Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    The Reset That Removes Words From iPhone Predictive Text

    Get Full Screen Pictures for iPhone Calls

    Apple Worldwide Developer Conference (WWDC) 2026: Announcements, News, and More

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      ChatGPT can embrace authoritarian ideas after just one prompt, researchers say

      January 24, 2026

      Ashley St. Clair, the mother of one of Elon Musk’s children, sues xAI over Grok sexual images

      January 17, 2026

      Anthropic joins OpenAI’s push into health care with new Claude tools

      January 12, 2026

      The mother of one of Elon Musk’s children says his AI bot won’t stop creating sexualized images of her

      January 7, 2026

      A new pope, political shake-ups and celebs in space: The 2025-in-review news quiz

      December 31, 2025
    • Business

      New VoidLink malware framework targets Linux cloud servers

      January 14, 2026

      Nvidia Rubin’s rack-scale encryption signals a turning point for enterprise AI security

      January 13, 2026

      How KPMG is redefining the future of SAP consulting on a global scale

      January 10, 2026

      Top 10 cloud computing stories of 2025

      December 22, 2025

      Saudia Arabia’s STC commits to five-year network upgrade programme with Ericsson

      December 18, 2025
    • Crypto

      Bitcoin Job Listings Grow 6% in 2025: Here’s What Employers Are Looking For

      January 27, 2026

      Circle Could Change Global Money with New Gamble —Or Risk Losing It All

      January 27, 2026

      Is BitMine’s 4 Million ETH Stash Putting BMNR Price on a 30% Risk Path? Charts Reveal More

      January 27, 2026

      What VanEck’s Avalanche ETF Debut Reveals About Investor Sentiment in January

      January 27, 2026

      Quantum Computing Is Forcing Crypto’s First Survival Test— Only a Handful of Chains Are Preparing

      January 27, 2026
    • Technology

      The Reset That Removes Words From iPhone Predictive Text

      January 28, 2026

      Get Full Screen Pictures for iPhone Calls

      January 28, 2026

      Apple Worldwide Developer Conference (WWDC) 2026: Announcements, News, and More

      January 28, 2026

      Motorola Razr+ Deal Makes This Flip Phone Hard to Resist

      January 28, 2026

      This Network Reset Fixes Many Mac Wi-Fi Issues

      January 28, 2026
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Technology»Towards fearless SIMD, 7 years later
    Technology

    Towards fearless SIMD, 7 years later

    TechAiVerseBy TechAiVerseMarch 30, 2025No Comments12 Mins Read17 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    Towards fearless SIMD, 7 years later

    Raph Levien, March 29, 2025

    Seven years ago I wrote a blog post Towards fearless SIMD, outlining a vision for Rust as a compelling language for writing fast SIMD programs.
    Where are we now?

    Unfortunately, the present-day experience of writing SIMD in Rust is still pretty rough, though there has been progress, and there are promising efforts underway.
    As in the previous post, this post will outline a possible vision.

    Up to now, Linebender projects have not used SIMD, but that is changing.
    As we work on CPU/GPU hybrid rendering techniques, it’s clear that we need SIMD to get maximal performance of the CPU side.
    We also see opportunities in faster color conversion and accelerated 2D geometry primitives.

    This blog post is also a companion to a podcast I recorded recently with André Popovitch.
    That podcast is a good introduction to SIMD concepts, while this blog post focuses more on future directions.

    A simple example

    As a running example, we’ll compute a sigmoid function for a vector of 4 values.
    The scalar version is as follows:

    fn sigmoid(x: [f32; 4]) -> [f32; 4] {
        x.map(|y| y / (1.0 + y * y).sqrt())
    }
    

    This particular simple code autovectorizes nicely (Godbolt link), but more complex examples often fail to autovectorize, often because of subtle differences in floating point semantics.
    (Editorial note: a previous version of this post didn’t autovectorize (Godbolt) because optimization level was set at -O, which is less aggressive than -C opt-level=3, the latter of which is the default for release builds)

    Safety

    One of the biggest problems with writing SIMD in Rust is that all exposed SIMD intrinsics are marked as unsafe, even in cases where they can be used safely.
    The reason is that support for SIMD features varies widely, and executing a SIMD instruction on a CPU that does not support it is undefined behavior – the chip can crash, ignore the instruction, or do something unexpected.
    To be used safely, there must be some other mechanism to establish that the CPU does support the feature.

    Here’s the running example in hand-written intrinsic code, showing the need to write unsafe to access SIMD intrinsics at all:

    #[cfg(target_arch = "aarch64")]
    fn sigmoid_neon(x: [f32; 4]) -> [f32; 4] {
        use core::arch::aarch64::*;
        unsafe {
            let x_simd = core::mem::transmute(x);
            let x_squared = vmulq_f32(x_simd, x_simd);
            let ones = vdupq_n_f32(1.0);
            let sum = vaddq_f32(ones, x_squared);
            let sqrt = vsqrtq_f32(sum);
            let ratio = vdivq_f32(x_simd, sqrt);
            core::mem::transmute(ratio)
        }
    }
    
    #[cfg(target_arch = "x86_64")]
    fn sigmoid_sse2(x: [f32; 4]) -> [f32; 4] {
        use core::arch::x86_64::*;
        unsafe {
            let x_simd = core::mem::transmute(x);
            let x_squared = _mm_mul_ps(x_simd, x_simd);
            let ones = _mm_set1_ps(1.0);
            let sum = _mm_add_ps(ones, x_squared);
            let sqrt = _mm_sqrt_ps(sum);
            let ratio = _mm_div_ps(x_simd, sqrt);
            core::mem::transmute(ratio)
        }
    }
    

    This is quite a simplified example.
    For one, the SIMD width is fixed at 4 lanes (128 bits).
    Most likely, in practice you’d iterate over a larger slice, taking chunks equal to the natural SIMD width.

    Multiversioning

    A central problem important for SIMD is multiversioning and runtime dispatch.
    In some cases, you know the exact CPU target, for example when compiling a binary you’ll run only on your machine (in which case target-cpu=native is appropriate).
    But when distributing software more widely, there may be a range of capabilities.
    For highest performance, it’s necessary to compile multiple versions of the code, and do runtime detection to dispatch to the best SIMD code the hardware can run.
    This problem was expressed in the original fearless SIMD blog post, and there hasn’t been significant advance at the Rust language level since then.

    In the C++ world, the Highway library provides excellent SIMD support for a very wide range of targets, and also solves the multiversioning problem.
    Among other uses are the codecs for the JPEG-XL image format.
    Such codecs are an ideal use case for SIMD programming in general, and shipping them in a browser requires a good solution to multiversioning.
    Highway has a really good explanation of their approach to multiversioning.
    It will be useful to study it carefully to see how they’ve solved various problems.
    And a concise way of saying what I’d like to see is “Highway for Rust.”

    One possible approach is a crate called multiversion, which uses macros to replicate the code for multiple versions.
    A more recent macro-based approach is rust-target-feature-dispatch.
    It is generally a similar approach to multiversion, and the specific differences are set out in that crate’s README.

    Another approach, as I believe first advocated in my 2018 blog post, is to write functions polymorphic on a zero-sized type representing the SIMD capabilities, then rely on monomorphization to create the various versions.
    One motivation for this approach is to encode safety in Rust’s type system.
    Having the zero-sized token is proof of the underlying CPU having a certain level of SIMD capability, so calling those intrinsics is safe.
    A major library that uses this approach is pulp, which also powers the faer linear algebra library.

    I started putting together a pulp version of the running example, but ran into the immediate problem that it lacks a sqrt intrinsic (this would be easy enough to add, however).
    It also works a bit differently, in that it only supports vectors of the natural width, not ones of a fixed width.
    For general linear algebra, that’s fine, but for some other applications it adds friction, for example colors with alpha are naturally chunks of 4 scalars.
    To see an example of pulp code, as well as some discussion, see this Zulip thread.

    In fearless_simd#2 I propose a prototype of reasonably-ergonomic SIMD multiversioning.
    Like the original fearless_simd prototype, vector data types are polymorphic on SIMD level.
    The new prototype goes beyond that in several important ways.
    For one, arithmetic traits in std::ops are implemented for vector types, so it’s possible to add two vectors together, multiply vectors by scalars, etc.

    Here’s what the running example looks like in that prototype:

    #[inline(always)]
    fn sigmoid_impl(simd: S, x: [f32; 4]) -> [f32; 4] {
        let x_simd: f32x4 = x.simd_into(simd);
        (x_simd / (1.0 + x_simd * x_simd).sqrt()).into()
    }
    
    simd_dispatch!(sigmoid(level, rgba: [f32; 4]) -> [f32; 4] = sigmoid_impl);
    

    An advantage of the fearless_simd#2 prototype over pulp is a feature for downcasting based on SIMD level, so it’s possible to write different code optimized for different chips.
    See the srgb example in that pull request for more detail.
    Though there are clear advantages, at this point I’m not sure whether this is the direction to go.
    It would be a lot of work to build out all the needed types and operations, with potentially a large amount of repetitive boilerplate code in the library, which in turn may cause issues with compile time.
    Another possible direction is a smarter, compiler-like proc macro which synthesizes the SIMD intrinsics as needed based on the types and operations in the source program.

    One additional consideration for Rust is that the implementation of runtime feature detection is slower than it should be.
    Thus, feature detection and dispatch shouldn’t be done at every function call.
    A good working solution is to do feature detection once, at the start of the program, then pass that token down through function calls.
    It’s workable but definitely an ergonomic paper cut.

    FP16 and AVX-512

    A general trend in parallel computation, really fueled by AI workloads, is smaller scalars with higher throughputs.
    While not yet common on x86_64, the FP16 extension is supported on all Apple Silicon desktop CPUs and most recent high-ish end ARM-based phones.
    Since Neon is only 128 bits wide, having 8 lanes is welcome.
    I find the f16 format to be especially useful for pixel values, as it can encode color values with more than enough precision to avoid visual artifacts (8 bits is not quite enough, though it is good enough for some applications, as long as you’re not trying to do HDR).

    Native Rust support for the f16 type has not yet landed (tracked in rust#125440), which makes use of this scalar size harder.
    However, there is some support in the half library, and also the fearless_simd#2 prototype exports a number of FP16 Neon instructions through inline assembly.
    When true f16 support lands, it will be possible to switch over to intrinsics, which will have better optimization and ergonomics (for example, the same method will splat constants converted to f16 at compile time and f32 variables to be converted at runtime).

    AVX-512 is a somewhat controversial SIMD capability.
    It first appeared in the ill-fated Larrabee project, which shipped in limited numbers as the Xeon Phi starting in 2010, and has since appeared in scattered Intel CPUs, but with compromises.
    In particular, sprinkling even a small amount of AVX-512 code into a program could result in downclocking, reducing performance for all workloads (see Stack Overflow thread on throttling for more details).
    These days, the most likely way to get a CPU with AVX-512 is an AMD Zen 4 or Zen 5; it is on their strength that AVX-512 makes up about 16% of computers in the Steam hardware survey.

    The increased width is not the main reason to be enthusiastic about AVX-512.
    Indeed, on Zen 4 and most Zen 5 chips, the datapath is 256 bits so full 512 bit instructions are “double pumped.” The most exciting aspect is predication based on masks, a common implementation technique on GPUs.
    In particular, memory load and store operations are safe when the mask bit is zero, which is especially helpful for using SIMD efficiently on strings.
    Without predication, a common technique is to write two loops, the first handling only even multiples of the SIMD width, and a second, usually written as scalars, to handle the odd-size “tail”.
    There are lots of problems with this – code bloat, worse branch prediction, inability to exploit SIMD for chunks slightly less than the natural SIMD width (which gets worse as SIMD grows wider), and risks that the two loops don’t have exactly the same behavior.

    Going forward, Intel has proposed AVX10, and will hopefully ship AVX 10.2 chips in the next few years.
    This extension has pretty much all of the features of AVX-512, with some cleanups and new features (until recently, AVX10 was defined has having a 256 bit base width and optionally 512, but 512 is now the baseline).
    In addition, AVX10.2 will include 16-bit floats (currently available only in the Sapphire Rapids high-end server and workstation chips).

    About std::simd

    The “portable SIMD” work has been going on for many years and currently has a home as the nightly std::simd.
    While I think it will be very useful in many applications, I am not personally very excited about it for my applications.
    For one, because it emphasizes portability, it encourages a “lowest common denominator” approach, while I believe that for certain use cases it will be important to tune algorithms to best use the specific quirks of the different SIMD implementations.
    For two, std::simd does not itself solve the multiversion problem.
    From my perspective, it’s probably best to consider it as a souped-up version of autovectorization.

    Language evolution

    Rust’s out of the box support for SIMD is still quite rough, especially the need to use unsafe extensively.
    While some of the gap can be filled with libraries, arguably it should be a goal of the language itself to support safe SIMD code.
    There is progress in this direction.

    First, the original version of target_feature requires unsafe to call into any function annotated with #[target_feature].
    A proposal to relax that so that functions already under a target_feature gate can call safely call into another function with the same gate is called “target_feature 1.1” and is scheduled to ship in 1.86.
    Closely related, once inside the suitable target_feature gate, the majority of SIMD intrinsics (broadly, those that don’t do memory access through pointers) should be considered safe by the compiler, and that feature (safe intrinsics in core::arch) is also in flight.

    There’s more that can be done to help the Rust compiler recognize when SIMD use is safe, in particular to allow target_features when a concrete witness to the SIMD level is passed in as a function argument.
    The “struct target_features” proposal (RFC 3525) enables target_feature in such cases, and is one of the proposals considered in the proposed Rust project goal Nightly support for ergonomic SIMD multiversioning.

    In general, improving Rust SIMD support will require both libraries and support in the Rust language.
    Different approaches at the library level may indicate different language features to best support them.

    Looking forward

    My main goal in putting these prototypes forward, as well as writing these blog posts, is to spark conversation on how best to support SIMD programming in Rust.
    If done well, it is a great opportunity for the language, and fits in with its focus on performance and portability.

    As we build out the Vello hybrid CPU/GPU renderer, performance of the CPU components will rely heavily on SIMD, so we need to invest in writing a lot of SIMD code.
    The most conservative approach would be hand-writing unsafe intrinsics-based code for all targets, but that’s a lot of work and the use of unsafe is unappealing.
    I’d love for the Rust ecosystem can come together and build good infrastructure, competitive with Highway.
    For now, I think it’s time to carefully consider the design space and try to come to consensus on what that should look like.

    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleToday’s NYT Connections: Sports Edition Hints and Answers for March 30, #188
    Next Article A timeline of IBM keyboard history
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    The Reset That Removes Words From iPhone Predictive Text

    January 28, 2026

    Get Full Screen Pictures for iPhone Calls

    January 28, 2026

    Apple Worldwide Developer Conference (WWDC) 2026: Announcements, News, and More

    January 28, 2026
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 2025641 Views

    Lumo vs. Duck AI: Which AI is Better for Your Privacy?

    July 31, 2025241 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 2025143 Views

    6 Best MagSafe Phone Grips (2025), Tested and Reviewed

    April 6, 2025111 Views
    Don't Miss
    Technology January 28, 2026

    The Reset That Removes Words From iPhone Predictive Text

    The Reset That Removes Words From iPhone Predictive TextStart fresh when keyboard suggestions refuse to…

    Get Full Screen Pictures for iPhone Calls

    Apple Worldwide Developer Conference (WWDC) 2026: Announcements, News, and More

    Motorola Razr+ Deal Makes This Flip Phone Hard to Resist

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    The Reset That Removes Words From iPhone Predictive Text

    January 28, 20260 Views

    Get Full Screen Pictures for iPhone Calls

    January 28, 20260 Views

    Apple Worldwide Developer Conference (WWDC) 2026: Announcements, News, and More

    January 28, 20260 Views
    Most Popular

    A Team of Female Founders Is Launching Cloud Security Tech That Could Overhaul AI Protection

    March 12, 20250 Views

    7 Best Kids Bikes (2025): Mountain, Balance, Pedal, Coaster

    March 13, 20250 Views

    VTOMAN FlashSpeed 1500: Plenty Of Power For All Your Gear

    March 13, 20250 Views
    © 2026 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.