Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Is Windows’ Alt+Tab shortcut broken for you? A rogue update is to blame

    Is Intel finally ready to compete with AMD’s Ryzen X3D chips?

    Nvidia’s legendary GeForce GTX 10-series cards might lose support soon

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      Apple sued by shareholders for allegedly overstating AI progress

      June 22, 2025

      How far will AI go to defend its own survival?

      June 2, 2025

      The internet thinks this video from Gaza is AI. Here’s how we proved it isn’t.

      May 30, 2025

      Nvidia CEO hails Trump’s plan to rescind some export curbs on AI chips to China

      May 22, 2025

      AI poses a bigger threat to women’s work, than men’s, report says

      May 21, 2025
    • Business

      Cloudflare open-sources Orange Meets with End-to-End encryption

      June 29, 2025

      Google links massive cloud outage to API management issue

      June 13, 2025

      The EU challenges Google and Cloudflare with its very own DNS resolver that can filter dangerous traffic

      June 11, 2025

      These two Ivanti bugs are allowing hackers to target cloud instances

      May 21, 2025

      How cloud and AI transform and improve customer experiences

      May 10, 2025
    • Crypto

      Armed Robbers Steal $100,000 in a Mall Parking Lot | Crypto Horror Story

      July 1, 2025

      Lummis Proposes Tax Cuts for Crypto Miners in Trump’s Big Beautiful Bill

      July 1, 2025

      3 CoinGecko Top Gainers to Watch for the First Week of July

      July 1, 2025

      What Circle’s US Trust Bank Bid Means for USDT and Stablecoin Competition

      July 1, 2025

      Why PI’s Breakout Wasn’t What It Seemed: Network Nears All-Time Low

      July 1, 2025
    • Technology

      Is Windows’ Alt+Tab shortcut broken for you? A rogue update is to blame

      July 1, 2025

      Is Intel finally ready to compete with AMD’s Ryzen X3D chips?

      July 1, 2025

      Nvidia’s legendary GeForce GTX 10-series cards might lose support soon

      July 1, 2025

      NASA will start streaming live on Netflix later this summer

      July 1, 2025

      How to secure your printer with Windows 11’s new Protected Print mode

      July 1, 2025
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Shop Now
    Tech AI Verse
    You are at:Home»Technology»Towards fearless SIMD, 7 years later
    Technology

    Towards fearless SIMD, 7 years later

    TechAiVerseBy TechAiVerseMarch 30, 2025No Comments12 Mins Read3 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    Towards fearless SIMD, 7 years later

    Raph Levien, March 29, 2025

    Seven years ago I wrote a blog post Towards fearless SIMD, outlining a vision for Rust as a compelling language for writing fast SIMD programs.
    Where are we now?

    Unfortunately, the present-day experience of writing SIMD in Rust is still pretty rough, though there has been progress, and there are promising efforts underway.
    As in the previous post, this post will outline a possible vision.

    Up to now, Linebender projects have not used SIMD, but that is changing.
    As we work on CPU/GPU hybrid rendering techniques, it’s clear that we need SIMD to get maximal performance of the CPU side.
    We also see opportunities in faster color conversion and accelerated 2D geometry primitives.

    This blog post is also a companion to a podcast I recorded recently with André Popovitch.
    That podcast is a good introduction to SIMD concepts, while this blog post focuses more on future directions.

    A simple example

    As a running example, we’ll compute a sigmoid function for a vector of 4 values.
    The scalar version is as follows:

    fn sigmoid(x: [f32; 4]) -> [f32; 4] {
        x.map(|y| y / (1.0 + y * y).sqrt())
    }
    

    This particular simple code autovectorizes nicely (Godbolt link), but more complex examples often fail to autovectorize, often because of subtle differences in floating point semantics.
    (Editorial note: a previous version of this post didn’t autovectorize (Godbolt) because optimization level was set at -O, which is less aggressive than -C opt-level=3, the latter of which is the default for release builds)

    Safety

    One of the biggest problems with writing SIMD in Rust is that all exposed SIMD intrinsics are marked as unsafe, even in cases where they can be used safely.
    The reason is that support for SIMD features varies widely, and executing a SIMD instruction on a CPU that does not support it is undefined behavior – the chip can crash, ignore the instruction, or do something unexpected.
    To be used safely, there must be some other mechanism to establish that the CPU does support the feature.

    Here’s the running example in hand-written intrinsic code, showing the need to write unsafe to access SIMD intrinsics at all:

    #[cfg(target_arch = "aarch64")]
    fn sigmoid_neon(x: [f32; 4]) -> [f32; 4] {
        use core::arch::aarch64::*;
        unsafe {
            let x_simd = core::mem::transmute(x);
            let x_squared = vmulq_f32(x_simd, x_simd);
            let ones = vdupq_n_f32(1.0);
            let sum = vaddq_f32(ones, x_squared);
            let sqrt = vsqrtq_f32(sum);
            let ratio = vdivq_f32(x_simd, sqrt);
            core::mem::transmute(ratio)
        }
    }
    
    #[cfg(target_arch = "x86_64")]
    fn sigmoid_sse2(x: [f32; 4]) -> [f32; 4] {
        use core::arch::x86_64::*;
        unsafe {
            let x_simd = core::mem::transmute(x);
            let x_squared = _mm_mul_ps(x_simd, x_simd);
            let ones = _mm_set1_ps(1.0);
            let sum = _mm_add_ps(ones, x_squared);
            let sqrt = _mm_sqrt_ps(sum);
            let ratio = _mm_div_ps(x_simd, sqrt);
            core::mem::transmute(ratio)
        }
    }
    

    This is quite a simplified example.
    For one, the SIMD width is fixed at 4 lanes (128 bits).
    Most likely, in practice you’d iterate over a larger slice, taking chunks equal to the natural SIMD width.

    Multiversioning

    A central problem important for SIMD is multiversioning and runtime dispatch.
    In some cases, you know the exact CPU target, for example when compiling a binary you’ll run only on your machine (in which case target-cpu=native is appropriate).
    But when distributing software more widely, there may be a range of capabilities.
    For highest performance, it’s necessary to compile multiple versions of the code, and do runtime detection to dispatch to the best SIMD code the hardware can run.
    This problem was expressed in the original fearless SIMD blog post, and there hasn’t been significant advance at the Rust language level since then.

    In the C++ world, the Highway library provides excellent SIMD support for a very wide range of targets, and also solves the multiversioning problem.
    Among other uses are the codecs for the JPEG-XL image format.
    Such codecs are an ideal use case for SIMD programming in general, and shipping them in a browser requires a good solution to multiversioning.
    Highway has a really good explanation of their approach to multiversioning.
    It will be useful to study it carefully to see how they’ve solved various problems.
    And a concise way of saying what I’d like to see is “Highway for Rust.”

    One possible approach is a crate called multiversion, which uses macros to replicate the code for multiple versions.
    A more recent macro-based approach is rust-target-feature-dispatch.
    It is generally a similar approach to multiversion, and the specific differences are set out in that crate’s README.

    Another approach, as I believe first advocated in my 2018 blog post, is to write functions polymorphic on a zero-sized type representing the SIMD capabilities, then rely on monomorphization to create the various versions.
    One motivation for this approach is to encode safety in Rust’s type system.
    Having the zero-sized token is proof of the underlying CPU having a certain level of SIMD capability, so calling those intrinsics is safe.
    A major library that uses this approach is pulp, which also powers the faer linear algebra library.

    I started putting together a pulp version of the running example, but ran into the immediate problem that it lacks a sqrt intrinsic (this would be easy enough to add, however).
    It also works a bit differently, in that it only supports vectors of the natural width, not ones of a fixed width.
    For general linear algebra, that’s fine, but for some other applications it adds friction, for example colors with alpha are naturally chunks of 4 scalars.
    To see an example of pulp code, as well as some discussion, see this Zulip thread.

    In fearless_simd#2 I propose a prototype of reasonably-ergonomic SIMD multiversioning.
    Like the original fearless_simd prototype, vector data types are polymorphic on SIMD level.
    The new prototype goes beyond that in several important ways.
    For one, arithmetic traits in std::ops are implemented for vector types, so it’s possible to add two vectors together, multiply vectors by scalars, etc.

    Here’s what the running example looks like in that prototype:

    #[inline(always)]
    fn sigmoid_impl(simd: S, x: [f32; 4]) -> [f32; 4] {
        let x_simd: f32x4 = x.simd_into(simd);
        (x_simd / (1.0 + x_simd * x_simd).sqrt()).into()
    }
    
    simd_dispatch!(sigmoid(level, rgba: [f32; 4]) -> [f32; 4] = sigmoid_impl);
    

    An advantage of the fearless_simd#2 prototype over pulp is a feature for downcasting based on SIMD level, so it’s possible to write different code optimized for different chips.
    See the srgb example in that pull request for more detail.
    Though there are clear advantages, at this point I’m not sure whether this is the direction to go.
    It would be a lot of work to build out all the needed types and operations, with potentially a large amount of repetitive boilerplate code in the library, which in turn may cause issues with compile time.
    Another possible direction is a smarter, compiler-like proc macro which synthesizes the SIMD intrinsics as needed based on the types and operations in the source program.

    One additional consideration for Rust is that the implementation of runtime feature detection is slower than it should be.
    Thus, feature detection and dispatch shouldn’t be done at every function call.
    A good working solution is to do feature detection once, at the start of the program, then pass that token down through function calls.
    It’s workable but definitely an ergonomic paper cut.

    FP16 and AVX-512

    A general trend in parallel computation, really fueled by AI workloads, is smaller scalars with higher throughputs.
    While not yet common on x86_64, the FP16 extension is supported on all Apple Silicon desktop CPUs and most recent high-ish end ARM-based phones.
    Since Neon is only 128 bits wide, having 8 lanes is welcome.
    I find the f16 format to be especially useful for pixel values, as it can encode color values with more than enough precision to avoid visual artifacts (8 bits is not quite enough, though it is good enough for some applications, as long as you’re not trying to do HDR).

    Native Rust support for the f16 type has not yet landed (tracked in rust#125440), which makes use of this scalar size harder.
    However, there is some support in the half library, and also the fearless_simd#2 prototype exports a number of FP16 Neon instructions through inline assembly.
    When true f16 support lands, it will be possible to switch over to intrinsics, which will have better optimization and ergonomics (for example, the same method will splat constants converted to f16 at compile time and f32 variables to be converted at runtime).

    AVX-512 is a somewhat controversial SIMD capability.
    It first appeared in the ill-fated Larrabee project, which shipped in limited numbers as the Xeon Phi starting in 2010, and has since appeared in scattered Intel CPUs, but with compromises.
    In particular, sprinkling even a small amount of AVX-512 code into a program could result in downclocking, reducing performance for all workloads (see Stack Overflow thread on throttling for more details).
    These days, the most likely way to get a CPU with AVX-512 is an AMD Zen 4 or Zen 5; it is on their strength that AVX-512 makes up about 16% of computers in the Steam hardware survey.

    The increased width is not the main reason to be enthusiastic about AVX-512.
    Indeed, on Zen 4 and most Zen 5 chips, the datapath is 256 bits so full 512 bit instructions are “double pumped.” The most exciting aspect is predication based on masks, a common implementation technique on GPUs.
    In particular, memory load and store operations are safe when the mask bit is zero, which is especially helpful for using SIMD efficiently on strings.
    Without predication, a common technique is to write two loops, the first handling only even multiples of the SIMD width, and a second, usually written as scalars, to handle the odd-size “tail”.
    There are lots of problems with this – code bloat, worse branch prediction, inability to exploit SIMD for chunks slightly less than the natural SIMD width (which gets worse as SIMD grows wider), and risks that the two loops don’t have exactly the same behavior.

    Going forward, Intel has proposed AVX10, and will hopefully ship AVX 10.2 chips in the next few years.
    This extension has pretty much all of the features of AVX-512, with some cleanups and new features (until recently, AVX10 was defined has having a 256 bit base width and optionally 512, but 512 is now the baseline).
    In addition, AVX10.2 will include 16-bit floats (currently available only in the Sapphire Rapids high-end server and workstation chips).

    About std::simd

    The “portable SIMD” work has been going on for many years and currently has a home as the nightly std::simd.
    While I think it will be very useful in many applications, I am not personally very excited about it for my applications.
    For one, because it emphasizes portability, it encourages a “lowest common denominator” approach, while I believe that for certain use cases it will be important to tune algorithms to best use the specific quirks of the different SIMD implementations.
    For two, std::simd does not itself solve the multiversion problem.
    From my perspective, it’s probably best to consider it as a souped-up version of autovectorization.

    Language evolution

    Rust’s out of the box support for SIMD is still quite rough, especially the need to use unsafe extensively.
    While some of the gap can be filled with libraries, arguably it should be a goal of the language itself to support safe SIMD code.
    There is progress in this direction.

    First, the original version of target_feature requires unsafe to call into any function annotated with #[target_feature].
    A proposal to relax that so that functions already under a target_feature gate can call safely call into another function with the same gate is called “target_feature 1.1” and is scheduled to ship in 1.86.
    Closely related, once inside the suitable target_feature gate, the majority of SIMD intrinsics (broadly, those that don’t do memory access through pointers) should be considered safe by the compiler, and that feature (safe intrinsics in core::arch) is also in flight.

    There’s more that can be done to help the Rust compiler recognize when SIMD use is safe, in particular to allow target_features when a concrete witness to the SIMD level is passed in as a function argument.
    The “struct target_features” proposal (RFC 3525) enables target_feature in such cases, and is one of the proposals considered in the proposed Rust project goal Nightly support for ergonomic SIMD multiversioning.

    In general, improving Rust SIMD support will require both libraries and support in the Rust language.
    Different approaches at the library level may indicate different language features to best support them.

    Looking forward

    My main goal in putting these prototypes forward, as well as writing these blog posts, is to spark conversation on how best to support SIMD programming in Rust.
    If done well, it is a great opportunity for the language, and fits in with its focus on performance and portability.

    As we build out the Vello hybrid CPU/GPU renderer, performance of the CPU components will rely heavily on SIMD, so we need to invest in writing a lot of SIMD code.
    The most conservative approach would be hand-writing unsafe intrinsics-based code for all targets, but that’s a lot of work and the use of unsafe is unappealing.
    I’d love for the Rust ecosystem can come together and build good infrastructure, competitive with Highway.
    For now, I think it’s time to carefully consider the design space and try to come to consensus on what that should look like.

    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleToday’s NYT Connections: Sports Edition Hints and Answers for March 30, #188
    Next Article A timeline of IBM keyboard history
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    Is Windows’ Alt+Tab shortcut broken for you? A rogue update is to blame

    July 1, 2025

    Is Intel finally ready to compete with AMD’s Ryzen X3D chips?

    July 1, 2025

    Nvidia’s legendary GeForce GTX 10-series cards might lose support soon

    July 1, 2025
    Leave A Reply Cancel Reply

    Top Posts

    New Akira ransomware decryptor cracks encryptions keys using GPUs

    March 16, 202526 Views

    OpenAI details ChatGPT-o3, o4-mini, o4-mini-high usage limits

    April 19, 202522 Views

    Rsync replaced with openrsync on macOS Sequoia

    April 7, 202517 Views

    I love that this tiny NAS offers up to 32TB of SSD storage and four 2.5GbE LAN ports, but why the DisplayPort connector?

    May 18, 202513 Views
    Don't Miss
    Technology July 1, 2025

    Is Windows’ Alt+Tab shortcut broken for you? A rogue update is to blame

    Is Windows’ Alt+Tab shortcut broken for you? A rogue update is to blame Image: Ham…

    Is Intel finally ready to compete with AMD’s Ryzen X3D chips?

    Nvidia’s legendary GeForce GTX 10-series cards might lose support soon

    NASA will start streaming live on Netflix later this summer

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    Is Windows’ Alt+Tab shortcut broken for you? A rogue update is to blame

    July 1, 20250 Views

    Is Intel finally ready to compete with AMD’s Ryzen X3D chips?

    July 1, 20250 Views

    Nvidia’s legendary GeForce GTX 10-series cards might lose support soon

    July 1, 20250 Views
    Most Popular

    Ethereum must hold $2,000 support or risk dropping to $1,850 – Here’s why

    March 12, 20250 Views

    Xiaomi 15 Ultra Officially Launched in China, Malaysia launch to follow after global event

    March 12, 20250 Views

    Apple thinks people won’t use MagSafe on iPhone 16e

    March 12, 20250 Views
    © 2025 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.