Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Honda Malaysia Targets 60,000 Sales in 2026 with Expanded e:HEV Lineup

    Older Windows 11 PCs need a Secure Boot fix ASAP

    Why Ring’s Super Bowl ad hits so sinister

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      Read the extended transcript: President Donald Trump interviewed by ‘NBC Nightly News’ anchor Tom Llamas

      February 6, 2026

      Stocks and bitcoin sink as investors dump software company shares

      February 4, 2026

      AI, crypto and Trump super PACs stash millions to spend on the midterms

      February 2, 2026

      To avoid accusations of AI cheating, college students are turning to AI

      January 29, 2026

      ChatGPT can embrace authoritarian ideas after just one prompt, researchers say

      January 24, 2026
    • Business

      New VoidLink malware framework targets Linux cloud servers

      January 14, 2026

      Nvidia Rubin’s rack-scale encryption signals a turning point for enterprise AI security

      January 13, 2026

      How KPMG is redefining the future of SAP consulting on a global scale

      January 10, 2026

      Top 10 cloud computing stories of 2025

      December 22, 2025

      Saudia Arabia’s STC commits to five-year network upgrade programme with Ericsson

      December 18, 2025
    • Crypto

      HBAR Shorts Face $5 Million Risk if Price Breaks Key Level

      February 10, 2026

      Ethereum Holds $2,000 Support — Accumulation Keeps Recovery Hopes Alive

      February 10, 2026

      Miami Mansion Listed for 700 BTC as California Billionaire Tax Sparks Relocations

      February 10, 2026

      Solana Drops to 2-Year Lows — History Suggests a Bounce Toward $100 is Incoming

      February 10, 2026

      Bitget Cuts Stock Perps Fees to Zero for Makers Ahead of Earnings Season, Expanding Access Across Markets

      February 10, 2026
    • Technology

      Older Windows 11 PCs need a Secure Boot fix ASAP

      February 11, 2026

      Why Ring’s Super Bowl ad hits so sinister

      February 11, 2026

      This dual-CPU PC from 1995 was so cool, Microsoft had to kill it

      February 11, 2026

      1,300 games for $10: ‘No ICE in Minnesota’ bundle launched

      February 11, 2026

      Gemini gave my Plex server a checkup. Its diagnosis surprised me

      February 11, 2026
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Technology»Faster substring search with SIMD in Zig
    Technology

    Faster substring search with SIMD in Zig

    TechAiVerseBy TechAiVerseAugust 11, 2025No Comments5 Mins Read12 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Faster substring search with SIMD in Zig
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    Faster substring search with SIMD in Zig

    Published 10.08.2025

    I’ve been learning a lot about low-level programming languages lately, and for a long time there has been one thing that has interested me: SIMD (or ‘single instruction, multiple data’) code. I’ve seen a lot of articles about having massive performance gains by utilizing SIMD and wanted to learn how to do it myself.

    This article is a journey into implementing ~60% faster substring searching compared to Zig’s std.mem.indexOf using a SIMD-friendly algorithm.

    Baseline
    #

    This is the baseline function that we will be comparing against:

    fn find_substr(needle: []const u8, haystack: []const u8) ?usize {
        return std.mem.indexOf(u8, haystack, needle);
    }
    

    It’s the closest thing to a substring search function from Zig’s standard library. It returns the first index of a subsequence – or null if not found.

    SIMD algorithm
    #

    This algorithm is taken directly from Wojciech Muła’s fantastic article SIMD-friendly algorithms for substring searching, which seems to have the best algorithms for finding substrings in a large body of text.

    Here’s how the algorithm works: say that we want to find the index of the word “blue” (the needle) in “It was a beautiful, bounteous, blue day” (the haystack). First, we extract the first and last character of the needle (‘b’ and ’e’) and store them in a variable.

    Then we will loop through all of the characters in haystack, loading the next 32 characters (bytes) from memory into a SIMD register and comparing each character (byte) in the register with ‘b’. This will result in a mask containing 32 bytes, 1 if the character is ‘b’ and 0 in all other cases.

    We will do the same with the last character, but load the characters with an offset (needle.len - 1).

    Without the offset, any match that starts in one 32‑byte chunk and ends in the next would be missed. With this method, we can also check for needles that are longer than 32 characters.

    The result will be two bit masks, First and Last, where we can use bit-wise AND (Result = First & Last) to figure out potential substring occurrences.

    Result will be 1 only when there is a ‘b’ at index i followed by an ’e’ at index i+3. We still need to check if those positions actually contain the value “blue”, but this still dramatically reduces the number of checks (= individual memory accesses) that are necessary. We’ll see how this works in practice in the next section.

    Implementation in Zig
    #

    First, to properly use SIMD, let’s assume that the CPU supports AVX2 (Advanced Vector Extensions 2) and has 256-bit wide registers.

    All desktop processors less than 10 years old support AVX2, with newer ones also supporting AVX-512 with 512-bit wide registers.

    This allows us to use Zig’s @Vector function to make a type:

    const Block = @Vector(32, u8); // number of elements, element type (32*8=256)
    

    By using Block, we are telling the compiler that the operations on this datatype should use SIMD instructions where possible.

    Next, we take the first and last letters of the search word (’needle’) and load them into two SIMD registers, so that every byte of the register is filled with the character. This is handled by another built-in function, @splat:

    const first_letter: Block = @splat(needle[0]);
    const last_letter: Block = @splat(needle[needle.len - 1]);
    

    In the main loop, we check that there is enough characters left in haystack so that we can read the next 32 + needle.len characters. Inside the block, we load the blocks that we’re going to compare first_letter and last_letter with.

    const n = haystack.len;
    const k = needle.len;
    var i: usize = 0;
    while (i + k + 32 <= n) : (i += 32) {
        const first_block: Block = haystack[i..][0..32].*;
        const last_block: Block = haystack[i + k - 1 ..][0..32].*;
    

    Now we can make the comparisons and combine them into a mask:

    	// ...
        const eq_first = first_letter == first_block;
        const eq_last = last_letter == last_block;
        var mask: std.bit_set.IntegerBitSet(32) = .{ .mask = @bitCast(eq_first & eq_last) };
    

    Here we can use an IntegerBitSet from Zig’s standard library. We construct it by casting the result of eq_first & eq_last into a 32-bit integer. If the resulting mask is non-zero, there are candidates in the current block.

    	// ...
        while (mask.findFirstSet()) |bitpos| {
            if (std.mem.eql(u8, haystack[i + bitpos + 1 ..][0 .. k - 1], needle[1..])) {
                return i + bitpos;
            }
            _ = mask.toggleFirstSet();
        }
    

    The first and last characters of the substring are checked already, so we don’t need to check their equality again.

    Finally, if there are leftover characters, we can fall back to std.mem.IndexOf.

    	// ...
    	// Fallback to scalar search for the tail
    	if (i < n) {
    	    if (std.mem.indexOf(u8, haystack[i..], needle)) |rel_idx| {
    	        return i + rel_idx;
    	    }
    	}
    	return null; // no substring found
    }
    

    Benchmarks
    #

    To properly show the effects of our SIMD algorithm, we’re going to need a large haystack. For this, I’ve chosen to use the entirety Moby Dick in plain text, and a search word ’newsletter’, which appears at the very end of the text.

    The code is available on GitHub

    To compile the code, I ran zig build with -Doptimize=ReleaseFast:

    > zig build -Doptimize=ReleaseFast
    

    Support for bitwise operations on boolean vectors was added in Zig 0.15, which is unreleased as of now (August 2025). If you want to run the code on your system, you need to build Zig from the master branch.

    To measure performance and compare against baseline, I’ll use one of my favorite CLI tools, poop:

    > poop -d 10000 "./zig-out/bin/substr" "./zig-out/bin/substr --simd"
    
    
    Benchmark 1 (6361 runs): ./zig-out/bin/substr
      measurement          mean ± σ            min … max           outliers         delta
      wall_time          1.22ms ±  185us     903us … 5.33ms        242 ( 4%)        0%
      peak_rss           1.20MB ±  290      1.18MB … 1.20MB          2 ( 0%)        0%
      cpu_cycles         2.15M  ± 40.5K     2.10M  … 2.71M         312 ( 5%)        0%
      instructions       1.85M  ± 0.75      1.85M  … 1.85M          56 ( 1%)        0%
      cache_references   43.8K  ±  620      38.3K  … 44.9K           9 ( 0%)        0%
      cache_misses       19.0K  ± 10.3K     4.08K  … 33.6K           0 ( 0%)        0%
      branch_misses      48.1   ± 17.4        20   …  104           97 ( 2%)        0%
    
    Benchmark 2 (10000 runs): ./zig-out/bin/substr --simd
      measurement          mean ± σ            min … max           outliers         delta
      wall_time           500us ± 96.9us     397us … 4.23ms        840 ( 8%)        ⚡- 58.9% ±  0.4%
      peak_rss           1.20MB ±  164      1.18MB … 1.20MB          1 ( 0%)          +  0.0% ±  0.0%
      cpu_cycles          369K  ± 36.1K      340K  … 1.10M        1167 (12%)        ⚡- 82.8% ±  0.1%
      instructions        578K  ± 0.53       578K  …  578K           6 ( 0%)        ⚡- 68.8% ±  0.0%
      cache_references   38.8K  ±  545      34.1K  … 40.5K           6 ( 0%)        ⚡- 11.4% ±  0.0%
      cache_misses       5.62K  ± 4.97K     2.11K  … 27.9K        1529 (15%)        ⚡- 70.3% ±  1.2%
      branch_misses      2.88K  ± 23.4      2.81K  … 3.09K         453 ( 5%)        💩+5879.8% ±  1.4%
    

    (Scroll right to see more data)

    As you can see, for a large body of text, the speedup is noticeable: 59% faster with 80% less CPU cycles!

    The SIMD version only took 500 microseconds to complete on average, including the overhead of loading the program into memory and printing the result. 500 microseconds is half a millisecond. That’s how fast my laptop searches through a whole book, 200 000 words, cover to cover. This is why computers are so powerful! How long would it take for a human to do that?

    This is quite a large improvement, and proves that the SIMD code is actually working (otherwise the reduction in CPU cycles wouldn’t be so massive). Can we do even better though?

    Character selection
    #

    You may notice from the output of poop that the number of branch misses has absolutely blown up – from 48 on average to 2.88k !

    Why does this happen? Well, if you were to count how many times the inner while loop is entered when the mask is non-zero:

        var i: usize = 0;
        var count: usize = 0;
        while (i + k + 32 <= n) : (i += 32) {
            const block_first: Block = haystack[i..][0..32].*;
            const block_last: Block = haystack[i + k - 1 ..][0..32].*;
            const eq_first = first == block_first;
            const eq_last = last == block_last;
            var mask: std.bit_set.IntegerBitSet(32) = .{ .mask = @bitCast(eq_first & eq_last) };
            while (mask.findFirstSet()) |bitpos| {
                count += 1;
                if (std.mem.eql(u8, haystack[i + bitpos + 1 ..][0 .. k - 1], needle[1..])) {
                    std.debug.print("found match with count: {}n", .{count});
                    return i + bitpos;
                }
                _ = mask.toggleFirstSet();
            }
        }
    
    found match with count: 2792
    

    The fact that count is so close to the number of mispredictions suggests that each time the mask is non‑zero we incur a branch miss.

    Unfortunately, there is no obvious way to prevent this with the current algorithm. The state-of-the-art seems to be choosing two bytes in the needle that occur less frequently according to a pre-calculated frequency distribution. This is used in the memchr crate in Rust, as explained by the author in this comment on Hacker News.

    For example, the needle newsletter has the rarest characters w at index 2 and l at index 4.

    The function in memchr can be found here. I’ve ported it into Zig, and you can see it here.

        const needle_index_pair = find_rarest(needle);
    
        const first_letter: Block = @splat(needle[needle_index_pair[0]]);
        const first_offset = needle_index_pair[0];
        const second_letter: Block = @splat(needle[needle_index_pair[1]]);
        const second_offset = needle_index_pair[1];
    

    The algorithm is the exact same, but the index for first_letter and second_letter now varies according to the pre-calculated frequency distribution.

    Benchmarks
    #

    
    Benchmark 1 (10000 runs): ./zig-out/bin/substr --simd
      measurement          mean ± σ            min … max           outliers         delta
      wall_time           472us ± 62.9us     400us … 1.62ms        735 ( 7%)        0%
      peak_rss           1.20MB ±    0      1.20MB … 1.20MB          0 ( 0%)        0%
      cpu_cycles          376K  ± 44.7K      347K  … 1.46M        1213 (12%)        0%
      instructions        578K  ± 0.54       578K  …  578K          10 ( 0%)        0%
      cache_references   38.7K  ±  715      28.3K  … 40.6K          96 ( 1%)        0%
      cache_misses       7.37K  ± 5.83K     2.78K  … 27.7K        1608 (16%)        0%
      branch_misses      2.88K  ± 23.4      2.82K  … 3.08K         415 ( 4%)        0%
    
    Benchmark 2 (10000 runs): ./zig-out/bin/substr --simdv2
      measurement          mean ± σ            min … max           outliers         delta
      wall_time           429us ± 75.5us     369us … 3.85ms        393 ( 4%)        ⚡-  9.1% ±  0.4%
      peak_rss           1.20MB ±    0      1.20MB … 1.20MB          0 ( 0%)          -  0.0% ±  0.0%
      cpu_cycles          304K  ± 28.4K      282K  … 1.07M        1140 (11%)        ⚡- 19.2% ±  0.3%
      instructions        561K  ± 0.52       561K  …  561K           5 ( 0%)        ⚡-  2.9% ±  0.0%
      cache_references   38.7K  ±  610      29.9K  … 40.3K          25 ( 0%)          -  0.1% ±  0.0%
      cache_misses       5.21K  ± 3.53K     2.57K  … 27.3K        1306 (13%)        ⚡- 29.3% ±  1.8%
      branch_misses      1.07K  ± 14.0      1.02K  … 1.17K         275 ( 3%)        ⚡- 62.8% ±  0.0%
    
    

    Comparing to the previous SIMD version, the number of branch misses has dropped by 60%, and it’s 9% faster too. Nice!

    The number of branch misses is lower, which can cause faster execution, but I suspect that a much bigger impact is the fact that there are less false positives, which means less byte-by-byte memory accesses and comparisons.

    AVX-512
    #

    Since AMD Zen 4 and Intel Cannon Lake, there has been a new SIMD instruction set, AVX-512 with 512-bit instructions – double the size of AVX2. I don’t have a computer that has AVX-512 right now, but I suspect that changing the Zig code to process 64 characters at once would lead to even better results.

    A smaller haystack
    #

    It’s clear that with a very large haystack, the SIMD version is much faster. But what about a tiny input, like less than a hunder characters?

    I did a bit of benchmarking with poop, but I found that I couldn’t accurately measure the speed, since both versions finish extremely very quickly. I decided to use zBench to do a microbenchmark. I decided to use a snippet from Moby Dick as seen here.

    +- run test stderr
    benchmark              runs     total time     time/run (avg ± σ)    (min ... max)                p75        p99        p995      
    -----------------------------------------------------------------------------------------------------------------------------
    find_substr            100000   424.368ms      4.243us ± 740ns       (3.964us ... 107.923us)      4.187us    7.075us    7.245us   
    find_substr_simd_v2    100000   147.883ms      1.478us ± 186ns       (1.417us ... 21.354us)       1.483us    1.539us    1.548us   
    

    I was surprised to see that even when processing less than a hundred characters, the SIMD algorithm is still faster! The difference between 4μs vs 1μs is extremely small, but it’s slightly faster nonetheless.

    Conclusion
    #

    As you can see, SIMD can be used to make substring searching dramatically faster, for both very large and very small strings.

    But if it’s so much better, then why haven’t I made a pull request to change std.mem.indexOf to use SIMD? Well, the reason is that

    1. std.mem.indexOf is generic over element size, and having a size larger than u8 makes the algorithm much slower
    2. The algorithm used in stdmem.indexOf is cross-platform, while the SIMD code wouldn’t be. (not all platforms have SIMD registers at all, Arm has only 128-bit)

    Substring searching is rarely the bottleneck in programs, especially ones written in a fast language like Zig. That’s why I don’t personally think it would be worth it to add it to the standard library.

    Still, it was great to learn about this advanced optimization technique and see some concrete performance measurements from it!

    The full code is available on GitHub here.

    Further reading
    #

    • SIMD with Zig https://www.openmymind.net/SIMD-With-Zig/
    • SIMD-friendly algorithms for substring searching: http://0x80.pl/notesen/2016-11-28-simd-strfind.html
    • memchr source code: https://github.com/BurntSushi/memchr
    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleHand-picked selection of articles on AI fundamentals/concepts
    Next Article GPT-OSS-120B runs on just 8GB VRAM & 64GB+ system RAM
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    Older Windows 11 PCs need a Secure Boot fix ASAP

    February 11, 2026

    Why Ring’s Super Bowl ad hits so sinister

    February 11, 2026

    This dual-CPU PC from 1995 was so cool, Microsoft had to kill it

    February 11, 2026
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 2025666 Views

    Lumo vs. Duck AI: Which AI is Better for Your Privacy?

    July 31, 2025251 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 2025151 Views

    6 Best MagSafe Phone Grips (2025), Tested and Reviewed

    April 6, 2025111 Views
    Don't Miss
    Gadgets February 11, 2026

    Honda Malaysia Targets 60,000 Sales in 2026 with Expanded e:HEV Lineup

    Honda Malaysia Targets 60,000 Sales in 2026 with Expanded e:HEV Lineup Honda Malaysia is accelerating…

    Older Windows 11 PCs need a Secure Boot fix ASAP

    Why Ring’s Super Bowl ad hits so sinister

    This dual-CPU PC from 1995 was so cool, Microsoft had to kill it

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    Honda Malaysia Targets 60,000 Sales in 2026 with Expanded e:HEV Lineup

    February 11, 20263 Views

    Older Windows 11 PCs need a Secure Boot fix ASAP

    February 11, 20262 Views

    Why Ring’s Super Bowl ad hits so sinister

    February 11, 20263 Views
    Most Popular

    7 Best Kids Bikes (2025): Mountain, Balance, Pedal, Coaster

    March 13, 20250 Views

    VTOMAN FlashSpeed 1500: Plenty Of Power For All Your Gear

    March 13, 20250 Views

    This new Roomba finally solves the big problem I have with robot vacuums

    March 13, 20250 Views
    © 2026 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.