Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Acer Expands Predator Lineup with AI Laptop, Gaming Desktops, Ultra-Fast Monitor and Keyboard

    IFA 2025: Acer unveils world’s lightest 16-inch laptop

    Samsung Malaysia introduces Galaxy A07 and Galaxy A17 5G

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      Blue-collar jobs are gaining popularity as AI threatens office work

      August 17, 2025

      Man who asked ChatGPT about cutting out salt from his diet was hospitalized with hallucinations

      August 15, 2025

      What happens when chatbots shape your reality? Concerns are growing online

      August 14, 2025

      Scientists want to prevent AI from going rogue by teaching it to be bad first

      August 8, 2025

      AI models may be accidentally (and secretly) learning each other’s bad behaviors

      July 30, 2025
    • Business

      Cloudflare hit by data breach in Salesloft Drift supply chain attack

      September 2, 2025

      Cloudflare blocks largest recorded DDoS attack peaking at 11.5 Tbps

      September 2, 2025

      Why Certified VMware Pros Are Driving the Future of IT

      August 24, 2025

      Murky Panda hackers exploit cloud trust to hack downstream customers

      August 23, 2025

      The rise of sovereign clouds: no data portability, no party

      August 20, 2025
    • Crypto

      Trump Death Rumors Fueled $1.6 Million In Prediction Market Bets This Weekend

      September 3, 2025

      3 US Crypto Stocks to Watch This Week

      September 3, 2025

      The Shocking Cost Of Bitcoin Payments: One Transaction Can Power a UK Home For 3 Weeks

      September 3, 2025

      Analysts Increase IREN Price Target: Will The Stock Keep Rallying?

      September 3, 2025

      ​​Pi Network Gears Up for Version 23 Upgrade, But Market Demand Stays Flat

      September 3, 2025
    • Technology

      I love Windows PCs, but a $599 MacBook would be mighty tempting

      September 3, 2025

      Acer’s new OLED gaming monitor hits a blistering 720Hz, with a catch

      September 3, 2025

      Acer Chromebook Plus Spin 514 review: This 2-in-1 multitasks like a pro

      September 3, 2025

      Acer’s Nitro V 16S packs big RTX gaming power into a slim frame

      September 3, 2025

      Acer’s latest Chromebook Plus laptop joins Google’s AI wave

      September 3, 2025
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Technology»Faster substring search with SIMD in Zig
    Technology

    Faster substring search with SIMD in Zig

    TechAiVerseBy TechAiVerseAugust 11, 2025No Comments5 Mins Read2 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    Faster substring search with SIMD in Zig
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    BMI Calculator – Check your Body Mass Index for free!

    Faster substring search with SIMD in Zig

    Published 10.08.2025

    I’ve been learning a lot about low-level programming languages lately, and for a long time there has been one thing that has interested me: SIMD (or ‘single instruction, multiple data’) code. I’ve seen a lot of articles about having massive performance gains by utilizing SIMD and wanted to learn how to do it myself.

    This article is a journey into implementing ~60% faster substring searching compared to Zig’s std.mem.indexOf using a SIMD-friendly algorithm.

    Baseline
    #

    This is the baseline function that we will be comparing against:

    fn find_substr(needle: []const u8, haystack: []const u8) ?usize {
        return std.mem.indexOf(u8, haystack, needle);
    }
    

    It’s the closest thing to a substring search function from Zig’s standard library. It returns the first index of a subsequence – or null if not found.

    SIMD algorithm
    #

    This algorithm is taken directly from Wojciech Muła’s fantastic article SIMD-friendly algorithms for substring searching, which seems to have the best algorithms for finding substrings in a large body of text.

    Here’s how the algorithm works: say that we want to find the index of the word “blue” (the needle) in “It was a beautiful, bounteous, blue day” (the haystack). First, we extract the first and last character of the needle (‘b’ and ’e’) and store them in a variable.

    Then we will loop through all of the characters in haystack, loading the next 32 characters (bytes) from memory into a SIMD register and comparing each character (byte) in the register with ‘b’. This will result in a mask containing 32 bytes, 1 if the character is ‘b’ and 0 in all other cases.

    We will do the same with the last character, but load the characters with an offset (needle.len - 1).

    Without the offset, any match that starts in one 32‑byte chunk and ends in the next would be missed. With this method, we can also check for needles that are longer than 32 characters.

    The result will be two bit masks, First and Last, where we can use bit-wise AND (Result = First & Last) to figure out potential substring occurrences.

    Result will be 1 only when there is a ‘b’ at index i followed by an ’e’ at index i+3. We still need to check if those positions actually contain the value “blue”, but this still dramatically reduces the number of checks (= individual memory accesses) that are necessary. We’ll see how this works in practice in the next section.

    Implementation in Zig
    #

    First, to properly use SIMD, let’s assume that the CPU supports AVX2 (Advanced Vector Extensions 2) and has 256-bit wide registers.

    All desktop processors less than 10 years old support AVX2, with newer ones also supporting AVX-512 with 512-bit wide registers.

    This allows us to use Zig’s @Vector function to make a type:

    const Block = @Vector(32, u8); // number of elements, element type (32*8=256)
    

    By using Block, we are telling the compiler that the operations on this datatype should use SIMD instructions where possible.

    Next, we take the first and last letters of the search word (’needle’) and load them into two SIMD registers, so that every byte of the register is filled with the character. This is handled by another built-in function, @splat:

    const first_letter: Block = @splat(needle[0]);
    const last_letter: Block = @splat(needle[needle.len - 1]);
    

    In the main loop, we check that there is enough characters left in haystack so that we can read the next 32 + needle.len characters. Inside the block, we load the blocks that we’re going to compare first_letter and last_letter with.

    const n = haystack.len;
    const k = needle.len;
    var i: usize = 0;
    while (i + k + 32 <= n) : (i += 32) {
        const first_block: Block = haystack[i..][0..32].*;
        const last_block: Block = haystack[i + k - 1 ..][0..32].*;
    

    Now we can make the comparisons and combine them into a mask:

    	// ...
        const eq_first = first_letter == first_block;
        const eq_last = last_letter == last_block;
        var mask: std.bit_set.IntegerBitSet(32) = .{ .mask = @bitCast(eq_first & eq_last) };
    

    Here we can use an IntegerBitSet from Zig’s standard library. We construct it by casting the result of eq_first & eq_last into a 32-bit integer. If the resulting mask is non-zero, there are candidates in the current block.

    	// ...
        while (mask.findFirstSet()) |bitpos| {
            if (std.mem.eql(u8, haystack[i + bitpos + 1 ..][0 .. k - 1], needle[1..])) {
                return i + bitpos;
            }
            _ = mask.toggleFirstSet();
        }
    

    The first and last characters of the substring are checked already, so we don’t need to check their equality again.

    Finally, if there are leftover characters, we can fall back to std.mem.IndexOf.

    	// ...
    	// Fallback to scalar search for the tail
    	if (i < n) {
    	    if (std.mem.indexOf(u8, haystack[i..], needle)) |rel_idx| {
    	        return i + rel_idx;
    	    }
    	}
    	return null; // no substring found
    }
    

    Benchmarks
    #

    To properly show the effects of our SIMD algorithm, we’re going to need a large haystack. For this, I’ve chosen to use the entirety Moby Dick in plain text, and a search word ’newsletter’, which appears at the very end of the text.

    The code is available on GitHub

    To compile the code, I ran zig build with -Doptimize=ReleaseFast:

    > zig build -Doptimize=ReleaseFast
    

    Support for bitwise operations on boolean vectors was added in Zig 0.15, which is unreleased as of now (August 2025). If you want to run the code on your system, you need to build Zig from the master branch.

    To measure performance and compare against baseline, I’ll use one of my favorite CLI tools, poop:

    > poop -d 10000 "./zig-out/bin/substr" "./zig-out/bin/substr --simd"
    
    
    Benchmark 1 (6361 runs): ./zig-out/bin/substr
      measurement          mean ± σ            min … max           outliers         delta
      wall_time          1.22ms ±  185us     903us … 5.33ms        242 ( 4%)        0%
      peak_rss           1.20MB ±  290      1.18MB … 1.20MB          2 ( 0%)        0%
      cpu_cycles         2.15M  ± 40.5K     2.10M  … 2.71M         312 ( 5%)        0%
      instructions       1.85M  ± 0.75      1.85M  … 1.85M          56 ( 1%)        0%
      cache_references   43.8K  ±  620      38.3K  … 44.9K           9 ( 0%)        0%
      cache_misses       19.0K  ± 10.3K     4.08K  … 33.6K           0 ( 0%)        0%
      branch_misses      48.1   ± 17.4        20   …  104           97 ( 2%)        0%
    
    Benchmark 2 (10000 runs): ./zig-out/bin/substr --simd
      measurement          mean ± σ            min … max           outliers         delta
      wall_time           500us ± 96.9us     397us … 4.23ms        840 ( 8%)        ⚡- 58.9% ±  0.4%
      peak_rss           1.20MB ±  164      1.18MB … 1.20MB          1 ( 0%)          +  0.0% ±  0.0%
      cpu_cycles          369K  ± 36.1K      340K  … 1.10M        1167 (12%)        ⚡- 82.8% ±  0.1%
      instructions        578K  ± 0.53       578K  …  578K           6 ( 0%)        ⚡- 68.8% ±  0.0%
      cache_references   38.8K  ±  545      34.1K  … 40.5K           6 ( 0%)        ⚡- 11.4% ±  0.0%
      cache_misses       5.62K  ± 4.97K     2.11K  … 27.9K        1529 (15%)        ⚡- 70.3% ±  1.2%
      branch_misses      2.88K  ± 23.4      2.81K  … 3.09K         453 ( 5%)        💩+5879.8% ±  1.4%
    

    (Scroll right to see more data)

    As you can see, for a large body of text, the speedup is noticeable: 59% faster with 80% less CPU cycles!

    The SIMD version only took 500 microseconds to complete on average, including the overhead of loading the program into memory and printing the result. 500 microseconds is half a millisecond. That’s how fast my laptop searches through a whole book, 200 000 words, cover to cover. This is why computers are so powerful! How long would it take for a human to do that?

    This is quite a large improvement, and proves that the SIMD code is actually working (otherwise the reduction in CPU cycles wouldn’t be so massive). Can we do even better though?

    Character selection
    #

    You may notice from the output of poop that the number of branch misses has absolutely blown up – from 48 on average to 2.88k !

    Why does this happen? Well, if you were to count how many times the inner while loop is entered when the mask is non-zero:

        var i: usize = 0;
        var count: usize = 0;
        while (i + k + 32 <= n) : (i += 32) {
            const block_first: Block = haystack[i..][0..32].*;
            const block_last: Block = haystack[i + k - 1 ..][0..32].*;
            const eq_first = first == block_first;
            const eq_last = last == block_last;
            var mask: std.bit_set.IntegerBitSet(32) = .{ .mask = @bitCast(eq_first & eq_last) };
            while (mask.findFirstSet()) |bitpos| {
                count += 1;
                if (std.mem.eql(u8, haystack[i + bitpos + 1 ..][0 .. k - 1], needle[1..])) {
                    std.debug.print("found match with count: {}n", .{count});
                    return i + bitpos;
                }
                _ = mask.toggleFirstSet();
            }
        }
    
    found match with count: 2792
    

    The fact that count is so close to the number of mispredictions suggests that each time the mask is non‑zero we incur a branch miss.

    Unfortunately, there is no obvious way to prevent this with the current algorithm. The state-of-the-art seems to be choosing two bytes in the needle that occur less frequently according to a pre-calculated frequency distribution. This is used in the memchr crate in Rust, as explained by the author in this comment on Hacker News.

    For example, the needle newsletter has the rarest characters w at index 2 and l at index 4.

    The function in memchr can be found here. I’ve ported it into Zig, and you can see it here.

        const needle_index_pair = find_rarest(needle);
    
        const first_letter: Block = @splat(needle[needle_index_pair[0]]);
        const first_offset = needle_index_pair[0];
        const second_letter: Block = @splat(needle[needle_index_pair[1]]);
        const second_offset = needle_index_pair[1];
    

    The algorithm is the exact same, but the index for first_letter and second_letter now varies according to the pre-calculated frequency distribution.

    Benchmarks
    #

    
    Benchmark 1 (10000 runs): ./zig-out/bin/substr --simd
      measurement          mean ± σ            min … max           outliers         delta
      wall_time           472us ± 62.9us     400us … 1.62ms        735 ( 7%)        0%
      peak_rss           1.20MB ±    0      1.20MB … 1.20MB          0 ( 0%)        0%
      cpu_cycles          376K  ± 44.7K      347K  … 1.46M        1213 (12%)        0%
      instructions        578K  ± 0.54       578K  …  578K          10 ( 0%)        0%
      cache_references   38.7K  ±  715      28.3K  … 40.6K          96 ( 1%)        0%
      cache_misses       7.37K  ± 5.83K     2.78K  … 27.7K        1608 (16%)        0%
      branch_misses      2.88K  ± 23.4      2.82K  … 3.08K         415 ( 4%)        0%
    
    Benchmark 2 (10000 runs): ./zig-out/bin/substr --simdv2
      measurement          mean ± σ            min … max           outliers         delta
      wall_time           429us ± 75.5us     369us … 3.85ms        393 ( 4%)        ⚡-  9.1% ±  0.4%
      peak_rss           1.20MB ±    0      1.20MB … 1.20MB          0 ( 0%)          -  0.0% ±  0.0%
      cpu_cycles          304K  ± 28.4K      282K  … 1.07M        1140 (11%)        ⚡- 19.2% ±  0.3%
      instructions        561K  ± 0.52       561K  …  561K           5 ( 0%)        ⚡-  2.9% ±  0.0%
      cache_references   38.7K  ±  610      29.9K  … 40.3K          25 ( 0%)          -  0.1% ±  0.0%
      cache_misses       5.21K  ± 3.53K     2.57K  … 27.3K        1306 (13%)        ⚡- 29.3% ±  1.8%
      branch_misses      1.07K  ± 14.0      1.02K  … 1.17K         275 ( 3%)        ⚡- 62.8% ±  0.0%
    
    

    Comparing to the previous SIMD version, the number of branch misses has dropped by 60%, and it’s 9% faster too. Nice!

    The number of branch misses is lower, which can cause faster execution, but I suspect that a much bigger impact is the fact that there are less false positives, which means less byte-by-byte memory accesses and comparisons.

    AVX-512
    #

    Since AMD Zen 4 and Intel Cannon Lake, there has been a new SIMD instruction set, AVX-512 with 512-bit instructions – double the size of AVX2. I don’t have a computer that has AVX-512 right now, but I suspect that changing the Zig code to process 64 characters at once would lead to even better results.

    A smaller haystack
    #

    It’s clear that with a very large haystack, the SIMD version is much faster. But what about a tiny input, like less than a hunder characters?

    I did a bit of benchmarking with poop, but I found that I couldn’t accurately measure the speed, since both versions finish extremely very quickly. I decided to use zBench to do a microbenchmark. I decided to use a snippet from Moby Dick as seen here.

    +- run test stderr
    benchmark              runs     total time     time/run (avg ± σ)    (min ... max)                p75        p99        p995      
    -----------------------------------------------------------------------------------------------------------------------------
    find_substr            100000   424.368ms      4.243us ± 740ns       (3.964us ... 107.923us)      4.187us    7.075us    7.245us   
    find_substr_simd_v2    100000   147.883ms      1.478us ± 186ns       (1.417us ... 21.354us)       1.483us    1.539us    1.548us   
    

    I was surprised to see that even when processing less than a hundred characters, the SIMD algorithm is still faster! The difference between 4μs vs 1μs is extremely small, but it’s slightly faster nonetheless.

    Conclusion
    #

    As you can see, SIMD can be used to make substring searching dramatically faster, for both very large and very small strings.

    But if it’s so much better, then why haven’t I made a pull request to change std.mem.indexOf to use SIMD? Well, the reason is that

    1. std.mem.indexOf is generic over element size, and having a size larger than u8 makes the algorithm much slower
    2. The algorithm used in stdmem.indexOf is cross-platform, while the SIMD code wouldn’t be. (not all platforms have SIMD registers at all, Arm has only 128-bit)

    Substring searching is rarely the bottleneck in programs, especially ones written in a fast language like Zig. That’s why I don’t personally think it would be worth it to add it to the standard library.

    Still, it was great to learn about this advanced optimization technique and see some concrete performance measurements from it!

    The full code is available on GitHub here.

    Further reading
    #

    • SIMD with Zig https://www.openmymind.net/SIMD-With-Zig/
    • SIMD-friendly algorithms for substring searching: http://0x80.pl/notesen/2016-11-28-simd-strfind.html
    • memchr source code: https://github.com/BurntSushi/memchr

    BMI Calculator – Check your Body Mass Index for free!

    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleHand-picked selection of articles on AI fundamentals/concepts
    Next Article GPT-OSS-120B runs on just 8GB VRAM & 64GB+ system RAM
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    I love Windows PCs, but a $599 MacBook would be mighty tempting

    September 3, 2025

    Acer’s new OLED gaming monitor hits a blistering 720Hz, with a catch

    September 3, 2025

    Acer Chromebook Plus Spin 514 review: This 2-in-1 multitasks like a pro

    September 3, 2025
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 2025176 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 202548 Views

    New Akira ransomware decryptor cracks encryptions keys using GPUs

    March 16, 202530 Views

    Is Libby Compatible With Kobo E-Readers?

    March 31, 202529 Views
    Don't Miss
    Gadgets September 3, 2025

    Acer Expands Predator Lineup with AI Laptop, Gaming Desktops, Ultra-Fast Monitor and Keyboard

    Acer Expands Predator Lineup with AI Laptop, Gaming Desktops, Ultra-Fast Monitor and Keyboard Acer has…

    IFA 2025: Acer unveils world’s lightest 16-inch laptop

    Samsung Malaysia introduces Galaxy A07 and Galaxy A17 5G

    I love Windows PCs, but a $599 MacBook would be mighty tempting

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    Acer Expands Predator Lineup with AI Laptop, Gaming Desktops, Ultra-Fast Monitor and Keyboard

    September 3, 20252 Views

    IFA 2025: Acer unveils world’s lightest 16-inch laptop

    September 3, 20252 Views

    Samsung Malaysia introduces Galaxy A07 and Galaxy A17 5G

    September 3, 20252 Views
    Most Popular

    Xiaomi 15 Ultra Officially Launched in China, Malaysia launch to follow after global event

    March 12, 20250 Views

    Apple thinks people won’t use MagSafe on iPhone 16e

    March 12, 20250 Views

    French Apex Legends voice cast refuses contracts over “unacceptable” AI clause

    March 12, 20250 Views
    © 2025 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.