Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    This massive 49-inch ultrawide OLED monitor is just $900

    Google faces lawsuit over Gemini AI’s role in man’s suicide

    Big decision? Here’s the AI prompt to use

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      What the polls say about how Americans are using AI

      February 27, 2026

      Tensions between the Pentagon and AI giant Anthropic reach a boiling point

      February 21, 2026

      Read the extended transcript: President Donald Trump interviewed by ‘NBC Nightly News’ anchor Tom Llamas

      February 6, 2026

      Stocks and bitcoin sink as investors dump software company shares

      February 4, 2026

      AI, crypto and Trump super PACs stash millions to spend on the midterms

      February 2, 2026
    • Business

      Google releases Gemini 3.1 Flash Lite at 1/8th the cost of Pro

      March 4, 2026

      Huawei Watch GT Series

      March 4, 2026

      Weighing up the enterprise risks of neocloud providers

      March 3, 2026

      A stolen Gemini API key turned a $180 bill into $82,000 in two days

      March 3, 2026

      These ultra-budget laptops “include” 1.2TB storage, but most of it is OneDrive trial space

      March 1, 2026
    • Crypto

      Banks Respond to Kraken’s Federal Reserve Access as Trump Sides with Crypto

      March 4, 2026

      Hyperliquid and DEXs Break the Top 10 — Is the CEX Era Ending?

      March 4, 2026

      Consensus Hong Kong 2026: The Institutional Turn 

      March 4, 2026

      New Crypto Mutuum Finance (MUTM) Reports V1 Protocol Progress as Roadmap Enters Phase 3

      March 4, 2026

      Bitcoin Short Sellers Caught Off Guard in New White House Move

      March 4, 2026
    • Technology

      This massive 49-inch ultrawide OLED monitor is just $900

      March 7, 2026

      Google faces lawsuit over Gemini AI’s role in man’s suicide

      March 7, 2026

      Big decision? Here’s the AI prompt to use

      March 7, 2026

      Keychron’s new ultra-slim wireless keyboard folds in half

      March 7, 2026

      Newegg’s $7,500 RTX 5090 card is a sad, depressing omen

      March 6, 2026
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Technology»An optimizing compiler doesn’t help much with long instruction dependencies
    Technology

    An optimizing compiler doesn’t help much with long instruction dependencies

    TechAiVerseBy TechAiVerseJune 1, 2025No Comments4 Mins Read3 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    An optimizing compiler doesn’t help much with long instruction dependencies
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    An optimizing compiler doesn’t help much with long instruction dependencies

    We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.

    There was a rumor I read somewhere related to training AI models, something along the lines “whether we compile our code in debug mode or release mode, it doesn’t matter, because our models are huge, all of our code is memory bound”.

    I wanted to investigate if this is true for the cases that are interesting to me so I wrote a few small kernels to investigate. Here is the first:

    for (size_t i { 0ULL }; i < pointers.size(); i++) {
        sum += vector[pointers[i]];
    }

    This is a very memory intensive kernel. The data from vector is read from random locations – depending on the size of vector we can experiment with data being read from L1, L2, L3 caches or memory.

    I compiled this loop with Gcc, optimization level -O0 (no optimizations) and -O3 (full optimizations). Then I calculated the instruction count ratio – instructions_count(O0) / instruction_count(O3) and runtime ratio – runtime(O0) / runtime(O3). In an imaginary perfect hardware, where the runtime is proportional to instruction count and doesn’t depend on memory at all, the graph could look like this:

    In the above graph, an O0 version might have 10 times more instructions than O3 version, and therefore it should be ten times slower than the O3 version, regardless of the vector size.

    Of course, real hardware is different. The same graph for the same loop executed on my brand new AMD Ryzen 9 PRO 8945HS w/ Radeon 780M Graphics looks like this:

    The O0 generates almost 10 times more instructions than O3 version, but when the dataset is big enough, this doesn’t matter a lot: O3 version is about 3 times faster than with O0 version. So, the claim is (at least) partially true. Of course, being three times faster is something one should nevertheless appreciate, especially since very memory intensive codes like above don’t appear to often (but they do appear, e.g. the above loop is very similar to looking up data in a hash map).

    Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
    Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

    What about instruction level parallelism?

    The above code is very memory intensive – but there is a catch. There is no loop carried dependency on loads, there is only a loop carried dependency on addition. What this means is that CPU can issue many loads in advance and run them in parallel. Even though the problem itself is memory intensive, it has a lot of instruction level parallelism and can be executed relatively fast.

    What happens when the problem is significantly memory bound and it has very low instruction level parallelism? Consider the following example:

    while(p < total_accesses) {
        sum += list[idx].value;
        idx = list[idx].next;
        
        if (idx == node_t::NULLPTR) {
            idx = 0U;
        }
    
        p++;
    }

    This code essentially performs total_accesses accesses on a linked list stored in a vector. In contract to the previous snippet, this snippet has a very low instruction level parallelism – to get the address of the next load, we have to load the current node from the memory. If a load takes 200 cycles, the CPU needs to dispatch a load, wait 200 cycles for it to complete, and then dispatch the next load, etc.

    So, how does the previous graph looks like for the above snippet? Here it is:

    Ouch! The O3 version executes on average 4.6 times fewer instruction than O0 version, but in the best case (the smallest vector of 1K values or 16 kB) the speedup is only 2.12. In all other cases the speedup is less than 1.1. All the CPU resources are in vain – low ILP means the out-of-order CPU can process only limited amount of instructions. In this case it doesn’t matter whether the compiler optimizes or not – the bottleneck is low ILP.

    Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
    Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleBrowser extension (Firefox, Chrome, Opera, Edge) to redirect URLs based on regex
    Next Article Why DeepSeek is cheap at scale but expensive to run locally
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    This massive 49-inch ultrawide OLED monitor is just $900

    March 7, 2026

    Google faces lawsuit over Gemini AI’s role in man’s suicide

    March 7, 2026

    Big decision? Here’s the AI prompt to use

    March 7, 2026
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 2025705 Views

    Lumo vs. Duck AI: Which AI is Better for Your Privacy?

    July 31, 2025291 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 2025165 Views

    6 Best MagSafe Phone Grips (2025), Tested and Reviewed

    April 6, 2025125 Views
    Don't Miss
    Technology March 7, 2026

    This massive 49-inch ultrawide OLED monitor is just $900

    This massive 49-inch ultrawide OLED monitor is just $900 Image: MSI I like ’em big.…

    Google faces lawsuit over Gemini AI’s role in man’s suicide

    Big decision? Here’s the AI prompt to use

    Keychron’s new ultra-slim wireless keyboard folds in half

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    This massive 49-inch ultrawide OLED monitor is just $900

    March 7, 20260 Views

    Google faces lawsuit over Gemini AI’s role in man’s suicide

    March 7, 20262 Views

    Big decision? Here’s the AI prompt to use

    March 7, 20262 Views
    Most Popular

    7 Best Kids Bikes (2025): Mountain, Balance, Pedal, Coaster

    March 13, 20250 Views

    VTOMAN FlashSpeed 1500: Plenty Of Power For All Your Gear

    March 13, 20250 Views

    Best TV Antenna of 2025

    March 13, 20250 Views
    © 2026 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.