Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Trump closes the online shopping loophole that beat tariffs

    Tapo DL100 review: A Wi-Fi smart lock for a whole lot less

    This Alienware 4K OLED gaming monitor is over $400 off right now

    Facebook X (Twitter) Instagram
    • Artificial Intelligence
    • Business Technology
    • Cryptocurrency
    • Gadgets
    • Gaming
    • Health
    • Software and Apps
    • Technology
    Facebook X (Twitter) Instagram Pinterest Vimeo
    Tech AI Verse
    • Home
    • Artificial Intelligence

      AI models may be accidentally (and secretly) learning each other’s bad behaviors

      July 30, 2025

      Another Chinese AI model is turning heads

      July 15, 2025

      AI chatbot Grok issues apology for antisemitic posts

      July 13, 2025

      Apple sued by shareholders for allegedly overstating AI progress

      June 22, 2025

      How far will AI go to defend its own survival?

      June 2, 2025
    • Business

      Cloudflare open-sources Orange Meets with End-to-End encryption

      June 29, 2025

      Google links massive cloud outage to API management issue

      June 13, 2025

      The EU challenges Google and Cloudflare with its very own DNS resolver that can filter dangerous traffic

      June 11, 2025

      These two Ivanti bugs are allowing hackers to target cloud instances

      May 21, 2025

      How cloud and AI transform and improve customer experiences

      May 10, 2025
    • Crypto

      Shiba Inu Price’s 16% Drop Wipes Half Of July Gains; Is August In Trouble?

      July 30, 2025

      White House Crypto Report Suggests Major Changes to US Crypto Tax

      July 30, 2025

      XRP Whale Outflows Reflect Price Concern | Weekly Whale Watch

      July 30, 2025

      Stellar (XLM) Bull Flag Breakout Shows Cracks as Momentum Fades

      July 30, 2025

      Binance Listing Could Be a ‘Kiss of Death’ for Pi Network and New Tokens

      July 30, 2025
    • Technology

      Trump closes the online shopping loophole that beat tariffs

      July 31, 2025

      Tapo DL100 review: A Wi-Fi smart lock for a whole lot less

      July 31, 2025

      This Alienware 4K OLED gaming monitor is over $400 off right now

      July 31, 2025

      Asus ROG Xbox Ally releases August with high price, according to leaks

      July 31, 2025

      This high-capacity solar power bank is down to its lowest price

      July 31, 2025
    • Others
      • Gadgets
      • Gaming
      • Health
      • Software and Apps
    Check BMI
    Tech AI Verse
    You are at:Home»Technology»An optimizing compiler doesn’t help much with long instruction dependencies
    Technology

    An optimizing compiler doesn’t help much with long instruction dependencies

    TechAiVerseBy TechAiVerseJune 1, 2025No Comments4 Mins Read1 Views
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
    An optimizing compiler doesn’t help much with long instruction dependencies
    Share
    Facebook Twitter LinkedIn Pinterest WhatsApp Email

    BMI Calculator – Check your Body Mass Index for free!

    An optimizing compiler doesn’t help much with long instruction dependencies

    We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.

    There was a rumor I read somewhere related to training AI models, something along the lines “whether we compile our code in debug mode or release mode, it doesn’t matter, because our models are huge, all of our code is memory bound”.

    I wanted to investigate if this is true for the cases that are interesting to me so I wrote a few small kernels to investigate. Here is the first:

    for (size_t i { 0ULL }; i < pointers.size(); i++) {
        sum += vector[pointers[i]];
    }

    This is a very memory intensive kernel. The data from vector is read from random locations – depending on the size of vector we can experiment with data being read from L1, L2, L3 caches or memory.

    I compiled this loop with Gcc, optimization level -O0 (no optimizations) and -O3 (full optimizations). Then I calculated the instruction count ratio – instructions_count(O0) / instruction_count(O3) and runtime ratio – runtime(O0) / runtime(O3). In an imaginary perfect hardware, where the runtime is proportional to instruction count and doesn’t depend on memory at all, the graph could look like this:

    In the above graph, an O0 version might have 10 times more instructions than O3 version, and therefore it should be ten times slower than the O3 version, regardless of the vector size.

    Of course, real hardware is different. The same graph for the same loop executed on my brand new AMD Ryzen 9 PRO 8945HS w/ Radeon 780M Graphics looks like this:

    The O0 generates almost 10 times more instructions than O3 version, but when the dataset is big enough, this doesn’t matter a lot: O3 version is about 3 times faster than with O0 version. So, the claim is (at least) partially true. Of course, being three times faster is something one should nevertheless appreciate, especially since very memory intensive codes like above don’t appear to often (but they do appear, e.g. the above loop is very similar to looking up data in a hash map).

    Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
    Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

    What about instruction level parallelism?

    The above code is very memory intensive – but there is a catch. There is no loop carried dependency on loads, there is only a loop carried dependency on addition. What this means is that CPU can issue many loads in advance and run them in parallel. Even though the problem itself is memory intensive, it has a lot of instruction level parallelism and can be executed relatively fast.

    What happens when the problem is significantly memory bound and it has very low instruction level parallelism? Consider the following example:

    while(p < total_accesses) {
        sum += list[idx].value;
        idx = list[idx].next;
        
        if (idx == node_t::NULLPTR) {
            idx = 0U;
        }
    
        p++;
    }

    This code essentially performs total_accesses accesses on a linked list stored in a vector. In contract to the previous snippet, this snippet has a very low instruction level parallelism – to get the address of the next load, we have to load the current node from the memory. If a load takes 200 cycles, the CPU needs to dispatch a load, wait 200 cycles for it to complete, and then dispatch the next load, etc.

    So, how does the previous graph looks like for the above snippet? Here it is:

    Ouch! The O3 version executes on average 4.6 times fewer instruction than O0 version, but in the best case (the smallest vector of 1K values or 16 kB) the speedup is only 2.12. In all other cases the speedup is less than 1.1. All the CPU resources are in vain – low ILP means the out-of-order CPU can process only limited amount of instructions. In this case it doesn’t matter whether the compiler optimizes or not – the bottleneck is low ILP.

    Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
    Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

    BMI Calculator – Check your Body Mass Index for free!

    Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
    Previous ArticleBrowser extension (Firefox, Chrome, Opera, Edge) to redirect URLs based on regex
    Next Article Why DeepSeek is cheap at scale but expensive to run locally
    TechAiVerse
    • Website

    Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

    Related Posts

    Trump closes the online shopping loophole that beat tariffs

    July 31, 2025

    Tapo DL100 review: A Wi-Fi smart lock for a whole lot less

    July 31, 2025

    This Alienware 4K OLED gaming monitor is over $400 off right now

    July 31, 2025
    Leave A Reply Cancel Reply

    Top Posts

    Ping, You’ve Got Whale: AI detection system alerts ships of whales in their path

    April 22, 202533 Views

    6.7 Cummins Lifter Failure: What Years Are Affected (And Possible Fixes)

    April 14, 202532 Views

    New Akira ransomware decryptor cracks encryptions keys using GPUs

    March 16, 202529 Views

    OpenAI details ChatGPT-o3, o4-mini, o4-mini-high usage limits

    April 19, 202522 Views
    Don't Miss
    Technology July 31, 2025

    Trump closes the online shopping loophole that beat tariffs

    Trump closes the online shopping loophole that beat tariffs Image: Pixabay An executive order by…

    Tapo DL100 review: A Wi-Fi smart lock for a whole lot less

    This Alienware 4K OLED gaming monitor is over $400 off right now

    Asus ROG Xbox Ally releases August with high price, according to leaks

    Stay In Touch
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo

    Subscribe to Updates

    Get the latest creative news from SmartMag about art & design.

    About Us
    About Us

    Welcome to Tech AI Verse, your go-to destination for everything technology! We bring you the latest news, trends, and insights from the ever-evolving world of tech. Our coverage spans across global technology industry updates, artificial intelligence advancements, machine learning ethics, and automation innovations. Stay connected with us as we explore the limitless possibilities of technology!

    Facebook X (Twitter) Pinterest YouTube WhatsApp
    Our Picks

    Trump closes the online shopping loophole that beat tariffs

    July 31, 20252 Views

    Tapo DL100 review: A Wi-Fi smart lock for a whole lot less

    July 31, 20252 Views

    This Alienware 4K OLED gaming monitor is over $400 off right now

    July 31, 20252 Views
    Most Popular

    Xiaomi 15 Ultra Officially Launched in China, Malaysia launch to follow after global event

    March 12, 20250 Views

    Apple thinks people won’t use MagSafe on iPhone 16e

    March 12, 20250 Views

    French Apex Legends voice cast refuses contracts over “unacceptable” AI clause

    March 12, 20250 Views
    © 2025 TechAiVerse. Designed by Divya Tech.
    • Home
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms & Conditions

    Type above and press Enter to search. Press Esc to cancel.