An optimizing compiler doesn’t help much with long instruction dependencies

We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.

There was a rumor I read somewhere related to training AI models, something along the lines “whether we compile our code in debug mode or release mode, it doesn’t matter, because our models are huge, all of our code is memory bound”.

I wanted to investigate if this is true for the cases that are interesting to me so I wrote a few small kernels to investigate. Here is the first:

for (size_t i { 0ULL }; i < pointers.size(); i++) {
    sum += vector[pointers[i]];
}

This is a very memory intensive kernel. The data from vector is read from random locations – depending on the size of vector we can experiment with data being read from L1, L2, L3 caches or memory.

I compiled this loop with Gcc, optimization level -O0 (no optimizations) and -O3 (full optimizations). Then I calculated the instruction count ratio – instructions_count(O0) / instruction_count(O3) and runtime ratio – runtime(O0) / runtime(O3). In an imaginary perfect hardware, where the runtime is proportional to instruction count and doesn’t depend on memory at all, the graph could look like this:

In the above graph, an O0 version might have 10 times more instructions than O3 version, and therefore it should be ten times slower than the O3 version, regardless of the vector size.

Of course, real hardware is different. The same graph for the same loop executed on my brand new AMD Ryzen 9 PRO 8945HS w/ Radeon 780M Graphics looks like this:

The O0 generates almost 10 times more instructions than O3 version, but when the dataset is big enough, this doesn’t matter a lot: O3 version is about 3 times faster than with O0 version. So, the claim is (at least) partially true. Of course, being three times faster is something one should nevertheless appreciate, especially since very memory intensive codes like above don’t appear to often (but they do appear, e.g. the above loop is very similar to looking up data in a hash map).

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

What about instruction level parallelism?

The above code is very memory intensive – but there is a catch. There is no loop carried dependency on loads, there is only a loop carried dependency on addition. What this means is that CPU can issue many loads in advance and run them in parallel. Even though the problem itself is memory intensive, it has a lot of instruction level parallelism and can be executed relatively fast.

What happens when the problem is significantly memory bound and it has very low instruction level parallelism? Consider the following example:

while(p < total_accesses) {
    sum += list[idx].value;
    idx = list[idx].next;
    
    if (idx == node_t::NULLPTR) {
        idx = 0U;
    }

    p++;
}

This code essentially performs total_accesses accesses on a linked list stored in a vector. In contract to the previous snippet, this snippet has a very low instruction level parallelism – to get the address of the next load, we have to load the current node from the memory. If a load takes 200 cycles, the CPU needs to dispatch a load, wait 200 cycles for it to complete, and then dispatch the next load, etc.

So, how does the previous graph looks like for the above snippet? Here it is:

Ouch! The O3 version executes on average 4.6 times fewer instruction than O0 version, but in the best case (the smallest vector of 1K values or 16 kB) the speedup is only 2.12. In all other cases the speedup is less than 1.1. All the CPU resources are in vain – low ILP means the out-of-order CPU can process only limited amount of instructions. In this case it doesn’t matter whether the compiler optimizes or not – the bottleneck is low ILP.

Subscribe to Updates

What's Hot

An optimizing compiler doesn’t help much with long instruction dependencies

An optimizing compiler doesn’t help much with long instruction dependencies

What about instruction level parallelism?

Related Posts