[[art of unix programming]] P290/291 advocates (presumably based on experience) to bend over backward and ensure the central data structure + time-critical loop NEVER fall out of i-cache/d-cache.
I think this is useful sound byte to impress interviewers.
Developers need to know the size of L1 cache, L2 cache, L3 cache in our hardware. Then target the central data structure to one of them. Say L2 cache. We want to size our data structure to fit in there and never break the boundary.
“Small is beautiful” applies to both code and data.
Example 1 : cpu vs i-cache, IMHO more convincing than
Example 2 : cpu vs d-cache
In both examples, if cpu becomes fast enough then we ought to make it do more work in order to economize the caches.
— Example 1
Loop-unrolling enlarges text-section of the binary, so the additional “text” must compete for scarce instruction-cache.
[[art of unix programming]] claims that at least in some (not theoretical) cases, this compiler optimization may back fire i.e. hurt rather than improve performance.
The critical situation – suppose the loop runs in one of the extremely hot functions that need to remain in the instruction-cache, then we want to minimize code footprint.
Loading this hot function over and over from main memory can be worse than executing the original loop (without unrolling). This happens when cpu execution speed improves over the years but main memory access speed remains a bottleneck.
— Example 2
A second example “claimed” in the same paragraph, revolves around pre-computed look-up table for frequent access. This kind of optimization strategy can shave cpu cycles, but can increase the fierce competition for scarce L1 cache.
In other words, Pre-computing of a data table was designed to save cpu time but could backfire if the increased data footprint leads to data-cache misses.
We are again talking about extremely hot data that needs to remain in L1 cache. If this pre-computed data is accessed frequently, it can hog the L1 cache.
In such a case, removing the pre-computation, and re-computing each time, could be faster.
I think this problem can become real –
• If cpu is becoming much faster in recent years, or the computation gets simplified, then the cpu saving would diminish.
• At the same time, if the pre-computed data size grows larger, then the “hogging” would worsen.