1)template code bloat 2)inline..hurt i-cache@@

Q: does template code bloat increase instruction cache pressure and increase risk of cache thrashing?
A: i have seen about 8 forums mention of template code bloat increasing i-cache pressure, but no clear evidence, no consensus among authorities. In fact, no known authority has confirmed this issue.

However, I would think having many similar functions in the executable will likely lead to unnecessary competition for i-cache.

In contrast, for inlining there is consensus. wikipedia states — excess inlining will hurt speed, due to inlined code consuming too much of the instruction cache .. Inlining imposes a cost on performance, due to the code expansion (due to duplication) hurting instruction cache performance.[5] This is most significant if, prior to expansion, the working set of the program (or a hot section of code) fit in one level of the memory hierarchy (e.g., L1 cache), but after expansion it no longer fits, resulting in frequent cache misses at that level.

See also inline: footprint+perf can Backfire ! #Google

Advertisements

i-cache caution: manual inlining

Suppose hot function f2() calls f1(). Should f1() be inlined?

Martin Thompson suggested we developers can effectively “take over” inlining by copying a short function f1()’s body into a caller function f2(), if both functions are hot.

This way, we don’t rely on c++ compiler or JIT compiler to decide what to inline. Note JIT compiler can make that decision dynamically based on runtime heuristics.

However, if the expanded f2()’s footprint becomes too big to fit into i-cache, then this practice can backfire.

i-cache: extract corner-case into non-inlined function

https://www.eetimes.com/document.asp?doc_id=1275470&page_number=2 gave an elegant illustration of reverse-inlining.

“If you have a large function with many cases, ask yourself which will be executed the most often. Place the infrequently-executed cases in a different function.”

If a corner case is embedded in a hot function, i-cache may suffer. A JIT compiler could in theory move it to end of a hot loop, but that depends on many factors. Is it good practice for us developers to manually move the less frequent code paths towards the end, and introduce early-returns before them? I guess the i-cache would only contain the first half of the function containing the hot code paths?

The trick — if you extract the corner case into a separate function and somehow[1] ensure it is not inlined, then i-cache would be relieved.

[1] A JIT compiler would notice the function should NOT be inlined, but in c++ I am not sure.

 

i-cache: avoid handling corner cases]hot functions #JIT inlining

“Functions often contain a significant amount of code that deals with corner cases, that is, cases which are rarely executed. This means that a large number of instructions read into cache are rarely executed.”https://www.eetimes.com/document.asp?doc_id=1275470

Martin Thompson agreed. He hinted that JIT compilers could (or have be designed to) notice the “cold” corner cases and dynamically refactor the code path so the corner case is handled at end of a hot loop.

Martin also said inlining has to be done judiciously, to avoid adding corner cases (cold stuff) into hot functions. In c++, Inlining decision are made at compile time but JIT can make the same decisions at run-time.

Personally, I guess this JIT technique is academic or experimental, probably hit-n-miss so the performance gain/penalty is hard to predict.

compiler loop-unrolling could hurt i-cache

[[art of unix programming]] P290/291 advocates (presumably based on experience) to bend over backward and ensure the central data structure + time-critical loop NEVER fall out of i-cache/d-cache.

I think this is useful sound byte to impress interviewers.

Developers need to know the size of L1 cache, L2 cache, L3 cache in our hardware. Then target the central data structure to one of them. Say L2 cache. We want to size our data structure to fit in there and never break the boundary.

“Small is beautiful” applies to both code and data.

Example 1 : cpu vs i-cache, IMHO more convincing than
Example 2 : cpu vs d-cache

In both examples, if cpu becomes fast enough then we ought to make it do more work in order to economize the caches.

— Example 1
Loop-unrolling enlarges text-section of the binary, so the additional “text” must compete for scarce instruction-cache.

[[art of unix programming]] claims that at least in some (not theoretical) cases, this compiler optimization may back fire i.e. hurt rather than improve performance.

The critical situation – suppose the loop runs in one of the extremely hot functions that need to remain in the instruction-cache, then we want to minimize code footprint.

Loading this hot function over and over from main memory can be worse than executing the original loop (without unrolling). This happens when cpu execution speed improves over the years but main memory access speed remains a bottleneck.

— Example 2
A second example “claimed” in the same paragraph, revolves around pre-computed look-up table for frequent access. This kind of optimization strategy can shave cpu cycles, but can increase the fierce competition for scarce L1 cache.

In other words, Pre-computing of a data table was designed to save cpu time but could backfire if the increased data footprint leads to data-cache misses.

We are again talking about extremely hot data that needs to remain in L1 cache. If this pre-computed data is accessed frequently, it can hog the L1 cache.

In such a case, removing the pre-computation, and re-computing each time, could be faster.

I think this problem can become real –
• If cpu is becoming much faster in recent years, or the computation gets simplified, then the cpu saving would diminish.
• At the same time, if the pre-computed data size grows larger, then the “hogging” would worsen.

inline perf can Backfire ! #Google#i-cache

As illustrated below, without inline, instruction cache system could hit ChangeOfFlow twice as it enters/exits your function aa(). If aa() is actually inlined and embedded in a hostFunc, then the instruction cache system can often load entire hostFunc, eliminating COF. This helps instruction cache, but excessive inlining can increase executable footprint (code bloat).

google c++ guide points out that

  • inline can either increase or decrease (for tiny functions) executable footprint. In general, smaller footprint improves running time due to instruction cache efficiency
  • virtual functions are inlined (i.e. defined in class body) primarily for convenience/maintainability, not performance

See also https://www.eetimes.com/document.asp?doc_id=1275470 and https://www.eetimes.com/document.asp?doc_id=1275474

As an example, consider the function call tree flow in Figure 1. Suppose function F2 is linked near function F1, but function F3 is not linked near F1. When function F1 calls F2, it is possible that F2 is already in the cache and that there will be no cache miss. (The likelihood that F2 is in the cache depends on the sizes of F1 and F2, as well as the location of the call inside F1.) In contrast, there will probably be a cache miss when F1 calls F3. Because F3 is located far away from F1 in memory, it is not likely to be in the cache when F1 calls it.

## optimize code for i-cache: few tips

I don’t see any ground-breaking suggestions. I think only very hot functions (confirmed by oprofile + cachegrind) requires such micro-optimization.

I like the function^code based fragmentation framework on https://www.eetimes.com/document.asp?doc_id=1275472 (3 parts)

  • inline: footprint+perf can backfire. Can be classified as embedding
  • use table lookup to replace “if” ladder — minimize jumps
  • branching — refactor a lengthy-n-corner-case (not “hot”) code chunk out to a function, so 99% of the time the instruction cache (esp. the pre-fetch flavor) doesn’t load a big chunk of cold stuff.
    • this is the opposite of embedding !
  • Trim the executable footprint. Reduce code bloat due to inlining and templates?
  • loop unrolling to minimize jumps. I think this is practical and time-honored — at aggressive optimization levels some compilers actually perform loop unrolling! Programmers can do it manually.
  • Use array (anything contiguous) instead of linked list or maps to exploit d-cache + i-cache
  • https://software.intel.com/en-us/blogs/2014/11/17/split-huge-function-if-called-by-loop-for-best-utilizing-instruction-cache is a 2014 Intel paper — split huge function if it’s invoked in a loop.