I don’t see any ground-breaking suggestions. I think only very hot functions (confirmed by oprofile + cachegrind) requires such micro-optimization.
I like the function^code based fragmentation framework on https://www.eetimes.com/document.asp?doc_id=1275472 (3 parts)
- inline: footprint+perf can backfire. Can be classified as embedding
- use table lookup to replace “if” ladder — minimize jumps
- branching — refactor a lengthy-n-corner-case (not “hot”) code chunk out to a function, so 99% of the time the instruction cache (esp. the pre-fetch flavor) doesn’t load a big chunk of cold stuff.
- this is the opposite of embedding !
- Trim the executable footprint. Reduce code bloat due to inlining and templates?
- loop unrolling to minimize jumps. I think this is practical and time-honored — at aggressive optimization levels some compilers actually perform loop unrolling! Programmers can do it manually.
- Use array (anything contiguous) instead of linked list or maps to exploit d-cache + i-cache
- https://software.intel.com/en-us/blogs/2014/11/17/split-huge-function-if-called-by-loop-for-best-utilizing-instruction-cache is a 2014 Intel paper — split huge function if it’s invoked in a loop.