loop fusion^fission

Stroustrup told me about compiler optimization “loop fusion” when I asked how c/c++ can get beaten in benchmarks.

  • loop fusion may improve runtime overhead
  • loop fission may improve locality of reference and parallelism on multi-core??

I guess this is a technique to improve raw speed.

default malloc is slow, can be improved

See https://stackoverflow.com/questions/161053/which-is-faster-stack-allocation-or-heap-allocation

I see various evidence that industry practitioners consider the default allocator too slow.

I don’t think system call is the issue. System calls are very infrequent with malloc.

inline perf can Backfire ! #Google#i-cache

As illustrated below, without inline, instruction cache system could hit ChangeOfFlow twice as it enters/exits your function aa(). If aa() is actually inlined and embedded in a hostFunc, then the instruction cache system can often load entire hostFunc, eliminating COF. This helps instruction cache, but excessive inlining can increase executable footprint (code bloat).

google c++ guide points out that

  • inline can either increase or decrease (for tiny functions) executable footprint. In general, smaller footprint improves running time due to instruction cache efficiency
  • virtual functions are inlined (i.e. defined in class body) primarily for convenience/maintainability, not performance

See also https://www.eetimes.com/document.asp?doc_id=1275470 and https://www.eetimes.com/document.asp?doc_id=1275474

As an example, consider the function call tree flow in Figure 1. Suppose function F2 is linked near function F1, but function F3 is not linked near F1. When function F1 calls F2, it is possible that F2 is already in the cache and that there will be no cache miss. (The likelihood that F2 is in the cache depends on the sizes of F1 and F2, as well as the location of the call inside F1.) In contrast, there will probably be a cache miss when F1 calls F3. Because F3 is located far away from F1 in memory, it is not likely to be in the cache when F1 calls it.

mlock() : low-level syscall to prevent paging ] real-time apps

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/7/html/reference_guide/using_mlock_to_avoid_page_io has sample code

See also https://eklitzke.org/mlock-and-mlockall

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Reference_Guide/sect-Realtime_Reference_Guide-Memory_allocation-Using_mlock_to_avoid_memory_faults.html says

If the application is entering a time sensitive region of code, an mlockall call prior to entering, followed by munlockall can reduce paging while in the critical section. Similarly, mlock can be used on a data region that is relatively static or that will grow slowly but needs to be accessed without page faulting.

## optimize code for i-cache: few tips

I don’t see any ground-breaking suggestions. I think only very hot functions (confirmed by oprofile + cachegrind) requires such micro-optimization.

I like the function^code based fragmentation framework on https://www.eetimes.com/document.asp?doc_id=1275472 (3 parts)

  • inline: footprint+perf can backfire. Can be classified as embedding
  • use table lookup to replace “if” ladder — minimize jumps
  • branching — refactor a lengthy-n-corner-case (not “hot”) code chunk out to a function, so 99% of the time the instruction cache (esp. the pre-fetch flavor) doesn’t load a big chunk of cold stuff.
    • this is the opposite of embedding !
  • Trim the executable footprint. Reduce code bloat due to inlining and templates?
  • loop unrolling to minimize jumps. I think this is practical and time-honored — at aggressive optimization levels some compilers actually perform loop unrolling! Programmers can do it manually.
  • Use array (anything contiguous) instead of linked list or maps to exploit d-cache + i-cache
  • https://software.intel.com/en-us/blogs/2014/11/17/split-huge-function-if-called-by-loop-for-best-utilizing-instruction-cache is a 2014 Intel paper — split huge function if it’s invoked in a loop.

 

translation lookaside buffer #stats

  • a.k.a Address Translation Cache. The TLB lets the processor very quickly convert virtual addresses to physical addresses.
  • TLB is a cache for the big, slow page table
  • A typical entry in TLB is a pair of {virtual -> physical addresses}. In contrast,
  • A typical entry in a L1 cache is mapping of {physical address -> payload}.
  • You can hit both caches!
  • Both caches sits between processor and main memory
  • each hardware system has one or more TLBs
  • TLB-miss can be handled in hardware or kernel
  • typical miss probability — 0.01% to 1%
  • typical miss latency (penalty) — 10 to 100 clock cycles to read the page table
  • typical hit latency: 0.5 to 1 clock cycle

If a TLB hit takes 1 clock cycle, a miss takes 30 clock cycles, and the miss rate is 1%, the effective memory cycle rate is an average of 1 × 0.99 + (1 + 30) × 0.01 = 1.30 i.e. 1.30 clock cycles per memory access.

https://stackoverflow.com/questions/5440128/thread-context-switch-vs-process-context-switch says TLB favors thread rather than process context-switch. TLB gets flushed during process rather than thread context-switch.

 

##fastest container choices: array of POD #or pre-sized vector

relevant to low-latency market data.

  • raw array is “lean and mean” — the most memory efficient; vector is very close, but we need to avoid reallocation
  • std::array is less popular but should offer similar performance to vector
  • all other containers are slower, with bigger footprint
  • For high-performance, avoid container of node/pointer — Cache affinity loves contiguous memory. After accessing 1st element, then accessing 2nd element is likely a cache-hit
    • set/map, linked list suffer the same

contiguous memory data structures : page faults+cache miss

Compared to linked data structures, I feel vectors (and array, deque, circular array, unrolled linked list…) reduce page faults and cache miss.

P124 [[art of debugging]] describes how a simple array is loaded from swap space into a real memory “page”. This is managed by the virtual memory system.

P19 [[optimized c++]] describes that L1 or L2 cache also loads vector more efficiently than a linked data structure. For example, each time main memory is read, the adjacent memory cells are loaded together into L1 cache, so contiguous memory data structures result in much fewer cache miss, in theory.

A virtual memory page is about 4KB. A cache line is about 64B.

Both page fault and cache miss are costly at runtime. In both cases the “backing store” is much slower but we have to read from it. These situations are by design and very common, but it pays to reduce their /incidence/occurrence. Even if we reduce page fault frequency or cache miss frequency by 10%, it could be helpful.

strace, ltrace, truss, oprofile, gprof – random notes

[[optimizing Linux performance]] has usage examples of ltrace.
I think truss is the oldest and most well-known.
Q: what values do the others add?
truss, strace, ltrace all show function arguments, though pointer to objects will not be “dumped”. (Incidentally, I guess apptrace has a unique feature to dump arguments of struct types.)
strace/ltrace are similar in many ways…
ltrace is designed for shared LLLibrary tracing, but can also trace syscalls.
truss is designed for syscalls, but “-u” covers shared libraries.
oprofile — can measure time spent and hit rates on library functions