C++ has clear advantage in low-latency
Ranked from most specific to most vague. Another ranking criteria is market value.
- memory mgmt – for time and space efficiency
- generic techniques to use stack objects instead of heapy thingies
- reduce reliance on dynamic containers
- virtual — reusable techniques to avoid virtual
- other compiler internals that impact latency
- knowledge of inline — to trade off latency against code size
- latency instrumentation tools
- safe c++ techniques, to reduce the chance of run-time errors. C++ is particular vulnerable as it is
- more complex than C and others and
- more close-to-hardware than Java and other languages.
- TMP — again to trade off latency against complexity. But this zbs domain is too big for me
- tool chain
C++ can be 5x faster than java if both programs are well-tuned — A ball-park estimate given by Stroustrup.
The c++ code is often written like java code, using lots of pointers, virtual functions, no inline, perhaps with too many heap allocations (STL containers) rather than strictly-stack variables .
Many other benchmarks are similarly questionable. These new languages out there are usually OO and rely on GC + pointer indirection. If you translate their code to C++, the resulting c++ code would be horribly inefficient, not taking advantage of c++ compiler’s powers. An expert c++ developer would rewrite everything to avoid virtual functions and favor local variables and inline, and possibly use compile-time programming. The binary would usually become comparable in benchmark. The c++ compiler is more sophisticated and have more optimization opportunities, so it usually produces faster code.
Suppose your class Trade has virtual functions and a comparable class Order has no virtual functions. What are the specific runtime overheads of the vptr/vtable usage?
- cpu cache efficiency — memory footprint of the vptr in each object. Java affected! If you have a lot of Trade objects with only one char data field, then the vptr greatly expands footprint and you overuse cache lines.
- [[ARM]] singles out this factor as a justification for -fno-rtti… see RTTI compiler-option enabled by default
- [[moreEffC++]] P116 singles out vptr footprint as the biggest performance penalty of vptr
- runtime indirection — “a few memory references more efficient”  in the Order usage
- inlining inhibition is the most significant overhead. P209 [[ARM]] says inline virtual functions make perfect sense so it is best to bypass vptr and directly call the virtual function, if possible.
 P209 [[ARM]] wording
Note a virtual function unconditionally introduces the first overhead, but the #2/#3 overheads can sometimes be avoided by a smart compiler.
Official benchmark result – 700 KMPS per instance of rebus or parser. Here are some enabling features:
- [! iii] mostly single-threaded apps. Some older parsers are multi-threaded but each thread operating in single threaded mode. No shared mutable:)
- In the order book replication engine, we use lockfree in one place only
- ! (ST mode loves) Stateless  parser, small footprint. Only downstream component (Rebus) handles cancels/busts, modifications.. So we can run many parser instances per machine, or many threads per instance, fully utilizing the cpu cores. See https://haydenjames.io/linux-performance-almost-always-add-swap-space/
- [! ii] allocation avoided — on millions of Output message objects. Pre-allocated ring buffer eliminates new(). very few and small STL containers .. mostly arrays… pre-allocate DTOs@SOD #HFT
- ! allocation avoided — on millions of Input message objects, thanks to reinterpret_cast() on pointers… nearly
Zero-copy. See reinterpret_cast^memcpy in raw mktData parsing
- ! allocation avoided — custom containers to replace STL containers used, since they all allocate from heap
- ! p2p messaging beats MOM
- ! Socket buffer tuning — to cope with busts. 64-256MB receive buffer in kernel; 4MB UserModeBuffer
- low memory footprint. Most parsers use around 60MB. (My parsers was 100 to 150MB.) I think there are many benefits in terms of latency.
- epoll — to replace select() with 2 orders of magnitude improve in throughput
- buffering of Output list (of messages). I guess this hurts latency but enhances throughput
- Very fast duplicate seq check, without large history data — a hot function
- large containers like the ring buffer are pre-sized. No reallocation.
- mostly array-based data structures — cache-friendly
- Hot functions use pbref or RVO or move semantics, minimizing stack allocations
- aggressively simplified parsing. Minimal logic
- Special Logger to reduce I/O cost
- Sharding to speed up symbol lookup
- kernel bypass : RTS
- No virtual functions — enhancing data cache and inlining .. 3 overheads@vptr #ARM
- —-Speed with control
- Best of both to reduce the real risk of our connection becoming slow or lossy
- Depth monitor to detect missed messages
- Capacity tracker to monitor mps rates
- Missed seq reporting
- [!=one of the really effective techniques in RTS]
- [i=presented in multiple (!!, !!!) interviews]
 parser does remembers the sequence numbers. In fact sequence number management is crucial. Before retrans was added to parser, we never sent any retrans request. We did keep track of gaps. Mostly we used the gap-tracker to check duplicates across Line A/B. We also relied on our sequence number state to handle sequence resets.
Q: does template code bloat increase instruction cache pressure and increase risk of cache thrashing?
A: i have seen about 8 forums mention of template code bloat increasing i-cache pressure, but no clear evidence, no consensus among authorities. In fact, no known authority has confirmed this issue.
However, I would think having many similar functions in the executable will likely lead to unnecessary competition for i-cache.
In contrast, for inlining there is consensus. wikipedia states — excess inlining will hurt speed, due to inlined code consuming too much of the instruction cache .. Inlining imposes a cost on performance, due to the code expansion (due to duplication) hurting instruction cache performance. This is most significant if, prior to expansion, the working set of the program (or a hot section of code) fit in one level of the memory hierarchy (e.g., L1 cache), but after expansion it no longer fits, resulting in frequent cache misses at that level.
See also inline: footprint+perf can Backfire ! #Google
https://www.eetimes.com/document.asp?doc_id=1275470&page_number=2 gave an elegant illustration of reverse-inlining.
“If you have a large function with many cases, ask yourself which will be executed the most often. Place the infrequently-executed cases in a different function.”
If a corner case is embedded in a hot function, i-cache may suffer. A JIT compiler could in theory move it to end of a hot loop, but that depends on many factors. Is it good practice for us developers to manually move the less frequent code paths towards the end, and introduce early-returns before them? I guess the i-cache would only contain the first half of the function containing the hot code paths?
The trick — if you extract the corner case into a separate function and somehow ensure it is not inlined, then i-cache would be relieved.
 A JIT compiler would notice the function should NOT be inlined, but in c++ I am not sure.
“Functions often contain a significant amount of code that deals with corner cases, that is, cases which are rarely executed. This means that a large number of instructions read into cache are rarely executed.” — https://www.eetimes.com/document.asp?doc_id=1275470
Martin Thompson agreed. He hinted that JIT compilers could (or have be designed to) notice the “cold” corner cases and dynamically refactor the code path so the corner case is handled at end of a hot loop.
Martin also said inlining has to be done judiciously, to avoid adding corner cases (cold stuff) into hot functions. In c++, Inlining decision are made at compile time but JIT can make the same decisions at run-time.
Personally, I guess this JIT technique is academic or experimental, probably hit-n-miss so the performance gain/penalty is hard to predict.
P76 [[javaPerf]] described a nifty JIT technique to avoid runtime cost of the dynamic binding of virtual function equals(). Suppose in some class, we call obj1.equals(obj2).
After a priming (i.e. warm-up) period, JIT collects enough statistics to see that every dynamic dispatch at this site is calling String.equals(), so JIT decides to turn it into faster “static binding” so the String.equals() function address is hardwired into the assembly code (not JVM bytecode). JIT also needs to handle the possibility of Character.equals(). I guess the assembly code can detect that obj1/obj2 is not a String.java instance and retry the virtual function lookup. JIT can generate assembly code to
1. verify obj is a String and call an inlined String.equals()
2. if obj1 is not String, then use obj1 vtable to look up the virtual function obj1.equals()
It may turn out that 99.9% of the time we can skip the time-consuming Step 2: )
Martin gave a (hypothetical?) example. Suppose JIT notices that obj1 is always either a String or Character. JIT could inline both equals() functions and completely bypass vtable. (Branching can be done via if/else or …) This inline compilation is done after a fairly long period of instrumentation. I asked Martin why c++ can’t do it. He said c++ only uses static compilation. I feel c++ compilers don’t bother with this technique as it is not a proven performance win.
Stroustrup was the first one to tell me c++ std::sort() can beat C qsort() easily.
— https://travisdowns.github.io/blog/2019/05/22/sorting.html says:
Since the qsort() code is compiled ahead of time and is found inside the shared libc binary, there is no chance that the comparator funciton, passed as a function pointer, can be inlined.
— https://martin-ueding.de/articles/qsort-vs-std-sort/index.html says
For qsort(), since the function is passed as a pointer, and the elements are passed as void pointers as well, it means that each comparison costs three indirections and a function call.
In C++, the
std::sort is a template algorithm, so that it can be compiled once for each type. The
operator< of the type is usually baked into the sort algorithm as well (inlining), reducing the cost of the comparison significantly.
This is collected after i stopped active interviewing in Aug 2012.
Low-level domains are my strength, proven again in 2017-2018 interviews. Beside the obvious domains — threading, data structures, c++/java/c#..
- exception in c++/java
- concurrent singleton
- big-O analysis in west coast interviews
- linux/compiler/C++ interviews are even more low-level than java. Despite serious “weak-joints”, I generally excel at the lower-level
- shared mem — pointer therein!
- sockets .. like kernel bypass
- unix signals
- cpu cache – instruction and data
- rvr, the most important c++0x feature, is very low-level
- pbclone^pbref^pbptr, slicing
- undefined behavior – usually are low level
- internals of deque, vector/hashtable reallocation…
- smart pointer internals
- reinterpet_cast vs move()
- pass a functor to a thread
-  See items of the same color
- —-On the other hand, interviewers don’t bother to go low level on …
- thread pool? somehow never low-level
- Unix commands, python? never low level
- (noSQL and) SQL? no
- FIX? not so low-level
- SOA, MOM
- asynch designs
- design patterns
As illustrated below, without inline, instruction cache system could hit ChangeOfFlow twice as it enters/exits your function aa(). If aa() is actually inlined and embedded in a hostFunc, then the instruction cache system can often load entire hostFunc, eliminating COF. This helps instruction cache, but excessive inlining can increase executable footprint (code bloat).
google c++ guide points out that
- inline can either increase or decrease (for tiny functions) executable footprint. In general, smaller footprint improves running time due to instruction cache efficiency
- virtual functions are inlined (i.e. defined in class body) primarily for convenience/maintainability, not performance
See also https://www.eetimes.com/document.asp?doc_id=1275470 and https://www.eetimes.com/document.asp?doc_id=1275474
As an example, consider the function call tree flow in Figure 1. Suppose function F2 is linked near function F1, but function F3 is not linked near F1. When function F1 calls F2, it is possible that F2 is already in the cache and that there will be no cache miss. (The likelihood that F2 is in the cache depends on the sizes of F1 and F2, as well as the location of the call inside F1.) In contrast, there will probably be a cache miss when F1 calls F3. Because F3 is located far away from F1 in memory, it is not likely to be in the cache when F1 calls it.
I don’t see any ground-breaking suggestions. I think only very hot functions (confirmed by oprofile + cachegrind) requires such micro-optimization.
I like the function^code based fragmentation framework on https://www.eetimes.com/document.asp?doc_id=1275472 (3 parts)
- inline: footprint+perf can backfire. Can be classified as embedding
- use table lookup to replace “if” ladder — minimize jumps
- branching — refactor a lengthy-n-corner-case (not “hot”) code chunk out to a function, so 99% of the time the instruction cache (esp. the pre-fetch flavor) doesn’t load a big chunk of cold stuff.
- this is the opposite of embedding !
- Trim the executable footprint. Reduce code bloat due to inlining and templates?
- loop unrolling to minimize jumps. I think this is practical and time-honored — at aggressive optimization levels some compilers actually perform loop unrolling! Programmers can do it manually.
- Use array (anything contiguous) instead of linked list or maps to exploit d-cache + i-cache
- https://software.intel.com/en-us/blogs/2014/11/17/split-huge-function-if-called-by-loop-for-best-utilizing-instruction-cache is a 2014 Intel paper — split huge function if it’s invoked in a loop.
Let’s set the stage — you can either pass in a function name (like myFunc), or a ptr to function (&myFunc) or a functor object. The functor is recommended even though it involves more typing. Justification — inlining. I remember a top c++ expert said so.
I believe “myFunc” is implicitly converted to & myFunc by compiler, so these two forms are equivalent.
Dino of BBG FX team asked me — when you mark a small f1() function inline (like manually copying the code into main()), you save yourself a jump or a new stack frame?
A: both a jump and a new stack frame.
It turns out a new stack frame would require a jump, because after the new stack frame is created, thread jumps to the beginning of f1().
However, there’s something to set up before the jump — Suppose f1() is on Line 5 in main(), then Line 6’s address has to be saved to CPU register, otherwise the thread has no idea where to” jump back” after returning from f1(). According to my codebashing training (required at RTS team), this Line 6’s address is saved in the main() stack frame, not the f1() stack frame!
Note the Line 6’s address is not a heap address not a stack address but an pointer into the code area.
MSVS and g++ debug build both disable inline (presumably to ease debugging). The performance difference vis-a-vis release build is mostly due to this single factor, according to [[optimized c++]]
The same author asserts that inlining is probably the most powerful code optimization.
Stroustrup singled out inline as a key feature of c++.
In his low-latency seminar, Martin Thompson also quoted a computer scientist saying “Inlining is THE optimization”. This is now a memorable quote.
I think one of my java performance books singled out inlining as the single most effective optimization compiler can do.
Pimpl effectively disables inlining
vptr-based dispatch also disables inlining
C++faq gives many reasons for and against inline.