C++ has clear advantage in low-latency
Ranked from most specific to most vague. Another ranking criteria is market value.
- memory mgmt – for time and space efficiency
- generic techniques to use stack objects instead of heapy thingies
- reduce reliance on dynamic containers
- virtual — reusable techniques to avoid virtual
- other compiler internals that impact latency
- knowledge of inline — to trade off latency against code size
- latency instrumentation tools
- safe c++ techniques, to reduce the chance of run-time errors. C++ is particular vulnerable as it is
- more complex than C and others and
- more close-to-hardware than Java and other languages.
- TMP — again to trade off latency against complexity. But this zbs domain is too big for me
- tool chain
Stroustrup is not the first to tell me that java objects are always accessed via a pointer as the Garbage collector may relocate the actual object.
At runtime, this indirection has a non-zero cost. In contrast, C/C++ app (without GC) would access the pointee directly.
I guess a GC language would need some lookup table.
- case 1 (standard java): you allocate heap memory. After you finish with it you wait for the java GC to clean it up.
- case 2 (low latency java): you allocate heap memory but disable java GC. Either you hold on to all your objects, or you leave unreachable garbage orbiting the earth forever.
- case 3 (c++): you allocate heap memory with the expectation of releasing it, so the compiler sets up housekeeping in advance for the anticipated delete(). This housekeeping overhead is somehow similar to try/catch before c++11 ‘noexcept’.
Stroustrup suggested that #2 will be faster than #3, but #3 is faster than #1. I said “But c++ can emulate the allocation as jvm does?” Stroustrup said C++ is not designed for that. I have seen online posts about this “emulation” but I would trust Stroustrup more.
- case 4 (C): C/c++ can sometimes use local variables to beat heap allocation. C programmers use rather few heap allocations, in my experience.
Note jvm or malloc are all userland allocators, not part of kernel and usually not using system calls. You can substitute your own malloc.
— https://stackoverflow.com/questions/18268151/java-collections-faster-than-c-containers top answer by Kanze is consistent with what Stroustrup told me.
- no dynamic allocation is always faster than even the fastest dynamic allocation. Similar to Case 4
- jvm allocation (without the GC clean-up) can be 10 times faster than c++ allocation. Similar to Case 2^3
- Q: Is there a free list in JVM allocator?
— https://softwareengineering.stackexchange.com/questions/208656/java-heap-allocation-faster-than-c claims
- c++ Custom allocators managing a pool of fixed-sized objects can beat jvm
- jvm allocation often requires little more than one pointer addition, which is certainly faster than typical C++ heap allocation algorithms
Official benchmark result – 700 KMPS per instance of rebus or parser. Here are some enabling features:
- [! iii] mostly single-threaded apps. Some older parsers are multi-threaded but each thread operating in single threaded mode. No shared mutable:)
- In the order book replication engine, we use lockfree in one place only
- ! (ST mode loves) Stateless  parser, small footprint. Only downstream component (Rebus) handles cancels/busts, modifications.. So we can run many parser instances per machine, or many threads per instance, fully utilizing the cpu cores. See https://haydenjames.io/linux-performance-almost-always-add-swap-space/
- [! ii] allocation avoided — on millions of Output message objects. Pre-allocated ring buffer eliminates new(). very few and small STL containers .. mostly arrays… pre-allocate DTOs@SOD #HFT
- ! allocation avoided — on millions of Input message objects, thanks to reinterpret_cast() on pointers… nearly
Zero-copy. See reinterpret_cast^memcpy in raw mktData parsing
- ! allocation avoided — custom containers to replace STL containers used, since they all allocate from heap
- ! p2p messaging beats MOM
- ! Socket buffer tuning — to cope with busts. 64-256MB receive buffer in kernel; 4MB UserModeBuffer
- low memory footprint. Most parsers use around 60MB. (My parsers was 100 to 150MB.) I think there are many benefits in terms of latency.
- epoll — to replace select() with 2 orders of magnitude improve in throughput
- buffering of Output list (of messages). I guess this hurts latency but enhances throughput
- Very fast duplicate seq check, without large history data — a hot function
- large containers like the ring buffer are pre-sized. No reallocation.
- mostly array-based data structures — cache-friendly
- Hot functions use pbref or RVO or move semantics, minimizing stack allocations
- aggressively simplified parsing. Minimal logic
- Special Logger to reduce I/O cost
- Sharding to speed up symbol lookup
- kernel bypass : RTS
- No virtual functions — enhancing data cache and inlining .. 3 overheads@vptr #ARM
- —-Speed with control
- Best of both to reduce the real risk of our connection becoming slow or lossy
- Depth monitor to detect missed messages
- Capacity tracker to monitor mps rates
- Missed seq reporting
- [!=one of the really effective techniques in RTS]
- [i=presented in multiple (!!, !!!) interviews]
 parser does remembers the sequence numbers. In fact sequence number management is crucial. Before retrans was added to parser, we never sent any retrans request. We did keep track of gaps. Mostly we used the gap-tracker to check duplicates across Line A/B. We also relied on our sequence number state to handle sequence resets.
I see various evidence that industry practitioners consider the default allocator too slow.
I don’t think system call is the issue. System calls are very infrequent with malloc.
STL is faster compared to
* alternative designs using virtual functions
* alternative designs using runtime logic instead of compile-time TMP
However, STL containers are slower than containers without on-the-fly allocation. To match those containers, you can pre-allocate enough capacity at start-up.
shared_ptr and STL containers, java autoboxing …
… are avoided in busy mkt data systems due to on-the-fly DAM allocations.
shared_ptr has extra DAM in terms of the ref counter!
consider pre-allocation to pre-empty “on-the-fly”
see also post on large in-memory search
I suspect there’s a dilemma —
- large heap allocation request -> free list search is harder and slower. Need to avoid ad-hoc/unplanned request for large chunks.
- many small heap allocation requests -> free list mgr becomes hotspot
- .. in reality, pre-allocating large arrays is probably a performance win.
I thought in low latency, time costs outweigh space costs, but no. Garbage collection is a major performance issue in low latency systems. I guess so is object creation. I guess that’s why memory efficiency affects latency. GC probably takes less cpu time if there’s less stuff to scan.
Distributed cache (memory visualization) isn’t much used in low latency systems, possibly because
- * serialization
- * network latency
- * TCP? I feel the fastest HFT systems have very limited IO perhaps in the form of shared memory? FIX is based on TCP so I would assume very high overhead, but in reality, it may be a small overhead.
This low-level topic is apparently important to multiple interviewers. I guess there are similarly low-level topics like lockfree, wait/notify, hashmap, const correctness.. These topics are purely for theoretical QQ interviews. I don’t think app developers ever need to write forward() in their code.
https://stackoverflow.com/questions/18543717/c-perfect-forwarding/18543824 touches on a few low-level optimizations. Suppose you follow Herb Sutter’s advice and write a factory accepting Trade ctor arg and returning a shared_ptr<Trade>,
- your factory’s parameter should be a universal reference. You should then std::forward() it to make_shared(). See gcc source code See make_shared() source in https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-api-4.6/a01033_source.html
- make_shared() makes a single allocation for a Trade and an adjacent control block, with cache efficiency — any read access on the Trade pointer will cache the control block too
- I remember reading online that one allocation vs two is a huge performance win….
- if the arg object is a temp object, then the rvr would be forwarded to the Trade ctor. Scott Meryers says the lvr would be cast to a rvr. The Trade ctor would need to move() it.
- if the runtime object is carried by an lvr (arg object not a temp object), then the lvr would be forwarded as is to Trade ctor?
Q: What if I omit std::forward()?
AA: Trade ctor would receive always a lvr. See ScottMeyers P162 and my github code
https://github.com/tiger40490/repo1/blob/cpp1/cpp1/rvrDemo.cpp is my experiment.
I said “1000 times” when a GS interviewer asked for an estimate of relative latency of main memory vs register. He said that’s about right.
Numbers below were taken from the CPU Cache Flushing Fallacy blog post by Martin Thompson, which indicates that for a particular 2012-era Intel processor, the following was observed:
- register access = single cycle or 4 instructions per cycle
- L1 latency = 3 cycles (3 x register)
- L2 latency = 12 cycles (4 x L1, 48 x register)
- L3 latency = up to 38 cycles (3 x L2, 12 x L1, 144 x register)
- MM Latency= very roughly 200 cycles (5 x L3, 15 x L2, 60 x L1, 720 x register) = average 65 ns on a 3 GHz CPU
Diagram is more simplified than the text, but there are many fine prints.
I feel a major reason is virtual address translation. https://en.wikipedia.org/wiki/Translation_lookaside_buffer says “page walk” is slow.
Heap allocation is extremely slow compared to other operations.
https://stackoverflow.com/questions/1950878/c-for-loop-indexing-is-forward-indexing-faster-in-new-cpus top answer is concise. I think the observations may not be relevant in x years but the principles are.
- adjacent cache line (ACL) prefetcher — simple to understand
- cpu can detect streams of memory accesses in forward or backward directions
Note L1/L2/L3 caches are considered part of the CPU even if some of them are physically outside the microprocessor.
Is shared-memory a popular messaging solution in low-latency trading?
I know some high-volume data processing engines (like Ab Initio) favor
shared memory as the fastest IPC solution.
However, I feel in low latency trading, messaging (like tibrv, 29west,
Solace) is more popular. For a trading engine, shared memory IPC can
be the basis of messaging between processes on the same machine, but
not across different machines.
Do your system use shared memory?
If interested, you can check out
See also post on large in-memory search
Reason: the larger the heap size to scan, the slower is the GC (but paradoxically, large heap helps you postpone the GC). The less memory you use, the lower your GC penalty.
reason: favor in-memory data store rather than disk. In-memory can mean remote or local (best). The smaller your footprint. the easier.
reason: Favor single-VM — serialization-free, network-free. The smaller the easier.
reason: serialization (for IPC, cache replication, network or disk) is slower for larger footprints.
reason: allocation is costly, even for an 32-bit int. Look at shared_ptr and circular arrays.
reason: minimize object creation (i.e. allocation) — standard optimization advice
reason: reduce GC frequency
reason: reduce GC workload ie amount of memory to scan. If you must incur a full GC, then make it short.
reason? Compared to graphs, arrays enjoy 1)random access and 2)one-shot memory allocation, but if each element is bulky then a long array requires too large a contiguous block of memory, which is hard to allocate. The smaller your object, the easier.
 Note java array element is usually a 32 bit pointer.
Major technique: favor iterative over recursive algorithms to reduce stack memory footprint
beside processor cores, physical RAM was allocated to applications, too, in my Mansion data center.