##G5 c++skills { Stroustrup chat

C++ has clear advantage in low-latency

Ranked from most specific to most vague. Another ranking criteria is market value.

  1. memory mgmt – for time and space efficiency
  2. generic techniques to use stack objects instead of heapy thingies
    • reduce reliance on dynamic containers
  3. virtual — reusable techniques to avoid virtual
  4. other compiler internals that impact latency
    1. knowledge of inline — to trade off latency against code size
  5. latency instrumentation tools
  6. safe c++ techniques, to reduce the chance of run-time errors. C++ is particular vulnerable as it is
    • more complex than C and others and
    • more close-to-hardware than Java and other languages.
  7. TMP — again to trade off latency against complexity. But this zbs domain is too big for me
  8. tool chain

 

Advertisements

indirection due to jGC: runtime cost

Stroustrup is not the first to tell me that java objects are always accessed via a pointer as the Garbage collector may relocate the actual object.

At runtime, this indirection has a non-zero cost. In contrast, C/C++ app (without GC) would access the pointee directly.

I guess a GC language would need some lookup table.

heap allocation: java Can beat c++

  • case 1 (standard java): you allocate heap memory. After you finish with it you wait for the java GC to clean it up.
  • case 2 (low latency java): you allocate heap memory but disable java GC. Either you hold on to all your objects, or you leave unreachable garbage orbiting the earth forever.
  • case 3 (c++): you allocate heap memory with the expectation of releasing it, so the compiler sets up housekeeping in advance for the anticipated delete(). This housekeeping overhead is somehow similar to try/catch before c++11 ‘noexcept’.

Stroustrup suggested that #2 will be faster than #3, but #3 is faster than #1. I said “But c++ can emulate the allocation as jvm does?” Stroustrup said C++ is not designed for that. I have seen online posts about this “emulation” but I would trust Stroustrup more.

  • case 4 (C): C/c++ can sometimes use local variables to beat heap allocation. C programmers use rather few heap allocations, in my experience.

Note jvm or malloc are all userland allocators, not part of kernel and usually not using system calls. You can substitute your own malloc.

https://stackoverflow.com/questions/18268151/java-collections-faster-than-c-containers top answer by Kanze is consistent with what Stroustrup told me.

  • no dynamic allocation is always faster than even the fastest dynamic allocation. Similar to Case 4
  • jvm allocation (without the GC clean-up) can be 10 times faster than c++ allocation. Similar to Case 2^3
    • Q: Is there a free list in JVM allocator?

https://softwareengineering.stackexchange.com/questions/208656/java-heap-allocation-faster-than-c claims

  • c++ Custom allocators managing a pool of fixed-sized objects can beat jvm
  • jvm allocation often requires little more than one pointer addition, which is certainly faster than typical C++ heap allocation algorithms

How RTS achieved 400-700 KMPS #p2p

Official benchmark result – 700 KMPS per instance of rebus or parser. Here are some enabling features:

  • [! iii] mostly single-threaded apps. Some older parsers are multi-threaded but each thread operating in single threaded mode. No shared mutable:)
    • In the order book replication engine, we use lockfree in one place only
  • ! (ST mode loves) Stateless [1] parser, small footprint. Only downstream component (Rebus) handles cancels/busts, modifications.. So we can run many parser instances per machine, or many threads per instance, fully utilizing the cpu cores. See https://haydenjames.io/linux-performance-almost-always-add-swap-space/
  • [! ii] allocation avoided — on millions of Output message objects. Pre-allocated ring buffer eliminates new(). very few and small STL containers .. mostly arrays… pre-allocate DTOs@SOD #HFT
  • ! allocation avoided — on millions of Input message objects, thanks to reinterpret_cast() on pointers… nearly Zero-copy. See reinterpret_cast^memcpy in raw mktData parsing
  • ! allocation avoided — custom containers to replace STL containers used, since they all allocate from heap
  • ! p2p messaging beats MOM 
  • ! Socket buffer tuning — to cope with busts. 64-256MB receive buffer in kernel; 4MB UserModeBuffer
  • low memory footprint. Most parsers use around 60MB. (My parsers was 100 to 150MB.) I think there are many benefits in terms of latency.
  • epoll — to replace select() with 2 orders of magnitude improve in throughput
  • buffering of Output list (of messages). I guess this hurts latency but enhances throughput
  • Very fast duplicate seq check, without large history data — a hot function
  • large containers like the ring buffer are pre-sized. No reallocation.
  • mostly array-based data structures — cache-friendly
  • Hot functions use pbref or RVO or move semantics, minimizing stack allocations
  • aggressively simplified parsing. Minimal logic
  • Special Logger to reduce I/O cost
  • Sharding to speed up symbol lookup
  • kernel bypass : RTS
  • No virtual functions — enhancing data cache and inlining .. 3 overheads@vptr #ARM
  • Inline
  • —-Speed with control
  • Best of both to reduce the real risk of our connection becoming slow or lossy
  • Depth monitor to detect missed messages
  • Capacity tracker to monitor mps rates
  • Missed seq reporting
  • [!=one of the really effective techniques in RTS]
  • [i=presented in multiple (!!, !!!) interviews]

[1] parser does remembers the sequence numbers. In fact sequence number management is crucial. Before retrans was added to parser, we never sent any retrans request. We did keep track of gaps. Mostly we used the gap-tracker to check duplicates across Line A/B. We also relied on our sequence number state to handle sequence resets.

default malloc is slow, can be improved

See https://stackoverflow.com/questions/161053/which-is-faster-stack-allocation-or-heap-allocation

I see various evidence that industry practitioners consider the default allocator too slow.

I don’t think system call is the issue. System calls are very infrequent with malloc.

[10]y memory footprint is key to latency#JGC

see also post on large in-memory search

I suspect there’s a dilemma —

  • large heap allocation request -> free list search is harder and slower. Need to avoid ad-hoc/unplanned request for large chunks.
  • many small heap allocation requests -> free list mgr becomes hotspot
  • .. in reality, pre-allocating large arrays is probably a performance win.

I thought in low latency, time costs outweigh space costs, but no. Garbage collection is a major performance issue in low latency systems. I guess so is object creation. I guess that’s why memory efficiency affects latency. GC probably takes less cpu time if there’s less stuff to scan.

Distributed cache (memory visualization) isn’t much used in low latency systems, possibly because

  • * serialization
  • * network latency
  • * TCP? I feel the fastest HFT systems have very limited IO perhaps in the form of shared memory? FIX is based on TCP so I would assume very high overhead, but in reality, it may be a small overhead.

mlock() : low-level syscall to prevent paging ] real-time apps

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/7/html/reference_guide/using_mlock_to_avoid_page_io has sample code

See also https://eklitzke.org/mlock-and-mlockall

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Reference_Guide/sect-Realtime_Reference_Guide-Memory_allocation-Using_mlock_to_avoid_memory_faults.html says

If the application is entering a time sensitive region of code, an mlockall call prior to entering, followed by munlockall can reduce paging while in the critical section. Similarly, mlock can be used on a data region that is relatively static or that will grow slowly but needs to be accessed without page faulting.

make_shared() cache efficiency, forward()

This low-level topic is apparently important to multiple interviewers. I guess there are similarly low-level topics like lockfree, wait/notify, hashmap, const correctness.. These topics are purely for theoretical QQ interviews. I don’t think app developers ever need to write forward() in their code.

https://stackoverflow.com/questions/18543717/c-perfect-forwarding/18543824 touches on a few low-level optimizations. Suppose you follow Herb Sutter’s advice and write a factory accepting Trade ctor arg and returning a shared_ptr<Trade>,

  • your factory’s parameter should be a universal reference. You should then std::forward() it to make_shared(). See gcc source code See make_shared() source in https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-api-4.6/a01033_source.html
  • make_shared() makes a single allocation for a Trade and an adjacent control block, with cache efficiency — any read access on the Trade pointer will cache the control block too
  • I remember reading online that one allocation vs two is a huge performance win….
  • if the arg object is a temp object, then the rvr would be forwarded to the Trade ctor. Scott Meryers says the lvr would be cast to a rvr. The Trade ctor would need to move() it.
  • if the runtime object is carried by an lvr (arg object not a temp object), then the lvr would be forwarded as is to Trade ctor?

Q: What if I omit std::forward()?
AA: Trade ctor would receive always a lvr. See ScottMeyers P162 and my github code

https://github.com/tiger40490/repo1/blob/cpp1/cpp1/rvrDemo.cpp is my experiment.

 

L1/L2/L3 latency stats@2012 processor #Martin

I said “1000 times” when a GS interviewer asked for an estimate of relative latency of main memory vs register. He said that’s about right.

Numbers below were taken from the CPU Cache Flushing Fallacy blog post by Martin Thompson, which indicates that for a particular 2012-era Intel processor, the following was observed:

  • register access = single cycle or 4 instructions per cycle
  • L1 latency = 3 cycles (3 x register)
  • L2 latency = 12 cycles (4 x L1, 48 x register)
  • L3 latency = up to 38 cycles (3 x L2, 12 x L1, 144 x register)
  • MM Latency= very roughly 200 cycles  (5 x L3, 15 x L2, 60 x L1, 720 x register) = average 65 ns on a 3 GHz CPU

Diagram is more simplified than the text, but there are many fine prints.

malloc=long considered costly

Heap allocation is extremely slow compared to other operations.

 

CPU(data)cache prefetching

https://stackoverflow.com/questions/1950878/c-for-loop-indexing-is-forward-indexing-faster-in-new-cpus top answer is concise. I think the observations may not be relevant in x years but the principles are.

  • adjacent cache line (ACL) prefetcher — simple to understand
  • cpu can detect streams of memory accesses in forward or backward directions

Note L1/L2/L3 caches are considered part of the CPU even if some of them are physically outside the microprocessor.

sharedMem in low latency systems

Hi Anthony,

Is shared-memory a popular messaging solution in low-latency trading?

I know some high-volume data processing engines (like Ab Initio) favor

shared memory as the fastest IPC solution.

However, I feel in low latency trading, messaging (like tibrv, 29west,

Solace) is more popular. For a trading engine, shared memory IPC can

be the basis of messaging between processes on the same machine, but

not across different machines.

Do your system use shared memory?

If interested, you can check out

http://www.cisco.com/web/strategy/docs/finance/29W-INFA-Cisco-IPC-Performance-new.pdf

http://www.informatica.com/sg/company/news-and-events-calendar/press-releases/05152013-ultramessaging.aspx

http://solacesystems.com/blog/high-frequency-trading-to-warp-speed

y memory footprint hurts latency #my take

See also post on large in-memory search

Reason: the larger the heap size to scan, the slower is the GC (but paradoxically, large heap helps you postpone the GC). The less memory you use, the lower your GC penalty.

reason: favor in-memory data store rather than disk. In-memory can mean remote or local (best). The smaller your footprint. the easier.
reason: Favor single-VM — serialization-free, network-free. The smaller the easier.
reason: serialization (for IPC, cache replication, network or disk) is slower for larger footprints.
reason: allocation is costly, even for an 32-bit int. Look at shared_ptr and circular arrays.
reason: minimize object creation (i.e. allocation) — standard optimization advice
reason: reduce GC frequency
reason: reduce GC workload ie amount of memory to scan. If you must incur a full GC, then make it short.

reason? Compared to graphs, arrays enjoy 1)random access and 2)one-shot memory allocation, but if each element is bulky[2] then a long array requires too large a contiguous block of memory, which is hard to allocate. The smaller your object, the easier.

[2] Note java array element is usually a 32 bit pointer.

Major technique: favor iterative over recursive algorithms to reduce stack memory footprint