Hog the cpu by busy-waiting, with interrupts off

In the low-latency seminar https://www.youtube.com/watch?v=BD9cRbxWQx8 , Dan Shaya said our thread should hog the cpu by spinning, with interrupts off (to fend off other threads).

I like this unconventional advice.

##G5 c++skills { Stroustrup chat

C++ has clear advantage in low-latency

Ranked from most specific to most vague. Another ranking criteria is market value.

  1. memory mgmt – for time and space efficiency
  2. generic techniques to use stack objects instead of heapy thingies
    • reduce reliance on dynamic containers
  3. virtual — reusable techniques to avoid virtual
  4. other compiler internals that impact latency
    1. knowledge of inline — to trade off latency against code size
  5. latency instrumentation tools
  6. safe c++ techniques, to reduce the chance of run-time errors. C++ is particular vulnerable as it is
    • more complex than C and others and
    • more close-to-hardware than Java and other languages.
  7. TMP — again to trade off latency against complexity. But this zbs domain is too big for me
  8. tool chain

 

scaffolding around try{}block #noexcept

[[ARM]] P358 says that all local non-static objects on the current call stack fully constructed since start of the try-block are “registered” for stack unwinding. The registration is fine-grained in terms of partial destruction —

  • for any array with 3 out of 9 objects fully constructed, the stack unwinding would only destruct those 3
  • for a half constructed composite object with sub-objects, all constructed sub-objects will be destructed
  • Any half-constructed object is not registered since the dtor would be unsafe.

I guess this registration is an overhead at run time.

For the stack objects created in a noexcept function, this “registration” is not required, so compiler may or may not call their destructors.

— in http://www.stroustrup.com/C++11FAQ.html#noexcept Stroustrup hints at the  scaffolding

  • noexcept is a efficiency feature — widely ans systematically used in standard library to improve performance
  • noexcept is crude and “very efficient”
  • dtor may not be invoked upon stack unwinding
  • stack unwinding may not happen at all

 

 

3 real overheads@vptr #inline

Suppose your class Trade has virtual functions and a comparable class Order has no virtual functions. What are the specific runtime overheads of the vptr/vtable usage?

  1. cpu cache efficiency — memory footprint of the vptr in each object. Java affected! If you have a lot of Trade objects with only one char data field, then the vptr greatly expands footprint and you overuse cache lines.
    • [[ARM]] singles out this factor as a justification for -fno-rtti… see RTTI compiler-option enabled by default
    • [[moreEffC++]] P116 singles out vptr footprint as the biggest performance penalty of vptr
  2. runtime indirection — “a few memory references more efficient” [1] in the Order usage
  3. inlining inhibition is the most significant overhead. P209 [[ARM]] says inline virtual functions make perfect sense so it is best to bypass vptr and directly call the virtual function, if possible.

[1] P209 [[ARM]] wording

Note a virtual function unconditionally introduces the first overhead, but the #2/#3 overheads can sometimes be avoided by a smart compiler.

 

std::sort() beating ANSI-C qsort() #inline

Stroustrup was the first one to tell me c++ std::sort() can beat C qsort() easily.

https://travisdowns.github.io/blog/2019/05/22/sorting.html says:

Since the qsort() code is compiled ahead of time and is found inside the shared libc binary, there is no chance that the comparator funciton, passed as a function pointer, can be inlined.

https://martin-ueding.de/articles/qsort-vs-std-sort/index.html says

For qsort(), since the function is passed as a pointer, and the elements are passed as void pointers as well, it means that each comparison costs three indirections and a function call.

In C++, the std::sort is a template algorithm, so that it can be compiled once for each type. The operator< of the type is usually baked into the sort algorithm as well (inlining), reducing the cost of the comparison significantly.

reinterpret_cast(zero-copy)^memcpy: raw mktData parsing

Raw market data input comes in as array of unsigned chars. I “reinterpret_cast” it to a pointer-to-TradeMsgStruct before looking up each field inside the struct.

Now I think this is the fastest solution. Zero-cost at runtime.

As an alternative, memcpy is also popular but it requires bitwise copy. It often require allocating a tmp variable.

unordered_map^map performance: unpredictable

Small halo… the theories and explanations are subject to change.

Many people report that unordered_map can be slower than traditional std::map for small collections. I feel it’s implementation-dependent. Result may NOT apply to java or c#.

http://www.stroustrup.com/C++11FAQ.html#std-unordered (by Stroustrup) says “For larger numbers of elements (e.g. thousands), lookup in an unordered_map can be much faster than for a std::map.” In the same paragraph Stroustrup also says unordered_map “lookup involves a single call of a hash function and one or more equality operations

https://blog.dubbelboer.com/2012/12/04/lru-cache.html is mostly about lookup speed. It shows that string key size hurt hash table, but sample size hurts RBTree.

Many interviewers asked me

  • Q: why for a small string collection, RBTree can outperform hash table?
  • A: String hashing takes time proportional to string size
  • A: Poor hashing => Hash collision => linear search is another potential problem. String hashing is unpredictable compared to integer hashing. No one can guarantee the next sample will be “uniform” according to our hash function.
  • A: rehash

Cache efficiency on the bucket array (array of pointers) is another reported weakness in hash tables. However, I guess a small bucket array can be completely held in cache and each linked list usually holds 1 node only so cache miss is on that 1 node. For a tree, logN nodes are accessed for a lookup and all of them could get cache-miss.

Practical suggestion — it’s easy to swap the two in a given codebase, so just swap and benchmark. Either can be faster. If your data set changes over time, you may need to re-benchmark.

##fastest container choices: array of POD #or pre-sized vector

relevant to low-latency market data.

  • raw array is “lean and mean” — the most memory efficient; vector is very close, but we need to avoid reallocation
  • std::array is less popular but should offer similar performance to vector
  • all other containers are slower, with bigger footprint
  • For high-performance, avoid container of node/pointer — Cache affinity loves contiguous memory. After accessing 1st element, then accessing 2nd element is likely a cache-hit
    • set/map, linked list suffer the same

Inlining is THE optimization

MSVS and g++ debug build both disable inline (presumably to ease debugging). The performance difference vis-a-vis release build is mostly due to this single factor, according to [[optimized c++]]

The same author asserts that inlining is probably the most powerful code optimization.

Stroustrup singled out inline as a key feature of c++.

In his low-latency seminar, Martin Thompson also quoted a computer scientist saying “Inlining is THE optimization”. This is now a memorable quote.

I think one of my java performance books singled out inlining as the single most effective optimization compiler can do.

Pimpl effectively disables inlining

vptr-based dispatch also disables inlining

C++faq gives many reasons for and against inline.

contiguous memory data structures : page faults+cache miss

Compared to linked data structures, I feel vectors (and array, deque, circular array…) reduce page faults and cache miss.

P124 [[art of debugging]] describes how a simple array is loaded from swap space into a real memory “page”. This is managed by the virtual memory system.

P19 [[optimized c++]] describes that L1 or L2 cache also loads vector more efficiently than a linked data structure. For example, each time main memory is read, the adjacent memory cells are loaded together into L1 cache, so contiguous memory data structures result in much fewer cache miss, in theory.

A virtual memory page is about 4KB. A cache line is about 64B.

Both page fault and cache miss are costly at runtime. In both cases the “backing store” is much slower but we have to read from it. These situations are by design and very common, but it pays to reduce their /incidence/occurrence. Even if we reduce page fault frequency or cache miss frequency by 10%, it could be helpful.

inline getter methods

  • In C/C++, if you have a simple getter on a private field, you would typically request compiler to inline it, thus eliminating the function call run-time overhead.
  • In c#, the “property” getter is often inlined, but there’s no such requirement on the compiler.
  • Java getters can get inlined too, esp. if the field is final.

[12] y implement state transition by static methods#OMS

(This discussion is relevant to any high-concurrency finite state machine or FSM, but I will use an order management system as example.)

Suppose my FSM implements state transitions in the form of instance methods defined on the Order object, myOrder->A_to_B(Event). Inside such a method, we verify this->currentState == A, validate other Conditions, and take the appropriate Action to process the Event, before updating this->currentState = B.

(The Transition/Event/Condition/Action/Validation concepts are fundamental to FSM in general. See http://www.thetibcoblog.com/2007/06/26/differences-between-a-bre-and-a-rule-driven-cep-engine-part-1)

I once proposed that A_to_B could convert to a static (java/c#/c++) or (c++) global, free-standing function A_to_B(Event, Order). I once said that this “might” be a bit easier to multi-thread. Now I feel in most cases such a simple change (from non-static to static) won’t improve concurrency. Such a change may have bigger benefits (to be elaborated) but concurrency is rarely one of them. However, here’s one possible “rare” context.

Following Functional Programming principle (as seen in FMD), each and every object is immutable, so is Order. Therefore the global function will return a new Order instance but representing the same real world order. Immutability removes the need to synchronize, completely.

Q: But what if 2 threads need to access a given Order?
A: If one of them is a reader, then immutability will probably help.
A: If both are writers, then my counter question is why this scenario is allowed to exist —

I feel in low latency OMS each order should ideally be updated by the same physical thread throughout. Multi-queue — each security/cusip belongs exclusively to a single thread, never hopping from thread X to thread Y — Rule 1.

I feel Rule 1 is an extremely practical rule but in some contexts it is over-restrictive and needs to be relaxed, but still at any moment an order is only updated on a single thread, never concurrently on 2 threads — Rule 1b.

However, I feel all good software rules have exceptions (as explained elsewhere in my blog) so sometimes it’s legitimate to break Rule 1b, though I won’t condone such reckless practice. Well I’m not boss…:)

In such a system, I take the liberty to assume every writer thread must put the updated Order into a shared cache. 2 writers using the instance method A_to_B()….must synchronize on the cache (usually a hash map). The FP design could use CAS. Since the chance of concurrent write is low, CAS retry is is unlikely, so CAS will be faster. Uncontended synchronization is always slower than uncontended CAS — see Brian Goetz article on http://www.ibm.com/developerworks/java/library/j-jtp04186/index.html.

However, immutability entails more object creation instead of updating mutable fields. Trade-off to be measured.

avoid notifyAll() in low latency apps

notifyAll() is unpopular in low-latency apps. For fear of lost signal, we ask jvm to do a lot of useless work. You wake up everyone when you just need one of them waken up. But How do we get away with notify()? Here are 2 common conditions that should be satisfied.

* All the threads in your target wait set must be identical, so any one can get up and finish the task.
* All are waiting on the same while() condition.

Imagine some threads are waiting for a blocking queue to become non-empty; other threads are waiting for non-full. If you notify(), the risk is that scheduler chooses one waiting thread, but the chosen one checks the while() condition to see it’s waiting for non-empty, but the signal is no-longer-full, so the signal is ignored and lost. Using jdk 5 conditionVar, you can simplify this kind of situation using multiple conditions bound to the same lock.

clever use of enum(char@@) in demanding c++ app #Miami

class Der: public Base {…..}

Requirement — we need to down cast a Base pointer. dynamic_cast has overhead.

Der* returnVal = static_cast (new Base) // will compile but is potentially disastrous, because the returnVal can be a partially wild pointer.

Q: without incurring the overhead of vtbl lookup, how do I decide if a pointer to Base given to me actually points to a Base or Der instance?

%%A: type_info still uses vtbl.
%%A: use a Base field to indicate the type. Both ctor will set the field, so the Der ctor will overwrite.
A: enum. If sizeof(aBase) is a tiny 2 bytes, then adding an enum field adds just 1 byte. Adding vptr would add 8 bytes. Big difference if we instantiate billions of this class.

You can also use a char, which is always 1 byte. An enum can be configured to take 1 byte, and have meaningful longer itemized names.

https://github.com/tiger40490/repo1/blob/cpp1/cpp/88miscLang/enumBasedRTTI.cpp has my tested code

no lock no multiplex`no collections/autobox`:mkt-data sys

Locks and condition variables are important to threading and Wall street systems, but the highest performance market data systems don't use those. Many of them don't use java at all.

They dedicate a CPU to a single thread, eliminating context switch. The thread reads a (possibly fixed sized) chuck of data from a single socket and puts the data into a buffer, then it goes back to the socket, non-stop, until there's no more data to read. At that time, the read operation blocks the thread and the exclusive CPU. Subsequent Processing on the buffer is asynchronous and on a different thread. This 2nd thread can also get a dedicated CPU.

This design ensures that the socket is consumed at the highest possible speed. (Can you apply 2 CPUs on this socket? I doubt it.) You may notice that the dedicated CPU is idle when socket is empty, but in the context of a high-volume market data feed that's unlikely or a small price to pay for the throughput.

Large live feeds often require dedicated hardware anyway. Dedication means possible under-utilization.

What if you try to be clever and multiplex a single thread among 2 sockets? Well you can apply only one cpu on any one thread! Throughput is slower.

mkt-data favor arrays+fixed width, !! java OBJECTs

(See also post on market data and autoboxing.)

In the post on “size of Object.java” we see every single Object.java instance needs at least 8 bytes of book keeping.

Therefore primitive array (say array of 9 ints) has much lower overhead than Collection of Integer.java
– array takes 4bytes x 9 + at least 8 byte booking keeping
– collection takes at least sizeof(Integer) x 9 + book keeping. Each Integer takes at least 4 bytes of payload + 8 bytes book keeping.

Market-data engines gets millions of primitives per second. They must use either c++ or java primitive arrays.

Market data uses lots of ints and chars. For chars, again java String is too wasteful. Most efficient would be C-string without null terminator.

The fastest market data feed is fixed-width, so no bandwidth is wasted on delimiter bits.

Floats are imprecise. Am not an experienced practitioner, but i don’t see any raw market data info represented in floating points.

Java Objects also increases garbage collection. Often indeterminate full GC hurts transaction FIX engines more than the market data FIX engines, but in HFT shops the latter needs extreme consistency. Unpredictable pause in market data FIX can shut out a HFT auto-trader for a number of milliseconds, during the most volatile moment such as a market crash.

mkt-data engines hate pre-java8 autoboxing

(See also post on market data and size of Object)

High volume market data systems deal with primitives like ints, char-arrays and occasionally bit-arrays These often need to go into collections, like vector of int. These systems avoids DAM allocation like a plague.

Specifically, Java collections auto-boxing results in excessive memory allocation — a new object for every primitive item. A major justification for c++ STL, according to a CS veteran.

Sun also says autoboxing is inappropriate for high performance.

C# generic collection is free of autoboxing, just like STL, so is the primitive streams in java 8. But I don’t know if market data systems has actually embraced primitive streams.

C-style programing]java: can be legit

A friend told me it doesn’t make sense to write C-style programs in java, but I believe some of the fastest high volume trading engines
– use very few objects (perhaps some Object.java instances or array objects), but lots of primitives
– minimize garbage collector to almost never
– use no java collections, but arrays (albeit objects)
– use many dedicated threads but not always thread pools
– use almost no locks among the threads

So why java instead of C? I feel in such domains, C is indeed the language of choice, but I guess java offers

* platform independence — (but do we need it, when we use JIT compiling?). Truly compile-once, run anywhere.
* GC — (occasional use), since C memory management can be tricky.
* threading — the superior threading constructs in java
* off-the-shelf modules — many are available to java only, though less common in low-latency.
* pointers — java eliminates all (problematic) access to raw pointers
* platform-specific coding — java prevents programmers from resorting to platform-specific features. For any powerful/resourceful developer, if there exists a shortcut, it will be exploited someday — human nature
* without pointers and complex type declarations, programming is more readable, easier and less error prone
* the option to occasionally use some of the many java features, like the ones below. Note in a superfast engine, not all parts are equally latency-sensitive, so those “easy” parts can (and probably should) be written in a higher-level language. C won’t give you that option.
** RMI
** collections
** NIO

10 coding tips of high-frequency exchange connectivity

[c] use enum/char rather than vptr #Miami
[c] hand-crafted hash table, which is often more optimized for our specific requirement #Barcap
[c] Fixed-width, padded char-array without null terminator #Miami
[c] favor Set to Map
[c] hand-crafted reference counted data structure #Miami
[c] monolithic network-friendly wire format STRUCT for high volume data transfer. Nothing fancy.
[c] java finalizers slowing down garbage collection
[c] favor arrays to object graphs #UBS
[c] avoid boxing and java colllections. Favor primitives. #CS
specialized operator-new. See [[eff c++]]
unions

—-
Note — all of these are considered optimizations on memory. That’s where C++ outsmarts java.

[c=according to insiders]