fastest container: array of POD #or pre-sized vector

relevant to low-latency market data.

  • raw array is the most memory efficient; vector is very close, but need to avoid reallocation
  • std::array is less popular but should offer similar performance to vector
  • all other containers are slower and bigger footprint
  • For high-performance, avoid container of pointers — Cache affinity loves contiguous memory. After accessing 1st element, the accessing 2nd element is likely a cache-hit

how2guage TPS capacity@mkt data engine

Pump in an artificial feed. Increase the input TPS rate until

  1. CPU utilization hits 100%
  2. messages get dropped

The input TPS is the the highest acceptable rate i.e. the “capacity” of this one process.

Note each feed has its own business logic complexity level, so the same software may have 600k TPS capacity for a simple Feed A but only 100k TPS for a complex Feed B.

Also in my experience the input interface is the bottle neck compared to the output interface. If System X feeds into System Y, then we want to have System X pumping at 50% of Y’s capacity. In fact, we actually monitor the live TPS rate. The difference between that and the capacity is the “headway”.

C for latency^TPS can use java

I’m 98% confident — low latency favors C/C++ over java [1]. FPGA is _possibly_ even faster.

I’m 80% confident — throughput (in real time data processing) is achievable in C, java, optimized python (Facebook?), optimized php (Yahoo?) or even a batch program. When you need to scale out, Java seems the #1 popular choice as of 2017. Most of the big data solutions seem to put java as the first among equals.

In the “max throughput” context, I believe the critical java code path is optimized to the same efficiency as C. JIT can achieve that. A python and php module can achieve that, perhaps using native extensions.

[1] Actually, java bytecode can run faster than compiled C code (See my other posts such as

MOM+threading ] low latency apps

Piroz told me that trading IT job interviews tend to emphasize multi-threading and MOM. Some use SQL too. I now feel all of these are unwelcome in low latency trading.

A) MOM – The HSBC interviewer was the first to point out to me that MOM adds latency. Their goal is to get the (market) data from producer to consumer as quickly as possible, with minimum stops in between.

Then I found that the ICE/IDC systems has no middleware between feed parser and order book engine (named Rebus). 29West documentation echos “Instead of implementing special messaging servers and daemons to receive and re-transmit messages, Ultra Messaging routes messages primarily with the network infrastructure at wire speed. Placing little or nothing in between the sender and receiver is an important and unique design principle of Ultra Messaging.

B) threading – ST is generally the fastest in theory and in practice. (I only have a small sample size) I feel the fastest trading engines are ST. No shared mutable.

MT is OK if they don’t compete for resources like CPU, I/O or locks. Compared to ST, most lockfree systems introduce latency like retries.

C) SQL – as stated elsewhere, flat files are much faster than relational DB. How about in-memory relational DB?

Rebus, the order book engine, is in-memory.

realistic TPS for exchange mkt data gateway: below 1M/sec

  • Bombay’s BSE infrastructure (including matching engine) can execute 1 mil trades/second. Must maintain order books, therefore NOT stateless.
  • Rebus maintains full order books for each symbol, can handle 300k (in some cases 700k) messages per second per instance. Uses AVL tree, which beat all other data structure in tests.
  • The Nasdaq feed parser (in c++) is stateless and probably single-threaded. Very limited parsing logic compared to Rebus. It once handled 600k message/second per instance
  • Product Gateway is probably a dumb process since we never need to change it. Can handle 550k message/second per instance

I believe TPS throughput, not latency, is the optimization goal. Biggest challenge known to me is the data burst.

Therefore, java GC pause is probably unacceptable. In my hypothesis, after you experience a data surge for a while, you tend to run out of memory and must run GC (like run to the bathroom). But that’s the wrong time to run GC. If the surge continues while you GC runs, then the incoming data would overflow the queue.

c++SCB eFX IV#Dmitry

I feel these are all QQ type, as defined in

Q: Scanning a vector of int (like finding the average or max). Forward iteration vs backward iteration, which one could be faster, considering all possible compiler optimizations.

%%A: forward. Memory read into cpu cache will be in chunks, not one element at a time. Easy for forward iteration. Not sure about backward.

Q: Which one could be fastest:

void f(double arg){…..}
void f(double & arg){….}

%%A: inlining for first but not 2nd?
A: See esp. the long answer.

Q: Thr1 and Thr2 on 2 CPU’s both update an object s, having 2 fields. Thr1 only updats s.field1. Thr2 only updates s.field2. No interference. No synchronization required. We observe the performance is slower than using one thread to update both fields. Any explanation?

Q: weak_ptr justification, when we have shared_ptr already? I feel [[effectiveModernC++]] has a good chapter on it.

Ashish pointed out in some apps, you could identify a clear risk of circular dependency. Replace with weak_ptr.

Q: given an 2D array arr[10][5], how do you use pointer arithmetic to hit arr[1][5]

A: Contiguous. see Note this is different from an array of pointers.

Q: what would you see if a TCP socket server has a full queue
%%A: TCP requires ack, so if server is unable to accept a request the client would know it.

Q: what STL algorithms did you use?
%%A: foreach(), find(), copy_if(), transform(), reverse()

inline getter methods

  • In C/C++, if you have a simple getter on a private field, you would typically request compiler to inline it, thus eliminating the function call run-time overhead.
  • In c#, the “property” getter is often inlined, but there’s no such requirement on the compiler.
  • Java getters can get inlined too, esp. if the field is final.

subverting kernel’s resource-allocation – a few eg.

[[Linux sys programming]] explains several techniques to subvert OS resource-allocation decisions. Relevant to performance-critical apps.

P275 mlockall(). A Real time app could benefit from locking its entire memory pages into physical RAM, i.e. disable swapping and eliminate
page faults.

P173 sched_setaffinity(). A strongly cache-sensitive app could benefit from hard affinity (stronger than the default soft affinity), which prevents the kernel scheduler migrating the process to another

[[The Linux Programmer’s Toolbox]]. A write-only variable will be removed by the compiler’s optimizer, but such a variable could be useful to debugger. I read somewhere that you can mark it volatile — subversive.

Any way to prevent “my” data or instruction leaving the L1/L2 cache?

Any way to stay “permantly” in the driver’s seat, subverting the thread scheduler’s time-slicing?