How we achieved 400 to 700 kmps #reinterpret_cast

  • ! Many threads but each in single threaded mode. No shared mutable.
  • ! allocation avoided — on millions of Output message objects. Pre-allocated ring buffer eliminates new()
  • ! allocation avoided — on millions of Input message objects, thanks to reinterpret_cast() on pointers. (Allocation would have returned a new address!)
  • ! Stateless Parser. Downstream component (Rebus) handles cancels, modifications..
  • Hot fictions use pbref or move, minimizing stack allocations
  • Socket tuning
  • Output list (of messages) buffering
  • aggressively simplified parsing. Minimal logic
  • large containers are pre sized. No reallocation
  • Very fast duplicate seq check — a hot function
  • Special Logger to reduce I/O cost
  • Sharding to speed up symbol lookup
  • No virtual functions
  • Inline
  • —-Speed with control
  • Best of both to reduce the real risk of our connection becoming slow or lossy
  • Depth monitor to detect missed messages
  • Capacity tracker to monitor mps rates
  • Missed seq reporting

conflation ] stream`mkt data: dilemma

I have hit this same question twice — Q: in a streaming price feed, you get IBM prices in the queue but you don’t want the consumer thread to use those outdated prices. Your consumer also needs a history of the prices.

I see two conflicting requirements by the interviewer. I will point out to the interviewer this conflict.

  1. if full tick history is important, then the consumer have to /process/ every tick, even if outdated. We can have dedicated system just to record ticks, with non-optimal latency. For example, Rebus receives every tick, saves it and sends it out without conflation.
  2. If your algo engine needs to react to opportunities with minimal latency, then it can’t afford to care about history. It must ignore history. I will focus on this requirement.
  3. Combining the two, if your algo engine needs to analyze tick-by-tick history and also react to the opportunities, then the producer thread alone has to do all the leg work but I don’t know if it can be fast enough. In general, the fastest data processing system is single-threaded without queues and minimal interaction with other data stores. FIFO Queue implies latency. Anyway, I will not talk about this stringent “combo” requirement. says “Many firms mitigate the data they consume through the use of simple time conflation. These firms throw data on the floor based solely on the time that data arrived.”

In the Wells interview, I proposed a design that bypasses the tick queue. The producer simply updates a “notice board” with latest prices for each of 999 tickers. Registered consumers get notified to re-read the notice board[1], on some messaging thread. Async design has a latency. I don’t know how tolerable that is. I feel async and MOM are popular and tolerable in algo trading. I should check my book [[all about HFT]]…

However, the HSBC manager (Brian?) seems to imply that for minimum latency, the producer thread must drive the algo all the way and send order out to exchange in one big function. Since the producer thread is also the consumer thread for the same message, there’s no conflation. Every tick is consumed! I am not sure about the scalability of this synchronous design.

Two market-leading investment bank gateways actually publish periodic updates regardless how many raw input messages hit it. Not event-driven and not monitoring every tick!

  • Barclays eq options real time vol publisher
  • BofA Stirt Sprite publishes short-term yield curves on the G10 currencies.

[1] The notification should not contain the latest price. Doing so defeats conflation and puts us back to a queuing system.

fastest container: array of POD #or pre-sized vector

relevant to low-latency market data.

  • raw array is the most memory efficient; vector is very close, but need to avoid reallocation
  • std::array is less popular but should offer similar performance to vector
  • all other containers are slower, with bigger footprint
  • For high-performance, avoid container of node/pointer — Cache affinity loves contiguous memory. After accessing 1st element, then accessing 2nd element is likely a cache-hit
    • set/map, linked list suffer the same

C for latency^TPS can use java

I’m 98% confident — low latency favors C/C++ over java [1]. FPGA is _possibly_ even faster.

I’m 80% confident — throughput (in real time data processing) is achievable in C, java, optimized python (Facebook?), optimized php (Yahoo?) or even a batch program. When you need to scale out, Java seems the #1 popular choice as of 2017. Most of the big data solutions seem to put java as the first among equals.

In the “max throughput” context, I believe the critical java code path is optimized to the same efficiency as C. JIT can achieve that. A python and php module can achieve that, perhaps using native extensions.

[1] Actually, java bytecode can run faster than compiled C code (See my other posts such as

MOM+threading ] low latency apps

Piroz told me that trading IT job interviews tend to emphasize multi-threading and MOM. Some use SQL too. I now feel all of these are unwelcome in low latency trading.

A) MOM – The HSBC interviewer was the first to point out to me that MOM adds latency. Their goal is to get the (market) data from producer to consumer as quickly as possible, with minimum stops in between.

Then I found that the ICE/IDC systems has no middleware between feed parser and order book engine (named Rebus). 29West documentation echos “Instead of implementing special messaging servers and daemons to receive and re-transmit messages, Ultra Messaging routes messages primarily with the network infrastructure at wire speed. Placing little or nothing in between the sender and receiver is an important and unique design principle of Ultra Messaging.

B) threading – ST is generally the fastest in theory and in practice. (I only have a small sample size) I feel the fastest trading engines are ST. No shared mutable.

MT is OK if they don’t compete for resources like CPU, I/O or locks. Compared to ST, most lockfree systems introduce latency like retries.

C) SQL – as stated elsewhere, flat files are much faster than relational DB. How about in-memory relational DB?

Rebus, the order book engine, is in-memory.

realistic TPS for exchange mkt data gateway: below 1M/sec

  • Bombay’s BSE infrastructure (including matching engine) can execute 1 mil trades/second. Must maintain order books, therefore NOT stateless.
  • Rebus maintains full order books for each symbol, can handle 300k (in some cases 700k) messages per second per instance. Uses AVL tree, which beat all other data structure in tests.
  • The Nasdaq feed parser (in c++) is stateless and probably single-threaded. Very limited parsing logic compared to Rebus. It once handled 600k message/second per instance
  • Product Gateway is probably a dumb process since we never need to change it. Can handle 550k message/second per instance

I believe TPS throughput, not latency, is the optimization goal. Biggest challenge known to me is the data burst.

Therefore, java GC pause is probably unacceptable. In my hypothesis, after you experience a data surge for a while, you tend to run out of memory and must run GC (like run to the bathroom). But that’s the wrong time to run GC. If the surge continues while you GC runs, then the incoming data would overflow the queue.

c++SCB eFX IV#Dmitry

100% QQ type, as defined in feel many are micro optimizations with questionable improvement. I wonder how much value such obscure knowledge adds to the team.

Q: Scanning a vector of int (like finding the average or max). Forward iteration vs backward iteration, which one could be faster, considering all possible compiler optimizations.

%%A: forward. Memory read into cpu cache will be in chunks, not one element at a time. Easy for forward iteration. Not sure about backward.

Q: Which one could be fastest:

void f(double arg){…..}
void f(double & arg){….}

%%A: inlining for first but not 2nd?
A: See esp. the long answer.

Q: Thr1 and Thr2 on 2 CPU’s both update an object s, having 2 fields. Thr1 only updates s.field1. Thr2 only updates s.field2. No interference. No synchronization required. We observe the performance is slower than using one thread to update both fields. Any explanation?
%%A: caching in cpu

Q: weak_ptr justification, when we have shared_ptr already? I feel [[effectiveModernC++]] has a good chapter on it.

Ashish pointed out in some apps, you could identify a clear risk of circular dependency. Replace with weak_ptr.

Q: given an 2D array arr[10][5], how do you use pointer arithmetic to hit arr[1][5]

A: Contiguous. see Note this is different from an array of pointers.

Q: what would you see if a TCP socket server has a full queue
%%A: TCP requires handshake, so if server is unable to accept a request the client would know it.
%%A: connection refused?

Q: what STL algorithms did you use?
%%A: foreach(), find(), copy_if(), transform(), reverse(), sort(), replace_if, remov_if