pre-allocate DTOs@SOD #HFT #RTS

example — RTS pre-allocates outgoing message objects from a ring buffer’s head, and “returns” to the ring buffer at the tail… See How RTS achieved 400-700 KMPS #epoll

example — Sell-side HFT OMS also uses pre-allocation. Suppose for every new order there are 3 new DataTransferObjects A/B/C to be instantiated on heap. Traditional approach would make 3 allocation requests in real time. I think the free-list manager becomes a hotspot, even if there’s a per-thread free list.

Basically HFT avoids new/malloc after market opens. RTS uses mostly arrays, and very few (rather small) STL containers. Those STL containers tend to be populated before market opens and remain static.

Pre-allocation is a popular technique. We compute at compile time the sizes of A/B/C based on their class declarations. For DTO class A, sizeof(A) just adds up the non-static data field sizes. Then we estimate how many orders we will get a day (say 7 million). Then we pre-allocate 7 million A objects in an array. The allocation happens at start-up, though the sizes are compile-time constants.

When an order comes in, the A/B/C DTO objects are already allocated but empty.

Byte-array is an alternative, but this smells like the raw free list to me…


consolidate into single-process: low-latency OMS

Example #2– In a traditional sell-side OMS, an client FIX order propagates through at least 3 machines in a chain —

  1. client order gateway
  2. main OMS engine
  3. exchange gateway such as smart order router or benchmark execution engine supporting VWAP etc

The faster version consolidates all-of-the-above into a single Process, cutting latency from 2ms to 150 micros .. latency in eq OMS

Example #1– In 2009 I read about or heard from interviewers about single-JVM designs to replace multi-stage architecture.

Q: why is this technique not used on west coast or main street ?
%%A: I feel on west coast throughput outweighs latency. So scale-out is the hot favorite. Single-JVM is scale-up.

10 μs Additional latency: collocated eq OMS

  • Many organizations “are using the words ultra low latency to describe latencies of under 1 millsec” [1]
  • 13 microsec in collocated eq OMS
  • 150 microsec “single-trip” latency in similar software outside collocation site, measured by Corvil, from A to B
    • Time A: FIX msg coming into our engine
    • Time B: FIX msg going out of our engine
    • 150 μs is median, not average
    • Corvial is (most likely) a TCP network sniffer with FIX parser so it can track a single order flow
  • 2 millis in a “regular” build


Treasury trading doesn’t need such low latency.

HFT mktData redistribution via MOM #real world

Several low-latency practitioners say MOM is unwelcome due to added latency:

  1. The HSBC hiring manager Brian R was the first to point out to me that MOM adds latency. Their goal is to get the raw (market) data from producer to consumer as quickly as possible, with minimum hops in between.
  2. 29West documentation echoes “Instead of implementing special messaging servers and daemons to receive and re-transmit messages, Ultra Messaging routes messages primarily with the network infrastructure at wire speed. Placing little or nothing in between the sender and receiver is an important and unique design principle of Ultra Messaging.” However, the UM system itself is an additional hop, right? Contrast a design where sender directly sends to receiver via multicast.
  3. Then I found that the RTS systems (not ultra-low-latency ) have no middle-ware between feed parser and order book engine (named Rebus).

However, HFT doesn’t always avoid MOM.

  • P143 [[all about HFT]] published 2010 says an HFT such as Citadel often subscribes to both individual stock exchanges and CTS/CQS [1], and multicasts the market data for internal components of the HFT platform. This design has additional buffers inherently. The first layer receives raw external data via a socket buffer. The 2nd layer components would receive the multicast data via their socket buffers.
  • SCB eFX system is very competitive in latency, with Solarflare NIC + kernel bypass etc. They still use Solace MOM!
  • Lehman’s market data is re-distributed over tibco RV, in FIX format.

[1] one key justification to subscribe redundant feeds — CTS/CQS may deliver a tick message faster than direct feed!

p2p messaging beats MOM ] low-latency trading

example — RTS exchange feed dissemination infrastructure uses raw TCP and UDP sockets and no MOM

example — the biggest sell-side equity OMS network uses MOM only for minor things (eg?). No MOM for market data. No MOM carrying FIX order messages. Between OMS nodes on the network, FIX over TCP is used

I read and recorded the same technique in 2009… in this blog

Q: why is this technique not used on west coast or main street ?
%%A: I feel on west coast throughput outweighs latency. MOM enhances throughput.

mkt data conflation: design question

I have hit this same question twice — Q: in a streaming price feed, you get IBM prices in the queue but you don’t want consumer thread AA to use “outdated” prices. Consumer BB needs a full history of the prices.

I see two conflicting requirements by the interviewer. I will point out to the interviewer this conflict.

I see two channels — in-band + out-of-band needed.

  1. in-band only — if full tick history is important, then the consumers have to /process/ every tick, even if outdated. We can have dedicated systems just to record ticks, with latency. For example, Rebus receives every tick, saves it and sends it out without conflation.
  2. out-of-band — If your algo engine needs to catch opportunities at minimal latency, then it can’t afford to care about history. It must ignore history. I will focus on this requirement.
  3. dual-band — Combining the two, if your super-algo-engine needs to analyze tick-by-tick history and also react to the opportunities, then the “producer” thread alone has to do all work till order transmission, but I don’t know if it can be fast enough. In general, the fastest data processing system is single-threaded without queues and minimal interaction with other data stores. Since the producer thread is also the consumer thread for the same message, there’s no conflation. Every tick is consumed! I am not sure about the scalability of this synchronous design. FIFO Queue implies latency. Anyway, I will not talk further about this stringent “combo” requirement. says “Many firms mitigate the data they consume through the use of simple time conflation. These firms throw data on the floor based solely on the time that data arrived.”

In the Wells interview, I proposed a two-channel design. The producer simply updates a “notice board” with latest prices for each of 999 tickers. Registered consumers get notified out-of-band to re-read the notice board[1], on some messaging thread. Async design has a latency. I don’t know how tolerable that is. I feel async and MOM are popular and tolerable in algo trading. I should check my book [[all about HFT]]…

In-band only — However, the HSBC manager (Brian?) seems to imply that for minimum latency, the socket reader thread must run the algo all the way and send order out to exchange in one big function.

Out-of-band only — For slightly slower markets, two market-leading investment bank gateways actually publish periodic updates regardless how many raw input messages hit it. Not event-driven, not monitoring every tick!

  • Lehman eq options real time vol publisher
  • BofA Stirt Sprite publishes short-term yield curves on the G10 currencies.
    • Not EURUSD spot prices

[1] The notification can only contain indicative price numbers and serve to invite clients to check the notice board. If clients rely solely on the indicative, it defeats conflation and brings us back to a FIFO design.

c++low latency phone IV #MS-shanghai

I think this is all QQ… i.e theoretical knowledge. Not really harder than the mvea interview. Easier than the CVA interview.

Q: why we should not throw exception from dtor? Why do you say sometimes you can break the rule? See throwing dtor: %%justified use cases
%%A: if i have a utility routine that may throw, and I want to use it in my dtor, we should assess the danger. If we know entire process should crash under this condition, whether this dtor throws exception or swallow it, then it’s fine to use that routine without try/catch. Even without double-exception situation, the default behavior is std::terminate()

Q: divide by zero — is that an exception?
%%A: yes in java but UB in c++
AA: UB in c++

Q: is there something you can do in C but can’t in c++?
%%A: I guess you mean “C code uncompilable under c++ compiler”? I don’t think so.
A: “include <stdio>” won’t compile in g++
%%A: in pure-C environment if you receive a pointer at run time and it’s known to point to heap, you can call free() but in a C++ environment, that pointer may be created by new/array-new so free() is problematic.
A: hints that if a C source code has no function prototype, then c++ compiler will complain.

Q: at what threshold of latency requirement would you switch from java to c++?
%%A: 100 μs. but I think nowadays java can rival c++.

Q: singleton — how would you implement
%%A: no friend class; declare but not define copier/op=. Provide a static method getInstance

Q: size of an empty c++ class’s instance
%%A: 1 byte (Correct !

Q: diff between struct and class
Q: can you implement inheritance and polymorphism in C?
%%A: yes. Most c++ features (until exception) used to be converted to C source code.
Q: what’s complete vs partial specialization of template? (Jargon question)
Q: what happens to stack when an exception is thrown?

Q: what’s your reaction when you see “delete this” in a program?
A: After that, need to be careful not to reference anything in the class, so as to avoid dereferencing a dangling pointer.


low-latency trade execution in a bond mkt-maker – suggestions

(I think dealership trading desks include most bonds, IRS, FX options, FX cash, CDS, OTC stock options…)

After saving the trade into database, all post-trade messages can be delegated to a task queue for a background thread (like swing event queue). These messages inform downstream[1] systems , so they can update their GUI [2] and database in real time. These post-trade messages aren’t can’t really fail (if they fail we just have to resend). These are one-way messages. So they don’t need to be on the critical path.

[1] could be internal modules of the same trading engine or external systems owned by other teams like commissions:). Either way, such a downstream always runs in a separate process or machine.
[2] Note these GUI aren’t web pages, but WPF, Swing or similar auto-updating screen.

Q: Must position be updated in the critical path, before we process the next trade?
A: For a dealer, yes I think so. If Trade 1 depletes inventory, then we must reject next trader-sell.

4 infrastructure features@Millennium

 swing trader station + OMS on the server-side + smart order router over low-latency connectivity layer

* gemfire distributed cache. why not DB? latency too high.
* tibrv is the primary MOM
* between internal systems — FIX based protocol over tibrv, just like Lehman equities. Compare to protobuf object serialization
* there’s more advanced math in risk system; but the highest latency requirements are on the eq front office systems.

10 (random) arch features of HFT

When talking to low-latency shops, i realize the focus shifts from pricing, trade booking, position mgmt … to market data, message formatting and sockets – rather low-level stuff. A high-frequency trading engine has many special features at architectural and impl levels, but here i will focus on some important architectural features that make a difference. By the way, my current system happens to show many of these features.

1) message-driven, often using RV or derivatives. Most trading signals come in as market data, tick data, benchmark shifts, position adjustments (by other traders of own own bank). Among these, I feel market data poses the biggest challenge from the latency perspective.
2) huge (reluctantly distributed – see other post) cache to minimize database access
) judicious use of async and sync IPC, if one-big-machine is undesirable.
3) optimized socket layer, often in C rather than c++. No object-orientation needed here:)
) server collocation
) large number of small orders to enable fine-grained timing/cancel and avoid disrupting market
) market data gateway instantiates a large number of small objects
) smart order router, since an order can often execute on multiple liquidity venues

Beyond the key features, I guess there’s often a requirement to immediately change a parameter in the runtime rather than updating a database and waiting for the change to be noticed by the runtime. I feel messaging is one option, and RMI/JMX is another.

DMA low latency interview #European bank Midtown

Q: Given a 32-core machine and a lot of CPU-bound calculations, how would you size your thread pool?
A: First, find out how many kernel threads — probably 32 * 2 or 4. i feel if tasks are IO intensive like waiting for a pricing or risk assessment from JMS, then a lot of threads would be waiting for IO. If each thread spends 90% of its time waiting for IO, then each cpu should handle 10 threads roughly. This way, each thread after getting the data needed won’t need to wait to get CPU.

The below questions were probably not prepared in advance.

Q1: for a one-direction linked list of size N, efficiently find the item at Position floor(N/2). Positions are 1 to N. You don’t know the size until you finish scanning.

Q2: for a strictly sorted binary tree of numbers, I give you a pointer to the root node and 2 arbitrary numbers in the tree, find the lowest common ancestor node. The only entry point is the root node address. The 2 input numbers are passed by value and guaranteed to exist in the tree. Note for this binary tree, any node on my left are less than me.

Q3: efficiently reshuffle a PokerCard[52] array. You can call a rand(int max) function to get a random number between 0 and max.
Tip: avoid “retry” as that wastes CPU
A: i think we need to call rand(51), rand(50), rand(49)….

Treasury low-latency trading#Barc

(Background: most low-latency trading systems are related to eq or FX…)

Bloomberg and tradeweb allow an institutional client to send RFQ to her favourite Treasury dealers. The dealer’s auto quoter is typically low-latency. Typical latency [from receiving RFQ at gateway, to sending out quote from gateway] is typically 100~200ms. 500ms would be too high.

For an interdealer broker (espeed/brokertec), 30ms is good. 60ms will arouse complaints from hedge fund clients. Perhaps 30ms from receiving order to sending out acknowledgement? Distributed cache? No need. Just a regular database cache is enough. Coherence is being considered though.

Order matching engine is the heart of a treasury ECN operator. An ECN for muni bond or corporate bond may have a very different algorithm.

c++^realtimeJVM for trading

Java might outperform c++ sometimes, but java performance is not consistent or controllable due to GC jitters — #1 motivation behind real time jvm.

IBM realtime jvm limits GC frequency so the GC may struggle to keep up with the demand, and heap gets bigger than in Sun jvm.

Sun realtime jvm requires some code change.

Some people feel that to benefit from real time jvm, you must code very carefully … why not use c++. C++ is the incumbent in low-latency systems.

Market-facing gateways — 1) order execution (FIX or custom), and 2) market data feed — still use c++ primarily for another reason – gateway API is often in c++.