array of sd::byte^Unsigned-char as “byte array”

Background: Suppose you need to store or transfer arbitrary binary data.

In RTS, we used char-array. Some say we should use unsigned char. It is the only data type that is guaranteed (by the ANSI C Standard) to have no padding bits. So all 8 bits in an unsigned char contribute to the value. None of them is a padding bit. I think std::uint8_t is similar but less common.

Contrary to some online posts, unsigned-char type is different from “char” —

In C++17, std::byte is probably preferred because only the bitwise operations are defined. I believe you can reinterpret_cast a pointer to std::ptr. In, I used none of the above — I used placement-new 🙂

In java we use the primitive “byte” type — an 8-bit signed integer.

retrans questions from IV+ from me

Q11 (2 x IV): do you track gaps for Line A and also track gaps in Line B?
A: No, according to Deepak and Shanyou. Deepak said we treat the seq from Line A/B as one combined sequence (with duplicates) and track gaps therein

Q2 (IV x 2): Taking parser + orderbook (i.e. rebus) as a single black box, when you notice a gap (say seq #55/56/57), do you continue to process seq # 58, 59 … or do you warehouse these #58, 59… messages and wait until you get the resent #55/56/57 messages?
A: no warehousing. We process #58 right away.

Q2b (IV): In your orderbook engine (like Rebus), suppose you get a bunch of order delete/exec/modify messages, but the orderId is unrecognized and possibly pending retrans. Rebus doesn’t know about any pending retrans. What would rebus do about those messages?
%%A: I don’t know the actual design [3], but if I were the architect I would always check the orderId. If orderId is unknown then I warehouse the message. If it is a known order Id in Rebus, I will apply the message on the order book. Risks? I can’t think of any.

[3] It’s important to avoid stating false facts, so i will add the disclaimer.

Q2c (IV): what data structures would you use to warehouse those pending messages? ( I guess this question implies warehousing is needed.)
%%A: a linked list would do. Duplicate seqNum check is taken care of by parser.

Q13 (IV): do you immediately send a retrans request every time you see a gap like (1-54, then 58,59…)? Or do you wait a while?
A: I think we do need to wait since UDP can deliver #55 out of sequence. Note Line A+B are tracked as a combined stream.

Q13b: But how long do we wait?
A: 5 ms according to Deepak

Q13c: how do you keep a timer for every gap identified?
%%A: I think we could attach a timestamp to each gap.

— The above questions were probably the most important questions in a non-tech interview. In other words, if an interview has no coding no QQ then most of the questions would be simpler than these retrans questions ! These questions test your in-depth understanding of a standard mkt data feed parser design. 3rd type of domain knowledge.

Q: after you detect a gap, what does your parser do?
A (Deepak): parser saves the gap and moves on. After a configured timeout, parser sends out the retrans request. Parser monitors messages on both Line A and B.

Q: if you go on without halting the parser, then how would the rebus cope?

  • A: if we are missing the addOrder, then rebus could warehouse all subsequent messages about unknown order IDs. Ditto for a Level 1 trade msg.

Deepak felt this warehouse could build up quickly since the ever-increasing permanent gaps could contain tens of thousands of missing sequence numbers. I feel orderId values are increasing and never reused within a day, so we can check if an “unknown” orderId is very low and immediately discard it, assuming the addOrder is permanently lost in a permanent gap.

  • A: if we are missing an order cancel (or trade cancel), i.e. the last event in the life cycle, then we don’t need to do anything special. When the Out-of-sequence message shows up, we just apply it to our internal state and send it to downstream with the OOS flag.

If a order cancel is lost permanently, we could get a crossed order book. After a few refreshes (15min interval), system would discard stale orders sitting atop a crossed book.

In general, crossed book can be fixed via the snapshot feed. If not available in integrated feed, then available in the open-book feed.

  • A: If we are missing some intermediate msg like a partial fill, then we won’t notice it. I think we just proceed. The impact is smaller than in FIX.

OOS messages are often processed at the next refresh time.

Q3b: But how long do we wait before requesting retrans?
Q3c: how do you keep a timer for every gap identified?

Q14: after you send a retrans request but gets no data back, how soon do you resend the same request again?
A: a few mills, according to Deepak. I think

Q14b: Do you maintain a timer for every gap?
%%A: I think my timestamp idea will help.

Q: You said the retrans processing in your parser shares the same thread as regular (main publisher) message processing. What if the publisher stream is very busy so the gaps are neglected? In other words, the thread is overloaded by the publisher stream.
%%A: We accept this risk. I think this seldom happens. The exchange uses sharding to ensure each stream is never overloaded.

embed char_array ] my java object #XR

With market data it’s common to use some Message class(s) that “embeds” a fixed-length character array like 20-char for example.

Allocating an array object off-site on heap is very costly in memory footprint. One extra allocation per Message.

Also slower reading at run time due to data-cache inefficiency. Data cache favors contiguous data structures. See CPU(data)cache prefetching

c/c++ and c# (via Struct) can easily support exactly this “embedding”. I feel java also has some support. Beside JNI, I wonder if there’s another, pure-java solution.

Q: in java, how can I have embedded fixed-length char-array field in my Message or Acct object, rather than a separate array object allocated somewhere off-site?

  1. Solution: If the fixed length is small like 10, I could maintain 10 individual char fields.
  2. Solution: assuming the chars are ascii (8-bit rather than 16-bit in java), I can group first eight chars into a 64-bit long int field. Provide a translation interface when reading/writing the field. With 10 such fields I can support 80-char embedded.
  3. Solution: If not possible, I would use a gigantic singleton off-site char array to hold fixed-length “segments”. Then I need a single int “position”. Every Acct object has a field this.position, where this.position * fixedLength = offset, to identify one segment.
  4. There are two published solutions described in ringBuffer@pre_allocated objects to preempt JGC

Among them, not sure which solution is fastest in practice.

ringBuffer@pre_allocated objects to preempt JGC

Goal — to eliminate JGC completely.

Design 1: I will want to use primitive fields only and avoid reference fields [1] at all cost, so the total footprint of an Order is known in advance. Say it’s 100 bytes. I will create 10M of dummy Order instances, possibly scattered in heap, not adjacent as in c++, and hold their 10M addresses in an Order array… about 1GB footprint for the Order objects + 80M footprint for the array of 8-byte pointers.

(Note I reuse these Order instances in this object pool and never let them get garbage-collected.)

Then i need a few subscripts to identify the “activeRegion” of the ring but how about released slots enclosed therein?

[1] timestamps will be ints; symbolIDs and clientIDs are ints; short ascii strings will use 64-bit ints (8 characters/int); free-form strings must be allocated off-site:(

Design 2a: To avoid the “scatter” and to place the Order instances side by side, Can we use a serialized byte[100] array object to represent one Order? Can we use one gigantic off-heap byte array to hold all Orders, eliminating the 80M footprint? See java off-heap memory

Design 2b: shows a contiguous array of java objects, like std::vector<MyObject>

Design 2c: is a feature in IBM jvm

Ring buffer is good if the object lifetimes are roughly equal, giving us FIFO phenomenon. This occurs naturally in market data or message passing gateways. Otherwise, we may need a linked list (free list) of released slots in addition to a pair of subscript to identify the active region.

It might be better to allocate a dedicated buffer for each thread, to avoid contention. Drawback? One buffer may get exhausted when another stays unused.

RTS feed for Internet clients #Mark #50%

Q: Based on the RTS market data dissemination system, what if some clients subscribe by a slow Internet connection and your orderbook (TCP) feed need to provide delta updates each time a client reconnects?

Default solution: similar to FIX/TCP .. sender to maintain per-client state. Kenny of Trecquant said his trading app can be extremely simple if exchange maintains state. I won’t elaborate. Here is my own Solution.

Note on terminology — in multicast there’s no TCP-style “server” . Instead, there’s a sender engine for many receiver clients.

Suppose we have too many clients. To minimize per-client state management, my engine would simply multicast to all clients real time updates + periodic snapshots.

A client AA can request a snapshot on any symbol group, and I will immediately multicast the snapshots on a refresh channel. If client BB never requests anything, BB can ignore the refresh multicast channel.

Request quota — each client like AA can request X free snapshots a day. Beyond that, AA’s requests would be regulated like

  • levied a fee
  • queued with lower priority

It’s feasible to replace the refresh multicast group with an unicast UDP channel per client, but to me the multicast-refresh solution offers clear advantages without major drawbacks.

  1. if there is an outage affecting two clients, each would request the same snapshots, creating unnecessary work on the engine. The request quota would incentivize each client to monitor the refresh group and avoid sending the same request as someone else
  2. multicast can be valuable to clients who like more frequent snapshots than the periodic.
  3. My operations team can also utilize this refresh channel to broadcast unsolicited (FreeOfCharge) snapshots. For this task, I would avoid using the publisher channel as it can be busy sending real time updates. We don’t want to interleave real time and snapshot messages.

mktData direct^Reuter feeds: realistic ibank set-up

Nyse – an ibank would use direct feed to minimize latency.

For some newer exchanges – an ibank would get Reuters feed first, then over a few years replace it with direct feed.

A company often gets duplicate data feed for the most important feeds like NYSE and Nasdaq. The rationale is discussed in [[all about hft]]

For some illiquid FX pairs, they rely on calculated rates from Reuters. BBG also calculate a numerically very similar rate, but BBG is more expensive. Bbg also prohibits real time data redistribution within the ibank.

mktData parser: multi-threaded]ST mode #Balaji#mem

Parsers at RTS: a high-volume c++ system I worked on has only one thread. I recall some colleague (Shubin) saying the underlying framework is designed in single-threaded mode, so if a process has two threads they must share no mutable data. In such a context, I wonder what’s better

  1. EACH application having only one thread
  2. Each application has 2 independent threads sharing nothing

I feel (A) is simple to set up. I feel if the executable + reference data footprint is large like 100MB, then (B) would save memory since the 100MB can be shared between threads.

–Answer from a c++ trading veteran
Two independent instances is much simpler time/effort wise. Introducing multiple threads in a single threaded application is not trivial. But I do not see any value if you create multiple threads which do not share any (mutable) data.

–My response:
Executable and reference data take up memory. 1) I feel executable can take up 10MB+, sometimes 100MB. 2) If there’s a lot of immutable reference data loaded from some data store, they can also add megabytes to process footprint. Suppose their combined footprint is 100MB, then two instances would each need 100MB, but the multi-threaded design would need only 100MB in a single instance hosting multiple threads.

My RTS team manager required every market data parser process to reduce footprint below 60MB or explain why. My application there used 200MB+ and I had to find out why.

Apparently 100MB difference is non-trivial in that context.

370,000 MPS isn’t tough for multicast #CSY

370,000 msg/sec isn’t too high. Typical exchange message size is 100 bits, so we are talking about 37 Mbps, , less than 0.1% of a 100 Gbps network capacity.

My networking mentor CSY and I both believe it’s entirely possible to host 8 independent threads in a single process to handle the 8 independent message channels. Capacity would be 296 Mbps on a single NIC and single PID.

See also mktData parser: multi-threaded]ST mode #Balaji

I feel a more bandwidth-demanding multicast application is video-on-demand where a single server may need to simultaneously stream different videos.

Q: how about world cup final real-time multicast video streaming to millions of subscribers?
%%A: now I think this is not so demanding, because the number of simultaneous video streams is one

how is mkt data used ] buy-side FI analytics@@

This is a BIG bond asset manager… They use 2-factor HJM model, among others.

They use EOD market data for risk measure + risk sensitivity calculations. No real time.

Models were written by 40+ quants untrained in c++. The 16-strong IT team integrates the models

I asked “Do you use liquid fixed income market data mostly to calibrate models and use the model to price illiquid instruments?”

A: both

  • To calibrate model — every day, as explained in [[complete guide]] P436
  • To derive valuation directly on existing positions if the instruments are comparable (between ref data instrument and position instrment)

retrans: FIX^TCP^xtap

The FIX part is very relevant to real world OMS.. Devil is in the details.

IP layer offers no retrans. UDP doesn’t support retrans.



TCP FIX xtap
seq# continuous no yes.. see seq]FIX yes
..reset automatic loopback managed by application seldom #exchange decision
..dup possible possible normal under bestOfBoth
..per session per connection per clientId per day
..resumption? possible if wire gets reconnected quickly yes upon re-login unconditional. no choice
Ack positive Ack needed only needed for order submission etc not needed
gap detection sophisticated every gap should be handled immediately since sequence is critical. Out-of-sequence is unacceptable. gap mgr with timer
retrans sophisticated receiver(ECN) will issue resend request; original sender to react intelligently gap mgr with timer
Note original sender should be careful resending new orders.


churn !! bad ] mktData #socket,FIX,.. unexpected!

I feel the technology churn is remarkably low.

New low-level latency techniques are coming up frequently, but these topics are actually “shallow” and low complexity to the app developer.

  • epoll replacing select()? yes churn, but much less tragic than the stories with swing, perl, structs
  • most of the interview topics are unchanging
  • concurrency? not always needed. If needed, then often fairly simple.

CHANNEL for multicast; TCP has Connection

In NYSE market data lingo, we say “multicast channel”.

  • analogy: TV channel — you can subscribe but can’t connect to it.
  • analogy: Twitter hashtag — you can follow it, but can’t connect to it.

“Multicast connectivity” is barely tolerable but not “connection”. A multicast end system joins or subscribes to a group. You can’t really “connect” to a group as there could be zero or a million different peer systems without a “ring leader” or a representative.

Even for unicast UDP, “connect” is the wrong word as UDP is connectionless.

Saying a nonsense like “multicast connection” is an immediate giveaway that the speaker isn’t familiar with UDP or multicast.

reinterpret_cast(zero-copy)^memcpy: raw mktData parsing

Raw market data input comes in as array of unsigned chars. I “reinterpret_cast” it to a pointer-to-TradeMsgStruct before looking up each field inside the struct.

Now I think this is the fastest solution. Zero-cost at runtime.

As an alternative, memcpy is also popular but it requires bitwise copy. It often require allocating a tmp variable.

One ISIN map to multiple symbols across exchanges

Previously, a single company could have many different ticker symbols as they varied between the dozens of individual stock markets.

Today, Daimler AG stock trades on twenty-two different stock exchanges, and is priced in five different currencies; it has the same ISIN on each (DE0007100000), though not the same ticker symbol.

–For trade identification

In this case, ISIN cannot specify a particular trade, and another identifier (typically the three- or four-letter exchange code such as the Market Identifier Code) will have to be specified in addition to the ISIN.

–For price identification

The general public would want the price for an ISIN, while traders may want the price for a ticker symbol on one (or a few) liquidity venues.

symbol ticker(esp. short ones)are recycled

Symbol ticker is typically 1 to 4 chars, though numbers are often used in Asia such as HKSE. We can call it

  • symbol
  • stock symbol
  • ticker symbol
  • symbol ticker

Symbols are sometimes reused. In the US the single-letter symbols are particularly sought after as vanity symbols. For example, since Mar 2008 Visa Inc. has used the symbol V that had previously been used by Vivendi which had delisted

FIX heartBtInt^TCP keep-alive^XDP heartbeat^xtap inactivity^MSFilter hearbeat

When either end of a FIX connection has not sent any data for [HeartBtInt] seconds, it should transmit a Heartbeat message. This is first layer of defense.

When either end of the connection has not received any data for (HeartBtInt + “some reasonable transmission lag”) seconds, it will send a TestRequest probe message to verify the connection. This is 2nd layer of defense. If there is still no message received after (HeartBtInt + “some reasonable transmission lag”) seconds then the connection should be considered lost.

Heartbeats issued as a response to TestRequest must contain the TestReqID transmitted in the TestRequest. This is useful to verify that the Heartbeat is the result of the TestRequest and not as the result of a HeartBtInt timeout. If a response doesn’t have the expected TestReqId, I think we shouldn’t ignore it. If everyone were to ignore it, then TestReqId would be a worthless feature added to FIX protocol.

FIX clients should reset the heartbeat interval timer after every transmitted message (not just heartbeats).

TCP keep-alive is an optional, classic feature.

NYSE XDP protocol uses heartbeat messages in both multicast and TCP request server. The TCP heartbeat requires client response, similar to FIX probe message.

Xtap also has an “inactivity timeout”. Every incoming exchange message (including heartbeats) is recognized as “activity”. 60 sec of Inactivity triggers an alert in the log, the log watchdog…

MS Filter framework supports a server-initiated heartbeat, to inform clients that “I’m alive”.  An optional feature — heartbeat msg can carry an expiration value a.k.a dynamic frequency value like “next heartbeat will arrive in X seconds”.

#112=testRequest id. There can be many test requests, each with an id.

##mkt-data jargon terms: fakeIV

Context — real time high volume market data feed from exchanges.

  • Q: what’s depth of market? Who may need it?
  • Q: what’s the different usages of FIX vs binary protocols?
  • Q: when is FIX not suitable? (market data dissemination …)
  • Q: when is binary protocol not suitable? (order submission …)
  • Q: what’s BBO i.e. best bid offer? Why is it important?
  • Q: How do you update BBO and when do you send out the update?
  • Q[D]: how do you support trade-bust that comes with a trade id? How about order cancellation — is it handled differently?
  • Q: how do you handle an order modification?
  • Q: how do you handle an order replacement?
  • Q: How do you handle currency code? (My parser sends the currency code for each symbol only once, not on every trade)
  • Q: do you have the refresh channel? How can it be useful?
  • Q[D]: do you have a snapshot channel? How can it be useful?
  • Q: when do you shutdown your feed handler? (We don’t, since our clients expect to receive refresh/snapshots from us round the clock. We only restart once a while)
  • Q: if you keep running, then how is your daily open/close/high/low updated?
  • Q: how do you handle partial fill of a big order?
  • Q: what’s an iceberg order? How do you handle it?
  • Q: what’s a hidden order? How do you handle it?
  • Q: What’s Imbalance data and why do clients need it?
    • Hint: they impact daily closing prices which can trigger many derivative contracts. Closing prices are closely watched somewhat like LIBOR.
  • Q: when would we get imbalance data?
  • Q: are there imbalance data for all stocks? (No)
  • ——— symbol/reference data
  • Q[D]: Do you send security description to downstream in every message? It can be quite long and take up lots of bandwidth?
  • Q: what time of the day do symbol data come in?
  • Q5[D]: what essential attributes are there in a typical symbol message?
  • Q5b: Can you name one of the key reasons why symbol message is critical to a market data feed?
  • Q6: how would dividend impact the numbers in your feed? Will that mess up the data you send out to downstream?
  • Q6b[D]: if yes how do you handle it?
  • Q6c: what kind of exchanges don’t have dividend data?
  • Q: what is a stock split? How do you handle it?
  • Q: how are corporate actions handled in your feed?
  • Q: what are the most common corporate actions in your market data? (dividends, stock splits, rights issue…)
  • ——— resilience, recovery, reliability, capacity
  • Q: if a security is lightly traded, how do you know if you have missed some updates or there’s simply no update on this security?
  • Q25: When would your exchange send a reset?
  • Q25b: How do you handle resets?
  • Q26[D]: How do you start your feed handler mid-day?
  • Q26b: What are the concerns?
  • Q[D]: how do you handle potential data loss in multicast?
  • Q[D]: how do you ensure your feed handler can cope with a burst in messages in TCP vs multicast?
  • Q[D]: what if your feed handler is offline for a while and missed some messages?
  • Q21: Do you see gaps in sequence numbers? Are they normal?
  • Q21b: How is your feed handler designed to handle them?
  • Q[D]: does your exchange have a primary and secondary line? How do you combine both?
  • Q[D]: between primary and secondary channel, if one channel has lost data but the other channel did receive the data, how does your system combine them?
  • Q[D]: how do you use the disaster recovery channel?
  • Q[D]: If exchange has too much data to send in one channel, how many more channels do they usually use, and how do they distribute the data among multiple channels?
  • Q: is there a request channel where you send requests to exchange? What kind of data do you send in a request?
  • Q: Is there any consequence if you send too many requests?
    Q: what if some of your requests appear to be lost? How does your feed handler know and react?

[D=design question]

## mkt data: avoid byte-copying #NIO

I would say “avoid” or “eliminate” rather than “minimize” byte copying. Market data volume is gigabytes so we want and can design solutions to completely eliminate byte copying.

  • RTS uses reinterpret_cast but still there’s copying from kernel socket buffer to userland buffer.
  • Java NIO buffers can remove the copying between JVM heap and the socket buffer in C library. See P226 [[javaPerf]]
  • java autoboxing is highly unpopular for market data systems. Use byte arrays instead

mktData conflation: design question

I have hit this same question twice — Q: in a streaming price feed, you get IBM prices in the queue but you don’t want consumer thread AA to use “outdated” prices. Consumer BB needs a full history of the prices.

I see two conflicting requirements by the interviewer. I will point out to the interviewer this conflict.

I see two channels — in-band + out-of-band needed.

  1. in-band only — if full tick history is important, then the consumers have to /process/ every tick, even if outdated. We can have dedicated systems just to record ticks, with latency. For example, Rebus receives every tick, saves it and sends it out without conflation.
  2. out-of-band — If your algo engine needs to catch opportunities at minimal latency, then it can’t afford to care about history. It must ignore history. I will focus on this requirement.
  3. dual-band — Combining the two, if your super-algo-engine needs to analyze tick-by-tick history and also react to the opportunities, then the “producer” thread alone has to do all work till order transmission, but I don’t know if it can be fast enough. In general, the fastest data processing system is single-threaded without queues and minimal interaction with other data stores. Since the producer thread is also the consumer thread for the same message, there’s no conflation. Every tick is consumed! I am not sure about the scalability of this synchronous design. FIFO Queue implies latency. Anyway, I will not talk further about this stringent “combo” requirement. says “Many firms mitigate the data they consume through the use of simple time conflation. These firms throw data on the floor based solely on the time that data arrived.”

In the Wells interview, I proposed a two-channel design. The producer simply updates a “notice board” with latest prices for each of 999 tickers. Registered consumers get notified out-of-band to re-read the notice board[1], on some messaging thread. Async design has a latency. I don’t know how tolerable that is. I feel async and MOM are popular and tolerable in algo trading. I should check my book [[all about HFT]]…

In-band only — However, the HSBC manager (Brian?) seems to imply that for minimum latency, the socket reader thread must run the algo all the way and send order out to exchange in one big function.

Out-of-band only — For slightly slower markets, two market-leading investment bank gateways actually publish periodic updates regardless how many raw input messages hit it. Not event-driven, not monitoring every tick!

  • Lehman eq options real time vol publisher
  • BofA Stirt Sprite publishes short-term yield curves on the G10 currencies.
    • Not EURUSD spot prices

[1] The notification can only contain indicative price numbers and serve to invite clients to check the notice board. If clients rely solely on the indicative, it defeats conflation and brings us back to a FIFO design.

##fastest container choices: array of POD #or pre-sized vector

relevant to low-latency market data.

  • raw array is “lean and mean” — the most memory efficient; vector is very close, but we need to avoid reallocation
  • std::array is less popular but should offer similar performance to vector
  • all other containers are slower, with bigger footprint
  • For high-performance, avoid container of node/pointer — Cache affinity loves contiguous memory. After accessing 1st element, then accessing 2nd element is likely a cache-hit
    • set/map, linked list suffer the same

real-time symbol reference-data: arch #RTS

Real Time Symbol Data is responsible for sending out all security/product reference data in real time, without duplication.

  • latency — typically 2ms (not microsec) latency, from receiving to sending out the enriched reference data to downstream.
  • persistence — any data worthing sending out need to be saved. In fact, every hour the same system sends a refresh snapshot to downstream.
    • performance penalty of disk write — is handled by innoDB. Most database access is in-memory. Disk write is rare. Enough memory to hold 30GB of data. shows how many symbols there across all trading venues.
  • insert is actually slower than update. But first, system must check if there’s a need to insert or update. If no change, then don’t save the data or send out.
  • burst / surge — is the main performance headache. We could have a million symbols/messages flooding in
  • relational DB with mostly in-memory storage

MOM^sharedMem ring buffer^UDP : mkt-data transmission

I feel in most environments, the MOM design is most robust, relying on a reliable middleware. However, latency sensitive trading systems won’t tolerate the additional latency and see it as unnecessary.

Gregory (ICE) told me about his home-grown simple ring buffer in shared memory. He used a circular byte array. Message boundary is embedded in the payload. When the producer finishes writing to the buffer, it puts some marker to indicate end of data. Greg said the consumer is slower, so he makes it a (periodic) polling reader. When consumer encounters the marker it would stop reading. I told Gregory we need some synchronization. Greg said it’s trivial. Here are my tentative ideas —

Design 1 — every time the producer or the consumer starts it would acquire a lock. Coarse-grained locking

But when the consumer is chipping away at head of the queue, the producer can simultaneously write to the tail, so here’s

Design 2 — the latest message being written is “invisible” to the consumer. Producer keeps the marker unchanged while adding data to the tail of queue. When it has nothing more to write, it moves the marker by updating it.

The marker can be a lock-protected integer representing the index of the last byte written.

No need to worry about buffer capacity, or a very slow consumer.

MOM UDP multicast or TCP or UDS shared_mem
how many processes 3-tier 2-tier 2-tier
1-to-many distribution easy easiest doable
intermediate storage yes tiny. The socket buffer can be 256MB yes
producer data burst supported message loss is common in such a situation supported
async? yes yes, since the receiver must poll or be notified I think the receiver must poll or be notified
additional latency yes yes minimal

##[11] data feed to FX pricer #pre/post trade, mid/short term

(see blog on influences on FX rates)

Pricing is at heart of FX trading. It’s precisely due to the different pricing decisions of various players that speculation opportunities exist. There are pricing needs in pre/post trade, and pricing timeframes of long term or short term

For mark to market and unrealized PnL, real time market trade/quote prices are probably best, since FX is an extremely liquid and transparent market, except the NDF markets.

That’s post trade pricer. For the rest of this write-up let’s focus on pre-trade pricer of term instruments. Incoming quotes from major electronic markets are an obvious data source, but for less liquid products you need a way to independently derive a fair value, as the market might be overpriced or underpriced.

For a market maker or dealer bank responding to RFQ,
– IRS, bond data from Bloomberg
– yield spread between government bonds of the 2 countries. Prime example – 2-year Bund vs T-note
– Libor, government bond yield. See
– Depth of market
– volume of _limit_orders_ and trades (It’s possible to detect trends and patterns)
– dealer’s own inventory of each currency
– cftc COT report. See
– risk reversal data on FXCM. See

For short term trading, interest rate is the most important input to FX forward pricing — There’s a separate blog post. Other significant drivers must be selected and re-selected from the following pool of drivers periodically, since one set of drivers may work for a few days and become *obsolete*, to be replaced  by another set of drivers.
– yield spread
– T yields of 3 month, 2 year and 10 year
– Libor and ED futures
– price of oil (usually quoted in USD). Oil up, USD down.
– price of gold

For a buy-and-hold trader interested in multi-hear long term “fair value”, pricers need
– balance of trade?
– inflation forecast?
– GDP forecast?

mkt-data tech skills: portable/shared@@

Raw mkt data tech skill is better than soft mkt data even though it’s further away from “the money”:

  • standard — Exchange mkt data format won’t change a lot. Feels like an industry standard
  • the future — most OTC products are moving to electronic trading and will have market data to process
  • more necessary than many modules in a trading system. However ….. I guess only a few systems need to deal with raw market data. Most down stream systems only deal with the soft market data.

Q1: If you compare 5 typical market data gateway dev [1] jobs, can you identify a few key tech skills shared by at least half the jobs, but not a widely used “generic” skill like math, hash table, polymorphism etc?

Q2: if there is at least one, how important is it to a given job? One of the important required skills, or a make-or-break survival skill?

My view — I feel there is not a shared core skill set. I venture to say there’s not a single answer to Q1.

In contrast, look at quant developers. They all need skills in c++/excel, BlackScholes, bond math, swaps, …

In contrast, also look at dedicated database developers. They all need non-trivial SQL, schema design. Many need stored procs. Tuning is needed if large tables

Now look at market data gateway for OPRA. Two firms’ job requirements will share some common tech skills like throughput (TPS) optimization, fast storage.

If latency and TPS requirements aren’t stringent, then I feel the portable skill set is an empty set.

[1] There are also many positions whose primary duty is market data but not raw market data, not large volume, not latency sensitive. The skill set is even more different. Some don’t need development skill on market data — they only configure some components.

## 4 exchange feeds +! TCP/multicast

  • eSpeed
  • Dalian Commodity Exchange
  • SGX Securities Market Direct Feed
  • CanDeal ( mostly Canadian dollar debt securities

I have evidence (not my imagination) to believe that these exchange data feeds don’t use vanilla TCP or multicast but some proprietary API (based on them presumably).

I was told other China feeds (probably Shenzhen feed) is also API-based.

ESpeed would ship a client-side library. Downstream would statically link or dynamically link it into parser. The parser then communicates to the server in some proprietary protocol.

IDC is not an  exchange but we go one level deeper, requiring our downstream to provide rack space for our own blackbox “clientSiteProcessor” machine. I think this machine may or may not be part of a typical client/server set-up, but it communicates with our central server in a proprietary protocol.

what hours2expect execution msg{typical U.S.exchanges

There was a requirement that within 90 seconds, any execution on any online or traditional broker system need to be reported to the “official” exchange. For each listed security, there’s probably a single “official” listing exchange.

Take IBM for example.

  • On NYSE, executions only take place after 9.30am, usually after an opening auction.
  • On-line electronic brokers operate 24/7 so an execution could happen and get reported any time. However, NYSE data feed only publishes it after 4am by right. I don’t know how strict this timing is. If your feed shows it before 4am I guess you are free to discard it. Who knows it might be a test message.


Locate msg]binary feed #multiple issues solved

Hi guys, thanks to all your help, I managed to locate the very first trading session message in the raw data file.

We hit and overcame multiple obstacles in this long “needle search in a haystack”.

  • · Big Obstacle 1: endian-ness. It turned out the raw data is little-endian. For my “needle”, the symbol integer id 15852(in decimal) or 3dec(in hex) is printed swapped as “ec3d” when I finally found it.

Solution: read the exchange spec. It should be mentioned.

  • · Big Obstacle 2: my hex viewers (like “xxd”) adds line breaks to the output, so my needle can be missed during my search. (Thanks to Vishal for pointing this out.)

Solution 1: xxd -c 999999 raw/feed/file > tmp.txt; grep $needle tmp.txt

The default xxd column size is 16 so every 16 bytes output will get a line break — unwanted! So I set a very large column size of 999999.

Solution 2: in vi editor after “%!xxd -p” if you see line breaks, then you can still search for “ec\_s*3d”. Basically you need to insert “\_s*” between adjacent bytes.

Here’s a 4-byte string I was able to find. It span across lines: 15\_s*00\_s*21\_s*00

  • · Obstacle 3: identify the data file among 20 files. Thanks to this one obstacle, I spent most of my time searching in the wrong files 😉

Solution: remove each file successively, starting from the later hours, and retest, until the needle stops showing. The last removed file must contain our needle. That file is a much smaller haystack.

o one misleading info is the “9.30 am” mentioned in the spec. Actually the message came much earlier.

o Another misleading info is the timestamp passed to my parser function. Not sure where it comes from, but it says 08:00:00.1 am, so I thought the needle must be in the 8am file, but actually, it is in the 4am file. In this feed, the only reliable timestamp I have found is the one in packet header, one level above the messages.

  • · Obstacle 4: my “needle” was too short so there are too many useless matches.

Solution: find a longer and more unique needle, such as the SourceTime field, which is a 32-bit integer. When I convert it to hex digits I get 8 hex digits. Then I flip it due to endian-ness. Then I get a more unique needle “008e0959”. I was then able to search across all 14 data files:

for f in arca*0; do

xxd -c999999 -p $f > $f.hex

grep -ioH 008e0959 $f.hex && echo found in $f


  • · Obstacle 5: I have to find and print the needle using my c++ parser. It’s easy to print out wrong hex representation using C/C++, so for most of this exercise I wasn’t sure if I was looking at correct hex dump in my c++ log.

o If you convert a long byte array to hex and print without whitespace, you could see 15002100ffffe87600,but when I added a space after each byte, it looks like 15 00 21 00 ffffe876 00, so the 3rd byte was overflowing without warning!

o If you forget padding, then you can see a lot of single “0” when you should get “00”. Again, if you don’t include white space you won’t notice.

Solution: I have worked out some simplified code that works. I have a c++ solution and c solution. You can ask me if you need it.

  • · Obstacle 6: In some cases, sequence number is not in the raw feed. In this case the sequence number is in the feed, so Nick’s suggestion was valid, but I was blocked by other obstacles.

Tip: If sequence number is in the feed, you would probably spot a pattern of incrementing hex numbers periodically in the hex viewer.

OPRA: name-based sharding by official feed provider

2008(GFC)peak OPRA msg rate — Wikipedia “low latency” article says 1 million updates per second. Note My NYSE parser can handle 370,000 messages per second per thread ! shows 48 multicast groups each for a subset of the option symbols. When there were 24 groups, the symbols starting with RU to SMZZZ were too heavy too voluminous for one multicast group, more so than the other 23 groups.

Our OPRA parser instance is simple and efficient (probably stateless) so presumably capable of handling multiple OPRA multicast groups per instance. We still use one parser per MC group for simplicity and ease of management.

From: Chen, Tao
Sent: Tuesday, May 30, 2017 8:24 AM
To: Tan, Victor
Subject: RE: OPRA feed volume

Opra data is provided by SIAC(securities industry automation corporation). The data is disseminated on 53 multicast channels. TP runs 53 instances of parser and 48 instances of rebus across 7 servers to handle it.

realistic TPS for exchange mkt-data gateway: below 1M/sec

  • Bombay’s BSE infrastructure (including matching engine) can execute 1 mil trades/second. Must maintain order books, therefore NOT stateless.
  • Rebus maintains full order books for each symbol, can handle 300k (in some cases 700k) messages per second per instance. Uses AVL tree, which beat all other data structure in tests.
  • The Nasdaq feed parser (in c++) is stateless and probably single-threaded. Very limited parsing logic compared to Rebus. It once handled 600k message/second per instance
  • Product Gateway is probably a dumb process since we never need to change it. Can handle 550k message/second per instance

I believe TPS throughput, not latency, is the optimization goal. Biggest challenge known to me is the data burst.

Therefore, java GC pause is probably unacceptable. In my hypothesis, after you experience a data surge for a while, you tend to run out of memory and must run GC (like run to the bathroom). But that’s the wrong time to run GC. If the surge continues while you GC runs, then the incoming data would overflow the queue.

nyse opening auction mkt-data

There are various order types usable before the 9.30 opening auction (and also before a halted security comes back). We might receive these orders in the Level 2 feed. I guess the traders use those orders to test the market before trading starts. Most of these orders can be cancelled, since there’s no execution.

The Imbalance data is a key feature by the exchange auction engine, and possibly very valuable to traders. It has

  • indicative match price, which at 9.30 becomes final match price
  • imbalance at that match price

The secret optimization algorithm – The auction engine looks at all the orders placed, and works out an optimal match price to maximize execution volume. Traders would monitor that indicative price published by exchange, and adjust their orders. This adjustment would be implemented in an execution algorithm.

stale (replicated) orderbook

This page is intended to list ideas on possible ways of improving the orderbook replication system so that it can better determine when an instrument or instruments are stale. It further lists possible ways of automating retransmission of lost and/or stale data to improve the recovery time in the event of an outage.

The current system is largely “best effort”; at each level of the ticker-plant and distribution network components are capable of dropping data should a problem arise. There are few processes capable of correlating what was lost in a format that is useful to a customer. To account for the possibility of lost data, the orderbook replication components constantly generates “refresh” messages with the most up-to-date state of an instrument.

Although this system works well in practice, it leaves open the possibility that a customer may have a cache with “stale” data for an unbounded length of time. This inability to track staleness can be a point of concern for customers.

= Recovery types =

When a downstream component loses one or more update messages for an instrument it is generally safe to assume the instrument is stale. However there can be two different kinds of recoveries:

== Retransmit the latest snapshot ==

This method of retransmission and stale detection revolves around keeping the current tick snapshot database up to date. It is useful for customers that need an accurate tick cache. It may not be a full solution to customers that need an accurate time and sales database.

== Retransmit all lost ticks ==

It is also possible to retransmit all lost ticks after an outage. This is typically useful when trying to repair a time-and-sales database.

Although it is possible to build an accurate “current” snapshot record when all lost ticks are retransmitted, it is a very tedious and error-prone process. It is expected that customers will, in general, be unwilling to rebuild the “current” from the retransmission of lost ticks.

So, a scheme that involves retransmission of lost ticks will still likely require a scheme that retransmits the latest snapshot.

Most of the following discussions are centered around the concept of latest snapshot recovery.

= Gap prevention =

There may be simple ways to reduce the number of times gaps occur. This process could be called “gap prevention”.

In general, it is not possible to eliminate gaps, because severe outages and equipment failure can always occur. The process of gap prevention may be useful, however, where the alternative gap recovery solution is expensive or undesirable. It is also useful in systems that need full lost tick recovery.

There are two possible ways of preventing gaps from occurring. Both trade bandwidth and latency for increased reliability during intermittent outages.

== Wait for retransmit ==

The simplest form of gap prevention involves retransmitting any packets lost on the network. The sender keeps a buffer of recently sent messages, and the receiver can request retransmissions. In the event of packet loss, the receiver waits for the retransmissions before processing data.

This form of gap recovery is a basic feature of the TCP/IP transmission protocol.

== Forward error correction ==

It is also possible to prevent gaps by sending additional data on the feed.

The most basic form of this is used in the “best-of-both” system. It sends two or more copies of the data, and the receiver can fill lost ticks from the additional copies.

It is not necessary to send a full additional feed. For example, one could send a block of parity codes on every tenth packet. A receiver could then theoretically recover from up to ten percent packet loss by using the parity code packets.

Although the forward error correction scheme uses additional bandwidth, additional bandwidth may be available due to line redundancy.

= Snapshot recovery types =

In order to correct a stale instrument, it may be necessary to send the full contents of the instrument. When doing so, one may send them serialized in the real-time feed or out of order.

== In sequence snapshots ==

The simplest form of snapshot transmission involves using the same socket or pipe as the real-time feed itself. In this case, the receiver can simply apply the snapshot to its database; it does not need to handle the cases where the snapshot arrives after/before a real-time update arrives.

The downside of this scheme, however, is that a single upstream supplier of snapshots might get overloaded with requests for retransmissions. If additional distributed databases are used to offload processing, then each additional component will add latency to the real-time feed.

== Out of sequence snapshots ==

It is also possible to send snapshot transmissions using sockets and/or pipes separate from the real-time feed. The advantage of this scheme, is it is relatively cheap and easy to increase the number of distributed snapshot databases from which one can query. However, it requires the receiver of the snapshot to work harder when attempting to apply the response to its database.

One way to apply out of order snapshots is to build a “reorder buffer” into the receiver. This buffer would store the contents of the real-time feed. When a snapshot response arrives, the receiver could locate where the sender was in the real-time stream when it generated the snapshot (possibly by using sequence numbers). It can then apply the snapshot and internally replay any pending updates from the reorder buffer. In the case where a snapshot arrived that was based on real-time traffic that the receiver has yet to receive, the receiver must wait for that traffic to arrive before applying the snapshot.

This scheme is thought to be complex, resource intensive, and error-prone.

If, however, the feed were changed to eliminate the distributed business rules, it may be possible to implement a much simpler out-of-order snapshot application system. See [[Out of sequence snapshots idea]] for a possible implementation.

= Gap detection =

In order to accurately determine when an instrument is “stale”, it is necessary to be able to determine when one or more update messages have been lost. The following sections contains notes on various schemes that can provide this information.

Note that some of the schemes may be complementary. That is, a possible future solution might use parts of several of these methods.

== Sequence numbers with keep-alive on idle ==

The most common method to detect a gap involves placing a monotonically incrementing number on all outgoing messages or message blocks. The receiver can then detect a gap when a message (or message block) arrives with a sequence number that is not one greater than the last sequence number.

In order to account for the case where all messages from a source are lost, or where a source goes idle just after a message loss, the sender needs to arrange to transmit a “keep alive” indicator periodically when the source is otherwise idle. With knowledge of the keep-alive period, the receiver can detect a gap by timing out if it does not receive a message from a source within the specified period. The larger the period, the less keep-alive messages need to be sent when idle. However, it also increases the worst case time to detect a message gap.

It is possible for the sender to generate multiple sequence number series simultaneously by separating instruments into multiple categories. For example, the outgoing feed currently generates an independent sequence number for each “exchange”. At the extreme, it is possible to generate a sequence number stream per instrument; however this would increase the bandwidth due to larger number of keep-alive messages necessary. (One would also not be able to optimize bandwidth by sequencing only message blocks.)

When a sequence number gap is detected, the receiver must consider all instruments using the sequence series as suspect.

Only a single source can reliably generate a sequence number. If multiple ticker-plants generate a feed, they need to use different sequence series. If an upstream ticker-plant switch occurs, the receiver needs to mark the full range of affected instruments as suspect.

Finally, some exchanges provide sequenced data feeds. However, there are [[issues with exchange provided sequence numbers]]. Due to this, it may be difficult to rely on exchange sequencing as a basis for a distribution network sequencing.

== Sequence number of last message ==

A variant of the basic sequencing scheme involves the sequence number of the last message (SNLM) that updates an instrument. This field would be kept by both sender and receiver, and included with real-time messages. If SNLM matches the receiver’s record, implying that the receiver has not missed any updates for this instrument, then the instrument can transition from “suspect” to “clean”. Conversely, a non-match should force the instrument to “stale”.

An advantage of this scheme is that ordinary real-time traffic could reduce the number of suspect records after an outage. It may also make using exchange provided sequence numbers more practical.

As a disadvantage, however, it would require that both a sequence number and SNLM be provided on every real-time update. This might significantly increase bandwidth.

== Periodic message count checks ==

It is also possible to detect gaps if the sender periodically transmits an accounting of all messages sent since the last period. This scheme may use less bandwidth than sequence numbers, because it is not necessary to send a sequence number with every message (or message block).

The scheme still has the same limitations as sequence numbers when ticker-plant switches occur and when trying to determine what was lost when a gap occurs.

== Periodic hash checks ==

Another possible method of detecting gaps is by having the sender generate a hash of the contents of its database. The receiver can then compare the sender’s hash to the same hash generated for its database. If the two differ, a gap must have occurred. (If the two match, however, a gap may have occurred but already been corrected; this method is therefore not useful when full tick recovery is necessary.)

This scheme may be beneficial when ticker-plant switches occur. If two senders have identical databases and no data is lost during a network switch, then the hash checks should still match at the receiver. This scheme, however, still faces the problem of determining which instruments from the set are actually stale when a gap is detected.

Technically, it is possible that two databases could differ while sharing the same hash key. However, it is possible to choose a hash function that makes the possibility of this extremely small.

Finally, this system may face challenges during software upgrades and rollouts. If either the sender or the receiver change how or what they database, it may be difficult to maintain a consistent hash representation.

== Sender tells receiver of gaps ==

If a reliable transmission scheme (eg, tcp) is in use between the sender and receiver, then it may be possible for the sender to inform the receiver when the receiver is unable to receive some portion of the content.

For example, if a sender can not transmit a block of messages to a receiver because the receiver does not have sufficient bandwidth at the time of the message, then it is possible for the sender to make a note of all instruments that receiver was unable to receive. When the receiver has sufficient bandwidth to continue receiving updates, the sender can iterate through the list of lost instruments and inform the receiver.

The scheme has the advantage that it allows the receiver to quickly determine what instruments are stale. It may also be useful when a component upstream in the ticker-plant detects a gap – it can just push down the known stale messages to all components down-stream from it. (For example, an exchange parser might detect a gap and send a stale indicator downstream while it attempts to fill the gap from the exchange.)

As a disadvantage, it may significantly complicate data senders. It also does not help in cases where a receiver needs to change to a different sender.

== Receiver analyzes gapped messages ==

In some systems, the receiver may need to obtain all lost messages (eg, to build a full-tick database). If the receiver knows the contents of messages missing earlier in the stream it can determine which messages are stale. Every instrument that contains an update message in the list of missing messages would be stale; instruments that did not have update messages would be “clean”.

An advantage of this system is that it is relatively simple to implement for receivers that need full tick retransmissions.

However, in the general case, it is not possible to implement full tick retransmissions due to the possibility of hard failures and ticker-plant switches. Therefore this scheme would only be useful to reduce the number of stale instruments in certain cases.

Also, the cost of retransmitting lost ticks may exceed the benefits found from reducing the number of instruments marked stale. This makes the scheme less attractive for receivers that do not need all lost ticks retransmitted.

= Stale correction =

This section discusses possible methods of resolving “suspect conditions” that occur when it is detected that an instrument may have missed real-time update messages.

There are likely many other possible schemes not discussed here. It is also possible that a combination of one or more of these schemes may provide a useful solution.

These solutions center around restoring the snapshot database. Restoration of a tick history database is left for a separate discussion.

== Background refresh ==

The simplest method of clearing stale records is to have the ticker-plant generate a periodic stream of refresh messages. This is what the system currently does.

This system is not very good at handling intermittent errors, because it could take a very long time to refresh the full instrument database. However, if enough bandwidth is allocated, it is a useful system for recovering from hard failures where the downstream needs a full refresh anyway. It is also possible to combine this with one of the gap prevention schemes discussed above to help deter intermittent outages.


* simple to implement at both receiver and sender


* time to recovery can be large

* can be difficult to detect when an instrument should be deleted, or when an IPO should be added

== Receiver requests snapshot for all stale instruments ==

In this system, the receiver would use one of the above gap detection mechanisms to determine when an instrument may be stale. It then issues a series of upstream requests until all such instruments are no longer stale.

In order to reduce the number of requests during an outage, the instruments on the feed could be broken up into multiple sets of sequenced streams (eg, one per exchange).


* could lead to faster recovery when there is available bandwidth and few other customers requiring snapshots


* could be complex trying to request snapshots for instruments where the initial create message is lost


* see discussion on [[#Snapshot recovery types]]

* see discussion on [[#Gap detection]] for possible methods of reducing the universe of suspect instruments during an outage

== Sender sends snapshots ==

This is a variant of [[#Sender tells receiver of gaps]]. However, in this scheme, the sender would detect a gap for a receiver and automatically send the snapshot when bandwidth becomes available. (It may also be possible to send only the part of the snapshot that is necessary.)


* Simple for receiver


* Could be complex for sender

* Isn’t useful if receiver needs to change upstream sources.

== Receiver requests gapped sequences ==

This method involves the receiver detecting when an outage occurs and making an upstream request for the sequence numbers of all messages (or message blocks) not received. The sender would then retransmit the lost messages (or blocks) to the receiver.

The receiver would then place the lost messages along with all subsequently received messages into a “reorder” buffer. The receiver can then internally “play back” the messages from the reorder buffer to rebuild the current state.


* Useful for clients that need to build full-tick databases and thus need the lost messages anyway.


* Thought to be complex and impractical to implement. The reorder buffer could grow to large sizes and might take significant resources to store and apply.
* The bandwidth necessary to retransmit all lost messages may exceed the bandwidth necessary to retransmit the current state of all instruments.
* Doesn’t help when a ticker-plant switch occurs.

== Sender analyzes gapped sequences ==

This scheme is a variant on [[#Receiver requests gapped sequences]]. The receiver detects when an outage occurs and makes an upstream request for the sequence numbers of all messages (or message blocks) not received.

Upon receipt of the request the sender would generate a series of snapshots for all instruments that had real-time updates present in the lost messages. It can do this by analyzing the contents of the messages that it sent but the receiver did not obtain. The sender would also have to inform the receiver when all snapshots have been sent so the receiver can transition the remaining instruments into a “not stale” state.


* May be useful in conjunction with gap prevention. That is, the sender could try resending the lost messages themselves if there is a good chance the receiver will receive them before timing out. If the receiver does timeout, the sender could fall back to the above snapshot system.

* May be simple for receivers


* May be complicated for senders
* Doesn’t help when a ticker-plant switch occurs.


* Either in-sequence or out-of-sequence snapshot transmissions could be used. (See [[#Snapshot recovery types]] for more info.) The receiver need not send the requests to the sender – it could send them to another (more reliable) receiver.

== Receiver could ask if update necessary ==

This is a variant of [[#Receiver requests snapshot for all stale instruments]], however, in this system the receiver sends the sequence number of the last message that updated the instrument (SNLM) with the request. The sender can then compare its SNLM with the receiver’s and either send an “instrument okay” message or a full snapshot in response.


* Reduces downstream bandwidth necessary after an outage


* Doesn’t work well in cases where instruments are updating, because the receiver and sender may be at different points in the update stream
* Lost create messages – see disadvantages of [[#Receiver requests snapshot for all stale instruments]]

== Receiver could ask with hash ==

This is a variant of [[#Receiver could ask if update necessary]], however, in this system the receiver sends a hash value of the current instrument’s database record with the request. The sender can then compare its database hash value with the receiver’s and either send an “instrument okay” message or a full snapshot in response.


* Works during tp switches


* Doesn’t work well in cases where instruments are updating, because the hash values are unlikely to match if sender and receiver are at a different point in the update stream.

* Rollout issues – see [[#Periodic hash checks]]

* Lost create messages – see disadvantages of [[#Receiver requests snapshot for all stale instruments]]

= Important considerations =

In many stale detection and correction system there are several “corner cases” that can be difficult to handle. Planning for these cases in advance can simplify later development issues.

The following is a list of “corner cases” and miscellaneous ideas:

== Ticker plant switches ==

It can be difficult to handle the case where a receiver starts obtaining messages from a different ticker-plant. Our generated sequence numbers wont be synchronized between the ticker-plants. Many of the above schemes would need to place any affected instruments into a “suspect” state should a tp switch occur.

Even if one could guarantee that no update messages were lost during a tp switch (for example by using exchange sequence numbers) there might still be additional work. The old ticker-plant might have been sending incorrect or incomplete messages — indeed, that may have been the reason for the tp switch.

== Lost IPO message ==

When the real-time feed gaps, it is possible that a message that would have created a new instrument was lost. An automatic recovery process should be capable of recovering this lost information.

There are [[schemes to detect extra and missing records]].

== Lost delete message ==

Similar to the IPO case, a real-time gap could have lost an instrument delete message. An automatic recover process should be able to properly handle this.

A more strange, but technically possible situation, involves losing a combination of delete and create messages for the same instrument. The recovery process should be robust enough to ensure that full resynchronization is possible regardless of the real-time message update content.

There are [[schemes to detect extra and missing records]].

== Exchange update patterns ==

Some exchanges have a small number of instruments that update relatively frequently (eg, equities). Other exchanges have a large number of instruments that individually update infrequently, but have a large aggregate update (eg, US options).

Schemes for gap detection and correction should be aware of these differences and be capable of handling both gracefully.

== Orderbooks ==

Recovering orderbooks can be a difficult process. However, getting it right can result in dramatic improvements to their quality, because orderbooks are more susceptible to problems resulting from lost messages.

The key to getting orderbooks correct is finding good solutions to all of the above corner cases. Orderbooks have frequent record creates and deletes. They also have the peculiar situation where some of the orders (those at the “top”) update with very high frequency, but most other orders (not at the “top”) update very infrequently.

== Sources can legitimately idle ==

Many exchanges follow a pattern of high traffic during market hours, but little to no traffic on off hours. Ironically, the traffic near idle periods can be extremely important (eg, opens, closes, deletes, resets).

It is important to make sure a detection scheme can handle the case where a gap occurs around the time of a legitimate feed idle. It should also be able to do so in a reasonable amount of time. (An example of this is the “keep alive” in the above sequence number scheme.)

== Variations of stale ==

A record is sometimes thought to be either “clean” or “stale”. However, it is possible to graduate and/or qualify what stale means. That is, it is possible to be “suspect” or “suspect for a reason” instead of just being “stale”.

Possible per-instrument stale conditions:

  • ; clean : the instrument is not stale
  • ; possible gap : gap in sequence number that could affect many instruments
  • ; definite gap : some recovery schemes can determine when an instrument has definitely lost updates
  • ; upstream possible gap : the tp might have seen a sequence gap from the exchange
  • ; upstream definite gap : the tp might have deduced which instruments actually gapped from exchange
  • ; stale due to csp startup : the csp was recently started and has an unknown cache state
  • ; suspect due to tp switch : a ticker-plant switch occurred
  • ; pre-clean : possible state in out-of-order snapshot recovery schemes
  • ; downstream gap : in some schemes a sender can inform a receiver that it lost updates

Y more threads !! help`throughput if I/O bound

To keep things more concrete. You can think of the output interface in the I/O.

The paradox — given an I/O bound busy server, the conventional wisdom says more thread could increase CPU utilization [1]. However, the work queue for CPU gets quickly /drained/, whereas the I/O queue is constantly full, as the I/O subsystem is working at full capacity.

[1] In a CPU bound server, adding 20 threads will likely create 20 idle, starved new threads!

Holy Grail is simultaneous saturation. Suggestion: “steal” a cpu core from this engine and use it for unrelated tasks. Additional threads or processes basically achieve that purpose. In other words, the cpu cores aren’t dedicated to this purpose.

Assumption — adding more I/O hardware is not possible. (Instead, scaling out to more nodes could help.)

If the CPU cores are dedicated, then there’s no way to improve throughput without adding more I/O capacity. At a high level, I clearly see too much CPU /overcapacity/.

how2guage TPS capacity@mkt-data engine

Pump in an artificial feed. Increase the input TPS rate until

  1. CPU utilization hits 100%
  2. messages get dropped

The input TPS is the the highest acceptable rate i.e. the “capacity” of this one process.

Note each feed has its own business logic complexity level, so the same software may have 600k TPS capacity for a simple Feed A but only 100k TPS for a complex Feed B.

Also in my experience the input interface is the bottle neck compared to the output interface. If System X feeds into System Y, then we want to have System X pumping at 50% of Y’s capacity. In fact, we actually monitor the live TPS rate. The difference between that and the capacity is the “headway”.

most popular/important instruments by Singapore banks

I spoke to a derivative market data vendor’s presales. Let’s just say it’s a lady named AA.

Without referring specifically to Singapore market, she said in all banks (i guess she means trading departments) FX is the bread and butter. She said FX desk is the heaviest desk. She said interest rate might be the 2nd most important instrument. Equities and commodities are not …(heavy/active?) among banks.

I feel commercial banks generally like currencies and high quality bonds in favor of equities, unrated bonds and commodities. Worldwide, Commercial banks’ lending business model is most dependent on interest rates. Singapore being an import/export trading hub, its banks have more forex exposure than US or Japanese banks. Their use of credit products is interesting.

AA later cited credit derivative as potentially the 2nd most useful Derivative market data for a typical Singapore bank. (FXVol being the #1). Actually, Most banks don’t trade a lot of credit derivatives, but they need the market data for analysis (like CVA) and risk management. She gave an example — say your bank enters a long-term OTC contract with BNP. You need to assess BNP’s default probability as part of counterparty risk. The credit derivative market data would be relevant. I think the most common is CDS

(Remember this vendor is a specialist in derivative market data.)

The FX desk of most banks make bulk of the money from FXO, not FX spot. She felt spot volume is higher but margin is as low as 0.1 pip, with competition from EBS and other electronic liquidity venues. What she didn’t say is that FXO market is less crowded.

She agreed that many products are moving to the exchanges, but OTC model is more flexible.

same dealers are behind mutiple mkt-data aggregators

I spoke to a derivative market data vendor’s presales. Let’s just say it’s a lady named AA.

A market data vendor adds value by “calibrating” raw data (quotes, i guess) collected from contributors. AA told me they are NOT purely aggregators.

Before calibration, basic data cleansing would include …. removing outliers.

–Anonymous quotes, Executable quotes and Busted trades —

Some vendors would reveal the contributor identity behind a quote, while other vendors keep their identities confidential. If citi is a dealer/market-maker on a particular instrument, then I guess citi is likely to give more authentic (probably tighter) quotes to the 2nd vendors.

In reality, for a given currency pair or major index (and derivatives thereof), the big dealers are often well known, so in the 2nd case it’s not hard (according to AA) to guess who is behind an anonymous quote.

Whether the contributor/mkt-maker is identified or anonymous, you can’t really trade on the quote without calling them. ECNs provide executable, tradable quotes, whereas mkt data vendors only provide informational or indicative quotes, which aren’t executable. Is the difference a big deal? Probably not. As illustrated in , unlike regulated exchanges, OTC executable quotes are not truly guaranteed because trade execution happens in dealer’s system not in the ECN since ECN don’t have capital to hold positions. Dealer can reject an order on any “reasonable” ground. In other words, the trade is subject to dealer’s approval. Either the market taker or the dealer can even cancel the trade long after execution, for example due to technical error. On the
exchange, that would be a busted trade and is always initiated and ruled by the exchange — investors have no say.

I suppose busted trades should be clearly justifiable. They are supposed to protect the integrity of the market and maintain investor confidence. It’s like public service. Therefore, they probably happen due to regulator pressure or public pressure.

no 2 thread for 1 symbol: fastest mkt-data distributor

Quotes (and other market data) sent downstream should be in FIFO sequence, not out-of-sequence (OOS).

In FX and cash equities (eg EURUSD), I know many major market data aggregators design the core of the core feed engine to be single-threaded — each symbol is confined to a single “owning” thread. I was told the main reason is to avoid synchronization between 2 load-sharing threads. 2 threads improve throughput but can introduce OOS risk.

You can think of a typical stringent client as a buy-side high-frequency trader (HFT). This client assumes later-delivered quote is physically “generated” later. If 2 quotes arrive on the same name, one by one, then the later one always overwrites the earlier one – conflation.

A client’s HFT can react in microseconds, from receiving quote (data entering client’s network) to placing orders (data leaving client’s network). For such a fast client, a little bit of delay can be quite bad, but not as bad as OOS. I feel OOS delivery makes the data feed unreliable.

I was told many automated algo trading engines (including automatic offer/bid pricers in bond) send fake orders just to test the market. It sends a test order and waits for the response in the data feed. An OOS delivery would confuse this “observer”.

A HFT could be trend-sensitive. It monitors the rise and fall of sizes of the quotes on a given name (say SPX). It assumes the market data are delivered in-sequence.

tick data feed – mkt-data distributor

  • You have a ticker feed, a queue that the feed can push onto, data workers that consume the queue, and subscribing listeners to the data workers.
  • Explain a non-blocking, scalable algorithm for how the data workers can transfer the data to the listeners.
  • Suppose some listeners consume messages slower than others. Suppose some listeners are only interested in certain tickers.
  • Suppose the ticker feed pushes data on the queue in bursts, how might you ensure the data workers don’t needlessly poll an empty queue?

That’s original question. I feel this is fairly realistic market data system. I covered a a similar requirement in

I feel a data worker deals with 2 “external systems” — big queue + listeners. The queue is a messaging FIFO thing (not necessarily JMS queue) to be polled. When it’s empty, data worker threads should wait in wait(). A socket thread can probably receive new data and notify data workers. When all the workers have processed a given message, it needs to be removed — See the same blog above.

I feel non-blocking kind of comes by-default, but maybe I don’t understand the question.

Message filter should probably be implemented in the data worker. See the same blog above.

To ensure slow listeners don’t block fast listeners, I feel we need multiple threads per data worker. In contrast, the Simple design is that data worker thread is also the thread running the onMessage() methods of all the listeners.  Multi-threaded data worker is probably a common design.

OPRA feed processing – load sharing

On the Frontline, one (or more) socket receives the raw feed. Number of sockets is dictated by the data provider system. Socket somehow feeds into tibco (possibly on many boxes). Optionally, we could normalize or enrich the raw data.

Tibco then multicasts messages on (up to) 1 million channels i.e. hierarchical subjects. For each underlier, there are potentially hundreds of option tickers. Each option is traded on (up to) 7 exchanges (CME is #1). Therefore there’s one tibco subject for each ticker/exchange. These add up to 1 million hierarchical subjects.

Now, that’s the firmwide market data distributor, the so-called producer. My system, one of the consumers, actually subscribes to most of these subjects. This entails a dual challenge

* we can’t run one thread per subject. Too many threads.
* Can we subscribe to all the subjects in one JVM? I guess possible if one machine has enough kernel threads. In reality, we use up to 500 machines to cope with this volume of subjects.

We ended up grouping thousands of subjects into each JVM instance. All of these instances are kept busy by the OPRA messages.

Note it’s not acceptable to skip any single message in our OPRA feed because the feed is incremental and cumulative.

no lock no multiplex`no collections/autobox`:mkt-data sys

Locks and condition variables are important to threading and Wall street systems, but the highest performance market data systems don't use those. Many of them don't use java at all.

They dedicate a CPU to a single thread, eliminating context switch. The thread reads a (possibly fixed sized) chuck of data from a single socket and puts the data into a buffer, then it goes back to the socket, non-stop, until there's no more data to read. At that time, the read operation blocks the thread and the exclusive CPU. Subsequent Processing on the buffer is asynchronous and on a different thread. This 2nd thread can also get a dedicated CPU.

This design ensures that the socket is consumed at the highest possible speed. (Can you apply 2 CPUs on this socket? I doubt it.) You may notice that the dedicated CPU is idle when socket is empty, but in the context of a high-volume market data feed that's unlikely or a small price to pay for the throughput.

Large live feeds often require dedicated hardware anyway. Dedication means possible under-utilization.

What if you try to be clever and multiplex a single thread among 2 sockets? Well you can apply only one cpu on any one thread! Throughput is slower.

[11] real time high volume FX quote processing #letter

Horizontal scale-out (distributing to different boxes) is the design of choice when we are cpu-bound. For instance, if we get hundreds of updates a sec and each update requires repricing a large number of objects.

Ideally, you would want cpu to be saturated. (By using twice the hardware threads, you want throughput to double.) Our pricing engine didn’t have that much cpu load, so we didn’t scale out to more than a few boxes.

The complication of scale-out is, data required to reprice one object may reside in different boxes. People try many solutions like memory virtualization (non-trivial synchronization cost + network latency), message-passing, RMI, … but I personally prefer the one-big machine approach. Throw in 16 (or 128) processors, each with say 4 to 8 hardware threads, run 64-bit, throw in 256G RAM. No network latency. No RMI/messaging latency. I think this hardware is rather costly. Total cost of 8 smaller machines with a comparable total CPU power would cost much less, so most big banks prefer it – so-called grid computing.

According to my observations, most practitioners in your type of situations eventually opt for scale-out.

It sounds like after routing a message, your “worker” process has all it needs in its local memory. That would be an ideal use case for parallel processing.

I don’t know if FX spot real time pricing is that ideal. Specifically, suppose a worker process is *dedicated* to update and publish eur/usd spot quote. I know you would listen to the eurusd quotes from all liquidity providers, but do you also need to watch usd/jpy and eur/jpy?

mkt-data subscription engines prefer c++ over C

For a busy feed, Java is usually slower. One of the many reasons is autoboxing. Market data always prefer primitive integers (rather than floats), char-arrays (rather than null-terminated or fancy strings).

I think Another reason is garbage collector — non-deterministic. I feel explicit free() is fast and efficient [1].

A market data engineer at 2-Sigma said C++ is the language of choice, rather than C or java. Some market data subscription engines use C to simulate basic C++ features.

[1] free(3) is a standard library function, not a syscall (manpage section 2). No kernel involvement.

order book live feed – stream^snapshot

Refer to the blog on order-driven vs quote-driven markets. Suppose a ECN receives limit orders or executable quotes (like limit orders) from traders. ECN maintains them in an order book. Suppose you are a trader and want to receive all updates to that order book. There are 2 common formats

1) The snapshot feed can simplify development/integration with the ECN as the subscriber application does not need to manage the internal state of the ECN order book. The feed publishes a complete snapshot of the order book at a fixed interval; this interval could be something like 50ms. It is possible to subscribe to a specific list of securities upon connecting.

2) Streaming event-based format. consumes more bandwidth (1Mb/sec) and requires that the subscriber manages the state of the ECN order book by applying events received over the feed. The advantage of this feed is that subscriber’s market data will be as accurate as possible and will not have the possible 50ms delay of the snapshot feed.

simplified wire data format in mkt-data feed

struct quote{
  char[7] symbol; // null terminator not needed
  int bidPrice; // no float please
  int bidSize;
  int offerPrice;
  int offerSize;
This is a 23-byte fixed-width record, extremely network friendly.

Q: write a hash function q[ int hashCode(char * char7) ] for 300,000 symbols
%%A: (c[0]*31 + c[1])*31 …. but this is too slow for this options exchange

I was told there’s a solution with no loop no math. I guess some kind of bitwise operation on the 56bits of char7.

mkt-data favor arrays+fixed width, !! java OBJECTs

(See also post on market data and autoboxing.)

In the post on “size of” we see every single instance needs at least 8 bytes of book keeping.

Therefore primitive array (say array of 9 ints) has much lower overhead than Collection of
– array takes 4bytes x 9 + at least 8 byte booking keeping
– collection takes at least sizeof(Integer) x 9 + book keeping. Each Integer takes at least 4 bytes of payload + 8 bytes book keeping.

Market-data engines gets millions of primitives per second. They must use either c++ or java primitive arrays.

Market data uses lots of ints and chars. For chars, again java String is too wasteful. Most efficient would be C-string without null terminator.

The fastest market data feed is fixed-width, so no bandwidth is wasted on delimiter bits.

Floats are imprecise. Am not an experienced practitioner, but i don’t see any raw market data info represented in floating points.

Java Objects also increases garbage collection. Often indeterminate full GC hurts transaction FIX engines more than the market data FIX engines, but in HFT shops the latter needs extreme consistency. Unpredictable pause in market data FIX can shut out a HFT auto-trader for a number of milliseconds, during the most volatile moment such as a market crash.

necessity: some trading module imt others

“non-optional + non-trivial” is the key.

Context – trading systems.

I feel trade booking/capture is among the “least optional”. Similarly, settlement, cash mgmt, GL, position database, daily pnl (incl. unrealized). Even the smallest trading shops have these automated. Reasons – automation, reliability, volume. Relational Database is necessary and proven. These are Generally the very first wave of boring, low-pay IT systems. In contrast a lot of new, hot technologies look experimental, evolving and not undoubtedly necessary or proven —

* Sophisticated risk engine is less proven. I don’t know if traders really trust it.
* Pre-trade analysis is less proven.
* huge Market data often feed into research department, risk/analysis systems. I feel some small portion of market data is necessary.
* models
* Algo trading, often based on market data and models
* object DB, dist cache, cloud aren’t always needed
* MOM? i guess many trading systems don’t use MOM but very rare.

mkt-data is primarily collected for … quants in research dept

(background — there are many paper designs to process market data. The chosen designs on wall street reflect the intended use of this data.)

I won’t say “by traders” since it’s too much data for human consumption. It must be digested. Filtering is one of many (primitive) form of digestion.

I won’t say “by trading systems” as that’s vague. Which part of the trading system?

I won’t say “by algo trading engines”. What’s the algo? The abstract algo (before the implementation) is designed by quants based on models, not designed by IT. Traders may design too.

Q:Who has the knowledge to interpret/analyze such flood of market data?
A: not IT. We don’t have the product knowledge
A: not traders. In its raw form such market data is probably unusable.
A: quantitative researchers by definition are responsible for analyzing quantitative data.
A: data scientist also need to understand the domain and can use the data to extract insight.

which "Technology" were relevant to 2010 GS annual report #upstream

(Another blog post. See also the post on Upstream)

Re the GS annual report page about Technology… When yet another market goes electronic, commissions drop, bid/ask spreads drop, profit margins drop, trade volumes increase, competitions intensify. So which IT systems will rise?

market data – tends to explode when a market goes electronic.
* tick data
trade execution including
* order state management
* order matching, order book
real time risk assessment
real time deal pricing
offer/bid price adjustment upon market events
database performance tuning
distributed cache
MOM ie Message-oriented-middleware
multi-processor machines
grid computing

How about back office systems? If volumes escalate, I feel back office systems will need higher capacity but no stringent real time requirements.

On the other hand, what IT systems will shrink, fade away, phase out? Not sure, but overall business user population may drop as system goes low-touch. If that happens, then IT budget for some departments will shrink, even though overall IT budget may rise.

In a nutshell, some systems will rise in prominence, while others fall.

Tick data repository – real-time/historical mkt-data

I think this was a CS interview…

Used for 1) back-testing, 2) trading signal generation, even if no real order sent, and 3) algorithmic real-$ trading.

First target user is back testing. Users would also try the 1 -} 2 -} 3 in that sequence.

Stores FX, FI and equity market data. Therefore the system treats everything just as generic tick data either quote ticks or trade ticks.

Multiple TB of data in memory. Time-series (not SQL) database.

Created using c++ and java. Probably due to sockets.

mkt-data engines hate pre-java8 autoboxing

(See also post on market data and size of Object)

High volume market data systems deal with primitives like ints, char-arrays and occasionally bit-arrays These often need to go into collections, like vector of int. These systems avoids DAM allocation like a plague.

Specifically, Java collections auto-boxing results in excessive memory allocation — a new object for every primitive item. A major justification for c++ STL, according to a CS veteran.

Sun also says autoboxing is inappropriate for high performance.

C# generic collection is free of autoboxing, just like STL, so is the primitive streams in java 8. But I don’t know if market data systems has actually embraced primitive streams.

ION-mkt — nested event handler

MkvPlatformListener — Platform events such as connection status.
MkvPublishListener — Data Events like publish and unpublish.
MkvRecordListener — Data Events like Record updates
MkvChainListener — Data Events like Chain updates.

class CustomChainListener implements MkvChainListener {
public void onSupply(MkvChain chain, String record, int pos,
MkvChainAction action) {
System.out.println("Updated " + chain.getName() + " record: " + record
+ " pos: " + pos + "action: " + action);
class CustomPublishListener implements MkvPublishListener {
public void onPublish(MkvObject object, boolean start, boolean dwl) {
if (object.getMkvObjectType() == MkvObjectType.CHAIN
&& object.getName().equals("MY_CHAIN") && start) {
MkvChain mychain = (MkvChain) object;
System.out.println("Published " + mychain.getName());
try {
////// the new() creates a listener just like swing ActionListeners
mychain.subscribe(new CustomChainListener());
} catch (MkvConnectionException e) {}
public void onPublishIdle(String component, boolean start) {
public void onSubscribe(MkvObject obj) {
Mkv.getInstance().getPublishManager().addPublishListener(new CustomPublishListener());

common quote pricing rules + live feeds for munis

Rule: (same as FX day trader, and applicable to market making only) based on inventory changes
Rule: dollar cost averaging
Rule: based on BMA index, which is typically 70% of Libor, assuming a 30% tax exemption
Rule: on the muni desk, the swap positions and quotes are priced using SIFMA swap index and ED futures
Rule: on the muni desk, the T (and futures) positions and quotes are priced using T/T-futures price feed
Rule: based on ETF prices. If one of our quoted bonds is part of one (or several) ETF, we price our bid/ask using live ETF prices
Rule: based on Evaluation prices, received twice daily (JJ Kenny and InteractiveData)
Rule (one of the most important): based on MSRB last-traded quantity/price reported to MSRB
Rule: based on “pins” controlled by senior traders of the desk
Rule: based on The One Muni Curve of the trading desk
Rule: stock (hedging) positions (not quotes) are priced using stock price feeds

Other feeds – ION, SBA (SmallBizAdmin) feed

“Based on” generally
Always apply a bid/ask spread matrix (call dates/quantity as x/y dimensions)
Always apply a commission matrix
Apply odd lot discount matrix
Always convert from clean to dirty price
Always check for occurrence of bid/ask inversion

##what mkt-data feed your autoReo pricing engine@@

background — every algo pricing/trading engine needs market data feeds.

For my algo engine,
– Libor, ED futures
– Treasury futures
– Treasury ticks
– Bloomberg
– credit rating updates
– material event feed
– ION mkt view
– JJ Kenny

There are a few essential INTERNAL feeds to my algo engines —
– positions,
– marks by other desks,  and other algo engines.

eq listed drv desk

Some basic info from a friend –

Equity Listed derivatives – mostly options on single stocks or options on index/future, but also variance-swaps. Even if a stock has no listed options, we would still create a vol surface so as to price OTC options on it, but the technique would be different — The standard technique if given many pairs of {expiration, strike} is to fit a curve on a single expiration, then create similar curves for other expirations on the same underlyer (say IBM), then try to consolidate all IBM curves into a smooth IBM vol surface. Each “point” on the surface is an implied vol value. I was told some of the more advanced “fitting” math is extracted out into a C++ quant lib.

Instrument pricing has to be fast, not multi-second. I guess this is pre-trade, RFQ bid/offer pricing, similar to bond markets’ bid-wanted. In contrast, the more “real” need for vol surface is position pricing (or mark-to-market), which provides unrealized PnL. I feel this is usually end-of-day, but some traders actually want it real time. Beside the traders on the flow[3]/listed/OTC derivative desks, the vol surface is also used by many other systems such as structured derivatives, which are entirely OTC.

It’s quite hard to be really event-driven since they are too frequent, instruments too numerous, and pricing algo non-trivial, exactly like FX option real time risk. Instead, you can schedule periodic repricing batches once a few minutes.

About 3500 underliers and about 450,000 derivative instruments. Average 100 derivatives on each underlier (100 combinations of strike/tenor). S&P500 has more than 1000 derivatives on it.

Market data vendors — Reuterss, Wombat, Bloomberg.

Inputs to vol calculation — product reference (strike/tenor), live market quotes, dividend, interest rate …

One of the most common OTC equity derivatives is barrier option.

Pricing and risk tend to be the most mathematically challenging.

Exchange connectivity is usually c++, client connectivity (clients to send orders or receive market data) is usually java.

[3] Flow means agency trading, most for institutional clients. Retail clients are very wealthy. Those ordinary retail investors won’t use an investment bank. Flow equity derivative can be listed or OTC.

mkt-data subscription engine java IV #barc eq drv

Say you have market data feeds from Reuters, Wombat, Bloomberg, eSpeed, BrokerTec, ION… Data covers some 4000 underliers and about half a million derivative instruments on these underliers. For each instrument, there can be new bid/offer/trade ticks at any millisecond mark[1]. Volume is similar to option data feed like OPRA.

Say you have institutional clients (in additional to in-house systems) who register to receive IBM ticks when a combination of conditions occur, like “when bid/ask spread reaches X, and when some other pricing pattern occurs”. There are other conditions like “send me the 11am IBM snapshot best bid/ask”, but let’s put those aside. For each of the instruments, there are probably a few combination of conditions, but each client could have a different target value for a condition — 2% for u, 2.5% for me. Assuming just 10 combination for each instrument, we have 5 million combination to monitor. To fulfill clients, we must continuously evaluate these conditions. CEP and Gemfire continuous query have this functionality.

I proposed a heavily multi-threaded architecture. Each thread is event-driven (primary event) and wakes up to reevaluate a bunch of conditions and generate secondary events to be sent out. It can drop the new 2ndary event into a queue so as to quickly return. The “consumer” can pick up the 2ndary events and send out by multicast.

Each market data vendor (Reuters, e-speed, ION, even tibrv) provides a “client-runtime” in the form of a jar or DLL. You embed the client-runtime into your VM, and it may create private threads dedicated to communicating with the remote publisher.

[1] Each IBM tick actually has about 10 fields, but each IBM update from vendor only contains 2 fields if the other field the symbol didn’t change. So we need something like Gemfire to reconstruct the entire 10-field object.

real-world OMS c++data structure for large collection of orders

class OrderMap{
struct data_holder {/*fields only. no methods*/};
unordered_map<int, shared_ptr<data_holder> > orders; // keyed by orderID
… //methods to access the orders

Each data_holder is instantiated on heap, and the new pointer goes into a shared_ptr. No need to worry about memory.

Do you see ANY drawback, from ANY angle?

GUI for fast-changing mkt-data (AMM

For a given symbol, market data MOM could pump in many (possibly thousands of) messages per second. Client jvm would receive all updates by regular MOM listener, and update an small (possibly one-element) hashmap — by repeated overwriting in the memory location. Given the update frequency, synchronization must be avoided by either CAS or double-buffering, in the case of object or a long/double. For an int or float, regular volatile field might be sufficient.

Humans don’t like screen updated so frequently. Solution – GUI worker thread (like swing timer) to query the local cache every 2 sec — a reasonable refresh rate. It will miss a lot of updates but fine.

(Based on a real implementation hooked to a OPRA feed)

Changi(citi)c++IV #onMsg dispatcher, STL, clon`#done

I feel this is more like a BestPractice (BP) design question.

void onMsg(Price& const price){ // need to persist to DB in arrival order
// Message rate is very high, so I chose async producer-consumer pattern like

// sharding by symbol name is a common practice

// %%Q: should I dispatch to multiple queues?
// A: Requirement says arrival order, so I decided to use a single queue.

// The actual price object could be a StockPrice or an OptionPrice. To be able to persist the option-specific attributes, the queue consumer must figure out the actual type.

// Also, price object may be modified concurrently. I decided to make a deep copy of the object.

// Q: when to make a copy, in onMsg or the consumer? I personally prefer the latter, but let’s say we need to adopt the former. Here’s a poor solution —

//Instead of dispatch(price)
dispatchPBClone(price); // slicing a StockPrice down to a Price. Here’s my new solution —


// clone() is a virtual method in StockPrice and OptionPrice.
//Q: clone() should allocate memory on the heap, but where is the de-allocation?
//A: Use smart ptr.

select() syscall multiplex vs 1 thread/socket ]mkt-data gateway

Low-volume market data gateways could multiplex using select() syscall — Warren of CS. A single thread can service thousands of low-volume clients. (See my brief write-up on epoll) Blocking socket means each read() and write() could block an entire thread. If 90% of 1000 sockets have full buffers, then 900 threads would block in write(). Too many threads slow down entire system.

A standard blocking socket server’s main thread blocks in accept(). Upon return, it gets a file handle. It could save the file handle somewhere then go back to accept(). Over time it will collect a bunch of file handles, each being a socket for a particular network client. Another server thread can then use select() to talk to multiple clients, whild the main accept() thread continues to wait for new connections.

However, in high volume mkt data gateways, you might prefer one dedicated thread per socket. This supposedly reduces context switching. I believe in this case there’s a small number of sockets preconfigured, perhaps one socket per exchange. In such a case there’s no benefit in multiplex. Very different from a google web server.

This dedicated thread may experience short periods of silence on the socket – I guess market data could come in bursts. I was told the “correct” design is spin-wait, with a short sleep between iterations. I was told there’s no onMsg() in this case. I guess onMsg() requires another thread to wake up the blocking thread. Instead, the spin thread simply sleeps briefly, then reads the socket until there’s no data to read.

If this single thread and this socket are dedicated to each other like husband and wife, then there’s not much difference between blocking vs non-blocking read/write. The reader probably runs in an endless loop, and reads as fast as possible. If non-blocking, then perhaps the thread can do something else when socket buffer is found empty. For blocking socket, the thread is unable to do any useful work while blocked.

I was told UDP asynchronous read will NOT block.

solace^tibcoApplicance #OPRA volume solace JMS broker (Solace Message Router) support 100,000 messages per second in persistent mode and 10 million messages non-persistent. In a more detailed article, shows 11 million 100-byte non-persistent messages.

A major sell-side’s messaging platform chief said his most important consideration was the deviation of peak-to-average latency and outliers. A small amount of deviation and (good) predictability were key. They chose Solace. has good details.

In all cases (Solace, Tibco, Tervela), hardware-based appliances *promise* at least 10 fold boost in performance compared to software solutions. Latency within the appliance is predictably low, but the end-to-end latency is not. Because of the separate /devices/ and the network hops between them, the best-case latency is in the tens of microseconds. The next logical step is to integrate the components into a single system to avoid all the network latency and intermediate memory copies (including serializations). Solace has demonstrated sub-microsecond latencies by adding support for inter-process communications (IPC) via shared memory. Developers will be able to fold the ticker feed function, the messaging platform, and the algorithmic engine into the same “application” [1], and use shared memory IPC as the data transport (though I feel single-application design need no IPC).

For best results you want to keep each “application” [1] on the same multi-core processor, and nail individual application components (like the feed handler and algo engine) to specific cores. That way, application data can be shared between the cores in the Level 2 cache.

[1] Each “application” is potentially a multi-process application with multiple address spaces, and may need IPC.

Benchmark — Solace ran tests with a million 100-byte messages per second, achieving an average latency of less than 700 nanoseconds using a single Intel processor. As of 2009, OPRA topped out at about a million messages per second. OPRA hit 869,109 mps (msg/sec) in Apr 2009.

Solace vs RV appliance — Although Solace already offers its own appliance, it runs other messaging software. The Tibco version runs Rendezvous (implemented in ASIC+FPGA), providing a clear differentiator between the Tibco and Solace appliances.

Solace 3260 Message Router is the product chosen by most Wall St. customers. provides good tech insights.

Merrill S’pore: fastest stock broadcast

Updates — RV or multicast topic; msg selector

I think this is a typical wall-street interview question for a senior role. System requirement as remembered by my friend the interviewee: ML needs a new relay system to receive real-time stock updates from the stock exachange such as SGX. Each ML client, one of many thousand[1], will each install a new client-software [3] to receive updates on the stocks [2] she is interested. Some clients use algorithmic trading system and need the fastest feed.

[1] Not clear about the order of magnitude. Let’s target 10,000
[2] Not clear how many stocks per client on average. Let’s target 100.
[3] Maintence and customer support for a custom client-software is nightmare and perhaps impractical. Practically, the client-software has to be extremely mature such as browsers or email clients.

Q: database locking?
A: I don’t think so. only concurrent reading. No write-contention.

Key#1 to this capacity planning is how to identify bottlenecks. Bandwidth might be a more severe bottleneck than other bottlenecks described below.

Key#2 — 2 separate architectures for algorithmic clients and traditional clients. Each architecture would meet a different minimum latency standard, perhaps a few seconds for traditional and sub-second for algorithmic.

Solution 0: Whatever broadcasting system SGX uses. In an idea world, no budget constraint. Highest capacity desired.

Solution 2: no MQ? No asynchronous transmission? As soon as an update is received from SGX, the relay calls each client directly. Server-push.

Solution 1: MQ — the standard solution in my humble opinion.

Solution 1A: topics. One topic per stock. If 2000 clients want IBM updates, they all subscribe to this topic.

Q: client-pull? I think this is the bottleneck.

Q: Would Client-pull introduce additional delays?

Solution 1B: queues. one queue for each client each stock.

If 2000 clients want IBM updates, Relay need to make that many copies of an update and send to that many queues — duplication of effort. I think this is the bottleneck. Not absolutely sure if this affects relay system performance. Massively parallel processing is required, with thousands of native CPU threads (not java green threads)