DB=%% favorite data store due to instrumentation

The noSQL products all provide some GUI/query, but not very good. Piroz had to write a web GUI to show the content of gemfire. Without the GUI it’s very hard to manage anything that’s build on gemfire.

As data stores, even binary files are valuable.

Note snoop/capture is no data-store, but falls in the same category as logging. They are easily suppressed, including critical error messages.

Why is RDBMS my #1 pick? ACID requires every datum to be persistent/durable, therefore viewable from any 3rd-party app, so we aren’t dependent on the writer application.


deadlock avoidance: release all locks fail-fast

In some deadlock avoidance algorithms, at runtime we need to ensure we immediately release every lock already acquired, without hesitation, so to speak. Here’s one design my friend Shanyou shared with me.

  • Tip: limit the number of “grabbing” functions, functions that explicit grab any lock.

Suppose a grabbing function f1 acquires a lock and calls another function f2 (a red flag to concurrency designers). In such a situation, f2() will use -1 return value to indicate “must release lock”. If f2() calls a grabbing function f3(), and if f3() fails to grab, it propagates “-1” to f2 and f1.

  1. f3() could use trylock.
  2. f3() could check lock hierarchy and return -1.

##good-looking-only designs/ideas

I think a true architect knows the difference. The best design is often not so good-looking.

Not all of them are classified “white elephants”. Not all of them are “designs”.

  1. lock-free
  2. multi-threading — not always significantly faster than multi-Processing
  3. sharedMem — not always significantly faster than sockets. I briefly discussed these two choices with my friend Deepak M, in the parser/rebus context. I feel it may not be faster.
  4. java generic module (beyond collections) — look impressive, can be hard to maintain but doesn’t buy us much
  5. converting java system to c++ — not always brings significant performance gains
  6. noSQL — not always significantly faster than SQL with lots of memory
  7. hash-table — not always faster than RBTree

conflation: design question

I have hit this same question twice — Q: in a streaming price feed, you get IBM prices in the queue but you don’t want consumer thread AA to use “outdated” prices. Consumer BB needs a full history of the prices.

I see two conflicting requirements by the interviewer. I will point out to the interviewer this conflict.

I see two channels — in-band + out-of-band needed.

  1. in-band only — if full tick history is important, then the consumers have to /process/ every tick, even if outdated. We can have dedicated systems just to record ticks, with latency. For example, Rebus receives every tick, saves it and sends it out without conflation.
  2. dual-band — If your algo engine needs to catch opportunities at minimal latency, then it can’t afford to care about history. It must ignore history. I will focus on this requirement.
  3. in-band only — Combining the two, if your super-algo-engine needs to analyze tick-by-tick history and also react to the opportunities, then the “producer” thread alone has to do all work till order transmission, but I don’t know if it can be fast enough. In general, the fastest data processing system is single-threaded without queues and minimal interaction with other data stores. Since the producer thread is also the consumer thread for the same message, there’s no conflation. Every tick is consumed! I am not sure about the scalability of this synchronous design. FIFO Queue implies latency. Anyway, I will not talk further about this stringent “combo” requirement.

https://tabbforum.com/opinions/managing-6-million-messages-per-second?print_preview=true&single=true says “Many firms mitigate the data they consume through the use of simple time conflation. These firms throw data on the floor based solely on the time that data arrived.”

In the Wells interview, I proposed a two-channel design. The producer simply updates a “notice board” with latest prices for each of 999 tickers. Registered consumers get notified out-of-band to re-read the notice board[1], on some messaging thread. Async design has a latency. I don’t know how tolerable that is. I feel async and MOM are popular and tolerable in algo trading. I should check my book [[all about HFT]]…

In-band only — However, the HSBC manager (Brian?) seems to imply that for minimum latency, the socket reader thread must run the algo all the way and send order out to exchange in one big function.

Out-of-band only — two market-leading investment bank gateways actually publish periodic updates regardless how many raw input messages hit it. Not event-driven and not monitoring every tick!

  • Lehman eq options real time vol publisher
  • BofA Stirt Sprite publishes short-term yield curves on the G10 currencies.

[1] The notification should not contain price numbers. Doing so defeats conflation and brings us back to a FIFO design.


blocking scenario ] CPU bound system

Think of a few CPU bound systems like

  • database server
  • MC simulation engine
  • stress testing

I tend to think that a thread submitting a heavy task is usually the same thread that processes the task. Such a thread doesn’t block!

In a task-queue producer/consumer architecture, the submitter thread enqueues the task and can do other things or return to the thread pool. A processor thread picks up the task from queue and spends hours to complete it. There is nothing else to do on this task. Again, no blocking here.

Here’s a trivial blocking scenario in a CPU bound system — Any of these threads can block in I/O.


complexities: replicate exchange order book #Level1+2

–For a generic orderbook, the Level 1 complexity is mostly about trade cancel/correct.
All trades must be databased (like our TickCache). In the database, each trade has a trade Id but also an arrival time.

When a trade is canceled by TradeId, we need to regenerate LastPrice/OpenPrice, so we need the ArrivalTime attribute.

VWAP/High/Low are all generated from TickCache.

The other complexity is BBO generation from Level 2.

–For a Level-based Level 2 order book, the complexity is higher than Level 1.

OrderId is usually required so as to cancel/modify.


real-time symbol reference-data: arch #RTS

Real Time Symbol Data is responsible for sending out all security/product reference data in real time, without duplication.

  • latency — typically 2ms (not microsec) latency, from receiving to sending out the enriched reference data to downstream.
  • persistence — any data worthing sending out need to be saved. In fact, every hour the same system sends a refresh snapshot to downstream.
    • performance penalty of disk write — is handled by innoDB. Most database access is in-memory. Disk write is rare. Enough memory to hold 30GB of data. https://bintanvictor.wordpress.com/2017/05/11/exchange-tickers-and-symbols/ shows how many symbols there across all trading venues.
  • insert is actually slower than update. But first, system must check if there’s a need to insert or update. If no change, then don’t save the data or send out.
  • burst / surge — is the main performance headache. We could have a million symbols/messages flooding in
  • relational DB with mostly in-memory storage

MOM^sharedMem ring buffer^UDP : mkt-data transmission

I feel in most environments, the MOM design is most robust, relying on a reliable middleware. However, latency sensitive trading systems won’t tolerate the additional latency and see it as unnecessary.

Gregory (ICE) told me about his home-grown simple ring buffer in shared memory. He used a circular byte array. Message boundary is embedded in the payload. When the producer finishes writing to the buffer, it puts some marker to indicate end of data. Greg said the consumer is slower, so he makes it a (periodic) polling reader. When consumer encounters the marker it would stop reading. I told Gregory we need some synchronization. Greg said it’s trivial. Here are my tentative ideas —

Design 1 — every time the producer or the consumer starts it would acquire a lock. Coarse-grained locking

But when the consumer is chipping away at head of the queue, the producer can simultaneously write to the tail, so here’s

Design 2 — the latest message being written is “invisible” to the consumer. Producer keeps the marker unchanged while adding data to the tail of queue. When it has nothing more to write, it moves the marker by updating it.

The marker can be a lock-protected integer representing the index of the last byte written.

No need to worry about buffer capacity, or a very slow consumer.

MOM UDP multicast or TCP or UDS shared_mem
how many processes 3-tier 2-tier 2-tier
1-to-many distribution easy easiest doable
intermediate storage yes tiny. The socket buffer can be 256MB yes
producer data burst supported message loss is common in such a situation supported
async? yes yes, since the receiver must poll or be notified I think the receiver must poll or be notified
additional latency yes yes minimal

tick query: design notes #GS

— to support -Oprint —
For each symbol, we will use a vector to hold Tick objects (or shared_ptr thereof). Lookup will use a modified binary search — In the case of an exact hit, we need to scan backwards (for the starting timestamp) until we see a smaller value than the target. In the case of a search miss, we know the target timestamp value is between 2 vector items, and no scanning is needed.

I decided against a multiset due to larger memory footprint. Also a red-black tree would require lots of re-balancing when the input data stream is already sorted.

This vector will support product-query, but not optimally.

Trade-off: this optimization mode saves memory, speeds up print-queries but (the less frequent) product-queries are slower.

— to support -Oproduct —
In addition to the data structure of -Oprint, under each symbol we will have 11 additional vectors assuming there are 11 fields.

For Field1, the vector will hold 98700 addresses assuming there are that many records having Field1. Each address is a Tick object that has Field1 populated.

For Field2, let’s assume the vector is longer.

If a product query specifies Field1 and Field2, engine first identifies which of the two vectors is shorter. In our case, Field1 vector is chosen, so we iterate over it. (Note we don’t need Field2’s vector.) For each item, we check if Field2 is present (low-cost) and produce a product.

Before iteration, we need to identify the start and end items in the vector. This is the same modified-binary search as in -Oprint.

Suppose Line 1 has 3 fields, Line 2 has 5 fields, Line 3 has 1 field … The additional space requirement in this design is proportional to the total count of fields (3+5+1+…) present in the tick file.

To support print-query, in theory we could iterate each of the 11 vectors under the target symbol. However, we need a place to store the Tick objects. The data structure of -Oprint is an efficient solution, so I see no reason to abandon it and invent another. This data structure already supports faster print-query.

Trade-off: In this optimization mode, we support faster product-query, but use more memory. Also, the initial file processing is slower.

— to support -O2product (not implemented) —
If we know in advance that total distinct field count is low like 10, then there are only 10-choose-2 i.e. 45 pairs. We could support a new -O2product optimization mode.

Instead of 10 vectors, we need 45 vectors of floats, for the 45 combinations.

If an incoming Tick object has 4 fields, we will iterate over the 4-choose-2 i.e. 6 pairs of fields. Each pair is always one of the 45 possible combinations, so we will pick one of the 45 vectors, and push_back the product of the two fields. Since we have 6 pairs within this Tick object, we update exactly 6 of the 45 vectors.

Once the tick file is fully processed, we have 45 vectors to support fast product-query. To select one of 45 vectors, we use a hash table keyed by the two target field names (in sorted order) concatenated. Once we select the one vector, we will use the same modified binary search then iterate. Each element in the iteration is now guaranteed to contain both fields, so there’s no wasted iteration as in -Oproduct.

Print-query would need the data structure in -Oprint.

Trade-off: This mode uses even more memory but further speeds up product-queries. The additional space is proportional to the total count of field-pairs present in the tick file.

–use of shared_tr —
Memory footprint of a shared_ptr is about twice of a raw ptr, and much smaller than the footprint of a string. There are millions of field names (and also many Tick objects) in my data structures. In fact, the copy operations are disabled for Field (and Tick) classes, which are essentially immutable objects.

It was an obvious memory optimization to use pointers to avoid allocation for the same string twice. However, there’s question whether to store shared_ptr or raw ptr in the large data structures.

I chose raw pointer.

–duplicate copies of field strings or symbol strings —

If there are a million ticks, averaging 5 fields each, there would be 5 million field names to store, but only 50 distinct fields! To save memory, I allocate memory for each globally unique field name once only. We end up with 50 Field objects referenced by many shared_ptr objects.

For lookup, a hash table is possibly faster (if map is large) but not really, according to some benchmark tests. Our map size is assumed to be small, like 100.

Symbol strings are also repeating, but by design, my data structures do not hold many duplicates for a symbol string, because symbol is only a lookup key.

–Improvement feedback from friends

  • avoid using namespace std
  • avoid virtual

STL+smart_pointer for SQL DTO

Are there any best practice online?

Q1: Say I have a small db table of 10 columns x 100 rows. Keys are
non-unique. To cache it we want to use STL containers. What container?
%%A: multimap or list. unordered_multimap? I may start with a vector, for simplicity. Note if 2 duplicate rows aren’t 100% identical, then multimap will lose data

Q1a: search?
%A: for a map, just lookup using this->find(). For list, iterate using generic find()

Q1c: what if I have a list of keys to search?
%%A: is there an “set_intersect()” algorithm? If none, then I would write my nested iteration. Loop through the target keys, and find() on each.
A: for_each()?

Q1e: how do you hold the 10 col?
%%A: each object in container will have 10 fields. They could be 10 custom data classes or strings, ints, floats. Probably 10 smart pointers for maximum flexibility.

Q1h: what if I have other tables to cache too?
%%A: parametrize the CacheService class. CacheService class will be a wrapper of the vector. There will be other fields beside the vector.

Q1m: how about the data class? Say you have a position table and account table to cache
%%A: either inheritance or template.