Tick query: design notes #GS

— to support -Oprint —
For each symbol, we will use a vector to hold Tick objects (or shared_ptr thereof). Lookup will use a modified binary search — In the case of an exact hit, we need to scan backwards (for the starting timestamp) until we see a smaller value than the target. In the case of a search miss, we know the target timestamp value is between 2 vector items, and no scanning is needed.

I decided against a multiset due to larger memory footprint. Also a red-black tree would require lots of re-balancing when the input data stream is already sorted.

This vector will support product-query, but not optimally.

Trade-off: this optimization mode saves memory, speeds up print-queries but (the less frequent) product-queries are slower.

— to support -Oproduct —
In addition to the data structure of -Oprint, under each symbol we will have 11 additional vectors assuming there are 11 fields.

For Field1, the vector will hold 98700 addresses assuming there are that many records having Field1. Each address is a Tick object that has Field1 populated.

For Field2, let’s assume the vector is longer.

If a product query specifies Field1 and Field2, engine first identifies which of the two vectors is shorter. In our case, Field1 vector is chosen, so we iterate over it. (Note we don’t need Field2’s vector.) For each item, we check if Field2 is present (low-cost) and produce a product.

Before iteration, we need to identify the start and end items in the vector. This is the same modified-binary search as in -Oprint.

Suppose Line 1 has 3 fields, Line 2 has 5 fields, Line 3 has 1 field … The additional space requirement in this design is proportional to the total count of fields (3+5+1+…) present in the tick file.

To support print-query, in theory we could iterate each of the 11 vectors under the target symbol. However, we need a place to store the Tick objects. The data structure of -Oprint is an efficient solution, so I see no reason to abandon it and invent another. This data structure already supports faster print-query.

Trade-off: In this optimization mode, we support faster product-query, but use more memory. Also, the initial file processing is slower.

— to support -O2product (not implemented) —
If we know in advance that total distinct field count is low like 10, then there are only 10-choose-2 i.e. 45 pairs. We could support a new -O2product optimization mode.

Instead of 10 vectors, we need 45 vectors of floats, for the 45 combinations.

If an incoming Tick object has 4 fields, we will iterate over the 4-choose-2 i.e. 6 pairs of fields. Each pair is always one of the 45 possible combinations, so we will pick one of the 45 vectors, and push_back the product of the two fields. Since we have 6 pairs within this Tick object, we update exactly 6 of the 45 vectors.

Once the tick file is fully processed, we have 45 vectors to support fast product-query. To select one of 45 vectors, we use a hash table keyed by the two target field names (in sorted order) concatenated. Once we select the one vector, we will use the same modified binary search then iterate. Each element in the iteration is now guaranteed to contain both fields, so there’s no wasted iteration as in -Oproduct.

Print-query would need the data structure in -Oprint.

Trade-off: This mode uses even more memory but further speeds up product-queries. The additional space is proportional to the total count of field-pairs present in the tick file.

–use of shared_tr —
Memory footprint of a shared_ptr is about twice of a raw ptr, and much smaller than the footprint of a string. There are millions of field names (and also many Tick objects) in my data structures. In fact, the copy operations are disabled for Field (and Tick) classes, which are essentially immutable objects.

It was an obvious memory optimization to use pointers to avoid allocation for the same string twice. However, there’s question whether to store shared_ptr or raw ptr in the large data structures.

I chose raw pointer.

–duplicate copies of field strings or symbol strings —

If there are a million ticks, averaging 5 fields each, there would be 5 million field names to store, but only 50 distinct fields! To save memory, I allocate memory for each globally unique field name once only. We end up with 50 Field objects referenced by many shared_ptr objects.

For lookup, a hash table is possibly faster (if map is large) but not really, according to some benchmark tests. Our map size is assumed to be small, like 100.

Symbol strings are also repeating, but by design, my data structures do not hold many duplicates for a symbol string, because symbol is only a lookup key.

–Improvement feedback from friends

  • avoid using namespace std
  • avoid virtual

java cod`IV #threading #Gelber

My solution was attached on 27 Jan 2011
Hi Bin,
Here is the problem I was speaking about:
1. Develop an application in Java that will satisfy the following requirements:

– Read in any number of text files.
– Calculate 5 most commonly used letters
– Calculate 3 most commonly used characters
– Calculate 3 most commonly used numbers
– Calculate 10 most commonly used words
– Concurrently parse all files, however, limit your application to 2 parser threads. If there are more files than available parser threads, you will need to queue files for parsing.

2. Write an application in Java that will pull down the HTML content of any number of specified websites – single file per URL, no depth. Strip out all metadata and generate statistics by showing 2 most commonly used letters, numbers, characters and words.


simplified order book design doc: jump

It’s tempting to use virtual function processMessage() to process various order types (A, M, X …) and trade types, but virtual functions add runtime overhead. Template specialization is a more efficient design, but due to the limited timeframe I implemented an efficient and simple alternative.

Assumption: M messages can only change quantity. Price change not allowed — Sender would need to cancel and submit new order. The B/S and price fields of the order should not change, but validation is omitted in this version.

Assumption: T messages and the corresponding M and X messages (also the initiator A message) are assumed consistent and complete. Validation is technically possible but omitted. Validation failure indicates lost messages.

The cornerstone of the design is the data structure of the order book — a RB-tree of linked lists. Add is O(logN) due to the tree-insert. Modify is O(1) thanks to the lookup array. Remove is O(1) — eliminating tree search. This is achieved with the lookup array, and by saving iterator into the order object.

There are 2 containers of pointers — the map of lists and the lookup-array. It would be better to use container of smart pointers to ease memory management, but STL doesn’t provide any smart pointer.

All equality test on doubles are done using “==”. Should use some kind of tolerance if time permits.

Here’s the documentation in the lookup array class

/*This class encapsulates an array of pointers.
Assumption 1 — Exchanges is likely to generate auto-incrementing orderID’s. Here’s my reasoning. OrderID’s are unique, as stated in the question. If orderID generation isn’t continuous, then the generator has 2 choices about the inevitable gap between 2 adjacent ID numbers. It can keep the gap forever wasted, or somehow “go back” into a gap and pick a number therein as a new orderID. To do the latter it must keep track of what numbers are already assigned — rather inefficient. There are proven in-memory algorithms to generate auto-increment identity numbers. I assume an exchange would use them. Auto-increment numbers make a good candidate as array index, but what about the total number range?

Assumption 2 — each day the number range has an upper limit. Exchange must agree with exchange members on the format of the orderID. It’s likely to be 32 bits, 64 bits etc and won’t be a million bits.

Question 1: Can the exchange use OrderID 98761234 on both IBM and MSFT during a trading day? I don’t know and i feel it doesn’t matter. Here’s the reason.

Case 1: suppose exchange uses an *independent* auto-increment generator for each stock. So IBM and MSFT generators can both generate 98761234. My design would use one array for IBM and one array for MSFT. For basket orders, additional generator instances might be needed.

Case 2: suppose exchange uses an independent auto-increment generator for each stock, but each stock uses a non-overlap number range. 98761234 will fall into IBM number range. My design would need to know the number range so as to convert orderID to array index and conserve memory.

Case 3: suppose exchange uses a singleton auto-increment generator across all stocks (bottleneck inside the exchange). My design would use one gigantic array. Given Assumption 1, the numbers would be quasi-continuous rather than sparse — below 50% of the range is assigned. Suppose the range is S+1, S+2 … S+N, then my array would be allocated N elements (where S is orderIDBase). There’s a limit on N in reality. Every system is designed for a finite N — no system can handle 10^9999 (that’s one followed by ten thousand zeros) orders in a day. Each array element is a pointer. For a 64-bit machine, N elements take 64N bits or 8N bytes. If I have 640GB memory, N can be 80 billion but not higher. To scale out horizontally, we would hope Case 1 or 2 is the case.

Therefore the answer to Question 1 shows array of pointer is feasible for the purpose of lookup by orderID. In a real system hash table is likely to be time/space efficient. In this exercise, only STL is available, which provides no hash table. Tree based map has logN time complexity — too slow. My choice is between a built-in array vs a non-expanding vector. I chose array for simplicity.

Fwd: design a finite state machine for a FIX engine

We need a cache (OMS/EMS?) of all the pending orders we created. All from-exchange FIX messages must match one of the pending orders. Once matched, we examine its current state.

If the message is belated, then we must reject it (informing all parties involved).

If the message is delivered out of sequence and too early, then we must keep it on hand. (a 2nd cache)

If the message is for immediate consumption, then consume.


…the FSM (finite state machine) in FIX is about state transition and should be pertinent to all state machines. Basically, it goes from pending new to new to fill or canceled. From canceled, it cannot go back to new or fill etc.



SCB threading design IV: caching remote DB

This could be a good story to tell during interviews.

The MOM design is still unfinished but I think a practitioner could see we are on the right track.

Q: A remote position data source (a DB or some opaque data source) is super slow. It takes up to 500ms to return a position object when queried. The read(int key) method is provided by the DB’s client-side API and is thread safe, so multiple threads can call read() concurrently. Now, we need some kind of caching to minimize the 500ms latency, but we want a lazy cache, partly to avoid overloading this busy DB — We don’t want to load all 9 million positions when we are interested in only a few positions. Our cache shall provide an API like a Cache class with a method

“public Position lookup(int key) const”

If a local lookup thread is blocked due to a cache miss (resulting in a DB read), it should not block other unrelated lookup threads. Also, if 2 threads happen to call lookup(123) in quick succession which is a cache miss, 2nd thread should block and gets the result as soon as first thread receives the position and populates the cache. Perhaps some kind of notification.
———– Analysis ———–
Similar to my B2bTradingEngine. I feel we can address this either with low-level tools or with high level tools. Perhaps the interviewer is interested in the low-level know-how, or perhaps interested in the higher-level design techniques and principles. I feel it’s more challenging to assume high volume and slow connections, more relevant to the growing big data economy. Perhaps a distributed infrastructure.

It’s generally easier to scale down an over-engineered design than to scale-up a simplistic design.
———– MOM based idea ———-
(not yet a “design”). If the request volume is too high, then we run the risk of creating thousands of blocked threads. MOM to the rescue. I will refer to the MOM-based server process as the “engine”.

The client thread would have to be “decoupled” from the engine thread, otherwise 5000 concurrent requests would map to 5000 threads in the MOM server. To decouple, the client thread (one of 5000 or 55000) could block as a synchronous MOM consumer, or the client thread could register an onMsg() callback.

On the MOM server, we check request against the cache and returns the Position if hit — synchronously. Position data goes out in a response “topic”, so dozens of clients could filter the responses to look for their desired position. What if cache miss? Request goes into another queue (cacheMissQueue). Add the key to a task queue (producer-consumer). The consumer thread pool could have 1 or more work threads. Assuming the read() method is blocking, the worker thread must block. Upon failure, we update the cache with a special record for key 123. Upon success, we update the global cache. Either way we publish to the response topic.

Note If a request key 123 is already being processed or in the task queue, then we won’t even add it to the task queue. The task queue should “see” each key only once.
——–Low level design ——-
Now let’s assume this is a low-level single-process threading challenge, without MOM. The lookup() method could be called on 50 threads simultaneously. If 2 threads both request key 123 (cache miss), they must both block. Each reader thread should create a lock tagged with the key. During the read(), the lock should not be held. Upon successful read, the reader thread should lock, update cache, notify, unlock, then return the Position object. If in the interim a 2nd thread (a want2read thread) also has a cache miss on key 123, it would look for the lock tagged with 123, lock, then wait in it. Upon wake-up, it check cache. If found, it exits the loop and returns the Position object.

If read() times out or fails, the reader thread would fake a special Position object with an error msg, and do the same as above.

%%A: We need a thread pool to consume the queue. The pool threads will execute processOneReadTask(), “complete” the Position object in cache and notify on the position id

class Cache{
   std::map<.....> * map; // should use shared_ptr<position>
   Mutex mutex;
   Condition cond;
   Queue<int> queue; //thread safe queue, not std::queue
   DB * db;
     map(new std::map<....>),
     db(getRemoteDB()) {}
   void submitReadTask(int key){
         queue.enqueueIfNew(key); //thread-safely enqueue, if key has never been enqueued
   //callable by clients
   Position * lookup(int key){
       ScopedLock lock(mutex); //may block
       Position * found = map.findByKey(key);
       if (found) return found;
          Position * found = map.findByKey(key);
          if (found) return found;
   // executed by pool threads
   void processOneReadTask(){
       int key = queue.pop();
       if (key == -1) return; // empty queue
       Position * readOut = this.db.read(key); //slow
       ScopedLock lock(this.mutex); //may block
       map.populate(key, readOut);
}; //end of class

How about the listener solution instead of wait/notify? If we have 500 threads requesting position 123. They all block in wait() — memory-intensive. The listener solution means each thread would package its local state info into a state object and save it into a global collection. A few threads would be enough. For example, a single thread could wake up upon notification and sequentially complete the tasks of the 500 requests.