ringBuffer@pre_allocated objects to preempt JGC

Goal — to eliminate JGC completely.

Design 1: I will want Order.java to use primitive fields only and avoid reference fields [1] at all cost, so the total footprint of an Order is known in advance. Say it’s 100 bytes. I will create 10M of dummy Order instances, possibly scattered in heap, not adjacent as in c++, and hold their 10M addresses in an Order array… about 1GB footprint for the Order objects + 80M footprint for the array of 8-byte pointers.

(Note I reuse these Order instances in this object pool and never let them get garbage-collected.)

Then i need a few subscripts to identify the “activeRegion” of the ring but how about released slots enclosed therein?

[1] timestamps will be ints; symbolIDs and clientIDs are ints; short ascii strings will use 64-bit ints (8 characters/int); free-form strings must be allocated off-site:(

Design 2a: To avoid the “scatter” and to place the Order instances side by side, Can we use a serialized byte[100] array object to represent one Order? Can we use one gigantic off-heap byte array to hold all Orders, eliminating the 80M footprint? See java off-heap memory

Design 2b: https://blog.bramp.net/post/2015/08/26/unsafe-part-2-using-sun.misc.unsafe-to-create-a-contiguous-array-of-objects/ shows a contiguous array of java objects, like std::vector<MyObject>

Design 2c: https://www.ibm.com/support/knowledgecenter/en/SSYKE2_7.1.0/com.ibm.java.lnx.71.doc/user/packed_optimizing.html is a feature in IBM jvm

Ring buffer is good if the object lifetimes are roughly equal, giving us FIFO phenomenon. This occurs naturally in market data or message passing gateways. Otherwise, we may need a linked list (free list) of released slots in addition to a pair of subscript to identify the active region.

It might be better to allocate a dedicated buffer for each thread, to avoid contention. Drawback? One buffer may get exhausted when another stays unused.

RTS feed for Internet clients #Mark #50%

Q: Based on the RTS market data dissemination system, what if some clients subscribe by a slow Internet connection and your orderbook (TCP) feed need to provide delta updates each time a client reconnects?

Default solution: similar to FIX/TCP .. sender to maintain per-client state. Kenny of Trecquant said his trading app can be extremely simple if exchange maintains state. I won’t elaborate. Here is my own Solution.

Note on terminology — in multicast there’s no TCP-style “server” . Instead, there’s a sender engine for many receiver clients.

Suppose we have too many clients. To minimize per-client state management, my engine would simply multicast to all clients real time updates + periodic snapshots.

A client AA can request a snapshot on any symbol group, and I will immediately multicast the snapshots on a refresh channel. If client BB never requests anything, BB can ignore the refresh multicast channel.

Request quota — each client like AA can request X free snapshots a day. Beyond that, AA’s requests would be regulated like

  • levied a fee
  • queued with lower priority

It’s feasible to replace the refresh multicast group with an unicast UDP channel per client, but to me the multicast-refresh solution offers clear advantages without major drawbacks.

  1. if there is an outage affecting two clients, each would request the same snapshots, creating unnecessary work on the engine. The request quota would incentivize each client to monitor the refresh group and avoid sending the same request as someone else
  2. multicast can be valuable to clients who like more frequent snapshots than the periodic.
  3. My operations team can also utilize this refresh channel to broadcast unsolicited (FreeOfCharge) snapshots. For this task, I would avoid using the publisher channel as it can be busy sending real time updates. We don’t want to interleave real time and snapshot messages.

mutex ] rebus: 3designs #RCU #CSY #80%

Requirement: Each worker thread operates independently on its own symbols, never interfering. Admin thread may read/write all symbols.

My friend CSY’s design resembles ConcurrentHashMap — split a big “graph-container” into 32 independent sub-graphs, to be locked individually.

For my 1st design, I briefly considered one lock per symbol. I think 100,000 locks is fine if symbols are consecutive integers. I simply use a symbol as array index to locate the corresponding lock. Every worker thread and the admin thread need to acquire the lock for symbol X before updating the AVL tree of X.

— Here’s my 2nd design, based on a single global lock:

  • Regular writer threads only checksnever acquires, the lock. If lock is in-use (i.e. taken by the admin thread) then wait in a 1-second-sleep loop.
  • admin thread immediately and unconditionally takes the lock and wait for a grace period[1] before writing. I assume this “write” can take up to 1 minute.
  • how to “check” the lock?
    • TryLock solution [Design#1] — Just trylock and immediately unlock.
    • LockFree solution [Design #2] — atomic boolean to be set only on the admin thread. I think this is one the simplest usage scenarios of atomic boolean. We are not even relying on the atomicity. We only need the visibility feature.

[1] This grace period is designed to be sufficient for a batch of updates on the worker thread. If we know a batch can write 100 orders and take X microsecond, then set grace period to X. We require worker routine to recheck the lock after each batch.

My friend CSY raised the concern that mid-stream the update could be suspended by kernel scheduler. I think in this case, a simple integer update could take ages, as in the GC stop-the-world scenario. So the design requires some realtime kernel support to guarantee timely response.

read-copy-update lockfree +! retry #RCU mentions a grace period used in RCU api

— Here’s a bigger lockfree design [Design #3]

  • without realtime kernel
  • Two global variables — in addition to the atomic boolean flag, I will add an atomic int “activeWorkerCnt”
  • Two non-reentrant routines (to avoid incremental acquisition)
Worker threads routine:
1. check the flag until it’s false
2. increment activeWorkerCnt
3. update data store
4. decrement activeWorkerCnt
Admin thread routine:
1. unconditionally set the flat # rejecting all new workers
2. while activeWorkerCnt > 0
2.   sleep 1 millisecond
2. ## draining the active worker pools
3. update data store
4. unset the flag

tick`95%mark #Kam #70%

“Ticking 95th percentile server” is the blog title I would use. Original question is on paper/pencil, scanned and sent to my gmail, from my friend Deepak CM. I find this problem rather realistic with practical usage, rather than fake and contrived. I treat it as a System Design + implementation question.

Q: Using only std library, write a c++ program to process a live stream of 128,000,000 (or much more) double numbers, representing temperature readings not necessarily unique. As the temperatures come in, print the current 95th percentile on-demand. I call it the lucky “winner”. We can use the nearest-rank percentile definition.

====Idea 1: given unsorted ints, find median in O(N) is for median but can be tweaked for any percentile, but unfortunately, not “ticking”

====design 2, for static data set
use an “order statistic tree i.e. a RBTree where each node remembers the size of its subtree. (A leaf node has size 1.)
====design 3, optimized for high volume of updates like 128 million updates, not optimized for frequent query

The entire temperature range is divided into non-overlapping segments, each represented by a segment-head temperature i.e. the lower bound [1b]. Each segment has a range (i.e.distance to next segment head), size (i.e.item count) and density (i.e. size/range ratio). We mostly care about “size” only.

We need a RB-tree (or sorted vector) containing P=1024 [1] nodes, each an unsorted container[3]. The RB-tree serves to maintain the containers i.e segments.

Each incoming temperature is quickly “routed” to the correct container and simply appended therein, increasing its size.

Upon query request, we will use the latest segment sizes to build a cumulative profile, and run a O[logP] binary search to identify the one segment containing the “winner”. This segment size would be hopefully much smaller than 128,000 [2] and far more /tractable/.

–Within the chosen segment of size S, we can use a vector to sort in O(S logS) the temperatures and identify the winner.  After completing a query, the chosen container will become (partially) sorted, helping subsequent queries if this segment is picked again.

Since we only support 95th percentile, chance is good that this segment will be picked most of the time. If x% of the queries hit this segment, then I will convert this “favorite segment” to a RB-tree.

Alternatively, we can also use the O(S) algorithm in Idea 1, but the container won’t become sorted.


[2] 128,000 is 1024th the size of original sample size… not ideal. The segments need to be initialized carefully, during a priming phase, inspired by JIT compiler. Shall we assume roughly uniform distribution or Gaussian distribution? Assuming we know the total sample size is 128 million, I will use the first 100,000 temperatures to select the 1024 segment heads. The segments are structured not for equal length (in temperature) or equal size (i.e. element count). In fact the low segments can be very long very crowded.

Instead, the segment heads are chosen so that between 94th percentile and 96th percentile we have half of all segments. These small segment sizes will be much smaller than 128,000 and quicker to manipulate.

–Foot notes:

Q: what if some of the containers grow too big like three times 128,000,000/1024. The priming/estimate was ineffective.
A: Not a problem unless the winner happens to be in such a container.
A: One idea is to split up such a container when we notice it, and grow a new node on the RB-tree. std::map::insert() can take a hint for the position where new node can be inserted. Note we don’t want to split a node JJ too early since JJ may not grow any further subsequently and may end up no larger than other nodes as other nodes keep growing.

[1] Sizing of P — First we estimate total sample size. If unknown, then set N:=1024 so all nodes stay in L1-cache (typically 32KB). If we assume 16 bytes/node ( 8 bytes pointer to container + 8 bytes double ), then 32KB can hold 2000 nodes.

If query becomes more frequent, I can increase P by 1024, sacrificing insertion.

[1b] The node values are “lower-bounds” and don’t have to be really the minimum of the original temperatures in the container. We can probably cheat with 4-byte floats, and we get away with 2700 twelve-byte tree nodes.

[3] slist vs vector — vector might be faster due to pre-allocation, provided a node will never grow beyond a capacity. Vector has reserve() (Note resize() is wrong choice.)

RTS pbflow msg+time files #wait_until #60%

Background — In RTS we relied everyday on a UDP message replay tool, taking in a msg file and a corresponding time file. Both binary files. I saw no message delimiters in the msg file, partly because this file is produced in production. Inserting delimiters would add overhead in production parser engine.

Given there’s no delimiter, I believe the timestamp file must have a seek-offset values (marker) corresponding to each timestamp.

Q: how would you implement a similar replay tool?

Driver is the timestamp file. Every time I see a timestamp, I would wait_until() that timestamp (adjusted to my start time). Upon wake-up, I would send the corresponding chunk of bytes between current and next markers.

I would use condition_variable::wait_until(). As noted in c++condVar 2 usages #timedWait, this function has nanosec precision.

(The mutex needed in wait_until? My program is single-threaded, so the dummy lock would be held forever.)

The chuck of bytes would be sent as one UDP packet. No split. Each chunk should starts with an original packet header created by the matching engine, and received in the original feed.

Q2: how about TCP?

TCP receiver can deal with partial messages very effectively, as I saw in my tests.

However, TCP receiver can introduce delays in the sender i.e. my replay tool, so the transmission will not be instantaneous. In fact, send() can block! This could lead to drift in the timer. That’s why I need wait_until()

Q3: Explain timer drift?

Suppose in the time file, Timestamp #1 is 9:00:00:000001 and Timestamp #2 is one microsec later, but the data transmission can take 10 nanosec esp. with TCP blocking send().  This 10 nanosec is a small drift but it adds up to microseconds.

SCB-FM design IV #AshS#80%

Q: Your parser class is a plug-in in a framework. The framework would call your parser’s member function onData(seqNum, packet)  whenever the framework receives a packet on a UDP socket. You need to deal with

  • out of sequence packets
  • duplicate packets

Inside your onData(), you need to invoke a client callback function like theClient->callback(ptr2packet) but you need to invoke it 1) in correct sequence and 2) without duplicates.

Note the requirement is part of TCP’s job functions. TCP receives out-of-sequence packets (and/or duplicates) from IP layer and must deliver sequenced packets to the application layer.

Does TCP use ring buffer or hashtable? I doubt it, but we are building simpler solutions and are free to choose our data structure.

====My solution=====

an illustration
seq # received warehoused send in-use region of ring buffer
1 1
5 5 2-5
9 5,9 2-9
3 3,5,9
2 2-3 4-9
6 5,6,9
8 5,6,8,9
7 5-9
11 5-9,11 4-11
4 4-9 10-11
  • keep the packets (like fixed-size struct instances) in a large singleton circular array (or a deque). Save each packet in a slot keyed by the seq number of the packet (modulus the array size). Remember the nextSeqToSend. If we get a higher sequence than that, just warehouse it in the circular buffer.
  • (Interviewer didn’t ask) How do you reuse slots in the circular buffer? Given ten thousand slots #0~#9999, when I’m warehousing packet #109,999 in slot #9999, then conceptually the old packet in the #0 slot was already sent out, so I can safely “wrap around” to save next packet (#110,000) in there. I can implement my system to ensure it is actually safe.
  • What if the sequence numbers I receive jump wildly? Well, in real systems this will never happen (except an explicit seq reset). At most the sequence numbers jump ahead by a few hundreds. Assuming the sequence numbers arrive largely in correct order with occasional out-of-order arrivals, ring buffer is a natural choice. Without this assumption, dictionary solutions (Ashish saw in QuickFix) might be more suitable.
  • permanent gaps? If I see an old gap (like nextSeqToSend == #55 and no #56 but #57~#8000 all received) then we need a policy to mark the gap as permanent. Otherwise we would have to wait for it indefinitely.

Q (from interviewer): if you use a deque, how do you allocate slot for packet #5 while waiting for #4?
%%A: i would allocate for both, but keep #4 slot vacant. Not sure if std::deque has this API. I think my deque will hold pointers … dummy pointer represents vacant.

Justification for deque is similar to ring buffer — to keep the queue length short and release once-used memory.

I haven’t analyzed hashtables i.e. dictionaries. I believe it’s a proven solution with no major drawbacks.

#1 minor drawback of hashtable-based (or deque) relative to ring buffer is runtime allocation which is about 100 to 1000 times slower than arithmetic operations. For this reason, I always favor ring buffers when it’s a natural and logical data structure choice. This is my bias in many system designs. Sequence-number-based systems can often use ring buffers.

Another minor drawback of hashtable is memory overhead . Ring buffer has relatively small overhead in addition to the packet footprints, Hashtable wraps each packet in a link node. Hashtable also needs an expandable bucket array.

In terms of runtime efficiency, I am not so sure. I feel circular array has faster read/write. Hashtable depends on the hash function, which can degrade due to hash collisions.


SDI: URL shortening

Q: Design TinyURL or bitly (a URL shortening service)

Given a (typically) long URL, how would how would you design service that would generate a shorter and unique alias for it.

Discuss things like:

  • How to generate a unique ID for each URL?
  • How would you generate unique IDs at scale (thousands of URL shortening requests coming every second)?
  • How would your service handle redirects?
  • How would you support custom short URLs?
  • How to delete expired URLs etc?
  • How to track click stats?

https://www.interviewbit.com/problems/design-url-shortener/ is a long discussion.

high-volume DoS-guard #Nsdq SDI

Q: Design an API Rate Limiter (e.g. for Firebase or Github)

You are expected to develop a Rate Limiter services that can:

  • Limit the number of requests an entity can send to an API within a time window e.g., 15 requests per second.
  • The rate limiting should work for a distributed setup, as the APIs are accessible through a cluster of servers.

(A similar question was asked at Nsdq… )

Q2: how do your cluster of cache servers detect a given IP on the Internet is sending requests too frequently, causing Denial of Service? How do you protect yourself?

Q2b: After you blacklist a client IP, it goes quiet, then it sends a single request again. How you decide whether to ignore the request?

Q2c: what algorithm to decide if a client IP has legitimate need to send lots of requests vs another client IP engaging in Denial of Service attack?

Q2d: what if distributed DoS attack?

https://en.wikipedia.org/wiki/Denial-of-service_attack#Defense_techniques has practical solutions.

SDI: elevator design #DeepakCM #70%

My friend Deepak gave me this Basic Requirements:

  • N-level building. You can assume 5 for now
  • Each level’s lift lobby has up/down buttons
  • Inside each lift there are N buttons for the N target floors
  • Any time, system can receive requests from any button

My basic design is not optimized for efficiency. The number of pending requests will stay below 20 since there are only that many buttons, so we iterate over all requests frequently.

My design effort is heavily focused on data structure — The more complex the requirements, the more I need to focus on clean, concise, sound data structure. They may not be necessary — a less optimal data structure can also work, but an optimal data structure helps us tremendously to cope with the complexity. I feel this problem is tractable once the data structures take shape.

Q: what if lift is in motion towards some target when a lift-lobby button is pressed and it happens to be serviceable?
A: Like the pencil solution to the space-pen challenge, my endless loop in main() may qualify as a simple solution to this daunting challenge. The system wakes up frequently to check for new requests. When sleeping, it ignores all inputs.

This simple design, if viable, avoids asynchronous or multi-threading complexities.  http://pubs.vmware.com/foundry1/pg/wwhelp/wwhimpl/common/html/wwhelp.htm?context=pg&file=Foundry_PG_concepts.3.30.html is a similar single-threaded design

For simplicity, I assume the new requests from all buttons show up in some buffer (or database), so we can poll it to see them. There’s no interrupt or callback.

Q: in your design, there are “targets” set by in-lift passengers vs. up/down requests (from lift lobbies) assigned by the system. How do you prioritize between the two types?
A: In general, targets are at higher priority than assignments, but if lift is already moving down towards Level 2 for an assigned request, it will move down all the way till that level.

I don’t want to spend too much time on any module since the correct emphasis/focus could be different in a real world design or design interview.

implement an exchange #Trex Kenny

Q: create an exchange with messaging for NewOrderSingle, ExecutionReport etc. (I think interviewer means the matching server.)

  • need to support restart in any client or the exchange itself.
  • (Use ack or other techniques) basic reliability such as
    • Client needs to know for sure if new order is received.
    • Exchange needs to ensure execution report is received.
  • Please write c++ code, and compile it if possible. One hour given.
  • no need to send out market data to subscribers — not the focus

bbg-Eq: filters in a screen

Exchanges constantly send you live update messages in a fixed format like

IBM, Price, 100
GOOG, Volume, 5000
MSFT, Volume, 14000
IBM, Price, 99
GS, Volume, 7000
MSFT, Price, 250
GOOG, Price 332

Each message always has exactly 3 fields — ticker, field name, integer value. The value only overwrites the previous value and never aggregates. In this example, the IBM price 99 will wipe out the IBM price 100.

There are up to 100 distinct field names, but many tickers and very high messaging rate.

Each user has a Screen, which holds a variable number of Filters defined by this user. Each Filter has a fixed format like “Price > 150” or “Volume < 9000”. All the filters need to be applied, to return a list of qualifying tickers. In this example, first filter alone returns a list of tickers {MSFT,GOOG} and 2nd filter alone returns {GS,GOOG}. Applying all filters would reduce the list to {GOOG}

Data feed goes into the server. Server maintains thousands of Screen objects, each for one user. When a new message is received, the server must apply all relevant filters. If out of 3000 screens 500 Screens’ lists get updated, then another module will update those 500 clients. You don’t need to design the interface to clients, data feed, threading.

Requirement: design the data store, the screen and filter.

I will assume 3000 Screens. On average, each Screen has 4 filters, defined by that user. If you assume every incoming message must be go through 3000*4 = 12000 filters, I guess you are probably right.

Can you avoid linear searches when applying filters? To illustrate, suppose all the messages are Price numbers. I maintain an unorder_map<ticker, priceValue>, updated constantly. When server applies a single filter, we start with the survivor list from previous filter. Go through each key/value of the map, and erase each disqualified ticker from survivor list. If I start with N tickers and perform N look-ups we get O(N).

Akhil hinted a reverse sorted map<priceValue, ticker>. I will call it a RBTree. This RBTree must be updated with each message, by erase/insert. Suppose we care only about filter latency, not about update latency. Or suppose we get very few updates but many Screens to applyFilters(), then this RBTree can help. We will use lower_bound followed by backward iteration (for example, Price < 150). This is presumably faster than my simple solution of full hash table iteration.

That’s the first filter. How about second filter? Let’s name the RBTree for the second filter “tree2”.  In my simple solution, I start with the survivor list (of size N) from first filter and look up N times against the hash table of the second filter, without full iteration. With the RBTree, we have a choice

  • if tree2.lower_bound() leaves very few (like 5) tickers to be checked, but the survivor list from first filter is large, then we should convert the survivor list to an unordered_set and drive from filter2.lower_bound()
  • If tree2.lower_bound() leaves many (like 9999) tickers to be checked, then we should drive from the survivor list. The tree2 won’t be used.
  • However, in a RBTree, how many nodes come before a given node is hard to count. This counting requires expensive iteration. so I feel the choice cannot be made at runtime. See RBTree range count #enum,auto

mktData conflation: design question

I have hit this same question twice — Q: in a streaming price feed, you get IBM prices in the queue but you don’t want consumer thread AA to use “outdated” prices. Consumer BB needs a full history of the prices.

I see two conflicting requirements by the interviewer. I will point out to the interviewer this conflict.

I see two channels — in-band + out-of-band needed.

  1. in-band only — if full tick history is important, then the consumers have to /process/ every tick, even if outdated. We can have dedicated systems just to record ticks, with latency. For example, Rebus receives every tick, saves it and sends it out without conflation.
  2. out-of-band — If your algo engine needs to catch opportunities at minimal latency, then it can’t afford to care about history. It must ignore history. I will focus on this requirement.
  3. dual-band — Combining the two, if your super-algo-engine needs to analyze tick-by-tick history and also react to the opportunities, then the “producer” thread alone has to do all work till order transmission, but I don’t know if it can be fast enough. In general, the fastest data processing system is single-threaded without queues and minimal interaction with other data stores. Since the producer thread is also the consumer thread for the same message, there’s no conflation. Every tick is consumed! I am not sure about the scalability of this synchronous design. FIFO Queue implies latency. Anyway, I will not talk further about this stringent “combo” requirement.

https://tabbforum.com/opinions/managing-6-million-messages-per-second?print_preview=true&single=true says “Many firms mitigate the data they consume through the use of simple time conflation. These firms throw data on the floor based solely on the time that data arrived.”

In the Wells interview, I proposed a two-channel design. The producer simply updates a “notice board” with latest prices for each of 999 tickers. Registered consumers get notified out-of-band to re-read the notice board[1], on some messaging thread. Async design has a latency. I don’t know how tolerable that is. I feel async and MOM are popular and tolerable in algo trading. I should check my book [[all about HFT]]…

In-band only — However, the HSBC manager (Brian?) seems to imply that for minimum latency, the socket reader thread must run the algo all the way and send order out to exchange in one big function.

Out-of-band only — For slightly slower markets, two market-leading investment bank gateways actually publish periodic updates regardless how many raw input messages hit it. Not event-driven, not monitoring every tick!

  • Lehman eq options real time vol publisher
  • BofA Stirt Sprite publishes short-term yield curves on the G10 currencies.
    • Not EURUSD spot prices

[1] The notification can only contain indicative price numbers and serve to invite clients to check the notice board. If clients rely solely on the indicative, it defeats conflation and brings us back to a FIFO design.

SDI: order-resend timer #CSY #retrans

Requirement — Each time we send an order (with a unique orderID number), we need to wait for about 5 seconds. If no Ack received on this id, we would resend it using the same id. Please design a data structure and algo to achieve it.

Interview question Context 1 — probably re-transmission request, so “order id” means a sequence number.

Interview question Context 2 — re-transmission in TCP, so “order id” means a sequence number.

Interview question Context 3 — FIX connectivity

I believe we must keep data structure size under control, so when there are too many pending orders then very old pending orders would be dropped according to a reasonable policy.

A reasonable assumption — For simplicity, we resend any order only once and drop the order. If needed, we could send the same or a modified order but under a new orderID.

For now, I will relax the timing precision so that a little longer than 5 seconds is tolerable in practice. I would hope it takes sub-millis to iterate through any data structure under size control.

Note TCP has an extremely robust, efficient and well-thought-out design for a similar challenge, tested millions of times every second throughout the world. However, I will not reference it. Below is ..

—- my homemade design —-

System is driven by 4 types of events — timer, ack, new-order, resend. The first 3 are asynchronous, primary events, whereas the resend is a secondary event after a primary event. To minimize data races, I will use a single thread, so all event handlers must be brief.

Ring-buffer is the most popular underlying data structure for this type of system. I will implement a linked queue where each queue node is allocated from a ring buffer, and returned to buffer after delete/erase. Note a contiguous array will NOT accommodate mid-stream deletion.

  • Hashmap holds {orderId -> address of link node}
  • Each link node has {integer orderId; expiry time, pointer to next node; other trade details}.
  • We enqueue only at the tail, but we could erase either from head (dequeue) or the middle (ack received)
  • If we take a snapshot at any time, all link nodes are always ordered by expiry.
  • Only one timer is needed. It is either empty or has a single expiry time.

Event-handler algorithms:

  • –after sending a new order,
  • iterate from the head of the queue. If any node has an expiry time already passed, then resend it and dequeue it. Once we see a node that’s not expired yet, iteration ends.
  • enqueue the new id. If there’s no capacity, then simply remove the oldest node i.e. head of queue.
  • –After a resend,
  • Always erase (the node for) the resent id, usually mid-stream. This is where linked lists beat arrays.
  • If this resend is due to a timer event, then we need to set the timer to the expiry time of the queue head.
  • (No data structure scan since this is a secondary event.)
  • –After a timer event,
  • iterate from the head of the queue. If any node has an expiry time already passed, then resend it and dequeue it. Once we see a node that’s not expired yet, iteration ends.
  • set the timer to the expiry time of the current queue head.
  • –After an ack is received,
  • get the id in the ack message
  • use it to look up in hashmap to get the order object.
  • erase the node from linked queue
  • iterate from the head of the queue. If any node has an expiry time already passed, then resend it and dequeue it. Once we see a node that’s not expired yet, iteration ends.

SDI: design Uber

Q: Design Uber or Lyft (a ride sharing service)

While designing a ride-sharing service, discuss things like:

  • The most critical use case — when a customer requests a ride and how to efficiently match them with the nearby drivers?
  • How to store millions of geographical locations for drivers and riders who are always moving.
  • How to handle updates to driver/rider locations (millions of updates every second)? Comparable to market data systems, but less latency sensitive.
  • How to scale out (or scale up)?
  • What data store? See below
    • sharding policy? by location?
  • Any MOM?
  • Any multicast?
  • Any queuing/buffering? I doubt it.
  • –secondary features
  • payment? I feel is less challenging in terms of performance and data volume. It’s like a post-trade system.

–minimize location update messaging volume

If a rider app is in the background, then the location should not be monitored. Server would ignore any location data on this rider. When a rider app goes foreground, then a rider is possibly looking at the Uber app screen, then the app will send a msg to the server and server will start tracking its location.

Driver would log in to start receiving requests.  we can log her out after a timeout like not responding to any requests, or staying background. By default, we could log out a driver after a configured time. Driver app can also have a feature to stay always-on.

Driver location updates should be 10 times more frequent than rider.

–Data store for driver movement (static data will go to a separate, slower data store)

Let’s stick to something we know well — RDBMS. I feel a single big database is enough for any country including U.S. A traditional SQL table can hold 200 million rows (drivers) easily and support concurrent updates like

  • location update — most common, perhaps 10 times/minutes for each driver
  • driver logout, when she decides not to receive booking requests
  • driver login

(It’s possible to upgrade to a in-memory database or a noSQL.)

We need to have at least an index on DriverId and an index on Location i.e. latitude/longitude/zipcode

count unique words]big file using5machines: high-level design

Q: Design a system to calculate the number of unique words in a file
1) What if the file is huge? (i.e. cannot fit in the main memory)
2) Assuming that you have more than one computer available, how can you distribute the problem?

Constraints are the key to such an optimization. Let’s make it more realistic but hopefully without loss of generality. Say the file is 2 TB ascii of purely alphabetical words of any language in a unified alphabet, with natural distribution such as text from world newspapers. Word length is typically below 20.

I’d assume regular 100GB network with dedicated sockets between machines. The machines have roughly equal memory, and the combined memory is enough to hold the file.

I’d minimize disk and network access since these are slower than memory access and require serialization.

Q: is the network transfer such a bottle neck that I’m better off processing entire file in one machine?

— one-machine solution —
Assuming my memory (2GB) can only hold 1% of the unique words. I’d select only those words “below” ad* — i.e. aa*, ab*, ac* only. Save the unique words to a temp file, then rescan the input file looking for ad*, ae*…ak* to produce a 2nd temp file… Finally Combine the temp files.

— multi-machine solution —
Don’t bother to have one machine scanning the file and tcp the words to other machines. Just copy the entire input file by CD or file transfer to each machine. Each machine would ignore words outside its target range.

How do we divide the task. Say we have 50 machines. We don’t know the exact distribution, so if we assume aa-ak to Not have too many unique words to fit into one machine (2GB), assumption might be wrong. Instead, we’d divide the entire universe into 50 * 10 ranges. We assume even if we are underestimating, still each range should fit into one machine. Every time a machine finishes one range, it sends a tiny signal to a controller and waits for controller to give it next range.

— hashing on words —
Hash table should be sized to minimize rehash. We need superfast hashCode and compression. hashcode should use all the characters, perhaps except the first, since it tends to be the same within a range.

max-thruput quote distribution: 6designs#CAS,socket

Update — fastest would require single-threaded model with no shared mutable

Suppose a live feed of market quotes pumps in messages at the max speed of the network (up to 100gigabit/sec). We have (5) thousands of hedge fund clients, each with some number (not sure how large, perhaps hundreds) of subscriptions to these quotes. Each subscription sets up a filter that may look like some combination of “Symbol = IBM”, “bid/ask spread < 0.2…”, or “size at the best bid price….”. All the filters only reference fields of the quote object such as symbol, size and price. We need the fastest distribution system. Bottleneck should be network, not our application.

–memory allocation and copying–
If an IBM /quote/ matches 300 filters, then we need to send it to 300 destinations, therefore copying 300 times, but not 300 allocations within JVM. We want to minimize allocation within JVM. I believe the standard practice is to send just one copy as a message and let the receiver (different machine) forward it to those 300 hedge funds. Non-certified RV is probably efficient, but unicast JMS is fine too.

–socket reader thread latency–
Given the messaging rate, socket reader thread should be as lean as possible. I suggest it should blindly drop each msg into a buffer, without looking at it. Asynchronously consumer threads can apply the filters and distribute the quotes.

A fast wire format is fixed-width. Socket reader takes 500bytes and assume it’s one complete quote object, and blindly drops this 500-long byte array into the buffer.

–multicast rather than concurrent unicast–
See single/multi-thread TCP servers contrasted

–cpu dedication–
Each thread is busy and important enough to deserve a dedicated cpu. That CPU is never given to another thread.
Now let me introduce my design. One thread per filter. Buffer is a circular array — bounded but efficient pre-allocation. Pre-allocation requires fixed-sized nodes, probably byte arrays of 500 each. I believe de-allocation is free — recycling. Another friend (csdoctor) suggested an unbounded linked list of arrays . Total buffer capacity should exceed the *temporary* queue build-up. Slowest consumer thread must be faster than producer, though momentarily the reverse could happen.

—-garbage collection—-
Note jvm gc can’t free the memory in our buffer.

–Design 3–
Allocate a counter in each quote object. Each filter applied will decrement the counter. The thread that hits zero will free it. But this incurs allocation cost for that counter.

–Design 6–
Each filter thread records in a global var its current position within the queue. Each filter thread advances through the queue and increments it’s global var. One design is based on the observation that given the dedicated CPU, the slowest thread is always the slowest in the wolfpack. This designated thread would free the memory after applying its filter.

However, it’s possible for 2 filters to be equally slow.

–design 8–We can introduce a sweeper thread that periodically wakes up to sequentially free all allocations that have been visited by all filters.

–Design 9– One thread to apply all filters for a given HF client. This works if filter logic is few and simple.

–Design A (CAS)– Create any # of “identical” consumer threads. Any time we can expand this thread pool.
1)read BigArrayBuffer[++MyThreadPtr] into this thread’s register and examine the fields, without converting to a Quote instance.
2) examine the Taken boolean flag. If already set, then simply “continue” the loop. This step might be needed if CAS is costly.
3) CAS to set this flag
4a) if successful, apply ALL filters on the quote. Then somehow free up the memory (without the GC). Perhaps set another boolean flag to indicate this fixed-length block is now reusable storage.
4b) else just “continue” since another thread will process and free it.

message requester to wait5min for response #wait()

Imagine a typical request/reply messaging system. I think in JMS it’s usually based on temp queues, reply-to and correlation-Id — See other blog post. In contrast, RV has no broker. It’s decentralized into multiple peer rv daemons. No difference in this case —

Suppose a message broker holds a lot of queues. One of the queues is for a request message, from requester system to a pricing system. Another queue is for pricing system to return the new price to the requester.

Now, pricing system is slow. Requester should wait for no more than 5 minutes. If the new price comes back through the reply-queue 301 sec later, requester will ignore this stale price since it’s too risky to place an order on a stale price in a fast market. How do you implement this?

My design — Requester main thread can use condVar wait(5*60*000). Another thread in requester JVM can block forever in onMsg(), and notify main thread when something received.

Timer is the 2nd usage of condVar as Stoustrup described.

(I actually implemented this in a few trading engines.)

Merrill S’pore: fastest stock broadcast

Updates — RV or multicast topic; msg selector

I think this is a typical wall-street interview question for a senior role. System requirement as remembered by my friend the interviewee: ML needs a new relay system to receive real-time stock updates from the stock exachange such as SGX. Each ML client, one of many thousand[1], will each install a new client-software [3] to receive updates on the stocks [2] she is interested. Some clients use algorithmic trading system and need the fastest feed.

[1] Not clear about the order of magnitude. Let’s target 10,000
[2] Not clear how many stocks per client on average. Let’s target 100.
[3] Maintence and customer support for a custom client-software is nightmare and perhaps impractical. Practically, the client-software has to be extremely mature such as browsers or email clients.

Q: database locking?
A: I don’t think so. only concurrent reading. No write-contention.

Key#1 to this capacity planning is how to identify bottlenecks. Bandwidth might be a more severe bottleneck than other bottlenecks described below.

Key#2 — 2 separate architectures for algorithmic clients and traditional clients. Each architecture would meet a different minimum latency standard, perhaps a few seconds for traditional and sub-second for algorithmic.

Solution 0: Whatever broadcasting system SGX uses. In an idea world, no budget constraint. Highest capacity desired.

Solution 2: no MQ? No asynchronous transmission? As soon as an update is received from SGX, the relay calls each client directly. Server-push.

Solution 1: MQ — the standard solution in my humble opinion.

Solution 1A: topics. One topic per stock. If 2000 clients want IBM updates, they all subscribe to this topic.

Q: client-pull? I think this is the bottleneck.

Q: Would Client-pull introduce additional delays?

Solution 1B: queues. one queue for each client each stock.

If 2000 clients want IBM updates, Relay need to make that many copies of an update and send to that many queues — duplication of effort. I think this is the bottleneck. Not absolutely sure if this affects relay system performance. Massively parallel processing is required, with thousands of native CPU threads (not java green threads)