bbg-Eq: filters in a screen

Exchanges constantly send you live update messages in a fixed format like

IBM, Price, 100
GOOG, Volume, 5000
MSFT, Volume, 14000
IBM, Price, 99
GS, Volume, 7000
MSFT, Price, 250
GOOG, Price 332

Each message always has exactly 3 fields — ticker, field name, integer value. The value only overwrites the previous value and never aggregates. In this example, the IBM price 99 will wipe out the IBM price 100.

There are up to 100 distinct field names, but many tickers and very high messaging rate.

Each user has a Screen, which holds a variable number of Filters defined by this user. Each Filter has a fixed format like “Price > 150” or “Volume < 9000”. All the filters need to be applied, to return a list of qualifying tickers. In this example, first filter alone returns a list of tickers {MSFT,GOOG} and 2nd filter alone returns {GS,GOOG}. Applying all filters would reduce the list to {GOOG}

Data feed goes into the server. Server maintains thousands of Screen objects, each for one user. When a new message is received, the server must apply all relevant filters. If out of 3000 screens 500 Screens’ lists get updated, then another module will update those 500 clients. You don’t need to design the interface to clients, data feed, threading.

Requirement: design the data store, the screen and filter.

I will assume 3000 Screens. On average, each Screen has 4 filters, defined by that user. If you assume every incoming message must be go through 3000*4 = 12000 filters, I guess you are probably right.

Can you avoid linear searches when applying filters? To illustrate, suppose all the messages are Price numbers. I maintain an unorder_map<ticker, priceValue>, updated constantly. When server applies a single filter, we start with the survivor list from previous filter. Go through each key/value of the map, and erase each disqualified ticker from survivor list. If I start with N tickers and perform N look-ups we get O(N).

Akhil hinted a reverse sorted map<priceValue, ticker>. I will call it a RBTree. This RBTree must be updated with each message, by erase/insert. Suppose we care only about filter latency, not about update latency. Or suppose we get very few updates but many Screens to applyFilters(), then this RBTree can help. We will use lower_bound followed by backward iteration (for example, Price < 150). This is presumably faster than my simple solution of full hash table iteration.

That’s the first filter. How about second filter? Let’s name the RBTree for the second filter “tree2”.  In my simple solution, I start with the survivor list (of size N) from first filter and look up N times against the hash table of the second filter, without full iteration. With the RBTree, we have a choice

  • if tree2.lower_bound() leaves very few (like 5) tickers to be checked, but the survivor list from first filter is large, then we should convert the survivor list to an unordered_set and drive from filter2.lower_bound()
  • If tree2.lower_bound() leaves many (like 9999) tickers to be checked, then we should drive from the survivor list. The tree2 won’t be used.
  • However, in a RBTree, how many nodes come before a given node is hard to count. This counting requires expensive iteration. so I feel the choice cannot be made at runtime. See RBTree range count #enum,auto
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s