opaque c++trouble-shooting: bustFE streamIn

This is a good illustration of fairly common opaque c++ problems, the most dreadful/terrifying species of developer nightmares.

The error seems to be somewhat consistent but not quite.

Reproducing it in dev enviroment was a first milestone. Adding debug prints proved helpful in this case, but sometimes it would take too long.

In the end, I needed a good hypothesis, before we could set out to verify it.

     81     bool SwapBustAction::streamInImpl(ETSFlowArchive& ar)
     82     { // non-virtual
     83       if (exchSliceId.empty())
     84       {
     85         ar >> exchSliceId;
     86       }
    104     }
    105     void SwapBustAction::streamOutImpl(ETSFlowArchive& ar) const
    106     { // non-virtual
    107       if (exchSliceId.size())
    108       {
    109         ar << exchSliceId;
    110       }

When we save the flow element to file, we write out the exchSliceId field conditionally as on Line 107, but when we restore the same flow element from file, the function looks for this exchSliceId field unconditionally as on Line 85. When the function can’t find this field in the file, it hits BufferUnderflow and aborts the restore of entire flow chain.

The serialization file uses field delimiters between the exchSliceId field and the next field which could be a map. When the exchSliceId field is missing, and the map is present, the runtime would notice an unusable data item. It throws a runtime exception in the form of assertion errors.

The “unconditional” restore of exchSliceId is the bug. We need to check the exchSliceId field is present in the file, before reading it.

In my testing, I only had a test case where exchSliceId was present. Insufficient testing.

story: %%investigation skill #etscfg

Hi Paul, In my etscfg I used this simple ifelse:

1        %IFELSE{defined(SOME_CHECK)}%
3        %ELSE%
5        %IFELSE%

To my surprise, Line 4 comes out preceded by a new-line when the if-condition fails. This impacts a lot of config artifacts that shouldn’t be impacted.

On the other hand, when the if-condition evaluates to true, I get exactly Line 2 and I don’t see any new-line added anywhere. This is correct output.

— Investigation —
On my Line 3, the implicit new-line character right after %ELSE% is incorrectly included in $else_block and printed out before Line 4, causing the surprise result. Attached are two test programs showing

my ($if_block, $else_block) = split(/(?:[^\w\n]*)\%ELSE\%/, $block); ### should be
my ($if_block, $else_block) = split(/(?:[^\w\n]*)\%ELSE\%\n?/, $block); ### to include trailing newline in delimiter of split()

In my fix, I also added a “guard” to deal with any trailing spaces sandwiched between %ELSE% and new-line character. I know sometimes I can leave a trailing space sandwiched there.

Altogether, my fix consists of two changes

• 1 new line of perl
• 1 modified line of perl


pre-allocate DTOs@SOD #HFT #ets

example — etsflow framework pre-allocates object pool (presumably the flow elements) for the day, to avoid runtime call to malloc. Are these objects ever released to the pool? I doubt it since all of these objects are subject to query or bust.

example — RTS pre-allocates outgoing message objects from a ring buffer’s head, and “returns” to the ring buffer at the tail… See How RTS achieved 400-700 KMPS #epoll

example — Sell-side HFT OMS also uses pre-allocation. Suppose for every new order there are 3 new DataTransferObjects A/B/C to be instantiated on heap. Traditional approach would make 3 allocation requests in real time. I think the free-list manager becomes a hotspot, even if there’s a per-thread free list.

Basically HFT avoids new/malloc after market opens. RTS uses mostly arrays, and very few (rather small) STL containers. Those STL containers tend to be populated before market opens and remain static.

Pre-allocation is a popular technique. We compute at compile time the sizes of A/B/C based on their class declarations. For DTO class A, sizeof(A) just adds up the non-static data field sizes. Then we estimate how many orders we will get a day (say 7 million). Then we pre-allocate 7 million A objects in an array. The allocation happens at start-up, though the sizes are compile-time constants.

When an order comes in, the A/B/C DTO objects are already allocated but empty.

Byte-array is an alternative, but this smells like the raw free list to me…

How RTS achieved 400-700 KMPS #p2p

Official benchmark result – 700 KMPS per instance of rebus or parser. Here are some enabling features:

  • [! iii] mostly single-threaded apps. Some older parsers are multi-threaded but each thread operating in single threaded mode. No shared mutable:)
    • In the order book replication engine, we use lockfree in one place only
  • ! (ST mode loves) Stateless [1] parser, small footprint. Only downstream component (Rebus) handles cancels/busts, modifications.. So we can run many parser instances per machine, or many threads per instance, fully utilizing the cpu cores. See https://haydenjames.io/linux-performance-almost-always-add-swap-space/
  • [! ii] allocation avoided — on millions of Output message objects. Pre-allocated ring buffer eliminates new(). very few and small STL containers .. mostly arrays… pre-allocate DTOs@SOD #HFT
  • ! allocation avoided — on millions of Input message objects, thanks to reinterpret_cast() on pointers… nearly Zero-copy. See reinterpret_cast^memcpy in raw mktData parsing
  • ! allocation avoided — custom containers to replace STL containers used, since they all allocate from heap
  • ! p2p messaging beats MOM 
  • ! Socket buffer tuning — to cope with busts. 64-256MB receive buffer in kernel; 4MB UserModeBuffer
  • low memory footprint. Most parsers use around 60MB. (My parsers was 100 to 150MB.) I think there are many benefits in terms of latency.
  • epoll — to replace select() with 2 orders of magnitude improve in throughput
  • buffering of Output list (of messages). I guess this hurts latency but enhances throughput
  • Very fast duplicate seq check, without large history data — a hot function
  • large containers like the ring buffer are pre-sized. No reallocation.
  • mostly array-based data structures — cache-friendly
  • Hot functions use pbref or RVO or move semantics, minimizing stack allocations
  • aggressively simplified parsing. Minimal logic
  • Special Logger to reduce I/O cost
  • Sharding to speed up symbol lookup
  • kernel bypass : RTS
  • No virtual functions — enhancing data cache and inlining .. 3 overheads@vptr #ARM
  • Inline
  • —-Speed with control
  • Best of both to reduce the real risk of our connection becoming slow or lossy
  • Depth monitor to detect missed messages
  • Capacity tracker to monitor mps rates
  • Missed seq reporting
  • [!=one of the really effective techniques in RTS]
  • [i=presented in multiple (!!, !!!) interviews]

[1] parser does remembers the sequence numbers. In fact sequence number management is crucial. Before retrans was added to parser, we never sent any retrans request. We did keep track of gaps. Mostly we used the gap-tracker to check duplicates across Line A/B. We also relied on our sequence number state to handle sequence resets.

crossed orderbook detection@refresh #CSY

At least one interviewer asked me — in your orderbook replication (like our rebus), how do you detect a crossed orderbook? I have an idea for your comment.

Rebus currently has a rule to generate top-book-marker. Rebus puts this token on the best bid (+ best ask) whenever a new top-of-book bid emerges.

I think rebus can have an additional rule to check the new best bid against the reining best ask. If the new best bid is ABOVE the best ask, then rebus can attach a “crossed-orderbook-warning” flag to the top-book-marker. This way, downstream gets alerted and have a chance to take corrective action such as removing the entire orderbook or blocking the new best bid until rebus sends a best-bid without this warning.

Such a warning could help protect the customer’s reputation and integrity of the data.

crossed orderbook mgmt: real story4IV

disappears after a while; Some days better some days worse

data quality visible to paying customers

–individual contribution?
not really, to be honest.

–what would you do differently?
Perhaps detect and set a warning flag when generating top-book token? See other post on “crossed”


  • query the exchange if query is supported – many exchanges do.
  • compare across ticker plants? Not really done in our case.
  • replay captured data for investigation
  • retrans as the main solution
  • periodic snapshot feed is supported by many exchanges, designed for late-starting subscribers. We could (though we don’t) use it to clean up our crossed orderbook
  • manual cleaning via the cleaner script, as a “2nd-last” resort
  • hot failover.. as last resort

–the cleaner script:

This “depth-cleaner” tool is essentially a script to delete crossed/locked (c/l) entries from our replicated order books. It is run by a user in response to an alert.

… The script then compares the Ask & Bid entries in the order book. If it finds a crossed or locked order, where the bid price is greater than (crossed) or equal to (locked) the ask price, it writes information about that order to a file. It then continues checking entries on both sides of the book until it finds a valid combination. Note that the entries are not checked for “staleness”. As soon as it finds non-crossed/locked Ask and Bid entries, regardless of their timestamps, it is done checking that symbol.

The script takes the entries from the crossed/locked file, and creates a CIM file containing a delete message for every order given. This CIM data is then sent to (the admin port of) order book engine to execute the deletes.

Before the cleaner is invoked manually, we have a scheduled scanner for crossed books. This scanner checks every symbol once a minute. I think it uses a low-priority read-only thread.

m_activity IV story@investigation skill #RTS

symptom — array filled up beyond limit of 1024 .. Undefined Behavior, often crashing entire process, but no guarantees.

This “m_activity” array is a process-wide singleton, holding ALL active tcp/udp socket descriptors (each an int id). Every time we close a socket we are supposed to remove its id from the array, and shift all “upper” elements down.

Sometimes a connection can drop unexpectedly.

What’s this array for? We iterate this array frequently for timer events. We don’t but could use select() to monitor a bunch of sockets.

I found that when we reconnect after an unexpected disruption, we were not following a proper sequence
* check if connected
* if disconnected, then remove the socket id from the array
* connect
* upon success, append the new socket id to the array

Due to bug, in a unstable period, usually at start of day, we could drop connection and reconnect many times, and fill up this array within the first hour (often within minutes).

Hard to reproduce.

%%c++keep crash` I keep grow`as hacker #zbs#Ashish

Note these are fairly portable zbs, more than local GTD know-how !

My current c++ project has high data volume, some business logic, some socket programming challenges, … and frequent crashes.

The truly enriching part are the crashes. Three months ago I was afraid of c++, largely because I was afraid of any crash.

Going back to 2015, I was also afraid of c++ build errors in VisualStudio and Makefiles, esp. those related to linkers and One-Definition-Rule, but I overcame most of that fear in 2015-2016. In contrast, crashes are harder to fix because 70% of the crashes come with no usable clue. If there’s a core file I may not be able to locate it. If I locate it, it may not have symbols. If it has symbols the crash site is usually in some classes unrelated to any classes that I wrote. I have since learned many lessons how to handle these crashes:

  • I have a mental list like “10 common crash patterns” in my log
  • I have learned to focus on the 20% of my codebase that are most convoluted, most important, most tricky and contribute most to debugging difficulties. I then invest my time strategically to rewrite (parts of) that 20% and dramatically simplify them. I managed to get familiar and confident with that 20%.
    • If the code belongs to someone else including 3rd party, I try to rewrite it locally for my dev
  • I have learned to pick the most useful things to log, so they show a *pattern*. The crashes usually deviate from the patterns and are now easier to spot.
  • I have developed my binary data dumper to show me the raw market data received, which often “cause” crashes.
  • I have learned to use more assertions and a hell lot of other validations to confirm my program is not in some unexpected *state*. I might even overdo this and /leave no stoned unturned/.
  • I figured out memset(), memcpy(), raw arrays are the most crash-prone constructs so I try to avoid them or at least build assertions around them.
  • I also figured signed integers can become negative and don’t make sense in my case so I now use unsigned int exclusively. In hind sight not sure if this is best practice, but it removed some surprises and confusions.
  • I also gained quite a bit of debugger (gdb) hands-on experience

Most of these lessons I picked up in debugging program crashes, so these crashes are the most enriching experience. I believe other c++ programs (including my previous jobs) don’t crash so often. I used to (and still do) curse the fragile framework I’m using, but now I also recognize these crashes are accelerating my growth as a c++ developer.

algo trading engine from scratch: ez route hopefully

Update: TrexQuant had an extremely simple trading system, mostly an FIX connectivity engine…

Start by identifying some high-quality, flexible, working code base that’s as close to our requirement as possible. Then slowly add X) business features + Y) optimizations (on throughput, latency etc.) I feel [Y] is harder than [X], thought [X] gives higher business value. Latency tuning is seldom high-value, but data volume could be a show-stopper.

Both X/Y enhancements could benefit from the trusty old SQL or an in-memory data store[1]. We can also introduce MOM. These are mature tools to help X+Y. [3]

As I told many peers, my priorities as architect are 1) instrumentation 2) transparent languages 3) product maturity

GTD Show-stopper: data rate overflow. Already addressed
GTD Show-stopper: frequent[4] crashes. Unlikely to happen if you start with a mature working code base. Roll back to last-working version and retest incrementally. Sometimes the crash is intermittent and hard to reproduce 😦 Good luck with those.

To blast through the stone walls, you need power tools like instrumentation, debuggers … I feel these are more important to GTD than optimization skills.

To optimize, you can also introduce memory manager such as the ring buffer and custom allocator in TP, or the custom malloc() in Facebook. If performance doesn’t improve, just roll back as in Rebus.

For backend, there are many high or low cost products, so they are effectively out of scope, including things like EOD PnL, position management, risk management, reporting. Ironically, many products in these domains advertise themselves as “trading platforms”. In contrast, what I consider in-scope would be algo executor, OMS[2], market data engine [2], real time PnL.

— The “easy route” above is probably an over-simplification, but architects must be cautiously optimistic to survive the inevitable onslaught of adversities and setbacks —

It’s possible that such a design gradually becomes outdated like GMDS or the Perl codebase in PWM-commissions, but that happens to many architects, often for no fault of their own. The better architects may start with a more future-proof design, but more likely, the stronger architects are better at adjusting both the legacy design + new requirements

Ultimately, you are benchmarked against your peers in terms of how fast you figure things out and GTD….

Socket tuning? Might be required to cope with data rate. Latency is seldom a hard requirement.

Threading? single-threaded model is probably best. Multiple processes rather than multiple threads.

Shared memory? Even though shared memory is the fastest way to move data between processes, the high-performance and high-throughput ticket plant uses TCP/Multicast instead.

MOM? for high-speed market data gateway, many banks use MOM because it’s simpler and flexible.

Inter-process data encoding? RTS uses a single simplified FIX-like, monolithic format “CTF”. There are thousands of token types defined in a “master” data dictionary — semi-static data.

GUI for trader? I would think HTML+javascript is most popular and quick. For a barebones trading engine, the GUI is fairly simple.

Scheduled tasks? Are less common in high speed trading engines and seldom latency-sensitive. I would rely on database or java/c++ async timers. For the batch tasks, I would use scripts/cron.

Testing? I would use scripts as much as possible.

[1] eg: GMDS architect chose memory-mapped-file which was the wrong choice. [2] both require an exchange interface
[3] data store is a must; MOM is optional;
[4]If it crashes once a day we could still cope. Most trading engines can shut down when market closed.

stale (replicated) orderbook

This page is intended to list ideas on possible ways of improving the orderbook replication system so that it can better determine when an instrument or instruments are stale. It further lists possible ways of automating retransmission of lost and/or stale data to improve the recovery time in the event of an outage.

The current system is largely “best effort”; at each level of the ticker-plant and distribution network components are capable of dropping data should a problem arise. There are few processes capable of correlating what was lost in a format that is useful to a customer. To account for the possibility of lost data, the orderbook replication components constantly generates “refresh” messages with the most up-to-date state of an instrument.

Although this system works well in practice, it leaves open the possibility that a customer may have a cache with “stale” data for an unbounded length of time. This inability to track staleness can be a point of concern for customers.

= Recovery types =

When a downstream component loses one or more update messages for an instrument it is generally safe to assume the instrument is stale. However there can be two different kinds of recoveries:

== Retransmit the latest snapshot ==

This method of retransmission and stale detection revolves around keeping the current tick snapshot database up to date. It is useful for customers that need an accurate tick cache. It may not be a full solution to customers that need an accurate time and sales database.

== Retransmit all lost ticks ==

It is also possible to retransmit all lost ticks after an outage. This is typically useful when trying to repair a time-and-sales database.

Although it is possible to build an accurate “current” snapshot record when all lost ticks are retransmitted, it is a very tedious and error-prone process. It is expected that customers will, in general, be unwilling to rebuild the “current” from the retransmission of lost ticks.

So, a scheme that involves retransmission of lost ticks will still likely require a scheme that retransmits the latest snapshot.

Most of the following discussions are centered around the concept of latest snapshot recovery.

= Gap prevention =

There may be simple ways to reduce the number of times gaps occur. This process could be called “gap prevention”.

In general, it is not possible to eliminate gaps, because severe outages and equipment failure can always occur. The process of gap prevention may be useful, however, where the alternative gap recovery solution is expensive or undesirable. It is also useful in systems that need full lost tick recovery.

There are two possible ways of preventing gaps from occurring. Both trade bandwidth and latency for increased reliability during intermittent outages.

== Wait for retransmit ==

The simplest form of gap prevention involves retransmitting any packets lost on the network. The sender keeps a buffer of recently sent messages, and the receiver can request retransmissions. In the event of packet loss, the receiver waits for the retransmissions before processing data.

This form of gap recovery is a basic feature of the TCP/IP transmission protocol.

== Forward error correction ==

It is also possible to prevent gaps by sending additional data on the feed.

The most basic form of this is used in the “best-of-both” system. It sends two or more copies of the data, and the receiver can fill lost ticks from the additional copies.

It is not necessary to send a full additional feed. For example, one could send a block of parity codes on every tenth packet. A receiver could then theoretically recover from up to ten percent packet loss by using the parity code packets.

Although the forward error correction scheme uses additional bandwidth, additional bandwidth may be available due to line redundancy.

= Snapshot recovery types =

In order to correct a stale instrument, it may be necessary to send the full contents of the instrument. When doing so, one may send them serialized in the real-time feed or out of order.

== In sequence snapshots ==

The simplest form of snapshot transmission involves using the same socket or pipe as the real-time feed itself. In this case, the receiver can simply apply the snapshot to its database; it does not need to handle the cases where the snapshot arrives after/before a real-time update arrives.

The downside of this scheme, however, is that a single upstream supplier of snapshots might get overloaded with requests for retransmissions. If additional distributed databases are used to offload processing, then each additional component will add latency to the real-time feed.

== Out of sequence snapshots ==

It is also possible to send snapshot transmissions using sockets and/or pipes separate from the real-time feed. The advantage of this scheme, is it is relatively cheap and easy to increase the number of distributed snapshot databases from which one can query. However, it requires the receiver of the snapshot to work harder when attempting to apply the response to its database.

One way to apply out of order snapshots is to build a “reorder buffer” into the receiver. This buffer would store the contents of the real-time feed. When a snapshot response arrives, the receiver could locate where the sender was in the real-time stream when it generated the snapshot (possibly by using sequence numbers). It can then apply the snapshot and internally replay any pending updates from the reorder buffer. In the case where a snapshot arrived that was based on real-time traffic that the receiver has yet to receive, the receiver must wait for that traffic to arrive before applying the snapshot.

This scheme is thought to be complex, resource intensive, and error-prone.

If, however, the feed were changed to eliminate the distributed business rules, it may be possible to implement a much simpler out-of-order snapshot application system. See [[Out of sequence snapshots idea]] for a possible implementation.

= Gap detection =

In order to accurately determine when an instrument is “stale”, it is necessary to be able to determine when one or more update messages have been lost. The following sections contains notes on various schemes that can provide this information.

Note that some of the schemes may be complementary. That is, a possible future solution might use parts of several of these methods.

== Sequence numbers with keep-alive on idle ==

The most common method to detect a gap involves placing a monotonically incrementing number on all outgoing messages or message blocks. The receiver can then detect a gap when a message (or message block) arrives with a sequence number that is not one greater than the last sequence number.

In order to account for the case where all messages from a source are lost, or where a source goes idle just after a message loss, the sender needs to arrange to transmit a “keep alive” indicator periodically when the source is otherwise idle. With knowledge of the keep-alive period, the receiver can detect a gap by timing out if it does not receive a message from a source within the specified period. The larger the period, the less keep-alive messages need to be sent when idle. However, it also increases the worst case time to detect a message gap.

It is possible for the sender to generate multiple sequence number series simultaneously by separating instruments into multiple categories. For example, the outgoing feed currently generates an independent sequence number for each “exchange”. At the extreme, it is possible to generate a sequence number stream per instrument; however this would increase the bandwidth due to larger number of keep-alive messages necessary. (One would also not be able to optimize bandwidth by sequencing only message blocks.)

When a sequence number gap is detected, the receiver must consider all instruments using the sequence series as suspect.

Only a single source can reliably generate a sequence number. If multiple ticker-plants generate a feed, they need to use different sequence series. If an upstream ticker-plant switch occurs, the receiver needs to mark the full range of affected instruments as suspect.

Finally, some exchanges provide sequenced data feeds. However, there are [[issues with exchange provided sequence numbers]]. Due to this, it may be difficult to rely on exchange sequencing as a basis for a distribution network sequencing.

== Sequence number of last message ==

A variant of the basic sequencing scheme involves the sequence number of the last message (SNLM) that updates an instrument. This field would be kept by both sender and receiver, and included with real-time messages. If SNLM matches the receiver’s record, implying that the receiver has not missed any updates for this instrument, then the instrument can transition from “suspect” to “clean”. Conversely, a non-match should force the instrument to “stale”.

An advantage of this scheme is that ordinary real-time traffic could reduce the number of suspect records after an outage. It may also make using exchange provided sequence numbers more practical.

As a disadvantage, however, it would require that both a sequence number and SNLM be provided on every real-time update. This might significantly increase bandwidth.

== Periodic message count checks ==

It is also possible to detect gaps if the sender periodically transmits an accounting of all messages sent since the last period. This scheme may use less bandwidth than sequence numbers, because it is not necessary to send a sequence number with every message (or message block).

The scheme still has the same limitations as sequence numbers when ticker-plant switches occur and when trying to determine what was lost when a gap occurs.

== Periodic hash checks ==

Another possible method of detecting gaps is by having the sender generate a hash of the contents of its database. The receiver can then compare the sender’s hash to the same hash generated for its database. If the two differ, a gap must have occurred. (If the two match, however, a gap may have occurred but already been corrected; this method is therefore not useful when full tick recovery is necessary.)

This scheme may be beneficial when ticker-plant switches occur. If two senders have identical databases and no data is lost during a network switch, then the hash checks should still match at the receiver. This scheme, however, still faces the problem of determining which instruments from the set are actually stale when a gap is detected.

Technically, it is possible that two databases could differ while sharing the same hash key. However, it is possible to choose a hash function that makes the possibility of this extremely small.

Finally, this system may face challenges during software upgrades and rollouts. If either the sender or the receiver change how or what they database, it may be difficult to maintain a consistent hash representation.

== Sender tells receiver of gaps ==

If a reliable transmission scheme (eg, tcp) is in use between the sender and receiver, then it may be possible for the sender to inform the receiver when the receiver is unable to receive some portion of the content.

For example, if a sender can not transmit a block of messages to a receiver because the receiver does not have sufficient bandwidth at the time of the message, then it is possible for the sender to make a note of all instruments that receiver was unable to receive. When the receiver has sufficient bandwidth to continue receiving updates, the sender can iterate through the list of lost instruments and inform the receiver.

The scheme has the advantage that it allows the receiver to quickly determine what instruments are stale. It may also be useful when a component upstream in the ticker-plant detects a gap – it can just push down the known stale messages to all components down-stream from it. (For example, an exchange parser might detect a gap and send a stale indicator downstream while it attempts to fill the gap from the exchange.)

As a disadvantage, it may significantly complicate data senders. It also does not help in cases where a receiver needs to change to a different sender.

== Receiver analyzes gapped messages ==

In some systems, the receiver may need to obtain all lost messages (eg, to build a full-tick database). If the receiver knows the contents of messages missing earlier in the stream it can determine which messages are stale. Every instrument that contains an update message in the list of missing messages would be stale; instruments that did not have update messages would be “clean”.

An advantage of this system is that it is relatively simple to implement for receivers that need full tick retransmissions.

However, in the general case, it is not possible to implement full tick retransmissions due to the possibility of hard failures and ticker-plant switches. Therefore this scheme would only be useful to reduce the number of stale instruments in certain cases.

Also, the cost of retransmitting lost ticks may exceed the benefits found from reducing the number of instruments marked stale. This makes the scheme less attractive for receivers that do not need all lost ticks retransmitted.

= Stale correction =

This section discusses possible methods of resolving “suspect conditions” that occur when it is detected that an instrument may have missed real-time update messages.

There are likely many other possible schemes not discussed here. It is also possible that a combination of one or more of these schemes may provide a useful solution.

These solutions center around restoring the snapshot database. Restoration of a tick history database is left for a separate discussion.

== Background refresh ==

The simplest method of clearing stale records is to have the ticker-plant generate a periodic stream of refresh messages. This is what the system currently does.

This system is not very good at handling intermittent errors, because it could take a very long time to refresh the full instrument database. However, if enough bandwidth is allocated, it is a useful system for recovering from hard failures where the downstream needs a full refresh anyway. It is also possible to combine this with one of the gap prevention schemes discussed above to help deter intermittent outages.


* simple to implement at both receiver and sender


* time to recovery can be large

* can be difficult to detect when an instrument should be deleted, or when an IPO should be added

== Receiver requests snapshot for all stale instruments ==

In this system, the receiver would use one of the above gap detection mechanisms to determine when an instrument may be stale. It then issues a series of upstream requests until all such instruments are no longer stale.

In order to reduce the number of requests during an outage, the instruments on the feed could be broken up into multiple sets of sequenced streams (eg, one per exchange).


* could lead to faster recovery when there is available bandwidth and few other customers requiring snapshots


* could be complex trying to request snapshots for instruments where the initial create message is lost


* see discussion on [[#Snapshot recovery types]]

* see discussion on [[#Gap detection]] for possible methods of reducing the universe of suspect instruments during an outage

== Sender sends snapshots ==

This is a variant of [[#Sender tells receiver of gaps]]. However, in this scheme, the sender would detect a gap for a receiver and automatically send the snapshot when bandwidth becomes available. (It may also be possible to send only the part of the snapshot that is necessary.)


* Simple for receiver


* Could be complex for sender

* Isn’t useful if receiver needs to change upstream sources.

== Receiver requests gapped sequences ==

This method involves the receiver detecting when an outage occurs and making an upstream request for the sequence numbers of all messages (or message blocks) not received. The sender would then retransmit the lost messages (or blocks) to the receiver.

The receiver would then place the lost messages along with all subsequently received messages into a “reorder” buffer. The receiver can then internally “play back” the messages from the reorder buffer to rebuild the current state.


* Useful for clients that need to build full-tick databases and thus need the lost messages anyway.


* Thought to be complex and impractical to implement. The reorder buffer could grow to large sizes and might take significant resources to store and apply.
* The bandwidth necessary to retransmit all lost messages may exceed the bandwidth necessary to retransmit the current state of all instruments.
* Doesn’t help when a ticker-plant switch occurs.

== Sender analyzes gapped sequences ==

This scheme is a variant on [[#Receiver requests gapped sequences]]. The receiver detects when an outage occurs and makes an upstream request for the sequence numbers of all messages (or message blocks) not received.

Upon receipt of the request the sender would generate a series of snapshots for all instruments that had real-time updates present in the lost messages. It can do this by analyzing the contents of the messages that it sent but the receiver did not obtain. The sender would also have to inform the receiver when all snapshots have been sent so the receiver can transition the remaining instruments into a “not stale” state.


* May be useful in conjunction with gap prevention. That is, the sender could try resending the lost messages themselves if there is a good chance the receiver will receive them before timing out. If the receiver does timeout, the sender could fall back to the above snapshot system.

* May be simple for receivers


* May be complicated for senders
* Doesn’t help when a ticker-plant switch occurs.


* Either in-sequence or out-of-sequence snapshot transmissions could be used. (See [[#Snapshot recovery types]] for more info.) The receiver need not send the requests to the sender – it could send them to another (more reliable) receiver.

== Receiver could ask if update necessary ==

This is a variant of [[#Receiver requests snapshot for all stale instruments]], however, in this system the receiver sends the sequence number of the last message that updated the instrument (SNLM) with the request. The sender can then compare its SNLM with the receiver’s and either send an “instrument okay” message or a full snapshot in response.


* Reduces downstream bandwidth necessary after an outage


* Doesn’t work well in cases where instruments are updating, because the receiver and sender may be at different points in the update stream
* Lost create messages – see disadvantages of [[#Receiver requests snapshot for all stale instruments]]

== Receiver could ask with hash ==

This is a variant of [[#Receiver could ask if update necessary]], however, in this system the receiver sends a hash value of the current instrument’s database record with the request. The sender can then compare its database hash value with the receiver’s and either send an “instrument okay” message or a full snapshot in response.


* Works during tp switches


* Doesn’t work well in cases where instruments are updating, because the hash values are unlikely to match if sender and receiver are at a different point in the update stream.

* Rollout issues – see [[#Periodic hash checks]]

* Lost create messages – see disadvantages of [[#Receiver requests snapshot for all stale instruments]]

= Important considerations =

In many stale detection and correction system there are several “corner cases” that can be difficult to handle. Planning for these cases in advance can simplify later development issues.

The following is a list of “corner cases” and miscellaneous ideas:

== Ticker plant switches ==

It can be difficult to handle the case where a receiver starts obtaining messages from a different ticker-plant. Our generated sequence numbers wont be synchronized between the ticker-plants. Many of the above schemes would need to place any affected instruments into a “suspect” state should a tp switch occur.

Even if one could guarantee that no update messages were lost during a tp switch (for example by using exchange sequence numbers) there might still be additional work. The old ticker-plant might have been sending incorrect or incomplete messages — indeed, that may have been the reason for the tp switch.

== Lost IPO message ==

When the real-time feed gaps, it is possible that a message that would have created a new instrument was lost. An automatic recovery process should be capable of recovering this lost information.

There are [[schemes to detect extra and missing records]].

== Lost delete message ==

Similar to the IPO case, a real-time gap could have lost an instrument delete message. An automatic recover process should be able to properly handle this.

A more strange, but technically possible situation, involves losing a combination of delete and create messages for the same instrument. The recovery process should be robust enough to ensure that full resynchronization is possible regardless of the real-time message update content.

There are [[schemes to detect extra and missing records]].

== Exchange update patterns ==

Some exchanges have a small number of instruments that update relatively frequently (eg, equities). Other exchanges have a large number of instruments that individually update infrequently, but have a large aggregate update (eg, US options).

Schemes for gap detection and correction should be aware of these differences and be capable of handling both gracefully.

== Orderbooks ==

Recovering orderbooks can be a difficult process. However, getting it right can result in dramatic improvements to their quality, because orderbooks are more susceptible to problems resulting from lost messages.

The key to getting orderbooks correct is finding good solutions to all of the above corner cases. Orderbooks have frequent record creates and deletes. They also have the peculiar situation where some of the orders (those at the “top”) update with very high frequency, but most other orders (not at the “top”) update very infrequently.

== Sources can legitimately idle ==

Many exchanges follow a pattern of high traffic during market hours, but little to no traffic on off hours. Ironically, the traffic near idle periods can be extremely important (eg, opens, closes, deletes, resets).

It is important to make sure a detection scheme can handle the case where a gap occurs around the time of a legitimate feed idle. It should also be able to do so in a reasonable amount of time. (An example of this is the “keep alive” in the above sequence number scheme.)

== Variations of stale ==

A record is sometimes thought to be either “clean” or “stale”. However, it is possible to graduate and/or qualify what stale means. That is, it is possible to be “suspect” or “suspect for a reason” instead of just being “stale”.

Possible per-instrument stale conditions:

  • ; clean : the instrument is not stale
  • ; possible gap : gap in sequence number that could affect many instruments
  • ; definite gap : some recovery schemes can determine when an instrument has definitely lost updates
  • ; upstream possible gap : the tp might have seen a sequence gap from the exchange
  • ; upstream definite gap : the tp might have deduced which instruments actually gapped from exchange
  • ; stale due to csp startup : the csp was recently started and has an unknown cache state
  • ; suspect due to tp switch : a ticker-plant switch occurred
  • ; pre-clean : possible state in out-of-order snapshot recovery schemes
  • ; downstream gap : in some schemes a sender can inform a receiver that it lost updates

simplified order book design doc: jump

It’s tempting to use virtual function processMessage() to process various order types (A, M, X …) and trade types, but virtual functions add runtime overhead. Template specialization is a more efficient design, but due to the limited timeframe I implemented an efficient and simple alternative.

Assumption: M messages can only change quantity. Price change not allowed — Sender would need to cancel and submit new order. The B/S and price fields of the order should not change, but validation is omitted in this version.

Assumption: T messages and the corresponding M and X messages (also the initiator A message) are assumed consistent and complete. Validation is technically possible but omitted. Validation failure indicates lost messages.

The cornerstone of the design is the data structure of the order book — a RB-tree of linked lists. Add is O(logN) due to the tree-insert. Modify is O(1) thanks to the lookup array. Remove is O(1) — eliminating tree search. This is achieved with the lookup array, and by saving iterator into the order object.

There are 2 containers of pointers — the map of lists and the lookup-array. It would be better to use container of smart pointers to ease memory management, but STL doesn’t provide any smart pointer.

All equality test on doubles are done using “==”. Should use some kind of tolerance if time permits.

Here’s the documentation in the lookup array class

/*This class encapsulates an array of pointers.
Assumption 1 — Exchanges is likely to generate auto-incrementing orderID’s. Here’s my reasoning. OrderID’s are unique, as stated in the question. If orderID generation isn’t continuous, then the generator has 2 choices about the inevitable gap between 2 adjacent ID numbers. It can keep the gap forever wasted, or somehow “go back” into a gap and pick a number therein as a new orderID. To do the latter it must keep track of what numbers are already assigned — rather inefficient. There are proven in-memory algorithms to generate auto-increment identity numbers. I assume an exchange would use them. Auto-increment numbers make a good candidate as array index, but what about the total number range?

Assumption 2 — each day the number range has an upper limit. Exchange must agree with exchange members on the format of the orderID. It’s likely to be 32 bits, 64 bits etc and won’t be a million bits.

Question 1: Can the exchange use OrderID 98761234 on both IBM and MSFT during a trading day? I don’t know and i feel it doesn’t matter. Here’s the reason.

Case 1: suppose exchange uses an *independent* auto-increment generator for each stock. So IBM and MSFT generators can both generate 98761234. My design would use one array for IBM and one array for MSFT. For basket orders, additional generator instances might be needed.

Case 2: suppose exchange uses an independent auto-increment generator for each stock, but each stock uses a non-overlap number range. 98761234 will fall into IBM number range. My design would need to know the number range so as to convert orderID to array index and conserve memory.

Case 3: suppose exchange uses a singleton auto-increment generator across all stocks (bottleneck inside the exchange). My design would use one gigantic array. Given Assumption 1, the numbers would be quasi-continuous rather than sparse — below 50% of the range is assigned. Suppose the range is S+1, S+2 … S+N, then my array would be allocated N elements (where S is orderIDBase). There’s a limit on N in reality. Every system is designed for a finite N — no system can handle 10^9999 (that’s one followed by ten thousand zeros) orders in a day. Each array element is a pointer. For a 64-bit machine, N elements take 64N bits or 8N bytes. If I have 640GB memory, N can be 80 billion but not higher. To scale out horizontally, we would hope Case 1 or 2 is the case.

Therefore the answer to Question 1 shows array of pointer is feasible for the purpose of lookup by orderID. In a real system hash table is likely to be time/space efficient. In this exercise, only STL is available, which provides no hash table. Tree based map has logN time complexity — too slow. My choice is between a built-in array vs a non-expanding vector. I chose array for simplicity.

Fwd: design a finite state machine for a FIX engine

We need a cache (OMS/EMS?) of all the pending orders we created. All from-exchange FIX messages must match one of the pending orders. Once matched, we examine its current state.

If the message is belated, then we must reject it (informing all parties involved).

If the message is delivered out of sequence and too early, then we must keep it on hand. (a 2nd cache)

If the message is for immediate consumption, then consume.


…the FSM (finite state machine) in FIX is about state transition and should be pertinent to all state machines. Basically, it goes from pending new to new to fill or canceled. From canceled, it cannot go back to new or fill etc.



story-telling for behavior interviews

As stated in https://bintanvictor.wordpress.com/2010/08/13/vague-vs-normal-vs-specific-answers-in-nontech-interviews/, during non-tech interviews, stories are a good way to organize your thoughts, and make yourself memorable.

Gayle McDowell pointed out that your story needs to

  • explain why it was done that way
  • reflect on you, not your team
  • understandable and substantial
  • Also be prepared to say how you would do it differently

–To prove a _team_player_ —

  • These aren’t stories but … voted most helpful colleague in Zed;
  • knowledge sharing;
  • hands-on guidance over freshers. In ICE team, all four new hires come to me for help.
  • one of the most gregarious guys on the floor, making friends across department boundaries.

–To prove “help those in need” —

rated substantially-exceed by all the freshers I helped, mostly Indian freshers; Never turned away a help seeker

–To prove constructiveness in conflict —

presenting alternative designs to senior managers

–To prove knowledge sharing —

–To prove “can work with difficult colleagues” —

Chih Hao who likes to criticize my code ..

–To prove under-pressure — biggest release of 2009

–To prove personal sacrifice —

swing OMS screen (PWM), briefly

PWM screen, used till today (2012). Handles Eq, FX, FI, derivatives (esp. options), futures….

Real time OMS — “order state management”, including but not limited to
* real time status updates
* order entry
* manual trade booking

* manual order cancel/mod before fully executed

Web GUI won’t provide the responsiveness and volume —

At least 30,000 orders/day in US alone. 50-200 executions/order, executed in-house or on external liquidity venues. Typically 10 MOM messages per execution.
– new order placed
– acknowledged
– partial fill
– cancel/mod

Each swing JVM has its own MOM subscriber(s).

This codebase isn’t part of the IC build and has a frequent release cycle.

design considerations – my first homemade FIX client

In a typical sell-side / market maker, the FIX gateway is a FIX client for a liquidity venue. The liquidity venue (typically an exchange, but also ECNs, dark pools) runs a massively multi-threaded, FIX server farm, possibly with custom-made FPGA hardware. I believe the technical requirements for an exchange FIX server is different from a sell-side FIX client.

Therefore, I assume my audience are far more interested in the client code than the server code.

My mock server implementation is a mock-up, with no optimization, no resilience/robust design, no fault-tolerance, no select()-based sockets, no high-availability, no checksum validation.

— design priorities —
Since this is a from-scratch implementation of a large spec, I focused on basic functionality. Debugging and instrumentation is important in the early stage. A lot of extra code was created to help developers verify the correct operation of all those nitty gritty details.

I don’t aim to build a complete engine, but the basic functionality I build, i try to build it on solid ground, so hopefully it doesn’t become throw-away code.

At the heart of the OO design was the abstraction expressed by AbstractClient, AbstractState and ClientSession. These form the backbone of the core object graph. The fields and constructors were chosen with care. Cohesion and low-coupling were my ideals but sometimes hard to achieve.

These core classes help meet most of the fundamental requirements related to object state of a client connection, sequence number, state transition, output stream sharing, connection liveness monitoring… I wouldn’t say these are well-design classes, but they are a reasonable first attempt at modelling the problem using java classes.

Reusable components were created whenever possible, such as the message parser, message formatter.

FIX engine are highly configurable. All config parameters are extracted into a config object, which is obtained from a config factory. One config for Test, one for Production or UAT etc.

— features —
– parsing a raw FIX msg into an array, which is faster to access than a hashmap. By our design, we avoid hash lookup completely.
– Destroyer.java is a home-made solution to the potential forever-blocking of readLine(). On some platforms, socket read is not interruptible and the only way to unfreeze a socket read is to close the socket. We do that in this implementation but only after sending a final logout message.
– Various template methods to guarantee close() and guard against resource leak
– AbstractClient has most of the generic logic, while SingleOrderClient implements very small amount of custom logic.
– FIX msg formatter is generic enough to accept a map of tags and values — any tags and values
– sequence num validation
– state machines on both client and (simplified) server. State design pattern.
– server side — very few features, but ConnectionManager is reasonable encapsulation of new socket creation.

—- client-side features to be implemented —-
– heart beat — more robust. Should never fail.
– timeout waiting for ack
– re-logon if no heartbeat for 60 seconds
– checksum validation
– storing exchange order-ack into database
– Right now, given time constraint, i didn’t use nonblocking IO. All FIX system should use non-blocking IO.
———- Disclaimer ————-
I spent about 4 hours on this project, as suggested by Jonathan. I had to do some basic research on FIX since I have never programmed FIX (except some ECN connectivity modules that probably used FIX under the hood).
————— How to run the server/client —————–
Run ServerMain.java, then ClientMain.java.
To see the heart beat monitoring/intervention (Destroyer) in action, put 500 in getSilenceThreshold().

1900 tiers of quotes, RFS over FIX, indicative/executable quotes #400w

One of the REAL bottlenecks in a large SELL-side FX dealer system is tiered pricer. Trigger event could be a market data change. Since such an event could trigger an avalanche of messages, the frequency of such events is not very high, probably below 10 events/second on 1 currency pair. If you have 10(or 50 or whatever) active currency pairs, then you could get 100 triggers/sec through your entire system.

Once a trigger is activated, pricer computers new bid/ask quotes for Tier 1 Gold clients. Pricer then adds a distinct spread for each tier. For an active pair like EUR/USD, there can be 1900 tiers. There can be up to 1900 (non-unique) pairs of bid/ask quotes. Typically, the “best” quotes would have a bid/ask spread of 2 to 3 pip, applicable to the best and largest clients. For a *retail* client, it could be 20 – 100 pips.

Core of the tiered pricer is a Drools rule engine.

Another module is the messaging engine using Nirvana by My-Channels. If a particular bid/ask quote applies for all (say, 20) tiers in Silver category, then pricer broadcast the quote to a topic like Quote.EURUSD.Silver. This is kind of alias for 20 different “tier” topics. For efficiency, this is probably multicast.

In the worst case, one event can trigger an avalanche of 1900 messages for one currency pair alone.

Last module is the FIX engine. Quotes often go out the door in FIX format, just as RFQ. Now there are 1900 tiers but the number of clients could be higher or lower than 1900. If there are more than 1900 clients and if all of them subscribe to our quote, then each must be sent the quote. I know a Chicago prop trading firm (Gelber?) subscribe to a lot of “bank feeds”.

The most demanding type of subscription is a RequestForStream (RFS). A typical RFS could ask for a stream of EURUSD quotes for 10 (up to 120 minutes), during which time all quotes must be delivered.

Unlike RFQ, RFS requires special approval. The bid/ask quotes in an RFS can be indicative or *executable* (similar to Firm, but see separate blot post). If a client hit an executable bid or lift an executable offer, then the trade is considered executed, though I believe cancellation is still a possibility, just like any Firm quote in bonds.

How does a dealer make sure he has enough position to honor the quote? Perhaps by setting aside reserve quantities, or by monitoring the open market.

Unlike bidwanted systems (non-negotiable quotes), it’s possible for a client to negotiate on our quote electronically, though I feel manual negotiation is more practical.