This post is about really achieving it, not “trying to achieve”.
First off, How do you measure or define that latency? I guess from the moment you get one piece of data at one end (first marker) to the time you send it out at the other end (2nd marker). If the destination is far away, then I feel we should use the ack time as the 2nd marker. There are 2 “ends” to the system being measured. The 2 ends are
* the exchange and
* the internal machines processing the data.
There are also 2 types of exchange data — order vs market data (MD = mostly other people’s quotes and last-executed summaries). Both are important to a trader. I feel market data is slightly less critical, though some practitioners would probably point out evidence to the contrary.
Here are some techniques brought up by a veteran in exchange connectivity but not the market data connectivity.
* Most places use c++. For a java implementation, most important technique is memory tuning – GC, object creation.
* avoid MOM? — HFT mktData redistribution via MOM
* avoid DB — local flat file is one way they use for mandatory persistence (execution). 5ms latency is possible. I said each DB stored proc or query takes 30-50 ms minimum. He agreed.
** A market data engineer at 2Sigma also said “majority of data is not in database, it’s in file format”. I guess his volume is too large for DB.
* object pooling — to avoid full GC. I asked if a huge hash table might introduce too much look-up latency. He said the more serious problem is the uncontrollable, unpredictable GC kick-in. He felt hash table look-up is probably constant time, probably pre-sized and never rehashed. Such a cache must not grow indefinitely, so some kind of size control might be needed.
* multi-queue — below 100 queues in each JVM. Each queue takes care of execution on a number of securities. Those securities “belong” to this queue. Same as B2B
* synchronize on security — I believe he means “lock” on security. Lock must attach to objects, so the security object is typically simple string objects rather than custom objects.
* full GC — GC tuning to reduce full GC, ideally to eliminate it.
* use exchange API. FIX is much slower, but more flexible and standardized. See other posts in this blog
Some additional techniques not specific to exchange connectivity —
$ multicast — all trading platforms use multicast nowadays
$ consolidate network requests — bundle small request into a large requests
$ context switching — avoid
$ dedicated CPU — If a thread is really important, dedicate a cpu to it.
$ shared memory — for IPC
$ OO overhead — avoid. Use C or assembly
$ pre-allocate large array
$ avoid trees, favour arrays and hash tables
$ reflection — avoid
$ concurrent memory allocator (per-thread)
$ minimize thread blocking
** data parallelism