profilers for low-latency java

Most (simple) java profilers are based on jvm safepoint. At a safepoint, they can use JVM API to query the JVM. Safepoint-based profiling is relatively easy to implement.

s-sync (Martin) is not based on safepoint.

Async profiler is written for openJDK, but some features are usable on Zinc JVM. Async profiler is based on process level counters. Can’t really measure micro-latency with any precision.

Perf is an OS-level profiler, probably based on kernel counters.

370,000 MPS isn’t tough for multicast #CSY

370,000 msg/sec isn’t too high. Typical exchange message size is 100 bits, so we are talking about 37 Mbps, , less than 0.1% of a 100 Gbps network capacity.

My networking mentor CSY and I both believe it’s entirely possible to host 8 independent threads in a single process to handle the 8 independent message channels. Capacity would be 296 Mbps on a single NIC and single PID.

See also mktData parser: multi-threaded]ST mode #Balaji

I feel a more bandwidth-demanding multicast application is video-on-demand where a single server may need to simultaneously stream different videos.

Q: how about world cup final real-time multicast video streaming to millions of subscribers?
%%A: now I think this is not so demanding, because the number of simultaneous video streams is one

4 components@latency

  1. Propagation delay — Amount of time required for a message to travel from the sender to receiver, which is a function of distance.
  2. Transmission delay — Amount of time required to push all the packet’s bits into the link, which is a function of the packet’s length and data rate of the link, and has nothing to do with distance
  3. Processing delay — Amount of time required to process the packet header, check for bit-level errors, and determine the packet’s destination. Personally, I would think application logic also introduces processing delay.
  4. Queuing delay —  Amount of time the packet is waiting in the queue until it can be processed. Personally, I would guess that a larger buffer tends to exacerbate queuing delay. I would say there are often multiple queuing points along the critical path, like an airport.



throughput^latency #wiki

High bandwidth often means high-latency:( .. see also linux tcp buffer^AWS tuning params

  • RTS is throughput driven, not latency-driven.
  • Twitter/FB fanout is probably throughput-driven, not latency-driven
  • I feel MOM is often throughput-driven and introduces latency.
  • I feel HFT OMS like in Mvea is latency-driven. There are probably millions of small orders, many of them cancelled. shows

  • satellite is high-latency, regardless of throughput
  • offline data transfer by trucks) is poor latency, excellent throughput

-XX:CompileThreshold to control JIT priming

## how might jvm surpass c++]latency #MS #priming has 2 other tips from jvm trading engine veterans is a 2014 short piece focused on q[ -XX:CompileThreshold ] — sets the number of method invocations before Hotspot will compile a method to native code.

The -server VM defaults to 10,000 and -client defaults to 1500.

Smaller numbers reduce priming time, but Very low numbers would mean that the server starts considerably slower because of the time taken by the JIT to compile too many methods (which may not be used that often after all).

## mkt data: avoid byte-copying #NIO

I would say “avoid” or “eliminate” rather than “minimize” byte copying. Market data volume is gigabytes so we want and can design solutions to completely eliminate byte copying.

  • RTS uses reinterpret_cast but still there’s copying from kernel socket buffer to userland buffer.
  • Java NIO buffers can remove the copying between JVM heap and the socket buffer in C library. See P226 [[javaPerf]]
  • java autoboxing is highly unpopular for market data systems. Use byte arrays instead

C for latency^^TPS can use java

I’m 98% confident — low latency favors C/C++ over java [1]. FPGA is _possibly_ even faster.

I’m 80% confident — throughput (in real time data processing) is achievable in C, java, optimized python (Facebook?), optimized php (Yahoo?) or even a batch program. When you need to scale out, Java seems the #1 popular choice as of 2017. Most of the big data solutions seem to put java as the first among equals.

In the “max throughput” context, I believe the critical java code path is optimized to the same efficiency as C. JIT can achieve that. A python and php module can achieve that, perhaps using native extensions.

[1] Actually, java bytecode can run faster than compiled C code (See my other posts such as

MOM+threading Unwelcome ] low latency@@ #FIX/socket

Piroz told me that trading IT job interviews tend to emphasize multi-threading and MOM. Some use SQL too. I now feel all of these are unwelcome in low latency trading.

A) MOM – see also HFT mktData redistribution via MOMFor order processing, FIX is the standard. FIX can use MOM as transport, but not popular and unfamiliar to me.

FIX does use buffers to hold a burst of incoming or outgoing messages. The buffers resemble message queues.

B) threading – Single-Threaded-Mode is generally the fastest in theory and in practice. (I only have a small observed sample size.) I feel the fastest trading engines are STM. No shared mutable. Nsdq new platform (in java) is STM

Multithreading is OK if the threads don’t compete for resources like CPU, I/O or locks. Compared to STM, most lockfree systems introduce latency like retries, and additional memory barrier. By default compiler optimization doesn’t need such memory barriers.

C) SQL – as stated elsewhere, flat files are much faster than relational DB. How about in-memory relational DB?

Rebus, the order book engine, is in-memory.

realistic TPS for exchange mkt-data gateway: below 1M/sec

  • Bombay’s BSE infrastructure (including matching engine) can execute 1 mil trades/second. Must maintain order books, therefore NOT stateless.
  • Rebus maintains full order books for each symbol, can handle 300k (in some cases 700k) messages per second per instance. Uses AVL tree, which beat all other data structure in tests.
  • The Nasdaq feed parser (in c++) is stateless and probably single-threaded. Very limited parsing logic compared to Rebus. It once handled 600k message/second per instance
  • Product Gateway is probably a dumb process since we never need to change it. Can handle 550k message/second per instance

I believe TPS throughput, not latency, is the optimization goal. Biggest challenge known to me is the data burst.

Therefore, java GC pause is probably unacceptable. In my hypothesis, after you experience a data surge for a while, you tend to run out of memory and must run GC (like run to the bathroom). But that’s the wrong time to run GC. If the surge continues while you GC runs, then the incoming data would overflow the queue.

c++SCB eFX IV#Dmitry

100% QQ type, as defined in feel many are micro optimizations with questionable improvement. I wonder how much value such obscure knowledge adds to the team.

Q: Scanning a vector of int (like finding the average or max). Forward iteration vs backward iteration, which one could be faster, considering all possible compiler optimizations.

%%A: forward. Memory read into cpu cache will be in chunks, not one element at a time. Easy for forward iteration. Not sure about backward.

Q: Which one could be fastest:

void f(double arg){…..}
void f(double & arg){….}

%%A: inlining for first but not 2nd?
A: See esp. the long answer.

Q: Thr1 and Thr2 on 2 CPU’s both update an object s, having 2 fields. Thr1 only updates s.field1. Thr2 only updates s.field2. No interference. No synchronization required. We observe the performance is slower than using one thread to update both fields. Any explanation?
%%A: caching in cpu

Q: weak_ptr justification, when we have shared_ptr already? I feel [[effModernC++]] has a good chapter on it.

Ashish pointed out in some apps, you could identify a clear risk of circular dependency. Replace with weak_ptr.

Q: given an 2D array arr[10][5], how do you use pointer arithmetic to hit arr[1][5]

A: Contiguous. see Note this is different from an array of pointers.

Q: what would you see if a TCP socket server has a full queue
%%A: TCP requires handshake, so if server is unable to accept a request the client would know it.
%%A: connection refused?

Q: what STL algorithms did you use?
%%A: foreach(), find(), copy_if(), transform(), reverse(), sort(), replace_if, remov_if

how2guage TPS capacity@mkt-data engine

Pump in an artificial feed. Increase the input TPS rate until

  1. CPU utilization hits 100%
  2. messages get dropped

The input TPS is the the highest acceptable rate i.e. the “capacity” of this one process.

Note each feed has its own business logic complexity level, so the same software may have 600k TPS capacity for a simple Feed A but only 100k TPS for a complex Feed B.

Also in my experience the input interface is the bottle neck compared to the output interface. If System X feeds into System Y, then we want to have System X pumping at 50% of Y’s capacity. In fact, we actually monitor the live TPS rate. The difference between that and the capacity is the “headway”.

strace, ltrace, truss, oprofile, gprof – random notes

[[optimizing Linux performance]] has usage examples of ltrace.
I think truss is the oldest and most well-known.
Q: what values do the others add?
truss, strace, ltrace all show function arguments, though pointer to objects will not be “dumped”. (Incidentally, I guess apptrace has a unique feature to dump arguments of struct types.)
strace/ltrace are similar in many ways…
ltrace is designed for shared LLLibrary tracing, but can also trace syscalls.
truss is designed for syscalls, but “-u” covers shared libraries.
oprofile — can measure time spent and hit rates on library functions

oprofile – phrasebook

root – privilege required to start/stop the daemon, but the query tools don’t need root

dtrace – comparable. I think these two are the most powerful profilers on solaris/linux.

statistical – results can be partially wrong. Example – call graph.

Per-process – profiling is possible. I think default is system-wide.

CPU – counters (hardware counters). Therefore, low impact on running apps, lower than “attachment” profilers.

userland – or kernel : both can be profiled

recompile – not required. Other profilers require recompiling.

kernel support – must be compiled in.

oprifiled – the daemon. Note there’s no executable named exactly “oprofile”.

[[Optimizing Linux performance]] has detailed usage examples of oprofile. [[linux programmer’s toolbox]] has decent coverage too.

algorithm efficiency at Google/HFT: random thoughts

I was told Google engineers discuss algorithm efficiency everyday, including lunch time.

I feel latency due to software algorithm might be a small component of overall latency. However, the bigger portions of that latency may be unavoidable – network latency, disk write(?), serialization for transmission(?), … So the only part we could tune might be the software algorithm.

Further, it’s also possible that all the competitors are already using the same tricks to minimize network latency. In that case the competitive advantage is in the software algorithm.

I feel algorithm efficiency could be more permanent and fundamental than threading. If I compare 2 algorithms A1 and A2 and find A2 being 2x A1’s speed, then no matter what threading or hardware solutions I use, A2 still beats A1.

sub-millis OMS arch – last tips from Anthony

I feel ideally you want to confine entire OMS to one single process, minimizing IPC latency [1]. In practice however, even for one symbol OMS is often split into multiple processes or “instances”.

So what’s the IPC of choice? It turns out that in sub-millis trading, MOM messaging is the IPC of choice. I mentioned synchronous call and shared memory, but my veteran friend pointed out messaging performs better in practice.

The main platform-component is one big JVM instance with an internal order lookup cache for order state maintenance.

Multi-queue – if there are 50,001 symbols, there will be 50,001 queues. Once a queue is assigned a given thread T351, it is permanently bound to T351. This is to prevent multiple threads handling events concurrently on the same symbol. Obviously we don’t want 50,001 threads. Therefore, some kind of multiplexing is in place.

[1] Note data parallelism (into multiple processes) is free of IPC and perfectly fine.

achieving sub-millis latency: exchange connectivity

This post is about really achieving it, not “trying to achieve”.

First off, How do you measure or define that latency? I guess from the moment you get one piece of data at one end (first marker) to the time you send it out at the other end (2nd marker). If the destination is far away, then I feel we should use the ack time as the 2nd marker. There are 2 “ends” to the system being measured. The 2 ends are
* the exchange and
* the internal machines processing the data.

There are also 2 types of exchange data — order vs market data (MD = mostly other people’s quotes and last-executed summaries). Both are important to a trader. I feel market data is slightly less critical, though some practitioners would probably point out evidence to the contrary.

Here are some techniques brought up by a veteran in exchange connectivity but not the market data connectivity.
* Most places use c++. For a java implementation, most important technique is memory tuning – GC, object creation.
* avoid MOM? — HFT mktData redistribution via MOM
* avoid DB — local flat file is one way they use for mandatory persistence (execution). 5ms latency is possible. I said each DB stored proc or query takes 30-50 ms minimum. He agreed.
** A market data engineer at 2Sigma also said “majority of data is not in database, it’s in file format”. I guess his volume is too large for DB.
* object pooling — to avoid full GC. I asked if a huge hash table might introduce too much look-up latency. He said the more serious problem is the uncontrollable, unpredictable GC kick-in. He felt hash table look-up is probably constant time, probably pre-sized and never rehashed. Such a cache must not grow indefinitely, so some kind of size control might be needed.
* multi-queue — below 100 queues in each JVM. Each queue takes care of execution on a number of securities. Those securities “belong” to this queue. Same as B2B
* synchronize on security — I believe he means “lock” on security. Lock must attach to objects, so the security object is typically simple string objects rather than custom objects.
* full GC — GC tuning to reduce full GC, ideally to eliminate it.
* use exchange API. FIX is much slower, but more flexible and standardized. See other posts in this blog

Some additional techniques not specific to exchange connectivity —
$ multicast — all trading platforms use multicast nowadays
$ consolidate network requests — bundle small request into a large requests
$ context switching — avoid
$ dedicated CPU — If a thread is really important, dedicate a cpu to it.
$ shared memory — for IPC
$ OO overhead — avoid. Use C or assembly
$ pre-allocate large array
$ avoid trees, favour arrays and hash tables
$ reflection — avoid
$ concurrent memory allocator (per-thread)
$ minimize thread blocking
** immutable
** data parallelism
** lockfree

##[12] bottlenecks in a high performance data "flow" #abinitio


#1 probably most common — database, both read and write operations. Therefore, ETL solutions achieve superior throughput by taking data processing out of database. ETL uses DB mostly as dumb storage.

  • write – if a database data-sink capacity is too slow, then entire pipe is limited by its throughput, just like sewage.
    • relevant in mkt data and high frequency trading, where every execution must be recorded
  • read – if you must query a DB to enrich or lookup something, this read can be much slower than other parts of the pipe.

#2 (similarly) flat files. Write tends to be faster than database write. (Read is a completely different story.)
* used in high frequency trading
* used in high volume market data storage — Sigma2 for example. So flat file writing is important in industry.
* IDS uses in-memory database + some kind of flat file write-behind for persistence.

#? Web service

#? The above are IO-bound. In contrast, CPU-bound compute-intensive transform can (and do) also become bottlenecks.

low latency: C++preferred over java

(Fastest is FPGA, but Let’s put that aside. )

Low latency developers generally prefer C and avoid OO (runtime overhead), but they do use C++ templates to achieve power and flexibility.

In terms of data structures, I think they use STL too. Array is essential. In contrast, graph data structures incur additional allocation due to the graph Node objects — no such overhead in arrays.

Java’s issues —
* autoboxing — market data use mostly of primitive objects
* Every instance takes something like 8+ bytes.
* Indeterminate garbage collection
* virtual function overhead
* Even purely local variables are often put into heap for delayed clean-up
* JVM could reach good throughput wrt c++, but only after a slow warm-up.

## personal xp on low latency trading

Thread — lockfree becomes relevant in latency-sensitive contexts
Thread — create opportunity for parallelism, and exploit multi-core
Thread — avoid locks including concurrent data structures and database. Be creative.
Data structures — STL is known to be fairly efficient but can affect performance
Data structures — minimize footprint
Data structures — favor primitive arrays because allocations are fewer and footprint is smaller
Algo — favor iterative over recursive
DB tuning — avoid hitting DB? Often impractical IMO
Serialize — favor primitives over java autoboxing
Serialize — favor in-process transfers, and bypass serialization
Mem — avoid vtbl for large volume of small objects
Mem — reuse objects, and avoid instantiation? Often impractical IMO
Mem — mem virtualization (like gemfire) as alternative to DB. A large topic.
Mem — Xms == Xmx, to avoid dynamic heap expansion
Mem — control size of large caches.
Mem — consider weak reference
MOM — multicast is the clear favorite to cope with current market data volume
MOM — shrink payload. FIX is a primary example.
GC — avoid forced GC, by allocating large enough heap
Socket — dedicate a thread to a single connection and avoid context switching
Socket — UDP can be faster than TCP
Socket — non-blocking can beat blocking if there are too many low-volume connections. See post on select()

instrumentation overhead: latency/throughput]low latency app

(Note these overheads primarily hurt latency, but throughput-sensitive apps would also suffer, to a lesser extent I presume. Here’s my Reason — if it takes 20 minutes (or hours) to finish a big batch, 5% longer would not matter that much. However, in low-latency trading a consistent 5% additional latency can mean many lost opportunities. An inconsistent latency is even more prevalent.)

A top engineer in a ultra high speed option exchange told me truss is the #1 instrumentation tool (they run linux). I believe this is for occasional use. truss is intrusive and can slow down an app to HALF its speed, according to Solaris tuning experts.

Dtrace jvm provider has method-entry and method-exit hooks and can add overhead too. Overall, dtrace is less intrusive than truss.


— logging —
By default, logging is intrusive and adds latency to the low latency critical-path. We still need real time logging for production support. Solutions? Someone told me ramdisk is one option. Someone told me async logging is available in log4j. 

Given that disk is a block-oriented storage (see other posts on this blog), you can also increase disk buffer and write in bigger blocks. Non-ideal for real time monitoring. Minutes of delay is common.

When quizzed in interviews, I gave some impromptu suggestions —
* main thread puts instrumentation/logging data into a shared data structure in local memory. A background thread reads it asynchronously — non-intrusive.  Shared mutable => locking required.
* we can also send messages to a logging server. In this case, I feel few-big messages beat many-smaller messages.

tibrv latency tips – from tibrv documentation #logging

1) offload to worker thread — When inbound messages require lengthy processing, we recommend shifting the processing load asynchronously. Quickly extract data from the message, and process it in another thread.

* compare EDT/swingWorker

In a ML muni trading engine, we designed several “grabber” listeners
– grab messages and route them to different processing queues, based on hashcode.
– grab messages, append a short code to the subject, then republish to the same queue. Not sure if this works.

2) reduce message size — Avoid XML. RV now supports integer field identifiers similar to FIX. 16-bits and much smaller than String field names.

3) reduce logging — I always felt logging hurts performance. Now confirmed in Tibrv manual! When logging is required for monitoring or auditing, shift the I/O burden to another computer to log messages without introducing a time penalty

Tick data repository – real-time/historical mkt-data

I think this was a CS interview…

Used for 1) back-testing, 2) trading signal generation, even if no real order sent, and 3) algorithmic real-$ trading.

First target user is back testing. Users would also try the 1 -} 2 -} 3 in that sequence.

Stores FX, FI and equity market data. Therefore the system treats everything just as generic tick data either quote ticks or trade ticks.

Multiple TB of data in memory. Time-series (not SQL) database.

Created using c++ and java. Probably due to sockets.

max throughput ^ max concurrency

I now feel this is a a bit too academic. In practice, I feel concurrency is a technique to achieve throughput.

“max concurrency” is a system feature and probably means “full exploitation of processor and threading capacity”. I know a single SUN processor can have 4 cores and 32 concurrent kernel threads. In such a system, it’s also possible to create user-level threads so total concurrent threads can far exceed 32. I would think thousands are possible.

“max throughput” is a tangible benefit. I think it’s measured by the max number of completed requests per second. The highest throughput is usually achieved in latency-insensitive batch systems rather than synchronous or asynchronous request/response. The fastest way to transfer billions of terabytes of data is to load flash drives on trucks and ship to the receiving site – batch mode.

Max throughput requires finding out bottlenecks. Throughput of a given system is the throughput of the “narrowest” or “slowest” link on the critical path. Adding concurrency to the bottleneck spots (not other spots) improves throughput. For a realistic system, people would buy the max amount of memory they can afford to avoid hitting disk, put in a lot of processors, and somehow multiplex I/O. In low-latency systems, network I/O is often the biggest latency contributor. People use java NIO and exchange collocation, among other things.

Other concurrency techniques include partitioned table, parallel query, disk array, grid and cloud computing.

By “narrowest” I mean the high-way. The more lanes, the better the throughput. A 32-thread processor has 32 lanes. Each thread has a capacity limited by clock speed and software synchronization – that’s what I mean by “slowest”.

##[09]low latency techniques

roughly ranked in terms of interviewer’s emphasis

  1. OOM and GC. memory conservation.
    1. Avoid dupe objects.
    2. avoid keeping large order state object
  2. parallelism
  3. avoid recursion
  4. JVM tuning beyond GC tuning
  5. –network
  6. avoid network like distributed cache. Favor single-JVM designs
  7. NIO
  8. FIX payload size reduction, such as encoding and compression
  9. –MOM
  10. multicast. RV is more stable than the newly invented multicast-JMS
  11. MOM message size control
  12. !! Peer-to-peer messaging eliminates message brokers and daemon processes — diagrams, brevity.

JRockit real time —