I think java could deliver similar latency numbers to c/c++, but the essential techniques are probably unnatural to java:
- STM — Really low latency systems should use single-threaded mode. STM is widely used and well proven. Concurrency is the biggest advantage of java but unfortunately not effective in low-latency.
- DAM — (dynamically allocated memory) needs strict control, but DAM usage permeates mainstream java.
- arrays — Latency engineering favors contiguous memory arrays, rather than object graphs including hash tables, lists, trees, or array of heap pointers,,. C pointers were designed based on tight integration with array, and subsequent languages have all moved away from arrays. Programming with raw arrays in java is unnatural.
- struct — Data structures in C has a second dimension beside arrays – namely structs. Like arrays, structs are very compact, wasting no memory and can live on heap or non-heap. In java, this would translate to a class with only primitive fields. Such a class is unnatural in java.
- GC — Low latency doesn’t like a garbage collector thread that can relocate objects. I don’t feel confident discussing this topic, but I feel GC is a handicap in the latency race. Suppressing GC is unnatural for a GC language like java.
My friend Qihao commented —
There are more management barriers than technical barriers towards low latency java. One common example is with “suppressing gc is unnatural”.
10’s of mlllisec between London and NY
— Mostly based on [[JavaPerf]] P28
Runtime.availableProcessors() returns the count of virtual processors, or count of hardware threads. This is an important number for CPU tuning, bottleneck analysis.
When a run-queue depth exceeds 4 times the processor count, then host system will become visibly slow. For a host dedicated to java, this is a 2nd reason for CPU saturation. First reason is high CPU usage.
Note run-queue depth is the first column in vmstat output
Adapted from blog by Hayden James
Even when our average memory usage is smaller than RAM capacity, system still benefits from swap!
Most server processes are daemons. Any daemon can create lots of memory pages rarely accessed till shutdown. Kernel often decides to relocate rarely used memory pages to swap for performance reasons, mostly to free up RAM. The reclaimed RAM space can remain vacant for some time, so you may think the relocation is unnecessary but ..
- the relocation is usually harmless — the kswap pid uses very little cpu unless such relocation workload becomes frequent and bidirectional, a sign of insufficient RAM.
- the relocation is preemptive — The vacant RAM is available to any process that can use it more productively. In general, faster cache ought to hold “hotter” data. In other words, hotter data should be cheaper to access.
But what if there’s no other process or hotter data? What if all the data can fit into RAM? This is rare but yes you can disable swap. Rarely needed tweaks are sometimes under-documented, as kernel is a very efficient “government” like Singapore.
Note, as explained in my [[linux kernel]], kswapd is both a process and a kernel thread that wakes up from time to time.
In real world applications on IA32, we have not seen a significant performance boost by using icc but not worse. On IA64 however, it consistently performs better than apps built with gcc.
Important authors often describe important techniques in memory efficiency as well as speed efficiency.
Trading is mostly about speed. Memory is important in other domains.
800+ G4 linux machines (HP). 4-8 core / 16G RAM each. 32-bit OS
Later consolidated to
400+ G7 linux machines. 32 core / 144G RAM each. 64-bit OS
Roughly 6 of these machines are dedicated Coherence machines.
Database is not part of this market data and real time risk engine.
A real, practical challenge in a low-latency, market-data system is to quickly find out “What’s the system doing”. Log files usually have a lot of details, but we also want to know what files/sockets our process is accessing, what kind of data it is reading and writing.
truss -s or -r can reveal actual data transferred(??)
if write() syscall is stuck, then perhaps disk is full.
lsof reads /proc to get open sockets/files
Q: request wait-queuing (toilet queue)? I know weblogic can configure the toilet queue
A: keep the queue entries small. we only keep object id while the objects are serialized to disk (?!)
Q: is 1kB too large?
q: most common cause of perf issue?
A: mem leak. still present after regression test
q: jvm tuning?
A: yes important, esp mem related
q: regression test?
q: perf tools?
a: no tools. primarily based on logs. eg. track a long-running
transaction and compute the duration between soap transaction start
Q: web services?
A: Many of the transactions are based on soap, axis. TCP monitor
can help with your perf investigation.
A: yes we use two phase commits. Too many transactions involved.
really complex biz logic. Solution is async.
A: handled by weblogic.
Q: how is the async and queue implemented?
A: weblogic-mq with persistent store, crash-proof
–cache: model manager
telecom: circuits don’t change that often
–async with queues
telecom: request volume
–loose coupling, with async, queues and stateless
separate jvm for web tier, for slsb, for dispatchers and workers
configurable worker threads, on multiple machines, with multi-processor cores
configurable worker pool
stop runaway threads
clustering on web tier
clustering btw dispatcher instances
clustering transcoder slsb
Q: processing power?
A: E10k is old. About 400Mhz cpu. T2000 has more processing power.
Q: form factor?
A: E10k takes a full rack, with up to 16 boards, each holding up to 4 processors
Q: any special skill required to administer e10k or above?
A: not much different from mid-range sparc systems