typically 10 – 100 millisec
— Mostly based on Charlie Hunt’s [[JavaPerf]] P28
Runtime.availableProcessors() returns the count of virtual processors, or count of hardware threads. This is an important number for CPU tuning, bottleneck analysis.
When a run-queue depth exceeds 4 times the processor count, then host system will become visibly slow (presumably due to excessive context switching). For a host dedicated to jvm, this is a 2nd reason for CPU saturation. First reason is high CPU usage, which can become high even with a single CPU-hog.
Note run-queue depth is the first column in vmstat output
I think java could deliver similar latency numbers to c/c++, but the essential techniques are probably unnatural to java:
My friend Qihao commented —
There are more management barriers than technical barriers towards low latency java. One common example is with “suppressing gc is unnatural”.
Adapted from blog by Hayden James
Even when our average memory usage is smaller than RAM capacity, system still benefits from swap!
Most server processes are daemons. Any daemon can create lots of memory pages rarely accessed till shutdown. Kernel often decides to relocate rarely used memory pages to swap for performance reasons, mostly to free up RAM. The reclaimed RAM space can remain vacant for some time, so you may think the relocation is unnecessary but ..
But what if there’s no other process or hotter data? What if all the data can fit into RAM? This is rare but yes you can disable swap. Rarely needed tweaks are sometimes under-documented, as kernel is a very efficient “government” like Singapore.
Note, as explained in my [[linux kernel]], kswapd is both a process and a kernel thread that wakes up from time to time.
In real world applications on IA32, we have not seen a significant performance boost by using icc but not worse. On IA64 however, it consistently performs better than apps built with gcc.
10’s of mlllisec between London and NY
–[[java performance]] by Scott Oaks
best of breed..see chapter details on
[jvm] heap memory
lambda, stream (java 8 interviews!)
The Introduction chapter outlines 3 broad aspects
* JVM – like memory tuning
* java language – like threading, collections
* Java API — like xml parser, JDBC, serialization, Json…
JVM tuning is done by “system engineers” who may not be developers.
Important authors often describe important techniques in memory efficiency as well as speed efficiency.
Trading is mostly about speed. Memory is important in other domains.
800+ G4 linux machines (HP). 4-8 core / 16G RAM each. 32-bit OS
Later consolidated to
400+ G7 linux machines. 32 core / 144G RAM each. 64-bit OS
Roughly 6 of these machines are dedicated Coherence machines.
Database is not part of this market data and real time risk engine.
A real, practical challenge in a low-latency, market-data system is to quickly find out “What’s the system doing”. Log files usually have a lot of details, but we also want to know what files/sockets our process is accessing, what kind of data it is reading and writing.
truss -s or -r can reveal actual data transferred(??)
if write() syscall is stuck, then perhaps disk is full.
lsof reads /proc to get open sockets/files
Q: request wait-queuing (toilet queue)? I know weblogic can configure the toilet queue
A: keep the queue entries small. we only keep object id while the objects are serialized to disk (?!)
Q: is 1kB too large?
q: most common cause of perf issue?
A: mem leak. still present after regression test
q: jvm tuning?
A: yes important, esp mem related
q: regression test?
q: perf tools?
a: no tools. primarily based on logs. eg. track a long-running
transaction and compute the duration between soap transaction start
Q: web services?
A: Many of the transactions are based on soap, axis. TCP monitor
can help with your perf investigation.
A: yes we use two phase commits. Too many transactions involved.
really complex biz logic. Solution is async.
A: handled by weblogic.
Q: how is the async and queue implemented?
A: weblogic-mq with persistent store, crash-proof
–cache: model manager
telecom: circuits don’t change that often
–async with queues
telecom: request volume
–loose coupling, with async, queues and stateless
separate jvm for web tier, for slsb, for dispatchers and workers
configurable worker threads, on multiple machines, with multi-processor cores
configurable worker pool
stop runaway threads
clustering on web tier
clustering btw dispatcher instances
clustering transcoder slsb