CPU run-queue #java perspective

— Mostly based on Charlie Hunt’s [[JavaPerf]] P28

Runtime.availableProcessors() returns the count of virtual processors, or count of hardware threads. This is an important number for CPU tuning, bottleneck analysis.

When a run-queue depth exceeds 4 times the processor count, then host system will become visibly slow (presumably due to excessive context switching).  For a host dedicated to jvm, this is a 2nd reason for CPU saturation. First reason is high CPU usage, which can become high even with a single CPU-hog.

Note run-queue depth is the first column in vmstat output

[20] java≠a natural choice 4 latency #DQH

I think java could deliver similar latency numbers to c/c++, but the essential techniques are probably unnatural to java:

  • STM — Really low latency systems should use single-threaded mode. STM is widely used and well proven. Concurrency is the biggest advantage of java but unfortunately not effective in serious latency engineering.
  • DAM — (dynamically allocated memory) needs strict control, but DAM usage permeates mainstream java.
  • arrays — Latency engineering favors contiguous data structures i.e. arrays, rather than object graphs including hash tables, lists, trees, or array of heap pointers,,. C pointers were designed based on tight integration with array, and subsequent languages have all moved away from arrays. Programming with raw arrays in java is unnatural.
    • struct — Data structures in C has a second dimension beside arrays – namely structs. Like arrays, structs are very compact, wasting no memory and can live on heap or non-heap. In java, this would translate to a class with only primitive fields. Such a class is unnatural in java.
  • GC — Low latency doesn’t like a garbage collector thread that can relocate objects. I don’t feel confident discussing this topic, but I feel GC is a handicap in the latency race. Suppressing GC is unnatural for a GC language like java.

My friend Qihao commented —

There are more management barriers than technical barriers towards low latency java. One common example is with “suppressing gc is unnatural”.

swap usage when RAM already adequate

Adapted from blog by Hayden James

Even when our average memory usage is smaller than RAM capacity, system still benefits from swap!

Most server processes are daemons. Any daemon can create lots of memory pages rarely accessed till shutdown. Kernel often decides to relocate rarely used memory pages to swap for performance reasons, mostly to free up RAM. The reclaimed RAM space can remain vacant for some time, so you may think the relocation is unnecessary but ..

  • the relocation is usually harmless — the kswap pid uses very little cpu unless such relocation workload becomes frequent and bidirectional, a sign of insufficient RAM.
  • the relocation is preemptive — The vacant RAM is available to any process that can use it more productively. In general, faster cache ought to hold “hotter” data. In other words, hotter data should be cheaper to access.

But what if there’s no other process or hotter data? What if all the data can fit into RAM? This is rare but yes you can disable swap. Rarely needed tweaks are sometimes under-documented, as kernel is a very efficient “government” like Singapore.

Note, as explained in my [[linux kernel]], kswapd is both a process and a kernel thread that wakes up from time to time.

[[java performance]] by Scott Oaks

–[[java performance]] by Scott Oaks

 

best of breed..see chapter details on

[jvm] heap memory

[jvm] threading

[jvm] instrumentation

JPA

serialization

lambda, stream  (java 8 interviews!)

 

The Introduction chapter outlines 3 broad aspects

* JVM – like memory tuning

* java language – like threading, collections

* Java API — like xml parser, JDBC, serialization, Json

 

JVM tuning is done by “system engineers” who may not be developers.

 

let’s find out What the system is doing

A real, practical challenge in a low-latency, market-data system is to quickly find out “What’s the system doing”. Log files usually have a lot of details, but we also want to know what files/sockets our process is accessing, what kind of data it is reading and writing.

truss -s or -r can reveal actual data transferred(??)

if write() syscall is stuck, then perhaps disk is full.

lsof reads /proc to get open sockets/files

snoop

perf techniques in T J W’s project–ws,mq,tx

Q: request wait-queuing (toilet queue)? I know weblogic can configure the toilet queue
A: keep the queue entries small. we only keep object id while the objects are serialized to disk (?!)

Q: is 1kB too large?
A: no

q: most common cause of perf issue?
A: mem leak. still present after regression test

q: jvm tuning?
A: yes important, esp mem related

q: regression test?
a: important

q: perf tools?
a: no tools. primarily based on logs. eg. track a long-running
transaction and compute the duration between soap transaction start
and end.

Q: web services?
A: Many of the transactions are based on soap, axis. TCP monitor
(http://ws.apache.org/axis/java/user-guide.html#AppendixUsingTheAxisTCPMonitorTcpmon)
can help with your perf investigation.

Q: tx?
A: yes we use two phase commits. Too many transactions involved.
really complex biz logic. Solution is async.

Q: multi-threaded?
A: handled by weblogic.

Q: how is the async and queue implemented?
A: weblogic-mq with persistent store, crash-proof

Y K on T2000 vs E10k

Q: processing power?
A: E10k is old. About 400Mhz cpu. T2000 has more processing power.
 
Q: form factor?
A: E10k takes a full rack, with up to 16 boards, each holding up to 4 processors
 
Q: any special skill required to administer e10k or above?
A: not much different from mid-range sparc systems