SDI: 3 ways to expire cached items

server-push update ^ TTL ^ conditional-GET # write-through is not cache expiration

Few Online articles list these solutions explicitly. Some of these are simple concepts but fundamental to DB tuning and app tuning. compares write-through ^ write-behind ^ refresh-ahead. I think refresh-ahead is similar to TTL.

B) cache-invalidation — some “events” would trigger an invalidation. Without invalidation, a cache item would live forever with a infinity TTL, like the list of China provinces.

After cache proxies get the invalidation message in a small payload (bandwidth-friendly), the proxies discard the outdated item, and can decide when to request an update. The request may be skipped completely if the item is no longer needed.

B2) cache-update by server push — IFF bandwidth is available, server can send not only a tiny invalidation message, but also the new cache content.

IFF combined with TTL, or with reliability added, then multicast can be used to deliver cache updates, as explained in my other blogposts.

T) TTL — more common. Each “cache item” embeds a time-to-live data field a.k.a expiry timestamp. Http cookie is the prime example.

In Coherence, it’s possible for the cache proxy to pre-emptively request an update on an expired item. This would reduce latency but requires a multi-threaded cache proxy.

G) conditional-GET in HTTP is a proven industrial strength solution described in my 2005 book [[computer networking]]. The cache proxy always sends a GET to the database but with a If-modified-since header. This reduces unnecessary database load and network load.

W) write-behind (asynchronous) or write-through — in some contexts, the cache proxy is not only handling Reads but also Writes. So the Read requests will read or add to cache, and Write requests will update both cache proxy and the master data store. Drawback — In distributed topology, updates from other sources are not visible to “me” the cache proxy, so I still rely one of the other 3 means.

TTL eager server-push conditional-GET
if frequent query, in-frequent updates efficient efficient frequent but tiny requests between DB and cache proxy
if latency important OK lowest latency slower lazy fetch, though efficient
if in-frequent query good waste DB/proxy/NW resources as “push” is unnecessary efficient on DB/proxy/NW
if frequent update unsuitable high load on DB/proxy/NW efficient conflation
if frequent update+query unsuitable can be wasteful perhaps most efficient


CPU run-queue #java perspective

— Mostly based on [[JavaPerf]] P28

Runtime.availableProcessors() returns the count of virtual processors, or count of hardware threads. This is an important number for CPU tuning, bottleneck analysis.

When a run-queue depth exceeds 4 times the processor count, then host system will become visibly slow. For a host dedicated to java, this is a 2nd reason for CPU saturation. First reason is high CPU usage.

Note run-queue depth is the first column in vmstat output

too many DB-writes: sharding insufficient #Indeed

Context: each company receives many many reviews. In a similar scenario, we can say that too many user comments flood in during the soccer world cup.

Aha — the updates don’t need to show up on browser in strict order

Aha — commenting user only need to see her own comment on top, and need not see other users’ latest comments. The comments below her own could be browser-cached content. Ajax…

Interviewer: OK you said horizontal sharding by company id can address highly concurrent data store updates. Now what if one company, say, Amazon, by itself gets 20% of the updates so sharding can’t help this one company.

me: I guess the update requests would block and possibly time out. We can use a task queue.

This is similar to WordPress import/export requests.

  • Each task takes a few seconds, so we don’t want user to wait.
  • If high server load, the wait could be longer.
  • More important — this task need not be immediate. It can be scheduled.
  • We should probably optimize for server efficiency rather than latency or throughput

So similarly, all the review submissions on Amazon can be queued and scheduled.

In FIX protocol, an order sender can receive 150=A meaning PendingNew. I call it “queued for exchange”.  The OMS Server sends the acknowledgement in two steps.

  1. Optional. Execution Report message on order placement in OMS Server’s Order Book. ExecType (150) = A (Pending New)
  1. Execution Report message on receiving acknowledgement from exchange. ExecType (150) = 0 (New). I guess this indicates placement in exchange order book

Note this optimization is completely different from HFT latency engineering.

swap usage when RAM already adequate

Adapted from blog by Hayden James

Even when our average memory usage is smaller than RAM capacity, system still benefits from swap!

Most server processes are daemons. Any daemon can create lots of memory pages rarely accessed till shutdown. Kernel often decides to relocate rarely used memory pages to swap for performance reasons, mostly to free up RAM. The reclaimed RAM space can remain vacant for some time, so you may think the relocation is unnecessary but ..

  • the relocation is usually harmless — the kswap pid uses very little cpu unless such relocation workload becomes frequent and bidirectional, a sign of insufficient RAM.
  • the relocation is preemptive — The vacant RAM is available to any process that can use it more productively. In general, faster cache ought to hold “hotter” data. In other words, hotter data should be cheaper to access.

But what if there’s no other process or hotter data? What if all the data can fit into RAM? This is rare but yes you can disable swap. Rarely needed tweaks are sometimes under-documented, as kernel is a very efficient “government” like Singapore.

Note, as explained in my [[linux kernel]], kswapd is both a process and a kernel thread that wakes up from time to time.

[[java performance]] by Scott Oaks

–[[java performance]] by Scott Oaks


best of breed..see chapter details on

[jvm] heap memory

[jvm] threading

[jvm] instrumentation



lambda, stream  (java 8 interviews!)


The Introduction chapter outlines 3 broad aspects

* JVM – like memory tuning

* java language – like threading, collections

* Java API — like xml parser, JDBC, serialization, Json


JVM tuning is done by “system engineers” who may not be developers.


## personal xp on low latency trading

Thread — lockfree becomes relevant in latency-sensitive contexts
Thread — create opportunity for parallelism, and exploit multi-core
Thread — avoid locks including concurrent data structures and database. Be creative.
Data structures — STL is known to be fairly efficient but can affect performance
Data structures — minimize footprint
Data structures — favor primitive arrays because allocations are fewer and footprint is smaller
Algo — favor iterative over recursive
DB tuning — avoid hitting DB? Often impractical IMO
Serialize — favor primitives over java autoboxing
Serialize — favor in-process transfers, and bypass serialization
Mem — avoid vtbl for large volume of small objects
Mem — reuse objects, and avoid instantiation? Often impractical IMO
Mem — mem virtualization (like gemfire) as alternative to DB. A large topic.
Mem — Xms == Xmx, to avoid dynamic heap expansion
Mem — control size of large caches.
Mem — consider weak reference
MOM — multicast is the clear favorite to cope with current market data volume
MOM — shrink payload. FIX is a primary example.
GC — avoid forced GC, by allocating large enough heap
Socket — dedicate a thread to a single connection and avoid context switching
Socket — UDP can be faster than TCP
Socket — non-blocking can beat blocking if there are too many low-volume connections. See post on select()

Tick data repository – real-time/historical mkt-data

I think this was a CS interview…

Used for 1) back-testing, 2) trading signal generation, even if no real order sent, and 3) algorithmic real-$ trading.

First target user is back testing. Users would also try the 1 -} 2 -} 3 in that sequence.

Stores FX, FI and equity market data. Therefore the system treats everything just as generic tick data either quote ticks or trade ticks.

Multiple TB of data in memory. Time-series (not SQL) database.

Created using c++ and java. Probably due to sockets.

let’s find out What the system is doing

A real, practical challenge in a low-latency, market-data system is to quickly find out “What’s the system doing”. Log files usually have a lot of details, but we also want to know what files/sockets our process is accessing, what kind of data it is reading and writing.

truss -s or -r can reveal actual data transferred(??)

if write() syscall is stuck, then perhaps disk is full.

lsof reads /proc to get open sockets/files


##[09]low latency techniques

roughly ranked in terms of interviewer’s emphasis

  1. OOM and GC. memory conservation.
    1. Avoid dupe objects.
    2. avoid keeping large order state object
  2. parallelism
  3. avoid recursion
  4. JVM tuning beyond GC tuning
  5. –network
  6. avoid network like distributed cache. Favor single-JVM designs
  7. NIO
  8. FIX payload size reduction, such as encoding and compression
  9. –MOM
  10. multicast. RV is more stable than the newly invented multicast-JMS
  11. MOM message size control
  12. !! Peer-to-peer messaging eliminates message brokers and daemon processes — diagrams, brevity.

JRockit real time —

perf techniques in T J W’s project–ws,mq,tx

Q: request wait-queuing (toilet queue)? I know weblogic can configure the toilet queue
A: keep the queue entries small. we only keep object id while the objects are serialized to disk (?!)

Q: is 1kB too large?
A: no

q: most common cause of perf issue?
A: mem leak. still present after regression test

q: jvm tuning?
A: yes important, esp mem related

q: regression test?
a: important

q: perf tools?
a: no tools. primarily based on logs. eg. track a long-running
transaction and compute the duration between soap transaction start
and end.

Q: web services?
A: Many of the transactions are based on soap, axis. TCP monitor
can help with your perf investigation.

Q: tx?
A: yes we use two phase commits. Too many transactions involved.
really complex biz logic. Solution is async.

Q: multi-threaded?
A: handled by weblogic.

Q: how is the async and queue implemented?
A: weblogic-mq with persistent store, crash-proof

Y K on T2000 vs E10k

Q: processing power?
A: E10k is old. About 400Mhz cpu. T2000 has more processing power.
Q: form factor?
A: E10k takes a full rack, with up to 16 boards, each holding up to 4 processors
Q: any special skill required to administer e10k or above?
A: not much different from mid-range sparc systems