too many DB-writes: sharding insufficient #Indeed

Context: each company receives many many reviews. In a similar scenario, we can say a stock receives many investor comments

Interviewer: OK you said horizontal sharding by company id can address highly concurrent data store updates. Now what if one company, say, Amazon, by itself gets 20% of the updates so sharding can’t help this one company.

me: I guess the update requests would block and possibly time out. We can use a task queue.

This is similar to WordPress import/export requests.

  • Each task takes a few seconds, so we don’t want user to wait.
  • If high server load, the wait could be longer.
  • More important — this task need not be immediate. It can be scheduled.
  • We should probably optimize for server efficiency rather than latency or throughput

So similarly, all the review submissions on Amazon can be queued and scheduled.

In FIX protocol, an order sender can receive 150=A meaning PendingNew. I call it “queued for exchange”.  The OMS Server sends the acknowledgement in two steps.

  1. Optional. Execution Report message on order placement in OMS Server’s Order Book. ExecType (150) = A (Pending New)
  1. Execution Report message on receiving acknowledgement from exchange. ExecType (150) = 0 (New). I guess this indicates placement in exchange order book

Note this optimization is completely different from HFT latency engineering.


3ways to expire cached items

server-push update ^ TTL ^ conditional-GET

Online articles would hint at both, but few articles list these two modes explicitly. This is a simple concept but fundamental to DB tuning and app tuning.

A) TTL — more common. Each “cache item” embeds a time-to-live data field a.k.a expiry timestamp

B) cache-invalidation — some “events” would trigger an invalidation. Without invalidation, a cache item would live forever with a infinity TTL, like the list of China provinces.

If TTL is predictable, you should combine A and B. I think cookie is an example.

G) conditional-GET in HTTP is a proven industrial strength solution described in my 2005 book [[computer networking]]. The cache server always sends a GET to the database but with a If-modified-since header. This reduces unnecessary database load and network load.

TTL eager server-push conditional-GET
if frequent query, infrequent updates efficient efficient high network load but limited to tiny requests
if latency important OK lowest latency slower lazy fetch, though efficient
if infrequent query good waste DB/client/NW resources as “push” is unnecessary efficient on DB/client/NW
if frequent update unsuitable high load on DB/client/NW efficient conflation
if frequent update+query unsuitable can be wasteful fairly efficient


swap usage when RAM already adequate

Adapted from blog by Hayden James

Even when our average memory usage is smaller than RAM capacity, system still benefits from swap!

Most server processes are daemons. Any daemon can create lots of memory pages rarely accessed till shutdown. Kernel often decides to relocate rarely used memory pages to swap for performance reasons, mostly to free up RAM. The reclaimed RAM space can remain vacant for some time, so you may think the relocation is unnecessary but ..

  • the relocation is usually harmless — the kswap pid uses very little cpu unless such relocation workload becomes frequent and bidirectional, a sign of insufficient RAM.
  • the relocation is preemptive — The vacant RAM is available to any process that can use it more productively. In general, faster cache ought to hold “hotter” data. In other words, hotter data should be cheaper to access.

But what if there’s no other process or not hotter data? What if all the data can fit into RAM? This is rare but yes you can disable swap.

Note, as explained in my [[linux kernel]], kswapd is both a process and a kernel thread that wakes up from time to time.

[[java performance]] by Scott Oaks

–[[java performance]] by Scott Oaks


best of breed..see chapter details on

[jvm] heap memory

[jvm] threading

[jvm] instrumentation



lambda, stream  (java 8 interviews!)


The Introduction chapter outlines 3 broad aspects

* JVM – like memory tuning

* java language – like threading, collections

* Java API — like xml parser, JDBC, serialization, Json


JVM tuning is done by “system engineers” who may not be developers.


## personal xp on low latency trading

Thread — lockfree becomes relevant in latency-sensitive contexts
Thread — create opportunity for parallelism, and exploit multi-core
Thread — avoid locks including concurrent data structures and database. Be creative.
Data structures — STL is known to be fairly efficient but can affect performance
Data structures — minimize footprint
Data structures — favor primitive arrays because allocations are fewer and footprint is smaller
Algo — favor iterative over recursive
DB tuning — avoid hitting DB? Often impractical IMO
Serialize — favor primitives over java autoboxing
Serialize — favor in-process transfers, and bypass serialization
Mem — avoid vtbl for large volume of small objects
Mem — reuse objects, and avoid instantiation? Often impractical IMO
Mem — mem virtualization (like gemfire) as alternative to DB. A large topic.
Mem — Xms == Xmx, to avoid dynamic heap expansion
Mem — control size of large caches.
Mem — consider weak reference
MOM — multicast is the clear favorite to cope with current market data volume
MOM — shrink payload. FIX is a primary example.
GC — avoid forced GC, by allocating large enough heap
Socket — dedicate a thread to a single connection and avoid context switching
Socket — UDP can be faster than TCP
Socket — non-blocking can beat blocking if there are too many low-volume connections. See post on select()

Tick data repository – real-time/historical mkt-data

I think this was a CS interview…

Used for 1) back-testing, 2) trading signal generation, even if no real order sent, and 3) algorithmic real-$ trading.

First target user is back testing. Users would also try the 1 -} 2 -} 3 in that sequence.

Stores FX, FI and equity market data. Therefore the system treats everything just as generic tick data either quote ticks or trade ticks.

Multiple TB of data in memory. Time-series (not SQL) database.

Created using c++ and java. Probably due to sockets.

let’s find out What the system is doing

A real, practical challenge in a low-latency, market-data system is to quickly find out “What’s the system doing”. Log files usually have a lot of details, but we also want to know what files/sockets our process is accessing, what kind of data it is reading and writing.

truss -s or -r can reveal actual data transferred(??)

if write() syscall is stuck, then perhaps disk is full.

lsof reads /proc to get open sockets/files


##[09]low latency techniques

roughly ranked in terms of interviewer’s emphasis

  1. OOM and GC. memory conservation.
    1. Avoid dupe objects.
    2. avoid keeping large order state object
  2. parallelism
  3. avoid recursion
  4. JVM tuning beyond GC tuning
  5. –network
  6. avoid network like distributed cache. Favor single-JVM designs
  7. NIO
  8. FIX payload size reduction, such as encoding and compression
  9. –MOM
  10. multicast. RV is more stable than the newly invented multicast-JMS
  11. MOM message size control
  12. !! Peer-to-peer messaging eliminates message brokers and daemon processes — diagrams, brevity.

JRockit real time —

perf techniques in T J W’s project–ws,mq,tx

Q: request wait-queuing (toilet queue)? I know weblogic can configure the toilet queue
A: keep the queue entries small. we only keep object id while the objects are serialized to disk (?!)

Q: is 1kB too large?
A: no

q: most common cause of perf issue?
A: mem leak. still present after regression test

q: jvm tuning?
A: yes important, esp mem related

q: regression test?
a: important

q: perf tools?
a: no tools. primarily based on logs. eg. track a long-running
transaction and compute the duration between soap transaction start
and end.

Q: web services?
A: Many of the transactions are based on soap, axis. TCP monitor
can help with your perf investigation.

Q: tx?
A: yes we use two phase commits. Too many transactions involved.
really complex biz logic. Solution is async.

Q: multi-threaded?
A: handled by weblogic.

Q: how is the async and queue implemented?
A: weblogic-mq with persistent store, crash-proof