Context: each company receives many many reviews. In a similar scenario, we can say a stock receives many investor comments
Interviewer: OK you said horizontal sharding by company id can address highly concurrent data store updates. Now what if one company, say, Amazon, by itself gets 20% of the updates so sharding can’t help this one company.
me: I guess the update requests would block and possibly time out. We can use a task queue.
This is similar to WordPress import/export requests.
- Each task takes a few seconds, so we don’t want user to wait.
- If high server load, the wait could be longer.
- More important — this task need not be immediate. It can be scheduled.
- We should probably optimize for server efficiency rather than latency or throughput
So similarly, all the review submissions on Amazon can be queued and scheduled.
In FIX protocol, an order sender can receive 150=A meaning PendingNew. I call it “queued for exchange”. The OMS Server sends the acknowledgement in two steps.
- Optional. Execution Report message on order placement in OMS Server’s Order Book. ExecType (150) = A (Pending New)
- Execution Report message on receiving acknowledgement from exchange. ExecType (150) = 0 (New). I guess this indicates placement in exchange order book
Note this optimization is completely different from HFT latency engineering.
server-push update ^ TTL ^ conditional-GET
Online articles would hint at both, but few articles list these two modes explicitly. This is a simple concept but fundamental to DB tuning and app tuning.
A) TTL — more common. Each “cache item” embeds a time-to-live data field a.k.a expiry timestamp
B) cache-invalidation — some “events” would trigger an invalidation. Without invalidation, a cache item would live forever with a infinity TTL, like the list of China provinces.
If TTL is predictable, you should combine A and B. I think cookie is an example.
G) conditional-GET in HTTP is a proven industrial strength solution described in my 2005 book [[computer networking]]. The cache server always sends a GET to the database but with a If-modified-since header. This reduces unnecessary database load and network load.
||if frequent query, infrequent updates
||high network load but limited to tiny requests
||if latency important
||slower lazy fetch, though efficient
||if infrequent query
||waste DB/client/NW resources as “push” is unnecessary
||efficient on DB/client/NW
||if frequent update
||high load on DB/client/NW
||if frequent update+query
||can be wasteful
Adapted from blog by Hayden James
Even when our average memory usage is smaller than RAM capacity, system still benefits from swap!
Most server processes are daemons. Any daemon can create lots of memory pages rarely accessed till shutdown. Kernel often decides to relocate rarely used memory pages to swap for performance reasons, mostly to free up RAM. The reclaimed RAM space can remain vacant for some time, so you may think the relocation is unnecessary but ..
- the relocation is usually harmless — the kswap pid uses very little cpu unless such relocation workload becomes frequent and bidirectional, a sign of insufficient RAM.
- the relocation is preemptive — The vacant RAM is available to any process that can use it more productively. In general, faster cache ought to hold “hotter” data. In other words, hotter data should be cheaper to access.
But what if there’s no other process or not hotter data? What if all the data can fit into RAM? This is rare but yes you can disable swap.
Note, as explained in my [[linux kernel]], kswapd is both a process and a kernel thread that wakes up from time to time.
In real world applications on IA32, we have not seen a significant performance boost by using icc but not worse. On IA64 however, it consistently performs better than apps built with gcc.
Important authors often describe important techniques in memory efficiency as well as speed efficiency.
Trading is mostly about speed. Memory is important in other domains.
Thread — lockfree becomes relevant in latency-sensitive contexts
Thread — create opportunity for parallelism, and exploit multi-core
Thread — avoid locks including concurrent data structures and database. Be creative.
Data structures — STL is known to be fairly efficient but can affect performance
Data structures — minimize footprint
Data structures — favor primitive arrays because allocations are fewer and footprint is smaller
Algo — favor iterative over recursive
DB tuning — avoid hitting DB? Often impractical IMO
Serialize — favor primitives over java autoboxing
Serialize — favor in-process transfers, and bypass serialization
Mem — avoid vtbl for large volume of small objects
Mem — reuse objects, and avoid instantiation? Often impractical IMO
Mem — mem virtualization (like gemfire) as alternative to DB. A large topic.
Mem — Xms == Xmx, to avoid dynamic heap expansion Mem — control size of large caches. Mem — consider weak reference MOM — multicast is the clear favorite to cope with current market data volume MOM — shrink payload. FIX is a primary example. GC — avoid forced GC, by allocating large enough heap
Socket — dedicate a thread to a single connection and avoid context switching
Socket — UDP can be faster than TCP
Socket — non-blocking can beat blocking if there are too many low-volume connections. See post on select()
800+ G4 linux machines (HP). 4-8 core / 16G RAM each. 32-bit OS
Later consolidated to
400+ G7 linux machines. 32 core / 144G RAM each. 64-bit OS
Roughly 6 of these machines are dedicated Coherence machines.
Database is not part of this market data and real time risk engine.
I think this was a CS interview…
Used for 1) back-testing, 2) trading signal generation, even if no real order sent, and 3) algorithmic real-$ trading.
First target user is back testing. Users would also try the 1 -} 2 -} 3 in that sequence.
Stores FX, FI and equity market data. Therefore the system treats everything just as generic tick data either quote ticks or trade ticks.
Multiple TB of data in memory. Time-series (not SQL) database.
Created using c++ and java. Probably due to sockets.
A real, practical challenge in a low-latency, market-data system is to quickly find out “What’s the system doing”. Log files usually have a lot of details, but we also want to know what files/sockets our process is accessing, what kind of data it is reading and writing.
truss -s or -r can reveal actual data transferred(??)
if write() syscall is stuck, then perhaps disk is full.
lsof reads /proc to get open sockets/files
roughly ranked in terms of interviewer’s emphasis
- OOM and GC. memory conservation.
- Avoid dupe objects.
- avoid keeping large order state object
- avoid recursion
- JVM tuning beyond GC tuning
- avoid network like distributed cache. Favor single-JVM designs
- FIX payload size reduction, such as encoding and compression
- multicast. RV is more stable than the newly invented multicast-JMS
- MOM message size control
- !! Peer-to-peer messaging eliminates message brokers and daemon processes
http://www.sun.com/solutions/documents/pdf/fn_lowlatency.pdf — diagrams, brevity.
JRockit real time — http://www.oracle.com/appserver/docs/low-latency-capital-markets-whitepaper.pdf
#1: squid cache
#2: a single mysql master replicating to multiple slave machines
exactly as advised in [[ programming php ]]. Other interesting tips:
#9 job queue for mysql operations
#1: async — between every pair of client-server
#2: load balancers
#3: request queue. Pipeline is a DB-based request queue at the most congested interface
# DB table shrinking
query cache@@ applicable for doris
conn pool@@ perhaps applicable
Q: request wait-queuing (toilet queue)? I know weblogic can configure the toilet queue
A: keep the queue entries small. we only keep object id while the objects are serialized to disk (?!)
Q: is 1kB too large?
q: most common cause of perf issue?
A: mem leak. still present after regression test
q: jvm tuning?
A: yes important, esp mem related
q: regression test?
q: perf tools?
a: no tools. primarily based on logs. eg. track a long-running
transaction and compute the duration between soap transaction start
Q: web services?
A: Many of the transactions are based on soap, axis. TCP monitor
can help with your perf investigation.
A: yes we use two phase commits. Too many transactions involved.
really complex biz logic. Solution is async.
A: handled by weblogic.
Q: how is the async and queue implemented?
A: weblogic-mq with persistent store, crash-proof
–cache: model manager
telecom: circuits don’t change that often
–async with queues
telecom: request volume
–loose coupling, with async, queues and stateless
separate jvm for web tier, for slsb, for dispatchers and workers
configurable worker threads, on multiple machines, with multi-processor cores
configurable worker pool
stop runaway threads
clustering on web tier
clustering btw dispatcher instances
clustering transcoder slsb
Q: processing power?
A: E10k is old. About 400Mhz cpu. T2000 has more processing power.
Q: form factor?
A: E10k takes a full rack, with up to 16 boards, each holding up to 4 processors
Q: any special skill required to administer e10k or above?
A: not much different from mid-range sparc systems