malloc()performance #tips #CSY


  • stack allocation is much faster than heap allocation but you may not notice the difference
  • custom heap allocator to replace malloc() can be as fast as stack allocation
  • Many shops like FaceBook create custom allocators because standard malloc is too slow

Why is malloc so slow? Many online commentators point their fingers at the complexity of heap memory (and free list) management.

I don’t think system call is the issue. System calls are very infrequent with malloc.


10 μs additional latency: collocated eq exec engine

  • 13 microsec in collocated eq exec engine
  • 150 microsec “single-trip” latency in similar software outside collocation site, measured by Corvil, from A to B
    • Time A: FIX msg coming into our engine
    • Time B: FIX msg going out of our engine
    • 150 μs is median, not average
    • Corvial is (most likely) a TCP network sniffer with FIX parser so it can track a single order flow
  • 2 millis in a “regular” build

Treasury trading doesn’t need such low latency.

L1/L2/L3 latency stats@2012 processor

I said “1000 times” when a GS interviewer asked for an estimate of relative latency of main memory vs register. He said that’s about right.

Numbers below were taken from the CPU Cache Flushing Fallacy blog post, which indicates that for a particular 2012-era Intel processor, the following is true:

  • register access = 4 instructions per cycle
  • L1 latency = 3 cycles (12 x register)
  • L2 latency = 12 cycles (4 x L1, 48 x register)
  • L3 latency = 38 cycles (3 x L2, 12 x L1, 144 x register)
  • MM Latency= 195 cycles  (5 x L3, 15 x L2, 60 x L1, 720 x register) = 65 ns on a 3 GHz CPU

translation lookaside buffer #stats

  • a.k.a Address Translation Cache. The TLB lets the processor very quickly convert virtual addresses to physical addresses.
  • TLB is a cache of the big, slow page table
  • A typical entry in TLB is a pair of {virtual -> physical addresses}. In contrast,
  • A typical entry in a L1 cache is mapping of {physical address -> payload}.
  • You can hit both caches!
  • Both caches sits between processor and main memory
  • each hardware system has one or more TLBs
  • TLB-miss can be handled in hardware or kernel
  • typical miss probability — 0.01% to 1%
  • typical miss latency (penalty) — 10 to 100 clock cycles to read the page table
  • typical hit latency: 0.5 to 1 clock cycle

If a TLB hit takes 1 clock cycle, a miss takes 30 clock cycles, and the miss rate is 1%, the effective memory cycle rate is an average of 1 × 0.99 + (1 + 30) × 0.01 = 1.30 i.e. 1.30 clock cycles per memory access.

sharedMem in low latency systems

Hi Anthony,

Is shared-memory a popular messaging solution in low-latency trading?

I know some high-volume data processing engines (like Ab Initio) favor

shared memory as the fastest IPC solution.

However, I feel in low latency trading, messaging (like tibrv, 29west,

Solace) is more popular. For a trading engine, shared memory IPC can

be the basis of messaging between processes on the same machine, but

not across different machines.

Do your system use shared memory?

If interested, you can check out

hidden cost, stolen CPU cycles – latency engineering

Latency /engineering/ and optimization is all about the implicit operations, hidden costs and “stolen” cpu cycles. Incur the minimum CPU costs for a task.

  • eg (practical!): function A calling B, calling C is one more stack frame than A-calling-C
  • eg: boxing/unboxing — extra work for cpu. Also creates garbage for the GC.
  • eg: one thread switching between multiple sockets — more cpu workload than one thread dedicated to one (exchange) socket
  • eg: un-contended lock acquisition — more work than no-lock, partly due to memory fence
  • eg: garbage collection – competes for CPU at a bad time. Usually, if there’s no OOM, then the GC thread is very low priority and won’t slow down a critical task.
  • eg: page swap as part of virtual memory systems — competes for CPU
  • eg: vtbl lookup — adds a few clock cycles per function call. To be avoided inside the most critical apps in an \
  • exchange. Therefore c developers favor templates than virtuals
  • eg: RTTI — latency sensitive apps generally disable RTTI early on — during compilation
  • eg: null terminator at end of c-strings — adds network traffic by x%
  • eg: inline – trims cpu cycles.
  • eg: one kernel thread mapping to multiple user threads — fastest system should create no more user threads than the maximum cpu (or kernel) threads, so the thread scheduler doesn’t need to run at all. I feel this is possible only in a dedicated machine, but such a machine must then communicate with peripheral machines, involving serialization and network latency.

For a dedicated kernel thread to service a busy stream of tasks, we need to consider what if the tasks come in bursts so the thread becomes temporarily idle. One idea is to suspend the thread in wait() but i believe kernel thread can’t be suspended. In the kernel, a common approach is to simply keep the thread in spinlock. Assumption is, one cpu is exclusively dedicated to this thread, so this cpu can’t do anything else even if we suspend this thread.

10 (random) arch features of HFT

When talking to low-latency shops, i realize the focus shifts from pricing, trade booking, position mgmt … to market data, message formatting and sockets – rather low-level stuff. A high-frequency trading engine has many special features at architectural and impl levels, but here i will focus on some important architectural features that make a difference. By the way, my current system happens to show many of these features.

1) message-driven, often using RV or derivatives. Most trading signals come in as market data, tick data, benchmark shifts, position adjustments (by other traders of own own bank). Among these, I feel market data poses the biggest challenge from the latency perspective.
2) huge (reluctantly distributed – see other post) cache to minimize database access
) judicious use of async and sync IPC, if one-big-machine is undesirable.
3) optimized socket layer, often in C rather than c++. No object-orientation needed here:)
) server collocation
) large number of small orders to enable fine-grained timing/cancel and avoid disrupting market
) market data gateway instantiates a large number of small objects
) smart order router, since an order can often execute on multiple liquidity venues

Beyond the key features, I guess there’s often a requirement to immediately change a parameter in the runtime rather than updating a database and waiting for the change to be noticed by the runtime. I feel messaging is one option, and RMI/JMX is another.

##low latency study topics

ranked in terms of interviewer’s emphasis

* OOM and GC
* memory conservation. Avoid dupe objects
* avoid keeping large order state object
* parallelism
* avoid recursion
* FIX package size reduction, such as encoding and compression
* avoid network — distributed cache? No. Favor single-JVM designs
* minimize garbage collection overhead
* multicast. RV is more stable than the newly invented muktucast-JMS
* Peer-to-peer messaging eliminates message brokers and daemon processes
* JVM tuning
* message size control — diagrams, brevity.

JRockit real time —