4 components@latency

  1. Propagation delay — Amount of time required for a message to travel from the sender to receiver, which is a function of distance.
  2. Transmission delay — Amount of time required to push all the packet’s bits into the link, which is a function of the packet’s length and data rate of the link, and has nothing to do with distance
  3. Processing delay — Amount of time required to process the packet header, check for bit-level errors, and determine the packet’s destination. Personally, I would think application logic also introduces processing delay.
  4. Queuing delay —  Amount of time the packet is waiting in the queue until it can be processed. Personally, I would guess that a larger buffer tends to exacerbate queuing delay. I would say there are often multiple queuing points along the critical path, like an airport.

From https://hpbn.co/primer-on-latency-and-bandwidth/.

 

Advertisements

keep central data structure+loop in cache

[[art of unix programming]] P290/291 advocates (presumably based on experience) to bend over backward and ensure the central data structure + time-critical loop NEVER fall out of cache.

I think this is vague but a useful sound byte to impress interviewers.

Developers need to know the size of L1 cache, L2 cache, L3 cache in our hardware. Then target the central data structure to one of them. Say L2 cache. We want to size our data structure to fit in there and never break the boundary.

An interesting illustration given on the same page — pre-computing of a data table was designed to save cpu time but could backfire if the increased data footprint leads to cache miss. Therefore it may be faster to recompute each time.

The same page also touches on loop-unrolling, designed to speed up execution, but may backfire due to increased code size breaking instruction cache.

1)template code bloat 2)inline..hurt i-cache@@

Q: does template code bloat increase instruction cache pressure and increase risk of cache thrashing?
A: i have seen about 8 forums mention of template code bloat increasing i-cache pressure, but no clear evidence, no consensus among authorities. In fact, no known authority has confirmed this issue.

However, I would think having many similar functions in the executable will likely lead to unnecessary competition for i-cache.

In contrast, for inlining there is consensus. wikipedia states — excess inlining will hurt speed, due to inlined code consuming too much of the instruction cache .. Inlining imposes a cost on performance, due to the code expansion (due to duplication) hurting instruction cache performance.[5] This is most significant if, prior to expansion, the working set of the program (or a hot section of code) fit in one level of the memory hierarchy (e.g., L1 cache), but after expansion it no longer fits, resulting in frequent cache misses at that level.

See also inline: footprint+perf can Backfire ! #Google

kernel bypass : possible usage ] RTS

Partially hypothetical usage scenario/proposal.

“Bypass” means .. bypassing standard kernel functions and using faster, lighter firmware instead.

“Bypass” means .. every network packet would go straight from NIC to user application, without passing through tcp/ip stack in the kernel.

Background — Traditional packet processing goes through tcp/ip software stack, implemented as a family of kernel functions. Whenever a network packet is received, NIC writes the packet to a ring buffer and raise a hardware interrupt. The i-handler (interrupt handler routine) and bottom-half will then perform packet processing in the kernel socket buffer, and finally copy it to a UserModeBuffer.

Note the two separate buffers. In our parser config file, we configure them as sock_recv_buf vs read_buf. The former is accessible by kernel only and is not used when we turn on kernel bypass.

In contrast, with kernel bypass,

  • the Network card (NIC) has a FPGA chip, which contains the low-level packet processing software (actually firmware “burned” into fpga)
  • This firmware replaces tcp/ip kernel functions and delivers the packets directly to application. However, my parser relies more on another feature —
  • The SolarFlare firmware also lets my parser (user applications) access the NIC ring-buffer directly. Zero-copy technique bypasses the socket receive buffer in the kernel.

My parser uses SolarFlare NIC for both multicast and tcp.

Kernel bypass API was only used in some low-level modules of the framework, and disabled by default and configurable for each connection defined in configuration file.

http://jijithchandran.blogspot.com/2014/05/solarflare-multicast-and-kernel-bypass.html is relevant.

SDI: DoS-guard #Nsdq

Q: Design an API Rate Limiter (e.g. for Firebase or Github)

You are expected to develop a Rate Limiter services that can:

  • Limit the number of requests an entity can send to an API within a time window e.g., 15 requests per second.
  • The rate limiting should work for a distributed setup, as the APIs are accessible through a cluster of servers.

(A similar question was asked at Nsdq… )

Q2: how do your cluster of cache servers detect a given IP on the Internet is sending requests too frequently, causing Denial of Service? How do you protect yourself?

Q2b: After you blacklist a client IP, it goes quiet, then it sends a single request again. How you decide whether to ignore the request?

Q2c: what algorithm to decide if a client IP has legitimate need to send lots of requests vs another client IP engaging in Denial of Service attack?

Q2d: what if distributed DoS attack?

https://en.wikipedia.org/wiki/Denial-of-service_attack#Defense_techniques has practical solutions.

inline perf can Backfire ! #Google

As illustrated below, without inline, instruction cache system could hit ChangeOfFlow twice as it enters/exits your function aa(). If aa() is actually inlined and embedded in a hostFunc, then the instruction cache system can often load entire hostFunc, eliminating COF. This helps instruction cache, but excessive inlining can increase executable footprint (code bloat).

google c++ guide points out that

  • inline can either increase or decrease (for tiny functions) executable footprint. In general, smaller footprint improves running time due to instruction cache efficiency
  • virtual functions are inlined (i.e. defined in class body) primarily for convenience/maintainability, not performance

See also https://www.eetimes.com/document.asp?doc_id=1275470 and https://www.eetimes.com/document.asp?doc_id=1275474

As an example, consider the function call tree flow in Figure 1. Suppose function F2 is linked near function F1, but function F3 is not linked near F1. When function F1 calls F2, it is possible that F2 is already in the cache and that there will be no cache miss. (The likelihood that F2 is in the cache depends on the sizes of F1 and F2, as well as the location of the call inside F1.) In contrast, there will probably be a cache miss when F1 calls F3. Because F3 is located far away from F1 in memory, it is not likely to be in the cache when F1 calls it.

mlock() : low-level syscall to prevent paging ] real-time apps

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/7/html/reference_guide/using_mlock_to_avoid_page_io has sample code

See also https://eklitzke.org/mlock-and-mlockall

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/1.3/html/Realtime_Reference_Guide/sect-Realtime_Reference_Guide-Memory_allocation-Using_mlock_to_avoid_memory_faults.html says

If the application is entering a time sensitive region of code, an mlockall call prior to entering, followed by munlockall can reduce paging while in the critical section. Similarly, mlock can be used on a data region that is relatively static or that will grow slowly but needs to be accessed without page faulting.

## optimize code for instruction cache: few tips

I don’t see any ground-breaking suggestions. I think only very hot functions (confirmed by oprofile + cachegrind) requires such micro-optimization.

I like the function^code based fragmentation framework on https://www.eetimes.com/document.asp?doc_id=1275472 (3 parts)

  • inline: footprint+perf can backfire. Can be classified as embedding
  • use table lookup to replace “if” ladder — minimize jumps
  • branching — refactor a lengthy-n-cold (not “hot”) code chunk out to a function, so 99% of the time the instruction cache (esp. the pre-fetch flavor) doesn’t load a big chunk of cold stuff.
    • this is the opposite of embedding !
  • Trim the executable footprint. Reduce code bloat due to inlining and templates?
  • loop unrolling to minimize jumps. I think this is practical and time-honored — at aggressive optimization levels some compilers actually perform loop unrolling!
  • Use array (anything contiguous) instead of linked list or maps??

https://software.intel.com/en-us/blogs/2014/11/17/split-huge-function-if-called-by-loop-for-best-utilizing-instruction-cache is a 2014 Intel paper — split huge function if it’s invoked in a loop.

 

reinterpret_cast(zero-copy)^memcpy: raw mktData parsing

Raw market data input comes in as array of unsigned chars. I “reinterpret_cast” it to a pointer-to-TradeMsgStruct before looking up each field inside the struct.

Now I think this is the fastest solution. Zero-cost at runtime.

As an alternative, memcpy is also popular but it requires bitwise copy. It often require allocating a tmp variable.

linux tcp buffer^AWS tuning params

—receive buffer configuration
In general, there are two ways to control how large a TCP socket receive buffer can grow on Linux:

  1. You can set setsockopt(SO_RCVBUF) explicitly as the max receive buffer size on individual TCP/UDP sockets
  2. Or you can leave it to the operating system and allow it to auto-tune it dynamically, using the global tcp_rmem values as a hint.
  3. … both values are capped by

/proc/sys/net/core/rmem_max — is a global hard limit on all sockets (TCP/UDP). I see 256M in my system. Can you set it to 1GB? I’m not sure but it’s probably unaffected by the boolean flag below.

/proc/sys/net/ipv4/tcp_rmem — doesn’t override SO_RCVBUF. The max value on RTS system is again 256M. The receive buffer for each socket is adjusted by kernel dynamically, at runtime.

The linux “tcp” manpage explains the relationship.

Note large TCP receive buffer size is usually required for high latency, high bandwidth, high volume connections. Low latency systems should use smaller TCP buffers.

For high-volume multicast connections, you need large receive buffers to guard against data loss — UDP sender doesn’t obey flow control to prevent receiver overflow.

—AWS

/proc/sys/net/ipv4/tcp_window_scaling is a boolean configuration. (Turned on by default) 1GB  is the new limit on AWS after turning on window scaling. If turned off, then AWS value is constrained to a 16-bit integer field in the TCP header — 65536

I think this flag affects AWS and not receive buffer size.

  • if turned on, and if buffer is configured to grow beyond 64KB, then Ack can set AWS to above 65536.
  • if turned off, then we don’t (?) need a large buffer since AWS can only be 65536 or lower.

 

HFT mktData redistribution via MOM #real world

Several low-latency practitioners say MOM is unwelcome due to added latency:

  1. The HSBC hiring manager Brian R was the first to point out to me that MOM adds latency. Their goal is to get the raw (market) data from producer to consumer as quickly as possible, with minimum hops in between.
  2. 29West documentation echoes “Instead of implementing special messaging servers and daemons to receive and re-transmit messages, Ultra Messaging routes messages primarily with the network infrastructure at wire speed. Placing little or nothing in between the sender and receiver is an important and unique design principle of Ultra Messaging.” However, the UM system itself is an additional hop, right? Contrast a design where sender directly sends to receiver via multicast.
  3. Then I found that the RTS systems (not ultra-low-latency ) have no middle-ware between feed parser and order book engine (named Rebus).

However, HFT doesn’t always avoid MOM.

  • P143 [[all about HFT]] published 2010 says an HFT such as Citadel often subscribes to both individual stock exchanges and CTS/CQS [1], and multicasts the market data for internal components of the HFT platform. This design has additional buffers inherently. The first layer receives raw external data via a socket buffer. The 2nd layer components would receive the multicast data via their socket buffers.
  • SCB eFX system is very competitive in latency, with Solarflare NIC + kernel bypass etc. They still use Solace MOM!
  • Lehman’s market data is re-distributed over tibco RV, in FIX format.

[1] one key justification to subscribe redundant feeds — CTS/CQS may deliver a tick message faster than direct feed!

unordered_map^map performance: unpredictable

Small halo… the theories and explanations are subject to change.

Many people report that unordered_map can be slower than traditional std::map for small collections. I feel it’s implementation-dependent. Result may NOT apply to java or c#.

https://blog.dubbelboer.com/2012/12/04/lru-cache.html is mostly about lookup speed. It shows that string key size hurt hash table, but sample size hurts RBTree.

Many interviewers asked me

  • Q: why for a small string collection, RBTree can outperform hash table?
  • A: String hashing takes time proportional to string size
  • A: Poor hashing => Hash collision => linear search is another potential problem. String hashing is unpredictable compared to integer hashing. No one can guarantee the next sample will be “uniform” according to our hash function.
  • A: rehash

Cache efficiency on the bucket array (array of pointers) is another reported weakness in hash tables. However, I guess a small bucket array can be completely held in cache and each linked list usually holds 1 node only so cache miss is on that 1 node. For a tree, logN nodes are accessed for a lookup and all of them could get cache-miss.

Practical suggestion — it’s easy to swap the two in a given codebase, so just swap and benchmark. Either can be faster. If your data set changes over time, you may need to re-benchmark.

How RTS achieved 400 to 700 KMPS #epoll

Official benchmark result – 700 KMPS per instance of rebus or parser. Here are some enabling features:

  • ! mostly single-threaded apps. Some older parsers are multi-threaded but each in single threaded mode. No shared mutable:)
    • In the order book replication engine, we use lockfree in one place only
  • ! allocation avoided — on millions of Output message objects. Pre-allocated ring buffer eliminates new()
  • ! allocation avoided — on millions of Input message objects, thanks to reinterpret_cast() on pointers. Zero-copy. See reinterpret_cast^memcpy in raw mktData parsing
  • ! Socket buffer tuning — to cope with busts. 64-256MB receive buffer in kernel; 4MB UserModeBuffer
  • ! epoll — to replace select() with 2 orders of magnitude improve in throughput
  • ! buffering of Output list (of messages)
  • ! Stateless parser, small footprint. Only downstream component (Rebus) handles cancels/busts, modifications.. So we can run many parser instances per machine, utilizing the cpu cores. See https://haydenjames.io/linux-performance-almost-always-add-swap-space/
  • Very fast duplicate seq check, without large history data — a hot function
  • large containers are pre-sized. No reallocation
  • mostly array-based data structures — cache-friendly
  • Hot fictions use pbref or RVO or move semantics, minimizing stack allocations
  • aggressively simplified parsing. Minimal logic
  • Special Logger to reduce I/O cost
  • Sharding to speed up symbol lookup
  • kernel bypass : RTS
  • No virtual functions
  • Inline
  • —-Speed with control
  • Best of both to reduce the real risk of our connection becoming slow or lossy
  • Depth monitor to detect missed messages
  • Capacity tracker to monitor mps rates
  • Missed seq reporting
  • [!=one of the really effective techniques]

##fastest container choices: array of POD #or pre-sized vector

relevant to low-latency market data.

  • raw array is “lean and mean” — the most memory efficient; vector is very close, but we need to avoid reallocation
  • std::array is less popular but should offer similar performance to vector
  • all other containers are slower, with bigger footprint
  • For high-performance, avoid container of node/pointer — Cache affinity loves contiguous memory. After accessing 1st element, then accessing 2nd element is likely a cache-hit
    • set/map, linked list suffer the same

Sliding-^Advertised- window size

https://networklessons.com/cisco/ccnp-route/tcp-window-size-scaling/ has real life illustration using wireshark.

https://www.ibm.com/support/knowledgecenter/en/SSGSG7_7.1.0/com.ibm.itsm.perf.doc/c_network_sliding_window.html

https://web.cs.wpi.edu/~rek/Adv_Nets/Spring2002/TCP_SlidingWindows.pdf

  • AWS = amount of free space on receive buffer
    • This number, along with the ack seq #, are both sent from receiver to sender
  • lastSeqSent and lastSeqAcked are two sender control variable in the sender process.

Q: how are the two variables updated during transmission?
A: When an Ack for packet #9183 is received, sender updates its control variable “lastSeqAcked”. It then computes how many more packets to send, ensuring that “lastSeqSent – lastSeqAcked < AWS

  • SWS (sliding window size) = lastSeqSent – lastSeqAcked = amount of transmitted but unacknowledged bytes, is a derived control variable in the sender process, like a fluctuating “inventory level”.
  • SWS is a concept; AWS is a TCP header field
  • receive buffer size — is sometimes incorrectly referred as window size
  • “window size” is vague but usually refers to SMS
  • AWS is probably named after the sender’s sliding window.
  • receiver/sender — only these players control the SWS, not the intermediate routers etc.
  • too large — large SWS is always set based on large receive buffer
  • too small — underutilized bandwidth. As explained in linux tcp buffer^AWS tuning, high bandwidth connections should use larger AWS.

Q: how are AWS and SWS related?
A: The sender adjusts lastSeqSent (and SWS) based on the feedback of AWS and lastSeqAcked.

 

C for latency^^TPS can use java

I’m 98% confident — low latency favors C/C++ over java [1]. FPGA is _possibly_ even faster.

I’m 80% confident — throughput (in real time data processing) is achievable in C, java, optimized python (Facebook?), optimized php (Yahoo?) or even a batch program. When you need to scale out, Java seems the #1 popular choice as of 2017. Most of the big data solutions seem to put java as the first among equals.

In the “max throughput” context, I believe the critical java code path is optimized to the same efficiency as C. JIT can achieve that. A python and php module can achieve that, perhaps using native extensions.

[1] Actually, java bytecode can run faster than compiled C code (See my other posts such as https://bintanvictor.wordpress.com/2017/03/20/how-might-jvm-beat-cperformance/)

MOM+threading Unwelcome ] low latency@@ #FIX/socket

Piroz told me that trading IT job interviews tend to emphasize multi-threading and MOM. Some use SQL too. I now feel all of these are unwelcome in low latency trading.

A) MOM – see also HFT mktData redistribution via MOMFor order processing, FIX is the standard. FIX can use MOM as transport, but not popular and unfamiliar to me.

B) threading – Single-Threaded-Mode is generally the fastest in theory and in practice. (I only have a small observed sample size.) I feel the fastest trading engines are STM. No shared mutable. Nsdq new platform (in java) is STM

MT is OK if they don’t compete for resources like CPU, I/O or locks. Compared to STM, most lockfree systems introduce latency like retries, and additional memory barrier. By default compiler optimization doesn’t need such memory barriers.

C) SQL – as stated elsewhere, flat files are much faster than relational DB. How about in-memory relational DB?

Rebus, the order book engine, is in-memory.

realistic TPS for exchange mkt-data gateway: below 1M/sec

  • Bombay’s BSE infrastructure (including matching engine) can execute 1 mil trades/second. Must maintain order books, therefore NOT stateless.
  • Rebus maintains full order books for each symbol, can handle 300k (in some cases 700k) messages per second per instance. Uses AVL tree, which beat all other data structure in tests.
  • The Nasdaq feed parser (in c++) is stateless and probably single-threaded. Very limited parsing logic compared to Rebus. It once handled 600k message/second per instance
  • Product Gateway is probably a dumb process since we never need to change it. Can handle 550k message/second per instance

I believe TPS throughput, not latency, is the optimization goal. Biggest challenge known to me is the data burst.

Therefore, java GC pause is probably unacceptable. In my hypothesis, after you experience a data surge for a while, you tend to run out of memory and must run GC (like run to the bathroom). But that’s the wrong time to run GC. If the surge continues while you GC runs, then the incoming data would overflow the queue.

Does inline really make a difference@@ Yes

MSVS and g++ debug build both disable inline (presumably to ease debugging). The performance difference vis-a-vis release build is mostly due to this single factor, according to [[optimized c++]]

The same author asserts that inlining is probably the most powerful code optimization.

Pimpl effectively disables inline.

However, c++faq gives many reasons for and against inline.

contiguous memory data structures : page faults+cache miss

Compared to linked data structures, I feel vectors (and array, deque, circular array…) reduce page faults and cache miss.

P124 [[art of debugging]] describes how a simple array is loaded from swap space into a real memory “page”. This is managed by the virtual memory system.

P19 [[optimized c++]] describes that L1 or L2 cache also loads vector more efficiently than a linked data structure. For example, each time main memory is read, the adjacent memory cells are loaded together into L1 cache, so contiguous memory data structures result in much fewer cache miss, in theory.

A virtual memory page is about 4KB. A cache line is about 64B.

Both page fault and cache miss are costly at runtime. In both cases the “backing store” is much slower but we have to read from it. These situations are by design and very common, but it pays to reduce their /incidence/occurrence. Even if we reduce page fault frequency or cache miss frequency by 10%, it could be helpful.

c++SCB eFX IV#Dmitry

100% QQ type, as defined in https://bintanvictor.wordpress.com/2017/02/15/qqzz-mutual-exclusion-cjava/.I feel many are micro optimizations with questionable improvement. I wonder how much value such obscure knowledge adds to the team.

Q: Scanning a vector of int (like finding the average or max). Forward iteration vs backward iteration, which one could be faster, considering all possible compiler optimizations.

%%A: forward. Memory read into cpu cache will be in chunks, not one element at a time. Easy for forward iteration. Not sure about backward.

Q: Which one could be fastest:

void f(double arg){…..}
void f(double & arg){….}

%%A: inlining for first but not 2nd?
A: See http://stackoverflow.com/questions/722257/should-i-take-arguments-to-inline-functions-by-reference-or-value esp. the long answer.

Q: Thr1 and Thr2 on 2 CPU’s both update an object s, having 2 fields. Thr1 only updates s.field1. Thr2 only updates s.field2. No interference. No synchronization required. We observe the performance is slower than using one thread to update both fields. Any explanation?
%%A: caching in cpu

Q: weak_ptr justification, when we have shared_ptr already? I feel [[effModernC++]] has a good chapter on it.

Ashish pointed out in some apps, you could identify a clear risk of circular dependency. Replace with weak_ptr.

Q: given an 2D array arr[10][5], how do you use pointer arithmetic to hit arr[1][5]

A: Contiguous. see http://stackoverflow.com/questions/7784758/c-c-multidimensional-array-internals. Note this is different from an array of pointers.

Q: what would you see if a TCP socket server has a full queue
%%A: TCP requires handshake, so if server is unable to accept a request the client would know it.
%%A: connection refused?

Q: what STL algorithms did you use?
%%A: foreach(), find(), copy_if(), transform(), reverse(), sort(), replace_if, remov_if

how2guage TPS capacity@mkt-data engine

Pump in an artificial feed. Increase the input TPS rate until

  1. CPU utilization hits 100%
  2. messages get dropped

The input TPS is the the highest acceptable rate i.e. the “capacity” of this one process.

Note each feed has its own business logic complexity level, so the same software may have 600k TPS capacity for a simple Feed A but only 100k TPS for a complex Feed B.

Also in my experience the input interface is the bottle neck compared to the output interface. If System X feeds into System Y, then we want to have System X pumping at 50% of Y’s capacity. In fact, we actually monitor the live TPS rate. The difference between that and the capacity is the “headway”.

inline getter methods

  • In C/C++, if you have a simple getter on a private field, you would typically request compiler to inline it, thus eliminating the function call run-time overhead.
  • In c#, the “property” getter is often inlined, but there’s no such requirement on the compiler.
  • Java getters can get inlined too, esp. if the field is final.

subverting kernel’s resource-allocation – a few eg

[[Linux sys programming]] explains several techniques to subvert OS resource-allocation decisions. Relevant to performance-critical apps.

P275 mlockall() — mlock() : prevent paging ] real time apps

P173 sched_setaffinity(). A strongly cache-sensitive app could benefit from hard affinity (stronger than the default soft affinity), which prevents the kernel scheduler migrating the process to another
processor.

[[The Linux Programmer’s Toolbox]]. A write-only variable will be removed by the compiler’s optimizer, but such a variable could be useful to debugger. I read somewhere that you can mark it volatile — subversive.

Any way to prevent “my” data or instruction leaving the L1/L2 cache?

Any way to stay “permantly” in the driver’s seat, subverting the thread scheduler’s time-slicing?

strace, ltrace, truss, oprofile, gprof – random notes

[[optimizing Linux performance]] has usage examples of ltrace.
I think truss is the oldest and most well-known.
Q: what values do the others add?
truss, strace, ltrace all show function arguments, though pointer to objects will not be “dumped”. (Incidentally, I guess apptrace has a unique feature to dump arguments of struct types.)
strace/ltrace are similar in many ways…
ltrace is designed for shared LLLibrary tracing, but can also trace syscalls.
truss is designed for syscalls, but “-u” covers shared libraries.
oprofile — can measure time spent and hit rates on library functions

oprofile – phrasebook

root – privilege required to start/stop the daemon, but the query tools don’t need root

dtrace – comparable. I think these two are the most powerful profilers on solaris/linux.

statistical – results can be partially wrong. Example – call graph.

Per-process – profiling is possible. I think default is system-wide.

CPU – counters (hardware counters). Therefore, low impact on running apps, lower than “attachment” profilers.

userland – or kernel : both can be profiled

recompile – not required. Other profilers require recompiling.

kernel support – must be compiled in.

oprifiled – the daemon. Note there’s no executable named exactly “oprofile”.

[[Optimizing Linux performance]] has detailed usage examples of oprofile. [[linux programmer’s toolbox]] has decent coverage too.

cache-miss in (CPU hogging)hot-function

P245 [[Optimizing Linux Perf]] (2005) points out this “intriguing” scenario. Here are the investigation steps to identify it —

First, we use _oprofile_ to identify a function(s) taking up the most application time. In other words, the process is spending a large portion (I would imagine 5%+) of cpu cycles in this function. However, the cycles could be spent in one lengthy entry or a million quick re-entries. Either way, this would be a hotspot. Then we use oprofile/valgrind(cachegrind)/kcache on the same process, and check if the hot function generates high cache misses.

The punch line – the high cache misses could be the cause of the observed process hogging. I assume the author has experienced the same but I’m not sure how rare or common this scenario is.

Some optimization guy in a HFT shop told me main memory is now treated as IO, so cache miss is treated seriously. http://programmers.stackexchange.com/questions/142864/writing-low-latency-java mentions that “Cache misses are your biggest cost to performance. Use algorithms that are cache friendly.”

By the way, instruction cache miss is worse than data cache miss. My friend Shanyou also said the same.

algorithm efficiency at Google/HFT: random thoughts

I was told Google engineers discuss algorithm efficiency everyday, including lunch time.

I feel latency due to software algorithm might be a small component of overall latency. However, the bigger portions of that latency may be unavoidable – network latency, disk write(?), serialization for transmission(?), … So the only part we could tune might be the software algorithm.

Further, it’s also possible that all the competitors are already using the same tricks to minimize network latency. In that case the competitive advantage is in the software algorithm.

I feel algorithm efficiency could be more permanent and fundamental than threading. If I compare 2 algorithms A1 and A2 and find A2 being 2x A1’s speed, then no matter what threading or hardware solutions I use, A2 still beats A1.

sub-millis OMS arch – last tips from Anthony

I feel ideally you want to confine entire OMS to one single process, minimizing IPC latency [1]. In practice however, even for one symbol OMS is often split into multiple processes or “instances”.

So what’s the IPC of choice? It turns out that in sub-millis trading, MOM messaging is the IPC of choice. I mentioned synchronous call and shared memory, but my veteran friend pointed out messaging performs better in practice.

The main platform-component is one big JVM instance with an internal order lookup cache for order state maintenance.

Multi-queue – if there are 50,001 symbols, there will be 50,001 queues. Once a queue is assigned a given thread T351, it is permanently bound to T351. This is to prevent multiple threads handling events concurrently on the same symbol. Obviously we don’t want 50,001 threads. Therefore, some kind of multiplexing is in place.

[1] Note data parallelism (into multiple processes) is free of IPC and perfectly fine.

achieving sub-millis latency – exchange connectivity

This post is about really achieving it, not “trying to achieve”.

First off, How do you measure or define that latency? I guess from the moment you get one piece of data at one end (first marker) to the time you send it out at the other end (2nd marker). If the destination is far away, then I feel we should use the ack time as the 2nd marker. There are 2 “ends” to the system being measured. The 2 ends are
* the exchange and
* the internal machines processing the data.

There are also 2 types of exchange data — order vs market data (MD = mostly other people’s quotes and last-executed summaries). Both are important to a trader. I feel market data is slightly less critical, though some practitioners would probably point out evidence to the contrary.

Here are some techniques brought up by a veteran in exchange connectivity but not the market data connectivity.
* Most places use c++. For a java implementation, most important technique is memory tuning – GC, object creation.
* avoid MOM? — HFT mktData redistribution via MOM
* avoid DB — local flat file is one way they use for mandatory persistence (execution). 5ms latency is possible. I said each DB stored proc or query takes 30-50 ms minimum. He agreed.
** A market data engineer at 2Sigma also said “majority of data is not in database, it’s in file format”. I guess his volume is too large for DB.
* object pooling — to avoid full GC. I asked if a huge hash table might introduce too much look-up latency. He said the more serious problem is the uncontrollable, unpredictable GC kick-in. He felt hash table look-up is probably constant time, probably pre-sized and never rehashed. Such a cache must not grow indefinitely, so some kind of size control might be needed.
* multi-queue — below 100 queues in each JVM. Each queue takes care of execution on a number of securities. Those securities “belong” to this queue. Same as B2B
* synchronize on security — I believe he means “lock” on security. Lock must attach to objects, so the security object is typically simple string objects rather than custom objects.
* full GC — GC tuning to reduce full GC, ideally to eliminate it.
* use exchange API. FIX is much slower, but more flexible and standardized. See other posts in this blog

Some additional techniques not specific to exchange connectivity —
$ multicast — all trading platforms use multicast nowadays
$ consolidate network requests — bundle small request into a large requests
$ context switching — avoid
$ dedicated CPU — If a thread is really important, dedicate a cpu to it.
$ shared memory — for IPC
$ OO overhead — avoid. Use C or assembly
$ pre-allocate large array
$ avoid trees, favour arrays and hash tables
$ reflection — avoid
$ concurrent memory allocator (per-thread)
$ minimize thread blocking
** immutable
** data parallelism
** lockfree

##[12] bottlenecks in a high performance data "flow" #abinitio

Bottlenecks:

#1 probably most common — database, both read and write operations. Therefore, ETL solutions achieve superior throughput by taking data processing out of database. ETL uses DB mostly as dumb storage.

  • write – if a database data-sink capacity is too slow, then entire pipe is limited by its throughput, just like sewage.
    • relevant in mkt data and high frequency trading, where every execution must be recorded
  • read – if you must query a DB to enrich or lookup something, this read can be much slower than other parts of the pipe.

#2 (similarly) flat files. Write tends to be faster than database write. (Read is a completely different story.)
* used in high frequency trading
* used in high volume market data storage — Sigma2 for example. So flat file writing is important in industry.
* IDS uses in-memory database + some kind of flat file write-behind for persistence.

#? Web service

#? The above are IO-bound. In contrast, CPU-bound compute-intensive transform can (and do) also become bottlenecks.

low latency: C++preferred over java

(Fastest is FPGA, but Let’s put that aside. )

Low latency developers generally prefer C and avoid OO (runtime overhead), but they do use C++ templates to achieve power and flexibility.

In terms of data structures, I think they use STL too. Array is essential. In contrast, graph data structures incur additional allocation due to the graph Node objects — no such overhead in arrays.

Java’s issues —
* autoboxing — market data use mostly of primitive objects
* Every Object.java instance takes something like 8+ bytes.
* Indeterminate garbage collection
* virtual function overhead
* Even purely local variables are often put into heap for delayed clean-up
* JVM could reach good throughput wrt c++, but only after a slow warm-up.

avoid notifyAll() in low latency apps

notifyAll() is unpopular in low-latency apps. For fear of lost signal, we ask jvm to do a lot of useless work. You wake up everyone when you just need one of them waken up. But How do we get away with notify()? Here are 2 common conditions that should be satisfied.

* All the threads in your target wait set must be identical, so any one can get up and finish the task.
* All are waiting on the same while() condition.

Imagine some threads are waiting for a blocking queue to become non-empty; other threads are waiting for non-full. If you notify(), the risk is that scheduler chooses one waiting thread, but the chosen one checks the while() condition to see it’s waiting for non-empty, but the signal is no-longer-full, so the signal is ignored and lost. Using jdk 5 conditionVar, you can simplify this kind of situation using multiple conditions bound to the same lock.

c++low latency phone IV #MS-shanghai

I think this is all QQ… i.e theoretical knowledge. Not really harder than the mvea interview. Easier than the CVA interview.

Q: why we should not throw exception from dtor? Why do you say sometimes you can break the rule? See throwing dtor: %%justified use cases
%%A: if i have a utility routine that may throw, and I want to use it in my dtor, we should assess the danger. If we know entire process should crash under this condition, whether this dtor throws exception or swallow it, then it’s fine to use that routine without try/catch. Even without double-exception situation, the default behavior is std::terminate()

Q: divide by zero — is that an exception?
%%A: yes in java but UB in c++
AA: UB in c++

Q: is there something you can do in C but can’t in c++?
%%A: I guess you mean “C code uncompilable under c++ compiler”? I don’t think so.
A: “include <stdio>” won’t compile in g++
%%A: in pure-C environment if you receive a pointer at run time and it’s known to point to heap, you can call free() but in a C++ environment, that pointer may be created by new/array-new so free() is problematic.
A: https://isocpp.org/wiki/faq/big-picture#back-compat-with-c hints that if a C source code has no function prototype, then c++ compiler will complain.

Q: at what threshold of latency requirement would you switch from java to c++?
%%A: 100 μs. but I think nowadays java can rival c++.

Q: singleton — how would you implement
%%A: no friend class; declare but not define copier/op=. Provide a static method getInstance

Q: size of an empty c++ class’s instance
%%A: 1 byte (Correct ! https://stackoverflow.com/questions/6552319/c-sizeof-of-a-class-with-functions)

Q: diff between struct and class
Q: can you implement inheritance and polymorphism in C?
%%A: yes. Most c++ features (until exception) used to be converted to C source code.
Q: what’s complete vs partial specialization of template? (Jargon question)
Q: what happens to stack when an exception is thrown?

Q: what’s your reaction when you see “delete this” in a program?
A: After that, need to be careful not to reference anything in the class, so as to avoid dereferencing a dangling pointer.

 

clever use of enum(char?) in demanding c++ app #Miami

class Der: public Base {…..}

Requirement — we need to down cast a Base pointer. dynamic_cast has overhead.

Der* returnVal = static_cast (new Base) // will compile but is potentially disastrous, because the returnVal can be a partially wild pointer.

Q: without incurring the overhead of vtbl lookup, how do I decide if a pointer to Base given to me actually points to a Base or Der instance?

%%A: type_info still uses vtbl.
%%A: use a Base field to indicate the type. Both ctor will set the field, so the Der ctor will overwrite.
A: enum. If sizeof(aBase) is a tiny 2 bytes, then adding an enum field adds just 1 byte. Adding vptr would add 8 bytes. Big difference if we instantiate billions of this class.

You can also use a char, which is always 1 byte. An enum can be configured to take 1 byte, and have meaningful longer itemized names.

https://github.com/tiger40490/repo1/blob/cpp1/cpp/88miscLang/enumBasedRTTI.cpp has my tested code

## low latency – key expertise4GTD

I talked to a few big and small low-latency shops. I feel latency optimization is “low-logic, high speed, high throughput”. Below are the Expertise required, but in any real system, you will see diminishing return on each aspect, so remember to prioritize the cost-effective areas to optimize. See my post on 80/20 (http://bigblog.tanbin.com/2012/03/8020-rule-dimishing-return.html)

* threading
** parallel lock free threading but sometimes we can’t avoid inter-thread-comm (condVar)
** [L] try every technique to avoid locking (such as immmutables)

* [L] low-level memory management — custom allocators, custom (or disabled) garbage collectors
** care and feeding of java GC
** try every way to avoid non-deterministic garbage collection

* in-memory data stores — KDB etc
** This is how secDB compensates for some of the performance drawbacks
** state maintenance in memory — important in many OMS, DMA, exchange connectivity engines.

* try every techniques to avoid RDBMS
** use flat files

* low-latency messaging — 29West, tibrv messaging appliance …
** [H] async MOM is the only choice for high volume systems
** multicast is the de-facto standard

* OS monitoring and optimization
** dtrace
** network utilization
** paging activity
** cpu utilization

* [L] socket programming customized
* avoid ethernet collisions
* [H] avoid choke points, promote multi-lane highways
* [H] connectivity, collocation
* [L] serialization – customized
* [L] powerful hardware, FPGA
* [H] scale out – important
* 64 bit
* realtime java vs C++
[L] = low-level
[H] = high-level architectural feature

no lock no multiplex`no collections/autobox`:mkt-data sys

Locks and condition variables are important to threading and Wall street systems, but the highest performance market data systems don't use those. Many of them don't use java at all.

They dedicate a CPU to a single thread, eliminating context switch. The thread reads a (possibly fixed sized) chuck of data from a single socket and puts the data into a buffer, then it goes back to the socket, non-stop, until there's no more data to read. At that time, the read operation blocks the thread and the exclusive CPU. Subsequent Processing on the buffer is asynchronous and on a different thread. This 2nd thread can also get a dedicated CPU.

This design ensures that the socket is consumed at the highest possible speed. (Can you apply 2 CPUs on this socket? I doubt it.) You may notice that the dedicated CPU is idle when socket is empty, but in the context of a high-volume market data feed that's unlikely or a small price to pay for the throughput.

Large live feeds often require dedicated hardware anyway. Dedication means possible under-utilization.

What if you try to be clever and multiplex a single thread among 2 sockets? Well you can apply only one cpu on any one thread! Throughput is slower.

tips on how to incur just 1 STW GC/day — low-latency java

Big eq trading desks in Citi and other banks claim they incur a single full-GC “penalty” a day in most jvm instances (where an instance is typically a node in a scalable grid). Here are some techniques.

0) “pre-allocate” long-living objects. Probably allocate a large array of primitive objects. Each “primitive object” could be a byteArray of fixed size. No POJO no java beans.
)) don’t let GC do the clean-up; reuse the memory pre-allocated. Reuse by dropping in new data. Circular array of fixed-size byteArray is one of my ideas.

2) avoid promoting objects beyond the nursery/eden — see other posts in this blog.

3) run 64bit and allocate enough heap (typically 10G) to last a day.
)) Each jvm instance presumably serves only US, or only Hongkong…, so it could shutdown nightly.
)) If an instance must run 24×7, then keep it simple and small.
)) If we still need more heap, split to 2 jvm instances.

)) avoid distributed cache and latency. Barc DMA guys also avoide distributed caching.
)) How do you size heap? See post on heap sizing

5) During a scheduled quiet period, trigger a full GC throughout the entire (huge) heap, which could actually take 30 minutes. Just call Runtime.gc(). If this doesn’t trigger GC, then tweak -XdisableExplicitGC

personal xp on low latency trading

Thread — lockfree becomes relevant in latency-sensitive contexts
Thread — create opportunity for parallelism, and exploit multi-core
Thread — avoid locks including concurrent data structures and database. Be creative.
Data structures — STL is known to be fairly efficient but can affect performance
Data structures — minimize footprint
Data structures — favor primitive arrays because allocations are fewer and footprint is smaller
Algo — favor iterative over recursive
DB tuning — avoid hitting DB? Often impractical IMO
Serialize — favor primitives over java autoboxing
Serialize — favor in-process transfers, and bypass serialization
Mem — avoid vtbl for large volume of small objects
Mem — reuse objects, and avoid instantiation? Often impractical IMO
Mem — mem virtualization (like gemfire) as alternative to DB. A large topic.
Mem — Xms == Xmx, to avoid dynamic heap expansion
Mem — control size of large caches.
Mem — consider weak reference
MOM — multicast is the clear favorite to cope with current market data volume
MOM — shrink payload. FIX is a primary example.
GC — avoid forced GC, by allocating large enough heap
Socket — dedicate a thread to a single connection and avoid context switching
Socket — UDP can be faster than TCP
Socket — non-blocking can beat blocking if there are too many low-volume connections. See post on select()

low-latency trade execution in a bond mkt-maker – suggestions

(I think dealership trading desks include most bonds, IRS, FX options, FX cash, CDS, OTC stock options…)

After saving the trade into database, all post-trade messages can be delegated to a task queue for a background thread (like swing event queue). These messages inform downstream[1] systems , so they can update their GUI [2] and database in real time. These post-trade messages aren’t can’t really fail (if they fail we just have to resend). These are one-way messages. So they don’t need to be on the critical path.

[1] could be internal modules of the same trading engine or external systems owned by other teams like commissions:). Either way, such a downstream always runs in a separate process or machine.
[2] Note these GUI aren’t web pages, but WPF, Swing or similar auto-updating screen.

Q: Must position be updated in the critical path, before we process the next trade?
A: For a dealer, yes I think so. If Trade 1 depletes inventory, then we must reject next trader-sell.

infrastructure features@High Frq eq hedge fund (Millenium

 swing trader station + OMS on the server-side + smart order router over low-latency connectivity layer

* gemfire distributed cache. why not DB? latency too high.
* tibrv is the primary MOM
* between internal systems — FIX based protocol over tibrv, just like Lehman equities. Compare to protobuf object serialization
* there’s more math in risk system; but the highest latency requirements are on the eq front office systems.

y memory footprint hurts latency #my take

See also post on large in-memory search

Reason: the larger the heap size to scan, the slower is the GC (but paradoxically, large heap helps you postpone the GC). The less memory you use, the lower your GC penalty.

reason: favor in-memory data store rather than disk. In-memory can mean remote or local (best). The smaller your footprint. the easier.
reason: Favor single-VM — serialization-free, network-free. The smaller the easier.
reason: serialization (for IPC, cache replication, network or disk) is slower for larger footprints.
reason: allocation is costly, even for an 32-bit int. Look at shared_ptr and circular arrays.
reason: minimize object creation (i.e. allocation) — standard optimization advice
reason: reduce GC frequency
reason: reduce GC workload ie amount of memory to scan. If you must incur a full GC, then make it short.

reason? Compared to graphs, arrays enjoy 1)random access and 2)one-shot memory allocation, but if each element is bulky[2] then a long array requires too large a contiguous block of memory, which is hard to allocate. The smaller your object, the easier.

[2] Note java array element is usually a 32 bit pointer.

Major technique: favor iterative over recursive algorithms to reduce stack memory footprint

mkt-data favor arrays+fixed width, !! java OBJECTs

(See also post on market data and autoboxing.)

In the post on “size of Object.java” we see every single Object.java instance needs at least 8 bytes of book keeping.

Therefore primitive array (say array of 9 ints) has much lower overhead than Collection of Integer.java
– array takes 4bytes x 9 + at least 8 byte booking keeping
– collection takes at least sizeof(Integer) x 9 + book keeping. Each Integer takes at least 4 bytes of payload + 8 bytes book keeping.

Market-data engines gets millions of primitives per second. They must use either c++ or java primitive arrays.

Market data uses lots of ints and chars. For chars, again java String is too wasteful. Most efficient would be C-string without null terminator.

The fastest market data feed is fixed-width, so no bandwidth is wasted on delimiter bits.

Floats are imprecise. Am not an experienced practitioner, but i don’t see any raw market data info represented in floating points.

Java Objects also increases garbage collection. Often indeterminate full GC hurts transaction FIX engines more than the market data FIX engines, but in HFT shops the latter needs extreme consistency. Unpredictable pause in market data FIX can shut out a HFT auto-trader for a number of milliseconds, during the most volatile moment such as a market crash.

instrumentation overhead: latency/throughput]low latency app

(Note these overheads primarily hurt latency, but throughput-sensitive apps would also suffer, to a lesser extent I presume. Here’s my Reason — if it takes 20 minutes (or hours) to finish a big batch, 5% longer would not matter that much. However, in low-latency trading a consistent 5% additional latency can mean many lost opportunities. An inconsistent latency is even more prevalent.)

A top engineer in a ultra high speed option exchange told me truss is the #1 instrumentation tool (they run linux). I believe this is for occasional use. truss is intrusive and can slow down an app to HALF its speed, according to Solaris tuning experts.

Dtrace jvm provider has method-entry and method-exit hooks and can add overhead too. Overall, dtrace is less intrusive than truss.

Strace??

— logging —
By default, logging is intrusive and adds latency to the low latency critical-path. We still need real time logging for production support. Solutions? Someone told me ramdisk is one option. Someone told me async logging is available in log4j. 

Given that disk is a block-oriented storage (see other posts on this blog), you can also increase disk buffer and write in bigger blocks. Non-ideal for real time monitoring. Minutes of delay is common.

When quizzed in interviews, I gave some impromptu suggestions —
* main thread puts instrumentation/logging data into a shared data structure in local memory. A background thread reads it asynchronously — non-intrusive.  Shared mutable => locking required.
* we can also send messages to a logging server. In this case, I feel few-big messages beat many-smaller messages.

tibrv latency tips – from tibrv documentation #logging

1) offload to worker thread — When inbound messages require lengthy processing, we recommend shifting the processing load asynchronously. Quickly extract data from the message, and process it in another thread.

* compare EDT/swingWorker

In a ML muni trading engine, we designed several “grabber” listeners
– grab messages and route them to different processing queues, based on hashcode.
– grab messages, append a short code to the subject, then republish to the same queue. Not sure if this works.

2) reduce message size — Avoid XML. RV now supports integer field identifiers similar to FIX. 16-bits and much smaller than String field names.

3) reduce logging — I always felt logging hurts performance. Now confirmed in Tibrv manual! When logging is required for monitoring or auditing, shift the I/O burden to another computer to log messages without introducing a time penalty

Tick data repository – real-time/historical mkt-data

I think this was a CS interview…

Used for 1) back-testing, 2) trading signal generation, even if no real order sent, and 3) algorithmic real-$ trading.

First target user is back testing. Users would also try the 1 -} 2 -} 3 in that sequence.

Stores FX, FI and equity market data. Therefore the system treats everything just as generic tick data either quote ticks or trade ticks.

Multiple TB of data in memory. Time-series (not SQL) database.

Created using c++ and java. Probably due to sockets.

mkt-data engines hate pre-java8 autoboxing

(See also post on market data and size of Object)

High volume market data systems deal with primitives like ints, chars and bools (bit) These often need to go into collections, like vector of int. These systems avoids runtime memory allocation like a plague.

Specifically, Java collections auto-boxing results in excessive memory allocation — a new object for every primitive item. A major justification for c++ STL, according to a CS veteran.

Sun also says autoboxing is inappropriate for high performance.

C# generic collection is free of autoboxing, just like STL, so is the primitive streams in java 8.

C-style programing]java: can be legit

A friend told me it doesn’t make sense to write C-style programs in java, but I believe some of the fastest high volume trading engines
– use very few objects (perhaps some Object.java instances or array objects), but lots of primitives
– minimize garbage collector to almost never
– use no java collections, but arrays (albeit objects)
– use many dedicated threads but not always thread pools
– use almost no locks among the threads

So why java instead of C? I feel in such domains, C is indeed the language of choice, but I guess java offers

* platform independence — (but do we need it, when we use JIT compiling?). Truly compile-once, run anywhere.
* GC — (occasional use), since C memory management can be tricky.
* threading — the superior threading constructs in java
* off-the-shelf modules — many are available to java only, though less common in low-latency.
* pointers — java eliminates all (problematic) access to raw pointers
* platform-specific coding — java prevents programmers from resorting to platform-specific features. For any powerful/resourceful developer, if there exists a shortcut, it will be exploited someday — human nature
* without pointers and complex type declarations, programming is more readable, easier and less error prone
* the option to occasionally use some of the many java features, like the ones below. Note in a superfast engine, not all parts are equally latency-sensitive, so those “easy” parts can (and probably should) be written in a higher-level language. C won’t give you that option.
** RMI
** collections
** NIO

10 coding tips of high-frequency exchange connectivity

[c] use enum/char rather than vptr (Miami
[c] hand-crafted hash table, which is often more optimized for our specific requirement (Barcap
[c] Fixed-width, padded char-array without null terminator (Miami
[c] favor Set to Map
[c] hand-crafted reference counted data structure (Miami
[c] monolithic network-friendly wire format STRUCT for high volume data transfer. Nothing fancy.
[c] java finalizers slowing down garbage collection
[c] favor arrays to object graphs (UBS
[c] avoid boxing and java colllections. Favor primitives. (CS
specialized operator-new. See [[eff c++]]
unions

—-
Note — all of these are considered optimizations on memory. That’s where C++ outsmarts java.

[c=according to insiders]

max throughput vs max concurrency

I now feel this is a a bit too academic. In practice, I feel concurrency is a technique to achieve throughput.

“max concurrency” is a system feature and probably means “full exploitation of processor and threading capacity”. I know a single SUN processor can have 4 cores and 32 concurrent kernel threads. In such a system, it’s also possible to create user-level threads so total concurrent threads can far exceed 32. I would think thousands are possible.

“max throughput” is a tangible benefit. I think it’s measured by the max number of completed requests per second. The highest throughput is usually achieved in latency-insensitive batch systems rather than synchronous or asynchronous request/response. The fastest way to transfer billions of terabytes of data is to load flash drives on trucks and ship to the receiving site – batch mode.

Max throughput requires finding out bottlenecks. Throughput of a given system is the throughput of the “narrowest” or “slowest” link on the critical path. Adding concurrency to the bottleneck spots (not other spots) improves throughput. For a realistic system, people would buy the max amount of memory they can afford to avoid hitting disk, put in a lot of processors, and somehow multiplex I/O. In low-latency systems, network I/O is often the biggest latency contributor. People use java NIO and exchange collocation, among other things.

Other concurrency techniques include partitioned table, parallel query, disk array, grid and cloud computing.

By “narrowest” I mean the high-way. The more lanes, the better the throughput. A 32-thread processor has 32 lanes. Each thread has a capacity limited by clock speed and software synchronization – that’s what I mean by “slowest”.

y memory footprint is key to latency#c,java

see also post on large in-memory search

large allocation request -> free list search is harder and slower.

I thought in low latency, time costs outweigh space costs, but no. Garbage collection is a major performance issue in low latency systems. I guess so is object creation. I guess that’s why memory efficiency affects latency. GC probably takes more cpu time if there’s less stuff to scan.

distributed cache isn’t much used in low latency systems, perhaps because
* serialization
* network latency

DMA low latency interview (European bank Midtown

Q: Given a 32-core machine and a lot of CPU-bound calculations, how would you size your thread pool?
A: First, find out how many kernel threads — probably 32 * 2 or 4. i feel if tasks are IO intensive like waiting for a pricing or risk assessment from JMS, then a lot of threads would be waiting for IO. If each thread spends 90% of its time waiting for IO, then each cpu should handle 10 threads roughly. This way, each thread after getting the data needed won’t need to wait to get CPU.

The below questions were probably not prepared in advance.

Q1: for a one-direction linked list of size N, efficiently find the item at Position floor(N/2). Positions are 1 to N. You don’t know the size until you finish scanning.

Q2: for a strictly sorted binary tree of numbers, I give you a pointer to the root node and 2 arbitrary numbers in the tree, find the lowest common ancestor node. The only entry point is the root node address. The 2 input numbers are passed by value and guaranteed to exist in the tree. Note for this binary tree, any node on my left are less than me.

Q3: efficiently reshuffle a PokerCard[52] array. You can call a rand(int max) function to get a random number between 0 and max.
Tip: avoid “retry” as that wastes CPU
A: i think we need to call rand(51), rand(50), rand(49)….

Treasury low-latency trading(Barc

(Background: most low-latency trading systems are related to eq or FX…)

Bloomberg and tradeweb allow an institutional client to send RFQ to her favourite Treasury dealers. The dealer’s auto quoter is typically low-latency. Typical latency [from receiving RFQ at gateway, to sending out quote from gateway] is typically 100~200ms. 500ms would be too high.

For an interdealer broker (espeed/brokertec), 30ms is good. 60ms will arouse complaints from hedge fund clients. Perhaps 30ms from receiving order to sending out acknowledgement? Distributed cache? No need. Just a regular database cache is enough. Coherence is being considered though.

Order matching engine is the heart of a treasury ECN operator. An ECN for muni bond or corporate bond may have a very different algorithm.

c++ vs realtime JVM

Java might outperform c++ sometimes, but java performance is not consistent or controllable due to GC jitters — #1 motivation behind real time jvm. IBM jvm limits GC frequency so the GC struggles to keep up with the demand, so heap gets bigger than in Sun jvm. Sun jvm requires some code change. Some people feel that to benefit from real time jvm, you must code very carefully … why not use c++. C++ is the incumbent in low-latency systems.

Market-facing gateways — order execution and market data feed — still use c++ primarily for another reason – gateway API is often in c++.

perf techniques in T J W’s project–ws,mq,tx

Q: request wait-queuing (toilet queue)? I know weblogic can configure the toilet queue
A: keep the queue entries small. we only keep object id while the objects are serialized to disk (?!)

Q: is 1kB too large?
A: no

q: most common cause of perf issue?
A: mem leak. still present after regression test

q: jvm tuning?
A: yes important, esp mem related

q: regression test?
a: important

q: perf tools?
a: no tools. primarily based on logs. eg. track a long-running
transaction and compute the duration between soap transaction start
and end.

Q: web services?
A: Many of the transactions are based on soap, axis. TCP monitor
(http://ws.apache.org/axis/java/user-guide.html#AppendixUsingTheAxisTCPMonitorTcpmon)
can help with your perf investigation.

Q: tx?
A: yes we use two phase commits. Too many transactions involved.
really complex biz logic. Solution is async.

Q: multi-threaded?
A: handled by weblogic.

Q: how is the async and queue implemented?
A: weblogic-mq with persistent store, crash-proof