average_case ^ amortized

>> bigO (and big-Theta) is about input SIZE, therefore, usable on average or worst case data quality

There’s no special notation like bigX specifically for average case.

>> Worst vs average case … is about data quality.

With hash table, we (novices) are permitted to say “quality of hash function” but this is technically incorrect.  My book [[algorithms in a nutshell]] P93 states that IFF given a uniform hash function, then even in worst case bucket sort runs in linear time O(N).

>> Amortized … is about one expensive step in a large number (≅ input size) steps like inserts. Amortized analysis makes no assumption about quality of data, and therefore is a worst-case analysis.

Note Sorting has no amortized analysis AFAIK.

eg: quicksort+quickSelect can degrade to O(NN). Amortization doesn’t apply.

worst case amortized — is a standard analysis
Eg: vector expansion. Worst-case doesn’t bother us.

— statistical-average amortized analysis — is a standard analysis
eg: Hashtable insert is O(1) based on both statistical average AND amortized analysis :

  • Some inserts will hit long chain, while most insertions hit short chain, so O(1) on average, statistically
  • Amortization is relevant iFF an insert triggers a hash table expansion.

What if the input is so bad that all hash codes are identical? Then amortization won’t help. For hash tables, people don’t worry about worst-case data, because a decent, usable hash function is always, always possible.

eg: iterating a RBTree. Data quality doesn’t bother us as RBTree has guaranteed time complexity :). Traversing entire tree is O(N). Therefore statistical average per-iteration is O(1). Worst per-iteration is O(log N)

http://www.cs.cornell.edu/courses/cs312/2006sp/lectures/lec18.html says “amortized analysis is a worst case analysis” , where “worse” means data quality

average case per-operation — can be quite pessimistic
eg: vector insert

eg: RBTree iteration (see above)

worst case per-operation — not common, relevant to realtime systems only.

I feel this analysis easily hits the extreme far-outliers, so the per-operation cost is far worse than typical,

q[static thread_local ] in %%production code

static thread_local std::vector<std::vector<std::string>> allFills; // I have this in my CRAB codebase, running in production.

Justification — In this scenario, data is populated on every SOAP request, so keeping them as non-static data members is doable but considered pollutive.

How about static field? I used to think it’s thread-safe …

When thread_local is applied to a variable of block scope, the storage-class-specifier static is implied if it does not appear explicitly. In my code I make it explicit.

criticalMass[def]against churn ] tech IV

See also body-building impact{c++jobs#^self-xx

IV knowledge Critical mass (eg: in core java self-study) is one of the most effective strategies against technology churn in tech interviews. Once I accumulate the critical mass, I don’t need a full time job to sustain it.

I have reached critical mass with core java IV, core c++ IV, swing IV (no churn) and probably c# IV.

The acid test is job interviews over a number of years.

Q: how strongly is it (i.e. critical mass) related to accumulation?
A: not much AFTER you accumulate the critical mass. With core java I did it through enough interviews and reading.

Q: how strongly is it related to leverage?
A: not much though Critical mass enhances leverage.

Q: why some domains offer no critical mass?
A: some (jxee) interviews topics have limited depth
A: some (TMP, py) interview topics have No pattern I could identify from interview questions.


c++complexity≅30% above java #c#=in_between

Numbers are just gut feelings, not based on any measurement. I often feel “300% more complexity” but it’s nicer to say 30% 🙂

  • in terms of interview questions, I have already addressed in numerous blog posts.
  • see also mkt value@deep-insight: java imt c++
  • — tool chain complexity in compiler+optimizer+linker… The c++ compiler is 200% to 400% (not merely 30%) more complex than java… see my blogpost on buildQiurks. Here are some examples:
  • undefined behaviors … see my blogposts on iterator invalidation
  • RVO — top example of optimizer frustrating anyone hoping to verify basic move-semantics.
  • See my blogpost on gdb stepping through optimized code
  • See my blogpost on on implicit
  • — syntax — c++ >> c# > java
  • java is very very clean yet powerful 😦
  • C++ has too many variations, about 100% more than c# and 300% more than java
  • — core language details required for GTD:
  • my personal experience shows me c++ errors are more low-level.
  • Java runtime problems tend to be related to the (complex) packages you adopt from the ecosystem. They often use reflection.
  • JVM offers many runtime instrumentation tools, because JVM is an abstract, simplified machine.
  • — opacity — c++ > c# > java
  • dotnet IL bytecode is very readable. Many authors reference it.
  • java is even cleaner than c#. Very few surprises.
  • — more low-level — c++ > c# > java.
  • JVM is an excellent abstraction, probably the best in the world. C# CLR is not as good as JVM. A thin layer above the windows OS.

POSIX semaphores do grow beyond initial value #CSY

My friend Shanyou pointed out a key difference between c++ binary semaphore and a simple mutex — namely ownership. Posix Semaphore (binary or otherwise) has no ownership, because non-owner threads can create permits from thin air, then use them to enter the protected critical section. Analogy — NFL fans creating free tickets to enter the stadium.

https://stackoverflow.com/questions/7478684/how-to-initialise-a-binary-semaphore-in-c is one of the better online discussions.

  • ! to my surprise, if you initialize a posix semaphore to 1 permit, it can be incremented to 2. So it is not a strict binary semaphore.

https://github.com/tiger40490/repo1/blob/cpp1/cpp/thr/binarySem.cpp is my experiment using linux POSIX semaphore. Not sure about system-V semaphore. I now think a posix semaphore is always a multi-value counting semaphore with the current value indicating current permits available.

  • ! Initial value is NOT “max permits” but rather “initial permits”

I feel the restriction by semaphore is just advisory, like advisory file locking. If a program is written to subvert the restriction, it’s easy. Therefore, “binary semaphore” is binary IIF all threads follow protocol.

https://stackoverflow.com/questions/62814/difference-between-binary-semaphore-and-mutex claims “Mutex can be released only by thread that had acquired it, while you can signal semaphore from any non-owner thread (or process),” This does NOT mean a non-owner thread can release a toilet key owned by another thread/process. It means the non-owner thread can mistakenly create a 2nd key, in breach of exclusive, serialized access to the toilet. https://blog.feabhas.com/2009/09/mutex-vs-semaphores-–-part-1-semaphores/ is explicit saying that a thread can release the single toilet key without ever acquiring it.


[[DougLea]] P220 confirms that in some thread libraries such as pthreads, release() can increment a 0/1 binary semaphore value to beyond 1, destroying the mutual exclusion control.

However, java binary semaphore is a mutex because releasing a semaphore before acquiring has no effect (no “spare key”) but doesn’t throw error.

TCP_NODELAY to improve latency #lower efficiency



The default Nagle’s algo helps in applications like telnet. However, it may increase latency when sending streaming data.

In the case of interactive applications or chatty protocols with a lot of handshakes such as SSL, Citrix and Telnet, Nagle’s algorithm can cause a drop in performance, whereas enabling TCP_NODELAY can improve latency, but at the expense of efficiency, as briefly mentioned in 2011 white paper@high-perf messaging.

In such cases, disabling Nagle’s algorithm is a better option. Enabling the TCP_NODELAY option disables Nagle’s algorithm.

NoDelay means noNagle.


temp object bind`preferences: rvr,lvr,, #SCB

(Note I used “temp object” as a loose shorthand for “rval-object”.)

— based on P545 [[c++primer]]

Usually, a library class would define two overloads of an API such vector::push_back():

  • push_back(X const &)
  • push_back(X &&)

No const rvr; no mutable lvr. The author gave brief reasons to each.  However, outside the library, I often see functions taking a mutable lvr param, in order to modify the object passed in.

— Based on https://www.codesynthesis.com/~boris/blog/2012/07/24/const-rvalue-references/

  • a const (not mutable [1]) L-value reference … … can bind to a naturally-occurring temp object (or a to-be-robbed object from std::move), or any object according to P545 [[c++primer]]
  • a non-const r-value reference can bind to a naturally-occurring rvalue object, treating the object as a target for robbing
  • a const r-value reference (crvr, rarely encountered) can bind to a naturally-occurring rvalue object but canNOT bind to an lvalue object
  • [1] See https://stackoverflow.com/questions/13826897/why-not-non-const-reference-to-temporary-objects. Basically, temp objects and robbed objects have no “location” so they can’t be the target of a mutable L-value reference.

Q: so in the presence of all of these overloads, which reference can naturally-occurring temp objects bind to?
A: a const rvr. Such an object prefers to bind to a const rvalue reference rather than a const lvalue reference. There are some obscure use cases for this binding preference.

More important is the fact that const lvr (param type of copy-ctor and assignment operator) can bind to temp object. [[c++primer]] P540 has a section describing that if you pass a temp into Foo ctor you may hit the copy-ctor:

Foo z(std::move(anotherFoo)) // compiles and runs fine even if move-ctor is unavailable. This is THE common scenario before c++11 and after.

Compiler doesn’t bother to synthesize the move-ctor, when copy-ctor is defined!

short-term ROTI ] java IV: tuning imt threading

Analogy — physics vs chemistry in high-school — physics problems involve more logic, more abstract, more ; Chemistry is more factual. Similarly, Java threading challenges are more complex. Java tuning knowledge-pearls are relatively easy to memorize and show-off.

“Short-term” here means 3-5Y horizon.

I’m many times more successful with java threading– accu, deepening knowledge; lead over the pack.

I hope to build some minor zbs in JVM tuning, mostly GC tuning. However, many important parts of JVM tuning skills suffer from churn and poor shelf life.

mktData direct^Reuter feeds: realistic ibank set-up

Nyse – an ibank would use direct feed to minimize latency.

For some newer exchanges – an ibank would get Reuters feed first, then over a few years replace it with direct feed.

A company often gets duplicate data feed for the most important feeds like NYSE and Nasdaq. The rationale is discussed in [[all about hft]]

For some illiquid FX pairs, they rely on calculated rates from Reuters. BBG also calculate a numerically very similar rate, but BBG is more expensive. Bbg also prohibits real time data redistribution within the ibank.

mktData parser: multi-threaded]ST mode #Balaji#mem

Parsers at RTS: a high-volume c++ system I worked on has only one thread. I recall some colleague (Shubin) saying the underlying framework is designed in single-threaded mode, so if a process has two threads they must share no mutable data. In such a context, I wonder what’s better

  1. EACH application having only one thread
  2. Each application has 2 independent threads sharing nothing

I feel (A) is simple to set up. I feel if the executable + reference data footprint is large like 100MB, then (B) would save memory since the 100MB can be shared between threads.

–Answer from a c++ trading veteran
Two independent instances is much simpler time/effort wise. Introducing multiple threads in a single threaded application is not trivial. But I do not see any value if you create multiple threads which do not share any (mutable) data.

–My response:
Executable and reference data take up memory. 1) I feel executable can take up 10MB+, sometimes 100MB. 2) If there’s a lot of immutable reference data loaded from some data store, they can also add megabytes to process footprint. Suppose their combined footprint is 100MB, then two instances would each need 100MB, but the multi-threaded design would need only 100MB in a single instance hosting multiple threads.

My RTS team manager required every market data parser process to reduce footprint below 60MB or explain why. My application there used 200MB+ and I had to find out why.

Apparently 100MB difference is non-trivial in that context.

live-free-of restrictions #LS

I can live with the restrictions of diet, burn rate, workout routine, perhaps strict work hours … but not LiuShuo style self-restraint in office context. Why?

  • I feel the suffering is not worthwhile.
  • I don’t feel self-confident I can “improve” myself.

I want to live free of this last category of restrictions. For this, I would let go of any leadership career.

3 real overheads@vptr #inline

Suppose your class Trade has virtual functions and a comparable class Order has no virtual functions. What are the specific runtime overheads of the vptr/vtable usage?

  1. cpu cache efficiency — memory footprint of the vptr in each object. Java affected! If you have a lot of Trade objects with only one char data field, then the vptr greatly expands footprint and you overuse cache lines.
    • [[ARM]] singles out this factor as a justification for -fno-rtti… see RTTI compiler-option enabled by default
    • [[moreEffC++]] P116 singles out vptr footprint as the biggest performance penalty of vptr
  2. runtime indirection — “a few memory references more efficient” [1] in the Order usage
  3. inlining inhibition is the most significant overhead. P209 [[ARM]] says inline virtual functions make perfect sense so it is best to bypass vptr and directly call the virtual function, if possible.

[1] P209 [[ARM]] wording

Note a virtual function unconditionally introduces the first overhead, but the #2/#3 overheads can sometimes be avoided by a smart compiler.


How RTS achieved 400-700 KMPS #p2p

Official benchmark result – 700 KMPS per instance of rebus or parser. Here are some enabling features:

  • [! iii] mostly single-threaded apps. Some older parsers are multi-threaded but each thread operating in single threaded mode. No shared mutable:)
    • In the order book replication engine, we use lockfree in one place only
  • ! (ST mode loves) Stateless [1] parser, small footprint. Only downstream component (Rebus) handles cancels/busts, modifications.. So we can run many parser instances per machine, or many threads per instance, fully utilizing the cpu cores. See https://haydenjames.io/linux-performance-almost-always-add-swap-space/
  • [! ii] allocation avoided — on millions of Output message objects. Pre-allocated ring buffer eliminates new(). very few and small STL containers .. mostly arrays… pre-allocate DTOs@SOD #HFT
  • ! allocation avoided — on millions of Input message objects, thanks to reinterpret_cast() on pointers… nearly Zero-copy. See reinterpret_cast^memcpy in raw mktData parsing
  • ! allocation avoided — custom containers to replace STL containers used, since they all allocate from heap
  • ! p2p messaging beats MOM 
  • ! Socket buffer tuning — to cope with busts. 64-256MB receive buffer in kernel; 4MB UserModeBuffer
  • low memory footprint. Most parsers use around 60MB. (My parsers was 100 to 150MB.) I think there are many benefits in terms of latency.
  • epoll — to replace select() with 2 orders of magnitude improve in throughput
  • buffering of Output list (of messages). I guess this hurts latency but enhances throughput
  • Very fast duplicate seq check, without large history data — a hot function
  • large containers like the ring buffer are pre-sized. No reallocation.
  • mostly array-based data structures — cache-friendly
  • Hot functions use pbref or RVO or move semantics, minimizing stack allocations
  • aggressively simplified parsing. Minimal logic
  • Special Logger to reduce I/O cost
  • Sharding to speed up symbol lookup
  • kernel bypass : RTS
  • No virtual functions — enhancing data cache and inlining .. 3 overheads@vptr #ARM
  • Inline
  • —-Speed with control
  • Best of both to reduce the real risk of our connection becoming slow or lossy
  • Depth monitor to detect missed messages
  • Capacity tracker to monitor mps rates
  • Missed seq reporting
  • [!=one of the really effective techniques in RTS]
  • [i=presented in multiple (!!, !!!) interviews]

[1] parser does remembers the sequence numbers. In fact sequence number management is crucial. Before retrans was added to parser, we never sent any retrans request. We did keep track of gaps. Mostly we used the gap-tracker to check duplicates across Line A/B. We also relied on our sequence number state to handle sequence resets.

enough bandwidth2help kids study@@ : 3 past jobs

Which past jobs would allow me to spend more time helping kids with studies?

  1. #1 Citi — i can finish off personal work in office
  2. OC
  3. RTS? yes during the quiet weeks. More importantly, I was familiar with the system, possibly more so than my coworkers, so with 70% of effort I could meet the requirements, so I can spend more time helping kids

How about Macq? Hours are short but I often decide to work on weekends. I dare not leave earlier because I feel guilty that I haven’t made enough progress on my projects.

I feel the stress of helping kids with studies would /resonate/ (“conflict” is the wrong word) with the stress of stigma/PIP, and  create a highly amplified response, bigger than each stressor alone could produce. I would feel like a candle burning from both ends .. insufficient energy and would be forced to consider tough sacrifices.

I think Mvea team would be much better as Josh is very tolerant and forgiving.

Commute — how important is it? I now feel commute is insignificant compared to coworker benchmarking or PIP

mutex ] rebus: 3designs #RCU #CSY #80%

Requirement: Each worker thread operates independently on its own symbols, never interfering. Admin thread may read/write all symbols.

My friend CSY’s design resembles ConcurrentHashMap — split a big “graph-container” into 32 independent sub-graphs, to be locked individually.

For my 1st design, I briefly considered one lock per symbol. I think 100,000 locks is fine if symbols are consecutive integers. I simply use a symbol as array index to locate the corresponding lock. Every worker thread and the admin thread need to acquire the lock for symbol X before updating the AVL tree of X.

— Here’s my 2nd design, based on a single global lock:

  • Regular writer threads only checksnever acquires, the lock. If lock is in-use (i.e. taken by the admin thread) then wait in a 1-second-sleep loop.
  • admin thread immediately and unconditionally takes the lock and wait for a grace period[1] before writing. I assume this “write” can take up to 1 minute.
  • how to “check” the lock?
    • TryLock solution [Design#1] — Just trylock and immediately unlock.
    • LockFree solution [Design #2] — atomic boolean to be set only on the admin thread. I think this is one the simplest usage scenarios of atomic boolean. We are not even relying on the atomicity. We only need the visibility feature.

[1] This grace period is designed to be sufficient for a batch of updates on the worker thread. If we know a batch can write 100 orders and take X microsecond, then set grace period to X. We require worker routine to recheck the lock after each batch.

My friend CSY raised the concern that mid-stream the update could be suspended by kernel scheduler. I think in this case, a simple integer update could take ages, as in the GC stop-the-world scenario. So the design requires some realtime kernel support to guarantee timely response.

read-copy-update lockfree +! retry #RCU mentions a grace period used in RCU api

— Here’s a bigger lockfree design [Design #3]

  • without realtime kernel
  • Two global variables — in addition to the atomic boolean flag, I will add an atomic int “activeWorkerCnt”
  • Two non-reentrant routines (to avoid incremental acquisition)
Worker threads routine:
1. check the flag until it’s false
2. increment activeWorkerCnt
3. update data store
4. decrement activeWorkerCnt
Admin thread routine:
1. unconditionally set the flat # rejecting all new workers
2. while activeWorkerCnt > 0
2.   sleep 1 millisecond
2. ## draining the active worker pools
3. update data store
4. unset the flag

close(): with timeout or non-blocking #CSY

I was right that close() can be non-blocking even on a TCP socket. It is the default.

When you specify SO_LINGER, you can also give a timeout to control how long close() blocks on a TCP socket


So there are 3 modes on a TCP socket:

  • close() blocking with timeout via SO_LINGER
  • close() blocking without timeout
  • close() nonblocking

[19]with3sockets: select^non-block recv#CSY

Update — CSY said when you increase from 3 to 99, the difference becomes obvious.

Q (TradeWeb mkt data team): given 3 non-blocking receive-sockets, you can either read them sequentially in a loop or use select(). What’s the difference?

http://compgeom.com/~piyush/teach/4531_06/project/hell.html compares many alternatives. It says Non-block socket read() is least efficient as it wastes CPU.

If you have 3 receive-sockets to monitor, select()/poll() is designed just for your situation.

Using read() instead, you may try specifying a timeout in your read() like

//set socket option via SO_RCVTIMEO to 5 seconds

Low latency systems can’t use this design because while you wait 5 seconds on socket1, you miss new data on socket2 😦

Therefore, low-latency systems MUST disable SO_RCVTIMEO. In such a design, the loop wastes cpu in a spin-wait.

–another advantage to selelct(): you can process a high-priority socket earlier when select() says multiple sockets ready.

I gave this answer on the spot.

recv()/send() with timeout #CSY

I was right — timeout is supported Not only in select()/poll(), but also read/recv/send..


Timeouts only have effect for system calls that perform socket I/O (e.g., read(2), recvmsg(2), send(2), sendmsg(2)); timeouts have no effect for select(2), poll(2), epoll_wait(2), and so on. These latter functions only check socket status… no I/O.

I believe this socket option has no effect if used on non-blocking sockets . Nonblocking socket can be created either

  • O_NONBLOCK in fcntl() or
  • SOCK_NONBLOCK in socket()

git commit – -am # risky

Consider the harmless-looking command git commit –am

Most of the time, I rely on this command as a safe way to amend commit message without touching the original files in the commit.

However, if there’s further changes on file1 in staging, then this command would silently introduce this change into the commit. Once you finalize the commit message and save it, this change inadvertently becomes part of the commit !

nonVirtual1() calling this->virt2() #templMethod

http://www.cs.technion.ac.il/users/yechiel/c++-faq/calling-virtuals-from-base.html has a simple sample code. Simple idea but there are complexities:

  • the given print() should never be used inside base class ctor/dtor. In general, I believe any virt2() like any virtual function behaves non-virtual in ctor/dtor.
  • superclass now depends on subclass. The FAQ author basically says this dependency is by-design. I believe this is template-method pattern.
  • pure-virtual is probably required here.

strided list-traversal: excel-formula/python/boost #JackZ


The green part is part of a natural number sequence.

lazy evaluation: a common lambda usage


Lambda function fits nice to do lazy evaluation. It gives a special syntax to capture some lambda expression or a block to be evaluated later:

// Java: delay counting lines of an input text
Supplier<Integer> createLineCounter(String text) {
    return () -> countLines(text);

Here, the method creates and returns a Supplier-of-integer, a callable object that produces an integer not now, but at a later time.

Q: at registration time, the string text is received but used only at counting time, so where is this string stored?
%%A: probably “captured” in some hidden field of the Supplier object

I suspect our dropcopy framework uses lambda mostly for this purpose.

%%engaging[def2]deep dive shines through QQ interviews

Opening example — I had 3M of fixation on tibrv^JMS. Just look at my blogposts. I go beyond interview topics. This fixation hopefully shines through on interviews.

This post is more about “engaging” than focused effort — I couldn’t really force myself to focus on speed coding or c++ TMP, or high-churning javascript.

Ken Li (Citi-muni) was probably a sharp observer of my personality and once said “If you are interested in software development then you will have a long career till retirement … no need to worry. A lot of developers lack that interest.”

Very few candidates had more than 3M personal spare time engaged in any tech topic.. Something like a deep dive. (Even fewer can sustain 12M focus.) Consider XR’s interest in machine learning. Consider grandpa.

Apparently, many interviewers are personally interested in these same details (the whys/hows/whats) so they seek out like-minded enthusiasts. This is one hypothetical explanation why those crazy theoretical QQ questions come up so often.

Other examples of personal fixation that (hopefully) shine through

  • sockets — blocking operations …; unplugged
  • clustered index
  • timer hardware interrupts
  • userland vs kernel mode
  • nested classes across languages
  • compiler rules on overriding

Now, many of the weeks spent has low ROTI but I was not fixated on ROTI since my earliest engaging studies in 1998 — first deep dive in unix + C programming + perl. As it turned out

  • unix and linux became dominant on server side and enhanced my career
  • perl became m entry point into programming career
  • C domain was hopeless, but with c++ I was able to capitalize on it

Still, many of my engaging studies had low ROTI, but what if I didn’t capture my interest? I think I would end up wasting my spare energy!

Striking balance — too much “engaging” –> regretful ROTI; too little “engaging” –> bored and unfulfilled.

decline@quant-dev domain: low value-add

Q: why drv pricing quant jobs used to be more abundant and pay a premium even though the economic value-add of the quant skill has always been questionable?
%%A: because the traders made money for a few years and subsequently received a big budget. Now budget is lower.

Same can happen to quant funds in NY, Chicago, HK…

Some UChicago lecturer once commented that everyone should get some quant training .. (almost as every high school student does math). But I think it is not necessary….

I am more theoretical and was naturally attracted to this domain but the value-add reality soon prevailed in my case.

Tech shops love algo challenges and speed-coding which have dubious value-add similar to the quant skills. However, the hot money in tech sector is bigger and last longer.

car park concurrency design #volatile field#AshS

Design a car park class for concurrent access given

  • a single sensor A at the entry
  • a single sensor B at the exit

Without using locks, Provide enter() and leave() API methods that can be safely invoked on multiple threads.

— If at any time there’s only one thread calling enter (and leave), then volatile-based solution might work. I think this assumption is reasonable. Given there’s only one physical entry point, it’s not physically possible for two threads to run enter() simultaneously.

volatile int sensorA, sensorB;

int attemptToPark(){
count = this.senserA – this.sensorB;
if (count > CAPACITY) throw IllegalStateException;
if (count == CAPACITY) return 0;
this.senserA ++ ;
return count+1;
void leave(){
//concurrent access disallowed for this R/W operation
this.sensorB ++ ;

— If multiple threads are allowed to increment this.senserB, then we need (either a lock or) atomicInteger. ABA problem is not a concern here given the integer never decrements.

short coding IV: ibank^HFT^inetShop

See also prevalence@speed coding among coding-IV

  • “InetShop” includes all west-coast type of hiring teams including China shops
  • Ibanks are mostly interested in language knowledge, so they use coding test (including multithreaded) for that purpose.  Ibanks also use simpler coding questions for basic screening.
  • HFT shops (and bbg) have an unpredictable mix of focuses

I predict that Wall St will Not adopt the west coast practice due to .. poor test-coverage — Not testing language knowledge, latency engineering, threading etc.

The table below excludes take-home tests and includes hackerrank, white-board, webex, and by-phone algo questions.

ibanks web2.0 Shops HFT
language knowledge; latency NOT bigO #1 focus never key focus, not #1
simple-medium speed coding rare #1 focus. High bar common
medium-simple algo without implementation,
logistically easiest => popular
common #SCB
minimum bar is low
common key focus, not #1
tough algo never sometimes rare
big-O of your solution NOT latency LGlp. low bar #2 focus. high bar LGlp
concurrency coding test G5 focus never #python/ruby/js sometimes


## marketable syntax nlg: c++ > j/c#

Every language has poorly understood syntax rules, but only in c++ these became fashionable, and halos in job interviews !

  • ADL
  • CRTP
  • double pointers
  • hacks involving void pointers
  • operator overloading to make smart ptr look like original pointers
  • TMP hacks using typedef
  • TMP hacks using non-type template param
  • universal reference vs rvr
  • rval: naturally occurring vs moved
    • const ref to extend lifetime of a naturally occurring rval object

techShop-plan: rare opp2try tech start-ups

The Sg2019 presents a rare opportunity to work for such a firm and gain some insight how they select candidates.

Salary might be slightly lower like 150k vs 180k@SCB, but i think it’s completely fine, until we (irrationally) compare to some peers.

The inherent risk of stigma (damagedGood, PIP..) are lower since I’m older and newer, bringing down the expectation. I would feel like Michael Jordan joining baseball.

std::bitset: popular ] WallSt coding IV

std::bitset is popular in WallSt c++ coding interviews. Easy to use iFF you are an experienced c++ developer.

  • to_string() and cvctor from string like “011100101011010”
  • to_ulong() and cvctor from ulong
  • operator[] returns a reference, so you can assign to myBitset[0]. Designed to superfically mimic a raw array. Actually, return type is not reference to a bool. Rather, I think it’s a reference to a specific posiiton in the “backing store”. See http://www.cplusplus.com/reference/bitset/bitset/operator%5B%5D/
  • Avoid the notion of “left/right” when using bitset. Prefer “higher/lower” instead. Lowest bit has position 0, becasue bitset is designed to represent a binary integer. Therefore, to scan from highest bit, we must iterate from bitset[size()-1] to bitset[0].

new c++features=mostly4lib developers

In C++0x, Half of the major new language features are designed for the standard library developers.

  • The unspoken assumption — these features are equally useful to other library developers.
  • The follow-up assumption — app developers also need to use some (and understand all) of the nitty-gritty details.

In reality, these nitty-gritty details are Not relevant to GTD for app developers.

price discovery n closing auction

When an order is sent directly to NYSE, NYSE could follow an algorithm and route it to BATS-Y which fills it. See https://en.wikipedia.org/wiki/National_best_bid_and_offer

I guess the fill report should probably show 30=BATY

liquidity indicator might show 3 (routed out)

An exchange provides the (nearly) ideal venue for price discovery, but in reality the fragmentation of liquidity (eg: routing-out) makes price discovery non-trivial even on a large exchange like NYSE.

Luckily, the closing auction remains a centralized process. Even though a stock symbol can be traded on multiple exchanges beside the primary listing exchange, auction orders are always sent to the primary listing exchange.

insert to sorted collection of N items: I can beat O(log N)

features/advantages of sorted collection

  • O(N) ascending iteration
  • O(log N) range query
  • O(log N) closest neighbor query above/below
  • O(log N) boolean contains() query
  • O(1) min() max() queries
  • deleteByKey is typically O(log N) but can be O(1) if we construct a hashtable of {key -> iterator} once the collection is built. This hashtable can also accept new items.

Q: insert is typically O(log N), but can we improve on it?

— A: if there’s only one insert, we can hold it in a hidden field of the collection, to be checked at the end of each operation above.

— A: if there are unlimited inserts but all 64-bit integers, then we can treat the incoming integer as a short radix array (8-elements for example). Take each element of the array as a lookup key, look up in an array to find a subtree.

Basically the entire sorted collection is an  8-level tree with fan_out=256

You may say that there are 8 lookups required so O(k) where K=8, but here we assume the size of the radix array is a constant equal to eight. Note the collection size is unlimited and far more than 2^64 if duplicates are permitted.

FIX^tibrv: order flow

Background — In a typical sell-side an buy/sell order goes through multiple nodes of a flow chain before it goes out to the trading venue. The order message is payload in a messaging platform.

FIX and Tibrv (and derivatives) are dominant and default choices as messaging platforms. Both are highly adaptable to complex combinations of requirements.

TCP UDP transport
unicast multicast pub/sub fan-out
designed for low-latency eq trading popular for bond trading latency
order msg gets modified along the flow chain ? editable payload

tick`95%mark #Kam #70%

“Ticking 95th percentile server” is the blog title I would use. Original question is on paper/pencil, scanned and sent to my gmail, from my friend Deepak CM. I find this problem rather realistic with practical usage, rather than fake and contrived. I treat it as a System Design + implementation question.

Q: Using only std library, write a c++ program to process a live stream of 128,000,000 (or much more) double numbers, representing temperature readings not necessarily unique. As the temperatures come in, print the current 95th percentile on-demand. I call it the lucky “winner”. We can use the nearest-rank percentile definition.

====Idea 1: given unsorted ints, find median in O(N) is for median but can be tweaked for any percentile, but unfortunately, not “ticking”

====design 2, for static data set
use an “order statistic tree i.e. a RBTree where each node remembers the size of its subtree. (A leaf node has size 1.)
====design 3, optimized for high volume of updates like 128 million updates, not optimized for frequent query

The entire temperature range is divided into non-overlapping segments, each represented by a segment-head temperature i.e. the lower bound [1b]. Each segment has a range (i.e.distance to next segment head), size (i.e.item count) and density (i.e. size/range ratio). We mostly care about “size” only.

We need a RB-tree (or sorted vector) containing P=1024 [1] nodes, each an unsorted container[3]. The RB-tree serves to maintain the containers i.e segments.

Each incoming temperature is quickly “routed” to the correct container and simply appended therein, increasing its size.

Upon query request, we will use the latest segment sizes to build a cumulative profile, and run a O[logP] binary search to identify the one segment containing the “winner”. This segment size would be hopefully much smaller than 128,000 [2] and far more /tractable/.

–Within the chosen segment of size S, we can use a vector to sort in O(S logS) the temperatures and identify the winner.  After completing a query, the chosen container will become (partially) sorted, helping subsequent queries if this segment is picked again.

Since we only support 95th percentile, chance is good that this segment will be picked most of the time. If x% of the queries hit this segment, then I will convert this “favorite segment” to a RB-tree.

Alternatively, we can also use the O(S) algorithm in Idea 1, but the container won’t become sorted.


[2] 128,000 is 1024th the size of original sample size… not ideal. The segments need to be initialized carefully, during a priming phase, inspired by JIT compiler. Shall we assume roughly uniform distribution or Gaussian distribution? Assuming we know the total sample size is 128 million, I will use the first 100,000 temperatures to select the 1024 segment heads. The segments are structured not for equal length (in temperature) or equal size (i.e. element count). In fact the low segments can be very long very crowded.

Instead, the segment heads are chosen so that between 94th percentile and 96th percentile we have half of all segments. These small segment sizes will be much smaller than 128,000 and quicker to manipulate.

–Foot notes:

Q: what if some of the containers grow too big like three times 128,000,000/1024. The priming/estimate was ineffective.
A: Not a problem unless the winner happens to be in such a container.
A: One idea is to split up such a container when we notice it, and grow a new node on the RB-tree. std::map::insert() can take a hint for the position where new node can be inserted. Note we don’t want to split a node JJ too early since JJ may not grow any further subsequently and may end up no larger than other nodes as other nodes keep growing.

[1] Sizing of P — First we estimate total sample size. If unknown, then set N:=1024 so all nodes stay in L1-cache (typically 32KB). If we assume 16 bytes/node ( 8 bytes pointer to container + 8 bytes double ), then 32KB can hold 2000 nodes.

If query becomes more frequent, I can increase P by 1024, sacrificing insertion.

[1b] The node values are “lower-bounds” and don’t have to be really the minimum of the original temperatures in the container. We can probably cheat with 4-byte floats, and we get away with 2700 twelve-byte tree nodes.

[3] slist vs vector — vector might be faster due to pre-allocation, provided a node will never grow beyond a capacity. Vector has reserve() (Note resize() is wrong choice.)

value@xRef imt tricks@frontier_Qn: retention

My absorbency is limited. My absorbency is more limited than my time (that’s why I camp out in office when in the mood.) When I’m in the “absorbency” mood,….

… I need to apply my laser energy on the most valuable learning. Many coding drill guys would focus on unfamiliar problems, however, learning a few tricks on a few unfamiliar categories of problems  is unlikely an organic growth.. limited xRef.. not connecting the dots .. no thick->thin. Nowadays I think the more valuable learning is building xRef including

Even if my xRef leads to non-optimal solution, it is still extremely valuable. The non-optimal solutions is sometimes more versatile and can become more effective in a modified problem [3].

If you focus on optimal solution, you may experience wipe-out. Same wipe-out effect due to fixation on passing leetcode tests and fixation on “#new problems solved”

A shift from “frontier” exploration to xRef means slowing down and revisiting old problems.

In some coding problems, you really need some new technique [2], but across 90% of the coding problems, the number of constructs are below 10.

[2] any example?

[3] example:

  • find-all-paths
  • in-order-walk on a non-sorted tree to assign auto-increment nodeIDs, then use pre-order or post-order
  • reusable technique — RBTree as alternative to heap or sorted vector
  • reusable technique — yield-based cookbook
  • reusable technique — radix sort on ints, floats and fixed-length strings as a O(N) wall-blaster. Variable-length string can also achieve O(N) as described in my blogpost on “FastestSortFor”

by default, interrupts==hardware interrupts

In operation system jargon, “interrupts” means hardware interrupts by default. Timer hardware, NIC, keyboard/mouse, and possibly hard disk … can all generate hardware interrupts.

“Interrupt signal” often refers to the electrical signal, never the higher-level SIG* signals in unix (unsupported in Windows AFAIK).

There is also a well-known concept of “software interrupt”. I see them as fake interrupts, simulated in software. My book [[linux kernel]] says software interrupts are employed for 1) system calls 2) debugger breakpoints.

A key difference between software vs hardware interrupts — hardware interrupts come from devices with their independent power supply and clock signal, so hardware interrupts can hit CPU any time, even in the middle of a CPU clock cycle. Software interrupts won’t. Software interrupts are simulated by code running in the same CPU, driven by the same clock signal.

In other words, hardware interrupts are unsynchronized with CPU, whereas software interrupts are synchronized with CPU.

missing q[ return ] in non-void function #Josh

I told Josh about this real Scenario: my non-void function has 3 paths to exit the function, one of them without a return statement.

g++ usually give a warning under “-Wall”. Flowing off the end of a function is equivalent to a return with no value — undefined behavior in non-void function.

If you print the function return value and the 3rd path executes, you get undefined behavior.

https://stackoverflow.com/questions/3402178/omitting-return-statement-in-c shows -Wreturn-type

I believe it’s a compilation error in other languages.

hardware mutex, based@XCHG instruction

[[OS by William Stallings]] P214 is concise and clear.

The atomic XCHG instruction on Intel processors can swap a local variable (held in cpu register) with a main memory location visible to all threads.

This XCHG instruction alone can implement a simple mutex usable by multiple threads on multiple processors sharing the same main memory.

Another atomic instruction, atomic test-n-set instruction, can also by itself implement mutex.

Both mutexes have only academic value because their Limitations are severe:

  • even with a single mutex, deadlock is possible at the hardware level — ThrA in critical section can get interrupted and preempted — perfectly normal. Now ThrA has to wait for high-priority ThrH to complete, but ThrH is busy-waiting (wheel-spinning) to enter the same critical section 😦
    • Note in this example there’s no “lock object” per-se, only a critical section in code !
  • by default, unsuccessful threads like ThrH are stuck in busy-wait — wasting CPU and generating heat 😉


lexdfs: home-grown implementation #YH

Requirement: given an undirected graph of numbered nodes (each with an integer id), and a random root node, there’s a specialized DFS algo — When visiting multiple neighbors of a node AA, we must visit the “lower” neighbor first.

Basic rule of any DFS — Graph may contain cycles, so care must be taken to visit each node exactly once.

I will use color Grey to denote any node in my current stack. The root node is always in current stack. I will use Black to paint fully explored nodes, which were once in my stack. White nodes are yet unexplored. Every node has exactly one color. Therefore, IFF we encounter a white , we explore it.

( Color scheme on [[algo in a nutshell]] P144 )

==== analysis ====

Efficient insert/remove on the white/grey/black collections is key. For a simple DFS, I think a regular hashtable might do.

  1. first pass I will build a sorted edgeList for each node. I like LinkedHashSet. When a node K is in the edgeList of T, T is also in K’s edgeList.
    • Every edgeList shall be sorted. This requires building a temporary egeList, sorting it, and finally creating a real edgeList
    • I also put all nodes in the white collection
  2. now we start DFS loop by moving the root node to the grey collection. lexdfs(cur) is a recursive function returning nothing:
    1. if cur has a non-empty edgeList, then identify the first neighbor, move this item from white to grey collection, and call lexdfs() on it.
    2. if cur has no edgeList, then move cur item from grey to black collection, and return from the function (i.e. pop stack)
      1. whenever we paint any item T black, we also remove it from all edgeLists. We look at every neighbor (i.e. every item in T’s edgeList). For each item like K, we efficiently remove T from K’s edgeList.  This removal is efficient due to hashtable, and also maintains the sorting.

##FX/eq projects ] PWM, for SG IV

—-fx or eq
* Swing app to monitor multiple live database tables and auto-refresh upon new quotes/trades and position updates.
* Post-trade Affirmation for complex derivative deals. A workflow application. Operations verify deal attributes by phone and Affirm on the deal to send out trade confirm.
* Statistical feed “blender” (with multiple purposes), needed to remove outliers among diverse indicative eFX forward quotes. Outliers tend to mess up our auto pricers, tiered pricers and cross rate pricers.
* (Ledger balance?) PnL report. Nightly batch to apply marks on option positions and compute PnL. Reports both clients and firm accounts.
* Volatility surface fitter for FX (& equity index/single-names). Volatility smile curve fitter. Quote inversion. Option position mark-to-market, driven by live quotes from multiple live eFX markets.
* Workflow engine for Error trade reconciliation/chargeback. A single error ticket can compute PnL and chargeback across multiple (up to 100) error trades and cover trades. Every quarter, this engine reconciles hundreds of error trade whose total values add up to millions of dollars.

* Deal entry screen for trade booking after a retail dealer executes over phone. If trade validation fails, we notify the dealer to amend and resubmit. Same swing app is also used by spot/swap/options dealers to book voice trades.
system? data normalization to take care of splits and dividends
system? basic back testing of simple strategies
system? re-sampling

* (EOS) web-based trade/order entry, with multiple validations. Used primarily by investment advisers across Americas, Europe and Asia. Thousands (up to 5-figure) of orders received daily.
** showing real time bid/ask from GS + bid/ask from trading venues
** supports most order types — limit/market/FOK/IOC/SL

* swing-based real time order book display. Updates to displayed order books come from the backend OMS via real time messaging.
* Also contributed to the OMS backend. Orders originate exclusively from private bank clients. We then monitor their cancels/amends and execution reports (via firm-wide messaging hub) from exchanges, ECN and GS private liquidity pool. Total message volume up to 6 figures.

We are the interface to multiple ECNs to 1) receive quotes, enrich and publish to PWM clients 2) forward client orders to ECN in 2-leg spread trades 3) execute trades received from ECN 4) generate and validate (against tri-arb) cross rates using ECN quotes. Also contributed to auto quoters and RFQ engine. Supervised and monitored the progress of high-frequency eFX applications. Performed eFX development activities such as requirement gathering, design, testing and deployment. Personal contributions included
* Tiered quote pricer. Each FX customer falls into one of the tiers. When a currency pair updates in our pricer, all registered clients would get a new quote (??delivered to their Financial Advisors??). Silver tier clients receive a better bid/ask spread than regular clients; Gold tier gets the best quote.
* eFX rate update/distribution engine to downstream real time risk, PnL, position marking systems
* eFX option real time risk report (unfinished). Option position risk calc is rather slow, so each user selects a limited number of positions into his watch-list. Watched positions get periodically updated based on live ECN rates to show real-time risk.
* (questionable project) Compute cross rates for PWM trades that are large, recurring and involving illiquid currencies. Cross rates computation from daily close USD buy/sell rates.
— blender
http://www.olsendata.com/fileadmin/Publications/Tech_Papers/FXBlenderDoc_01.pdf (C:\0x\xz_ref)
outliers are particularly damaging to market making systems.
input – transaction prices and indicative quotes
output – only indicative quotes

— cross rate calculator

What we compute are backdated cross rates. PWM (non-trading) client agrees on a currency conversion AAA/BBB. She agrees on AAA amount, and wait for a few hours/days to get the actual BBB amount, just like cross currency credit card payment —

MasterCard exchange rates are based on multiple market sources (such as Bloomberg, Reuters, Central Banks, and others). These rates are collected during our daily rate setting process. The exchange rates displayed on the web site are derived from the buy and sell rates included in the MasterCard daily T057 Currency Conversion Rate File.

MasterCard applies the USD as unique reconciliation currency to manage all currency conversions globally. Due to possible rounding differences, the published calculated cross-rates may not precisely reflect the actual rate applied to the transaction amount when converting to the cardholder billing amount. The calculated cross-rates will be loaded onto the MasterCard web site on a daily basis.

MasterCard applies the exchange rate to transactions at the time of settlement, not at the point of authorization of the sale.

–ECN interface
(my code doesn’t use FIX. Our volume is lower. conversion from FIX to xml is encapsulated in a library)

FXAall, Currenex, Hotspot, TradeWeb , Bloomberg. Example applications:
* auto quoters
* price feeds
* market data feeds
* post trade feeds
* execution gateways

returning^throwing local object

Tag line – always catch by reference. [[moreEffC++]] has a chapter with this title.

  • If you return a local object by reference, it will lead to run time error as the object would be wiped out from stack. Compiler is likely to give a warning.
  • If you throw a local object and catch by reference up the call stack, it is actually considered best practice, because compiler always clones the local object and throws the clone.

count allocation strategies of $K into N funds #70%

!! how is this related to the CSY staircase problem?


Q: given K dollars and N distinct mutual funds (including a cash fund), how many unique allocation strategies are there? It’s permissible to allocate $0 to any fund, but total amount in a strategy must add up to exactly K dollars

Note K can be higher or lower than N, both positive integers. Allocation amount must be whole dollars.

Without loss of generality, let’s say K is $5 and N is 3.


I think this is more a mathematical than an algorithm question. Each strategy can be represented by an N-element array. [0,5,0] is one strategy, allocation $5 to 2nd fund and $0 to the other funds. If we are required to enumerate all the strategies, we can print out the arrays.

My solution starts by defining a mathematical function f(k, n) := count of allocation strategies given k dollars and n funds. Based on such a definition, we can see

  • f(any_integer, 1) == 1 because there’s only one fund to receive the full amount. One strategy only.
  • f(0, any_integer) == 1 because all funds must receive $0 each. No other strategy.

Without loss of generality, let’s say K is 33 and N is 21. Using bottom-up DP, I assert that when we get a “new” fund, the 22nd fund, we can use the earlier answers to find f(33,22):

f($33,22) = f($33,21)f($0,1) + f($32,21)f($1,1) + f($31,21)f($2,1) +… +f($1,21)f($32,1)+f($0,21)f($33,1)

Let me explain the 3rd term on the right-hand-side. This term means the count for “$31 into the 21 existing funds” multiplied by the count for “$2 into the new fund”. This describes the scenario of first allocating $2 to the new fund, and then allocating the remaining $31 to the existing funds. The above formula simplifies to

f($33,22) = f($33,21) + f($32,21) + f($31,21) +… + f($1,21) +1

This formula for f(33,22) uses earlier results for f(smaller or equal amounts, one fund fewer). A bottom-up DP algorithm is straightforward. Generalization of this example gives

f(K,n+1) = f(K,n) + f(K-1,n) + f(K-2,n) + f(K-3,n)+..+ f(2,n) + f(1,n) + f(0,n)


avoid lopsided BST

Self-balanced is not a theoretical nicety but an essential BST feature. Without it a BST could easily become lopsided and all operations will slow down.

If for any reason (I don’t know any) we can’t use a AVL or RBT, then we could randomize the input and insert them (without deletion) and get a fairly balanced BST.

intervalTree: classic RBTree to find overlapp`event

Notable detail — non-balancing BST won’t give logN performance, since the tree depth may degrade

Q1: given N event objects {start, end, id}, use a RBTree to support a O(logN) query(interval INT) that returns any one event that overlaps with INT, or NULL if none.

If the end time == start time in the input INT, then it becomes the focus today :

Q2: given N event objects {start, end, id}, use a RBTree to support a O(logN) query(timestamp ii) that returns any one event that’s ongoing at time ii, or NULL if none.

P312 [[intro to algorithms]] describes an elegant solution.

The text didn’t highlight — I think the end time of the input interval INT is … ignored. (Therefore, in my Q2, input ii is a single timestamp, not an event.) Precisely because of this conscious decision, the tree is sorted by event start time, and the additional payload (in each node N) is the subtree end time i.e. last end time of all events started before N (including N itself). By construction, N’s payload is equal or higher than N’s start time. (Note Our input ii can fall into gaps between the event intervals.)

Eg: suppose N starts at 2:22 and left-child payload says 3:33, then we know for sure there at least one ongoing even during the interval 2:22 to 3:33.  Not sure if this is useful insight.

The text illustrates and proves why this design enables a clean and simple binary search, i.e. after each decision, we can safely discard one of “my” two subtrees. Here’s my (possibly imperfect) recall:

Let’s suppose each time the current node/event is not “ongoing”. (If it is then we can return it 🙂  ) So here’s the gist of the algorithm:

Suppose i’m the “current node”, which is root node initially. Compare ii to my left child L’s payload (not my payload).

As defined, L’s payload is the highest end times of L + all its sub nodes. In other words, this payload is the highest end time of all events starting before me.

Note the algo won’t compare L’s start time with ii
Note the algo won’t compare my start time with ii, though the scenario analysis below considers it.

  • case 1: If payload is lower than ii, then we know all events on my left have ended before ii, so we can discard the left subtree completely. Only way to go is down the right subtree.
  • the other case (case 2): if payload is higher or equal to ii, then we know my left subtree contains some events (known as “candd” aka candidates) that will end after ii.
  • case 2a: suppose my start time is before ii, then by BST definition every candidate has started before ii. We are sure to find the “candd” among them.
  • case 2b: suppose my start time is after ii, then by BST definition, all right subtree events have not started. Right subtree can be discarded
  • In both cases 2a and 2b, we can discard right subtree.


EnterpriseServiceBus phrasebook #ESB

Mostly based on https://www.it.ucla.edu/news/what-esb

  • enterprise — used in big enterprises, but now out of fashion
  • HTTP/MQ — Http is the more popular protocol than MQ
  • MOM — I think this middleware is a separate process. https://en.wikipedia.org/wiki/Enterprise_service_bus#ESB_as_software says MOM vendors call their products ESB
  • async — no synchronous http call between client and server
  • latency — perhaps not popular for real time trading, which prefers FIX
  • SOA — ESB jargon was created along with SOA
  • jxee — you may need to know this jargon in the jxee job market.
  • churn 😦

c++ecosystem[def]questions are tough #DeepakCM

C++ interviewers may demand <del>c++ecosystem knowledge</del> but java also has its own ecosystem like add-on packages.

As I told my friend and fellow c++ developer Deepak CM,

  1. c++ecosystem QQ questions can be more obscure and tougher than core c++ questions
    • tool chain — compiler, linker, debugger, preprocessor
    • IPC, socket, pthreads and other C-level system libraries
    • kernel interface — signals, interrupts, timers, device drivers, virtual memory+ system programming in general # see the blog catetory
    • processor cache tuning
    • (at a higher level) boost, design patterns, CORBA, xml
    • cross-language integration with python, R, pyp, Fortran + other languages
  2. java ecosystem QQ questions are easier than core java questions. In other words, toughest java QQ questions are core java.
    • java ecosystem questions are often come-n-go, high-churn

Low level topics are tough

  1. c++ ecosystem questions are mostly in C and very low-level
  2. java ecosystem questions are usually high-level
    • JVM internals, GC … are low-level and core java


priorityQ: 2 advantages over RBtree#O(1)add #Part2

RBTree O(1) insert is quite common in coding questions.

[1] Contrary to popular belief, RBTree mass insert (including mass-construction) is O(logN) rather than O(1) per node in the general case. However, see link above.

See lecture notes https://courses.cs.washington.edu/courses/cse373/02au/lectures/lecture11l.pdf and SOF post on

std::priority_queue phrasebook

Mostly based on [[Josuttis]]

  • #include <queue> // this class template is functionally closer to a regular queue
  • container adapter — based on other containers
    • defaults to vector, but deque fine.
  • random access — base container must provide random access iterator (to support fast sorting)
  • push and pop — both O(logN). https://en.cppreference.com/w/cpp/algorithm/push_heap shows individual insert and delete_max are both logN in worst case
  • make_heap() — linear in worst case, guaranteed:)

xchg between register/memory triggers fencing

I guess xchg is used in some basic concurrency constructs, but I need to research more.

See my blogpost hardware mutex, based@XCHG instruction

https://c9x.me/x86/html/file_module_x86_id_328.html points out the implicit locking performed by CPU whenever one of the two operands is a memory location. The other operand must be a register.

This implicit locking is quite expensive according to https://stackoverflow.com/questions/50102342/how-does-xchg-work-in-intel-assembly-language, but presumably cheaper than user-level locking.

This implicit locking involves a memory fence.

competitiveness: Pre-teen sprinter + HFT selectivity

I was a decent sprinter at age 6 through 13, then I realized I was not good enough to compete at Beijing municipal level.

There are quite a number of HFT shops in Singapore. I think they have a higher bar than ibanks. At ibank interviews I felt again like that pre-teen sprinter, but HFT interview is generally too hard for me

op=(): java cleaner than c++ #TowerResearch

A Tower Research interviewer asked me to elaborate why I claimed java is a very clean language compared to c++ (and c#). I said “clean” means consistency and fewer special rules, such that regular programmers can reason about the language standard.

I also said python is another clean language, but it’s not a compiled language so I won’t compare it to java.

See c++complexity≅30% mt java

— I gave interviewer the example of q[=]. In java, this is either content update at a given address (for primitive data types) or pointer reseat (for reference types). No ifs or buts.

In c++ q[=] can invoke the copy ctor, move ctor, copy assignment, move assignment, conversion ctor, conversion operator. In addition to the standard meanings,

  • Its meaning is somewhat special for a reference variable at site of initialization vs update.
  • For (dereferenced) pointers variables, there are additional subtleties.
  • You can even put a function call on the LHS

represent graph: matrix^edgeList #std::forward_list

Josh’s book [[data structures and other objects using c++]] has 5 pages devoted to three best-practice data structures representing a generic directed graph. (I think my [[discrete math] also covers the same topic.)

Note all these data structures can serialize graphs.

  1. boolean adjacency matrix — adj mat, unable to represent parallel edges between nodes A and B (rare) or self-to-self
  2. edge set — unable to represent parallel edges but can support complex graph algos having risk of traversing the same edge repeatedly. The set can be a field of a graph node, or a hash table {node-> edgeSet}
  3. edge list — can be implemented using std::forward_list, marginally faster better than hashset… not discussed further in this blogpost

pros and cons:

  • iterating all edges connected to a node — edge set wins. adjMat very poor.
  • sparse graph — adj mat wastes memory, not scalable.
  • test linkage between 2 nodes — adj mat and edgeSet are O(1)
  • ease of implementation — adj mat wins

Fundamental classification:

  • In a so-called dense graph, number of edges ~ O(VV) i.e. between any pair of nodes there’s likely an edge
  • In a so-called sparse graph, number of edges ~ O(V).

[[algo in a nutshell]] written by some professors, compares matrix and edge list. For a dense graph, P140 describes an intriguing edge list practical implementation to save space and speed up isConnected(node1, node2). Key insight —

   “Number all individual nodes using auto-incrementing ID numbers.”

After this, we will see that a node’s edge list may look like 2,3,4,5, 8,9 …which can be summarized as 2-4, 8-9. I felt this was a trivial trick but is actually a standard implementation. There’s a tradeoff though — inserting edge is now slower, but I feel O() could remain unchanged.


ABA problem ] lockfree designs #DeepakCM

  • breaks basic CAS assumptions
  • most common solution is [1]
  • affects c++ CAS only, probably not java CAS

[1] https://lumian2015.github.io/lockFreeProgramming/aba-problem.html and http://moodycamel.com/blog/2014/solving-the-aba-problem-for-lock-free-free-lists explain the “tagged state reference”  aka “tagged pointer“. I feel I don’t need to understand it. Even if I spent some time getting a grasp it may not be enough to impress an interviewer.

http://www.modernescpp.com/index.php/aba-a-is-not-the-same-as-a has more details, including

  • compare_exchange_strong
  • lockfree stack hitting UndefinedBehavior due to ABA
  • RCU

This topic was asked briefly in Deepak’s MS c++ interview in 2019. I think it’s too arcane and low-level for most interviewers.


compile-time ^ run-time linking

https://en.wikipedia.org/wiki/Dynamic_linker describes the “magic” of linking *.so files with some a.out at runtime, This is more unknown and “magical” than compile-time linking.

“Linking is often referred to as a process that is performed when the executable is compiled, while a dynamic linker is a special part of an operating system that loads external shared libraries into a running process”

I now think when the technical literature mentions linking or linker I need to ask “early linker or late linker?”

c++changed more than coreJava: QQ perspective

Recap — A QQ topic is defined as a “hard interview topic that’s never needed in projects”.

Background — I used to feel as new versions of an old language get adopted, the QQ interview topics don’t change much. I can see that in java7, c#, perl6, python3.

To my surprise, compared to java7/8, c++0x has more disruptive impact on QQ questions. Why? Here are my guesses:

  • Reason: low-level —- c++ is more low-level than java at least in terms of interview topics. Both java8 and c++0x introduced many low-level changes, but the java interviewers don’t care that much.
  • Reason: performance —- c++0x changes have performance impact esp. latency impact, which is the hot focus of my target c++ employers. In contrast, java8 doesn’t have much performance impact, and java employers are less latency-sensitive.
  • Reason: template  —- c++0x feature set has a disproportionate amount of TMP features which are very hard. No such “big rock” in java.
    • move/forward, enable_if, type traits

Q: if that’s the case, for my career longevity, is c++ a better domain than java?
A: I’m still biased in favor or low-level languages

Q: is that a form of technology churn?
A: yes since the c++11 QQ topics are likely to show up less over the years, replaced by newer features.

UseSpinning jvm flag #pthr/kernel/c#

Hi XR,

Around 2011 I once said a java thread waiting for a lock would not keep grabbing the lock…. Wrong! Actually it does keep grabbing for a while. It runs some useless CPU instructions and checks the lock periodically. This is really spin locking inside the JVM.

Spin locking wastes CPU but frequently results in faster application performance (reducing context switches). Therefore, JVM runtime uses some policy to decide how long a thread would spin before placing the thread in some kind of “parking lot” (my terminology) so the thread doesn’t’ occupy CPU any more. This spin phase used to be configurable via the boolean UseSpinning JVM flag, but in java7 and beyond, this flag has no effect — JVM always has freedom to use spinning. This is described in [[javaPerf2014]]

Comparison 1 — Dotnet SpinWait() provides exactly the same spinlock feature, as described in https://docs.microsoft.com/en-us/dotnet/standard/threading/spinwait

Comparison 2 — In linux Kernel (not userland), 1) spinlock and 2) counting semaphore are the two locking primitives, as described in Page 164 [[understandingLinuxKernel]]. Both locks block a “grabbing” thread getting into a critical section until another thread leaves it. Similar to my 2011 description, the kernel semaphore remembers all sleeping processes (P174) and wakes one of them when the critical section is clear. I think this kernel semaphore is used by the scheduler in the kernel.

Comparison 3 — In linux pthreads library (the basis of linux JVM), the locking constructs are implemented using system calls and userland C functions. Pthreads as a userland library probably can’t use the above two kernel constructs directly. Userland C function can implement spinlock, but to implement “parking” I assume (kernel support and) system calls are required. So far I assume the pthreads thread maps to native threads controlled by the kernel. This mapping is now the dominant usage.

The alternative mapping model is green threads, where userland threads are invisible to kernel. Parking has to be implemented in userland.  Nowadays it’s rare to see green threads.

job hopper^stagnant mediocre#CNA

https://www.channelnewsasia.com/news/commentary/career-mobility-new-normal-career-stability-job-hopping-11469760 is a Singapore perspective:

Long gone is the notion of the career ladder, where the ideal CV looks like a narrow, vertical progression. Today’s gold-standard CV looks like a career matrix, with horizontal and vertical moves signifying depth and breadth of experience, skills and exposure to different cultures.

Employers have gone from being cynical about hiring job-hoppers to becoming accustomed to seeing diverse CVs from top talent who are in frequent demand.

double-ptr is rare@@

  • c# ref-param (on a class type, not a Value data type) is really a double ptr. No other ways to achieve this purpose.
  • java offers absolutely no such support or any support for double-ptr
  • python? not sure. Not common

In conclusion, I now feel

  1. the need for double-ptr is prevalent across languages. One of the most common need is — main() passes into my method a String or List object, and I want to reseat that pointer so that main() will operate on the new String
  2. Despite these common needs, java and probably many new languages were designed from ground up to live without double pointers. It’s a conscious decision, an active refusal. I have not, but you may need to find workarounds.
  3. Workarounds are plenty and never too ugly too convoluted too complex
  4. In fact, many would consider (practically) all double-pointers as ugly and dangerous for the average programmers. If you are more than an average programmer and want to use double-pointers, then those workarounds are rather easy for you.
  5. Conclusion — in this new era, double-pointers are rarely needed.

Can a.so file get loaded 5min after process starts@@

Q: Can some shared library abc.so file get loaded 5 min after my process pid123 starts?

https://stackoverflow.com/questions/7767325/replacing-shared-object-so-file-while-main-program-is-running says NO. This abc.so file has to be loaded into pid123 memory (then dynamically linked into the executable) before main() is called.

Among the 3 major mechanism 1) static linker 2) dynamic linker 3) dlopen, dlopen is able to achieve this purpose but I’m unfamiliar with dlopen.

If pid123 reads a config file to decide whether to load abc.so, then dlopen is the only solution. I saw such an industrial strength implementation in 2019.

A remotely related note — The same stackoverflow webpage also shows that after pid123 starts, you can actually remove (and replace) abc.so without affecting pid123, since the old file content is already loaded into pid123 memory.

STL containers implemented +! pointer

  1. std::array
  2. std::string SSO i.e. small-string optimization

In these containers

  1. heap storage is not required
    • trivial special case — if this container is a nonref field of a Shape object, then q(new Shape) would still allocate the container on heap.
    • This is similar to “Are java primitives on heap/stack?”
  2. Therefore contains no pointer.
  3. Therefore move-ctor runs in linear [1] time compared to constant-time move-ctor of other STL containers
    • std::array probably offers a move-ctor and move-assignment but I don’t have time to research

[1] P205 [[effModernC++]] also addresses —

Q: What happens when you move a std::array container with 77 Acct elements?
A: 77 individual Acct objects are moved to the new container, in linear time, invoking the move-ctor of Acct class

##[19]if I were2re-select%%tsn bets #tbudget=30m

##[17] checklist@tsn #engaging provides a checklist but here’s a FRESH look, which may not be so in-depth, balanced, and comprehensive.

Q: Hypothetically, knowing what I know now, beside c++ and python, which trySomethingNew bets would I have chosen in 2010, after GS?

  1. mkt data and sockets
  2. forex, equities, options, IRS — I like the well-defined body of dnlg as entry-barrier. I learned it fast 🙂
  3. trading engine dev including pricing, OMS, connectivity
  4. risk analytics?
  5. devops?
  6. big data including hadoop, cloud etc?
  7. — not-so-great
  8. c# — heavy investment, a lot of legwork but insufficient ROTI
  9. MOM and Gemfire — shrinking demand
  10. swing? Fun but very poor job market
  11. quantDev? extremely disappointing job market
  12. HFT? entry barrier too high

prevent c++compiler reordering statements

http://www.cplusplus.com/reference/atomic/atomic_signal_fence/ says this low-level c++11 function is something like a

directive to the compiler inhibiting it from making optimizations that involve moving writing operations beyond a releasing fence or read operations before an acquire fence.

I doubt there’s any need for this in any application.

##%%c++(n java)memoryMgmt trec : IV

Q: Your resume says “memory mgmt” for c++, so what did you do personally, not as a team?
A: I will talk about the features in my system. Most of them are not written by me, honestly

  • per-class: restricting allocation to heap or stack
  • per-class: customize operator new for my class
  • per-class: disable operator new for my class and provide allocateFromRingBuffer() API
  • per-class: move support
  • static thread_locals
  • ring buffer to eliminate runtime allocation
  • custom smart pointers
  • memory leak detector tools like valgrind+massif .. breakdown heap/non-heap footprint@c++app #massif
  • RAII
  • pre-allocate DTOs@SOD #HFT #RTS
  • customize new_handler() — I didn’t write this

Q: java

  • GC tuning
  • memory sizing
  • JDK tools like jhat
  • memory profiler tools
  • string intern
  • investigating memory leaks, all related to GC

##G3 uncertainties10Y out: #Demand^otherCandd

Let’s be specific — on the competitive landscape over next 10Y, what specific factors are we trying to predict

  • G3: predicting my health, energy, motivation, absorbency, “retention rate” (MSFM…), coding drill visible-progress ..
    • I have 3x more insight on this factor than other factors
  • G5: predicting strength of competing candd (i.e. candidates), often younger
    • one note — my accumulated theoretical and low-level QQ strength will continue to be rare among the young competitors
  • G9: predicting market demand evolution, including churn. Coding IV is the prime example.
    • Note this evolution is not so fast. Read my analysis of MOM and RDBMS

Since 2017, I have tentatively started to shift more focus towards west coast. As a result, the Competition and Demand factors felt too fast-changing, which makes my painstaking analysis look naive and futile.

Critical thinking needed — Suppose I keep my focus on ibanks. In such a case, my analysis and prediction is actually relevant, even valuable. The Competition/Demand factors have not changed drastically.

I’m pretty good at this kind of competitive landscape analysis , but It’s truly tough to predict any of these 3 factors. However, as Warren Buffett (?) observed, those who decide to give up and submit to mindless investing is likely (not guaranteed) to lose out to those who try and analyze/predict.


Nsdq onsite IV #emerging Nsdq architecture

Q: Java NonBlocking IO?
%%A: one thread monitoring multiple sockets

Q: many simple Objects with complex relationships/interactions vs one (or very few) huge complex object?

Q: thread Futures? Any comments?

–some of my findings

  • microservices and containers are important
  • According to the system architect (last interviewer), the new framework uses one dedicated instance of matching engine serving a single symbol such as AAPL.
  • the new java framework only supports a few smaller exchanges initially, but is designed to roll out to more “exchanges” esp. those to be installed on client sites (like CSP). It is (theoretically) designed to be capable enough to implement the main stock exchange.
  • is Mostly java + javascript (Amber, react) with small amount of c++
  • the new java framework was deliberately designed to operate in single-threaded mode, despite oppositions. Therefore, multiple “clients” calling the same library concurrently would be unsafe.
  • java8
  • noSQL is adopted in addition to PostgreSQL, but the principal architect is less concerned about data store, which is only used in individual components

error stack trace: j^c++

Without stack trace, silent crashes are practically intractable.

In my java and python /career/ (I think c# as well) , exceptions always generate a console stack trace. The only time it didn’t happen was a Barclays JNI library written in c++..

In c++, getting the stack trace is harder.

  • when I used the ETSFlowAsert() construct I get a barely usable stack trace, with function names, without line numbers
  • [[safeC++]] described a technique to generate a simple stack trace with some line numbers but I have never tried it
  • the standard assert() macro doesn’t generate stack trace
  • In RTS and mvea, memory access errors lead to seg-fault and core dump. In these contexts we are lucky because the runtime environment (host OS, standard library, seg-fault signal handler ..) cooperate to dump some raw data into the core file, but it’s not as reliable as the JVM runtime … Here are some of the obstacles:
    • core files may be suppressed
    • To actually make out this information, we need gdb + a debug build of the binary with debug symbols.
    • it can take half an hour to load the debug symbols
    • My blogpost %%first core dump gdb session describes how much trouble and time is involved to see the function names in the stack trace.


Those IV victories define me!

Q: how is the Siddesh phenomenon related?
A: I think my dev-till-70 longevity is my strategic advantage. The Siddeshes can’t have it due to tech interview capabilities.


Without those IV successes, I am likely to feel a total failure and worthless, esp. when comparing to the so-called “high-flyers”, virtually all managers in big companies. As grandpa pointed out, these managers belong to a incompatible, unsuitable comparison group.

Nevertheless, I would probably fall into this pit at least once a years, based on my self-knowledge.

Q: with aging, my IV success rates would probably decline gradually. What’s the plan?

  • #1) maintaining my IV body-building “form” and stay “in-shape” is still the #1 most reliable solution
  • nonwork income, mostly from rentals, can help cushion the decline and hedge the loss of self-esteem, but the high credit risk in the high-yield investments can create stress and self-hate (beside the financial impact)
  • favor longevity jobs — Questionable! I feel very unsure about this strategy, after licking the multiple woulds for years due to repeated PIP.
  • specialize in low-churn, high-accu domains — but very tricky to find one. So far coreJava is the best, with c++ showing many problems. BondMath, MktData are some of the dnlg domains I favor.

zbs=vague,cf to QQ+GTD

Many zbs learning experiences (me or DQH) were arguably irrelevant if never quizzed.

–iv questions on malloc, tmalloc and the facebook malloc
In latency benchmarks, c++ should simply 1) avoid DMA or 2) pre-allocate. Therefore, if the iv question touches on benchmark, then the malloc question become moot.

–iv questions on lockfree in c++
messier and more low-level than java lockfree. I would say lockfree is seldom written by a c++ app developer. Therefore not a GTD topic at all. Purely QQ halo topic.

–iv question on volatile
Again, many interviewers use volatile as concurrency feature in c++ but it’s non-standard ! Even if it has a side effect under gcc, the practice is “not something I would recommend”, though I won’t criticize hiring team’s current practice.

Not a GTD skill, not zbs either.

–iv question on throwing dtor
c++ standard says UDB but still some interviewers ask about it. So this topic is not zbs not GTD but really QQ

–iv question on debugger breakpoint
Not GTD, not even zbs

–iv question (GS-HK) on CAS no slower than uncontended lock acquisition
I feel I don’t need to know how much faster. My knowledge is good enough for GTD

–iv question on how many clock cycles per instruction
no GTD but zbs for a latency enthusiast


unspecified^undefined behavior

Some interviewers really really go into UDB, so to impress them it is very very useful to understand UnSpecifiedBehavior (USB):

Compliant compilers are not required to document the behavior but the behavior should be consistent and valid such as crash. For a given program containing a USB, where possible the Standard provides two or more possibilities (i.e. permissible outcomes), such that an executable from a compliant compiler should produce one of these outcomes.

Compared to USB, UDB gives compilers more freedom. The actual result could be anything including invalid results in one instance and in another instance turning a blind eye.

RTS pbflow msg+time files #wait_until #60%

Background — In RTS we relied everyday on a UDP message replay tool, taking in a msg file and a corresponding time file. Both binary files. I saw no message delimiters in the msg file, partly because this file is produced in production. Inserting delimiters would add overhead in production parser engine.

Given there’s no delimiter, I believe the timestamp file must have a seek-offset values (marker) corresponding to each timestamp.

Q: how would you implement a similar replay tool?

Driver is the timestamp file. Every time I see a timestamp, I would wait_until() that timestamp (adjusted to my start time). Upon wake-up, I would send the corresponding chunk of bytes between current and next markers.

I would use condition_variable::wait_until(). As noted in c++condVar 2 usages #timedWait, this function has nanosec precision.

(The mutex needed in wait_until? My program is single-threaded, so the dummy lock would be held forever.)

The chuck of bytes would be sent as one UDP packet. No split. Each chunk should starts with an original packet header created by the matching engine, and received in the original feed.

Q2: how about TCP?

TCP receiver can deal with partial messages very effectively, as I saw in my tests.

However, TCP receiver can introduce delays in the sender i.e. my replay tool, so the transmission will not be instantaneous. In fact, send() can block! This could lead to drift in the timer. That’s why I need wait_until()

Q3: Explain timer drift?

Suppose in the time file, Timestamp #1 is 9:00:00:000001 and Timestamp #2 is one microsec later, but the data transmission can take 10 nanosec esp. with TCP blocking send().  This 10 nanosec is a small drift but it adds up to microseconds.

1)num-theory 2)combinatorics: most-quizzed math

  • For regular programmers, most quizzed non-trivial subdomains of math are 1) number theory and 2) combinatorics.
  • For Quant developer interview is a different game, but these subdomains are still highly relevant.

These math sub-domains would have remained obscure if programming as (a profession) had not become this wide spread, mainstream, lucrative (油水) and economically important. Since the 2000’s, more and more IV focus is shifted to math, largely in these two sub-domains.

This trend is bad news for the GTD types like Ashish/CSY, the domain experts and the managers like Shuo etc

I think some strong STEM students don’t bother with coding skill since it is less emphasized, less quizzed in school.

I suspect that in some colleges, math is regarded as more refined, more classic than coding skill in any language such as python or javascript. Coding is considered blue-collar.

java setting -Xmx Twice


$ java -XX:+PrintFlagsFinal -Xmx2G -Xmx1G 2>&1 |grep MaxHeap #shows the 2nd setting would overwrite the first setting

uintx MaxHeapSize                              := 1073741824                          {product}

export JAVA_OPTIONS=-Xmx…. # would append a -Xmx. The last -Xmx value overrides previous



restatement in FIX: triggers

http://fixwiki.org/fixwiki/ExecutionReport — represents an ExecutionRpt sent by the sellside communicating a change in the order (or a restatement of the order”s parameters) WITHOUT an electronic [1] request from the buyside.

“unsolicited order replace”

[1] this exec report can be triggered by a non-electronic request from buyside

A less common trigger — repricing of orders by the sellside such as making Sell Short orders compliant with uptick / downtick rules

%%tsn dream: largely constrained by risks

My tsn dream=largely constrained by risks

  • stigma/respect
  • figure things out faster than colleagues

For these specific risks, my current risk tolerance is 2% higher so I can be bolder, but the blow of a stigma is still rather heavy.

Note these risks exist with/without tsn. If we want to minimize these risks we need to find a low-calibre team like RTS, or OC (95G?), sometimes low-salary team

fractional shares: quite common]basket trading

https://www.investopedia.com/terms/f/fractionalshare.asp explains that fractional shares can be naturally-occurring, but I want to talk about artificial fractional shares in “normal” stocks.

Custom baskets are traded as equity-swaps. Client specify a weightage profile of the basket like x% IBM + y% AAPL + z% GE where x+y+z ==100. The sell-side dealer (not a broker) would hold the positions on its own book, but on behalf of clients.

Such a basket is not listed on any exchange (unlike ETFs). If client were to go long such a basket directly, due to limited liquidity on some constituent stocks she may not get the desired weightage. In contrast, a sell-side has more liquidity access including smart-order-routers and internal execution.

The weightage means some stocks will have fractional quantities in a given basket. Also such a basket could be too big in dollar amount like $58,476, so the sell-side often use a divisor (like 584) to create a unit price of $100, similar to a share’s price in a mutual fund. Fractional quantities are even more inevitable in one unit or in 391 units.

After a client buys 391 units she could sell partially.

With mutual funds, I often buy fractional units as well, like 391.52 units.

Every useful codebase requires never-ending maintenance #mod_rewrite

Imagine you maintain any mature, stable software product with non-trivial complexity, used by 100+ internal users or 1000+ external users. I would venture to predict that maintenance effort for such a product would never end.

  • New feature requirements – The bigger and more diverse your user base, the less likely you would stay stagnant. Stagnant would lead to losing relevance gradually. Consider dinosaurs. If you don’t evolve , then other products would catch up, absorb your best features and make you /irrelevant/.
  • New maintainers — have a need to understand and learn your source code, otherwise in 20 years no one will be available to answer user questions or maintain the codebase. New maintainers shall see room for improvements and modernization (like adopting new technologies). After a few changes of guard, they could vote for a rewrite.
  • Imperfections — We wish all the bugs get ironed out X years after active development, but sadly, bugs never die off completely.
  • New users — come on-board, ask (rarely new) questions that require assistance, which is part of maintenance effort
  • Documentation — require periodic update and modernization. Otherwise, new readers would find the documentation old-fashioned.
  • testing — testing never ends. There will be better ways to test your product. If lucky, then would uncover hitherto unknown bugs.

Therefore, there’s budget allocated to the system as long as there are important users. Budget creates jobs and employment. Some of these jobs are age-friendly.

I used to think apache mod_rewrite was written once and needs no update. I doubt it now.