const data member initialization: simple on the surface

The well-known Rule 1 — a const data member must be initialized exactly once, no more no less.

The lesser-known Rule 2 — for class-type data member, there’s an implicit default-initialization feature that can kick in without us knowing. This default-init interacts with ctor initializer in a strange manner.

On a side note, [[safeC++]] P38 makes clever use of Rule 2 to provide primitive wrappers. If you use such a wrapper in place of a primitive field (non-const), then you eliminate the operational risk of “forgetting to initialize a non-const primitive field

The well-known Rule 3 — the proper way to explicitly initialize a const field is the ctor initializer, not inside ctor body.

The lesser-known Rule 4 — at run-time, once control passes into the ctor body, you can only modify/edit an already-initialized field. Illegal for a const field.

To understand these rules, I created an experiment in

— for primitive fields like int, Rule 2 doesn’t apply, so we must follow Rule 1 and Rule 3.

— for a class-type field like “Component”,

  • We can either leave the field “as is” and rely on the implicit Rule 2…., or
  • If we want to initialize explicitly, we must follow Rule 3. In this case, the default-init is suppressed by compiler.

In either case, there’s only one initialization per const field (Rule 1)

c++nlg pearls: xx new to refresh old 知新而后温故

Is this applicable in java? I think so, but my focus here is c++.

— 温故而知新 is less effective at my level. thick->thin, reflective.

— 知新而后温故 — x-ref, thin->thick->thin learning.

However, the pace of learning new knowledge pearls could appear very slow and disappointing. 5% new learning + 95% refresh. In such a case, the main benefit and goal is the refresh. Patience and Realistic expectation needed.

In some situations, the most effective learning is 1% new and 99% refresh. If you force yourself to 2% new and 98% refresh, learning would be less effective.

This technique is effective with distinct knowledge PEARLS. Each pearl can be based on a sentence in an article but developed into a blogpost.


fine print ] source code

Q1: “Here is 95% of the logic” — In what contexts is such a documentation considered complete? Answered at the end.

When we programmers read source code and document the “business logic” implemented thereby, we are sometimes tempted to write “My write-up captures the bulk of the business logic. I have omitted minor details, but they are edge cases. At this stage we don’t need to worry about them”. I then hope I have not glossed over important details. I hope the omitted details are just fine prints. I was proven wrong time and again.

Sound byte: source code is all details, nothing but details.

Sound byte: Everything in source code is important detail until proven otherwise. The “proof” takes endless effort, so in reality, Everything in source code is important detail.

The “business logic” we are trying to capture actually consists of not only features and functionalities, but functional fragments i.e. the details.

When we examine source code, a visible chunk of code with explicit function names, variable names, or explicit comments are hard to miss. Those are the “easy parts”, but what about those tiny functional fragments …

  • Perhaps a short condition buried in a complicated if/while conditional
  • Perhaps a seemingly useless catch block among many catches.
  • Perhaps a break/continue statement that seems serve no purpose
  • Perhaps an intentional switch-case fall-through
  • Perhaps a seemingly unnecessary sanity check? I tend to throw in lots of them
  • Perhaps some corner case error handling module that look completely redundant and forgettable, esp. compared to other error handlers.
  • Perhaps a variable initialization “soon” before an assignment
  • Perhaps a missing curly brace after a for-loop header

( How about the equal sign in “>=” … Well, that’s actually a highly visible fragment, because we programmers have trained vision to spot that “=” buried therein. )

Let me stress again. The visibility, code size … of a functional fragment is not indication of its relative importance. The low-visibility, physically small functional fragments can be equally important as a more visible functional fragment.

To the computer, all of these functional fragments are equally significant. Each could have impact on a production request or real user.

Out of 900 such “functional fragments”, which ones deal with purely theoretical scenarios that would never arise (eg extraneous data) .. we really don’t know without analyzing tons of production data. One minor functional fragment might get activated by a real production data item. So the code gets executed unexpectedly, usually with immediate effect, but sometimes invisibly, because its effect is concealed by subsequent code.

I would say there are no fine-prints in executable source code. Conversely, every part of executable source code is fine print, including the most visible if/else. Every executable code has a real impact, unless we use real production data to prove otherwise.

A1: good enough if you have analyzed enough production data to know that every omitted functional fragment is truly unimportant.

OO-modeling: c++too many choices

  • container of polymorphic Animals (having vtbl);
  • Nested containers; singletons;
  • class inheriting from multiple supertypes ..

In these and other OO-modeling decisions, there are many variations of “common practices” in c++ but in java/c# the best practice usually boils down to one or two choices.

No-choice is a Very Good Thing, as proven in practice. Fewer mistakes…

These dynamic languages rely on a single big hammer and make everything look like a nail….

This is another example of “too many variations” in c++.

jGC heap: 2 unrelated advantages over malloc

Advantage 1: faster allocation, as explained in other blogposts

Advantage 2: programmer can "carelessly" create an "local" Object in any method1, pass (by reference) the object into other methods and happily forget about freeing the memory.

In this extremely common set-up, the reference itself is a stack variable in method1, but the heapy thingy is "owned" by the GC.

In contrast, c/c++ requires some "owner" to free the heap memory, otherwise memory would leak. There’s also the risk of double-free. Therefore, we absolutely need clearly documented ownership.

expertise: Y I trust lecture notes more than forums

Q: A lot of times we get technical information in forums, online lecture notes, research papers. why do I trust some more than others?

  1. printed books and magazines — receive the most scrutiny because once printed, mistakes are harder to correct. Due to this fundamental reason, there is usually more scrutiny.
  2. research papers — receive a lot of expert peer review and most stringent editorial scrutiny, esp. in a top journal and top conference. There’s also internal review within the authors’ organization, who typically cares about its reputation.
  3. college lecture notes — are more “serious” than forums,
    • mostly due to the consequence of misinforming students.
    • When deciding what to include in lecture notes, many professors are conservative and prudent when adding less established, less proven research findings. The professor may mention those findings verbally but more prudent about her posted lecture notes.
    • research students ask lots of curious questions and serve as a semi-professional scrutiny of the lecture notes
    • lecture notes are usually written by PhD holders
  4. Stackoverlow and Quora — have a quality control via voting by accredited members.
  5.  the average forum — Least reliable.

SDI: 3 ways to expire cached items

server-push update ^ TTL ^ conditional-GET # write-through is not cache expiration

Few Online articles list these solutions explicitly. Some of these are simple concepts but fundamental to DB tuning and app tuning. compares write-through ^ write-behind ^ refresh-ahead. I think refresh-ahead is similar to TTL.

B) cache-invalidation — some “events” would trigger an invalidation. Without invalidation, a cache item would live forever with a infinity TTL, like the list of China provinces.

After cache proxies get the invalidation message in a small payload (bandwidth-friendly), the proxies discard the outdated item, and can decide when to request an update. The request may be skipped completely if the item is no longer needed.

B2) cache-update by server push — IFF bandwidth is available, server can send not only a tiny invalidation message, but also the new cache content.

IFF combined with TTL, or with reliability added, then multicast can be used to deliver cache updates, as explained in my other blogposts.

T) TTL — more common. Each “cache item” embeds a time-to-live data field a.k.a expiry timestamp. Http cookie is the prime example.

In Coherence, it’s possible for the cache proxy to pre-emptively request an update on an expired item. This would reduce latency but requires a multi-threaded cache proxy.

G) conditional-GET in HTTP is a proven industrial strength solution described in my 2005 book [[computer networking]]. The cache proxy always sends a GET to the database but with a If-modified-since header. This reduces unnecessary database load and network load.

W) write-behind (asynchronous) or write-through — in some contexts, the cache proxy is not only handling Reads but also Writes. So the Read requests will read or add to cache, and Write requests will update both cache proxy and the master data store. Drawback — In distributed topology, updates from other sources are not visible to “me” the cache proxy, so I still rely one of the other 3 means.

TTL eager server-push conditional-GET
if frequent query, in-frequent updates efficient efficient frequent but tiny requests between DB and cache proxy
if latency important OK lowest latency slower lazy fetch, though efficient
if in-frequent query good waste DB/proxy/NW resources as “push” is unnecessary efficient on DB/proxy/NW
if frequent update unsuitable high load on DB/proxy/NW efficient conflation
if frequent update+query unsuitable can be wasteful perhaps most efficient


single-writer lockfree data structure@@

Most of the lockfree data structures I have seen have 2 writer threads or more.

Q: what if in your design, there can be at most one writer thread but some reader threads. Any need for lockfree?
A: I don’t think so. I think both scenarios below are very common.

  • Scenario: If notification required, then mutex + condVar
  • Scenario: In many scenarios, notification is not needed — Readers simply drop by and read the latest record. In java, a volatile field suffices.


clean language==abstracted away{hardware

Nowadays I use the word “clean” language more and more.

Classic example — java is cleaner than c# and c++. I told a TowerResearch interviewer that java is easier to reason with. Fewer surprises and inconsistencies.

The hardware is not “consistent” as wished. Those languages closer to the hardware (and programmers of those languages) must deal with the “dirty” realities. To achieve consistency, many modern languages make heavy use of heap and smart pointers.

For consistency, Everything is treated as a heapy thingy, including functions, methods, metadata about types …

For consistency, member functions are sometimes treated as data members.

freelist in pre-allocated object pool #DeepakCM

Deepak CM described a pre-allocated free-list used in his telecom system. is my self-tested implementation. Am proud of the low-level details that I had to nail down one by one.

He said his system initialization could make 400,000 new() calls to allocate 400,000 dummy objects and put them into a linked list. Personally, My design /emulates/ it with a single malloc() call. This is at startup time.

During the day, each new msg will overwrite [1] a linked node retrieved at the Head of the slist.

[1] using operator=(). Placement-new would be needed if we use a single malloc()

Every release() will link-in the node at the Tail of the slist. Can we link it in at the Head? I think so. Benefit — It would leave a large chunk of continuous free space near the tail. Improved Fragmentation.

Illustration — Initially the physical addresses in the slist are likely consecutive like addr1 -> addr2 -> addr3…. After some release() calls, it would look like a random sequence.

Using return-to-Head, I would get

  • pop, pop, pop: 4->5->..
  • rel 2: 2->4->5..
  • pop: 4->5->….
  • rel 1: 1->4->5…

— The API usage:

  • void ffree(void *)
  • void * fmalloc(size_t), possibly used by placement new

key^payload: realistic treeNode #hashset

I think only SEARCH trees have keys. “Search” means key search. In a search tree (and hashset), each node carries a (usually unique) key + 0 or more data fields as payload.

Insight — Conceptually the key is not part of the payload. Payload implies “additional information”. In contrast, key is part of the search tree (and hashset) infrastructure, similar to auto-increment record IDs in database. In many systems, the key also has natural meaning, unlike auto-incremented IDs.

Insight — If a tree has no payload field, then useful info can only exist in the key. This is fairly common.

For a treemap or hashmap, the key/value pair can be seen as “key+payload”. My implementation saves a key/value pair in each bucket.

A useful graph (including tree , linked list, but not hashset) can have payloads but no search keys. The graph nodes can hold pointers to payload objects on heap [1]. Or The graph nodes can hold Boolean values. Graph construction can be quite flexible and arbitrary. So how do you select a graph node? Just follow some iteration/navigation rules.

Aha — similar situation: linked list has well-defined iterator but no search key . Ditto vector.

[1] With pre-allocation, a static array substitutes for the heap.

Insight — BST, priorityQ, and hash tables need search key in each item.

I think non-search-TREES are rarely useful. They mostly show up in contrived interview questions. You can run BFT, pre|post-order DFT, but not BF S / DF S, since there’s nothing to “search”.

Insight — You may say search by payload, but some trees have Boolean payloads.

homemade clustered sparse array

O(1) lookup and Random Access is hard to achieve among sparse array implementation. Here my sparse array is a special subset of sparse arrays — the populated sections come in clusters. For with special subset, I hope Random access is feasible. My design is Fragment table:

  • My fragment table is a std::deque to hold raw pointers to “fragment arrays” not vectors. I choose the word “fragment” since std::deque has its own internal “segment” that is completely irrelevant.
  • elements 0 to 99 —  live in a 100-array, whose address is saved in fragmentTable[0]
  • elements 100 to 199: live in a 100-array, whose address is saved in fragmentTable[1]
  • elements 200 to 299: live in a 100-array, whose address is saved in fragmentTable[2]
  • elements 500 to 599: all missing, so a special ptr is is saved in fragmentTable[5]
  • elements 603 to 605 are present. A special ptr is is saved in fragmentTable[6]. This fragment consists of a small-footprint 3-array. This fragment also has a field this->offset=3
  • Note the fragmentTable is a deque of void pointers. Each void pointer can be a regular plain old array or a special pointer.

— As a bonus, this design supports unlimited append. Here’s a limited prepend() procedure

  1. to prepend a value X in front of the current sparseArr[0], we first allocate a new 100-array as a new fragment whose offset=99. This new fragment supports 99 future prepend() operations, but each time offset needs decrementing.
  2. save X as the last element in the new fragment.
  3. record the new fragment’s address in a new fragmentTable[0]. This is possible because the fragmentTable is a std::deque.

Note I can also allocate a small-footprint new fragment, iFF no further prepend() is allowed.

— Cornerstone of my design — all fragments have Identical range=100 inspired by the std::deque and the ragged 2D array. This enables simple and fast random access. The underlying arrays are the same size with few exceptions — empty fragments, and tiny-n-frozen fragments.

If one of the “tiny fragments” is expected to get populated in the future i.e. not frozen, then it should be allocated as a full array with bigger footprint. The wasted space should not be wasted for long.

Even if this design is not flexible and general-purpose, it meets my design goals so I’m proud of it.

Many implementations use linked fragments, where each fragment has a “next_fragment” pointer. I don’t see any benefit.

I don’t want to look at pure-java implementation as java array has many restrictions.

pre-allocated array as backing store for graph nodes #java No

I think any graph node can use the same technique, but here I present a simple yet interesting use case — a linked list with each node allocated from an array. shows three home-made implementations:

  1. backing array of dummy link nodes, pre-allocated at compile time
  2. backing array of dummy link nodes, pre-allocated from free store aka DMA
  3. backing array is a byte array on heap or data section. Each link node is constructed via placement-new.

Here are a few Advantages that I consider minor because linked list is seldom needed in low-latency

  1. d-cache efficiency
  2. eliminates runtime load on heap allocator, since memory is pre-allocated. See malloc=long considered costly

Advantage #3: For c++ algo questions, this set-up has an interesting advantage — The node address is now an index into the backing array. This index is a natural auto-increment ID , based on creation order.

Now, the biggest advantage of linked list over vector is mid-stream insert/delete. One of the biggest disadvantages is lack of random-access. If nothing happens mid-stream (as in coding questions), then we can achieve random-access-by-id using array as backing store.

If nothing happens mid-stream, then this linked list is physically similar to an array with extra capacity.

This technique won’t work in java because java array of Node is array-of-pointers.

T.isLikeSubtree(S) #60%

Q (Leetcode 572): Given two non-empty binary trees s and t, check whether tree t has exactly the same structure and node values as a subtree of s. A subtree of s is a tree consists of a node in s and all of this node’s descendants. The tree s could also be considered as a subtree of itself.

====analysis relies (as “primary check”) on the payloads of  each tree node, but in some trees, all payloads are empty or all payloads are either True or False. In these cases, the comparison of payload is only usable as a secondary check. The primary check must be structural. See key in a tree node

The O(N+K) is plain wrong.

I guess the 2nd solution (and possibly 1st solution) would compare every node in S to the root of T. I think there are more efficient solutions using subtree size and subtree height as secondary checks – more reliable than payload check.

My solution below uses BFT + pre/post/in-order walk !

— Preliminary step: post-order walk to get subtree-size, subtree-height at each S node + T root. (I will skip other T nodes.). Suppose T is size 22 height 4. We will look for any node Y of size 22 and height 4 and a matching payload. This would eliminate lots of S nodes:

If T height is more than 2, then lots of low-level S nodes are eliminated.
If T height is 2 or 1, then T size would be at most 3. Most high-level S nodes are eliminated.

— Solution 1: For both T and S, We take in-order walk to assign incremental IDs, then take pre-order walk to produce an array of IDs that represent the tree structure.

Can We run a level-aware BST. Only one level need to be examined … wrong!

I think the in-order walk itself is what I need. Find any node Y in S that matches size+height+payload of T root. Suppose ID(Y)=44 but ID(T root) = 4, then simply shift down by 40 and do a linear scan. Must check payload, but not height/size.


QQ study=burning_joy

See also the pleasure^chore framework

Unbelievably, QQ study feels like “burning” pleasure. After I complete some chore like coding drill, tax, I long for some QQ study !

There are very few burning pleasures:

  • tech study including zbs, halo…
  • localSys hacking — camp_out
  • weekend coding assignments
  • jogging in stadium
  • learning Oracle or unix in my younger days — camp_out

4 common mem techniques]%%HRT assignment

Technique: reinterpret_cast

Technique: memcpy

Technique: pre-allocated buffer as local static objects … can be better than "stack" allocation only for a large buffer.

Unlike local variables, local Static objects can be returned by pointer — major flexibility in low-level memory management. discusses the runtime cost of static local vs local

vector internal array: always on heap; cleaned up via RAII

Across languages, vector is the #1 most common and useful container. Hashtable might be #2.

I used to feel that a vector (and hashtable) can grow its “backing array” without heap. There’s global memory and stack memory and perhaps more. Now I think that’s naive.

  • Global area is allocated at compile time. The array would be fixed in size.
  • For an array to be allocated on stack, the runtime must know its size. Once it’s allocated on that stack frame, it can’t grow beyond that stack frame.
  • In fact, any array can’t grow at all.

The vector grows its backing array by allocating a bigger array and copying the payload. That bigger array can’t be in global area because this type on-demand reallocation is not compile-time.

Anything on heap is accessed by pointer. Vector owns this hidden pointer. It has to delete the array on heap as in RAII. No choice.

Now the painful part — heap allocation is much slower than stack allocation. Static allocation has no runtime cost at all after program starts.

c++toolChain: archaic for new coders #%%advantage

The modern languages all feature dramatically simplified tool chain. In contrast, c++ tool chain feels daunting to me, including profilers, static analyzers, binary file dumpers, linkers .. Just like manual cars, too complicated compared to newer languages like java. The dynamic scripting languages are even simpler.

This is one of the real obstacles to new entrants, young or old. This is also my (slowly growing) competitive advantage. I feel some people (like Kevin of Macq) know more, but most developers have a cursory working knowledge in this field. Just as I prefer command line, I often feel more confident (relative to my peers) with c++ tool chain

This learning curve, entry barrier … is a direct consequence to the c++ “sweet spot” as Stroustrup described — inherently complex codebase close to hardware.

.. Due to the bigger ecosystem needed to support c++, new features are added at a slower pace than languages having a central organization.

— personal xp:

In my early days using c++, I tried to set up eclipse CDT and spent lots of time on the tool chain. My goal was to set up similar convenience … Futile. Not worthwhile. java tools are miles ahead. Most c++ programmers don’t bother with such convenience and rely on command line tools.

I wrote dozens of blogposts about c++ build issues. For example, on windows, my strawberryPerl + git_bash + notepad++ setup is unknown to many. These fellow developers struggle with MSVS or Eclipse !

I was frustrated for years by the complex and messy build tools in c++. Same for the other new entrants — Rahul spent a month setting up Eclipse CDT…

— mileage as an entry barrier:

This toolchain is an entry barrier. Young people can take it up if determined, but majority of them are put off.

Q: A non-compSci graduate on a mid-career boot camp can take up programming in a scripting language or a simpler compiled language, but how many percent would take up c++?

There’s a minimum mileage required. Many young people don’t stay on it long enough.

My vi mileage is also too short.

Similarly, yoga is hard for most people but some individuals are determined and therefore able to overcome the initial hump (6-24M) and find joy and reward.


when2introduce new tokens into O()

The extra tokens like A  B C in a O(A+BB+logC) are not always independent dimensions.

  • Case: DFT/BFT have O(V+E) but some people may say V <= E, so why not remove V and say O(E).

I think this estimate misses the point that E could greatly outnumber V. If another algo is O(V+logE) it would probably win.

  • case: radix sort is O(WN), but some may say W <=N, so why not remove W and say O(NN)

Well, W is typically log(unique count among N), though W can be half N.

Some bigO students don’t like too many tokens and may simplify it to O(KK), but I think it’s imprecise. If M or N is very small like logK, then O(MN) is much faster than O(KK).

baseclass(template)use subclass field+!vptr

In my baseclass template is able to use a subclass field. This reusable technique is reusable in a very restrictive context —

  1. the base and sub classes represent market data messages, used in a reinterpret_cast context, with zero padding. Therefore, a proposed vptr would add a pointer and mess up reinterpret_cast.
  2. multiple (say 11) subclasses have a “qty” field, so I don’t want code duplication 11 times
  3. The order among the fields is fixed in struct definition. Without this restriction, I would define the “price” field in a base struct and also a getPrice() method. With a subclass instance, CRTP could probably work like static_cast<Subclass const*>(ptr)->getPrice() but the “price” field’s physical offset would be be dictated by the compiler not according to my struct definition

My technique uses CRTP but no SFINAE no enable_if.

My technique is easier to /internalize/, as it relies on simple overload resolution + simple type deduction. In contrast, the SFINAE technique used in my RTS codebase is off-putting and alien to me

FizzBuzz in O(N)

Q (Leetcode ) Write a program that outputs the string representation of numbers from 1 to n. But for multiples of three it should output “Fizz” instead of the number and for the multiples of five output “Buzz”. For numbers which are multiples of both three and five output “FizzBuzz”.

What if the divisors in the requirement is not two but much more and each a prime number?


In the standard solution, there’s a hashtable O(NK) algo — for every int 1 to N, check the K divisors.  Here’s a O(N) solution:

  • first populate the return vector with the basic strings
  • Then make a strided iteration to append “Fizz” on every multiple of 3.
  • Then make a strided iteration to append “Buzz” on every multiple of 5.

The fact that prime numbers become very high very soon help our bigO:

N/3 + N/5 +… + N/31 + N/37… will be lower than 5N

hashtable resizing ≅ jGC jitter #unpredicatable

“Note because of the rehashing issue – a realtime applications and applications that need low latency- should not use a hash table as their data structure.” — said one guy on StackOverflow.

I would say if we have relative confidence about total data size (like below 1 million) then we can still use hash table. A benchmark test is not hard to set up, to compare against alternative designs such as RBTree. For large data sizes, I think hash table would win.

Note in latency sensitive systems, resizing (and stop-the-world GC) is likely to happen at busiest time.

Can we trigger a resize? Just send in more data to build up the table.

Can you trigger a pre-emptive GC? See can you force garbage collector to run@@


## same-complexity optimization in algo design

A lot of algorithm optimizations do not reduce the time/space complexity, but offer other benefits. I will use “ECO” to denote such optimizations. (E can stand for Equivalent.)

  • benefit — smaller data set
    • eg: tree pruning
    • eg: keep only peak+trough
  • benefit — fewer edge cases
  • benefit — easier to understand
  • benefit — easier to implement
  • benefit — easier to debug and investigate.
    • eg: RBTree has this advantage over priorityQ
  • — examples:
  • eg: yield-generator
  • eg: replace RBTree with priorityQ or sorted vector
  • eg: bidirectional BFS in

Note memoization is more then ECO.

If you can convert a recursive solution, then it is often more than ECO.

given any string, generate splits into palindromes

Q: Given a string s, partition s such that every substring of the partition is a palindrome. Return all possible palindrome partitioning of s.


I found this question quite hard and had no confidence, .. overwhelmed, like many DP problems. But then I came up with two DP solutions ..

I don’t bother to run the leetcode tests as they tend to deflate my burning joy, absorbency… precious stuff.

insight — the word “partition” is horribly confusing. A “partition” can mean two very important entities in this problem domain — either 1) a palindrome substring or 2) a complete family of substrings that form the original word. It’s crucial to avoid this word

insight — challenge is not O() but a correct solution that generates all splits without repetition

–idea 1: recursive top-down DP

memoization is not easy since I used generator.

–idea 2: iterative bottom-up DP

sorted_Array|_List ⇒ balanced_BST

Q1 (easy): Given an array where elements are sorted in ascending order, convert it to a height balanced BST, where the depth of the two subtrees of every node never differ by more than 1.

Q2 (medium): How about a sorted slist?

==== analysis

I feel this “easy” problem should be medium. Perhaps there are elegant solutions.

I need not care about the payloads in the array. I can assume each payload equals the subscript.

There are many balanced BSTs. Just need to find one

— idea 1: construct an sequence of positions to probe the array. Each probed value would be added to the tree. The tree would grow level by level, starting from root.

The sequence is similar to a BFT. This way of tree building ensures

  • only the lowest branch nodes can be incomplete.
  • at all branch levels, node count is 2^L

I think it’s challenging to ensure we don’t miss any position.

Observation — a segment of size 7 or shorter is easy to convert into a balanced subtree, to be attached to the final BST.

When a subarray (between two probed positions) is recognized as such a segment we can pass it to a simple routine that returns the “root” of the subtree, and attach it to the final BST. This design is visual and a clean solution to the “end-of-iteration” challenge.

The alternative solution would iterate until the segment sizes become 0 or 1. I think the coding interviewer probably prefers this solution as it is shorter but I prefer mine.

— idea 2

Note a complete binary tree can be efficiently represented as an array indexed from one (not zero).

— Idea 3 (elegant) dummy payload — For the slist without using extra storage, I can construct a balanced tree with dummy payload, and fill in the payload via in-order walk

All tree levels are full except the leaf level — guarantees height-balanced. We can leave the right-most leaf nodes missing — a so-called “complete” binary tree. This way the tree-construction is simple — build level by level, left to right.

— idea 4 STL insert() — For the slist, we can achieve O(N) by inserting each element in constant time using STL insert() with hint

— For the slist without using an array, I can get the size SZ. (then divide it by 2 until it becomes 7,6,5 or 4.) Find the highest 3 bits (of SZ) which represent an int t, where 4 <= t <= 7. Suppose that value is 6.

We are lucky if every leaf-subtree can be size 6. Then we just construct eight (or 4, or 16 …) such leaf-subtrees

If SZ doesn’t give us such a nice scenario, then some leaf-subtrees would be size 5.  Then my solution is to construct eight (or 4, or 16 …) leaf-trees of type AAAAA (size 6) then BBB (size 5).

Lucky or unlucky, the remaining nodes must number 2^K-1 like 3,7,15,31 etc.  We then construct the next level.

–For the slist without using an array, I can propose a O(N logN) recursive solution:
f(slist, sz){
locate the middle node of slist. Construct a tree with root node populated with this value.
Then cut the left segment as a separated slist1, compute its length as sz1. node1 = f(slist1,sz1); set node1 as the left child of root.
Repeat it for the right segment
return the root

efficient memoization: keyed by auto-increment id

Market Value of this knowledge — I feel this level of intricate knowledge is useful mostly in bigO coding interviews. If you get the bigO wrong there, you can lose major points. Versioned dict #indeed is one typical coding interview question, where the result of read(id) can be memoized to speed up subsequent reads.

Q: what’s the optimal time complexity of maintaining the memoization data structure? User requests can hit us with read(9), read(88), read(15), read(0), read(15) again…

Notation — N:= number of id values.

  • Option 0 (unusable): hash table — nearest-neighbor search impossible
  • Option 1 (sub-optimal): RBTree —  Lookup is O(logN). In general, insertion costs O(logN).

( Special case — If the read() id never decreases, then STL RBtree insert-with-hint is O(1). )

  • Option 2: vector — indexed by id [1]. Lookup is binary search O(logN). Insertion is always amortized O(1). Here’s why.

Eg: after read(9), we get read(88). We would need to null-fill the vector until position 88. This null-fill design looks sub-optimal.

But how is the id 88 created after 87 ? Probably in some write() operation. So write() can incrementally expand this vector 🙂

Actually, this incremental-insert design is not faster. I would rather leave write() alone as the null-fill design is a mass-insert and minimizes re-allocation.

[1] Is it common or lucky to have id values as auto-increment integers? I think it’s quite common in practice and in algo interviews. Auto-increment Integer id is light-weight and fast to generate.

language war: 4 criteria

  1. criteria-E: efficiency, performance including platform-specific optimization.
    • Ling of MS felt threading support is crucial
    • many interactive or batch applications don’t care much about latency
  2. criteria-P1: proven, widely used in “my” community (me = tech lead), with mature ecosystem.
  3. criteria-F: killer features, including OO
  4. criteria : RAD, simplicity, ease-of-use

It is instructive to study the “dethrone” stories.

  • Case: On server-side, java dethroned c++ due to P1, RAD, E
    • In contrast, windows GUI seldom use java. They use 1)c++ 2)c# due to P1, F and E
  • Case: Python dethroned perl due to RAD
  • Case: shell scripting is very old but survived, due to F
  • Case: php survived due to F (designed for web server side), RAD

20Y long term trend – demand for high-level language skillset is rising (unsteadily) relative to java/c++. The reasons are numerous and include RAD, F, P1 Q: which factor is the biggest threat to java? A: F, RAD, not E ! I guess the RAD productivity gap between java and python isn’t so big. For large modularized projects, java probably has a productivity advantage.

Q: will c++ (or even java) be relegated to assembly’s status? No but I wonder why.

recombinant binTree^2D_grid

I prefer “recombinant binary tree”. As a special 2D-grid, it has BOTH [2] child links pointing down [1]

[1] or “eastward” in a treeDump() output.

[2] Note if there are four child links then no binTree.

This constraint is important. Important for visualization. Easy to remember:

  • Every edge is slanted at 45 degrees, either southeast or southwest
  • Consequently, we can visualize that all nodes 33 steps below root line up horizontally on the 33rd Level 🙂

Now I feel that any 2D-grid with this feature should be drawn (and used) as a recombinant binary tree. This includes half the 2D-grid problems and path-through-matrix problems 🙂


streaming data median #70%

Q (Leetcode 295): Design a data structure that supports the following two operations:

  • void addNum(int num) – Add a integer number from the data stream to the data structure.
  • double findMedian() – Return the median of all elements so far.

Follow up:

Q2 If all integer numbers from the stream are between 0 and 100, how would you optimize it?

Q3: If 99% of all integer numbers from the stream are between 0 and 100, how would you optimize it?

===== analysis
For Q2, I use a fixed array to keep the frequency. Based on this array, I maintain a segment tree.
For Q3, I will add two additional “buckets” i.e. array elements for ‘below0’ and ‘above99’.
Below is for Q1
— keeping a size-constrained minHeap holding the “upper house”. Similarly, a maxHeap to hold the “lower house”.

If N is even, then both heaps have same size. If N is odd, then lower half (maxHeap) gets 1 more item and its top is the median

if a new number is below current median it goes into the lower house (max heap) whose top may relocate to the other heap if necessary.

pop() is O(log N/2), so addNum() is O(logN) whereas findMedian() is O(1). Can RBTree achieve the same?

shared vars across files: prefer static field

When I have state to maintain and share across compilation units, there are basically three main types of variables I can create. (Non-static Local variables are simple and don’t maintain state.)

  1. nonstatic field of a singleton class — Note any user of this variable need a reference to the single object 😦
  2. file scope var in a *.cpp  — relatively simple usage, but don’t put in a shared header , as explained in global^file-scope variables]c++
  3. public static fields — most versatile design. Any user can access it after they #include the class header file.
  4. — non-contenders
  5. local static variables — (niche usage) You can create a local static var in myfunc(). To share the variable across compilation units, myfunc() can return a reference to this object, so from anywhere you can use the return value of myfunc(). This is a simple for of singleton.
  6. global variables — monster. Probably involves “extern”. See my other blogposts

The advantages of static field is often applicable to static methods too.

In fact, java leaves you with nothing but this choice, because this choice is versatile. Java has no “local static”, no file-scope, no global variables.

ruthless march@technology

There’s a concept of “best practices across industry”, as I experienced in Macq. Using new technology, things can be done faster, at a large scale, and more automated, even though I may feel it doesn’t make such a difference.

CTO’s don’t want to be seen as laggards. Same motivation at MS-Iceman, Quoine …

  • PWM-billing, PWM-comm. I remember Mark wanted “strategic improvement” not incremental improvement. He needs it for his promotion 政绩
  • RTS infrastructure was considered (by Jack He and outsiders) outdated and lagging behind competitors

You can call it “ruthless march of technology” — a ruthless progress. At a fundamental level, this “progress” can wipe out the promised benefit of “slow-changing, stable domain knowledge”

  1. quant skillset
  2. SQL skillset — affected by noSQL
  3. c++ skillset — perhaps affected by c++0x
  4. FIX skillset — perhaps affected by faster proprietary exchange APIs?
  5. … However, the skills above are still relatively robust. Other skillsets (they are not today’s focus) have proved arguably more robust against this march — sockets, pthread, STL, coreJava, bond math,.. I listed them in my spreadsheet pastTechBet.xlsx.

identical binTree

Q: Given two binary trees, write a function to check if they are the same or not. Two binary trees are considered the same if they are structurally identical and the nodes have the same key.


Look at hints? No need !

— solution 1: I can use the serialization solution and compare the two serialized strings in real time.

— solution 2: BFT but ensure each branch level has strictly 2^n items including nulls

Intuition — If both sides match up on every level, then identical

This solution works if each tree node has a fixed number of child nodes.

— solution 2b: a null node at level N will reduce Level N+1 nodes by two.

— solution 3: recursion
bool diff(node1, node2){
if diff(node1.left, node2.left) or diff(node1.right, node2,right): return false
There might be some implementation issues

— Idea 9: BFT but each queue item is {payload, childrenCount}

This solution doesn’t work for binTree as it can’t detect “null leftChild” vs “null rightChild”.

This solution works if each node can have arbitrary number of child nodes, but I don’t know how common this is.

frequent rests@work → productivity drop

As I get older I believe my brain needs more frequent mini-breaks, but I think sometimes my breaks become distractions. The more treacherous “breaks” tend to be

  • blogging
  • personal investment — Eliminate !
  • generic tech learning, unrelated to current project

— sense of urgency — When taking these breaks, A sense of urgency is what I need, but I favor a relaxed sense of urgency.

— focus — frequent mini-breaks should not affect my focus at work. I might need to avoid chitchats.

I think it’s possible to maintain or enhance my focus by taking breaks.

— stay late — Frequent breaks often require longer hours. I should prefer coming early, but in reality I often stay late.

— I also want to avoid long sit-down sessions. The breaks give relief to my eyes, neck, back, shoulder etc.

linkedHashSet as dynamic sorted collection: O(1)lookup/remove O(logN)insert

linkedHashSet as dynamic sorted collection with O(1) remove/lookup, but insert is O(logN)

Note the java class doesn’t support pinpoint insertion

— My idea:

one-time build a red-black tree then a linkedHashSet therefrom. Te tree is the real container. The linkedHashSet points into the tree nodes.

From now on, additional insert is O(logN) via the tree.

lookup/remove is still O(1) via the hashtable

versioned dict #indeed

Suppose a resume consists of key/value pairs of strings. We need to support two operations —

  1. Each write(key, value) operation creates or updates a key with a new value, and returns an incremented integer vid i.e. version ID. A version ID represents a point in history, i.e. a snapshot of the entire resume
  2. Each read(vid) returns a key/value dictionary as a complete resume snapshot
  3. * In a realistic system, we also need a deleteVersion(vid) but no need to optimize for it

Single threaded system. Version 0 is an empty resume.

Q: design an efficient solution.
Q: after creating N versions via N write() operations, what’s the time and space complexity of next write()? What’s the time and space complexity of the next read()?


Without loss of generality, Let’s say there are K fields in the resume version N. Note K <= N. I should have differentiated K vs N, which is useful for my clarity of thinking.

Holy grail — O(1) read and write. Impossible?

—Design 5 in hind sight, optimized for read(): Simple solution would snapshot entire vector of k/v pairs [1] at each write() i.e. each version. Space complexity is bad (I now believe time complexity is more important.).  Worst case is K==N as version1 creates key1, version 5 creates key5 etc.

[1] not dict because returning a hashmap and forcing client to iterate over the hashmap can lead to sub-optimal iteration efficiency. However, the requirement may say “return a hashmap”

A single read() is presumably O(K) just to read the K fields. I was convinced this was the theoretically limit for read() complexity, because I was thinking of network serialization — either my system or client system need to iterate over all fields of the resume. Now in hind sight I feel interviewer was thinking in java mode — to return entire dict by reference is O(1) rather than O(K) … I didn’t get his hint when he said O(K) read was not really theoretical limit.

Each write() clones the dict for a new vid and saves the new dict in a vector of pointers (!) . Therefore write() is O(K).

—My design 1, Based on the same idea, I used a vector for each key. However, my read() had to create a dict to be returned, on the fly in O(K). I didn’t think of return-by-reference/pointer.

—My design 2 is a minor optimization to remove the cloning of unchanged values. I thought I was on to something efficient but in java (and python?), all strings are copied by reference so my optimization is meaningless in java.

—My design 3 is lazy write. So write() only appends to one of the K vectors. The other K-1 vectors are updated only when needed by a later read() or write(). Amortized cost?

This O(1) write() and O(?) read() complexity can be emulated and surpassed by …

—My design 4 used K RBTrees (small trees) to optimize for frequent write() at O(1). Each write appends on one tree. STL insert with hint is O(1). No such feature in java or python. Read() would construct a new hashmap and return it. Efficiency is discussed shortly.

Design 4b would use a separate vector of Record{vid, value} to replace the small tree. Append is O(1).

Read(vid) requires binary search in each [3] of K trees:

Aha — After N updates, total node count across all K trees is N, so even a naive search would not exceed O(N).

If needed, I can also maintain at O(1) cost a “replay vector of originals” i.e. original {key, value} from each write(). We can use this vector for replay, discussed shortly.

[3] Q1: Can we avoid scanning all K trees when vid is small or K is much smaller than N ?

A: Like rebus, let’s maintain a separate vector (or RBTree) “birthday” holding records {vid, pointer to a single small tree}. Birthday expands (O(1)) by one element whenever a new small-tree is born i.e. new key is created. My read(vid=55) would binary-search in this birthday data structure using vid to locate the last-born small-tree before version 55, and then iterate backward to visit each small tree born before version 55. This optimized read() uses binary-search throughout to eliminate unnecessary linear search.

Q1b: Worst case read():  K == N i.e. every write() creates a new key. So read() is still O(N)?

A: read() can become O(A) where A:=difference between queried vid and any smaller vid previously cached.
A: If we find out in practice K is close to N, we can use additional data structure to memoize previous read(vid) results i.e. the new hashmaps. I can use a sparse vector of records {pointer to read(vid) result}, indexed by vid. Each read() would save, at O(1) cost [4], the result in this memoization vector. For the next read(vid=77) I would use binary search in this memoization vector to find the closest lower neighbor (eg 55), then replay from vid=77 towards  55, using XR’s algorithm.

[4] see my blogpost on memoization keyed by auto-increment id

Note RBTree supports efficient deletion from both birthday tree + small tree.


Now i feel a big string representing the entire resume is still under 100KB. I assumed were were talking about millions of fields of gigabytes each.

For a long time I was confused and assumed that each write() could update multiple key/value pairs.

in-place merge: 2 sorted arrays

Q: Given two sorted integer arrays nums1 and nums2, merge nums2 into nums1 as one sorted array.

  • The number of elements initialized in nums1 and nums2 are m and n respectively.
  • You may assume that nums1 has enough space (size that is greater or equal to m + n) to hold additional elements from nums2.

I would add a requirement — O(1) additional space. so you can’t create another array. This can be realistic if allocation is strictly controlled to prevent fragmentation in embedded environment.


Rather contrived, so I won’t spend too much time

–Idea: use num1’s right portion as the “new” array.

Suppose the allocated array of num1 has capacity k >= m + n. I will call it array KK. Note The right portion of KK is currently unused, so I can wipe it clean with some dummy value.

( If no dummy value is possible, then I probably can still solve the problem but with less clarity. )

Now backscan both arrays and put the highest value in KK[m+n-1] , filling KK leftward. The spare capacity to the right of this position will remain unused forever.

Implementation note — We need a back-scanner pointer into num1 as “cur” + another pointer to the right, “lastPicked”… meaning the item at this position has been copied to KK.

(We may not need lastPicked pointer, but it is less ambiguous more clear, easier to reason with.  You may say it’s a device for analysis and communication, not necessarily for coding.)

We also need such a pointer into num2.

git | diff | make file6 appear before file2

The pain — a big commit touches on too many files, 3 of them major changes, while 5 files have cosmetic changes. In git-diff or git-difftool, I want to focus on the 3 major files.

Some developers would want to split the big commit into a “major” and a “cosmetic” commit, but one of the snapshots would be half-cooked and unusable. In such a case, the commit comment would warn “This commit is part of a bigger change. Please do not build this commit alone”.

As an alternative we can make the 3 major files show up first (or last) in diff. You can put the content below in src/diffOrder.txt and use it as instructed. All unspecified files would show up in natural order.

# git diff -O src/diffOrder.txt
# check EACH line below to ensure unix line ending, even if entire file is unix-style.
# wildcards are needed below.


retrans questions from IV+ from me

Q11 (2 x IV): do you track gaps for Line A and also track gaps in Line B?
A: No, according to Deepak and Shanyou. Deepak said we treat the seq from Line A/B as one combined sequence (with duplicates) and track gaps therein

Q2 (IV x 2): Taking parser + orderbook (i.e. rebus) as a single black box, when you notice a gap (say seq #55/56/57), do you continue to process seq # 58, 59 … or do you warehouse these #58, 59… messages and wait until you get the resent #55/56/57 messages?
A: no warehousing. We process #58 right away.

Q2b (IV): In your orderbook engine (like Rebus), suppose you get a bunch of order delete/exec/modify messages, but the orderId is unrecognized and possibly pending retrans. Rebus doesn’t know about any pending retrans. What would rebus do about those messages?
%%A: I don’t know the actual design [3], but if I were the architect I would always check the orderId. If orderId is unknown then I warehouse the message. If it is a known order Id in Rebus, I will apply the message on the order book. Risks? I can’t think of any.

[3] It’s important to avoid stating false facts, so i will add the disclaimer.

Q2c (IV): what data structures would you use to warehouse those pending messages? ( I guess this question implies warehousing is needed.)
%%A: a linked list would do. Duplicate seqNum check is taken care of by parser.

Q13 (IV): do you immediately send a retrans request every time you see a gap like (1-54, then 58,59…)? Or do you wait a while?
A: I think we do need to wait since UDP can deliver #55 out of sequence. Note Line A+B are tracked as a combined stream.

Q13b: But how long do we wait?
A: 5 ms according to Deepak

Q13c: how do you keep a timer for every gap identified?
%%A: I think we could attach a timestamp to each gap.

— The above questions were probably the most important questions in a non-tech interview. In other words, if an interview has no coding no QQ then most of the questions would be simpler than these retrans questions ! These questions test your in-depth understanding of a standard mkt data feed parser design. 3rd type of domain knowledge.

Q: after you detect a gap, what does your parser do?
A (Deepak): parser saves the gap and moves on. After a configured timeout, parser sends out the retrans request. Parser monitors messages on both Line A and B.

Q: if you go on without halting the parser, then how would the rebus cope?

  • A: if we are missing the addOrder, then rebus could warehouse all subsequent messages about unknown order IDs. Ditto for a Level 1 trade msg.

Deepak felt this warehouse could build up quickly since the ever-increasing permanent gaps could contain tens of thousands of missing sequence numbers. I feel orderId values are increasing and never reused within a day, so we can check if an “unknown” orderId is very low and immediately discard it, assuming the addOrder is permanently lost in a permanent gap.

  • A: if we are missing an order cancel (or trade cancel), i.e. the last event in the life cycle, then we don’t need to do anything special. When the Out-of-sequence message shows up, we just apply it to our internal state and send it to downstream with the OOS flag.

If a order cancel is lost permanently, we could get a crossed order book. After a few refreshes (15min interval), system would discard stale orders sitting atop a crossed book.

In general, crossed book can be fixed via the snapshot feed. If not available in integrated feed, then available in the open-book feed.

  • A: If we are missing some intermediate msg like a partial fill, then we won’t notice it. I think we just proceed. The impact is smaller than in FIX.

OOS messages are often processed at the next refresh time.

Q3b: But how long do we wait before requesting retrans?
Q3c: how do you keep a timer for every gap identified?

Q14: after you send a retrans request but gets no data back, how soon do you resend the same request again?
A: a few mills, according to Deepak. I think

Q14b: Do you maintain a timer for every gap?
%%A: I think my timestamp idea will help.

Q: You said the retrans processing in your parser shares the same thread as regular (main publisher) message processing. What if the publisher stream is very busy so the gaps are neglected? In other words, the thread is overloaded by the publisher stream.
%%A: We accept this risk. I think this seldom happens. The exchange uses sharding to ensure each stream is never overloaded.

short+efficient : holy grail ] algo search

For most of my tough algo problems (such as leetcode problems)

  1. many celebrated solutions are very short, but not so efficient
    • I think about 60% of them are easier to memorize due to length.
    • if you can understand and memorize them, they are quicker to write due to fewer keystrokes
    • About half of them are easier to understand, thanks to the source code size
    • eg: yield-based generators are a prime example
    • eg: recursive solutions are often brief but inefficient
  2. many great solutions are efficient in terms of O(), but verbose
    • eg: bottom-up DP
    • eg: top-down DP with memoization
    • eg: using a tree (when not required) can sometimes give an efficient solution

However, these two goals are often hard to harmonize. It is my holy grail to meet both criteria simultaneously, but i won”t try so hard.

  1. For real coding problems, I will prefer brevity
  2. For verbal discussions, I will concentrate on efficient solutions

java primitives have no address #unlike C

In C, any variable, including those on stack, can have its address printed.

In java, the primitive variables have no address.  Every reference type object has an addresses, by definition (“reference” means address)

C# is somewhat mixed and I’m not going into it.

Python rule is extreme, simple and consistent. Every object has an address. You can print id(x) or getrefcount(x)

>>> from sys import getrefcount as rc
>>> i=40490
>>> rc(i)

topological_sort@DAG: linear algo to assign ranks?

Q: topological sort — given a directed graph, any linear algo to assign ranks?

I prefer to use one shared rank for multiple nodes that can be simultaneously started/concretized/evaluated. This feature can increase flexibility and parallelism

terminology — avoid “dependency” — confusing. Prefer “upstream/downstream” or “ancestor/descendant”. Note ancestors and upstreams should be started/processed first.

rank table — We can use a hashtable (or pre-sized vector) to store the ranks: {rank -> list of nodes of that rank}. Assigning a node means adding the node id to the correct list, in O(1)

Assumption 1: the original graph node contains links to ancestors but no descendants. spreadsheet-model. I think this is Kahn’s assumption.

Assumption 2: the original graph nodes contain links to descendants but no ancestors. notification list or “call list”, or “listener list”. I think this model is used the DFS algo.

In most situations, One of these two assumptions would hold, but rarely both.

==== my modified version of Kahn’s algo

Scan-1 O(V+E) — build a hashtable-based two-way edgeSet representation of the graph. For each node, we maintain a hashset (or slist) of ancestors and a hashset of descendants. The duplication is needed, as described below in the Kahn context. (I think the DFS algo needs no duplication.)

Scan-2 O(V) — assign rank 0 to all top-level nodes (no precedent). Now we can use the rank table to scan rank-0 nodes

Scan-3 — Now scan the last assigned rank, rank-0 in this case. For each node in that list, check each downstream child. Unconditionally remove (O(1) thanks to hashset) the upstream link from inside the child. After that, If the child has empty hashset of ancestors it is assigned rank 1. I now believe the precedent/dependent link is never accessed again, so we can remove both.

Repeat the last scan at Rank 1, then Rank 2..

Every node is assigned only once. Every edge is checked only once or twice.

Can the ancestors hashset become an integer count?

— simplicity

Insight — In this design, I use multiple Simple passes and avoid doing too much in one pass. If needed, you can combine Scan-2 and Scan-1.

We treat the original nodes as readonly — nice simplification.

— terminology:
precedent/dependent is accurate but abstract.
“Dependency” is a confusing term. It means someone I depend on. Better avoid this word in graph problems.
uplink/downlink is visual only in a tree with root on top

— Kahn uses “incoming edge” to mean a link to an upstream/ancestor
“All nodes with no incoming edge” … implies a Node::ancestors field

When he visits downstream nodes from “current node”, he needs this->descendants field

This crucial detail is not explained in wikipedia

=== Kyle’s favorite DFS algo, as described on wikipedia, and in [[CLRS]]

Basic idea — check each remaining node and start a DFS. Whenever a leaf (downstream) node is found, Remove it from DAG, and prepend it to output list. Game over when last (most upstream) node is removed from DAG.

  • last node removed will be one of the most upstream nodes
  • first node removed will be one of the original leaf nodes
  • Output list has a meaning — a sorted list of items to “process”
  • Invariant — At any time, size of output list + size of graph == N
  • Implementation note: Actual removal of a node is tricky in either matrix representation or edge-list representation, so it’s easier to use a hashtable to hold “removedNodes”

The simple algo above will fail if cycles exist. To check cycles, we need a  small feature — “temporary mark”. I think this feature can detect cycles in any directed graph such as a tree.

## candd are assessed mostly@tech skill !!dnlg

I feel mostly we as candidates are assessed on technical not domain knowledge.

Q: Among your past job interviews which one had the highest emphasis on dnlg? In the interview, which one of the 3 dnlg categories? Usually math I would assume.

I over-invested in dnlg, relying on it to stand out and differentiate. Well, it’s not “reliable” as a competitive advantage just as SQL/Unix skill isn’t such a selective screening criteria. In contrast, I would say about half of my job interviews had highly selective tech screening.

  • eg: west-coast
  • eg: HFT shops
  • pure algo challenge — About 30% are tough, including a few west-coast challenges
  • whiteboard threading — easy for me, but 60% hard for most guys
  • online tests — MCQ (30%) or coding (80% hard)
  • interactive coding, remote/onsite, whiteboard/IDE — 70% are tough
  • core java/c++/c# knowledge — 50% are tough

embed char_array ] my java object #XR

With market data it’s common to use some Message class(s) that “embeds” a fixed-length character array like 20-char for example.

Allocating an array object off-site on heap is very costly in memory footprint. One extra allocation per Message.

Also slower reading at run time due to data-cache inefficiency. Data cache favors contiguous data structures. See CPU(data)cache prefetching

c/c++ and c# (via Struct) can easily support exactly this “embedding”. I feel java also has some support. Beside JNI, I wonder if there’s another, pure-java solution.

Q: in java, how can I have embedded fixed-length char-array field in my Message or Acct object, rather than a separate array object allocated somewhere off-site?

  1. Solution: If the fixed length is small like 10, I could maintain 10 individual char fields.
  2. Solution: assuming the chars are ascii (8-bit rather than 16-bit in java), I can group first eight chars into a 64-bit long int field. Provide a translation interface when reading/writing the field. With 10 such fields I can support 80-char embedded.
  3. Solution: If not possible, I would use a gigantic singleton off-site char array to hold fixed-length “segments”. Then I need a single int “position”. Every Acct object has a field this.position, where this.position * fixedLength = offset, to identify one segment.
  4. There are two published solutions described in ringBuffer@pre_allocated objects to preempt JGC

Among them, not sure which solution is fastest in practice.

ringBuffer@pre_allocated objects to preempt JGC

Goal — to eliminate JGC completely.

Design 1: I will want to use primitive fields only and avoid reference fields [1] at all cost, so the total footprint of an Order is known in advance. Say it’s 100 bytes. I will create 10M of dummy Order instances, possibly scattered in heap, not adjacent as in c++, and hold their 10M addresses in an Order array… about 1GB footprint for the Order objects + 80M footprint for the array of 8-byte pointers.

(Note I reuse these Order instances in this object pool and never let them get garbage-collected.)

Then i need a few subscripts to identify the “activeRegion” of the ring but how about released slots enclosed therein?

[1] timestamps will be ints; symbolIDs and clientIDs are ints; short ascii strings will use 64-bit ints (8 characters/int); free-form strings must be allocated off-site:(

Design 2a: To avoid the “scatter” and to place the Order instances side by side, Can we use a serialized byte[100] array object to represent one Order? Can we use one gigantic off-heap byte array to hold all Orders, eliminating the 80M footprint? See java off-heap memory

Design 2b: shows a contiguous array of java objects, like std::vector<MyObject>

Design 2c: is a feature in IBM jvm

Ring buffer is good if the object lifetimes are roughly equal, giving us FIFO phenomenon. This occurs naturally in market data or message passing gateways. Otherwise, we may need a linked list (free list) of released slots in addition to a pair of subscript to identify the active region.

It might be better to allocate a dedicated buffer for each thread, to avoid contention. Drawback? One buffer may get exhausted when another stays unused.

##RDBMS strategies4faster read+write #widely useful#Sunil

(Further to our chat in mid Apr 2019…) Hi Sunil,

I generally agree that RDBMS such as Sybase, Oracle, DB2, MSSQL, … have good performance for Select and slower performance for Inserts, as you concluded in our last call. However, this conclusion is full of ifs and buts.

  • Read performance can be improved with partitioning — Imagine if the table is relatively large like 100 million rows. Even if the queries use an index, the index can have fairly large footprint (with implications). Often the Asia users only query the Asia records while Europe users only query Europe records.
  • Read performance can be improved with pre-computed data tables. If the same calculation is frequently requested via some frequent queries, then the calculation results can be saved in a pre-computed result table.
  • Read performance can be improved with stored procedures.
  • Read performance can benefit from de-normalization.
  • Read performance can improve if entire table is in-memory, as confirmed by my 2009 DB2 trainer in NY.

Now I think most RDBMS performance tuning techniques target Select as slow Select is the most common pain and the most severe pain.

Insert is typically much slower than read, but user expectation is also less demanding. In my observation, most RDBMS databases are either mostly-select or mostly-insert. The mostly-insert database can benefit from batch insert (bcp in sybase/MSSQL), or a background writer thread (Gemfire).

However, sometimes real-time insert is needed. I think the most common strategy is sharding (horizontal partitioning) like splitting 100 million rows into two tables of 50 millions each.

A related strategy is normalization (vertical partitioning). Normalization removes duplicate data and helps inserts.

[19] assignment^rebind in python^c++j

For a non-primitive, java assignment is always rebinding. Java behavior is well-understood and simple, compared to python.

Compared to python, c++ assignment is actually well-documented .. comparable to a mutator method.

Afaik, python assignment is always rebinding afaik, even for an integer. Integer objects are immutable, reference counted.
In python, if you want two functions to share a single mutable integer variable, you can declare a global myInt.
It would be in the global idic/namespace. q[=] on myInt has special meaning similar to

idic[‘myInt’] =..

Alternatively, you can wrap the int in a singular list and call list mutator methods, without q[=].

See my experiment in github py/88lang and my blogpost on immutable arg-parssing

required experience: QQ imt CIV

A few years into my technology career, I observed that Solaris/Oracle was harder to self-teach at home than linux/mysql/php/python/perl/javascript because even a high-school student can install and hack with the latter. Entry-barrier was non-existent.

Similarly, I now observe that Wall St QQ-IV demands longer experience than coding IV of west-coast style. Coding drill can use Leetcode over months. QQ requires not so much project experience but a lot of interview experience. It takes more than months of exploration and self-learning.

  • example — tcp/ip. My friend Shanyou preferred to read books systematically, but a book touches on hundreds of topic. He wouldn’t know what topics interviews like to dig into.
  • example — c++ TMP. A fresh grad can read about it but won’t know the favorite topics for interviewers
  • Compelling Example — java concurrency. A fresh grad can build up theoretical knowledge but won’t have my level of insight

Many inexperienced candidates were highly appreciated in west coast interviews. No such appreciation on Wall St because Wall St can VP or contract roles require work experience .. tried-n-tested.

  • Wall St interviews are selective in terms of experience
  • West Coast coding interviews are selective in terms of … speed and optimality

low-level^high-level expertise

See also my blogpost on wrapper^low-level API

Compare to laymen on the street, I have accumulated fairly deep, specific and demonstrable non-local expertise in two lucrative field i.e. software dev + finance

  • concurrency details (theory++) in real languages
  • dStruct choices and effective combinations
  • dStruct implementation details in java/c++/pyhton
  • SQL joins and tuning
  • sockets? no deep insight but my expertise beats most peers
  • Most standard algo problems are familiar to me
  • memory models, memory layout.. beneath c++ and java
  • drv pricing math, VaR statistics, bond math. My expertise goes beyond the degree

These trophies are won based on TSN, interviews and personal time investment … not automatically

— Let me focus on low-level vs high-level
Most of the dev expertise domains are low-level, consequently very specific — you either know it or don’t. I can even teach in these subjects. The more low-level, the rarer is the expertise. The laymen developers have a vague idea of the low-level details, for many reasons.

High-level understanding is often more useful (than low-level, theoretical QQ knowledge) in projects, but in job interviews, low-level knowledge is differentiator.

Practically all (98%) high-end java/c++ interviews use low-level questions as differentiator. I excluded the start-ups as I’m unfamiliar with them… Eg: Quoine.

In my dev (not system architect) interviews, they rarely asked high-level stuff like spring rationale, design patterns…
I tend to believe these questions can’t /separate the crop from the chaff/

The finance dnlg  also serves as differentiator among developers. The most specific sub-domains are quant math, the “architecture” and some of the jargon.

C++ is more low-level than java; java is more low-level than python…

Q: Paradoxically, C and assembly are even more low-level but not more valuable?
%%A: Some C knowledge is highly valued. For example, kernel knowledge is high-value and mostly at the level of C, compiler, assembly, and hardware
%%A: assembly expertise in device drivers or embedded is not really relevant to HFT, and too far from the money

MOM is not low-level and seldom asked.

Will tough coding IV decline]10Y@@ #Rahul

It’s possible that as easy coding IVs spread or grow, tough coding IVs decline i.e. less wide-spread. In such a case, my t-investment now will have lower ROTI than expected.

It’s possible that as easy coding IVs spread or grow, tough coding IVs also grow … for differentiation of talent.

Rahul responded to my question and pointed out the two styles of interviewers he observed:

* interested in can/can’t solve — can this candidate get the problem solved in time?
* interested in thought process

Rahul felt the first type will want to use ever tougher problems to pick outstanding candidates… selectivity

Rahul felt the second type will want to use easier problems to get more insight into the candidate’s thought process.

Rahul felt to differentiate, employers can compare completion time on easy questions

float/int 1:1 mapping#java support

See last paragraph of

you can use an AtomicInteger to hold byte values, and cast appropriately. You can also hold doubles using Double.doubleToRawLongBits(double) and Double.longBitsToDouble(long) conversions… because there’s a valuable 1:1 mapping between a 64-bit number and a 64-bit float number.

  • Useful if you need to precisely compare a 64-bit float value. For example, you can check if a value has changed at all. The value could be user-supplied or shared mutable.
  • Useful if you need to store float values in a hashtable, though I’m not sure about performance.
  • Useful if you need radix sort on floats

EnumSet^regular enum

[category javaOrphan]
A java enum type usually represents .. (hold your breath) .. a radio-button-group. A variable of this type will bind to exactly one of the declared enum constants.

eg: Continent — there are only 7 declared constants. A Continent variable binds to Africa or Antarctic but not both.
eg: SolarPlanet — there are only 8 declared constants
eg: ChemicalElement — there are only 118 declared constants
eg: ChinaProvince — there are only 23 declared constants

In contrast, enum type has a very different meaning if used within an EnumSet. (I will name a meaning soon). Each enum constant is an independent boolean flag. You can mix and match these flags.

Eg: Given enum BaseColor { Red,Yellow,Blue} we can have only 2^3 = 8 distinct combinations. R+Y gives orange color. R+Y+B gives white color.

Therefore, the BaseColor enum represents the 3 dimensions of color composition.

EnumSet was created to replace bit vector. If your bit vector has a meaning (rare!) then the underlying enum type would have a meaning. Here’s an example from [[effJava]]

Eg: enum Style {Bold, Underline, Italic, Blink, StrikeThrough, Superscript, Subscript… } This enum represents the 7 dimensions of text styling.

[[effJava]] reveals the telltale sign — if the enum type has up to 64 declared constants (only three in, then entire EnumSet is actually represented as a single 64-bit integer. This proves that our three enum constants are three boolean flags.

Q:are java primitive+reference on heap or stack #escape

An old question but my answers are not really old 🙂

In Java, a so-called “referent” is a non-primitive thingy with a unique address on heap, accessed via heap pointers.

In java, a referent is always an Object, and an Object is always on heap therefore always a referent.

(Java language defines “reference types” in terms of primitive types, so we need a clear understanding of primitive types first.)

In java, a primitive thingy is either part of a (heapy) Object or a local thingy on stack

(In C++ lingo, object can be a new int(88)…)

A reference is, at run-time really a heap pointer. Assuming 32-bit machine, the pointer itself occupies 4 bytes and must be allocated somewhere. If the reference is local to a method, like a parameter or a local variable, then the 4 bytes are on stack like a 32-bit primitive local variable. If it’s part of an object then it’s on heap just like a 32-bit primitive field.

— advanced topic: escape

Escape analysis is enabled by default. EA can avoid construction an Object on heap, by using the individual fields as local variables.

— advanced topic: arrays

Arrays are special and rarely quizzed. My unverified hypothesis:

  • at run-time an array of 3 ints is allocated like an Object with 3 int-fields
  • at run-time an array of 3 Dogs is allocated like an Object with 3 Dog fields. This resembles std::vector<shared_ptr<Dog>>
  • Q: how about std::vector<Dog>?
  • %%A: I don’t think java supports it.
  • The array itself is an Object


This is a pure QQ topic, but you can score a few points by mentioning it.

[[javaPerf2014]] P268 says biased locking is enabled by default but can be disabled to improve performance in app servers using a thread pool and contended locks. It explains the reason i.e. biased locking comes with a price — a runtime overhead. is very concise — un-contended locking may incur zero cost, as claimed in the accepted answer

  • .. incurs no synchronization cost. I suppose this is typically geared towards overly conservative code that performs locks on objects without ever exposing them to another thread. The actual synchronization overhead will only kick in once another thread tries to obtain a lock on the object.
  • Biased locking is the default in java6 ====> -XX:+UseBiasedLocking improving the performance of uncontended synchronization. An object is “biased” toward the thread which first acquires its monitor; subsequent monitor-related operations are relatively faster. Some applications with significant amounts of uncontended synchronization may attain significant speedups

Is this technique considered lockfree? No, but some may speculate that it might be even faster than lockfree. So if you suspect most of the time there’s only one thread accessing a critical section, you could choose to rely on the (java6 default) biased locking rather than lockfree solution. Most of the time this mutex-based design would challenge (if not beat) a lockfree performance.

However, I believe single-threaded mode is still faster, where a thread isn’t aware of other threads, as if it is the only thread access those objects i.e. no shared-mutable. There would be no lock grab, no memory fencing at all. [[javaPerf2014]] P375 agrees.

2 threads taking turn #std::ref #CSY

Vanguard Q: write a program containing exactly 3 threads printing 1/3/5/7/… 2/4/6/8… respectively, but in lock steps, as if passing a token between them. is my solution.

I managed to avoid sleep() and condVar. The idea — all thread run the same function which

  1. check the shared mutable variable Next. If it’s “my” turn then
  2. grab lock, print my next number, update Next, release lock
  3. end-if
  4. yield and exit

I used an atomic<char> “Next” set to lastOutput%something, that’s visible to all threads, even if not holding a lock.


RTS feed for Internet clients #Mark #50%

Q: Based on the RTS market data dissemination system, what if some clients subscribe by a slow Internet connection and your orderbook (TCP) feed need to provide delta updates each time a client reconnects?

Default solution: similar to FIX/TCP .. sender to maintain per-client state. Kenny of Trecquant said his trading app can be extremely simple if exchange maintains state. I won’t elaborate. Here is my own Solution.

Note on terminology — in multicast there’s no TCP-style “server” . Instead, there’s a sender engine for many receiver clients.

Suppose we have too many clients. To minimize per-client state management, my engine would simply multicast to all clients real time updates + periodic snapshots.

A client AA can request a snapshot on any symbol group, and I will immediately multicast the snapshots on a refresh channel. If client BB never requests anything, BB can ignore the refresh multicast channel.

Request quota — each client like AA can request X free snapshots a day. Beyond that, AA’s requests would be regulated like

  • levied a fee
  • queued with lower priority

It’s feasible to replace the refresh multicast group with an unicast UDP channel per client, but to me the multicast-refresh solution offers clear advantages without major drawbacks.

  1. if there is an outage affecting two clients, each would request the same snapshots, creating unnecessary work on the engine. The request quota would incentivize each client to monitor the refresh group and avoid sending the same request as someone else
  2. multicast can be valuable to clients who like more frequent snapshots than the periodic.
  3. My operations team can also utilize this refresh channel to broadcast unsolicited (FreeOfCharge) snapshots. For this task, I would avoid using the publisher channel as it can be busy sending real time updates. We don’t want to interleave real time and snapshot messages.

noexcept impact on RAII

If a function (esp. dtor) is declared noexcept, compiler can choose to omit stack-unwinding “scaffolding” around it.  Among other things, there’s a runtime performance gain. This gain is a real advantage of using noexcept instead of empty throw() which is deprecated in c++0x.

Q: Is there any impact on RAII?
%%A: yes

Q: Can we even use RAII in such a context?
%%A: I think we can. If the function does throw, then std::terminate() runs, instead of the destructors

Note Best concise intro to noexcept is P780 [[c++primer]]

success@concurrency features] java^c++^c#

I’m biased towards java.

I feel c# concurrency is less impactful because most of the important concurrent systems use *nix servers not windows, and most concurrent programming jobs do not go to windows developers.

Outside windows, c++ concurrency is mostly based on the C library pthreads, non-OO and inconvenient compared to java/c#

The c++11 thread classes are the next generation following pthreads, but not widely used.

Java’s concurrency support is the most successful among languages, well-designed from the beginning and rather stable. It’s much simpler than c++11 thread classes, having only the and data types. More than half the java interviews would ask threading, because java threading is understandable and usable by the average programmer, who can’t really understand c++ concurrency features.

%%perfect hash: 1000 known string keys #MLP-sg %50

See also string bucket-sort

Q: similar to FIX tags, you are given a published dictionary of 5000 short string keys, each up to 10-characters. Every message consists of about 100 key/payload pairs. Design a parser for these messages. Note every key would be in the dictionary.

Simple solution 1 BST like AVL or RBT to hold the pairs. O(log N=5000) … 5000 keys require about 12 levels. Possibly faster than hashtable.

Simple solution 2 standard hashtable to hold the pairs. The hashcode for string is usually fast enough, with minimal collision if load factor is set sufficiently low. Even if zero collision, this requires allocating 5000 linked nodes in memory.

In this case since the key strings are known in advance, I believe we can design a perfect hashing algorithm without collision. Interviewer said perfect hashing is hard in this case.

Solution 3: perfect hashing. Group the key strings by length, to 10 groups

  • If the number of distinct 8-char keys is small, for instance, we can use switch-case
  • for 1-char string keys, we maintain a dedicated bucket array (“table”) indexed by the ascii code
  • for 2-char string keys I can have a nested 2D array. Given key string “px”, “p” identifies the level2 array and within it, “x” identifies the bucket for the payload.
  • for 3-char keys, ditto

For those 10-char string keys, a 10-level nested array would be too large. Here’s my strategy. Assuming a lot of key strings (say, 3000 out of total population of 5000) are 4-char long, let’s tackle this sub-problem first:

  • for 4-char keys, I would designate an index range like 1 to 6000 (6000 buckets) to hold 3000 keys in this group. I would compute an initial hashcode := (char1*31+char2)*31+char3… % 6000 for this group of 3000 key strings and check for collision within this group. If there’s collision, then adjust some of the prime numbers and retry. Since load factor is 0.5 I hope we can avoid collision within this group. If it’s hard to avoid collision, I can reduce load factor, or introduce my collision resolver[2].
  • for 5-char keys, I would designate an index range like 6001 to 7800 (1800 buckets) to hold 900 keys in this group. I would compute initial hashcode := (char1*31+char2)*31+char3… %1800 for this group of keys and check for collision within this group.
  • So far, we have designed with a big combined bucket array. As an alternative, if there are only 7 key groups [4-char, 5-char, .. 10-char] then we can use seven dedicated bucket arrays, but what if 7000 groups? 7000 bucket arrays? I think so. Memory footprint is same as a combined bucket array.

[2] Here’s a simple collision resolver. Suppose Strings A and B(and C,D..) all hash to the same initial-hashcode of 55. I need to find another bucket for B. After I compute all the initial-hashcode values for each of 3000 key strings, I can easily find an unused bucket like 77. I will simply add an if-block to use bucket 77 for key B. Now this scheme is probably faster than open-addressing and separate chaining since the calc logic can be manually optimized , held in instruction-cache, or burned into FPGA. In contrast,

  • Separate chaining requires runtime memory access to follow the linked nodes. We can only hope the linked nodes are cached
  • open addressing requires runtime probing. Our load factor of 0.5 is not very good so open addressing can become slow. is a published perfect hashing solution.

POSIX semaphores do grow beyond initial value #CSY

My friend Shanyou pointed out a key difference between c++ binary semaphore and a simple mutex — namely ownership. Posix Semaphore (binary or otherwise) has no ownership, because non-owner threads can create permits from thin air, then use them to enter the protected critical section. Analogy — NFL fans creating free tickets to enter the stadium. is one of the better online discussions.

  • ! to my surprise, if you initialize a posix semaphore to 1 permit, it can be incremented to 2. So it is not a strict binary semaphore. is my experiment using linux POSIX semaphore. Not sure about system-V semaphore. I now think a posix semaphore is always a multi-value counting semaphore with the current value indicating current permits available.

  • ! Initial value is NOT “max permits” but rather “initial permits”

I feel the restriction by semaphore is just advisory, like advisory file locking. If a program is written to subvert the restriction, it’s easy. Therefore, “binary semaphore” is binary IIF all threads follow protocol. claims “Mutex can be released only by thread that had acquired it, while you can signal semaphore from any non-owner thread (or process),” This does NOT mean a non-owner thread can release a toilet key owned by another thread/process. It means the non-owner thread can mistakenly create a 2nd key, in breach of exclusive, serialized access to the toilet.–-part-1-semaphores/ is explicit saying that a thread can release the single toilet key without ever acquiring it.


[[DougLea]] P220 confirms that in some thread libraries such as pthreads, release() can increment a 0/1 binary semaphore value to beyond 1, destroying the mutual exclusion control.

However, java binary semaphore is a mutex because releasing a semaphore before acquiring has no effect (no “spare key”) but doesn’t throw error.

How RTS achieved 400-700 KMPS #p2p

Official benchmark result – 700 KMPS per instance of rebus or parser. Here are some enabling features:

  • [! iii] mostly single-threaded apps. Some older parsers are multi-threaded but each thread operating in single threaded mode. No shared mutable:)
    • In the order book replication engine, we use lockfree in one place only
  • ! (ST mode loves) Stateless [1] parser, small footprint. Only downstream component (Rebus) handles cancels/busts, modifications.. So we can run many parser instances per machine, or many threads per instance, fully utilizing the cpu cores. See
  • [! ii] allocation avoided — on millions of Output message objects. Pre-allocated ring buffer eliminates new(). very few and small STL containers .. mostly arrays… pre-allocate DTOs@SOD #HFT
  • ! allocation avoided — on millions of Input message objects, thanks to reinterpret_cast() on pointers… nearly Zero-copy. See reinterpret_cast^memcpy in raw mktData parsing
  • ! allocation avoided — custom containers to replace STL containers used, since they all allocate from heap
  • ! p2p messaging beats MOM 
  • ! Socket buffer tuning — to cope with busts. 64-256MB receive buffer in kernel; 4MB UserModeBuffer
  • low memory footprint. Most parsers use around 60MB. (My parsers was 100 to 150MB.) I think there are many benefits in terms of latency.
  • epoll — to replace select() with 2 orders of magnitude improve in throughput
  • buffering of Output list (of messages). I guess this hurts latency but enhances throughput
  • Very fast duplicate seq check, without large history data — a hot function
  • large containers like the ring buffer are pre-sized. No reallocation.
  • mostly array-based data structures — cache-friendly
  • Hot functions use pbref or RVO or move semantics, minimizing stack allocations
  • aggressively simplified parsing. Minimal logic
  • Special Logger to reduce I/O cost
  • Sharding to speed up symbol lookup
  • kernel bypass : RTS
  • No virtual functions — enhancing data cache and inlining .. 3 overheads@vptr #ARM
  • Inline
  • —-Speed with control
  • Best of both to reduce the real risk of our connection becoming slow or lossy
  • Depth monitor to detect missed messages
  • Capacity tracker to monitor mps rates
  • Missed seq reporting
  • [!=one of the really effective techniques in RTS]
  • [i=presented in multiple (!!, !!!) interviews]

[1] parser does remembers the sequence numbers. In fact sequence number management is crucial. Before retrans was added to parser, we never sent any retrans request. We did keep track of gaps. Mostly we used the gap-tracker to check duplicates across Line A/B. We also relied on our sequence number state to handle sequence resets.

insert to sorted collection of N items: I can beat O(log N)

features/advantages of sorted collection

  • O(N) ascending iteration
  • O(log N) range query
  • O(log N) closest neighbor query above/below
  • O(log N) boolean contains() query
  • O(1) min() max() queries
  • deleteByKey is typically O(log N) but can be O(1) if we construct a hashtable of {key -> iterator} once the collection is built. This hashtable can also accept new items.

Q: insert is typically O(log N), but can we improve on it?

— A: if there’s only one insert, we can hold it in a hidden field of the collection, to be checked at the end of each operation above.

— A: if there are unlimited inserts but all 64-bit integers, then we can treat the incoming integer as a short radix array (8-elements for example). Take each element of the array as a lookup key, look up in an array to find a subtree.

Basically the entire sorted collection is an  8-level tree with fan_out=256

You may say that there are 8 lookups required so O(k) where K=8, but here we assume the size of the radix array is a constant equal to eight. Note the collection size is unlimited and far more than 2^64 if duplicates are permitted.

tick`95%mark #Kam #70%

“Ticking 95th percentile server” is the blog title I would use. Original question is on paper/pencil, scanned and sent to my gmail, from my friend Deepak CM. I find this problem rather realistic with practical usage, rather than fake and contrived. I treat it as a System Design + implementation question.

Q: Using only std library, write a c++ program to process a live stream of 128,000,000 (or much more) double numbers, representing temperature readings not necessarily unique. As the temperatures come in, print the current 95th percentile on-demand. I call it the lucky “winner”. We can use the nearest-rank percentile definition.

====Idea 1: given unsorted ints, find median in O(N) is for median but can be tweaked for any percentile, but unfortunately, not “ticking”

====design 2, for static data set
use an “order statistic tree i.e. a RBTree where each node remembers the size of its subtree. (A leaf node has size 1.)
====design 3, optimized for high volume of updates like 128 million updates, not optimized for frequent query

The entire temperature range is divided into non-overlapping segments, each represented by a segment-head temperature i.e. the lower bound [1b]. Each segment has a range (i.e.distance to next segment head), size (i.e.item count) and density (i.e. size/range ratio). We mostly care about “size” only.

We need a RB-tree (or sorted vector) containing P=1024 [1] nodes, each an unsorted container[3]. The RB-tree serves to maintain the containers i.e segments.

Each incoming temperature is quickly “routed” to the correct container and simply appended therein, increasing its size.

Upon query request, we will use the latest segment sizes to build a cumulative profile, and run a O[logP] binary search to identify the one segment containing the “winner”. This segment size would be hopefully much smaller than 128,000 [2] and far more /tractable/.

–Within the chosen segment of size S, we can use a vector to sort in O(S logS) the temperatures and identify the winner.  After completing a query, the chosen container will become (partially) sorted, helping subsequent queries if this segment is picked again.

Since we only support 95th percentile, chance is good that this segment will be picked most of the time. If x% of the queries hit this segment, then I will convert this “favorite segment” to a RB-tree.

Alternatively, we can also use the O(S) algorithm in Idea 1, but the container won’t become sorted.


[2] 128,000 is 1024th the size of original sample size… not ideal. The segments need to be initialized carefully, during a priming phase, inspired by JIT compiler. Shall we assume roughly uniform distribution or Gaussian distribution? Assuming we know the total sample size is 128 million, I will use the first 100,000 temperatures to select the 1024 segment heads. The segments are structured not for equal length (in temperature) or equal size (i.e. element count). In fact the low segments can be very long very crowded.

Instead, the segment heads are chosen so that between 94th percentile and 96th percentile we have half of all segments. These small segment sizes will be much smaller than 128,000 and quicker to manipulate.

–Foot notes:

Q: what if some of the containers grow too big like three times 128,000,000/1024. The priming/estimate was ineffective.
A: Not a problem unless the winner happens to be in such a container.
A: One idea is to split up such a container when we notice it, and grow a new node on the RB-tree. std::map::insert() can take a hint for the position where new node can be inserted. Note we don’t want to split a node JJ too early since JJ may not grow any further subsequently and may end up no larger than other nodes as other nodes keep growing.

[1] Sizing of P — First we estimate total sample size. If unknown, then set N:=1024 so all nodes stay in L1-cache (typically 32KB). If we assume 16 bytes/node ( 8 bytes pointer to container + 8 bytes double ), then 32KB can hold 2000 nodes.

If query becomes more frequent, I can increase P by 1024, sacrificing insertion.

[1b] The node values are “lower-bounds” and don’t have to be really the minimum of the original temperatures in the container. We can probably cheat with 4-byte floats, and we get away with 2700 twelve-byte tree nodes.

[3] slist vs vector — vector might be faster due to pre-allocation, provided a node will never grow beyond a capacity. Vector has reserve() (Note resize() is wrong choice.)

intervalTree: classic RBTree to find overlapp`event

Notable detail — non-balancing BST won’t give logN performance, since the tree depth may degrade

Q1: given N event objects {start, end, id}, use a RBTree to support a O(logN) query(interval INT) that returns any one event that overlaps with INT, or NULL if none.

If the end time == start time in the input INT, then it becomes the focus today :

Q2: given N event objects {start, end, id}, use a RBTree to support a O(logN) query(timestamp ii) that returns any one event that’s ongoing at time ii, or NULL if none.

P312 [[intro to algorithms]] describes an elegant solution.

The text didn’t highlight — I think the end time of the input interval INT is … ignored. (Therefore, in my Q2, input ii is a single timestamp, not an event.) Precisely because of this conscious decision, the tree is sorted by event start time, and the additional payload (in each node N) is the subtree end time i.e. last end time of all events started before N (including N itself). By construction, N’s payload is equal or higher than N’s start time. (Note Our input ii can fall into gaps between the event intervals.)

Eg: suppose N starts at 2:22 and left-child payload says 3:33, then we know for sure there at least one ongoing even during the interval 2:22 to 3:33.  Not sure if this is useful insight.

The text illustrates and proves why this design enables a clean and simple binary search, i.e. after each decision, we can safely discard one of “my” two subtrees. Here’s my (possibly imperfect) recall:

Let’s suppose each time the current node/event is not “ongoing”. (If it is then we can return it 🙂  ) So here’s the gist of the algorithm:

Suppose i’m the “current node”, which is root node initially. Compare ii to my left child L’s payload (not my payload).

As defined, L’s payload is the highest end times of L + all its sub nodes. In other words, this payload is the highest end time of all events starting before me.

Note the algo won’t compare L’s start time with ii
Note the algo won’t compare my start time with ii, though the scenario analysis below considers it.

  • case 1: If payload is lower than ii, then we know all events on my left have ended before ii, so we can discard the left subtree completely. Only way to go is down the right subtree.
  • the other case (case 2): if payload is higher or equal to ii, then we know my left subtree contains some events (known as “candd” aka candidates) that will end after ii.
  • case 2a: suppose my start time is before ii, then by BST definition every candidate has started before ii. We are sure to find the “candd” among them.
  • case 2b: suppose my start time is after ii, then by BST definition, all right subtree events have not started. Right subtree can be discarded
  • In both cases 2a and 2b, we can discard right subtree.


UseSpinning jvm flag #pthr/kernel/c#

Hi XR,

Around 2011 I once said a java thread waiting for a lock would not keep grabbing the lock…. Wrong! Actually it does keep grabbing for a while. It runs some useless CPU instructions and checks the lock periodically. This is really spin locking inside the JVM.

Spin locking wastes CPU but frequently results in faster application performance (reducing context switches). Therefore, JVM runtime uses some policy to decide how long a thread would spin before placing the thread in some kind of “parking lot” (my terminology) so the thread doesn’t’ occupy CPU any more. This spin phase used to be configurable via the boolean UseSpinning JVM flag, but in java7 and beyond, this flag has no effect — JVM always has freedom to use spinning. This is described in [[javaPerf2014]]

Comparison 1 — Dotnet SpinWait() provides exactly the same spinlock feature, as described in

Comparison 2 — In linux Kernel (not userland), 1) spinlock and 2) counting semaphore are the two locking primitives, as described in Page 164 [[understandingLinuxKernel]]. Both locks block a “grabbing” thread getting into a critical section until another thread leaves it. Similar to my 2011 description, the kernel semaphore remembers all sleeping processes (P174) and wakes one of them when the critical section is clear. I think this kernel semaphore is used by the scheduler in the kernel.

Comparison 3 — In linux pthreads library (the basis of linux JVM), the locking constructs are implemented using system calls and userland C functions. Pthreads as a userland library probably can’t use the above two kernel constructs directly. Userland C function can implement spinlock, but to implement “parking” I assume (kernel support and) system calls are required. So far I assume the pthreads thread maps to native threads controlled by the kernel. This mapping is now the dominant usage.

The alternative mapping model is green threads, where userland threads are invisible to kernel. Parking has to be implemented in userland.  Nowadays it’s rare to see green threads.

double-ptr is rare@@

  • c# ref-param (on a class type, not a Value data type) is really a double ptr. No other ways to achieve this purpose.
  • java offers absolutely no such support or any support for double-ptr
  • python? not sure. Not common

In conclusion, I now feel

  1. the need for double-ptr is prevalent across languages. One of the most common need is — main() passes into my method a String or List object, and I want to reseat that pointer so that main() will operate on the new String
  2. Despite these common needs, java and probably many new languages were designed from ground up to live without double pointers. It’s a conscious decision, an active refusal. I have not, but you may need to find workarounds.
  3. Workarounds are plenty and never too ugly too convoluted too complex
  4. In fact, many would consider (practically) all double-pointers as ugly and dangerous for the average programmers. If you are more than an average programmer and want to use double-pointers, then those workarounds are rather easy for you.
  5. Conclusion — in this new era, double-pointers are rarely needed.

RTS pbflow msg+time files #wait_until #60%

Background — In RTS we relied everyday on a UDP message replay tool, taking in a msg file and a corresponding time file. Both binary files. I saw no message delimiters in the msg file, partly because this file is produced in production. Inserting delimiters would add overhead in production parser engine.

Given there’s no delimiter, I believe the timestamp file must have a seek-offset values (marker) corresponding to each timestamp.

Q: how would you implement a similar replay tool?

Driver is the timestamp file. Every time I see a timestamp, I would wait_until() that timestamp (adjusted to my start time). Upon wake-up, I would send the corresponding chunk of bytes between current and next markers.

I would use condition_variable::wait_until(). As noted in c++condVar 2 usages #timedWait, this function has nanosec precision.

(The mutex needed in wait_until? My program is single-threaded, so the dummy lock would be held forever.)

The chuck of bytes would be sent as one UDP packet. No split. Each chunk should starts with an original packet header created by the matching engine, and received in the original feed.

Q2: how about TCP?

TCP receiver can deal with partial messages very effectively, as I saw in my tests.

However, TCP receiver can introduce delays in the sender i.e. my replay tool, so the transmission will not be instantaneous. In fact, send() can block! This could lead to drift in the timer. That’s why I need wait_until()

Q3: Explain timer drift?

Suppose in the time file, Timestamp #1 is 9:00:00:000001 and Timestamp #2 is one microsec later, but the data transmission can take 10 nanosec esp. with TCP blocking send().  This 10 nanosec is a small drift but it adds up to microseconds.

Every useful codebase requires never-ending maintenance #mod_rewrite

Imagine you maintain any mature, stable software product with non-trivial complexity, used by 100+ internal users or 1000+ external users. I would venture to predict that maintenance effort for such a product would never end.

  • New feature requirements – The bigger and more diverse your user base, the less likely you would stay stagnant. Stagnant would lead to losing relevance gradually. Consider dinosaurs. If you don’t evolve , then other products would catch up, absorb your best features and make you /irrelevant/.
  • New maintainers — have a need to understand and learn your source code, otherwise in 20 years no one will be available to answer user questions or maintain the codebase. New maintainers shall see room for improvements and modernization (like adopting new technologies). After a few changes of guard, they could vote for a rewrite.
  • Imperfections — We wish all the bugs get ironed out X years after active development, but sadly, bugs never die off completely.
  • New users — come on-board, ask (rarely new) questions that require assistance, which is part of maintenance effort
  • Documentation — require periodic update and modernization. Otherwise, new readers would find the documentation old-fashioned.
  • testing — testing never ends. There will be better ways to test your product. If lucky, then would uncover hitherto unknown bugs.

Therefore, there’s budget allocated to the system as long as there are important users. Budget creates jobs and employment. Some of these jobs are age-friendly.

I used to think apache mod_rewrite was written once and needs no update. I doubt it now.

strong coreJava candd are most mobile

Jxee guys face real barriers when breaking into core java. Same /scene/ when a c++ interviewer /grills/ a java candidate. These interviewers view their technical context as a level deeper and (for the skill level) a level higher. I have seen and noticed this perception many times.

Things are different in reverse — core java guys can move into jxee with relatively small effort. When jxee positions grow fast, the hiring teams often lower their requirements and take in core java guys. Fundamentally (as some would agree with me) the jxee add-on packages are not hard otherwise they won’t be popular.

Luckily for java guys, Javaland now has the most jobs and often well-paying jobs but ..

  1. c# guys can’t move into the huge java market. Ellen have seen many
  2. c++ is a harder language than java, but a disproportionate percentage (80%?) of c++ guys face real entry barrier when breaking into javaland.
  3. I wonder why..
  • I have heard of many c++ and c# veterans complain about the huge ecosystem in javaland.
  • I have spoken to several c++/c# friends. I think they don’t sink their teeth in java. Also a typical guy doesn’t want to abandon his traditional stronghold, even though in reality the stronghold is shrinking.
  • Age is a key factor. After you have gone though not only many years but also a tough accumulation phase on a steep learning curve, you are possibly afraid of going through the same.

[20] java≠a natural choice 4 latency #DQH

I think java could deliver similar latency numbers to c/c++, but the essential techniques are probably unnatural to java:

  • STM — Really low latency systems should use single-threaded mode. STM is widely used and well proven. Concurrency is the biggest advantage of java but unfortunately not effective in serious latency engineering.
  • DAM — (dynamically allocated memory) needs strict control, but DAM usage permeates mainstream java.
  • arrays — Latency engineering favors contiguous data structures i.e. arrays, rather than object graphs including hash tables, lists, trees, or array of heap pointers,,. C pointers were designed based on tight integration with array, and subsequent languages have all moved away from arrays. Programming with raw arrays in java is unnatural.
    • struct — Data structures in C has a second dimension beside arrays – namely structs. Like arrays, structs are very compact, wasting no memory and can live on heap or non-heap. In java, this would translate to a class with only primitive fields. Such a class is unnatural in java.
  • GC — Low latency doesn’t like a garbage collector thread that can relocate objects. I don’t feel confident discussing this topic, but I feel GC is a handicap in the latency race. Suppressing GC is unnatural for a GC language like java.

My friend Qihao commented —

There are more management barriers than technical barriers towards low latency java. One common example is with “suppressing gc is unnatural”.

tech firms value motivation in coding drill

I agree with you that to pass FB/Goog/indeed/twitter interviews, we all need to practice for months.

We also need to study the problem patterns during our practice. “Study” is euphemism for a hard long struggle. If we don’t put in a hell lot of focused energy, we can’t absorb the amount of variations and *recognize* the patterns — 从厚学到薄. Without learning the patterns, we will be overwhelmed by the sheer amount of variations in the problems, and forget a lot, as in 狗熊掰棒子.

Q: Why the hell do these firms like to hire coders who practice so hard?
A: My hypothesis:

  • such a coder is typically young — with plenty of spare energy and spare time. An older techie with kids is unlikely to endure such heavy coding drill. We lack spare energy and spare time.
  • such a coder has perseverance and dedication — hard-driving, hard-working, focused, achievement-oriented and determined
  • such a coder is able to tolerate repetitive, boring, lonely tasks — even joy the boredom. I call it absorbency capacity or 耐得住寂寞. This is not only attitude; but an aptitude. My son currently lacks this capacity, but my dad, my daughter and I have this capacity, to varying degrees.
  • such a coder is able to pay attention to details — Getting a program to work on a range of corner cases requires attention to details.
  • such a coder has a quick analytical mind for abstract logical problem solving

In contrast, financial firms target slightly different skills. Their coding questions test language-specific efficiency, language features, or pseudo-code algorithm. Tech firms don’t test these.

Most c++guys have no insight in STL

When I recall my STL interviews on Wall St, it’s clear that majority of c++ app developers use STL almost like a black box, every year for 5-10 years. Only 1% has the level of insight as Henry Wu.

Reason? such insight is never needed on the job. This factor also explains my advantage in QQ interviews.

  • Analogy — Most of us live in a house without civil engineering insight.

STL uses many TMP and memory techniques. Some may say STL is simple but I doubt they have even scratched surface on any of the advanced topics.

  • Analogy — some may say VWAP execution algo is simple.

Many interview questions drill in on one or two essential STL functionality (which everyone would have used), just below the surface. These questions filter out majority[1] of candidates. Therein I see an opportunity — You can pick one or two topics you like, and grow an edge and a halo, just as Henry did. Intellectual curiosity vs intellectual laziness. 不求甚解,不加咀嚼,囫囵吞枣 — I see it in many self-taught developers. That’s exactly what these Wall St interview questions are designed to identify and screen out.

How about west coast and other high-end tech interviews? I’m not sure.

[1] sometimes 90%, sometimes 60%.


[19] sorting J integers, each in [1,N]

Q: What’s the time complexity of sorting J integer scores which are all [1, N], possibly non-unique?

This is classic counting sort. My book [[algo in a nutshell]] incorrectly says O(J). Incorrect when N is large.

My analysis below uses a total of three dimensions N, J and K, where N and J can be vastly different, and K <=min(N, J), but K could be much smaller.


Textbook counting sort is listed at the bottom. If N is smaller than J, then O(N+J) is dominated by O(J). Now suppose N is huge (like the revenue figure of a company).

Suppose there are K distinct scores among the J scores. I would want a constant-time translation from each distinct score to a distinct counter. I wish there’s a perfect hashing from the K scores to K buckets but I think it’s impossible since the K distinct values are not known in advance. Even with imperfect hashing, I can get O(1) translation from each score value to a distinct counter. I will iterate over the J scores. For each,

  • look up (on hash table) the corresponding counter.
  • If the score is new i.e. missing from the hash table,
    • then create a counter
    • add the “pair” {score -> counter} into the hash table.
    • Also insert the score into a min-heap
  • increment the counter

O(K logK) : Now pop the min-heap K times, each time to get a distinct score in ascending order. Lookup the score in hash table to get the counter. If counter for score 55 the count is 3, then output 55 three times. This output would be a sorted sequence of the original J scores.

— comparing with alternatives: Mine is O(J + K logK). If K*logK exceeds N, then I would fall back to standard counting sort.

The comparison-based sort would be O(J logJ), inferior to mine if K is much smaller than J.

The textbook counting sort would be O(J + N) , using N independent counters.

topK,K-th ranked within a bag: quickselect

For any top-K problem, and for any problems about K-th ranked item, a Fairly efficient solution is the quickselect algo, in O(N). O(N) is the theoretical lower bound since you must examine each of N element.

For top-K, we can use quickselect to partition the array and the upper section would be top-K. This is O(N). In addition, I can think of two O(N logK) solutions.

Keep the container size at up-to-K. For each new item encountered, compare to the minimum in the container. IFF bigger, then remove the minimum then insert new-comer.


insert is O(1) on average but remove() is O(logK)

“If you are inserting a single node for every one you remove, then you lose the advantage of the asymptotic O(1) average insert that heaps provide as the delete would dominate, and you might as well use a BST.” according to


Each remove() and insert() is logK

[13] U.S. vs Singapore economies

I used to feel Singapore economy is less stable than North America, Japan and Western Europe. Even a small European economy like Belgium or Finland is in a safer position than Singapore because it’s a stable, integral part of a large, strong ecosystem. These western economies have experienced relatively low or negative growth, persistent unemployment, high national debt, aging population, diminishing manufacturing competitiveness under competition from emerging economies, …. but "strong" means they can withstand bigger adversities longer. Strong means economic competitive advantage over the developing countries.

Now I see signs of U.S. economic status in decline. However, positive signs are still visible — in high tech, new/old media, life science, aerospace, innovation, advanced R&D, energy, material science…. I feel the next 10 years may see further decline in USD and U.S. GDP relative to rest of the world.

In contrast, over 30 years, I feel S’pore is still too small to be strong. S’pore government has to make all the right decisions to avoid the many potential disasters (i won’t list them). Being small, I feel S’pore would suffer quite badly from even a mild misstep. (Sorry no specific examples.)

For big or small economies, every country needs to stay competitive and stay relevant, and identify new growth sectors every 20 years. Just witness the current decline of Nokia and BlackBerry. Nortel is another high-tech example — once the biggest company in Canada. Creative Technology, DEC too. I guess AOL is another cautionary tale of staying relevant. The software industry is even more unforgiving and littered with even more dinosaurs compared to hardware or telecom industries.

Perhaps Singapore would fall on a similar path, but there’s opportunity of renewal…

For a big country like the U.S., renewal seems inevitable. Is there a major sector in U.S. economy that may decline and never recover again? Maybe the railway, steel and garment industries of the past?

strategic value of MOM]tech evolution

What’s the long-term value of MOM technology? “Value” to my career and to the /verticals/ I’m following such as finance and internet. JMS, Tibrv (and derivatives) are the two primary MOM technologies for my study.

  • Nowadays JMS (tibrv to a lesser extent) seldom features in job interviews and job specs, but the same can be said about servlet, xml, Apache, java app servers .. I think MOM is falling out of fashion but not a short-lived fad technology. MOM will remain relevant for decades. I saw this longevity deciding to invest my time.
  • Will socket technology follow the trend?
  • [r] Key obstacle to MOM adoption is perceived latency “penalty”. I feel this penalty is really tolerable in most cases.
  • — strengths
  • [r] compares favorably in terms of scalability, efficiency, reliability, platform-neutrality.
  • encourages modular design and sometimes decentralized architecture. Often leads to elegant simplification in my experience.
  • [r] flexible and versatile tool for the architect
  • [rx] There has been extensive lab research and industrial usage to iron out a host of theoretical and practical issues. What we have today in MOM is a well-tuned, time-honored, scalable, highly configurable, versatile, industrial strength solution
  • works in MSA
  • [rx] plays well with other tech
  • [rx] There are commercial and open-source implementations
  • [r] There are enterprise as well as tiny implementations
  • — specific features and capabilities
  • [r] can aid business logic implementation using content filtering (doable in rvd+JMS broker) and routing
  • can implement point-to-point request/response paradigm
  • [r] transaction support
  • can distribute workload as in 95G
  • [r] can operate in-memory or backed by disk
  • can run from firmware
  • can use centralized hub/spoke or peer-to-peer (decentralized)
  • easy to monitor in real time. Tibrv is subject-based, so you can easily run a listener on the same topic
  • [x=comparable to xml]
  • [r=comparable to RDBMS]

average O(): hashtable more IMperfect than qsort

In these celebrated algorithms, we basically accept the average complexity as if they were very likely in practice. Naive…

In comp science problems, hash table’s usage and importance is about 10 times higher than qsort

  • I would say qsort is faster than many O(N logN) sorts. Qsort can use random pivot. It degrades only if extremely “lucky;)” like getting a “6” on all ten dice.
  • In contrast, hash table performance depends mostly on programmer skill in designing the hash function, less on luck.

Performance compared to the alternatives — qsort competitive performance is pretty good in practice, but hash table relative performance is often underwhelming compared to red-black trees or AVL trees in practice. Recall RTS.

std::weak_ptr phrasebook

ALWAYS need to compare with raw ptr + shared_ptr, to understand the usage context, motivations and justifications is concise.

— Based on the chapter in [[effModernC++]]:

#1 feature — detect dangle

  • use case — a subject that keeps track of its observers who might become dangling pointers
  • use case — objects A and B pointing to each other with ref count … leading to island. Using raw pointers exclusively is possible but requires explicit deletion, as pointed out on P 84 [[Josuttis]]
  • In both use cases, Raw ptr won’t work since dangle becomes unnoticed.
  • Achilles’ heel of the #1 feature — manual “delete” on the raw ptr is beneath the radar of reference counting, and leads to chaos and subversion of ownership control, as illustrated —
#include <iostream>
#include <memory>
using namespace std;

void f1(){
  auto p = new int(55);
  shared_ptr<int> sp(p);
  weak_ptr<int> wp(sp);

  cout<<"expired()? "<<wp.expired()<<endl; // false
  cout<<"deleting from down below\n";
  delete p; // sp.reset();
  cout<<"expired()? "<<wp.expired()<<endl; // still false!
  // at end of this function, shared_ptr would double-delete as the manual delete 
// is beneath the radar of reference counting:(
int main(){

##[18] y dotnet hasn’t caught on #enterprise-focus

Don’t spend too much time about this unless considering to take up c#.

In 2012/13 I invested my time into c# just in case c# would become more widespread and high-paying. However, after 2013, I have not noticed this trend strengthening. If I have to guess why, here some personal glimpses:

  • web-GUI — Rise of alternative user interfaces such as mobiles and javascript-based web UI, etching away the strongest stronghold of C#. Nowadays, which new build chooses a fat client built in swing, javaFX, winforms, WPF or Silverlight? Fewer and fewer, to my surprise.
  • Linux — Maturity and further performance improvement of Linux machines compared to windows machines. For large server side apps, or cloud apps, I see more and more adoption of Linux.
  • Java@server-side — (in financial sector) Strengthening of java as the technology of choice on the server side. c# made a valiant but unsuccessful attempt to dethrone java in this game. Java may have limitations (do we know any?), but somehow the decision makers are not worried about them.
  • Open source — there’s continuing community effort to improve the java ecosystem (+ python, javascript, c++ ecosystems). C# is losing out IMO.
  • C++@windows — (My personal speculation) I feel c# offers higher performance than java on windows machines, but vc++ is even faster.

Among ibanks, the main reasons for the current c# situation are 1) Linux 2) web-GUI

— Some update based on Wallace chat.

I think 1) microsoft stack, 2) jvm stack and 3) web 2.0 stack all have vast ecosystems, but at enterprise level, only microsoft^jvm

How about server side c++ or python? much smaller mind-share.

Wallace said on Wall St (where he worked for 5+ years) he knows no ibank using c# on the back-end. Wallace believed Unix/Linux is the main reason.

How about outside ibanks? Wallace said c# back-end is not so rare.

## y they ask QQ questions #revisit

Remember QQ means ‘tough topics not needed for GTD’. A QQ quiz (to some extent, algo quiz too)

  • measures your diligence, amount of learning effort you put in beyond your work projects.
  • measures your dedication and commitment to self-learning a dry, tough topic. Learning is not only a drudgery, but also a major challenge on a typical job. 90% of the day-to-day challenges involve learning, re-learning, connecting the dots..
  • checks for a self-starter vs a passive learner
  • measures your self-reliance at learning
  • measures how detail-oriented a candidate is
  • checks for intellectually laziness
  • measures soundness of fundamental concepts underlying your logical “framework”. In many large financial systems there are fundamental design principles and concepts that are .. fairly logical and consistent. Yang was good at grasping the logical, and questioning the illogical.
  • measures depth of curiosity and learning
  • measures technical communication bandwidth with another developer

rvr^rvalueObject #rvr=compiler concept

  • ALL objects by definition exist in runtime memory but references may not.
  • std::move() is about rvr variables not rvalue objects!
    • You can even use move() on natural-occurring temp though std::move is supposed to be used on regular lval objects
  • I now believe rvr is a compiler concept. Underlying is just an address of an object (temporary or otherwise)
    • Further, the traditional lvr is probably a compiler concept too. Underlying is a regular pointer variable.

“rvalue object is an object that can ONLY appear on the RHS of assignment.” I wrote this in 2018 but now less sure.

rvalue object can be naturally occurring (usually anonymous), or “converted from a named object” by std::move(), but if some variable still references the object, then the object is actually not a proper rvalue object.

The SCB architect pointed that “some variable” can be a const (lvr) ref bound to a naturally occurring rvalue! To my surprise the c++ syntax rule says the object is still a temp object i.e. rvalue object, so you can’t assign this object to a lvr !

jvm^c++ as infrastructure

c/c++ is part of the infrastructure of many new technologies, and consequently will last for decades whereas java may not.

😦 This doesn’t mean there will be enough c++ jobs for me and my C++ friends.

  • JVM is an infrastructure for a relatively small number of new languages and new frameworks like spring, hadoop,.. However, the machine learning community seem to regard python and c++ as the mainstay.
  • Java (not JVM) serves as infrastructure in the new domains of MSA, cloud, big data etc, but not Machine Learning.

self-hate due to hearsay 300k salary #XR

.. You seem to feel the (hearsay) income level of 300k is the minimum you need to feel good about yourself. In that case, your worry and negative self-assessment about income is misaligned with reality.

A bit of real statistics to chew on – rank all countries by GDP per-capita. Most of the top 10 richest countries have population below 9 million including Switzerland and most of the Northern Europe countries.

Q: How many countries are richer than U.S. *and* with a population above 20 million?
Answer: zero. Japan, Germany, UK, France,  … are all less rich than the U.S. Now, I believe you don’t want to compare with developing countries like China, Korean, Taiwan, India, let’s first focus on rich countries —

  • I believe half the American families earn less than 60k combined income, so do you think half the American families are struggling to survive every day?
  • I would estimate (based on my knowledge) more than half the *families* across the rich countries earn less than USD 60k, but you are implying that a single income of 300k is the minimum you need?
  • USD 300k single income would put you in the top 2% in any rich country, but you feel that’s the minimum you need?
  • USD 300k is higher than at least half the doctors’ and lawyers’ income across the rich countries, so you seem to say most doctors and lawyers are struggling to survive based on their income
  • My wife’s income is about SGD 30k. A regular teacher’s salary in Singapore is about SGD 50k. Singapore is, by most rankings, more affluent than the U.S. and teachers are a large, white-collar workforce. By your standard, even a 500% increase in a Singapore teacher’s income would still be too low for you.
  • In one of the most expensive cities of our world – London, a USD 300k salary would be top 2%. I know from many sources that London finance IT salary is lower than New York. A 700-pound daily contract rate is “extremely rare” (unheard of to many people) but it works to be only USD 230k, but you would think that’s not enough to survive. Mind you, London is more expensive than New York.
  • Some would say London is still cheaper than … Hongkong. A UBS VP position I tried was at HKD 1.2 million, about half your minimum standard.
  • I have friends in Shanghai and Beijing – the most expensive Chinese cities (along with Shenzhen). A 300k USD salary would be one in 500 jobs over there, but you think it’s barely enough for you. They would guess you live in a city where home price is 10 times higher than Shanghai/Beijing but in reality, your home is more affordable — A comparable apartment (not a 2-storey house with backyard) in Beijing/Shanghai would cost at least USD 1.5 million.

You are living in an ivory tower (and refusing to step out to the real world) if you hold on to that irrational and negative view. You sound like a guy complaining about his 10-room, 3-story mansion. It’s not depression but hallucination.

If you carry this habit into parenting, then beware — your kids could be top of their class but you may still feel they are not good enough because they didn’t win a gold medal. Would be tragic. I think Chinese parents are demanding but most are not at that level. We love our kids and accept them. We ought to love and accept ourselves.

I recommend [[compassion and self-hate]] by Theodore Issac Rubin, my favorite American self-help writer. His books changed my life. I feel good to know he is now 95. I’m grateful to Dr Rubin; I’m grateful to my dad; I’m grateful to Buddhism teachings; and I’m grateful when I answer your questions — I often get a chance to look into myself. I thank our creator to give me the analytical powers (though not as powerful as Dr Rubin) to dissect the layers and uncover the core issues in my psyche. As I endeavor to answer your questions I often reach a deeper understanding of my own pains, stupidity, irrationality and unfairness to myself and love ones.

[10] Can interviewers see through our soft skill weaknesses@@

Now I feel even experienced interviewers (I had chitchat with some recently) can only judge a candidate’s non-technical qualities to a very limited extent.
Experienced interviewers try to focus on “thought process” and “problem solving skill” (soft skills) rather than practical experience and textbook knowledge. White boarding, expanding a given problem’s scope — “what if we need …”, “What if we can’t …” … By the way, my Google interviews are all of this type. Heavy on algorithm and almost language-agnostic algorithms. Some of my most in-depth interviews with good companies seem to follow similar patterns. These are not designed to be traditional technical screening as they test analysis and problem solving, under pressure.

In addition, Many interviewers ask about java’s fundamental design principles, though we developers seldom need to know those. Examples include those wildly crazy questions on hash table implementations, wait/notify, equals() and hashCode(), generics implementation in Java 1.5, beauty of dependency injection, reentrant locks .. These questions are less about “productivity/competency” or design skill and more about “depth of understanding“. Perhaps they want to know how deep the candidate thinks. I consider these non-technical qualities.

Obviously interviewers watch out for personality issues like arguing, strong opinion, arrogance, refusal to admit defeat, insufficient give and take, lack of confidence… Luckily none of my serious weaknesses are visible to them — such as my weakness in office politics.

Taking a step back, if we analyze why a person A fails on a job, obviously a lot of responsibilities are on other people. But if we focus on the weaknesses in A herself and try to classify the common weaknesses in workers, very loosely i see 2 types
1) a range of technical weaknesses
2) a range of non-technical weaknesses

I feel the art and science of candidate screening has advanced more on the first front then the 2nd front. Sharp interviewers can now find out technical weaknesses more than they can non-technical weaknesses. I’m talking about things like real (not fake) honesty, ownership and initiative, efficiency, follow-up, prioritizing, sense of urgency, can-do attitude, client-service attitude, dedication, knowledge sharing, helping out, give and take, push-back, persistence and determination, respect for cultural differences, bias, fairness, attention to detail, code testing attitude, ethical standard on code quality, professionalism, self-starter, motivation, hard working, personal sacrifice … ( … some of my strongest and weakest parts!)

I think these are important to a hiring manager and can break or delay a project, but these personal qualities are invisible to even the smartest interviewers I can imagine. However, I was told some interviewers can be unbelievably smart, beyond my imagination. I tend to feel a perceptive person can be very sensitive but she can’t be sure about personalities based on a brief conversation.

My conclusion — once you pass the technical screening, then interviewers can’t easily find out your relative weaknesses in communication, personality, attitude .. provided you don’t make an obvious mistake.

celebrity search in O(N) #MS

Q (MS 2018): write an algorithm to search for celebrities in a collection of people with the help of a blackbox predicate that can tell whether one person knows the other (bool a.knows(b)) which is an asymmetric relation. Celebrities are those who know no one but are known by everyone else.

Aha : There can be 0 or 1 celebrity.

There are exactly N*(N-1) relationships in the system, so brute force is O(N*N) but I propose a 2-pass O(N) solution:

  1. First pass: shrink the collection of Nodes [A, B .. N]
    1. check a random pair. if A knows B, then remove A from list; else, remove B from list. So first step always shrinks list by one
    2. Now check any random pair of persons in remaining list. Check either direction. Will remove exactly one person.
    3. list will shrink to 1 person (Dan) after N-1 steps
  2. Second pass: check this Dan thoroughly. Dan must be known to everyone and must not know anyone. We end up with either no celebrity OR one celebrity in Dan.

Aha — better use A/B/C or Alice/Bob/Cody instead of #1, #2, #3.

easiest MSWindows G++ installer #c++17 is an easy installer for Perl and it bundles GCC. I was able to compile in c++17 mode:

g++ -std=c++17 my.cpp

My experiences with cygwin and mingw were both horrible.

  1. cygwin — too bulky and unnecessary, and affected my existing Windows features. I tried sygwin once many years ago and quickly concluded that it was designed for people unlike me. Those people would have only positive experiences about cygwin that I don’t share.
  2. mingw — supposed to be minimalist, but I tried installing it at least 3 (up to 6) times and always painful.

In contrast, I installed strawberryPerl about 3 times and once on a friend’s laptop … always quick and easy. No tweaking required. GCC + gdb works out of the box.

gdb to show c++thread wait`for mutex+condVar shows gdb works out of the box.

Through almost 10 installations of GCC on various windows laptops, I have come to the obvious conclusion.

Don’t use null/None to indicate empty

Update — [[c++codingStandard]] says we should use (smart) ptr parameter if it can be null.

Many developers return (python) None/null (java) and put such a value in a containers to positively indicate an empty value.

This practice creates unnecessary ambiguity (rather than reduce ambiguity) because many built-in modules authors use None/null when they have no choice.  If you positively return this value, it’s harder to differentiate the various scenarios. This becomes an unnecessary ambiguity when troubleshooting in production.

In Java, I often create a dummy instance of a MyType and return it to mean “empty”.  I think applicable to c++ too.

In python, I tend to return a negative number from a function that ordinarily returns  a MyType, since python functions can return type1 in ContextA but type2 in ContextB! Python list also allows unrelated types.

##what UPSTREAM tech domains: defined by eg

It pays to specialize in a domain that helps you break into similar domains. (Unsuccessful with my quant adventure so far.)

Note “wrappers” API are never upstream. Very common in big banks including Qz.

  • #1 eg: [L] socket —- C++ is the alpha wolf, leader of the wolf pack
  • [D] market data —- Nyse/Nasdaq (binary) feed parer seems to be the leader, over the slower FIX feeds
  • [D] execution algo —- sell side equities desk seems to be the leader in terms of volume and variety of algos
  • data analytics language —- python looks like the leader
  • eg: [D] alpha —- equities trading is the leader
  • eg: threading —- java is the leader of the pack. In c++ threading is too hard, so once I master some important details in java threading, it helps me build zbs in c/c#, but only to some extent
  • eg: regex/string —- Perl is the leader
  • eg: [D] consolidated firm-wide risk —- SecDB is the leader. Applicable to buy-side? I think so.
  • lambda —- C# is the leader. Scripting languages follow a different pattern!
  • list/stream —- Linq is the leader
  • drv pricing theory —- Black-Scholes —- feels like a leader or upstream, but not /convincing/
  • [L] heap/stack —- C++ is the leader. Java and c# are cleaner, simpler and hides the complexities.
  • defensive coding —- (and crash management) C++ is the leader, because c++ hits more errors, like Japan on earthquake management.
  • eg: collections —- C++ is the leader of the pack. There’s a lot of low level details you gain in STL that help you understand the java/c#/python collections
  • eg: [L] latency —- C++ is the leader of the pack in low-level optimizations .. In-line, cache-efficiency, footprint..
  • eg: [L] pbref^val —- C++ is the leader. C# has limited complexity and java has no complxity
  • [L = lowLevel]
  • [D= well-defined application domain]
  • ? automated pricing —- no upstream
  • ? Fixed income trading systems —- no upstream. IRS^Treasury^credit bonds (trading platforms) have limited features in common.
  • [D] VaR/risk analytics —- FI derivatives in big banks seem to be the leader, but I feel one trading desk in one firm may not respect another system

xx: weekendCIV^codingDrill^work

Amount of tech learning and zbs growth over weekend coding assignments are different from coding drill or work projects

  • intensity — highest
  • QQ halos — comparable to my language feature experiments — improves QQ
  • scale — larger than coding drill but not too large like work projects
  • BP — is a major learning focus and evaluation criteria
  • absorbency — highest
  • sustained focus — lasting over a few days, pretty rare for me

Are equities simpler than FICC@@

I agree that FICC products are more complex, even if we exclude derivatives

  • FI product valuations are sensitive to multiple factors such as yield curve, credit spread
  • FI products all have an expiry date
  • We often calculate a theoretical price since market price is often unavailable or illiquid.
  • I will omit other reasons, because I want to talk more (but not too much) about …

I see some complexities (mostly) specific to equities. Disclaimer — I have only a short few years of experience in this space. Some of the complexities here may not be complex in many systems but may be artificially, unnecessarily complex in one specific system. Your mileage may vary.

  • Many regulatory requirements, not all straightforward
  • Restrictions – Bloomberg publishes many types of restrictions for each stock
  • Short sale — Many rules and processes around short sale
  • Benchmarks, Execution algorithms and alphas. HFT is mostly on equities (+ some FX pairs)
  • Market impact – is a non-trivial topic for quants
  • Closing auctions and opening auctions
  • Market microstructure
  • Order books – are valuable, not easy to replicate, and change by the second
  • Many orders in a published order book get cancelled quickly. I think some highly liquid government bonds may have similar features
  • Many small rules about commission and exchange fees
  • Aggregate exposure — to a single stock… aggregation across accounts is a challenge mostly in equities since there are so many trades. You often lose track of your aggregate exposure.
  • Exchange connectivity
  • Order routing
  • Order management

in c++11 overload resolution, TYPE means..

In the presence of rvr, compiler must more carefully choose overloads based on type of argument and type of parameter.

  • I think type-of-parameter is the declared type, as programmer wrote it.

No change in c++11. No surprise.

  • “argument” means the object. Type-of-argument means…?
  1. Traditionally it means int vs float
  2. Traditionally it means const vs non-const
  3. (Traditionally, it means Acct vs TradingAcct but this is runtime type info, not available at compile time.)
  4. In c++11, it also means rval-obj vs regular object. You can convert regular object to a rval-object via … move(). But What if argument is a rval-variable, declared as “int&& param” ?

This is one of the trickiest confusions about rvr. The compiler “reasons” differently than a human reasons. [[effModernC++]] tried to explain it on P2 and P162 but I don’t get it.

Yes the ultimate runtime int object is an rval-obj (either naturally-occurring or moved), but when compiler sees the argument is an “int&& param”, compiler treats this argument as lvalue as it has a Location !

My blogpost calling std::move() inside mv-ctor  includes my code experiments.

spreadsheet concretize #Junli Part2

Note the java algo is event-queue based — every newly concretized cell is an event added to the event queue. When we encounter this cell again after a dequeue, All registered dependents of this cell are checked. If the check results in a new cell concretized, this cell is enqueued.

In contrast, my c++ algo is a modified BFT. Key idea is, whenever a branch node can’t be concretized (due to an unresolved upstream reference) we basically ignore that node’s subtree. The other root nodes’s BFT would eventually visit this node, unless there’s a cycle.

I believe both algorithms are relatively easy to visualize at a high level. Which algo implementation is more tricky and error-prone? I guess the BFT but not really sure.

— Topo-sort — “topological sorting” is the reusable general technique for similar problems like even scheduling. As I described to Kyle, the idea is “Given a directed graph, assign (artificial) integer ranks to all nodes so that every arrow is from a low rank to a high rank”

There are linear time algorithms to assign the ranks. I think some form of BFT may work… need more thinking.

I think it depends on what’s more natural — start from leaf nodes or start from root nodes. The start level would use lower integers.

For a typical spreadsheet, I feel it’s natural to start from nodes that have no downstream.

My c++ implementation was similar to Kahn’s algorithm.

[[Algorithms]] P 550 presents an elegant DFT  algo but not so intuitive to me yet. I think it DFT can’t solve this spreadsheet.

–Some additional notes on the c++ implementation

  • 🙂 I encountered much few seg-faults than in other projects. I think it’s because very few arrays (including vectors) are used.
  • Before I look up a map/set, I always use count() to verify.
  • 🙂 I didn’t need to check against end() as lower_bound() and find() functions would require.
  • no smart ptr needed. container of raw ptr worked well. See source code comments
  • In fact, container of cell names (as strings) is even “safer”.

python IV: xp-verification^QQ questions

Many CVs include claims like — "used py for sys automation". If interviewer wants to verify such claims, There are too many realistic IV questions they can use.

Too many, so I want to be selective. Many verification questions requires extensive learning (and periodic re-learning to compensate for low usage). See [14] mock/realistic python IV

In fact, I do come with more wide-ranging python project experience than other candidates. If out of 30 "verification questions" I struggle with 50% then other candidates probably /fare worse/.

If you look at java/c++ interviewers, they don’t ask so many "experience verification" questions. On the contrary, QQ questions by definition means "not required in projects"

max-sum path problemS #high fan-out/fan-in

Note “path” implies directed (not bidirectional) edges i.e. dag

  • variation (rare): bi-directional edges
  • variation: single origin, single destination
  • variation: single origin, any destination among 55 leaf nodes
  • variation: each node has a +/- value
  • variation (rare): each edge has a +/- value. Edges usually outnumber nodes, as illustrated in the balloon-burst problem

–Common Special case: pyramid (see blogpost on recombinant binTree). The balloon problem has high fan-out.

I would try DP. For each node AA, determine (and store) the max cum-sum up to AA. Need to start from the layer 1 nodes (1 hop from origin), then use Layer 1 data to update layer 2 nodes.

Markov-like — For any node AA, the max cum-sum to AA only depends on the previous layer. No need to even care about a “fullpath” to AA. This property leads to dramatic *simplification*. Efficiency is a by-product of this simplification.

–Common Special case: non-recombinant binary tree.

i would use DFT + Kadane. Each origin-to-leaf path is independently handled. No layering.

Note Kadane algo handles +/- node values.

[19]Cod`drill in our 50’s #CSY

  • I agree that Competing with younger guys on (speed) coding test is like math competition, sports competition, … where experience doesn’t matter. But in our chosen profession, this is reality to be accepted.
  • I agree on Wall St etc there exist decent jobs with modest, reasonable coding tests. I hope coding test do not grow in popularity but I think they will.
  • I agree that you and me are not less intelligent than those strong candidates at coding tests. Your steadfast conviction is inspiring n uplifting. I tend to have self doubts.

I agree that most peers at my age have stopped coding practice.  Some individuals still do. (Deepak is just one example.) I see them as role models just like male yoga practitioners, among a sea of women. Most of my peers see coding drill as a chore, a hard job, a pain in the ass, but I’m different.

For the next 5 to 10 years I hope to find joy in coding drill. Coding drill is like jigsaw puzzle n board games … mental gymnastics to keep our brain young.

There is no shame in revisiting college level subjects.  At age 44 I still studied calculus and passed math exams with flying colors. I am proud of myself, not ashamed of myself. I can probably help my son n my daughters with math till their college. Not many dads can do that.

I will say this again — some technical interviews ask probabilities n SQL. These are college level topics. We just accept and prepare for these interview questions.

classic generator problems: non-trivial

classic generator problems (such as combination sum, combo with redraw, phonePad…) are not simple for most of us programmers. Even though these are classic and well-researched, most programmers can’t really master them .. cognitive barrier is formidable.

Our mind is not constructed to deal with such challenges. I have tried and failed many times to compare across various generator problems hoping to extract the gist like thick->thin 总结规律

Duplication in output —  is one of many tough challenges.

The recursion-in-loop magic is another “thingy” beyond our comprehension.

Many of them are DP problems —

  • A size-5 data set is easier to deal with than size-6
  • real challenge is to train our human brain to see through the “forest” and identify how size-5  RESULTS can help solve size-6

I believe these algos are building blocks of many coding problems, both real-world problems and quiz problems.

coding drill≈yoga + weight loss

I had a brief discussion with my friend Ashish.. Coding test improvement is similar to weight or flexibility improvement. I also tell my son about diet control. You probably can think of many other examples.

  • we don’t notice any improvement for a long time. Still, I believe the practice makes me better than before.
    • In the absence of visible progress, Optimism and Faith is important. A pessimist tends to give up before seeing ROTI
  • when we stop the exercise, we are likely to lose part of the effort and continue the downward spiral, esp. with flexibility. See coding drill: LASTING value4family wellbeing for10Y 
  • as we grow older, we feel we tend to lose the competition, even though there’s not enough evidence.
  • it can feel boring and repetitive
  • pair exercise is easier
    • Otherwise, promise a friend

killer app@RBTree #priorityQ,sortedVector

A common usage context is query on some data set after pre-processing . In these contexts, BST competes with sorted vector and priority queue.

  • Killer app against vector: incremental insert or live updates
  • Killer app against vector: if there’s even occasional delete(), then sorted vector would suffer
  • Killer app against vector: update on one item can be implemented as delete/reinsert. Vector requires binary search -> mid-stream insert
  • minor advantage over sorted vector: vector sorting requires swapping, so value-semantics is problematic
  • Killer app against priority queue: range query, approximate query,
  • Killer app against priority queue: both max() and min()
  • Application: if each BST node needs additional data. Binary heap doesn’t expose those nodes(?)

It’s important to remember the advantages of vector

  1. cache efficiency
  2. runtime malloc cost

Java, c++ and c# all provide self-balancing BST in their standard libraries, from the very beginning. In my projects, I use these containers on an everyday basis. However, after talking to a few friends I now agree that most coding problems don’t need self-balancing BST because

  1. These problems have no incremental insertion/deletion, so we can simply sort an array of items at O(N logN) as pre-processing
  2. In some cases, priorityQ is a viable alternative
  3. Python doesn’t have this container at all and all coding problems must support python.

bitBucket: investigate history of merges

Scenario: throughout the month, your main branch takes in many pull-requests. You want to see the sequence of merge points in chronological order.

On BitBucket, the commit listing is chronological but doesn’t show the merge commits. I usually unhide the merge commits, so on my page the merge commits are interleaved with “regular” commits in chronological order like

  • merg3
  • merge2
  • commit2c
  • merge1
  • commit1b
  • commit2b
  • commit 2a
  • commit1a
  • commit3a

The regular commits line up according to their commit timestamps. the most recent merge could have its one and only commit created months ago !

What I want is “nothing but the merge points”. I think I have to ignore the “regular” commits and only look at the merge commits.

FIX^MOM: exchange connectivity

The upshot — some subsystem use FIX not MOM; some subsystem use MOM not FIX. A site could send FIX over MOM (such as RV) but not common.

It’s important to realize FIX is not a type of MOM system. Even though FIX uses messages, I believe they are typically sent over persistent TCP sockets, without middle-ware or buffering/queuing. RTS is actually a good example.

Q: does FIX have a huge message buffer? I think it’s small, like TCP, but TCP has congestion control, so sender will wait for receiver.

Say we (sell-side or buy-side) have a direct connectivity to an exchange. Between exchange and our exterior gateway, it’s all sockets and possibly java NIO — no MOM. Data format could be FIX, or it could be the exchange’s proprietary format — here’s why.

The exchange often has an internal client-server protocol — the real protocol and a data flow “choke point” . All client-server request/response relies solely on this protocol. The exchange often builds a FIX translation over this protocol (for obvious reasons….) If we use their FIX protocol, our messages are internally translated from FIX to the exchange proprietary format. Therefore it’s obviously faster to directly pass messages in the exchange proprietary format. The exchange offers this access to selected hedge funds.

As an espeed developer told me, they ship a C-level dll (less often *.so [1]) library (not c++, not static library), to be used by the hedge fund’s exterior gateway. The hedge fund application uses the C functions therein to communicate with the exchange. This communication protocol is possibly proprietary. HotspotFX has a similar client-side library in the form of a jar. There’s no MOM here. The hedge fund gateway engine makes synchronous function calls into the C library.

[1] most trader workstations are on win32.

Note the dll or jar is not running in its own process, but rather loaded into the client process just like any dll or jar.

Between this gateway machine and the trading desk interior hosts, the typical communication is dominated by MOM such as RV, 29West or Solace. In some places, the gateway translates/normalizes messages into an in-house standard format.

[19] j^python arg passing #(im)mutable

Python is ALWAYS pbref.

— list/arrayList, dict/HM etc: same behavior between java and py —
pbref. Edits by the function hits the original object.

— string: probably identical behavior between java and py —
immutable in both java and py.

Passed by reference in both.

If function “modifies” the string content, it gets a new string object (like copy-on-write), unrelated to the original string object in the “upstairs” stackframe.

— ints, floats etc: functionally similar between java and py —
If function modifies the integer content, the effect is “similar”. In python a new int object is created (like string). In java this int is a mutable clone of the original, created on the stackframe.

Java int is primitive, without usable address and pbclone, never pbref.

Python numbers are like python strings – immutable objects passed by reference. Probably copy-on-write.

read_matrix(r,c)beats matrix[r][c]

Usually read_matrix(r,c) simply returns matrix[r][c] but

Such a wrapper function can

  • assertions
  • read-count + write-count on each element
  • write-once constraint enforced
  • saves typing — a(r,c) vs mat[r][c]
  • encapsulation — the matrix can be encapsulated as a local static object.
  • simplifies conditional logic when r==0 or r==c. Without this wrapper, I often need separate code to initialize those special cells
  • In python, defaultdict() can handle index-out-of-range automatically but read_matrix() also does a fine job

IDE IV=easier4some !!4me

I guess many (majority ?) candidates consider IDE coding IV easier than white-board. I think they rely on IDE as a coding aid. Many programmers are not used to “demo” coding using white board … stressful, unnatural?

The “coding aid” feature helps me too, but it helps my competitors more !

If on white-board (or paper), my relative rating is A-, then on IDE i would score B

  • my relative weakness on IDE — ECT speed
  • my relative strength on white-board — clarity of thinking/explanation

python^perl #my take

(Collected from various forums + my own, but ranking is based on my experience and judgment)

  1. OO
    1. – Not sure about perl6, but not many app developers create perl classes. Many CPAN modules are OO though. Python users don’t create many classes either but more than in perl. I guess procedural  is simple and good enough.
    2. – not sure about perl6, I feel perl OO is an afterthought. Python was created with OO in mind.
    3. – even if you don’t create any class, out-of-the-box python relies (more deeply) on more OO features than CPAN modules. Object-orientation is more ingrained in Python’s ethos.
    4. OO is important for large-scale, multi-modular development project. Python is more battle tested in such a context, partly due to it’s OO design, but i think a more important reason is the python import machinery, which is probably more powerful and more widely used than perl’s
  2. – Python can cooperate better with other popular technologies like DotNet (IronPython) and Java (Jython). See also python project in boost.
  3. – perl’s text processing (and to a lesser extent unix integration) features are richer and more expressive. Using every key on the keyboard is like using all 20 fingers and toes. I immediately felt the difference when switching to java. In this aspect, pytyon is somewhere between those 2 extremes.
  4. – perl can be abused to create unreadable line-noise; python has a rather clean syntax
  5. – As part of unix integration, perl offers many practical one-liners competing effectively with grep/awk/sed. Perl one-liners integrate well with other unix commands
  6. – Perl was initially designed for unix, text and as an alternative for shell script programmers. It’s still very good for that purpose. For that purpose, I feel OO offers limited value-add.
  7. – for unix scripting, perl is more similar to unix shell scripting at least on the surface, but python isn’t hard to learn.
  8. – I feel perl was more entrenched and more established in certain domains such as unix system automation, production support, bio-informatics
  9. big data, data science and data analytics domains have already picked python as the default scripting language. Perl may be usable but not popular
  10. – some say Perl the base language is much larger(??) than Python’s base language so probably it takes longer(???) time to learn Perl than Python but in the end you have a more expressive language???
  11. – CPAN was once larger than python’s module collection. If CPAN’s collection is A and python’s module colleciton is B, then I personally feel (A-B) are mostly non-essential modules for specialized purposes.
  12. – Python has better support for Windows GUI apps than Perl.
  13. I feel the open source contributions are growing faster in python
  14. There are more books on python

jvm thread Maps-To(isn’t)native thread: %%speculation

I don’t (need to) know HOW a jvm thread maps to a native thread. I believe by construction, JVM attaches house-keeping info on the native thread handle (something based on a ptr). The housekeeping data probably provides essential management features, such as

* thread dump
* interactive debugger support
* jGC knows all the live objects on a thread’s call stack

save lookup-key in payload object: data duplication

In c++ data structure designs, I often have a key/payload pair in a lookup map, and face the design choice — should I duplicate data by saving the key in the payload object?

Context (simple) — if every time we access any payload object, we always have the key on hand, then no need. It could be unnecessary complication to copy the key into the payload instance, since we may not have the key at payload construction time.

Context — if sometimes we manipulate the payload object without (easy access) to the key, then reconsider. For example, the key object may be a private field of some Acct class so we would need a handle on that Acct instance. I would incur the cost of data duplication by saving the key inside payload, upon construction.

If key object address is “permanent” .. like maintained in some global collection or allocated on heap, then I prefer reference/pointer. (In java, every object is a reference.) This way, key object content change won’t affect me.

(Special case) If the key object is an integral type and immutable, I would simply copy its content into the payload object.

(complex special case) if the key object is BASED-ON a simple data type (string or int) but needs to be a custom MyKey type, then we have the (KISS) choice of saving the string as a payload field, but half the times when we read this field from a payload, we may need to create a temp MyKey. I usually prefer KISS.

I treat this as a design issue more than merely a low-level implementation issue, because I have many concerns about the data duplication:

  1. memory footprint
  2. !! More important than (A) is artificial complexity and higher chance of bugs. I’m very lucky if the key object location is permanent as described above.
  3. code maintenance

java’s amazing perf^simplicity #c++/c#

* java’s syntax simplicity is on par with python, better than c# and much better than c++.
* java’s performance is on par with c# and c++, largely due to JVM and JIT

java has been popular on web servers, and crucially the newer mobile, cloud, big-data platforms, beating c++, c#, python

java’s adoption rate as a foundation platform or integration-target… is better than all other languages. Many products are built for java, on JVM or with java in mind. I’m deliberately vague here because I don’t want to spend too much time analyzing this vague, general observation.

SQL expertise(+tuning)as competitive advantage: losing mkt val

Opening example — I have received very few selective/tough SQL interview questions. The last interview with non-trivial SQL is Lazada.

If you want to rely on SQL/RDBMS skill as a competitive advantage, you will be disappointed.

I think many teams still use SQL, but use it lightly. Simple queries, simple stored proc, small tables, no tuning required. Therefore, interview questions are dumbing down…

I believe I over-invested in SQL. The last job that required any non-trivial SQL was 95G…

strategic value of MOM]tech evolution  is about MOM, but similar things can be said about SQL and RDBMS.

This is a cautionary tail in favor of TrySomethingNew. If my friend Victoria stays within the familiar domain of SQL/Perl/PHP/Apache, then her skillset would slowly lose market value.

Don’t forget — TSN can have low ROTI. We have to accept that possibility

frequency@(hardware)timer interrupts #KHz

In the 2003 edition of [[LinuxKernel]] P195, 1200 interrupts/second was the highest frequency, above 1KHz.

The CPU must process these timer interrupts. If it were easy to “nullify” these interrupts and eliminate the CPU overhead, then we would have millions of time interrupts per second but I doubt it. The overhead is probably unavoidable.

Scheduler is a kernel component. I think scheduler divides cpu time into slots. I guess the smallest quantum is limited by this frequency, among other factors.

P350 of The same book says that scheduler runs upon timer interrupts + keyboard interrupts.

Note many people get confused by software timer vs hardware timer. Hardware timer is my focus.

Many people get confused by software interrupt vs hardware interrupts. I have a separate blogpost (interrupts=hardware interrupts #by default), but basically most interrupts are hardware interrupts. Software interrupts are simulated.


##RDBMS=architect’s favorite

A specific advantage .. stored proc can _greatly_ simplify business logic as the business logic lives with the data …

Even without stored proc, a big join can replace tons of application code implementing non-trivial business logic. Hash table lookup can be implemented in SQL (join or sub-query) with better clarity and instrumentation because

* Much fewer implicit complexities in initialization, concurrency, input validation, null pointers, state validity, invariants, immutabilities, …
* OO and concurrency design patterns are often employed to manage these complexities, but SQL can sidestep these complexities.

Modularity is another advantage. The query logic can be complex and maintained as an independent module (Similarly, on the presentation layer, javascript offers modularity too). Whatever modules dependent on the query logic has an dependency interface that’s well-defined and easy to test, easy to investigate.

In fact testability might be the most high-profile feature of RDBMS.

I think one inflexibility is adding new column. There are probably some workarounds but noSQL is still more flexible.

Another drawback is concurrency though there are various solutions.

thread^process: lesser-known differences #IV

Popular IV question. Largely a QQ question.  Some may consider it zbs.

To the kernel, there are man similarities between the “thread” construct vs the “process” construct. In fact, a (non-kernel) thread is often referenced as a LightWeightProcess in many kernels such as Solaris and Linux.

  • context switching — is faster between threads than between processes because no TLB flush needed. In linux, context switching between kernel-threads is even faster.
  • creation — some thread libraries can create threads without the kernel knowing. No such thing for a process.
  • socket — 2 threads in a process can access the same socket; two processes usually can’t access the same socket, unless … parent-child. See post on fork()
  • memory — thread AA can access all heap objects, and even Thread BB’s stack objects via pointers. Two processes can’t share these, except via shared memory.
  • a non-kernel thread can never exist without an owner process. In contrast, every process always has a parent process which could be long gone.


## JVM interceptions #mem,exception..%%hypothesis

Java provides a virtual machine to resemble a physical machine. To the upper-level[1] applications running therein[2], JVM provides similar interfaces as a physical machine does. Through these interfaces, JVM intercepts runtime requests made by the hosted application, and provides powerful value-added services to the hosted application. Below are a subset of the interfaces I’m interested in.

Note when I say “intercept”, I mean the request is ultimately serviced by the host kernel of the physical machine. However, in many cases the JVM completes the requests in itself without calling into the host kernel.

  • memory allocation. Note de-allocation is not a legitimate request from a java app. The JVM is likely to handle allocations without calling into kernel. When allocation fails, the outcome is much better than in an unmanaged runtime.
  • memory access beyond an array or linked data structure. JVM probably forwards the request on, but the physical address is managed by the GC.
  • thread — management (including creation). Mostly native thread is used, so JVM forwards the requests to kernel, but I think JVM is very much in-the-loop and can provide instrumentation support
  • exception handling. In an un-managed environment, exception outcome can be inconsistent, unpredictable. Linux uses interrupts and signals (interrupts^signal ] kernel). JVM handles exceptions at a higher level, more effectively:)

[1] a layered view
[2] a “hosting” view. In this view, the JVM is a “container”. A docker container is cgroup that includes the JVM process

## My github profile

Some interviewers have requested my github profile. I create github content for learning and to share with my friends, not really for interviewers. I have decided not to spend too much time polishing my github content. The content would be rough round the edges. If there’s any doubt or bug report, I would be happy to discuss.

unbounded queue for 1-Producer-1-Consumer #Wells

———- Forwarded message ———
From: Bin TAN (Victor) <>
Subject: demo of my design that first interviewer didn’t see

Hi Angelo,

During my interview with Jun, I spent the last 20 minutes locked in a rigorous debate — given an unbounded queue designed for single-threaded mode, can we provide a thread-safe facade so a producer thread and a consumer thread can operate on it safely, but without using locks in every operation.
Note In JDK and STL there are many collections designed for single-threaded mode, because such designs are simpler and significantly faster.
I argued valiantly that there exist designs where most of the time we can avoid both locking and CompareAndSwap. I failed to convince Jun. Jun believed firmly that unsynchronized access by two threads is always unsafe so we must use lock or CompareAndSwap every single time.
I just spent 20 minutes at home posting an implementation in
I would also like to point out the java ConcurrentHashMap lets two threads (possibly a writer thread and a reader thread) access the shared data structure concurrently, usually without locking or CompareAndSwap , when the two threads happen to access two distinct segments. Note there can be a large number of segments, up to a few hundred at least, so the chance of two threads hitting distinct segments is very high (99.999% chance for a 333-segments map). Therefore, contrary to what Jun said, for reader/writer threads to access the same data structure,  we don’t need locking in every read and write access.
concurrencyLevel – the estimated number of concurrently updating threads. The implementation may use this value as a sizing hint.
I wrote this (like most of my concurrent designs) in Java since Java memory model is better documented and better understood.

read-mostly data store: needs synchronization

Suppose a data store is infrequently updated, and only by a single writer thread, but is accessed frequently by many, many reader threads.

Q1: Can we cheat and skip some of the (expensive) locking?
A1: No. The writer could be in the middle of a multi-step update, so readers would see a half-updated data store

Q2: What if we somehow ensure the updates are never multi-step [1]?
A2: Unfortunately no, due to visibility failure. If a reader thread doesn’t synchronize, the crucial memory fencing would not occur and the reader may not notice a recent update, and use cached data instead.

[1] One way to do it — writer creates a new version of a record, and at last reseat a single pointer , as described in my AtomicReference blogpost

calling database access method while holding mutex

Update: XR referred me to —

Hi XR,

Q: If some vendor provides a database access function (perhaps dbget()) that may be slow and may acquire some lock internally, is it a good idea to call this function while holding an unrelated mutex?

We spoke about this. Now I think this is so bad it should be banned. This dbget() could take an unknown amount of time. If someone deleted a row from a table that dbget() needs, it could block forever. The mutex you hold is obviously shared with other threads, so those threads would be starved. Even if this scenario happens once in a million times, it is simply unacceptable.

Here, I’m assuming the dbget() internal lock is completely unrelated to the mutex. In other words, dbget() doesn’t know anything about your mutex and will not need it.

As a rule of thumb, we should never call reach-out functions while holding a mutex. Reach-out functions need to acquire some shared resources such as a file, a web site, a web service, a socket, a remote system, a messaging system, a shared queue, a shared-memory mapping… Most of these resources are protected and can take some amount of time to become available.

(I remember there’s some guideline named open-call but I can’t find it.)

That’s my understanding. What do you think?

##[17] proliferation → consolidation.. beware of churn

This is an extension of my 2015 blog post

Imagine you spent months of serious personal effort [1] learning to use, debug, and tune, say, MongoDB but after this project you only find projects that need just superficial Mongo knowledge. Developer time-investment has no recurring return. I think this is widespread in tech: A domain heats up attracting too many players creating competing products with varying degrees of similarity. We wish these products are mostly similar so we developers’ time investment can help us more than once, rather than learn-n-forget like 狗熊掰棒子. Often it takes decades to see some consolidation among the competitors, when most of them drop out of the race and we one player emerges dominant, or a common standard [2] is adopted with limited vendor extensions.

Therefore I see two phases : Proliferation -> Consolidation. The churn in the early phase represents a hazardous pitfall.

If we invest too much learning effort there we get burned.

  • Javascript packages
  • JVM languages — javascript, Scala, Groovy, jython
    • I don’t even know which company uses them
  • ORM and database access “frameworks”–, LINQ, EntityFramework, SpringJDBC,  iBatis,
  • Data Grid and NoSQL — Terracotta, Hazelcast, Gigaspace, Gemfire, Coherence, …
  • MOM — tibco, solace, 29west, Tervela, zeroc
  • machine learning?
  • web app languages
    • php, perl and the LAMP stack
    • Javascript MEAN stack
    • ASP and the Microsoft stack
    • Java stack

[1] You could have spent the time on personal investment, or something else if not required by the project.

[2] Some positive examples of standardization —

  1. RDBMS vendors
  2. Unix vendors
  3. c++ vendors — mostly GCC vs Microsoft VC++
  4. Java IDEs; c++/java/c# debuggers
  5. cvs, svn, git

A few development technologies “free” of proliferation pains —

  1. socket and system programming — complexities are low level and in C not c++
  2. core java
  3. SQL
  4. c/c++
  5. Unix know-how for testing, investigation, devops and process management
    1. shell scripting,
    2. regular expressions
    3. searching

online solution2algoQ:I always try%%self1st #maxRectangle is the Nasdaq question. It includes a link to my solution in github. I came up with my own solution. Then I tried your solution found online. is another interview question. I tried my own and came up with a “better” solution.

If I don’t try my own solution, I would not learn a lot, and I would not have fun implementing my own idea (even if not optimal).

I also find it hard to absorb then remember clever solutions I found online, such as the maxRectangle.

Solving problems myself also grows my confidence when facing unfamiliar problems. I do read online solutions when I have no clue. confidence-build`: another reason y I don’t look at Leetcode answer

By the way, one advantage of my solution is — it can identify the missing numbers, all within O(N) time and O(1) space.

cod`IV^QnA: relative importance

Let’s focus on technical screening but put aside SDI questions, which are sometimes like QnA .

  • In terms of “weighting” assigned by employer — coding 50/50 QnA on average, though only on-site coding is authentic. If an interviewer uses on-site coding, then it would often have higher weighting than the QnA portion.
  • interviewers’ time — coding 30/70 QnA. Interviewer needs to spend more time conducting QnA interview than coding interview.
  • candidate’s time — roughly coding 45/55 QnA. On average coding takes a candidate less time but more effort.
  • candidate’s effort — coding 70/30 QnA on average. QnA requires effort in advance but little effort during the interview.
    • I feel coding tests often require substantial concentration/effort but many people are not willing. One hour of coding test effort == 4 hours of project effort
    • timed coding test is like exam… full concentration.
  • rejection percentage — coding 70/30 QnA on average. I feel rejection rate is higher in coding than in QnA, partly due to insufficient effort by many candidates. West Coast and HFT has higher rejection rate on coding tests.
  • In terms of long term benefits on personal financial security and family wellbeing — about the same value.

Conclusion — as a job candidate, we had better embrace coding interviews, for our own long-term security.

make_shared() cache efficiency, forward()

This low-level topic is apparently important to multiple interviewers. I guess there are similarly low-level topics like lockfree, wait/notify, hashmap, const correctness.. These topics are purely for theoretical QQ interviews. I don’t think app developers ever need to write forward() in their code. touches on a few low-level optimizations. Suppose you follow Herb Sutter’s advice and write a factory accepting Trade ctor arg and returning a shared_ptr<Trade>,

  • your factory’s parameter should be a universal reference. You should then std::forward() it to make_shared(). See gcc source code See make_shared() source in
  • make_shared() makes a single allocation for a Trade and an adjacent control block, with cache efficiency — any read access on the Trade pointer will cache the control block too
  • I remember reading online that one allocation vs two is a huge performance win….
  • if the arg object is a temp object, then the rvr would be forwarded to the Trade ctor. Scott Meryers says the lvr would be cast to a rvr. The Trade ctor would need to move() it.
  • if the runtime object is carried by an lvr (arg object not a temp object), then the lvr would be forwarded as is to Trade ctor?

Q: What if I omit std::forward()?
AA: Trade ctor would receive always a lvr. See ScottMeyers P162 and my github code is my experiment.

— in summary, advantages of make_shared
* code size is smaller and more i-cache friendly
* one allocation fewer. I think one allocation is thousands of times slower than a simple calc


deadlock avoidance: release all locks fail-fast

In some deadlock avoidance algorithms, at runtime we need to ensure we immediately release every lock already acquired, without hesitation, so to speak. Here’s one design my friend Shanyou shared with me.

  • Tip: limit the number of “grabbing” functions, functions that explicit grab any lock.

Suppose a grabbing function f1 acquires a lock and calls another function f2 (a red flag to concurrency designers). In such a situation, f2() will use -1 return value to indicate “must release lock”. If f2() calls a grabbing function f3(), and if f3() fails to grab, it propagates “-1” to f2 and f1.

  1. f3() could use trylock.
  2. f3() could check lock hierarchy and return -1.

Q: which thread/PID drains NicBuffer→socketBuffer

Too many kernel concepts here. I will use a phrasebook format. I have also separated some independent tips into hardware interrupt handler #phrasebook

  1. Scenario 1 : A single CPU. I start my parser which creates the multicast receiver socket but no data coming. My process (PID 111) gets preempted on timer interrupt. CPU is running unrelated PID 222 when my data wash up on the NIC.
  2. Scenario 2: pid111 is running handleInput() while additional data comes in on the NIC.

Some key points about all scenarios:

  • context switching — There’s context switch to interrupt handler (i-handler). In all scenarios, the running process gets suspended to make way for the interrupt handler function. I-handler’s instruction address gets loaded into the cpu registers and this function starts “driving” the cpu. Traditionally, the handler function would use the suspended process’s existing stack.
    • After the i-handler completes, the suspended “current” process resumes by default. However, the handler may cause another pid333 to be scheduled right away [1 Chapter 4.1].
  • no pid — interrupt handler execution has no pid, though some authors say it runs on behalf of the suspended pid. I feel the suspended pid may be unrelated to the socket (Scenario 2), rather than the socket’s owner process pid111.
  • kernel scheduler — In Scenario 1, pid111 would not get to process the data until it gets in the “driver’s seat” again. However, the interrupt handler could trigger a rescheduling and push pid111 “to the top of the queue” so to speak. [1 Chapter 4.1]
  • top-half — drains the tiny NIC ring-buffer into main memory (presumably socket buffer) as fast as possible [2] as NIC buffer can only hold a few packets — [[linux kernel]] P 629.
  • bottom-half — (i.e. deferrable functions) includes lengthy tasks like copying packets. Deferrable function run in interrupt context [1 Chapter 4.8], under nobody’s pid
  • sleeping — the socket owner pid 111 would be technically “sleeping” in the socket’s wait queue initially. After the data is copied into the socket receive buffer in user space, I think the kernel scheduler would locate pid111 in the socket’s wait queue and make pid111 the cpu-driver. This pid111 would call read() on the socket.
    • wait queue — How the scheduler does it is non-trivial. See [1 Chapter]
  • burst — What if there’s a burst of multicast packets? The i-handler would hog or steal the driver’s seat and /drain/ the NIC ring-buffer as fast as possible, and populate the socket receive buffer. When the i-handler takes a break, our handleInput() would chip away at the socket buffer.
    • priority — is given to the NIC’s interrupt handler as NIC buffer is much smaller than socket buffer.
    • UDP could overrun the socket receive buffer; TCP uses transmission control to prevent it.

Q: What if the process scheduler is triggered to run (on timer interrupt) while i-handler is busy draining the NIC?
A: Well, all interrupt handlers can be interrupted, but I would doubt the process scheduler would suspend the NIC interrupt handler.

One friend said the while the i-handler runs on our single-CPU, the executing pid is 1, the kernel process. I doubt it.

[1] [[UnderstandingLinuxKernel, 3rd Edition]]


if thread fails b4 releasing mutex #CSY

My friend Shanyou asked:

Q: what if a thread somehow fails before releasing mutex?

I see only three scenarios:

  • If machine loses power, then releasing mutex or not makes no difference.
  • If process crashes but the mutex is in shared memory, then we are in trouble. The mutex will be seen as forever in-use. The other process can’t get this mutex. I feel this could be a practical problem, with practical solutions like reboot or process restart.
  • If process is still alive, I rely on stack unwinding.

Stack unwinding is set up by compiler. The only situation when this compiler-generated stack unwinding is incomplete is — if the failing function is declared noexcept. (In such a case, the failure is your self-inflicted problem since you promised to compiler it should never throw exception.) I will assume we don’t have a noexcept function. Therefore, I assume stack unwinding is robust and all stack objects will be destructed.

If one of the stack objects is a std::unique_lock, then compiler guarantees an unlocked status on destruction. That’s the highest reliability and reassurance I can hope for.

atomic{int} supports operator+=()

Background — on a machine with lock-free CPU…

My friend Shanyou asked me — with c++11 atomic data types, can we simply say

myAtomicInt ++; //without any mutex

A: Yes according to [[c++StdLib]]

Q: Is there some hidden CAS while-loop therein?
A: Yes I am 99% sure because other threads could be updating the same shared mutable object concurrently on other CPU cores.

Q: is there a CAS while-loop in a basic store/load operation?
A: I don’t think so

updates2shared-mutable: seldom flushed2memory immediately #CSY

  • memory fence c++
  • memory barrier c++
  • c++ thread memory visibility


You can search for these keywords on Google. Hundreds of people would agree that without synchronization, a write to sharedMutableObject1 by thread A at Time 1 is not guaranteed to be visible to Thread B at Time 2.

In any aggressively multithreaded program, there are very few shared mutable objects. If there’s none by design, then all threads can operate in single-threaded mode as if there’s no other thread in existence.

In single-threaded mode (the default) compilers would Not generate machine code to always flush a write to main memory bypassing register/L1/L2/L3 caches. Such a memory barrier/fence is extremely expensive — Main memory is at least 100 times slower than register access.

I would hypothesize that by default, the most recent write (at Time 1) is saved to register only, not L1 cache, because at compile time, compiler author doesn’t know if at runtime this same thread may write to that same address! If you update this object very soon, then it’s wasteful to flush the intermediate, temporary values to L1 cache, since no other threads need to see it.

L1 cache is about 10 times slower than register.

Multi-threaded lock-free programming always assumes multiple threads access shared mutable objects.

Even a lock-free function without contention [1] requires memory barriers and is therefore slower than single-threaded mode. I would say in a low-contention context, the main performance gain of single-threaded over lock-free is data cache efficiency. Another performance gain is statement reordering.

[1] i.e. no retry needed, since other threads are unlikely to touch sharedMutableObject1 concurrently

[11]quant library in java/c# !! catching on


You once told me java can emulate the same quant lib functionality of C/C++. I asked quants in GS, MS, ML, Barcap and a few other banks. I don’t remember anyone saying their quant lib is in java. I now feel there’s no industry momentum behind such a migration. Further, I feel there’s no justification either. I’d go out on a limb and say there’s justification for sticking to C++.

A java implementation is less accessible from dotnet, python, and other scripting languages that could be making (slow) inroads into trading floors.  In contrast, all major languages support an decent interface to integrate with a C library. C is the common denominator.

More importantly, the sponsors of the quant lib are business users (not only traders) and they
know none of the languages but they know MSExcel. I’d say Excel integration is a must for every quant lib, otherwise traders may refuse to use it. C implementations easily integrate with MSExcel, via the
Microsoft COM interface and other interfaces. C# also integrates well with Excel.

Some quant libs are used in visualization and GUI. Dotnet and WPF are a market leader in GUI.

I also feel C implementation tends to be faster, at least no slower, than java quant lib. In pre-trade real time apps, a quant lib needs to be fast. A Barcap veteran told me the most important justification for C++ in quant lib is speed/performance.

In 2018 I asked an Executive Director in MS CVA team why java is not used. He said performance is the main reason.

rvalue Object holding a resource : rather rare

I think naturally-occurring rvalue objects  rarely hold a resource.

  • literals — but these objects don’t hold any resources via a heap pointer
  • string1 + “.victor”
  • myInventoryLevel – 5000
  • myVector.push_back(Trade(12345)) — there is actually a temp Trade object. Compiler will call the rvr overload of push_back(). is my investigation. My temp object actually hold a resource via a heap pointerBut this usage scenario is rare in my opinion

However, if you have a regular nonref variable Connection myConn (“hello”), you can generate a rvr variable:

Connection && rvr2 = std::move(myConn);

By using std::move(), you promise to the compiler not to use myConn object afterwards.



reinterpret_cast(zero-copy)^memcpy: raw mktData parsing

Raw market data input comes in as array of unsigned chars. I “reinterpret_cast” it to a pointer-to-TradeMsgStruct before looking up each field inside the struct.

Now I think this is the fastest solution. Zero-cost at runtime.

As an alternative, memcpy is also popular but it requires bitwise copy. It often require allocating a tmp variable.

rvr coding experiments=tricky4everyone

I think every candidate faces the same challenge, so each person’s understanding would be patchy in some area.

Each candidate tries to identify and internalize a small number of “fundamental” principles in this subject, and hope the fundamentals would help connect the dots in a logical, natural fashion, but I think this is hard. There are too many surprises, too many unnatural “phenomena”. Therefore, I can only hope to connect a few dots, while the other dots remain scattered and hard to remember.

The best way to clear up the confusion and doubts, and deepen our understanding is through coding experiments, but in this domain, I found it very tricky to write code to confirm my understanding of rvr or move().

Many of the relevant language rules are too implicit, so I can’t easily insert probing prints. Some of those language rules are related to compiler’s function-override resolution.

Therefore, my standard experiment techniques are often inapplicable or ineffective.

Therefore, I now hold lower expectation of my eventual understanding of this domain. I consider further t-investment low-yielding in terms of ROTI. I should spend my time on other domains.

single/multi-thread TCP servers contrasted

In all cases, listening socket LL and worker sockets W4 W5 look like

local/server port local/server ip remote/client ip remote/client port socket file descriptor
 80 ANY unbound unbound LL  3
 80 high port A W4  4
 80 high port B W5  5 newly created

Now the 3 designs. [1] is described in accept()+select() : multiple persistent worker-sockets

LL W4 W5
not used by the “engaged” thr sending data single-thr primitive
not used by the “engaged” thr, but will be monitored via select() sending data single-thr multiplexing[1]
 listening Established. sending or idle sending multi-threaded

tcp: detect wire unplugged

In general for either producer or consumer, the only way to detect peer-crash is by probing (eg: keepalive, 1-byte probe, RTO…).

  • Receiver generally don’t probe and will remain oblivious.
  • Sender will always notice a missing Ack (timeout OR 3 duplicate Acks). After retrying, TCP module will give up and generate SIGPIPE.
send/recv buffers full buffer full then receiver-crash receiver-crash then sender has data to send receiver-crash amid active transmission
visible symptom 1-byte probe from sender triggers Ack containing AWS=0 The same Ack suddenly stops coming very first expected Ack doesn’t come Ack was coming in then suddenly stops coming
retrans by sender yes yes yes yes
SIGPIPE no probably yes probably

Q20: if a TCP producer process dies After transmission, what would the consumer get?
AA: nothing. See — Receiver is ready to receive data, and has no idea that sender has crashed.
AA: Same answer on

Q21: if a TCP producer process dies During transmission, what would the consumer get?
%A: ditto. Receiver has to assume sender stopped.

Q30: if a TCP consumer process dies during a quiet period After transmission, what would the producer get?
AA: P49 [[tcp/ip sockets in C]] Sender doesn’t know right away. At the next send(), sender will get -1 as return value. In addition, SIGPIPE will also be delivered, unless configured otherwise.

Q30b: Is SIGPIPE generated immediately or after some retries?
AA: describes Ack and re-transmission. Sender will notice a missing Ack and RTO will kick in.
%A: I feel TCP module will Not give up prematurely. Sometimes a wire is quickly restored during ftp, without error. If wire remains unplugged it would look exactly like a peer crash.

Q31: if a TCP consumer dies During transmission, what would the producer get?
%A: see Q30.

Q32: if a TCP consumer process dies some time after buffer full, what would the producer get?
%A: probably similar to above, since sender would send a 1-byte probe to trigger a Ack. Not getting the Ack tells sender something. This probe is builtin and mandatory , but functionally similar to (the optional) TCP Keepalive feature

I never studied these topics but they are quite common.

Q: same local IP@2 worker-sockets: delivery to who

Suppose two TCP server worker-sockets both have local address port 80. Both connections active.

When a packet comes addressed to, which socket will kernel deliver it to? Not both.

Not sure about UDP, but TCP connection is a virtual circuit or “private conversation”, so Socket W1  knows the client is If the incoming packet doesn’t have source IP:port matching that, then this packet doesn’t belong to the conversation.

accept()+select() : multiple persistent worker-sockets

I feel it is not that common. See is very relevant.

The naive design — a polling-thread (select/poll) to monitor new data on 2 worker-sockets + accept-thread to accept on the listening socket. The accept-thread must inform the polling thread after a worker-socket is born.

The proposed design —

  1. a single polling thread to watch two existing worker sockets W1/W2 + listening socket LL. select() or poll() would block.
  2. When LL is seen “ready”, select() returns, so the same thread will run accept() on LL and immediately get a 3rd worker-socket W3. No blocking:)
  3. process the data on the new W3 socket
  4. go back to select() on W1 W2 W3 LL
  • Note if any worker socket has data our polling thread must process it quickly. If any worker socket is hogging the polling thread, then we need another thread to offload the work.
  • Note all worker sockets, by definition, have identical local (i.e. server-side) port, since they all inherit the local port from LL.

[[tcp/ip socket programming in C]] shows a select() example with multiple server ports.


simultaneous send to 2 tcp clients #multicast emulation

Consider a Mutlti-core machine hosting either a forking or multi-threaded tcp server. The accept() call would return twice with file descriptors 5 and 6 for two new-born worker sockets. Both could have the same server-side address, but definitely their ports must be identical to the listening port like 80.

There will be 2 dedicated threads (or processes) serving file descriptor 5 and 6, so they can both send the same data simultaneously. The two data streams will not be exactly in-sync because the two threads are uncoordinated.

My friend Alan confirmed this is possible. Advantages:

  • can handle slow and fast receives at different data rates. Each tcp connection maintains its state
  • guaranteed delivery

For multicast, a single thread would send just a single copy  to the network. I believe the routers in between would create copies for distribution.  Advantages:

  • Efficiency — Router is better at this task than a host
  • throughput — tcp flow control is a speed bump


tcp listening socket doesn’t use a lot of buffers, probably

The server-side “worker” socket relies on the buffers. So does the client-side socket, but the listening socket probably doesn’t need the buffer since it exchanges very little data with client

That’s one difference between listening socket vs worker socket

2nd difference is the server (i.e. local) address field. Worker socket has a single server address filled in. The listening socket often has “ANY” as its server address.

non-forking STM tcp server

This a simple design. In contrast, I proposed a multiplexing design for a non-forking single-threaded tcp server in accept()+select() : multiple persistent worker-sockets

ObjectSpace manual P364 shows a non-forking single-threaded tcp server. It exchanges some data with each incoming client and immediate disconnects the client and goes back to accept()

Still, when accept() returns, it returns with a new “worker” socket that’s already connected to the client.

This new “worker” socket has the same port as the listening socket.

Since the worker socket object is a local variable in the while() loop, it goes out of scope and gets destructed right away.

spend 4H/D self-study@@ #XR

Hi XR,

I called you to highlight to you the realistic amount of hours we can spend on self-study.

Real example 1 —- I monitored it for months in my Master’s program — I would spend about 30 hours a week, including lectures, tutorial classes and homework(about 15-30 Hr/week, often taking up one full weekend-day). If I self-study without the program, I think it will drop to 3 hours a week. Why so low? Read on.

Eg 2 —- Some graduate students prepare for coding interviews by going through some question bank, just as you planned. Could be close to a hundred questions, during a 3M period — just my imagination, not based on any observation. They probably have 4 hours/weekday + 16 hours/weekend-day of free time. How many hours do they actually spend ? I would guess 10 – 20 hours a week. Why so low? Read on

Real example 3 —- In Singapore, I had kids, bills, grocery, household chores, home maintenance, family outing … so I had far less spare time than now, but still I probably had 1H/weekday + 8H/WED (weekend-day) for myself. In reality, I averaged 3 hours a week on self-study, but you know what a motivated person I am. Why so low? Read on

Real example 4 —- in Feb 2018 I had some very promising, valuable interview opportunities (FB). Even on the very last weekend before the interview, my time utilization rate was very very low. I lock myself up in office with no one else to distract me. I sleep only 7 hours and I eat in front my computer. In fact, I tell my friends “I spent entire weekend in office without going out of the building. You would think I could spend 15 hours a day on the preparation. Well, more like 6. Why so low?

It’s not all about motivation. I had high motivation in Singapore. I had the highest motivation over the recent weekend. But the numbers are still much lower than “theoretical limits”. There’s huge difference between wall-clock time vs real time spent:

Example 5 —- My son would spend an hour pretending to study 20 Chinese characters but only learn 3 characters and soon forget all of them. So the real time he spent was … a few minutes?

It’s about intensity. I call it laser energy intensity.

Real example 6 —- in a coding interview, my mind was working at top capacity. Over 15 minutes I solved some very tough problem. The rest of the interview (I was there for 3 hours) was far less intense, though it’s still intense, as intense as any technical interview.

Example 4 continued —- I time myself for white-board coding. Over 20 minutes my mind was working under immense pressure. Over the next 2 hours I sit down at a computer and make the actual code work. It’s still a lot of work, but my mind was at 50% peak capacity.

Use numbers to gauge the amount of workload and your time utilization rate. Say each problem (in homework or a coding exercise) takes average 30 minutes in exam / interview. Beside eating, using toilet, you have 15 hours to spend, so you would want to do 30 problems. How many can you actually complete? Perhaps 6, up to 10. Why so low? Our mind can’t operate at peak capacity for 15 hours!

In Example 4, after I spend 20 intense minutes, I need to wind down. After I spend 2 hours on lower-intensity work, I also need to wind down.

In Example 1, at the lecture/tutorial classes, the intensity was about 30% on average. That’s why I can do 2-3 hours. At a higher intensity the students’ mind would /disengage/ after an hour. All experienced teachers would slow down and give breaks when introducing tough technical topics.

In Example 1, When I spent 10 hours to complete my homework, my intensity was about 50% and I took many breaks. Actually the concentration time spent was more like 6-7 hours.

How does this mental “workload” compare to paid project work in a regular job, including work-from-home?

  • 😦 Project work is often repetitive or mundane. I can’t choose another topic.
  • 🙂 Paid work often has much bigger impact (otherwise why paid) and often a sense of achievement
  • 😦 Paid work stress is harmful, since the impact is high.
  • 🙂 Project work often has a discipline built-in, so we don’t wander off. This is similar to my Master’s program
  • 🙂 Project work requires us to keep going when the going gets tough. Also in my Master’s program

[17] widely in-use, no longer quizzed #spring,SOAP.

I see a pattern — a new technology is getting adopted and quizzed in-depth at interviews. After 5 years, it is still a favorite, perhaps dominant solution, but 1) the know-how has become common knowledge and candidates are assumed to know it and 2) usage is now standardized and simplified, so the “bar” is lower, and candidates without the knowledge can easily pick it up.

No more in-depth questions needed. Therefore, time previously invested here is wasted, since only superficial knowledge is required now.

  1. Eg: spring/hibernate
  2. Eg: java servlets and JSP — From 1999 to 2008 these topics were heavily quizzed. Still widely in use but often invisible.
  3. Eg: Apache web server — In 2000 I was asked a lot on Apache details. Apache is still very popular. See
  4. Eg: php — still widely used, but I feel not asked a lot. See
  5. Eg: xml parsing — I used to get in-depth questions about DOM or SAX etc. Now I believe xml is still widely used
  6. Eg: web services, SOA, SOAP — Still very much in use
  7. Eg: HTTP protocol details like GET/POST, status codes
  8. Eg: Maven and Ant

Most of my examples are in the high-churn domains like Internet, mobile. I believe the same will happen to interview questions on big data, javascript, Android, iOS , blockchain, ..

The opposite list — some essential technologies underlying multiple technology waves were never heavily quizzed, but, after the waves subsided, remain rather relevant to many niche interviews.

  • SQL query — joins, subquery, case, ..
  • SQL and DB tuning
  • Unix automation — It can take years to become reasonably competent with all of bash, piping, subshells, trap, shell functions, string operators, file manipulation, and top 30 Unix commands
  • Unix system administration
  • Pthreads, a POSIX standard C library
  • http client programming
  • regular expression
  • Inter-Process-Communication
  • Java servlet session management
  • Java serialization
  • Java reflection

— You write —
There are still many projects using Spring. My current project is also using Spring, but it’s modified by internal team to create an internal framework. When people discuss in meeting, they say “Spring” to refer to this framework. But there are many pitfalls when I use it. To name a few:

  1. a) restful service is easy to implement in spring, ust add related annotations, but it doesn’t work, and after I spent a few days of research, I gave up and choose to use a internally created annotation.
  2. b) some configurations doesn’t work, parameters couldn’t be passed in. I still don’t know what’s the reason. The internal framework code is not accessible for other teams developers, so I don’t think it worth to spent more time to try to figure out.

For this project using Spring, the interview only mentioned this project is using Spring, but didn’t ask any questions about Spring.

For last year, I went through 5 interviews, 2 mentioned the projects are using Spring, and only one client asked some Spring questions.

I recall 5 years ago, 8/10 will ask spring and hibernate questions. Now, still a few clients asked Spring questions, but none asked Hibernate questions.


c++interview tougher than java on WallSt #XR

My friend XR is first to point this out.

Q1: for candidates with 5+ years of experience, is the passing rate really worse in c++ IV than java IV? Let’s limit ourselves to sell-side.
%%A: indeed slightly lower.

Q2: is c++ paying slightly higher? I don’t think so.

Both java and c++ (c# too but not python) interviews go to crazy low levels details. Most tough questions in java/c#/python are low-level. C/C++ is a lower-level language.

c++(!!Java)ecosystem questions are tough #Deepak

Perhaps the most important reason for my Q1 answer is selectivity. A java hiring team can be highly selective but on average, c++ hiring teams have higher selectivity:

  • higher bar
  • more time allocated to selecting the right candidate
  • less interested in commodity skills and more demanding on in-depth knowledge
  • less likely to settle for second best talent

What does it mean for Shanyou and Deepak? They are unlucky to hit a rising bar in a shrinking domain… increasingly competitive.

c++matrix using deque@deque #python easier

My own experiment shows

  • had better default-populate with zeros. Afterwards, you can easily overwrite individual cells without bound check.
  • it’s easy to insert a new row anywhere. Vector would be inefficient.
  • To insert a new column, we need a simple loop

For python, # zero-initialize a 5-row, 8-column matrix:
width, height = 8, 5
Matrix = [[0 for x in range(width)] for y in range(height)]

In any programming language, the underlying data structure is a uniform pile-of-horizontal-arrays, therefore it’s crucial (and tricky) to understand indexing. It’s very similar to matrix indexing — Mat[0,1] refers to first row, 2nd element.

Warning: 2d array is hard to pass in c++, based on personal experience 😦 You often need to specify the size in the receiving function declaration. Even if this is feasible, it’s unwanted legwork.

[1] Warning — The concept of “column” is mathematical (matrix) and non-existent in our implementation, therefore misleading! I will avoid any mention of it in my source code. No object no data structure for the “column”!

[2] Warning — Another confusion due to mathematics training. Better avoid Cartesian coordinates. Point(4,1) is on 2nd row, 5th item, therefore arr[2][5] — so you need to swap the subscripts.

1st subscript 2nd subscript
max subscript  44  77
height #rowCnt 45 #not an index value  <==
width #rowSz [1]  ==> 78 #not an index value
example value 1 picks 2nd row 4 picks 5th item in the row
variable name rowId, whichRow subId [1]
Cartesian coordinate[2] y=1 (Left index) x=4

y I use lots of if..{continue;}

Inside a loop, many people prefer if/elif/else. To them, it looks neat and structured.

However, I prefer the messier if…continue; if…continue; if..continue. Justification?

I don’t have to look past pageful of if/elif/elif/…/else to see what else happens to my current item. I can ignore the rest of the loop body.

Beside the current item, I also can safely let go (rather than keeping track) of all the loop-local variable values longer. All of them will be wiped out and reset to new values.

choose python^c++for cod`IV

1) Some hiring teams have an official policy — focus on coding skills and let candidate pick any language they like. I can see some interviewers are actually language-agnostic.

2) 99% of the hiring teams also have a timing expectation. Many candidates are deemed too slow. This is obvious in online coding tests. We all have failed those due to timing. (However, I guess on white-board python is not so much faster to code.)

If these two factors are important to a particular position, then python is likely better than c++ or java or c#.

  • Your code is shorter and easier to edit. No headers to include.
  • Compiler errors are shorter
  • c++ pointer or array indexing errors can crash without error message. Dynamic languages like python always give an error message.
  • STL iterator can become end() or invalidated. Result would often look normal, so we can spend a long time struggling a hidden bug.
  • Edit-Compile-Test cycle is reduced to Edit-Test
  • no uninitialized variables
  • Python offers some shortcuts to tricky c++ tasks, such as
    1. string: split,search, and many other convenient features. About 1/3 of the coding questions require non-trivial string manipulation.
    2. vector (a.k.a List): slicing, conversion(to string etc), insertion, deletion…, are much faster to write. Shorter code means lower chance of mistakes
    3. For debugging: Easy printing of vector, map and nested containers. No iteration required.
    4. easy search in containers
    5. iterating over any container or iterating over the characters of a string — very easy. Even easier than c++11 for-loop
    6. Dictionary lookup failure can return a default value
    7. Nested containers. Basically, the more complex the data structure , the more time-saving python is.
    8. multiple simultaneous return values — very very easy in python functions.
    9. a python function can return a bool half the times and a list otherwise!

If the real challenge lies in the algorithm, then solving it in any language is equally hard, but I feel timed coding tests are never that hard. A Facebook seminar presenter emphasized that across tech companies, every single coding problem is always, always solvable within the time limit.

live updated hitCount over last5s#presumably Indeed

I hit a similar question in NY, possibly LiquidNet or CVA

Q: Make a system (perhaps a function?) that returns the average number of hits per minute from the past 5 minutes.

I will keep things simple by computing the total hit over the last 300 seconds. (Same complexity if you want average order amount at Amazon over last 5 minutes.)

Let’s first build a simple system before expanding it for capacity.

Let’s first design a ticking system that logs an update every time there’s an update. The log can be displayed or broadcast like a “notice board”, or we can update a shared atomic<int>.

Whenever we get a new record (a hit), we save it in a data structure stamped with an expiry date (datetime). At any time, we want to quickly find the earliest unexpired record i.e. the blue record. There’s only one blue at any time.

What data structure? RingBuffer with enough capacity to hold the last 5 minutes worth of record.

I will keep the address of the current blue record which is defined as the earliest unexpired record in the last update. When a new record comes in, i check “Is the blue expired?” If NO, then easy.. this new record is too close to the last new record. I simply update my “notice board” in O(1). If YES then we run a binary search for the new blue. Once we find it, we have to compute a new update in O(W), where W is the minimum of two counts, A) recently expired records B) still unexpired records.  After the update, we remove the expired items from our data structure.

–That concludes my first design. Now what if we also need to update the notice board even when there is no new record?

I would need an alarm set to the expiry time of the current blue.

–Now what if the updates are too frequent? I can run a schedule update job. I need to keep the address of a yellow record, defined as the newest record of the last update.

When triggered, routine is familiar. I check “Is the blue expired?” If NO then easy… If YES then binary-search for the new blue.

c++compiler select`move-ctor

This is a __key__ part of understanding move-semantics, seldom quizzed. Let’s set the stage:

  • you overload a traditional insert(Amount const &)  with a move version insert(Amount &&)
    • by the way, Derrick of TrexQuant introduced a 3rd alternative emplace()
  • without explicit std::move, you pass in an argument into insert()
  • (For this example, I want to keep things simple by avoid constructors, but the rules are the same.)

Q1: When would the compiler select the rvr version?

P22 [[c++stdLib]] has a limited outline. Here’s my illustration

  • if I pass in a temporary like insert(originalAmount + 15), then this argument is a rvalue obj so the rvr version is selected
  • if I pass in a regular variable like insert(originalAmount), then this argument is an lvalue obj so the traditional version is selected
  • … See also my dedicated blogpost in c++11overload resolution TYPE means..

After we are clear on Q1, we can look at Q2

Q2: how would std::move help?
A: insert(std::move(originalAmount)); // if we know the object behind originalAmount is no longer needed. shows when we need to use std::move() and when we don’t need.

git | backup b4 history rewrite

Personal best practice. Git History rewrite is not always no-brainer and riskless. Once in 100 times it can become nasty.

It should Never affect file content, so at end of the rewrite we need to diff against a before-image to confirm no change.

The dumbest (and most foolproof) before-image is a zip of entire directory but here’s a lighter alternative:

  1. git branch b4rewrite/foo
  2. git reset or rebase or cherry-pick
  3. git diff b4rewrite/foo

Note the branch name can be long but always explicit so I can delete it later without doubt.