linux on cloud

Today I managed to set up a simple, functional linux machine on the cloud

—- login via ip4v address 149.28.37.xx
I had to pick a “server size” big enough to get an ipv4 address. The smallest i.e. cheapest server size has 500GB bandwidth.

It takes a few seconds to deploy the instance and get the ipv4 address + the root password. I was then able to ssh as root from my windows git-bash which provides a functional ssh client.

—- github set-up
I had immediate access to — a big win over Oracle virtual box 🙂 Not hard to set up my username, email address, credential store (to git-push quickly)

I needed to set up vim to edit my commit messages. Default was something unfamiliar, perhaps Emacs.

—- other set-up
My full screen is automatically usable, a big win over Oracle virtual box 🙂

I was able to install g++ and python, within seconds.

Once I ssh in, I changed the root password, because the generated password was hard to remember. The password to log in on is separate and pretty long — 10-char.

I was able to log in again and see my personal files intact. So the same physical instance is still there.

—- cost: $3.50 a month for my instance.

I was told that as soon as I destroy (not “stop”) the instance, the “meter” stops ticking. All instances are billed hourly up to the monthly cap of $3.50, even if they are up everyday. The hourly rate is determined by dividing the monthly rate by 672 hours (28 days).

I had to provide my credit card details. To prevent accidental charge, need to destroy all instances after use.


## y they ask QQ questions #revisit

Remember QQ means ‘tough topics not needed for GTD’. A QQ quiz (to some extent, algo quiz too)

  • measures your diligence, amount of learning effort you put in beyond your work projects.
  • measures your dedication and commitment to self-learning a dry, tough topic. Learning is not only a drudgery, but also a major challenge on a typical job. 90% of the day-to-day challenges involve learning, re-learning, connecting the dots..
  • checks for a self-starter vs a passive learner
  • measures your self-reliance at learning
  • measures how detail-oriented a candidate is
  • checks for intellectually laziness
  • measures soundness of fundamental concepts underlying your logical “framework”. In many large financial systems there are fundamental design principles and concepts that are .. fairly logical and consistent. Yang was good at grasping the logical, and questioning the illogical.
  • measures depth of curiosity and learning
  • measures technical communication bandwidth with another developer

rvr^rvalueObject #rvr=compiler concept

  • ALL objects by definition exist in runtime memory but references may not.
  • std::move() is about rvr variables not rvalue objects!
    • You can even use move() on natural-occurring temp though std::move is supposed to be used on regular lval objects
  • I now believe rvr is a compiler concept. Underlying is just an address of an object (temporary or otherwise)
    • Further, the traditional lvr is probably a compiler concept too. Underlying is a regular pointer variable.

“rvalue object is an object that can ONLY appear on the RHS of assignment.” I wrote this in 2018 but now less sure.

rvalue object can be naturally occurring (usually anonymous), or “converted from a named object” by std::move(), but if some variable still references the object, then the object is actually not a proper rvalue object.

The SCB architect pointed that “some variable” can be a const (lvr) ref bound to a naturally occurring rvalue! To my surprise the c++ syntax rule says the object is still a temp object i.e. rvalue object, so you can’t assign this object to a lvr !

1)template code bloat 2)inline..hurt i-cache@@

Q: does template code bloat increase instruction cache pressure and increase risk of cache thrashing?
A: i have seen about 8 forums mention of template code bloat increasing i-cache pressure, but no clear evidence, no consensus among authorities. In fact, no known authority has confirmed this issue.

However, I would think having many similar functions in the executable will likely lead to unnecessary competition for i-cache.

In contrast, for inlining there is consensus. wikipedia states — excess inlining will hurt speed, due to inlined code consuming too much of the instruction cache .. Inlining imposes a cost on performance, due to the code expansion (due to duplication) hurting instruction cache performance.[5] This is most significant if, prior to expansion, the working set of the program (or a hot section of code) fit in one level of the memory hierarchy (e.g., L1 cache), but after expansion it no longer fits, resulting in frequent cache misses at that level.

See also inline: footprint+perf can Backfire ! #Google

unique_ptr implicit copy : only for rvr #auto_ptr

P 470-471 [[c++primer]] made it clear that

  • on a regular unique_ptr variable, explicit copy is a compilation error. Different from auto_ptr here.
  • However returning an unnamed temp unique_ptr (rvalue object) from a function is a standard idiom.
    • Factory returning a unique_ptr by value is the most standard idiom.
    • This is actually the scenario in my SCB-FM interview by the team architect

Underlying reason is what I have known for a long time — move-only. What I didn’t know (well enough to impress interviewer) — the implication for implicit copy. Implicit copy is the most common usage of unique_ptr.

max all-black subMatrix #ZR

Same problem as

Q: Given a 2D binary matrix (L by N) filled with white(0) and black(1) cells, find the largest all-black rectangle. See raiserchu’s mail on 12 Sep 13. There is a clever DP solution, probably O(LN).


Worst case — A standard chess board? We can’t do better than O(LN) since there are LN cells to read.

–O(LN) leetcode solution based on histogram .. is latest code with my adaptations and my detailed comments.

— sol5:

First scan O(LN) to record, in each cell {bar height; horizontalBarStart/End}.

— idea 4unfinished

Scan #1 O(LN): build a shadow matrix “histogram” where each integer in the cell is the height (possibly 0) of the bar anchored therein.

Scan #2 O(LN) for each cell, remember the currentRunStart column index i.e. from that column until current column, we have an all-black box of height == current bar height

— sol3 O(LNS) new idea based on max rectangle ] histogram treat top 2 (denote J:=2) rows as a histogram. Find the max rectangle therein. Then J:=3 …

  • Scan #1 O(LN): build a shadow matrix “histogram” where each integer in the cell is the height (possibly 0) of the bar anchored therein. In other words, if a cell value=5 then there are exactly 4 consecutive black cells above this (black) cell. Build it incrementally, level by level, top to bottom.
  • Scan #2a: for each row in the shadow matrix, we run the proven algo in O(NS), Note there’s no help from previous row:(
    • S:= #unique heights, N:= matrix width 
  • Scan #2 := the entire scan of L rows. so worst case we hit O(LNS)

Q: Can we do better by reducing scan #2a complexity to O(N), by making use of the previous row results?

— My brute force solution 1: Each rectangle is identified by 2 vertices, i.e 4 integers. Without loss of generality, We require the “high” corner to have higher x-coordinate and higher y-coordinate than the “low” corner. (We can assume y-axis run upward.) With this O(N^4) nested loop we can iterate over all possible rectangles:

Lock low corner
Move high corner in typewriter (zigzag) steps i.e.
  hold highY and move highX step by step
  process the (series of) resulting rectangles
  increment highY and repeat
Move the lower corner in typewriter steps and repeat

Key observation: any “bad pixel” disqualifies every rectangle containing it.

— Here’s my partial solution:
We can effectively ignore all the “good pixels”.

1) Look at the x coordinates of all bad pixels. Sort them into an array. Find the largest gap. Suppose it’s between x=22 and x=33. Our candidate rectangle extends horizontally from 23 to 32, exactly. Notice there’s no bad pixel within this vertical band [1].
2) Look at the y coordinates of all bad pixels. Sort them into an array. Find the largest gap. Suppose it’s between y=15 and y=18. Our candidate rectangle extends vertically from 16 to 17, exactly.
[1] This candidate rectangle can expand All the way vertically, though it may give a bigger rectangle
Ditto horizontally.

SCB-FM algo Q2a 2 slists

Q: Given two well-formed singly-linked lists (loop-free), that might be merged at some node, locate the merge point. If not merged, return null.

Note every node has exactly one child node, but may have two parent nodes.


Suppose list A has size P=55 and list B has size Q=32

–SolH: hashtable O(N+M) — optimal in time but not space

First populate hashtable with the short list’s 32 nodes. Then iterate the longer list and check each node’s address.

–SolR: reverse one list … how?

–SolA: array-based.  construct a simple 55-array of A node pointers, and another 32-array of B node pointers. Then compare the two final elements in the two arrays. If same then binary search in B. O(N+M) + O(log M)

–Sol2: 2-pointer algo O(1) space, usable on read-only lists 🙂

first scan to find the 2 end nodes and remember the sizes like 55 vs 32.

IFF the end nodes match, then 2nd scan:

skip first P-Q (23) nodes in the longer list and then increment 2 iterators in lock steps. Compare the 2 iterators at each step.

SCB-FM stack-based FIFO in O(1)amortized

Q: given a hardware-based stack API consisting of 3 functions {pop/push/isEmpty}, please implement a queue api consisting of 3 functions {enqueue/dequeue/isEmpty} is similar


service dequeue from a hidden stack.

When hidden stack is empty, pop all nodes from visible stack to hidden stack. Amortized O(1) pop()

isEmpty() must add up two sizes.

[[python cookbook]] P658 implements this classic algo in 9 lines.

SCB-FM IV by architect #shared_ptr upcast

Q: how does the compiler accept this code:
shared_ptr<C> aa = myDerSharedPtr; //myDerSharedPtr is a shared_ptr<D> object

%%Q: shared_ptr<C> has a copy ctor and also a conversion ctor accepting a C raw ptr, but here we are passing in a shared_ptr<D> instance. How does compiler handle it?
%%A: I guess shared_ptr<D> has a conversion operator returning a D raw ptr, but this is not used.
AA: there’s a conversion ctor template<class U> shared_ptr(shared_ptr<U>…) — a TMP trick. See

The github experiment also reveals — If a function lvr param is shared_ptr<C> & and you pass in a shared_ptr<D>, compiler will complain about assigning an rvalue (i.e. anonymous temp) object to an lvalue reference — a key insight into rvr + rvalue objects.

Q3: just when is the memory freed for temp objects like q[ string1 + string2 ]
%%A: at an unspecified time. A custom string implementation could use COW, in a single-threaded project. This is a common practice in many pre-c++11 libraries
A(from architect): after the semicolon

Q3b: how can you extend the lifetime of those naturally occurring temp object?
A: assign the temp to a “const ref” variable.

Q: what are your favorite c++11/14 features? See ## c++11 features I understand as significant

Q: OK you briefly mentioned move semantic..what is it?

struct C{ //tested
  virtual void f(){/*..*/}
  ~C(){     cout<<"C dtor\n";  } //non-virtual
struct D: public C{
  string s;
  D(): s("def"){}
  ~D(){     cout<<"D dtor\n";  }
D createD(){return D();} //return by value! probably via RVO
int main(){
  C const & trade = createD();

Q: is string memory freed?
%%A: yes. Verified

Q: what if the string field is in D?
%%A: yes. Verified

I believe the temp D object is on stack and is guaranteed to be destructed. Since the D ctor called base ctor, the dtor sequence is guaranteed to be ~D then ~C.

kernThr ] linux^Unix

Here are a few things I can recall, from [[UnderstandingLinuxKernel]].

P 104 lists some minor Linux kernel threads, but here are the important ones:

  • process 0 (swapper) is a kernel thread. This process gets scheduled when there’s no other process to run
  • process 1 (init) is a kernel thread. This process launches and monitors all other processes except process 0. It’s also the adopted parents of all orphaned zombies
  • both of them run forever. Most other linux kernel threads run for a short while and exit.

Linux kernThr is completely unrelated concept to traditional unix kernThr. More difference than similarities (zero?)

  • Unix kernel threads aka “native threads” concept was first known to me in jvm.  kernel scheduler can see native threads but not java green threads.
  • native threads often run user mode but linux kernel threads only run in kernel mode and very limited in usage. I think they are mostly set up to access hardware

Linux kernel threads are less fundamental than kernel routines including sysCall handlers + interrupt handlers.

  1. every Linux user process enter kernel mode via kernel routines
  2. not every user process interacts with kernel threads

your brain power^brain power baked into localSys

Background – “how fast you figure things out relative to your peers”.

For each team member AA, the struggle is the same — AA’s brain power vs the cumulative brain power that has gone into the local system which measures the complexity. If the local system complexity is too high then AA would struggle and take a long time (before he gives up).

The “local system” could include firmwide frameworks, or something open-source.

I prefer a local system created by low site-specific brain power, like one with standard SQL/stored-procs, standard noSQL, standard data encoding (FIX, Json..), standard java/c++ libraries, including OSS.

  • RTS and OC – relatively small amount of site-specific brain power in the system.
  • PWM comm – actually small amount of local system complexity but time given is too short
  • Barc – brand new codebase .. low site-specific brain power.
  • Quartz — the worst

merge 2 sorted slists #2try

Q: Merge two sorted linked lists and return it as a new (ascending) list. The new list should be made by splicing together the nodes of the first two lists.

A test of implementation skill. A test of clear communication and clear thinking. My solution below is O(1) space and O(N) time without allocating any heap memory.

I would pick the list having the lowest value as the Yellow jersey (leading). The other list is the green. At any time there exist two head nodes for two slists.

I always keep a brown pointer to the last of the leading pack in the yellow’s list. The leading pack are the “retired” nodes.

Swap – In the event of a ‘swap’, the brown’s current child (current yellow jersey list head) gets detached and becomes the new green jersey, and the old green jersey moves into the leading pack. Then we increment the ptr — this old green jersey’s child becomes the yellow jersey.

Every time after we increment the pointer, we compare its child node vs the green jersey.

I won’t bother with std::list::splice() and will use python or my own c++ linked list.

LFU cache #cf.LRU #72%

Q LFU (Least-Frequently-Used) cache to support the following operations: get and put in O(1)
* get(key) – Get the value (will always be positive) of the key if the key exists in the cache, otherwise return -1.
* put(key, value) – Set or insert the value if the key is not already present. When the cache reaches its capacity, it should invalidate the least frequently used item before inserting a new item. For the purpose of this problem, when there is a tie (i.e., two or more keys that have the same frequency), the least recently used key would be evicted.


  1. dstruc — centry i.e. CacheEntry node {key, value, hitCount, lastHit (timestamp), (optional)ptr to host LinkNode}, to be used in an inner linked list.
    • invariant: hitCount can only increase
  2. dstruct — inner minilist of centry nodes
    • invariant: list always sorted by lastHit. We can remove any intermediate node, but incoming node is always added to the Tail
  3. dstruct — fixed-sized (rehash-free) hashtable {key -> ptr to centry}, needed for mid-stream laser-removal
  4. dstruct — LinkNode {level, minilist-of-centry} where all centry objects share the same hitCount denoted “level”.
  5. dstruct — outer list of LinkNodes, always sorted by level

“bubble-up” operation — Whenever a centry gets a cache-hit, its hitCount increments. It immediately and unconditionally bubbles up to the LinkNode one level higher (to be created in O(1) if necessary) ((
* [o1] query the hashtable and follow ptr to remove the centry from the minilist in an old LinkNode
* [o1] insert the centry to the new level, at Tail of minilist. The new LinkNode could be non-existent but Never empty!
* [o1] optionally, new host LinkNode’s address is saved in the centry

  • Get() hit — relatively easy. Update the hitCount and bubble up
  • Get() miss — trivial
  • Put() Update — similar to get-hit
  • Insertion (possibly after deletion) — [o1] append to the minilist Tail in the Level-1 LinkNode (to be created if necessary) and add to hashtable
  • Deletion — always from list to hashtable, never the converse
    • [o1] identify lowest level present, then delete the head (i.e. eviction target) of minilist
    • when a linkNode becomes empty, it must disappear from the outer list, to prevent build-up of consecutive empty LinkNodes leading to linear search for eviction target. Imagine aaaaa bbbbb c[Now need to evict an “a”]. Therefore, array of LinkNode is unacceptable.

localSys learning: get over the hump

  • eg: nyse xtap
  • eg: OC Quest ownership handover
  • eg: Same with my son’s piano … Same with my son’s math problem solving ..

Once I get over the hump … relief ! and then I tend to keep going for a few hours and gain more local system insight.

There are various signs of the hump…. It’s nice to notice “Hey, we might be at the hump”, so we could turn on “supportive self-coach”

  • block out distractions/interruptions … put on earphone
  • ask for help rather than pushing too hard on our own
  • take frequent breaks with food or mini-workout
  • recognize that even 30 minutes of immersion is tough, valuable and an achievement to be celebrated [1], with or without visible progress.

[1] I WILL reward my son for that.

replacing auto_ptr with unique_ptr #IViewer’s focus

I spent lots of time cleaning up hundreds of auto_ptr usages but interviewers were more interested in the theoretical differences between the two.

  • On the job, it’s an operational challenge. You don’t need the knowledge.
  • On the interview, it’s a knowledge challenge… theoretical, bookish. You don’t need the GTD capacities.

##af 70 ..spend%%spare time meaningfully

Holy grail — Long-term sustainable (hopefully intrinsic) motivation + modest level of expertise, with reliable (albeit low) income and (a bit of) social value.

I need a purpose, a goal to work towards… Without it, the absence of a … job would create a void. Depression, lack of purpose, loss of energy. None of the below is easily achievable or easily available. Whichever I choose, need to work towards it.

  • Research would be ideal. I have proven aptitude in theoretical domains ..
  • xp: I think the RTS/NYSE work has more meaning as it impacts more users.
  • xp: devops effort has proven value to the local team
  • I might consider joining a start-up, which provides employment and learning opportunity for younger workers (perhaps in their 50’s?)
  • Teach (online) — Chinese/English, with emphasis on writing and vocab
  • Teach (online) — programming? threading, data struct, algo
  • Teach — statistical data analysis, If not outdated..
  • Teach — enterprise app design, If not outdated? Too competitive. They may not take in an old programmer.
  • Teach — financial math? After 70?
    • ▼this domain is too competitive and entry barrier too high. A lot of effort to cross it but demand is low.
    • ▼Limited practical value. more specialized, but growing demand.
    • ▼I feel I need interaction with people.
  • Two-way translation service, but I prefer interactions.
  • Chinese medicine?

Tim (RTS), a friend in his 50’s gave 3 points

  1. earn a salary to help kids pay student loan
  2. sight seeing world wide — costs a lot
  3. volunteering

increase vi usage in git-bash #leverage

  • vi is incredibly resilient, long-living, and widely available
  • More vi usage can prolong my professional life
    • comparable to memorizing phone numbers without phone book
    • comparable to sleeping on floor
    • comparable to standing desk — not absolutely necessary, but trains the body system. Slowly Builds confidence
    • comparable to barefoot jogging
    • comparable to mental math
  • it gives me (real) advantage over younger programmers

Please be realistic and slowly increase the usage. Some may suggest big-bang.

localSys insight: Self-learning

Q: by correlating the logs, config, source code, queries … on your own [3] in your spare time [2], do you think you can gain insight into local system?

This question is at the heart of my GTD challenges, including code-reading weakness.

I don’t need to be super-fast to make progress.

[2] I often have some spare time, not a lot but a few hours a week. Sometimes my work hour is exactly allocated to this. In 2018, I have plenty of weekend hours 🙂

[3] self-learning — There’s a balance to strike between asking and self-learning. I’m reasonably good at guessing :).

I guess some developers rely on asking others and then making rapid progress. It has not been very successful for me, and I have not tried it many times. For me, it’s almost always a personal journey, not guided tour.

new orderId is generated for order-replace, !! order-modify

Click to access XDP_Integrated_Feed_Client_Specification_v2.1g.pdf

shows that NYSE matching engine has specific business logic for order-repace vs order-modify

  • order-modify reuses existing orderId
  • order-replace would create a new orderId — The “sitting” order must be removed from the book and replaced with the new order.

My friend Ashish said in more than one forex ECN’s,  order update always generates new orderId. He actually implemented the order update business logic in FIX clients that connect to the ECN’s.

risk-neutral means..illustrated by CIP

Background — all of my valuation procedures are subjective, like valuing a property, an oil field, a commodity …

Risk-Neutral has always been confusing, vague, abstract to me. CIP ^ UIP, based on Mark Hendricks notes has an illustration —

  • RN means .. regardless of individuals’ risk profiles … therefore objective
  • RN means .. partially [1] backed by arbitrage arguments, but often theoretical
    • [1] partially can mean 30% or 80%
    • If it’s well supported by arbitrage argument, then replication becomes theoretical foundation of RN pricing


max-profit # easy SCB-FM

Q: maximize profit on a given a price series … you can buy, sell, buy, sell any number of times, each time one share. No buy-buy-sell-sell allowed, i.e. at the time of buying, you must have zero inventory.

Q: ok you said your solution is O(N), can you do it in one scan?

====my answer

If price keeps dropping, then no trade possible

If price keeps rising steadily, then only one buy-sell pair

if i get peak-trough-peak-trough… then i must discard the first peak since I can’t do anything with it.


SCB-FM design IV #AshS#80%

Q: Your parser class is a plug-in in a framework. The framework would call your parser’s member function onData(seqNum, packet)  whenever the framework receives a packet on a UDP socket. You need to deal with

  • out of sequence packets
  • duplicate packets

Inside your onData(), you need to invoke a client callback function like theClient->callback(ptr2packet) but you need to invoke it 1) in correct sequence and 2) without duplicates.

Note the requirement is part of TCP’s job functions. TCP receives out-of-sequence packets (and/or duplicates) from IP layer and must deliver sequenced packets to the application layer.

Does TCP use ring buffer or hashtable? I doubt it, but we are building simpler solutions and are free to choose our data structure.

====My solution=====

an illustration
seq # received warehoused send in-use region of ring buffer
1 1
5 5 2-5
9 5,9 2-9
3 3,5,9
2 2-3 4-9
6 5,6,9
8 5,6,8,9
7 5-9
11 5-9,11 4-11
4 4-9 10-11
  • keep the packets (like fixed-size struct instances) in a large singleton circular array (or a deque). Save each packet in a slot keyed by the seq number of the packet (modulus the array size). Remember the nextSeqToSend. If we get a higher sequence than that, just warehouse it in the circular buffer.
  • (Interviewer didn’t ask) How do you reuse slots in the circular buffer? Given ten thousand slots #0~#9999, when I’m warehousing packet #109,999 in slot #9999, then conceptually the old packet in the #0 slot was already sent out, so I can safely “wrap around” to save next packet (#110,000) in there. I can implement my system to ensure it is actually safe.
  • What if the sequence numbers I receive jump wildly? Well, in real systems this will never happen (except an explicit seq reset). At most the sequence numbers jump ahead by a few hundreds. Assuming the sequence numbers arrive largely in correct order with occasional out-of-order arrivals, ring buffer is a natural choice. Without this assumption, dictionary solutions (Ashish saw in QuickFix) might be more suitable.
  • permanent gaps? If I see an old gap (like nextSeqToSend == #55 and no #56 but #57~#8000 all received) then we need a policy to mark the gap as permanent. Otherwise we would have to wait for it indefinitely.

Q (from interviewer): if you use a deque, how do you allocate slot for packet #5 while waiting for #4?
%%A: i would allocate for both, but keep #4 slot vacant. Not sure if std::deque has this API. I think my deque will hold pointers … dummy pointer represents vacant.

Justification for deque is similar to ring buffer — to keep the queue length short and release once-used memory.

I haven’t analyzed hashtables i.e. dictionaries. I believe it’s a proven solution with no major drawbacks.

#1 minor drawback of hashtable-based (or deque) relative to ring buffer is runtime allocation which is about 100 to 1000 times slower than arithmetic operations. For this reason, I always favor ring buffers when it’s a natural and logical data structure choice. This is my bias in many system designs. Sequence-number-based systems can often use ring buffers.

Another minor drawback of hashtable is memory overhead . Ring buffer has relatively small overhead in addition to the packet footprints, Hashtable wraps each packet in a link node. Hashtable also needs an expandable bucket array.

In terms of runtime efficiency, I am not so sure. I feel circular array has faster read/write. Hashtable depends on the hash function, which can degrade due to hash collisions.


SCB-FM eTrading IV#1

Q: tell me more about your pricing projects
Q: is your project management style agile?
Q: what’s const correctness and “mutable”
Q: cpu cache optimization

(open) Q: forward vs backward iteration of array .. any performance difference?
%%A: i don’t know any, though some people advocate backward

Q: make_shared advantage over calling ctor of shared_ptr?
%%A: memory leak… Correct. See
%%A: one allocation only
%%A: perfect forwarding

Q: is shared_ptr thread safe?
%%A: yes only for the increment of reference count
%%A: if concurrent with a copying operation on inst3, inst3 is reset on another thread, then I don’t know if it’s thread safe. See thread-unsafe shared_ptr: tiny examples

Q5: any experience with c++11?
Q5a: what are the c++11 code modernization changes you described in resume. Examples?

Q: auto_ptr vs unique_ptr
%%A: unique_ptr can be moved (explicitly), not copied. auto_ptr can be copied and moved??
%%A: unique_ptr can go into containers. Yes see unique^shared^auto_ptr #container

q[visible progress]=Unreasonable expectations

See also my sample list in the sms twister blog visible progress # very rare therefore to be celebrated

Contrast with the views in [[reconciliation]]. Beware empty [g=] glorifications that don’t mean much when I look back 20Y later.

Every week, I actually make more progress than my fellow daddies with kids and commute etc. However, Once a while, in retrospect I would fall apart and cast serious doubt on (and belittle) my progress and point out the unfortunately invisible long-term effect.

I think many people implicitly follow a harsh and simplistic criteria like earning capacity, kids’ grades or absolute return, to dismiss and discredit all the “progresses”. This can become irrational, counterproductive, and /demotivating/ — engine loss of power. Such criteria are unfair to the self. If you are a teacher or coach, would you be so harsh on your student?

It can be a punishment, like a flogging whip.

Putting on a critical thinker’s hat, I feel that for most guys in my situation, it’s no mean achievements to maintain current condition and make small progress, with invisible long-term effect. Anything more is asking too much, and requires luck, talent, determination, contexx etc.

  • –ranked by …? I want to highlight the unsung heroes…
  • cholesterol, dental, belly and weight? maintaining is no mean achievement
  • loving relationship with wife? maintained, even strengthened
  • knowledge (and first hand experience) with diet, fitness, aging? building up slowly
  • more blood donation, done for my kids.
  • semi-retirement planning? improving through 5 discussions/year
  • more familiar with Bayonne residential market
  • relationship with in-laws? improved, as visible long term progress. More important — relationship with my own parents maintained
  • boy’s renzi and Chinese reading? improved slightly. Not really long term visible progress but at least he maintained
  • physical flexibility? maintained .. yes! Improvement? yes a bit of visible progress, with huge effortstamina? maintained … no mean achievement
  • [g] financial domain knowledge? I expanded to FX; market data; low-latency equity; FIX exchange trading…. Visible progress but shallow.
  • algo and coding test performance? I tend to belittle the improvement
  • bonding with kids? constantly building, deepening… Not by default, but by effort.
  • c++/c# conquered as a visible long term progress. Rather hard and long mileage, which probably means high entry barrier for the new entrants.
    • Probably more important — java skill level maintained.
  • credit score
  • financial assets (mostly holding well against inflation)? yes visible progress but I tend to belittle it. Building this portfolio actually required persistent effort, years of analysis, experiments, ..

our coding drills r different: fundamental reason #le2XR

Fundamentally, one chooses how to practice based on past interview experience of his own, not hearsays.

Question for you — how many times in the last 18 months did you get a coding interview that required 30+ minutes per question?

A: about 20 out of 25 positions in my recent experience, not including a BGC-partners onsite when they gave me 4 different questions but each 20 minutes only.

Most of my coding rounds are hackerrank/codility or over weekend.

  • Now I think for you the answer is 10% (but did you include those hacker rank tests?)
  • Now I think for you, mostly you don’t need to compile ! That’s why you call them “algorithm questions” rather than coding questions.

Even if I persuade you on the importance of edit-compile-test speed, or the value of python, sooner or later you would doubt “Really? How come I seldom need to write real code so fast in an interview?”. You would eventually stop practicing with real-code and refocus on pure algorithms, by reading key ideas behind common questions.

If you spend hours of focused energy writing real code as practice, and feel worn out, your own, personal experience would eventually kick in and remind you that it’s not worth it.

Conversely, if I were to follow your method to memorize key ideas only, my personal experience would soon shout out — “Take a step back and look again, you are already pretty fast coming up with data structure and algorithm ideas, but your REAL weakness is implementing them — too slow !”

I listed all recent interview episodes in tabular format —

  • 100% of big tech companies require at least one coding question lasting more than 30 minutes.
  • 100% of big buy-side shops require at least one coding question lasting more than 30 minutes.
  • More than 50% of investment bank jobs also require it.
  • For the other c++ financial companies (the so-called “third party” players) like Bloomberg, exchanges, brokers, market data providers, about 80% of the jobs require it.

local jargon, localSys architecture #WFH

Background:  3 types: Portable dnlg ] finance IT listed 3 portable types of dnlg (i.e. domain knowledge), but now I feel the local (i.e. non-portable) dnlg is more relevant to

  • your remote work
  • GTD productivity,
  • your survival,
  • your overall value-add to the team,
  • assessment by manager

Interviewers can only ask portable dnlg, Local dnlg include things like

  1. local jargon
  2. local system architecture, local “data flow”

mgr|risk| age-unfriendly job mkt cf contractor

Statistically, very few IT managers can maintain the income level beyond age 55.

I believe those younger managers in 30’s and 40’s are often more competitive and more hungry (ambitious), more capable at least in terms of tech learning.

Even if you are above average as a manager, the chance of rising up is statistically slim and you end up contending against the younger, hungrier, /up-and-coming/ rising stars.

::value^::type #typedef

This TMP technique is the simplest but not for the uninitiated.

In type_traits.h, many templates expose a ::value or ::type construct. Often these two “members” are the only visible output from the type trait meta-functions.

  • ::value is a static field, typically true/false
  • ::type is a member typedef, typically related to the type argument T, but can be “void” like in enable_if


low-latency: avoid DAM #escape analysis

After talking to Stroustrup and Martin Thompson, I now believe low-latency loves the stack more than the heap.

  • eg: std::shared_ptr designers spent a lot of time optimizing away the heap allocation of the ref counter. See other blogposts
  • eg: Escape analysis (available in java compiler for many years) is one proven technique to convert a heap allocation into a stack usage

Unlike the “avoid concurrency” principle, I think this “avoid DAM” principle is hard to achieve in large systems. However, it is more achievable in reusable data structures, which can be building blocks in large systems.

i-cache caution: manual inlining

Suppose hot function f2() calls f1(). Should f1() be inlined?

Martin Thompson suggested we developers can effectively “take over” inlining by copying a short function f1()’s body into a caller function f2(), if both functions are hot.

This way, we don’t rely on c++ compiler or JIT compiler to decide what to inline. Note JIT compiler can make that decision dynamically based on runtime heuristics.

However, if the expanded f2()’s footprint becomes too big to fit into i-cache, then this practice can backfire.

i-cache: extract corner-case into non-inlined function gave an elegant illustration of reverse-inlining.

“If you have a large function with many cases, ask yourself which will be executed the most often. Place the infrequently-executed cases in a different function.”

If a corner case is embedded in a hot function, i-cache may suffer. A JIT compiler could in theory move it to end of a hot loop, but that depends on many factors. Is it good practice for us developers to manually move the less frequent code paths towards the end, and introduce early-returns before them? I guess the i-cache would only contain the first half of the function containing the hot code paths?

The trick — if you extract the corner case into a separate function and somehow[1] ensure it is not inlined, then i-cache would be relieved.

[1] A JIT compiler would notice the function should NOT be inlined, but in c++ I am not sure.


i-cache: avoid handling corner cases]hot functions #JIT inlining

“Functions often contain a significant amount of code that deals with corner cases, that is, cases which are rarely executed. This means that a large number of instructions read into cache are rarely executed.”

Martin Thompson agreed. He hinted that JIT compilers could (or have be designed to) notice the “cold” corner cases and dynamically refactor the code path so the corner case is handled at end of a hot loop.

Martin also said inlining has to be done judiciously, to avoid adding corner cases (cold stuff) into hot functions. In c++, Inlining decision are made at compile time but JIT can make the same decisions at run-time.

Personally, I guess this JIT technique is academic or experimental, probably hit-n-miss so the performance gain/penalty is hard to predict.

JIT has opportunities to avoid vtable latency #Martin

P76 [[javaPerf]] described a nifty JIT technique to avoid runtime cost of the dynamic binding of virtual function equals(). Suppose in some class, we call obj1.equals(obj2).

After a priming (i.e. warm-up) period, JIT collects enough statistics to see that every dynamic dispatch at this site is calling String.equals(), so JIT decides to turn it into faster “static binding” so the String.equals() function address is hardwired into the assembly code (not JVM bytecode). JIT also needs to handle the possibility of Character.equals(). I guess the assembly code can detect that obj1/obj2 is not a instance and retry the virtual function lookup. JIT can generate assembly code to
1. verify obj is a String and call an inlined String.equals()
2. if obj1 is not String, then use obj1 vtable to look up the virtual function obj1.equals()

It may turn out that 99.9% of the time we can skip the time-consuming Step 2: )

Martin gave a (hypothetical?) example. Suppose JIT notices that obj1 is always either a String or Character. JIT could inline both equals() functions and completely bypass vtable. (Branching can be done via if/else or …) This inline compilation is done after a fairly long period of instrumentation. I asked Martin why c++ can’t do it. He said c++ only uses static compilation. I feel c++ compilers don’t bother with this technique as it is not a proven performance win.

low-latency: avoid concurrency #ST-mode

Backgrounder — CPU speed is increasing more gradually than before. The technology industry as a whole is advancing more horizontally — increasing parallelism. Yet the best designs don’t use lock-free or concurrency at all.

I asked Martin Thompson — To really push the limit of latency, should we avoid concurrency as much as possible, completely eliminating it if possible? Answer is yes. Martin pointed out the difference between

  • parallel design —— use multitasking, in ST-mode, or “do multiple things at the same time
  • concurrent design — deal with multitasking, or “deal with multiple things at the same time“. The expression “deal with” implies complexities, hazards, risks, control, management.

One of the hidden hazards Martin pointed out is heap memory de-allocation, but that’s for another blogpost.

fastestSortFor: 1-str; arr@Str; 64bit floats #non-comparison

I think the big-O coding question may require this knowledge, because

….real world sorting problems can often use key-based sorting, rather than comparison-based sorting.

Therefore, if your overall solution requires sorting, don’t assume it would become O(J*K+M^2+P* N logN) based on your O(N logN) sorting assumption. Here are some achievable linear-time sorts:

sort O(?) data type ref notes
counting O(N) chars
 radix O(N) 64-bit ints O(Nw) -> O(N)
 radix O(N) floats [3][4] IEEE format
 radix O(N) variable-size strings [1/2] #1 fastest
 burst O(N) variable-size strings #2 fastest. key-based, trie-based
 radix O(N) date/time converting to fixed-length string or fixed-size integers
 bucket? O(N) random ints in python/perl unbounded… No limit on #digits

MSA seminar by Chris Richardson

One of the main (perhaps #1) justifications for MSA is devops including testability, phased roll-out, resilience …

Useful terminology — A particular microservice has a single service provider and some “consumers”.

REST is the simplest implementation, but async (with msg broker) was highly recommended by Chris.

Q: too many services, too many failure points?
A: async (with msg brokers) improves reliability and availability

— cloud and MSA

container is not the only way to host a service in the cloud. Chris mentioned “lambda” as the simplest alternative.

— each service can use a different programming language. Chris said this is controversial but can be useful. Say we are migrating from golang to node.js. We could implement one of five services in the new language and migrate in phases.

— “validation” in production environment #we don’t say “testing”

  1. deploy a new version of a service to production but has no traffic routed there
  2. expose the new service instance to the same production requests as the live service, but in readonly mode, so there’s no write to data store, no visible output except logging.
  3. Now run it in full mode
  4. route some fake traffic (like a test symbol in mvea) and validate the output
  5. route some real traffic in small quantity and rely on comprehensive service monitoring

— My questions + other audience questions
Q1: UDP as alternative to TCP?
A: UDP is used for low latency in microsec. Many MSA use cases are in millisec

Q2: latency … I would think monolithic is fastest
A: no simple yes/no answer
A: MSA definitely improves scalability. I would think the web2.0 shops do care about latency too.

Q3: distributed transaction/rollback …. I thought this technology is reliable for decades?
A: CAP .. choose between consistency and availability. I think Chris means
A: there’s a paper

Q4: c++…. is MSA unsuitable?
A: the prominent voices tend to be java, dotnet, node.js , golang, and python, but c++ is not unsuitable. Some key software projects are in c++ (I think for efficiency)

git | beyondCompare

— bcomp advantage over git-diff
bcomp shows byte count in new vs old versions… can detect intrafile cut-paste

— based on

git config --global diff.tool bc
git config --global difftool.bc.path /c/Progra~1/BeyondCompare4/BComp.exe
git config --global difftool.prompt false

The above sets up git-difftool but how about git-diff? I feel one of the two is sufficient.

git config --global merge.tool bc
git config --global mergetool.bc.path /c/Progra~1/BeyondCompare4/BComp.exe

— no integration with git-diff?
Think again. If your git-diff always uses a GUI, then you can’t get the console UI which is frequently valuable. Therefore, it’s a good thing that you can’t modify git-diff. worked for me, including a 3-way merge GUI —

git mergetool feature/my_br6 origin/feature/my_br6 --


array=#1 important data structure

C supports only array (horizontal) and struct (vertical). I feel most standard libraries across languages are designed based on the same. Array + graph are about the only data structures in those libraries and in CIV.

For cross-language coding drill, we should probably keep our focus on arrays.

For comp science algorithm research, there’s more energy focused on array than any other data structure.

performance domain is low-level. Array (among various data structures) is the real focus of micro tuning and hardware optimizations.

Proven GTD: worthless in candidate ranking #JackZ

I feel that Jack Zhang is competent with localSys GTD but weak on c++ and comp science.

Does he have working knowledge of c++? I assume so. Working knowledge is attainable in a couple of months for a clean language, and up to a year for c++

The required level of working knowledge and basic skill is very low for localSys GTD.

His c++ knowledge is probably barely enough to do the job. Remember I didn’t know what things live on java heap vs stack.

Based on my guesstimate, he would fail any c++ interview and any algo interview. He can write simple SQL in an interview, but I am not sure if he can write complex joins.

The fact that Jacn and DeepakM are are proven on GTD is useless and lost in the conversation.

How about CSY? He can solve many algo problems without practice, but he is reluctant to practice.

I think the self-sense of on-the-job competency is misleading. Many in their positions might feel GTD competency is more important than IV skills. They are so afraid of the benchmark that they don’t want to study for it.

When the topic of tech interview comes up, I think they wish to escape or cover their ears.

jvm^c++ as infrastructure

c/c++ is part of the infrastructure of many new technologies, and consequently will last for decades whereas java may not.

😦 This doesn’t mean there will be enough c++ jobs for me and my C++ friends.

  • JVM is an infrastructure for a relatively small number of new languages and new frameworks like spring, hadoop,.. However, the machine learning community seem to regard python and c++ as the mainstay.
  • Java (not JVM) serves as infrastructure in the new domains of MSA, cloud, big data etc, but not Machine Learning.

longest consecutive ints ] matrix 70% done

View at

Q: Given a n*n square matrix where all numbers are distinct, find the maximum length path (starting from any cell) such that all cells along the path are in increasing order with a difference of 1. We can move in 4 directions from a given cell (i, j), i.e., we can move to (i+1, j) or (i, j+1) or (i-1, j) or (i, j-1) with the condition that the adjacent cells have a difference of 1.


Input: mat[][] = {
{1, 2, 9}
{5, 3, 8}
{4, 6, 7}}
Output: 4
The longest path is 6-7-8-9.

–sol1, probably O(NN) since each node is checked no more than 5 times.

1) typewriter search for the first unvisited node. Exit if all visited. Once a node is found, designate it as cur.
2) down-search .. explore cur’s four neighbors to see if any == cur-1.
3) if found, then desingate that node as cur and continue the down-search.
4) At end of down-search, Go back to #2) and start up-search.
5) At end of up-search we have a linked list. IFF list size > 1, then all nodes on it are marked as visited.
7) go back to #1.