c++11 lockfree simple tutorial




Not written by a foremost expert, but I like the simplicity.


unnoticed gain{SG3jobs: 看破quantDev

All three jobs were java-lite , with some quantDev exposure. Through these jobs, I gained the crucial clarity about the bleak reality of the quantDev career direction. The clarity enabled me to take the bold decision to stop the brave but costly TSN attempts to secure a foothold. Foothold is simply too tough and futile.

Traditional QuantDev in derivative pricing is a shrinking job pool. Poor portability of skills without any standard set of interview topics.

QuantDev offers no contract roles !

Instead, I successfully established some c#/py/c++ trec. The c++ accu, though incomplete, was especially difficult and precious.

Without these progresses, I would be lacking the confidence in py/c#/c++ to work towards and achieving multiple job offers. I would still be stuck in the quantDev direction.

closestMatch in sorted-collection: j^python^c++

–java is cleanest. P236 (P183 for Set) [[java generics]] lists four methods belonging to the NavigableMap interface

  • ceilingEntry(key) — closest entry higher or equal
  • higherEntry(key) — closest entry strictly higher than key
  • lowerEntry
  • floorEntry

both DEPTH+breadth expected of%%age: selective dig

Given my age, many interviewers expect me to demonstrate insight into many essential (not-obscure) topics such as lockfree (Ilya),

Interviewers expect a tough combination of breadth + some depth. To my advantage I’m a low-level digger, and also a broad-based avid reader. The cross-reference in blog is esp. valuable.

Challenge for me — identify which subtopic to dig deeper, among the breadth of topics, given my limited absorbency and the distractions.

jvm footprint: classes can dominate objects

P56 of The official [[java platform performance]], written by SUN java dev team, has pie charts showing that

  • a typical Large server app can have about 20% of heap usage taken up by classes, rather than objects.
  • a typical small or medium client app usually have more RAM used by classes than data, up to 66% of heap usage take up by classes.

On the same page also says it’s possible to reduce class footprint.

long-term value: QQ imt ECT speed

The alpha geeks — authors, experts, open source contributors … are they fast enough to win those coding contests?

The speed-coding contest winners … are they powerful, influential, innovative, creative, insightful? Their IQ? Not necessarily high, but they are nobody if not superfast.

The QQ knowledge is, by definition, not needed on projects, usually obscure, deep, theoretical or advanced technical knowledge. As such, QQ knowledge has some value, arguably more than the ECT speed.

Some say a self-respecting programmer need some of this QQ knowledge.

noexcept impact on RAII

If a function (esp. dtor) is declared noexcept, compiler can choose to omit stack-unwinding “scaffolding” around it.  Among other things, there’s a runtime performance gain. This gain is a real advantage of using noexcept instead of empty throw() which is deprecated in c++0x.

Q: Is there any impact on RAII?
%%A: yes

Q: Can we even use RAII in such a context?
%%A: I think we can. If the function does throw, then std::terminate() runs, instead of the destructors

q[static thread_local ] in %%production code

static thread_local std::vector<std::vector<std::string>> allFills; // I have this in my CRAB codebase, running in production.

Justification — In this scenario, data is populated on every SOAP request, so keeping them as non-static data members is doable but considered pollutive.

How about static field? I used to think it’s thread-safe …

When thread_local is applied to a variable of block scope, the storage-class-specifier static is implied if it does not appear explicitly. In my code I make it explicit.

TCP_NODELAY to improve latency


Nagle’s algo helps in applications like telnet. However, it may increase latency when sending streaming data. Additionally, if the receiver implements the ‘delayed ACK policy’, it will cause a temporary deadlock situation. In such cases, disabling Nagle’s algorithm is a better option.

temp object binding preferences: rvr,lvr..

Based on https://www.codesynthesis.com/~boris/blog/2012/07/24/const-rvalue-references/

  • a const L-value reference … … can bind to a naturally-occuring rvalue object
  • a const rvalue reference (rvr) can bind to a naturally-occuring rvalue object but canNOT bind to an lvalue object

Q: so what kind of reference can naturally occuring temp objects bind to?
A: a const rvr. Such an object prefers to bind to a const rvalue reference rather than a const lvalue reference

There are some obscure use cases for this binding preference.

nonVirtual1() calling this->virt2() #templMethod

http://www.cs.technion.ac.il/users/yechiel/c++-faq/calling-virtuals-from-base.html has a simple sample code. Simple idea but there are complexities:

  • the given print() should never be used inside base class ctor/dtor. In general, I believe any virt2() like any virtual function behaves non-virtual in ctor/dtor.
  • superclass now depends on subclass. The FAQ author basically says this dependency is by-design. I believe this is template-method pattern.
  • pure-virtual is probably required here.

%% absorbency: reading+blogging imt coding

I can read dry QQ topics for hours each day for many days, but tend to lose steam after coding for a few hours.

My coding drill drains my “laser” energy faster than QQ reading/blogging. When my laser is weakened (by boredom), I must drill harder to go through the “brick walls”.

–I guess many fellow programmers enjoy coding more than reading. I feel lucky that in my interviews, knowledge tests still outweigh coding test.


xchg between register/memory triggers fencing

I guess xchg is used in some basic concurrency constructs, but I need to research more.

See my blogpost hardware mutex, based@XCHG instruction

https://c9x.me/x86/html/file_module_x86_id_328.html points out the implicit locking performed by CPU whenever one of the two operands is a memory location. The other operand must be a register.

This implicit locking is quite expensive according to https://stackoverflow.com/questions/50102342/how-does-xchg-work-in-intel-assembly-language, but presumably cheaper than user-level locking.

This implicit locking involves a memory fence.

compile-time ^ run-time linking

https://en.wikipedia.org/wiki/Dynamic_linker describes the “magic” of linking *.so files with some a.out at runtime, This is more unknown and “magical” than compile-time linking.

“Linking is often referred to as a process that is performed when the executable is compiled, while a dynamic linker is a special part of an operating system that loads external shared libraries into a running process”

I now think when the technical literature mentions linking or linker I need to ask “early linker or late linker?”

reasons to limit tcost@SG job hunt #XR

XR said a few times that it is too time consuming each time to prepare for job interviews. The 3 or 4 months he spent has no long-term value. I immediately voiced my disagreement because I took IV fitness training as a lifelong mission, just like jogging or yoga or chin-up.

This view remains as my fundamental perspective, but my disposable time is limited. If I can save the time and spend in on some meaningful endeavors  [1] then it’s better to have a shorter job hunt.

[1] Q: what endeavors?
A: yoga
A: diet
A: stocks? takes very little effort
A: ?

Hadoop apps: Is java preferred@@

I didn’t hear about any negative experience with any other languages, I would assume yes java is preferred, and the most proven choice. If you go with the most popular “combination” then you can find the thriving ecosystem — most resources online and the widest support tools.

–According to one website:

Hadoop itself is written in Java, with some C-written components. The Big Data solutions are scalable and can be created in any language that you prefer. Depending on your preferences, advantages, and disadvantages presented above, you can use any language you want.

POSIX countingSemaphore ^ lock+condVar #Solaris docs

https://docs.oracle.com/cd/E19120-01/open.solaris/816-5137/sync-11157/index.html points out a lesser-known difference in the Solaris context:

Because semaphores need not be acquired and be released by the same thread, semaphores can be used for asynchronous event notification, such as in signal handlers (but presumably not interrupt handlers). And, because semaphores contain state, semaphores can be used asynchronously without acquiring a mutex lock as is required by condition variables. However, semaphores are not as efficient as mutex locks.

The same page also shows POSIX countingSemaphore can be used IPC or between threads.

prefer ::at()over operator[]read`containers#UB

::at() throws exception … consistently 🙂

  • For (ordered or unordered) maps, I would prefer ::at() for reading, since operator[] silently inserts for lookup miss.
  • For vector, I would always favor vector::at() since operator[] has undefined behavior when index is beyond the end.
    1. worst outcome is getting trash without warning. I remember getting trash from an invalid STL iterator.
    2. better is consistent seg fault
    3. best is exception, since I can catch it


Machine Learning #notes

Machine Learning — can be thought of as a method of data analysis, but a method that can automate analytical model building. As such, this method can find hidden insights unknown to the data scientist. I think the AlphaGo Zero is an example .. https://en.wikipedia.org/wiki/AlphaGo_Zero

Training artificial intelligence without datasets derived from human experts is… valuable in practice because expert data is “often expensive, unreliable or simply unavailable.”

AlphaGo Zero’s neural network was trained using TensorFlow. The robot engaged in reinforcement learning, playing against itself until it could anticipate its own moves and how those moves would affect the game’s outcome

So the robot’s training is by playing against itself, not studying past games by other players.

The robot discovered many playing strategies that human players never thought of. In the first three days AlphaGo Zero played 4.9 million games against itself and learned more strategies than any human can.

In the game of GO, world’s strongest players are no longer humans. Strongest players are all robots. The strongest strategies humans have developed are easily beaten by these robots. Human players can watch these top (robot) players fight against each other, and try to understand why their strategies work.

t-investment: QQ^coding^localSys

My investment in QQ category is more adequate than my investment in

  • coding drill
  • local sys knowledge

All 3 are important to family well-being, sense of security, progress, self-esteem..

However, localSys learning is obviously non-portable. Still at the current juncture localSys is the most important area lacking “sunshine”

I created a shared_ptr with a local object address..

In my trade busting project, I once created a local object, and used its address to construct a shared_ptr (under an alias like TradePtr).

I hit consistent crashes. I think the reason is — shared_ptr likes heap objects. When my function returns, the shared_ptr tried to call delete on the raw ptr, which points at the local stack, leading to crash.

The proven solution — make_shared()

Q: passive income ⇒ reduce GTD pressure#positive stress

My (growing) Passive income does reduce cash flow pressure… but it has no effect so far on my work GTD pressure.

Q: Anything more effective more practical?

  1. take more frequent unpaid leaves, to exercise, blog or visit family
  2. expensive gym membership

How about a lower salary job (key: low caliber team)? No I still want some challenge some engagement, some uphill, some positive stress.

unique_ptr implicit copy : only for rvr #auto_ptr

P 470-471 [[c++primer]] made it clear that

  • on a regular unique_ptr variable, explicit copy is a compilation error. Different from auto_ptr here.
  • However returning an unnamed temp unique_ptr (rvalue object) from a function is a standard idiom.
    • Factory returning a unique_ptr by value is the most standard idiom.
    • This is actually the scenario in my SCB-FM interview by the team architect

Underlying reason is what I have known for a long time — move-only. What I didn’t know (well enough to impress interviewer) — the implication for implicit copy. Implicit copy is the most common usage of unique_ptr.

mgr role risk: age-unfriendly

Statistically, very few IT managers can maintain the income level beyond age 55.

I believe those managers in 30’s and 40’s are often more capable, more competitive and more ambitious.

Even if you are above average as a manager, the chance of rising up is statistically slim and you end up contending against the younger, hungrier, /up-and-coming/ rising stars.

git | merge conflict #personal tips

I prefer cherry-pick.

Often the merge/rebase/cherry-pick/stash-pop operation would silently put some of the changes in _staging_, therefore git-diff would fail to show them. I have to use git-diff-HEAD instead:(

–If you have no choice then here’s the standard procedure

  1. git rebase master
  2. you get a merge conflict and file1 now contains <<<< ===
  3. vi file1 # to remove the bad stuff
  4. git add file1 # on the unnamed branch
  5. # no commit needed
  6. git rebase –continue # will bring you back to your feature branch


overcoming exchange-FIX-session throughput limit

Some exchanges (CME?) limits each client to 30 orders per second. If we have a burst of order to send , I can see two common solutions A) upstream queuing B) multiple sessions

  1. upstream queuing is a must in many contexts. I think this is similar to TCP flow control.
    • queuing in MOM? Possible but not the only choice
  2. an exchange can allow 100+ FIX sessions for one big client like a big ibank.
    • Note a big exchange operator like nsdq can have dozens of individual exchanges.

Q: is there any (sender self-discipline) flow control in intranet FIX?
A: not needed.

prefer std::deque when you need stack/queue

  • std::stack is unnecessarily troublesome for dumping and troubleshooting … see https://github.com/tiger40490/repo1/blob/cpp1/cpp/2d/maxRectangle.cpp
  • std::queue has a similar restriction.

Philosophically, std::stack and std::queue are container adapters designed to deny random access, often for safety. They are often based on an underlying “crippled deque”. As soon as you need to dump the container content, you want a full-fledged deque.

c++^java..how relevant ] 20Y@@

See [17] j^c++^c# churn/stability…

C++ has survived more than one wave of technology churn. It has lost market share time and time again, but hasn’t /bowed out/. I feel SQL, Unix and shell-scripting are similar survivors.

C++ is by far the most difficult languages to use and learn. (You can learn it in 6 months but likely very superficial.) Yet many companies still pick it instead of java, python, ruby — sign of strength.

C is low-level. C++ usage can be equally low-level, but c++ is more complicated than C.

convert a recursive algo to iterative #inOrderWalk

Suppose you have just one function being called recursively. (2-function scenario is similar.) Say it has 5 parameters. Create a struct named FRAME (having 5 fields + possibly a field for lineNo/instructionPointer.)

Maintain a stack holding the Frame instances. Each time the recursive algorithm adds to the call stack, we add to our stack too.

Wiki page on inorder tree walk  has very concise recursive/iterative algos. https://github.com/tiger40490/repo1/blob/py1/py/tree/iterative_InOrderWalk.py is my own attempt that’s not so simple. Some lessons:

  • Differentiate between popping vs peeking the top.
  • For a given node, popping and printing generally happen at different times without any clear pattern.
    • the sequence of pop() is probably a pre-order tree walk
    • the sequence of print is an in-order tree walk

reactive java #learning notes

After a 5-minute glance, I feel this is yet another (one of a group) jxee add-on package. Not sure about its shelf-life.

There are many jargon terms in, or related to, this concept. Presumably too many (and intimidating) to a new comer.

https://spring.io/blog/2016/06/07/notes-on-reactive-programming-part-i-the-reactive-landscape is a 2016 Spring article.

https://dzone.com/articles/rxjava-part-1-a-quick-introduction is a 2016 tutorial to RxJava library.

How is debugger breakpoint implemented@@ brief notes #CSY

This is a once-only obscure interview question. I said up-front that CPU interrupts were needed. I still think so.

I believe CPU support is needed to debug assembly programs, where kernel may not exist.

For regular C program I still believe special CPU instructions are needed.

https://eli.thegreenplace.net/2011/01/27/how-debuggers-work-part-2-breakpoints seems to agree.

It says Interrupt #3 is designed for debugging. It also says SIGTRAP is used in linux, but windows supports no signals.


NxN matrix: graph@N nodes #IV

Simon Ma of CVA team showed me this simple technique.

https://github.com/tiger40490/repo1/blob/cpp1/cpp1/miscIVQ/tokenLinked_Friend.cpp is my first usage of it.

  • I only needed half of all matrix cells (excluding the diagonal cells) because relationships are bilateral.
  • Otherwise, if graph edges are directed, then we need all (N-1)(N-1) cells since A->B is not same as B->A.

My discrete math textbook shows this is a simplified form of representation and can’t handle self-link or parallel edge. The vertex-edge matrix is more robust.


churn !! bad ] mktData #socket,FIX,.. unexpected!

I feel the technology churn is remarkably low.

New low-level latency techniques are coming up frequently, but these topics are actually “shallow” and low complexity to the app developer.

  • epoll replacing select()? yes churn, but much less tragic than the stories with swing, perl, structs
  • most of the interview topics are unchanging
  • concurrency? not always needed. If needed, then often fairly simple.

UDP recv()from 1 send()at most

P116 [[tcp/ip sockets in C]] made it very clear.

A call to recv on the receiver machine will return data from at most one send() on the sender machine.

It can be a partial message, but would be the first part. See https://stackoverflow.com/questions/13317532/receiving-a-part-of-packet-via-recvfrom-udp

I believe entire payload of one send()/sendto() is packaged into an envelope. The kernel would never deliver two envelopes to one recv()/recvfrom() call. Therefore receiver can only receive one envelope at a time. If entire envelope is too large then only only first part of the payload is delivered.

childThr.get_id() after join()

Not sure about pthreads but here is c++11 std::thread::get_id() behavior:
“If the thread object is not joinable, the function returns a default-constructed object of member type thread::id.”
I believe after you join a childThr, that thread is no longer joinable, SO get_id() will return a meaningless boilerplate value.

Solution: to use that id, you need to save it before joining
https://github.com/tiger40490/repo1/blob/cpp1/cpp/thr/takeTurn.cpp is my experiment

## optimize code for i-cache: few tips

I don’t see any ground-breaking suggestions. I think only very hot functions (confirmed by oprofile + cachegrind) requires such micro-optimization.

I like the function^code based fragmentation framework on https://www.eetimes.com/document.asp?doc_id=1275472 (3 parts)

  • inline: footprint+perf can backfire. Can be classified as embedding
  • use table lookup to replace “if” ladder — minimize jumps
  • branching — refactor a lengthy-n-cold (not “hot”) code chunk out to a function, so 99% of the time the instruction cache (esp. the pre-fetch flavor) doesn’t load a big chunk of cold stuff.
    • this is the opposite of embedding !
  • Trim the executable footprint. Reduce code bloat due to inlining and templates?
  • loop unrolling to minimize jumps. I think this is practical and time-honored — at aggressive optimization levels some compilers actually perform loop unrolling!
  • Use array (anything contiguous) instead of linked list or maps??

https://software.intel.com/en-us/blogs/2014/11/17/split-huge-function-if-called-by-loop-for-best-utilizing-instruction-cache is a 2014 Intel paper — split huge function if it’s invoked in a loop.


reinterpret_cast(zero-copy)^memcpy: raw mktData parsing

Raw market data input comes in as array of unsigned chars. I “reinterpret_cast” it to a pointer-to-TradeMsgStruct before looking up each field inside the struct.

Now I think this is the fastest solution. Zero-cost at runtime.

As an alternative, memcpy is also popular but it requires bitwise copy. It often require allocating a tmp variable.

JGC G1 Metaspace: phrasebook #intern


c++condVar 2 usages #timedWait

poll()as timer]real time C : industrial-strength #RTS is somewhat similar.

http://www.stroustrup.com/C++11FAQ.html#std-condition singles out two distinct usages:

1) notification
2) timed wait — often forgotten

https://en.cppreference.com/w/cpp/thread/condition_variable/wait_for shows std::condition_variable::wait_for() takes a std::chrono::duration parameter, which has nanosec precision.

Note java wait() also has nanosec precision.

std::condition_variable::wait_until() can be useful too, featured in my proposal RTS pbflow msg+time files #wait_until

killing a stuck thread #cancellation points#CSY/Vanguard

Q: once you know one of many threads is stuck in a production process, what can you do? Can you kill a single thread?
A: there will Not be a standard construct provided by OS or thread library because killing a thread is inherently unsafe.. Look at java Thread.stop()
A: yes if I have a builtin kill-hook in the binary

https://www.thoughtspot.com/codex/threadstacks-library-inspect-stacktraces-live-c-processes describes a readonly custom hook. Its conceivable to add a kill feature —

  • Each thread runs a main loop to check an exit-condition periodically.
  • This exit-condition would be similar to pthreads “cancellation points”

https://stackoverflow.com/questions/10961714/how-to-properly-stop-the-thread-in-java shows two common kill hooks — interrupt and Boolean flag


socket^swing: distinct(specialized skill)from core language

  • I always believe swing is a distinct skill from core java. A regular core Java or jxee guy needs a few years experience to become swing veteran.
  • Now I feel socket programming is similarly a distinct skill from core C/c++

In both cases, since the core language knowledge won’t extend to this specialized domain, you need to invest personal time outside work hours .. look at CSY. That’s why we need to be selective which domain.

Socket domain has much better longevity (shelf-life)  than swing!


  1. fads — vaguely I feel these are fads.
  2. salary — (Compare to financial IT) absolute profit created by data science is small but headcount is high ==> most practitioners are not well-paid. Only buy-side data science stands out
  3. volatile — I see data science too volatile and churning, like javascript, GUI and c#.
  4. shrink — I see traditional derivative-pricing domain shrinking.
  5. entry barrier — quant domain requires huge investment but may not reward me financially
  6. value — I am suspicious of the economic value they claim to create.

##functions(outside big4)using rvr param or std::move

Q: Any function(outside big4)with rvr param?
%%A: Such functions are rare. I don’t know any.
AA: [[effModernC++]] has a few functions taking rvr param.
AA: P544 [[c++primer]] says class methods could use rvr param
* eg: push_back()

Q: any function (outside big4) using std::move?

  • [[effModernC++]] has a few functions.
  • P544 [[c++primer]] says rarely needed
  • [[josuttis]] p20


down-cast a reference #idiom

ARM P69 says down-cast a reference is fairly common. I have never seen it.

Q: Why not use ptr?
%%A: I guess pointer can be null so the receiver function must face the risk of a null ptr.
%%A: 99% of references I have seen in my projects are function parameters, so references are extremely popular and proven in this use case. If you receive a ref-to-base, you can down cast it.

See post on new-and-dynamic_cast-exceptions
see also boost polymorphic_cast

favor std::begin(arrayOrContainer)

https://stackoverflow.com/questions/7593086/why-use-non-member-begin-and-end-functions-in-c11 explains some important details.

Q: So how do we choose between

  • this free global function
  • the container member function cont::begin() / end()?

%%A: Basically, I would always use std::begin() instead of cont.begin() esp. in template-enable programs.

slist in python@@ #no std #Ashish

A quick google search shows

* python doesn’t offer linked list in standard library

* python’s workhorse list like [2,1,5] is a expendable array, i.e. vector. See https://stackoverflow.com/questions/3917574/how-is-pythons-list-implemented and https://www.quora.com/How-are-Python-lists-implemented-internally

* {5, 1, 0} braces can initialize a set. I very seldom use a set since a dict is almost always good-enough.

mgr role risk: smaller job pool

When not comfortable (under threat), or job lost … the prospect of finding a similar job is much worse than a hands-on developer, because the number of senior mgr jobs is much smaller.

Avichal basically said he would avoid hands-off manager roles.

As contractor, most of the time I feel very relaxed about moving in and out. The price to pay, of course, is lower salary.

## emulators of secDB #GregR

— secDB derivatives/cousins/emulators:

  • Pimco and Macquarie licensed Beacon
  • CS hired two secDB veterans to build a similar system on java
  • MS has a RCE (risk calc engine) project, based on scala and java
  • UBS tried it too. I applied for this job in 2011 or 2012
  • Athena and Quartz
  • BlackRock Aladdin (1988), written in java, for risk management across portfolios. All other functionalities are secondary.

I feel you need such a system only if your books have many derivative contracts that needs constant “revaluation”. This is a core feature of derivative risk systems.

Q: beyond risk systems, why is Quartz also supporting trade booking and execution?

I think secrete key is in the data store, which is central to those systems. SecDB systems feature a specially designed in-memory and replicated data store, which can be the basis of those systems.

A special data store is live and reference market data.

y I use lots of if..{continue;}

Inside a loop, many people prefer if/elif/else. To them, it looks neat and structured.

However, I prefer the messier if…continue; if…continue; if..continue. Justification?

I don’t have to look past pageful of if/elif/elif/…/else to see what else happens to my current item. I can ignore the rest of the loop body.

Beside the current item, I also can safely let go (rather than keeping track) of all the loop-local variable values longer. All of them will be wiped out and reset to new values.

pass temp obj into func by val: move-ctor skipped #RVO CSY

Suppose we have a class MoveOnlyStr which has only move-ctor, no copy-ctor. Suppose we pass an unnamed temporary instance of this class into a function by value, like func1(arg_mos).

Q: Will the move-ctor be used to create the argument object arg_mos? We discussed this in your car last time we met up.
A: No. My test shows I was right to predict that compiler optimizes away the temporary, due to RVO. So move-ctor is NOT used.

This RVO optimization has existed long before c++11.

c++GC interface


GC interface is partly designed to enable

  • reachability-based leak detectors
  • garbage collection

The probe program listed in the URL shows that as of 2019, all major compilers provide trivial support for GC.

Q: why does c++ need GC, given RAII and smart pointers?
A: system-managed automatic GC instead of manual deallocation, without smart pointers

Y allocate static field in .c file %%take

why do we have to define static field myStaticInt in a cpp file?

For a non-static field myInt, the allocation happens when the class instance is allocated on stack, on heap (with new()) or in global area.

However, myStaticInt isn’t take care of. It’s not on the real estate of the new instance. That’s why we need to declare it in the class header, and then define it exactly once (ODR) in a cpp file.

p2p messaging beats MOM ] low-latency trading

example — RTS exchange feed dissemination infrastructure uses raw TCP and UDP sockets and no MOM

example — the biggest sell-side equity OMS network uses MOM only for minor things (eg?). No MOM for market data. No MOM carrying FIX order messages. Between OMS nodes on the network, FIX over TCP is used

I read and recorded the same technique in 2009… in this blog

Q: why is this technique not used on west coast or main street ?
%%A: I feel on west coast throughput outweighs latency. MOM enhances throughput.

CMS JGC: deprecated in java9

Java9/10 default GC is G1. CMS is officially deprecated in Java 9.

Java8/7 default GC is ParallelGC, CMS. See https://stackoverflow.com/questions/33206313/default-garbage-collector-for-java-8

Note parallelGC uses

  • parallel in most generations
  • serial in old gen

…whereas parallelOldGC uses parallel in all generations.

Q: why is CMS deprecated?
A: one blogger seems to know the news well. He said JVM engineering team needs to focus on new GC engines and need to let go the most high-maintenance but outdated codebase — the CMS, As a result, new development will cease on CMS but CMS engine is likely to be available for a long time.

efficient swap(): two containers-of-T

Background — template function std::swap(T&, T&) works for int, float etc, but the same implementation will not work efficiently for vector, list, map or set. Therefore I suspected there might be specializations of swap() template function.

As it turns out, vector (and the other containers) provides a swap() member function. So the implementation of vector swap is indeed different from std::swap().

q[inline] to avoid a.. jump^stackFrame

Dino of BBG FX team asked me — when you mark a small f1() function inline (like manually copying the code into main()), you save yourself a jump or a new stack frame?

A: both a jump and a new stack frame.

It turns out a new stack frame would require a jump, because after the new stack frame is created, thread jumps to the beginning of f1().

However, there’s something to set up before the jump — Suppose f1() is on Line 5 in main(), then Line 6’s address has to be saved to CPU register, otherwise the thread has no idea where to” jump back” after returning from f1(). According to my codebashing training (required at RTS team), this Line 6’s address is saved in the main() stack frame, not the f1() stack frame!

Note the Line 6’s address is not a heap address not a stack address but an pointer into the code area.

jmp_buf/setjmp() basics for IV #ANSI-C

Q: how is longjmp different from goto? See http://ecomputernotes.com/what-is-c/function-a-pointer/what-is-the-difference-between-goto-and-longjmp-and-setjmp

A: longjmp can 1) jump across functions, and 2) restore program state from a jmp_buf, which was saved earlier by setjmp.

fileHandle/socket/dbConection are thread Unsafe

There’s a buffer in a DB connection, in a file handle, in a socket …

The buffer is a shared mutable object. Consider a read-buffer. The host object knows how much of the data in the buffer was already delivered, to avoid repeated delivery. There’s some kind of watermark, which is moved by a consumer thread.

As all shared mutables, these objects are thread unsafe.

All of these objects can also be allocated on stack and therefore invisible to other threads. Therefore, this could be the basis of a thread-safe design.


python run complex external commands #subprocess

I prefer one single full-feature solution that’s enough for all my needs. The os.system() solution is limited. The subprocess module is clearly superior. One of the simplest features is

>>> subprocess.call([“ls”, “-l”])

If you need redirection and background, then try the single-string version

>>> subprocess.call(‘ls /tmp > /tmp/a.log &’, shell=True) # output goes to STDOUT, hard to capture

c++4beatFronts: which1to LetGo if I must pick1

What to let go, given I have limited O2/laser/bandwidth and mental capacity?

  1. give up BP — biggest room for improvement but least hope
  2. give up codility style — Just get other friends to help me. See codility: ignore overflow, bigO
  3. give up algo — already decent. diminishing return. Smaller room for improvement? But practice on some small algo questions can significantly improve my ECT.

container of smart^raw pointer

In many cases, people need to store addresses in a container. Let’s use std::vector for example. Both smart ptr and raw ptr are common and practical

  • Problem with raw ptr — stray pointer. Usually we the vector doesn’t “own” the pointee, and won’t delete them. But what if the pointee is deleted somewhere and we access the stray pointer in this vector? smart pointer would solve this problem nicely.
  • J4 raw ptr — footprint efficiency. Raw ptr object is smaller.

WordPad : rich-text note taker

I found WordPad RTF files a good middle ground between ascii text and MSWord files.

  • 🙂 basic text effects like background+font colors
  • 🙂 footprint is comparable to ascii, much smaller than MSWord files.
    • Q: How about after compression?
  • 🙂 how about copying to Linux? tested

Verdict —

  1. default general-purpose note-taker remains ascii, but
  2. for office work notes with text highlighting, let’s try WordPad more often. But let’s not become dependent. Convert old rtf notes to ascii.

select^poll # phrasebook

Based on https://www.ulduzsoft.com/2014/01/select-poll-epoll-practical-difference-for-system-architects/, which I respect.

  • descriptor count — up to 200 is fine with select(); 1000 is fine with poll(); Above 1000 consider epoll
  • time-out precision — poll/epoll has millisec precision. select() has nanosec, a million times higher precision, but only embedded devices need such precision.
  • single-threaded app — poll is just as fast as epoll. epoll() excels in MT.

https://www.ibm.com/support/knowledgecenter/ssw_ibm_i_71/rzab6/poll.htm has sample code on poll().

collection-of-abstract-shape: j^c++

In java, this usage pattern is simple and extremely common — Shape interface.

In c++, we need a container of shared_ptr to pure abstract base class 😦

  • pure abstract interface can support MI
  • shared_ptr supports reference counting, in place of garbage collection
  • pointer instead of nonref payload type, to avoid slicing.

This little “case study” illustrates some fundamental differences between java and c++, and showcases some of the key advantages of java.

wfh – j4^concerns

I trust XR’s experience…

  1. 😦 Productivity is likely (XR: inevitably) affected … increasing the deadline stress.
  2. 🙂 saves commute time, can be used on coding drill or with family.
  3. 😦 I might feel isolated, lack of support … though I do come into office quite often. I think WFH workers need to be more independent.

gdb –tui #split screen

You can also start gdb normally, then switch to split screen, with “ctrl-x  ctrl-a” (thanks to Gregory).

Upper screen shows the source code with a moving marker

–Here’s my full gdb command showing

  • –args to run the target executable with arguments
  • redirect stderr to a file so my gdb screen isn’t messed up — not always effective

gdb –tui –args $base/shared/tp_xtap/bin/xtap -D 9 -c $base/etc/test_replay.cfg 2 >

big guns: template4c++^reflection4(java+python)

Most complex libraries (or systems) in java require reflection to meet the inherent complexity;

Most complex libraries in c++ require template meta-programming.

But these are for different reasons… which I’m not confident to point out.

Most complex python systems require … reflection + import hacks? I feel python’s reflection (as with other scripting languages) is more powerful, less restricted. I feel reflection is at the core of some (most?) of the power features in python – import, polymorphism

## low-complexity QQ topics #JGC/parser..

java GC is an example of “low-complexity domain”. Isolated knowledge pearls. (Complexity would be high if you delve into the implementation.)

Other examples

  • FIX? slightly more complex when you need to debug source code. java GC has no “source code” for us.
  • socket programming? conceptually, relatively small number of variations and combinations. But when I get into a big project I am likely to see the true color.
  • stateless feed parser coded against an exchange spec

GTD/KPI/zbs/effi^productivity defined#succinctly

  • zbs — real, core strength (内功) of the tech foundation for GTD and IV. Zbs (not GTD) is required as architect, lead developer, or for west coast interviews.
  • GTD — make the damn thing work. LG of quality, code smell, maintainability etc
  • KPI — boss’s assessment often uses productivity as the most important underlying KPI, though they won’t say it.
  • productivity — GTD level as measured by __manager__, at a higher level than “effi”
  • effi — a lower level measurement than “Productivity”

joining/leaving a multicast group

Every multicast address is a group address. In other words, a multicast address identifies a group.

Sending a multicast datagram is much simpler than receiving…

[1] http://www.tldp.org/HOWTO/Multicast-HOWTO-2.html is a concise 4-page introduction. Describes joining/leaving.

[2] http://ntrg.cs.tcd.ie/undergrad/4ba2/multicast/antony/ has sample code to send/receive. Note there’s no server/client actually.


Single-Threaded^STM #jargon

SingleThreadedMode means each thread operates without regard to other threads, as if there’s no shared mutable data with other threads.

Single-Threaded can mean

  • no other thread is doing this task. We ensure there’s strict control in place. Our thread still needs to synchronize with other threads doing other tasks.
  • there’s only one thread in the process.

Does inline really make a difference@@ Yes

MSVS and g++ debug build both disable inline (presumably to ease debugging). The performance difference vis-a-vis release build is mostly due to this single factor, according to [[optimized c++]]

The same author asserts that inlining is probably the most powerful code optimization.

Pimpl effectively disables inline.

However, c++faq gives many reasons for and against inline.

simple implementation of memory allocator#no src

P9 [[c++game development primer]] has a short implementation without using heap. The memory pool comes from a large array of chars. The allocator keeps track of allocated chunks but doesn’t reuse reclaimed chunks.

It showcases the header associated with each allocated chunk. This feature is also part of a real heap allocator.

reinterpret_cast is used repeatedly.

y concurrentHM.size() must lock entire map#my take

Why not lock one segment, get the subcount, unlock, then move to next segment?

Here’s my take. Suppose 2 threads concurrently inserts an item in each of two segments. Before that, there are 33 items. Afterwards, there are 35 items. So 33 and 35 are both "correct" answers. 34 is incorrect.

If you lock one segment at a time, you could count an old value in one segment then a new value in another segment.

multicast address 1110xxxx #briefly

By definition, multicast addresses all start with 1110 in the first half byte. Routers seeing such a destnation (never a source) address knows the msg is a multicast msg.

However, routers don’t forward any msg with destnation address through because these are local multicast addresses. I guess these local multicast addresses are like 192.168.* addresses.

serialize access to shared mutable: mutex^CAS

[[optimized c++]] P290 points out that, in addition to mutex, CAS construct also serializes access to shared mutable objects.

I feel it’s nothing but a restatement of the definition of “shared mutable”.  More relevant question is

Q: what constructs support unimpeded concurrent access to shared mutable?
A: read-write lock lets multiple readers proceed, in the absence of writers.
A: RCU lets all readers proceed, but writers are restricted.

mutex^condition top2constructs across platforms#except win32

https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSPR/About_NSPR is one more system programming manual that asserts the fundamental importance of mutex and condition variable.

NSPR started with java in mind but primary application is supporting clients written entirely in C or C++.

I said many times in this blog that most thread synchronization features are built on top these duo.

However, on Windows the fundamental constructs are locks and waitHandles…

reinterpret_cast{int}( somePtr): practical use

http://stackoverflow.com/questions/17880960/what-does-it-mean-to-reinterpret-cast-a-pointer-as-long shows a real use case of reinterpret_cast from pointer to integer, and reinterpret_cast back to pointer.

What if I serialize an object to xml and the object contains a pointer field (reference field is less common but possible)? Boost::serialization would unwrap the pointer and serialize the pointee object.

Most of the time (like parsing unsigned char array), we reinterpret_cast from an (array) address into address of another data type

async – 2 threads 2 different execution contexts

See also https://bintanvictor.wordpress.com/2013/01/06/every-async-operation-involves-a-sync-call/

Sync call is simple — the output of the actual operation is processed on the same thread, so all the objects on the stack frame are available.

In an async call, the RR fires and forgets. Firing means registration.

The output data is processed …. usually not RR thread. Therefore the host object of the callback and other objects on the RR stack frame are not available.

If one of them is made available to the callback, then the object must be protected from concurrent access.

SQL is slow even if completely in-memory

Many people told me flat-file data store is always faster than RDBMS.

For writing, I guess one reason is — transactional. REBMS may have to populate the redo log etc.

For reading, I am less sure. I feel noSQL (including flat-file) simplify the query operation and avoid scans or full-table joins. So is that faster than a in-memory SQL query? Not sure.

Instead of competing with RDBMS on the traditional game, noSQL products change the game and still get the job done.

memory fencing: terminology

Let’s standardize our terminology in this blog to ease searching — “fence instruction” and “memory-fencing” are more flexible words than “barrier”… also shorter.

Visibility-of-update is part of memory fencing.

Q: does memory fencing imply flush to main memory and bypassing register?
A: I think memory fencing is more about statement reordering and less about “flush to main memory”. All of these affect global visibility of a shared mutable.


## insertion sort: phrasebook

  • nearly sorted — for nearly-sorted sample, beat all other sorts  including counting sort
  • small — for small sample, beats MSD radix sort and all comparison sort
    • switch — many comparison sort implementations switch to insertion-sort when the sub-array becomes small
  • poker — Intuitively, this is how players sort poker cards
    • shift — requires shifting part of the sorted portion.

python initializing list^dict differently #ECT

Favor {} to dict() due to performance. Also dict() can’t take numbers as keys 😦

pprint({1:’value1′,   3+2:’value2′})

Empty [] is faster than list() but I prefer list() because

See https://stackoverflow.com/questions/5790860/and-vs-list-and-dict-which-is-better

##ECT tips@python triple quote #print

  • 3-single-quote or 3-double-quote are the  same thing.  Single is faster to type 🙂
  • trick: You can print a triple-quoted multi-line string
  • trick: I also use this for multi-line comments, as a standard practice — https://www.python.org/dev/peps/pep-0008/#documentation-strings
  • trick: You can construct formatted multi-line strings from a list of value, or from a dict such as locals()

*morePure* SAM^functional interface]java

Most experts and authors on the internet don’t differentiate the 2. [[functional thinking]] says that a java 8 functional interface can have default methods in addition to a SAM (single abstract method)

Later, I will probably add a bit more details, but this is the key message.

Q: is a lambda related to SAM or FI (with default methods)? Official answer seems to be FI but can a FI-with-default-methods?

send()recv() ^ write()read() @sockets

Q: A socket is also a file descriptor, so why bother with send()recv() when you can use write()read()?

A: See https://stackoverflow.com/questions/9475442/unix-domain-socket-vs-named-pipes

send()recv() are recommended, more widely used and better documented.

[[linux kernel]] P623 actually uses read()write() for udp, in stead of sendto()recvfrom(), but only after a call to connect() to set the remote address

STL containers: pick correct SORT for each

Needed in coding IV

  • std::sort() is good for array, vector, deque…
  • std::list has no random access iterator so it must use list::sort() method
  • a single string? You have to construct a vector<char>
  • unordered containers don’t need sorting

map (and set) remain sorted so don’t need a sort operation. They can specify a special sort comparitor either as a template arg or a ctor arg. P316 and P334 [[c++ std lib]] compare the two choices. You can also also pass no comparitor arg but define operator< in the key class

avoid unsigned-int type if you ever test for positiveness

Beware in coding tests…

Even though unsigend types are self-documenting, Google style guide advises against unsigned int types as they are error-prone.

When I use size_t as a loop control variable, I need to avoid decrementing it below 0, which is something like undefined behavior for me.

size_t sz=myMap.size();
for(int i=0; i<sz-N; ++i) //sz-N can overflow to a very large number!

c++atomic^volatile #fencing

See other posts about volatile, in both c++ and java.

For a “volatile” variable, c++ (unlike java) compiler is not required to introduce memory fence i.e. updates by one thread is not guaranteed to be visible globally. See Eric Lippert on http://stackoverflow.com/questions/26307071/does-the-c-volatile-keyword-introduce-a-memory-fence

based on [[effModernC++]]
atomic volatile
optimization: reorder prevented permitted 😦
atomicity of RMW operations enforced not enfoced
optimization: redundant access skip permitted prevented
cross-thr visibility of update probably enforced no guarantee [1]
memory fencing enforced no guarantee

[1] https://en.wikipedia.org/wiki/Memory_barrier

linux command elapse time

–q(time) command

output can be a bit confusing.

Q: does it work with redirection?

–$SECONDS variable

std::lock_guard simple idiom]cod`IV

Since this data type is designed as a stack object i.e. scoped variable, simply declare (and initialize) this object within curly brackets. When this variable goes out of scope, the object behind the variable gets destructed.

In fact, the compiler sets up this sequence of events.

std::lock_guard<std::mutex> aLocalLock(sharedMutex);
someMutableShared ++;

Ajax on java: nothing fancy

As of 2019, I found a 2007 book [[ ajax on java: The Essentials of XMLHttpRequest and XML Programming with Java ]]. Some basic observations:

  • ajax works with tomcat servlet engine
  • ajax works with JSF and GWT but not so popular nowadays.

There’s also a 2006 book [[Practical Ajax Projects with Java Technology]], so I guess java+ajax integration was most prominent in 2007 !java

## Y in sg(^U.S.)u can’t be developer till 65,succinctly

In the US, at 65 you could work as a developer. (Actually that’s not the mainstream for most immigrant techies. What do they do? Should ask Ed? Anirudh? Liu Shuo, ZR…)

Why SG is different? Here’s my answer, echoing my earlier posts.

  1. US culture (job market, managers…) has a tradition of being more open to older techies
  2. US culture respects technologists. Main street techies get paid significantly higher than SG main street techies
  3. high-end (typical VP-level) technical work – more comon to get in the US than SG, partly because wage premium is smaller, like 100k -> 150k

new-expression alloates AND invokes ctor…most of the time

Neither the allocation or the ctor step is universal and guaranteed to happen.

1) ctor

If allocation fails then no “raw memory” is there to initialize.

new int; // This won’t invoke any ctor since “int” is not a class/struct and doesn’t have a ctor.

2) allocation

It’s possible to bypass the allocation, if you use placement-new

equivalent FX(+option) trades, succinctly

The equivalence among FX trades can be confusing to some. I feel there are only 2 common scenarios:

1) Buying usdjpy is equivalent to selling jpyusd.
2) Buying usdjpy call is equivalent to Buying jpyusd put.

However, Buying a fx option is never equivalent to Selling an fx option. The seller wants (implied) vol to drop, whereas the buyer wants it to increase.

socket stats monitoring tools – on-line resources

This is a rare interview question, perhaps asked 1 or 2 times. I don’t want to overspend.

In ICE RTS, we use built-in statistics modules written in C++ to collect the throughput statistics.

If you don’t have source code to modify, I guess you need to rely on standard tools.

how effective is %%resume]NY^sg fin IT

* On Wall St, the battlefield is  the tech IV.
* In Singapore financial IT, the battlefield is the resume. I feel 70% of the battle is on the resume.

In Singapore, I have tried 3 times (2011, 2014 and 2015). Much fewer “senior” roles (i.e. 150k+); much lower chance of shortlist. I believe the competitors are way too many for each role.

On Wall St, Many friends (XR, YH, Mithun etc) all get many interviews without too much effort.


order routing inside an exchange #Nsdq

A major stock exchange’s job spec mentions that the exchange actually runs an internal SOR.

I thought only trading houses maintain SOR. Now I know that when an order (probably a market order) hits a stock exchange NDQ, NDQ may need to route it somewhere else. Not sure if it’s internal or external destination.

My friend Alan said all 13 U.S. exchange receive a live NBBO feed. Based on that, ABC can “forward” the order to the “best” venue.

[14]JGC tuning: will these help@@

Q: have a small nursery generation, so as to increase the frequency of GC collections?

AA: An optimal nursery size for maximum application throughput is such that as many objects as possible are garbage collected by young collection rather than old collection. This value approximates to about half of the free heap.

Q: have maximum heap size exceeding RAM capacity.
A: 32-bit JVM won’t let you specify more than 4G even with 32 GB RAM. Suppose you use 64-bit JVM, then actually JVM would start and would likely use up all available RAM and starts paging.

test if a key exists in multimap: count() perf tolerable

map::count() is simpler and slightly slower 🙂 than map::find()

Even for a multimap, count() is slower but good enough in a quick coding test. Just add a comment to say “will upgrade to find()”

In terms of cost, count() is only slightly slower than find(). Note multimap::count() complexity is Logarithmic in map size, plus linear in the number of matches… Probably because the matching entries live together in the RB tree.

c++mem mgmt: MSDN overview

http://msdn.microsoft.com/en-us/library/vstudio/hh438473.aspx is a brief but decent overview, illustrated with code snippets. I re-read in Nov 2017.

  • nonref variables as data members ! Default dtor is good enough 🙂
  • vector as data member ! Default dtor is good enough 🙂
  • (MSDN didn’t mention) shared_ptr and unique_ptr as data member. Default dtor is good enough 🙂
  • Heap-based vs stack-based — different techniques
  • c++ mem mgmt — “designed” for stack-based, so prefer stack as far as possible.
  • “observing” vs “owning” a heap object. An observing raw ptr can dangle but will never leak.


http://en.wikipedia.org/wiki/Multicast shows(suggests?) that broadcast is also time-efficient since sender only does one send. However, multicast is smarter and more bandwidth-efficient.

IPv6 disabled broadcast — to prevent disturbing all nodes in a network when only a few are interested in a particular service. Instead it relies on multicast addressing, a conceptually similar one-to-many routing methodology. However, multicasting limits the pool of receivers to those that join a specific multicast receiver group.

4 Scopes for operator-overloads #new()

  1. non-static member operators are very common, such as smart ptr operator++(), operator< () in iterators
  2. static member operator new() is sometimes needed. ARM explains why static.
  3. friend operators are fairly common, such as operator<<()
  4. class specific free standing operator is recommended by Sutter/Andrei, to be placed in the same namespace as the target class. Need to understand more. Advanced technique.

select() syscall and its drv

select() is the most “special” socket kernel syscall (not a “standard library function”). It’s treated special.

– Java Non-blocking IO is related to select().
– Python has at least 3 modules — select, asyncore, asynchat all built around select()
– dotnet offers 2 solutions to the same problem:

  1. a select()-based and
  2. a newer asynchronous solution to the same problem

static object initialization order #lazy singletons

Java’s lazy singleton has a close relative among C++ idioms, comparable in popularity and implementation.

Basically, local statics are initialized upon first use, exactly. Other static Objects are initialized “sometime” before main() starts. See the Item on c vs c++ in [[More eff c++]] and P222 [[EffC++]].

In real life, this’s hard to control — a known problem but with standard solutions — something like a lazy singleton. See P 170 C++FAQ. Also addressed in [[EffC++]] P221.