MOM^sharedMem ring buffer^UDP : mkt-data transmission

I feel in most environments, the MOM design is most robust, relying on a reliable middleware. However, latency sensitive trading systems won’t tolerate the additional latency and see it as unnecessary.

Gregory (ICE) told me about his home-grown simple ring buffer in shared memory. He used a circular byte array. Message boundary is embedded in the payload. When the producer finishes writing to the buffer, it puts some marker to indicate end of data. Greg said the consumer is slower, so he makes it a (periodic) polling reader. When consumer encounters the marker it would stop reading. I told Gregory we need some synchronization. Greg said it’s trivial. Here are my tentative ideas —

Design 1 — every time the producer or the consumer starts it would acquire a lock. Coarse-grained locking

But when the consumer is chipping away at head of the queue, the producer can simultaneously write to the tail, so here’s

Design 2 — the latest message being written is “invisible” to the consumer. Producer keeps the marker unchanged while adding data to the tail of queue. When it has nothing more to write, it moves the marker by updating it.

The marker can be a lock-protected integer representing the index of the last byte written.

No need to worry about buffer capacity, or a very slow consumer.

MOM UDP multicast or TCP or UDS shared_mem
how many processes 3-tier 2-tier 2-tier
1-to-many distribution easy easiest doable
intermediate storage yes tiny. The socket buffer can be 256MB yes
producer data burst supported message loss is common in such a situation supported
async? yes yes, since the receiver must poll or be notified I think the receiver must poll or be notified
additional latency yes yes minimal

job4accu@20Y: which past job was closest2ideal

Q: do I want to deepen and specialize in one single area, rather than diversify? skill: deepen^diversify^stack up provides a context
A: let’s assume answer is yes deepen.

Q: which area?
A: programmatic data analysis in finance … with two entry barriers in math and coding, with reasonable “moneyness” but market depth is poor.
A: financial enterprise app with multi-threading …?
A: high speed market data engine in TCP/multicast?
A: orderbook replication with retransmission?
A: bond math including IRS?
A: option math? Poor market depth

Once we figure out which domain is good enough (very few), then we need a good boss…

So which past job is most likely to offer me that professional growth prospect?

  • 95G — java threading for trade execution
  • Barcap, but market depth is poor for that skill
  • citi?
  • RTS? but I get NO exposure to the raw sockets and I’m not allowed to ask!

DeMorgan’s law in boolean algebra

{\displaystyle {\begin{aligned}{\overline {A\cup B}}&={\overline {A}}\cap {\overline {B}},\\{\overline {A\cap B}}&={\overline {A}}\cup {\overline {B}},\end{aligned}}}

Note the perfect symmetry 🙂

The symmetry makes it easy to remember the pair, but the two laws are actually related. You can easily derive one from the other if you denote A’ as C … and B’ as D.

Starting from first rule

(A or B)’ = A’ and B’ # now negate both sides
A or B = (A or B)” = (A’ and B’)’ = (C and D)’ # now rewrite everything in C/D
C’ or D’ = (C and D)’ # which is the same as the 2nd rule 🙂

Tick query: design notes #GS

— to support -Oprint —
For each symbol, we will use a vector to hold Tick objects (or shared_ptr thereof). Lookup will use a modified binary search — In the case of an exact hit, we need to scan backwards (for the starting timestamp) until we see a smaller value than the target. In the case of a search miss, we know the target timestamp value is between 2 vector items, and no scanning is needed.

I decided against a multiset due to larger memory footprint. Also a red-black tree would require lots of re-balancing when the input data stream is already sorted.

This vector will support product-query, but not optimally.

Trade-off: this optimization mode saves memory, speeds up print-queries but (the less frequent) product-queries are slower.

— to support -Oproduct —
In addition to the data structure of -Oprint, under each symbol we will have 11 additional vectors assuming there are 11 fields.

For Field1, the vector will hold 98700 addresses assuming there are that many records having Field1. Each address is a Tick object that has Field1 populated.

For Field2, let’s assume the vector is longer.

If a product query specifies Field1 and Field2, engine first identifies which of the two vectors is shorter. In our case, Field1 vector is chosen, so we iterate over it. (Note we don’t need Field2’s vector.) For each item, we check if Field2 is present (low-cost) and produce a product.

Before iteration, we need to identify the start and end items in the vector. This is the same modified-binary search as in -Oprint.

Suppose Line 1 has 3 fields, Line 2 has 5 fields, Line 3 has 1 field … The additional space requirement in this design is proportional to the total count of fields (3+5+1+…) present in the tick file.

To support print-query, in theory we could iterate each of the 11 vectors under the target symbol. However, we need a place to store the Tick objects. The data structure of -Oprint is an efficient solution, so I see no reason to abandon it and invent another. This data structure already supports faster print-query.

Trade-off: In this optimization mode, we support faster product-query, but use more memory. Also, the initial file processing is slower.

— to support -O2product (not implemented) —
If we know in advance that total distinct field count is low like 10, then there are only 10-choose-2 i.e. 45 pairs. We could support a new -O2product optimization mode.

Instead of 10 vectors, we need 45 vectors of floats, for the 45 combinations.

If an incoming Tick object has 4 fields, we will iterate over the 4-choose-2 i.e. 6 pairs of fields. Each pair is always one of the 45 possible combinations, so we will pick one of the 45 vectors, and push_back the product of the two fields. Since we have 6 pairs within this Tick object, we update exactly 6 of the 45 vectors.

Once the tick file is fully processed, we have 45 vectors to support fast product-query. To select one of 45 vectors, we use a hash table keyed by the two target field names (in sorted order) concatenated. Once we select the one vector, we will use the same modified binary search then iterate. Each element in the iteration is now guaranteed to contain both fields, so there’s no wasted iteration as in -Oproduct.

Print-query would need the data structure in -Oprint.

Trade-off: This mode uses even more memory but further speeds up product-queries. The additional space is proportional to the total count of field-pairs present in the tick file.

–use of shared_tr —
Memory footprint of a shared_ptr is about twice of a raw ptr, and much smaller than the footprint of a string. There are millions of field names (and also many Tick objects) in my data structures. In fact, the copy operations are disabled for Field (and Tick) classes, which are essentially immutable objects.

It was an obvious memory optimization to use pointers to avoid allocation for the same string twice. However, there’s question whether to store shared_ptr or raw ptr in the large data structures.

I chose raw pointer.

–duplicate copies of field strings or symbol strings —

If there are a million ticks, averaging 5 fields each, there would be 5 million field names to store, but only 50 distinct fields! To save memory, I allocate memory for each globally unique field name once only. We end up with 50 Field objects referenced by many shared_ptr objects.

For lookup, a hash table is possibly faster (if map is large) but not really, according to some benchmark tests. Our map size is assumed to be small, like 100.

Symbol strings are also repeating, but by design, my data structures do not hold many duplicates for a symbol string, because symbol is only a lookup key.

–Improvement feedback from friends

  • avoid using namespace std
  • avoid virtual

af 70,non-profit OK; voluntary work no

After my prime years, when I can only work half the time, I may be able to work towards some meaningful cause, but not completely voluntary work. If there’s no income, I will have low motivation to continue.

With a salary, I feel more commitment, more responsibility.

In our later years, my wife and I also have a non-trivial financial need. I don’t want to depend on my kids or welfare to support ourselves. I may have to continue my drive for more income.

sorted vector^multiset for tick query

I see no advantage to using a tree.

Note the key (timestamps) is repeating, so only multiset is valid.

Space — vector is more compact leading to better cache affinity. There’s a price to pay — Vector should use reserve() to prevent reallocation, if possible. Tree doesn’t need it.

Query by lower_bound/upper_bound — I guess both are similar O(log N)

Initial insertions — Remember our data are ticks from an exchange loaded from a large file.

  1. 1. For one insert (assuming no re-balancing no reallocation), vector is O(1) but tree is O(log N) 😦
  2. 2. If data comes in randomly, then tree re-balancing is less frequent but more than once. For vector, it’s a single quicksort after all insertions. I would say vector is faster mostly due to the previous factor
  3. 3. if data comes in (almost) by ascending timestamp, then vector requires no or minimal sorting, but the tree would require frequent re-balancing — presorted data is probably the worst input sequence for one-by-one-insertion.
  4. 4. If no re-balancing is done, then the tree would have depth = N, and query would be linear search 😦

memset: a practical usage #Gregory

  • memset is a low-level C function.
  • memset takes a void pointer.
  • Fast and simple way to zero out an array of struct, having primitive data members. No std::string please. No ptr please. Use sizeof to get the byte count.
  • Useful in low level wire coding
// illustrates packed and memset
#include <iostream>
using namespace std;

struct A{
  unsigned int i1; //4 bytes
  bool b; //1 byte
  char cstr[2];
  int* ptr; //8 bytes
} __attribute__((packed));
size_t const cnt = 3;
A arr[cnt];
int main(){
  size_t sz = sizeof(arr);
  memset(arr, 0, sz);
  for(size_t i=0; i<cnt; ++i){
    A* tmp = &arr[i];
    cout<<"i1 = "<<tmp->i1<<"; b = "<<tmp->b<<" ; cstr[1] = "<<(int)tmp->cstr[1]<<" ptr = "<<tmp->ptr<<endl;

Nomura C++quant IV

Q: vs operaotr[]?
%%A: operator[] throws exception only if you specify a compiler flag like D_GLIBCXX_DEBUG. I think this is correct

Q: given a vector or ints, how would you sort it by abs value?
%%A: free function
%%A: subclass “less” with operator(). body is same as the free function

Q: clustered index vs non-clustered?

Q: inputs to price an equity option?
%%A: 5 inputs: K, S, T, r, dividend rate. I missed “implied volatility”, which is a soft market data derived from a raw market data.

Q: inputs to a mortgage prepayment model (or callable bond)? ….
%%A: Black model to price a callable bond. Inputs include yield, maturity, interest rate, but there’s no strike per-se? I think the implied volatility (in which random process?) is the most important factor and needs to be backed out from some market data

Q: how did you construct your yield curves at Barcap?

Q: what’s local vol
A: the instantaneous volatility is a deterministic function of 2 inputs. Deterministic because there’s no dB … See other blog posts.

Q: zero-volatility spread?

Q: how do you compute OAS?
%%A: I think OAS is a curve.

Q: how about dv01?

Q: duration of a 10Y T-bond vs 10Y muni bond? I answered with the (more “visual”) Macaulay duration, which is slightly different from the (most useful) modified duration

GS c++/dataStruct IV

Q1: what’s the problem to be solved by virtual inheritance
A: diamond..

Q2: consider such a diamond with A as grandparent, B/C virtually deriving from A, then grandchild D. Suppose class C has a foo() method that uses a field A.a. Now, within C, the offset of A.a depends on the inheritance hierarchy, so how does the compiler compile

A: I do see a problem here. In this ABCD case, the offset of A.a is one value, depending on the size of B. However, in a ABJKCD case (J/K virtually subclass A), the offset of A.a will be a different value, depending on the sizes of B/J/K. So the method will compile to different object code! Yet the compiler has a single version of

The trick is the offset of the A sub-object within C. This offset value can be saved in the vtable of C.

Note if there’s no virtual inheritance, then simpler. Within C, the A sub-object’s offset is a constant. No need to look up vtable.

Q3: appending a million integers to a list vs a vector, which is faster and by how much?
%%A: vector requires reallocation, roughly 20 times, assuming doubling each time, much fewer than 1M allocations for the list.  The reallocation entails element-by-element copy, but that’s cheaper than allocations. If I must estimate how much faster, I need to estimate how many allocations and how many copying:

  • for list, 1M allocations + 1M copying
  • for vector, 20 allocations + about 2M copying

%%A: I still think allocation is a hotspot and much more expensive than copying, perhaps by 10 or 100 times.

%%A: of course, you can pre-allocate for vector, but for this problem we ignore that feature.

Q4: deque — known for fast append/prepend, and random access. How?
%%A: I think it’s implemented as segments of vectors, with the front and back segments not fully occupied and growable. Once one of them reaches a certain size, a new segment is created. RandomAccess is implemented by identifying the host segment. To optimize, segments should have equal length.

Q4b: what if we want to support fast midstream insertion?
%%A: this is a vector limitation inherited by deque. If we need to support it, we may need to allow unequal segment lengths. RandomAccess is now slightly slower. For a given subscript 16, we need to maintain a step function like

  • 5-10 in Segment 2
  • 11-13 in Segment 3
  • 14-20 in Segment 4
  • 21-30 in Segment 5
  • 31-32 in Segment 6
  • … so each time any segment expands, the step function needs an update
  • Once we have the step function, a binary search will do, at log(S) where S = Segment count

WordPad : rich-text note taker

I found WordPad RTF files a good middle ground between ascii text and MSWord files.

  • 🙂 basic text effects like background+font colors
  • 🙂 footprint is comparable to ascii, much smaller than MSWord files.
    • Q: How about after compression?
  • 🙂 how about copying to Linux? tested

Verdict —

  1. default general-purpose note-taker remains ascii, but
  2. for office work notes with text highlighting, let’s try WordPad more often. But let’s not become dependent. Convert old rtf notes to ascii.

select^poll # phrasebook

Based on, which I respect.

  • descriptor count — up to 200 is fine with select(); 1000 is fine with poll(); Above 1000 consider epoll
  • time-out precision — poll/epoll has millisec precision. select() has nanosec, a million times higher precision, but only embedded devices need such precision.
  • single-threaded app — poll is just as fast as epoll. epoll() excels in MT. has sample code on poll().

a profession with Enjoyment+income+barrier

Update — Enjoyment is often short-lived and affected by too many factors like

  • respect by mgr #bonus
  • “strategic” learning #Barclays
  • commute
  • income cf to peers

Instead of Enjoyment, I often prefer “engagement”.

However, for both enjoyment and engagement, the signal-to-noise ratio is below 0.2 i.e. those “noise factors” above are at least 5 times stronger.


Background — discussion with an older fellow developer about helping our high-school kids select a major.

Example — Look at Aunt Gennifer. She changed her profession around age 40 to find a better paying and less boring job.

Example — grandpa has an ever-green interest in his research field, but most researchers in this field are not well-paid (it’s the least commercial field of research). In terms of income, I think grandpa’s medical and pension benefits are in the the top 1%.

Example — me. I entered trading-engine-dev circa 2007.

  • It pays well above the average professional salary across all professions. (I was told U.S. family doctors earn about 150k only.)
  • Market depth is excellent — look at my post on NBA salary.
  • The constant pressure to upgrade and learn is perfectly acceptable to me. In fact, I seek opportunities to maintain my learning pace
  • I don’t find it tiring or boring. For a few years before 2007, I told myself “I don’t want to remain a developer for life” but had a U-turn.

For most people, it’s hard to find a profession that’s not boring, not too tiring, and pays well, a field you could keep plowing till retirement, if you ever retire.

In fact, such a field is likely to be /oversubscribed/ — too many hopeful new entrants but few would get in and fewer would survive 😦

Example — Lianzhong’s daughter Borong loves writing, but she realized it wouldn’t be an easy way to make a living. Similarly, a large number of individuals have a meaningful and rewarding “hobby” but can’t make a decent income from it

  • tweaking with computers and gadgets
  • visual arts # including photography
  • literary arts
  • performing arts #including music
  • sports, games #including board and electronic
  • cooking

field destruction]class dtor and leaks: string,vector,shared_ptr has some scenarios that are realistic and practical.

1) if you have a local std::string instance behind a variable mystr, then dtor of mystr will clean up the heap memory that was allocated by the mystr ctor. This is one of the most basic usages of std::string. There’s No heap memory leak in this simple scenario.

1b) Ditto for a std::vector.

2) If you have a vector of raw pointer, then be careful. If you used new() for each raw ptr, then you need to call delete() otherwise memory would leak.

2b) shared_ptr is a great replacement. No memory leak even if you don’t call delete()

3) if your class C has a field as a raw ptr to A, then the C dtor will run the default dtor of the ptr. As Scott Meyers said, that default dtor is a no-op. Therefore by default memory leaks. You need to manually delete()

3b) smart-ptr-to-A as a field is a clean, recommended solution.

TimeInForce [59] FIX

0 = Day (or session)
1 = Good Till Cancel (GTC)
2 = At the Opening (OPG)
3 = Immediate or Cancel (IOC)
4 = Fill or Kill (FOK)
5 = Good Till Crossing (GTX)
6 = Good Till Date
7 = At the Close

Many of these values are interesting. Most confusing is IOC ^ FOK. The IOC order can (but does not have to) be filled in full. If it cannot be filled immediately or fully, the pending part gets canceled (a partial execution is generated) by the market. It differs from a fill or kill order in that in fill or kill orders, partial executions are not accepted.

demo: static method declare/define separately n inherited

Low complexity in this demo, but if you lack a firm grip on the important details here, they will add to the complexity in a bigger code base.

  • When subclass is compiled, compiler complains about undefined sf() since it sees only the declaration. You need “g++ der.cpp base.cpp”.
  • Note the static method is inherited automatically, so you could call der::sf().
#include <iostream>
struct base{
  static void sf();
///////// end of header /////////
#include "base.h"
using namespace std;
void base::sf(){ // no "static" please
///////// end of base class /////////
#include "base.h"
using namespace std;
struct der: public base{};
int main(){
///////// end of sub class /////////

cross-currency equity swap: %%intuition

Trade 1: At Time 1, CK (a hedge fund based in Japan) buys one share of GE priced at USD 10, paying JPY 1000. Eleven months later, GE is still at USD 10 which is now JPY 990. CK faces a paper loss due to FX. I will treat USD as asset currency. CK bought 10 greenbacks at 100 yen each and now each greenback is worth 99 yen only.

Trade 2: a comparable single-currency eq-swap trade

Trade 3: a comparable x-ccy swap. At Time 1, the dealer (say GS) buys and holds GE on client’s behalf.

(It is instructive to compare this to compare this to Trade 2. The only difference is the FX.)

In Trade 3, how did GS pay to acquire the share? GS received JPY 1000 from CK and used it to get [1] 10 greenbacks to pay for the stock.

Q: What (standard) solutions do GS have to eliminate its own FX risk and remain transparent to client? I think GS must pass on the FX risk to client.

I think in any x-ccy deal with a dealer bank, this FX risk is unavoidable for CK. Bank always avoids the FX risk and transfer the risk to client.

[1] GS probably bought USDJPY on the street. Who GS bought from doesn’t matter, even if that’s another GS trader. For an illiquid currency, GS may not have sufficient inventory internally. Even if GS has inventory under trader Tom, Tom may not want to Sell the inventory at the market rate at this time. Client ought to get the market rate always.

After the FX trade, GS house account is long USDJPY at price 100 and GS want USD to strengthen. If GS effectively passes on the FX risk, then CK would be long USDJPY.

I believe GS need to Sell USDJPY to CK at price 100, to effectively and completely transfer the FX risk to client. In a nutshell, GS sells 10 greenbacks to CK and CK uses the 10 greenbacks to enter an eq-swap deal with GS.

GS trade system probably executes two FX trades

  1. buy USDJPY on street
  2. sell USDJPY to CK

After that,

  • GS is square USDJPY.
  • CK is Long USDJPY at price 100. In other words, CK wants USD to strengthen.

I believe the FX rate used in this trade must be communicated to CK.

Eleven months later, GS hedge account has $0 PnL since GE hasn’t moved. GS FX account is square. In contrast, CK suffers a paper loss due to FX, since USD has weakened.

As a validation (as I instructed my son), notice that this outcome is identical to the traditional trade, where CK buys USDJPY at 100 to pay for the stock. Therefore, this deal is fair deal.

Q: Does GS make any money on the FX?
A: I don’t think so. If they do, it’s Not by design. By design, GS ought to Sell USDJPY to client at fair market price. “Fair” implies that GS could only earn bid/ask spread.

gdb cheatsheet

#1 tip:
gdb –tui # starts an upper screen showing a moving marker against the current line. If you started gdb without “–tui”, you can still use Ctrl-X Ctrl-A.

#2 tip:
After hitting a breakpoint, you can use q(n) i.e. q(next) and just Enter to repeat previous q(n).

I found it easier to use q(break filename:lineno) # Filename doesn’t need full path J

Other essential commands

  • · info break # or info b to list all breakpoints
  • · bt # or backtrace
  • · frame 0
  • · list
  • · p for print
  • · info variables
  • · info locals
  • · info args
  • · n # for step-over
  • · s # for step-in
  • · fin # for step-out
  • ctrl-L toclear gdb command screen

declare^define: additional complexity@c over java/c#

It slowly dawned on me that a big share of c++ programming headaches (confusions, compiler/linker errors, makefile complexity) stem from one basic design of C — names need to be declare to the “world”, and separately defined.

This design gives rise to header files.

Variables/objects vs functions vs custom classes vs templates have different rules.

I think only objects (including static fields) obey this particular rule: the definition allocates (non-stack) memory.

I think class instance fields are completely different. See

I think functions are completely different. I blogged about function ODR —

Pimpl is one of many issues.

more MCQ questions on core java skills #SGA GregR

You can find sample MCQ questions in Brainbench, Select2perform and IKM. Let me know if you want me to help you review and select some good questions — not obscure, not ambiguous…

Below are my own inventions, including those I sent before. (So you can discard the last mail.)

For candidates — Please pick the most likely answer for the most typical case, even if the question is not 100% clear. You can ignore corner cases.

Q1: which command produces java bytecode [correct answer is marked with *]

A) java (lower case)

B) * javac (lower case)

C) jar (lower case)

D) jvm (lower case)

E) None of the above

Q1b: which command executes the bytecode

Q1c: which command is the java debugger

Q1d: which command could launch the JVM
Q1e: which command can read but not execute java bytecode

Q2: which among these is a modifier on a class

A) finally

B) finalize

C) * final

Q2b: which is actually the name of a well-known method? [B]

Q2c: which is a keyword only used inside methods? [A]

Q3: which keywords are legal for a method? (multiple correct answers)

A) * synchronized

B) virtual

C) * final

D) mutable

E) * abstract

F) const

G) override

H) deprecated

I) volatile

J) atomic

Q3b: which keyword is related to concurrency (all lower case, multiple correct answers) [ A/I ]

Q3c: which ones are not java language keywords [Advanced question. B/D/F/G/H/J]

Q4: do I need at least one “import” to use basic java threads? [Answer is no]

Q4b: how about java collections?

Q4c: how about java generics, assuming I avoid generic collections

Q5: if your java app creates no thread, will there be any other thread beside the main thread, in a typical java app? [Answer is yes]

Q11: if I have a typical java application’s bytecode compiled under 32-bit, can I use it in a 64-bit JVM?

A) * Usually yes, sometimes with minor changes.

B) Usually no, as it requires recompilation under 64-bit

Q11b: how much more memory can the 64-bit java use compared to 32-bit, in theory?

A) About double

B) * Much, much more

C) About the same.

Q11c: how much faster will the 64-bit jvm run that same bytecode, compared to the 32-bit jvm, assuming it needs minimal memory? [Advanced]

A) about double or higher

B) much, much faster

C) * About the same

Q12: which download has more content — jdk or jre? [ answer is jdk]

Q13: does garbage collection slow down a typical java server on a single-processor machine?

A) yes. Clearly Noticeable and very significant

B) * slightly

C) Should have no impact

Q13b: what if the app uses very little memory? [Answer is C]

Q13c: what if the app creates a huge amount of short-lived objects? [Advanced. Answer is B]

Q13d: what if the app creates a fair but not excessive amount of long-lived objects? [Advanced. Answer is B]

Q14: have you used any java debuggers? If yes, can you see the content of protected fields?

Q14b: can you see the content of protected static fields?

Q14c: can you modify its value using the debugger? [Advanced question. answer is yes]

c++11 class removing the 4 copy operations

There are at least four main copy operations
a) copy-ctor
a2) move-ctor
b) operator=
b2) move-operator=

Each one can be declared “=delete”. In, the author says you can have four combinations:
1) all_enabled
2) movable_only
3) copyable_only — rare. A meaningless design, according to the author.
4) all_deleted — most restricted, most locked-down

Some important library classes are movable_only, such as …. unique_ptr, and some thread objects

c++11,sockets,IPC in QnA IV #Ashish

[18] Update — This has become one of the more  valuable posts in my c++ blog. It cuts down the amount of info overload and learning curve steepness by half then again by half.

Hi Ashish,

I now have a little theory on the relative importance of several c++ tech skills in a job candidate. I feel all of the skills below are considered “secondary importance” to most of the (15 – 30) interviewers I have met. These skill are widely used in projects, but if we say “no experience” in them, BUT demonstrate strength in core c++ then we win respect.

[e=c++ecosystem topics]
[c=core c++topics]

—#2 [e] c++threading —— many job descriptions say they use threading but it’s probably very simple threading. Threading is such complex topic that only the well-reviewed proven designs are safe to use. If a project team decides to invent their concurrency design ( I invented once ) , and have any non-trivial deviation from the proven designs, they may unknowingly introduce bugs that may not surface for years. So the actual threading usage in any project is always minimal, simple and isolated to a very small number of files.

The fastest systems tend to have nothing shared-mutable, so multi-threading presents zero risk and requires no design. No locking no CAS. Essentially single-threaded.

However, we don’t have the luxury to say “limited experience” in threading. I have a fair amount of concurrency design experience across languages and thread libraries (pthreads, ObjectSpace, c#), using locks+conditional variables+CAS as building blocks, but the c++ thread library used in another team is probably different. I used to say I used boost::thread a lot but it back-fired.

— design patterns ——- never required in coding tests. In fact, basically no OO features show up in short coding tests. 90% of short coding tests are about GettingThingsDone, using algorithms and data structures.

Design patterns do show up in QnA interviews but usually not difficult. I usually stick to a few familiar ones — Singleton, Factory, TemplateMethod, ProducerConsumer. I’m no expert with any of these but I feel very few candidates are. I feel most interviewers have a reasonable expectation. I stay away from complex patterns like Visitor and Bridge.

—#1 [c] c++11 —— is not yet widely used. Many financial jobs I applied have old codebases they don’t want to upgrade. Most of the c++11 features we use as developers are optional convenience features, Some features are fundamental (decltype, constexpr …) yet treated as simple convenience features. I feel move semantics and r-value references are fairly deep but these are really advanced features for library writers, not application developers. Beside shared_ptr, C++11 features seldom affect system design. I have “limited project experience using c++11“.

Interviewers often drill into move semantics. If I admit ignorance I could lose. Therefore, I’m actively pursuing the thin->thick->thin learning path.

—#5 [e] Boost —— is not really widely used. 70% of the financial companies I tried don’t use anything beyond shared_ptr. Most of the boost features are considered optional and high-level, rather than fundamental. If I tell them I only used shared_ptr and no other boost library, they will usually judge me on other fronts, without deducting points. I used many boost libraries but only understand shared_ptr better.

Q: Are there job candidates who are strong with some boost feature (beside shared_ptr) but weak on core c++?
A: I have not seen any.

Q: Are there programmers strong on core c++ but unfamiliar with boost?
A: I have seen many

—#4 [c] templates —— another advanced feature primarily for library developers. Really deep and complex. App developers don’t really need this level of wizardry. A few experienced c++ guys told me their teams each has a team member using fancy template meta-programing techniques that no one could understand or maintain. None of my interviewers went very deep on this topic. I have only limited meta-programming experience, but I will focus on 2 common template techniques — CRTP and SFINAE and try to build a good understanding (thin->thick->thin)

—#6 [c] IKM —— is far less important than we thought. I know many people who score very high but fail the tech interview badly. On the other hand, I believe some candidates score mediocre but impress the interviewers when they come on-site.

—#3 [e] linux system programming like sockets —— is relevant to low-latency firms. I think half the c++ finance firms are. Call them infrastructure team, or engineering team, or back-end team. I just happened to apply for too many of this kind. To them, socket knowledge is essential, but to the “mainstream” c++ teams, socket is non-essential. I have “modest socket experience” but could impress some interviewers.

(Actually socket api is not a c/c++ language feature, but a system library. Compared to STL, socket library has narrower usage.)

Messaging is a similar skill like sockets. There are many specific products so we don’t need to be familiar with each one.

InterProcessCommunication  (Shared memory, pipes… ) is a similar “C” skill to sockets but less widely required. I usually say my system uses sockets, database or flat files for IPC, though shared memory is probably faster. I hope interviewers don’t mind that. If there’s some IPC code in their system, it’s likely isolated and encapsulated (even more encapsulated than threading code), so hopefully most developers don’t need to touch it

Memory Manager is another specialized skill just like IPC. Always isolated in some low-level module in a library, and seldom touched. A job spec may require that but actually few people have experience and never a real show stopper.

If a role requires heavy network programming (heavy IPC or messaging development is less common) but we have limited experience, then it can be a show-stopper.

These topics are never tested in short coding questions, and seldom in longer coding questions.

What c++ knowledge is valued more highly? See ##20 C++territories for QQ IV

collection-of-abstract-shape: j^c++

In java, this usage pattern is simple and extremely common — Shape interface.

In c++, we need a container of shared_ptr to pure abstract base class 😦

  • pure abstract interface can support MI
  • shared_ptr supports reference counting, in place of garbage collection
  • pointer instead of nonref payload type, to avoid slicing.

This little “case study” illustrates some fundamental differences between java and c++, and showcases some of the key advantages of java.

##[11] data feed to FX pricer #pre/post trade, mid/short term

(see blog on influences on FX rates)

Pricing is at heart of FX trading. It’s precisely due to the different pricing decisions of various players that speculation opportunities exist. There are pricing needs in pre/post trade, and pricing timeframes of long term or short term

For mark to market and unrealized PnL, real time market trade/quote prices are probably best, since FX is an extremely liquid and transparent market, except the NDF markets.

That’s post trade pricer. For the rest of this write-up let’s focus on pre-trade pricer of term instruments. Incoming quotes from major electronic markets are an obvious data source, but for less liquid products you need a way to independently derive a fair value, as the market might be overpriced or underpriced.

For a market maker or dealer bank responding to RFQ,
– IRS, bond data from Bloomberg
– yield spread between government bonds of the 2 countries. Prime example – 2-year Bund vs T-note
– Libor, government bond yield. See
– Depth of market
– volume of _limit_orders_ and trades (It’s possible to detect trends and patterns)
– dealer’s own inventory of each currency
– cftc COT report. See
– risk reversal data on FXCM. See

For short term trading, interest rate is the most important input to FX forward pricing — There’s a separate blog post. Other significant drivers must be selected and re-selected from the following pool of drivers periodically, since one set of drivers may work for a few days and become *obsolete*, to be replaced  by another set of drivers.
– yield spread
– T yields of 3 month, 2 year and 10 year
– Libor and ED futures
– price of oil (usually quoted in USD). Oil up, USD down.
– price of gold

For a buy-and-hold trader interested in multi-hear long term “fair value”, pricers need
– balance of trade?
– inflation forecast?
– GDP forecast?

LRU cache #Part 2 c++#LinkedHashMap

Java offers LinkedHashMap-based solution. Below is a simple c++ version written by

    • Basic idea is nice — list as primary storage; and map value points to list nodes.  Therefore, By design, the lookup operation never inserts new nodes.
      • (Even if I can think of the same idea, I would be slow writing up this code in real time.)
    • This code showcases a handful of useful methods of list and map. For example,
    • list.splice() : Transfers the element pointed to by 3rd arg (itr), from 2nd arg (another list), into *this. The element is inserted before the element pointed to by 1st arg (itr)
    • To reproduce this code in real time, you need to memorize list.splice()! I don’t have the time.
    • The map value is an iterator — essential!
class LRUCache{
    size_t m_capacity;
    unordered_map<int,  list<pair<int, int>>::iterator> m_map; //m_map_iter->first: key, m_map_iter->second: list iterator;
    list<pair<int, int>> m_list;                               //m_list_iter->first: key, m_list_iter->second: value;
    LRUCache(size_t capacity):m_capacity(capacity) {
    int get(int key) {
        auto found_iter = m_map.find(key);
        if (found_iter == m_map.end()) //key doesn't exist
            return -1;
        m_list.splice(m_list.begin(), m_list, found_iter->second); //move the node corresponding to key to front
        return found_iter->second->second;                         //return value of the node
    void set(int key, int value) {
        auto found_iter = m_map.find(key);
        if (found_iter != m_map.end()) //key exists
            m_list.splice(m_list.begin(), m_list, found_iter->second); //move the node corresponding to key to front
            found_iter->second->second = value;                        //update value of the node
        if (m_map.size() == m_capacity) //reached capacity
           int key_to_del = m_list.back().first; 
           m_list.pop_back();            //remove node in list;
           m_map.erase(key_to_del);      //remove key in map
        m_list.emplace_front(key, value);  //create new node in list
        m_map[key] = m_list.begin();       //create correspondence between key and node

SFINAE #sizeof#ptr2member as template param is relatively simple, concise. Shows how to test T has method1() is shorter and uses the same sizeof trick. is another illustration

–all 3 resource above use sizeof and function template (not class template) — useful demo of my own code in production, powering the nyse+nyseAmerican real time market data parsers behind most of the biggest financial data sites.

When the compiler evaluates sizeof, which is a compile-time task, it would try one of the 3 func() overloads and check the winner’s return type[1] . Always exactly one of the 3 overloads can compile.

When T is AddRefreshOrderStruct, the compiler tries 1st overload, where AddRefreshOrderStruct needs a long field, and AddRefreshOrderStruct needs a sendTimeNS field. Pass! So the return type is int.

When T is EmptyStruct, the compiler tries the 1st overload, where EmptyStruct needs a long field … failed:( Only the last overload, the default overload, passes.

[1] the size of the return type is used to initialize the static const field!

The asterisk at end of the func declarations is needed as the func() argument will be NULL pointer. NULL pointer can match a pointer to any type.

As CTO, I’d favor transparent langs, wary of outside libraries

If I were to propose a system rewrite, or start a new system from scratch without constraints like legacy code, then Transparency (+instrumentation) is my #1 priority.

  • c++ is the most opaque. Just look at complex declarations, or linker rules, the ODR…
  • I feel more confident debugging java. The JVM is remarkably well-behaving (better than CLR), consistent, well discussed on-line
  • key example — the SOAP stub/skeleton hurts transparency, so does the AOP proxies. These semi-transparent proxies are not like regular code you can edit and play with in a sandbox
  • windows is more murky than linux
  • There are many open-source libraries for java, c++, py etc but many of them affects transparency. I think someone like Piroz may say Spring is a transparent library
  • SQL is transparent except performance tuning
  • Based on my 1990’s experience, I feel javascript is transparent but I could be wrong.
  • I feel py, perl are still more transparent than most compiled languages. They too can become less transparent, when the codebase grows. (In contrast, even small c++ systems can be opaque.)

This letter is about which language, but allow me to digress briefly. For data store and messaging format (both require serialization), I prefer the slightly verbose but hugely transparent solutions, like FIX, CTF, json (xml is too verbose) rather than protobuf. Remember 99% of the sites use only strings, numbers, datetimes, and very rarely involve audio/visual data.

STL+smart_pointer for SQL DTO

Are there any best practice online?

Q1: Say I have a small db table of 10 columns x 100 rows. Keys are
non-unique. To cache it we want to use STL containers. What container?
%%A: multimap or list. unordered_multimap? I may start with a vector, for simplicity. Note if 2 duplicate rows aren’t 100% identical, then multimap will lose data

Q1a: search?
%A: for a map, just lookup using this->find(). For list, iterate using generic find()

Q1c: what if I have a list of keys to search?
%%A: is there an “set_intersect()” algorithm? If none, then I would write my nested iteration. Loop through the target keys, and find() on each.
A: for_each()?

Q1e: how do you hold the 10 col?
%%A: each object in container will have 10 fields. They could be 10 custom data classes or strings, ints, floats. Probably 10 smart pointers for maximum flexibility.

Q1h: what if I have other tables to cache too?
%%A: parametrize the CacheService class. CacheService class will be a wrapper of the vector. There will be other fields beside the vector.

Q1m: how about the data class? Say you have a position table and account table to cache
%%A: either inheritance or template.

with algo cracked,syntax=雕虫小技 but ECT=time-consum`

Background — Let’s focus on short coding tests , compiler-based.

Q: If it takes 2Y to improve from 3 to 6, then how many percent of the effort is syntax?
A: less than 50%, but can be as short as a few days. Note ECT is harder.

In java or python, my coding test track record is better. The syntax know-how required is relatively small, because the questions use only array, hash table, linked nodes, StringBuilder/stringstream… Look at EPI300.

C++ is actually similar. Once the algo (and data structures) is cracked, the syntax know-how required consist of about 500 common operations. I have some of it listed in my blog…

However, once I have a decent idea, it often takes me 20 – 60 minutes to pass the basic tests, but i am given only 30 minutes! Why am I so slow? Not only syntax, but also ECT, even if without debugging the corner cases.