mgr position stress: relationships

A technical or contract role is less complicated, though relationships are also important and can make your life very stressful or relatively easy.

In ## 2 heaviest work stressors, I listed “figure-things-out” as a key stressor — if I’m reasonably fast on this front, then the relationships have limited impact on my stress level at work. Not true for a manager role — even if you get things done fast enough, relationships can still mess up your life.

  • the relationship with the immediate boss is most critical. I had many problems in the past.
  • relationship with other teams. Dependency means … stressful relationship
  • relationship with big bosses
  • relationship with subordinates can also become difficult. Shuo told me it was not for him, but I feel some managers depend on some key subordinates. Dependency means stress.
    • managing a non-performing subordinates … is not easy at all. I could see Kevin had headaches with me.
  • relationship with key business users. I feel Venkat (ICE) is under that pressure.

How RTS achieved 400 to 700 KMPS #epoll

Official benchmark result – 700 KMPS per instance of rebus or parser. Here are some enabling features:

  • ! mostly single-threaded apps. Some older parsers are multi-threaded but each in single threaded mode. No shared mutable:)
    • In the order book replication engine, we use lockfree in one place only
  • ! allocation avoided — on millions of Output message objects. Pre-allocated ring buffer eliminates new()
  • ! allocation avoided — on millions of Input message objects, thanks to reinterpret_cast() on pointers. Zero-copy. See reinterpret_cast^memcpy in raw mktData parsing
  • ! Socket buffer tuning — to cope with busts. 64-256MB receive buffer in kernel; 4MB UserModeBuffer
  • ! epoll — to replace select() with 2 orders of magnitude improve in throughput
  • ! buffering of Output list (of messages)
  • ! Stateless parser, small footprint. Only downstream component (Rebus) handles cancels/busts, modifications.. So we can run many parser instances per machine, utilizing the cpu cores. See
  • Very fast duplicate seq check, without large history data — a hot function
  • large containers are pre-sized. No reallocation
  • mostly array-based data structures — cache-friendly
  • Hot fictions use pbref or RVO or move semantics, minimizing stack allocations
  • aggressively simplified parsing. Minimal logic
  • Special Logger to reduce I/O cost
  • Sharding to speed up symbol lookup
  • kernel bypass : RTS
  • No virtual functions
  • Inline
  • —-Speed with control
  • Best of both to reduce the real risk of our connection becoming slow or lossy
  • Depth monitor to detect missed messages
  • Capacity tracker to monitor mps rates
  • Missed seq reporting
  • [!=one of the really effective techniques]

convert non-null-terminated char-array to std::string

std::string ccy (ptr->ccy, ptr->ccy+3); //using a special string() ctor

my ptr->ccy is the address of a 3-char array, but it’s immediately followed by other chars belonging to another field, in a tightly packed struct without padding. If you simply pass ptr->ccy to string() ctor, your string will take in many extra chars until a null terminator.

check a receiver socket is connected: tcp^udp

Q: given a TCP receiver socket, how do you tell if it’s connected to a session or disconnected?

Shanyou said when you recv() on the socket but got 0 it means disconnected. shows recv() return value of 0 indicates dead connection i.e. disconnected. uses getsockopt()

Q: given a UDP multicast receiver socket, how do you tell if it’s still has a live subscription to the multicast group?

%%A: I guess you can use getsockopt() to check socket /aliveness/. If alive but no data, then the group is quiet

thesys mgr QnA IV

Q: why do you say look-up can be slower on hash table than RBTree?
A: hash collision; slow hashing;

Q: Given a fast parser but a slow disk writer, how do you persist all the messages received?
%%A: socket for IPC
%%A: shared mem is probably faster, even with synchronization cost
%%A: we can also use threads. Now I feel we should use coarse-grained concurrency. Each task is a big chunk of data. One of multiple consumers/writers pick up a chunk and spend a sizable amount of time writing to disk.

Q: what’s your concept of multiplexing?

Q: you notice packet dropping due to overcapacity. What would you do to improve throughput?

Q: what system calls have you used?

bbg cod`IV: %%clarifications

I had four technical discussions (before and after the wonderful lunch chat with Chitan and Neel). For three of them, I have a few echnical clarifications, as in a product design discussion.

  1. FX team:  in-place list shuffle — At home, I wrote and tested a splice() solution, but splice() works only on a linked list, which has roughly double the memory footprint of a vector of integers. If we forcibly apply some kind of splice on a vector, runtime cost of repeated shifts will be unacceptable. If there’s an in-place solution with O(1) space complexity and O(N) time complexity, then I have not found it.
  2. FX team: local static variables are zero-initialized by default, contrary to what we assumed in our discussion. This represents a bug in my solution. The extreme-value tests will indeed flush out this bug.
  3. Equities team: filters in screen —

At Akhil’s hint I sketched a conceptual design based on sorted map<priceValue, ticker>. I will call it a RBTree. This RBTree must be updated with each message, by erase/insert. Suppose we care only about filter latency, not about update latency. Or suppose we get relatively infrequent updates but many users perform applyAllFilters(). Then this RBTree can help. We will use lower_bound followed by backward iteration (for example, Price < 150). This is presumably faster than my simple solution of full hash table iteration. With both designs, we get the list of “survivor tickers” from this filter.

That’s the first filter in a screen. How about the second filter?  In my simple solution, I take the survivor list (of size N) from first apply_filter and then look up N times against filter2’s hash table (no full iteration). With the RBTree design, we have a choice. Let’s name the filter2’s RBTree as “tree2”.

  • if tree2.lower_bound() leaves very few (like 5) qualified tickers, but the survivor list from filter1 is large, then we should take each qualified ticker from tree2.lower_bound() and check if it is in filter1’s survivor list.
  • If tree2.lower_bound() leaves many (like 99999999) tickers to be checked, then we should drive from the survivor list. The tree2 won’t be used.
  • However, in a RBTree, how many nodes come before a given node is hard to count. This counting requires expensive iteration. so I feel the choice cannot be made at runtime.

As I told Akhil and Rajan, the RBTree vs my simple solution should be benchmarked in a realistic setting. Depending on the context, one will outperform the other.

strncpy^strcpy^strlcpy #padding nulls claims that strncpy is not safe, but I am not sure. The strncpy() API is well defined, while strlcpy is a BSD extension and not even available in my linux. May not be in glibc.

One key difference between strcpy vs strncpy is the implicit \0 terminator —

strcpy() unconditionally appends a single null (\0), whereas strncpy won’t. strncpy() appends enough padding nulls IFF input array contains a null within first “n” slots.

q[inline] to avoid a.. jump^stackFrame

Dino of BBG FX team asked me — when you mark a small f1() function inline (like manually copying the code into main()), you save yourself a jump or a new stack frame?

A: both a jump and a new stack frame.

It turns out a new stack frame would require a jump, because after the new stack frame is created, thread jumps to the beginning of f1().

However, there’s something to set up before the jump — Suppose f1() is on Line 5 in main(), then Line 6’s address has to be saved to CPU register, otherwise the thread has no idea where to” jump back” after returning from f1(). According to my codebashing training (required at RTS team), this Line 6’s address is saved in the main() stack frame, not the f1() stack frame!

Note the Line 6’s address is not a heap address not a stack address but an pointer into the code area.

RBTree range count #enum,auto

// demo range counting with lower_bound. I don't know any faster algorithm
// demo auto keyword
// demo enum
// demo upper_bound different from lower_bound!
#include <iostream>
#include <climits>
#include <set>
#include <assert.h>
using namespace std;

set<int> s;
//typedef set<int>::iterator It; // ---> auto is easier
enum Mode { NoMin, NoMax, BothGiven };

size_t countWithInclusiveLimits(int limit1, int limit2, Mode m = BothGiven){
  if (s.empty()) return 0;
  auto it  = s.begin();
  auto it2 = s.end(); //it2 is the node past the last wanted node.

  if (m != NoMin) it = s.lower_bound(limit1);
  if (m != NoMax){
    it2 = s.upper_bound(limit2);
    assert(*it2 != limit2 && "it2 initial value should be consistent with end()");

  size_t ret = 0;
  for (; it != it2; ++it){
    cout<<*it<<" ";
  cout<<" --> "<<ret<<endl;
  return ret;
int main(){
  for(int i=-4; i<=9; ++i) s.insert(i*10);
  countWithInclusiveLimits(11, 55);
  countWithInclusiveLimits(0, 50, NoMin);
  countWithInclusiveLimits(10, 0, NoMax);

##5 long^short-term stressors

Short-term stressors

  • best examples: sickness; quarrel; deadlines; something broken
  • mgr’s assessment; respect from team; figure-things-out speed
  • stagnation, wasting my “potential”
  • —-poor examples
  • GC progress
  • peer comparison
  • BMI

10Y long-term concerns

  • retirement planning
  • kids’ well-being
  • kids’ academic benchmark
  • employability and income stability
    • competence in job^interview; beat-fronts
  • Some peers at my age worry about health but I’m free from that worry.


c++user-defined constant:(scoped)enum^const static


  • best practice — enclose in a namespace or a class
  • enum can group 2 related constants. Scoped enum is even more descriptive.
  • enum also creates a typename for code documentation
  • enum const value must be signed integers 😦
  • Both enum type and a single static const field can be declared as class members 🙂
  • For singular constants, use enum or const static variables (file-scope, and independently instantiated in each compilation unit). Avoid extern.

STL iterator invalidation rules, succinctly is concise with explanations. Specifically,

  • list insertions don’t invalidate any iterator. I feel iterator is a pointer to a node.
  • tree insertions don’t invalidate any iterator. Same reason as list.
  • for erasure from lists or trees, only the iterator to the erased node is invalidated.

Now for vector:

  • vector insertion invalidates any iterator positioned somewhere after the insertion point. If reallocation happens due to exceeding vector.capacity() then all invalidated

bbg-Eq: filters in a screen

Exchanges constantly send you live update messages in a fixed format like

IBM, Price, 100
GOOG, Volume, 5000
MSFT, Volume, 14000
IBM, Price, 99
GS, Volume, 7000
MSFT, Price, 250
GOOG, Price 332

Each message always has exactly 3 fields — ticker, field name, integer value. The value only overwrites the previous value and never aggregates. In this example, the IBM price 99 will wipe out the IBM price 100.

There are up to 100 distinct field names, but many tickers and very high messaging rate.

Each user has a Screen, which holds a variable number of Filters defined by this user. Each Filter has a fixed format like “Price > 150” or “Volume < 9000”. All the filters need to be applied, to return a list of qualifying tickers. In this example, first filter alone returns a list of tickers {MSFT,GOOG} and 2nd filter alone returns {GS,GOOG}. Applying all filters would reduce the list to {GOOG}

Data feed goes into the server. Server maintains thousands of Screen objects, each for one user. When a new message is received, the server must apply all relevant filters. If out of 3000 screens 500 Screens’ lists get updated, then another module will update those 500 clients. You don’t need to design the interface to clients, data feed, threading.

Requirement: design the data store, the screen and filter.

I will assume 3000 Screens. On average, each Screen has 4 filters, defined by that user. If you assume every incoming message must be go through 3000*4 = 12000 filters, I guess you are probably right.

Can you avoid linear searches when applying filters? To illustrate, suppose all the messages are Price numbers. I maintain an unorder_map<ticker, priceValue>, updated constantly. When server applies a single filter, we start with the survivor list from previous filter. Go through each key/value of the map, and erase each disqualified ticker from survivor list. If I start with N tickers and perform N look-ups we get O(N).

Akhil hinted a reverse sorted map<priceValue, ticker>. I will call it a RBTree. This RBTree must be updated with each message, by erase/insert. Suppose we care only about filter latency, not about update latency. Or suppose we get very few updates but many Screens to applyFilters(), then this RBTree can help. We will use lower_bound followed by backward iteration (for example, Price < 150). This is presumably faster than my simple solution of full hash table iteration.

That’s the first filter. How about second filter? Let’s name the RBTree for the second filter “tree2”.  In my simple solution, I start with the survivor list (of size N) from first filter and look up N times against the hash table of the second filter, without full iteration. With the RBTree, we have a choice

  • if tree2.lower_bound() leaves very few (like 5) tickers to be checked, but the survivor list from first filter is large, then we should convert the survivor list to an unordered_set and drive from filter2.lower_bound()
  • If tree2.lower_bound() leaves many (like 9999) tickers to be checked, then we should drive from the survivor list. The tree2 won’t be used.
  • However, in a RBTree, how many nodes come before a given node is hard to count. This counting requires expensive iteration. so I feel the choice cannot be made at runtime. See RBTree range count #enum,auto

bbg-Eq: fewest increments on odometer

There’s a special odometer with N (eg 5) digits. Each digit can increment (never decrement, to keep the problem simpler) independently and cyclically, like 8->9->0->1. You can think of them as 5 independent mini-odometers. The starting sequence is input (perhaps something like 07748). Ending sequence can be any other value.

Each increment on any mini-odometer has cost of $1. Find the cheapest way to go from start to end. One simple solution (from me) –increment each digit cyclically to the target value. So a “8” will increment to 9 -> 0 -> 1. Other mini-odometers are not affected by this mini-odometer.

However, 18021 is one of many illegal sequences — barriers.

Key insight — When N==2, this becomes a 2D matrix Tom-n-Jerry problem with barriers. We can simplify the problem to a 3×3 matrix i.e. 2-digit odometer from 00 to 22. Nine possible sequences.

We need to build a graph of nine nodes, which helps answer

  • Q: feasible? [2,2] would be unreachable if [1,2] and [2,1] are blocked.
  • Q: which path is shortest path

Once we build the graph, shortest path btw 2 graph nodes #binary matrix as illutration will solve it.

bbg-Eq: longest abbreviation #easier

Given a long word with N non-unique characters, we know there can be 2^N abbreviations (including the original word, and including the empty string).

Requirement: I give you a list of strings and you suspect some of them could be an abbreviation of your word. Find the longest string among them that’s a valid abbreviation. Optimize for time complexity.

I feel this problem appears simple but not easy to complete quickly is my python solution, different from the c++ solution below. Not sure which has more bugs.

I started with string::find() and hit the problem of non-unique characters. Interviewer hinted char-matching — the key point I needed for this type of problem.

#include <iostream>
#include <algorithm>
#include <vector>
using namespace std;

typedef string candidate;

vector<candidate> v{"monkey","aaaaaaaaaaaaaaaaaa", "mamalon", "monk"};
string const haystack{"mamalonkey"};
size_t hsize=haystack.size();

//Tip: can use typdef alias in argument:)
bool lenComp(candidate const & a, candidate const & b){
  return a.size()<=b.size();
ostream & operator<<(ostream & os, vector<candidate> const & c){
  size_t const sz = c.size();
  for (int i=0;i<sz; ++i){
        os<<c[i]<<" ";
  os<<endl; return os; } //single-pass 🙂 

bool check1candate(candidate const & c){ 
  if (c.size() > hsize) return false;

  for(int i=0, h=0; h<hsize; ++h){
        if (c[i] == haystack[h]){
          if (i==c.size()-1) return true;
  return false;
int main(){
  sort(v.begin(), v.end(), lenComp);
  cout<<v; for (int i=v.size()-1; i>=0; --i){
    if (check1candate(v[i])){
        cout<<"winner is "<<v[i];
        return 0;


bbg-FX: in-place array shuffle #splice

Update: In-Place usually requires 1) splice 2) swap.

Requirement: given a list like

{1,2,3,4,  25,29,44,33,  159,139,155,150,  177,176,169,180} which describes four Persons
{id1,id2,..age1,age2, weight1,weight2,…height1,height2,..}

Write a program to output each person’s full attributes like

{1, 25, 139, 177,   2, 29, 129, 176,   3, 24, 135, 169,   4, 33, 140, 180}

Note the number of attributes (in this case 4) is an input parameter to your program. You don’t know (or care about) the name of each attribute. If you are told there are 37 attributes in each person, just output attr1 .. attr37.

Note the input and output sequence are important.

Q (Bonus question): do it in-place without creating a new collection. You can still have a small number of variables in your code.
A: one way to “cheat” is to append each output person to end of the original vector. It will double in size, so we will truncate the first half.
A: I believe I can use the “homeless/displacement” technique.
A: I now feel many in-place array reshuffle can be done in place with a single temp variable but this may not be one of them.
A: how about starting with a linked list? has ..

bbg-FX: check if a binary tree is BST #no deep recursion

Binary search tree has an “ascending” property — any node in my left sub-tree are smaller than me.

Q1: given a binary tree, check if it’s a BST. (No performance optimization required.) Compile and run the program.

Suppose someone tells you that the lastSeen variable can start with a random (high) initial value, what kind of test tree would flush out the bug? My solution is below, but let’s Look at Q1b.

Q1b: what if the tree could be deep so you can’t use recursion?
A: use BFT. When we actually “print” each node, we check left/right child nodes.

I made a mistake with lastSeen initial value. There’s no “special” initial value to represent “uninitialized”. Therefore I added a boolean flag.

Contrary to the interviewer’s claim, local statics are automatically initialized to zero, but zero or any initial value is unsuitable (payload can be any large negative value), so we still need the flag.

#include <iostream>
#include <climits>
using namespace std;

struct Node {
    int data;
    Node *le, *ri;
    Node(int x, Node * left = NULL, Node * right = NULL) : data(x), le(left), ri(right){}
/*    5
  2       7
 1 a
Node _7(7);
Node _a(6);
Node _1(1);
Node _2(2, &_1, &_a);
Node _5(5, &_2, &_7); //root

bool isAscending = true; //first assume this tree is BST
void recur(Node * n){
//static int lastSeen; // I don't know what initial value can safely represent a special value

  //simulate a random but unlucky initial value, which can break us if without isFirstNodeDone
  static int lastSeen = INT_MAX; //simulate a random but unlucky initial value
  static bool isFirstNodeDone=false; //should initialize to false

  if (!n) return;
  if (n->le) recur(n->le);

  // used to be open(Node *):
  if (!isAscending) return; //check before opening any node
  cout<<"opening "<<n->data<<endl; if (!isFirstNodeDone){ isFirstNodeDone = true; }else if (lastSeen > n->data){
        return; //early return to save time
  lastSeen = n->data;

  if (n->ri) recur(n->ri);

int main(){
  cout<< (isAscending?"ok":"nok");

rather few exceptions when using STL #virtual

Update — virtual function is also very rare in STL classes. I don’t know any

STL functions seldom throw exception. See P 248 [[c++standard library]]

  1. at() method throws, since it’s the “checked” version of the subscript operator
  2. reserve() method throws

Besides these two unusual functions, STL would only throw the standard low-level exceptions like memory allocation failures.

Payload data types going into STL container can create exceptions:

  • If a payload class ctor/assignment throws, then it would propagate.
  • Payload destructors should never throw. [[c++coding standard]] says it’s forbidden by STL standard.

private dtor restricts allocation to heap only

My test proves that merely declaring the a stack instance or global instance is illegal because compiler generates a (illegal) call to the private dtor!

Friend can also call dtor!

class PrivDtor{
  static void destroy(PrivDtor* p){p->~PrivDtor();}
  static void del(PrivDtor* p){delete p;}

//PrivDtor globalInstance;  //won't compile
int main(){
  //PrivDtor stackInstance; //won't compile

  PrivDtor * p = new PrivDtor;
  //PrivDtor::del(p); // works!
  PrivDtor::destroy(p);  //works!

19obscure but recurr`c++QQ questions #halo

All QQ topics, not related to GTD or coding IV

  1. Throwing dtor
  2. how to avoid virtual functions (use enum?); CRTP
  3. IPC mutex; synchronization in shared mem
  4. Delete a pointer to base, but the actual object is a derived without virtual dtor. See my post
  5. casting “this” pointer. See
  6. memory: placement new and how to delete
  7. reinterpret_cast
  8. why division by zero is not an exception
  9. debug build modifying behavior
  10. memory: q(delete this)
  11. SFINAE — I was asked in … 2 MS teams?

–socket programming interviews

  1. ##tcp tuning params
  2. non-blocking socket calls
  3. buffer full

non-blocking socket readiness: alternatives to periodic polling

Some interviewer once asked me —

Q: After your non-blocking send() fails due to a full buffer, what can you do to get your data sent ASAP?

Simple solution is retrying after 0 or more millisecond. Zero would be CPU spinning. Non-zero means unwanted latency.

A 1st alternative is poll()/select() with a timeout, and immediately retry the same. There’s basically no latency. No spinning either. The linux proprietary epoll() is more efficient than poll()/select() and a popular solution for asynchronous IO

2nd alternative is SIGIO. says it doesn’t waste CPU. P52 [[tcp/ip sockets in C]] also picked this solution to go with non-blocking sockets.


conflation: design question

I have hit this same question twice — Q: in a streaming price feed, you get IBM prices in the queue but you don’t want consumer thread AA to use “outdated” prices. Consumer BB needs a full history of the prices.

I see two conflicting requirements by the interviewer. I will point out to the interviewer this conflict.

I see two channels — in-band + out-of-band needed.

  1. in-band only — if full tick history is important, then the consumers have to /process/ every tick, even if outdated. We can have dedicated systems just to record ticks, with latency. For example, Rebus receives every tick, saves it and sends it out without conflation.
  2. dual-band — If your algo engine needs to catch opportunities at minimal latency, then it can’t afford to care about history. It must ignore history. I will focus on this requirement.
  3. in-band only — Combining the two, if your super-algo-engine needs to analyze tick-by-tick history and also react to the opportunities, then the “producer” thread alone has to do all work till order transmission, but I don’t know if it can be fast enough. In general, the fastest data processing system is single-threaded without queues and minimal interaction with other data stores. Since the producer thread is also the consumer thread for the same message, there’s no conflation. Every tick is consumed! I am not sure about the scalability of this synchronous design. FIFO Queue implies latency. Anyway, I will not talk further about this stringent “combo” requirement. says “Many firms mitigate the data they consume through the use of simple time conflation. These firms throw data on the floor based solely on the time that data arrived.”

In the Wells interview, I proposed a two-channel design. The producer simply updates a “notice board” with latest prices for each of 999 tickers. Registered consumers get notified out-of-band to re-read the notice board[1], on some messaging thread. Async design has a latency. I don’t know how tolerable that is. I feel async and MOM are popular and tolerable in algo trading. I should check my book [[all about HFT]]…

In-band only — However, the HSBC manager (Brian?) seems to imply that for minimum latency, the socket reader thread must run the algo all the way and send order out to exchange in one big function.

Out-of-band only — two market-leading investment bank gateways actually publish periodic updates regardless how many raw input messages hit it. Not event-driven and not monitoring every tick!

  • Lehman eq options real time vol publisher
  • BofA Stirt Sprite publishes short-term yield curves on the G10 currencies.

[1] The notification should not contain price numbers. Doing so defeats conflation and brings us back to a FIFO design.

java assignment-expression returns assigned value

A small halo in coding interviews 🙂

This feature is inherited from C.

  • eg: return singletonInstance = new MySingleton();
  • eg: a = b = 22;
  • eg: while ((line = reader.readLine()) != null)
  • eg (bigger):
    List<Object> processables;
    while ((processables = retrieveProcessableItems(..)).size() > 0) {/*/}

Incidentally, unlike C, java while-loop header can’t declare variables 😦  Solution — use for-loop 🙂

ibank interviewers: naive,arrogant..

In 2017 I interviewed with

  • * Pimco
  • Blomberg
  • * BGC partners
  • * ICE
  • * TradeWeb
  • Thesys (A trading software vendor)
  • Citadel
  • —-investment banks:
  • * BAML
  • * MS
  • Wells Fargo
  • GS
  • UBS

I now have a bias against investment banks. Wells is the most recent and most typical ibank interview:

  1. wishful thinking that if they are lucky to find a candidate who has used (for example) non-blocking-IO or shared-memory, he will be able to create such a system and make it WORK. I think in reality, many senior developers who have used such a technology professionally for a few years won’t be able to overcome the many technical challenges. I know from experience that we all struggle with a technology before we understand its limitations and pitfalls and figure out how to navigate around them. If you use an existing codebase that was fine-tuned years ago, then you have missed the “struggle” phase. You have to find opportunities to struggle with it (am trying now) to learn it. No pain no gain.
  2. They demand relevant domain knowledge and domain experience as a prerequisite.
  3. They have a long list of topics and they hope to find someone who scores well in most of them.
  4. Focus on in-depth, precise understanding of a few “cornerstone” topics such as threading memory model, and data structures. Such knowledge is very seldom needed in projects. We gain that knowledge by reading and interviewing and use it for interview only.
  5. For all other topics, they have a focus on textbook knowledge, slightly beyond skin-deep. These investment banks seem to believe if a candidate has such a knowledge, then he has experienced using it (or can learn it fast). Naïve thinking.

The non-banks (particularly Pimco, Bloomberg, BGC) are not “smarter interviewers”, but their focus is slightly different. They don’t need domain knowledge or a super long list of topics to rate a candidate. They don’t look for skin-deep “insight”. Many use coding tests, more often on a compiler than on paper.

Many non-banks (BGC, TradeWeb) also have the same high requirement on the same cornerstone topics above. Others focus on C++/java language details including low-level and tricky topics, more than the ibank interviews. My GS interview is like this.

tcp: one of 3 client-receivers is too slow

See also no overflow]TCP slow receiver #non-blocking sender

This is a real BGC interview question

Q: server is sending data fast. One of the clients (AA) is too slow.

Background — there will be 3 worker-sockets. The local address:port will look identical among them if the 3 clients connect to the same network interface, from the same network segment.

The set-up is described in simultaneous send to 2 tcp clients #mimic multicast

Note every worker socket for every client has identical local port.


I believe the AA connection/session/thread will be stagnant. At a certain point [1] server will have to remove the (mounting) data queue and release memory — data loss for the AA client.

[1] can happen within seconds for a fast data feed.

I also feel this set-up overloads the server. A TCP server has to maintain state for each laggard client, assuming single-threaded multiplexing(?). If each client takes a dedicated thread then server gets even higher load.

Are 5 client Connections using 5 sockets on server? I think so. Can a single thread multiplex them? I think so.

blocking scenario ] CPU-bound system

Q: can you describe a blocking scenario in a CPU-bound system?

Think of a few CPU bound systems like

  • database server
  • O(N!) algo
  • MC simulation engine
  • stress testing

I tend to think that a thread submitting a heavy task is usually the same thread that processes the task. (Such a thread doesn’t block!)

However, in a task-queue producer/consumer architecture, the submitter thread enqueues the task and can do other things or return to the thread pool.

A workhorse thread picks up the task from queue and spends hours to complete it.

Now, I present a trivial blocking scenario in a CPU bound system —

  • Any of these threads can briefly block in I/O if it has big data to send. Still, system is CPU-bound.
  • Any of these threads can block on a mutex or condVar

## Y avoid blocking design

There are many contexts. I only know a few.

1st, let’s look at an socket context. Suppose there are many (like 500 or 50) sockets to process. We don’t want 50 threads. We prefer fewer, perhaps 1 thread to check each “ready” socket, transfer whatever data can be transferred then go back to waiting. In this context, we need either

  • /readiness notification/, or
  • polling
  • … Both are compared on P51 [[TCP/IP sockets in C]]

2nd scenario — GUI. Blocking a UI-related thread (like the EDT) would freeze the screen.

3rd, let’s look at some DB request client. The request thread sends a request and it would take a long time to get a response. Blocking the request thread would waste some memory resource but not really CPU resource. It’s often better to deploy this thread to other tasks, if any.

Q: So what other tasks?
A: ANY task, in the thread pool design. The requester thread completes the sending task, and returns to the thread pool. It can pick up unrelated tasks. When the DB server responds, any thread in the pool can pick it up.

This can be seen as a “server bound” system, rather than IO bound or CPU bound. Both the CPU task queue and the IO task queue gets drained quickly.


concurrent lazy singleton using static-local var has 20 upvotes and looks substantiated. It also considers double-checking, std::call_once, atomics, CAS…

GCC uses platform-specific tricks to ensure a static local variable is initialized only once, on first use. 

If it works, this is the easiest solution.

As explained in another blog post, static local is a shared mutable.

no overflow]TCP slow receiver #non-blocking sender

Q: Does TCP receiver ever overflow due to a fast sender?

A: See

A: should not. When the receiver buffer is full, the receiver sends AdvertizedWindowSize to informs the sender. If sender app ignores it and continues to send, then sent data will remain in the send buffer and not sent over the wire. Soon the send buffer will fill up and send() will block. On a non-blocking TCP socket, send() returns with error only when it can’t send a single byte. (UDP is different.)

Non-block send/receive operations either complete the job or returns an error.

Q: Do they ever return with part of the data processed?
A: Yes they return the number of bytes transferred. Partial transfer is considered “completed”.


UDP/TCP socket read buffer size: can be 256MB

For my UDP socket, I use 64MB.
For my TCP socket, I use 64MB too!

These are large values and required kernel turning. In my linux server, /etc/sysctl.conf shows these permissible read buffer sizes:

net.core.rmem_max = 268435456 # —–> 256 MB
net.ipv4.tcp_rmem = 4096   10179648   268435456 # —–> 256 MB

Note a read buffer of any socket is always maintained by the kernel and can be shared across processes [1]. In my mind, the TCP/UDP code using these buffers is kernel code, like hotel service. Application code is like hotel guests.

[1] Process A will use its file descriptor 3 for this socket, while Process B will use its file descriptor 5 for this socket.

c++template^java generics #%%take

A 2017 Wells Fargo interviewer asked me this question. There are many many differences. Here I list my top picks. I feel c# is more like java.

  1. (1st word indicates the category winner)
  2. C++ TMP is quite an advanced art and very powerful. Java generics is useful mostly on collections and doesn’t offer equivalents to most of the TMP techniques.
  3. java List<Student> and List<Trade> shares a single classfile, with uniform implementation of all the methods. In c++ there are distinct object files. Most of the code is duplicated leading to code bloat, but it also supports specialization and other features.
  4. java generics supports extends/super. C# is even “richer”. I think c++ can achieve the same with some of the TMP tricks
  5. c++ supports template specialization
  6. C++ wins — java doesn’t allow primitive type arguments and requires inefficient boxing. C# improved on it. This is more serious than it looks because most c++ templates use primitive type arguments.
  7. c++ supports non-dummy-type template param, so you can put in a literal argument of “1.3”
  8. c++ actual type argument is available at runtime. Java erases it, but I can’t give a concrete example illustrating the effect.


3rd effect@volatile ] java5

A Wells Fargo java interviewer said there are 3 effects. I named

  1. load/store to main memory on the target variable
  2. disable statement reordering

I think interviewer mentioned a 3rd effect about memory barrier.

This quote from Java Concurrency in Practice, chap. 3.1.4 may be relevant:

The visibility effects of volatile variables extend beyond the value of the volatile variable itself. When thread A writes to a volatile variable and subsequently thread B reads that same variable, the values of all variables that were visible to A prior to writing to the volatile variable become visible to B after reading the volatile variable. So from a memory visibility perspective, writing a volatile variable is like exiting a synchronized block and reading a volatile variable is like entering a synchronized block. address my doubt about “other writes“. Nialscorva’s answer echoes the interviewer:

Before java 1.5, the compiler can reorder the two steps

  1. construction of the new object
  2. assigning the new address to the variable

In such a scenario, other threads (unsynchronized) can see the address in the variable and use the incomplete object, while the construction thread is preempted indefinitely like for 3 hours!

So in java 1.5, the construction is among the “other writes” by the volatile-writing thread! Therefore, the construction is flushed to memory before the address assignment. Below is my own solution, using a non-static volatile field:

public class DoubleCheckSingleton {
	private static DoubleCheckSingleton inst = null;
	private volatile boolean isConstructed = false;
	private DoubleCheckSingleton() {
		/* other construction steps */
		this.isConstructed = true; //last step
	DoubleCheckSingleton getInstance() {
		if (inst != null && inst.isConstructed) return inst;
		synchronized(DoubleCheckSingleton.class) {
			if (inst != null && inst.isConstructed) return inst;
/**This design makes uses of volatile feature that's reputed to be java5
* Without the isConstructed volatile field, an incomplete object's 
* address can be assigned to inst, so another thread entering getInstance()
* will see a non-null inst and use the half-cooked object 😦
* The isConstructed check ensures the construction has completed
			return inst = new DoubleCheckSingleton();

Wells IV #socket/c++/threading

This is one of the longest tech interviews and one of the most enriching 🙂 even though I don’t “like” all their questions — very academic/theoretical, text-book driven, too low-level to be practical, not testing practical zbs.

Q: Any consistently reliable technique to detect stale order in your book?

Q: what’s your frame size if your tcp/ucp buffers sizes are so large?

Q: experience storing ticks?

—— C++ questions:
Q1: how can you avoid the cost of virtual function?
%%A: enum/switch or CRTP

Q1b: what’s CRTP

Q: what if your copy ctor signature is missing the “const”?
%%A: you can’t pass in temporaries or literal values. Such arguments must pass into “const &” parameter or “const” parameter. Correct.

Q: double delete?

Q: policy class? traits class?

Q: STL binders.. use case?

Q: how many types of smart pointers do you use?

Q: difference between java generics vs c++ templates?
%%A: type erasure. Java Compiler enforces many rules, but bytecode saves no info about the type argument, so we can get strange runtime errors.
%%A: template specialization. Template meta-programming
%%A (accepted): code bloat since each template instantiation is a separate chunk of code in the object file and in memory.
A: I have a dedicated blog post on this.

Q: what’s returned by a std::queue’s dequeue operation when it’s empty?
AA: undefined behavior, so we must check empty() before attempting dequeue. I believe ditto for a std::stack

Q: why template specialization?
%%A: customize behavior for this particular vector since the standard implementation is not suitable.

Q: how do you implement a thread-safe c++singleton
A: not obvious. See concurrent lazy singleton using static-local var

Q12: in a simple function I have
vector v1 = {“a”, “b”}; vector v2 = v1; cout<<….
What happens to the ctor, dtor etc?
A: 2 std::strings constructed on heap, vector constructed on stack; 2nd vector copy-constructed on stack; 2 new strings constructed on heap; vector destructors deletes all four strings
A: the actual char array is allocated only once for each actual value, due to reference counting in the current std::string implementation.

Q12b: you mean it’s interned?

Coding question: implement void remove_spaces(char * s) //modify the s char array in place. See %%efficient removeSpaces(char*) #Wells

—— threading, mostly in java
Q: What are the problems of CAS solutions?
A: too many retries. CAS is optimistic, but if there are too many concurrent writes, then the assumption is invalid.
%%A: missed update? Not a common issue so far.

%%Q: Synchronized keyword usage inside a static method?
AA: you need be explicit about the target object, like synchronized(MyClass.class)

Q21: Name 3 effects of java volatile keyword — Advanced. See 3effects@volatile ] java5.

Q21b: analyze the double-checking singleton implementation.
staticInst = new Student(); // inside synchronized block, this can assign an incomplete object’s address to staticInst variable, which will be visible to an unsynchronized reader. Solution – declare staticInst as volatile static field

—— System programming (ANSI-C) questions
Q: have you used kernel bypass?

Q5: how do you construct a Student object in a memory-mapped-file?
%%A: placement new

Q5b: what if we close the file and map it again in the same process?
%%A: the ptr I got from malloc is now a dangling ptr. Content will be intact, as in Quest

Q5c: what if Student has virtual functions and has a vptr.
%%A: I see no problem.

Q: you mentioned non-blocking TCP send(), so when your send fails, what can you do?
%%A: retry after completing some other tasks.

Q4: why is UDP faster than TCP
%%A: no buffering; smaller envelopes; no initial handshake; no ack;
%%A: perhaps most important for market data — multiple recipient with TCP is very heavy on sender and on network

Q4b: why buffering in tcp?
%%A: resend

Q: can tcp client use bind()?
%%A: bind to a specific client port, rather than a random port

Q6: is there socket buffer overflow in TCP?
A: probably not

Q6b: what if I try to send a big file by TCP, and the receiver’s buffer is full? Will sender care about it? Is that reliable transmission?
A: Aquis sends 100MB file. See no overflow]TCP slow receiver #non-blocking sender

Q: in your ticking risk system, how does the system get notified with new market data?
%%A: we use polling instead of notification. The use case doesn’t require notification.

Q: any network engineering techniques?

Q: what kernel parameters did you tune?


—— Stupid best-practice questions:
Q: what’s the benefit of IOC?

Q: Fitnesse, mock objects

Question on j4 factory. See separate blog post.


java lock: atomic^visibility

You need to use “synchronized” on a simple getter, for the #2 effect.!


A typical lock() operation in a java method has two effects:

  1. serialized access — unless I release the lock, all other threads will be blocked grabbing this lock
  2. memory barrier — after I acquire the lock, all the changes (on shared variables) by other threads are now visible. The exact details are .. to be detailed.

I now feel the 2nd effect is often more important (and more tricky) than the 1st effect. See P94 [[Doug Lea]] . I like this simple summary, even if not 100% complete —

“In essence, releasing a lock forces a flush of all writes from working memory….and acquiring a lock forces reload of accessible fields.”

Q: what are the subset of “accessible fields” in a class with 9 fields?
A: I believe the compiler knows “in advance” what subset of fields will be accessed after lock acquisition.

Q: what if I acquire a lock, do nothing and release the lock? Is there the #2 effect?
A: I doubt it. You need to enclose all the “tricky” operations between a lock grab/release. If you leave some update (to a shared mutable) outside in the “cold”, then #1 will fail and #2 may also fail.


first N prime Fibonacci: cache server #Thesys

The challenge is not in Fib or prime, but in the algo around them.

//This version is memory efficient as it only remembers the last
//two Fib numbers + all previous prime Fib numbers

//Requirement: For a given input number X, print
//the first X prime Fib numbers. Each output is a Fib and prime
//This version is memory efficient as it only remembers the last
//two Fib numbers + all previous prime Fib numbers
//The typedef of Fib is nice documentation
//My version of isPrime() is simpler
#include <iostream>
#include <string>
#include <vector>
#include <cstdlib>
using namespace std;

typedef unsigned int Fib;
bool isPrime(int i){
        if (i <= 1) return false;
        for(int j=2; j<=i/2; ++j) {
            if(i%j==0) return false;
        return true;

pair<Fib,bool> generate1Fib(){
  static Fib parent = 1;
  static Fib grandparent = 0;
  Fib newFib = parent + grandparent;
  grandparent = parent;
  parent = newFib;
  return make_pair(newFib, isPrime(newFib));

void print1stXFibPrimes(size_t const X){
    static vector<Fib> fibPrimes;
    int count = 0;
    for(int j=0; j<fibPrimes.size(); ++j){
                cout<<fibPrimes[j]<<" ";
                if (++count >= X) return;

    while( count<X ) {
          pair<Fib,bool> new1 = generate1Fib();
          if (new1.second){
                cout<<new1.first<<" ";
inline bool isInteger(const string & s) {
   if(s.empty() || ((!isdigit(s[0])) && (s[0] != '-') && (s[0] != '+'))) return false ;
   char * p ;
   strtol(s.c_str(), &p, 10) ;
   return (*p == 0) ;
int main(int argc, char *argv[]){
  for(int i=1; i<argc; ++i){
    string arg(argv[i]);
    if (isInteger(arg)){
      cout<<"calculate the first "<<arg<<" prime fibs"<<endl;
      print1stXFibPrimes( stoi(arg)) ;
      cout<<arg<<" is not an integer\n";

#include <xtap/PluginConfig.h> trick

I have seen in many large systems:

The actual path to the header is …/shared/tp_xtap/include/PluginConfig.h, but develoeprs prefer an abbreviated include like #include <xtap/PluginConfig.h>.

#1) Here’s one simple implementation:

ls -l shared/tp_xtap/include/ # should show a symbolic link to be created automatically:

    xtap -> ./

Therefore, -I/full/path/to/shared/tp_xtap/include/ will resolve #include <xtap/PluginConfig.h>

#2) I guess a second-best solution is code generation. Checked-in source file has #include <xtap/PluginConfig.h> but the build system follows configured rewrite-rules, to convert it into #include <some/other/path>

jmp_buf/setjmp() basics for IV #ANSI-C

Q: how is longjmp different from goto? See

A: longjmp can 1) jump across functions, and 2) restore program state from a jmp_buf, which was saved earlier by setjmp.

77 c++IV paperTigers

Avoid spending too much time on this list…. These c++ topics appeared non-trivial (perhaps daunting, intimidating) for years, until I started cracking the QnA interviews. Then I realized in-depth expertise isn’t required.

  1. [s] what debuggers and memory leak detectors are you familiar with?
  2. [a] singleton, factory, pimpl and other design patterns
  3. [s] c++ regex
  4. [s] stream I/O and manipulators
  5. [A] CRPP, SFINAE — real tigers. I got asked about these around 5 times, sometimes in-depth
  6. —sockets # many more paper tigers to be listed
  7. udp, multicast, select()
  8. [s] socket buffers
  9. [a] byte alignment
  10. endianness
  11. non-blocking
  12. TCP flow control
  13. TCP handshake and disconnect
  14. ack numbers
  15. partial send
  16. close() vs shutdown()
  17. — STL # many more paper tigers to be listed
  18. [s] STL binders
  19. [s] STL allocators
  20. adapters for containers, iterators and functors
  21. [a] iterator invalidation rules
  22. [s] how is deque different from a vector
  23. RBtree
  24. —concurrency # many more paper tigers to be listed
  25. [A] atomic types
  26. pthread functions
  27. [A] IPC mutex
  28. mutex in shared memory
  29. recursive lock
  30. read-write lock
  31. what if a thread dies while holding a lock?
  32. [s] RAII scoped lock
  33. — multi-file build
  34. forward class declarations (as required in pimpl) and their limitations
  35. [s] C/C++ integration, extern-C — heavily quizzed, but no in-depth
  36. [s] what C features are not supported by c++ compiler
  37. circular dependency between libraries — confusing. real tigers
  38. [s] shared lib vs static lib
  39. —integration and data exchange
  40. [A] shared memory
  41. [A] CORBA, RPC
  42. [a] serialization in compact wire format — only primitive data types!
  43. [s] OS system calls vs std library (glibc) — sounds daunting to most developers
  44. —exception
  45. catch by ref or by value or by pointer?
  46. [s] exception guarantees
  47. [s] stack unwinding due to exception
  48. throwing destructors — various questions
  49. —memory
  50. which part of memory do static data members go? How about file scope static variables? How about global variables
  51. [s] preventing heap/stack allocation of my class
  52. [s] custom new/delete,  set_new_handler()
  53. [s] intrusive smart ptr, weak_ptr
  54. [s] ref counting
  55. custom deleter in shared_ptr
  56. [s] reinterpret_cast  # mostly on pointers
  57. custom allocators
  58. [A] free list in the free store
  59. what if you call delete on a pointer that’s created by array-new?
  60. placement new
  61. out-of-memory in operator-new
  62. —inheritance
  63. dynamic_cast
  64. [A] multiple inheritance
  65. [s] virtual inheritance… which base class ctor gets called first? See
  66. [a] slicing problem
  67. [a] private inheritance
  68. [s] pure virtual
  69. —other low level topics
  70. [s] setjmp, jmp_buf… See the dedicated blog post jmp_buf/setjmp() basics for IV #ANSI-C
  71. [s] cpu cache levels
  72. [s] translation lookaside buffer
  73. [s] what data structures are cache-friendly?
  74. [a] memcpy, memset
  75. [s] ++myItr vs myItr++ how are they implemented differently?
  76. —other language features
  77. [s] RAII
  78. [s] operator-overload
  79. [A] template specialization — part of the STL fabric but largely transparent
  80. [s] ptr to member (function) — seldom used outside library code. I tried the syntax in my binary tree serializer
  81. [A] std::forward() std::move(), rvalue-ref
  82. const and constexp
  83. [a] lambda with capture
  84. [a] double-pointer .. why do you need it?
  85. —-
  86. [s == shallow book knowledge is enough]
  87. [a == actually not that deep, IMHO]
  88. [A == actually deep topic]

fileHandle/socket/dbConection are thread Unsafe

There’s a buffer in a DB connection, in a file handle, in a socket …

The buffer is a shared mutable object. Consider a read-buffer. The host object knows how much of the data in the buffer was already delivered, to avoid repeated delivery. There’s some kind of watermark, which is moved by a consumer thread.

As all shared mutables, these objects are thread unsafe.

All of these objects can also be allocated on stack and therefore invisible to other threads. Therefore, this could be the basis of a thread-safe design.


##how fast I figurED things out relative2teammates

GS — Initially I was much slower than my colleagues. After 18 months I was 80% as fast as them. Everyone needed to learn and memorize lots of details in a big system. Without that “mileage” you can’t be as fast as the rest.

I relied on my colleagues to give me pointers and detailed explanations. Otherwise it was too many details to wade through by one self.

Citi — no comparison — No “team mates” per-se. I had two old-timer colleagues who were very familiar with the system. I was not benchmarked against them.

I didn’t need to understand the ins and outs of the 5-year old codebase. I never became productive with it.

95G — not slower — i joined this greenfield project from beginning (luckily). I was as fast as the other consultant. Also, I designed a few key components, so I had the “original designer” advantage.

Barclays — not slower — i joined this greenfield project from very beginning (luckily), replacing a half-finished c# system. I had very few “team peers” to benchmark against. I designed one of the most valuable and complex components, so I had the “original designer” advantage. Huge advantage. I basically name my price when it came to effort estimates.

I did read some old code to replicate the logic, but I was fairly fast, and no team member ever read the same codebase anyway.

OCBC — not slower — I was seen as slower, but actually I became 70% to 90% as fast as the other team members. I had to spend many long hours on the existing codebase but it’s a small codebase (1 man-year) so I was able to get a hang of it over a few months.

I also designed some minor modules, where I had the “original designer” advantage. The other c# developers had to come to me and understand my designs.

Stirt — I was much slower than the old-timers, but I was not slower than the new joiners. I should have given up earlier. It was hopeless. Codebase was rather opaque and you need to ask many people to understand it.

Other new joiners seemed to ask only within their own team, but I stick out my neck and asked all around my team. It may have made me look slower and /greener/, but I was trying to learn faster. Did I learn faster than them? Not enough time to show.

Macquarie — The benchmark was my predecessor + my boss. I think I was much slower than them. My boss only complained about it in the 2nd year, months after paying me a small bonus.

RTS — not slower — I’m probably faster than some team members. I feel confident how fast I can figure things out now. There are many team members to compare, so from Day 1 it was crucial to hit the average standard.

GTD skill is harder,lasts longer in c++ than in Cleaner languages

In terms of troubleshooting, C++ is 90% same as C, which is a low-level language, close to the hardware.

In contrast, higher level languages strive to have the low level details encapsulated, so developers only need to deal with a simplified, standardized, cleaner façade. Some call it a virtualization.

Eg: sockets

Eg: c++ threading vs java threading

pbclone large obj(eg:vector)rely`@move

This is impressive in QQ interviews + coding questions

GotW #90 Solution: Factories

has a good illustration of move semantics put to good use.

  • Before c++11, a function returning a large vector (or any large object) by value incurs expensive deep copying of all vector elements.
  • With c++11 move features added to std::vector class, returning a vector by value is cheap and recommended.
  • RVO may kick in but (i feel) less reliable than move semantic. For the specific rules see RVO^move-semantics

declare iterator]function template #gotcha

(Needed in some coding interviews and also in GTD!)

Update: With c++11, you can use the “auto” keyword and avoid the complexity.

If you drop the “typename” from the for-loop header, then compiler is confused

error: dependent-name ‘std::multiset::iterator’ is parsed as a non-type, but (template) instantiation yields a type
note: say ‘typename std::multiset::iterator’ if a type is meant

Basically, we need to be extra explicit to the confused compiler.

template<typename T> ostream & operator<<(ostream & os, multiset<T> const & l){
  for(typename multiset<T>::iterator it = l.begin(); 
      it != l.end(); ++it){
        os<<*it<<" ";

populate map<string,list> no list cloning #make_shared

I feel a few things are fairly impressive for a short coding interview:

  • no leak.
  • no explicit new(). make_shared allocates the control block efficiently

In c++11, we could probably return the list by value, and rely on move semantics.

Here’s an inefficient nested container. Saving pointer-to-list in the map is more common, because inserting or reading the map doesn’t need to clone the list!

//tested with -std=c++0x
unordered_map<string, list<size_t> > lookup;

shared_ptr<list<size_t> > myfunc(){
  shared_ptr<list<size_t> > ret= make_shared<list<size_t> >();
  return ret; //return shared_ptr by clone, cheaply

lookup.insert(make_pair(word, *myfunc())); //dereference to get the list

Codility: ignore overflow+bigO

The machine won’t check your bigO!

Focus on getting it to work. If you have time, address the overflow. If you get bogged down by the overflow concerns, you lose focus and lose speed.

–Here are my practical strategy for codility

  1. Don’t aim for high marks, since these tests are very hard-to-please and not very popular.
  2. Get friends to help crack it.
  3. Remember that ROTI is lower than other coding tests like Bloomberg style.  ECT and thinking speed is way too demanding.

## +ve keywords]%%annual reviews

I really don’t want to care too much about manager’s comments. Those comments tend to hurt deeply. They often reflect the manager’s personal agenda, never a balanced/unbiased view. The keyword list is designed to counter some of the negative comments.

  • [zed, catcha] versatile, broad-based
  • [GS] ownership
  • [GS] client relationship
  • [GS, Chartered] attention to details
  • [GS] code quality. I was asked to present and publish
  • [GS, Mansion] technical strength
  • [GS, Barc] analytical
  • well-planned
  • [Barc, GS] knowledge sharing
  • [Barc, GS] personal sacrifices
  • [95G] architecture design
  • [Mac] good at team working across departments
  • [Mac] adaptive to new tech; fast learning

Thesys QnA IV

Q: what’s the average O() complexity of insertion sort
AA: N^2 in most implementations — comparison-based, but using contiguous memory. NlogN if using skip list

Q: what’s the average O() complexity of heap sort
AA: NlogN, since it’s comparison-based.

Q: what are the synchronization primitives on the operating systems you know

Q: how many levels of caches are in a typical Intel processor
%%A: at least 3 possibly 4.
AA: some intel processors do have 4 levels

Q: name some IPC mechanisms
%%A: 1) shared mem 2) sockets 3) pipes

–2nd round (Don Bush?)

Q: Requirement: insert into a std::map only if key is absent
AA: the insert() function does that by default! Returns a pair<Iterator-where-inserted, bool-isInserted>. I used this same technique in my handleInputInternal()

Q: if my class Student has a private dtor, will it compile?
%%A: yes it’s possible to define a member function to invoke the dtor on a pointer passed in
%%A: deleting a ptr-to-Student won’t compile. Correct!
AA: have a friend call the dtor!

Q: Any guess why STL stack designer didn’t combine top() and void pop() like in other languages? hint: what if the copy ctor throws?
%%A: it’s more efficient if a function doesn’t need to return anything, so we don’t want to force the pop() users to pay a price for something they don’t need.

Q: do you create your own templates?
A: I should name the SFINAE hack!

Q: any restriction on a non-type template parameter’s type?
%%A: int and bool are common. Pointers are allowed. shows even arrays and ptr-to-functions are allowed.
AA: has to be a discrete type, not a float

%%Q: what’s the usage of this non-type parameter?
A: std::array uses it for the size. It can’t be a ctor argument since if you construct an array on stack the size must be known at compile time.

Q: can you use new-arrays with shared_ptr? I think he wanted vector<shared_ptr>
%%A: if I have a new-array to store somewhere, then I think shared_array or scoped_array
A: how about shared_ptr<vector>?

Q: what if ctor failes in Foo * p = new Foo() // will memory leak?
%%A: i know it’s common practice to throw from ctor…

Q: name some c++11 features you used in your recent projects
%%A: unique_ptr, shared_ptr, auto, unordered_map, initializer list for vector/map

%%top3GTD concerns as lead dev #instru/code reading..

Below are my biggest concerns when considering a lead dev role. Actually, many of my peers often hints “My role (as a mgr, or architect, or app owner..) is even harder” but I think they are Not more effective if asked to play the hands-on developer role.

  1. code reading — requires focus and stamina.
    • In addition, taking ownership of a big code base requires determination and confidence
  2. instrumentation — using prints or special tools, troubleshooting, error reproducing, ..
  3. pressure due to major feature not completed by impending deadlines

Note none of these is tested in any interview!

  • How about firefighting? I tend to panic but my peers aren’t stronger than me technically!
  • How about design? Not really my weakness. My peers’ designs are not necessarily better than mine
  • knowledge of generic tech (rather than local system knowledge)? I’m actually stronger than most peers
  • quality, attention to details? is usually my strength
  • fine tuning? Most peers are not good at this
(not ranked) bonus peers are .. generic or local nlg?
code reading no impact some are stronger  local sys nlg
investigate boss needs answers usually no diff  local
deadline yes ?  local
firefighting yes not stronger  local
design somewhat visible not stronger  some generic

top N active stocks #bbg#cache,VO

Requirement: implement top_n(int N): top N most active stocks

Requirement 2: add a top_n_last_x_minute(int N, int X): top N that are last traded in the last X millisec

This program also demonstrates how to avoid storing the same string twice, as map key and also as field of the map value

#include <unordered_map> //requires -std=c++0x
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <chrono>
#include <assert.h>
using namespace std;

struct Rec{
  string const * strPtr; //a best-effort to avoid storing duplicate strings
  int cumvol; //cum vol
  long lastUpd; //millsec since epoch

  Rec(int vol): strPtr(NULL), cumvol(vol),
    lastUpd( (chrono::system_clock::now().time_since_epoch()).count()) { }

  bool operator<(Rec const & other) const{
        return this->cumvol < other.cumvol;
  friend ostream & operator<<(ostream & os, Rec const & r){
        os<<r.lastUpd<<"-"<<*(r.strPtr)<<"  : "<<r.cumvol;
        return os;

typedef unordered_map<string, Rec>::iterator ITR;

class App{
  unordered_map<string, Rec> lookup;
  vector<Rec> v;
  bool dirty;
  App(): dirty(true){}
  void update(string const & symbol, int vol){
    this->dirty = true;
    ITR it = lookup.find(symbol);
    if (it == lookup.end()){
        pair<ITR, bool> pair = lookup.insert(make_pair(symbol, Rec(vol)));
        assert(pair.first->second.strPtr == NULL);
        pair.first->second.strPtr = //second is the valye in the pair
          & pair.first->first; //the key of the new pair
        assert(it->second.strPtr != NULL);
        it->second.cumvol += vol;
        it->second.lastUpd = (std::chrono::system_clock::now().time_since_epoch()).count();
  void top_n(int N, unsigned long X=-1){ //unsigned -1 gives the largest positive value!
    size_t sz=lookup.size();
    if (dirty){
      cout<<"resorting due to updates to database ..."<<endl;
      for(ITR i=lookup.begin(); i!= lookup.end(); ++i){
      sort(v.begin(), v.end());
      dirty = false;
    long now = (std::chrono::system_clock::now().time_since_epoch()).count();
    //cout<<now<<" is now"<<endl;
    for(int k=sz-1, count=0; k>=0; --k){
        if (now - v[k].lastUpd > X) continue;
        if (++count >= N) break;
int main(){
  App app;
  app.update("ibm", 1000);
  app.update("gs", 700);
  app.update("gs", 500);
  app.update("goog", 600);
  app.update("msft", 800);
  app.update("ge", 500);
  app.update("ibm", 160);
  app.update("c", 400);

std::string is COW-reference-counted now but should change]c++11 (circa 2003) explains that COW speeds up string copy and string destruction.

This proved a small halo.

However, gcc 5.1 introduced a breaking change. says These changes were necessary to conform to the 2011 C++ standard which forbids Copy-On-Write strings

Q: why forbidden?
A: thread-safety … See [[std c++lib]] P692

TCP/UDP: partial or multiple messages in one buffer

This is often mentioned in IV. At least you can demonstrate your knowledge.

What if the UDP datagram is too big for recv() i.e. specified buffer length is too small? P116 [[tcp/ip soclets in C]] seems to say the oversize message is silently truncated.

UDP recv() will only return a single “logical” message [1]. I believe TCP can put partial or multiple messages into one “buffer” for recv().

Q: if my buffer is big enough, will my UDP recv() ever truncate a msg?
%%A: never

Note IP would always deliver a whole msg or miss a whole msg, never a partial msg. See P 329 [[comp networking]]

[1] a logical msg is the payload from one send()

## threading: a better specialization than algo,quant…

  • In summary — appears daunting and opaque but actually simple in practice
  • theoretical — my strength. Few coding tests and usually easy for me
  • fairly low-level — my strength. Not as low level as debugger, or template hacks …
  • GTD — no GTD challenges, since most projects use only simple threading designs. The code tracing, trouble-shooting expectation on me is relatively low.
  • opaque — for my peers. Even the basic condVar…
  • churn — low churn at the low level, but high churn at high level
  • ever-green — favorite topic of interviewers, esp. java
  • thin->thick->thin — I have achieved this for a long time.

Many of my halos are in this domain — ##halo across domains #specific=better