mgr|obligation| relationships

A technical or contract role is less complicated, though relationships are also important and can make your life very stressful or relatively easy.

In ## 2 heaviest work stressors, I listed “figure-things-out” as a key stressor — if I’m reasonably fast on this front, then the relationships have limited impact on my stress level at work. Not true for a manager role — even if you get things done fast enough, relationships can still mess up your life.

  • the relationship with the immediate boss is most critical. I had many problems in the past.
  • relationship with other teams. Dependency means … stressful relationship
  • relationship with big bosses
  • relationship with subordinates can also become difficult. Shuo told me it was not for him, but I feel some managers depend on some key subordinates. Dependency means stress.
    • managing a non-performing subordinates … is not easy at all. I could see Kevin had headaches with me.
  • relationship with key business users. I feel Venkat (ICE) is under that pressure.

opaque challenge:(intermediate)data browser #push the limit

Opacity — is a top 3 (possibly biggest) fear and burden in terms of figure-things-out-relative-to-cowokers on a localSys. The most effective and direct solution is some form of instrumentation tool for intermediate data. If you develop or master an effective browsing tool, however primitive, it would likely become a competitive advantage in terms of figure-out speed, and consequently earn you some respect.

LocalSys — I feel most of the effective data browser knowledge is localSys knowledge.

If you are serious about your figure-out speed weakness, if you are seriously affected by the opaque issues, then consider investing more time in these browsing tools.

Hard work, but worthwhile.

  • eg: Piroz built a Gemfire data browser and it became crucial in the Neoreo project
  • #1 eg: in my GS projects, the intermediate data was often written into RDBMS. Also important — input and output data are also written into RDBMS tables. Crucial in everyday trouble-shooting. I rank this as #1 in terms of complexity and value. Also this is my personal first-hand experience
  • #2 eg: RTS rebus — during development, I captured lots of output CTF messages as intermediate data… extremely valuable
    • Once deployed, QA team relied on some web tools. I didn’t need to debug production issues.
    • I remember the static data team save their static data to RDBMS, so they relied on the RDBMS query tool on a daily basis.

Now some negative experiences

  • eg: I’m not too sure, but during the Stirt QZ dev projects I didn’t develop enough investigation skill esp. in terms of checking intermediate data.
  • eg: in Mvea, we rely on net-admin commands to query order state, flow-element state and specific fills… Not as convenient as a data store. I never became proficient.
    • I would say the FIX messages are logged consistently and serve as input and output data browser.

Many projects in my recent past have no such data store. I don’t know if there’s some effective solution to the opacity, but everyone else face the same issue.

find 4-digit natural number X, where (X^2)%10000==X #csdoctor

Lets denote the number X^2 as Y;
Let’s denote the lowest digit of Y as Y_1 and 2nd lowest digit as Y_2 ..
Let’s denote the four digits of X as a b c d (where each a single-digit integers), so X=1000a + 100b + 10c + d. Multiplying out this expression, we get

  • the ten’s digit of Y, denoted Y_2 = [2cd + carry from dd]%10 = c
  • the hundred’s digit of Y, denoted Y_3 = [(200 db + 100 cc)%100 + carry]%10 = (2bd+cc+carry)%10 = b

What can d be?

dd  %10 = d so d can only be 0 || 1 || 5 || 6. I will now rule out 0, 1 and 5

  • if d = 0, then the Y_2 expression evaluates to 0 i.e c = 0 i.e. X is a multiple of 100, so Y is a multiple of 10000, so X = 0000, an invalid integer.
  • if d = 1, Y_2 evaluates to 2c%10 = c, i.e. c = 0. Then Y_3 evaluates to 2b%10 = b i.e. b = 0, so X is “?001” … easy to prove this is impossible
  • if d = 5, then Y_2 evaluates to [10c + 2]%10 = 2 = c. Then Y_3 evaluates to (cc+carry from 20cd)%10= 6 = b i.e. X looks like “?625” .. easy to prove this is impossible
  • So d can only be 6

Now let’s work out c:

Y_2 evaluates to (12c+3)%10 = c, so c can only be 7, i.e. X looks like “? ? 7 6”

Now let’s work out b:

Y_3 evaluates to (2 b + 7) %10 = b so b can only be 3, i.e. X looks like “?376”

I tried all 10 possible values for a, to find a must be 9.

convert non-null-terminated char-array to std::string

std::string ccy (ptr->ccy, ptr->ccy+3); //using a special string() ctor

my ptr->ccy is the address of a 3-char array, but it’s immediately followed by other chars belonging to another field, in a tightly packed struct without padding. If you simply pass ptr->ccy to string() ctor, your string will take in many extra chars until a null terminator.

check a receiver socket is alive: tcp^udp

Q: given a TCP receiver socket, how do you tell if it’s connected to a session or disconnected?

Shanyou said when you recv() on the socket but got 0 it means disconnected.

http://man7.org/linux/man-pages/man2/recv.2.html#RETURN_VALUE shows recv() return value of 0 indicates dead connection i.e. disconnected.

https://stackoverflow.com/questions/4142012/how-to-find-the-socket-connection-state-in-c uses getsockopt()

Q: given a UDP multicast receiver socket, how do you tell if it’s still has a live subscription to the multicast group?

%%A: I guess you can use getsockopt() to check socket /aliveness/. If alive but no data, then the group is quiet

thesys mgr QnA IV

Q: why do you say look-up can be slower on hash table than RBTree?
A: hash collision; slow hashing;

Q: Given a fast parser but a slow disk writer, how do you persist all the messages received?
%%A: socket for IPC
%%A: shared mem is probably faster, even with synchronization cost
%%A: we can also use threads. Now I feel we should use coarse-grained concurrency. Each task is a big chunk of data. One of multiple consumers/writers pick up a chunk and spend a sizable amount of time writing to disk.

Q: what’s your concept of multiplexing?

Q: you notice packet dropping due to overcapacity. What would you do to improve throughput?

Q: what system calls have you used?

bbg cod`IV: %%clarifications

I had four technical discussions (before and after the wonderful lunch chat with Chitan and Neel). For three of them, I have a few echnical clarifications, as in a product design discussion.

  1. FX team:  in-place list shuffle — At home, I wrote and tested a splice() solution, but splice() works only on a linked list, which has roughly double the memory footprint of a vector of integers. If we forcibly apply some kind of splice on a vector, runtime cost of repeated shifts will be unacceptable. If there’s an in-place solution with O(1) space complexity and O(N) time complexity, then I have not found it.
  2. FX team: local static variables are zero-initialized by default, contrary to what we assumed in our discussion. This represents a bug in my solution. The extreme-value tests will indeed flush out this bug.
  3. Equities team: filters in screen —

At Akhil’s hint I sketched a conceptual design based on sorted map<priceValue, ticker>. I will call it a RBTree. This RBTree must be updated with each message, by erase/insert. Suppose we care only about filter latency, not about update latency. Or suppose we get relatively infrequent updates but many users perform applyAllFilters(). Then this RBTree can help. We will use lower_bound followed by backward iteration (for example, Price < 150). This is presumably faster than my simple solution of full hash table iteration. With both designs, we get the list of “survivor tickers” from this filter.

That’s the first filter in a screen. How about the second filter?  In my simple solution, I take the survivor list (of size N) from first apply_filter and then look up N times against filter2’s hash table (no full iteration). With the RBTree design, we have a choice. Let’s name the filter2’s RBTree as “tree2”.

  • if tree2.lower_bound() leaves very few (like 5) qualified tickers, but the survivor list from filter1 is large, then we should take each qualified ticker from tree2.lower_bound() and check if it is in filter1’s survivor list.
  • If tree2.lower_bound() leaves many (like 99999999) tickers to be checked, then we should drive from the survivor list. The tree2 won’t be used.
  • However, in a RBTree, how many nodes come before a given node is hard to count. This counting requires expensive iteration. so I feel the choice cannot be made at runtime.

As I told Akhil and Rajan, the RBTree vs my simple solution should be benchmarked in a realistic setting. Depending on the context, one will outperform the other.

strncpy^strcpy^strlcpy #padding nulls

https://stackoverflow.com/questions/6987217/strncpy-or-strlcpy-in-my-case claims that strncpy is not safe, but I am not sure. The strncpy() API is well defined, while strlcpy is a BSD extension and not even available in my linux. May not be in glibc.

One key difference between strcpy vs strncpy is the implicit \0 terminator —

strcpy() unconditionally appends a single null (\0), whereas strncpy won’t. strncpy() appends enough padding nulls IFF input array contains a null within first “n” slots.

q[inline] to avoid a.. jump^stackFrame

Dino of BBG FX team asked me — when you mark a small f1() function inline (like manually copying the code into main()), you save yourself a jump or a new stack frame?

A: both a jump and a new stack frame.

It turns out a new stack frame would require a jump, because after the new stack frame is created, thread jumps to the beginning of f1().

However, there’s something to set up before the jump — Suppose f1() is on Line 5 in main(), then Line 6’s address has to be saved to CPU register, otherwise the thread has no idea where to” jump back” after returning from f1(). According to my codebashing training (required at RTS team), this Line 6’s address is saved in the main() stack frame, not the f1() stack frame!

Note the Line 6’s address is not a heap address not a stack address but an pointer into the code area.

RBTree range count #enum,auto

// demo range counting with lower_bound. I don't know any faster algorithm
// demo auto keyword
// demo enum
// demo upper_bound different from lower_bound!
#include <iostream>
#include <climits>
#include <set>
#include <assert.h>
using namespace std;

set<int> s;
//typedef set<int>::iterator It; // ---> auto is easier
enum Mode { NoMin, NoMax, BothGiven };

size_t countWithInclusiveLimits(int limit1, int limit2, Mode m = BothGiven){
  if (s.empty()) return 0;
  auto it  = s.begin();
  auto it2 = s.end(); //it2 is the node past the last wanted node.

  if (m != NoMin) it = s.lower_bound(limit1);
  if (m != NoMax){
    it2 = s.upper_bound(limit2);
    assert(*it2 != limit2 && "it2 initial value should be consistent with end()");
  }

  size_t ret = 0;
  for (; it != it2; ++it){
    cout<<*it<<" ";
    ++ret;
  }
  cout<<" --> "<<ret<<endl;
  return ret;
}
int main(){
  for(int i=-4; i<=9; ++i) s.insert(i*10);
  countWithInclusiveLimits(11, 55);
  countWithInclusiveLimits(0, 50, NoMin);
  countWithInclusiveLimits(10, 0, NoMax);
}

c++user-defined constant:(scoped)enum^const static

See https://stackoverflow.com/questions/112433/should-i-use-define-enum-or-const

  • best practice — enclose in a namespace or a class
  • enum can group 2 related constants. Scoped enum is even more descriptive.
  • enum also creates a typename for code documentation
  • enum const value must be signed integers 😦
  • Both enum type and a single static const field can be declared as class members 🙂
  • For singular constants, use enum or const static variables (file-scope, and independently instantiated in each compilation unit). Avoid extern.

STL iterator invalidation rules, succinctly

http://www.martinbroadhurst.com/iterator-invalidation-rules-for-c-containers.html has concise explanations. Specifically,

  • list insertions don’t invalidate any iterator. I feel iterator is a pointer to a node.
  • tree insertions don’t invalidate any iterator. Same reason as list.
  • for erasure from lists or trees, only the iterator to the erased node is invalidated.

Now for vector:

  • vector insertion invalidates any iterator positioned somewhere after the insertion point. If reallocation happens due to exceeding vector.capacity() then all invalidated

Who cares?

  1. in projects, we always check the documentation
  2. in coding interviews, we can save precious time if we remember these rules
  3. in QQ interviews, I have never received such a knowledge question. I think interviewers don’t consider this very “deep”, but we could possibly impress some interviews.

bbg-Eq: filters in a screen

Exchanges constantly send you live update messages in a fixed format like

IBM, Price, 100
GOOG, Volume, 5000
MSFT, Volume, 14000
IBM, Price, 99
GS, Volume, 7000
MSFT, Price, 250
GOOG, Price 332

Each message always has exactly 3 fields — ticker, field name, integer value. The value only overwrites the previous value and never aggregates. In this example, the IBM price 99 will wipe out the IBM price 100.

There are up to 100 distinct field names, but many tickers and very high messaging rate.

Each user has a Screen, which holds a variable number of Filters defined by this user. Each Filter has a fixed format like “Price > 150” or “Volume < 9000”. All the filters need to be applied, to return a list of qualifying tickers. In this example, first filter alone returns a list of tickers {MSFT,GOOG} and 2nd filter alone returns {GS,GOOG}. Applying all filters would reduce the list to {GOOG}

Data feed goes into the server. Server maintains thousands of Screen objects, each for one user. When a new message is received, the server must apply all relevant filters. If out of 3000 screens 500 Screens’ lists get updated, then another module will update those 500 clients. You don’t need to design the interface to clients, data feed, threading.

Requirement: design the data store, the screen and filter.

I will assume 3000 Screens. On average, each Screen has 4 filters, defined by that user. If you assume every incoming message must be go through 3000*4 = 12000 filters, I guess you are probably right.

Can you avoid linear searches when applying filters? To illustrate, suppose all the messages are Price numbers. I maintain an unorder_map<ticker, priceValue>, updated constantly. When server applies a single filter, we start with the survivor list from previous filter. Go through each key/value of the map, and erase each disqualified ticker from survivor list. If I start with N tickers and perform N look-ups we get O(N).

Akhil hinted a reverse sorted map<priceValue, ticker>. I will call it a RBTree. This RBTree must be updated with each message, by erase/insert. Suppose we care only about filter latency, not about update latency. Or suppose we get very few updates but many Screens to applyFilters(), then this RBTree can help. We will use lower_bound followed by backward iteration (for example, Price < 150). This is presumably faster than my simple solution of full hash table iteration.

That’s the first filter. How about second filter? Let’s name the RBTree for the second filter “tree2”.  In my simple solution, I start with the survivor list (of size N) from first filter and look up N times against the hash table of the second filter, without full iteration. With the RBTree, we have a choice

  • if tree2.lower_bound() leaves very few (like 5) tickers to be checked, but the survivor list from first filter is large, then we should convert the survivor list to an unordered_set and drive from filter2.lower_bound()
  • If tree2.lower_bound() leaves many (like 9999) tickers to be checked, then we should drive from the survivor list. The tree2 won’t be used.
  • However, in a RBTree, how many nodes come before a given node is hard to count. This counting requires expensive iteration. so I feel the choice cannot be made at runtime. See RBTree range count #enum,auto

bbg-Eq: fewest increments on odometer

There’s a special odometer with N (eg 5) digits. Each digit can increment (never decrement, to keep the problem simpler) independently and cyclically, like 8->9->0->1. You can think of them as 5 independent mini-odometers. The starting sequence is a user input (perhaps something like 07748). Ending sequence is another user input value.

Each increment on any mini-odometer has cost of $1. Find the cheapest way to go from start to end. One simple solution (from me) –increment each digit cyclically to the target value. So a “8” will increment to 9 -> 0 -> 1. Other mini-odometers are not affected by this mini-odometer.

However, 18021 is one of many illegal sequences — barriers.

Key insight — When N==2, this becomes a 2D matrix Tom-n-Jerry problem with barriers. We can represent it with a 3×3 matrix i.e. 2-digit odometer from 00 to 22. Nine possible states.

We need to build a graph of nine nodes, which helps answer

  • Q: feasible? [2,2] would be unreachable if [1,2] and [2,1] are blocked.
  • Q: which path is shortest path

Once we build the graph, shortest path btw 2 graph nodes #binary matrix as illutration will solve it.

bbg-Eq: longest abbreviation #easier

Given a long word with N non-unique characters, we know there can be 2^N abbreviations (including the original word, and including the empty string).

Requirement: I give you a list of strings and you suspect some of them could be an abbreviation of your word. Find the longest string among them that’s a valid abbreviation. Optimize for time complexity.

I feel this problem appears simple but not easy to complete quickly

https://github.com/tiger40490/repo1/blob/py1/py/str/testAbbr_ez.py is my python solution, different from the c++ solution below. Not sure which has more bugs.


I started with string::find() and hit the problem of non-unique characters. Interviewer hinted char-matching — the key point I needed for this type of problem.

#include <iostream>
#include <algorithm>
#include <vector>
using namespace std;

typedef string candidate;

vector<candidate> v{"monkey","aaaaaaaaaaaaaaaaaa", "mamalon", "monk"};
string const haystack{"mamalonkey"};
size_t hsize=haystack.size();

//Tip: can use typdef alias in argument:)
bool lenComp(candidate const & a, candidate const & b){
  return a.size()<=b.size();
}
ostream & operator<<(ostream & os, vector<candidate> const & c){
  size_t const sz = c.size();
  for (int i=0;i<sz; ++i){
        os<<c[i]<<" ";
  }
  os<<endl; return os; } //single-pass 🙂 
} 

bool check1candate(candidate const & c){ 
  if (c.size() > hsize) return false;

  for(int i=0, h=0; h<hsize; ++h){
        if (c[i] == haystack[h]){
          if (i==c.size()-1) return true;
          ++i;
        }
  }
  return false;
}
int main(){
  sort(v.begin(), v.end(), lenComp);
  cout<<v; for (int i=v.size()-1; i>=0; --i){
    if (check1candate(v[i])){
        cout<<"winner is "<<v[i];
        return 0;
    }
  }
}

 

bbg-FX: in-place array shuffle #splice^evict

Update: In-Place usually requires 1) splice 2) swap.

Requirement: given a list like

{1,2,3,4,  25,29,44,33,  159,139,155,150,  177,176,169,180} which describes four Persons
{id1,id2,..age1,age2, weight1,weight2,…height1,height2,..}

Write a program to output each person’s full attributes like

{1, 25, 139, 177,   2, 29, 129, 176,   3, 24, 135, 169,   4, 33, 140, 180}

Note the number of attributes (in this case 4) is an input parameter to your program. You don’t know (or care about) the name of each attribute. If you are told there are 37 attributes in each person, just output attr1 .. attr37.

Note the input and output sequence are important.

Q (Bonus question): do it in-place without creating a new collection. You can still have a small number of variables in your code.
A: one way to “cheat” is to append each output person to end of the original vector. It will double in size, so we will truncate the first half.
A: I believe I can use the “homeless/displacement” technique.
A: I now feel many in-place array reshuffle can be done in place with a single temp variable but this may not be one of them.
A: how about starting with a linked list?

https://github.com/tiger40490/repo1/blob/cpp1/cpp/array/insitu_reshuffle_attrib_bbg.cpp has ..

bbg-FX: isBST() #no recursion #DeepakCM

Binary search tree has an “ascending” property — any node in my left sub-tree are smaller than me.

Q1: given a binary tree, check if it’s a BST. (No performance optimization required.) Compile and run the program.

Suppose someone tells you that the lastSeen variable can start with a random (high) initial value, what kind of test tree would flush out the bug? My solution is below, but let’s Look at Q1b.

Q1b: what if the tree could be deep so you can’t use recursion?
%%A: Use an explicit stack to convert the recursive to iterative algo. Same idea as Deepak
%%A: use BFT. When we actually “print” each node, we check greatest node in its left subtree (and least node in the right subtree). Now I think this algo is inefficient.
%%A: run a first BFT to add uplink to every node(elegant), perhaps using a hashtable. In 2nd BFT scan, when we print a payload, also upate every ancestor in terms of maxPayloadInLeftSubTree. So far, two fields added to each node. In another BFT scan, update the third field minPayloadInRightSubTree.


I made a mistake with lastSeen initial value. There’s no “special” initial value to represent “uninitialized”. Therefore I added a boolean flag.

Contrary to the interviewer’s claim, local statics are automatically initialized to zerohttps://stackoverflow.com/questions/1597405/what-happens-to-a-declared-uninitialized-variable-in-c-does-it-have-a-value, but zero or any initial value is unsuitable (payload can be any large negative value), so we still need the flag.

#include <iostream>
#include <climits>
using namespace std;

struct Node {
    int data;
    Node *le, *ri;
    Node(int x, Node * left = NULL, Node * right = NULL) : data(x), le(left), ri(right){}
};
/*    5
  2       7
 1 a
*/
Node _7(7);
Node _a(6);
Node _1(1);
Node _2(2, &_1, &_a);
Node _5(5, &_2, &_7); //root

bool isAscending = true; //first assume this tree is BST
void recur(Node * n){
//static int lastSeen; // I don't know what initial value can safely represent a special value

  //simulate a random but unlucky initial value, which can break us if without isFirstNodeDone
  static int lastSeen = INT_MAX; //simulate a random but unlucky initial value
  static bool isFirstNodeDone=false; //should initialize to false

  if (!n) return;
  if (n->le) recur(n->le);

  // used to be open(Node *):
  if (!isAscending) return; //check before opening any node
  cout<<"opening "<<n->data<<endl; 
  if (!isFirstNodeDone){ 
        isFirstNodeDone = true; 
  }else if (lastSeen > n->data){
        isAscending=false;
        return; //early return to save time
  }
  lastSeen = n->data;

  if (n->ri) recur(n->ri);
}

int main(){
  recur(&_5);
  cout<< (isAscending?"ok":"nok");
}

STL: few exceptions #no virtual

side note — virtual function is also very rare in STL classes. I don’t know any.


STL functions seldom throw exception. See P 248 [[c++standard library]]

  1. at() method of vector/string/map… throws, since it’s the “checked” version of the subscript operator
  2. reserve() method throws

Besides these two unusual functions, STL would only throw the standard low-level exceptions like memory allocation failures. All other error conditions are undefined-behavior, such as

  • top()/pop() on priority queue
  • std::string operator[index] with invalid index

However, Payload data types going into STL container can create exceptions:

  • If a payload class ctor/assignment throws, then it would propagate.
  • Payload destructors should never throw. [[c++coding standard]] says it’s forbidden by STL standard.

private dtor restricts instantiation to heap only

My test proves that declaration (even without definition) of a stack instance or global instance is illegal because compiler generates a (illegal) call to the private dtor!

Friend and static method can both also call the private dtor.

One usage of such a class — force all users to go trough factory to instantiate this class.

class PrivDtor{
  ~PrivDtor(){
        cout<<"dtor\n";
  }
public:
  static void destroy(PrivDtor* p){p->~PrivDtor();}
  static void del(PrivDtor* p){delete p;}
};

//PrivDtor globalInstance;  //won't compile
int main(){
  //PrivDtor stackInstance; //won't compile

  PrivDtor * p = new PrivDtor;
  //PrivDtor::del(p); // works!
  PrivDtor::destroy(p);  //works!
}

 

19obscure but recurr`c++QQ questions #halo

All QQ topics, not related to GTD or coding IV

  1. Throwing dtor
  2. how to avoid virtual functions (use enum?); CRTP
  3. Delete a pointer to base, but the actual object is a derived without virtual dtor. See my post https://bintanvictor.wordpress.com/2015/04/21/deleting-base-class-ptr-non-virt-dtor
  4. casting “this” pointer. See https://bintanvictor.wordpress.com/2017/11/15/crtp-1st-lesson/
  5. memory: q(delete this)
  6. memory: placement new and how to delete
  7. why division by zero is not an exception
  8. debug build modifying behavior
  9. IPC mutex; synchronization in shared mem
  10. memcpy
  11. ADL
  12. template template params

–socket programming interviews

  1. ##tcp tuning params
  2. non-blocking socket calls
  3. buffer full

non-blocking socket readiness: alternatives2periodic poll

Some Wells interviewer once asked me —

Q: After your non-blocking send() fails due to a full buffer, what can you do to get your data sent ASAP?

Simple solution is retrying after zero or more millisecond. Zero would be busy-weight i.e. spinning CPU. Non-zero means unwanted latency.

A 1st alternative is poll()/select() with a timeout, and immediately retry the same. There’s basically no latency. No spinning either. The linux proprietary epoll() is more efficient than poll()/select() and a popular solution for asynchronous IO.

2nd alternative is SIGIO. http://compgeom.com/~piyush/teach/4531_06/project/hell.html says it doesn’t waste CPU. P52 [[tcp/ip sockets in C]] also picked this solution to go with non-blocking sockets.

http://compgeom.com/~piyush/teach/4531_06/project/hell.html is actually a concise overview of several alternatives

  • non-blocking socket
  • select/poll/epoll
  • SIGIO
  • .. other tricks

class scope as a pseudo namespace #goog/effC++

https://google.github.io/styleguide/cppguide.html#Nonmember,_Static_Member,_and_Global_Functions

[[effC++]] P 120 has a concise chapter on namespace vs namespace-emulating class.

So when you see someName::someVar or someName::someClass or someName::someFunc, the “someName” may be a class.

When you see “someVar” without the prefix, don’t assume it’s in your local namespace. It could be typedef of someName::someVar !

mktData conflation: design question

I have hit this same question twice — Q: in a streaming price feed, you get IBM prices in the queue but you don’t want consumer thread AA to use “outdated” prices. Consumer BB needs a full history of the prices.

I see two conflicting requirements by the interviewer. I will point out to the interviewer this conflict.

I see two channels — in-band + out-of-band needed.

  1. in-band only — if full tick history is important, then the consumers have to /process/ every tick, even if outdated. We can have dedicated systems just to record ticks, with latency. For example, Rebus receives every tick, saves it and sends it out without conflation.
  2. out-of-band — If your algo engine needs to catch opportunities at minimal latency, then it can’t afford to care about history. It must ignore history. I will focus on this requirement.
  3. dual-band — Combining the two, if your super-algo-engine needs to analyze tick-by-tick history and also react to the opportunities, then the “producer” thread alone has to do all work till order transmission, but I don’t know if it can be fast enough. In general, the fastest data processing system is single-threaded without queues and minimal interaction with other data stores. Since the producer thread is also the consumer thread for the same message, there’s no conflation. Every tick is consumed! I am not sure about the scalability of this synchronous design. FIFO Queue implies latency. Anyway, I will not talk further about this stringent “combo” requirement.

https://tabbforum.com/opinions/managing-6-million-messages-per-second?print_preview=true&single=true says “Many firms mitigate the data they consume through the use of simple time conflation. These firms throw data on the floor based solely on the time that data arrived.”

In the Wells interview, I proposed a two-channel design. The producer simply updates a “notice board” with latest prices for each of 999 tickers. Registered consumers get notified out-of-band to re-read the notice board[1], on some messaging thread. Async design has a latency. I don’t know how tolerable that is. I feel async and MOM are popular and tolerable in algo trading. I should check my book [[all about HFT]]…

In-band only — However, the HSBC manager (Brian?) seems to imply that for minimum latency, the socket reader thread must run the algo all the way and send order out to exchange in one big function.

Out-of-band only — For slightly slower markets, two market-leading investment bank gateways actually publish periodic updates regardless how many raw input messages hit it. Not event-driven, not monitoring every tick!

  • Lehman eq options real time vol publisher
  • BofA Stirt Sprite publishes short-term yield curves on the G10 currencies.
    • Not EURUSD spot prices

[1] The notification can only contain indicative price numbers and serve to invite clients to check the notice board. If clients rely solely on the indicative, it defeats conflation and brings us back to a FIFO design.

java assignment-expression returns assigned value

A small halo in coding interviews 🙂

This feature is inherited from C.

  • eg: return singletonInstance = new MySingleton();
  • eg: a = b = 22;
  • eg: while ((line = reader.readLine()) != null)
  • eg (bigger):
    List<Object> processables;
    while ((processables = retrieveProcessableItems(..)).size() > 0) {/*/}

Incidentally, unlike C, java while-loop header can’t declare variables 😦  Solution — use for-loop 🙂

[20] how many cpu cycles per instruction

A 2019 GS-Hongkong interviewer pointed out that very few instructions can complete within one clock cycle. Indeed only simple add/move etc take one clock cycle, without pipelining:

https://www.d.umn.edu/~gshute/arch/performance-equation.xhtml#clocks-per-instruction says “Without instruction-level parallelism, simple instructions usually take 4 or more cycles to execute…. Pipelining (overlapping execution of instructions) can bring the average for simple instructions down to near 1 clock per instruction. Superscalar pipelining (issuing multiple instructions per cycle) can bring the average down to a fraction of a clock per instruction”

https://www.agner.org/optimize/instruction_tables.pdf is very detailed and up to date.

http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/ is an info-graphic

  • integer multiplication 3-6 cycles
  • integer division 12-44 cycles
  • L1 cache hit: 3-4
  • L2 cache hit: 10-12, about 3 times slower
  • L3 cache hit on the same core: 30-70, about 3 times slower
  • (L3 cache hit across sockets: 100-300, about 3 times slower )
  • main RAM read or L1/2/3 cache miss: 100-150, about 3 times slower than L3
  • ..main RAM read across socket : 300-500, about 3 times slower
  • CAS on single-core hardware and hitting L1 only: 15-30 cycles
  • .. about 5 times slower than L1 hit in Single-threaded-mode.
  • CAS on multi-core across sockets: 100-300 cycles, guesstimated by the author. Same as L3 hit across socket.
  • c/c++ function call: 15-30 cycles
  • c++ virtual function call: 30-60 cycles
  • heap/free-store/DAM {allocation+deallocation} pair for small objects: 200-500 cycles, more than 100 times slower than integer add/subtract (1 cycle)
  • thread context switch (including d-cache and i-cache invalidation): 10,000 – 1 million cycles

Note a socket is the seat on the mainboard to hold a physical chip. The seat has pins sticking out like an octopus. One socket holds one chip, which can hold multiple cores.

ibank interviewers: naive,arrogant..

In 2017 I interviewed with

  • * Pimco
  • Blomberg
  • * BGC partners
  • * ICE
  • * TradeWeb
  • Thesys (A trading software vendor)
  • Citadel
  • —-investment banks:
  • * BAML
  • * MS
  • Wells Fargo
  • GS
  • UBS

I now have a bias against investment banks. Wells is the most recent and most typical ibank interview:

  1. wishful thinking that if they are lucky to find a candidate who has used (for example) non-blocking-IO or shared-memory, he will be able to create such a system and make it WORK. I think in reality, many senior developers who have used such a technology professionally for a few years won’t be able to overcome the many technical challenges. I know from experience that we all struggle with a technology before we understand its limitations and pitfalls and figure out how to navigate around them. If you use an existing codebase that was fine-tuned years ago, then you have missed the “struggle” phase. You have to find opportunities to struggle with it (am trying now) to learn it. No pain no gain.
  2. They demand relevant domain knowledge and domain experience as a prerequisite.
  3. They have a long list of topics and they hope to find someone who scores well in most of them.
  4. Focus on in-depth, precise understanding of a few “cornerstone” topics such as threading memory model, and data structures. Such knowledge is very seldom needed in projects. We gain that knowledge by reading and interviewing and use it for interview only.
  5. For all other topics, they have a focus on textbook knowledge, slightly beyond skin-deep. These investment banks seem to believe if a candidate has such a knowledge, then he has experienced using it (or can learn it fast). Naïve thinking.

The non-banks (particularly Pimco, Bloomberg, BGC) are not “smarter interviewers”, but their focus is slightly different. They don’t need domain knowledge or a super long list of topics to rate a candidate. They don’t look for skin-deep “insight”. Many use coding tests, more often on a compiler than on paper.

Many non-banks (BGC, TradeWeb) also have the same high requirement on the same cornerstone topics above. Others focus on C++/java language details including low-level and tricky topics, more than the ibank interviews. My GS interview is like this.

tcp: one of 3 client-receivers is too slow

See also no overflow]TCP slow receiver #non-blocking sender

This is a real BGC interview question https://bintanvictor.wordpress.com/2017/04/08/bgc-iv-c-and-java/

Q: server is sending data fast. One of the clients (AA) is too slow.

Background — there will be 3 worker-sockets. The local address:port will look identical among them if the 3 clients connect to the same network interface, from the same network segment.

The set-up is described in simultaneous send to 2 tcp clients #mimic multicast

Note every worker socket for every client has identical local port.

See https://stackoverflow.com/questions/1997691/what-happens-when-tcp-udp-server-is-publishing-faster-than-client-is-consuming

I believe the AA connection/session/thread will be stagnant. At a certain point [1] server will have to remove the (mounting) data queue and release memory — data loss for the AA client.

[1] can happen within seconds for a fast data feed.

I also feel this set-up overloads the server. A TCP server has to maintain state for each laggard client, assuming single-threaded multiplexing(?). If each client takes a dedicated thread then server gets even higher load.

Are 5 client Connections using 5 sockets on server? I think so. Can a single thread multiplex them? I think so.

blocking scenario ] CPU-bound system

Q: can you describe a blocking scenario in a CPU-bound system?

Think of a few CPU bound systems like

  • database server
  • O(N!) algo
  • MC simulation engine
  • stress testing

I tend to think that a thread submitting a heavy task is usually the same thread that processes the task. (Such a thread doesn’t block!)

However, in a task-queue producer/consumer architecture, the submitter thread enqueues the task and can do other things or return to the thread pool.

A workhorse thread picks up the task from queue and spends hours to complete it.

Now, I present a trivial blocking scenario in a CPU bound system —

  • Any of these threads can briefly block in I/O if it has big data to send. Still, system is CPU-bound.
  • Any of these threads can block on a mutex or condVar

## Y avoid blocking design

There are many contexts. I only know a few.

1st, let’s look at an socket context. Suppose there are many (like 500 or 50) sockets to process. We don’t want 50 threads. We prefer fewer, perhaps 1 thread to check each “ready” socket, transfer whatever data can be transferred then go back to waiting. In this context, we need either

  • /readiness notification/, or
  • polling
  • … Both are compared on P51 [[TCP/IP sockets in C]]

2nd scenario — GUI. Blocking a UI-related thread (like the EDT) would freeze the screen.

3rd, let’s look at some DB request client. The request thread sends a request and it would take a long time to get a response. Blocking the request thread would waste some memory resource but not really CPU resource. It’s often better to deploy this thread to other tasks, if any.

Q: So what other tasks?
A: ANY task, in the thread pool design. The requester thread completes the sending task, and returns to the thread pool. It can pick up unrelated tasks. When the DB server responds, any thread in the pool can pick it up.

This can be seen as a “server bound” system, rather than IO bound or CPU bound. Both the CPU task queue and the IO task queue gets drained quickly.

 

concurrent lazy singleton using static-local var#c++

As explained in another blog post, static local is a shared mutable, but fortunately initialization is thread-safe:

“If multiple threads attempt to initialize the same static local variable concurrently, the initialization occurs exactly once (similar behavior can be obtained for arbitrary functions with std::call_once).” — https://en.cppreference.com/w/cpp/language/storage_duration

https://stackoverflow.com/questions/26013650/threadsafe-lazy-initialization-static-vs-stdcall-once-vs-double-checked-locki has 20 upvotes and looks substantiated. It also considers double-checking, std::call_once, atomics, CAS…

GCC uses platform-specific tricks to ensure a static local variable is initialized only once, on first use. 

Conclusion — On GCC and other compilers, this is the easiest solution for a concurrent lazy singleton.

 

no overflow]TCP slow receiver #non-blocking sender

Q: Does TCP receiver ever overflow due to a fast sender?

A: See http://www.mathcs.emory.edu/~cheung/Courses/455/Syllabus/7-transport/flow-control.html

A: should not. When the receiver buffer is full, the receiver sends AdvertizedWindowSize to informs the sender. If sender app ignores it and continues to send, then sent data will remain in the send buffer and not sent over the wire. Soon the send buffer will fill up and send() will block. On a non-blocking TCP socket, send() returns with error only when it can’t send a single byte. (UDP is different.)

Non-block send/receive operations either complete the job or returns an error.

Q: Do they ever return with part of the data processed?
A: Yes they return the number of bytes transferred. Partial transfer is considered “completed”.

 

UDP/TCP socket read buffer size: can be 256MB

For my UDP socket, I use 64MB.
For my TCP socket, I use 64MB too!

These are large values and required kernel turning. In my linux server, /etc/sysctl.conf shows these permissible read buffer sizes:

net.core.rmem_max = 268435456 # —–> 256 MB
net.ipv4.tcp_rmem = 4096   10179648   268435456 # —–> 256 MB

Note a read buffer of any socket is always maintained by the kernel and can be shared across processes [1]. In my mind, the TCP/UDP code using these buffers is kernel code, like hotel service. Application code is like hotel guests.

[1] Process A will use its file descriptor 3 for this socket, while Process B will use its file descriptor 5 for this socket.

c++template^java generics #%%take

A 2017 Wells Fargo interviewer asked me this question. There are many many differences. Here I list my top picks. I feel c# is more like java.

  1. (1st word indicates the category winner)
  2. C++ TMP is quite an advanced art and very powerful. Java generics is useful mostly on collections and doesn’t offer equivalents to most of the TMP techniques.
  3. java List<Student> and List<Trade> shares a single classfile, with uniform implementation of all the methods. In c++ there are distinct object files. Most of the code is duplicated leading to code bloat, but it also supports specialization and other features.
  4. java generics supports extends/super. C# is even “richer”. I think c++ can achieve the same with some of the TMP tricks
  5. c++ supports template specialization
  6. C++ wins — java doesn’t allow primitive type arguments and requires inefficient boxing. C# improved on it. This is more serious than it looks because most c++ templates use primitive type arguments.
  7. c++ supports non-dummy-type template param, so you can put in a literal argument of “1.3”
  8. c++ actual type argument is available at runtime. Java erases it, but I can’t give a concrete example illustrating the effect.

 

3rd effect@volatile introduced@java5

See also https://bintanvictor.wordpress.com/wp-admin/post.php?post=36111&action=edit&classic-editor

A Wells Fargo java interviewer said there are 3 effects. I named

  1. load/store to main memory on the target variable. C++ volatile has a somewhat similar effect. See the comparisonTable
  2. disable statement reordering

I think interviewer mentioned a 3rd effect about memory barrier.

This quote from Java Concurrency in Practice, chap. 3.1.4 may be relevant:

The visibility effects of volatile variables extend beyond the value of the volatile variable itself. When thread A writes to a volatile variable and subsequently thread B reads that same variable, the values of all variables that were visible to A prior to writing to the volatile variable become visible to B after reading the volatile variable. So from a memory visibility perspective, writing a volatile variable is like exiting(STORE/flush?) a synchronized block and reading a volatile variable is like entering (LOAD?) a synchronized block.

— “volatile” on a MyClass field provides special concurrency protection when the MyClass field is constructed :

java reference-type volatile #partially constructed has good details.

In https://stackoverflow.com/questions/9169232/java-volatile-and-side-effects , Nialscorva’s answer echoes the interviewer:

Before java 1.5, the compiler can reorder the two steps

  1. construction of the new object
  2. assigning the new address to the variable

In such a scenario, other threads (unsynchronized) can see the address in the variable and use the incomplete object, while the construction thread is preempted indefinitely like for 3 hours!

So in java 1.5, the construction is among the “other writes” by the volatile-writing thread! Instance construction often requires writing to some instance fields. Therefore, the construction is flushed to memory before the address assignment. Below is my own solution, using a non-static volatile field:

public class DoubleCheckSingleton {
	private static DoubleCheckSingleton inst = null;
	private volatile boolean isConstructed = false;
	private DoubleCheckSingleton() {
		/* other construction steps */
		this.isConstructed = true; //last step
	}
	DoubleCheckSingleton getInstance() {
		if (inst != null && inst.isConstructed) return inst;
		synchronized(DoubleCheckSingleton.class) {
			if (inst != null && inst.isConstructed) return inst;
			
/**This design makes uses of volatile feature that's reputed to be java5
*
* Without the isConstructed volatile field, an incomplete object's 
* address can be assigned to inst, so another thread entering getInstance()
* will see a non-null inst and use the half-cooked object 😦
* 
* The isConstructed check ensures the construction has completed
*/
			return inst = new DoubleCheckSingleton();
		}
	}
}

Wells IV #socket/c++/threading

This is one of the longest tech interviews and one of the most enriching 🙂 even though I don’t “like” all their questions — very academic/theoretical, text-book driven, too low-level to be practical, not testing practical zbs.

Q: Any consistently reliable technique to detect stale order in your book?

Q: what’s your frame size if your tcp/ucp buffers sizes are so large?

Q: experience storing ticks?

—— C++ questions:
Q1: how can you avoid the cost of virtual function?
%%A: enum/switch or CRTP

Q1b: what’s CRTP

Q: what if your copy ctor signature is missing the “const”?
%%A: you can’t pass in temporaries or literal values. Such arguments must pass into “const &” parameter or “const” parameter. Correct.

Q: double delete?

Q: policy class? traits class?

Q: STL binders.. use case?

Q: how many types of smart pointers do you use?

Q: difference between java generics vs c++ templates?
%%A: type erasure. Java Compiler enforces many rules, but bytecode saves no info about the type argument, so we can get strange runtime errors.
%%A: template specialization. Template meta-programming
%%A (accepted): code bloat since each template instantiation is a separate chunk of code in the object file and in memory.
A: I have a dedicated blog post on this.

Q: what’s returned by a std::queue’s dequeue operation when it’s empty?
AA: undefined behavior, so we must check empty() before attempting dequeue. I believe ditto for a std::stack

Q: why template specialization?
%%A: customize behavior for this particular vector since the standard implementation is not suitable.

Q: how do you implement a thread-safe c++singleton
A: not obvious. See concurrent lazy singleton using static-local var

Q12: in a simple function I have
vector v1 = {“a”, “b”}; vector v2 = v1; cout<<….
What happens to the ctor, dtor etc?
A: 2 std::strings constructed on heap, vector constructed on stack; 2nd vector copy-constructed on stack; 2 new strings constructed on heap; vector destructors deletes all four strings
A: the actual char array is allocated only once for each actual value, due to reference counting in the current std::string implementation.

Q12b: you mean it’s interned?

Coding question: implement void remove_spaces(char * s) //modify the s char array in place. See %%efficient removeSpaces(char*) #Wells

—— threading, mostly in java
Q: What are the problems of CAS solutions?
A: too many retries. CAS is optimistic, but if there are too many concurrent writes, then the assumption is invalid.
%%A: missed update? Not a common issue so far.

%%Q: Synchronized keyword usage inside a static method?
AA: you need be explicit about the target object, like synchronized(MyClass.class)

Q21: Name 3 effects of java volatile keyword — Advanced. See 3effects@volatile ] java5.

Q21b: analyze the double-checking singleton implementation.
staticInst = new Student(); // inside synchronized block, this can assign an incomplete object’s address to staticInst variable, which will be visible to an unsynchronized reader. Solution – declare staticInst as volatile static field

—— System programming (ANSI-C) questions
Q: have you used kernel bypass?

Q5: how do you construct a Student object in a memory-mapped-file?
%%A: placement new

Q5b: what if we close the file and map it again in the same process?
%%A: the ptr I got from malloc is now a dangling ptr. Content will be intact, as in Quest

Q5c: what if Student has virtual functions and has a vptr.
%%A: I see no problem.

Q: you mentioned non-blocking TCP send(), so when your send fails, what can you do?
%%A: retry after completing some other tasks.

Q4: why is UDP faster than TCP
%%A: no buffering; smaller envelopes; no initial handshake; no ack;
%%A: perhaps most important for market data — multiple recipient with TCP is very heavy on sender and on network

Q4b: why buffering in tcp?
%%A: resend

Q: can tcp client use bind()?
%%A: bind to a specific client port, rather than a random port

Q6: is there socket buffer overflow in TCP?
A: probably not

Q6b: what if I try to send a big file by TCP, and the receiver’s buffer is full? Will sender care about it? Is that reliable transmission?
A: Aquis sends 100MB file. See no overflow]TCP slow receiver #non-blocking sender

Q: in your ticking risk system, how does the system get notified with new market data?
%%A: we use polling instead of notification. The use case doesn’t require notification.

Q: any network engineering techniques?

Q: what kernel parameters did you tune?

 

—— Stupid best-practice questions:
Q: what’s the benefit of IOC?

Q: Fitnesse, mock objects

Question on j4 factory. See separate blog post.

 

java lock: atomic^visibility

You need to use “synchronized” on a simple getter, for the #2 effect.!

 

A typical lock() operation in a java method has two effects:

  1. serialized access — unless I release the lock, all other threads will be blocked grabbing this lock
  2. memory barrier — after I acquire the lock, all the changes (on shared variables) by other threads are now visible. The exact details are .. to be detailed.

I now feel the 2nd effect is often more important (and more tricky) than the 1st effect. See P94 [[Doug Lea]] . I like this simple summary, even if not 100% complete —

“In essence, releasing a lock forces a flush of all writes from working memory….and acquiring a lock forces reload of accessible fields.”

Q: what are the subset of “accessible fields” in a class with 9 fields?
A: I believe the compiler knows “in advance” what subset of fields will be accessed after lock acquisition.

Q: what if I acquire a lock, do nothing and release the lock? Is there the #2 effect?
A: I doubt it. You need to enclose all the “tricky” operations between a lock grab/release. If you leave some update (to a shared mutable) outside in the “cold”, then #1 will fail and #2 may also fail.

 

first N prime Fibonacci: cache server #Thesys

The challenge is not in Fib or prime, but in the algo around them.

//This version is memory efficient as it only remembers the last
//two Fib numbers + all previous prime Fib numbers

//Requirement: For a given input number X, print
//the first X prime Fib numbers. Each output is a Fib and prime
//
//This version is memory efficient as it only remembers the last
//two Fib numbers + all previous prime Fib numbers
//
//The typedef of Fib is nice documentation
//
//My version of isPrime() is simpler
#include <iostream>
#include <string>
#include <vector>
#include <cstdlib>
using namespace std;

typedef unsigned int Fib;
bool isPrime(int i){
        if (i <= 1) return false;
        for(int j=2; j<=i/2; ++j) {
            if(i%j==0) return false;
        }
        return true;
}

pair<Fib,bool> generate1Fib(){
  static Fib parent = 1;
  static Fib grandparent = 0;
  Fib newFib = parent + grandparent;
  grandparent = parent;
  parent = newFib;
  return make_pair(newFib, isPrime(newFib));
}

void print1stXFibPrimes(size_t const X){
    static vector<Fib> fibPrimes;
    int count = 0;
    for(int j=0; j<fibPrimes.size(); ++j){
                cout<<fibPrimes[j]<<" ";
                if (++count >= X) return;
    }

    while( count<X ) {
          pair<Fib,bool> new1 = generate1Fib();
          if (new1.second){
                fibPrimes.push_back(new1.first);
                cout<<new1.first<<" ";
                ++count;
          }
    }
}
inline bool isInteger(const string & s) {
   if(s.empty() || ((!isdigit(s[0])) && (s[0] != '-') && (s[0] != '+'))) return false ;
   char * p ;
   strtol(s.c_str(), &p, 10) ;
   return (*p == 0) ;
}
int main(int argc, char *argv[]){
  for(int i=1; i<argc; ++i){
    string arg(argv[i]);
    if (isInteger(arg)){
      cout<<"calculate the first "<<arg<<" prime fibs"<<endl;
      print1stXFibPrimes( stoi(arg)) ;
      cout<<endl;
    }else
      cout<<arg<<" is not an integer\n";
  }
}

#include <xtap/PluginConfig.h> trick

I have seen in many large systems:

The actual path to the header is …/shared/tp_xtap/include/PluginConfig.h, but develoeprs prefer an abbreviated include like #include <xtap/PluginConfig.h>.

#1) Here’s one simple implementation:

ls -l shared/tp_xtap/include/ # should show a symbolic link to be created automatically:

    xtap -> ./

Therefore, -I/full/path/to/shared/tp_xtap/include/ will resolve #include <xtap/PluginConfig.h>

#2) I guess a second-best solution is code generation. Checked-in source file has #include <xtap/PluginConfig.h> but the build system follows configured rewrite-rules, to convert it into #include <some/other/path>

jmp_buf/setjmp() basics for IV #ANSI-C

Q: how is longjmp different from goto? See http://ecomputernotes.com/what-is-c/function-a-pointer/what-is-the-difference-between-goto-and-longjmp-and-setjmp

A: longjmp can 1) jump across functions, and 2) restore program state from a jmp_buf, which was saved earlier by setjmp.

fileHandle/socket/dbConection are thread Unsafe

There’s a buffer in a DB connection, in a file handle, in a socket …

The buffer is a shared mutable object. Consider a read-buffer. The host object knows how much of the data in the buffer was already delivered, to avoid repeated delivery. There’s some kind of watermark, which is moved by a consumer thread.

As all shared mutables, these objects are thread unsafe.

All of these objects can also be allocated on stack and therefore invisible to other threads. Therefore, this could be the basis of a thread-safe design.

 

##how fast I figurED things out relative2teammates

GS — Initially I was much slower than my colleagues. After 18 months I was 80% as fast as them. Everyone needed to learn and memorize lots of details in a big system. Without that “mileage” you can’t be as fast as the rest.

I relied on my colleagues to give me pointers and detailed explanations. Otherwise it was too many details to wade through by one self.

Citi — no comparison — No “team mates” per-se. I had two old-timer colleagues who were very familiar with the system. I was not benchmarked against them.

I didn’t need to understand the ins and outs of the 5-year old codebase. I never became productive with it.

95G — not slower — i joined this greenfield project from beginning (luckily). I was as fast as the other consultant. Also, I designed a few key components, so I had the “original designer” advantage.

Barclays — not slower — i joined this greenfield project from very beginning (luckily), replacing a half-finished c# system. I had very few “team peers” to benchmark against. I designed one of the most valuable and complex components, so I had the “original designer” advantage. Huge advantage. I basically name my price when it came to effort estimates.

I did read some old code to replicate the logic, but I was fairly fast, and no team member ever read the same codebase anyway.

OCBC — not slower — I was seen as slower, but actually I became 70% to 90% as fast as the other team members. I had to spend many long hours on the existing codebase but it’s a small codebase (1 man-year) so I was able to get a hang of it over a few months.

I also designed some minor modules, where I had the “original designer” advantage. The other c# developers had to come to me and understand my designs.

Stirt — I was much slower than the old-timers, but I was not slower than the new joiners. I should have given up earlier. It was hopeless. Codebase was rather opaque and you need to ask many people to understand it.

Other new joiners seemed to ask only within their own team, but I stick out my neck and asked all around my team. It may have made me look slower and /greener/, but I was trying to learn faster. Did I learn faster than them? Not enough time to show.

Macquarie — The benchmark was my predecessor + my boss. I think I was much slower than them. My boss only complained about it in the 2nd year, months after paying me a small bonus.

RTS — not slower — I’m probably faster than some team members. I feel confident how fast I can figure things out now. There are many team members to compare, so from Day 1 it was crucial to hit the average standard.

GTD skill is harder,lasts longer in c++ than in Cleaner languages

In terms of troubleshooting, C++ is 90% same as C, which is a low-level language, close to the hardware.

In contrast, higher level languages strive to have the low level details encapsulated, so developers only need to deal with a simplified, standardized, cleaner façade. Some call it a virtualization.

Eg: sockets

Eg: c++ threading vs java threading

return-pbclone large obj(eg:vector)rely`@move

This is impressive in QQ interviews + coding questions

GotW #90 Solution: Factories

has a good illustration of move semantics put to good use.

  • Before c++11, a function returning a large vector (or any large object) by value incurs expensive deep copying of all vector elements.
  • With c++11 move features added to std::vector class, returning a vector by value is cheap and recommended.
  • RVO may kick in but (i doubt it) less reliable than move semantic. For the specific rules see RVO^move-semantics

declare iterator]function template #gotcha

(Needed in some coding interviews and also in GTD!)

Update: With c++11, you can use the “auto” keyword and avoid the complexity.

If you drop the “typename” from the for-loop header, then compiler is confused

error: dependent-name ‘std::multiset::iterator’ is parsed as a non-type, but (template) instantiation yields a type
note: say ‘typename std::multiset::iterator’ if a type is meant

Basically, we need to be extra explicit to the confused compiler.

template<typename T> ostream & operator<<(ostream & os, multiset<T> const & l){
  for(typename multiset<T>::iterator it = l.begin(); 
      it != l.end(); ++it){
        os<<*it<<" ";
  }
  os<<endl;
}

populate map<string,list> no list cloning #make_shared

I feel a few things are fairly impressive for a short coding interview:

  • no leak.
  • no explicit new(). make_shared allocates the control block efficiently

In c++11, we could probably return the list by value, and rely on move semantics.

Here’s an inefficient nested container. Saving pointer-to-list in the map is more common, because inserting or reading the map doesn’t need to clone the list!

//tested with -std=c++0x
unordered_map<string, list<size_t> > lookup;

shared_ptr<list<size_t> > myfunc(){
  shared_ptr<list<size_t> > ret= make_shared<list<size_t> >();
  //...
  return ret; //return shared_ptr by clone, cheaply
}

lookup.insert(make_pair(word, *myfunc())); //dereference to get the list

Codility: ignore overflow+bigO

The machine won’t check your bigO!

Focus on getting it to work. If you have time, address the overflow. If you get bogged down by the overflow concerns, you lose focus and lose speed.

–Here are my practical strategy for codility

  1. Don’t aim for high marks, since these tests are very hard-to-please and not very popular.
  2. Get friends to help crack it.
  3. Remember that ROTI is lower than other coding tests like Bloomberg style.  ECT and thinking speed is way too demanding.

## +ve keywords]%%annual appraisal

I really don’t want to care too much about manager’s comments. Those comments tend to hurt deeply. They often reflect the manager’s personal agenda, never a balanced/unbiased view. The keyword list is designed to counter some of the negative comments.

  • [zed, catcha] versatile, broad-based
  • [GS] ownership
  • [GS] client relationship
  • [GS, Chartered] attention to details
  • [GS] code quality. I was asked to present and publish
  • [GS, Mansion] technical strength
  • [GS, Barc] analytical
  • well-planned
  • [Barc, GS] knowledge sharing
  • [Barc, GS] personal sacrifices
  • [95G] architecture design
  • [Mac] good at team working across departments
  • [Mac] adaptive to new tech; fast learning

Thesys QnA IV

Q: what’s the average O() complexity of insertion sort
AA: N^2 in most implementations — comparison-based, but using contiguous memory. NlogN if using skip list

Q: what’s the average O() complexity of heap sort
AA: NlogN, since it’s comparison-based.

Q: what are the synchronization primitives on the operating systems you know

Q: how many levels of caches are in a typical Intel processor
%%A: at least 3 possibly 4.
AA: some intel processors do have 4 levels

Q: name some IPC mechanisms
%%A: 1) shared mem 2) sockets 3) pipes

–2nd round (Don Bush?)

Q: Requirement: insert into a std::map only if key is absent
AA: the insert() function does that by default! Returns a pair<Iterator-where-inserted, bool-isInserted>. I used this same technique in my handleInputInternal()

Q: if my class Student has a private dtor, will it compile?
%%A: yes it’s possible to define a member function to invoke the dtor on a pointer passed in
%%A: deleting a ptr-to-Student won’t compile. Correct!
AA: have a friend call the dtor!

Q: Any guess why STL stack designer didn’t combine top() and void pop() like in other languages? hint: what if the copy ctor throws?
%%A: it’s more efficient if a function doesn’t need to return anything, so we don’t want to force the pop() users to pay a price for something they don’t need.

Q: do you create your own templates?
A: I should name the SFINAE hack!

Q: any restriction on a non-type template parameter’s type?
%%A: int and bool are common. Pointers are allowed. https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.2.0/com.ibm.zos.v2r2.cbclx01/non-type_template_parameters.htm shows even arrays and ptr-to-functions are allowed.
AA: has to be a discrete type, not a float

%%Q: what’s the usage of this non-type parameter?
A: std::array uses it for the size. It can’t be a ctor argument since if you construct an array on stack the size must be known at compile time.

Q: can you use new-arrays with shared_ptr? I think he wanted vector<shared_ptr>
%%A: if I have a new-array to store somewhere, then I think shared_array or scoped_array
A: how about shared_ptr<vector>?

Q: what if ctor failes in Foo * p = new Foo() // will memory leak?
%%A: i know it’s common practice to throw from ctor…

Q: name some c++11 features you used in your recent projects
%%A: unique_ptr, shared_ptr, auto, unordered_map, initializer list for vector/map

%%3 GTD concerns as lead dev #instru/code reading..

Below are my biggest concerns when considering a lead dev role. Actually, many of my peers often hints “My role (as a mgr, or architect, or app owner..) is even harder” but I think they are Not more effective if asked to play the hands-on developer role.

  1. code reading — requires focus and stamina.
    • In addition, taking ownership of a big code base requires determination and confidence
  2. instrumentation — using prints or special tools, troubleshooting, error reproducing, ..
  3. pressure due to major feature not completed by impending deadlines

Note none of these is tested in any interview!

  • How about firefighting? I tend to panic but my peers aren’t stronger than me technically!
  • How about design? Not really my weakness. My peers’ designs are not necessarily better than mine
  • knowledge of generic tech (rather than local system knowledge)? I’m actually stronger than most peers
  • quality, attention to details? is usually my strength
  • fine tuning? Most peers are not good at this
(not ranked) bonus peers are .. generic or local nlg?
code reading no impact some are stronger  local sys nlg
investigate boss needs answers usually no diff  local
deadline yes ?  local
firefighting yes not stronger  local
design somewhat visible not stronger  some generic

top N active stocks #bbg#cache,VO

Requirement: implement top_n(int N): top N most active stocks

Requirement 2: add a top_n_last_x_minute(int N, int X): top N that are last traded in the last X millisec

This program also demonstrates how to avoid storing the same string twice, as map key and also as field of the map value

//
#include <unordered_map> //requires -std=c++0x
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#include <chrono>
#include <assert.h>
using namespace std;

struct Rec{
  string const * strPtr; //a best-effort to avoid storing duplicate strings
  int cumvol; //cum vol
  long lastUpd; //millsec since epoch

  Rec(int vol): strPtr(NULL), cumvol(vol),
    lastUpd( (chrono::system_clock::now().time_since_epoch()).count()) { }

  bool operator<(Rec const & other) const{
        return this->cumvol < other.cumvol;
  }
  friend ostream & operator<<(ostream & os, Rec const & r){
        os<<r.lastUpd<<"-"<<*(r.strPtr)<<"  : "<<r.cumvol;
        return os;
  }
};

typedef unordered_map<string, Rec>::iterator ITR;

class App{
  unordered_map<string, Rec> lookup;
  vector<Rec> v;
  bool dirty;
public:
  App(): dirty(true){}
  void update(string const & symbol, int vol){
    this->dirty = true;
    ITR it = lookup.find(symbol);
    if (it == lookup.end()){
        pair<ITR, bool> pair = lookup.insert(make_pair(symbol, Rec(vol)));
        assert(pair.first->second.strPtr == NULL);
        pair.first->second.strPtr = //second is the valye in the pair
          & pair.first->first; //the key of the new pair
    }else{
        assert(it->second.strPtr != NULL);
        it->second.cumvol += vol;
        it->second.lastUpd = (std::chrono::system_clock::now().time_since_epoch()).count();
    }
  }
  void top_n(int N, unsigned long X=-1){ //unsigned -1 gives the largest positive value!
    size_t sz=lookup.size();
    if (dirty){
      cout<<"resorting due to updates to database ..."<<endl;
      v.clear();
      for(ITR i=lookup.begin(); i!= lookup.end(); ++i){
        v.push_back(i->second);
      }
      sort(v.begin(), v.end());
      dirty = false;
    }
    long now = (std::chrono::system_clock::now().time_since_epoch()).count();
    //cout<<now<<" is now"<<endl;
    for(int k=sz-1, count=0; k>=0; --k){
        if (now - v[k].lastUpd > X) continue;
        cout<<v[k]<<endl;
        if (++count >= N) break;
    }
  }
};
int main(){
  App app;
  app.update("ibm", 1000);
  app.update("gs", 700);
  app.update("gs", 500);
  app.update("goog", 600);
  app.update("msft", 800);
  app.update("ge", 500);
  app.update("ibm", 160);
  app.top_n(6,29);
  app.update("c", 400);
  app.top_n(3,39);
}

std::string is COW-reference-counted now but should change]c++11

http://www.drdobbs.com/cpp/c-string-performance/184405453 (circa 2003) explains that COW speeds up string copy and string destruction.

https://stackoverflow.com/questions/12520192/is-stdstring-refcounted-in-gcc-4-x-c11

This proved a small halo.

However, gcc 5.1 introduced a breaking change. https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_abi.html says These changes were necessary to conform to the 2011 C++ standard which forbids Copy-On-Write strings

Q: why forbidden?
A: thread-safety … See [[std c++lib]] P692

TCP/UDP: partial or multiple messages in one buffer

This is often mentioned in IV. At least you can demonstrate your knowledge.

What if the UDP datagram is too big for recv() i.e. specified buffer length is too small? P116 [[tcp/ip soclets in C]] seems to say the oversize message is silently truncated.

UDP recv() will only return a single “logical” message [1]. I believe TCP can put partial or multiple messages into one “buffer” for recv().

Q: if my buffer is big enough, will my UDP recv() ever truncate a msg?
%%A: never

Note IP would always deliver a whole msg or miss a whole msg, never a partial msg. See P 329 [[comp networking]]

[1] a logical msg is the payload from one send()

## threading: a better specialization than algo,quant…

  • In summary — appears daunting and opaque but actually simple in practice
  • theoretical — like [[dougLea]].. my strength. Few coding tests and usually easy for me
  • fairly low-level — my strength. Not as low level as debugger, or template hacks …
  • GTD — no GTD challenges, since most projects use only simple threading designs. The code tracing, trouble-shooting expectation on me is relatively low.
  • opaque — for my peers. Even the basic condVar…
  • churn — low churn at the low level, but high churn at high level
  • ever-green — favorite topic of interviewers, esp. java
  • thin->thick->thin — I have achieved this for a long time.

Many of my halos are in this domain — ##halo across domains #specific=better