non-blocking socket readiness: alternatives to periodic polling

Some interviewer once asked me —

Q: After your non-blocking send() fails due to a full buffer, what can you do to get your data sent ASAP?

Simple solution is retrying after 0 or more millisecond. Zero would be CPU spinning. Non-zero means unwanted latency.

A 1st alternative is poll()/select() with a timeout, and immediately retry the same. There’s basically no latency. No spinning either. The linux proprietary epoll() is more efficient than poll()/select() and a popular solution for asynchronous IO

2nd alternative is SIGIO. says it doesn’t waste CPU. P52 [[tcp/ip sockets in C]] also picked this solution to go with non-blocking sockets.



conflation ] stream`mkt data: dilemma

I have hit this same question twice — Q: in a streaming price feed, you get IBM prices in the queue but you don’t want the consumer thread to use those outdated prices. Your consumer also needs a history of the prices.

I see two conflicting requirements by the interviewer. I will point out to the interviewer this conflict.

  1. if full tick history is important, then the consumer have to /process/ every tick, even if outdated. We can have dedicated system just to record ticks, with non-optimal latency. For example, Rebus receives every tick, saves it and sends it out without conflation.
  2. If your algo engine needs to react to opportunities with minimal latency, then it can’t afford to care about history. It must ignore history. I will focus on this requirement.
  3. Combining the two, if your algo engine needs to analyze tick-by-tick history and also react to the opportunities, then the producer thread alone has to do all the leg work but I don’t know if it can be fast enough. In general, the fastest data processing system is single-threaded without queues and minimal interaction with other data stores. FIFO Queue implies latency. Anyway, I will not talk about this stringent “combo” requirement. says “Many firms mitigate the data they consume through the use of simple time conflation. These firms throw data on the floor based solely on the time that data arrived.”

In the Wells interview, I proposed a design that bypasses the tick queue. The producer simply updates a “notice board” with latest prices for each of 999 tickers. Registered consumers get notified to re-read the notice board[1], on some messaging thread. Async design has a latency. I don’t know how tolerable that is. I feel async and MOM are popular and tolerable in algo trading. I should check my book [[all about HFT]]

However, the HSBC manager (Brian?) seems to imply that for minimum latency, the producer thread must drive the algo all the way and send order out to exchange in one big function. Since the producer thread is also the consumer thread for the same message, there’s no conflation. Every tick is consumed! I am not sure about the scalability of this synchronous design.

[1] The notification should not contain the latest price. Doing so defeats conflation and puts us back to a queuing system.

java assignment-expression returns assigned value

A small halo in coding interviews 🙂

This feature is inherited from C.

  • eg: return singletonInstance = new MySingleton();
  • eg: a = b = 22;
  • eg: while ((line = reader.readLine()) != null)
  • eg (bigger):
    List<Object> processables;
    while ((processables = retrieveProcessableItems(..)).size() > 0) {/*/}

Incidentally, unlike C, java while-loop header can’t declare variables 😦  Solution — use for-loop 🙂

ibank tech IV: naive,arrogant..

In 2017 I interviewed with

  • * Pimco
  • Blomberg
  • * BGC partners
  • * ICE
  • * TradeWeb
  • Thesys (A trading software vendor)
  • Citadel
  • —-investment banks:
  • * BAML
  • * MS
  • Wells Fargo
  • GS
  • UBS

I now have a bias against investment banks. Wells is the most recent and most typical ibank interview:

  1. wishful thinking that if they are lucky to find a candidate who has used (for example) non-blocking-IO or shared-memory, he will be able to create such a system and make it WORK. I think in reality, many senior developers who have used such a technology professionally for a few years won’t be able to overcome the many technical challenges. I know from experience that we all struggle with a technology before we understand its limitations and pitfalls and figure out how to navigate around them. If you use an existing codebase that was fine-tuned years ago, then you have missed the “struggle” phase. You have to find opportunities to struggle with it (am trying now) to learn it. No pain no gain.
  2. They demand relevant domain knowledge and domain experience as a prerequisite.
  3. They have a long list of topics and they hope to find someone who scores well in most of them.
  4. Focus on in-depth, precise understanding of a few “cornerstone” topics such as threading memory model, and data structures. Such knowledge is very seldom needed in projects. We gain that knowledge by reading and interviewing and use it for interview only.
  5. For all other topics, they have a focus on textbook knowledge, slightly beyond skin-deep. These investment banks seem to believe if a candidate has such a knowledge, then he has experienced using it (or can learn it fast). Naïve thinking.

The non-banks (particularly Pimco, Bloomberg, BGC) are not “smarter interviewers”, but their focus is slightly different. They don’t need domain knowledge or a super long list of topics to rate a candidate. They don’t look for skin-deep “insight”. Many use coding tests, more often on a compiler than on paper.

Many non-banks (BGC, TradeWeb) also have the same high requirement on the same cornerstone topics above. Others focus on C++/java language details including low-level and tricky topics, more than the ibank interviews. My GS interview is like this.

blocking scenario ] CPU bound system

Think of a few CPU bound systems like

  • database server
  • MC simulation engine
  • stress testing

I tend to think that a thread submitting a heavy task is usually the same thread that processes the task. Such a thread doesn’t block!

In a task-queue producer/consumer architecture, the submitter thread enqueues the task and can do other things or return to the thread pool. A processor thread picks up the task from queue and spends hours to complete it. There is nothing else to do on this task. Again, no blocking here.

Here’s a trivial blocking scenario in a CPU bound system — Any of these threads can block in I/O.

##why avoid blocking design

There are many contexts. I only know a few.

1st, let’s look at an socket context. Suppose there are many (like 500 or 50) sockets to process. We don’t want 50 threads. We prefer fewer, perhaps 1 thread to check each “ready” socket, transfer whatever data can be transferred then go back to waiting. In this context, we need either

  • /readiness notification/, or
  • polling
  • … Both are compared on P51 [[TCP/IP sockets in C]]

2nd scenario — GUI. Blocking a UI-related thread (like the EDT) would freeze the screen.

3rd, let’s look at some DB request client. The request thread sends a request and it would take a long time to get a response. Blocking the request thread would waste some memory resource but not really CPU resource. It’s often better to deploy this thread to other tasks, if any.

Q: So what other tasks?
A: ANY task, in the thread pool design. The requester thread completes the sending task, and returns to the thread pool. It can pick up unrelated tasks. When the DB server responds, any thread in the pool can pick it up.

This can be seen as a “server bound” system, rather than IO bound or CPU bound. Both the CPU task queue and the IO task queue gets drained quickly.


%%efficient removeSpaces(char*) #Wells

Wells Fargo IV white-board coding question.

//Avoid strlen() which has O(N) cost
//Avoid unnecessary char-copying in the initial part since most strings start with non-spaces.

void removeSpaces(char * s){
  int i=0;
  for(; !isspace(s[i]); ++i){
    if (s[i] == 0) return;

  assert(isspace(s[i]) &&
"This should be the first space. Before this position, we have saved unnecessary char-copying");

  for(int front=i; s[front]!=0; ++front){
    if (isspace(s[front])) continue;
      s[i++] = s[front];
int main(){
  char s[]="  at  b c dd ";

concurrent lazy singleton using static-local var has 12 upvotes and looks substantiated. It also considers double-checking, std::call_once, atomics, CAS…

GCC uses platform-specific tricks to ensure a static local variable is initialized only once, on first use. 

If it works, this is the easiest solution.

As explained in another blog post, static local is a shared mutable.