Q: which linux c++thread is stuck #CSY

This is a typical “c++ecosystem question”. It’s not about c++ or C; it’s about linux instrumentation tools.

Q1: Given a multi-threaded server, you see some telltale signs that process is stuck and you suspect only one of the threads is stuck while the other threads are fine. How do you verify?

Q2: What if it’s a production environment?
A: I guess all my solution should be usable on production, since the entire machine is non-functioning. We can’t make it any worse.  If the machine is still doing useful work, then we should probably wait till end of day to investigate.

–Method: thread dump? Not popular for c++ processes. I have reason to believe it’s a JVM feature, since java threads are always jvm constructs, usually based on operating system threads [1]. JVM has full visibility into all threads and provides comprehensive instrumentation interface.

https://www.thoughtspot.com/codex/threadstacks-library-inspect-stacktraces-live-c-processes shows a custom c++ thread dumper but you need custom hooks in your c++ source code.

[1] Note “kernel-thread” has an unrelated meaning in the linux context

–Method: gdb

thread apply all bt – prints a stack trace of every thread, allowing you to somewhat easily find the stuck one

I think in gdb you can release each thread one by one and suspend only one suspect thread, allowing the good threads to continue

–Method: /proc — the dynamic pseudo file system

For each process, a lot of information is available in /proc/12345 . Information on each thread is available in /proc/12345/task/67890 where 67890 is the kernel thread ID. This is where pstop and other tools get thread information.


[19]with3sockets: select^non-block recv#CSY

Update — CSY said when you increase from 3 to 99, the difference becomes obvious.

Q (TradeWeb mkt data team): given 3 non-blocking receive-sockets, you can either read them sequentially in a loop or use select(). What’s the difference?

http://compgeom.com/~piyush/teach/4531_06/project/hell.html compares many alternatives. It says Non-block socket read() is least efficient as it wastes CPU.

If you have 3 receive-sockets to monitor, select()/poll() is designed just for your situation.

Using read() instead, you may try specifying a timeout in your read() like

//set socket option via SO_RCVTIMEO to 5 seconds

Low latency systems can’t use this design because while you wait 5 seconds on socket1, you miss new data on socket2 😦

Therefore, low-latency systems MUST disable SO_RCVTIMEO. In such a design, the loop wastes cpu in a spin-wait.

–another advantage to selelct(): you can process a high-priority socket earlier when select() says multiple sockets ready.

I gave this answer on the spot.

RTS pbflow msg+time files #wait_until #60%

Background — In RTS we relied everyday on a UDP message replay tool, taking in a msg file and a corresponding time file. Both binary files. I saw no message delimiters in the msg file, partly because this file is produced in production. Inserting delimiters would add overhead in production parser engine.

Given there’s no delimiter, I believe the timestamp file must have a seek-offset values (marker) corresponding to each timestamp.

Q: how would you implement a similar replay tool?

Driver is the timestamp file. Every time I see a timestamp, I would wait_until() that timestamp (adjusted to my start time). Upon wake-up, I would send the corresponding chunk of bytes between current and next markers.

I would use condition_variable::wait_until(). As noted in c++condVar 2 usages #timedWait, this function has nanosec precision.

(The mutex needed in wait_until? My program is single-threaded, so the dummy lock would be held forever.)

The chuck of bytes would be sent as one UDP packet. No split. Each chunk should starts with an original packet header created by the matching engine, and received in the original feed.

Q2: how about TCP?

TCP receiver can deal with partial messages very effectively, as I saw in my tests.

However, TCP receiver can introduce delays in the sender i.e. my replay tool, so the transmission will not be instantaneous. In fact, send() can block! This could lead to drift in the timer. That’s why I need wait_until()

Q3: Explain timer drift?

Suppose in the time file, Timestamp #1 is 9:00:00:000001 and Timestamp #2 is one microsec later, but the data transmission can take 10 nanosec esp. with TCP blocking send().  This 10 nanosec is a small drift but it adds up to microseconds.

unbounded queue for 1-Producer-1-Consumer #Wells

———- Forwarded message ———
From: Bin TAN (Victor) <tiger40490@gmail.com>
Subject: demo of my design that first interviewer didn’t see

Hi Angelo,

During my interview with Jun, I spent the last 20 minutes locked in a rigorous debate — given an unbounded queue designed for single-threaded mode, can we provide a thread-safe facade so a producer thread and a consumer thread can operate on it safely, but without using locks in every operation.
Note In JDK and STL there are many collections designed for single-threaded mode, because such designs are simpler and significantly faster.
I argued valiantly that there exist designs where most of the time we can avoid both locking and CompareAndSwap. I failed to convince Jun. Jun believed firmly that unsynchronized access by two threads is always unsafe so we must use lock or CompareAndSwap every single time.
I just spent 20 minutes at home posting an implementation in https://github.com/tiger40490/repo1/tree/jProj/java/com/wells
I would also like to point out the java ConcurrentHashMap lets two threads (possibly a writer thread and a reader thread) access the shared data structure concurrently, usually without locking or CompareAndSwap , when the two threads happen to access two distinct segments. Note there can be a large number of segments, up to a few hundred at least, so the chance of two threads hitting distinct segments is very high (99.999% chance for a 333-segments map). Therefore, contrary to what Jun said, for reader/writer threads to access the same data structure,  we don’t need locking in every read and write access.
concurrencyLevel – the estimated number of concurrently updating threads. The implementation may use this value as a sizing hint.
I wrote this (like most of my concurrent designs) in Java since Java memory model is better documented and better understood.

c++debug^release build can modify app behavior #IV

This was actually asked in an interview, but it’s also good GTD knowledge.

https://stackoverflow.com/questions/4012498/what-to-do-if-debug-runs-fine-but-release-crashes points out —

  • fewer uninitialized variables — Debug mode is more forgiving because it is often configured to initialize variables that have not been explicitly initialized.
    • For example, Perhaps you’re deleting an uninitialized pointer. In debug mode it works because pointer was nulled and delete ptr will be ok on NULL. On release it’s some rubbish, then delete ptr will actually cause a problem.

https://stackoverflow.com/questions/186237/program-only-crashes-as-release-build-how-to-debug points out —

  • guard bytes on the stack frame– The debugger puts more on the stack, so you’re less likely to overwrite something important.

I had frequent experience reading/writing beyond an array limit.

https://stackoverflow.com/questions/312312/what-are-some-reasons-a-release-build-would-run-differently-than-a-debug-build?rq=1 points out —

  • relative timing between operations is changed by debug build, leading to race conditions

Echoed on P260 [[art of concurrency]] which says (in theory) it’s possible to hit threading error with optimization and no such error without optimization, which represents a bug in the compiler.

P75 [[moving from c to c++]] hints that compiler optimization may lead to “critical bugs” but I don’t think so.

  • poor use of assert can have side effect on debug build. Release build always turns off all assertions as the assertion failure messages are always unwelcome.

SDI: order-resend timer #CSY #retrans

Requirement — Each time we send an order (with a unique orderID number), we need to wait for about 5 seconds. If no Ack received on this id, we would resend it using the same id. Please design a data structure and algo to achieve it.

Interview question Context 1 — probably re-transmission request, so “order id” means a sequence number.

Interview question Context 2 — re-transmission in TCP, so “order id” means a sequence number.

Interview question Context 3 — FIX connectivity

I believe we must keep data structure size under control, so when there are too many pending orders then very old pending orders would be dropped according to a reasonable policy.

A reasonable assumption — For simplicity, we resend any order only once and drop the order. If needed, we could send the same or a modified order but under a new orderID.

For now, I will relax the timing precision so that a little longer than 5 seconds is tolerable in practice. I would hope it takes sub-millis to iterate through any data structure under size control.

Note TCP has an extremely robust, efficient and well-thought-out design for a similar challenge, tested millions of times every second throughout the world. However, I will not reference it. Below is ..

—- my homemade design —-

System is driven by 4 types of events — timer, ack, new-order, resend. The first 3 are asynchronous, primary events, whereas the resend is a secondary event after a primary event. To minimize data races, I will use a single thread, so all event handlers must be brief.

Ring-buffer is the most popular underlying data structure for this type of system. I will implement a linked queue where each queue node is allocated from a ring buffer, and returned to buffer after delete/erase. Note a contiguous array will NOT accommodate mid-stream deletion.

  • Hashmap holds {orderId -> address of link node}
  • Each link node has {integer orderId; expiry time, pointer to next node; other trade details}.
  • We enqueue only at the tail, but we could erase either from head (dequeue) or the middle (ack received)
  • If we take a snapshot at any time, all link nodes are always ordered by expiry.
  • Only one timer is needed. It is either empty or has a single expiry time.

Event-handler algorithms:

  • –after sending a new order,
  • iterate from the head of the queue. If any node has an expiry time already passed, then resend it and dequeue it. Once we see a node that’s not expired yet, iteration ends.
  • enqueue the new id. If there’s no capacity, then simply remove the oldest node i.e. head of queue.
  • –After a resend,
  • Always erase (the node for) the resent id, usually mid-stream. This is where linked lists beat arrays.
  • If this resend is due to a timer event, then we need to set the timer to the expiry time of the queue head.
  • (No data structure scan since this is a secondary event.)
  • –After a timer event,
  • iterate from the head of the queue. If any node has an expiry time already passed, then resend it and dequeue it. Once we see a node that’s not expired yet, iteration ends.
  • set the timer to the expiry time of the current queue head.
  • –After an ack is received,
  • get the id in the ack message
  • use it to look up in hashmap to get the order object.
  • erase the node from linked queue
  • iterate from the head of the queue. If any node has an expiry time already passed, then resend it and dequeue it. Once we see a node that’s not expired yet, iteration ends.

[17]deque with uneven segments #GS

Update — Now I think unrolled linked list for small payloads is very similar.

Q: (2017 GS interview question) We know std::deque offers efficient insert at both ends. But now we want efficient insert mid-stream, and willing to relax O(1) lookup.

I proposed a solution using uneven segments. Each segment has a max capacity. When reached, a new segment would be allocated and linked in. Lookup using a staircase function.

Segment sizes (not capacity) is a mutable int array. From this array, we lazily derive another int array, the “threshold array” or staircase array. Every lookup requires a binary search in this threshold array to locate the target segment. If we keep a hard limit (like 1024) on segment count, then the threshold array would be at most 1024 long, so within 10 hits we always locate the target segment — better than O(log N)

The two int arrays should be L1-cached.

ref-counted copy-on-write string #MIAX exchange IV

I completely failed this 2011 IV question from MIAX options exchange:

Q: outline a ref-counted copy-on-write string class, showing all the function declarations
A: here’s my 2017 answer

struct Payload{ //this object must live outside any Str instance. If 1st Str instance in the group is destroyed, this object must live on.
	char * const arr;
	size_t const length;
	mutable size_t refCount;
	Payload(std::string const &);
class Str{
	Payload const * payload;
	~Str(); //decrement refCount and if needed, delete the payload
	//Str(); //default ctor is useless since after construction we can't really modify this instance, due to copy-on-write
	Str(std::string const &);
	Str(Str const &); // similar to shared_ptr

	Str & Operator=(Str const & other) const; // will return a reference to a new instance constructed on heap (never on stack!). The new instance will have a ref-count initialized to 1
	Str & replace_with(std::string const &) const; //ditto

// optional utilities
	char const * c_str()    const; // <-- Hey mine is very similar to std::string
	Str(char const * arr, size_t const len); // <-- Hey mine looks similar to std::string
	friend ostream & operator<<(ostream &, Str const &);

[12]back-scan any container,print`every Other item #MS

Have I overspent my time on this once-asked question?

The umbrella question — write a utility function to iterate any container and print out every Other element backwards?

Good coding practice! I think this is all about iterator syntax knowledge (my weakness) not algorithm (my strength)!

Note this is really about knowledge not  coding abilities. QQ not ZZ.

Iterator declaration is a can of worm 😦 I might need to give up on this.

#include <iostream>
#include <vector>
#include 	<list>
#include <set>
using namespace std;

template<class _InIt>  void printAlternateItem2itr(_InIt _First, _InIt _Last){
	bool flag = true;
	// if the iterator is from rbegin, then ++ would reverse it!
	for (_InIt it = _First; it != _Last; ++it, flag=!flag) {
		if (flag) cout << *it << ' ';
	cout << endl;
template <typename CONT> void printAlternateItemBackward(CONT const & cont) {
	printAlternateItem2itr(cont.rbegin(), cont.rend());
int main() {
	//vector<int> cont = { 11,2,3,4,5,6,7,18 };
	//list<int> cont = { 11,2,3,4,5,6,7,18 };
	string cont = "0123456789a";
	set<int> cont2 = { 11,2,33,44,55,66,77,88,99 };
	int arr[] = { 11,2,3,4,5,6,7,18,9 };
	int size = sizeof(arr) / sizeof(arr[0]);
	printAlternateItem2itr(arr, arr + size); //forward only

Q: is comparison defined on all iterators?
A: now I think linked list doesn’t. Now I think only random access itr does.

%%Q: what’s the signature of STL find()? I will use those declarations of iterators in my function. (Actually the map and set containers have member functions find() outperforming std::find)

%%Q: from a const container, can u get a non-const iterator?

Q: why don’t you take a container as input? Why must you take iterators?
%%A: it’s more common to take iterator, but in this case container will do. All containers provide rbegin() or begin() including string. Raw array doesn’t but the iterator increment won’t work for raw arrays anyway.

Separate question
Q: OO design — how would you represent Order state transition graph in an OMS?

minimize locking – queue for producer/consumer

I was asked this question in a big bank IV.

Q: if our queue needs to be as fast as possible, we want to avoid a global lock. How?

%%A1: multi-queue, based on cusip or account prefix. I implemented this partially in a JPM take-home coding test

%%A2: if were are very sure the queue is never depleted, then use 2 locks at both ends, so consumer threads only need to contend for the consumer lock.

%%A3: lock free queues are probably quite common in c#, java and c++

For A2, On one hand, we want to keep things simple by having fewer locks. On the other hand, we want to increase parallelism by breaking up one big sync group (of threads) into independent groups, each having a “smaller” lock.
Coarse-grained parallelism is key. If the 2 smaller sync groups never cross paths, then the additional locks won’t add complexity. We may need a way to ensure that the 2 ends of the queue never cross, hopefully without needing both locks. The risk — the consumer is too fast and “eats” an item that’s not yet added. We can make sure this always fails reliable in production and stress test, rather than UndefinedBehavior.

I feel in reality, there is usually some bound on the capacity of producer, consumer or queue. Sometimes producer will be too fast (overflow), or too slow (cross), so a queue without any bound check is unreliable in practice.

dynamic poker game in Zhou Xinfeng book

Q: you are dealt the 52 cards in a random sequence.
Each red card earns you $360
Each black card costs you $360
Game ends at end of the 52 draws or any time you quit. How do you optimize your earning and how much is admission ticket worth?

Let’s denote the amount of money you take home as H, a random variable. Your net profit/loss would be H minus admission price. If 555 reasonable/intelligent people play this game, then there would be five hundred and fifty-five values of H. What’s the average? That would be the answer.

p or “+” denotes the # of positive cards earned so far
n or “-” denotes the # of negative cards suffered so far

Exp(H|p=25,n=26) = Exp(H|p=26,n=26) = $0.
There’s an intrinsic value in our hands, Defined as i=(p-n). The e-value or Extrinsic value Evaluated from Expectation, may be higher or lower. Whenever i-value > e-value, we should exit. This is our strategy/policy.

Whenever intrinsic value <$0, E(H) is out of the money but always non-negative because we can wait till maturity and get all 52 cards.

E(p24,n26) = p(next is p)E(p25,n26) = 100%E(p25,n26) = $0
E(p26,n25) = $360     because we will exit

E(p25n25) = Prob(next is p)E(p26,n25) + Prob(next card is n)E(p25,n26) = 50%$360+0 = $180
E(p24n25) = p(p)E(p25,n25) + p(n)E(p24,n26)= p(p)E(p25,n25)+$0 = 66%$180= $120
E(p25n24)= p(p)E(26,n24)+p(n)E(p25,n25)= 33%$360×2 + 66%$180= $360

E(p24,n24)= half of E(25,24)+E(24,25)= half of  $360+$120= $240
E(p23,n25)= p(p)E(24,25)
+ p(n)E(p23,n26) # $0
=3/4x$120= $90
E(p23,n24)= p(p)E(24,24)+p(n)E(23,25)= 3/5 .$240+2/5 .$90

dynamic dice game (Zhou Xinfeng book

P126 [[Zhou Xinfeng]] presents — 
Game rule: you toss a fair dice repeatedly until you choose to stop or you lose everything due to a 6. If you get 1/2/3/4/5, then you earn an incremental $1/$2/$3/$4/$5. This game has an admission price. How much is a fair price? In other words, how many dollars is the expected take-home earning by end of the game?

Let’s denote the amount of money you take home as H. Your net profit/loss would be H minus admission price. If 555 reasonable/intelligent people play this game, then there would be 555 H values. What’s the average? That would be the answer.

It’s easy to see that if your cumulative earning (denoted h) is $14 or less, then you should keep tossing.

Exp(H|h=14) is based on 6 equiprobable outcomes. Let’s denote Exp(H|h=14) as E14
E14=1/6 $0 + 1/6(h+1) + 1/6(h+2) + 1/6(h+3) + 1/6(h+4) + 1/6(h+5)=$85/6= $14.166

E15=1/6 $0 + 1/6(h+1) + 1/6(h+2) + 1/6(h+3) + 1/6(h+4) + 1/6(h+5) where h=15, so E15=$15 so when we have accumulated $15, we can either stop or roll again.

It’s trivial to prove that E16=$16, E17=$17 etc because we should definitely leave the game — we have too much at stake.

How about E13? It’s based on 6 equiprobable outcomes.
E13 = 1/6 $0 +1/6(E14) + 1/6(E15) + 1/6(E16) + 1/6(E17) + 1/6(E18) = $13.36111
E12 = 1/6 $0 + 1/6(E13) +1/6(E14) + 1/6(E15) + 1/6(E16) + 1/6(E17) = $12.58796296

E1 =  1/6 $0 + 1/6(E2) +1/6(E3) + 1/6(E4) + 1/6(E5) + 1/6(E6)

Finally, at start of game, expected end-of-game earning is based on 6 equiprobable outcomes —
E0 =  1/6 $0 + 1/6(E1) + 1/6(E2) +1/6(E3) + 1/6(E4) + 1/6(E5) = $6.153737928

mother of 2 kids, at least 1 boy

A classic puzzle showing most people have unreliable intuition about Cond Prob.

Question A: Suppose there’s a club for mothers of exactly 2 kids — no more no less. You meet Alice and you know she has at least one boy. What’s Prob(both boys)?
Question K: You meet Kate (at clubhouse) along with her son. What’s P(she has 2 boys)?
Question K2: You also see the other kid in the stroller but not sure Boy or Girl. What’s P(BB)? This is essentially the same question on P166 [[Cows in the maze]]

Solution A: 4 equi-events BB/BG/GB/GG of 25% each. GG is ruled out, so she is equally likely to be BB/BG/GB. Answer=33%

Solution K: 8 equi-events BB1/BB2/BG1/GB2/BG2/GB1/GG1/GG2. The latter 4 cases are ruled out, so what you saw was equally likely to be BB1/BB2/BG1/GB2. Answer=50%

Question C: Each mother wears a wrist lace if she has a boy and 2 if 2 boys (Left for 1st born, Right for 2nd born). Each mother comes with a transparent (hardly visible) hairband if she has either 1 or 2 boys. There are definitely more wrist laces than hairbands in the clubhouse. If you notice a mother with a hairband, you know she has either 1 or 2 wrist laces. If you see a wrist lace, you know this mother must have a hairband.

C-A: What’s P(BB) if you see a mother with a hairband?
C-K: What’s P(BB) if you see a mother with a wrist lace on the left hand?

Solution C-A: Out of 2000 mothers, 1500 have hairband. 500 have 2 boys. P(BB) = 33%
Solution C-K: 500 have 2 wrist laces; 500 have only a left wrist lace; 500 have only a right wrist lace. P(BB) = 50%

Seeing a wrist lace is not the same as seeing a hairband. The 2 statements are NOT equivalent. Wrist laces (2000) outnumber hairbands (1500). A wrist lace sighting guarantees a hairband, so a wrist lace is more Rare, and a hairband  sighting is more Common. Within the clubhouse, 3 out of 4 hairband “tests” are positive, but only 2 out of 4 wrist lace tests are positive.

 Applied to original questions…
* Alice wears hairband but perhaps One of her wrists might be naked. If she brings one child each time to clubhouse, we may not always see the a boy.
* Kate wears at least one wrist lace (so we know she has a hairband too).

$ if we randomly “test” Alice for wrist lace on a random hand, she may fail
$ if we randomly “test” Alice for hairband, sure pass.
–> the 2 tests are NOT equivalent.

$$ if we randomly “test” Kate for wrist lace on a random hand, she may fail
$$ if we randomly “test” Kate for hairband, sure pass.
–> the 2 tests are NOT equivalent for Kate either

The wrist-lace-test pass implies hairband-test pass, but the same knowledge object contains additional knowledge. The 2 tests aren’t equivalent.

—– How is Scenario K2 different from A?
–How many mothers are like K2? We need to divide the club into 8 equal groups
* perhaps Kate is from the BB group and you saw the first kid or the 2nd kid
* perhaps Kate is from the BG group and you saw the first kid – BG1
* perhaps Kate is from the GB group (500 mothers) and you saw the 2nd kid – GB2. Now if you randomly pick one hand from each GB mother then 250 of them would show left hand (GB1) and 250 of them would show right hand (GB2). Dividing them into 2 groups, we know Kate could be from the GB2 group.
=} Kate could be from bb1, bb2, bg1, gb2 groups. In other words, all these 4 groups are “like Kate”. They (1000 mothers) all wear wrist lace, but not all having wrist lace are like-Kate — The bg2 (250) and gb1 (250) mothers are not like-Kate

–How many mothers are like Alice? 75% consisting of BB BG GB
^ Spotting a hairband, the wearer (Alice) is equally likely from the 3 groups — BB(33%) BG(33%) GB(33%)
^ Spotting a wrist lace, the wearer (Kate) is more likely from the BB group (50%) than BG(25%) or GB(25%) group.

If I hope to meet a BB mother, then spotting a wrist lace is more valuable “signal” than a hairband.  Reason? Out of the 2000 mothers, there are 2000 wrist laces, half of them from-BB. There are 1500 hairbands, and a third of them are from-BB.

Further suppose each twin-BB mother gets 100 free wrist laces (because wrist lace manufacturer is advertising?), and all the BB mothers claim to have a twin-BB. As a result, wrist laces explode. Virtually every wrist lace you see is from-BB.
There are many simple ways of reasoning behind the 33% and 50%, but they don’t address the apparent similarity and the subtle difference between A and K. When would a reasoning become inapplicable? It’s good to get to the bottom of the A-vs-K difference, the subtle but fundamental. A practitioner needs to spot the difference (like an eagle).

reliably convert Any c++ source to C : IV

More than one person asked me “Can you compile a c++ app if you only have a c compiler?”

Some of the foremost experts on c/c++ compiler said on http://www.edg.com/faq/convert —

If you mean “can you convert C++ source to C source, and run the C through a C compiler to get object code“, as a way to run C++ code on a system that has only a C compiler, yes it is possible to implement all of the features of ISO standard C++ by translation to C source code, and except for exception handling this produces object code with efficiency comparable to that of the code generated by a conventional compiler.

For exception handling, it is possible to do an implementation using setjmp/longjmp that is completely conformant, but the code generated will be 5-20% slower than code generated by a true c++ compiler.

16bit page-counter to Estimate when a webpage is hit the 1,000,000th time

We don’t need to estimate the hit Frequency which could be time-varying.

If we can use a 2 counters, we can precisely determine the 1,000,000th hit. One slow-counter one regular fast counter. But that means we use 32 bits for our counters, like a 32-bit counter!

void increment(){ // was named count()?
  static int count16bit;

  long unsigned int now = system_time_in_nanos();
  if (now%16 == 0){ //just check last 4 bits of “now”
     //now check count16bit
     if  (count16bit <= 0){cout<<"got it"<<endl; }

https://en.wikipedia.org/wiki/Approximate_counting_algorithm#Algorithm is the standard implementation of a similar idea.

spreadsheet concretize #Junli Part1

Q: A spreadsheet consists of a two-dimensional array of cells, labelled A1, A2, etc. Rows are identified using letters, columns by numbers. Each cell contains either a numeric value or an expression. Expressions contain numbers, cell references, and any combination of ‘+’, ‘-‘, ‘*’, ‘/’ (4 basic operators). Task: Compute all RPN expressions, or point out one of the Cyclic dependency paths.

——— Here is “our” design ———-
I feel it’s unwise to start by detecting circles. Better concretize as many cells as possible in Phase 1.

* First pass — construct all the RPN objects and corresponding cell objects. An RPN holds all the concrete or symbolic tokens. A cell has an rpn and also a cell name. If a cell is completely concrete, then calculate the result, and add the cell to a FIFO queue.

Also construct a p2d or precedent2dependent<Name, Set > map. It’s a look-up map of <Name, set > This will help us fire update events. If you wonder why use Name. In this context, name is a unique identifier for a cell. I use a simple hash map.

* 2nd pass — process the queue of concrete cells. For every cell removed from the queue, get its name and concrete value into a pair (call it ppair since it’s a Precedent). Look up p2d to get all dependents. Fire update events by feeding the ppair to each dependent cell, which will use the ppair to concretize (part of) its expression. If any dependent cell gets fully concretized, add it to the queue.

Remember to remove the ppair cell from p2d.

Only 2 data structures needed — queue and p2d.

* Phase 1 over when queue depletes. If p2d is still non-empty, we have cyclic dependency (Phase 2). All concrete cells have been “applied” on the spreadsheet yet some cells still reference other cells.

All remaining cells are guaranteed to be involved in some circle(s). To print out one circle, just start from any cell and follow first link and you are bound to hit the starting point.

count unique words]big file using5machines: high-level design

Q: Design a system to calculate the number of unique words in a file
1) What if the file is huge? (i.e. cannot fit in the main memory)
2) Assuming that you have more than one computer available, how can you distribute the problem?

Constraints are the key to such an optimization. Let’s make it more realistic but hopefully without loss of generality. Say the file is 2 TB ascii of purely alphabetical words of any language in a unified alphabet, with natural distribution such as text from world newspapers. Word length is typically below 20.

I’d assume regular 100GB network with dedicated sockets between machines. The machines have roughly equal memory, and the combined memory is enough to hold the file.

I’d minimize disk and network access since these are slower than memory access and require serialization.

Q: is the network transfer such a bottle neck that I’m better off processing entire file in one machine?

— one-machine solution —
Assuming my memory (2GB) can only hold 1% of the unique words. I’d select only those words “below” ad* — i.e. aa*, ab*, ac* only. Save the unique words to a temp file, then rescan the input file looking for ad*, ae*…ak* to produce a 2nd temp file… Finally Combine the temp files.

— multi-machine solution —
Don’t bother to have one machine scanning the file and tcp the words to other machines. Just copy the entire input file by CD or file transfer to each machine. Each machine would ignore words outside its target range.

How do we divide the task. Say we have 50 machines. We don’t know the exact distribution, so if we assume aa-ak to Not have too many unique words to fit into one machine (2GB), assumption might be wrong. Instead, we’d divide the entire universe into 50 * 10 ranges. We assume even if we are underestimating, still each range should fit into one machine. Every time a machine finishes one range, it sends a tiny signal to a controller and waits for controller to give it next range.

— hashing on words —
Hash table should be sized to minimize rehash. We need superfast hashCode and compression. hashcode should use all the characters, perhaps except the first, since it tends to be the same within a range.

throwing dtor: %%justified use cases

Status — still I don’t have a well-justified use case.

I feel throwing dtor is frowned upon but not “illegal”. In practice it’s seldom done by design. When used, it’s not worth the analysis. In interviews, however, this receives disproportionate attention.

However, in practice, there are reasons to break the rule. Suppose I have

int i1, i2;// temp variables,
  i1 = myobj->eat();
  delete myobj;
}catch(business_exception & be){
  //handle exception using the temp variables i1, i2 etc

Since eat(), drink() etc and dtor all throw the same business_exception, this code is clean and maintainable. If we need to throw more than one exception type from those 3 functions, we can easily add the code. The same exception handler is used as a catch-all.

It would be messy to pass the temp variables i1, i2 etc into myobj dtor and replicate the same exception-handling logic therein.

So in this case, I’d make myobj dtor throw business_exception.

Now, as described in [[moreEffC++]] myobj dtor is invoked as part of stack unwinding due to another exception? [1] Well, in this case, I know that’s a fatal scenario and I do want system to crash anyway, like an assertion error, so the terminate() behavior is not unacceptable.

In other words, myobj’s class is written such that its dtor should throw exception only under normal object destruction and should never be part of an exceptional stack unwinding. In such a case, no one should misuse this class in an exception-unsafe context. If they ignore the restrictions on this class, they could get this dtor invoked as part of an exceptional stack unwinding, and the consequence is something they must deal with.

[1] in c++11, system will trigger std::terminate() whether or not this is part of unwinding. See https://akrzemi1.wordpress.com/2011/09/21/destructors-that-throw/

implement a thread-safe queue using 2 stacks

Your idea is basically

void enqueue(Element e){

Element dequeue(){
    if (this->stack2.size()){
      return this->stack2.pop();
    if ( ! this->stack2.size()){
        //throw exception or return a special value indicating QUEUE_EMPTY;
    return this->stack2.pop();
void transferFromStack1to2(){
    assert this->stack2.size() == 0;
    //get locks on Stack1 and Stack2
    //release locks

We can improve performance a bit with a proactively transfer — as soon as stack2 becomes empty. However, this must be done on another thread.

how many tosses to get 3 heads in a row

A common probability quiz — Keep tossing a fair coin, how many tosses on average until you see 4 heads in a row? There’s a Markov chain solution, but here’s my own solution.

I will show a proposed solution using math induction. Suppose the answer to this question is A4. We will first work out A2 and A3.

Q1: how many tosses to see a head?

You can get h, th, tth, ttth, or tttth …  Summing up 1/2 + 2/4 + 3/8 + 4/16 + 5/32 == 2.0. This is A1.

Q2: how many tosses to see 2 heads in a row?

t: Total tosses in this senario = 1+A2. Probabily = 50%
hh: 2. P = 25%
ht: 2 + A2. P = 25%

(1+A2:50%) + (2:25%) + (2+A2:25%) = A2. A2 works out to be 6.0

Q3: how many tosses to see 3 heads in a row?
After we have tossed x times and got hh, we have 2 scenarios only
…..hhh — total tosses = x + 1. Probability = 50%
…..hht: x + 1 + A3 : 50%

(x+1: 50%) + (x+1+A3 : 50%) = x+1 + 0.5x A3= A3, So A3 = 2*(1+x), where x can be 2 or 3 or 4 ….

In English, this says

“If in one trial it took 5 tosses to encounter 2 heads in row, then this trial has expected score of 2*(1+5) = 12 tosses.”
“If in one trial it took 9 tosses to encounter 2 heads in row, then this trial has expected score of 2*(1+9) = 20 tosses.”

Putting on my mathematics hat, we know x has some probability distribution with an average of 6 because A2 = 6. We can substitute 6 for x, so

A3 = 2x(1+A2) = 14.0. This is my answer to Q3.

Now we realize A2 and A1 are related the same A2 == 2x(1+A1)

I believe A4 would be 2x(1+A3)==30.0. Further arithmetic shows A[n] = 2(A[n-1]+1) = 2^n – 2

slist iteration with concurrent add()


Thanks for the interview question —

Q: Can you iterate over a LinkedList while another thread adds or removes elements? How might you make it thread safe?

If a long linked list is super busy with a lot of add/remove, then we may never finish iterating — we get ConcurrentModificationException every time.

I feel solution depends on length of the list, how critical is iteration vs add()/remove(), performance requirement etc.

I feel your initial idea is plausible — make a copy of the entire linked list before iterating. But when we make a copy, we need to iterate — chicken and egg problem.

How about iteration-free copying? Get a node, copy it, then read the “nextNode” pointer to reach the next node. This way I implement my own loop. Must handle concurrent modification. Not easy, but at least I don’t get exceptions. In contrast, Similar home-made solution is harder on vector/hashtable/deque as their reallocation is more complex.

I also like your 2nd idea of a custom linked list, where add() operations are customized. This is fine if iteration can finish quickly or add() is infrequent. However, If add()/remove() are time-critical, then we need other solutions.

If we need to iterate a long linked list that’s super busy, then the __design__ is probably wrong. Why not use a database?

Alternatively, split the long linked list into shorter lists. Easier to iterate.

polymorphic equals() in java/c#@@ NO

A java interviewer quizzed me why my equals() implementation insists on both objects having exactly the same runtime type, and rejecting a derived object. He questioned what if we happen to have

B var1 = new Derived(1,2); B var2 = new B(1);

and the B portion of both objects are identical?

Biggest problem with this type of polymorphic equals() — X === Y === Z doesn’t mean X === Z. Suppose the base sub-object of X/Y/Z are identical. Y is a base object. Therefore Y.equals(X) == true == Y.equals(Z), but X and Z are subtype instances, and different.

I’m a purist with high standard. If my equals() determines 2 objects to be equals, then I “swear” they are indistinguishable and interchangeable carbon copies.

It’s generally safe to be strict writing equals() — insist both objects’ actual types at runtime are identical.

create unbounded queue using STL deque – threading

There’s a STL container adapter — queue over deque. (Probably no queue over vector exists since one end of the vector is hard to operate.) Like all STL containers it’s thread unsafe. But in this context, let’s ignore this adapter and try to design our own adapter.

How many condition variables? I think 1 only — isEmpty.

Q1: How many mutexes? At most 1 at each end of the queue.
%%A: probably 1. See below.

Q1b: Is it possible to allow consumers and producers to operate simultaneously?
%%A1b: I don’t think so. See below.

Q2: What if there’re 0 elements now and both threads are operating on that 1 new element?

For a queue backed by a linked list, Answer 1b might be yes — I feel the producer would button up the entire node and put its address into the head pointer in one instruction. For a deque though, the copying is not an atomic one-stepper. It’s either operator=() or copy-ctor. Before the producer finishes copying, consumer reads the half-copied node! In conclusion, I feel we need a big lock over the entire data structure.

equals method of a GraphNode.java class


You are spot on about linked list — If a class N has-a field of type N, then N is almost always, by definition, a node in a graph. That N field is probably a parent node. So allow me to put in some meaningful names — Each GraphNode has-a field named this.parent. Now the question becomes “how to override equals() in GraphNode and deal with the unbounded recursion”.

It’s an unusual technical requirement to make equals() to compare all ancestor nodes. However, It’s a reasonable business requirement to compare 2 GraphNodes by comparing all ancestors. Such a business requirement calls for a (static) utility method, NOT an instance method in GraphNode.java. A static utility method like compareAllAncestor(GraphNode, GraphNode) can be iterative and avoid recursion and stack overflow. Once this static method is in place, I might (grudgingly) create an instance method compare(GraphNode other) which simply returns compareAllAncestor(this, other), without unbounded recursion or stack overflow.

If 2 threads both perform this comparison, then I feel the method may need to lock the entire graph — expensive.

Even in a single-threaded environment, this comparison is expensive. (The recursive version would add an additional memory cost.) Potentially a performance issue. For most graph data structures in business applications, GraphNode should be Serializable and collections-friendly. Therefore hashCode() and equals() should be cheap.

For most graph data structures in business applications, each graph node usually represents a real world entity like a member in a MLM network. Now, if a graph node represents a real world entity, then it’s always, without exception, identifiable by an immutable and unique ID. Usually this ID is saved in database (could also be generated in application). Therefore, in most cases, equals() should compare ID only.

mkt-data subscription engine java IV #barc eq drv

Say you have market data feeds from Reuters, Wombat, Bloomberg, eSpeed, BrokerTec, ION… Data covers some 4000 underliers and about half a million derivative instruments on these underliers. For each instrument, there can be new bid/offer/trade ticks at any millisecond mark[1]. Volume is similar to option data feed like OPRA.

Say you have institutional clients (in additional to in-house systems) who register to receive IBM ticks when a combination of conditions occur, like “when bid/ask spread reaches X, and when some other pricing pattern occurs”. There are other conditions like “send me the 11am IBM snapshot best bid/ask”, but let’s put those aside. For each of the instruments, there are probably a few combination of conditions, but each client could have a different target value for a condition — 2% for u, 2.5% for me. Assuming just 10 combination for each instrument, we have 5 million combination to monitor. To fulfill clients, we must continuously evaluate these conditions. CEP and Gemfire continuous query have this functionality.

I proposed a heavily multi-threaded architecture. Each thread is event-driven (primary event) and wakes up to reevaluate a bunch of conditions and generate secondary events to be sent out. It can drop the new 2ndary event into a queue so as to quickly return. The “consumer” can pick up the 2ndary events and send out by multicast.

Each market data vendor (Reuters, e-speed, ION, even tibrv) provides a “client-runtime” in the form of a jar or DLL. You embed the client-runtime into your VM, and it may create private threads dedicated to communicating with the remote publisher.

[1] Each IBM tick actually has about 10 fields, but each IBM update from vendor only contains 2 fields if the other field the symbol didn’t change. So we need something like Gemfire to reconstruct the entire 10-field object.

pure-pure virtual invoked during base ctor/dtor #bbg

Update: P273 [[eff c#]] compares c# vs c++ when a pure virtual is invoked by ctor (book skips dtor). It confirms c++ would crash, but my bloodshed c++ compiler detected pure virtual called in base ctor and failed compilation. See source code at bottom.

This is rather obscure , not typical. Not even a valid question.

struct B{
 virtual void cleanup() = 0;
  }// not virtual
struct D: public B{
 void cleanup(){}
int main(){ D derivedObject; } 
// derivedObject destructed. 
// If (see below why impossible) code were to compile, what would happen here?

%%A: the invisible ~D() runs, then the defined ~B() runs even though it’s non-virtual.
%%A: I think it’s undefined behavior ONLY with “delete”.
%%A: virtual method called during base object destruction/construction – Warning by Scott Meyers. At time of base class destruction only the pure virtual is available, so system crashes saying “pure virtual function called”.

(In java, superclass ctor calling a virtual function results in the subclass’s version invoked, before the subclass ctor runs! Compiles but dangerous. If you do this, make sure the subclass ctor is empty.)

Q: how can a compiler intercept that error condition, with 0 runtime cost?
A: see post on pure pure virtual

Note, if Derived had not implemented the pure virtual, then Derived would have become an abstract class and non-instantiable.

Actually compiler detects the call to cleanup() is a call to B::cleanup() which is abstract. Here’s a demo.

struct B{
  virtual void cleanup() = 0;
  ~B();  // not virtual
//void B::cleanup(){       cout<<"B::cleanup\n";}
  cout<<"B()\n"; //this->cleanup(); //breaks compiler if enabled
  cout<<"~B()\n"; //this->cleanup(); //breaks compiler if enabled
struct D: public B{
  void cleanup(){       cout<<"D::cleanup\n"; }
  ~D(){cout<<"~D()\n"; }
int main(){
    if (true){
       D derivedObject; // derivedObject destructed. Suppose this can compile, what will happen?

Merrill S’pore: fastest stock broadcast

Updates — RV or multicast topic; msg selector

I think this is a typical wall-street interview question for a senior role. System requirement as remembered by my friend the interviewee: ML needs a new relay system to receive real-time stock updates from the stock exachange such as SGX. Each ML client, one of many thousand[1], will each install a new client-software [3] to receive updates on the stocks [2] she is interested. Some clients use algorithmic trading system and need the fastest feed.

[1] Not clear about the order of magnitude. Let’s target 10,000
[2] Not clear how many stocks per client on average. Let’s target 100.
[3] Maintence and customer support for a custom client-software is nightmare and perhaps impractical. Practically, the client-software has to be extremely mature such as browsers or email clients.

Q: database locking?
A: I don’t think so. only concurrent reading. No write-contention.

Key#1 to this capacity planning is how to identify bottlenecks. Bandwidth might be a more severe bottleneck than other bottlenecks described below.

Key#2 — 2 separate architectures for algorithmic clients and traditional clients. Each architecture would meet a different minimum latency standard, perhaps a few seconds for traditional and sub-second for algorithmic.

Solution 0: Whatever broadcasting system SGX uses. In an idea world, no budget constraint. Highest capacity desired.

Solution 2: no MQ? No asynchronous transmission? As soon as an update is received from SGX, the relay calls each client directly. Server-push.

Solution 1: MQ — the standard solution in my humble opinion.

Solution 1A: topics. One topic per stock. If 2000 clients want IBM updates, they all subscribe to this topic.

Q: client-pull? I think this is the bottleneck.

Q: Would Client-pull introduce additional delays?

Solution 1B: queues. one queue for each client each stock.

If 2000 clients want IBM updates, Relay need to make that many copies of an update and send to that many queues — duplication of effort. I think this is the bottleneck. Not absolutely sure if this affects relay system performance. Massively parallel processing is required, with thousands of native CPU threads (not java green threads)