“Ticking 95th percentile server” is the blog title I would use. Original question is on paper/pencil, scanned and sent to my gmail, from my friend Deepak CM. I find this problem rather **realistic with practical usage,** rather than fake and contrived. I treat it as a System Design + implementation question.

Q: Using only std library, write a c++ program to process a live stream of 128,000,000 (or much more) double numbers, representing temperature readings not necessarily unique. As the temperatures come in, print the current 95th percentile on-demand. I call it the lucky “winner”. We can use the **nearest-rank percentile definition.**

====Idea 1: **given unsorted ints, find median in O(N)** is for median but can be tweaked for any percentile, but unfortunately, not “ticking”

====design 2, optimized for high volume of updates like 128 million updates, not optimized for frequent query

The entire temperature range is divided into non-overlapping segments, each represented by a segment-head temperature i.e. the lower bound [1b]. Each segment has a range (i.e.distance to next segment head), size (i.e.item count) and density (i.e. size/range ratio). We care about “size” only.

We need a RB-tree containing P=1024 [1] nodes, each an unsorted container[3]. The RB-tree serves to maintain the containers i.e segments.

Each incoming temperature is quickly “routed” to the correct container and simply appended therein, incrementing its size.

Upon query request, we will use the latest segment sizes to build a cumulative profile, and run a O[logP] binary search to identify the one segment containing the “winner”. This segment size would be hopefully much smaller than 128,000 [2] and far more /tractable/.

–Within the chosen segment of size S, we can use a vector to sort in O(S logS) the temperatures and identify the winner. After completing a query, the chosen container will become (partially) sorted, helping subsequent queries if this segment is chosen again.

If x% of the queries hit this segment, then I will convert this “favorite segment” to a RB-tree.

Alternatively, we can also use the O(S) algorithm in Idea 1, but the container won’t become sorted.

–priming

[2] 128,000 is 1024th the size of original sample size… not ideal. The segments need to be initialized carefully, during a priming phase, inspired by JIT compiler. Assuming we know the total sample size is 128 million, I will use the first 100,000 temperatures to select the 1024 segment heads. The segments are structured not for equal length or equal size. In fact the low segments can be very long very large. Rather the segment heads are chosen so that between 94th percentile and 96th percentile we have 1/4 of all segments. These small segments will be much smaller than 128,000 and quicker to manipulate.

–Foot notes:

Q: what if some of the containers grow too big like three times 128,000,000/1024

A: Not a problem unless the winner happens to be in such a container.

A: One idea is to split up such a container when we notice it, and grow a new node on the RB-tree. *std::map::insert() can take a hint* for the position where new node can be inserted. Note we don’t want to split a node JJ too early since JJ may not grow any further subsequently and may end up no larger than other nodes as other nodes keep growing.

[1] Sizing of P — First we estimate total sample size. If unknown, then set N:=1024 so all nodes stay in **L1-cache (typically 32KB)**. If we assume 16 bytes/node ( 8 bytes pointer to container + 8 bytes double ), then 32KB can hold 2000 nodes.

If query becomes more frequent, I can increase P by 1024, sacrificing insertion.

[1b] The node values are “lower-bounds” and don’t have to be really the minimum of the original temperatures in the container. We can probably cheat with 4-byte floats, and we get away with 2700 twelve-byte tree nodes.

[3] slist vs vector — vector might be faster due to pre-allocation, provided a node will never grow beyond a capacity. Vector has reserve() (Note resize() is wrong choice.)