MOM^shared memory ring buffer^UDP : mkt data transmission

I feel in most environments, the MOM design is most robust, relying on a reliable middleware. However, latency sensitive trading systems won’t tolerate the additional latency and see it as unnecessary.

Gregory told me about his home-grown simple ring buffer in shared memory. He used a circular byte array. Message boundary is embedded in the payload. When the producer finishes writing to the buffer, it puts some marker to indicate end of data. Greg said the consumer is slower, so he makes it a (periodic) polling reader. When consumer encounters the marker it would stop reading. I told Gregory we need some synchronization. Greg said it’s trivial. Here’s my idea —

Design 1 — every time the producer or the consumer starts it would acquire a lock. Coarse-grained locking

But when the consumer is chipping away at head of the queue, the producer can simultaneously write to the tail, so here’s

Design 2 — the latest message being written is “invisible” to the consumer. Producer keeps the marker unchanged while adding data to the tail of queue. When it has nothing more to write, it moves the marker by updating it.

The marker can be a lock-protected integer representing the index of the last byte written.

No need to worry about buffer capacity, or a very slow consumer.

MOM UDP multicast or TCP shared_mem
how many processes 3-tier 2-tier 2-tier
1-to-many distribution easy easiest doable
intermediate storage yes tiny. The socket buffer is typically 64MB yes
producer data burst supported message loss is common in such a situation supported
async? yes no yes
additional latency yes zero latency minimal

job4accu@20Y: which past job was closest2ideal

Q: do I want to deepen and specialize in one single area, rather than diversify? skill: deepen^diversify^stack up provides a context

A: let’s assume answer is yes deepen.

Q: which area?

A: programmatic data analysis in finance … with two entry barriers in math and coding, with reasonable "moneyness" but market depth is poor.

A: financial enterprise app with multi-threading …

Once we figure out which domain is good enough (very few), then we need a good boss…

So which past job is most likely to offer me that professional growth prospect? Barcap, but market depth is poor for that skill

tick DB query engine: design notes #GS

— to support -Oprint —
For each symbol, we will use a vector to hold Tick objects (or shared_ptr thereof). Lookup will use a modified binary search — In the case of an exact hit, we need to scan backwards (for the starting timestamp) until we see a smaller value than the target. In the case of a search miss, we know the target timestamp value is between 2 vector items, and no scanning is needed.

I decided against a multiset due to larger memory footprint. Also a red-black tree would require lots of re-balancing when the input data stream is already sorted.

This vector will support product-query, but not optimally.

Trade-off: this optimization mode saves memory, speeds up print-queries but (the less frequent) product-queries are slower.

— to support -Oproduct —
In addition to the data structure of -Oprint, under each symbol we will have 11 additional vectors assuming there are 11 fields.

For Field1, the vector will hold 98700 addresses assuming there that many records having Field1. Each address is a Tick object that has Field1 populated.

For Field2, let’s assume the vector is longer.

If a product query specifies Field1 and Field2, engine first identifies which of the two vector is shorter. In our case, Field1 vector is chosen, so we iterate over it. (Note we don’t need Field2’s vector.) For each item, we check if Field2 is present (low-cost) and produce a product.

Before iteration, we need to identify the start and end items in the vector. This is the same modified-binary search as in -Oprint.

Suppose Line 1 has 3 fields, Line 2 has 5 fields, Line 3 has 1 field … The additional space requirement in this design is proportional to the total count of fields (3+5+1+…) present in the tick file.

To support print-query, in theory we could iterate each of the 11 vectors under the target symbol. However, we need a place to store the Tick objects. The data structure of -Oprint is an efficient solution, so I see no reason to abandon it and invent another. This data structure already supports faster print-query.

Trade-off: In this optimization mode, we support faster product-query, but use more memory. Also, the initial file processing is slower.

— to support -O2product (not implemented) —
If we know in advance that total distinct field count is low like 10, then there are only 10-choose-2 i.e. 45 pairs. We could support a new -O2product optimization mode.

Instead of 10 vectors, we need 45 vectors of floats, for the 45 combinations.

If an incoming Tick object has 4 fields, we will iterate over the 4-choose-2 i.e. 6 pairs of fields. Each pair is always one of the 45 possible combinations, so we will pick one of the 45 vectors, and push_back the product of the two fields. Since we have 6 pairs within this Tick object, we update exactly 6 of the 45 vectors.

Once the tick file is fully processed, we have 45 vectors to support fast product-query. To select one of 45 vectors, we use a hash table keyed by the two target field names (in sorted order) concatenated. Once we select the one vector, we will use the same modified binary search then iterate. Each element in the iteration is now guaranteed to contain both fields, so there’s no wasted iteration as in -Oproduct.

Print-query would need the data structure in -Oprint.

Trade-off: This mode uses even more memory but further speeds up product-queries. The additional space is proportional to the total count of field-pairs present in the tick file.

–use of shared_tr —
Memory footprint of a shared_ptr is about twice of a raw ptr, and much smaller than the footprint of a string. There are millions of field names (and also many Tick objects) in my data structures. In fact, the copy operations are disabled for Field (and Tick) classes, which are essentially immutable objects.

It was an obvious memory optimization to use pointers to avoid allocation for the same string twice. However, there’s question whether to store shared_ptr or raw ptr in the large data structures.

I chose raw pointer.

after 70,non-profit OK; voluntary work no

After my prime years, when I can only work half the time, I may be able to work towards some meaningful cause, but not completely voluntary work. If there’s no income, I will have low motivation to continue.

With a salary, I feel more commitment, more responsibility.

In our later years, my wife and I also have a non-trivial financial need. I don’t want to depend on my kids or welfare to support ourselves. I may have to continue my drive for more income.

memset() a practical usage #Gregory

  • memset is a low-level C function.
  • memset takes a void pointer.
  • Fast and simple way to zero out an array of struct, having primitive data members. No std::string please. No ptr please. Use sizeof to get the byte count.
  • Useful in low level wire coding
// illustrates packed and memset
#include <iostream>
using namespace std;

struct A{
  unsigned int i1; //4 bytes
  bool b; //1 byte
  char cstr[2];
  int* ptr; //8 bytes
} __attribute__((packed));
size_t const cnt = 3;
A arr[cnt];
int main(){
  size_t sz = sizeof(arr);
  memset(arr, 0, sz);
  for(size_t i=0; i<cnt; ++i){
    A* tmp = &arr[i];
    cout<<"i1 = "<<tmp->i1<<"; b = "<<tmp->b<<" ; cstr[1] = "<<(int)tmp->cstr[1]<<" ptr = "<<tmp->ptr<<endl;

Nomura C++quant IV

Q: vs operaotr[]?
%%A: operator[] throws exception only if you specify a compiler flag like D_GLIBCXX_DEBUG. I think this is correct

Q: given a vector or ints, how would you sort it by abs value?
%%A: free function
%%A: subclass “less” with operator(). body is same as the free function

Q: clustered index vs non-clustered?

Q: inputs to price an equity option?
%%A: 5 inputs: K, S, T, r, dividend rate. I missed “implied volatility”, which is a soft market data derived from a raw market data.

Q: inputs to a mortgage prepayment model (or callable bond)? ….
%%A: Black model to price a callable bond. Inputs include yield, maturity, interest rate, but there’s no strike per-se? I think the implied volatility (in which random process?) is the most important factor and needs to be backed out from some market data

Q: how did you construct your yield curves at Barcap?

Q: what’s local vol
A: the instantaneous volatility is a deterministic function of 2 inputs. Deterministic because there’s no dB … See other blog posts.

Q: zero-volatility spread?

Q: how do you compute OAS?
%%A: I think OAS is a curve.

Q: how about dv01?

Q: duration of a 10Y T-bond vs 10Y muni bond? I answered with the (more “visual”) Macaulay duration, which is slightly different from the (most useful) modified duration

GS c++/dataStruct IV

Q1: what’s the problem to be solved by virtual inheritance
A: diamond..

Q2: consider such a diamond with A as grandparent, B/C virtually deriving from A, then grandchild D. Suppose class C has a foo() method that uses a field A.a. Now, within C, the offset of A.a depends on the inheritance hierarchy, so how does the compiler compile

A: I do see a problem here. In this ABCD case, the offset of A.a is one value, depending on the size of B. However, in a ABJKCD case (J/K virtually subclass A), the offset of A.a will be a different value, depending on the sizes of B/J/K. So the method will compile to different object code! Yet the compiler has a single version of

The trick is the offset of the A sub-object within C. This offset value can be saved in the vtable of C.

Note if there’s no virtual inheritance, then simpler. Within C, the A sub-object’s offset is a constant. No need to look up vtable.

Q3: appending a million integers to a list vs a vector, which is faster and by how much?
%%A: vector requires reallocation, roughly 20 times, assuming doubling each time, much fewer than 1M allocations for the list.  The reallocation entails element-by-element copy, but that’s cheaper than allocations. If I must estimate how much faster, I need to estimate how many allocations and how many copying:

  • for list, 1M allocations + 1M copying
  • for vector, 20 allocations + about 2M copying

%%A: I still think allocation is a hotspot and much more expensive than copying, perhaps by 10 or 100 times.

%%A: of course, you can pre-allocate for vector, but for this problem we ignore that feature.

Q4: deque — known for fast append/prepend, and random access. How?
%%A: I think it’s implemented as segments of vectors, with the front and back segments not fully occupied and growable. Once one of them reaches a certain size, a new segment is created. RandomAccess is implemented by identifying the host segment. To optimize, segments should have equal length.

Q4b: what if we want to support fast midstream insertion?
%%A: this is a vector limitation inherited by deque. If we need to support it, we may need to allow unequal segment lengths. RandomAccess is now slightly slower. For a given subscript 16, we need to maintain a step function like

  • 5-10 in Segment 2
  • 11-13 in Segment 3
  • 14-20 in Segment 4
  • 21-30 in Segment 5
  • 31-32 in Segment 6
  • … so each time any segment expands, the step function needs an update
  • Once we have the step function, a binary search will do, at log(S) where S = Segment count

select^poll # phrasebook

Based on, which I respect.

* descriptor count — up to 200 is fine with select(); 1000 is fine with poll(); Above 1000 consider epoll

* single-threaded app — poll is just as fast as epoll. epoll() excels in MT.

* time-out precision — poll/epoll has millisec precision. select() has nanosec, but only embedded devices need such precision.

* linux-only — epoll

##after70..spend%%spare time meaningfully

Holy grail — Long-term sustainable (hopefully intrinsic) motivation + modest level of expertise, with reliable (albeit low) income and (a bit of) social value.

I need a purpose, a goal to work towards… Without it, the absence of a … job would create a void. Depression, lack of purpose, loss of energy. None of the below is easily achievable or easily available. Whichever I choose, need to work towards it.

  • I have proven aptitude in theoretical domains ..
  • Teach Chinese/English, with emphasis on writing and vocab
  • Two-way translation service, but I prefer interactions.
  • Teach programming — threading, data struct, algo
  • If not outdated, Teach statistical data analysis?
  • If not outdated, Teach enterprise app design? Too competitive. They may not take in an old programmer.
  • Teach financial math? After 70?
    • ▼this domain is too competitive and entry barrier too high. A lot of effort to cross it but demand is low.
    • ▼Limited practical value. more specialized, but growing demand.
    • ▼I feel I need interaction with people.
  • Chinese medicine?
  • help my children take care of their families

Tim (ICE), a friend in his 50’s gave 3 points

  • earn a salary to help kids pay student loan
  • travel — costs a lot
  • volunteering