— to support -Oprint —
For each symbol, we will use a vector to hold Tick objects (or shared_ptr thereof). Lookup will use a modified binary search — In the case of an exact hit, we need to scan backwards (for the starting timestamp) until we see a smaller value than the target. In the case of a search miss, we know the target timestamp value is between 2 vector items, and no scanning is needed.
I decided against a multiset due to larger memory footprint. Also a red-black tree would require lots of re-balancing when the input data stream is already sorted.
This vector will support product-query, but not optimally.
Trade-off: this optimization mode saves memory, speeds up print-queries but (the less frequent) product-queries are slower.
— to support -Oproduct —
In addition to the data structure of -Oprint, under each symbol we will have 11 additional vectors assuming there are 11 fields.
For Field1, the vector will hold 98700 addresses assuming there are that many records having Field1. Each address is a Tick object that has Field1 populated.
For Field2, let’s assume the vector is longer.
If a product query specifies Field1 and Field2, engine first identifies which of the two vectors is shorter. In our case, Field1 vector is chosen, so we iterate over it. (Note we don’t need Field2’s vector.) For each item, we check if Field2 is present (low-cost) and produce a product.
Before iteration, we need to identify the start and end items in the vector. This is the same modified-binary search as in -Oprint.
Suppose Line 1 has 3 fields, Line 2 has 5 fields, Line 3 has 1 field … The additional space requirement in this design is proportional to the total count of fields (3+5+1+…) present in the tick file.
To support print-query, in theory we could iterate each of the 11 vectors under the target symbol. However, we need a place to store the Tick objects. The data structure of -Oprint is an efficient solution, so I see no reason to abandon it and invent another. This data structure already supports faster print-query.
Trade-off: In this optimization mode, we support faster product-query, but use more memory. Also, the initial file processing is slower.
— to support -O2product (not implemented) —
If we know in advance that total distinct field count is low like 10, then there are only 10-choose-2 i.e. 45 pairs. We could support a new -O2product optimization mode.
Instead of 10 vectors, we need 45 vectors of floats, for the 45 combinations.
If an incoming Tick object has 4 fields, we will iterate over the 4-choose-2 i.e. 6 pairs of fields. Each pair is always one of the 45 possible combinations, so we will pick one of the 45 vectors, and push_back the product of the two fields. Since we have 6 pairs within this Tick object, we update exactly 6 of the 45 vectors.
Once the tick file is fully processed, we have 45 vectors to support fast product-query. To select one of 45 vectors, we use a hash table keyed by the two target field names (in sorted order) concatenated. Once we select the one vector, we will use the same modified binary search then iterate. Each element in the iteration is now guaranteed to contain both fields, so there’s no wasted iteration as in -Oproduct.
Print-query would need the data structure in -Oprint.
Trade-off: This mode uses even more memory but further speeds up product-queries. The additional space is proportional to the total count of field-pairs present in the tick file.
–use of shared_tr —
Memory footprint of a shared_ptr is about twice of a raw ptr, and much smaller than the footprint of a string. There are millions of field names (and also many Tick objects) in my data structures. In fact, the copy operations are disabled for Field (and Tick) classes, which are essentially immutable objects.
It was an obvious memory optimization to use pointers to avoid allocation for the same string twice. However, there’s question whether to store shared_ptr or raw ptr in the large data structures.
I chose raw pointer.
–duplicate copies of field strings or symbol strings —
If there are a million ticks, averaging 5 fields each, there would be 5 million field names to store, but only 50 distinct fields! To save memory, I allocate memory for each globally unique field name once only. We end up with 50 Field objects referenced by many shared_ptr objects.
For lookup, a hash table is possibly faster (if map is large) but not really, according to some benchmark tests. Our map size is assumed to be small, like 100.
Symbol strings are also repeating, but by design, my data structures do not hold many duplicates for a symbol string, because symbol is only a lookup key.
–Improvement feedback from friends
- avoid using namespace std
- avoid virtual