Hi guys, thanks to all your help, I managed to locate the very first trading session message in the raw data file.
We hit and overcame multiple obstacles in this long “needle search in a haystack”.
- · Big Obstacle 1: endian-ness. It turned out the raw data is little-endian. For my “needle”, the symbol integer id 15852(in decimal) or 3dec(in hex) is printed swapped as “ec3d” when I finally found it.
Solution: read the exchange spec. It should be mentioned.
- · Big Obstacle 2: my hex viewers (like “xxd”) adds line breaks to the output, so my needle can be missed during my search. (Thanks to Vishal for pointing this out.)
Solution 1: xxd -c 999999 raw/feed/file > tmp.txt; grep $needle tmp.txt
The default xxd column size is 16 so every 16 bytes output will get a line break — unwanted! So I set a very large column size of 999999.
Solution 2: in vi editor after “%!xxd -p” if you see line breaks, then you can still search for “ec\_s*3d”. Basically you need to insert “\_s*” between adjacent bytes.
Here’s a 4-byte string I was able to find. It span across lines: 15\_s*00\_s*21\_s*00
- · Obstacle 3: identify the data file among 20 files. Thanks to this one obstacle, I spent most of my time searching in the wrong files 😉
Solution: remove each file successively, starting from the later hours, and retest, until the needle stops showing. The last removed file must contain our needle. That file is a much smaller haystack.
o one misleading info is the “9.30 am” mentioned in the spec. Actually the message came much earlier.
o Another misleading info is the timestamp passed to my parser function. Not sure where it comes from, but it says 08:00:00.1 am, so I thought the needle must be in the 8am file, but actually, it is in the 4am file. In this feed, the only reliable timestamp I have found is the one in packet header, one level above the messages.
- · Obstacle 4: my “needle” was too short so there are too many useless matches.
Solution: find a longer and more unique needle, such as the SourceTime field, which is a 32-bit integer. When I convert it to hex digits I get 8 hex digits. Then I flip it due to endian-ness. Then I get a more unique needle “008e0959”. I was then able to search across all 14 data files:
for f in arca*0; do
xxd -c999999 -p $f > $f.hex
grep -ioH 008e0959 $f.hex && echo found in $f
- · Obstacle 5: I have to find and print the needle using my c++ parser. It’s easy to print out wrong hex representation using C/C++, so for most of this exercise I wasn’t sure if I was looking at correct hex dump in my c++ log.
o If you convert a long byte array to hex and print without whitespace, you could see 15002100ffffe87600,but when I added a space after each byte, it looks like 15 00 21 00 ffffe876 00, so the 3rd byte was overflowing without warning!
o If you forget padding, then you can see a lot of single “0” when you should get “00”. Again, if you don’t include white space you won’t notice.
Solution: I have worked out some simplified code that works. I have a c++ solution and c solution. You can ask me if you need it.
- · Obstacle 6: In some cases, sequence number is not in the raw feed. In this case the sequence number is in the feed, so Nick’s suggestion was valid, but I was blocked by other obstacles.
Tip: If sequence number is in the feed, you would probably spot a pattern of incrementing hex numbers periodically in the hex viewer.