Locate msg]binary feed #multiple issues solved

Hi guys, thanks to all your help, I managed to locate the very first trading session message in the raw data file.

We hit and overcame multiple obstacles in this long “needle search in a haystack”.

  • · Big Obstacle 1: endian-ness. It turned out the raw data is little-endian. For my “needle”, the symbol integer id 15852(in decimal) or 3dec(in hex) is printed swapped as “ec3d” when I finally found it.

Solution: read the exchange spec. It should be mentioned.

  • · Big Obstacle 2: my hex viewers (like “xxd”) adds line breaks to the output, so my needle can be missed during my search. (Thanks to Vishal for pointing this out.)

Solution 1: xxd -c 999999 raw/feed/file > tmp.txt; grep $needle tmp.txt

The default xxd column size is 16 so every 16 bytes output will get a line break — unwanted! So I set a very large column size of 999999.

Solution 2: in vi editor after “%!xxd -p” if you see line breaks, then you can still search for “ec\_s*3d”. Basically you need to insert “\_s*” between adjacent bytes.

Here’s a 4-byte string I was able to find. It span across lines: 15\_s*00\_s*21\_s*00

  • · Obstacle 3: identify the data file among 20 files. Thanks to this one obstacle, I spent most of my time searching in the wrong files 😉

Solution: remove each file successively, starting from the later hours, and retest, until the needle stops showing. The last removed file must contain our needle. That file is a much smaller haystack.

o one misleading info is the “9.30 am” mentioned in the spec. Actually the message came much earlier.

o Another misleading info is the timestamp passed to my parser function. Not sure where it comes from, but it says 08:00:00.1 am, so I thought the needle must be in the 8am file, but actually, it is in the 4am file. In this feed, the only reliable timestamp I have found is the one in packet header, one level above the messages.

  • · Obstacle 4: my “needle” was too short so there are too many useless matches.

Solution: find a longer and more unique needle, such as the SourceTime field, which is a 32-bit integer. When I convert it to hex digits I get 8 hex digits. Then I flip it due to endian-ness. Then I get a more unique needle “008e0959”. I was then able to search across all 14 data files:

for f in arca*0; do

xxd -c999999 -p $f > $f.hex

grep -ioH 008e0959 $f.hex && echo found in $f


  • · Obstacle 5: I have to find and print the needle using my c++ parser. It’s easy to print out wrong hex representation using C/C++, so for most of this exercise I wasn’t sure if I was looking at correct hex dump in my c++ log.

o If you convert a long byte array to hex and print without whitespace, you could see 15002100ffffe87600,but when I added a space after each byte, it looks like 15 00 21 00 ffffe876 00, so the 3rd byte was overflowing without warning!

o If you forget padding, then you can see a lot of single “0” when you should get “00”. Again, if you don’t include white space you won’t notice.

Solution: I have worked out some simplified code that works. I have a c++ solution and c solution. You can ask me if you need it.

  • · Obstacle 6: In some cases, sequence number is not in the raw feed. In this case the sequence number is in the feed, so Nick’s suggestion was valid, but I was blocked by other obstacles.

Tip: If sequence number is in the feed, you would probably spot a pattern of incrementing hex numbers periodically in the hex viewer.


hard edit vs soft edit when submitting order to mainframe

In one order entry system (A big European megabank), new orders are sent to mainframe. Upon validation, mainframe can return a message to the order entry system.

If the message is a hard edit, the order is rejected by mainframe validation module.

If  the message is a soft edit, the order is accepted by mainframe validation module. The soft edit is purely informational. Not necessarily a warning. No action is required on the user of the order entry system. I guess the soft edit is just “FYI”.

Posted By familyman to learning finance,c++,py… at 3/15/2011 11:52:00 PM

cobol copybook = input format spec

a cobol-copybook is “a file describing an input data format”.

“cobol copybook” is the standard term (“cobol-layout” is less common) for files like that mentioned in https://ssl.kundenserver.de/shop.softproject.de/downloads/CobolCopybookToolkitForJava.pdf


— based on http://edocs.bea.com/JAM/v51/program/egenapp.html
A COBOL CICS or IMS mainframe application typically uses a copybook source file to define its data layout. This file is specified in a COPY directive within the LINKAGE SECTION of the source program for a CICS application, or in the WORKING-STORAGE SECTION of an IMS program. If the CICS or IMS application does not use a copybook file, you will have to create one from the data definition contained in the program source.

A copybook is conceptually (not technically) part of a cobol program. Usually this copybook is a standalone file, included from its parent-program.

Posted By familyman to learning finance,c++,py… at 1/20/2008 08:53:00 PM

R-programming resources #ebooks …

–ebooks (master copy is in USB drive)
There are also decent ebooks outside CRAN.

http://cran.r-project.org/doc/manuals/R-intro.pdf — more techie

http://cran.r-project.org/doc/manuals/R-data.pdf — includes excel integration
http://cran.r-project.org/doc/contrib/usingR.pdf — good
http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf — more stats

model-based expert system

Q: why is SSCFI expert system model-based? Why is the model central to this expert system?

* according to published data, the largest body of this expert-system’s “knowledge” is about line records, even though the expert-system’s main job is something else ie diagnosis! This is common among real world expert-systems.

* I guess among expert systems, model-based designs form one well-known type. A one-liner introduction is “An expert system based on fundamental knowledge of the design and function of an object. Such systems are used to diagnose equipment problems, for example.”

* Circuit models help the system survive and continue to function despite 2 difficulties
1) many line records are unreliable — non-standard
2) many test equipments (used to test circuits) are unreliable — often misconfigured

Most if not all of the arguments below are my hypotheses with limited evidence.

* I feel an intelligent expert system can “reason” and use judgement, just like humans do with an internal model. The more comprehensive the model, the more it can reason and make sense of confusing data.

* I feel Fault-isolation may require the system to keep track of test resutls, to be interpreted in context. Circuit model is part of the context.

* I feel test data are perhaps correlated. The relations can be hard to identify. A model helps. A human tester, too, relies on a circuit model to correclate data.

* I feel Test results have patterns, as experienced human testers know. Perhaps patterns about brands and models, about seasons, about circuit types and designs … These could presumably be incorporated into the circuit model

strongly^weakly typed

Most complex software favor strong typing. I feel it’s not all due to ignorance, inertia, corporate politics or the marketing machine. Some brave and sharp technical minds ….

I think large teams need clean and well defined module-to-module interfaces. (module ~= class) A variable (mostly a pointer to a chunk of memory) should have well defined operations for it.

The precision comes with a cost — development time, inflexibility … but large teams usually need more coordination and control. At the heart of it is “identification”.

In the military, hospitals, government, and also large companies, identification is part of everyday life. It provides a foundation to security and coordination.

At the heart of OO modelling — translating real world security policies into system built-in rules. Strong typing = precise type identification.