tags: respect^bmark^GTD^localSys

These tags/categories overlap. Not a big problem.

  • respect — respect from MANAGER only. Synonyms include approval
  • coworker benchmark — usually the main reason for PIP, though Macq is exception
  • GTD — the basis of coworker benchmark
  • localSys knowledge — the #1 source of GTD power
Advertisements

op-new: allocate^construct #placement #IV

Popular IV topic. P41 [[more effective c++]] has an excellent summary:

  1. to BOTH allocate (on heap) and call constructor, use regular q(new)
  2. to allocate Without construction, use q(operator new)
    1. You can also use malloc. See https://stackoverflow.com/questions/8959635/malloc-placement-new-vs-new
  3. to call constructor on heap storage already allocated, use placement-new, which invokes ctor

The book has examples of Case 2 and Case 3.

Note it’s common to directly call constructor on stack and in global area, but on the heap, placement-new is the only way.

Placement-new is a popular interview topic (Jump, DRW and more …), rarely used in common projects.

some Technicalities for QnA IV

We often call these “obscure details”. At the same level, these are a small subset of a large amount of details, so we can’t possibly remember them all 😦

Surprisingly, interviewers show certain patterns when picking which technicality to ask. Perhaps these “special” items aren’t at the same level as the thousands of other obscure details??

These topics are typical of QQ i.e. tough topics for IV only, not tough topics for GTD.

  • archetypical: which socket syscalls are blocking and when
  • $LD_LIBRARY_PATH
  • hash table theoretical details? too theoretical to be part of this discussion
  • select() syscall details
  • vptr, vtable

 

##y MultiCast favored over TCP

Reason: data rate constraints inherent in TCP protocol. Congestion Control?
Reason: TCP to a large group would be one-by-one unicast, highly inefficient and too much load on the sender. Reason: TCP has more data-overhead in the form of non-payload data. * TCP header is typically 20 bytes vs 8 bytes for UDP
* Receiver need to acknowledge

q[g++ -g -O] together

https://linux.die.net/man/1/g++ has a section specifically on debugging. It says

GCC allows you to use -g with -O

I think -g adds additional debug info into the binary to help debuggers; -O turns on complier optimization.

By default, our binaries are compiled with “-g3 -O2”. When I debug these binaries, I can see variables but lines are rearranged in source code, causing minor problems. See my blog posts on gdb.

前辈civil engineer^old C programmers

Opening example — I shared with Wael… If you meet a regular (not a specialist) civil engineer aged 50, you respect and value his skills, but what about a C programmer of the same age? I guess in US this is similar to a civil engineer, but in SG? Likely to be seen with contempt. Key is the /shelf-life/ of the skill.

Look at Civil engineers, chemical engineer, accountant, dentists, carpenters or history researchers (like my dad). A relatively high percentage of those with 20Y experience are 前辈. These fields let you specialize and accumulate.

In contrast, look at fashion, pop music, digital media… I’m not familiar with these professions, but I feel someone with 20Y experience may not be 前辈. Why? Because their earliest experiences lose relevance like radioactive decay. The more recent the experience, the more relevant to today’s consumers and markets.


Now let’s /cut to the chase/. For programmers, there are some high-churn and some “accumulative” technical domains. It’s each programmer’s job to monitor, navigate, avoid or seek. We need to be selective. If you are in the wrong domain, then after 20Y you are just an old programmer, not a 前辈. I’d love to deepen my understanding of my favorite longevity[1] technologies like

  • data structures, algos
  • threading
  • unix
  • C/C++? at the heart or beneath many of these items
  • RDBMS tuning and design; SQL big queries
  • MOM like tibrv
  • OO design and design patterns
  • socket
  • interactive debuggers like gdb

Unfortunately, unlike civil engineering, even the most long-living members above could fall out of favor, in which case your effort doesn’t accumulate “value”.

– C++ is now behind-the-scenes of java and c#.
– threading shows limited value in small systems.

[1] see the write-up on relevant55

–person-profession matching–
A “accumulative” professional like medical research can 1) be hard to get in and 2) require certain personal attributes like perseverance, attention to details, years of focus, 3) be uninspiring to an individual. Only a small percentage of the population get into that career. (Drop-out rate could be quite high.)

For many years in my late 20’s I was completely bored with technical work, esp. programming, in favor of pre-sales and start-up. But after my U.S. stay I completely changed my preferences.

mkt-data tech skills: portable/shared@@

Raw mkt data tech skill is better than soft mkt data even though it’s further away from “the money”:

  • standard — Exchange mkt data format won’t change a lot. Feels like an industry standard
  • the future — most OTC products are moving to electronic trading and will have market data to process
  • more necessary than many modules in a trading system. However ….. I guess only a few systems need to deal with raw market data. Most down stream systems only deal with the soft market data.

Q1: If you compare 5 typical market data gateway dev [1] jobs, can you identify a few key tech skills shared by at least half the jobs, but not a widely used “generic” skill like math, hash table, polymorphism etc?

Q2: if there is at least one, how important is it to a given job? One of the important required skills, or a make-or-break survival skill?

My view — I feel there is not a shared core skill set. I venture to say there’s not a single answer to Q1.

In contrast, look at quant developers. They all need skills in c++/excel, BlackScholes, bond math, swaps, …

In contrast, also look at dedicated database developers. They all need non-trivial SQL, schema design. Many need stored procs. Tuning is needed if large tables

Now look at market data gateway for OPRA. Two firms’ job requirements will share some common tech skills like throughput (TPS) optimization, fast storage.

If latency and TPS requirements aren’t stringent, then I feel the portable skill set is an empty set.

[1] There are also many positions whose primary duty is market data but not raw market data, not large volume, not latency sensitive. The skill set is even more different. Some don’t need development skill on market data — they only configure some components.

gdb q[next] over if/else +function calls #optimized

I used an optimized binary. Based on limited testing, un-optimized doesn’t suffer from these complexities.

Conventional wisdom: q(next) differs from q(step) and should not go into a function

Rule (simple case): When you are on a line of if-statement in a source code, q(next) would evaluate this condition. If the condition doesn’t involve any function call, then debugger would evaluate it and move to the “presumed next line”, hopefully another simple statement.

Rule 1: suppose your “presumed next line” involves a function call, debugger would often show the first line in the function as the actual “pending”. This may look like step-into!

Eg: In the example below. Previous pending is showing L432 (See Rule 2b to interpret it). The presumed line is L434, but L434 involves a function call, so debugger actually shows L69 as the “pending” i.e. the first line in the function

Rule 2 (more tricky): suppose presumed line is an if-statement involving a function call. Debugger would show first line in the function as the pending.

Eg: In the example below, Previous pending was showing L424. Presumed line is L432, but we hit Rule 2, so actual pending is L176, i.e. first line in the function.

Rule 2b: when debugger shows such an if-statement as the “pending”, then probably the function call completed and debugger is going to evaluate the if-condition.

424 if (isSendingLevel1){
425 //……
426 //……….
427 //……..
428 //……….
429 } // end of if
430 } // end of an outer block
431
432 if (record->generateTopOfBook()
433 && depthDb->isTopOfTheBook(depthDataRecord)) {
434 record->addTopOfBookMarker(outMsg);
435 }

#1 challenge if u rely@gdb to figure things out: optimizer

Background: https://bintanvictor.wordpress.com/2015/12/31/wall-st-survial-how-fast-you-figure-things-out-relative-to-team-peers/ explains why “figure things out quickly” is such a make-or-break factor.

In my recent experience, I feel compiler optimization is the #1 challenge. It can mess up GDB step-through. For a big project using automated build, it is often tricky to disable every optimization flag like “-O2”.

More fundamentally, it’s often impossible to tell if the compiled binary in front of you was compiled as optimized or not. Rarely the binary shows it.

Still, compared to other challenges in figuring things out, this one is tractable.

gdb skill level@Wall St

I notice that, absolutely None of my c++  veteran colleagues (I asked only 3) [2] is a gdb expert as there are concurrency experts, algo experts [1], …

Most of my c++ colleagues don’t prefer (reluctance?) console debugger. Many are more familiar with GUI debuggers such as eclipse and MSVS. All agree that prints are often a sufficient debugging tool.

[1] Actually, these other domains are more theoretical and produces “experts”.

[2] maybe I didn’t meet enough true c++ alpha geeks. I bet many of them may have very good gdb skills.

I would /go out on a limb/ to say that gdb is a powerful tool and can save lots of time. It’s similar to adding a meaningful toString() or operator<< to your custom class.

Crucially, it could help you figure things out faster than your team peers. I first saw this potential when learning remote JVM debugging in GS.

— My view on prints —
In perl and python, I use prints exclusively and never needed interactive debuggers. However, in java/c++/c# I heavily relied on debuggers. Why the stark contrast? No good answer.

Q: when are prints not effective?
A: when the edit-compile-test cycle is too long, not automated but too frequent (like 40 times in 2 hours) and when there is real delivery pressure. Note the test part could involve many steps and many files and other systems.
A: when you can’t edit the file at all. I have not seen it.

A less discussed fact — prints are simple and reliable. GUI or console debuggers are often poorly understood. Look at step-through. Optimization, threads, and exceptions often have unexpected impacts. Or look at program state inspection. Many variables are hard to “open up” in console debuggers. You can print var.func1()

 

gdb stop@simple assignments #compiler optimize

Toggle between -O2 and -O0, which is the default non-optimized compilation.

In my definition, A “simple assignment” is one without using functions. It can get value from another variable or a literal. Simple assignments are optimized away under -O2, so gdb cannot stop on these lines. This applies to break point or step-through.

In particular, if you breakpoint on a simple assignment then “info breakpoint” will show a growing hit count on this breakpoint, but under -O2 gdb would never stop there. -O0 works as expected.

As another illustration, if an if-block contains nothing but simple assignment, then gdb has nowhere to stop inside it and will only stop after the if-block. You won’t know whether you entered it. -O0 works as expected.

MSVS linker option -L vs -l #small L

Linker option –l (small ell) for the lib file name; -L for to specify the search path

———-

Finally, you must tell the linker where to find the library files. The ST-Developer Windows installation contains programming libraries built for use with Visual C++ 2005 (VC8) and 2008 (VC9) with the /MD flag, as well as versions that are link compatible with other compilers. These are kept in different subdirectories under the ST-Developer “lib” directory. The library directories available for Visual Studio 2005 and 2008 are as follows:

  • $(ROSE)/lib/i86_win32_vc8_md (2005, 32bit platform, /MD)
  • $(ROSE)/lib/x64_win64_vc8_md (2005, 64bit platform, /MD)
  • $(ROSE)/lib/i86_win32_vc9_md (2008, 32bit platform, /MD)
  • $(ROSE)/lib/x64_win64_vc9_md (2008, 64bit platform, /MD)

Select the Linker category of properties and, pick the General options under that. In the Additional Library Directories property, add the appropriate ST-Developer library directory.

 

BFT^DFT, pre^in^post-order #trees++

Here are my own “key” observations, possibly incomplete, on 5 standard tree walk algos, based on my book [[discrete math]]

  • in-order is defined for binary trees only. Why? in-order means “left-subtree, parent, right-subtree”, implying only 2 subtrees.
  • In contrast, BFS/DFS are more versatile. Their input can be any graph. If not a tree, then BFS/DFS could produce a tree — you first choose a “root” node arbitrarily.
  • The 3 *-order walks are recursively defined. I believe they are three specializations of DFS. See https://www.cs.bu.edu/teaching/c/tree/breadth-first/ and https://en.wikipedia.org/wiki/Tree_traversal
  • The difference of the 3 *-order walks are visually illustrated via the black dot on  https://en.wikipedia.org/wiki/Tree_traversal#Pre-order_(NLR)

—-That’s a tiny intro. Below are more advanced —

BFT uses an underlying queue. In contrast DFS is characterized by backtracking, which requires a stack. However, you can implement the queue/stack without recursion.

BFT visits each node twice; DFT visits each node at least twice — opening (to see its edges) and leaving (when all edges are explored)

DFT backtrack in my own language —

  1. # start at root.
  2. At each node, descend into first [1] subtree
  3. # descend until a leaf node A1
  4. # backtrack exactly one level up to B
  5. # descend into A1’s immediate next sibling A2’s family tree (if any) until a leaf node. If unable to move down (i.e. no A2), then move up to C.

Visually — if we assign a $val to each node such that these nodes form a BST, then we get the “shadow rule”. In-order DFS would probably visit the nodes in ascending order by $value

[1] the child nodes, of size 1 or higher, at each node must be remembered in some order.

FIX support for multi-leg order # option strategy

  • There are multi-leg FIX orders to trade two stocks without options
    • I believe there’s no valid multi-leg order to trade the same stock using market orders. The multiple legs simply collapse to one. How about two-leg limit orders? I would say Please split into two disconnected limit orders.
  • There are multi-leg FIX orders to trade two futures contracts
  • There are multi-leg FIX orders to trade 3 products like a butterfly.
  • There are multi-leg FIX orders with a special ratio like an option “ratio spread”

But today I show a more common multi-leg strategy order — buy 5 buy-write with IBM. At least four option exchanges support multi-leg FIX orders — Nasdaq-ISE / CBOE / ARCA / Nasdq-PHLX

35=AB;
11=A1;
38=5; // quantity = 5 units
54=1;
555=2; // how many legs in this order. This tag precedes a repeatingGroup!
654=Leg1; //leg id for one leg
600=IBM; // same value repeated later!
608=OC; //OC=call option
610=201007; //option expiry
612=85; //strike price
624=2; //side = Sell
654=Leg2;
600=IBM; …

Note that in this case, the entire strategy is defined in FIX, without reference to some listed symbol or productID.

ODR@functions # and classes

Warning – ODR is never quizzed in IV. A weekend coding test might touch on it but we can usually circumvent it.

OneDefinitionRule is more strict on global variables (which have static duration). You can’t have 2 global variables sharing the same name. Devil is in the details:

(As explained in various posts, you declare the same global variable in a header file that’s included in various compilation units, but you allocate storage in exactly one compilation unit. Under a temporary suspension of disbelief, let’s say there are 2 allocated storage for the same global var, how would you update this variable?)

With free function f1(), ODR is more relaxed. http://www.drdobbs.com/cpp/blundering-into-the-one-definition-rule/240166489 (possibly buggy) explains the Lessor ODR vs Greater ODR. Lessor ODR is simpler and more familiar, forbidding multiple (same or different) definitions of f1() within one compilation unit.

My real focus today is the Greater ODR. Obeying Lessor ODR, the same static or inline function is often included via a header file and compiled into multiple binary files. If you want to put non-template free function definition in a shared header file but avoid Great ODR, then it must be static or inline, implicitly or explicitly. I find the Dr Dobbs article unclear on this point — In my test, when a free function was defined in a shared header without  “static” or “inline” keywords, then linker screams “ODR!”

The most common practice is to move function definitions out of shared headers, so the linker (or compiler) sees only one definition globally.

With inline, Linker actually sees multiple (hopefully identical) physical copies of func1(). Two copies of this function are usually identical definitions. If they actually have different definitions, compiler/linker can’t easily notice and are not required to verify, so no build error (!) but you could hit strange run time errors.

Java linker is simpler and never cause any problem so I never look into it.

//* if I have exactly one inline, then the non-inlined version is used. 
// Linker doesn't detect the discrepancy between two implementations.
//* if I have both inline, then main.cpp won't compile since both definitions 
// are invisible to main.cpp
//* if I remove both inline, then we hit ODR 
//* objdump on the object file would show the function name 
// IFF it is exposed i.e. non-inline
::::::::::::::
lib1.cpp
::::::::::::::
#include &amp;lt;iostream&amp;gt;
using namespace std;

//inline
void sameFunc(){
    cout&amp;lt;&amp;lt;"hi"&amp;lt;&amp;lt;endl;
}
::::::::::::::
lib2.cpp
::::::::::::::
#include &amp;lt;iostream&amp;gt;
using namespace std;

inline
void sameFunc(){
    cout&amp;lt;&amp;lt;"hey"&amp;lt;&amp;lt;endl;
}
::::::::::::::
main.cpp
::::::::::::::
void sameFunc(); //needed
int main(){
  sameFunc();
}

 

Sliding-^Advertised- window size

https://networklessons.com/cisco/ccnp-route/tcp-window-size-scaling/ has real life illustration using wireshark.

https://www.ibm.com/support/knowledgecenter/en/SSGSG7_7.1.0/com.ibm.itsm.perf.doc/c_network_sliding_window.html

https://web.cs.wpi.edu/~rek/Adv_Nets/Spring2002/TCP_SlidingWindows.pdf

  • AWS = amount of free space on receive buffer
    • This number, along with the ack seq #, are both sent from receiver to sender
  • lastSeqSent and lastSeqAcked are two sender control variable in the sender process.

SWS indicates position within a stream; AWS is a single integer.

Q: how are the two variables updated during transmission?
A: When an Ack for packet #9183 is received, sender updates its control variable “lastSeqAcked”. It then computes how many more packets to send, ensuring that “lastSeqSent – lastSeqAcked < AWS

  • SWS (sliding window size) = lastSeqSent – lastSeqAcked = amount of transmitted but unacknowledged bytes, is a derived control variable in the sender process, like a fluctuating “inventory level”.
  • SWS is a concept; AWS is a TCP header field
  • receive buffer size — is sometimes incorrectly referred as window size
  • “window size” is vague but usually refers to SMS
  • AWS is probably named after the sender’s sliding window.
  • receiver/sender — only these players control the SWS, not the intermediate routers etc.
  • too large — large SWS is always set based on large receive buffer, perhaps AWS
  • too small — underutilized bandwidth. As explained in linux tcp buffer^AWS tuning, high bandwidth connections should use larger AWS.

Q: how are AWS and SWS related?
A: The sender adjusts lastSeqSent (and SWS) based on the feedback of AWS and lastSeqAcked.

## 4 exchange feeds +! TCP/multicast

  • eSpeed
  • Dalian Commodity Exchange
  • SGX Securities Market Direct Feed
  • CanDeal (http://www.candeal.com) mostly Canadian dollar debt securities

I have evidence (not my imagination) to believe that these exchange data feeds don’t use vanilla TCP or multicast but some proprietary API (based on them presumably).

I was told other China feeds (probably Shenzhen feed) is also API-based.

ESpeed would ship a client-side library. Downstream would statically link or dynamically link it into parser. The parser then communicates to the server in some proprietary protocol.

IDC is not an  exchange but we go one level deeper, requiring our downstream to provide rack space for our own blackbox “clientSiteProcessor” machine. I think this machine may or may not be part of a typical client/server set-up, but it communicates with our central server in a proprietary protocol.

%%GTD xp: 2types@technical impasse#难题

See also post on alpha geeks…
See also post on how fast you figure things out relative to peers
See also ##a few projects technically too challeng` 4me
See also https://bintanvictor.wordpress.com/2017/03/26/google-searchable-softwares/
see also https://bintanvictor.wordpress.com/2017/05/29/transparentsemi-transparentopaque-languages/

tuning? never experienced this challenge in my projects.
NPE? Never really difficult in my perience.

#1 complexity/opacity/lack of google help

eg: understanding a hugely complex system like the Quartz dag and layers
eg: replaying raw data, why multicast works consistently but tcp fails consistently
eg: adding ssl to Guardian. Followed the standard steps but didn’t work. Debugger was not able to reveal anything.
Eg: Quartz dag layers
Eg: Quartz cancelled trade

#2 Intermittent, hard to reproduce
eg: Memory leak is one example, in theory but not in my experience

eg: crashes in GMDS? Not really my problem.

eg: Quartz preferences screen frequently but intermittently fails to remember the setting. Unable to debug into it i.e. opaque.

edit 1 file in big python^c++ production system #XR

Q1: suppose you work in a big, complex system with 1000 source files, all in python, and you know a change to a single file will only affect one module, not a core module. You have tested it + ran a 60-minute automated unit test suit. You didn’t run a prolonged integration test that’s part of the department-level full release. Would you and approving managers have the confidence to release this single python file?
A: yes

Q2: change “python” to c++ (or java or c#). You already followed the routine to build your change into a dynamic library, tested it thoroughly and ran unit test suite but not full integration test. Do you feel safe to release this library?
A: no.

Assumption: the automated tests were reasonably well written. I never worked in a team with a measured test coverage. I would guess 50% is too high and often impractical. Even with high measured test coverage, the risk of bug is roughly the same. I never believe higher unit test coverage is a vaccination. Diminishing return. Low marginal benefit.

Why the difference between Q1 and Q2?

One reason — the source file is compiled into a library (or a jar), along with many other source files. This library is now a big component of the system, rather than one of 1000 python files. The managers will see a library change in c++ (or java) vs a single-file change in python.

Q3: what if the change is to a single shell script, used for start/stop the system?
A: yes. Manager can see the impact is small and isolated. The unit of release is clearly a single file, not a library.

Q4: what if the change is to a stored proc? You have tested it and run full unit test suit but not a full integration test. Will you release this single stored proc?
A: yes. One reason is transparency of the change. Managers can understand this is an isolated change, rather than a library change as in the c++ case.

How do managers (and anyone except yourself) actually visualize the amount of code change?

  • With python, it’s a single file so they can use “diff”.
  • With stored proc, it’s a single proc. In the source control, they can diff this single proc. Unit of release is traditionally a single proc.
  • with c++ or java, the unit of release is a library. What if in this new build, beside your change there’s some other change , included by accident? You can’t diff a binary 😦

So I feel transparency is the first reason. Transparency of the change gives everyone (not just yourself) confidence about the size/scope of this change.

Second reason is isolation. I feel a compiled language (esp. c++) is more “fragile” and the binary modules more “coupled” and inter-dependent. When you change one source file and release it in a new library build, it could lead to subtle, intermittent concurrency issues or memory leaks in another module, outside your library. Even if you as the author sees evidence that this won’t happen, other people have seen innocent one-line changes giving rise to bugs, so they have reason to worry.

  • All 1000 files (in compiled form) runs in one process for a c++ or java system.
  • A stored proc change could affect DB performance, but it’s easy to verify. A stored proc won’t introduce subtle problems in an unrelated module.
  • A top-level python script runs in its own process. A python module runs in the host process of the top-level script, but a typical top-level script will include just a few custom modules, not 1000 modules. Much better isolation at run time.

There might be python systems where the main script actually runs in a process with hundreds of custom modules (not counting the standard library modules). I have not seen it.

what hours2expect execution msg{typical U.S.exchanges

There was a requirement that within 90 seconds, any execution on any online or traditional broker system need to be reported to the “official” exchange. For each listed security, there’s probably a single “official” listing exchange.

Take IBM for example.

  • On NYSE, executions only take place after 9.30am, usually after an opening auction.
  • On-line electronic brokers operate 24/7 so an execution could happen and get reported any time. However, NYSE data feed only publishes it after 4am by right. I don’t know how strict this timing is. If your feed shows it before 4am I guess you are free to discard it. Who knows it might be a test message.

 

socket accept() key points often missed

I have studied accept() many times but still unfamiliar.

Useful as zbs, and perhaps QQ, rarely for GTD…

Based on P95-97 [[tcp/ip socket in C]]

  • used in tcp only
  • used on server side only
  • usually called inside an endless loop
  • blocks most of the time, when there’s no incoming new connections. The existing clients don’t bother us as they communicate with the “child” sockets independently. The accept() “show” starts only upon a new incoming connection
    • thread remains blocked, starting from receiving the incoming until a newborn socket is fully Established.
    • at that juncture the new remote client is probably connected to the newborn socket, so the “parent thread[2]” have the opportunity/license to let-go and return from accept()
    • now, parent thread has the newborn socket, it needs to pass it to a child thread/process
    • after that, parent thread can go back into another blocking accept()
  • new born or other child sockets all share the same local port, not some random high port! Until now I still find this unbelievable. https://stackoverflow.com/questions/489036/how-does-the-socket-api-accept-function-work confirms it.
  • On a host with a single IP, 2 sister sockets would share the same local ip too, but luckily each socket structure has at least 4 [1] identifier keys — local ip:port / remote ip:port. So our 2 sister sockets are never identical twins.
  • [1] I omitted a 5th key — protocol as it’s a distraction from the key point.
  • [2] 2 variations — parent Thread or parent Process.

big guns: template4c++^reflection4(java+python)

Most complex libraries (or systems) in java require reflection to meet the inherent complexity;

Most complex libraries in c++ require template meta-programming.

But these are for different reasons… which I’m not confident to point out.

Most complex python systems require … reflection + import hacks? I feel python’s reflection (as with other scripting languages) is more powerful, less restricted. I feel reflection is at the core of some (most?) of the power features in python – import, polymorphism

##minimum python know-how for cod`IV

Hi XR,

My friend Ashish Singh (in cc) said “For any coding tests with a free choice of language, I would always choose python”. I agree that perl and python are equally convenient, but python is the easiest languages to learn, perhaps even easier than javascript and php in my opinion. If you don’t already have a general-purpose scripting language as a handy tool, then consider python as a duct tape and Swiss army knife

(Actually linux automation still requires shell scripting. Perl and python are both useful additions.)

You can easily install py on windows. Linux has it pre-installed. You can then write a script in any text editor and test-run, without compilation. On windows the bundled IDLE tool is optional but even easier. For the ECT cycle – see https://stackoverflow.com/questions/6513967/running-python-script-from-idle-on-windows-7-64-bit

I actually find some inconveniences — IDLE uses Alt-P to get previous command. Also copy-paste doesn’t work at all. On Windows The basic python command-line shell is better than IDLE!

For coding tests, a beginner would need to learn

  • String common operations — learn 30 and master 20 of them
    • “in” operator on string, list, dict — is one of the operations to master
    • slicing on string and list — is one of the operations to master
    • converting between string, list, dict etc — is one of the operations to master
    • Regex not needed since many developers aren’t familiar with it
  • list and dict data structures and common operations — learn 30 and master 20 of them
    • A “Set” rarely needed. I never create tuple but some built-in functions return tuples.
  • Define simple functions
    • Recursion is frequently coding-tested.
    • multiple return values are possible but not required
  • “global” keyword used inside functions
  • if/elif/else; while loop with beak and next. Switch statement doesn’t exit.
  • for-each loop is useful in coding test, esp. iterating list, dict, string, range(), file content
  • range() and xrange() function – frequently needed in coding test
  • check 2 object have same address
  • null pointer i.e. None
  • what counts as true / false
  • · No need to handle exceptions
  • · No need to create classes
    • I think “struct-type” classes with nothing but data fields are useful in coding tests, but not yet needed in my experience.
  • · No need to learn OO features
  • · No need to use list comprehension and generator expressions, though very useful features of python
  • · No need to use lambda, map()/reduce()/filter()/zip(), though essential for functional programming
  • · No need to use import os and sys modules or open files, which are essential for everyday automation scripts

algoTrading^bigData^quant

Update: I told bbg (Karin?) and Trex interviewers that domain isn’t a big concern to me. Even a back office IT role can turn out to be better.


  1. quantDev
  2. algoTrading
  3. bigData

… are the 3 big directions for trySomethingNew. I’m cautious about each.

quantDev (not pure quant) — low demand; poor market depth; unable to find another job; least jobs and only in banks. CVA is not really quant dev, based on what I gathered. See also %% poor accu ] quantDev

algoTrading — perhaps I should try java and medium frequency?

bigData — no consolidation; questionable accumulation and value-creation

effi^instrumentation ] new project

I always prioritize instrumentation over effi/productivity/GTD.

A peer could be faster than me in the beginning but if she lacks instrumentation skill with the local code base there will be more and more tasks that she can’t solve without luck.

In reality, many tasks can be done with superficial “insight”, without instrumentation, with old-timer’s help, or with lucky search in the log.

What if developer had not added that logging? You are dependent on that developer.

I could be slow in the beginning, but once I build up (over x months) a real instrumentation insight I will be more powerful than my peers including some older timers. I think the Stirt-tech London team guru (John) was such a guy.

In reality, even though I prioritize instrumentation it’s rare to make visible progress building instrumentation insight.

Flatten 2D array and convert subscripts #MS

Here’s a standard way to flatten a 2D array into a 1D array. Suppose A is an 2D array int[3][5]. I flattens to 1D array F of int[15]. Imagine A as 3 rows of 4 items each. Denote constants R = 3, S = 5. So A(1,2) → F(7). All subscripts are 0-based.

Q1: Implement

int convertSubscripts(r,s)

Q2: Now B is 3D array int [Q][R][S], where Q, R and S are the sizes. Implement

int convertSubscripts(q,r,s)

You need to work out the algorithm on the white board. I actually drew the 3D array on white board to show my thinking.

Locate msg]binary feed #multiple issues solved

Hi guys, thanks to all your help, I managed to locate the very first trading session message in the raw data file.

We hit and overcame multiple obstacles in this long “needle search in a haystack”.

  • · Big Obstacle 1: endian-ness. It turned out the raw data is little-endian. For my “needle”, the symbol integer id 15852(in decimal) or 3dec(in hex) is printed swapped as “ec3d” when I finally found it.

Solution: read the exchange spec. It should be mentioned.

  • · Big Obstacle 2: my hex viewers (like “xxd”) adds line breaks to the output, so my needle can be missed during my search. (Thanks to Vishal for pointing this out.)

Solution 1: xxd -c 999999 raw/feed/file > tmp.txt; grep $needle tmp.txt

The default xxd column size is 16 so every 16 bytes output will get a line break — unwanted! So I set a very large column size of 999999.

Solution 2: in vi editor after “%!xxd -p” if you see line breaks, then you can still search for “ec\_s*3d”. Basically you need to insert “\_s*” between adjacent bytes.

Here’s a 4-byte string I was able to find. It span across lines: 15\_s*00\_s*21\_s*00

  • · Obstacle 3: identify the data file among 20 files. Thanks to this one obstacle, I spent most of my time searching in the wrong files 😉

Solution: remove each file successively, starting from the later hours, and retest, until the needle stops showing. The last removed file must contain our needle. That file is a much smaller haystack.

o one misleading info is the “9.30 am” mentioned in the spec. Actually the message came much earlier.

o Another misleading info is the timestamp passed to my parser function. Not sure where it comes from, but it says 08:00:00.1 am, so I thought the needle must be in the 8am file, but actually, it is in the 4am file. In this feed, the only reliable timestamp I have found is the one in packet header, one level above the messages.

  • · Obstacle 4: my “needle” was too short so there are too many useless matches.

Solution: find a longer and more unique needle, such as the SourceTime field, which is a 32-bit integer. When I convert it to hex digits I get 8 hex digits. Then I flip it due to endian-ness. Then I get a more unique needle “008e0959”. I was then able to search across all 14 data files:

for f in arca*0; do

xxd -c999999 -p $f > $f.hex

grep -ioH 008e0959 $f.hex && echo found in $f

done

  • · Obstacle 5: I have to find and print the needle using my c++ parser. It’s easy to print out wrong hex representation using C/C++, so for most of this exercise I wasn’t sure if I was looking at correct hex dump in my c++ log.

o If you convert a long byte array to hex and print without whitespace, you could see 15002100ffffe87600,but when I added a space after each byte, it looks like 15 00 21 00 ffffe876 00, so the 3rd byte was overflowing without warning!

o If you forget padding, then you can see a lot of single “0” when you should get “00”. Again, if you don’t include white space you won’t notice.

Solution: I have worked out some simplified code that works. I have a c++ solution and c solution. You can ask me if you need it.

  • · Obstacle 6: In some cases, sequence number is not in the raw feed. In this case the sequence number is in the feed, so Nick’s suggestion was valid, but I was blocked by other obstacles.

Tip: If sequence number is in the feed, you would probably spot a pattern of incrementing hex numbers periodically in the hex viewer.

tech strength: Depth [def]

Some top geeks I know are fast at reading code + logs. Few rely on documentation. I’m OK not the fastest.

Some top geeks in “sister” teams of my team are experts with instrumentation tools and techniques. I guess other top geeks seldom need a lot of instrumentation. I feel they lack the experience but make up for it in other skills.

Some top geeks keep gaining depth if they focus on one complex system. I might have potential here. “Deep” is slightly different from “complex”. Deep means you need to slow down, /quiet down/, absorb, go through thick->thin cycles, get a laser focus, look again, and think deeper.

Perhaps related to my Depth capacity, I’m not that fast with timed online coding tests.

  • — some well-defined, often named technical subjects with depth, often opaque to the uninitiated
  • rvr, RVO
  • RAII, swap
  • [b] TMP including SFINAE, CRTP
  • [b] java generics
  • [b] JNI, python extension module — complexities lurking by the /dimly lit path/ due to low-level interfacing
  • java/c# reflection techniques in practice
  • ? python introspection
  • [B] concurrency in general and java threading in particular
  • [B] algorithms and data structures
  • clever SQL single join to solve tricky problems.
  • [B] query/sproc tuning — depth created by the large number of tables/indices and variations in queries
  • ? serialization over network
  • ? ADL techniques
  • any math and statistics subject
  • [b/B=books have been written on this topic]

3groupsOf3digits #YH #li.remove()..

http://entrepidea.com/blogs/tech/index.php/2017/06/02/three-3-digit-numbers/ is the original blog post by my friend

Q: Find three 3-digit numbers with a ratio of 1:2:3;
These numbers pick their 3 digits from a range of 1 to 9;
All digits that form these numbers must be completely unique, i.e. you must use up all distinct 9 digits. For example, a set of 123:246:369 doesn’t qualify.

https://github.com/tiger40490/repo1/blob/py1/py/miscIVQ/3groupsOf3digits.py is my solution, not necessarily smart.

Shows list.remove(anyValue)
shows default return value is None
shows sys.exit()
shows no main() needed, to save screen real estate

I see two distinct constraints.
* The obvious — number2 and number3 must be 2x and 3x and they must use the remaining 6 digits.
* The other constraint — number1 must use 3 distinct digits.

Real challenge is the first step — iteration to generate the first 3-digit number without checking the 2x and the 3x numbers. My implementation of it is basically the code outside the functions.

Now I think the simpler iteration is to take each numbers 123..329 and check each the number the same way now. I would reuse the “check” routine — better code reuse. Performance would only be significantly slower when there are many many disqualified integers between 123 and 329 i.e. a sparse array.

In contrast my iteration won’t waste time on those disqualified. So I consider it a little bit clever. My iteration took 20-40 minutes to write. Once this is done, the check on the 2x and 3x is much simpler.

C for latency^^TPS can use java

I’m 98% confident — low latency favors C/C++ over java [1]. FPGA is _possibly_ even faster.

I’m 80% confident — throughput (in real time data processing) is achievable in C, java, optimized python (Facebook?), optimized php (Yahoo?) or even a batch program. When you need to scale out, Java seems the #1 popular choice as of 2017. Most of the big data solutions seem to put java as the first among equals.

In the “max throughput” context, I believe the critical java code path is optimized to the same efficiency as C. JIT can achieve that. A python and php module can achieve that, perhaps using native extensions.

[1] Actually, java bytecode can run faster than compiled C code (See my other posts such as https://bintanvictor.wordpress.com/2017/03/20/how-might-jvm-beat-cperformance/)

putty color

I tried various color schemes. None worked, so I found “monochrome with bold” good enough. That’s the fallback option. Now here’s one color scheme that might produce readable color on white background. In this case I turned on all the putty settings under Windows->colors:

  • tick AllowTerminaltoSpecifyAnsiColors
  • tick AllowTerminalToUseXterm256
  • tick AttempToUseLogicalPalettes
  • tick UseSysColors
  • radio group choose Both

Imperfection: in vi, I get yellow line number on white background 😦 so I had to go to Windows -> colors -> bottom scrolling list -> set ANSI YellowBold to 222:222:0

Whenever you turn on color support in terminal and in q(grep), you open a can of worm. Often LESS would need -R switch to cope with “ESC[01”

addiction2low-level hacking:keep doing; no shame

Update: low-level hacking is generally easier in c++ than java.

When I become interested in a tech topic, I often throw cold water over my head — “This is such a /juvenile/, albeit productive and wholesome, hobby. Look at ex-classmates/colleagues so and so, with their business wing. They deal with business strategies. My tech stuff is so low-level and boring compared to what they deal with.”

Damaging, harmful, irrational, demoralizing SMS! Get Real, Man! Let’s assess our own situation

  • A) On one hand, I need to avoid spending too much time becoming expert in some low-leverage or high-churn technology (php? XML? ASP?).
  • B) On the other hand, the enthusiasm and keen interest is hard to get and extremely valuable. They could be the catalyst that grow my zbs and transform me into a veteran over a short few years. Even with this enthusiasm and depth of interest, such a quick ascent is not easy and not likely. Without them, it’s simply impossible.

Case: grandpa. His research domain(s) is considered unglamorous 冷门 but he is dedicated and passionate about it. He knows that in the same Academy of social sciences, economics, geopolitics and some other fields are more important. He often feels outside the spotlight (kind of sidelined but for valid reasons). That is a fact which had a huge impact on my own choice of specialization. But once he decided to dig in and invest his whole life, he needed to deal with that fact and not let it affect his motivation and self-image. As a senior leader of these unglamorous research communities, he has to motivate the younger researchers.

Case: Greg Racioppo, my recruiter, treats his work as his own business. The successful recruiters are often in the same business for many years and make a long term living and even create an impact for their employees (and people like me). They could easily feel “boring” compared to the clients or the candidates, but they don’t have to.

Case: PWM wealth advisors. They could feel “boring” compared to the filthy rich clients they deal with, but in reality, these advisors are more successful than 99% of the population.

Case: The ratio of support staff to traders is about 50:1, but I don’t feel “boring” because of them.

Case: Look at all the staff in a show, movie, supporting the stars.

char-array dump in hex digits: printf/cout

C++ code is convoluted. Must cast twice!

// same output from c and c++: 57 02 ff 80
void dumpBufferPrintf(){
  static const char tag[] = {'W', 2, 0xFF, 0x80};
  cout << hex << setfill('0') ;
  for(int i = 0; i< sizeof(tag)/sizeof(char); ++i)
    printf("%02hhx ", tag[i]);
  printf ("\n");
}
///////////////////
#include <iostream>
#include <sstream> //stringstream
#include <iomanip> //setfill

//This function was also used to dump a class instance. See below
void dumpBufferCout(const char * buf, size_t const len){
                std::stringstream ss;
                ss << std::hex << std::setfill('0');
                
                for(size_t i=0; i<len; ++i){
                          if (i%8 == 0) ss<< "  ";
                          ss<<std::setw(2)<< (int)(unsigned char) buf[i]<<" ";
                }
                std::cerr<<ss.str()<<std::endl;
}
dumpBufferCout((const char*)&myStruct, sizeof(myStruct));

q[nm] instrumentation #learning notes

When you want to reduce the opacity of the c++ compiled artifacts, q(nm) is instrumental. It is related to other instrumentation tools like

c++filt
objdump
q(strings -a)

Subset of noteworthy features:
–print-file-name
–print-armap? Tested with my *.a file. The filename printed is different from the above
–line-numbers? Tested
–no-sort
–demangle? Similar to c++filt but c++filt is more versatile
–dynamic? for “certain” types of shared libraries
–extern-only

My default command line is


nm --print-armap --print-file-name --line-numbers --demangle
nm --demangle ./obj/debug/ETSMinervaBust/src.C/ReparentSource.o //worked better

In May 2018, I ran nm on a bunch of *.so files (not *.a) to locate missing symbol definitions. Once I found a needed symbol is exported by libabc.so, I had to add -labc to my g+ command line.

[17] 5 unusual tips@initial GTD

See also https://bintanvictor.wordpress.com/wp-admin/edit.php?s&post_status=all&post_type=post&action=-1&m=0&cat=560907660&filter_action=Filter&paged=1&action2=-1

* build up instrumentation toolset
* Burn weekends, but first … build momentum and foundation including the “instrumentation” detailed earlier
* control distractions — parenting, housing, personal investment, … I didn’t have these in my younger years. I feel they take up O2 and also sap the momentum.
* Focus on output that’s visible to boss, that your colleagues could also finish so you have nowhere to hide. Clone if you need to. CSDoctor told me to buy time so later you can rework “under the hood” like quality or design

–secondary suggestions:
* Limit the amount of “irrelevant” questions/research, when you notice they are taking up your O2 or dispersing the laser. Perhaps delay them.

Inevitably, this analysis relies on the past work experiences. Productivity(aka GTD) is a subjective, elastic yardstick. #1 Most important is GTD rating by boss. It sinks deep… #2 is self-rating https://bintanvictor.wordpress.com/2016/08/09/productivity-track-record/