Asia catch`up with U.S.exchanges

Thanks Sonny,

I could imagine that u.s. exchanges are more advanced in terms of trading rules, operational complexity, matching engine, order types … because presumably u.s. exchanges have more volume and more variations in securities. It’s like a bigger, older hospital is more sophisticated since it has treated many more patients.

On the other hand, I read more than once (over the last 7 years) that in terms of pure latency the bigger exchanges in the U.S. are often slower, but only moderately, so everything still works fine. I don’t know any capacity concerns on the horizon. Some of the smaller exchanges in Asia were aggressive to beat the bigger exchanges on latency. World #1 fastest exchange is now Bombay.

Good sharing.

Victor

Advertisements

bash script: demo`trap,read,waiting for gdb-attach

#1 valuable feature is the wait for gdb to attach, before unleashing the data producer.

#2 signal trap. I don’t have to remember to kill off background processes.

# Better source this script. One known benefit -- q(jobs) command would now work

sigtrap(){
  echo Interrupted
  kill %1 %2 %3 %4 # q(jobs) can show the process %1 %2 etc
  set -x
  trap - INT
  trap # show active signal traps
  sleep 1
  jobs
  set +x
}

set +x
ps4 # my alias to show relevant processes
echo -e "\njobs:"
jobs
echo -en "\nContinue? [y/any_other_key] "
unset REPLY; read $REPLY
[ "$REPLY" = "y" ] || return

trap "sigtrap" INT # not sure what would happen to the current cmd and to the shell

base=/home/vtan//repo/tp/plugins/xtap/nysemkt_integrated_parser/
pushd $base
make NO_COMPILE=1 || return
echo '---------------------'
popd
/bin/rm --verbose $base/*_vtan.*log /home/vtan/nx_parser/working/CSMIParser_StaticDataMap.dat

set -x

#If our parser is a client to rebus server, then run s2o as a fake rebus server:
s2o 40490|tee $base/2rebus_vtan.bin.log | decript2 ctf -c $base/etc/cdd.cfg > $base/2rebus_vtan.txt.log 2>&1 &

#if our parser is a server outputing rtsd to VAP, then run c2o as a fake client:
c2o localhost 40492|tee $base/rtsd_vtan.bin.log | decript2 ctf -c $base/etc/cdd.cfg > $base/rtsd_vtan.txt.log 2>&1 &

# run a local xtap process:
$base/shared/tp_xtap/bin/xtap -c $base/etc/test_replay.cfg > $base/xtap_vtan.txt.log 2>&1 &
#sleep 3; echo -en "\n\n\nDebugger ready? Start pbflow? [y/any_other_key] "
#unset REPLY; read $REPLY; [ "$REPLY" = "y" ] || return

# playback some historical data, on a multicast port:
pbflow -r999 ~/captured/ax/arcabookxdp1-primary 224.0.0.7:40491 &

set +x
jobs
trap

c++variables: !! always objects

Every variable that holds data is an object. Objects are created either with static duration (sometimes by defining rather than declaring the variable), with automatic duration (declaration alone) or with dynamic duration via malloc().

That’s the short version. Here’s the long version:

  • heap objects — have no name, no host variable, no door plate. They only have addresses. The address could be saved in a “pointer-object”, which is a can of worm.
    • In many cases, this heap address is passed around without any pointer object.
  • stack variables (including function parameters) — each stack object has a name (multiple possible?) i.e. the host variable name, like a door plate on the memory location. This memory is allocated when the stack frame is created. When you clone a stack variable you get a cloned object.
    • Advanced — You could create a reference to the stack object, when you pass the host variable by-reference into another function. However, you should never return a stack variable by reference.
  • static Locals — the name myStaticLocal is a door plate on the memory location. This memory is allocated the first time this function is run. You can return a reference to myStaticLocal.
  • file-scope static objects — memory is allocated at program initialization, but if you have many of them scattered in many files, their order of initialization is unpredictable. The name myStaticVar is a door plate on that memory location, but this name is visible only within this /compilation-unit/. If you declare and define it (in one step!) in a shared header file (bad practice) then you get multiple instances of it:(
  • extern static objects — Again, you declare and define it in one step, in one file — ODR. All other compilation units  would need an extern declaration. An extern declaration doesn’t define storage for this object:)
  • static fields — are tricky. The variable name is there after you declare it, but it is a door plate without a door. It only becomes a door plate on a storage location when you allocate storage i.e. create the object by “defining” the host variable. There’s also one-definition-rule (ODR) for static fields, so you first declare the field without defining it, then you define it elsewhere. See https://bintanvictor.wordpress.com/2017/05/30/declared-but-undefined-variable-in-c/

Note: thread_local is a fourth storage duration, after 1) dynamic, 2) automatic and 3) static

in-depth article: epoll illustrated #SELECT

(source code is available for download in the article)

Compared to select(), the newer linux system call epoll() is designed to be more performant.

Ticker Plant uses epoll. No select() at all.

https://banu.com/blog/2/how-to-use-epoll-a-complete-example-in-c/ is a nice article with sample code of a TCP server.

  • bind(), listen(), accept()
  • main() function with an event loop. In the loop
  • epoll_wait() to detect
    • new client
    • new data on existing clients
    • (Using the timeout parameter, it could also react to a timer events.)

I think this toy program is more readable than a real-world epoll server with thousands of lines.

incisive eg show`difference with^without extern-C

---- dummy8.c ----
#include <stdio.h> 
//without this "test", we could be using c++ compiler unknowingly 😦
int cfunc(){ return 123; }
---- dummy9.cpp ----
#include <iostream>
extern "C" // Try removing this line and see the difference
  int cfunc();
int main(){std::cout << cfunc() <<std::endl; }

Above is complete source of a c++ application using a pre-compiled C function. It shows the need for extern-C.

/bin/rm -v *.*o *.out
### 1a
g++ -v -c dummy8.c # 
objdump --syms dummy8.o # would show mangled function name _Z5cfuncv
### 1b
gcc -v -x c -c dummy8.c # Without the -x c, we could end up with c++ compiler 😦
objdump --syms dummy8.o # would show unmangled function name "cfunc"
### 2
g++ -v dummy8.o dummy9.cpp  # link the dummy8.o into executable

# The -v flag reveals the c vs c++ compiler versions 🙂
### 3
./a.out

So that’s how to compile and run it. Note you need both a c compiler and a c++ compiler. If you only use a c++ compiler, then you won’t have any pre-compiled C code. You can still make the code work, but you won’t be mixing C and C++ and you won’t need extern-C.

My goal is not merely “make the code work”. It’s very easy to make the code work if you have full source code. You won’t need extern-C. You have a simpler alternative — compile every source file in c++ after trivial adjustments to #include.

c++dynamicLoading^dynamicLinking^staticLinking, basics

https://en.wikipedia.org/wiki/Dynamic_loading

*.so and *.dll files are libraries for dynamic linking.
*.a and *.lib files are libraries for static linking.

“Dynamic loading” allows an executable to start up in the absence of these libraries and integrate them at run time, rather than at link time.

You use dlopen(“path/to/some.so”) system call. In Windows it’s LoadLibrary(“path/to/some.dll”)

## low-complexity QQ topics #JGC/parser..

java GC is an example of “low-complexity domain”. Isolated knowledge pearls. (Complexity would be high if you delve into the implementation.)

Other examples

  • FIX? slightly more complex when you need to debug source code. java GC has no “source code” for us.
  • socket programming? conceptually, relatively small number of variations and combinations. But when I get into a big project I am likely to see the true color.
  • stateless feed parser coded against an exchange spec

C++build error: declared but undefined variable

I sometimes declare a static field in a header, but fail to define it (i.e. give it storage). It compiles fine and may even link successfully. When you run the executable, you may hit

error loading library /home/nysemkt_integrated_parser.so: undefined symbol: _ZN14arcabookparser6Parser19m_srcDescriptionTknE

Note this is a shared library.
Note the field name is mangled. You can un-mangle it using c++filt:

c++filt _ZN14arcabookparser6Parser19m_srcDescriptionTknE -> arcabookparser::Parser::m_srcDescriptionTkn

According to Deepak Gulati, the binary files only contain mangled names. The linker and all subsequent programs deal exclusively with mangled names.

If you don’t use this field, the undefined variable actually will not bother you! I think the compiler just ignores it.

OPRA: name-based sharding by official feed provider

2008(GFC)peak OPRA msg rate — Wikipedia “low latency” article says 1 million updates per second. Note My NYSE parser can handle 370,000 messages per second per thread !

https://www.opradata.com/specs/48_Line_Notification_Common_IP_Multicast_Specification.pdf shows 48 multicast groups each for a subset of the option symbols. When there were 24 groups, the symbols starting with RU to SMZZZ were too heavy too voluminous for one multicast group, more so than the other 23 groups.

Our OPRA parser instance is simple and efficient (probably stateless) so presumably capable of handling multiple OPRA multicast groups per instance. We still use one parser per MC group for simplicity and ease of management.

From: Chen, Tao
Sent: Tuesday, May 30, 2017 8:24 AM
To: Tan, Victor
Subject: RE: OPRA feed volume

Opra data is provided by SIAC(securities industry automation corporation). The data is disseminated on 53 multicast channels. TP runs 53 instances of parser and 48 instances of rebus across 7 servers to handle it.

##transparent^semi-transparent^opaque languages

With a transparent language, I am very likely (high correlation) to have higher GTD/productivity/KPI.

Stress — correlates with opacity. Would I take a job in php, SQL, javascript, perl?

Confidence to take on a gig — The more transparent, the higher my confidence — like py, javascript

Bootstrap — with a transparent language, I’m confident to download an open source project and hack it (Moodle …). With an opaque language like C++, I can download, make and run it fine, but to make changes I often face the opacity challenge. Other developers are often more competent at this juncture.

Learning — The opaque parts of a language requires longer and more “tough” learning, but sometimes low market value or low market depth.

Competitiveness — I usually score higher percentiles in IV, and lower percentiles in GTD. The “percentile spread” is wider and worse with opaque languages. Therefore, I feel 滥竽充数 more often

In this context, transparency is defined as the extent to which you can use __instrumentation__ (like debugger or print) to understand what’s going on.

  • The larger systems tend to use the advanced language features, which are less transparent.
  • The more low-level, the less transparent.

–Most of the items below are “languages” capable of expressing some complexity:

  • [T] SQL, perl, php, javascript, ajax,
  • [T] stored proc unless complex ones, which are uncommon
  • [T] java threading is transparent to me, but not other developers
  • [S] java reflection-based big systems
  • [T] regular c++, c# and java apps
  • [O]… but consider java dynamic proxy, which is used in virtually every non-trivial package
  • [T] most python scripts
  • [S] … but look at python import and other hacks. Import is required in large python systems.
  • [O] quartz
  • [S] Spring underlying code base. I initially thought it was transparent. Transparent to Piroz
  • [O] Swing visual components
  • [O] C# WCF, remoting
  • [T] VBA in Excel
  • —-below are not “languages” even in the generalized sense
  • [S] git .. semi-transparency became stressor cos KPI!
  • [O] java GC … but NO STRESS cos so far this is not a KPI
  • [O] MOM low level interface
  • [S] memory leak detectors in java, c#, c++
  • [O] protobuf. I think the java binding uses proxies
  • [T] XML + simple SAX/DOM
  • [S =semi-transparent]
  • [O =opaque]

 

MOM+threading Unwelcome ] low latency@@ #FIX/socket

Piroz told me that trading IT job interviews tend to emphasize multi-threading and MOM. Some use SQL too. I now feel all of these are unwelcome in low latency trading.

A) MOM – see also HFT mktData redistribution via MOMFor order processing, FIX is the standard. FIX can use MOM as transport, but not popular and unfamiliar to me.

B) threading – Single-Threaded-Mode is generally the fastest in theory and in practice. (I only have a small observed sample size.) I feel the fastest trading engines are STM. No shared mutable. Nsdq new platform (in java) is STM

MT is OK if they don’t compete for resources like CPU, I/O or locks. Compared to STM, most lockfree systems introduce latency like retries, and additional memory barrier. By default compiler optimization doesn’t need such memory barriers.

C) SQL – as stated elsewhere, flat files are much faster than relational DB. How about in-memory relational DB?

Rebus, the order book engine, is in-memory.

ref-counted copy-on-write string #MIAX exchange IV

I completely failed this 2011 IV question from MIAX options exchange:

Q: outline a ref-counted copy-on-write string class, showing all the function declarations
A: here’s my 2017 answer

struct Payload{ //this object must live outside any Str instance. If 1st Str instance in the group is destroyed, this object must live on.
	char * const arr;
	size_t const length;
	mutable size_t refCount;
	Payload(std::string const &);
};
class Str{
	Payload const * payload;
public:
	~Str(); //decrement refCount and if needed, delete the payload
	//Str(); //default ctor is useless since after construction we can't really modify this instance, due to copy-on-write
	Str(std::string const &);
	Str(Str const &); // similar to shared_ptr

	Str & Operator=(Str const & other) const; // will return a reference to a new instance constructed on heap (never on stack!). The new instance will have a ref-count initialized to 1
	Str & replace_with(std::string const &) const; //ditto

// optional utilities
	char const * c_str()    const; // <-- Hey mine is very similar to std::string
	Str(char const * arr, size_t const len); // <-- Hey mine looks similar to std::string
	friend ostream & operator<<(ostream &, Str const &);
};

returning const std::string #test program#DeepakCM

1) returning a const std::string is meaningful.. it disallows f().clear(), f().append, f().assign() etc. Deepak’s 2019 MS interview.

2) returning a const int is useless IMO. [1] agrees.

3) I agree with an online post. Returning a const ptr is same as returning a non-const ptr. Caller would clone that const ptr just as it clones a mutable ptr.

I would say what’s returned is an address, just like returning an int value of 315.

int i=444;
int * const pi = &i;
int * p2 = pi;
int main(){
  int i2=222;
  p2 = &i2;
  cout << *p2 <<endl;
}

It does make a difference if you return a ptr-to-const. The caller would make a clone of the ptr-to-const and can’t subsequently write to the pointee.

4) How about returning a const Trade object? [1] gave some justifications:

[1] https://www.linuxtopia.org/online_books/programming_books/thinking_in_c++/Chapter08_014.html

JGC duration=100→10ms ] “Z-GC” #frequency

This blog has many posts on JGC overhead.

  • For low-Latency JVM, duration (pause time) outweighs frequency
  • For mainstream JVM, overhead + throughput outweighs duration

–frequency

Could be Every 10 sec , as documented in my blogpost

–stop-the-world duration:

100 mills duration is probably good enough for most apps but too long for latency sensitive apps, according to my blogpost.

For a 32GB JVM in a latency-sensitive Barclays system, worst long pause == 300ms.

The new Z-GC features GC pause times below 10ms on multi-terabyte heaps. This is cutting-edge low pause.

TCP client set-up steps #connect()UDP

TCP Client-side is a 2-stepper (look at Wikipedia and [[python ref]], among many references)
1) [SC] socket()
2) [C] connect()

[SC = used on server and client sides]
[S=server-only]
[C=client-only. seldom/never used on server-side.]

Note UDP is connection-less but connect() can be used too — to set the default destination. See https://stackoverflow.com/questions/9741392/can-you-bind-and-connect-both-ends-of-a-udp-connection.

Under TCP, The verb connect() means something quite different — “reach across and build connection”[1]. You see it when you telnet … Also, server-side don’t make outgoing connections, so this is used by TCP client only. When making connection, we often see error messages about server refusing connection, because no server is “accepting”.

[1] think of a foreign businessman traveling to China to build guanxi with local government officials.

 

static_cast^reinterpret_cast

I like the top answer in https://stackoverflow.com/questions/6855686/what-is-the-difference-between-static-cast-and-reinterpret-cast.

Both are unsafe casts and could hit runtime errors, but RC is more unsafe. RC turns off compiler error checking, so you are on your own in the dark forest.

I feel RC is the last resort, when you have control on the bit representation of raw data, with documented assurance this bit representation won’t change.

In a real example in RTS — a raw byte array comes in as a char array, and you RC it into (pointer to) a packed struct. The compiler has no reason to believe the char array can be interpreted as that struct, so it skips all safety checks. If the raw data somehow changes you get a undefined behavior at run time. In this case SC is illegal — will not pass compiler.

 

c++non-void function +! a return value

Strictly, undefined behavior not a compiler error. https://stackoverflow.com/questions/9936011/if-a-function-returns-no-value-with-a-valid-return-type-is-it-okay-to-for-the explains the rationale.

However, in practice,

  • For an int function the compiler could return any int value.
  • For functions returning type AA, I don’t know what is returned. Could it be a default-constructed instance of AA?
    • My specific case — I modified a well-behaving function to introduce an exception. I then added a catch-all block without a return value. Actually worked fine. So some instance of AA is actually returned!

 

zero sum game #my take

“Zero sum game” is a vague term. One of my financial math professors said every market is a zero sum game. After the class I brought up to him that over the long term, the stock (as well as gov bond) market grows in value [1] so the aggregate “sum” is positive. If AA sells her 50 shares to BB who later sells them back to AA, they can all become richer. With a gov bond, if you buy it at par, collect some coupons, sell it at par, then everyone makes money. My professor agreed, but he said his context was the very short term.

Options (if expired) and futures look more like ZSG to me, over any horizon.

If an option is exercised then I’m not sure, since the underlier asset bought (unwillingly) could appreciate next day, so the happy seller and the unwilling buyer could both grow richer. Looks like non-zero-sum-game.

Best example of ZSG is football bet among friends, with a bookie; Best example of NZSG is the property market. Of course we must do the “sum” in a stable currency and ignore inflation.

[1] including dividends but excluding IPO and delisting.

realistic TPS for exchange mkt-data gateway: below 1M/sec

  • Bombay’s BSE infrastructure (including matching engine) can execute 1 mil trades/second. Must maintain order books, therefore NOT stateless.
  • Rebus maintains full order books for each symbol, can handle 300k (in some cases 700k) messages per second per instance. Uses AVL tree, which beat all other data structure in tests.
  • The Nasdaq feed parser (in c++) is stateless and probably single-threaded. Very limited parsing logic compared to Rebus. It once handled 600k message/second per instance
  • Product Gateway is probably a dumb process since we never need to change it. Can handle 550k message/second per instance

I believe TPS throughput, not latency, is the optimization goal. Biggest challenge known to me is the data burst.

Therefore, java GC pause is probably unacceptable. In my hypothesis, after you experience a data surge for a while, you tend to run out of memory and must run GC (like run to the bathroom). But that’s the wrong time to run GC. If the surge continues while you GC runs, then the incoming data would overflow the queue.

q[return]: bash func^sourced^regular script

http://stackoverflow.com/questions/9640660/any-way-to-exit-bash-script-but-not-quitting-the-terminal says

you can use return instead of exit. Its main purpose is to return from a shell function, but if you use it within a sourced script, it returns from that script.

In a regular shell script,  “return” also kind-of works. It is illegal so it fails the script, but all previous commands actually execute as expected.

return exit
from function 🙂 probably dangerous
from sourced script 🙂 immediate disconnection 😦
from standalone script fails but OK 🙂 nice
from shell fails but OK immediate disconnection 😦

cvsignore troubleshoot`

gitignore  can be investigated. Git can tell you why a particular file is ignore. CVS doesn’t support that instrumentation.

If you need to confirm a file is explicitly ignored, then you can put a single “!” in ~/.cvsignore (as described in the official manual) to clear the ignore list.

%%first core dump gdb session

gdb $base/shared/tp_xtap/bin/xtap   ~/nx_parser/core

### 1st arg is the executable responsible for the core dump. If this executable is correct, then you should not get “No symbol table is loaded. Use the file command.” If you do, then try

(gdb) file /path/to/xtap

I think “symbol” means the variable/function names. I think gdb sees only … address not names. To translate them, presumably you need *that* file.

In my case the “xtap” executable is a debug-build, verified with “file” and “objdump” commands, according to http://stackoverflow.com/questions/3284112/how-to-check-if-program-was-compiled-with-debug-symbols

Anyway, we need to load the symbols. For a large executable (like minerva) with many *.so libraries, 30 minutes may be needed. Once the symbols are loaded, you should run “bt” to load the back trace

(gdb) bt

Now choose one of the frames such as the 2nd most recent function i.e. Frame #1

(gdb) frame 1

Now gdb shows the exact line number and the line of source code. You can see before/after lines with “list”. Those lines may belong to other functions collocating in the same source file.

(gdb) list

GTD/KPI/zbs/effi^productivity defined#succinctly

  • zbs — real, core strength (内功) of the tech foundation for GTD and IV. Zbs (not GTD) is required as architect, lead developer, or for west coast interviews.
  • GTD — make the damn thing work. LG of quality, code smell, maintainability etc
  • KPI — boss’s assessment often uses productivity as the most important underlying KPI, though they won’t say it.
  • productivity — GTD level as measured by __manager__, at a higher level than “effi”
  • effi — a lower level measurement than “Productivity”

cvs cheatsheet

–top 10 how-to

  1. list modified files — 2 choices
    1. cvs -qn up # shows both untracked and uncommitted
    2. cvs status |grep # doesn’t offer a quick glance
  2. list untracked files in the current dir (git clean -fxd) —
    1. cvs -qn up | grep ‘^?’  # listing only
  3. cvs (git reset –hard)
    1. cvs up -C the/file
  4. checkout a single dir/file. First, cd into the parent dir , then 2 choices
    1. cvs up -d missingDir
    2. cvs co tp/plugins/xtap/missingDir # no leading slash. See http://www.slac.stanford.edu/grp/cd/soft/cvs/cvs_cheatsheet.html for explanation

–Tier 2 (Still useful):

  • cvs -H any command
  • the -l option: -l Local; run only in top of current working directory, rather than recursing through subdirectories. Available with the following commands: checkout, diff, log, status, tag, update..

Monkey-jump problem{Ashish IV@Baml #solved

On the 2D-coordinate plane, A monkey starts from point [a,b] and wants to reach [P,Q]. From any point, the money can only jump to one of two points:

  1. [x,y] up to [x, x+y]
  2. [x,y] out to [x+y, y]

Q: Can you write bool canReach(unsigned long long a,b,P,Q)
A: start from [P,Q] and have no decision to make!

#include <iostream>
using namespace std;
struct Point {
    int x, y;
    Point(int x, int y) : x(x), y(y) {}
    friend ostream& operator<<(ostream& os, Point & n){
      os<<n.x<<","<<n.y;
      return os;
    }
};
bool canReach(unsigned int a, unsigned int b, unsigned int P, unsigned int Q){
  Point lower(a,b), higher(P,Q);
  for(Point p=higher;;cout<<p<<endl){
    if (p.x == lower.x && lower.y == p.y){
        cout<<"Good:)"<<endl;
        return true;
    }
    if (p.x==p.y || p.x<lower.x || p.y<lower.y){
        cout<<"Failed:( Can't jump further"<<endl;
        return false;
    }
    if (p.x<p.y){ p.y = p.y-p.x ; continue;} 
    if (p.x>p.y){ p.x = p.x-p.y ; continue;}
  }
}
int main() {
  cout<<canReach(1,10, 57,91);
}

 

nyse opening auction mkt-data

There are various order types usable before the 9.30 opening auction (and also before a halted security comes back). We might receive these orders in the Level 2 feed. I guess the traders use those orders to test the market before trading starts. Most of these orders can be cancelled, since there’s no execution.

The Imbalance data is a key feature by the exchange auction engine, and possibly very valuable to traders. It has

  • indicative match price, which at 9.30 becomes final match price
  • imbalance at that match price

The secret optimization algorithm – The auction engine looks at all the orders placed, and works out an optimal match price to maximize execution volume. Traders would monitor that indicative price published by exchange, and adjust their orders. This adjustment would be implemented in an execution algorithm.

scan entire codebase4given func name #how2

–Challenge: scan a c++ codebase for a given func name

  • A script would offer more flexibility.
  • find + perl + grep is a crude solution, without support for comments

See also the task in Outlook!

–A related challenge: suppose you have the definition of a function, how do you see all the callers?

Csmi.C: In static member function ‘static csmiparser::Csmi& csmiparser::Csmi::getInstance()’:
Csmi.C:14: warning: ‘__comp_ctor ’ is deprecated (declared at /home/vtan/tp/plugins/xtap/csmi/include/Csmi.h:33)

  • Technique — Rename the by appending _xxx and rebuild

 

factorize a natural number #AQR

My friend Dilip received this question in a 2017 AQR on-site.

Q: given a natural number (like 8), write a function to output every factorization such as (2,4) (2,2,2). You can ignore or include the trivial factorization (1,8). You can use recursion if you want.
— (incomplete) Analysis

  1. I would first generate all the prime numbers up to sqrt(N)
  2. among them, i would find all the factors. Once I find a prime factor x, keep dividing by x so I know in total I have 3 x’s, 2 y’s and 1 z, or (x,x,x,y,y,z). I call them 6 non-distinct “prime factors”.

From there, I might be able to write a (recursive?) function to output the factorization formulas. The ideal algo automatically avoids duplicate factorizations but Here’s my non-ideal design: generate all 2-way “splits”, all 3-way splits… If I keep all my “splits” in a hashtable, I can detect duplicates. So just treat the 6 factors as 6 distinct factors. Now the problem is well-defined — next split@N boys .

— trie solution based on generate combinationSum compositions #backtrack up] trie+tree

Candidates are the non-distinct prime factors and their products, each a factor of the big number.

— recursive solution by CSY

  • no prime number needed! A major drawback — if the target number is odd, we would still keep testing 2, 4, 6, 8 as possible divisors!

https://github.com/tiger40490/repo1/blob/cpp1/cpp/algo_comboPermu/factorize_AQR.cpp is very short solution by CSY. Here’s my analysis —

  • I believe every time the factorize(60) function finds a small factor like 2, it pushes the factor onto a global stack, then run factorize() on the quotient i.e. 30 — wherein every factorization formula on 30 is “decorated” with the stack.

https://github.com/tiger40490/repo1/blob/py1/py/algo_combo_perm/factorize_AQR.py is my modified/improved python solution

  • I replaced the global vector with a local immutable list on each call stack. It helps me reason. This is also thread-friendly, if the target number is large.
  • It’s highly instructive to work out the expected output from the recursive loops, as in my comments.
  • Just like the continuousSentence problem, the recursive solution is clever-looking but not scalable.

managers(more than techies)need boss’s like

In a nutshell, compared to techies, managers depend heavily on boss relationship. Techies rely more on self-effort.

Grandpa pointed out that

  • A) as an aspiring manager (at least in China), your boss’s opinion is the overriding factor;
  • B) as a techie or a academic/researcher, you have the right to seek promotion based on merit and technical achievement. If you don’t get it you can try elsewhere.

I feel A) has 2 levels.

  • A1) at the mid-management level, I don’t have much insight, so I guess that unlike entrepreneurs, most managers are not more capable/intelligent then other managers, so I agree boss’s opinion is the #1 factor on your career progression.
  • A2) at the entry level, I actually observed many tech managers including tech leads (or architects), application owners, dev managers, support managers. I think competence level is visibly different. Some (example?) show very high technical capability. This is most relevant at the lowest level, but across the levels, technical capability is not always the #1 or #2 factor. Why?
    • High-level design, technical foresight, persuasion (on tech front) are not always “innate” to these guys.
    • Communication with users, team members, and manager could be equally important aspects. The Mdaq CTO singled out “listening to team members”. Not every leader is good at that.
    • The most technical guy may not be a suitable leader and may be best in another role in the same team.
  • Luck seems to have more impact on managerial than technical track

I think Yihai would have more to add.

Grandpa’s advice

  1. since you clearly know you are not good a management, don’t ever compare with the high-flyers (and invite the self-hate).
  2. know the many people who are less fortunate.

multicast address ownership#eg exchanges

https://www.iana.org/assignments/multicast-addresses/multicast-addresses.xhtml shows a few hundred big companies including exchanges. For example, one exchange multicast address 224.0.59.76 falls within the range

224.0.58.0 to 224.0.61.255 Inter-continental Exchange, Inc.

It’s educational to compare with a unicast IP address. If you own such an unicast address, you can put it on a host and bind an http server to it. No one else can bind a server to that uncast address. Any client connecting to that IP will hit your host.

As owner of a multicast address, you alone can send datagrams to it and (presumably) you can restrict who can send or receive on this group address. Alan Shi pointed out the model is pub-sub MOM.

Twitter hashtab is another analogy.

UDP^TCP #TV-channel

http://www.diffen.com/difference/TCP_vs_UDP is relevant.

  • FIFO — TCP; UDP — packet sequencing is uncontrolled
  • Virtual circuit — TCP; UDP — datagram network
  • Connectionless — UDP ; TCP — Connection-oriented
  • Channel vs Connection — In RTS and xtap, we use the analogy of “TV channel” for multicast. TCP uses “Connection”.

With http, ftp etc, you establish a Connection (like a session). No such connection for UDP communication.

Retransmission is part of — TCP; UDP — application layer (not network layer) on receiving end must request retransmission.

To provide guaranteed FIFO data delivery, over unreliable channel, TCP must be able to detect and request retransmission. UDP doesn’t bother. An application built on UDP need to create that functionality, as in the IDC (Interactive Data Corp) ticker plant. Here’s one simple scenario (easy to set up as a test):

  • sender keeps multicasting
  • shut down and restart receiver.
  • receiver detects the sequence number gap, indicate message loss during the down time.
  • Receiver request for retransmission.

 

blockchain #phrasebook

A blockchain is a peer-to-peer network that timestamps records by hashing them into an ongoing chain of hash-based proof-of-work, forming a record that cannot be changed without redoing the proof-of-work.

In contrast, a distributed ledger is a peer-to-peer network that uses a defined consensus mechanism to prevent modification of an ordered series of time-stamped records. All blockchains are distributed ledgers, but not all distributed ledgers are blockchains.

Keywords:

  • Peer-to-peer — no central single-point-of-failure
  • Immutable — records of past transactions
  • Ever-growing — the chain keeps growing and never shrinks. Is there some capacity issue in terms of storage, backup, search performance?
  • Double-spend — is a common error to be prevented by blockchain

use Unsigned-char-array(i.e. ByteArray)to transfer binary chunk

If you intend to store or transfer arbitrary binary data, you should use …. unsigned char. ANSI-C has no “byte” data type.

It is the only data type that is guaranteed (by the ANSI C Standard) to have no padding bits. So all 8 bits in an unsigned char contribute to the value. None of them is a padding bit.

Contrary to some online posts, unsigned-char type is different from “char” —

focus+engagement2dive into a tech topic#Ashish

(Blogging. No need to reply)

Learning any of the non-trivial parts of c++ (or python) requires focus and engagement. I call it the “laser”. For example, I was trying to understand all the rules about placing definitions vs declarations in header files. There are not just “3 simple rules”. There are perhaps 20 rules, with exceptions. To please the compiler and linker you have various strategies.

OK this is not the most typical example of what I want to illustrate. Suffice to say that, faced with this complexity (or ambiguity, or “chaos”) many developers at my age simply throw up their hands. People at my age are bombarded with kids’ schooling, kids’ enrichment, baby-sitting, home repair [1], personal investment [2], home improvement… It’s hard to find a block of 3 hours to dive in/zoom in on some c++ topic.

As a result, we stop digging after learning the basics. We learn only what’s needed for the project.

Sometimes, without the “laser”, you can’t break through the stone wall. You can’t really feel you have gained any insight on that topic. You can’t connect the dots. You can’t “read a book from thin to thick, then thick to thin again”. You can’t gain traction even though you are making a real effort. Based on my experience, on most of the those tough topics the focus and engagement is a must.

I’m at my best when I have my “laser” on. Gaining that insight is what I’m good at. I relied on my “laser” to gain insights and compete on the job market for years.

Now I have the time and bandwidth, I need to capitalize on it.

[1] old wood houses give more problems than, say, condos with a management fee

[2] some spend hours every day

total number@exchange tickers+symbols

In one major market data provider, there are 20 to 30 million “instruments”, mostly derivatives. Globally there are more than 300,000 stocks on various trading venues, as of 2017.

This site aggregates 300 to 400 “sources”. Each source could be a data feed or a distinct trading venue. For example, Nyse has at least 3 independent trading venues — nyse classic, Arca and Amex.

As of Apr 2018, all 8600 stocks and ETFs listed in the US are now tradable on NYSE. Until now, trading at the venerable exchange was limited to the roughly 3,150 securities listed on the NYSE.

prepare]advance for RnD career

Grandpa became too old to work full time. Similarly, at age 75 I may not be able to work 8 hours a day. Some job functions are more suitable for that age…

I guess there’s a spectrum of “insight accumulation” — from app developer to tuning, to data science/analysis to academic research and teaching. The older I get (consider age 70), the more I should consider a move towards the research end of the spectrum…

My master’s degree from a reputable university is a distinct advantage. Without it, this career choice would be less viable. (Perhaps more importantly) It also helps that my degree is in a “hard” subject. A PhD may not give me more choices.

For virtually all of these domains, U.S. has advantages over Singapore. Less “difficult/unlikely” in U.S.

In theory I could choose an in-demand research domain within comp science, math, investment and asset pricing … a topic I believe in, but in reality entry barrier could be too high, and market depth poor

Perhaps my MSFM and c++ investment don’t bear fruit for many years, but become instrumental when I execute a bold career switch.

 

[11] threading,boost IV #Sapient perhaps

Read-guard, write-guard ?

Q: how to test if current thread already holds a target lock?
%%A: Though there might be platform-specific tricks, I think it’s not widely supported. Best to use recursive mutex to bypass the problem.
AA: pthread_self() is a C function (=> non-method) that returns the thread id of the current thread.
AA: https://blogs.msdn.microsoft.com/oldnewthing/20130712-00/?p=3823 has a microsoft specific solution
AA: using this thread::id object a lock can “remember” which thread is holding it.  https://stackoverflow.com/questions/3483094/is-it-possible-to-determine-the-thread-holding-a-mutex shows a gdb technique, not available at runtime.

Q: Is scoped_lock reentrant by default? Can you make it reentrant?
AA: With a boost recursive mutex, a single thread may lock the same mutex several times and must unlock the mutex the **same-number-of-times**.

AA: pthreads supports it. http://www.opengroup.org/onlinepubs/009695399/functions/pthread_mutex_lock.html says — If the mutex type is PTHREAD_MUTEX_RECURSIVE and the mutex is currently owned by the calling thread, the mutex lock count shall be incremented by one and the pthread_mutex_trylock() function shall immediately return success.

Q: Boost supports try_lock()?
AA: yes

Q: Boost supports lockInterruptibly()?
A: probably not, but there’s timed_lock

Q: Boost supports reader/writer locks?
AA: read/write lock in the form of boost::shared_mutex

Q: difference between mutex and condition var?

Q: can you write code to implement “recursive” locking with reader/writer threads?
A: key technique is a bool isHoldingLock(targetlock)

Q: what types of smart pointers did you use and when. What are the differences? I’d like to focus on the fundamentals not the superstructures.

Q: what other boost lib?
%%A: I briefly used boost bind. I believe STL binders are imperfect, and lambdas are superior

Q2: if I want a custom class as a map key?
%%A: must overload the less-than operator

Q2c: what if I can’t change the source code?
%%A: derive from binary_function to get a functor, and pass it to map constructor as the optional type param. However, the new functor only has “operator()” overloaded? I think it will still work, as the function doesn’t have to be named “less()”.
AA: Item 42 [[eff STL]]

%%Q2g: is operator less-than a friend func or a method?
AA: can be a non-friend(if everything public) or a friend class/func.

##what defined%%self-img]each period

See posts on time allocation. Those things I overspent time on tend to be what defined me.

Income alone is never an item. Instead, “Earning capacity due to …..” is often one of the defining features.

In This list I prefer one-word items.

  • till high school, what defined me was nothing but
    • grades
    • simple focus and dedication, driven by academic ambition
  • NUS ( grades became mediocre even though I worked very hard towards Dean’s list )
    • academic ambition -> commercial ambition
  • 1999 – 2002
    • wizard — salary (due to my wizardry) was high but gradually became just about average
  • 2002 SCS – …
    • bizwing
  • 2007 – 2012
    • IV — in java, c++, swing, dnlg … became my killer skill
      • billing rate
    • tech zbs — (“GTD”? No) zbs was my focus even though i was not really outstanding
  • Aug 2012 – Apr 2017. Focus was finally lost. I became “interested in” many other things
    • wealth preservation, income maintenance — became the #1 undercurrent
      • MSFM was part of “income maintenance” effort
      • properties became the main part of wealth preservation and retirement planning
      • (alternative income source (rental, per investment, trading) — became a temporary focus and never get big)
    • I also defined myself as a dedicated and serious father to Yixin, not really hoping to produce a prodigy, but to shape his character and habits, and help him find his motivation.
  • now
    • IV and tech learning
    • continuous push to go deeper, further and increase zbs
    • low burn rate, conserver lifestyle

Hope to regain motivation and sharpen focus on c++, exchange conn and grow another “wing”

U.S.hir`mgr may avoid bright young candd

My colleague Alan pointed out that some hiring managers are cautious with young bright candidates.

Some companies have a policy to promote younger employee. In terms of competitive threat, the older candidates are perceived as "disarmed" weapons.

Some hiring managers may also want to protect his old loyal employees, who may be put under threat by a young bright newcomer.

sense-of-urgency, for kids+professionals #IBM #Alok

If there is no time frame specified, there is no sense of urgency and nothing to measure against. If you set a time frame you will be able to use this as a reason to go back to your boss at regular intervals and update them on the progress, ask for help and set appropriate expectations throughout the year.”

So said an IBM site. “Sense of urgency” is a simple but powerful propaganda slogan introduced by the 1993 incoming CEO of IBM. In my experience, workers with a consistent sense of urgency will be noticed by the boss. A boss without a sense of urgency could be a nice boss but not effective.

I like this slogan, because my son needs it, badly. It should be instilled at a younger age.

If a frog sits in a slow boiler, she won’t have the sense of urgency to jump out.

By default, folks fall into inaction — weight management, regular exercise, or self-improvement. Setting some kind of measurable goal is a simple device. My dad goes a step further — until now, he has daily plans and he gets things done.

Take for example my house search. If I don’t have any sense of urgency, I would simply postpone indefinitely any research or discussion including school-districts. Reality is, sooner or later my son needs a school. The more I postpone my research, the less time I have to do a proper research. So I do feel a sense of urgency, but hopefully I’m not hitting too many colleagues in the office.

Let’s remind each other the sacrifices we made — the cut, living away from family, additional expenses including rent.

[[Focus]] for technical xx

See also my email to Raja, XR and ZengSheng on “the zone”, the “creative engine”…

As the book author, Dan’s concept of focus is more narrow. It’s related to a student’s focus. I told my son he needs better concentration. As a student I was very good with concentration and engaged.

Compared to my engaged “traction” days, I now spend too much time emailing, blogging on non-tech, calling home, research about housing, shopping, food preparation … But that’s not the key. During the work or study hours, I am not focusing as I did in primary school.

Traction/engaged days are with growth, concentration, positive feedback loop, tech blogging….

I think Yang was the first manager who pointed out I need better focus. Perhaps he meant I tried to study all the systems in PWM Comm all at once. I made real progress (with ErrorMemos, SQL, Java …) after I sent my wife/son back to Singapore.

I made real progress with quant stuff during the Barcap days. However, I also had to keep job hunting remotely.

Avichal, who sat beside me, said I had too many distractions… Disengaged.

Similar to dabao’s case, I guess the initial code complexity requires focus… Without sufficient focus, the complexity is just too much for my capacity. Better shut out everything else and focus on delivering something non-trivial. Remember PWM, Quartz, …

Grandpa had a lot to say about focus…

github branch backup

I plan to use bighub branches for backup:

  1. pick a single most valuable directory, like cpp (or py)
  2. once a while merge it into another branch like perl1 or bash
    • Watch out for any deletion of a directory like cppProj. You may need to merge the cppProj branch

Note:

  • Do NOT rely on this back-up. It’s a low-cost back-up with unknown reliability
  • more frequent back-up doesn’t hurt and should be quick and simple

c++class field defined]header,but global vars obey ODR

Let’s put function declaration/definition aside — simpler.

Let’s put aside local static/non-static variables — different story.

Let’s put aside function parameters. They are like local variables.

The word “static” is heavily overloaded and confusing. I will try to avoid it as far as possible.

The tricky/confusing categories are

  • category: static field. Most complex and better discussed in a dedicated post — See https://bintanvictor.wordpress.com/2017/02/07/c-static-field-init-basic-rules/
  • category: file-scope var — i.e. those non-local vars with “static” modifier
  • category: global var declaration — using “extern”
    • definition of the same var — without “extern” or “static”
  • category: non-static class field, same as the classic C struct field <– the main topic in the post. This one is not about declaration/definition of a variable with storage. Instead, this is defining a type!

I assume you can tell a variable declaration vs a variable definition. Our intuition is usually right.

The Aha — [2] pointed out — A struct field listing is merely describing what constitutes a struct type, without actually declaring the existence of any variables, anything to be constructed in memory, anything addressable. Therefore, this listing is more like a integer variable declaration than a definition!

Q: So when is the memory allocated for this field?
A: when you allocate memory for an instance of this struct. The instance then becomes an object in memory. The field also becomes a sub-object.

Main purpose to keep struct definition in header — compiler need to calculate size of the struct. Completely different purpose from function or object declarations in headers. Scott Meyers discussed this in-depth along with class fwd declaration and pimpl.

See also