q[static thread_local ] in %%production code

static thread_local std::vector<std::vector<std::string>> allFills; // I have this in my CRAB codebase, running in production.

Justification — In this scenario, data is populated on every SOAP request, so keeping them as non-static data members is doable but considered pollutive.

How about static field? I used to think it’s thread-safe …

When thread_local is applied to a variable of block scope, the storage-class-specifier static is implied if it does not appear explicitly. In my code I make it explicit.

Advertisements

criticalMass[def]against churn ] tech IV

See also body-building impact{c++jobs#^self-xx

IV knowledge Critical mass (eg: in core java self-study) is one of the most effective strategies against technology churn in tech interviews. Once I accumulate the critical mass, I don’t need a full time job to sustain it.

I have reached critical mass with core java IV, core c++ IV, swing IV (no churn) and probably c# IV.

The acid test is job interviews over a number of years.

Q: how strongly is it (i.e. critical mass) related to accumulation?
A: not much AFTER you accumulate the critical mass. With core java I did it through enough interviews and reading.

Q: how strongly is it related to leverage?
A: not much though Critical mass enhances leverage.

Q: why some domains offer no critical mass?
A: some (jxee) interviews topics have limited depth
A: some (TMP, py) interview topics have No pattern I could identify from interview questions.

 

c++complexity≅30% mt java

Numbers are just gut feelings, not based on any measurement. I often feel “300% more complexity” but it’s nicer to say 30% 🙂

  • in terms of interview questions, I have already addressed in numerous blog posts.
  • see also mkt value@deep-insight: java imt c++
  • — syntax/compiler complexity, — c++ >> c# > java
  • java is very very clean yet powerful 😦
  • C++ has too many variations, about 100% more than c# and 300% more than java
  • — core language details required for GTD:
  • my personal experience shows me c++ errors are more low-level.
  • Java runtime problems tend to be related to the (complex) packages you adopt from the ecosystem. They often use reflection.
  • JVM offers many runtime instrumentation tools, because JVM is an abstract, simplified machine.
  • — opacity — c++ > c# > java
  • dotnet IL bytecode is very readable. Many authors reference it.
  • java is even cleaner than c#. Very few surprises.
  • — more low-level — c++ > c# > java.
  • JVM is an excellent abstraction, probably the best in the world. C# CLR is not as good as JVM. A thin layer above the windows OS.

POSIX semaphore can grow beyond initial value #CSY

My friend Shanyou pointed out a key difference between c++ binary semaphore and a simple mutex — namely ownership. Posix Semaphore (binary or otherwise) has no ownership, because non-owner threads can create permits from thin air, then use them to enter the protected critical section. Analogy — NFL fans creating free tickets to enter the stadium.

https://stackoverflow.com/questions/7478684/how-to-initialise-a-binary-semaphore-in-c is one of the better online discussions.

  • ! to my surprise, if you initialize a posix semaphore to 1 permit, it can be incremented to 2. So it is not a strict binary semaphore.

https://github.com/tiger40490/repo1/blob/cpp1/cpp/thr/binarySem.cpp is my experiment using linux POSIX semaphore. Not sure about system-V semaphore. I now think a posix semaphore is always a multi-value counting semaphore with the current value indicating current permits available.

  • ! Initial value is NOT “max permits” but rather “initial permits”

I feel the restriction by semaphore is just advisory, like advisory file locking. If a program is written to subvert the restriction, it’s easy. Therefore, “binary semaphore” is binary IIF all threads follow protocol.

https://stackoverflow.com/questions/62814/difference-between-binary-semaphore-and-mutex claims “Mutex can be released only by thread that had acquired it, while you can signal semaphore from any non-owner thread (or process),” This does NOT mean a non-owner thread can release a toilet key owned by another thread/process. It means the non-owner thread can mistakenly create a 2nd key, in breach of exclusive, serialized access to the toilet. https://blog.feabhas.com/2009/09/mutex-vs-semaphores-–-part-1-semaphores/ is explicit saying that a thread can release the single toilet key without ever acquiring it.

–java

[[DougLea]] P220 confirms that in some thread libraries such as pthreads, release() can increment a 0/1 binary semaphore value to beyond 1, destroying the mutual exclusion control.

However, java binary semaphore is a mutex because releasing a semaphore before acquiring has no effect (no “spare key”) but doesn’t throw error.

TCP_NODELAY to improve latency #unclear

https://stackoverflow.com/questions/3761276/when-should-i-use-tcp-nodelay-and-when-tcp-cork

Nagle’s algo helps in applications like telnet. However, it may increase latency when sending streaming data. Additionally, if the receiver implements the ‘delayed ACK policy’, it will cause a temporary deadlock situation. In such cases, disabling Nagle’s algorithm is a better option.

temp object binding preferences: rvr,lvr.. #SCB

(Note I used “temp object” as a loose shorthand for “rval-object”.)

Based on https://www.codesynthesis.com/~boris/blog/2012/07/24/const-rvalue-references/

  • a const L-value reference … … can bind to a naturally-occurring rvalue object (or a robbed object after std::move)
  • a non-const r-value reference can bind to a naturally-occurring rvalue object
  • a const r-value reference (crvr) can bind to a naturally-occurring rvalue object but canNOT bind to an lvalue object

Q: so in the presence of all overloads, what kind of reference can naturally occurring temp objects bind to?
A: a const rvr. Such an object prefers to bind to a const rvalue reference rather than a const lvalue reference. There are some obscure use cases for this binding preference.

More important is the fact that const lvr (param type of copy-ctor) can bind to temp object. [[c++primer]] P540 has a section describing that if you pass a temp into Foo ctor you may hit the copy-ctor:

Foo z(std::move(anotherFoo)) // compiles and runs fine even if move-ctor is unavailable. This is THE common scenario before c++11. No change.

Compiler doesn’t bother to synthesize the move-ctor, when copy-ctor is defined!

 

short-term ROTI ] java IV: tuning imt threading

Analogy — physics vs chemistry in high-school — physics problems involve more logic, more abstract, more ; Chemistry is more factual. Similarly, Java threading challenges are more complex. Java tuning knowledge-pearls are relatively easy to memorize and show-off.

“Short-term” here means 3-5Y horizon.

I’m many times more successful with java threading– accu, deepening knowledge; lead over the pack.

I hope to build some minor zbs in JVM tuning, mostly GC tuning. However, many important parts of JVM tuning skills suffer from churn and poor shelf life.

mktData direct^Reuter feeds: realistic ibank set-up

Nyse – an ibank would use direct feed to minimize latency.

For some newer exchanges – an ibank would get Reuters feed first, then over a few years replace it with direct feed.

A company often gets duplicate data feed for the most important feeds like NYSE and Nasdaq. The rationale is discussed in [[all about hft]]

For some illiquid FX pairs, they rely on calculated rates from Reuters. BBG also calculate a numerically very similar rate, but BBG is more expensive. Bbg also prohibits real time data redistribution within the ibank.

mktData parser: multi-threaded]ST mode #Balaji#mem

Parsers at RTS: a high-volume c++ system I worked on has only one thread. I recall some colleague (Shubin) saying the underlying framework is designed in single-threaded mode, so if a process has two threads they must share no mutable data. In such a context, I wonder what’s better

  1. EACH application having only one thread
  2. Each application has 2 independent threads sharing nothing

I feel (A) is simple to set up. I feel if the executable + reference data footprint is large like 100MB, then (B) would save memory since the 100MB can be shared between threads.

–Answer from a c++ trading veteran
Two independent instances is much simpler time/effort wise. Introducing multiple threads in a single threaded application is not trivial. But I do not see any value if you create multiple threads which do not share any (mutable) data.

–My response:
Executable and reference data take up memory. 1) I feel executable can take up 10MB+, sometimes 100MB. 2) If there’s a lot of immutable reference data loaded from some data store, they can also add megabytes to process footprint. Suppose their combined footprint is 100MB, then two instances would each need 100MB, but the multi-threaded design would need only 100MB in a single instance hosting multiple threads.

My RTS team manager required every market data parser process to reduce footprint below 60MB or explain why. My application there used 200MB+ and I had to find out why.

Apparently 100MB difference is non-trivial in that context.

3 real overheads@vptr

Suppose your class Trade has virtual functions and a comparable class Order has no virtual functions. What are the specific runtime overheads of the vptr/vtable usage?

  1. cpu cache efficiency — memory footprint of the vptr in each object. Java affected! If you have a lot of Trade objects with only one char data field, then the vptr greatly expands footprint and you overuse cache lines.
    • [[ARM]] singles out this factor as a justification for -fno-rtti… see RTTI compiler-option enabled by default
    • [[moreEffC++]] P116 singles out vptr footprint as the biggest performance penalty of vptr
  2. runtime indirection — “a few memory references more efficient” [1] in the Order usage
  3. inlining inhibition is the most significant overhead. P209 [[ARM]] says inline virtual functions make perfect sense so it is best to bypass vptr and directly call the virtual function, if possible.

[1] P209 [[ARM]] wording

Note a virtual function unconditionally introduces the first overhead, but the #2/#3 overheads can sometimes be avoided by a smart compiler.

 

consolidate into single-process: low-latency OMS

Example #2– In a traditional sell-side OMS, an client FIX order propagates through at least 3 machines in a chain —

  1. client order gateway
  2. main OMS engine
  3. exchange gateway such as smart order router or benchmark execution engine supporting VWAP etc

The faster version consolidates all-of-the-above into a single Process, cutting latency from 2ms to 150 micros .. latency in eq OMS

Example #1– In 2009 I read about or heard from interviewers about single-JVM designs to replace multi-stage architecture.

Q: why is this technique not used on west coast or main street ?
%%A: I feel on west coast throughput outweighs latency. So scale-out is the hot favorite. Single-JVM is scale-up.

How RTS achieved 400-700 KMPS #p2p

Official benchmark result – 700 KMPS per instance of rebus or parser. Here are some enabling features:

  • [! iii] mostly single-threaded apps. Some older parsers are multi-threaded but each thread operating in single threaded mode. No shared mutable:)
    • In the order book replication engine, we use lockfree in one place only
  • ! (ST mode loves) Stateless [1] parser, small footprint. Only downstream component (Rebus) handles cancels/busts, modifications.. So we can run many parser instances per machine, or many threads per instance, fully utilizing the cpu cores. See https://haydenjames.io/linux-performance-almost-always-add-swap-space/
  • [! ii] allocation avoided — on millions of Output message objects. Pre-allocated ring buffer eliminates new(). very few and small STL containers .. mostly arrays… pre-allocate DTOs@SOD #HFT
  • ! allocation avoided — on millions of Input message objects, thanks to reinterpret_cast() on pointers… nearly Zero-copy. See reinterpret_cast^memcpy in raw mktData parsing
  • ! allocation avoided — custom containers to replace STL containers used, since they all allocate from heap
  • ! p2p messaging beats MOM 
  • ! Socket buffer tuning — to cope with busts. 64-256MB receive buffer in kernel; 4MB UserModeBuffer
  • low memory footprint. Most parsers use around 60MB. (My parsers was 100 to 150MB.) I think there are many benefits in terms of latency.
  • epoll — to replace select() with 2 orders of magnitude improve in throughput
  • buffering of Output list (of messages). I guess this hurts latency but enhances throughput
  • Very fast duplicate seq check, without large history data — a hot function
  • large containers like the ring buffer are pre-sized. No reallocation.
  • mostly array-based data structures — cache-friendly
  • Hot functions use pbref or RVO or move semantics, minimizing stack allocations
  • aggressively simplified parsing. Minimal logic
  • Special Logger to reduce I/O cost
  • Sharding to speed up symbol lookup
  • kernel bypass : RTS
  • No virtual functions — enhancing data cache and inlining .. 3 overheads@vptr #ARM
  • Inline
  • —-Speed with control
  • Best of both to reduce the real risk of our connection becoming slow or lossy
  • Depth monitor to detect missed messages
  • Capacity tracker to monitor mps rates
  • Missed seq reporting
  • [!=one of the really effective techniques in RTS]
  • [i=presented in multiple (!!, !!!) interviews]

[1] parser does remembers the sequence numbers. In fact sequence number management is crucial. Before retrans was added to parser, we never sent any retrans request. We did keep track of gaps. Mostly we used the gap-tracker to check duplicates across Line A/B. We also relied on our sequence number state to handle sequence resets.

enough bandwidth2help kids study@@ : 3 past jobs

Which past jobs would allow me to spend more time helping kids with studies?

  1. #1 Citi — i can finish off personal work in office
  2. OC
  3. RTS? yes during the quiet weeks. More importantly, I was familiar with the system, possibly more so than my coworkers, so with 70% of effort I could meet the requirements, so I can spend more time helping kids

How about Macq? Hours are short but I often decide to work on weekends. I dare not leave earlier because I feel guilty that I haven’t made enough progress on my projects.

I feel the stress of helping kids with studies would /resonate/ (“conflict” is the wrong word) with the stress of stigma/PIP, and  create a highly amplified response, bigger than each stressor alone could produce. I would feel like a candle burning from both ends .. insufficient energy and would be forced to consider tough sacrifices.

I think Mvea team would be much better as Josh is very tolerant and forgiving.

Commute — how important is it? I now feel commute is insignificant compared to coworker benchmarking or PIP

mutex ] rebus: 3designs #RCU #CSY #80%

Requirement: Each worker thread operates independently on its own symbols, never interfering. Admin thread may read/write all symbols.

My friend CSY’s design resembles ConcurrentHashMap — split a big “graph-container” into 32 independent sub-graphs, to be locked individually.

For my 1st design, I briefly considered one lock per symbol. I think 100,000 locks is fine if symbols are consecutive integers. I simply use a symbol as array index to locate the corresponding lock. Every worker thread and the admin thread need to acquire the lock for symbol X before updating the AVL tree of X.

— Here’s my 2nd design, based on a single global lock:

  • Regular writer threads only checksnever acquires, the lock. If lock is in-use (i.e. taken by the admin thread) then wait in a 1-second-sleep loop.
  • admin thread immediately and unconditionally takes the lock and wait for a grace period[1] before writing. I assume this “write” can take up to 1 minute.
  • how to “check” the lock?
    • TryLock solution [Design#1] — Just trylock and immediately unlock.
    • LockFree solution [Design #2] — atomic boolean to be set only on the admin thread. I think this is one the simplest usage scenarios of atomic boolean. We are not even relying on the atomicity. We only need the visibility feature.

[1] This grace period is designed to be sufficient for a batch of updates on the worker thread. If we know a batch can write 100 orders and take X microsecond, then set grace period to X. We require worker routine to recheck the lock after each batch.

My friend CSY raised the concern that mid-stream the update could be suspended by kernel scheduler. I think in this case, a simple integer update could take ages, as in the GC stop-the-world scenario. So the design requires some realtime kernel support to guarantee timely response.

read-copy-update lockfree +! retry #RCU mentions a grace period used in RCU api

— Here’s a bigger lockfree design [Design #3]

  • without realtime kernel
  • Two global variables — in addition to the atomic boolean flag, I will add an atomic int “activeWorkerCnt”
  • Two non-reentrant routines (to avoid incremental acquisition)
Worker threads routine:
1. check the flag until it’s false
2. increment activeWorkerCnt
3. update data store
4. decrement activeWorkerCnt
Admin thread routine:
1. unconditionally set the flat # rejecting all new workers
2. while activeWorkerCnt > 0
2.   sleep 1 millisecond
2. ## draining the active worker pools
3. update data store
4. unset the flag

close(): with timeout or non-blocking #CSY

I was right that close() can be non-blocking even on a TCP socket. It is the default.

When you specify SO_LINGER, you can also give a timeout to control how long close() blocks on a TCP socket

https://stackoverflow.com/questions/6418277/how-to-close-a-non-blocking-socket

So there are 3 modes on a TCP socket:

  • close() blocking with timeout via SO_LINGER
  • close() blocking without timeout
  • close() nonblocking

[19]with3sockets: select^non-block recv#CSY

Update — CSY said when you increase from 3 to 99, the difference becomes obvious.

Q (TradeWeb mkt data team): given 3 non-blocking receive-sockets, you can either read them sequentially in a loop or use select(). What’s the difference?

http://compgeom.com/~piyush/teach/4531_06/project/hell.html compares many alternatives. It says Non-block socket read() is least efficient as it wastes CPU.

If you have 3 receive-sockets to monitor, select()/poll() is designed just for your situation.

Using read() instead, you may try specifying a timeout in your read() like

//set socket option via SO_RCVTIMEO to 5 seconds
while(1){
recv(sock1);
recv(sock2);

}
Low latency systems can’t use this design because while you wait 5 seconds on socket1, you miss new data on socket2 😦

Therefore, low-latency systems MUST disable SO_RCVTIMEO. In such a design, the loop wastes cpu in a spin-wait.

–another advantage to selelct(): you can process a high-priority socket earlier when select() says multiple sockets ready.

I gave this answer on the spot.

recv()/send() with timeout #CSY

I was right — timeout is supported Not only in select()/poll(), but also read/recv/send..

SO_RCVTIMEO and SO_SNDTIMEO

Timeouts only have effect for system calls that perform socket I/O (e.g., read(2), recvmsg(2), send(2), sendmsg(2)); timeouts have no effect for select(2), poll(2), epoll_wait(2), and so on. These latter functions only check socket status… no I/O.

I believe this socket option has no effect if used on non-blocking sockets . Nonblocking socket can be created either

  • O_NONBLOCK in fcntl() or
  • SOCK_NONBLOCK in socket()

nonVirtual1() calling this->virt2() #templMethod

http://www.cs.technion.ac.il/users/yechiel/c++-faq/calling-virtuals-from-base.html has a simple sample code. Simple idea but there are complexities:

  • the given print() should never be used inside base class ctor/dtor. In general, I believe any virt2() like any virtual function behaves non-virtual in ctor/dtor.
  • superclass now depends on subclass. The FAQ author basically says this dependency is by-design. I believe this is template-method pattern.
  • pure-virtual is probably required here.

strided list-traversal: excel-formula/python/boost #JackZ

=OFFSET(
$A$1,
(COLUMN()-COLUMN($C$2))*3,
  (COLUMN()-COLUMN($C$2))
)

The green part is part of a natural number sequence.

%%engaging[def2]deep dive shines through QQ interviews

Opening example — I had 3M of fixation on tibrv^JMS. Just look at my blogposts. I go beyond interview topics. This fixation hopefully shines through on interviews.

This post is more about “engaging” than focused effort — I couldn’t really force myself to focus on speed coding or c++ TMP, or high-churning javascript.

Ken Li (Citi-muni) was probably a sharp observer of my personality and once said “If you are interested in software development then you will have a long career till retirement … no need to worry. A lot of developers lack that interest.”

Very few candidates had more than 3M personal spare time engaged in any tech topic.. Something like a deep dive. (Even fewer can sustain 12M focus.) Consider XR’s interest in machine learning. Consider grandpa.

Apparently, many interviewers are personally interested in these same details (the whys/hows/whats) so they seek out like-minded enthusiasts. This is one hypothetical explanation why those crazy theoretical QQ questions come up so often.

Other examples of personal fixation that (hopefully) shine through

  • sockets — blocking operations …; unplugged
  • clustered index
  • timer hardware interrupts
  • userland vs kernel mode
  • nested classes across languages
  • compiler rules on overriding

Now, many of the weeks spent has low ROTI but I was not fixated on ROTI since my earliest engaging studies in 1998 — first deep dive in unix + C programming + perl. As it turned out

  • unix and linux became dominant on server side and enhanced my career
  • perl became m entry point into programming career
  • C domain was hopeless, but with c++ I was able to capitalize on it

Still, many of my engaging studies had low ROTI, but what if I didn’t capture my interest? I think I would end up wasting my spare energy!

Striking balance — too much “engaging” –> regretful ROTI; too little “engaging” –> bored and unfulfilled.

decline@quant-dev domain: low value-add

Q: why drv pricing quant jobs used to be more abundant and pay a premium even though the economic value-add of the quant skill has always been questionable?
%%A: because the traders made money for a few years and subsequently received a big budget. Now budget is lower.

Same can happen to quant funds in NY, Chicago, HK…

Some UChicago lecturer once commented that everyone should get some quant training .. (almost as every high school student does math). But I think it is not necessary….

I am more theoretical and was naturally attracted to this domain but the value-add reality soon prevailed in my case.

Tech shops love algo challenges and speed-coding which have dubious value-add similar to the quant skills. However, the hot money in tech sector is bigger and last longer.

short coding IV: ibank^HFT^inetShop

See also prevalence@speed coding among coding-IV

  • “InetShop” includes all west-coast type of hiring teams including China shops
  • Ibanks are mostly interested in language knowledge, so they use coding test (including multithreaded) for that purpose.  Ibanks also use simpler coding questions for basic screening.
  • HFT shops (and bbg) have an unpredictable mix of focuses

I predict that Wall St will Not adopt the west coast practice due to .. poor test-coverage — Not testing language knowledge, latency engineering, threading etc.

The table below excludes take-home tests and includes hackerrank, white-board, webex, and by-phone algo questions.

ibanks web2.0 Shops HFT
language knowledge; latency NOT bigO #1 focus never key focus, not #1
simple-medium speed coding rare #1 focus. High bar common
medium-simple algo without implementation,
logistically easiest => popular
common #SCB
minimum bar is low
common key focus, not #1
tough algo never sometimes rare
big-O of your solution NOT latency LGlp. low bar #2 focus. high bar LGlp
concurrency coding test G5 focus never #python/ruby/js sometimes

 

## marketable syntax nlg: c++ > j/c#

Every language has poorly understood syntax rules, but only in c++ these became fashionable, and halos in job interviews !

  • ADL
  • CRTP
  • SFINAE
  • double pointers
  • hacks involving void pointers
  • operator overloading to make smart ptr look like original pointers
  • TMP hacks using typedef
  • TMP hacks using non-type template param
  • universal reference vs rvr
  • rval: naturally occurring vs moved
    • const ref to extend lifetime of a naturally occurring rval object

techShop-plan: rare opp2try tech start-ups

The Sg2019 presents a rare opportunity to work for such a firm and gain some insight how they select candidates.

Salary might be slightly lower like 150k vs 180k@SCB, but i think it’s completely fine, until we (irrationally) compare to some peers.

The inherent risk of stigma (damagedGood, PIP..) are lower since I’m older and newer, bringing down the expectation. I would feel like Michael Jordan joining baseball.

denigrate%%intellectual strength #ChengShi

I have a real self-esteem problem as I tend to belittle my theoretical and low-level technical strength. CHENG, Shi was the first to point out “你就是比别人强”.

  • eg: my grasp of middle-school physics was #1 strongest across my entire school (a top Beijing middle school) but I often told myself that math was more valuable and more important
  • eg: my core-java and c++ knowledge (QQ++) is stronger than most candidates (largely due to absorbency++) but i often say that project GTD is more relevant. Actually, to a technical expert, knowledge is more important than GTD.
  • eg: I gave my dad an illustration — medical professor vs GP. The Professor has more knowledge but GP is more productive at treating “common” cases. Who is a more trusted expert?
  • How about pure algo? I’m rated “A-” stronger than most, but pure algo has far lower practical value than low-level or theoretical knowledge. Well, this skill is highly sought-after by many world-leading employers.
    • Q: Do you dismiss pure algo expertise as worthless?
  • How about quant expertise? Most of the math has limited and questionable practical value, though the quants are smart individuals.

Nowadays I routinely trivialize my academic strength/trec relative to my sister’s professional success. To be fair, I should say my success was more admirable if measured against an objective standard.

Q: do you feel any IQ-measured intelligence is overvalued?

Q: do you feel anything intellectual (including everything theoretical) is overvalued?

Q: do you feel entire engineering education is too theoretical and overvalued? This system has evolved for a century in all successful nations.

The merit-based immigration process focus on expertise. Teaching positions require expertise. When laymen know you are a professional they expect you to have expertise. What kind of knowledge? Not GTD but published body of jargon and “bookish” knowledge based on verifiable facts.

FIX^tibrv: order flow

Background — In a typical sell-side an buy/sell order goes through multiple nodes of a flow chain before it goes out to the trading venue. The order message is payload in a messaging platform.

FIX and Tibrv (and derivatives) are dominant and default choices as messaging platforms. Both are highly adaptable to complex combinations of requirements.

FIX tibRV
TCP UDP transport
unicast multicast pub/sub fan-out
designed for low-latency eq trading popular for bond trading latency
order msg gets modified along the flow chain ? editable payload

tick`95%mark #Kam #70%

“Ticking 95th percentile server” is the blog title I would use. Original question is on paper/pencil, scanned and sent to my gmail, from my friend Deepak CM. I find this problem rather realistic with practical usage, rather than fake and contrived. I treat it as a System Design + implementation question.

Q: Using only std library, write a c++ program to process a live stream of 128,000,000 (or much more) double numbers, representing temperature readings not necessarily unique. As the temperatures come in, print the current 95th percentile on-demand. I call it the lucky “winner”. We can use the nearest-rank percentile definition.

====Idea 1: given unsorted ints, find median in O(N) is for median but can be tweaked for any percentile, but unfortunately, not “ticking”

====design 2, for static data set
use an “order statistic tree i.e. a RBTree where each node remembers the size of its subtree. (A leaf node has size 1.)
====design 3, optimized for high volume of updates like 128 million updates, not optimized for frequent query

The entire temperature range is divided into non-overlapping segments, each represented by a segment-head temperature i.e. the lower bound [1b]. Each segment has a range (i.e.distance to next segment head), size (i.e.item count) and density (i.e. size/range ratio). We mostly care about “size” only.

We need a RB-tree (or sorted vector) containing P=1024 [1] nodes, each an unsorted container[3]. The RB-tree serves to maintain the containers i.e segments.

Each incoming temperature is quickly “routed” to the correct container and simply appended therein, increasing its size.

Upon query request, we will use the latest segment sizes to build a cumulative profile, and run a O[logP] binary search to identify the one segment containing the “winner”. This segment size would be hopefully much smaller than 128,000 [2] and far more /tractable/.

–Within the chosen segment of size S, we can use a vector to sort in O(S logS) the temperatures and identify the winner.  After completing a query, the chosen container will become (partially) sorted, helping subsequent queries if this segment is picked again.

Since we only support 95th percentile, chance is good that this segment will be picked most of the time. If x% of the queries hit this segment, then I will convert this “favorite segment” to a RB-tree.

Alternatively, we can also use the O(S) algorithm in Idea 1, but the container won’t become sorted.

–priming

[2] 128,000 is 1024th the size of original sample size… not ideal. The segments need to be initialized carefully, during a priming phase, inspired by JIT compiler. Shall we assume roughly uniform distribution or Gaussian distribution? Assuming we know the total sample size is 128 million, I will use the first 100,000 temperatures to select the 1024 segment heads. The segments are structured not for equal length (in temperature) or equal size (i.e. element count). In fact the low segments can be very long very crowded.

Instead, the segment heads are chosen so that between 94th percentile and 96th percentile we have half of all segments. These small segment sizes will be much smaller than 128,000 and quicker to manipulate.

–Foot notes:

Q: what if some of the containers grow too big like three times 128,000,000/1024. The priming/estimate was ineffective.
A: Not a problem unless the winner happens to be in such a container.
A: One idea is to split up such a container when we notice it, and grow a new node on the RB-tree. std::map::insert() can take a hint for the position where new node can be inserted. Note we don’t want to split a node JJ too early since JJ may not grow any further subsequently and may end up no larger than other nodes as other nodes keep growing.

[1] Sizing of P — First we estimate total sample size. If unknown, then set N:=1024 so all nodes stay in L1-cache (typically 32KB). If we assume 16 bytes/node ( 8 bytes pointer to container + 8 bytes double ), then 32KB can hold 2000 nodes.

If query becomes more frequent, I can increase P by 1024, sacrificing insertion.

[1b] The node values are “lower-bounds” and don’t have to be really the minimum of the original temperatures in the container. We can probably cheat with 4-byte floats, and we get away with 2700 twelve-byte tree nodes.

[3] slist vs vector — vector might be faster due to pre-allocation, provided a node will never grow beyond a capacity. Vector has reserve() (Note resize() is wrong choice.)

interrupts=hardware interrupts #by default

In operation system jargon, “interrupts” means hardware interrupts by default. Timer hardware, NIC, keyboard/mouse, and possibly hard disk … can all generate hardware interrupts.

“Interrupt signal” often refers to the electrical signal, never the higher-level SIG* signals in unix (unsupported in Windows AFAIK).

There is also a well-known concept of “software interrupt”. I see them as fake interrupts, simulated in software. My book [[linux kernel]] says software interrupts are employed for 1) system calls 2) debugger breakpoints.

A key difference between software vs hardware interrupts — hardware interrupts come from devices with their independent power supply and clock signal, so hardware interrupts can hit CPU any time, even in the middle of a CPU clock cycle. Software interrupts won’t. Software interrupts are simulated by code running in the same CPU, driven by the same clock signal.

In other words, hardware interrupts are unsynchronized with CPU, whereas software interrupts are synchronized with CPU.

missing q[ return ] in non-void function #Josh

I told Josh about this real Scenario: my non-void function has 3 control paths, one of them without a return statement.

g++ usually give a warning under “-Wall”. Flowing off the end of a function is equivalent to a return with no value — undefined behavior in non-void function.

If you print the function return value and the 3rd control path executes, you get undefined behavior.

https://stackoverflow.com/questions/3402178/omitting-return-statement-in-c

hardware mutex, based@XCHG instruction

[[OS by William Stallings]] P214 is concise and clear.

The atomic XCHG instruction on Intel processors can swap a local variable (held in cpu register) with a main memory location visible to all threads.

This XCHG instruction alone can implement a simple mutex usable by multiple threads on multiple processors sharing the same main memory.

Another atomic instruction, atomic test-n-set instruction, can also by itself implement mutex.

Both mutexes have only academic value because their Limitations are severe:

  • even with a single mutex, deadlock is possible at the hardware level — ThrA in critical section can get interrupted and preempted — perfectly normal. Now ThrA has to wait for high-priority ThrH to complete, but ThrH is busy-waiting (wheel-spinning) to enter the same critical section 😦
    • Note in this example there’s no “lock object” per-se, only a critical section in code !
  • by default, unsuccessful threads like ThrH are stuck in busy-wait — wasting CPU and generating heat 😉

 

##FX/eq projects ] PWM, for SG IV

—-fx or eq
* Swing app to monitor multiple live database tables and auto-refresh upon new quotes/trades and position updates.
* Post-trade Affirmation for complex derivative deals. A workflow application. Operations verify deal attributes by phone and Affirm on the deal to send out trade confirm.
* Statistical feed “blender” (with multiple purposes), needed to remove outliers among diverse indicative eFX forward quotes. Outliers tend to mess up our auto pricers, tiered pricers and cross rate pricers.
* (Ledger balance?) PnL report. Nightly batch to apply marks on option positions and compute PnL. Reports both clients and firm accounts.
* Volatility surface fitter for FX (& equity index/single-names). Volatility smile curve fitter. Quote inversion. Option position mark-to-market, driven by live quotes from multiple live eFX markets.
* Workflow engine for Error trade reconciliation/chargeback. A single error ticket can compute PnL and chargeback across multiple (up to 100) error trades and cover trades. Every quarter, this engine reconciles hundreds of error trade whose total values add up to millions of dollars.

* Deal entry screen for trade booking after a retail dealer executes over phone. If trade validation fails, we notify the dealer to amend and resubmit. Same swing app is also used by spot/swap/options dealers to book voice trades.
—-eq
system? data normalization to take care of splits and dividends
system? basic back testing of simple strategies
system? re-sampling

* (EOS) web-based trade/order entry, with multiple validations. Used primarily by investment advisers across Americas, Europe and Asia. Thousands (up to 5-figure) of orders received daily.
** showing real time bid/ask from GS + bid/ask from trading venues
** supports most order types — limit/market/FOK/IOC/SL

* swing-based real time order book display. Updates to displayed order books come from the backend OMS via real time messaging.
* Also contributed to the OMS backend. Orders originate exclusively from private bank clients. We then monitor their cancels/amends and execution reports (via firm-wide messaging hub) from exchanges, ECN and GS private liquidity pool. Total message volume up to 6 figures.

—-fx
We are the interface to multiple ECNs to 1) receive quotes, enrich and publish to PWM clients 2) forward client orders to ECN in 2-leg spread trades 3) execute trades received from ECN 4) generate and validate (against tri-arb) cross rates using ECN quotes. Also contributed to auto quoters and RFQ engine. Supervised and monitored the progress of high-frequency eFX applications. Performed eFX development activities such as requirement gathering, design, testing and deployment. Personal contributions included
* Tiered quote pricer. Each FX customer falls into one of the tiers. When a currency pair updates in our pricer, all registered clients would get a new quote (??delivered to their Financial Advisors??). Silver tier clients receive a better bid/ask spread than regular clients; Gold tier gets the best quote.
* eFX rate update/distribution engine to downstream real time risk, PnL, position marking systems
* eFX option real time risk report (unfinished). Option position risk calc is rather slow, so each user selects a limited number of positions into his watch-list. Watched positions get periodically updated based on live ECN rates to show real-time risk.
* (questionable project) Compute cross rates for PWM trades that are large, recurring and involving illiquid currencies. Cross rates computation from daily close USD buy/sell rates.
————
— blender
http://www.olsendata.com/fileadmin/Publications/Tech_Papers/FXBlenderDoc_01.pdf (C:\0x\xz_ref)
outliers are particularly damaging to market making systems.
input – transaction prices and indicative quotes
output – only indicative quotes

— cross rate calculator

What we compute are backdated cross rates. PWM (non-trading) client agrees on a currency conversion AAA/BBB. She agrees on AAA amount, and wait for a few hours/days to get the actual BBB amount, just like cross currency credit card payment —

MasterCard exchange rates are based on multiple market sources (such as Bloomberg, Reuters, Central Banks, and others). These rates are collected during our daily rate setting process. The exchange rates displayed on the web site are derived from the buy and sell rates included in the MasterCard daily T057 Currency Conversion Rate File.

MasterCard applies the USD as unique reconciliation currency to manage all currency conversions globally. Due to possible rounding differences, the published calculated cross-rates may not precisely reflect the actual rate applied to the transaction amount when converting to the cardholder billing amount. The calculated cross-rates will be loaded onto the MasterCard web site on a daily basis.

MasterCard applies the exchange rate to transactions at the time of settlement, not at the point of authorization of the sale.

–ECN interface
(my code doesn’t use FIX. Our volume is lower. conversion from FIX to xml is encapsulated in a library)

FXAall, Currenex, Hotspot, TradeWeb , Bloomberg. Example applications:
* auto quoters
* price feeds
* market data feeds
* post trade feeds
* execution gateways

returning^throwing local object

Tag line – always catch by reference. [[moreEffC++]] has a chapter with this title.

  • If you return a local object by reference, it will lead to run time error as the object would be wiped out from stack. Compiler is likely to give a warning.
  • If you throw a local object and catch by reference up the call stack, it is actually considered best practice, because compiler always clones the local object and throws the clone.

avoid lopsided BST

Self-balanced is not a theoretical nicety but an essential BST feature. Without it a BST could easily become lopsided and all operations will slow down.

If for any reason (I don’t know any) we can’t use a AVL or RBT, then we could randomize the input and insert them (without deletion) and get a fairly balanced BST.

intervalTree: classic RBTree to find overlapp`event

Notable detail — non-balancing BST won’t give logN performance, since the tree depth may degrade

Q1: given N event objects {start, end, id}, use a RBTree to support a O(logN) query(interval INT) that returns any one event that overlaps with INT, or NULL if none.

If the end time == start time in the input INT, then it becomes the focus today :

Q2: given N event objects {start, end, id}, use a RBTree to support a O(logN) query(timestamp ii) that returns any one event that’s ongoing at time ii, or NULL if none.

P312 [[intro to algorithms]] describes an elegant solution.

The text didn’t highlight — I think the end time of the input interval INT is … ignored. (Therefore, in my Q2, input ii is a single timestamp, not an event.) Precisely because of this conscious decision, the tree is sorted by event start time, and the additional payload (in each node N) is the subtree end time i.e. last end time of all events started before N (including N itself). By construction, N’s payload is equal or higher than N’s start time. (Note Our input ii can fall into gaps between the event intervals.)

Eg: suppose N starts at 2:22 and payload says 3:33, then at we know for sure there at least one ongoing even during the interval 2:22 to 3:33.  Not sure if this is useful insight.

The text illustrates and proves why this design enables a clean and simple binary search, i.e. after each decision, we can safely discard one of “my” two subtrees. Here’s my (possibly imperfect) recall:

Let’s suppose each time the current node/event is not “ongoing”. (If it is then we can return it 🙂 So here’s the gist of the algorithm:

Suppose i’m the “current node”, which is root node initially. Compare ii to my left child L’s payload.

As defined, this payload is the highest end times of L + all its sub nodes. In other words, this payload is the highest end time of all events starting before me.

Note the algo won’t compare L’s start time with ii
Note the algo won’t compare my start time with ii, though the scenario analysis below considers it.

  • case 1: If payload is lower than ii, then we know all events on my left have ended before ii, so we can discard the left subtree completely. Only way to go is down the right subtree.
  • the other case (case 0): if payload is higher or equal to ii, then we know my left subtree contains some events (known as “candd” aka candidates) that will end after ii.
  • case 0a: suppose my start time is before ii, then by BST definition every candidate has started before ii. We can pick any candidate to return.
  • case 0b: suppose my start time is after ii, then by BST definition, all right subtree events have not started. Right subtree can be discarded
  • In both cases 0a and 0b, we can discard right subtree.

 

EnterpriseServiceBus phrasebook #ESB

Mostly based on https://www.it.ucla.edu/news/what-esb

  • enterprise — used in big enterprises, but now out of fashion
  • HTTP/MQ — Http is the more popular protocol than MQ
  • MOM — I think this middleware is a separate process. https://en.wikipedia.org/wiki/Enterprise_service_bus#ESB_as_software says MOM vendors call their products ESB
  • async — no synchronous http call between client and server
  • latency — perhaps not popular for real time trading, which prefers FIX
  • SOA — ESB jargon was created along with SOA
  • jxee — you may need to know this jargon in the jxee job market.
  • churn 😦

c++ecosystem[def]questions are tough #DeepakCM

C++ interviewers may demand <del>c++ecosystem knowledge</del> but java also has its own ecosystem like add-on packages.

As I told my friend and fellow c++ developer Deepak CM,

  1. c++ecosystem QQ questions can be more obscure and tougher than core c++ questions
    • tool chain — compiler, linker, debugger, preprocessor
    • IPC, socket, pthreads and other C-level system libraries
    • kernel interface — signals, interrupts, timers, device drivers, virtual memory+ system programming in general # see the blog catetory
    • processor cache tuning
    • (at a higher level) boost, design patterns, CORBA, xml
    • cross-language integration with python, R, pyp, Fortran + other languages
  2. java ecosystem QQ questions are easier than core java questions. In other words, toughest java QQ questions are core java.
    • java ecosystem questions are often come-n-go, high-churn

Low level topics are tough

  1. c++ ecosystem questions are mostly in C and very low-level
  2. java ecosystem questions are usually high-level
    • JVM internals, GC … are low-level and core java

 

priorityQ: 2 advantages over RBtree#O(1)add #Part2

[1] Contrary to popular belief, RBTree mass insert (including mass-construction) is O(logN) rather than O(1) per node in the general case. However, it is O(1) if input is presorted.

See lecture notes https://courses.cs.washington.edu/courses/cse373/02au/lectures/lecture11l.pdf and SOF post on
https://stackoverflow.com/questions/6147242/heap-vs-binary-search-tree-bst

std::priority_queue phrasebook

Mostly based on [[Josuttis]]

  • #include <queue> // this class template is functionally closer to a regular queue
  • container adapter — based on other containers
    • defaults to vector, but deque fine.
  • random access — base container must provide random access iterator (to support fast sorting)
  • push and pop — both O(logN). https://en.cppreference.com/w/cpp/algorithm/push_heap shows individual insert and delete_max are both logN in worst case
  • make_heap() — linear in worst case, guaranteed:)

xchg between register/memory triggers fencing

I guess xchg is used in some basic concurrency constructs, but I need to research more.

See my blogpost hardware mutex, based@XCHG instruction

https://c9x.me/x86/html/file_module_x86_id_328.html points out the implicit locking performed by CPU whenever one of the two operands is a memory location. The other operand must be a register.

This implicit locking is quite expensive according to https://stackoverflow.com/questions/50102342/how-does-xchg-work-in-intel-assembly-language, but presumably cheaper than user-level locking.

This implicit locking involves a memory fence.

append() / maxInRangeK() #Rahul

Q: given a sequence of integers, implement two operations

  1. Operation 1: append another integer. Size becomes N
  2. Operation 2: given two arbitrary subscripts into the sequence, (they define a sub-array of size K), find the max

Rahul saw discussions without any optimal solution. I think a simple vector-based algo can achieve amortized O(1) for Append and O(logK) for Max.

AuxDS required.

— idea 1: segment the sequence into fixed segments

 

competitiveness: Pre-teen sprinter + HFT selectivity

I was a decent sprinter at age 6 through 13, then I realized I was not good enough to compete at Beijing municipal level.

There are quite a number of HFT shops in Singapore. I think they have a higher bar than ibanks. At ibank interviews I felt again like that pre-teen sprinter, but HFT interview is generally too hard for me

op=(): java cleaner than c++ #TowerResearch

A Tower Research interviewer asked me to elaborate why I claimed java is a very clean language compared to c++ (and c#). I said “clean” means consistency and fewer special rules, such that regular programmers can reason about the language standard.

I also said python is another clean language, but it’s not a compiled language so I won’t compare it to java.

I gave the example of q[=]. In java, this is either content update at a given address (for primitive data types) or pointer reseat (for reference types). No ifs or buts.

In c++ q[=] can invoke the copy ctor, move ctor, copy assignment, move assignment, conversion ctor, conversion operator. In addition to the standard meanings,

  • Its meaning is somewhat special for a reference variable at site of initialization vs update.
  • For (dereferenced) pointers variables, there are additional subtleties.
  • You can even put a function call on the LHS

represent graph: matrix^edgeList #std::forward_list

Josh’s book [[data structures and other objects using c++]] has 5 pages devoted to three best-practice data structures representing a generic directed graph. (I think my [[discrete math] also covers the same topic.)

Note all these data structures can serialize any graph.

  1. boolean adjacency matrix — adj mat
  2. edge set — can’t include two identical edges between nodes A and B (rare) but can support complex graph algos having risk of traversing the same edge repeatedly. The set can be a field of a graph node, or a hash table {node-> edgeSet}
  3. edge list — can be implemented using std::forward_list, marginally faster better than set… not discussed here

pros and cons:

  • iterating all edges connected to a node — edge set wins
  • sparse graph — adj mat wastes memory and not scalable.
  • test linkage between 2 nodes — adj mat wins
  • ease of implementation — adj mat wins

Fundamental classification:

  • In a so-called dense graph, number of edges ~ O(VV) i.e. between any pair of nodes there’s likely an edge
  • In a so-called sparse graph, number of edges ~ O(V).

[[algo in a nutshell]] written by some professors, compares matrix and edge list. For a dense graph, P140 describes an intriguing edge list practical implementation to save space and speed up isConnected(node1, node2). Key idea —

   “Number all individual nodes using auto-incrementing ID numbers.”

After this, we will see that a node’s edge list may look like 2,3,4,5, 8,9 …which can be summarized as 2-4, 8-9. I felt this was a trivial trick but is actually a standard implementation. There’s a tradeoff though — inserting edge is now slower, but I feel O() is unchanged.

 

ABA problem ] lockfree designs #DeepakCM

  • breaks basic CAS assumptions
  • most common solution is [1]
  • affects c++ CAS only, probably not java CAS

[1] https://lumian2015.github.io/lockFreeProgramming/aba-problem.html and http://moodycamel.com/blog/2014/solving-the-aba-problem-for-lock-free-free-lists explain the “tagged state reference”  aka “tagged pointer“. I feel I don’t need to understand it. Even if I spent some time getting a grasp it may not be enough to impress an interviewer.

http://www.modernescpp.com/index.php/aba-a-is-not-the-same-as-a has more details, including

  • compare_exchange_strong
  • lockfree stack hitting UndefinedBehavior due to ABA
  • RCU

This topic was asked briefly in Deepak’s MS c++ interview in 2019. I think it’s too arcane and low-level for most interviewers.

 

compile-time ^ run-time linking

https://en.wikipedia.org/wiki/Dynamic_linker describes the “magic” of linking *.so files with some a.out at runtime, This is more unknown and “magical” than compile-time linking.

“Linking is often referred to as a process that is performed when the executable is compiled, while a dynamic linker is a special part of an operating system that loads external shared libraries into a running process”

I now think when the technical literature mentions linking or linker I need to ask “early linker or late linker?”

c++changed more than coreJava: QQ perspective

Recap — A QQ topic is defined as a “hard interview topic that’s never needed in projects”.

Background — I used to feel as new versions of an old language get adopted, the QQ interview topics don’t change much. I can see that in java7, c#, perl6, python3.

To my surprise, compared to java7/8, c++0x has more disruptive impact on QQ questions. Why? Here are my guesses:

  • Reason: low-level —- c++ is more low-level than java at least in terms of interview topics. Both java8 and c++0x introduced many low-level changes, but the java interviewers don’t care that much.
  • Reason: performance —- c++0x changes have performance impact esp. latency impact, which is the hot focus of my target c++ employers. In contrast, java8 doesn’t have much performance impact, and java employers are less latency-sensitive.
  • Reason: template  —- c++0x feature set has a disproportionate amount of TMP features which are very hard. No such “big rock” in java.
    • move/forward, enable_if, type traits

Q: if that’s the case, for my career longevity, is c++ a better domain than java?
A: I’m still biased in favor or low-level languages

Q: is that a form of technology churn?
A: yes since the c++11 QQ topics are likely to show up less over the years, replaced by newer features.

UseSpinning jvm flag #pthr/kernel/c#

Hi XR,

Around 2011 I once said a java thread waiting for a lock would not keep grabbing the lock…. Wrong! Actually it does keep grabbing for a while. It runs some useless CPU instructions and checks the lock periodically. This is really spin locking inside the JVM.

Spin locking wastes CPU but frequently results in faster application performance (reducing context switches). Therefore, JVM use some policy to decide how long a thread would spin before placing the thread in some kind of “parking lot” (my terminology) so the thread doesn’t’ occupy CPU any more. This spin phase used to be configurable via the boolean UseSpinning JVM flag, but in java7 and beyond, this flag has no effect — JVM always has freedom to use spinning. This is described in [[javaPerf2014]]

Comparison 1 — Dotnet SpinWait() provides exactly the same spinlock feature, as described in https://docs.microsoft.com/en-us/dotnet/standard/threading/spinwait

Comparison 2 — In linux Kernel (not userland), 1) spinlock and 2) counting semaphore are the two locking primitives, as described in Page 164 [[understandingLinuxKernel]]. Both locks block a “grabbing” thread getting into a critical section until another thread leaves it. Similar to my 2011 description, the kernel semaphore remembers all sleeping threads and wakes one of them when the critical section is clear. I think this kernel semaphore is used by the scheduler in the kernel.

Comparison 3 — In linux pthreads (the basis of linux JVM), the locking constructs are implemented using system calls and userland C functions. Pthreads as a userland library probably can’t use the above two kernel constructs directly. Userland C function can implement spinlock, but to implement “parking” I assume (kernel support and) system calls are required. So far I assume the pthreads thread maps to native threads controlled by the kernel. This mapping is now the dominant usage.

The alternative mapping model is green threads, where userland threads are invisible to kernel. Parking has to be implemented in userland.  Nowadays it’s rare to see green threads.

double-ptr is rare@@

  • c# ref-param (on a class type, not a Value data type) is really a double ptr. No other ways to achieve this purpose.
  • java offers absolutely no such support or any support for double-ptr
  • python? not sure. Not common

In conclusion, I now feel

  1. the need for double-ptr is prevalent across languages. One of the most common need is — main() passes into my method a String or List object, and I want to reseat that pointer so that main() will operate on the new String
  2. Despite these common needs, java and probably many new languages were designed from ground up to live without double pointers. It’s a conscious decision, an active refusal. I have not, but you may need to find workarounds.
  3. Workarounds are plenty and never too ugly too convoluted too complex
  4. In fact, many would consider (practically) all double-pointers as ugly and dangerous for the average programmers. If you are more than an average programmer and want to use double-pointers, then those workarounds are rather easy for you.
  5. Conclusion — in this new era, double-pointers are rarely needed.

Can a.so file get loaded 5min after process starts@@

Q: Can some shared library abc.so file get loaded 5 min after my process pid123 starts?

https://stackoverflow.com/questions/7767325/replacing-shared-object-so-file-while-main-program-is-running says NO. This abc.so file has to be loaded into pid123 memory (then dynamically linked into the executable) before main() is called.

Among the 3 major mechanism 1) static linker 2) dynamic linker 3) dlopen, dlopen is able to achieve this purpose but I’m unfamiliar with dlopen.

If pid123 reads a config file to decide whether to load abc.so, then dlopen is the only solution. I saw such an industrial strength implementation in 2019.

A remotely related note — The same stackoverflow webpage also shows that after pid123 starts, you can actually remove (and replace) abc.so without affecting pid123, since the old file content is already loaded into pid123 memory.

STL containers +! pointer

  1. std::array
  2. std::string SSO i.e. small-string optimization

These containers

  1. heap storare is not required
    • trivial special case — if this container is a nonref field of a Shape object, then q(new Shape) would still allocate the container on heap.
    • This is similar to “Are java primitives on heap/stack?”
  2. Therefore contains no pointer.
  3. Therefore move ctor runs in linear [1] time compared to constant-time move-ctor in other STL containers
    • std::array probably offers a move-ctor and move-assignment

[1] P205 [[effModernC++]] also addresses

Q: What happens when you move a std::array container with 77 Acct elements?
A: 77 individual Acct objects are moved to the new container, in linear time, invoking the move-ctor of Acct class

##[19]if I were2re-select%%tsn bets #tbudget=30m

##[17] checklist@tsn #engaging provides a checklist but here’s a FRESH look, which may not be so in-depth, balanced, and comprehensive.

Q: Hypothetically, knowing what I know now, beside c++ and python, which trySomethingNew bets would I have chosen in 2010, after GS?

  1. mkt data and sockets
  2. forex, equities, options, IRS — I like the well-defined body of dnlg as entry-barrier. I learned it fast 🙂
  3. trading engine dev including pricing, OMS, connectivity
  4. risk analytics?
  5. devops?
  6. big data including hadoop, cloud etc?
  7. — not-so-great
  8. c# — heavy investment, a lot of legwork but insufficient ROTI
  9. MOM and Gemfire — shrinking demand
  10. swing? Fun but very poor job market
  11. quantDev? extremely disappointing job market
  12. HFT? entry barrier too high

prevent c++compiler reordering statements

http://www.cplusplus.com/reference/atomic/atomic_signal_fence/ says this low-level c++11 function is something like a

directive to the compiler inhibiting it from making optimizations that involve moving writing operations beyond a releasing fence or read operations before an acquire fence.

I doubt there’s any need for this in any application.

##%%c++(n java)memoryMgmt trec : IV

Q: Your resume says “memory mgmt” for c++, so what did you do personally, not as a team?
A: I will talk about the features in my system. Most of them are not written by me, honestly

  • per-class: restricting allocation to heap or stack
  • per-class: customize operator new for my class
  • per-class: disable operator new for my class and provide allocateFromRingBuffer() API
  • per-class: move support
  • static thread_locals
  • ring buffer to eliminate runtime allocation
  • custom smart pointers
  • memory leak detector tools like valgrind+massif .. breakdown heap/non-heap footprint@c++app #massif
  • RAII
  • pre-allocate DTOs@SOD #HFT #RTS
  • customize new_handler() — I didn’t write this

Q: java

  • GC tuning
  • memory sizing
  • JDK tools like jhat
  • memory profiler tools
  • string intern
  • investigating memory leaks, all related to GC

## G3 uncertainties for 10Y: #Demand^otherCandd

Let’s be specific — on the competitive landscape over next 10Y, what specific factors are we trying to predict

  • G3: predicting my healthy, energy, motivation, absorbency, “retention” (MSFM…), coding drill visible-progress ..
    • I have 3x more insight on this factor than other factors
  • G5: predicting strength of competing candd (i.e. candidates), often younger
    • one note — my accumulated theoretical and low-level QQ strength will continue to be rare among the young competitors
  • G9: predicting market demand evolution, including churn. Coding IV is the prime example.
    • Note this evolution is not so fast. Read my analysis of MOM and RDBMS

Since 2017, I have tentatively started to shift more focus towards west coast. As a result, the Competition and Demand factors felt too fast-changing, which makes my painstaking analysis look naive and futile.

Critical thinking needed — Suppose I keep my focus on ibanks. In such a case, my analysis and prediction is actually relevant, even valuable. The Competition/Demand factors have not changed drastically.

I’m pretty good at this kind of competitive landscape analysis , but It’s truly tough to predict any of these 3 factors. However, as Warren Buffett (?) observed, those who decide to give up and submit to mindless investing is likely (not guaranteed) to lose out to those who try and analyze/predict.

 

Nsdq onsite IV #emerging Nsdq architecture

Q: Java NonBlocking IO?
%%A: one thread monitoring multiple sockets

Q: many simple Objects with complex relationships/interactions vs one (or very few) huge complex object?

Q: thread Futures? Any comments?

–some of my findings

  • microservices and containers are important
  • According to the system architect (last interviewer), the new framework uses one dedicated instance of matching engine serving a single symbol such as AAPL.
  • the new java framework only supports a few smaller exchanges initially, but is designed to roll out to more “exchanges” esp. those to be installed on client sites (like CSP). It is (theoretically) designed to be capable enough to implement the main stock exchange.
  • is Mostly java + javascript (Amber, react) with small amount of c++
  • the new java framework was deliberately designed to operate in single-threaded mode, despite oppositions. Therefore, multiple “clients” calling the same library concurrently would be unsafe.
  • java8
  • noSQL is adopted in addition to PostgreSQL, but the principal architect is less concerned about data store, which is only used in individual components

error stack trace: j^c++

Without stack trace, silent crashes are practically intractable.

In my java and python /career/ (I think c# as well) , exceptions always generate a console stack trace. The only time it didn’t happen was a Barclays JNI library written in c++..

In c++, getting the stack trace is harder.

  • when I used the ETSFlowAsert() construct I get a barely usable stack trace, with function names, without line numbers
  • [[safeC++]] described a technique to generate a simple stack trace with some line numbers but I have never tried it
  • the standard assert() macro doesn’t generate stack trace
  • In RTS and mvea, memory access errors lead to seg-fault and core dump. In these contexts we are lucky because the runtime environment (host OS, standard library, seg-fault signal handler ..) cooperate to dump some raw data into the core file, but it’s not as reliable as the JVM runtime … Here are some of the obstacles:
    • core files may be suppressed
    • To actually make out this information, we need gdb + a debug build of the binary with debug symbols.
    • it can take half an hour to load the debug symbols
    • My blogpost %%first core dump gdb session describes how much trouble and time is involved to see the function names in the stack trace.

 

RTS pbflow msg+time files #wait_until #60%

Background — In RTS we relied everyday on a UDP message replay tool, taking in a msg file and a corresponding time file. Both binary files. I saw no message delimiters in the msg file, partly because this file is produced in production. Inserting delimiters would add overhead in production parser engine.

Given there’s no delimiter, I believe the timestamp file must have a seek-offset values (marker) corresponding to each timestamp.

Q: how would you implement a similar replay tool?

Driver is the timestamp file. Every time I see a timestamp, I would wait_until() that timestamp (adjusted to my start time). Upon wake-up, I would send the corresponding chunk of bytes between current and next markers.

I would use condition_variable::wait_until(). As noted in c++condVar 2 usages #timedWait, this function has nanosec precision.

(The mutex needed in wait_until? My program is single-threaded, so the dummy lock would be held forever.)

The chuck of bytes would be sent as one UDP packet. No split. Each chunk should starts with an original packet header created by the matching engine, and received in the original feed.

Q2: how about TCP?

TCP receiver can deal with partial messages very effectively, as I saw in my tests.

However, TCP receiver can introduce delays in the sender i.e. my replay tool, so the transmission will not be instantaneous. In fact, send() can block! This could lead to drift in the timer. That’s why I need wait_until()

Q3: Explain timer drift?

Suppose in the time file, Timestamp #1 is 9:00:00:000001 and Timestamp #2 is one microsec later, but the data transmission can take 10 nanosec esp. with TCP blocking send().  This 10 nanosec is a small drift but it adds up to microseconds.

%%tsn dream: largely constrained by risks

My tsn dream=largely constrained by risks

  • stigma/respect
  • figure things out faster than colleagues

For these specific risks, my current risk tolerance is 2% higher so I can be bolder, but the blow of a stigma is still rather heavy.

Note these risks exist with/without tsn. If we want to minimize these risks we need to find a low-calibre team like RTS, or OC (95G?), sometimes low-salary team

fractional shares: quite common]basket trading

https://www.investopedia.com/terms/f/fractionalshare.asp explains that fractional shares can be naturally-occurring, but I want to talk about artificial fractional shares in “normal” stocks.

Custom baskets are traded as equity-swaps. Client specify a weightage profile of the basket like x% IBM + y% AAPL + z% GE where x+y+z ==100. The sell-side dealer (not a broker) would hold the positions on its own book, but on behalf of clients.

Such a basket is not listed on any exchange (unlike ETFs). If client were to go long such a basket directly, due to limited liquidity on some constituent stocks she may not get the desired weightage. In contrast, a sell-side has more liquidity access including smart-order-routers and internal execution.

The weightage means some stocks will have fractional quantities in a given basket. Also such a basket could be too big in dollar amount like $58,476, so the sell-side often use a divisor (like 584) to create a unit price of $100, similar to a share’s price in a mutual fund. Fractional quantities are even more inevitable in one unit or in 391 units.

After a client buys 391 units she could sell partially.

With mutual funds, I often buy fractional units as well, like 391.52 units.