TCP sliding-window: congestion/receive-buffer

Based on P 248/265 [[computerNetworking]] and blogpost on AWS^SWS

Sliding window is a protocol or a design framework. AdvertizedWindowSize is an integer field in the Ack message. There are also two variables “ReceiveWindow/CongestionWindow” in the TCP sender codebase (part of kernel).

Sliding window kills two birds with one stone — 1) congestion control and 2) overflow protection on the receiving end. Without these controls, sender can pump packets too fast causing

  • receiver socket buffer overflows
  • congestion on the sender-to-receiver path so some of the router buffers overflow

Basic idea is limit sending rate such that

  • unacknowledged bytes := lastByteSent – lastByteAcked < ReceiveWindow
  • unacknowledged bytes := lastByteSent – lastByteAcked < CongestionWindow

UDP doesn’t limit sending rate and is favored by many aggressive applications such as market data dissemination.

Advertisements

serious delays@ project^feature levels

insightful article: managing tight deadlines is the background

— serious delay at feature level, without user impact

This is real stressor primarily because there exist team colleagues who can GTD faster.

For the serious delay, user impact is … minimal ! This is true across all my projects, but user impact doesn’t really matter. In fact, the feature may not be used much whether you deliver it in high or low quality, within or exceeding budget.

— blatant delay at project level when you are architect / team lead

Users never care that much about any new system cf current production systems. New systems often hit various teething problems and functionally unreliable for months.

OK Users don’t really care that much, but there’s visible technology budget impact which make the technology MD look bad. MD must look for an "explanation" and may cut headcount as in StirtRisk and RTS-Trep

Whose fault? Many fingers point at you the architect, but often it’s partial fault of the user-facing manager due to immaturity and overconfidence with timeline estimate.

Delay is architect’s fault if another architect can deliver faster but I can’t imagine two architects working on two comparable projects at the same time. Therefore it’s never mostly architect’s fault. In reality, management do apply pressure and you suffer, but it’s really for political reason (like sacrificial lamb or scapegoat)

eg: RTS Trep
eg: PWM Special-Investment-management
eg: Post-Saurabh GMDS — multiple delays and scope reductions

Reusable data structure for many pure algo challenges

Here’s a Reusable data structure for many pure algo challenges:

Construct a data store to hold a bunch of "structs" in linked list of growing vector (O(1) insertion) Then we can build multiple "indices" pointing to the nodes

Here are a few average-O(1) indices:
(Note O(1) lookup is the best we can dream of)

  • hashtable keyed by a struct field like a string
  • array indexed by a struct field like small int id
  • If there’s a non-unique int field, then we can use the same array lookup to reach a "group", and within it use one (or multiple) hashtable(s) keyed by another field

[17]#1 impactful endeavor(Now)4family: IV^zbs^gym..

meaningful endeavor Now: algo^zbs^…

In the quiet hours, inevitably I would get my share of self-doubt about the value of my endeavors. See also what efforts go towards 20Y-career building

At other times, I would feel the impacts (of my effort today) on the people who depend on me — grandparents, kids, wife. There’s a lot I can do to make their lives safer, richer, easier, … For example, the interview-preparation effort looks short-term and less meaningful than zbs, but actually has more impact on family well-being such as education, health-care, housing and life-chances. Re zbs, now I realize zbs accumu is rather theoretical and actually limited. Interviews build my confidence and capacity to provide for them.

Looking at my peers … I feel their personal endeavors are not much better than mine:

  • move up to leadership positions. I think that’s a good direction if you CAN move up. I gave up long ago. So I see myself almost like a specialist consultant for hire
  • personal (property) investments
  • kids grades and top schools
accumu #not a benefit #3 mental fitness, anti-aging #2 career after 65 (RnD/teach) #1 family well-being: impact 1-5 #4 lasting social value?
good if low churn good minimal 4 minimal IV: QQ/BP #incl. algo ideas
good excellent none 4 none IV: algo practice
good excellent N.A. 5 none …cf: yoga, jog
good if low churn good possible 3 #helps PKI !! IV minimal zbs #+GTD, instrumentation
churn ask HuKun too volatile 0 none data science, machine learning
some some esp. quant none 1 {– 2 none portable dnlg(+quant)
none some none 4 #if stay`long. Can help move-up but low correlation none xx local sys
some minimal none 1 #questionable can help kids;
can’t teach
per investment analysis #unlike XR
NA minimal none 2 {–1 none … cf: family expense management
some some possible 0 high En/Ch learning
churn good minimal 0 # coding practice churn! Get Real no bandwidth! contribute to OSS??

“spread%%nlg” as a column? trivial.

top9 workhorse-algos ] speedCoding #XR

XR said we should take speed coding contests to find out what basic algos are needed.

Opening example — parent/child pairs→tree algos #Indeed is one example of realistic problem that requires common comp-science constructs…

Need to memorize enough to implement on the spot, as these are basic “comp science constructs” needed for “realistic problems”

  • dft/bft, with levels
  • BST find predecessor
  • Binary search in array
  • Merge-sort/insertion-sort/partitioning
  • .. realistic string problems:
  • .. realistic int array problems:
  • max profit; max subarray sum
  • .. realistic matrix problems:

Below are basic algos for hackerrank/codility but NOT applicable to realistic problems typical of Indeed/FB

  • Linked list: remove dupes
  • string: collapse consecutive chars

eg@traction(def2) ] GTD^IV #Sunil

Sunil is not the only one who tried but failed to break into java. Sunil was motivated by the huge java job market. I believe he had opportunities to work on (probably small) java projects and gained confidence, but that’s really the easy part. That GTD experience was completely insufficient to crack java interviews. He needs IV traction.

  • Zhurong also tried java
  • [g] Venkat actually got a java job in SCB but didn’t like it. I feel he was lacking GTD traction
  • [g] XR and the Iranian java guy had some c# projects at work but didn’t gain traction.
  • CSY had a lot to tell me about breaking into java.
  • [gi] CSY and Deepak CM both had java projects at work but no traction
  • [i=IV traction]
  • [g=GTD traction]

IV/QQ traction — I experienced better IV traction in c# than c++. I think it’s because half my c++ interviews were HFT shops.

GTD traction — I had good GTD traction with javascript, php …, much better than c++. My C# GTD traction was also better than c++. C++ is probably the hardest language in terms of GTD. As explained to Kyle, my friend Ashish experienced tremendous GTD traction but can he crack the interviews? Hiring teams can’t access your GTD but they can ask tough QQ questions.

CPU(data)cache prefetching

https://stackoverflow.com/questions/1950878/c-for-loop-indexing-is-forward-indexing-faster-in-new-cpus top answer is concise. I think the observations may not be relevant in x years but the principles are.

  • adjacent cache line (ACL) prefetcher — simple to understand
  • cpu can detect streams of memory accesses in forward or backward directions

Note L1/L2/L3 caches are considered part of the CPU even if some of them are physically outside the microprocessor.

read-mostly data store: needs synchronization

Suppose a data store is infrequently updated, and only by a single writer thread, but is accessed frequently by many, many reader threads.

Q1: Can we cheat and skip some of the (expensive) locking?
A1: No. The writer could be in the middle of a multi-step update, so readers would see a half-updated data store

Q2: What if we somehow ensure the updates are never multi-step [1]?
A2: Unfortunately no, due to visibility failure. If a reader thread doesn’t synchronize, the crucial memory fencing would not occur and the reader may not notice a recent update, and use cached data instead.

[1] One way to do it — writer creates a new version of a record, and at last reseat a single pointer , as described in my AtomicReference blogpost https://wp.me/p74oew-s3

java singleton^immutable classes #enum

[[effJava]] explains that an immutable class needs no copying.

However, we don’t need to work hard to make it singletons.

If an immutable class happens to be used as one-instance data type, then lucky. But tomorrow it may get instantiated twice.. no worries 🙂

If boss asks you to make this class singleton, you should point out the legwork required and the numerous loopholes to fix before achieving that goal. Worthwhile?

Java enum types are special. JVM guarantees them to be immutable AND singleton, even in the face of serialization. See P 311.

[16] Fwd: techies’ common complaints about jobs

We complain about high churn, but why the hell none of us go teach math?

We complain low $ROTI but how many percent of techies get any $ROTI from personal investment or self-learning?

We complain about low (if any) lasting social value in our work, but why the hell none of us chooses an RnD career?

Hi friends,

Most techies (including developers) probably feel undervalued, and have a lot of potential not utilized on the current job.

We blame our company or our team or our job. Maybe it’s not challenging enough; maybe too repetitive; maybe too niche.

We look up at some high flyer and ask “what if I’m given that role… I may not do better than that person, but surely I will be competent and up to the job.  It may be boring and stressful but Hey I will earn so much more!”

In many over-staffed IT departments, about 20% of the roles are critical and some 20% of the roles are dedicated to “peripheral”systems that no business users care about. Perhaps that system is lightly used, and users don’t trust the output anyway.……

Well, my current (devops) job gives me a lot of opportunities to push myself higher. It’s not too niche (like Quartz/Athena/SecDB). It offers complexity and depth. Not mindless and repetitive. Not something I feel already too familiar with (and jaded). I can see the impact quickly. The impact is on many people. The impact is on front office.

Still I’m not fired-up. I guess there are always better roles out there.

We had better condition our mind not to think that way. Instead make the best use of the current role. “When life gives you lemons, make lemonade”

sg19: risks@tsn

This time my TSN attitude is not driven by fear of stagnation, but a free-spirited, fun-seeking desire to explore.

My past TSN failures (and some successes) are the best guide. Given the multiple TSN failures, my expectation is rather low. However, the pain of stigma , confidence loss, impact on family can still be very real. That’s why I wrote in my blog that the safest job is back office java+SQL+scripting with low-calibre colleagues.

3changes]SG job market #cautious optimism

  1. Many non-finance companies now can pay 150k base or higher for a senior dev role. In my 2015 job search, I didn’t find any
  2. Many smaller fintech companies (not hedge funds) can pay 150k base or higher
  3. c++ (and c#) is /conceding/ market share to java, partly due to the two categories above. Apparently, java is growing more dominant than before. I guess java is more proven, better supported, by a bigger ecosystem and have bigger talent pool. In contrast, c++ skill is harder to find in Singapore?
    1. Overall good news for me since my java arm is still stronger than c++ arm
  4. remote hiring — more Singapore teams are willing to hire from overseas. Lazada said “mostly over skype”
  5. no longer blue-collar — programmer used to be blue-collar support staff for the revenue staff. Some of the companies listed above treat programmers as first-class citizens.

[17]predict 2-5Y job satisfaction

Q: did the learning actually help with my job interviews (am not salary-focused like Mithun)? This is a key feedback.
A: Yes to some extent

The “table” is subject to frequent change, so I keep it in recoll (was a google sheet). Here some notes:

  • stigma/appreciation/respect(zbs) — turns to be the #1 key to job satisfaction, but appreciation is tricky. Bonus can break it, performance review can break it, other things can break it too. I often felt like “damaged goods”.
    • In Mac and Stirt (and OC too), managers considered me good enough for transfer to other teams.
  • salary + other income — turns out to be insignificant when I have inner confidence and self-esteem. It is still one of the most important factors. When it’s very high, it overrides almost everything else.
  • distractions — do hurt my O2 for GTD, zbs development and self-learning
  • traction — positive feedback, includes zbs growth, instrumentation, self confidence, IV, catching up or surpassing peers
  • strategic orgro/stagnation — turns out to be a white elephant.
  • Many of the top factors are “perceptions” rather than “hardships”
    • perceptions — self-esteem@comp; strategic tsn; engaging… #best eg@perception — peer comparison
    • hardships — mkt depth (job pool size); workload; family time; commute; salary

! OC job was actually not bad if there were some appreciation from boss. However, the new, strategic specialization didn’t bring enough satisfaction.

! Verizon job experience was rather positive. I was on the rise, in my prime. It all ended when I moved to GS. I should have quit GS earlier. Citi was the start of another prime period. Prime mostly in terms of self-confidence, self-esteem …

My prediction — to have a satisfying (not necessarily strategic) job next time,

  • I need the #1 factor — appreciation.
  • A well-paid java job will mostly likely make me feel good.
  • LearningSomethingNew and engagement will NOT be a deciding factor (Recall c#/py experiences). I will still make time to learn something, just like in 95G

 

##dry$valuable topics 4 high(absorbency)period #flight

This ranking was originally compiled for “in-flight edutainment”. Warning — If high churn, low accu, low traction, or short shelf life then the precious absorbency effort is wasted

  1. coding drill — esp. the hot picks. Look at tag “top50”. Practice solving them again quickly.
    • 😦 low accu?
  2. c++ TMP? seldom asked but might be asked more in high-end interviews. TMP is heavily used in real world libraries.
  3. effModernC++
  4. java concurrency book by Lea. More valuable than c++ concurrency because java threading is a reference implementation and widely used.
  5. java tuning, effJava,
  6. linux kernel as halo?
    • 😦 low traction since I’m a few layers removed from the kernel internals
  7. c++11 concurrency features?
    • 😦 low traction since seldom asked in-depth

## sg19 java QQ to brush up

  • — jxee
  • [10=10 min ]
  • [20] Containers
  • Cloud native? LG
  • JPA? LG
  • — other ecosystem items
  • containers, cgroups
  • jdk tools for heap analysis
  • tuning best practices
  • GC in general? bookish ok
  • Ajax integration? no one asked
  • — coreJava
  • java8/9 fundamentals?
  • high-level concurrency tools — how they are related to the low-level. Lower mkt value than low-level tools
  • serialization? not popular in IV
  • advanced generics like wildcard? not popular in IV
  • reflection? not popular in IV

longer u stay,more localSys nlg⇒safer@@ #5%GS

As I said in this blog, FTE-dev is often worse off than contractors.

I think if you stay for many years without moving up while some of your colleagues move up, you may or may not get a stigma.  Some of the newer members may become your manager:) But this is not the main focus here.

The longer you stay, the more knowledgeable about local system. Hopefully more relaxed and less stress? Partly true but very dangerous slippery slope. You hope that you can make do with 80% of the earlier /intensity/, but I think it won’t happen in most ibanks [1].

In my observation, most VP-level old timers operate under /unrelenting/ pressure (“threat”) to maintain high productivity. They are expected to be more proficient and more productive than earlier, not allowed to slow down and take it easy … No retirement home here.

Otherwise, they would fail the peer benchmark

Another Part of the threat comes from hungrier, younger colleagues able to reach (then surpass) half your speed within a year or two, driven by /formidable/ brain-power (energy, intelligence, ..)

[1] There are exceptions but I only know a few so I don’t want to spend too much analyzing. Exception — if you were the original architect and you keep an eye on the evolution of your brainchild, and remain knowledgeable about it, but this scenario requires some brain-power.

That’s the harsh competitive reality even if you don’t seek increasing responsibilities. A small percentage of the people are ambitious ladder climbers. (Am clearly not, because at my current level I already feel the heavy workload.)

Many people I talk to want an “easy life”, not increasing responsibilities. However, If you don’t take up increasing responsibilities, you may become too expensive. You may get a token bonus. I think you may even get a humiliating bonus.

Overall, in “peacetime” long service without moving up can feel embarrassing and uncomfortable at times, for some individuals. (It’s more noticeable if most of the peers at your level are much younger, as in Macq and OC.) Some corporate cultures may tolerate but stigmatize that

Employer claim they prefer employees staying longer rather than shorter. That’s blatant marketing. In reality, some employers wish some old timers to leave on their own, to make way for younger, cheaper fresh blood. GS annual 5% cull in peacetime is widely-reported in WSJ, Independent... A few common motivations:

  1. Old timers are sometimes perceived as obstacles to change, to be removed.
  2. some employers believe younger, newer workers are cheaper and more motivated on average
  3. Whenever a new manager comes in he would bring in his old friends, otherwise he is weak.

Down turn? All hell breaks loose. Rather than protecting you, your long service track record may make you vulnerable. You may be seen as an opportunity to “replenish fresh blood”. In contrast, the less-productive but newer colleagues may show potential, and the hiring manager don’t want to look bad — hiring then firing new guys. In other words, it’s sometimes safer for the manager to sacrifice an old timer than a recent new hire. This is different from my Stirt experience.

My personal biased conclusions —

  • no such thing as long service recognition. No such thing as two-way commitment.
  • If you can be replaced cheaper, you are at risk. The more you earn, the more risky
  • If you are earning above the market rate then you need enough value-add, regardless how long you have served.

float/int 1:1 mapping#java support

[category javaOrphan ]

See last paragraph of https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/atomic/package-summary.html

  • Useful if you need to precisely compare a 64-bit float value. For example, you can check if a value has changed at all. The value could be user-supplied or shared mutable.
  • Useful if you need to store float values in a hashtable, though I’m not sure about performance.
  • Useful if you need radix sort on floats

##specializations fac`same IV quizzes 20Y on#socket !! c++11

(tagline: the most churn-resistant specializations.)

Socket IV questions have remained unchanged for 20Y — Unmatched stability and churn-resistance, but not necessarily accumulating

  • Preservation of t-investment
  • Preservation of accumulation
  • Preservation of deep learning? Socket programming has modest depth.

Q: Beside the specialization of socket programming, are there other specializations that are dominated by the same old QQ questions 20 years on?

  • [S] classic data structures
  • [S] classic sort/search algorithms on int-array, char-array, list ..
  • [S] classic traversal algorithms on trees, general graphs
  • [s] classic recursive, DP, greedy algorithms beyond graphs
  • [S] pre-c++0x core-C++ (a specialization!) questions are largely unchanged. C++11 questions are rooted in the old knowledge base.. BUT most of the c++11 QQ topics will likely fall out of interview fashion
  • [s] cross-language concurrency primitives.
  • unix file/process/string/signal manipulation
  • unix shell scripting — low market value
  • [S] SQL — including join, index design … but seldom quizzed in depth nowadays
  • [S] regex — seldom quizzed, but often needed in coding
  • [S=classic, well-defined specialization]
  • [s=slightly less well-defined specialization]

Now the disqualified skills

  1. JGC + jvm tuning — high churn over 20Y
  2. TMP — new features introduced in c++11

## strategic TSN among%%abandoned

Q: name the most strategic trySomethingNew domains that I have since abandoned, given up, quitted.

Don’t spend too much time, because the answers are nothing new even though this is a decent question.

  1. — ranked by surprise
  2. algo trading? actually very few roles spread across a number of firms
  3. c#
  4. drv pricing quant
  5. real time risk as in GS, Qz and Athena
  6. distributed cache like Coherence and Gemfire
  7. web dev for dotcom? I did this for years then abandoned it. Many of the tech skills are still relevant like sessions, java thread safety, long-running jobs
  8. Unix sys admin and DBA
  9. perl, SQL, Unix/Linux power-user knowledge? No longer a top 10 focus
  • python? Not yet abandoned

EnumSet^regular enum

[category javaOrphan]
A java enum type usually represents .. (hold your breath) .. a radio-button-group. A variable of this type will bind to exactly one of the declared enum constants.

eg: Continent — there are only 7 declared constants. A Continent variable binds to Africa or Antarctic but not both.
eg: SolarPlanet — there are only 8 declared constants
eg: ChemicalElement — there are only 118 declared constants
eg: ChinaProvince — there are only 23 declared constants

In contrast, enum type has a very different meaning if used within an EnumSet. (I will name a meaning soon). Each enum constant is an independent boolean flag. You can mix and match these flags.

Eg: Given enum BaseColor { Red,Yellow,Blue} we can have only 2^3 = 8 distinct combinations. R+Y gives orange color. R+Y+B gives white color.

Therefore, the BaseColor enum represents the 3 dimensions of color composition.

EnumSet was created to replace bit vector. If your bit vector has a meaning (rare!) then the underlying enum type would have a meaning. Here’s an example from [[effJava]]

Eg: enum Style {Bold, Underline, Italic, Blink, StrikeThrough, Superscript, Subscript… } This enum represents the 7 dimensions of text styling.

[[effJava]] reveals the telltale sign — if the enum type has up to 64 declared constants (only three in BaseColor.java), then entire EnumSet is actually represented as a single 64-bit integer. This proves that our three enum constants are three boolean flags.

pre-allocate DTOs@SOD #HFT #RTS

example — RTS pre-allocates outgoing message objects from a ring buffer’s head, and “returns” to the ring buffer at the tail… See How RTS achieved 400-700 KMPS #epoll

example — Sell-side HFT OMS also uses pre-allocation. Suppose for every new order there are 3 new DataTransferObjects A/B/C to be instantiated on heap. Traditional approach would make 3 allocation requests in real time. I think the free-list manager becomes a hotspot, even if there’s a per-thread free list.

Basically HFT avoids new/malloc after market opens.

Pre-allocation is a popular technique. We compute at compile time the sizes of A/B/C based on their class declarations. For DTO class A, sizeof(A) just adds up the non-static data field sizes. Then we estimate how many orders we will get a day (say 7 million). Then we pre-allocate 7 million A objects in an array. The allocation happens at start-up, though the sizes are compile-time constants.

When an order comes in, the A/B/C DTO objects are already allocated but empty.

Byte-array is an alternative, but this smells like the raw free list to me…

java enum: elegant

“Elegant” (and “clean”) is probably the best adjective. My comments below are mostly based on [[effJava]]

  • 🙂 immutable in the face of serialization
  • 🙂 strict singleton in the face of serialization .. P311
  • 🙂 simple enough to be easily embedded as a static nested class
  • 🙂 you can add behavior (and data) unique to Jupiter, using a constant-specific class body
  • 🙂 You can switch on enum values
  • compiler adds two implicit static methods values() and valueOf(), not listed on the official javadoc 😦
  • values() returns a fixed array of the predefined enum values
  • valueOf(planetNameStr) would return the matching enum instance
    • Note this method is unrelated to String.valueOf()
    • you can even add your own fromString(abbrevatedPlanetName) method. see [[effJava]]

EnumSet (see separate blogpost) and EnumMap built on the strength of enum feature

##gains{Stirt job loss

immediately after I joined Macq, I realized I have traded up for a better job, better in every way.

  • gain: self-confidence that I can comfortably handle the ensuing financial impact on family
  • gain: self-confidence that I have survived again, albeit with a scar
  • gain: a protective scar — as of 2019 I’m still traumatized but slowly I’m healing from inside
  • gain: lesson learned the hard way — avoid FTE
  • $gain: compensation package more than covered my bench time but still I prefer the … something else instead of the compensation

unnoticed gain{SG3jobs: 看破quantDev

All three jobs were java-lite , with some quantDev exposure. Through these jobs, I gained the crucial clarity about the bleak reality of the quantDev career direction. The clarity enabled me to take the bold decision to stop the brave but costly TSN attempts to secure a foothold. Foothold is simply too tough and futile.

Traditional QuantDev in derivative pricing is a shrinking job pool. Poor portability of skills without any standard set of interview topics.

QuantDev offers no contract roles !

Instead, I successfully established some c#/py/c++ trec. The c++ accu, though incomplete, was especially difficult and precious.

Without these progresses, I would be lacking the confidence in py/c#/c++ to work towards and achieving multiple job offers. I would still be stuck in the quantDev direction.

[19]if I had stayed ] java

I think every tsn experience and proposal has some “buts”, so does it mean we should stop trying?

No. If I had stayed within java/sql/perl then I would have been worse off —

  • fewer job opportunities, less broadened technical base
  • slightly more worry about job security
  • more worry about churn
  • more worry about outsourcing
  • no differentiation from millions of java guys
  • left behind by some of the alpha geeks who branch out and establish new strongholds.
    • My technical mind is the BROAD-n-deep, curious explorer type so it is stifled if confined to java
  • sql and perl both falling out of fashion

But ….

  • possibly more competent in terms of figure-things-out relative to team peers
  • possibly fewer stigmas and more respect
  • ^^ These factors depend mostly on localSys knowledge
  • not so much stress, so much painful struggle in the beginning
  • possibly an architect role, provided I stay long and invest heavily in localSys

Was leverage good on my multiple tsn attempts after GS? reasonable leverage in some attempts.

checked STL^checked java Collections

jdk checkedList, checkedMap etc are designed to check type errors — checking any newly added item has the correct type for the collection. See P 246 [[java generics]]

STL checked container checks very different coding errors. See http://www.informit.com/articles/article.aspx?p=373341, which is extracted from my book [[c++codingStd]]

java6 biasedLocking^lockfree^ST-mode

This is a pure QQ topic, but you can score a few points by mentioning it.

https://stackoverflow.com/questions/9439602/biased-locking-in-java is very concise — un-contended locking may incur zero cost, as claimed in the accepted answer

  • .. incurs no synchronization cost. I suppose this is typically geared towards overly conservative code that performs locks on objects without ever exposing them to another thread. The actual synchronization overhead will only kick in once another thread tries to obtain a lock on the object.
  • Biased locking is the default in java6 ====> -XX:+UseBiasedLocking improving the performance of uncontended synchronization. An object is “biased” toward the thread which first acquires its monitor; subsequent monitor-related operations are relatively much faster. Some applications with significant amounts of uncontended synchronization may attain significant speedups

Is this technique considered lockfree? No, but some ma speculate that it might be even faster than lockfree. So if you suspect most of the time there’s only one thread accessing a critical section, you could choose to rely on the (java6 default) biased locking rather than lockfree solution. Most of the time this mutex-based design would challenge (if not faster than) a lockfre solution.

However, I believe single-threaded mode is still faster, where a thread isn’t aware of other threads, as if it is the only thread access those objects i.e. no shared-mutable. There would be no lock grab, no memory fencing at all.

Q:which linux c++thread is stuck #CSY/Vanguard

This is a typical “c++ecosystem question”. It’s not about c++ or C; it’s about linux instrumentation tools.

Q1: Given a multi-threaded server, you see some telltale signs that process is stuck and you suspect only one of the threads is stuck while the other threads are fine. How do you verify?

Q2: What if it’s a production environment?
A: I guess all my solution should be usable on production, since the entire machine is non-functioning. We can’t make it any worse.  If the machine is still doing useful work, then we should probably wait till end of day to investigate.

–Method: thread dump? Not popular for c++ processes. I have reason to believe it’s a JVM feature, since java threads are always jvm constructs, usually based on operating system threads [1]. JVM has full visibility into all threads and provides comprehensive instrumentation interface.

https://www.thoughtspot.com/codex/threadstacks-library-inspect-stacktraces-live-c-processes shows a custom c++ thread dumper but you need custom hooks in your c++ source code.

[1] Note “kernel-thread” has an unrelated meaning in the linux context

–Method: gdb

thread apply all bt – prints a stack trace of every thread, allowing you to somewhat easily find the stuck one

I think in gdb you can release each thread one by one and suspend only one suspect thread, allowing the good threads to continue

–Method: /proc — the dynamic pseudo file system

For each process, a lot of information is available in /proc/12345 . Information on each thread is available in /proc/12345/task/67890 where 67890 is the kernel thread ID. This is where pstop and other tools get thread information.

 

closestMatch in sorted-collection: j^python^c++

–java is cleanest. P236 (P183 for Set) [[java generics]] lists four methods belonging to the NavigableMap interface

  • ceilingEntry(key) — closest entry higher or equal
  • higherEntry(key) — closest entry strictly higher than key
  • lowerEntry
  • floorEntry

strong coreJava candidate=most mobile

Jxee guys face real barriers when breaking into core java. Same /scene/ when a c++ interviewer /grills/ a java candidate. These interviewers view their technical context as a level deeper and (for the skill level) a level higher. I have seen and noticed this perception many times.

Things are different in reverse — core java guys can move into jxee with relatively small effort. When jxee positions grow fast, the hiring teams often lower their requirements and take in core java guys. Fundamentally (as some would agree with me) the jxee add-on packages are not hard otherwise they won’t be popular.

Luckily for java guys, Javaland now has the most jobs and often well-paying jobs but ..

  1. c# guys can’t move into the huge java market. Ellen have seen many
  2. c++ is a harder language than java, but a disproportionate percentage (80%?) of c++ guys face real entry barrier when breaking into javaland.
  3. I wonder why..
  • I have heard of many c++ and c# veterans complain about the huge ecosystem in javaland.
  • I have spoken to several c++/c# friends. I think they don’t sink their teeth in java. Also a typical guy doesn’t want to abandon his traditional stronghold, even though in reality the stronghold is shrinking.
  • Age is a key factor. After you have gone though not only many years but also a tough accumulation phase on a steep learning curve, you are possibly afraid of going through the same.

2 threads taking turn #std::ref #CSY

Vanguard Q: write a program containing exactly 3 threads printing 1/3/5/7/… 2/4/6/8… respectively, but in lock steps, as if passing a token between them.

https://github.com/tiger40490/repo1/blob/cpp1/cpp/thr/takeTurn.cpp is my solution.

I managed to avoid sleep() and condVar. The idea — all thread run the same function which

  1. check the shared mutable variable Next. If it’s “my” turn then
  2. grab lock, print my next number, update Next, release lock
  3. end-if
  4. yield and exit

I used an atomic<char> “Next” set to lastOutput%something, that’s visible to all threads, even if not holding a lock.

 

contains(): Set^SortedSet

–in java # See P247/32 [[java generics]]

  • Set<Acct> contains() uses Acct.equals()
  • SortedSet<Acct> contains() uses Comparable<Acct> or a custom comparitor class, and ignores Acct.equals()

–In STL

The tree-based containers use “equivalence” to determine containment, basically same as the java  comparator.

The hash-based containers use hashCode + a equality predicate. The implementation details are slightly complicated since both the hash function and the predicate function are template params. Upon template instantiation, the two concrete types become “member types” of the host class. If host class is undered_set<string>, then we get two concrete member types:

unordered_set<string>::hasher and unordered_set<string>::key_equal

These member types can be implemented as typedefs or nested classes. See ARM.

Can C implement basic containers@@

Templates — I think template is unavailable in C, and I don’t know how C could emulate it, except using macros.

C doesn’t support exceptions but luckily very few exceptions in STL.

Most STL container classes have relative few member functions. I think C can implement free functions with “this” as an additional parameter.

Bulk of the basic operations on container are global functions in the form of STL algorithms. I think C can implement them.

Iterators are tricky. In STL I think they are usually nested classes. I think I would just give up. C supports no operator overloading so there’s no point providing an iterator.

##container iterator is implemented as ..

  • implemented as raw ptr — if your data is held in an array
  • implemented as member class (sugarcoated as member typedef) — most common
  • implemented as friend class
  • implemented as wrapper over internal container’s iterator — (cheat) if your custom container is kind of wrapper over an STL container, then just use the internal container’s iterator as your iterator.

Remember an iterator class is a form of smart pointer by definition, since it implements operator->() and operator*()

##java8 features #MethodRef

  • lambda
  • default methods
  • streaming
  • method references? Supposed to be an important feature in the core language/syntax. related to lambda syntax, presumably a syntactic sugar coating.
  • java.util.Optional<T> introduced to reduce NPE. Compare boost::optional
  • CompletableFuture<T> to promote async, event-driven programming
  • PermGen disappears in java8
  • –not really java8 features:
  • java7 introduced G1 garbage collector though default GC remains ParallelGC, for both java8 and java7

pureDev beats techLead: paycut=OK #CYW

My hypothesis based on discussions with CYW:

Bottom line –YW said if the pure tech salary is only 20k lower, then he may prefer the workload. The work-life balance, the freedom-from-care and simplicity .. are often highlighted by my tech-lead friends (like Su form GS?), though I suspect most of them still take tech lead roles (or higher).

A techLead has (usually difficult) deliverables for multiple business groups pushing for conflicting priorities. The techLead also need to enlist support from external teams. As such, he as nontrivial additional workload including supervising his own team members, managing users, liaising with other teams … all in addition to he development workload.

  • 😦 吃力不讨好
  • 😦 It’s not easy to mobilize other people to work for your deliverables.
  • 😦 The extra workload often requires overtime. RTS’s Venkat might be an example when he claimed he works 7 days a week.

The power dynamics is based on loyalty. Some big boss might value your loyalty more than your GTD capacity. Loyalty is tricky investment. If you invest in loyalty to big boss A, but A gets replaced by big boss B (as often happens), and B invariably brings in his own loyal lieutenants, then you have a problem. I think Jerry Zhang (Citi) faced this situation when Srini took over.

Unlikely the tech lead, the typical pure-tech contributor doesn’t need to demonstrate loyalty.

I think some senior developer roles are pure-tech-contributors.

both DEPTH+breadth expected of%%age: selective dig

Given my age, many interviewers expect me to demonstrate insight into many essential (not-obscure) topics such as lockfree (Ilya),

Interviewers expect a tough combination of breadth + some depth. To my advantage I’m a low-level digger, and also a broad-based avid reader. The cross-reference in blog is esp. valuable.

Challenge for me — identify which subtopic to dig deeper, among the breadth of topics, given my limited absorbency and the distractions.

##java heap allocation+!explicit q[new]

Most of these are java compiler tricks.

  • (pre-runtime) enum instance instantiation — probably at class-loading time
    • P62 [[java precisely]]
  • String myStr = “string1”; // see string pool blogpost
    • P11 [[java precisely]]
    • NOT anonymous temp object like in c++
  • “string1” + “str2” — is same as myStr.concat(..). So a method can often new up an object like this.
    • P10/11 [[java precisely]]
  • boxing
  • (most tricky) array initialization
    • int[] days ={31,28,31/* instantiates the array on heap */};
    • most tricky
    • P17 [[java precisely]] has examples
    • P16 [[java precisely]] also shows an alternative syntax “new int[]”

crypto exchange revenue #hearsay

–Jasper wrote “Trading Fees: When it comes to crypto-to-fiat trading pairs, there are no fees. As for crypto-to-crypto pairs, 0.15% of transaction value for market takers and -0.075% of transaction value (a rebate) for market makers is the rule.”

$100M daily trading volume is a figure I heard. $150k of fee income, assuming 15 bps fee.

I guess the rebate is like a income tax rebate. Market makers still pay a fee but a lower fee than market-takers.

–one-time listing fee

I was told that some creator of new crypotocurrency pay $5mil to a popular exchange for a listing. It’s a one-time listing fee.

 

3ways to expire cached items

server-push update ^ TTL ^ conditional-GET

Online articles would hint at both, but few articles list these two modes explicitly. This is a simple concept but fundamental to DB tuning and app tuning.

A) TTL — more common. Each “cache item” embeds a time-to-live data field a.k.a expiry timestamp

B) cache-invalidation — some “events” would trigger an invalidation. Without invalidation, a cache item would live forever with a infinity TTL, like the list of China provinces.

If TTL is predictable, you should combine A and B. I think cookie is an example.

C) conditional-GET in HTTP is a proven industrial strength solution described in my 2005 book [[computer networking]]. The cache server always sends a GET to the database but with a If-modified-since header. This saves unnecessary data database load and network load.

TTL eager server-push conditional-GET
if frequent query, infrequent updates efficient efficient high network load but limited to tiny requests
if latency important OK lowest latency slower lazy fetch, though efficient
if infrequent query good waste DB/client/NW resources efficient on DB/client/NW
if frequent update unsuitable high load on DB/client/NW efficient conflation
if frequent update+query unsuitable can be wasteful fairly efficient

 

market-depth^elite domains #jxee

I used to dismiss “commodity” skills like market data, risk system, JXEE… I used to prefer high-end specializations like algo-trading, quant-dev, derivative pricers. In reality, average salary is only slightly different and a commodity job can often outpay a specialist job.

As I get older, it makes sense to prefer market depth rather than “elite”(high-end niche) domains. A job market with depth (eg jxee, market-data, risk systems) offers a large number of positions. The typical salary of top 10% vs the median are not very different — small gaps. In contrast, the elite domains feature bigger gaps. As I grow older, I may need to reconsider the specialist vs generalist-manager choices.

Reminders about this preference (See also the spreadsheet):

  1. stagnation in my orgradient
  2. may or may not use my specialist skills in math, concurrency, algorithms, or SQL …
  3. robust demand
  4. low churn — a critical criteria whenever I mention “market depth”. I don’t like the market depth of javascript and web java.
  5. salary probabilities(distro): mgr^NBA#marketDepth etc

–case study: Algo trading domain

The skillset overlap between HFT vs other algo systems (sell-side, OTC, RFQ, automated pricing/execution..) is questionable. So is “accumulation” across the boundary.  There seems to be a formidable “dragon gate” — 鲤鱼跳龙门.

Within c++ based HFT, accumulation is conceivable. Job pool is so small that I worry about market depth. My friend Shanyou agreed that most of the technical requirement is latency. C/C++ latency techniques are different from java.

However, HFT developers seldom need to optimize latency

Outside HFT, the level of sophistication and latency-sensitivity varies. Given the vague definition, there are many (mostly java) jobs related to algo trading i.e. better market depth. Demand is more robust. Less elitist.

jvm footprint: classes can dominate objects

P56 of The official [[java platform performance]], written by SUN java dev team, has pie charts showing that

  • a typical Large server app can have about 20% of heap usage taken up by classes, rather than objects.
  • a typical small or medium client app usually have more RAM used by classes than data, up to 66% of heap usage take up by classes.

On the same page also says it’s possible to reduce class footprint.

mgr role risk: promotion=hazard #Alex

“mgr” in this context means any lead role.

When I feel left behind on the slow track, it’s usually in comparing to the manager peers.

Some Morgan Stanley developer rose to ED but after a while, he wanted hands-on dev so he stopped managing teams and became a very senior dev. But his performance/value-add was bench-marked against those manager EDs. After a while, Presumably he was seen as too expensive as a dev and got the golden handshake in 2019 Mar.

When my friend Alex told me this story, I gave this illustration — Suppose with hard work I am competent at Level 5 (senior VP) and very comfortably at Level 4 (junior VP) but struggle a bit at Level 6 (ED) when benchmarked to Level 6 peers. In such a case, for job safety I may want to remain at Level 5 rather than moving up. For an easy life, I may even stay at Level 4. If I get promoted to 6 I face stiff peer competition, from guys competent at Level 6. The benchmark-risk can be real. 高处不胜寒

When you get promoted to Level 6, you can’t avoid the peer bench-marking. You will get bench-marked, whether you like it or not. I find this peer benchmark unhealthy and hazardous, but Alex is very explicit about the benchmark/calibration system in many firms.

Better remain at a level you can perform well relative to peers.

try{}must be completed by ..#j^c++^c#

— Java before 7 and c#
try{} should be completed by at least a catch or finally. Lone wolf try{} block won’t compile. See https://www.c-sharpcorner.com/UploadFile/skumaar_mca/exception-handling-in-C-Sharp/

In particular, try/finally without catch is a standard idiom.

— java 7:
try{} should be completed by a catch, a finally, both or none .. four configurations 🙂

The try/finally configuration now has an important special case i.e. try-with-resources, where the finally is implicit so you won’t see it in anywhere.

— c++ as of 2019
C++ has no finally.

try{} must be followed by catch.

RTS feed for Internet clients #Mark #50%

Q: Based on the RTS market data dissemination system, what if some clients subscribe by a slow Internet connection and your orderbook feed need to provide delta updates each time a client reconnects?

Default solution: similar to FIX/TCP .. sender to maintain per-client state. I won’t elaborate. Here is my own Solution.

Note on terminology — in multicast there’s no TCP-style “server” . Instead, there’s a sender engine for many receiver clients.

Suppose we have too many clients. To minimize per-client state management, my engine would simply multicast to all clients real time updates + periodic snapshots.

A client AA can request a snapshot on any symbol group, and I will immediately multicast the snapshots on a refresh channel. If client BB never requests anything, BB can ignore the refresh multicast channel.

Request quota — each client like AA can request X free snapshots a day. Beyond that, AA’s requests would be regulated like

  • levied a fee
  • queued with lower priority

It’s feasible to replace the refresh multicast group with an unicast UDP channel per client. To me the engine multicast offers clear advantages without disadvantages.

  1. if there is an outage affecting two clients, each would request the same snapshots, creating unnecessary work on the engine. The request quota would incentivize each client to monitor the refresh group and avoid sending the same request as someone else
  2. multicast can be valuable to clients who like more frequent snapshots than the periodic.
  3. My operations team can also utilize this refresh channel (or the publisher channel?) to broadcast unsolicited (FreeOfCharge) snapshots. For this task, I would avoid using the publisher channel as it can be busy sending real time updates. We don’t want to interleave real time and snapshot messages.

%%CV profile cf 200x era

Until my early 30’s I avoided hot enterprise technologies like java/c++/SQL so my rating in the "body-building contest" was rather low. Analogy?

* An electronics engineering graduate stuck in a small, unsuccessful wafer fab
* An uneducated pretty girl unable to speak well, dress well.

Now my resume features java/c++/py/SQL and algo trading, quant, latency … and I have some accumulated insight on core c++/c#, SQL, sockets, swing, ..

[19] body-build`can hit low visPgress#leverage,disengaged

In Singapore (and very few NY) jobs, I noticed a pattern — the longer I stayed on a stable job, the lower I dropped in motivation, incentives and positive feedback/reinforcement for IV body-building including QQ, coding..

Every time I pass a non-trivial tech screening, I feel a real boost … Reinforcement of absorbency and reinforcement of a wellspring of energy lasting a few days to a few months … sometimes years thanks to my retrospective blogging. My Singapore experience is missing this crucial IV element. Without it, my absorbency of dry technical learning is hard to sustain. This also explains why my absorbency of localSys is hard to sustain.

(My son has never experienced such positive reinforcement.)

To gain perspective I find it necessary to compare with other endeavors. My conclusion — body-building has the highest leverage. See also meaningful endeavor(Now)4family: IV^zbs^gym..

Whenever I feel guilty/ashamed of my fixation on IV, and try to learn zbs, localSys, GTD etc,  eventually (often quickly) I hit a broken reinforcement loop and a weak or depleted “energy supply” and invariably give up, very rationally.

Q: is there any /endeavor/ with higher visPgress than IV body-building?

chores that require absorbency visPgress #immediate $ROTI leverage over long-term, on family well-being
body-building Yes in the form of blog+github… not $$ [1] reasonably high, given huge $gain and huge t-investment 😦 higher leverage than everything else combined [2], despite the churn
… cf localSys highly visible respect, not $$ ->necessary but insufficient condition for move-up
non-prop investment easily visible but small low given the small profit no leverage so far
yoga (+ fitness) some but hard to keep $zero high leverage, well-known
diet for BMI highest but hard to keep $zero 😦 low since it’s hard to keep up

[1] I think many of my peers can’t keep up the body-building effort precisely because there’s no $ROTI… personal projects: any ROI ] salary@@
[2] direct fruits of the endeavor:

  • made my nice TPY home possible
  • enabled me to move between countries
  • made GC possible
  • gave wife the SPR then Singapore citizenship
  • gave me decent salary
  • supported my property investments

ibank c# shops moving to java #Ellen

A Singapore banking recruiter shared with me her observation. I had earlier spoken to her a few times and she had earned my respect for her insight.

She said a few Singapore ibank departments were using c# before but now hiring java developers instead. I said c# is a newer language and I have never heard of such a migration. She didn’t address that but said java is apparently more open-source than c#.

I think this is a relatively isolated case. Some time in the past or future you may see java shops moving to c#.

She is very close to Standard Chartered bank (a c++ shop, as she acknowledged) and said SCB is hiring more java developers than before.

She said nowadays just about the only Singapore employers of c++ is HFT. I disagree. SCB, Macq and some ibanks still use some c++.

She said java is now the dominant language for internet companies, to my surprise. She said there’s now improving mobility between java developers in ibanks vs internet shops. I think she meant the big-data java roles, not core-java roles.

She said the SG banking job pool is dominated by java — out of 10 jobs, 9 are java jobs — “crazy” as she said. I guess less than half of those 9 jobs are coreJava.

hands-on dev beats mgr @same pay

BA, project mgr, even mid-level managers in some companies can earn the same 160k salary of a typical “developer” role. For a manager in a finance IT, salary is often higher, but for a statistically meaningful comparison I will use a 160k benchmark. Note in finance IT or tech firms, 160k is not high but on main street many developer positions pay below 160k.

As stated in other blogposts, at the same salary, developers enjoy higher mobility, more choices, higher career security…

jvm heap histogram simple console tool

The simplest among many similar tools. This type of analysis is easier in java than in c++ because jvm manages memory allocations. In fact, GC can even relocate objects.

~$jcmd test GC.class_histogram
58161:

num #instances #bytes class name
----------------------------------------------
 1: 1020 93864 [C
 2: 481 54856 java.lang.Class
 3: 526 26072 [Ljava.lang.Object;
 4: 13 25664 [B
 5: 1008 24192 java.lang.String
 6: 79 5688 java.lang.reflect.Field
 7: 256 4096 java.lang.Integer
 8: 94 3760 java.lang.ref.SoftReference
 9: 91 3712 [I
 10: 111 3552 java.util.Hashtable$Entry
 11: 8 3008 java.lang.Thread
...

REST^SOAP

REST stands for Representational State Transfer … basically means that each unique URL is a representation of some object. You can get the contents of that object using an HTTP GET, use a POST, PUT, or DELETE to modify the object (in practice most of the services use a POST for this).

— soap vs REST (most interviewers probably focus here) —

  • REST has only GET POST PUT DELETE; soap uses custom methods “setAge()” etc
  • SOAP takes more dev effort, despite it’s name
  • SOAP used to dominate enterprise apps, though XR used REST in ibanks.

–real REST URLs

https://restfulapi.net/resource-naming/ shows some examples. It also says

URIs should not be used to indicate that a CRUD function is performed. URIs should be used to uniquely identify resources and not any action upon them. HTTP request methods should be used to indicate which CRUD function is performed.

HTTP GET http://api.example.com/device-management/managed-devices  //Get all devices
HTTP POST http://api.example.com/device-management/managed-devices  //Create new Device
HTTP GET http://api.example.com/device-management/managed-devices/{id}  //Get device for given Id
HTTP PUT http://api.example.com/device-management/managed-devices/{id}  //Update device for given Id
HTTP DELETE http://api.example.com/device-management/managed-devices/{id}  //Delete device for given Id

parent/child pairs→tree algos #Indeed

As a speed-coding test, this problem requires you to apply common computer science constructs to a realistic problem, and then challenges you

“How many seconds do you need to build a working directed graph from raw data, and run BFT/DFT on it?”

45 minutes given. Target is to complete first 2 questions with working code. Can some candidates complete all 3 questions? I guess so.

Q: You are given some raw data like

parent_child_pairs = [ (1, 3), (2, 3), (3, 6), (5, 6), (5, 7), (4, 5), (4, 8), (8, 10), (11,2) ]

Suppose we have some input data describing a graph of relationships between parents and children over multiple generations. The data is formatted as a list of (parent, child) pairs, where each individual is assigned a unique integer identifier.

For example, in this diagram, 3 is a child of 1 and 2, and 5 is a child of 4:

  11
   \
1   2   4
 \ /   / \
  3   5   8
   \ / \   \
    6   7   10

Q1: write a function to output all individuals having no parents(like 1 and 4) in a list, and another list of individuals having a single parent (like 8 and 7)

Q2: write a bool function to determine if two named individuals have any common ancestor. 3 and 6 yes; 3 and 1 no!

I wrote a DFT solution .. https://github.com/tiger40490/repo1/blob/py1/py/tree/commonAncestor_Indeed.py Not very efficient but I really should care less about that since the real challenge is .. timely completion. I was not used to writing DFT on the spot within minutes but I hacked it together under ticking clock, first time in my career !

To find if two sets intersect, I was forced to make a quick judgment call to write my own loop.

  • I didn’t know if there’s a simple and reliable solution online
  • i didn’t know how much effort is required to search online and understand it
  • i didn’t know how much effort is required to adapt standard solution to suit my needs
  • My own loop gives more legwork but more control if requirements turns out to be non-standard.

Q3: (original wording) Write a function that, for a given individual in our dataset, returns their earliest known ancestor — the one at the farthest distance from the input individual. If there is more than one ancestor tied for “earliest”, return any one of them. If the input individual has no parents, the function should return null (or -1).

Sample input and output:

findEarliestAncestor(parentChildPairs, 8) => 4
findEarliestAncestor(parentChildPairs, 7) => 4
findEarliestAncestor(parentChildPairs, 6) => 11
findEarliestAncestor(parentChildPairs, 1) => null or -1

Indeed: hearsay algo questions#speedPractice

Q: The first interview consisted of interviews with two separate people. The first asked for a function that takes in a string of words separated by spaces and prints the first duplicate word. The second asked for a function that takes in two sorted unique int arrays and returns an array of the duplicates between the two arrays.

Q: Implementing an LRU = Least recently used cache. Question was I have a cached of storing just 50 objects now how do you make sure you have unique 50 elements in a cache adding 51 should kick out least recently used one from that 50 objects.

Q: Shunting Yard Algorithm

containers+MSA: cloud-affinity

I feel both the MSA architecture and the low level container technology can have a long-term impact thanks to cloud.

The implementations might be replaced by newer implementations in a few years. but some of the ideas may evolve. Now is still the volatile early phase… ##[17] proliferation → consolidation.. beware of churn

Both of them are easy to launch on elastic cloud

Both of them can be started on multiple machines as a wolf pack to handle bigger workload.

## Beat Sg2011Search

  • reason: I’m now open to java
  • reason: I’m now open to non-finance
  • reason: I will only work in SG for 1.5 years till 2021 spring, so not that much at stake.
  • reason: my USD salary is much lower.
  • tactic: will accept 140k jobs without complaint
  • tactic: will seriously target a few fertile grounds — FX, mkt data…
  • tactic: will seriously prepare for architect roles
  • tactic: will fly back for job interviews
  • reason: c++/algo body-building showing

22 notable features added to c++

https://en.cppreference.com/w/cpp/language/history briefly mentions

  • [90] exception handling
  • [90] templates
  • [98] cast operators
  • [98] dynamic_cast and typeid()
  • [98] covariant return type
  • [07] boost ref wrapper .. see std::reference_wrapper
  • [11] GarbageCollector interface .. See c++ GC interface
  • [11] std::next(), prev(), std::begin(), std::end() .. see favor std::begin(arrayOrContainer)
  • [11] exception_ptr? not sure how useful
  • [14] shared_lock — a RW lock
  • [14] shared_timed_mutex .. see try_lock: since pthreads
  • [14] std::exchange — comparable to std::swap() but doesn’t offer the atomicity of std::atomic_exchange()

cpu sharing among Docker container for jvm

Note cgroup is also usable beyond jvm and Docker, but i will just focus on jvm running in a Docker container..

Based on https://jaxenter.com/nobody-puts-java-container-139373.html

CPU shares are the default CPU isolation (or sharing??) and basically provide a priority weighting across all cpu time slots across all cores.

The default weight value of any process is 1024, so if you start a container as follows q[ docker run -it –rm -c 512 stress ] it will receive less CPU cycles than a default process/container.

But how many cycles exactly? That depends on the overall set of processes running at that node. Let us consider two cgroups A and B.

sudo cgcreate -g cpu:A
sudo cgcreate -g cpu:B
cgroup A: sudo cgset -r cpu.shares=768 A 75%
cgroup B: sudo cgset -r cpu.shares=256 B 25%

Cgroups A has CPU shares of 768 and the other has 256. That means that the CPU shares assume that if nothing else is running on the system, A is going to receive 75% of the CPU shares and B will receive the remaining 25%.

If we remove cgroup A, then cgroup B would end up receiving 100% of CPU shares.

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpu has more precise details.

https://scoutapp.com/blog/restricting-process-cpu-usage-using-nice-cpulimit-and-cgroups compares q(nice), cpulimit and cgroups. It provides more precise info on cpu.shares.

cpulimit can be used on an existing PID 1234:

cpulimit -l 50 -p 1234 # limit process 1234 to 50% of cpu timeslots. The remaining cpu timeslots can go to other processes or go to waste.

Leetcode contest #Rahul

  • don’t look at ranking
  • yoga — I CAN keep up this practice. This practice is good for my mental health and family well-being
  • yoga — I feel even if i don’t improve visbly, the fact that my participation count is increasing means I’m improving
  • if I don’t do the contest then I may not do any coding drill at all
  • What if I give up after one or a few episodes?
  • impact on family well-being?
  • leverage over long term?

In [[who moved my cheese]], we discover the world has changed. The harsh reality is, in this profession, your experience (like doctors) is heavily discounted. Your current runtime performance is easily benchmarked, just like a painter, or pianist, or chef.

##Java9 features #fewer than java8

  1. #1 Most important – modular jars featuring declarative module-descriptors i.e. requires and exports
  2. #2 linux cgroup support.. For one example, see Docker/java9 cpu isolation/affinity
  3. #3 G1 becoming default JGC.. CMS JGC: deprecated in java9
  4. REPL JShell
  5. private interface methods, either static or non-static
  6. Minor: C++11 style collection factory methods like

List<String> strings = List.of(“first”, “second”);


It’s unbelievable but not uncommon in Java history —

  • Java9 release introduced significantly fewer and less impactful features than java8.
  • Similarly, java5 overshadows java6 and java7 combined

 

jvm spawns(approx)32 GC threads on 32-core

Based on https://jaxenter.com/nobody-puts-java-container-139373.html

jvm would spawn 32 GC threads on a 32-core box [1].  As of 2018, this is the default, but you can change it with jvm parameters like -XX:ParallelGCThreads and -XX:ConcGCThreads

Java 9 introduced automatic detection of cpu-set when jvm runs in a cgroup (such as a docker container) with a cpu-set. For example, JVM detects there are 3 cores in the cpu-set, and spawns 3 GC threads.

[1] presumably for the parallel GC or the CMS GC but i’m not sure. https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/parallel.html and https://blog.codecentric.de/en/2013/01/useful-jvm-flags-part-6-throughput-collector/ says for the parallelGC the default would be 5/8*32 + 3 = 23 threads.

wasm: a distant threat to javascript

https://medium.com/javascript-scene/what-is-webassembly-the-dawn-of-a-new-era-61256ec5a8f6 is a good WebAssembly intro, by an author.

  • wsam is an in-browser language, like javascript
  • wsam offers low-level (web-ASSEMBLY) programming constructs to complement javascript
  • I feel wsam will only be considered by extreme-performance browser apps. It’s too low-level, too inconvenient to be popular.
  • How many percent market share will javascript lose to wsam? 0.1% to 0.5%
  • The fact that wsam is cited as the main challenger to javascript means javascript is unchallenged as the dominant choice on the browser

##[19] cited strengths@java language

In this post we compare c++, python, javascript, c#

  • Scalability [1] – James Governor has a saying: “When web companies grow up, they become Java shops”.Java is built for scalability in mind, which is why it is so popular among enterprises and scaling startups. Twitter moved from Ruby to Java for scaling purposes.
  • community support [1] such as stackoverflow —
  • portability [1] — on Linux and Android.
  • versatile — For web, batch jobs, and server-side. MS is a java shop, using java for trading. but DDAM is not a typical MS app. DDAM has many batch jobs but the UI is all in web java.
  • Java has high correlation with fashionable technologies — hadoop; cloud; big data; microservices… Python and javascript are also in the league.
  • proven —
    • web apps are the biggest market segment. Some (js/php/ruby) of the top 10 most popular languages are used exclusive for web. Java is more proven than c#, python, c++.
    • enterprise apps (complex business logic + DB) are my primary focus. java is more proven than python, javascript, php, c#

[1] https://stackify.com/popular-programming-languages-2018/ explains Java’s popularity

 

scaffolding around try{} block #noexcept

Updates — Stroustrup has more to say about noexcept and the scaffolding in http://www.stroustrup.com/C++11FAQ.html#noexcept

[[ARM]] P358 says that all local non-static objects on the current call stack fully constructed since start of the try-block are “registered” for stack unwinding. The registration is fine-grained in the form of partial destruction —

  • for any array with 3 out of 9 objects fully constructed, the stack unwinding would only destruct those 3
  • for a half constructed composite object with sub-objects, all constructed sub-objects will be destructed
  • Any half-constructed object is not registered since the dtor would be unsafe.

I guess this registration is an overhead at run time.

For the local non-static objects inside a noexcept function, this “registration” is not required, so compiler may or may not call their destructors.

 

long-term value: QQ imt ECT speed

The alpha geeks — authors, experts, open source contributors … are they fast enough to win those coding contests?

The speed-coding contest winners … are they powerful, influential, innovative, creative, insightful? Their IQ? Not necessarily high, but they are nobody if not superfast.

The QQ knowledge is, by definition, not needed on projects, usually obscure, deep, theoretical or advanced technical knowledge. As such, QQ knowledge has some value, arguably more than the ECT speed.

Some say a self-respecting programmer need some of this QQ knowledge.

RAII phrasebook

See [[ARM]] P 358and [[Essential C++] P199.

  • local — local nonstatic object required. See [ARM]]
  • dtor — is required.
  • stack unwinding — either by exception or normal return. Note noexcept may skip stack unwinding.
  • partial destruction — see other blog posts
  • scaffolding — see other blog posts
  • exception guarantee — RAII is the only exception guarantee
  • exception strategy — RAII is the best exception strategy
  • double-exception — what if an unhandled exception triggers unwinding but en-route a new exception is born? No good strategy.
  • == for memory management .. RAII is the #1 most important memory management technique.
  • memory leak prevention
  • smart ptr — example of RAII for memory management.

##tech skill superficial exposure: top5eg

see also exposure: semi-automatic(shallow)Accu #$valuable contexx and low-complexity topics #JGC..

  • –(defining?) features (actually limitations) of superficial exposure
  • accu — (except bash, sql) I would say very few of the items below offer any accu comparable to my core-c++ and core-java accu
  • entry barrier — (except SQL) is not created since you didn’t gain some real edge
  • commodity skill —
  • traction — spinning the wheel
  1. JGC — i think most guys have only textbook knowledge, no GTD knowledge.
  2. lockfree — most guys have only textbook knowledge
  3. [e] spring, tibrv — xp: I used it many times and read many books but no breakthrough. In-depth knowledge is never quizzed
  4. bash scripting — xp: i read books for years but only gained some traction by writing many non-trivial bash scripts
  5. SQL — xp: 5 years typical experience is less than 1Y@PWM
  6. –other examples (beware oth)
  7. [e] EJB, JPA (hibernate), JMS, Hadoop(?)
  8. memory profilers; leak detectors
  9. design patterns — really just for show.
  10. Ajax
  11. [e] Gemfire
  12. java latency tuning — is an art not a science.
    • Poor portability. JVM tuning is mostly about .. GC but at application level, real problem is usually I/O like data store + serialization. Sometimes data structure + threading also matter receives disproportionate limelight.
    • conclusion — superficial, textbook knowledge in this skillset probably won’t help with latency tuning.
  13. [e=ecosystem like add-on packages outside the core language.] Ecosystem skills are hard to enable deep learning and traction as employers don’t need deep insight. GTD is all they need.

friend to a class template

Similar to java reflection, void pointers, reinterpret_cast etc, friend is a powerful keyword in real projects, with GTD power. The rules are a bit tricky when you combine friend + template.

I think this is a relatively useful IV knowledge pearl as it demonstrates a fundamental concept — each template instantiation (“concretized template class”) is treated as a distinct class.

[[ARM]] P351 stipulates that

* A simple non-template friend introduced into a class template is a friend to ALL instances of the class template.

That’s the simple case.

* Now another common scenario — the class template C2 has a dummy type T, and the friend is another template F3 parameterized with the same type T. Now there’s a one-to-one friendship:

  • F3<float> is a trusted friend to C2<float>
  • F3<Acct> is a trusted friend to C2<Acct>

 

 

noexcept impact on RAII

If a function (esp. dtor) is declared noexcept, compiler can choose to omit stack-unwinding “scaffolding” around it.  Among other things, there’s a runtime performance gain. This gain is a real advantage of using noexcept instead of empty throw() which is deprecated in c++0x.

Q: Is there any impact on RAII?
%%A: yes

Q: Can we even use RAII in such a context?
%%A: I think we can. If the function does throw, then std::terminate() runs, instead of the destructors

move()doesn’t move..who does@@

(Note whenever I say mvctor i mean move-ctor and move-assignment.)

As explained in other posts in this blog, move() is a cast, no performing a physical “steal” at runtime, so

Q: what c++11 constructs performs a physical steal?
%%A: mvctor

  • move() ⇏ steal — std::move() may not trigger a physical steal — If you call myContainer.insert(std::move(myString)) but your custom String class has no mvctor, then copy-ctor is picked by compiler .. see P21 [[Josuttis]].
  • steal ⇏ move() — a physical steal may not require std::move() — if you have a naturally occurring rvalue object.
  • steal => mvctor

##fellow techies’workload stress when not slower than teammates

I won’t put on “t_gzPain” or t_stress as these factors are not stressors in my career! The opinions below are valid based on “their” personal experiences not my experience. As such, I should  NOT overthink about these.

  • Daniel Yan of RTS cited the effect of imposed deadline by exchange
  • Several IT friends cited the fire-fighting mode
  • CSDoctor said in some companies like Huawei, there was an unhealthy culture of long office hours, eroding family time, but i think this is rare nowadays.

2 heaviest work stressors  provides a incisive introduction to the same topic.

I tend to believe that “high performance” teams tend to emphasize GTD, intensity (“productivity” and “efficiency”) … remember Macq’s Kevin. He is highly efficient and he feels’ I’m not productive enough even though there’s no one else to compare.

However, my Barcap workload was not so heavy but high-impact 🙂

My conclusion at beginning and end of this analysis — figure-things-out faster than team colleagues is still the primary determinant of stress/respect/stigma.

success@concurrency features] java^c++^c#

I’m biased towards java.

I feel c# concurrency is less impactful because most of the important concurrent systems use *nix servers not windows, and most concurrent programming jobs do not go to windows developers.

Outside windows, c++ concurrency is mostly based on the C library pthreads, non-OO and inconvenient compared to java/c#

The c++11 thread classes are the next generation following pthreads, but not widely used.

Java’s concurrency support is the most successful among languages, well-designed from the beginning and rather stable. It’s much simpler than c++11 thread classes, having only the Thread.java and Runnable.java data types. More than half the java interviews would ask threading, because java threading is understandable and usable by the average programmer, who can’t really understand c++ concurrency features.

starting thread in ctor #Lea

Q: is it good practice to call new Thread(..).start() in MyClass ctor? This is common practice in my projects. I feel there are always alternatives.

I feel 30% confident this blog has another blogpost on this subject, to be combined.

[[DougLea]] points out the potential danger of starting thread in your ctor, esp. if subclasses are beyond your control.

The visibility effect is equivalent to parent thread releasing an implicit lock, and acquisition by run() on new thread.

Frequently, the constructed MyClass instance is used by the new thread, but is MyClass fully constructed? It is if the new Thread(..).start() is last line in ctor, but what if this ctor runs as part of a subclass ctor?

%%perfect hash: 1000 known string keys #ME-sg %50

See also string bucket-sort

Q: similar to FIX tags, you are given a published dictionary of 5000 short string keys, each up to 10-characters. Every message consists of about 100 key/payload pairs. Design a parser for these messages. Note every key would be in the dictionary.

Simple solution 1 BST like AVL or RBT to hold the pairs. O(log N=5000) … 5000 keys require about 12 levels. Possibly faster than hashtable.

Simple solution 2 standard hashtable to hold the pairs. The hashcode for string is usually fast enough, with minimal collision if load factor is set sufficiently low. Even if zero collision, this requires allocating 5000 linked nodes in memory.

In this case since the key strings are known in advance, I believe we can design a perfect hashing algorithm without collision. Interviewer said perfect hashing is hard in this case.

Solution 3: perfect hashing. Group the key strings by length, to 10 groups

  • If the number of distinct 8-char keys is small, for instance, we can use switch-case
  • for 1-char string keys, we maintain a dedicated bucket array (“table”) indexed by the ascii code
  • for 2-char string keys I can have a nested 2D array. Given key string “px”, “p” identifies the level2 array and within it, “x” identifies the bucket for the payload.
  • for 3-char keys, ditto

For those 10-char string keys, a 10-level nested array would be too large. Here’s my strategy. Assuming a lot of key strings (say, 3000 out of total population of 5000) are 4-char long, let’s tackle this sub-problem first:

  • for 4-char keys, I would designate an index range like 1 to 6000 (6000 buckets) to hold 3000 keys in this group. I would compute an initial hashcode := (char1*31+char2)*31+char3… % 6000 for this group of 3000 key strings and check for collision within this group. If there’s collision, then adjust some of the prime numbers and retry. Since load factor is 0.5 I hope we can avoid collision within this group. If it’s hard to avoid collision, I can reduce load factor, or introduce my collision resolver[2].
  • for 5-char keys, I would designate an index range like 6001 to 7800 (1800 buckets) to hold 900 keys in this group. I would compute initial hashcode := (char1*31+char2)*31+char3… %1800 for this group of keys and check for collision within this group.
  • So far, we have designed with a big combined bucket array. As an alternative, if there are only 7 key groups [4-char, 5-char, .. 10-char] then we can use seven dedicated bucket arrays, but what if 7000 groups? 7000 bucket arrays? I think so. Memory footprint is same as a combined bucket array.

[2] Here’s a simple collision resolver. Suppose Strings A and B(and C,D..) all hash to the same initial-hashcode of 55. I need to find another bucket for B. After I compute all the initial-hashcode values for each of 3000 key strings, I can easily find an unused bucket like 77. I will simply add an if-block to use bucket 77 for key B. Now this scheme is probably faster than open-addressing and separate chaining since the calc logic can be manually optimized , held in instruction-cache, or burned into FPGA. In contrast,

  • Separate chaining requires runtime memory access to follow the linked nodes. We can only hope the linked nodes are cached
  • open addressing requires runtime probing. Our load factor of 0.5 is not very good so open addressing can become slow.

http://www.dre.vanderbilt.edu/~schmidt/PDF/gperf.pdf is a published perfect hashing solution.

q[static thread_local ] in %%production code

static thread_local std::vector<std::vector<std::string>> allFills; // I have this in my CRAB codebase, running in production.

Justification — In this scenario, data is populated on every SOAP request, so keeping them as non-static data members is doable but considered pollutive.

How about static field? I used to think it’s thread-safe …

When thread_local is applied to a variable of block scope, the storage-class-specifier static is implied if it does not appear explicitly. In my code I make it explicit.

criticalMass[def]against churn ] tech IV

See also body-building impact{c++jobs#^self-xx

IV knowledge Critical mass (eg: in core java self-study) is one of the most effective strategies against technology churn in tech interviews. Once I accumulate the critical mass, I don’t need a full time job to sustain it.

I have reached critical mass with core java IV, core c++ IV, swing IV (no churn) and probably c# IV.

The acid test is job interviews over a number of years.

Q: how strongly is it (i.e. critical mass) related to accumulation?
A: not much AFTER you accumulate the critical mass. With core java I did it through enough interviews and reading.

Q: how strongly is it related to leverage?
A: not much though Critical mass enhances leverage.

Q: why some domains offer no critical mass?
A: some (jxee) interviews topics have limited depth
A: some (TMP, py) interview topics have No pattern I could identify from interview questions.

 

c++^bigData^crypto #SG

Workload (Figure-out-faster-than, team caliber, manager expectation …) are far more important, so let’s not spend too much time comparing :

  • –c++
  • see blog .. my view on java
  • –jxee/bigData
  • 🙂 many “commodity” jobs but often paying rather well.
  • 😦 i feel as a “specialization” this is low quality — high churn, low depth, low complexity,
  • 🙂 but I might be able to leverage it for future positions. I think my full-time python, full-time SQL, web app dev experiences provided the same leverage, albeit no-depth
  • –crypo —
  • 🙂 many small shops capable of paying
  • 😦 low chance to deep dnlg

c++complexity≅30% mt java

Numbers are just gut feelings, not based on any measurement. I often feel “300% more complexity” but it’s nicer to say 30% 🙂

  • in terms of interview questions, I have already addressed in numerous blog posts.
  • see also mkt value@deep-insight: java imt c++
  • — syntax/compiler complexity, — c++ >> c# > java
  • java is very very clean yet powerful 😦
  • C++ has too many variations, about 100% more than c# and 300% more than java
  • — core language details required for GTD:
  • my personal experience shows me c++ errors are more low-level.
  • Java runtime problems tend to be related to the (complex) packages you adopt from the ecosystem. They often use reflection.
  • JVM offers many runtime instrumentation tools, because JVM is an abstract, simplified machine.
  • — opacity — c++ > c# > java
  • dotnet IL bytecode is very readable. Many authors reference it.
  • java is even cleaner than c#. Very few surprises.
  • — more low-level — c++ > c# > java.
  • JVM is an excellent abstraction, probably the best in the world. C# CLR is not as good as JVM. A thin layer above the windows OS.

POSIX semaphore can grow beyond initial value #CSY

My friend Shanyou pointed out a key difference between c++ binary semaphore and a simple mutex — namely ownership. Posix Semaphore (binary or otherwise) has no ownership, because non-owner threads can create permits from thin air, then use them to enter the protected critical section. Analogy — NFL fans creating free tickets to enter the stadium.

https://stackoverflow.com/questions/7478684/how-to-initialise-a-binary-semaphore-in-c is one of the better online discussions.

  • ! to my surprise, if you initialize a posix semaphore to 1 permit, it can be incremented to 2. So it is not a strict binary semaphore.

https://github.com/tiger40490/repo1/blob/cpp1/cpp/thr/binarySem.cpp is my experiment using linux POSIX semaphore. Not sure about system-V semaphore. I now think a posix semaphore is always a multi-value counting semaphore with the current value indicating current permits available.

  • ! Initial value is NOT “max permits” but rather “initial permits”

I feel the restriction by semaphore is just advisory, like advisory file locking. If a program is written to subvert the restriction, it’s easy. Therefore, “binary semaphore” is binary IIF all threads follow protocol.

https://stackoverflow.com/questions/62814/difference-between-binary-semaphore-and-mutex claims “Mutex can be released only by thread that had acquired it, while you can signal semaphore from any non-owner thread (or process),” This does NOT mean a non-owner thread can release a toilet key owned by another thread/process. It means the non-owner thread can mistakenly create a 2nd key, in breach of exclusive, serialized access to the toilet. https://blog.feabhas.com/2009/09/mutex-vs-semaphores-–-part-1-semaphores/ is explicit saying that a thread can release the single toilet key without ever acquiring it.

–java

[[DougLea]] P220 confirms that in some thread libraries such as pthreads, release() can increment a 0/1 binary semaphore value to beyond 1, destroying the mutual exclusion control.

However, java binary semaphore is a mutex because releasing a semaphore before acquiring has no effect (no “spare key”) but doesn’t throw error.

TCP_NODELAY to improve latency

https://stackoverflow.com/questions/3761276/when-should-i-use-tcp-nodelay-and-when-tcp-cork

Nagle’s algo helps in applications like telnet. However, it may increase latency when sending streaming data. Additionally, if the receiver implements the ‘delayed ACK policy’, it will cause a temporary deadlock situation. In such cases, disabling Nagle’s algorithm is a better option.

temp object binding preferences: rvr,lvr..

Based on https://www.codesynthesis.com/~boris/blog/2012/07/24/const-rvalue-references/

  • a const L-value reference … … can bind to a naturally-occuring rvalue object
  • a const rvalue reference (rvr) can bind to a naturally-occuring rvalue object but canNOT bind to an lvalue object

Q: so what kind of reference can naturally occuring temp objects bind to?
A: a const rvr. Such an object prefers to bind to a const rvalue reference rather than a const lvalue reference

There are some obscure use cases for this binding preference.

short-term ROTI ] java IV: tuning imt threading

Analogy — physics vs chemistry in high-school — physics problems involve more logic, more abstract, more ; Chemistry is more factual. Similarly, Java threading challenges are more complex. Java tuning knowledge-pearls are relatively easy to memorize and show-off.

“Short-term” here means 3-5Y horizon.

I’m many times more successful with java threading– accu, deepening knowledge; lead over the pack.

I hope to build some minor zbs in JVM tuning, mostly GC tuning. However, many important parts of JVM tuning skills suffer from churn and poor shelf life.

mktData direct^Reuter feeds: realistic ibank set-up

Nyse – an ibank would use direct feed to minimize latency.

For some newer exchanges – an ibank would get Reuters feed first, then over a few years replace it with direct feed.

A company often gets duplicate data feed for the most important feeds like NYSE and Nasdaq. The rationale is discussed in [[all about hft]]

For some illiquid FX pairs, they rely on calculated rates from Reuters. BBG also calculate a numerically very similar rate, but BBG is more expensive. Bbg also prohibits real time data redistribution within the ibank.

mktData parser: multi-threaded]ST mode #Balaji#mem

Parsers at RTS: a high-volume c++ system I worked on has only one thread. I recall some colleague (Shubin) saying the underlying framework is designed in single-threaded mode, so if a process has two threads they must share no mutable data. In such a context, I wonder what’s better

  1. EACH application having only one thread
  2. Each application has 2 independent threads sharing nothing

I feel (A) is simple to set up. I feel if the executable + reference data footprint is large like 100MB, then (B) would save memory since the 100MB can be shared between threads.

–Answer from a c++ trading veteran
Two independent instances is much simpler time/effort wise. Introducing multiple threads in a single threaded application is not trivial. But I do not see any value if you create multiple threads which do not share any (mutable) data.

–My response:
Executable and reference data take up memory. 1) I feel executable can take up 10MB+, sometimes 100MB. 2) If there’s a lot of immutable reference data loaded from some data store, they can also add megabytes to process footprint. Suppose their combined footprint is 100MB, then two instances would each need 100MB, but the multi-threaded design would need only 100MB in a single instance hosting multiple threads.

My RTS team manager required every market data parser process to reduce footprint below 60MB or explain why. My application there used 200MB+ and I had to find out why.

Apparently 100MB difference is non-trivial in that context.

3 real overheads@vptr

Suppose your class Trade has virtual functions and a comparable class Order has no virtual functions. What are the specific runtime overheads of the vptr/vtable usage?

  1. cpu cache efficiency — memory footprint of the vptr in each object. If you have a lot of Trade objects with only one char data field, then the vptr greatly expands footprint and you overuse cache lines.
  2. runtime indirection — “a few memory references more efficient” [1] in the Order usage
  3. inlining inhibition is the most significant overhead. P209 [[ARM]] says inline virtual functions make perfect sense so it’s best to bypass vptr and directly call the virtual function, if possible.

[1] P209 [[ARM]] wording

Note a virtual function unconditionally introduces the first overhead, but the #2/#3 overheads can sometimes be avoided by a smart compiler.

 

consolidate into single-process: low-latency OMS

Example #2– In a traditional sell-side OMS, an client FIX order propagates through at least 3 machines in a chain —

  1. client order gateway
  2. main OMS engine
  3. exchange gateway such as smart order router or benchmark execution engine supporting VWAP etc

The faster version consolidates all-of-the-above into a single Process, cutting latency from 2ms to 150 micros .. latency in eq OMS

Example #1– In 2009 I read about or heard from interviewers about single-JVM designs to replace multi-stage architecture.

Q: why is this technique not used on west coast or main street ?
%%A: I feel on west coast throughput outweighs latency. So scale-out is the hot favorite. Single-JVM is scale-up.

How RTS achieved 400-700 KMPS #p2p

Official benchmark result – 700 KMPS per instance of rebus or parser. Here are some enabling features:

  • ! mostly single-threaded apps. Some older parsers are multi-threaded but each thread operating in single threaded mode. No shared mutable:)
    • In the order book replication engine, we use lockfree in one place only
  • ! (ST mode loves) Stateless [1] parser, small footprint. Only downstream component (Rebus) handles cancels/busts, modifications.. So we can run many parser instances per machine, or many threads per instance, fully utilizing the cpu cores. See https://haydenjames.io/linux-performance-almost-always-add-swap-space/
  • ! allocation avoided — on millions of Output message objects. Pre-allocated ring buffer eliminates new().. pre-allocate DTOs@SOD #HFT
  • ! allocation avoided — on millions of Input message objects, thanks to reinterpret_cast() on pointers… nearly Zero-copy. See reinterpret_cast^memcpy in raw mktData parsing
  • ! allocation avoided — no STL containers used, since they all allocate from heap
  • ! p2p messaging beats MOM 
  • ! Socket buffer tuning — to cope with busts. 64-256MB receive buffer in kernel; 4MB UserModeBuffer
  • low memory footprint. Most parsers use around 60MB. (My parsers was 100 to 150MB.) I think there are many benefits in terms of latency.
  • epoll — to replace select() with 2 orders of magnitude improve in throughput
  • buffering of Output list (of messages). I guess this hurts latency but enhances throughput
  • Very fast duplicate seq check, without large history data — a hot function
  • large containers like the ring buffer are pre-sized. No reallocation.
  • mostly array-based data structures — cache-friendly
  • Hot functions use pbref or RVO or move semantics, minimizing stack allocations
  • aggressively simplified parsing. Minimal logic
  • Special Logger to reduce I/O cost
  • Sharding to speed up symbol lookup
  • kernel bypass : RTS
  • No virtual functions — enhancing data cache and inlining .. 3 overheads@vptr #ARM
  • Inline
  • —-Speed with control
  • Best of both to reduce the real risk of our connection becoming slow or lossy
  • Depth monitor to detect missed messages
  • Capacity tracker to monitor mps rates
  • Missed seq reporting
  • [!=one of the really effective techniques in RTS]

[1] parser does remembers the sequence numbers. In fact sequence number management is crucial. Before retrans was added to parser, we never sent any retrans request. We did keep track of gaps. Mostly we used the gap-tracker to check duplicates across Line A/B. We also relied on our sequence number state to handle sequence resets.

mutex ] rebus: 3designs #RCU #CSY #80%

Requirement: Each worker thread operates independently on its own symbols, never interfering. Admin thread may read/write all symbols.

My friend CSY’s design resembles ConcurrentHashMap — split a big “graph-container” into 32 independent sub-graphs, to be locked individually.

For my 1st design, I briefly considered one lock per symbol. I think 100,000 locks is fine if symbols are consecutive integers. I simply use a symbol as array index to locate the corresponding lock. Every worker thread and the admin thread need to acquire the lock for symbol X before updating the AVL tree of X.

— Here’s my 2nd design, based on a single global lock:

  • Regular writer threads only checksnever acquires, the lock. If lock is in-use (i.e. taken by the admin thread) then wait in a 1-second-sleep loop.
  • admin thread immediately and unconditionally takes the lock and wait for a grace period[1] before writing. I assume this “write” can take up to 1 minute.
  • how to “check” the lock?
    • TryLock solution [Design#1] — Just trylock and immediately unlock.
    • LockFree solution [Design #2] — atomic boolean to be set only on the admin thread. I think this is one the simplest usage scenarios of atomic boolean. We are not even relying on the atomicity. We only need the visibility feature.

[1] This grace period is designed to be sufficient for a batch of updates on the worker thread. If we know a batch can write 100 orders and take X microsecond, then set grace period to X. We require worker routine to recheck the lock after each batch.

My friend CSY raised the concern that mid-stream the update could be suspended by kernel scheduler. I think in this case, a simple integer update could take ages, as in the GC stop-the-world scenario. So the design requires some realtime kernel support to guarantee timely response.

read-copy-update lockfree +! retry #RCU mentions a grace period used in RCU api

— Here’s a bigger lockfree design [Design #3]

  • without realtime kernel
  • Two global variables — in addition to the atomic boolean flag, I will add an atomic int “activeWorkerCnt”
  • Two non-reentrant routines (to avoid incremental acquisition)
Worker threads routine:
1. check the flag until it’s false
2. increment activeWorkerCnt
3. update data store
4. decrement activeWorkerCnt
Admin thread routine:
1. unconditionally set the flat # rejecting all new workers
2. while activeWorkerCnt > 0
2.   sleep 1 millisecond
2. ## draining the active worker pools
3. update data store
4. unset the flag

close(): with timeout or non-blocking #CSY

I was right that close() can be non-blocking even on a TCP socket. It is the default.

When you specify SO_LINGER, you can also give a timeout to control how long close() blocks on a TCP socket

https://stackoverflow.com/questions/6418277/how-to-close-a-non-blocking-socket

So there are 3 modes on a TCP socket:

  • close() blocking with timeout via SO_LINGER
  • close() blocking without timeout
  • close() nonblocking

[19]with3sockets: select^non-block recv#CSY

Update — CSY said when you increase from 3 to 99, the difference becomes obvious.

Q (TradeWeb mkt data team): given 3 non-blocking receive-sockets, you can either read them sequentially in a loop or use select(). What’s the difference?

http://compgeom.com/~piyush/teach/4531_06/project/hell.html compares many alternatives. It says Non-block socket read() is least efficient as it wastes CPU.

If you have 3 receive-sockets to monitor, select()/poll() is designed just for your situation.

Using read() instead, you may try specifying a timeout in your read() like

//set socket option via SO_RCVTIMEO to 5 seconds
while(1){
recv(sock1);
recv(sock2);

}
Low latency systems can’t use this design because while you wait 5 seconds on socket1, you miss new data on socket2 😦

Therefore, low-latency systems MUST disable SO_RCVTIMEO. In such a design, the loop wastes cpu in a spin-wait.

–another advantage to selelct(): you can process a high-priority socket earlier when select() says multiple sockets ready.

I gave this answer on the spot.

recv()/send() with timeout #CSY

I was right — timeout is supported Not only in select()/poll(), but also read/recv/send..

SO_RCVTIMEO and SO_SNDTIMEO

Timeouts only have effect for system calls that perform socket I/O (e.g., read(2), recvmsg(2), send(2), sendmsg(2)); timeouts have no effect for select(2), poll(2), epoll_wait(2), and so on. These latter functions only check socket status… no I/O.

I believe this socket option has no effect if used on non-blocking sockets . Nonblocking socket can be created either

  • O_NONBLOCK in fcntl() or
  • SOCK_NONBLOCK in socket()

nonVirtual1() calling this->virt2() #templMethod

http://www.cs.technion.ac.il/users/yechiel/c++-faq/calling-virtuals-from-base.html has a simple sample code. Simple idea but there are complexities:

  • the given print() should never be used inside base class ctor/dtor. In general, I believe any virt2() like any virtual function behaves non-virtual in ctor/dtor.
  • superclass now depends on subclass. The FAQ author basically says this dependency is by-design. I believe this is template-method pattern.
  • pure-virtual is probably required here.

strided list-traversal: excel-formula/python/boost #JackZ

=OFFSET(
$A$1,
(COLUMN()-COLUMN($C$2))*3,
  (COLUMN()-COLUMN($C$2))
)

The green part is part of a natural number sequence.

## marketable syntax nlg: c++ > j/c#

Every language has poorly understood syntax rules, but only in c++ these became fashionable, and halos in job interviews !

  • ADL
  • CRTP
  • SFINAE
  • double pointers
  • hacks involving void pointers
  • operator overloading to make smart ptr look like original pointers
  • TMP hacks using typedef
  • TMP hacks using non-type template param
  • universal reference vs rvr
  • rval: naturally occurring vs moved
    • const ref to extend lifetime of a naturally occurring rval object

FIX^tibrv: order flow

Background — In a typical sell-side an buy/sell order goes through multiple nodes of a flow chain before it goes out to the trading venue. The order message is payload in a messaging platform.

FIX and Tibrv (and derivatives) are dominant and default choices as messaging platforms. Both are highly adaptable to complex combinations of requirements.

FIX tibRV
TCP UDP transport
unicast multicast pub/sub fan-out
designed for low-latency eq trading popular for bond trading latency
order msg gets modified along the flow chain ? editable payload

tick`95%mark #Kam #60%

“Ticking 95th percentile server” is the blog title I would use. Original question is on paper/pencil, scanned and sent to my gmail, from my friend Deepak CM. I find this problem rather realistic with practical usage, rather than fake and contrived. I treat it as a System Design + implementation question.

Q: Using only std library, write a c++ program to process a live stream of 128,000,000 (or much more) double numbers, representing temperature readings not necessarily unique. As the temperatures come in, print the current 95th percentile on-demand. I call it the lucky “winner”. We can use the nearest-rank percentile definition.

====Idea 1: given unsorted ints, find median in O(N) is for median but can be tweaked for any percentile, but unfortunately, not “ticking”
====design 2, optimized for high volume of updates like 128 million updates, not optimized for frequent query
The entire temperature range is divided into non-overlapping segments, each represented by a segment-head temperature i.e. the lower bound [1b]. Each segment has a range (i.e.distance to next segment head), size (i.e.item count) and density (i.e. size/range ratio). We care about “size” only.

We need a RB-tree containing P=1024 [1] nodes, each an unsorted container[3]. The RB-tree serves to maintain the containers i.e segments.

Each incoming temperature is quickly “routed” to the correct container and simply appended therein, incrementing its size.

Upon query request, we will use the latest segment sizes to build a cumulative profile, and run a O[logP] binary search to identify the one segment containing the “winner”. This segment size would be hopefully much smaller than 128,000 [2] and far more /tractable/.

–Within the chosen segment of size S, we can use a vector to sort in O(S logS) the temperatures and identify the winner.  After completing a query, the chosen container will become (partially) sorted, helping subsequent queries if this segment is chosen again.

If x% of the queries hit this segment, then I will convert this “favorite segment” to a RB-tree.

Alternatively, we can also use the O(S) algorithm in Idea 1, but the container won’t become sorted.

–priming

[2] 128,000 is 1024th the size of original sample size… not ideal. The segments need to be initialized carefully, during a priming phase, inspired by JIT compiler. Assuming we know the total sample size is 128 million, I will use the first 100,000 temperatures to select the 1024 segment heads. The segments are structured not for equal length or equal size. In fact the low segments can be very long very large. Rather the segment heads are chosen so that between 94th percentile and 96th percentile we have 1/4 of all segments. These small segments will be much smaller than 128,000 and quicker to manipulate.

–Foot notes:

Q: what if some of the containers grow too big like three times 128,000,000/1024
A: Not a problem unless the winner happens to be in such a container.
A: One idea is to split up such a container when we notice it, and grow a new node on the RB-tree. std::map::insert() can take a hint for the position where new node can be inserted. Note we don’t want to split a node JJ too early since JJ may not grow any further subsequently and may end up no larger than other nodes as other nodes keep growing.

[1] Sizing of P — First we estimate total sample size. If unknown, then set N:=1024 so all nodes stay in L1-cache (typically 32KB). If we assume 16 bytes/node ( 8 bytes pointer to container + 8 bytes double ), then 32KB can hold 2000 nodes.

If query becomes more frequent, I can increase P by 1024, sacrificing insertion.

[1b] The node values are “lower-bounds” and don’t have to be really the minimum of the original temperatures in the container. We can probably cheat with 4-byte floats, and we get away with 2700 twelve-byte tree nodes.

 

[3] slist vs vector — vector might be faster due to pre-allocation, provided a node will never grow beyond a capacity. Vector has reserve() (Note resize() is wrong choice.)

%% absorbency: reading+blogging imt coding

I can read dry QQ topics for hours each day for many days, but tend to lose steam after coding for a few hours.

My coding drill drains my “laser” energy faster than QQ reading/blogging. When my laser is weakened (by boredom), I must drill harder to go through the “brick walls”.

–I guess many fellow programmers enjoy coding more than reading. I feel lucky that in my interviews, knowledge tests still outweigh coding test.

 

interrupts=hardware interrupts #by default

In operation system jargon, “interrupts” means hardware interrupts by default. Timer hardware, NIC, keyboard/mouse, and possibly hard disk … can all generate hardware interrupts.

“Interrupt signal” often refers to the electrical signal, never the higher-level SIG* signals in unix (unsupported in Windows AFAIK).

There is also a well-known concept of “software interrupt”. I see them as fake interrupts, simulated in software. My book [[linux kernel]] says software interrupts are employed for 1) system calls 2) debugger breakpoints.

A key difference between software vs hardware interrupts — hardware interrupts come from devices with their independent power supply and clock signal, so hardware interrupts can hit CPU any time, even in the middle of a CPU clock cycle. Software interrupts won’t. Software interrupts are simulated by code running in the same CPU, driven by the same clock signal.

In other words, hardware interrupts are unsynchronized with CPU, whereas software interrupts are synchronized with CPU.

missing q[ return ] in non-void function #Josh

I told Josh about this real Scenario: my non-void function has 3 control paths, one of them without a return statement.

g++ usually give a warning under “-Wall”. Flowing off the end of a function is equivalent to a return with no value — undefined behavior in non-void function.

If you print the function return value and the 3rd control path executes, you get undefined behavior.

https://stackoverflow.com/questions/3402178/omitting-return-statement-in-c

hardware mutex, based@XCHG instruction

[[OS by William Stallings]] P214 is concise and clear.

The atomic XCHG instruction on Intel processors can swap a local variable (held in cpu register) with a main memory location visible to all threads.

This XCHG instruction alone can implement a simple mutex usable by multiple threads on multiple processors sharing the same main memory.

Another atomic instruction, atomic test-n-set instruction, can also by itself implement mutex.

Both mutexes have only academic value because their Limitations are severe:

  • even with a single mutex, deadlock is possible at the hardware level — ThrA in critical section can get interrupted and preempted — perfectly normal. Now ThrA has to wait for high-priority ThrH to complete, but ThrH is busy-waiting (wheel-spinning) to enter the same critical section 😦
    • Note in this example there’s no “lock object” per-se, only a critical section in code !
  • by default, unsuccessful threads like ThrH are stuck in busy-wait — wasting CPU and generating heat 😉

 

##FX/eq projects ] PWM, for SG IV

—-fx or eq
* Swing app to monitor multiple live database tables and auto-refresh upon new quotes/trades and position updates.
* Post-trade Affirmation for complex derivative deals. A workflow application. Operations verify deal attributes by phone and Affirm on the deal to send out trade confirm.
* Statistical feed “blender” (with multiple purposes), needed to remove outliers among diverse indicative eFX forward quotes. Outliers tend to mess up our auto pricers, tiered pricers and cross rate pricers.
* (Ledger balance?) PnL report. Nightly batch to apply marks on option positions and compute PnL. Reports both clients and firm accounts.
* Volatility surface fitter for FX (& equity index/single-names). Volatility smile curve fitter. Quote inversion. Option position mark-to-market, driven by live quotes from multiple live eFX markets.
* Workflow engine for Error trade reconciliation/chargeback. A single error ticket can compute PnL and chargeback across multiple (up to 100) error trades and cover trades. Every quarter, this engine reconciles hundreds of error trade whose total values add up to millions of dollars.

* Deal entry screen for trade booking after a retail dealer executes over phone. If trade validation fails, we notify the dealer to amend and resubmit. Same swing app is also used by spot/swap/options dealers to book voice trades.
—-eq
system? data normalization to take care of splits and dividends
system? basic back testing of simple strategies
system? re-sampling

* (EOS) web-based trade/order entry, with multiple validations. Used primarily by investment advisers across Americas, Europe and Asia. Thousands (up to 5-figure) of orders received daily.
** showing real time bid/ask from GS + bid/ask from trading venues
** supports most order types — limit/market/FOK/IOC/SL

* swing-based real time order book display. Updates to displayed order books come from the backend OMS via real time messaging.
* Also contributed to the OMS backend. Orders originate exclusively from private bank clients. We then monitor their cancels/amends and execution reports (via firm-wide messaging hub) from exchanges, ECN and GS private liquidity pool. Total message volume up to 6 figures.

—-fx
We are the interface to multiple ECNs to 1) receive quotes, enrich and publish to PWM clients 2) forward client orders to ECN in 2-leg spread trades 3) execute trades received from ECN 4) generate and validate (against tri-arb) cross rates using ECN quotes. Also contributed to auto quoters and RFQ engine. Supervised and monitored the progress of high-frequency eFX applications. Performed eFX development activities such as requirement gathering, design, testing and deployment. Personal contributions included
* Tiered quote pricer. Each FX customer falls into one of the tiers. When a currency pair updates in our pricer, all registered clients would get a new quote (??delivered to their Financial Advisors??). Silver tier clients receive a better bid/ask spread than regular clients; Gold tier gets the best quote.
* eFX rate update/distribution engine to downstream real time risk, PnL, position marking systems
* eFX option real time risk report (unfinished). Option position risk calc is rather slow, so each user selects a limited number of positions into his watch-list. Watched positions get periodically updated based on live ECN rates to show real-time risk.
* (questionable project) Compute cross rates for PWM trades that are large, recurring and involving illiquid currencies. Cross rates computation from daily close USD buy/sell rates.
————
— blender
http://www.olsendata.com/fileadmin/Publications/Tech_Papers/FXBlenderDoc_01.pdf (C:\0x\xz_ref)
outliers are particularly damaging to market making systems.
input – transaction prices and indicative quotes
output – only indicative quotes

— cross rate calculator

What we compute are backdated cross rates. PWM (non-trading) client agrees on a currency conversion AAA/BBB. She agrees on AAA amount, and wait for a few hours/days to get the actual BBB amount, just like cross currency credit card payment —

MasterCard exchange rates are based on multiple market sources (such as Bloomberg, Reuters, Central Banks, and others). These rates are collected during our daily rate setting process. The exchange rates displayed on the web site are derived from the buy and sell rates included in the MasterCard daily T057 Currency Conversion Rate File.

MasterCard applies the USD as unique reconciliation currency to manage all currency conversions globally. Due to possible rounding differences, the published calculated cross-rates may not precisely reflect the actual rate applied to the transaction amount when converting to the cardholder billing amount. The calculated cross-rates will be loaded onto the MasterCard web site on a daily basis.

MasterCard applies the exchange rate to transactions at the time of settlement, not at the point of authorization of the sale.

–ECN interface
(my code doesn’t use FIX. Our volume is lower. conversion from FIX to xml is encapsulated in a library)

FXAall, Currenex, Hotspot, TradeWeb , Bloomberg. Example applications:
* auto quoters
* price feeds
* market data feeds
* post trade feeds
* execution gateways

components@kernel beside deviceDrivers #signals,interrupts

Q: exactly what constitute the linux kernel?

[[OpSys by William Stallings]] P99 has a nice diagram showing most of the components in linux kernel are device drivers and similar software

  • NIC driver
  • char device driver — for keyboard
  • block device driver — for disk
  • physical memory interface
  • device interrupt handlers

What are the other components of kernel?

1. system calls — the highest level in the kernel
1. Signals — the Signal definition and handlers are at the same level as syscalls, but at a higher level than the device interrupt module.
2. TCP/IP module —-  is a level between syscalls .. and .. NIC driver
2. virtual memory —– is a level between syscalls .. and .. physical memory interface
2. filesystem module – is a level between syscalls .. and .. block device driver

Lastly, scheduler — is a highly visible component of kernel. Scheduler relies critically on the device interrupt module.

avoid lopsided BST

Self-balanced is not a theoretical nicety but an essential BST feature. Without it a BST could easily become lopsided and all operations will slow down.

If for any reason (I don’t know any) we can’t use a AVL or RBT, then we could randomize the input and insert them (without deletion) and get a fairly balanced BST.

intervalTree: classic RBTree to find overlapp`event

Notable detail — non-balancing BST won’t be give logN performance, since the tree depth may degrade

Q1: given N event objects {start, end, id}, use a RBTree to support a O(logN) query(interval INT) that returns any one event that overlaps with INT, or NULL if none.

If the end time == start time in the input INT, then it becomes

Q2: given N event objects {start, end, id}, use a RBTree to support a O(logN) query(timestamp ii) that returns any one event that’s ongoing at time ii, or NULL if none.

P312 [[intro to algorithms]] describes an elegant solution.

The text didn’t highlight — I think the end time of the input interval INT is … ignored. (Therefore, in my Q2, input ii is a single timestamp, not an event.) Precisely because of this conscious decision, the tree is sorted by event start time, and the additional payload (in each node N) is the highest end time among all (including N itself) nodes of entire subtree rooted at N.

The text illustrates and proves why this design enables a clean and simple binary search, i.e. after each decision, we can safely discard one of “my” two subtrees. Here’s my (possibly imperfect) recall:

Let’s suppose each time the current node/event is not “ongoing”. (If it is then we can return it 🙂 So here’s the gist of the algorithm:

Compare ii to the left child L’s payload.  As defined, this payload is the highest end times of L + all its sub nodes. In other words, this payload is the highest end time of all events starting before me.

  • case: If payload is lower then we know all events on my left have ended, so we can discard the left subtree completely. Only way to go is down the right subtree.
  • the other case: if payload is higher or equal to ii, then we know left subtree contains some event(s), known as “candidates”, that will end after ii. We don’t know if those candidates have started. So why can we discard entire right subtree?
    • Suppose some candidate is ongoing, then we can safely discard right subtree
    • Suppose none of those candidates has yet started, then we know L has not started, myself has not started, and all right side events have not started, since tree is sorted by start time.
    • conclusion — in either case, we can discard right subtree.

 

EnterpriseServiceBus phrasebook #ESB

Mostly based on https://www.it.ucla.edu/news/what-esb

  • enterprise — used in big enterprises, but now out of fashion
  • HTTP/MQ — Http is the more popular protocol than MQ
  • MOM — I think this middleware is a separate process. https://en.wikipedia.org/wiki/Enterprise_service_bus#ESB_as_software says MOM vendors call their products ESB
  • async — no synchronous http call between client and server
  • latency — perhaps not popular for real time trading, which prefers FIX
  • SOA — ESB jargon was created along with SOA
  • jxee — you may need to know this jargon in the jxee job market.
  • churn 😦

c++ecosystem[def]questions are tough #Deepak

C++ interviewers may demand <del>c++ecosystem knowledge</del> but java also has its own ecosystem like add-on packages.

As I told my friend and fellow c++ developer Deepak CM,

  1. c++ecosystem QQ questions can be more obscure and tougher than core c++ questions
    • tool chain — compiler, linker, debugger, preprocessor
    • IPC, socket, pthreads and other C-level system libraries
    • kernel interface — signals, interrupts, timers, device drivers, virtual memory+ system programming in general # see the blog catetory
    • processor cache tuning
    • (at a higher level) boost, design patterns, CORBA, xml
    • cross-language integration with python, R, pyp, Fortran + other languages
  2. java ecosystem QQ questions are easier than core java questions. In other words, toughest java QQ questions are core java.
    • java ecosystem questions are often come-n-go, high-churn

Low level topics are tough

  1. c++ ecosystem questions are mostly in C and very low-level
  2. java ecosystem questions are usually high-level
    • JVM internals, GC … are low-level and core java

 

xchg between register/memory triggers fencing

I guess xchg is used in some basic concurrency constructs, but I need to research more.

See my blogpost hardware mutex, based@XCHG instruction

https://c9x.me/x86/html/file_module_x86_id_328.html points out the implicit locking performed by CPU whenever one of the two operands is a memory location. The other operand must be a register.

This implicit locking is quite expensive according to https://stackoverflow.com/questions/50102342/how-does-xchg-work-in-intel-assembly-language, but presumably cheaper than user-level locking.

This implicit locking involves a memory fence.

competitiveness: Pre-teen sprinter + HFT selectivity

I was a decent sprinter at age 6 through 13, then I realized I was not good enough to compete at Beijing municipal level.

There are quite a number of HFT shops in Singapore. I think they have a higher bar than ibanks. At ibank interviews I felt again like that pre-teen sprinter, but HFT interview is generally too hard for me

represent graph: matrix ^ edgeSset #forward_list

Josh’s book [[data structures and other objects using c++]] has 5 pages devoted to three best-practice data structures representing a generic directed graph. (I think my [[discrete math] also covers the same topic.)

  1. boolean adjacency matrix — amat
  2. edge set — eset
  3. edge list — can be implemented using std::forward_list, marginally better than set… not discussed here
  • iterating all edges connected to a node — eset wins
  • sparse graph — amat wastes memory
  • test linkage between 2 nodes — amat wins
  • ease of implementation — amat wins

ABA problem ] lockfree designs #DeepakCM

  • breaks basic CAS assumptions
  • most common solution is [1]
  • affects c++ CAS only, probably not java CAS

[1] https://lumian2015.github.io/lockFreeProgramming/aba-problem.html and http://moodycamel.com/blog/2014/solving-the-aba-problem-for-lock-free-free-lists explain the “tagged state reference”  aka “tagged pointer“. I feel I don’t need to understand it. Even if I spent some time getting a grasp it may not be enough to impress an interviewer.

http://www.modernescpp.com/index.php/aba-a-is-not-the-same-as-a has more details, including

  • compare_exchange_strong
  • lockfree stack hitting UndefinedBehavior due to ABA
  • RCU

This topic was asked briefly in Deepak’s MS c++ interview in 2019. I think it’s too arcane and low-level for most interviewers.

 

compile-time ^ run-time linking

https://en.wikipedia.org/wiki/Dynamic_linker describes the “magic” of linking *.so files with some a.out at runtime, This is more unknown and “magical” than compile-time linking.

“Linking is often referred to as a process that is performed when the executable is compiled, while a dynamic linker is a special part of an operating system that loads external shared libraries into a running process”

I now think when the technical literature mentions linking or linker I need to ask “early linker or late linker?”

c++changed more than java: QQ perspective

Recap — A QQ topic is a hard interview topic that’s never needed in projects.

Background — I used to feel as new versions of an old language get adopted, the QQ interview topics don’t change much. I can see that in java5, c#, perl6, python3.

To my surprise, compared to java7/8, c++0x has more disruptive impact on QQ questions. Why? Here are my guesses:

  • Reason: low-level —- c++ is more low-level than java at least in terms of interview topics. Both java8 and c++0x introduced many low-level changes, but the java interviewers don’t care that much.
  • Reason: performance —- c++0x changes have performance impact esp. latency impact, which is the hot focus of my target c++ employers. In contrast, java8 doesn’t have much performance impact, and java employers are less latency-sensitive.
  • Reason: template  —- c++0x feature set has a disproportionate amount of TMP features which are very hard. No such “big rock” in java.
    • move/forward, enable_if, type traits

Q: if that’s the case, for my career longevity, is c++ a better domain than java?
A: I’m still based in favor or low-level languages

Q: is that churn?
A: yes since the c++11 QQ topics are likely to show up less over the years, replaced by newer features.

STL containers +! pointer

  1. std::array
  2. std::string SSO i.e. small-string optimization

These containers

  1. heap storare is not required
    • trivial special case — if this container is a nonref field of a Shape object, then q(new Shape) would still allocate the container on heap.
    • This is similar to “Are java primitives on heap/stack?”
  2. Therefore contains no pointer.
  3. Therefore move ctor runs in linear [1] time compared to constant-time move-ctor in other STL containers
    • std::array probably offers a move-ctor and move-assignment

[1] P205 [[effModernC++]] also addresses

Q: What happens when you move a std::array container with 77 Acct elements?
A: 77 individual Acct objects are moved to the new container, in linear time, invoking the move-ctor of Acct class

##[19]if I were2re-select%%tsn bets #tbudget=30m

##[17] checklist@tsn #engaging provides a checklist but here’s a FRESH look, which may not be so in-depth, balanced, and comprehensive.

Q: Hypothetically, knowing what I know now, beside c++ and python, which trySomethingNew bets would I have chosen in 2010, after GS?

  1. mkt data and sockets
  2. forex, equities, options, IRS — I like the well-defined body of dnlg as entry-barrier. I learned it fast 🙂
  3. trading engine dev including pricing, OMS, connectivity
  4. risk analytics?
  5. devops?
  6. big data including hadoop, cloud etc?
  7. — not-so-great
  8. c# — heavy investment, a lot of legwork but insufficient ROTI
  9. MOM and Gemfire — shrinking demand
  10. swing? Fun but very poor job market
  11. quantDev? extremely disappointing job market
  12. HFT? entry barrier too high

##%%c++(n java)memoryMgmt trec : IV

Q: Your resume says “memory mgmt” for c++, so what did you do personally, not as a team?
A: I will talk about the features in my system. Most of them are not written by me, honestly

  • per-class: restricting allocation to heap or stack
  • per-class: customize operator new for my class
  • per-class: disable operator new for my class and provide allocateFromRingBuffer() API
  • per-class: move support
  • static thread_locals
  • ring buffer to eliminate runtime allocation
  • custom smart pointers
  • memory leak detector tools like valgrind+massif .. breakdown heap/non-heap footprint@c++app #massif
  • RAII
  • pre-allocate DTOs@SOD #HFT #RTS
  • customize new_handler() — I didn’t write this

Q: java

  • GC tuning
  • memory sizing
  • JDK tools like jhat
  • memory profiler tools
  • string intern
  • investigating memory leaks, all related to GC

error stack trace: j^c++

Without stack trace, silent crashes are practically intractable.

In my java and python /career/ (I think c# as well) , exceptions always generate a console stack trace. The only time it didn’t happen was a Barclays JNI library written in c++..

In c++, getting the stack trace is harder.

  • when I used the ETSFlowAsert() construct I get a barely usable stack trace, with function names, without line numbers
  • [[safeC++]] described a technique to generate a simple stack trace with some line numbers but I have never tried it
  • the standard assert() macro doesn’t generate stack trace
  • In RTS and mvea, memory access errors lead to seg-fault and core dump. In these contexts we are lucky because the runtime environment (host OS, standard library, seg-fault signal handler ..) cooperate to dump some raw data into the core file, but it’s not as reliable as the JVM runtime … Here are some of the obstacles:
    • core files may be suppressed
    • To actually make out this information, we need gdb + a debug build of the binary with debug symbols.
    • it can take half an hour to load the debug symbols
    • My blogpost %%first core dump gdb session describes how much trouble and time is involved to see the function names in the stack trace.

 

RTS pbflow msg+time files #wait_until #60%

Background — In RTS we relied everyday on a UDP message replay tool, taking in a msg file and a corresponding time file. Both binary files. I saw no message delimiters in the msg file, partly because this file is produced in production. Inserting delimiters would add overhead in production parser engine.

Given there’s no delimiter, I believe the timestamp file must have a seek-offset values (marker) corresponding to each timestamp.

Q: how would you implement a similar replay tool?

Driver is the timestamp file. Every time I see a timestamp, I would wait_until() that timestamp (adjusted to my start time). Upon wake-up, I would send the corresponding chunk of bytes between current and next markers.

I would use condition_variable::wait_until(). As noted in c++condVar 2 usages #timedWait, this function has nanosec precision.

(The mutex needed in wait_until? My program is single-threaded, so the dummy lock would be held forever.)

The chuck of bytes would be sent as one UDP packet. No split. Each chunk should starts with an original packet header created by the matching engine, and received in the original feed.

Q2: how about TCP?

TCP receiver can deal with partial messages very effectively, as I saw in my tests.

However, TCP receiver can introduce delays in the sender i.e. my replay tool, so the transmission will not be instantaneous. In fact, send() can block! This could lead to drift in the timer. That’s why I need wait_until()

Q3: Explain timer drift?

Suppose in the time file, Timestamp #1 is 9:00:00:000001 and Timestamp #2 is one microsec later, but the data transmission can take 10 nanosec esp. with TCP blocking send().  This 10 nanosec is a small drift but it adds up to microseconds.

max salary: simple game-plan for sg

Whether I ask for a big base or modest base, there’s a high chance I may have problem with manager expectation. So let’s just go for the max salary and forget about learning/tsn.

  • Some quant skill? tend to pay a premium
  • algo trading? tend to pay a premium, but I wonder how they assess my trec.
  • java/c++ combo role? will Not pay lower
  • If a HFT shop makes a real offer at S$150k base I will decline.

UDP: send/receive buffers are configurable #CSY

It’s wrong to say UDP uses a small receive buffer but doesn’t use send buffer.

Receive — https://access.redhat.com/documentation/en-US/JBoss_Enterprise_Web_Platform/5/html/Administration_And_Configuration_Guide/jgroups-perf-udpbuffer.html shows how to increase UDP receive buffer to 25MB

Send — https://stackoverflow.com/questions/2031109/understanding-set-getsockopt-so-sndbuf) shows you CAN configure send buffer on a UDP socket.

Send — https://www.ibm.com/support/knowledgecenter/en/SSB23S_1.1.0.15/gtpc2/cpp_sendto.html also confirms SO_SNDBUF send-buffer size applies to UDP sockets

In addition, Application is free to use its own custom buffer in the form of a vector for example.

handle java OOM @runtime

https://stackoverflow.com/questions/2679330/catching-java-lang-outofmemoryerror says "only very infrequently is OutOfMemoryError the death-knell to a JVM. There is only one good reason to catch an OutOfMemoryError and that is to close down gracefully, cleanly releasing resources and logging the reason for the failure best you can (if it is still possible to do so)."

Note in a Docker container, OOM leads to immediate crash of not only JVM but entire container.

identityHashCode,minimum object size,relocation by GC

https://srvaroa.github.io/jvm/java/openjdk/biased-locking/2017/01/30/hashCode.html offers a few “halo” knowledge pearls

  • every single java Object must always give an idHashcode on-demand, even if its host class has hashCode() method overridden to return a hard-coded 55.
    • hashcode() doesn’t overshadow idHashcode
  • The “contract” says an object’s idHashcode number must never [2] change, in the face of object relocations. So it’s not really computed based on address. Once someone requests the idHashCode number (like 4049040490), this number must be retained somewhere in object, as per the “contract”. It is retained in the 12-byte object header. (8-byte for a 32-bit JVM)
    • Therefore, the idHashcode contributes to the minimum size of java objects.
  • contrary to common belief, the idHashcode can clash between two objects, so idHashcode is a misnomer, more “hashcode” and not “identity”. https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6321873 explains there are insufficient integer values given the maximum object count.
  • Note anyone can call the hashcode() method on this same object and it could be overridden to bypass the idHashcode.
  • [2] in contrast a custom hashcode() can change its value when object state changes.