loop fusion^fission

Stroustrup told me about compiler optimization “loop fusion” when I asked how c/c++ can get beaten in benchmarks.

  • loop fusion may improve runtime overhead
  • loop fission may improve locality of reference and parallelism on multi-core??

I guess this is a technique to improve raw speed.

Advertisements

GS-sg core IV with rare feedback

A Tech win by my own assessment.

–rare email feedback, with highlighting by me. It is clear o me the Hongkong OMS interviewer gave the most direct (negative) feedback. I feel grateful for that.

I feel this was a technical win. I think only the Hongkong OMS interviewer found some weakness in me.

I feel perm role selection/screening was more stringent than contractors because of leadership, communication, personality match, ownership… The new permanent hire is often seen as a potential manager to join the “race”.

In contrast, contractors are mostly assessed for technical + basic communication/attitude.

In SG, more than half the time I was assessed as a dev lead (or architect) but I don’t want to. U.S. contract market is better.

As he interviewed till the end, I’d like to share a bit of feedback.

The interviewers believed he had good industry experience (low latency, low-level coding, memory usage, CPU usage, MSA) and can implement/maintain systems, and appreciated his honesty e.g. not hiding the fact that he didn’t perform performance testing.

However, the interviewers wished to see more leadership and communication skills cognizant of 15+ years experience, and to conduct his own tests rather than easily take others’ assumptions at face value.

— Tokyo interviewer
Given a BST, how to do you insert, look-up by key?

Lastly, how do you remove an arbitrary branch node? Need a deterministic algorithm to reshuffle affected subtree and keep the BST property. Considering the python coding rounds, this is about the hardest pure-algo question throughout the interview process.

  • This level of difficulty is very different from tech shops
  • This level of difficulty is much lower than the coreJava QQ questions

— QQ interviewer Ming

Q: suppose you find your prod log file suddenly stops growing, what do you do?
%%A: check disk space; check DB query hung due to a pending transaction; check socket data input/output blocking
%%A: take thread dump, which is a nice JVM feature not available in c++ or other applications

Q: how did you tune your jvm?

Q2: how did you size the heap enough to ensure no jGC for entire day?
%%A: We estimated the high watermark (like 2GB) and configure my jvm with a heap bigger than that.

This technique is simple and crude but it usually works. I think interviewer may not be impressed but I was confident because … it works!

Q2b: but will that prevent GC? Do you know every GC must stop all application thread first?
%%A: at least the app thread can run after the snapshot, during the actual collection, right? I assume the snapshot is quick.

Q: in java, even without any protection, reading a 32-bit int will give you either the before-value or the after-value, right?
%%A: Not in c++. See c++11 atomic{int}^AtomicInteger.java #Bool can be half-written

Q: beside speed, what are other benefits/advantages of lock-free compared to lock-based designs?
%%A: the unsuccessful thread can do something else before retrying, rather than block indefinitely
%%A: a blocked thread (due to lock contention) is likely context-switched off the CPU. All the CPU data caches would soon be flushed.

Q: single-producer, single-consumer … can you design an efficient data structure to share data between them
Now I think we only have one shared mutable variable — the producer’s moving pointer. Interviewer hinted that all we need is for the producer to notify consumer where the new position of the moving pointer, in the data structure. Perhaps we don’t need lockfree. Volatile is probably sufficient.  Alternatively, mutex+condVar is a proven solution.

Aha — If the consumer thread needs to wait (not busy-wait), then notification is the key, so condVar is more suitable than lockfree.

I tend to think of that moving pointer as an integer.

I asked interviewer: are these concurrency knowledge and skills needed in your projects?
Interviewer: “Your CV mentioned these skills“. I believe interviewer implicitly answered NO. Standard practice of verifying/scrutinizing claims on CV !

–This OMS interviewer was trying to break the candidate…

Q: describe CAS. I recalled the details in https://bintanvictor.wordpress.com/2018/04/05/cas-cpu-instruction-takes-3-inputs/..
Q: how many clock cycles for a CAS instruction?

See further details in toy^surgeon

Q: what’s your actual CAS usage in your project, not a textbook use case?
%%A: At citi-muni, I had two threads putting quotes on a lockfree queue or stack. Queue is natural. In hindsight, the stack can be useful — when readers want to process the latest quote first, LIFO.

Q: how would you design an order book, if you ignore your current system?
Q: why did you say array-based is faster than AVL?
%%A: most likely cache efficiency

Q: what’s the performance level of your order book? “I (the interviewer) would use array-based order book .. can easily achieve 1M msg/sec”
Q: what’s the disadvantage of an order book based on AVL tree?

— coding^QQ^non-tech questions

I think I did well on non-tech (A-) and coding (A), OK on QQ (B). With the non-technical, again, I feel interviewers can’t reach conclusions.

I believe the tough questions listed above are classic QQ i.e. theoretical knowledge not needed in real projects. Interviewers may consider it zbs but I would call it QQ.

Consistency and stability in high-end coreJava QQ topics — No real change since 2007

Do I look like Deepak? I think with algos I seldom feel fake and broken. As to the QQ topics if I am armed with a tidbit of c++ low-level knowledge then I can often defend myself, because on those questions, c++ is invariably more low-level than java. There are still some “weakness topics” where I may come across as … superficial+academic, like Deepak.

I generally concede to the interviewer, but not today during the “CAS justification” discussion.

baseclass(template)use subclass field+!vptr

A very restrictive context —

  1. the base and sub classes represent market data messages, used in a reinterpret_cast context, with zero padding. Therefore, vptr would add a pointer and mess up reinterpret_cast.
  2. multiple (say 11) subclasses have a “price” field, so I don’t want code duplication 11 times
  3. The field order is fixed in struct definition. Without this restriction, I would define the “price” field in a base struct and also a getPrice() method. With a subclass instance, CRTP could probably work like static_cast<Subclass const*>(ptr)->getPrice() but the “price” field’s physical offset would be be dictated by the compiler not according to my struct definition

My technique uses CRTP but no SFINAE no enable_if.

My technique is easier to /internalize/, as it relies on simple overload resolution + simple type deduction. In contrast, the SFINAE technique used in my RTS codebase is off-putting and alien to me

tsn: esteem hazards^boosts

With the exception of c++, socket(mkt data), py(devops), c#, MSVS … looks like majority of ventures out of my tech sweet spot java/SQL/scripting… presents esteem-hazards (like health hazards) but dismal prospect in terms of self-esteem boost.

high-risk, low-return ventures, in terms of self-esteem?

Looking deeper, the esteem-hazard is really nothing but one mgr’s assessment. For my recent painful jobs, I continue to dismiss/ignore the possible boost to self-esteem —

  • I conquered some of my biggest technology fears — MSVS; c++ crash management; c++ large code navigation.. Other people may not agree, but my own experience proves that these challenges are harder than high-level dev (like web/scripting..). My fears about these c++ challenges were more deep-rooted .
  • OC — I built real mileage in c#. I even proved myself stronger than some c# veterans when I created a simple web server to dump log files for everyday troubleshooting.
  • Macq — I invented elegant solutions for splunk even though boss wasn’t impressed
  • .. many more
  • see more pointers in https://bintanvictor.wordpress.com/wp-admin/post.php?post=27139&action=edit

nth largest element in unsorted array #QuickSelect

Q: Find the kth largest element in an unsorted array. Note that it is the kth largest element in the sorted order, not the kth distinct element.

I think this is mostly a bigO algoQQ problem.

std::nth_element is linear on average .. https://stackoverflow.com/questions/11068429/nth-element-implementations-complexities talks about QuickSelect algo

— idea 6: priority Q (Fib-heap) of size k
if any item is higher than the min, then pop min O(logK) and insert in O(1)
— idea 6: priority Q
Linear time to build it
— idea 5: 95th percentile problem from NYSE
— idea 4: multiple scans
— idea 3: segments
— Sol2: O(N). use the O(N) algo in the blog on “given int array, find median ] O(N)”. Then discard one of the two segments. Then repeat.
Note: Each time the target element must exist in one of the 2 segments.

O(N) + O(N/2) + O(N/4) … -> O(N)

— Sol2a: Use the O(N) qsort partition algo to anchor a (random) pivot element to create two segments. Our target must exist in one of the two, so discard the other by adjusting the le/ri boundaries.

This idea is same as the most voted solution in leetcode discussion.
O(N) on average — we get O(N)+O(N/2) + O(N/4) + … < O(2N)

Note average complexity is acceptable in hashtable!

ringBuffer@pre_allocated objects to preempt JGC

Goal — to eliminate JGC completely.

Design 1: I will want Order.java to use primitive fields only and avoid reference fields [1] at all cost, so the total footprint of an Order is known in advance. Say it’s 100 bytes. I will create 10M of dummy Order instances, possibly scattered in heap, not adjacent as in c++, and hold their 10M addresses in an Order array… about 1GB footprint for the Order objects + 80M footprint for the array of 8-byte pointers.

(Note I reuse these Order instances in this object pool and never let them get garbage-collected.)

Then i need a few subscripts to identify the “activeRegion” of the ring but how about released slots enclosed therein?

[1] timestamps will be ints; symbolIDs and clientIDs are ints; short ascii strings will use 64-bit ints (8 characters/int); free-form strings must be allocated off-site:(

Design 2a: To avoid the “scatter” and to place the Order instances side by side, Can we use a serialized byte[100] array object to represent one Order? Can we use one gigantic off-heap byte array to hold all Orders, eliminating the 80M footprint? See java off-heap memory

Design 2b: https://blog.bramp.net/post/2015/08/26/unsafe-part-2-using-sun.misc.unsafe-to-create-a-contiguous-array-of-objects/ shows a contiguous array of java objects, like std::vector<MyObject>

Design 2c: https://www.ibm.com/support/knowledgecenter/en/SSYKE2_7.1.0/com.ibm.java.lnx.71.doc/user/packed_optimizing.html is a feature in IBM jvm

Ring buffer is good if the object lifetimes are roughly equal, giving us FIFO phenomenon. This occurs naturally in market data or message passing gateways. Otherwise, we may need a linked list (free list) of released slots in addition to a pair of subscript to identify the active region.

It might be better to allocate a dedicated buffer for each thread, to avoid contention. Drawback? One buffer may get exhausted when another stays unused.

Q: which linux c++thread is stuck #CSY

This is a typical “c++ecosystem question”. It’s not about c++ or C; it’s about linux instrumentation tools.

Q1: Given a multi-threaded server, you see some telltale signs that process is stuck and you suspect only one of the threads is stuck while the other threads are fine. How do you verify?

Q2: What if it’s a production environment?
A: I guess all my solution should be usable on production, since the entire machine is non-functioning. We can’t make it any worse.  If the machine is still doing useful work, then we should probably wait till end of day to investigate.

–Method: thread dump? Not popular for c++ processes. I have reason to believe it’s a JVM feature, since java threads are always jvm constructs, usually based on operating system threads [1]. JVM has full visibility into all threads and provides comprehensive instrumentation interface.

https://www.thoughtspot.com/codex/threadstacks-library-inspect-stacktraces-live-c-processes shows a custom c++ thread dumper but you need custom hooks in your c++ source code.

[1] Note “kernel-thread” has an unrelated meaning in the linux context

–Method: gdb

thread apply all bt – prints a stack trace of every thread, allowing you to somewhat easily find the stuck one

I think in gdb you can release each thread one by one and suspend only one suspect thread, allowing the good threads to continue

–Method: /proc — the dynamic pseudo file system

For each process, a lot of information is available in /proc/12345 . Information on each thread is available in /proc/12345/task/67890 where 67890 is the kernel thread ID. This is where pstop and other tools get thread information.

 

tick`95%mark #Kam #70%

“Ticking 95th percentile server” is the blog title I would use. Original question is on paper/pencil, scanned and sent to my gmail, from my friend Deepak CM. I find this problem rather realistic with practical usage, rather than fake and contrived. I treat it as a System Design + implementation question.

Q: Using only std library, write a c++ program to process a live stream of 128,000,000 (or much more) double numbers, representing temperature readings not necessarily unique. As the temperatures come in, print the current 95th percentile on-demand. I call it the lucky “winner”. We can use the nearest-rank percentile definition.

====Idea 1: given unsorted ints, find median in O(N) is for median but can be tweaked for any percentile, but unfortunately, not “ticking”

====design 2, for static data set
use an “order statistic tree i.e. a RBTree where each node remembers the size of its subtree. (A leaf node has size 1.)
====design 3, optimized for high volume of updates like 128 million updates, not optimized for frequent query

The entire temperature range is divided into non-overlapping segments, each represented by a segment-head temperature i.e. the lower bound [1b]. Each segment has a range (i.e.distance to next segment head), size (i.e.item count) and density (i.e. size/range ratio). We mostly care about “size” only.

We need a RB-tree (or sorted vector) containing P=1024 [1] nodes, each an unsorted container[3]. The RB-tree serves to maintain the containers i.e segments.

Each incoming temperature is quickly “routed” to the correct container and simply appended therein, increasing its size.

Upon query request, we will use the latest segment sizes to build a cumulative profile, and run a O[logP] binary search to identify the one segment containing the “winner”. This segment size would be hopefully much smaller than 128,000 [2] and far more /tractable/.

–Within the chosen segment of size S, we can use a vector to sort in O(S logS) the temperatures and identify the winner.  After completing a query, the chosen container will become (partially) sorted, helping subsequent queries if this segment is picked again.

Since we only support 95th percentile, chance is good that this segment will be picked most of the time. If x% of the queries hit this segment, then I will convert this “favorite segment” to a RB-tree.

Alternatively, we can also use the O(S) algorithm in Idea 1, but the container won’t become sorted.

–priming

[2] 128,000 is 1024th the size of original sample size… not ideal. The segments need to be initialized carefully, during a priming phase, inspired by JIT compiler. Shall we assume roughly uniform distribution or Gaussian distribution? Assuming we know the total sample size is 128 million, I will use the first 100,000 temperatures to select the 1024 segment heads. The segments are structured not for equal length (in temperature) or equal size (i.e. element count). In fact the low segments can be very long very crowded.

Instead, the segment heads are chosen so that between 94th percentile and 96th percentile we have half of all segments. These small segment sizes will be much smaller than 128,000 and quicker to manipulate.

–Foot notes:

Q: what if some of the containers grow too big like three times 128,000,000/1024. The priming/estimate was ineffective.
A: Not a problem unless the winner happens to be in such a container.
A: One idea is to split up such a container when we notice it, and grow a new node on the RB-tree. std::map::insert() can take a hint for the position where new node can be inserted. Note we don’t want to split a node JJ too early since JJ may not grow any further subsequently and may end up no larger than other nodes as other nodes keep growing.

[1] Sizing of P — First we estimate total sample size. If unknown, then set N:=1024 so all nodes stay in L1-cache (typically 32KB). If we assume 16 bytes/node ( 8 bytes pointer to container + 8 bytes double ), then 32KB can hold 2000 nodes.

If query becomes more frequent, I can increase P by 1024, sacrificing insertion.

[1b] The node values are “lower-bounds” and don’t have to be really the minimum of the original temperatures in the container. We can probably cheat with 4-byte floats, and we get away with 2700 twelve-byte tree nodes.

[3] slist vs vector — vector might be faster due to pre-allocation, provided a node will never grow beyond a capacity. Vector has reserve() (Note resize() is wrong choice.)

xchg between register/memory triggers fencing

I guess xchg is used in some basic concurrency constructs, but I need to research more.

See my blogpost hardware mutex, based@XCHG instruction

https://c9x.me/x86/html/file_module_x86_id_328.html points out the implicit locking performed by CPU whenever one of the two operands is a memory location. The other operand must be a register.

This implicit locking is quite expensive according to https://stackoverflow.com/questions/50102342/how-does-xchg-work-in-intel-assembly-language, but presumably cheaper than user-level locking.

This implicit locking involves a memory fence.

Nsdq onsite IV #emerging Nsdq architecture

Q: Java NonBlocking IO?
%%A: one thread monitoring multiple sockets

Q: many simple Objects with complex relationships/interactions vs one (or very few) huge complex object?

Q: thread Futures? Any comments?

–some of my findings

  • microservices and containers are important
  • According to the system architect (last interviewer), the new framework uses one dedicated instance of matching engine serving a single symbol such as AAPL.
  • the new java framework only supports a few smaller exchanges initially, but is designed to roll out to more “exchanges” esp. those to be installed on client sites (like CSP). It is (theoretically) designed to be capable enough to implement the main stock exchange.
  • is Mostly java + javascript (Amber, react) with small amount of c++
  • the new java framework was deliberately designed to operate in single-threaded mode, despite oppositions. Therefore, multiple “clients” calling the same library concurrently would be unsafe.
  • java8
  • noSQL is adopted in addition to PostgreSQL, but the principal architect is less concerned about data store, which is only used in individual components

## linux kernel concurrency primitives #to elaborate

I feel the concurrency primitives in kernel are conceptually similar to those in thread libraries but fundamentally different in implementation. This topic is deep and time consuming so I will only memorize some jargons for show-off

I should read the OReilly Linux book + Josh’s book

  • spinlock — addressed in both books
  • read-write spinlock
  • binary and counting semaphores
  • read-write semaphores
  • RCU?

 

SCB-FM design IV #Ashish #80%

Q: Your parser class is a plug-in in a framework. The framework would call your parser’s member function onData(seqNum, packet)  whenever the framework receives a packet on a UDP socket. You need to deal with

  • out of sequence packets
  • duplicate packets

Inside your onData(), you need to invoke a client callback function like theClient->callback(ptr2packet) but you need to invoke it 1) in correct sequence and 2) without duplicates.

Note the requirement is part of TCP’s job functions. TCP receives out-of-sequence packets (and/or duplicates) from IP layer and must deliver sequenced packets to the application layer.

Does TCP use ring buffer or hashtable? I doubt it, but we are building simpler solutions and are free to choose our data structure.

====My solution=====

an illustration
seq # received warehoused send in-use region of ring buffer
1 1
5 5 2-5
9 5,9 2-9
3 3,5,9
2 2-3 4-9
6 5,6,9
8 5,6,8,9
7 5-9
11 5-9,11 4-11
4 4-9 10-11
  • keep the packets (like fixed-size struct instances) in a large singleton circular array (or a deque). Save each packet in a slot keyed by the seq number of the packet (modulus the array size). Remember the nextSeqToSend. If we get a higher sequence than that, just warehouse it in the circular buffer.
  • (Interviewer didn’t ask) How do you reuse slots in the circular buffer? Given ten thousand slots #0~#9999, when I’m warehousing packet #109,999 in slot #9999, then conceptually the old packet in the #0 slot was already sent out, so I can safely “wrap around” to save next packet (#110,000) in there. I can implement my system to ensure it is actually safe.
  • What if the sequence numbers I receive jump wildly? Well, in real systems this will never happen (except an explicit seq reset). At most the sequence numbers jump ahead by a few hundreds. Assuming the sequence numbers arrive largely in correct order with occasional out-of-order arrivals, ring buffer is a natural choice. Without this assumption, dictionary solutions (Ashish saw in QuickFix) might be more suitable.
  • permanent gaps? If I see an old gap (like nextSeqToSend == #55 and no #56 but #57~#8000 all received) then we need a policy to mark the gap as permanent. Otherwise we would have to wait for it indefinitely.

Q (from interviewer): if you use a deque, how do you allocate slot for packet #5 while waiting for #4?
%%A: i would allocate for both, but keep #4 slot vacant. Not sure if std::deque has this API. I think my deque will hold pointers … dummy pointer represents vacant.

Justification for deque is similar to ring buffer — to keep the queue length short and release once-used memory.

I haven’t analyzed hashtables i.e. dictionaries. I believe it’s a proven solution with no major drawbacks.

#1 minor drawback of hashtable-based (or deque) relative to ring buffer is runtime allocation which is about 100 to 1000 times slower than arithmetic operations. For this reason, I always favor ring buffers when it’s a natural and logical data structure choice. This is my bias in many system designs. Sequence-number-based systems can often use ring buffers.

Another minor drawback of hashtable is memory overhead . Ring buffer has relatively small overhead in addition to the packet footprints, Hashtable wraps each packet in a link node. Hashtable also needs an expandable bucket array.

In terms of runtime efficiency, I am not so sure. I feel circular array has faster read/write. Hashtable depends on the hash function, which can degrade due to hash collisions.

 

spare time as slack resource #blogging,parenting,localSys

In 2017 I wrote to Gerald Robinson that spare time depletion would become one of the key challenges/pains of my forthcoming/upcoming adjustments.

For many (probably most) of my peers, if they don’t feel the drive to keep learning in their spare time, the resource depletion doesn’t feel so harmful, hazardous, stressful.. Beside this “learning”, I also have a need for quiet time for reflection.

— commute — Bayonne commute is tolerable only when I have plenty of spare time as “slack resource”.

— Parenting — (esp. my son) became a huge commitment and strain on my slack resources.

— Additional paid annual leave is supposed to be a slack resource but I think it doesn’t reduce the pressure to perform

… but I did spend these hours (weekends + evenings) in office to learn localSys, provided I’m engaged, in the mood.

Need to think harder about how to turn these large quantities of high-quality spare resource into stress-relief. These resources feel like untapped oil reserve.

spreadsheet concretize #Junli Part2

Note the java algo is event-queue based — every newly concretized cell is an event added to the event queue. When we encounter this cell again after a dequeue, All registered dependents of this cell are checked. If the check results in a new cell concretized, this cell is enqueued.

In contrast, my c++ algo is a modified BFT. Key idea is, whenever a branch node can’t be concretized (due to an unresolved upstream reference) we basically ignore that node’s subtree. The other root nodes’s BFT would eventually visit this node, unless there’s a cycle.

I believe both algorithms are relatively easy to visualize at a high level. Which algo implementation is more tricky and error-prone? I guess the BFT but not really sure.

— Topo-sort — “topological sorting” is the reusable general technique for similar problems like even scheduling. As I described to Kyle, the idea is “Given a directed graph, assign (artificial) integer ranks to all nodes so that every arrow is from a low rank to a high rank”

There are linear time algorithms to assign the ranks. I think some form of BFT may work… need more thinking.

I think it depends on what’s more natural — start from leaf nodes or start from root nodes. The start level would use lower integers.

For a typical spreadsheet, I feel it’s natural to start from nodes that have no downstream.

My c++ implementation was similar to Kahn’s algorithm.

[[Algorithms]] P 550 presents an elegant DFT  algo but not so intuitive to me yet. I think it DFT can’t solve this spreadsheet.

–Some additional notes on the c++ implementation

  • 🙂 I encountered much few seg-faults than in other projects. I think it’s because very few arrays (including vectors) are used.
  • Before I look up a map/set, I always use count() to verify.
  • 🙂 I didn’t need to check against end() as lower_bound() and find() functions would require.
  • no smart ptr needed. container of raw ptr worked well. See source code comments
  • In fact, container of cell names (as strings) is even “safer”.

## clean design may look over-complicated

What is simple/clean really depends on (your knowledge of) best practices.

  • #1 eg: shared_ptr as member variable. Looks complicated but simplifies many things.
  • eg: nested container without pointer — looks scary[1] to java/python developers, but I think in c++ this is a proven design and simplifies many things.
  • eg: most of the 23 GOF design patterns introduce additional classes to address coupling and cohesion. These are classic and clean designs but add complexity. MVC pattern ditto.
  • try-with-resource looks complicated to me
  • RAII looks complicated to me

[1] inserting a entire sub-container  into the umbrella container requires copying the sub-container.

## any unexpected success@tsn@@

## past vindicative specializations is more comprehensive, but in this blogpost I want to start on a clean slate and answer the question —

Q: I have explored far and wide… for new domains, new challenges, so are there any unexpected successes of try-something-new?

skill discovered proven over %%strength 1-5 val 4 GTD val4IV
unix shell + unix admin 2000 Catcha 2Y 3 4 1
! perl 2000 Catcha 3Y 4 #many years ago 2 1
LAMP+js 2000 1Y 2 0 0
! python #popularity no surprise 1Y 2 #xp@Mac 2 3
! socket #and tools 2017 never 1 #lowLevel 3 if relevant 2
!! threading concept+java impl 2010 4Y 5 #theoretical 0 5
! x-lang collections 2010 5Y 4 #lowLevel+theoretical 1 5
! x-lang OO 2010 NA 4 #lowLevel 0 4
white board coding [1] 2011 2Y 2 @WallSt 0 3
c++instrumentation/build tools
! bond math 2010 Citi 1Y 2 1 if relevant 2
option math 2011 Barc 1Y 3 2 if relevant 1
fin jargon 2010 4Y #mvea 3 #know what’s relevant 2 2
finSys arch #abstract 2010 2Y 2 3 3

[1] Am more comfortable on whiteboard than other candidates.

Explanation marks means Surprise

SDI: URL shortening

Q: Design TinyURL or bitly (a URL shortening service)

Given a (typically) long URL, how would how would you design service that would generate a shorter and unique alias for it.

Discuss things like:

  • How to generate a unique ID for each URL?
  • How would you generate unique IDs at scale (thousands of URL shortening requests coming every second)?
  • How would your service handle redirects?
  • How would you support custom short URLs?
  • How to delete expired URLs etc?
  • How to track click stats?

https://www.interviewbit.com/problems/design-url-shortener/ is a long discussion.

y FIX needs seqNo over TCP seqNo

My friend Alan Shi said … Suppose your FIX process crashed or lost power, reloaded (from disk) the last sequence received and reconnected (resetting tcp seq#). It would then receive a live seq # higher than expected. This could mean some executions reports were missed. If exchange notices a sequence gap, then it could mean some order cancellation was missed.  Both scenarios are serious and requires request for resend. CME documentation states:

… a given system, upon detecting a higher than expected message sequence number from its counterparty, requests a range of ordered messages resent from the counterparty.

Major difference from TCP sequence number — FIX specifies no Ack though some exchange do. See Ack in FIX^TCP

Q; how could FIX miss messages given TCP reliability? I guess tcp session disconnect is the main reason.

https://kb.b2bits.com/display/B2BITS/Sequence+number+handling has details:

  • two streams of seq numbers, each controlled by exch vs trader
  • how to react to unexpected high/low values received. Note “my” outgoing seq is controlled by me hence never “unexpected”
  • Sequence number reset policy — After a logout, sequence numbers is supposed to reset to 1, but if connection is terminated ‘non-gracefully’ sequence numbers will continue when the session is restored. In fact a lot of service providers (eg: Trax) never reset sequence numbers during the day. There are also some, who reset sequence numbers once per week, regardless of logout.

 

CVA c++ IV 2 #oq

void func(ResourceMgr & rm){ 
  int connId = rm.getConn();
  double d=externalFunc(connId); 
  cout<<d<<endl;
  rm.reclaim(connId); //release the resource to the mgr
}
  • Q5: In the code above, external func can throw, so write a RAII wrapper to prevent resource leak.
  • Q5b: what if you aren’t allowed to use RAII? Ok you said catch-all. Is an empty catch block enough?
    AA: need to re-throw the original exception, to mimic the RAII behavior.
  • Q(paper coding): Iterate over two vectors in lock steps. Which is faster — iterator vs vector index?
  • Q (bond math): Is there any uncertainty in the present value calc on an pre-existing vanilla IRS?
  • q: what design patterns do you use?
  • q: write a singleton skeleton
  • Q: how do we make a class “final” in pre-c++11
    %%A: either make dtor private or make all constructors private
  • Q: is shared_ptr thread safe?
  • Q: any difference — const shared_ptr<T> vs shared_ptr<const T>?
  • %%Q: does it make sense to pass a shared_ptr by const reference? I feel it’s cleaner to pass by raw ptr
    AA!: pass by const-reference is faster and recommended

–Zheng

  • Q: why use weak_ptr instead of raw ptr. Valid question.
    A: See [[std c++lib]] P84.
    A: See last paragraph in https://bintanvictor.wordpress.com/2010/01/20/weak_ptr-can-access-inner-ptr-only-through-a-shared_ptr/
  • Q: you said atomic<int> operator++ probably wraps a CAS in a while loop internally, so is it spinlock?
    %%A: I think so.   http://www.cs.cornell.edu/courses/cs4410/2015su/lectures/lec06-spin.html is a detailed explanation
  • Q34: various synchronization techniques?
  • Q34b: what’s the c++ function to force a memory barrier?
    A: std::atomic_thread_fence(),
    but i feel if an application (not a library) uses this function then something is seriously wrong.
  • Q: can we implement a mutex using CAS?
    AA: blockingMutex implementation ] kernel
    AA: spinlock, not blockingMutex, can be implemented as in   https://www.andrew.cmu.edu/course/15-440-sp09/applications/ln/lecture4.html
  • ? Q (paper coding): write a piecewise linear interpolation Curve class with a ctor taking a vector<pair<double, double>>.  See https://github.com/tiger40490/repo1/blob/cpp1/cpp1/miscIVQ/curveInterpolation_CVA.cpp
  • Q: What are r-values and r-value references
  • Q: Explain your use of SFINAE. See https://github.com/tiger40490/repo1/blob/cpp1/cpp1/template/SFINAE_demo1.cpp
  • Q: what’s the c++ library function for cmpxch?
    A: atomic::compare_exchange_weak()
  • Q: is your barcap vol surface used for EOD risk only?
    %%A: no. A trader might suspect a lesser known product is currently mis-priced. She could confirm the live bid/ask are obvious outliers on the fitted surface.
    %%A: a dealer desk price new deals based on the fitted surface, whether or not the strike/expiry requires interpolation.

–Simon Ma:

  • Q (paper coding): implement Fib() recursively and iteratively
  • Q: what’s inline?
  • Q: semaphore vs mutex
  • Q: how did you compute greeks like gamma?
  • Q (bond math): given 6M spot IR = 5% and 12M = 10%, what’s the 6M rate 6M forward?
  • Q: Assign first half of a vector to another vector
    %%A: vector::assign() probably takes two iterators so I can pass v.begin(), v.begin()+5
  • Q (obscure): What data structure to represent a directed graph of N nodes? See NxN matrix for graph of N nodes
    %%A: create a Node class with a single ptr field…
  • Q: use parallel algorithm to compute sum of a vector<int>
    %%A: only the last stage global aggregation has a shared mutable that needs protection. The sub-aggregation can be single-threaded.
  • Q (probability problem): Both Alan and Bob agree to show up at Cafe9 sometime between 12 and 2pm. Whoever arrives first would wait for 15 minutes only. What’s the probability of them actually meeting
  • Q: what’s thread_local?
    %%A (correct): a new storage class that’s similar to static
  • Q (paper coding): A natural number is Good if it is a product of only 3, 5 and 7. The smallest Good numbers are 1,3,5,7,9,15,21,25,27,35… Write a print(int N) to print the Nth Good number. Hint: write a isGood(int k). See https://github.com/tiger40490/repo1/blob/cpp1/cpp1/miscIVQ/isGood357_CVA.cpp
  • Q (unclear): implement subtract(int a, int b) using only operator+ and comparison operators. I feel this question is unclear. Can we use bitwise? Can we multiply?
    %%A: just try all the integers (i) one by one a == b+i
  • Q (obscure): what operators can’t be overloaded?
    %%A: q(?:) correct
    AA: address-of CAN be overloaded!

–Mikhail the mgr

  • Q: inserting 1000,000 items into a list vs a vector without reserve()
    A: vector wins
  • Q: do you define your exception classes?
  • Q: in what context is a deque completely unusable whereas vector is perfectly fine?
    A: a C function taking an array (i.e. a ptr + a count). Vector::data() is designed for this. In Pre-c++11, we can pass &v[0] and v.size()
    A: if the code takes one element and uses pointer increment to access other elements. Deque would fail at segment boundaries.
  • Q89 (coding question): given a vector of T [passed in by const ref], write a template function to return [by const reference] the minimum element.
  • Q89b: how do you handle an empty vector?
  • ? Q89c (obscure): given q(vector const & vec), what can you say about q{auto it = vec.begin()} versus q{vector::const_iterator it=vec.begin()}

##order-book common data structureS #array

Rebus has two levels of trees, according to Steve.

  1. For each feed there’s a separate symbol lookup tree — one would think a hashtable would be faster but given our symbol data size (a few thousands per feed) AVL tree is proven faster.
  2. A per-symbol bid-tree (and ask-tree) sorted by price

crossed orderbook mgmt: real story4IV

–challenges:
disappears after a while; Some days better some days worse

–impact
data quality visible to paying customers

–individual contribution?
not really, to be honest.

–what would you do differently?
Perhaps detect and set a warning flag when generating top-book token? See other post on “crossed”

–details:

  • query the exchange if query is supported – many exchanges do.
  • compare across ticker plants? Not really done in our case.
  • replay captured data for investigation
  • retrans as the main solution
  • periodic snapshot feed is supported by many exchanges, designed for late-starting subscribers. We could (though we don’t) use it to clean up our crossed orderbook
  • manual cleaning via the cleaner script, as a “2nd-last” resort
  • hot failover.. as last resort

–the cleaner script:

This “depth-cleaner” tool is essentially a script to delete crossed/locked (c/l) entries from our replicated order books. It is run by a user in response to an alert.

… The script then compares the Ask & Bid entries in the order book. If it finds a crossed or locked order, where the bid price is greater than (crossed) or equal to (locked) the ask price, it writes information about that order to a file. It then continues checking entries on both sides of the book until it finds a valid combination. Note that the entries are not checked for “staleness”. As soon as it finds non-crossed/locked Ask and Bid entries, regardless of their timestamps, it is done checking that symbol.

The script takes the entries from the crossed/locked file, and creates a CIM file containing a delete message for every order given. This CIM data is then sent to (the admin port of) order book engine to execute the deletes.

Before the cleaner is invoked manually, we have a scheduled scanner for crossed books. This scanner checks every symbol once a minute. I think it uses a low-priority read-only thread.

detect cycle in slist #Part 2

Note there’s an open question at the end

Don’t use fast-slow iterators. These 3 solutions below each has advantages over the fast-slow iterators solution.

–solution: list-reversal, based on https://blog.ostermiller.org/find-loop-singly-linked-list and implemented in https://github.com/tiger40490/repo1/blob/py1/py/slist/slistReverseLoopDetect.py
Basically, iterate from aa to bb to cc … and reverse each arrow. Suppose dd points to bb initially forming a loop. At some point the snapshot is

  • null <- aa <- bb <- cc <- dd  while currentNode is dd and nextNode is bb.

If we go on, we would be back at aa, proving the existence of a loop.

Benefit of this solution — O(1) space complexity and O(N) time complexity

constraint — wont work on read-only lists .

–solution: count nodes against memory usage — my invention
We can get a reliable estimate of the memory usage of the entire data structure, and divide it by per-node footprint, so we know within good approximation how many nodes exist. If we count more than that, there must be duplicates.

https://blog.ostermiller.org/find-loop-singly-linked-list shows one way to estimate the memory usage — keep track of the maximum and minimum addresses among all visited nodes.

The low level languages usually provide addresses. The higher level languages usually provide memory usage. In realistic applications, there is always a way to get an estimate of memory usage.

–solution: Brent. See my code https://github.com/tiger40490/repo1/blob/cpp1/cpp/linkedList/cycleDetect.cpp

If we keep count of total number of increments, we should see that every time we double “power”, that total count is a power of 2. If we have incremented 32 times then we know { mu + lambda } must exceed 32… Advantages over the simple-2-iterator:

  1. tortoise is much faster, so if mu (merge point, or loop starting point) is far out, then we will find it faster. With the standard solution, tortoise takes a long time to reach the merge point.
  2. can return loop size

I now believe I can predict in which “phase” we will detect the loop — If lambda is between 2^5 and 2^6, then we will detect the loop in Phase 6, and we can’t detect in Phase 5

I suspect power-of-3 is also working:

  • If the slower iterator stops moving completely, then the algorithm is broken because the slower iterator could remain outside the loop.
  • If the slower iterator tailgates the faster iterator and the following distance is always always within a static limit (like 99), then the algorithm is broken because the loop size could exceed 99, so the two iterators would have no chance of rendezvous.
  • ==> If the following distance grows gradually, AND the follower is never stuck forever, I think eventually they will meet —
    • Suppose right after a tortoise jump, powOfN bumps up to 81 but actual loop length is shorter, like 80. Now within the next 81 iterations, the hare would move step by step while the tortoise remains. They would rendezvous for sure.
  • —- pros and cons for power-of-3 over power-of-2 —-
  • pro: if a big loop involves root, tortoise won’t have to jump too early too frequently ==> Fewer comparison needed.
  • con: if there’s a tiny loop far out, tortoise would end up jumping too late?

common string tasks: IV+GTD

(Let’s be imprecise here… Don’t sweat the small stuff.)

We should be able to perform all of these using c-string, std::string (limited adoption since c++98), the standard string in java , c#, perl, python, php. This is a master list. Tolerate multiple names on Each task.

See basic tasks on https://bintanvictor.wordpress.com/2018/01/26/22tasksarraystrdictq-algoiv/

–STL
use string iterator with STL algorithms

–the easy
toUpper

–advanced
convert to vector and apply vector tricks
convert to std::string and apply tricks
equalsIgnoreCase
compareIgnoreCase
lastIndexOf
count how many times a substr occurs
sort content
endsWith

c++iv: Jump#2

Mostly obscure QQ type of questions. I feel i may have to give up on some of the very low level (perf optimization) topics. I feel java and c# interviews are not so low.

Q: Stack overflow – who can detect it and print an error msg? JVM can do it but what if there’s no VM?

Q: What data type would you use for the tasks in a thread pool??
(I find this question too advanced. c++11 offers Futures…)
%%A: look at pthread-create. a func ptr taking a void ptr

Q: After malloc(), how do you cast the pointer to MyClass* ? Do you call the ctor? How?
(This is asked again by Alex of DRW)
A: placement-new?

  • Inter-thread communications in thread pool – how does it work?
  • Thread pool — Your resume mentioned your home-made thread pool? How?
  • Boost::any, boost::bind, boost::function
  • CPU cache – how do you use it to improve performance? Any specific techniques?
  • Stack size – who controls it? at Compile time or run time?
  • Shared ptr – how is it implemented?
  • Scoped lock – what is it, why use it?
  • Your bash shell customizations as a cpp developer?
  • $LD_LIBRARY_PATH — what is it?

UBS IV emerging market credit/rates: java

Q: How do you create a deep copy of a java List where each element could be an Integer, Float …?
A: see the email I sent on 4 Aug 2018.

Q1: how do you price an interest rate swap? I think this can mean at least two things
Q1a: how do you compute the p/l of an existing IRS?
Q1b: how do you determine the fixed rate on a new IRS?

Q: how do you call c from java?

Q: what does a credit spread mean?

Q: Challenges of multithreading?
%%A: one feature of java can restrict the number of threads to a number smaller than the number of processors? Can’t name that feature.

Q: Say a master producer thread puts objects (like tasks or …) into a shared collection, and worker threads pick them up and put results objects into a shared place. How do you ensure data integrity/consistency.

–perl QQ topics — by using python or perl day to day i won’t know these topics 😦

Q: Say your perl program needs to create 15 concurrent child processes.. How do you wait for all children to finish?

Q: How do you connect your (perl) STDIN to a child process?

credit desk — CDS, corp bonds, MBS
rate desk — IRS, treasuries
c++ is used by quants but not really others.

JPMC IV #stock loan

Q: concurrent hashmap locking — locking on an internal bucket object or a map entry object?

Q: a weak reference object can have a finalize()? What if in finalize() you make the object “reachable from the GC root set”?

Q: when do you use WeakHashMap?

Q: what’s a phantom reference?

Q: static fields during serialization?

Q: how do you make a hashmap (not a hashtable) thread safe?

Q: how does jms provider keep track of which message has not been delivered to each durable subscriber?

Q: different modes of acknowledgment in JMS? what’s the default mode?

Q: what’s JMS correlation id?

Q: how does a java app know a file is already open for writing by another unix process id, so it should wait and avoid concurrent edit?
A: see https://www.thegeekstuff.com/2012/04/linux-file-locking-types/

Q: if one method has synchronized(b){synchronized(c){…}}, and another method has synchronized(c){synchronized(b){…}}, and b and c point to the same object, can you think of when people write such code?
A: always bad design, unnecessarily confusing. Deadlock prone. Programmer must know for certain that b and c reference the same object throughout. I’d add an assert as documentation.

Q: A set should contain unique objects but what if you input 2 unique objects and then make them identical twins (ie a.equals(b))? So the set has duplicate entries? Consequence?

Q: how many types of iterators are there?

Q: what classic design patterns did you use?

Q: how is the treeMap or TreeSet implemented internally? What kind of binary tree?

Q: what’s Shell sort? What other sorts do you know?

Q: what’s the time complexity of quick sort?

Q: name 5 things you would consider before developing a java threading app.
A: privatize all fields in shared objects
A: immutables
A: reduce number of locks
A: I should know exactly how many threads system will create. Reduce the number.
A: any wait/notify required? Need to know early since they tend to mess up my design
A: extremely mindful of deadlocks
A: avoid synchronization if possible — use concurrent data structures
A: producer/consumer and other threading patterns. Avoid creating my own threading “pattern” or framework.

essential trading server arch QnA #tibrv,gemfire

Every trading server invariably uses some non-http network daemon. There’s always more than 1 process (JVM, C# or c++) on the server side. In FI/commodities (not low-latency eq) There’s often some MOM daemon such as JMS, tibrv and gemfire notification daemon. Here are some Fundamental questions:

Q22: on top of tcp/udp, what specific network protocol between the server-side and GUI?
A: I have seen rmi and protobuf over tib ems.

Q22a: how about JMS between server and swing? Did we see 160 subscribers on a given topic, due to that many swing installations?

Q33: on top of tcp/udp, what specific network protocol among the server-side processes?
A: I have see tibrv, JMS, RMI, gemfire data distribution protocols …

Q44: since most trading servers must avoid DB latency, where does the trading data live? In memory?
A: i have seen gemfire, rttp,

Q45: in case of distributed cache (not replicated), how does one cache listener update another node?

Q55: how does the daemon stay alive after main() exits?
A: Look at ion, gemfire, activemq. There’s often at least 1 (1 enough) non-daemon thread that’s stuck in wait()

OMS in FX vs equities

OMS is often the biggest and the core module in any trading platform that supports pending orders and FIX execution.

What are the differences between an eq OMS (see [[Complete Guide]]) and an FX OMS?

FX is fundamentally OTC and quote-driven, not order driven as listed markets are.

Q: in FX, there’s no exchange making the (irrevocable) matching?

digitalSearchTree^trie^radixTree

  • designed for string (ie alphanumeric) keys. Simplest example could be decimal integers
  • sort order is the English dictionary sort order.
  • nearly optimal performance with low implementation effort
  • In insert time (or tree-build time), each digit (or character) of the key is used at the next branching node to decide to move left or right.
  • first digit is used to compare with the root node’s first digit? 2nd digit is used to compare with the 2nd digit of the 2nd generation branching node?
  • Unlike binary search tree, nodes on my left are not always smaller, but why???
  • comparable insert/search performance to read-black tree, but with an implementation as easy as binary search tree.
  • A trie forms the fundamental data structure of Burst-sort, which (in 2007) WAS the fastest string sorting algorithm. A faster (some radix sort) — fastest sort for:one str; arrayOfStr; 32bit ints: non-comparison 
  • Digit can also be a Bit, but still different from a binary search tree. To understand, you may need a reasonable understanding of binary search tree first.
  • BST requires key-comparison at each branching node; DST uses quicker digit-comparison?
  • DST is not a common term. Could be related to the more common radix-tree and “trie”