Optional.empty()=immutable !!singleton

https://docs.oracle.com/javase/8/docs/api/java/util/Optional.html#empty– points out that Optional.empty() probably should return a global singleton instance, but no-guarantee ! No one should assume it is a singleton. Use of q[ == ] assumes it, and is wrong.

Unlike enums, JVM doesn’t guarantee “singleton”. I think this no-guarantee decision was chosen so as to give compiler/JVM maximum freedom.

[[effJava]] suggests that immutable instances need no copying. They probably can be singletons, but don’t need to be singletons.

 

Advertisements

Optional.java notes

Q: if an optional is empty, will it remain forever empty?

— An Optional.java variable could but should never be null, as this instance need a state to hold at least the boolean isPresent.

If a method is declared to return Optional<C>, then the author need to ensure she doesn’t return a null Optional ! This is not guaranteed by the language.

https://dzone.com/articles/considerations-when-returning-java-8s-optional-from-a-method illustrates a simple rule — use a local var retVal throughout then, at the very last moment, return Optional.ofNullable(retVal). This way, retVal can be null but the returned reference is never null.

If needed, an Optional variable should be initialized to Optional.empty() rather than null.

–immutability is tricky

  1. the referent object is mutable
  2. the Optional reference can be reseated, i.e. not q[ final ]
  3. the Optional instance itself is immutable.
  4. Therefore, I think an Optional == a mutable ptr to a const wrapper object enclosing a regular ptr to a mutable java object.

Similarity to String.java — [B/C]

Compare to shared_ptr instance — [A] is true.

  • C) In contrast, a shared_ptr instance has mutable State, in terms of refcount etc
  • B) I say Not-applicable as I seldom use a pointer to a shared_ptr

— get() can throw exception if not present

— not serializable

— My motivation for learning Optional is 1) QQ 2) simplify my design in a common yet simple scenario

https://www.mkyong.com/java8/java-8-optional-in-depth/ is a demo, featuring … flatMap() !!

finding1st is easier than optimizing

  • Problem Type: iterate all valid choices without duplicates — sounds harder than other types, but usually the search space is quite constrained and tractable
    • eg: regex
    • eg: multiple-word search in matrix
  • Problem Type: find best route/path/combo, possibly pruning large subtrees in the search space — often the hardest type
  • Problem Type: find fist — usually easier

In each case, there are recurring patterns.

GTD fear-graph: delays^opacity^benchmark..

  • — manifestations first, fundamental, hidden factors last
  • PIP, stigma, damagedGood. Note I’m not worried about cashflow
  • [f] project delays
  • [f] long hours
  • [f] long commute
  • large codebase in brown field. Am slower than others at reading code.
  • [f] opacity — is worse than complexity or codebase size
  • figure-out speed benchmark — including impostor’s syndrome
  • [f = faced by every team member]
The connections among these “nodes” look complex but may not be really complex.
PIP is usually the most acute, harmful item among them. I usually freak out, though in hind sight I always feel I tried my best and should not feel ashamed. Josh is one manager who refuses to use PIP, so under Josh I had no real fear.

single-writer lockfree data structure@@

All the lockfree data structures I have seen have 2 writer threads or more.

Q: what if in your design, there can be at most one writer thread but some reader threads. Any need for lockfree?
A: I don’t think so. I think both scenarios below are very common.

  • Scenario: If notification required, then mutex + condVar
  • Scenario: In many scenarios, notification is not needed — Readers simply drop by and read the latest record. In java, a volatile field suffices.

 

STL container=#1 common resource owner #RAII

Stroustrup said every resource (usually heapy thingies) need to have an owner, who will eventually return the resource.

By his definition, every resource has an acquire/release protocol. These resources include locks, DB connections and file handles. The owner is the Object responsible for the release.

  • The most common resource owner is the STL container. When a std::vector or std::unorderded_multimap … is destructed, it would release/return the resource to the heap memory manager.
  • The best-known resource-owner is the family of smart pointers.
  • You can also combine them as a container of smart pointers.
  • All of these resource owners rely on RAII

key^payload: realistic treeNode #hashset

I think only SEARCH trees have keys. “Search” means key search. In a search tree (and hashset), each node carries a (usually unique) key + 0 or more data fields as payload.

Insight — Conceptually the key is not part of the payload. Payload implies “additional information”. In contrast, key is part of the search tree (and hashset) infrastructure, similar to auto-increment record IDs in database. In many systems, the key also has natural meaning, unlike auto-incremented IDs.

Insight — If a tree has no payload field, then useful info can only exist in the key. This is fairly common.

For a treemap or hashmap, the key/value pair can be seen as “key+payload”. My re_hash_table_LinearProbing.py implementation saves a key/value pair in each bucket.

A useful graph (including tree , linked list, but not hashset) can have payloads but no search keys. The graph nodes can hold pointers to payload objects on heap [1]. Or The graph nodes can hold Boolean values. Graph construction can be quite flexible and arbitrary. So how do you select a graph node? Just follow some iteration/navigation rules.

Aha — similar situation: linked list has well-defined iterator but no search key . Ditto vector.

[1] With pre-allocation, a static array substitutes for the heap.

Insight — BST, priorityQ, and hash tables need search key in each item.

I think non-search-TREES are rarely useful. They mostly show up in contrived interview questions. You can run BFT, pre|post-order DFT, but not BF S / DF S, since there’s nothing to “search”.

Insight — You may say search by payload, but some trees have Boolean payloads.

std::move(): robbed object still usable !

Conventional wisdom says after std::move(obj2), this object is robbed and invalid for any operation…. Well, not quite!

https://en.cppreference.com/w/cpp/utility/move specifies exactly what operations are invalid. To my surprise, a few common operations are still valid, such as clear() and operator=().

The way I read it — if (but not iFF) an operation wipes out the object content regardless of current content, then this operation is valid on a robbed/hollowed object like our obj2.

Crucially, any robbed object should never hold a pointer to any “resource”, since that resource is now used by the robber object. Most movable data types hold such pointers. The classic implementation (and the only implementation I know) is by pointer reseat.

func overloading: pro^con #c++j..

Overloading is a common design tool in java and other languages, but more controversial in c++. As interviewer, I once asked “justification for using overload as a design tool”. I kinda prefer printInt/printStr/printPtr… rather than overloading on print(). I like explicit type differentiation, similar to [[safe c++]] advice on asEnum()/asString() conversion function.

Beside the key features listed below, the most common j4 is convenience | readability….. a on-technical justification!

— ADL — is a key c++ feature to support smart overload resolution

— TMP — often relies heavily on function overloading

— optional parameter and default arguments — unsupported in java so overloading is the alternative.

— visitor pattern — uses overloading. See https://wordpress.com/post/bintanvictor.wordpress.com/2115

— ctor and operator overloading — no choice. Unable to use differentiated function names

— C language — doesn’t support overloading. In a sense, overloading is non-essential.

Name mangling is a key ABI bridge from c++ to C

5 constructs: c++implicit singletons

#1 most implicit singleton in c++ is the ubiquitous “file-scope variable”. Extremely common in my projects.

  • — The constructs below are less implicit as they all use some explicit keyword to highlight the programmer’s intent
  • keyword “extern” — file-scope variable with extern
    • I seldom need it and don’t feel the need to remember the the details.. see other blogposts
  • keyword “static” — file-scope static variables
  • keyword “static” within function body — local static variables — have nice feature of predictable timing of initializaiton
  • keyword “static” within a class declaration —  static field

~~~~~~  The above are the 5 implicit singleton constructs ~~~~~~

Aha — it’s useful to recognize that when a data type is instantiated many many times i.e. non-singleton usage, it is usually part of a collection, or a local (stack) variable.

Sometimes we have the ambiguous situation where we use one of the constructs above, but we instantiate multiple instances of the class. It’s best to document the purpose like “instance1 for …; instance2 for …”

power-of-2 sized hash tables: implications

https://dzone.com/articles/hashmap-internal  and http://coding-geek.com/how-does-a-hashmap-work-in-java/ explain that java HashMap uses bucket count equal to power-of-two. A few implications

  • the modulo operation is a fast bit-masking i.e. select the lowest N bits: hash & (length-1)
  • an unlucky hashCode() may not offer dispersion among the lower bits. Assuming 16 buckets so only the lowest 4 bits matter, but hashCode() returns predominantly multiples of 8. Then two out of 16 buckets would be white hot. To guard against it, HashMap performs a rehash on the hashCode() output.

I implemented the same feature in https://github.com/tiger40490/repo1/blob/py1/py/88miscIVQ/re_hash_table_LinearProbing.py

Note rehash is ineffective if hashCode() returns very few distinct values. In java8, there’s one more protection against it — RBTree as alternative to linked list… See RBTree used inside java8 hashmap

lowest missing+ve int#Codility #70%

Q: Write a function int solution(int[] A);  that, given an array A of N integers, returns the smallest natural number that does not occur in A. For example, given A = [1, 3, 6, 4, 1, 2], the function should return 5.
Given A = [1, 2, 3], the function should return 4.
Given A = [−1, −3], the function should return 1.

• each element of array A is an 32-bit signed int
• expected worst-case time complexity is O(N);
• expected worst-case space complexity is O(N).
* Elements of input arrays can be modified.

https://leetcode.com/problems/first-missing-positive/description/ is similar but O(1) space and average O(N) time!

—— my analysis —–

The mutable and O(1) space hints at — saving array indices as payload !

—- Idea A:

first scan to swap non-positives to the end and remember the new boundary (say #111).

In the same scan also determine min and max. Suppose min=55.

Another scan to reduce all payloads by 55 so they fall in the range of 0 to max-55.

Now use CSY technique .. check array@0-N in-situ #Nsdq#contrived

—-solutionB: make_heap O(1) space but O(N logN) time. Build a min-heap based on the array in O(1) space, then keep calling min().

  • make_heap shows random access container (vector used in demo), and “rearrange”, implying O(1) space
  • make_heap shows O(N) to construct the heap

min() is O(1) but delete_min() is O(log N), so overall we probably get O(N logN)

—-solutionC: radix sort. https://en.wikipedia.org/wiki/Radix_sort#In-place_MSD_radix_sort_implementations shows an in-place binary radix sort.

First pass to transform all negative numbers to 0. Then iterate the sorted array and check for the earliest “gap”. Worst case — you get 1,2,3… without gap, so answer is the 1+ largest array element.

O(W*N) where W is width of the largest integer. If we assume 64-bit then W is a constant:)

http://www.drdobbs.com/parallel/parallel-in-place-radix-sort-simplified/229000734 is another in-place radix sort.

clean-up: non-overlapping intervals #70%

Q (L435): Given a collection of intervals, find the minimum number of intervals you need to remove to make the rest of the intervals non-overlapping.

Note:

  1. You may assume the interval’s end point is always bigger than its start point.
  2. Intervals like [1,2] and [2,3] have borders “touching” but they don’t overlap each other.

==== analysis:

I think this is same as the greedy room scheduler. Schedule the earliest-ending task, so to to maximize accepted meetings.

A deselected meeting is an interval removed.

java off-heap memory #cf database,reinterpret_cast

  • justification — see “jGC” item
  • tradeoff — ?
  • java.nio.ByteBuffer — non-blocking IO api is very relevant. It has API to access memory-mapped files, among other things. More often, I think we probably use NIO to request a big chunk of memory from kernel, and bypass the java heap.
  • serialization — most (if not all) off-heap techniques require objects serialized and stored off-heap.
  • byte array — Unlikely object pool, off-heap memory solutions must serialize Trade objects into binary byte array [1] and stores it. It has to be transformed back to Trade objects before use. C would use reinterpet_cast() — no disk involved.
  • offloading jGC — With huge data moved off the heap, jGC is now much faster, and jitter-free. This is the main goal and justification of off-heap
  • bigger — off-heap storage can be bigger than on-heap storage.
  • slower — off-heap storage is slower (less immediate) than on-heap, but faster than disk or database[2].
  • performance boost — even though “slower”, it can boost performance of GC.
  • memory sharing — between two JVM’s is easier if using off-heap. I think memory-mapped file is one means.

[1] In this way, the NYSE XDP protocol could arguably be considered an off-heap serialization protocol.

[2] database is actually a competing alternative solution for some use cases like “shared-cache” or “large lukewarm cache”. I feel database solution is very proven, reliable but slower than off-heap.

Among the many disparate online articles about off-heap, here are my top 2 fav

  1. https://stackoverflow.com/questions/6091615/difference-between-on-heap-and-off-heap
  2. https://www.tutorialdocs.com/article/spark-memory-management.html

hashtable resizing ≅ jGC jitter #unpredicatable

“Note because of the rehashing issue – a realtime applications and applications that need low latency- should not use a hash table as their data structure.” — said one guy on StackOverflow.

I would say if we have relative confidence about total data size (like below 1 million) then we can still use hash table. A benchmark test is not hard to set up, to compare against alternative designs such as RBTree. For large data sizes, I think hash table would win.

Note in latency sensitive systems, resizing (and stop-the-world GC) is likely to happen at busiest time.

Can we trigger a resize? Just send in more data to build up the table.

Can you trigger a pre-emptive GC? See can you force garbage collector to run@@

 

vtable also contains.. #class file

C++ is more complex than java. A typical vtable in c++ contains

  • offset of base type subobject. In multiple inheritance, this offset is often non-zero. This offset is needed not only for field access but also up-casting
  • typeid for RTTI

These details are part of the compiler ABI, since object files from older and newer compilers (of the same brand) could link together iFF they agree on these details.

Best-known part of ABI is name-mangling algorithm. This vtable detail would be the 2nd best-known ABI feature.

I believe the class file in java is one file per class. Therefore, vtable is something like the equivalent of a java class file.

 

touch not cross: path between 2 corners

Q1: from p 183 [[discrete math]]: given a n x n grid. Start from north west corner moving south or east each step, towards that corner. The diagonal connecting them can be touched from north, but not crossed. print all paths

easier to treat origin as [0,0] and end as [N,N]

DFT will require deep recursion.
BFT (with color) where each node remembers all paths-from-root? Kinda brute force

Insight — Actually this is not necessarily a graph problem though it can be solved that way.

Q2 (accepted@leetcode ): Given n pairs of parentheses, write a function to generate all combinations of well-formed parentheses. For example, given n = 3, a solution set is:

[ “((()))”,
“(()())”,
“(())()”,
“()(())”,
“()()()” ]

====analysis
Four related problems — These two problems are related to the abbreviation generator i.e. the combination generator.

However, the abbr/combo generators are more versatile and possibly overkill for this problem. These two problems can use a bit array to represent the output. I think my solution on github is probably considered inelegant but I don’t care. BigO insight —  number of paths (or valid strings) is O(N!) so any solution would not be any better than O(N!)

In general, our own ideas are often inefficient. If efficient, then often inelegant by some arbitrary interview standard. Still more valuable than learning standard solutions. One of the biggest values is xRef which helps build insight, intuition, thick->thin.

serialize binTree #funptr

struct Node{
  Node * left, right;
  int data;
}

… If you simply output as “data,leftId,rightId | data,leftId,rightId | …” then the graph structure i.e. link is lost. Therefore, I feel the stream need to identify each node: “Id,data,leftId,rightId|…” Next question is how to assign the id values. Easiest — use node address as a string id! Suppose there’s a treewalk algo walk(Node * root, void ( *callback ) (Node *)), i will call it like

walk(root, serialize1node); // where serialize1node is the callback.

Note the position of the asterisk:

  • It’s on the right of type name Node — read left to right .. pointer to Node
  • It’s on the left of variable name callback — read left to right .. callback is a pointer to …

https://github.com/tiger40490/repo1/blob/cpp1/cpp/binTree/serialize_bbg.cpp is my tested solution I feel the challenge is ECT and syntax, not the idea. —- elegant home-grown two-pass solution Assign incremental IDs to each node in an in-order walk. Then run pre-order walk to output each ID.

  • 🙂 applicable for any binTree, but 3-way trees  or higher
  • 🙂 No null node needed
  • 🙂 no space overhead. Instead of serializing address, I serialize my generated ID of each node, which are usually smaller and never bigger than a pointer. I think the serialized byte array is the smallest possible.
  • Key insight — from a given preorder output, there’s exactly one possible BST we construct.

—- solution: standard graph representations — AdjMatrix + edgeSet, applicable on any graph —- solution : BFT -> list including nulls —-idea from my friend Deepak: pre-order tree walk and output one line per node {payload, direction from root, direction from Level 1 ancestor, direction from Level 2 ancestor..}. The technique is reusable even though overall efficiency and complexity is not optimal. We don’t need to be optimal.

  • first node in the file is always the root
  • the output file always includes the higher level nodes before the lower-level nodes
  • delimiter (\n) needed between nodes
  • demerit — data bloat… too verbose if tree is sparse but 99 levels deep

—-If the tree is a BST, then this idea is based on the fact that a reading a series of unsorted nodes can construct a BSTree.

For this solution, we have no requirement on the payloads. They can be booleans.

first pass — in-order walk to build a hash table of {node address -> integer id}. If we treat the ID’s as node keys, then the tree is a BST.

2nd pass — BFT or pre-order walk the original tree. For each node, we write node ID’s to the file, without writing the links or addresses. File looks like T,5|T,4|T,6|T,2|… When we reconstruct the tree, we read each node from the file, and insert that node into our new tree, using the id value as key to decide where.

If the original tree nodes have keys, I will just treat it as payload, and use my assigned ID as key

sorted_Array|_List ⇒ balanced_BST

Q1 (easy): Given an array where elements are sorted in ascending order, convert it to a height balanced BST, where the depth of the two subtrees of every node never differ by more than 1.

Q2 (medium): How about a sorted slist?

==== analysis

I feel this “easy” problem should be medium. Perhaps there are elegant solutions.

I need not care about the payloads in the array. I can assume each payload equals the subscript.

There are many balanced BSTs. Just need to find one

— idea 1: construct an sequence of positions to probe the array. Each probed value would be added to the tree. The tree would grow level by level, starting from root.

The sequence is similar to a BFT. This way of tree building ensures

  • only the lowest branch nodes can be incomplete.
  • at all branch levels, node count is 2^L

I think it’s challenging to ensure we don’t miss any position.

Observation — a segment of size 7 or shorter is easy to convert into a balanced subtree, to be attached to the final BST.

When a subarray (between two probed positions) is recognized as such a segment we can pass it to a simple routine that returns the “root” of the subtree, and attach it to the final BST. This design is visual and a clean solution to the “end-of-iteration” challenge.

The alternative solution would iterate until the segment sizes become 0 or 1. I think the coding interviewer probably prefers this solution as it is shorter but I prefer mine.

— idea 2

Note a complete binary tree can be efficiently represented as an array indexed from one (not zero).

— Idea 3 (elegant) dummy payload — For the slist without using extra storage, I can construct a balanced tree with dummy payload, and fill in the payload via in-order walk

All tree levels are full except the leaf level — guarantees height-balanced. We can leave the right-most leaf nodes missing — a so-called “complete” binary tree. This way the tree-construction is simple — build level by level, left to right.

— idea 4 STL insert() — For the slist, we can achieve O(N) by inserting each element in constant time using STL insert() with hint

— For the slist without using an array, I can get the size SZ. (then divide it by 2 until it becomes 7,6,5 or 4.) Find the highest 3 bits (of SZ) which represent an int t, where 4 <= t <= 7. Suppose that value is 6.

We are lucky if every leaf-subtree can be size 6. Then we just construct eight (or 4, or 16 …) such leaf-subtrees

If SZ doesn’t give us such a nice scenario, then some leaf-subtrees would be size 5.  Then my solution is to construct eight (or 4, or 16 …) leaf-trees of type AAAAA (size 6) then BBB (size 5).

Lucky or unlucky, the remaining nodes must number 2^K-1 like 3,7,15,31 etc.  We then construct the next level.

–For the slist without using an array, I can propose a O(N logN) recursive solution:
f(slist, sz){
locate the middle node of slist. Construct a tree with root node populated with this value.
Then cut the left segment as a separated slist1, compute its length as sz1. node1 = f(slist1,sz1); set node1 as the left child of root.
Repeat it for the right segment
return the root
}

shared vars across files: prefer static field

When I have state to maintain and share across compilation units, there are basically three main types of variables I can create. (Non-static Local variables are simple and don’t maintain state.)

  1. nonstatic field of a singleton class — Note any user of this variable need a reference to the single object 😦
  2. file scope var in a *.cpp  — relatively simple usage, but don’t put in a shared header , as explained in global^file-scope variables]c++
  3. public static fields — most versatile design. Any user can access it after they #include the class header file.
  4. — non-contenders
  5. local static variables — (niche usage) You can create a local static var in myfunc(). To share the variable across compilation units, myfunc() can return a reference to this object, so from anywhere you can use the return value of myfunc(). This is a simple for of singleton.
  6. global variables — monster. Probably involves “extern”. See my other blogposts

The advantages of static field is often applicable to static methods too.

In fact, java leaves you with nothing but this choice, because this choice is versatile. Java has no “local static”, no file-scope, no global variables.

void list.reverse() ^ reversed()→iterator

These three constructs do three different things, without overlap

  • mylist[::-1] returns a reversed clone
    • myStr[::-1] works too
  • mylist.reverse() is a void in-place mutation, same as mylist.sort()
  • reversed(mylist) returns an iterator, not printable but can be used as argument to some other functions
    • I feel this can be convenient in some contexts.
    • reversed(myStr) works too but you can’t print it directly

Similarly,

  • Unlike items(),dict.iteritems()return iterator

Different case:

  • sorted(any_iterable) creates a new vector (i.e. “list”). Iterator would be technically stupid.

generate Fib(N) in O(1)time and space: 2 techniques

Like all Constant-recursive sequences, the Fib sequence has a closed-form formula Fib(N) that returns an integer but in floating point form. By rounding, we can generate Fib(N) in constant time.

https://www.math.hmc.edu/funfacts/ffiles/10002.4-5.shtml has the formula we can easily program.

[ Phin – ( -Phi)-n ] /√5  # Notice the three minus signs

where the constant Phi := (1+√5) / 2.

Similarly, if we are given 89 as a Fib number, we can use logarithm to compute the rank of this Fib number.

https://github.com/tiger40490/repo1/blob/cpp1/cpp/lang_33template/FibDeepak.cpp shows two O(1) solutions

a standard representation of binTree

I think on Leetcode there’s now a standard representation of simple binTree. https://leetcode.com/problems/convert-sorted-array-to-binary-search-tree/ shows one example.

see also (de)serialize binary tree #funptr ECT

Let’s not spend too much time here. This representation is simple but not very visual. Can’t deal with recombinant trees. It is useful only for Binary trees not general graphs. Graphs need adjacency matrix or edgeSet.

— My own idea is probably similar:

Can we make do without id and design the serialization sequence to save/reconstruct the tree structure?
I can output layer by layer.  If 2nd layer has no missing node, then 3rd layer consists of exactly 4 nodes, using a sentinel value for any NUL.  If 2nd node is NUL like [A/NUL/C/D], then the next layer would have “2 child nodes for A, 2 child nodes for C, 2 child nodes for D”, where the 2 child nodes for A could be NUL/NUL

What if all integer values are valid values? Trivial problem. I could add a one-bit flag to each serialized payload.

The id/edgeList based solution is general purpose and efficient. The technique above are 雕虫小技. Venkat of OC said something similar about regex state machine. In the same vein, disjoint set is a general solution though many problems have simplified union-find solutions.

too many DB-writes: sharding insufficient #Indeed

Context: each company receives many many reviews. In a similar scenario, we can say that too many user comments flood in during the soccer world cup.

Aha — the updates don’t need to show up on browser in strict order

Aha — commenting user only need to see her own comment on top, and need not see other users’ latest comments. The comments below her own could be browser-cached content. Ajax…

Interviewer: OK you said horizontal sharding by company id can address highly concurrent data store updates. Now what if one company, say, Amazon, by itself gets 20% of the updates so sharding can’t help this one company.

me: I guess the update requests would block and possibly time out. We can use a task queue.

This is similar to WordPress import/export requests.

  • Each task takes a few seconds, so we don’t want user to wait.
  • If high server load, the wait could be longer.
  • More important — this task need not be immediate. It can be scheduled.
  • We should probably optimize for server efficiency rather than latency or throughput

So similarly, all the review submissions on Amazon can be queued and scheduled.

In FIX protocol, an order sender can receive 150=A meaning PendingNew. I call it “queued for exchange”.  The OMS Server sends the acknowledgement in two steps.

  1. Optional. Execution Report message on order placement in OMS Server’s Order Book. ExecType (150) = A (Pending New)
  1. Execution Report message on receiving acknowledgement from exchange. ExecType (150) = 0 (New). I guess this indicates placement in exchange order book

Note this optimization is completely different from HFT latency engineering.

ringBuffer@pre_allocated objects to preempt JGC

Goal — to eliminate JGC completely.

Design 1: I will want Order.java to use primitive fields only and avoid reference fields [1] at all cost, so the total footprint of an Order is known in advance. Say it’s 100 bytes. I will create 10M of dummy Order instances, possibly scattered in heap, not adjacent as in c++, and hold their 10M addresses in an Order array… about 1GB footprint for the Order objects + 80M footprint for the array of 8-byte pointers.

(Note I reuse these Order instances in this object pool and never let them get garbage-collected.)

Then i need a few subscripts to identify the “activeRegion” of the ring but how about released slots enclosed therein?

[1] timestamps will be ints; symbolIDs and clientIDs are ints; short ascii strings will use 64-bit ints (8 characters/int); free-form strings must be allocated off-site:(

Design 2a: To avoid the “scatter” and to place the Order instances side by side, Can we use a serialized byte[100] array object to represent one Order? Can we use one gigantic off-heap byte array to hold all Orders, eliminating the 80M footprint? See java off-heap memory

Design 2b: https://blog.bramp.net/post/2015/08/26/unsafe-part-2-using-sun.misc.unsafe-to-create-a-contiguous-array-of-objects/ shows a contiguous array of java objects, like std::vector<MyObject>

Design 2c: https://www.ibm.com/support/knowledgecenter/en/SSYKE2_7.1.0/com.ibm.java.lnx.71.doc/user/packed_optimizing.html is a feature in IBM jvm

Ring buffer is good if the object lifetimes are roughly equal, giving us FIFO phenomenon. This occurs naturally in market data or message passing gateways. Otherwise, we may need a linked list (free list) of released slots in addition to a pair of subscript to identify the active region.

It might be better to allocate a dedicated buffer for each thread, to avoid contention. Drawback? One buffer may get exhausted when another stays unused.

low-level^high-level expertise

See also my blogpost on wrapper^low-level API

Compare to laymen on the street, I have accumulated fairly deep, specific and demonstrable non-local expertise in two lucrative field i.e. software dev + finance

  • concurrency details (theory++) in real languages
  • dStruct choices and effective combinations
  • dStruct implementation details in java/c++/pyhton
  • SQL joins and tuning
  • sockets? no deep insight but my expertise beats most peers
  • Most standard algo problems are familiar to me
  • memory models, memory layout.. beneath c++ and java
  • drv pricing math, VaR statistics, bond math. My expertise goes beyond the degree

These trophies are won based on TSN, interviews and personal time investment … not automatically

— Let me focus on low-level vs high-level
Most of the dev expertise domains are low-level, consequently very specific — you either know it or don’t. I can even teach in these subjects. The more low-level, the rarer is the expertise. The laymen developers have a vague idea of the low-level details, for many reasons.

High-level understanding is often more useful (than low-level, theoretical QQ knowledge) in projects, but in job interviews, low-level knowledge is differentiator.

Practically all (98%) high-end java/c++ interviews use low-level questions as differentiator. I excluded the start-ups as I’m unfamiliar with them… Eg: Quoine.

In my dev (not system architect) interviews, they rarely asked high-level stuff like spring rationale, design patterns…
I tend to believe these questions can’t /separate the crop from the chaff/

The finance dnlg  also serves as differentiator among developers. The most specific sub-domains are quant math, the “architecture” and some of the jargon.

C++ is more low-level than java; java is more low-level than python…

Q: Paradoxically, C and assembly are even more low-level but not more valuable?
%%A: Some C knowledge is highly valued. For example, kernel knowledge is high-value and mostly at the level of C, compiler, assembly, and hardware
%%A: assembly expertise in device drivers or embedded is not really relevant to HFT, and too far from the money

MOM is not low-level and seldom asked.

java singleton^immutable classes #enum

In c++ (and perhaps java) Most singletons are designed as “casual singletons” i.e. they are used as singletons, can be instantiated twice if you really want to, although there’s seldom a good justification.

[[effJava]] explains that an immutable class needs no copying. I think this is a performance view. (Does it apply to Optional.java? )

However, we don’t need to work hard trying to make it strict singletons.  Strict singleton is a functional requirement i.e. 2nd instance would be a functional defect.

If an immutable class happens to be used as one-instance data type, then lucky. But tomorrow it may get instantiated twice.. no worries 🙂

If boss asks you to make this class singleton, you should point out the legwork required and the numerous loopholes to fix before achieving that goal. Worthwhile?

Java enum types are special. JVM guarantees them to be immutable AND singleton, even in the face of serialization. See P 311.

python nested function2reseat var] enclos`scope

My maxPalindromeSubstr code in https://github.com/tiger40490/repo1/tree/py1/py/algo_str demos the general technique, based on https://stackoverflow.com/questions/7935966/python-overwriting-variables-in-nested-functions

Note — inside your nested function you can’t simply assign to such a variable. This is like assigning to a local reference variable in java.

https://jonskeet.uk/java/passing.html explains the fundamental property of java reference parameter/argument-passing. Basically same as the python situation.

In c# you probably (99% sure) need to use ref-parameters. In c++, you need to pass in a double-pointer. Equivalently, you can pass in a reference to a pre-existing 64-bit ptr object.

if external lib return object by pointer #Rahul

My colleague Rahul used some external library util and found out it returns a pointer. We briefly discussed the implications. Here are some afterthoughts.

  • Pattern — if I pass in an original object by reference, the lib util may return a wrapper object by pointer.  In such a context, is this wrapper object on heap? Always?

I would say usually on heap. But since the lib implementation is invisible, it may do other things.

In an alternative design, the library could update an object pre-existing in a ring buffer (or 3D matrix) and make it a wrapper. There’s no heap allocation in this round trip. The ring buffer itself could be previously allocated on heap or data segment, not as a local object in main() function. Reason — main() can end before other threads 😦

  • Pattern — if I pass in some parameters, the lib util could create a brand new object, based on my parameters. So in this context is this new-born on heap? Always?

I would say usually on heap. But since the lib implementation is invisible, it may use non-heap.

In a specialized design, the library could return singletons held in a lookup table. The singletons are usually pre-existing in heap or data-segment.

Q: Now suppose the library does return a new heap object, who would be responsible to release the memory?
A: One common design (Sutter) requires the library factory to own it. Reference counted smart pointers are useful, as the factory knows the reference count of the new heap object.

Q: Now suppose the library factory returns a new heap object by raw pointer, who would be responsible to release the memory?

c++factory usually heapy

Java factory has little (any?) choice, but focus here is c++ factory techniques. I think it usually has to manufacture on heap.

In specialized cases, my factory can maintain a private pre-allocated object pool, carved out from heap, so the factory returns objects from this pool, without calling new().

To my surprise, some c++ factories actually return by value, as illustrated in both [[c++cookbook]] and https://herbsutter.com/2013/05/30/gotw-90-solution-factories/ RVO and move-sematics would likely kick in.

pre-allocate DTOs@SOD #HFT #ets

example — etsflow framework pre-allocates object pool (presumably the flow elements) for the day, to avoid runtime call to malloc. Are these objects ever released to the pool? I doubt it since all of these objects are subject to query or bust.

example — RTS pre-allocates outgoing message objects from a ring buffer’s head, and “returns” to the ring buffer at the tail… See How RTS achieved 400-700 KMPS #epoll

example — Sell-side HFT OMS also uses pre-allocation. Suppose for every new order there are 3 new DataTransferObjects A/B/C to be instantiated on heap. Traditional approach would make 3 allocation requests in real time. I think the free-list manager becomes a hotspot, even if there’s a per-thread free list.

Basically HFT avoids new/malloc after market opens. RTS uses mostly arrays, and very few (rather small) STL containers. Those STL containers tend to be populated before market opens and remain static.

Pre-allocation is a popular technique. We compute at compile time the sizes of A/B/C based on their class declarations. For DTO class A, sizeof(A) just adds up the non-static data field sizes. Then we estimate how many orders we will get a day (say 7 million). Then we pre-allocate 7 million A objects in an array. The allocation happens at start-up, though the sizes are compile-time constants.

When an order comes in, the A/B/C DTO objects are already allocated but empty.

Byte-array is an alternative, but this smells like the raw free list to me…

checked STL^checked java Collections

jdk checkedList, checkedMap etc are designed to check type errors — checking any newly added item has the correct type for the collection. See P 246 [[java generics]]

STL checked container checks very different coding errors. See http://www.informit.com/articles/article.aspx?p=373341, which is extracted from my book [[c++codingStd]]

contains(): Set^SortedSet

–in java # See P247/32 [[java generics]]

  • Set<Acct> contains() uses Acct.equals()
  • SortedSet<Acct> contains() uses Comparable<Acct> or a custom comparitor class, and ignores Acct.equals()

–In STL

The tree-based containers use “equivalence” to determine containment, basically same as the java  comparator.

The hash-based containers use hashCode + a equality predicate. The implementation details are slightly complicated since both the hash function and the predicate function are template params. Upon template instantiation, the two concrete types become “member types” of the host class. If host class is undered_set<string>, then we get two concrete member types:

unordered_set<string>::hasher and unordered_set<string>::key_equal

These member types can be implemented as typedefs or nested classes. See ARM.

##java8 features #MethodRef

  • lambda
  • default methods
  • streaming
  • method references? Supposed to be an important feature in the core language/syntax. related to lambda syntax, presumably a syntactic sugar coating.
  • java.util.Optional<T> introduced to reduce NPE. Compare boost::optional
  • CompletableFuture<T> to promote async, event-driven programming
  • PermGen disappears in java8
  • –not really java8 features:
  • java7 introduced G1 garbage collector though default GC remains ParallelGC, for both java8 and java7

##java heap allocation+!explicit q[new]

Most of these are java compiler tricks. Here are a few java heap allocations without explicit q[new]

  • (pre-runtime) enum instance instantiation — probably at class-loading time
    • P62 [[java precisely]]
  • String myStr = “string1”; // see string pool blogpost
    • P11 [[java precisely]]
    • NOT anonymous temp object like in c++
  • “string1” + “str2” — is same as myStr.concat(..). So a method can often new up an object like this.
    • P10/11 [[java precisely]]
  • boxing
  • (most tricky) array initialization
    • int[] days ={31,28,31/* instantiates the array on heap */};
    • most tricky
    • P17 [[java precisely]] has examples
    • P16 [[java precisely]] also shows an alternative syntax “new int[]”

##Java9 features #fewer than java8

  1. #1 Most important – modular jars featuring declarative module-descriptors i.e. requires and exports
  2. #2 linux cgroup support.. For one example, see Docker/java9 cpu isolation/affinity
  3. #3 G1 becoming default JGC.. CMS JGC: deprecated in java9
  4. REPL JShell
  5. private interface methods, either static or non-static
  6. Minor: C++11 style collection factory methods like

List<String> strings = List.of(“first”, “second”);


It’s unbelievable but not uncommon in Java history —

  • Java9 release introduced significantly fewer and less impactful features than java8.
  • Similarly, java5 overshadows java6 and java7 combined

 

real-world treeNode has uplink

I think only in contrived interview questions do we find tree nodes without uplink.

Uplink basically, means every edge is bidirectional. With uplinks, from any node we can easily trace to the ancestor nodes.

In real world trees, uplink is cheap-yet-valuable most of the time. AVL tree node has uplink, but let’s look at RBTree:

  • RBTree needs uplink for tree rotation.
  • Also, I believe when you give an insertion hint, the “engine” needs to validate the hinted node, by checking parent + another ancestor.

noexcept impact on RAII

If a function (esp. dtor) is declared noexcept, compiler can choose to omit stack-unwinding “scaffolding” around it.  Among other things, there’s a runtime performance gain. This gain is a real advantage of using noexcept instead of empty throw() which is deprecated in c++0x.

Q: Is there any impact on RAII?
%%A: yes

Q: Can we even use RAII in such a context?
%%A: I think we can. If the function does throw, then std::terminate() runs, instead of the destructors

3 real overheads@vptr #inline

Suppose your class Trade has virtual functions and a comparable class Order has no virtual functions. What are the specific runtime overheads of the vptr/vtable usage?

  1. cpu cache efficiency — memory footprint of the vptr in each object. Java affected! If you have a lot of Trade objects with only one char data field, then the vptr greatly expands footprint and you overuse cache lines.
    • [[ARM]] singles out this factor as a justification for -fno-rtti… see RTTI compiler-option enabled by default
    • [[moreEffC++]] P116 singles out vptr footprint as the biggest performance penalty of vptr
  2. runtime indirection — “a few memory references more efficient” [1] in the Order usage
  3. inlining inhibition is the most significant overhead. P209 [[ARM]] says inline virtual functions make perfect sense so it is best to bypass vptr and directly call the virtual function, if possible.

[1] P209 [[ARM]] wording

Note a virtual function unconditionally introduces the first overhead, but the #2/#3 overheads can sometimes be avoided by a smart compiler.

 

mutex ] rebus: 3designs #RCU #CSY #80%

Requirement: Each worker thread operates independently on its own symbols, never interfering. Admin thread may read/write all symbols.

My friend CSY’s design resembles ConcurrentHashMap — split a big “graph-container” into 32 independent sub-graphs, to be locked individually.

For my 1st design, I briefly considered one lock per symbol. I think 100,000 locks is fine if symbols are consecutive integers. I simply use a symbol as array index to locate the corresponding lock. Every worker thread and the admin thread need to acquire the lock for symbol X before updating the AVL tree of X.

— Here’s my 2nd design, based on a single global lock:

  • Regular writer threads only checksnever acquires, the lock. If lock is in-use (i.e. taken by the admin thread) then wait in a 1-second-sleep loop.
  • admin thread immediately and unconditionally takes the lock and wait for a grace period[1] before writing. I assume this “write” can take up to 1 minute.
  • how to “check” the lock?
    • TryLock solution [Design#1] — Just trylock and immediately unlock.
    • LockFree solution [Design #2] — atomic boolean to be set only on the admin thread. I think this is one the simplest usage scenarios of atomic boolean. We are not even relying on the atomicity. We only need the visibility feature.

[1] This grace period is designed to be sufficient for a batch of updates on the worker thread. If we know a batch can write 100 orders and take X microsecond, then set grace period to X. We require worker routine to recheck the lock after each batch.

My friend CSY raised the concern that mid-stream the update could be suspended by kernel scheduler. I think in this case, a simple integer update could take ages, as in the GC stop-the-world scenario. So the design requires some realtime kernel support to guarantee timely response.

read-copy-update lockfree +! retry #RCU mentions a grace period used in RCU api

— Here’s a bigger lockfree design [Design #3]

  • without realtime kernel
  • Two global variables — in addition to the atomic boolean flag, I will add an atomic int “activeWorkerCnt”
  • Two non-reentrant routines (to avoid incremental acquisition)
Worker threads routine:
1. check the flag until it’s false
2. increment activeWorkerCnt
3. update data store
4. decrement activeWorkerCnt
Admin thread routine:
1. unconditionally set the flat # rejecting all new workers
2. while activeWorkerCnt > 0
2.   sleep 1 millisecond
2. ## draining the active worker pools
3. update data store
4. unset the flag

[19]with3sockets: select^non-block recv#CSY

Update — CSY said when you increase from 3 to 99, the difference becomes obvious.

Q (TradeWeb mkt data team): given 3 non-blocking receive-sockets, you can either read them sequentially in a loop or use select(). What’s the difference?

http://compgeom.com/~piyush/teach/4531_06/project/hell.html compares many alternatives. It says Non-block socket read() is least efficient as it wastes CPU.

If you have 3 receive-sockets to monitor, select()/poll() is designed just for your situation.

Using read() instead, you may try specifying a timeout in your read() like

//set socket option via SO_RCVTIMEO to 5 seconds
while(1){
recv(sock1);
recv(sock2);

}
Low latency systems can’t use this design because while you wait 5 seconds on socket1, you miss new data on socket2 😦

Therefore, low-latency systems MUST disable SO_RCVTIMEO. In such a design, the loop wastes cpu in a spin-wait.

–another advantage to selelct(): you can process a high-priority socket earlier when select() says multiple sockets ready.

I gave this answer on the spot.

recv()/send() with timeout #CSY

I was right — timeout is supported Not only in select()/poll(), but also read/recv/send..

SO_RCVTIMEO and SO_SNDTIMEO

Timeouts only have effect for system calls that perform socket I/O (e.g., read(2), recvmsg(2), send(2), sendmsg(2)); timeouts have no effect for select(2), poll(2), epoll_wait(2), and so on. These latter functions only check socket status… no I/O.

I believe this socket option has no effect if used on non-blocking sockets . Nonblocking socket can be created either

  • O_NONBLOCK in fcntl() or
  • SOCK_NONBLOCK in socket()

denigrate%%intellectual strength #ChengShi

I have a real self-esteem problem as I tend to belittle my theoretical and low-level technical strength. CHENG, Shi was the first to point out “你就是比别人强”.

  • eg: my grasp of middle-school physics was #1 strongest across my entire school (a top Beijing middle school) but I often told myself that math was more valuable and more important
  • eg: my core-java and c++ knowledge (QQ++) is stronger than most candidates (largely due to absorbency++) but i often say that project GTD is more relevant. Actually, to a technical expert, knowledge is more important than GTD.
  • eg: I gave my dad an illustration — medical professor vs GP. The Professor has more knowledge but GP is more productive at treating “common” cases. Who is a more trusted expert?
  • How about pure algo? I’m rated “A-” stronger than most, but pure algo has far lower practical value than low-level or theoretical knowledge. Well, this skill is highly sought-after by many world-leading employers.
    • Q: Do you dismiss pure algo expertise as worthless?
  • How about quant expertise? Most of the math has limited and questionable practical value, though the quants are smart individuals.

Nowadays I routinely trivialize my academic strength/trec relative to my sister’s professional success. To be fair, I should say my success was more admirable if measured against an objective standard.

Q: do you feel any IQ-measured intelligence is overvalued?

Q: do you feel anything intellectual (including everything theoretical) is overvalued?

Q: do you feel entire engineering education is too theoretical and overvalued? This system has evolved for a century in all successful nations.

The merit-based immigration process focus on expertise. Teaching positions require expertise. When laymen know you are a professional they expect you to have expertise. What kind of knowledge? Not GTD but published body of jargon and “bookish” knowledge based on verifiable facts.

FIX^tibrv: order flow

Background — In a typical sell-side an buy/sell order goes through multiple nodes of a flow chain before it goes out to the trading venue. The order message is payload in a messaging platform.

FIX and Tibrv (and derivatives) are dominant and default choices as messaging platforms. Both are highly adaptable to complex combinations of requirements.

FIX tibRV
TCP UDP transport
unicast multicast pub/sub fan-out
designed for low-latency eq trading popular for bond trading latency
order msg gets modified along the flow chain ? editable payload

tick`95%mark #Kam #70%

“Ticking 95th percentile server” is the blog title I would use. Original question is on paper/pencil, scanned and sent to my gmail, from my friend Deepak CM. I find this problem rather realistic with practical usage, rather than fake and contrived. I treat it as a System Design + implementation question.

Q: Using only std library, write a c++ program to process a live stream of 128,000,000 (or much more) double numbers, representing temperature readings not necessarily unique. As the temperatures come in, print the current 95th percentile on-demand. I call it the lucky “winner”. We can use the nearest-rank percentile definition.

====Idea 1: given unsorted ints, find median in O(N) is for median but can be tweaked for any percentile, but unfortunately, not “ticking”

====design 2, for static data set
use an “order statistic tree i.e. a RBTree where each node remembers the size of its subtree. (A leaf node has size 1.)
====design 3, optimized for high volume of updates like 128 million updates, not optimized for frequent query

The entire temperature range is divided into non-overlapping segments, each represented by a segment-head temperature i.e. the lower bound [1b]. Each segment has a range (i.e.distance to next segment head), size (i.e.item count) and density (i.e. size/range ratio). We mostly care about “size” only.

We need a RB-tree (or sorted vector) containing P=1024 [1] nodes, each an unsorted container[3]. The RB-tree serves to maintain the containers i.e segments.

Each incoming temperature is quickly “routed” to the correct container and simply appended therein, increasing its size.

Upon query request, we will use the latest segment sizes to build a cumulative profile, and run a O[logP] binary search to identify the one segment containing the “winner”. This segment size would be hopefully much smaller than 128,000 [2] and far more /tractable/.

–Within the chosen segment of size S, we can use a vector to sort in O(S logS) the temperatures and identify the winner.  After completing a query, the chosen container will become (partially) sorted, helping subsequent queries if this segment is picked again.

Since we only support 95th percentile, chance is good that this segment will be picked most of the time. If x% of the queries hit this segment, then I will convert this “favorite segment” to a RB-tree.

Alternatively, we can also use the O(S) algorithm in Idea 1, but the container won’t become sorted.

–priming

[2] 128,000 is 1024th the size of original sample size… not ideal. The segments need to be initialized carefully, during a priming phase, inspired by JIT compiler. Shall we assume roughly uniform distribution or Gaussian distribution? Assuming we know the total sample size is 128 million, I will use the first 100,000 temperatures to select the 1024 segment heads. The segments are structured not for equal length (in temperature) or equal size (i.e. element count). In fact the low segments can be very long very crowded.

Instead, the segment heads are chosen so that between 94th percentile and 96th percentile we have half of all segments. These small segment sizes will be much smaller than 128,000 and quicker to manipulate.

–Foot notes:

Q: what if some of the containers grow too big like three times 128,000,000/1024. The priming/estimate was ineffective.
A: Not a problem unless the winner happens to be in such a container.
A: One idea is to split up such a container when we notice it, and grow a new node on the RB-tree. std::map::insert() can take a hint for the position where new node can be inserted. Note we don’t want to split a node JJ too early since JJ may not grow any further subsequently and may end up no larger than other nodes as other nodes keep growing.

[1] Sizing of P — First we estimate total sample size. If unknown, then set N:=1024 so all nodes stay in L1-cache (typically 32KB). If we assume 16 bytes/node ( 8 bytes pointer to container + 8 bytes double ), then 32KB can hold 2000 nodes.

If query becomes more frequent, I can increase P by 1024, sacrificing insertion.

[1b] The node values are “lower-bounds” and don’t have to be really the minimum of the original temperatures in the container. We can probably cheat with 4-byte floats, and we get away with 2700 twelve-byte tree nodes.

[3] slist vs vector — vector might be faster due to pre-allocation, provided a node will never grow beyond a capacity. Vector has reserve() (Note resize() is wrong choice.)

represent graph: matrix^edgeList #std::forward_list

Josh’s book [[data structures and other objects using c++]] has 5 pages devoted to three best-practice data structures representing a generic directed graph. (I think my [[discrete math] also covers the same topic.)

Note all these data structures can serialize graphs.

  1. boolean adjacency matrix — adj mat, unable to represent parallel edges between nodes A and B (rare) or self-to-self
  2. edge set — unable to represent parallel edges but can support complex graph algos having risk of traversing the same edge repeatedly. The set can be a field of a graph node, or a hash table {node-> edgeSet}
  3. edge list — can be implemented using std::forward_list, marginally faster better than hashset… not discussed further in this blogpost

pros and cons:

  • iterating all edges connected to a node — edge set wins. adjMat very poor.
  • sparse graph — adj mat wastes memory, not scalable.
  • test linkage between 2 nodes — adj mat and edgeSet are O(1)
  • ease of implementation — adj mat wins

Fundamental classification:

  • In a so-called dense graph, number of edges ~ O(VV) i.e. between any pair of nodes there’s likely an edge
  • In a so-called sparse graph, number of edges ~ O(V).

[[algo in a nutshell]] written by some professors, compares matrix and edge list. For a dense graph, P140 describes an intriguing edge list practical implementation to save space and speed up isConnected(node1, node2). Key insight —

   “Number all individual nodes using auto-incrementing ID numbers.”

After this, we will see that a node’s edge list may look like 2,3,4,5, 8,9 …which can be summarized as 2-4, 8-9. I felt this was a trivial trick but is actually a standard implementation. There’s a tradeoff though — inserting edge is now slower, but I feel O() could remain unchanged.

 

ABA problem ] lockfree designs #DeepakCM

  • breaks basic CAS assumptions
  • most common solution is [1]
  • affects c++ CAS only, probably not java CAS

[1] https://lumian2015.github.io/lockFreeProgramming/aba-problem.html and http://moodycamel.com/blog/2014/solving-the-aba-problem-for-lock-free-free-lists explain the “tagged state reference”  aka “tagged pointer“. I feel I don’t need to understand it. Even if I spent some time getting a grasp it may not be enough to impress an interviewer.

http://www.modernescpp.com/index.php/aba-a-is-not-the-same-as-a has more details, including

  • compare_exchange_strong
  • lockfree stack hitting UndefinedBehavior due to ABA
  • RCU

This topic was asked briefly in Deepak’s MS c++ interview in 2019. I think it’s too arcane and low-level for most interviewers.

 

UseSpinning jvm flag #pthr/kernel/c#

Hi XR,

Around 2011 I once said a java thread waiting for a lock would not keep grabbing the lock…. Wrong! Actually it does keep grabbing for a while. It runs some useless CPU instructions and checks the lock periodically. This is really spin locking inside the JVM.

Spin locking wastes CPU but frequently results in faster application performance (reducing context switches). Therefore, JVM use some policy to decide how long a thread would spin before placing the thread in some kind of “parking lot” (my terminology) so the thread doesn’t’ occupy CPU any more. This spin phase used to be configurable via the boolean UseSpinning JVM flag, but in java7 and beyond, this flag has no effect — JVM always has freedom to use spinning. This is described in [[javaPerf2014]]

Comparison 1 — Dotnet SpinWait() provides exactly the same spinlock feature, as described in https://docs.microsoft.com/en-us/dotnet/standard/threading/spinwait

Comparison 2 — In linux Kernel (not userland), 1) spinlock and 2) counting semaphore are the two locking primitives, as described in Page 164 [[understandingLinuxKernel]]. Both locks block a “grabbing” thread getting into a critical section until another thread leaves it. Similar to my 2011 description, the kernel semaphore remembers all sleeping threads and wakes one of them when the critical section is clear. I think this kernel semaphore is used by the scheduler in the kernel.

Comparison 3 — In linux pthreads (the basis of linux JVM), the locking constructs are implemented using system calls and userland C functions. Pthreads as a userland library probably can’t use the above two kernel constructs directly. Userland C function can implement spinlock, but to implement “parking” I assume (kernel support and) system calls are required. So far I assume the pthreads thread maps to native threads controlled by the kernel. This mapping is now the dominant usage.

The alternative mapping model is green threads, where userland threads are invisible to kernel. Parking has to be implemented in userland.  Nowadays it’s rare to see green threads.

STL containers +! pointer

  1. std::array
  2. std::string SSO i.e. small-string optimization

These containers

  1. heap storare is not required
    • trivial special case — if this container is a nonref field of a Shape object, then q(new Shape) would still allocate the container on heap.
    • This is similar to “Are java primitives on heap/stack?”
  2. Therefore contains no pointer.
  3. Therefore move ctor runs in linear [1] time compared to constant-time move-ctor in other STL containers
    • std::array probably offers a move-ctor and move-assignment

[1] P205 [[effModernC++]] also addresses

Q: What happens when you move a std::array container with 77 Acct elements?
A: 77 individual Acct objects are moved to the new container, in linear time, invoking the move-ctor of Acct class

RTS pbflow msg+time files #wait_until #60%

Background — In RTS we relied everyday on a UDP message replay tool, taking in a msg file and a corresponding time file. Both binary files. I saw no message delimiters in the msg file, partly because this file is produced in production. Inserting delimiters would add overhead in production parser engine.

Given there’s no delimiter, I believe the timestamp file must have a seek-offset values (marker) corresponding to each timestamp.

Q: how would you implement a similar replay tool?

Driver is the timestamp file. Every time I see a timestamp, I would wait_until() that timestamp (adjusted to my start time). Upon wake-up, I would send the corresponding chunk of bytes between current and next markers.

I would use condition_variable::wait_until(). As noted in c++condVar 2 usages #timedWait, this function has nanosec precision.

(The mutex needed in wait_until? My program is single-threaded, so the dummy lock would be held forever.)

The chuck of bytes would be sent as one UDP packet. No split. Each chunk should starts with an original packet header created by the matching engine, and received in the original feed.

Q2: how about TCP?

TCP receiver can deal with partial messages very effectively, as I saw in my tests.

However, TCP receiver can introduce delays in the sender i.e. my replay tool, so the transmission will not be instantaneous. In fact, send() can block! This could lead to drift in the timer. That’s why I need wait_until()

Q3: Explain timer drift?

Suppose in the time file, Timestamp #1 is 9:00:00:000001 and Timestamp #2 is one microsec later, but the data transmission can take 10 nanosec esp. with TCP blocking send().  This 10 nanosec is a small drift but it adds up to microseconds.

identityHashCode,minimum object size,relocation by JGC

https://srvaroa.github.io/jvm/java/openjdk/biased-locking/2017/01/30/hashCode.html offers a few “halo” knowledge pearls

  • every single java Object must always give an idHashcode on-demand, even if its host class has hashCode() method overridden to return a hard-coded 55.
    • hashcode() doesn’t overshadow idHashcode
  • The “contract” says an object’s idHashcode number must never [2] change, in the face of object relocations. So it’s not really computed based on address. Once someone requests the idHashCode number (like 4049040490), this number must be retained somewhere in object, as per the “contract”. It is retained in the 12-byte object header. (8-byte for a 32-bit JVM)
    • Therefore, the idHashcode contributes to the minimum size of java objects.
  • contrary to common belief, the idHashcode can clash between two objects, so idHashcode is a misnomer, more “hashcode” and not “identity”. https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6321873 explains there are insufficient integer values given the maximum object count.
  • Note anyone can call the hashcode() method on this same object and it could be overridden to bypass the idHashcode.
  • [2] in contrast a custom hashcode() can change its value when object state changes.

average O(): hashtable more IMperfect than qsort

In these celebrated algorithms, we basically accept the average complexity as if they were very likely in practice. Naive…

In comp science problems, hash table’s usage and importance is about 10 times higher than qsort

  • I would say qsort is faster than many O(N logN) sorts. Qsort can use random pivot. It degrades only if extremely “lucky;)” like getting a “6” on all ten dice.
  • In contrast, hash table performance depends mostly on programmer skill in designing the hash function, less on luck.

Performance compared to the alternatives — qsort competitive performance is pretty good in practice, but hash table relative performance is often underwhelming compared to red-black trees or AVL trees in practice. Recall RTS.

get()vs operator*()on c++11 smart pointers #Rahul

Same for unique_ptr and shared_ptr

  • shared_ptr::get() returns a raw ptr like T*
  • shared_ptr::operator*() returns a lvalue reference, similar to operator*() on raw ptr
    • equivalent to: q[  * get() ]

Rahul wrote

   auto newShPtr = make_shared<Acct>(*existingShPtr);

After much discussion, I now believe

  • q[*existingShPtr] evaluates to a reference to the Acct instance on heap.
  • Therefore, at run time we hit the Acct class copy ctor.
  • We thereby instantiate a second Acct instance on heap — deep clone.
  • The address of this second instance is saved in a new “club” of shared_ptr instances. This new club is about the second Acct instance, unrelated to the existing club, so there’s no risk of double-delete. In contrast, the version below would probably create a (disaster) new club around the same Acct instance on heap :
  auto newShPtr = shared_ptr<Acct>(existingShPtr.  get  ());

You can also call make_shared<Acct>() to invoke the default ctor. Note there might be a default argument like

public Acct(int id=0);

parts@mv-semantics impl: helicopter/hist view

Move semantics is 90% compile-time + 10% run-time programming. 95% in-the-fabric and invisible to us.

  • 90% TMP
    • 70% is about special casts — in the form of std::move and std::forward
    • std::swap
  • 10% traditional programming
    • RAII to destroy the “robbed” rvalue object
    • heap ptr assignment [1] to rob the resource

[1] This “stealing” is tip of the iceberg, the most visible part of move semantics. Bare-hand stealing was doable in c++03, but too dangerous. The rvr is a fundamental language feature to make the “steal” safer:

  • a safety device around the dangerous “steal”
  • a safety glove
  • a controlled destruction

The resource is typically a heapy thingy, common fixture in all STL containers and+ std::string + smart pointers. Therefore, move-semantics is widely used only in libraries, not in applications.

IV Q: implement op=()using copier #swap+RAII

A 2010 interviewer asked:

Q: do you know any technique to implement the assignment operator using an existing copy ctor?
A: Yes. See P100 [[c++codingStd]] and P347 [[c++cookbook]] There are multiple learning points:

  • RAII works with a class owning some heap resource.
  • RAII works with a local stack object owning a heapy thingy
  • RAII provides a much-needed exception safety. For the exception guarantee to hold, std::swap and operator delete should never throw, as echoed in my books.
  • I used to think only half the swap (i.e. transfer-in) is needed. Now I know the transfer-out on the old resource is more important. It guarantees the old resource is released even-if hitting exceptions, thanks to RAII.
  • The std::swap() trick needed here is powerful because there’s a heap pointer field. Without this field, I don’t think std::swap will be relevant.
  • self-assignment check — not required as it is rare and tolerable
  • Efficiency — considered highly efficient. The same swap-based op= is used extensively in the standard library.
  • idiomatic — this implementation of operator= for a resource-owning class is considered idiomatic, due to simplicity, safety and efficiency

http://www.geeksforgeeks.org/copy-swap-idiom-c/ shows one technique. Not sure if it’s best practice. Below is my own

C & operator=(C const & rhs){
  C localCopy(rhs); //This step is not needed for move-assignment
  std::swap(localCopy.heapResource, this->heapResource);
  return *this;
}// at end of this function, localCopy is destructed, and the original this->heapResource is deleted

C & operator=(C && rhs){ //move-assignment
  std::swap(rhs.heapResource, this->heapResource);
  return *this;
}

 

## placement new: pearls to impress IV

container{string}: j^c++

In java, any container (of string or int or anything) holds pointers only.

I think c# collections (i.e. containers) contain pointers if T is a reference type.

In cpp,

  • container of int always contains nonref, unlike java
  • container of container contains ptr, just like in java
  • but container of string is widely used, and invariably contains nonref std::string !

Q: is there any justification for container<(smart) ptr to string>? I found rather few online discussions.
A: See boost::ptr_container

Q: what if the strings are very large?
A: many std::string implementations use COW to speed up copying + assignment, however, string copy ctor has O(lengthOfString) per in the standard ! So in a standard-compliant implementation copy and assignment would be expensive, so I believe we must use container<(smart) ptr to string>

 

std::defer_lock #kill1@4deadlock factors

Only in the lock() call does the thread grabs both locks together. This breaks “incremental acquisition” , one of the four deadlock conditions.

Sample code from https://en.cppreference.com/w/cpp/thread/lock_tag

  std::unique_lock<std::mutex> ulock1(mutex1,std::defer_lock);
  std::unique_lock<std::mutex> ulock2(mutex2,std::defer_lock);

  // now grab both locks
  std::lock(ulock1,ulock2); 

c++TMP: 9 fundamental features + little tricks

ranked:

  1. SFINAE — fundamental compiler rule for overload resolution
  2. template specialization
  3. NDTTP — non-type template param, widely used, probably most of the standard TMP
  4. within a class template, define member function templates or nested class templates, with additional dummy types, probably required by SFINAE
  5. Even if an actual type (eg, Trade) behind a dummy type T doesn’t support an operation in vector<T>, compiler can still instantiate vector<Trade> if you don’t use that operation. [[c++Primer]] P329 has examples.
  6. default arguments — for type-param (or non-type-param), a small syntactical feature with BIG usages

Tricks:

  1. #include <type_traits>
  2. member typedefs — a regular class can define member typedefs. but this trick is used much more in class templates

enable_if{bool,T=void} #sfinae,static_assert

  • enable_if is a TMP technique that hides a function overload (or template specialization) — convenient way to leverage SFINAE to conditionally remove functions from overload resolution based on type traits and to provide separate function overloads for different type traits. [3]
    • I think static_assert doesn’t do the job.
  • typically used with std::is_* compile-time type checks provided in type_traits.h
  • enable_if_t evaluates to either a valid type or nothing. You can use enable_if_t as a function return type. See [4]. This is the simplest way usage of enable_if to make the target function eligible for overload resolution. You can also use enable_if_t as a function parameter type but too complicated for me. [3]
  • There are hundreds of references to it in the C++11 standard template library [1]
  • type traits — often combined with enable_if [1]
  • sfinae — is closely related to enable_if [1]
  • by default (common usage), the enable_if_t evaluates to void if “enabled”. I have a github experiment specifically on enable_if_t. If you use enable_if_t as return type you had better put a q(*) after it!
    • Deepak’s demo in [4] uses bool as enable_if_t
  • static_assert — is not necessary for enable_if but can be used for compile-time type validation. Note static_assert is unrelated to sfinae. https://stackoverflow.com/questions/16302977/static-assertions-and-sfinae explains that
    • sfinae checks declarations of overloads. sfinae = Substitution failure is not a (compiler) error. An error would break the compilation.
    • static assert is always in definitions, not declarations. static asserts generate compiler errors.
    • Note the difference between failure vs error
    • I think template instantiation (including overload resolution) happens first and static_assert happens later. If static_assert fails, it’s too late. SFINAE game is over. Compilation has failed irrecoverably.
    • Aha moment on SFINAE !

[1] https://eli.thegreenplace.net/2014/sfinae-and-enable_if/

[2] https://stackoverflow.com/questions/30556176/template-detects-if-t-is-pointer-or-class

[3] https://en.cppreference.com/w/cpp/types/enable_if

[4] https://github.com/tiger40490/repo1/blob/cpp1/cpp/template/SFINAE_Deepak.cpp

## template tricks for type constraint

  1. –Here are the compile-time validation/checks offered by c++:
  2. std::is_pointer and family — each function checks one type of pointer. Very specific and clean
  3. q[ = delete ]  — to eliminate specific concrete type args. Laser removal .. extremely clean. Scott Meyers pointed out this usage of q[=delete]. Does this work with SFINAE ??????? I didn’t find anything online
  4. std::enable_if() — see separate blogpost. Designed more More for SFINAE than type constraint
  5. sfinae — probably the most versatile, flexible and powerful
  6. static_assert? Can be combined with other techniques to implement compile-time validation and constraints. See https://github.com/tiger40490/repo1/blob/cpp1/cpp/template/SFINAE_ptrCheck.cpp

std::weak_ptr phrasebook

ALWAYS need to compare with raw ptr + shared_ptr, to understand the usage context, motivations and justifications

http://www.stroustrup.com/C++11FAQ.html#std-weak_ptr is concise.

— Based on the chapter in [[effModernC++]]:

#1 feature — detect dangle

  • use case — a subject that keeps track of its observers who might become dangling pointers
  • use case — objects A and B pointing to each other with ref count … leading to island. Using raw pointers exclusively is possible but requires explicit deletion, as pointed out on P 84 [[Josuttis]]
  • In both use cases, Raw ptr won’t work since dangle becomes unnoticed.
  • Achilles’ heel of the #1 feature — manual “delete” on the raw ptr is beneath the radar of reference counting, and leads to chaos and subversion of ownership control, as illustrated —
#include <iostream>
#include <memory>
using namespace std;

void f1(){
  auto p = new int(55);
  shared_ptr<int> sp(p);
  weak_ptr<int> wp(sp);

  cout<<"expired()? "<<wp.expired()<<endl; // false
  cout<<"deleting from down below\n";
  delete p; // sp.reset();
  cout<<"expired()? "<<wp.expired()<<endl; // still false!
  // at end of this function, shared_ptr would double-delete as the manual delete 
// is beneath the radar of reference counting:(
}
int main(){
  f1();
}

## G5 move-only types #std::atomic

[1] P106 [[effModernC++]]

unique_ptr implicit copy : only for rvr #auto_ptr

P 470-471 [[c++primer]] made it clear that

  • on a regular unique_ptr variable, explicit copy is a compilation error. Different from auto_ptr here.
  • However returning an unnamed temp unique_ptr (rvalue object) from a function is a standard idiom.
    • Factory returning a unique_ptr by value is the most standard idiom.
    • This is actually the scenario in my SCB-FM interview by the team architect

Underlying reason is what I have known for a long time — move-only. What I didn’t know (well enough to impress interviewer) — the implication for implicit copy. Implicit copy is the most common usage of unique_ptr.

SCB-FM IV by architect #shared_ptr upcast

Q: how does the compiler accept this code:
shared_ptr<C> aa = myDerSharedPtr; //myDerSharedPtr is a shared_ptr<D> object

%%Q: shared_ptr<C> has a copy ctor and also a conversion ctor accepting a C raw ptr, but here we are passing in a shared_ptr<D> instance. How does compiler handle it?
%%A: I guess shared_ptr<D> has a conversion operator returning a D raw ptr, but this is not used.
AA: there’s a conversion ctor template<class U> shared_ptr(shared_ptr<U>…) — a TMP trick. See https://github.com/tiger40490/repo1/blob/cpp1/cpp/template/shPtrUpcastCopy.cpp .

The github experiment also reveals — If a function lvr param is shared_ptr<C> & and you pass in a shared_ptr<D>, compiler will complain about assigning an rvalue (i.e. anonymous temp) object to an lvalue reference — a key insight into rvr + rvalue objects.

Q3: just when is the memory freed for temp objects like q[ string1 + string2 ]
%%A: at an unspecified time. A custom string implementation could use COW, in a single-threaded project. This is a common practice in many pre-c++11 libraries
A(from architect): after the semicolon

Q3b: how can you extend the lifetime of those naturally occurring temp object?
A: assign the temp to a “const ref” variable.

Q: what are your favorite c++11/14 features? See ## c++11 features I understand as significant

Q: OK you briefly mentioned move semantic..what is it?

struct C{ //tested
  virtual void f(){/*..*/}
  ~C(){     cout<<"C dtor\n";  } //non-virtual
};
struct D: public C{
  string s;
  D(): s("def"){}
  ~D(){     cout<<"D dtor\n";  }
};
D createD(){return D();} //return by value! probably via RVO
int main(){
  C const & trade = createD();
}

Q: is string memory freed?
%%A: yes. Verified

Q: what if the string field is in D?
%%A: yes. Verified

I believe the temp D object is on stack and is guaranteed to be destructed. Since the D ctor called base ctor, the dtor sequence is guaranteed to be ~D then ~C.

kernThr ] linux^Unix

Here are a few things I can recall, from [[UnderstandingLinuxKernel]].

P 104 lists some minor Linux kernel threads, but here are the important ones:

  • process 0 (swapper) is a kernel thread. This process gets scheduled when there’s no other process to run
  • process 1 (init) is a kernel thread. This process launches and monitors all other processes except process 0. It’s also the adopted parents of all orphaned zombies
  • both of them run forever. Most other linux kernel threads run for a short while and exit.

Linux kernThr is completely unrelated concept to traditional unix kernThr. More difference than similarities (zero?)

  • Unix kernel threads aka “native threads” concept was first known to me in jvm.  kernel scheduler can see native threads but not java green threads.
  • native threads often run user mode but linux kernel threads only run in kernel mode and very limited in usage. I think they are mostly set up to access hardware

Linux kernel threads are less fundamental than kernel routines including sysCall handlers + interrupt handlers.

  1. every Linux user process enter kernel mode via kernel routines
  2. not every user process interacts with kernel threads

risk-neutral means..illustrated by CIP

Background — all of my valuation procedures are subjective, like valuing a property, an oil field, a commodity …

Risk-Neutral has always been confusing, vague, abstract to me. CIP ^ UIP, based on Mark Hendricks notes has an illustration —

  • RN means .. regardless of individuals’ risk profiles … therefore objective
  • RN means .. partially [1] backed by arbitrage arguments, but often theoretical
    • [1] partially can mean 30% or 80%
    • If it’s well supported by arbitrage argument, then replication becomes theoretical foundation of RN pricing

 

set::count^multiset_count()

ALL of the std::set, std::map containers including the multi*, unordered* and unordered_multi* versions offer a count() method.

For the multi* containers, count() can return 2 or higher. For the unique-containers,  count() returns 0 or 1 only.

Is it bad form to use count() on the unique-contianers? Not really. count() is easier than find().

python slicing q( [] )ALWAYS generate clone

This is crucial knowledge for ECT coding test

See mystr[:-2] copy_truncate last 2 chars #mystr[2:]

See copy_reverse list/string/tuple by [::-1]

See python initialize list/dict : perf

Among the various list cloning solutions on https://stackoverflow.com/questions/2612802/how-to-clone-or-copy-a-list:

  • slicing is consistent with string/tuple
  • list(oldList) is similar to java API
  • copy.deepcopy is useful across many containers

[18] Integer(like String)objects always immutable: java+python #XR

Integer(like String)objects always immutable in java. My google search confirmed that.

Beside serialization, there is no practical reason to deep-copy them.

Python integer objects are also immutable. Every time we modify an int, the new value has a different id(). See also my blog post on python immutable types.

See also https://medium.com/@meghamohan/mutable-and-immutable-side-of-python-c2145cf72747

detach()^pthread_exit()^pthread_join() at end@main()

See https://stackoverflow.com/questions/3559463/is-it-ok-to-call-pthread-exit-from-main

Q: When would main() call ..join() vs ..exit()?

  • I would call pthread_join() if main thread has something to do after child threads completes.
  • I would call pthread_exit() if main thread has nothing to do and should leave the child threads running.
    • However, if the child thread object is destructed…
  • I would call detach() to create a daemon child thread, which would automatically terminate with host process — according to MS training.
  • I would not call any of them from main() if I want main() to bring down entire process… The normal return of main() kills all threads, unlike jvm — see Anthony Williams comment in  https://stackoverflow.com/questions/3692591/return-versus-pthread-exit-in-pthread-start-functions

For c++11 std::thread object, rules are different from pthreads. If a std::thread object is destructed at end of main(), it would by default invoke std::terminate()! So before any return statement, main() need to call child.detach() , so as to leave the child Thread running.

Explained more in std::thread dtor std::terminate()

PendingNew^New: OrdStatus[39]

“8” has special meaning

  • in tag 35, it means execution report
  • in tag 39 and tag 150, it means status=rejected.

PendingNew and New are two possible statuses for a given order.

PendingNew (39=A, 150=A) is relatively simple. The downstream system sends a PendingNew to upstream as soon as it receives the “envelop”, before opening, validating or saving it. I would say even a bad order can go into PendingNew.

New (39=0, 150=0) is significant. It’s an official acknowledgement (or confirmation) of acceptance. It’s synonymous with “Accept” and “Ack”. I think it means fully validated and saved for execution. For an intermediate system, usually it waits for an Ack i.e. 39=0 from exchange before sending an Ack to the upstream. Tag 39 is usually not modified.

I see both A and 0 frequently in my systems, in execution report messages.

For a market Buy order, I think it will be followed by (partial) fills, but not guaranteed, because there may be no offers, or execution could fail for any reason. For a dealer system, execution can fail due to inventory shortage. I implemented such an execution engine in 95G.

I’m no expert on order statuses.

max-sum path problemS #high fan-out/fan-in

Note “path” implies directed (not bidirectional) edges i.e. dag

  • variation (rare): bi-directional edges
  • variation: single origin, single destination
  • variation: single origin, any destination among 55 leaf nodes
  • variation: each node has a +/- value
  • variation (rare): each edge has a +/- value. Edges usually outnumber nodes, as illustrated in the balloon-burst problem

–Common Special case: pyramid (see blogpost on recombinant binTree). The balloon problem has high fan-out.

I would try DP. For each node AA, determine (and store) the max cum-sum up to AA. Need to start from the layer 1 nodes (1 hop from origin), then use Layer 1 data to update layer 2 nodes.

Markov-like — For any node AA, the max cum-sum to AA only depends on the previous layer. No need to even care about a “fullpath” to AA. This property leads to dramatic *simplification*. Efficiency is a by-product of this simplification.

–Common Special case: non-recombinant binary tree.

i would use DFT + Kadane. Each origin-to-leaf path is independently handled. No layering.

Note Kadane algo handles +/- node values.

shared_ptr {vector} for array@heap

You can’t create a raw ptr from array-new and then use it to create a shared_ptr. The final dtor of the shared_ptr club would call delete, not the required array-delete.

I would prefer shared_ptr<vector<int>>. See https://stackoverflow.com/questions/13061979/shared-ptr-to-an-array-should-it-be-used. The vector would be allocated on heap [1]. The ptr-to-vector would be deleted by the last “club member”. This deletion would trigger the (RAII) dtor of the vector. The RAII would clean up the memory of the underlying raw array.

[1] In contrast, when we instantiate a vector object as a local object the vector “shell” including the housekeeping fields are allocated on stack. See housekeeping^payload fields: vector,string,shared_ptr

If you must use shared_ptr<int> instead, then https://www.acodersjourney.com/top-10-dumb-mistakes-avoid-c-11-smart-pointers/ shows a simple custom deleter to invoke array-delete

growth factor ] string/vector/hashtable #xLang

  1. std::string
  2. vector
  3. python list
  4. ArrayList
  5. hashtables

… all have algorithms to decide exactly how many percent more capacity to acquire during re-allocation. Usually it grows up to 2.0 in capacity:

FIX^MOM: exchange connectivity

The upshot — some subsystem use FIX not MOM; some subsystem use MOM not FIX. A site could send FIX over MOM (such as RV) but not common.

It’s important to realize FIX is not a type of MOM system. Even though FIX uses messages, I believe they are typically sent over persistent TCP sockets, without middle-ware or buffering/queuing. RTS is actually a good example.

Q: does FIX have a huge message buffer? I think it’s small, like TCP, but TCP has congestion control, so sender will wait for receiver.

Say we (sell-side or buy-side) have a direct connectivity to an exchange. Between exchange and our exterior gateway, it’s all sockets and possibly java NIO — no MOM. Data format could be FIX, or it could be the exchange’s proprietary format — here’s why.

The exchange often has an internal client-server protocol — the real protocol and a data flow “choke point” . All client-server request/response relies solely on this protocol. The exchange often builds a FIX translation over this protocol (for obvious reasons….) If we use their FIX protocol, our messages are internally translated from FIX to the exchange proprietary format. Therefore it’s obviously faster to directly pass messages in the exchange proprietary format. The exchange offers this access to selected hedge funds.

As an espeed developer told me, they ship a C-level dll (less often *.so [1]) library (not c++, not static library), to be used by the hedge fund’s exterior gateway. The hedge fund application uses the C functions therein to communicate with the exchange. This communication protocol is possibly proprietary. HotspotFX has a similar client-side library in the form of a jar. There’s no MOM here. The hedge fund gateway engine makes synchronous function calls into the C library.

[1] most trader workstations are on win32.

Note the dll or jar is not running in its own process, but rather loaded into the client process just like any dll or jar.

Between this gateway machine and the trading desk interior hosts, the typical communication is dominated by MOM such as RV, 29West or Solace. In some places, the gateway translates/normalizes messages into an in-house standard format.

Liunx(!!Unix)kernel thread can! run user program

–“Liunx kernel thread cannot run user programs”, as explained in [[UnderstandingLinuxKernel]] P3.

Removing ‘kernel‘… Linux userland threads (based on LWP, explained below) do run user programs.

Removing ‘thread‘… Linux kernel processes or interrupt routines could possibly run under a user process pid.

Removing ‘Linux‘ … Other Unix variants do use kernel threads to run both 1) user programs and 2) kernel routines. This is my understanding from reading about JVM…

  • unix  — multi-threaded user apps =use=> LWP =use=> kernel threads
  • linux — multi-threaded user apps =use=> LWP. LWP is the basic unit of concurrency in user apps. Linux LWP is unrelated to Unix LWP.

##simplicity@design pushed to the limit

Here I collection simple concepts proven rather versatile, resilient, adaptable. Note in these designs, the complexity can never disappear or reduce. Complexity shifts to somewhere else more manageable.

  • [c] stateless — http
  • [!s] microservices –complexity moves out of a big service into the architecture
  • [c] pure functions — without side effects
  • use the database concept in solving algo problems such as the skyline #Gelber
  • stateless static functions in java — my favorite
  • [c !s] garbage collection — as a concept.Complexity shifts from application into the GC codebase
  • REST
  • in c# and c++, all nested classes are static, unlike in java
  • python for-loop interation over a dir, a file, a string … See my blog post
  • [c] immutable — objects in concurrent systems
  • [c] STM i.e. single-threaded mode, without shared mutable #Nsdq
  • [c] pipe — the pipe concept in unix is a classic
  • JSON
  • [c] hash table as basis of OO — py, javascript, perl..
  • [c] sproc (+trigger) — as a simple concept “data storage as guardian of its data, and a facade hiding the internal complexities”
  • [!s] dependency injection
  • [c !s] EDT — swing EDT and WPF
  • [c] RAII
  • smart pointers as a concept
  • singleton implemented as a static local object, #Scott Meyers
  • [c=celebrated, classic, time-honored, ..]
  • [!s = not so simple in implementation]

 

3ways to call ctor #placement new

  1. to construct an Acct at a pre-existing address, call placement-new, which implicitly uses Acct ctor
  2. to construct an Acct in stack or static memory (usually data segment), just call ctor explicitly
  3. to construct an Acct in heap, call new, which implicit uses ctor.

By the way, you can also allocate memory without calling ctor. The malloc() and q[ operator new ] can do that, as explained in [[moreEffC++]]

y FIX needs seqNo over TCP seqNo

My friend Alan Shi said … Suppose your FIX process crashed or lost power, reloaded (from disk) the last sequence received and reconnected (resetting tcp seq#). It would then receive a live seq # higher than expected. This could mean some executions reports were missed. If exchange notices a sequence gap, then it could mean some order cancellation was missed.  Both scenarios are serious and requires request for resend. CME documentation states:

… a given system, upon detecting a higher than expected message sequence number from its counterparty, requests a range of ordered messages resent from the counterparty.

Major difference from TCP sequence number — FIX specifies no Ack though some exchange do. See Ack in FIX^TCP

Q; how could FIX miss messages given TCP reliability? I guess tcp session disconnect is the main reason.

https://kb.b2bits.com/display/B2BITS/Sequence+number+handling has details:

  • two streams of seq numbers, each controlled by exch vs trader
  • how to react to unexpected high/low values received. Note “my” outgoing seq is controlled by me hence never “unexpected”
  • Sequence number reset policy — After a logout, sequence numbers is supposed to reset to 1, but if connection is terminated ‘non-gracefully’ sequence numbers will continue when the session is restored. In fact a lot of service providers (eg: Trax) never reset sequence numbers during the day. There are also some, who reset sequence numbers once per week, regardless of logout.

 

make_shared() cache efficiency, forward()

This low-level topic is apparently important to multiple interviewers. I guess there are similarly low-level topics like lockfree, wait/notify, hashmap, const correctness.. These topics are purely for theoretical QQ interviews. I don’t think app developers ever need to write forward() in their code.

https://stackoverflow.com/questions/18543717/c-perfect-forwarding/18543824 touches on a few low-level optimizations. Suppose you follow Herb Sutter’s advice and write a factory accepting Trade ctor arg and returning a shared_ptr<Trade>,

  • your factory’s parameter should be a universal reference. You should then std::forward() it to make_shared(). See gcc source code See make_shared() source in https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-api-4.6/a01033_source.html
  • make_shared() makes a single allocation for a Trade and an adjacent control block, with cache efficiency — any read access on the Trade pointer will cache the control block too
  • I remember reading online that one allocation vs two is a huge performance win….
  • if the arg object is a temp object, then the rvr would be forwarded to the Trade ctor. Scott Meryers says the lvr would be cast to a rvr. The Trade ctor would need to move() it.
  • if the runtime object is carried by an lvr (arg object not a temp object), then the lvr would be forwarded as is to Trade ctor?

Q: What if I omit std::forward()?
AA: Trade ctor would receive always a lvr. See ScottMeyers P162 and my github code

https://github.com/tiger40490/repo1/blob/cpp1/cpp1/rvrDemo.cpp is my experiment.

 

How is debugger breakpoint implemented@@ brief notes #CSY

This is a once-only obscure interview question. I said up-front that CPU interrupts were needed. I still think so.

I believe CPU support is needed to debug assembly programs, where kernel may not exist.

For regular C program I still believe special CPU instructions are needed.

https://eli.thegreenplace.net/2011/01/27/how-debuggers-work-part-2-breakpoints seems to agree.

It says Interrupt #3 is designed for debugging. It also says SIGTRAP is used in linux, but windows supports no signals.

 

if thread fails b4 releasing mutex #CSY

My friend Shanyou asked:

Q: what if a thread somehow fails before releasing mutex?

I see only three scenarios:

  • If machine loses power, then releasing mutex or not makes no difference.
  • If process crashes but the mutex is in shared memory, then we are in trouble. The mutex will be seen as forever in-use. The other process can’t get this mutex. I feel this could be a practical problem, with practical solutions like reboot or process restart.
  • If process is still alive, I rely on stack unwinding.

Stack unwinding is set up by compiler. The only situation when this compiler-generated stack unwinding is incomplete is — if the failing function is declared noexcept. (In such a case, the failure is your self-inflicted problem since you promised to compiler it should never throw exception.) I will assume we don’t have a noexcept function. Therefore, I assume stack unwinding is robust and all stack objects will be destructed.

If one of the stack objects is a std::unique_lock, then compiler guarantees an unlocked status on destruction. That’s the highest reliability and reassurance I can hope for.

JGC G1 Metaspace: phrasebook #intern

Incidentally, NIO buffer is also in native memory

 

CRTP,ADL,thread_local to replace old-school

  • — thread_local variable to replace member data .. q[static thread_local ] in %%production code
  • eliminates pollution
  • — ADL  is often chosen to replace member operator and methods.. ADL #namespace
  • reduces coupling
  • –CRTP is often chosen to replace runtime binding (dynamic dispatch) of virtual function call
  • Template only
  • shaves a few clock cycles in HFT

I think both are advanced.

killing a stuck thread #cancellation points#CSY

Q: once you know one of many threads is stuck in a production process, what can you do? Can you kill a single thread?
A: there will Not be a standard construct provided by OS or thread library because killing a thread is inherently unsafe.. Look at java Thread.stop()
A: yes if I have a builtin kill-hook in the binary

https://www.thoughtspot.com/codex/threadstacks-library-inspect-stacktraces-live-c-processes describes a readonly custom hook. Its conceivable to add a kill feature —

  • Each thread runs a main loop to check an exit-condition periodically.
  • This exit-condition would be similar to pthreads “cancellation points”

https://stackoverflow.com/questions/10961714/how-to-properly-stop-the-thread-in-java shows two common kill hooks — interrupt and Boolean flag

 

sentinel node trick: array/slist

  • eg: STL uses sentinel nodes. I believe container.end() returns that.
  • eg: c-string uses \0 as sentinel value.
  • eg: binary search needle can be too high and we have to test if the return value is a real element or the end-of-container, so it’s extremely useful to add another sentinel to guarantee a successful search

When solving timed coding questions on “clean” arrays, it can be useful to append a sentinel node. It can simplify some algorithms which take action after each segment.

P 90 [[Pearls]] introduced a loop coding idiom. Idiom is applicable whenever we loop over a container and have an early-exit “break”. Such a loop has exactly two [1] conditional tests per iteration, therefore can run faster if we combine the two into one conditional test. This is a small halo idiom for latency. But beyond latency, there are more interesting benefits such as cognitive complexity reduction.

For example, consider the business logic after reaching “normal” end of the container without hitting the early exit. Rarely can we “forget” this business logic and simply exit the function and rely on the implicit return. Instead, this normal-completion scenario requires careful handling and a cognitive burden. To remind myself, I often place a comment after end of the loop. (Python provides an Else clause for a for-loop.)

In some cases, the business logic after end of the loop is 2 pages away from the early-exit business logic, but they really should be in one module. I hate this situation so much that I always have set a flag and use a “break” so that the two routines are both written after end of the loop.

In all these scenarios, it’s often simpler to append an artificial sentinel element at end of the container!  The sentinel is guaranteed to hit the early-exit, so the loop would never exhaust all elements and complete normally. Therefore we can combine the two into one conditional test:)

I usually replace “break” with “exit”, further reducing the control-flow complexity. Such micro simplifications can pay huge dividends in high-pressure timed tests.

Right before the exit/break we may still need to handle normal-completion scenario by checking if we are at the sentinel. Interesting scenario. Now we can combine the logic of normal-completion vs early exit. In many cases, they are (nearly) identical, so we can express the logic in very structured code.

Whether they are identical or not, handling the two scenarios (early vs normal completion) in a centralized module is usually simpler and more structured, reducing the cognitive complexity.

[1] what if three tests (two breaks)? The mental burden is even worse. The sentinel could reduce it

Ack in FIX #TCP

TCP requires Ack for every segment? See cumulative acknowledgement in [[computerNetworking]]

FIX protocol standard doesn’t require Ack for every message.

However, many exchanges spec would require an Ack message for {order submission OR execution report}. At TrexQuant design interview, I dealt with this issue —

  • –trader sends an order and expecting an Ack
  • exchange sends an Ack and expecting another Ack … endless cycle
  • –exchange sends an exec report and expecting an Ack
  • trader sends an Ack waiting for another Ack … endless cycle
  • –what if at any point trader crashes? Exchange would assume the exec report is not acknowledged?

My design minimizes duty of the exchange —

  • –trader sends an order and expecting an Ack
  • exchange sends an Ack and assumes it’s delivered.
  • If trader misses the Ack she would decide either to resend the order (with PossResend flag) or query the exchange
  • –exchange sends an exec report and expecting an Ack. Proactive resend (with PossResend) when appropriate.
  • trader sends an Ack and assumes it’s delivered.
  • If exchanges doesn’t get any Ack it would force a heartbeat with TestRequest. Then exchange assumes trader is offline.

malloc=long considered costly

Heap allocation is extremely slow compared to other operations.

 

## CRTP usageS #template Method

This blog post describes two usages of CRTP

I briefly read the excellent blog https://www.fluentcpp.com/2017/05/16/what-the-crtp-brings-to-code/. I feel CRTP is advanced for almost all the contexts I can think of. (In contrast,  I can see some “necessary” usages of SFINAE, such as the one in my little AddOrder.h)

https://stackoverflow.com/questions/262254/crtp-to-avoid-dynamic-polymorphism shows a simple [1] example of CRTP to replace virtual functions. How much efficiency improvement does it make? Questionable. I always say that if you want the lowest latency, then write selected modules in assembly language, and store it in hardware like FPGA.

[1] if there is anything simple in template meta-programming.

I have heard of several techniques to avoid virtual functions, but I believe the actual evidence (in terms of measured improvement in latency) is likely unconvincing or insignificant. Therefore, if CRTP is used to eliminate virtual function latency, then I am not sure how much value it adds.

There are other techniques to avoid “virtual”. I feel they are easier to understand than CRTP.

Sometimes CRTP is the only choice — if the virtual function needs its own template parameters, then compiler will complain that “function template can’t be virtual”. Rahul hit this complaint.

Q: for a true virtual method v1(), the derived class is not yet written when the base class is compiled. Later, Only at run time can the “system” pick the right implementation of v1(). How about CRTP?
A: base class is Not compiled ahead of the derived class. Each derived class includes a header defining the base class template.

——

Beside this “virtual-elimination” use case, CRTP has other applications (am still unfamiliar with), but if I’m asked in interviews I will only mention this one use case. One of the common “other usages” is TemplateMethod with compile time (not runtime) resolution, the 1st usage in the excellent blog . In the classic template method pattern, the “procedure” is published and finalized in the base class. Individual steps of the procedure are virtual methods, resolved at runtime. In the CRTP version, superclass methods call subclass methods, safely and cleanly. Superclass using subclass is a taboo in most contexts, but both traditional TemplateMethod and CRTP-TemplateMethod are notable exceptions.

The article didn’t clearly highlight a key point about this usage — The base class NumericalFunctions is general purpose, designed to be subclassed by anyone.  I could write a Temperature class to subclass NumericalFunctions too. This way, the code in NumericalFunctions is available for reuse.

https://github.com/tiger40490/repo1/blob/cpp1/cpp/template/CRTP_demo1.cpp is working code. Key points to remember about the code sample:

  • base-class — is a template with a dummy type “Sub”
  • derived classes — have the form “class Sub1 public Base<Sub1>”
  • the static dispatch (non-virtual) function in Base always static_cast “this” to *Sub.

 

pass temp into func by val: mv-ctor skipped #RVO CSY

Suppose we have a class MoveOnlyStr which has only move-ctor, no copy-ctor. Suppose we pass an unnamed temporary instance of this class into a function by value, like void func1(MoverOnlyStr arg_mos).

Q: Will the move-ctor be used to create the argument object arg_mos? We discussed this in your car last time we met up.

A: If the temp is produce by a function, then No. My test shows I was right to predict that compiler optimizes away the temporary, due to RVO. So move-ctor is NOT used. This RVO optimization has existed long before c++11.

A: if the temp is not produced by a function, then RVO is irrelevant (nothing “Returned”) but I don’t know if there’s still some copy-elision.

## mkt data: avoid byte-copying #NIO

I would say “avoid” or “eliminate” rather than “minimize” byte copying. Market data volume is gigabytes so we want and can design solutions to completely eliminate byte copying.

  • RTS uses reinterpret_cast but still there’s copying from kernel socket buffer to userland buffer.
  • Java NIO buffers can remove the copying between JVM heap and the socket buffer in C library. See P226 [[javaPerf]]
  • java autoboxing is highly unpopular for market data systems. Use byte arrays instead

STL: few exceptions #no virtual

side note — virtual function is also very rare in STL classes. I don’t know any.


STL functions seldom throw exception. See P 248 [[c++standard library]]

  1. at() method of vector/string/map… throws, since it’s the “checked” version of the subscript operator
  2. reserve() method throws

Besides these two unusual functions, STL would only throw the standard low-level exceptions like memory allocation failures. All other error conditions are undefined-behavior, such as

  • top()/pop() on priority queue
  • std::string operator[index] with invalid index

However, Payload data types going into STL container can create exceptions:

  • If a payload class ctor/assignment throws, then it would propagate.
  • Payload destructors should never throw. [[c++coding standard]] says it’s forbidden by STL standard.

private dtor restricts instantiation to heap only

My test proves that declaration (even without definition) of a stack instance or global instance is illegal because compiler generates a (illegal) call to the private dtor!

Friend and static method can both also call the private dtor.

One usage of such a class — force all users to go trough factory to instantiate this class.

class PrivDtor{
  ~PrivDtor(){
        cout<<"dtor\n";
  }
public:
  static void destroy(PrivDtor* p){p->~PrivDtor();}
  static void del(PrivDtor* p){delete p;}
};

//PrivDtor globalInstance;  //won't compile
int main(){
  //PrivDtor stackInstance; //won't compile

  PrivDtor * p = new PrivDtor;
  //PrivDtor::del(p); // works!
  PrivDtor::destroy(p);  //works!
}

 

non-blocking socket readiness: alternatives2periodic poll

Some Wells interviewer once asked me —

Q: After your non-blocking send() fails due to a full buffer, what can you do to get your data sent ASAP?

Simple solution is retrying after zero or more millisecond. Zero would be busy-weight i.e. spinning CPU. Non-zero means unwanted latency.

A 1st alternative is poll()/select() with a timeout, and immediately retry the same. There’s basically no latency. No spinning either. The linux proprietary epoll() is more efficient than poll()/select() and a popular solution for asynchronous IO.

2nd alternative is SIGIO. http://compgeom.com/~piyush/teach/4531_06/project/hell.html says it doesn’t waste CPU. P52 [[tcp/ip sockets in C]] also picked this solution to go with non-blocking sockets.

http://compgeom.com/~piyush/teach/4531_06/project/hell.html is actually a concise overview of several alternatives

  • non-blocking socket
  • select/poll/epoll
  • SIGIO
  • .. other tricks

## threading: a better specialization than algo,quant…

  • In summary — appears daunting and opaque but actually simple in practice
  • theoretical — like [[dougLea]].. my strength. Few coding tests and usually easy for me
  • fairly low-level — my strength. Not as low level as debugger, or template hacks …
  • GTD — no GTD challenges, since most projects use only simple threading designs. The code tracing, trouble-shooting expectation on me is relatively low.
  • opaque — for my peers. Even the basic condVar…
  • churn — low churn at the low level, but high churn at high level
  • ever-green — favorite topic of interviewers, esp. java
  • thin->thick->thin — I have achieved this for a long time.

Many of my halos are in this domain — ##halo across domains #specific=better

housekeeping^payload fields: vector,string,shared_ptr

See also std::string/vector are on heap; reserve() to avoid re-allocation

std::vector — payload is an array on heap. Housekeeping fields hold things like size(??), capacity, pointer to the array. These fields form a “shell” and are allocated either on stack or heap or global area depending on your variable declaration.

  • Most STL (and boost) containers are similar to vector in terms of memory allocation
  • std::string — payload is a char-array on heap, so it can expand both ways. Housekeeping data includes size…
  • shared_ptr — payload includes a ref counter and a raw-pointer object [1] on heap. This is the control-block shared by all “club members”. There’s still some housekeeping data (pointer to the control block), allocated on stack if you declare the shared_ptr object on stack and then use RAII.

If you use “new vector” or “new std::string”, then the housekeeping data will also live on heap, but I find this practice less common.

[1] this is a 32-byte pointer object, not a pure address. See 3 meanings of POINTER + tip on q(delete this)

non-local^local static object initialization ] c++

Regarding when such objects are initialized, there are simple rules as illustrated in 2 simple yet concurrent singleton implementations ] c++The two singleton implementations each use one of the 2 types.

The rules:

  1. lazy init — local statics are initialized in the first function call. Only one thread would initialize it, never concurrently on two threads. See concurrent lazy singleton using static-local var
  2. eager init — non-local statics are initialized before main(), on one single thread, but the sequence among them is non-deterministic, as explained by Scott Meyers.
  3. Both are kind of thread-safe

shared_ptr dislikes private ctor/dtor in payload class

1) make_shared can’t call a private ctor. If your ctor is private, you have to use

shared_ptr sp(new MyClass); # within your class??

2) If your MyClass dtor is private, you simply can’t hold it in shared_ptr.

#include <memory>
using namespace std;
class C{
  //~C(){cout<<"dtor\n"; } // breaks shared_ptr<C>
  C(){cout<<"ctor\n"; }
public:
  static shared_ptr<C> mk(){
    //shared_ptr<C> sp = make_shared<C>(); //won't compile if ctor is private
    return shared_ptr<C>(new C());
  }
};
int main(){ shared_ptr<C> sp = C::mk(); }

a shared_ptr !! @heap #no make_shared

Boost documentation says shared_ptr class template stores a pointer to a dynamically allocated object (not an “auto” object), typically allocated with a C++ new-expression (not array-new .. see separate blogpost). The object pointed to is guaranteed to be deleted.

If you feed it a stack object, then the eventual deletion will probably be undefined behavior. For a non-heap pointer, you can’t use make_shared() according to [[effModernC++]]

Q: Can you finish your task and exit the program before the eventual deletion? Maybe you don’t care?

Anyways, it’s unwise to create a library routine accepting a pointer argument and return a share_ptr. The library routine has no way to test if it points to the heap or the stack (or global area). Such a routine (if you must make one) had better force callers to pass in an explicit flag “isOnHeap”.

As stated in http://stackoverflow.com/questions/7383572/c-shared-ptr-of-stack-object, the function should be designed to receive a shared_ptr as argument, not a raw ptr.

I believe it’s possible to seed a shared_ptr with a non-heap pointer, if you provide your own deleter.

algoTrading^bigData^quant

Update: I told bbg (Karin?) and Trex interviewers that domain isn’t a big concern to me. Even a back office IT role can turn out to be better.


  1. quantDev
  2. algoTrading
  3. bigData

… are the 3 big directions for trySomethingNew. I’m cautious about each.

quantDev (not pure quant) — low demand; poor market depth; unable to find another job; least jobs and only in banks. CVA is not really quant dev, based on what I gathered. See also %% poor accu ] quantDev

algoTrading — perhaps I should try java and medium frequency?

bigData — no consolidation; questionable accumulation and value-creation

read-copy-update lockfree +! retry #RCU

RCU is an advanced concurrency construct, not implemented in a regular application, but valuable for lock-free interviews.

http://concurrencyfreaks.blogspot.sg/2016/09/a-simple-userspace-rcu-in-java.html is a “simple Userspace” java implementation. I didn’t read it in detail and assume it’s rather different from the linux kernel RCU

http://www.modernescpp.com/index.php/aba-a-is-not-the-same-as-a (by a published author) has a brief mention of Userspace RCU solution to ABA problem. Also touches on Garbage Collection.

https://lwn.net/Articles/262464/ — by RCU inventor. I can see RCU is non-trivial !

The Wikipedia article is accessible until the paragraph introducing read-side-critical-section and grace-period. This is the first paragraph on the implementation. I found it hard. Therefore a few background pointers:

  • · There must be multiple threads, typically many readers and few writers.
  • · There must a shared mutable data structure, probably on heap.
  • · Not all data structures are suitable for RCU. So which subset are? I would say pointer-graph including hash tables.

In the simplified model, we are talking about

  • · A reader thread R1 executing a block of code that has started reading the shared mutable data structure. This code block is the critical section
  • · A writer thread W1 executing a block of code attempting to update the same data.
  • · After the so-called grace period, a GC thread would reclaim the obsolete data that R1 has finished reading.
  • · A reader R2 enters the critical section after the update is done. R2 would see the new data

GC need to know when it’s safe to reclaim. It needs the wait-for-readers operation and the grace period.

Q: What if R1 gets a reference to the “obsolete node”, performs some slow task, reads the node, then exits the critical section? This node is reclaimable only after R1 exits critical section. The grace period would be long?

Q: Among 9999 nodes, how does system keep track which each node is reclaimable?
%%A: I feel kernel access to CPU registers might be needed.

Q: How does it compare with copy-on-write?
A: COW probably copies the entire data structure. Also, the modified copy is not visible to subsequent readers (like R2)

Q: How does it compare with read/write lock?
A: readlock can block

Q: How does it relate to lock-free concepts?
A: reader threads are lock-free and even better — not required to retry.
A GC threads need to wait.
A: Writer thread (like W1) ? not sure. Might be lock-free in some situations.

support MyType a=7 despite q[ explicit ] conversion constructor@@

Background:

  explict MyType(int); // would disallow
  MyType a = 77;

http://en.cppreference.com/w/cpp/language/converting_constructor has a solution:

  MyType a = (MyType) 77; // Static cast would invoke the explicit conversion constructor!

In general, most custom types should make conversion constructors explicit to prevent hidden bugs, but smart pointer need an implicit conversion constructor, to support

  SmartPtr myPtr = new int(77);

–A real example from CFM quant code

 FwdCurve::const_iterator iter = find( key ); //non-const itr

 QL_ASSERT( iter != ((FwdCurve * const) this)->end(), "not found" ); // original code probably before "explicit". 

// I had to change it to
 FwdCurve_iterator endItr = ((FwdCurve * const) this)->end();
 QL_ASSERT( iter != FwdCurve_const_iterator(endItr), "not found" ); //call the conversion ctor explicitly

factory returning smart ptr^pbclone #Sutter2013

Background — This is one example of “too many variations”. A simple idea become popular… People start mix-n-match it with other ideas and create many, many interesting questions. In practice we should stick to one or two best practices and avoid the other 50 variations.

Update: https://herbsutter.com/2013/05/30/gotw-90-solution-factories/ is a little too dogmatic as an article. A few take-aways:

  • returning raw ptr can indeed leak in real life. I would still do this in simple projects.
  • For a polymorphic type, return a unique_ptr by default. See also [[effModernC++]]. (Return shared_ptr only if the factory saves it in some internal state, but who does?)
  • For a non-polymorphic type, return a clone. Safely assume it’s copyable or movable in c++11.

I think it’s quite common (see [[safe c++]] P51) to have a factory function creating a heap “resource”. Given it’s on heap, factory must return the object by pointer. Easy to hit memory leak. Standard solution is return by ref count smart ptr.

Q: what if I forget to destruct the obj returned?
A: I feel it’s uncommon and unnatural. (More common mistake is to forget deletion.) Often the object returned (i.e. the smart pointer object) is on the stack – no way to “forget”.

Note if a (64-bit) raw pointer variable is on the stack, then the stack unwinding will deallocate/reclaim the 64-bit, but won’t call delete on it!

Q: what if the object returned is saved as a field like myUmbrella.resource1
A: I feel this is fine because the myUmbrella object will be destructed by someone’s responsibility. That would automatically destruct the smart pointer field this->resource1.

Alternative to the ref count smart ptr, we can also return a (boost) scoped pointer. See [[safe c++]].

Q: what if factory need to return a pointer to stack?
A: I don’t think so. Pointee goes out of scope….
A: I feel most “resources” are represented by a heapy thingy. Alternatively, it’s also conceivable to hold a “permanent” resource like globally. Memory management is non-issue.

 

 

By the way, [[safe c++]] also mentions the __common__ pattern of a (non-factory) function returning a pointer to an existing object, rather than a “new SomeClass”. The challenges and solutions are different.

c++compiler must know data type sizes

http://stackoverflow.com/questions/6264249/how-does-the-compilation-linking-process-work points out

So what the compiler outputs is rough machine code that is not yet fully built, but is laid out so we know the size of everything, in other words so we can start to calculate where all of the absolute addresses will be located. The compiler also outputs a list of symbols which are name/address pairs. The symbols relate a memory offset in the machine code in the module with a name. The offset being the absolute distance to the memory location of the symbol in the module.

Stroustrup told me about a key performance advantage of c++ over modern languages — local variables. If we want to use more local variables and fewer heap objects, then I can understand that each time need to know the size of every data type.

tibrv is decentralized, peer-to-peer, !! hub/spoke

JMS unicast topic has a centralized [1] broker. RV is decentralized. All the rvd processes are equal in status. This is a little bit similar to
– email servers among Ivy League universities.
– file-sharing peer networks

RV is peer-to-peer; JMS is hub/spoke. Note in JMS request/response, “server” has ambiguous meanings. Here we mean the broker.

[1] You can load-balance the broker but the “outside world” views them as a single broker. It’s the Hub of hub/spoke. Scalability bottleneck.

4 distinct sub-domains of financial math

Even though many real world scenarios involve more than a single topic below, it’s still practically useful if you understand one of the topics well.

* bond math including IRS, FRA…
* option
* VAR — correlation, simulation
* credit risk — essential to commercial banks, consumers, corporate borrowers… though I don’t know how much math

Every single topic is important to risk management (like the OC quant team); Half the topics are important to pre-trade pricing. For example, a lot of bond math (duration, OAS, term structure) is not that critical to pre-trade quote pricing.

All derivatives always involve more math than the underlying instrument do, partly because of the expiration built into every derivative instrument. Among derivatives, option math complexity tends to exceed swaps, futures, and forwards.

python tuples aren’t waterproof immutable

–Based on http://www.velocityreviews.com/forums/t339699-are-tuple-really-immutable.html
t = ([1],[2])
# apply the id() function to each item in t
map(id,t)   
[47259925375816, 47259925376392]

t[0].append(0)
t
([1, 0], [2])
map(id,t) # unchanged
[47259925375816, 47259925376392]

So tuple deviates from java immutability, which mandates t[0] returning a clone — essentially copy-on-write.

A tuple is like an ordered club roster written with indelible ink. The members of the club may change jobs, age, salary etc but the roster remains the same: same members, same SSN like python id(), same ranking.

In C++ lingo, the “t” tuple has 2 pointers on its real estate. It qualifies as immutable since the two 32-bit fields remain _bit_wise_constant_. The pointees live outside the tuple’s real estate and are Editable.

message requester to wait5min for response #wait()

Imagine a typical request/reply messaging system. I think in JMS it’s usually based on temp queues, reply-to and correlation-Id — See other blog post. In contrast, RV has no broker. It’s decentralized into multiple peer rv daemons. No difference in this case —

Suppose a message broker holds a lot of queues. One of the queues is for a request message, from requester system to a pricing system. Another queue is for pricing system to return the new price to the requester.

Now, pricing system is slow. Requester should wait for no more than 5 minutes. If the new price comes back through the reply-queue 301 sec later, requester will ignore this stale price since it’s too risky to place an order on a stale price in a fast market. How do you implement this?

My design — Requester main thread can use condVar wait(5*60*000). Another thread in requester JVM can block forever in onMsg(), and notify main thread when something received.

Timer is the 2nd usage of condVar as Stoustrup described.

(I actually implemented this in a few trading engines.)

msg overflow ] JMS^tibrv #60sec

Message overflow is a frequent headache in MOM. Basically consumption rate (possibly 0) is slower than production rate, eating up storage capacity, either in memory or on disk.

  • eg: JMS Topic with NDS (non-durable-subscriber) only — almost free of overflow. Bonnie said broker will discard. Jerry said broker keeps track of alive subscribers. Every incoming message is guaranteed to be delivered (by retry) to each. Broker must persist pending messages. When a subscribe disconnects gracefully, broker no longer needs to keep any pending message for her.
  • eg: chatrooms don’t persist data but what type of MOM is that? probably NDS topic
  • eg: JMS queue. While any number of producers can send messages to the queue, each message is guaranteed to be delivered, and consumed by one consumer. If no consumers are registered to consume the messages, the queue holds (on disk) them until a consumer registers to consume them.
  • eg: JMS temp queue??
  • eg: regular queue? In SIM upfront, unread messages –persist–
  • eg: mailing list (topic with durable subscribers)? –Persist–
  • eg: JMS topic without durable subscriber? see post on message retention
  • eg: tibrv CM? Storage is the ledger file. –persist–
  • eg: tibrv non-CM? market data feed? Discarded but how soon? 60 sec by default? Yes, in slow consume and fast producer scenario message lost will be possible — http://arun-tibco.blogspot.com/2010/09/tibco-rv-faqs.html

 

another basic difference iterator^container^algo

(Why bother? Well, you need to know these when you debug STL or extend STL.)

– Containers — are class templates. 100%
– Algorithms — are function templates. 100%

– iterators? A beginner can safely assume that Most of the time iterators are defined inside each container template as a Member type. Since a container has a dummy type T, the iterator must be a class template of T (or a typedef thereof), a Nested class template of T, presumably, inferred from the syntax :

vector<int>::iterator

– The container/algo/iterator adapters are typically class templates
– functor are typically class templates

A trivial consequence — the declarations of containers and iterators are complicated by the templates. Algorithm declarations are simpler in comparison.

JMS temporary queue, tibrv and sybase temp table

Update — tibrv’s flexibility make JMS tmp queue look unnecessary, rigid, uptight and old fashioned.

tmp queue is (tmp topic?[4]) used for request/reply. I feel the essential elements of JMS request/reply are
1) tmp queue
2) set and get reply-to — setJMSReplyTo(yourTmpQueue)
3) correlation id — setJMSCorrelationID(someRequestID)

Server-side (not the broker, but another JMS user) uses getJMSCorrelationID() and then setJMSCorrelationID() in the response message, then replies to getJMSReplyTo()

tmp queue is like a Sybase tmp table — (I feel this is the key to understanding tmp queue.)
* created by requestor
* removed by the “system janitor”, or by requestor
* private to the session. Therefore creator threads must keep the session alive.

[4] temp topic too. p67 [[JMS]]

generic callback in a concurrent java/c++ sys

(Note OO is a development-time concept. Once compiled, an OO program looks no different from a C program with tons of structs and function pointers.)

Most financial systems today are concurrent OO systems. The lowly callback pattern is extremely common but actually requires a thorough understanding of both thread and object interaction. Let’s examine some. Note just about every player in the callback scenarios is a pointer, including function pointers.

In the simplest 1-thread case, an observer pointee registers interest with a subject object, by giving its own address. After subject state changes in Thr 1, the same thread loops through the list of observers and invokes the callback method on each observer. Note the observer object must not be deallocated.

For onMsg() in tibrv or onMessage() in JMS, there’s a remote “upstream” thread in the message broker’s PID/runtime, and there’s a separate downstream thread in the listener’s PID. I believe this thread starts out blocking for messages. Upon receiving a msg, the runtime somehow wakes up this blocking thread and this thread invokes onMsg() on the listener pointee. Meanwhile what’s the remote upstream thread doing? Simplest design is synchronous — upstream thread waits for the listener thread to finish. This means the upstream thread can only dispatch the message to remote listeners sequentially. I feel it’s much better to get the upstream thread to return right away after sending the one-way-message[3] (see Doug Lea’s book) so this thread can contact the 2nd listener in the list. Rewind to registration time — Any thread could register the listener with “Hollywood”. This thread could terminate immediately but the listener pointee must live.

[3] in classic Reo, we actually send one-way messages to the market-facing gateway. We get the status message on a separate thread via onMessage().

In C/C++, you can also register into “Hollywood” a ptr to a free func. No object, usually no state (unless the free func uses stateful static objects). Free functions are free of destruction. Hollywood could save the (function) ptr in any collection. When there’s an event, Hollywood calls us back, in any thread.

tibrv-CM ^JMS

[[ rv concepts ]] has a concise comparison. A few highlights on CM:

(CM is one of the most important RV use cases in trading.)

* A producer sends a message to consumers directly (peer to peer, decentralized). The producer stores the message until each consumer has acknowledged receipt.

**Note the rv daemon is not a central server. There is usually one daemon on each host. These function a bit like cisco routers in a network. They don’t store or retry messages.

**I think p2p is a defining feature of rv (not only CM) relative to JMS.

* Disk failure on a peer host computer affects only the messages that its programs produce or consume. However, disk mirroring (often done on jms servers) for each individual peer is often impractical.

tibRV event object@@

C# “event” also has a dual-meaning…

In RV, an Event object can represent
1) a listener (or program’s interest in events) and also
2) an “event occurrence” (but NOT a TibrvMsg instance). I think we need to get comfortable with this unusual design.

As described in the “flow” blogpost, probably the main event object we use is the listener object. In fact, Chapter 5 of the java manual seems to suggest that
* TibrvListener is (probably the most important) subclass of TibrvEvent
* TibrvTimer is (probably the 2nd most important) subclass of TibrvEvent
* For beginners, you can think of TibrvEvent.java as an abstract base class of the 2

tibrv-UDP-multicast vs tcp-hub/spoke-JMS

http://narencoolgeek.blogspot.com/2006/01/tibco-rv-vs-tibco-ems.html
http://stackoverflow.com/questions/580858/tibco-rendezvous-and-msmq

Looks like the consensus is
* tibco rv — fastest, high volume, high throughput, less-then-perfect reliability even with certified delivery
* JMS — slower but perfect reliability.

Before adding bells and whistles, core RV was designed for
* efficient delivery — no bandwidth waste
* high volume, high throughput
* multicast, not hub/spoke, not p2p
* imperfect reliability. CM is on top of core RV
* no distributed transaction, whereas JMS standard requires XA. I struggled with a Weblogic jms xa issue in 2006

— simple scenario —

If a new IBM quote has to reach 100 subscribers, JMS (hub/spoke unicast topic) uses tcp/ip to send it 100 times, sequentially [3]. Each subscriber has a unique IP address and accepts this 1 message, and ignores the other 99. In contrast, Multicast sends [4] it once, and only duplicates/clones a message when the links to the destinations split [5]. RV uses a proprietary protocol over UDP.

[5] see my blogpost https://wordpress.com/post/bintanvictor.wordpress.com/32651

[3] http://www.cs.cmu.edu/~priya/WFoMT2002/Pang-Maheshwari.pdf benchmark tests reveals “The SonicMQ broker requires a separate TCP connection to each of the subscriber. Hence the more subscribers there are, the longer it takes for SonicMQ to deliver the same amount of messages.
[4] Same benchmark shows the sender is the RVD daemon. It simply broadcasts to the entire network. Whoever interested in the subject will get it.

— P233 [[weblogic definitive guide]] suggests
* multicast jms broker has a constant workload regardless how many subscribers. This is because the broker sends it once rather than 100 times as in our example.
* multicast saves network bandwidth. Unicast topic requires 100 envelopes processed by the switches and routers.

2 questions on tibRV to a Tibco friend #LS

Q1: Why is certified messaging so widely used in trading systems when CM is not as reliable as JMS? I guess if a system already uses Rendezvous, then there’s motivation to avoid mixing it with JMS. If the given system needs higher reliability than the standard non-guaranteed RV delivery, then the easiest solution is CM. Is that how your users think?

A: I don’t think CM is not as reliable as JMS. However you can say CM’s acknowledgment mechanisms are not as flexible as JMS/EMS. The latter supports several acknowledgment modes. An important advantage of RV is it supports multi-cast which makes broadcasting quotes more efficient. JMS 5 supports something similar for topic subscriber but not as natural as RV.

Q2: in JMS, a receiver can 1) poll the queue or 2) wait passively for the queue to call back, using onMessage() — known as a JMS listener. These are the synchronous receiver and asynchronous receiver, respectively. I feel Rendezvous supports asynchronous only, in the form of onMsg(). The poll() or dispatch() methods of the TibrvQueue object don’t return the message, unlike the JMS polling operation.

c++ exception handlers – global var

These “types” are aliases of the same type. Typedef over “ptr to void-void function”.

– terminate_handler / set_terminate() [1]
– unexpected_handler / set_unexpected() [1]
– new_handler / set_new_handler() [2]

Each setter modifies a global (implicitly static) variable. Global variable is quite common in C. The canonical global var is the errno global variable (which /bothers/ pthreads). I believe all of these global variables are shared by all threads.

For set_unexpected, Java offers similar things. (http://www.javapractices.com/topic/TopicAction.do?Id=229 highlights Swing EDT.)
* setUncaughtExceptionHandler()
* setDefaultUncaughtExceptionHandler()

[1] FAQ 271
[2] [[ effC++ ]]