cgroups #basics

  • group-of-Processes, …. not group of resources.
    • Initially named “Process Containers”
  • 2006 started at google
  • 2007 “mainlined” into linux kernel
  • Not available beyond linux, as far as I know

lockfree stack with ABA fix #AtomicStampedReference

http://15418.courses.cs.cmu.edu/spring2013/article/46 explains ABA in the specific context of a lockfree stack. It uses CAS2 to counter ABA problem.

Java CAS2? I believe AtomicStampedReference  is designed for it.

http://tutorials.jenkov.com/java-util-concurrent/atomicstampedreference.html#atomicstampedreference-and-the-a-b-a-problem explains the AtomicStampedReference solving the ABA problem

Actually, many ABA illustrations are simplistic. Consider this illustration:

  1. Thread 0 begins a POP and sees “A” as the top, followed by “B”. Thread saves “A” and “B” before committing.
  2. Thread 1 begins and completes a POP , returning “A”.
  3. Thread 1 begins and completes a push of “D”.
  4. Thread 1 pushes “A” back onto the stack and completes. So now the actual stack top has A above D above B.
  5. Thread 0 sees that “A” is on top and returns “A”, setting the new top to “B”.
  6. Node D is lost.

— With a vector-of-pointer implementation, Thread 0 needs to save integer position within the stack. At Step 5, it should then notice the A sitting at stack top is now at a higher position than before, and avoid ABA.

The rest is simple. Thread 0 should then query the new item (D) below A. Lastly, the CAS would compare current stack top position with the saved position, before committing.

However, in reality I think a vector-of-ptr is extremely complex to implement if we have two shared mutable things to update via CAS: 1) the stackTopPosition integer and 2) the (null) ptr in the next slot of the vector.

— with a linked list implementation, I think we only have the node addresses, so at Step 5, Thread 0 can’t tell that D has been inserted between A and B.

You may consider using stack size as a second check, but it would be similar complexity as CAS2 but less reliable.

array-based order book: phrasebook

For HFT + mkt data + .. this is a favorite interview topic, kind of similar to retransmission.

Perhaps look at … https://web.archive.org/web/20141222151051/https://dl.dropboxusercontent.com/u/3001534/engine.c has a brief design doc, referenced by https://quant.stackexchange.com/questions/3783/what-is-an-efficient-data-structure-to-model-order-book

  • TLB?
  • cache efficiency

std::move(): robbed object still usable !

Conventional wisdom says after std::move(obj2), this object is robbed and invalid for any operation…. Well, not quite!

https://en.cppreference.com/w/cpp/utility/move specifies exactly what operations are invalid. To my surprise, a few common operations are still valid, such as clear() and operator=().

The way I read it — if (but not iFF) an operation wipes out the object content regardless of current content, then this operation is valid on a robbed/hollowed object like our obj2.

Crucially, any robbed object should never hold a pointer to any “resource”, since that resource is now used by the robber object. Most movable data types hold such pointers. The classic implementation (and the only implementation I know) is by pointer reseat.

reliable multicast for replicated-cache update

https://www.usenix.org/conference/usits-99/scalable-web-caching-frequently-updated-objects-using-reliable-multicast is a 1999 research paper. I hope by now multicast has grown more mature more proven. Not sure where this is used, perhaps within certain network boundaries such as a private network of data servers.

This paper examines reliable multicast for invalidation and delivery of popular, frequently updated objects to web cache proxies.

vector used in hash table bucket

Summary — Same worst case time/space complexity as traditional chaining, but d-cache efficient during lookup, insert and removal.

Background — hash table worst case complexity is due to a bad input (and a weak hash function) leading to heavy collision and a single hot bucket. Load factor will not help as overall hash table size is still small. We end up with linear search in a long linked list. This affects not only lookup and removal. Insert also requires a lookup to prevent duplicate.

— Based on https://en.wikipedia.org/wiki/Hash_table#Separate_chaining_with_other_structures

  • This design makes more efficient use of CPU caching and the translation lookaside buffer(TLB), because slot entries are stored in sequential memory positions.
  • It also dispenses with the next pointers that are required by linked lists, which saves space.
  • Despite frequent array resizing, space overheads incurred by the operating system such as memory fragmentation were found to be small

power-of-2 sized hash tables: implications

https://dzone.com/articles/hashmap-internal  and http://coding-geek.com/how-does-a-hashmap-work-in-java/ explain that java HashMap uses bucket count equal to power-of-two. A few implications

  • the modulo operation is a fast bit-masking i.e. select the lowest N bits: hash & (length-1)
  • an unlucky hashCode() may not offer dispersion among the lower bits. Assuming 16 buckets so only the lowest 4 bits matter, but hashCode() returns predominantly multiples of 8. Then two out of 16 buckets would be white hot. To guard against it, HashMap performs a rehash on the hashCode() output.

I implemented the same feature in https://github.com/tiger40490/repo1/blob/py1/py/88miscIVQ/re_hash_table_LinearProbing.py

Note rehash is ineffective if hashCode() returns very few distinct values. In java8, there’s one more protection against it — RBTree as alternative to linked list… See RBTree used inside java8 hashmap

doubly-linked list as AuxDS for BST

I find it a very natural auxDS. Every time the BST gets an insert/delete, this list can be updated easily.

Q: How about the self-adjustment after an insert or delete?
%%A: I think this list is unaffected

With this list, in-order walk becomes really easy.

https://leetcode.com/problems/kth-smallest-element-in-a-bst/solution/ is the first place I saw this simple technique.

j8 MethodRef #cheatsheet

Need to developer more low-level insights as QQ…

  • Out of the four kinds of method refs, only AA) static-method and BB) specific-instance kinds are popular. The other two types are obscure.
  • q[ :: ] is used in all four kinds
  • I think the static-method kind is most readable and most intuitive. The javadoc tutorial features a BB example that should be implemented as static method IMHO.
  • a lambda expression has parameter types. A method ref has none and must be converted to a lambda expression. Where does compiler infer the parameter types? I think it is inferred from the calling context.

array of sd::byte^Unsigned-char as “byte array”

Background: Suppose you need to store or transfer arbitrary binary data.

In RTS, we used char-array. Some say we should use unsigned char. It is the only data type that is guaranteed (by the ANSI C Standard) to have no padding bits. So all 8 bits in an unsigned char contribute to the value. None of them is a padding bit. I think std::uint8_t is similar but less common.

Contrary to some online posts, unsigned-char type is different from “char” —

In C++17, std::byte is probably preferred because only the bitwise operations are defined. I believe you can reinterpret_cast a pointer to std::ptr. In https://github.com/tiger40490/repo1/blob/cpp1/cpp/lang_66mem/slistFromArray.cpp, I used none of the above — I used placement-new 🙂

In java we use the primitive “byte” type — an 8-bit signed integer.

##G5 c++skills { Stroustrup chat

C++ has clear advantage in low-latency

Ranked from most specific to most vague. Another ranking criteria is market value.

  1. memory mgmt – for time and space efficiency
  2. generic techniques to use stack objects instead of heapy thingies
    • reduce reliance on dynamic containers
  3. virtual — reusable techniques to avoid virtual
  4. other compiler internals that impact latency
    1. knowledge of inline — to trade off latency against code size
  5. latency instrumentation tools
  6. safe c++ techniques, to reduce the chance of run-time errors. C++ is particular vulnerable as it is
    • more complex than C and others and
    • more close-to-hardware than Java and other languages.
  7. TMP — again to trade off latency against complexity. But this zbs domain is too big for me
  8. tool chain

 

local_var=c++strength over GC languages

Stroustrup told me c++ code can use lots of local variables, whereas garbage collected languages put most objects on heap.

I hypothesized that whether I have 200 local variables in a function, or no local variable, the runtime cost of stack allocation is the same. He said it’s nanosec scale, basically free. In contrast, with heap objects, biggest cost is allocation. The deallocation is also costly.

Aha — At compile-time, compiler already knows how many bytes are needed for a given stack frame

insight — I think local variables don’t need pointers. GC languages rely heavily on “indirect” pointers. Since GC often relocates objects, the pointer content need to be translated to the current address of the target object. I believe this translation has to be done at run time. This is what I mean by “indirect” pointer.

insight — STL containers almost always use heap, so they are not strictly “local variables” in the memory sense

heap allocation: java Can beat c++

  • case 1 (standard java): you allocate heap memory. After you finish with it you wait for the java GC to clean it up.
  • case 2 (low latency java): you allocate heap memory but disable java GC. Either you hold on to all your objects, or you leave unreachable garbage orbiting the earth forever.
  • case 3 (c++): you allocate heap memory with the expectation of releasing it, so the compiler sets up housekeeping in advance for the anticipated delete(). This housekeeping overhead is somehow similar to try/catch before c++11 ‘noexcept’.

Stroustrup suggested that #2 will be faster than #3, but #3 is faster than #1. I said “But c++ can emulate the allocation as jvm does?” Stroustrup said C++ is not designed for that. I have seen online posts about this “emulation” but I would trust Stroustrup more.

  • case 4 (C): C/c++ can sometimes use local variables to beat heap allocation. C programmers use rather few heap allocations, in my experience.

Note jvm or malloc are all userland allocators, not part of kernel and usually not using system calls. You can substitute your own malloc.

https://stackoverflow.com/questions/18268151/java-collections-faster-than-c-containers top answer by Kanze is consistent with what Stroustrup told me.

  • no dynamic allocation is always faster than even the fastest dynamic allocation. Similar to Case 4
  • jvm allocation (without the GC clean-up) can be 10 times faster than c++ allocation. Similar to Case 2^3
    • Q: Is there a free list in JVM allocator?

https://softwareengineering.stackexchange.com/questions/208656/java-heap-allocation-faster-than-c claims

  • c++ Custom allocators managing a pool of fixed-sized objects can beat jvm
  • jvm allocation often requires little more than one pointer addition, which is certainly faster than typical C++ heap allocation algorithms

BST: finding next greater node is O(1) #Kyle

https://stackoverflow.com/questions/11779859/whats-the-time-complexity-of-iterating-through-a-stdset-stdmap shows that in a std::map, the time complexity of in-order iterating every node is O(N) from lowest to highest.

Therefore, each step is O(1) amortized. In other words, finding next higher node in such a BST is a constant-time operation.

https://en.cppreference.com/w/cpp/iterator/next confirms that std::next() is O(1) if the steps to advance is a single step. Even better, across all STL containers, iterator movement by N steps is O(N) except for random-access iterators, which are even faster, at O(1).

heap usage: C ilt C++/Java #Part 2

I now recall that when I programmed in C, my code never used malloc() directly.

The library functions probably used malloc to some extent, but malloc was advanced feature. Alexandrescu confirmed my experience and said that c++ programmers usually make rather few malloc() calls, each time requesting a large chunk. Instead of malloc, I used mostly local variables and static variables. In contrast, C++ uses heap much more:

  • STL containers are 99% heap-based
  • virtual functions require pointer, and the target objects are usually on heap, as Alexandrescu said on P78
  • pimpl idiom i.e. private implementation requires heap object, as Alexandrescu said on P78
  • the c++ reference is used mostly for pass-by-reference. Pass-by-reference usually works with heap objects.

In contrast, C++ uses small chunks of heap memory.

Across languages, heap usage is is slow because

  • In general OO programming uses more pointers more indirection and more heap objects
  • heap allocation is much slower than stack allocation, as Stroustrup explained to me
  • using a heap object, always always requires a runtime indirection. The heap object has no name, only an address !
  • In Garbabe-Collected languages, there’s one more indirection.

STL map[k]=.. Always uses assignment@runtime

The focus is on the value2, of type Trade.

— someMap[key1] = value2; //uses default-ctor if needed. Then Unconditionally invokes Trade::operator=() for assignment! See P344 [[Josuttis]]

Without default-ctor or operator=, then Trade class probably won’t compile with this code.

— someMay.emplace(key1, value2); //uses some ctor, probably the copy-ctor, or the cv-ctor if value2 type is not Trade

hash table time complexity: O(N)^O(logN)^O(1)

Q: When interviewer ask about time complexity, shall I point out my hash table is O(log N) not O(1)?
A: if time permits, I would briefly mention O(N) and O(logN) worst case
A: if no time, then just say typically O(1). I don’t think we would lose many points.

https://stackoverflow.com/questions/9214353/hash-table-runtime-complexity-insert-search-and-delete has a fairly authoritative answer. Basically, it is indeed O(N) worst case in general but

  • can improve to O(logN) in many situations where java8 technique applies
  • in other cases it is O(N) in theory, but extremely rare in practice. The only known case is a deliberate DoS attack.

multicast: multi-player games

  • data loss is tolerable
  • many-to-many messaging — with multiple senders and multiple receivers

https://www.cisco.com/c/en/us/about/press/internet-protocol-journal/back-issues/table-contents-19/reliable-multicast.html says

Collaborative applications such as data conferences (whiteboarding) and network-based games are many-to-many applications with ….

… modest scaling requirements of less than 100 participants.

This kind of application requires low latency of less than 400 millisec.

Transmission does not always need strict reliability; for example, refresh of background information for a network game could wait for the next refresh.

generate Fib(N) in O(1)time and space: 2 techniques

Like all Constant-recursive sequences, the Fib sequence has a closed-form formula Fib(N) that returns an integer but in floating point form. By rounding, we can generate Fib(N) in constant time.

https://www.math.hmc.edu/funfacts/ffiles/10002.4-5.shtml has the formula we can easily program.

[ Phin – ( -Phi)-n ] /√5  # Notice the three minus signs

where the constant Phi := (1+√5) / 2.

Similarly, if we are given 89 as a Fib number, we can use logarithm to compute the rank of this Fib number.

https://github.com/tiger40490/repo1/blob/cpp1/cpp/lang_33template/FibDeepak.cpp shows two O(1) solutions

ringBuffer@pre_allocated objects to preempt JGC

Goal — to eliminate JGC completely.

Design 1: I will want Order.java to use primitive fields only and avoid reference fields [1] at all cost, so the total footprint of an Order is known in advance. Say it’s 100 bytes. I will create 10M of dummy Order instances, possibly scattered in heap, not adjacent as in c++, and hold their 10M addresses in an Order array… about 1GB footprint for the Order objects + 80M footprint for the array of 8-byte pointers.

(Note I reuse these Order instances in this object pool and never let them get garbage-collected.)

Then i need a few subscripts to identify the “activeRegion” of the ring but how about released slots enclosed therein?

[1] timestamps will be ints; symbolIDs and clientIDs are ints; short ascii strings will use 64-bit ints (8 characters/int); free-form strings must be allocated off-site:(

Design 2a: To avoid the “scatter” and to place the Order instances side by side, Can we use a serialized byte[100] array object to represent one Order? Can we use one gigantic off-heap byte array to hold all Orders, eliminating the 80M footprint? See java off-heap memory

Design 2b: https://blog.bramp.net/post/2015/08/26/unsafe-part-2-using-sun.misc.unsafe-to-create-a-contiguous-array-of-objects/ shows a contiguous array of java objects, like std::vector<MyObject>

Design 2c: https://www.ibm.com/support/knowledgecenter/en/SSYKE2_7.1.0/com.ibm.java.lnx.71.doc/user/packed_optimizing.html is a feature in IBM jvm

Ring buffer is good if the object lifetimes are roughly equal, giving us FIFO phenomenon. This occurs naturally in market data or message passing gateways. Otherwise, we may need a linked list (free list) of released slots in addition to a pair of subscript to identify the active region.

It might be better to allocate a dedicated buffer for each thread, to avoid contention. Drawback? One buffer may get exhausted when another stays unused.

function returning (unnamed)rvalue object

The rules are hard to internalize but many interviewers like to zero in on this topic. [[effModernC++]] explains

  • case: if return type is nonref, and in the function the thingy-to-be-returned is a …. rvr parameter [1], then by default, it goes through copy-ctor on return.
    • You should apply explicit return move(thingy)
    • [1] See P172, extremely rare except in interviews
  • case: if return type is nonref and in the function the thingy-to-be-returned is a …. stack (non-static) object behind a local variable (NOT nameless object) .. a very common scenario, then RVO optimization usually happens.
    • Such a function call is always (P175 bottom) seen by the caller as evaluating to a naturally-occurring nameless rvalue object.
    • the books says move() should never be used in the case
  • case: if return type is rvr? avoid it if possible. I don’t see any use case.
  • case: if return type is lvr (not rvr) and returned thingy is a local object, then compilation fails as explained in the blogpost above

float/int 1:1 mapping#java support

See last paragraph of https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/atomic/package-summary.html

you can use an AtomicInteger to hold byte values, and cast appropriately. You can also hold doubles using Double.doubleToRawLongBits(double) and Double.longBitsToDouble(long) conversions… because there’s a valuable 1:1 mapping between a 64-bit number and a 64-bit float number.

  • Useful if you need to precisely compare a 64-bit float value. For example, you can check if a value has changed at all. The value could be user-supplied or shared mutable.
  • Useful if you need to store float values in a hashtable, though I’m not sure about performance.
  • Useful if you need radix sort on floats

pre-allocate DTOs@SOD #HFT #ets

example — etsflow framework pre-allocates object pool (presumably the flow elements) for the day, to avoid runtime call to malloc. Are these objects ever released to the pool? I doubt it since all of these objects are subject to query or bust.

example — RTS pre-allocates outgoing message objects from a ring buffer’s head, and “returns” to the ring buffer at the tail… See How RTS achieved 400-700 KMPS #epoll

example — Sell-side HFT OMS also uses pre-allocation. Suppose for every new order there are 3 new DataTransferObjects A/B/C to be instantiated on heap. Traditional approach would make 3 allocation requests in real time. I think the free-list manager becomes a hotspot, even if there’s a per-thread free list.

Basically HFT avoids new/malloc after market opens. RTS uses mostly arrays, and very few (rather small) STL containers. Those STL containers tend to be populated before market opens and remain static.

Pre-allocation is a popular technique. We compute at compile time the sizes of A/B/C based on their class declarations. For DTO class A, sizeof(A) just adds up the non-static data field sizes. Then we estimate how many orders we will get a day (say 7 million). Then we pre-allocate 7 million A objects in an array. The allocation happens at start-up, though the sizes are compile-time constants.

When an order comes in, the A/B/C DTO objects are already allocated but empty.

Byte-array is an alternative, but this smells like the raw free list to me…

Q:are java primitive+reference on heap or stack #escape

An old question but my answers are not really old 🙂

In Java, a so-called “referent” is a non-primitive thingy with a unique address on heap, accessed via heap pointers.

In java, a referent is always an Object, and an Object is always on heap therefore always a referent.

(Java language defines “reference types” in terms of primitive types, so we need a clear understanding of primitive types first.)

In java, a primitive thingy is either part of a (heapy) Object or a local thingy on stack

(In C++ lingo, object can be a new int(88)…)

A reference is, at run-time really a heap pointer. Assuming 32-bit machine, the pointer itself occupies 4 bytes and must be allocated somewhere. If the reference is local to a method, like a parameter or a local variable, then the 4 bytes are on stack like a 32-bit primitive local variable. If it’s part of an object then it’s on heap just like a 32-bit primitive field.

— advanced topic: escape

Escape analysis is enabled by default. EA can avoid construction an Object on heap, by using the individual fields as local variables.

— advanced topic: arrays

Arrays are special and rarely quizzed. My unverified hypothesis:

  • at run-time an array of 3 ints is allocated like an Object with 3 int-fields
  • at run-time an array of 3 Dogs is allocated like an Object with 3 Dog fields. This resembles std::vector<shared_ptr<Dog>>
  • Q: how about std::vector<Dog>?
  • %%A: I don’t think java supports it.
  • The array itself is an Object

jvm footprint: classes can dominate objects

P56 of The official [[java platform performance]], written by SUN java dev team, has pie charts showing that

  • a typical Large server app can have about 20% of heap usage taken up by classes, rather than objects.
  • a typical small or medium client app usually have more RAM used by classes than data, up to 66% of heap usage take up by classes.

On the same page also says it’s possible to reduce class footprint.

11 notable features added to c++

https://en.cppreference.com/w/cpp/language/history briefly mentions

  • [90] exception handling
  • [90] templates
  • [98] cast operators
  • [98] dynamic_cast and typeid()
  • [98] covariant return type
  • [07] boost ref wrapper .. see std::reference_wrapper
  • [11] GarbageCollector interface .. See c++ GC interface
  • [11] std::next(), prev(), std::begin(), std::end() .. see favor std::begin(arrayOrContainer)
  • [11] exception_ptr? not sure how useful
  • [14] shared_lock — a RW lock
  • [14] shared_timed_mutex .. see try_lock: since pthreads
  • [14] std::exchange — comparable to std::swap() but doesn’t offer the atomicity of std::atomic_exchange()

jvm spawns(approx)32 JGC threads on 32-core

Based on https://jaxenter.com/nobody-puts-java-container-139373.html

jvm would spawn 32 GC threads on a 32-core box [1].  As of 2018, this is the default, but you can change it with jvm parameters like -XX:ParallelGCThreads and -XX:ConcGCThreads

Java 9 introduced automatic detection of cpu-set when jvm runs in a cgroup (such as a docker container) with a cpu-set. For example, JVM detects there are 3 cores in the cpu-set, and spawns 3 GC threads.

[1] presumably for the parallel GC or the CMS GC but i’m not sure. https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/parallel.html and https://blog.codecentric.de/en/2013/01/useful-jvm-flags-part-6-throughput-collector/ says for the parallelGC the default would be 5/8*32 + 3 = 23 threads.

wasm: a distant threat to javascript

https://medium.com/javascript-scene/what-is-webassembly-the-dawn-of-a-new-era-61256ec5a8f6 is a good WebAssembly intro, by an author.

  • wsam is an in-browser language, like javascript
  • wsam offers low-level (web-ASSEMBLY) programming constructs to complement javascript
  • I feel wsam will only be considered by extreme-performance browser apps. It’s too low-level, too inconvenient to be popular.
  • How many percent market share will javascript lose to wsam? 0.1% to 0.5%
  • The fact that wsam is cited as the main challenger to javascript means javascript is unchallenged as the dominant choice on the browser

scaffolding around try{}block #noexcept

[[ARM]] P358 says that all local non-static objects on the current call stack fully constructed since start of the try-block are “registered” for stack unwinding. The registration is fine-grained in terms of partial destruction —

  • for any array with 3 out of 9 objects fully constructed, the stack unwinding would only destruct those 3
  • for a half constructed composite object with sub-objects, all constructed sub-objects will be destructed
  • Any half-constructed object is not registered since the dtor would be unsafe.

I guess this registration is an overhead at run time.

For the stack objects created in a noexcept function, this “registration” is not required, so compiler may or may not call their destructors.

— in http://www.stroustrup.com/C++11FAQ.html#noexcept Stroustrup hints at the  scaffolding

  • noexcept is a efficiency feature — widely ans systematically used in standard library to improve performance
  • noexcept is crude and “very efficient”
  • dtor may not be invoked upon stack unwinding
  • stack unwinding may not happen at all

 

 

friend to a class template

Similar to java reflection, void pointers, reinterpret_cast etc, friend is a powerful keyword in real projects, with GTD power. The rules are a bit tricky when you combine friend + template.

I think this is a relatively useful IV knowledge pearl as it demonstrates a fundamental concept — each template instantiation (“concretized template class”) is treated as a distinct class.

[[ARM]] P351 stipulates that

* A simple non-template friend introduced into a class template is a friend to ALL instances of the class template.

That’s the simple case.

* Now another common scenario — the class template C2 has a dummy type T, and the friend is another template F3 parameterized with the same type T. Now there’s a one-to-one friendship:

  • F3<float> is a trusted friend to C2<float>
  • F3<Acct> is a trusted friend to C2<Acct>

 

 

starting thread in ctor #Lea

Q: is it good practice to call new Thread(..).start() in MyClass ctor? This is common practice in my projects. I feel there are always alternatives.

I feel 30% confident this blog has another blogpost on this subject, to be combined.

[[DougLea]] points out the potential danger of starting thread in your ctor, esp. if subclasses are beyond your control.

The visibility effect is equivalent to parent thread releasing an implicit lock, and acquisition by run() on new thread.

Frequently, the constructed MyClass instance is used by the new thread, but is MyClass fully constructed? It is if the new Thread(..).start() is last line in ctor, but what if this ctor runs as part of a subclass ctor?

POSIX semaphores do grow beyond initial value #CSY

My friend Shanyou pointed out a key difference between c++ binary semaphore and a simple mutex — namely ownership. Posix Semaphore (binary or otherwise) has no ownership, because non-owner threads can create permits from thin air, then use them to enter the protected critical section. Analogy — NFL fans creating free tickets to enter the stadium.

https://stackoverflow.com/questions/7478684/how-to-initialise-a-binary-semaphore-in-c is one of the better online discussions.

  • ! to my surprise, if you initialize a posix semaphore to 1 permit, it can be incremented to 2. So it is not a strict binary semaphore.

https://github.com/tiger40490/repo1/blob/cpp1/cpp/thr/binarySem.cpp is my experiment using linux POSIX semaphore. Not sure about system-V semaphore. I now think a posix semaphore is always a multi-value counting semaphore with the current value indicating current permits available.

  • ! Initial value is NOT “max permits” but rather “initial permits”

I feel the restriction by semaphore is just advisory, like advisory file locking. If a program is written to subvert the restriction, it’s easy. Therefore, “binary semaphore” is binary IIF all threads follow protocol.

https://stackoverflow.com/questions/62814/difference-between-binary-semaphore-and-mutex claims “Mutex can be released only by thread that had acquired it, while you can signal semaphore from any non-owner thread (or process),” This does NOT mean a non-owner thread can release a toilet key owned by another thread/process. It means the non-owner thread can mistakenly create a 2nd key, in breach of exclusive, serialized access to the toilet. https://blog.feabhas.com/2009/09/mutex-vs-semaphores-–-part-1-semaphores/ is explicit saying that a thread can release the single toilet key without ever acquiring it.

–java

[[DougLea]] P220 confirms that in some thread libraries such as pthreads, release() can increment a 0/1 binary semaphore value to beyond 1, destroying the mutual exclusion control.

However, java binary semaphore is a mutex because releasing a semaphore before acquiring has no effect (no “spare key”) but doesn’t throw error.

hardware mutex, based@XCHG instruction

[[OS by William Stallings]] P214 is concise and clear.

The atomic XCHG instruction on Intel processors can swap a local variable (held in cpu register) with a main memory location visible to all threads.

This XCHG instruction alone can implement a simple mutex usable by multiple threads on multiple processors sharing the same main memory.

Another atomic instruction, atomic test-n-set instruction, can also by itself implement mutex.

Both mutexes have only academic value because their Limitations are severe:

  • even with a single mutex, deadlock is possible at the hardware level — ThrA in critical section can get interrupted and preempted — perfectly normal. Now ThrA has to wait for high-priority ThrH to complete, but ThrH is busy-waiting (wheel-spinning) to enter the same critical section 😦
    • Note in this example there’s no “lock object” per-se, only a critical section in code !
  • by default, unsuccessful threads like ThrH are stuck in busy-wait — wasting CPU and generating heat 😉

 

STL containers implemented +! pointer

  1. std::array
  2. std::string SSO i.e. small-string optimization

In these containers

  1. heap storage is not required
    • trivial special case — if this container is a nonref field of a Shape object, then q(new Shape) would still allocate the container on heap.
    • This is similar to “Are java primitives on heap/stack?”
  2. Therefore contains no pointer.
  3. Therefore move-ctor runs in linear [1] time compared to constant-time move-ctor of other STL containers
    • std::array probably offers a move-ctor and move-assignment but I don’t have time to research

[1] P205 [[effModernC++]] also addresses —

Q: What happens when you move a std::array container with 77 Acct elements?
A: 77 individual Acct objects are moved to the new container, in linear time, invoking the move-ctor of Acct class

prevent c++compiler reordering statements

http://www.cplusplus.com/reference/atomic/atomic_signal_fence/ says this low-level c++11 function is something like a

directive to the compiler inhibiting it from making optimizations that involve moving writing operations beyond a releasing fence or read operations before an acquire fence.

I doubt there’s any need for this in any application.

handle java OOM @runtime

https://stackoverflow.com/questions/2679330/catching-java-lang-outofmemoryerror says "only very infrequently is OutOfMemoryError the death-knell to a JVM. There is only one good reason to catch an OutOfMemoryError and that is to close down gracefully, cleanly releasing resources and logging the reason for the failure best you can (if it is still possible to do so)."

Note in a Docker container, OOM leads to immediate crash of not only JVM but entire container.

eg@ JGC-JNI interaction: String

First, let’s remember Java string objects are collected by GC. In fact, String.java instances are often the biggest group of “unreachable” objects in a GC iteration.

I guess interned string objects are treated differently … need more research just for QQ interviews.

GC thread has builtin safety checks to cooperate with application threads to avoid collecting a string.

However, JNI code is not 100% compatible with GC. Example from [[Core java]]:

Suppose a JNI C function instantiates a jstring object in the jvm heap (not in native memory), to be deallocated by GC.

Note the GC doesn’t know when this object is unreachable. To inform GC, JNI function need to call ReleaseStringUTFChars().

identityHashCode,minimum object size,relocation by JGC

https://srvaroa.github.io/jvm/java/openjdk/biased-locking/2017/01/30/hashCode.html offers a few “halo” knowledge pearls

  • every single java Object must always give an idHashcode on-demand, even if its host class has hashCode() method overridden to return a hard-coded 55.
    • hashcode() doesn’t overshadow idHashcode
  • The “contract” says an object’s idHashcode number must never [2] change, in the face of object relocations. So it’s not really computed based on address. Once someone requests the idHashCode number (like 4049040490), this number must be retained somewhere in object, as per the “contract”. It is retained in the 12-byte object header. (8-byte for a 32-bit JVM)
    • Therefore, the idHashcode contributes to the minimum size of java objects.
  • contrary to common belief, the idHashcode can clash between two objects, so idHashcode is a misnomer, more “hashcode” and not “identity”. https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6321873 explains there are insufficient integer values given the maximum object count.
  • Note anyone can call the hashcode() method on this same object and it could be overridden to bypass the idHashcode.
  • [2] in contrast a custom hashcode() can change its value when object state changes.

## linux kernel concurrency primitives #to elaborate

I feel the concurrency primitives in kernel are conceptually similar to those in thread libraries but fundamentally different in implementation. This topic is deep and time consuming so I will only memorize some jargons for show-off

I should read the OReilly Linux book + Josh’s book

  • spinlock — addressed in both books
  • read-write spinlock
  • binary and counting semaphores
  • read-write semaphores
  • RCU?

 

master The yield-generator

https://github.com/tiger40490/repo1/tree/py1/py/algo_combo_perm uses python q[ yield ] to implement classic generators of

  • combinations
  • permutations
  • abbreviations
  • redraw

Key features and reasons why we should try to memorize it

  1. very brief code, not too dense (cryptic), .. helps us remember.
  2. reliability — brief means fewer hiding place for a bug. I added automated tests for basic quality assurance. Published solution
  3. versatile — one single function to support multiple classic generators.
  4. yield functions’ suspend-resume feature is particular powerful and possibly a power drill. This is my first opportunity to learn it.
  5. instrumentation — relatively easy to add prints to see what’s going on.
  6. halo — yield-based generator functions are a halo and also a time-saver
  7. elegant — brief, simple, relatively easy to understand
  8. recursive — calling itself recursively in a loop therefore fairly brief but powerful
  9. useful in take-home assignments
  10. identity-aware — The second time you call myYieldFunc(44), it would usually return the same generator object as the first time you call it with that same argument (i.e. 44). If that generator object has reached end of it’s execution, then you won’t get any more output.

— How I might try to memorize it

  1. If needed, we just need to reproduce it quickly from memory a few times
  2. I added comments to help me understand and memorize it

[17] semaphore^mutex

See also CSY blogpost on binary semaphore vs mutex..

I usually trust official documentation by Microsoft, Oracle, Linux .. more than online forums, but I did find some worthwhile read in
https://stackoverflow.com/questions/62814/difference-between-binary-semaphore-and-mutex . Multiple authors pointed out the notification feature of semaphore. I agree.

There are at least 4 different semaphore APIs
* java has Semaphore class
* dotnet has a Semaphore class based on Windows semaphore
* system-V semaphore API
* POSIX semaphore API

The windows (including dotnet) semaphore API is somewhat different from the rest. Notification feature is built into windows semaphore, but I don’t think the other 3 API documentations have any mentions of notification or signaling features. I believe ownership is NOT the key point. By “ownership” the (Windows) semaphore discussions basically means “notification”. Non-owners of a semaphore can announce intention-to-acquire.

prefer ::at()over operator[]read`containers#UB

::at() throws exception … consistently 🙂

  • For (ordered or unordered) maps, I would prefer ::at() for reading, since operator[] silently inserts for lookup miss.
  • For vector, I would always favor vector::at() since operator[] has undefined behavior when index is beyond the end.
    1. worst outcome is getting trash without warning. I remember getting trash from an invalid STL iterator.
    2. better is consistent seg fault
    3. best is exception, since I can catch it

 

## placement new: pearls to impress IV

container{string}: j^c++

In java, any container (of string or int or anything) holds pointers only.

I think c# collections (i.e. containers) contain pointers if T is a reference type.

In cpp,

  • container of int always contains nonref, unlike java
  • container of container contains ptr, just like in java
  • but container of string is widely used, and invariably contains nonref std::string !

Q: is there any justification for container<(smart) ptr to string>? I found rather few online discussions.
A: See boost::ptr_container

Q: what if the strings are very large?
A: many std::string implementations use COW to speed up copying + assignment, however, string copy ctor has O(lengthOfString) per in the standard ! So in a standard-compliant implementation copy and assignment would be expensive, so I believe we must use container<(smart) ptr to string>

 

##5 understandable-yet-useful type traits for TMP

type_traits.h defines too many constructs but only a few are easy to understand and unambiguous. I think these can be halos in open-ended interviews.

However, these are clearly “superstructure” knowledge pearls. We first need to “beat them (the interviewers) at their own game”, otherwise they would feel “This candidate has some good knowledge but not strong on the fundamentals.”  — recall Deepak M.

  1. enable_if
  2. is_same
  3. conditional — https://en.cppreference.com/w/cpp/types/conditional. Not necessarily widely used but my favorite
  4. is_base_of — https://en.cppreference.com/w/cpp/types/is_base_of
  5. is_polymorphic — https://en.cppreference.com/w/cpp/types/is_polymorphic about virtual function
  6. — not so understandable —
  7. is_pointer — ambiguous. I think it only knows about regular ptr and function ptr
  8. is_class and is_scalar — possibly useful to check T is a class vs a “simple” type
  9. — understandable but not so valuable —
  10. is_abstract — having pure virtual function
  11. — neither very understandable nor highly valuable —

I guess all of these type traits are class templates. The way to access the result is the (boolean) ::value member typedef usually, though enable_if_t evaluates to a type.

## template tricks for type constraint

  1. –Here are the compile-time validation/checks offered by c++:
  2. std::is_pointer and family — each function checks one type of pointer. Very specific and clean
  3. q[ = delete ]  — to eliminate specific concrete type args. Laser removal .. extremely clean. Scott Meyers pointed out this usage of q[=delete]. Does this work with SFINAE ??????? I didn’t find anything online
  4. std::enable_if() — see separate blogpost. Designed more More for SFINAE than type constraint
  5. sfinae — probably the most versatile, flexible and powerful
  6. static_assert? Can be combined with other techniques to implement compile-time validation and constraints. See https://github.com/tiger40490/repo1/blob/cpp1/cpp/template/SFINAE_ptrCheck.cpp

covariant return type: c++98→java

java “override” rule permits covariant return — a overriding function to return a type D that’s subtype of B which is the original return type of the overridden method.

— ditto c++

https://en.wikibooks.org/wiki/More_C%2B%2B_Idioms/Covariant_Return_Types. Most common usage is a “clone” method that

  • returns ptr to Derived, in the Derived class
  • returns ptr to Base in the Base class

Covariant return types work with multiple inheritance and with protected and private inheritance — these simply affect the access levels of the relevant functions.

I was wrong to say virtual mechanism requires exact match on return type.

CRT was added in c++98. ARM P211 (c 1990) explains why CRT was considered problematic in the Multiple Inheritance context.

python Protocols #phrasebook

  • interface — similar to java Interface
  • unenforced — unlike java Interface, python compiler doesn’t enforce anything about protocols
  • eg: ContextManager protocol defines __enter__() and __exit__() methods
  • eg: Sequence protocol defines __len__() and __getitem__() methods
  • partial — you can partially implement the required functions of a protocol
    • eg: your class can implement just the __getitem__() and still works as a Sequence

concurrent python #my take

I’m biased against multi-threading, biased towards multiprocessing because …

  1. threading is for high-performance, but java/c++ leaves python in the dust
  2. GIL in CPython, which is the default download version of python. The standard doc says “If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing

(For my academic curiosity ….) Python thread library offers common bells and whistles:

  • join()
  • timers
  • condVar
  • lock
  • counting semaphore
  • barrier
  • concurrent queue
  • isDaemon()
  • Futures? python3.2 😦

 

##all c++ explicit casts #std::move included

  • — implicit casts
  • cvctor — is the most important implicit cast
  • conversion operator member-functions
  • numerical type conversions specified by C standard

=== now the explicit casts

  • — C:
  • parentheses cast
  • — C++03 added four casts
  • dynamic_cast — usable on references and pointers.
  • const_cast
  • reinterpret_cast — usually on pointers
  • static_cast
  • — c++11
  • std::move() — only for references, not pointers
  • std::forward()
  • numerous type conversions in type_traits.h
  • — boost:
  • boost::polymorphic_cast? is similar to dynamic_cast but much less popular and not essential knowledge expected of a c++ veteran

 

## G5 move-only types #std::atomic

[1] P106 [[effModernC++]]

Docker(+java9)cpu affinity #since2006

https://jaxenter.com/nobody-puts-java-container-139373.html is a 2018 article with some concrete examples demonstrating cpu isolation.

a Docker cgroup can specify a cpu-set (like core0 + core3 + core14) and limit itself to this cpu-set. Performance Motivation — preventing a process hopping between cores.

The “cpu-set” scheme provides conceptually simpler cpu isolation, but less popular than the “cpu-share” scheme.

Java9 offers support for cpu isolation if you adopt the the cpu-set scheme but not the cpu-share scheme, as explained succinctly in the article.

A historical note — In 2006 (Mansion/Strategem) I spoke to a Sun Microsystems consultant. An individual Solaris “zone” can specify which cpu core to use. This is my first encounter with CPU isolation/affinity.

rvr^rvalueObject #rvr=compiler concept

  • ALL objects by definition exist in runtime memory but references may not.
  • std::move() is about rvr variables not rvalue objects!
    • You can even use move() on natural-occurring temp though std::move is supposed to be used on regular lval objects
  • I now believe rvr is a compiler concept. Underlying is just a temporary’s address.
    • Further, the traditional lvr is probably a compiler concept too. Underlying is a regular pointer variable.

rvalue object is an object that can ONLY appear on the RHS of assignment.

rvalue object can be naturally occurring (usually anonymous), or “converted from a named object” by std::move(), but if some variable still references the object, then the object is actually not a proper rvalue object.

The SCB architect pointed that “some variable” can be a const ref bound to a naturally occurring rvalue! To my surprise the c++ syntax rule says the object is still a temp object i.e. rvalue object, so you can’t assign it to a lvr !

::value^::type #typedef

This TMP technique is the simplest but not for the uninitiated.

In type_traits.h, many templates expose a ::value or ::type construct. Often these two “members” are the only visible output from the type trait meta-functions.

  • ::value is a static field, typically true/false
  • ::type is a member typedef, typically related to the type argument T, but can be “void” like in enable_if

 

JIT has opportunities to avoid vtable latency #Martin

P76 [[javaPerf]] described a nifty JIT technique to avoid runtime cost of the dynamic binding of virtual function equals(). Suppose in some class, we call obj1.equals(obj2).

After a priming (i.e. warm-up) period, JIT collects enough statistics to see that every dynamic dispatch at this site is calling String.equals(), so JIT decides to turn it into faster “static binding” so the String.equals() function address is hardwired into the assembly code (not JVM bytecode). JIT also needs to handle the possibility of Character.equals(). I guess the assembly code can detect that obj1/obj2 is not a String.java instance and retry the virtual function lookup. JIT can generate assembly code to
1. verify obj is a String and call an inlined String.equals()
2. if obj1 is not String, then use obj1 vtable to look up the virtual function obj1.equals()

It may turn out that 99.9% of the time we can skip the time-consuming Step 2: )

Martin gave a (hypothetical?) example. Suppose JIT notices that obj1 is always either a String or Character. JIT could inline both equals() functions and completely bypass vtable. (Branching can be done via if/else or …) This inline compilation is done after a fairly long period of instrumentation. I asked Martin why c++ can’t do it. He said c++ only uses static compilation. I feel c++ compilers don’t bother with this technique as it is not a proven performance win.

IP_MC^appliation_layer_MC

I think there are many application-layer multicast protocols. http://pages.cs.wisc.edu/~suman/pubs/sigcomm02.pdf says

Unlike native multicast (i.e. IP-multicast) where data packets are replicated at routers inside the network, in application-layer multicast data packets are replicated at end hosts.

When we hear “multicast” we should not assume it means IP-multicast using UDP ! “Multicast” is not a specific protocol like UDP, but a generic, loosely defined term.

— RGDD

https://pdfs.semanticscholar.org/87f8/41def2a4e5740a13589d4b60d7a2406da866.pdf says “many current systems support RGDD at application layer using unicast TCP

default malloc is slow, can be improved

See https://stackoverflow.com/questions/161053/which-is-faster-stack-allocation-or-heap-allocation

I see various evidence that industry practitioners consider the default allocator too slow.

I don’t think system call is the issue. System calls are very infrequent with malloc.

OrderedDict == LinkedHashMap

collections.OrderedDict can be useful in some coding interviews. As Rahul pointed out, interviewers are likely interested in your skill implementing it.

  • similarly preference — http endpoint skill is less valued than socket programming skill
  • Counterexample — interviewers are seldom interested in how you implement a priority queue.

In all cases, some knowledge of actual implementation is a small halo, as Venkat shows

–impl:

doubly-linked circular list, as explained on https://stackoverflow.com/questions/33748340/how-does-pythons-ordereddict-remember-elements-inserted, and in the source code link

http://code.activestate.com/recipes/496761-a-more-clean-implementation-for-ordered-dictionary/ shows a non-standard implementation, but del is not O(1)

std::sort() beating ANSI-C qsort() #inline

Stroustrup was the first one to tell me c++ std::sort() can beat C qsort() easily.

https://travisdowns.github.io/blog/2019/05/22/sorting.html says:

Since the qsort() code is compiled ahead of time and is found inside the shared libc binary, there is no chance that the comparator funciton, passed as a function pointer, can be inlined.

https://martin-ueding.de/articles/qsort-vs-std-sort/index.html says

For qsort(), since the function is passed as a pointer, and the elements are passed as void pointers as well, it means that each comparison costs three indirections and a function call.

In C++, the std::sort is a template algorithm, so that it can be compiled once for each type. The operator< of the type is usually baked into the sort algorithm as well (inlining), reducing the cost of the comparison significantly.

q[struct]^q[class] in c++

–difference: by default,

  • deriving from a q[struct] is public derivation
  • deriving from a q[class] is private derivation
  • P 243 [[ARM]] explains why.
  • Best to specify explicitly private or public

–difference: members (including static) default to public in a struct

–difference: keyword q[class] is usable in template<class X>

–difference: q[struct] is a keyword in C and C has a few other syntax rules about q[struct]

kernel bypass : possible usage ] RTS

Partially hypothetical usage scenario/proposal.

“Bypass” means .. bypassing standard kernel functions and using faster, lighter firmware instead.

“Bypass” means .. every network packet would go straight from NIC to user application, without passing through tcp/ip stack in the kernel.

“Bypass” probably means bypassing the socket buffer in kernel

Background — Traditional packet processing goes through tcp/ip software stack, implemented as a family of kernel functions. Whenever a network packet is received, NIC writes the packet to a circular array and raise a hardware interrupt. The i-handler (interrupt handler routine) and bottom-half will then perform packet processing using the kernel socket buffer, and finally copy the packet to a UserModeBuffer.

Note the two separate buffers. In RTS parser config file, we configure them as sock_recv_buf vs read_buf for every channel, regardless of TCP or multicast. The socket buffer is accessible by kernel only (probably unused when we turn on kernel bypass.) as kernel can’t expose a fast-changing memory location to a slow userland thread. I believe userland thread uses read() or similar functions to drain the socket buffer, so that kernel can refill the socket buffer. See [[linux kernel]] and https://eklitzke.org/how-tcp-sockets-work

With kernel bypass,

  • the Network card (NIC) has a FPGA chip, which contains the low-level packet processing software (actually firmware “burned” into fpga)
  • This firmware replaces tcp/ip kernel functions and delivers the packets directly and automatically to application UserModeBuffer. However, my parser relies more on another feature —
  • The SolarFlare firmware also lets my parser (user applications) read the NIC circular array directly. Zero-copy technique could bypasses not only socket buffer but also UserModeBuffer.

My parser uses SolarFlare NIC for both multicast and tcp.

Kernel bypass API was only used in some low-level modules of the framework, and disabled by default and configurable for each connection defined in configuration file.

http://jijithchandran.blogspot.com/2014/05/solarflare-multicast-and-kernel-bypass.html is relevant.

compiler loop-unrolling could hurt i-cache

[[art of unix programming]] P290/291 advocates (presumably based on experience) to bend over backward and ensure the central data structure + time-critical loop NEVER fall out of i-cache/d-cache.

I think this is useful sound byte to impress interviewers.

Developers need to know the size of L1 cache, L2 cache, L3 cache in our hardware. Then target the central data structure to one of them. Say L2 cache. We want to size our data structure to fit in there and never break the boundary.

“Small is beautiful” applies to both code and data.

Example 1 : cpu vs i-cache, IMHO more convincing than
Example 2 : cpu vs d-cache

In both examples, if cpu becomes fast enough then we ought to make it do more work in order to economize the caches.

— Example 1
Loop-unrolling enlarges text-section of the binary, so the additional “text” must compete for scarce instruction-cache.

[[art of unix programming]] claims that at least in some (not theoretical) cases, this compiler optimization may back fire i.e. hurt rather than improve performance.

The critical situation – suppose the loop runs in one of the extremely hot functions that need to remain in the instruction-cache, then we want to minimize code footprint.

Loading this hot function over and over from main memory can be worse than executing the original loop (without unrolling). This happens when cpu execution speed improves over the years but main memory access speed remains a bottleneck.

— Example 2
A second example “claimed” in the same paragraph, revolves around pre-computed look-up table for frequent access. This kind of optimization strategy can shave cpu cycles, but can increase the fierce competition for scarce L1 cache.

In other words, Pre-computing of a data table was designed to save cpu time but could backfire if the increased data footprint leads to data-cache misses.

We are again talking about extremely hot data that needs to remain in L1 cache. If this pre-computed data is accessed frequently, it can hog the L1 cache.

In such a case, removing the pre-computation, and re-computing each time, could be faster.

I think this problem can become real –
• If cpu is becoming much faster in recent years, or the computation gets simplified, then the cpu saving would diminish.
• At the same time, if the pre-computed data size grows larger, then the “hogging” would worsen.

std::reference_wrapper usages

std::reference_wrapper is a class template that wraps a reference in a copyable, assignable object. Usages:

  • std::reference_wrapper is primarily used to store references inside standard containers which cannot hold references. In my projects, no one ever stores references in containers. Pointless.
  • std::reference_wrapper is also used to pass objects by reference to the constructor of std::thread, or the helper functions std::make_pair() and std::make_tuple()

Helper functions std::ref and std::cref() are often used to generate std::reference_wrapper objects.

See https://en.cppreference.com/w/cpp/utility/functional/reference_wrapper

–details on the std::thread usage

https://github.com/tiger40490/repo1/blob/cpp1/cpp/thr/takeTurn.cpp shows that without std::ref(), the object passed to std::thread ctor is a clone, so the original object is not updated by the thread routine.

My https://github.com/tiger40490/repo1/blob/cpp1/cpp/thr/functorToThrCtor.cpp uses std::ref() too. Same code also shows another adapter wrapper for a function

https://en.cppreference.com/w/cpp/thread/thread/thread says std::ref() is likely needed.

https://thispointer.com/c11-multithreading-part-3-carefully-pass-arguments-to-threads/ has detailed sample code

Liunx(!!Unix)kernel thread can! run user program

–“Liunx kernel thread cannot run user programs”, as explained in [[UnderstandingLinuxKernel]] P3.

Removing ‘kernel‘… Linux userland threads (based on LWP, explained below) do run user programs.

Removing ‘thread‘… Linux kernel processes or interrupt routines could possibly run under a user process pid.

Removing ‘Linux‘ … Other Unix variants do use kernel threads to run both 1) user programs and 2) kernel routines. This is my understanding from reading about JVM…

  • unix  — multi-threaded user apps =use=> LWP =use=> kernel threads
  • linux — multi-threaded user apps =use=> LWP. LWP is the basic unit of concurrency in user apps. Linux LWP is unrelated to Unix LWP.

TCP sliding-window: congestion^receiver

Based on P 248/265 [[computerNetworking]] , 2011 paper@high-perf messaging, and my blogpost on AWS^SWS

Sliding window is a protocol or a design framework. AdvertizedWindowSize is an integer field in the Ack message. There are also two variables “ReceiveWindow/CongestionWindow” in the TCP sender codebase (part of kernel).

Sliding window kills two birds with one stone — 1) congestion control and 2) overflow protection on the receiving end. Without these controls, sender can pump packets too fast causing

  • receiver socket buffer overflows
  • congestion on the sender-to-receiver path so some of the router buffers overflow

Basic idea is limit sending rate such that

  • unacknowledged bytes := lastByteSent – lastByteAcked < ReceiveWindow
  • unacknowledged bytes := lastByteSent – lastByteAcked < CongestionWindow

UDP doesn’t limit sending rate and is favored by many aggressive applications such as market data dissemination.

thread^process: lesser-known differences #IV

Popular IV question. Largely a QQ question.  Some may consider it zbs.

To the kernel, there are man similarities between the “thread” construct vs the “process” construct. In fact, a (non-kernel) thread is often referenced as a LightWeightProcess in many kernels such as Solaris and Linux.

  • context switching — is faster between threads than between processes. In linux, context switching between kernel-threads is even faster.
  • creation — some thread libraries can create threads without the kernel knowing. No such thing for a process.
  • socket — 2 threads in a process can access the same socket; two processes usually can’t access the same socket, unless … parent-child. See post on fork()
  • memory — thread AA can access all heap objects, and even Thread BB’s stack objects via pointers. Two processes can’t share these, except via shared memory.
  • a non-kernel thread can never exist without an owner process. In contrast, every process always has a parent process which could be long gone.

 

string pool IV notes

String str = new String("Cat"); //[1] inefficiency

Based on https://www.quora.com/String-s-new-string-ABC-How-many-objects-are-created-after-the-above-statement .. Net effect of above statement — either 1 or 2 instantiations

> At class-loading time, JVM picks up all string literals in the class file and populates the string pool. IIF this “Cat” is the first occurrence then it would be instantiated in the string pool.

Net effect so far is 0 or 1 instantiation.

> At runtime, the new() forces JVM to instantiate a brand new object, in the regular heap, unrelated to the string pool.

Net effect now is 1 or 2 instantiations.

[1] Note this is (almost) never needed according to P20 [[effJava]]

CVA c++ IV 2 #oq

void func(ResourceMgr & rm){ 
  int connId = rm.getConn();
  double d=externalFunc(connId); 
  cout<<d<<endl;
  rm.reclaim(connId); //release the resource to the mgr
}
  • Q5: In the code above, external func can throw, so write a RAII wrapper to prevent resource leak.
  • Q5b: what if you aren’t allowed to use RAII? Ok you said catch-all. Is an empty catch block enough?
    AA: need to re-throw the original exception, to mimic the RAII behavior.
  • Q(paper coding): Iterate over two vectors in lock steps. Which is faster — iterator vs vector index?
  • Q (bond math): Is there any uncertainty in the present value calc on an pre-existing vanilla IRS?
  • q: what design patterns do you use?
  • q: write a singleton skeleton
  • Q: how do we make a class “final” in pre-c++11
    %%A: either make dtor private or make all constructors private
  • Q: is shared_ptr thread safe?
  • Q: any difference — const shared_ptr<T> vs shared_ptr<const T>?
  • %%Q: does it make sense to pass a shared_ptr by const reference? I feel it’s cleaner to pass by raw ptr
    AA!: pass by const-reference is faster and recommended

–Zheng

  • Q: why use weak_ptr instead of raw ptr. Valid question.
    A: See [[std c++lib]] P84.
    A: See last paragraph in https://bintanvictor.wordpress.com/2010/01/20/weak_ptr-can-access-inner-ptr-only-through-a-shared_ptr/
  • Q: you said atomic<int> operator++ probably wraps a CAS in a while loop internally, so is it spinlock?
    %%A: I think so.   http://www.cs.cornell.edu/courses/cs4410/2015su/lectures/lec06-spin.html is a detailed explanation
  • Q34: various synchronization techniques?
  • Q34b: what’s the c++ function to force a memory barrier?
    A: std::atomic_thread_fence(),
    but i feel if an application (not a library) uses this function then something is seriously wrong.
  • Q: can we implement a mutex using CAS?
    AA: blockingMutex implementation ] kernel
    AA: spinlock, not blockingMutex, can be implemented as in   https://www.andrew.cmu.edu/course/15-440-sp09/applications/ln/lecture4.html
  • ? Q (paper coding): write a piecewise linear interpolation Curve class with a ctor taking a vector<pair<double, double>>.  See https://github.com/tiger40490/repo1/blob/cpp1/cpp1/miscIVQ/curveInterpolation_CVA.cpp
  • Q: What are r-values and r-value references
  • Q: Explain your use of SFINAE. See https://github.com/tiger40490/repo1/blob/cpp1/cpp1/template/SFINAE_demo1.cpp
  • Q: what’s the c++ library function for cmpxch?
    A: atomic::compare_exchange_weak()
  • Q: is your barcap vol surface used for EOD risk only?
    %%A: no. A trader might suspect a lesser known product is currently mis-priced. She could confirm the live bid/ask are obvious outliers on the fitted surface.
    %%A: a dealer desk price new deals based on the fitted surface, whether or not the strike/expiry requires interpolation.

–Simon Ma:

  • Q (paper coding): implement Fib() recursively and iteratively
  • Q: what’s inline?
  • Q: semaphore vs mutex
  • Q: how did you compute greeks like gamma?
  • Q (bond math): given 6M spot IR = 5% and 12M = 10%, what’s the 6M rate 6M forward?
  • Q: Assign first half of a vector to another vector
    %%A: vector::assign() probably takes two iterators so I can pass v.begin(), v.begin()+5
  • Q (obscure): What data structure to represent a directed graph of N nodes? See NxN matrix for graph of N nodes
    %%A: create a Node class with a single ptr field…
  • Q: use parallel algorithm to compute sum of a vector<int>
    %%A: only the last stage global aggregation has a shared mutable that needs protection. The sub-aggregation can be single-threaded.
  • Q (probability problem): Both Alan and Bob agree to show up at Cafe9 sometime between 12 and 2pm. Whoever arrives first would wait for 15 minutes only. What’s the probability of them actually meeting
  • Q: what’s thread_local?
    %%A (correct): a new storage class that’s similar to static
  • Q (paper coding): A natural number is Good if it is a product of only 3, 5 and 7. The smallest Good numbers are 1,3,5,7,9,15,21,25,27,35… Write a print(int N) to print the Nth Good number. Hint: write a isGood(int k). See https://github.com/tiger40490/repo1/blob/cpp1/cpp1/miscIVQ/isGood357_CVA.cpp
  • Q (unclear): implement subtract(int a, int b) using only operator+ and comparison operators. I feel this question is unclear. Can we use bitwise? Can we multiply?
    %%A: just try all the integers (i) one by one a == b+i
  • Q (obscure): what operators can’t be overloaded?
    %%A: q(?:) correct
    AA: address-of CAN be overloaded!

–Mikhail the mgr

  • Q: inserting 1000,000 items into a list vs a vector without reserve()
    A: vector wins
  • Q: do you define your exception classes?
  • Q: in what context is a deque completely unusable whereas vector is perfectly fine?
    A: a C function taking an array (i.e. a ptr + a count). Vector::data() is designed for this. In Pre-c++11, we can pass &v[0] and v.size()
    A: if the code takes one element and uses pointer increment to access other elements. Deque would fail at segment boundaries.
  • Q89 (coding question): given a vector of T [passed in by const ref], write a template function to return [by const reference] the minimum element.
  • Q89b: how do you handle an empty vector?
  • ? Q89c (obscure): given q(vector const & vec), what can you say about q{auto it = vec.begin()} versus q{vector::const_iterator it=vec.begin()}

inline perf can Backfire ! #Google#i-cache

As illustrated below, without inline, instruction cache system could hit ChangeOfFlow twice as it enters/exits your function aa(). If aa() is actually inlined and embedded in a hostFunc, then the instruction cache system can often load entire hostFunc, eliminating COF. This helps instruction cache, but excessive inlining can increase executable footprint (code bloat).

google c++ guide points out that

  • inline can either increase or decrease (for tiny functions) executable footprint. In general, smaller footprint improves running time due to instruction cache efficiency
  • virtual functions are inlined (i.e. defined in class body) primarily for convenience/maintainability, not performance

See also https://www.eetimes.com/document.asp?doc_id=1275470 and https://www.eetimes.com/document.asp?doc_id=1275474

As an example, consider the function call tree flow in Figure 1. Suppose function F2 is linked near function F1, but function F3 is not linked near F1. When function F1 calls F2, it is possible that F2 is already in the cache and that there will be no cache miss. (The likelihood that F2 is in the cache depends on the sizes of F1 and F2, as well as the location of the call inside F1.) In contrast, there will probably be a cache miss when F1 calls F3. Because F3 is located far away from F1 in memory, it is not likely to be in the cache when F1 calls it.

TCP blocking send() timeout #details

See also recv()/send() with timeout #CSY

I see three modes

  1. non-blocking send() — immediate return if unable to send
  2. blocking send() without timeout — blocks forever, so the thread can’t do anything.
  3. blocking send() with timeout —

SO_SNDTIMEO: sets the timeout value specifying the amount of time that an output function blocks because flow control prevents data from being sent. If a send operation has blocked for this time, it shall return with a partial count or with errno set to [EAGAIN] or [EWOULDBLOCK] if no data is sent. The default for this option is zero, which indicates that a send operation shall not time out. This option stores a timeval structure. Note that not all implementations allow this option to be set.

In xtap library, timeout isn’t implemented at all. Default is non-blocking.  If we configure to use 2), then we can hit a strange problem — one of three receivers gets stuck but keeps its connection open. The other receives are starved even though their receive buffers are free.

fragmentation: IP^TCP #retrans

I consider this a “halo” knowledge pearl because it is part of an essential everyday service. We can easily find an opportunity to inject it into an Interview. Interviews are unlikely to go this deep, but it’s good to over-prepare here.

This comparison ties together many loose ends like Ack, reassembly, retrans, seq resets..

This comparison clarifies what reliability IP offers + what reliability features it lacks:

  • it offers complete datagrams
  • it can lose an entire datagram — TCP (not UDP) will detect it and use retrans
  • it can deliver two datagrams out of order — TCP will detect and fix it

[1] IP fragmentation can cause excessive retransmissions when fragments encounter packet loss and reliable protocols such as TCP must retransmit ALL of the fragments in order to recover from the loss of a SINGLE fragment
[2] TCP seq# never looks like 1,2,3
[3] IP (de)fragmentation #MTU,offset

IP4 fragmentation TCP fragmentation
minimum guarantees all-or-nothing. Never partial packet stream in-sequence without gap
reliability unreliable fully reliable
name of a “whole” piece IP datagram #“packet/msg” are vague
a TCP session with thousands of sequence numbers
name for a “part” IP fragment TCP segment
sequencing each fragment has an offset each segment has a seq#
.. continuous? yes discontinuous ! [2]
.. seq # reset? yes for each packet loops back to 0 right before overflow
Ack no Ack !
positive Ack needed
gap detection using offset using seq# [2]
id for the “msg” identification number no such thing
end-of-msg flag in last fragment no such thing. Continuous stream
out-of-sequence? likely likely
..reassembly based on id/offset/flag [3] based on seq#
retrans not by IP4 [1][3] commonplace
missing a piece? entire IP datagram discarded[3] triggers retrans

reinterpret_cast(zero-copy)^memcpy: raw mktData parsing

Raw market data input comes in as array of unsigned chars. I “reinterpret_cast” it to a pointer-to-TradeMsgStruct before looking up each field inside the struct.

Now I think this is the fastest solution. Zero-cost at runtime.

As an alternative, memcpy is also popular but it requires bitwise copy. It often require allocating a tmp variable.

linux tcp buffer^AWS tuning params

—receive buffer configuration
In general, there are two ways to control how large a TCP socket receive buffer can grow on Linux:

  1. You can set setsockopt(SO_RCVBUF) explicitly as the max receive buffer size on individual TCP/UDP sockets
  2. Or you can leave it to the operating system and allow it to auto-tune it dynamically, using the global tcp_rmem values as a hint.
  3. … both values are capped by

/proc/sys/net/core/rmem_max — is a global hard limit on all sockets (TCP/UDP). I see 256M in my system. Can you set it to 1GB? I’m not sure but it’s probably unaffected by the boolean flag below.

/proc/sys/net/ipv4/tcp_rmem — doesn’t override SO_RCVBUF. The max value on RTS system is again 256M. The receive buffer for each socket is adjusted by kernel dynamically, at runtime.

The linux “tcp” manpage explains the relationship.

Note large TCP receive buffer size is usually required for high latency, high bandwidth, high volume connections. Low latency systems should use smaller TCP buffers.

For high-volume multicast channel, you need large receive buffers to guard against data loss — UDP sender doesn’t obey flow control to prevent receiver overflow.

—AWS

/proc/sys/net/ipv4/tcp_window_scaling is a boolean configuration. (Turned on by default) 1GB  is the new limit on AWS after turning on window scaling. If turned off, then AWS value is constrained to a 16-bit integer field in the TCP header — 65536

I think this flag affects AWS and not receive buffer size.

  • if turned on, and if buffer is configured to grow beyond 64KB, then Ack can set AWS to above 65536.
  • if turned off, then we don’t (?) need a large buffer since AWS can only be 65536 or lower.

 

translation lookaside buffer #stats

  • a.k.a Address Translation Cache. The TLB lets the processor very quickly convert virtual addresses to physical addresses.
  • TLB is a cache for the big, slow page table
  • A typical entry in TLB is a pair of {virtual -> physical addresses}. In contrast,
  • A typical entry in a L1 cache is mapping of {physical address -> payload}.
  • You can hit both caches!
  • Both caches sits between processor and main memory
  • each hardware system has one or more TLBs
  • TLB-miss can be handled in hardware or kernel
  • typical miss probability — 0.01% to 1%
  • typical miss latency (penalty) — 10 to 100 clock cycles to read the page table
  • typical hit latency: 0.5 to 1 clock cycle

If a TLB hit takes 1 clock cycle, a miss takes 30 clock cycles, and the miss rate is 1%, the effective memory cycle rate is an average of 1 × 0.99 + (1 + 30) × 0.01 = 1.30 i.e. 1.30 clock cycles per memory access.

https://stackoverflow.com/questions/5440128/thread-context-switch-vs-process-context-switch says TLB favors thread rather than process context-switch. TLB gets flushed during process rather than thread context-switch.

 

##order-book common data structureS #array

Rebus has two levels of trees, according to Steve.

  1. For each feed there’s a separate symbol lookup tree — one would think a hashtable would be faster but given our symbol data size (a few thousands per feed) AVL tree is proven faster.
  2. A per-symbol bid-tree (and ask-tree) sorted by price

sizeof(‘a’) in c^c++^java #CSY

  • ‘a’ is taken as an int object in C, so sizeof(‘a’) is 4 !!
  • c++ take it as a char object.

See https://stackoverflow.com/questions/2172943/size-of-character-a-in-c-c

Java char is 2-bye to support 16-bit unicode! In Java, Only “byte” type is 1-byte, by definition.

c++ doesn’t have “byte” type, because “char” is the equivalent of the java “byte” type.

JGC G1 Metaspace: phrasebook #intern

Incidentally, NIO buffer is also in native memory

 

crossed orderbook mgmt: real story4IV

–challenges:
disappears after a while; Some days better some days worse

–impact
data quality visible to paying customers

–individual contribution?
not really, to be honest.

–what would you do differently?
Perhaps detect and set a warning flag when generating top-book token? See other post on “crossed”

–details:

  • query the exchange if query is supported – many exchanges do.
  • compare across ticker plants? Not really done in our case.
  • replay captured data for investigation
  • retrans as the main solution
  • periodic snapshot feed is supported by many exchanges, designed for late-starting subscribers. We could (though we don’t) use it to clean up our crossed orderbook
  • manual cleaning via the cleaner script, as a “2nd-last” resort
  • hot failover.. as last resort

–the cleaner script:

This “depth-cleaner” tool is essentially a script to delete crossed/locked (c/l) entries from our replicated order books. It is run by a user in response to an alert.

… The script then compares the Ask & Bid entries in the order book. If it finds a crossed or locked order, where the bid price is greater than (crossed) or equal to (locked) the ask price, it writes information about that order to a file. It then continues checking entries on both sides of the book until it finds a valid combination. Note that the entries are not checked for “staleness”. As soon as it finds non-crossed/locked Ask and Bid entries, regardless of their timestamps, it is done checking that symbol.

The script takes the entries from the crossed/locked file, and creates a CIM file containing a delete message for every order given. This CIM data is then sent to (the admin port of) order book engine to execute the deletes.

Before the cleaner is invoked manually, we have a scheduled scanner for crossed books. This scanner checks every symbol once a minute. I think it uses a low-priority read-only thread.

c++condVar 2 usages #timedWait

poll()as timer]real time C : industrial-strength #RTS is somewhat similar.

http://www.stroustrup.com/C++11FAQ.html#std-condition singles out two distinct usages:

1) notification
2) timed wait — often forgotten

https://en.cppreference.com/w/cpp/thread/condition_variable/wait_for shows std::condition_variable::wait_for() takes a std::chrono::duration parameter, which has nanosec precision.

Note java wait() also has nanosec precision.

std::condition_variable::wait_until() can be useful too, featured in my proposal RTS pbflow msg+time files #wait_until

IP4 fragmentation+reassembly #MTU,offset

I consider this a “halo” knowledge pearl because it is part of an essential everyday service. We can easily find an opportunity to inject it into an IV.

A Trex interviewer said something questionable. I said fragmentation is done at IP layer and he said yes but not reassembly. He was wrong. See P329 [[Computer Networking]]

I was talking about IP layer breaking up , say, a 4KB packet (TCP or UDP packet) into three IP-fragments no bigger than 1500B [1]. The reassembly task is to put all 3 fragments back together in sequence (and detect missing fragments) and hand it over to TCP or UDP.

This reassembly is done in IP layer. IP4 uses an “offset” number in each fragment to identify the sequencing and to detect missing fragments. The fragment with the highest offset also has a flag indicating it’s the last fragment of a given /logical/ packet.

Therefore, IP4 detects and will never deliver partial packets to UDP/TCP (P328 [[computer networking]]), even though IP is considered an unreliable service. IP4 can detect missing/incomplete IP-datagram and will refuse to “release” it to upper layer (i.e. UDP/TCP).

  • If TCP doesn’t get one “segment” (a.k.a. IP-datagram), it will request retransmission
  • UDP does no retransmission
  • IP4 does no retransmission

[1] MTU for some hardware is lower than 1500 Bytes …

down-cast a reference #idiom

ARM P69 says down-cast a reference is fairly common. I have never seen it.

Q: Why not use ptr?
%%A: I guess pointer can be null so the receiver function must face the risk of a null ptr.
%%A: 99% of references I have seen in my projects are function parameters, so references are extremely popular and proven in this use case. If you receive a ref-to-base, you can down cast it.

See post on new-and-dynamic_cast-exceptions
see also boost polymorphic_cast

sentinel node trick: array/slist

  • eg: STL uses sentinel nodes. I believe container.end() returns that.
  • eg: c-string uses \0 as sentinel value.
  • eg: binary search needle can be too high and we have to test if the return value is a real element or the end-of-container, so it’s extremely useful to add another sentinel to guarantee a successful search

When solving timed coding questions on “clean” arrays, it can be useful to append a sentinel node. It can simplify some algorithms which take action after each segment.

P 90 [[Pearls]] introduced a loop coding idiom. Idiom is applicable whenever we loop over a container and have an early-exit “break”. Such a loop has exactly two [1] conditional tests per iteration, therefore can run faster if we combine the two into one conditional test. This is a small halo idiom for latency. But beyond latency, there are more interesting benefits such as cognitive complexity reduction.

For example, consider the business logic after reaching “normal” end of the container without hitting the early exit. Rarely can we “forget” this business logic and simply exit the function and rely on the implicit return. Instead, this normal-completion scenario requires careful handling and a cognitive burden. To remind myself, I often place a comment after end of the loop. (Python provides an Else clause for a for-loop.)

In some cases, the business logic after end of the loop is 2 pages away from the early-exit business logic, but they really should be in one module. I hate this situation so much that I always have set a flag and use a “break” so that the two routines are both written after end of the loop.

In all these scenarios, it’s often simpler to append an artificial sentinel element at end of the container!  The sentinel is guaranteed to hit the early-exit, so the loop would never exhaust all elements and complete normally. Therefore we can combine the two into one conditional test:)

I usually replace “break” with “exit”, further reducing the control-flow complexity. Such micro simplifications can pay huge dividends in high-pressure timed tests.

Right before the exit/break we may still need to handle normal-completion scenario by checking if we are at the sentinel. Interesting scenario. Now we can combine the logic of normal-completion vs early exit. In many cases, they are (nearly) identical, so we can express the logic in very structured code.

Whether they are identical or not, handling the two scenarios (early vs normal completion) in a centralized module is usually simpler and more structured, reducing the cognitive complexity.

[1] what if three tests (two breaks)? The mental burden is even worse. The sentinel could reduce it

## notable linux system calls: QQ question

https://syscalls.kernelgrok.com can sort the functions by function id

http://asm.sourceforge.net/syscall.html is ordered by function id

  • fork()
  • waitpid()
  • open() close() read() write()
  • –socket
  • socket() connect() accept()
  • recvfrom() sendto()
  • shutdown() is for socket shutdown and is more granular than the generic close()
  • select()
  • epoll family
  • –memory
  • brk
  • mmap

std::sort (!!quicksort) requires random-access

list::sort() works on list — a non-random-access container, using a stable version of quicksort. See https://stackoverflow.com/questions/1717773/which-sorting-algorithm-is-used-by-stls-listsort.

There are many online resources about quicksort without random access.

In contrast, java Collections.sort() does NOT require RandomAccess.. RandomAccess #ArrayList sorting

unordered_map^map performance: unpredictable

Small halo… the theories and explanations are subject to change.

Many people report that unordered_map can be slower than traditional std::map for small collections. I feel it’s implementation-dependent. Result may NOT apply to java or c#.

http://www.stroustrup.com/C++11FAQ.html#std-unordered (by Stroustrup) says “For larger numbers of elements (e.g. thousands), lookup in an unordered_map can be much faster than for a std::map.” In the same paragraph Stroustrup also says unordered_map “lookup involves a single call of a hash function and one or more equality operations

https://blog.dubbelboer.com/2012/12/04/lru-cache.html is mostly about lookup speed. It shows that string key size hurt hash table, but sample size hurts RBTree.

Many interviewers asked me

  • Q: why for a small string collection, RBTree can outperform hash table?
  • A: String hashing takes time proportional to string size
  • A: Poor hashing => Hash collision => linear search is another potential problem. String hashing is unpredictable compared to integer hashing. No one can guarantee the next sample will be “uniform” according to our hash function.
  • A: rehash

Cache efficiency on the bucket array (array of pointers) is another reported weakness in hash tables. However, I guess a small bucket array can be completely held in cache and each linked list usually holds 1 node only so cache miss is on that 1 node. For a tree, logN nodes are accessed for a lookup and all of them could get cache-miss.

Practical suggestion — it’s easy to swap the two in a given codebase, so just swap and benchmark. Either can be faster. If your data set changes over time, you may need to re-benchmark.

unordered_set implementation note

https://www.gamedev.net/forums/topic/582613-c-how-is-unordered_set-implemented/ says

“unordered_set is typically implemented as a linked list that holds all the elements and a hash table stores pointers to the linked list nodes.”, echoed on https://stackoverflow.com/questions/31112852/how-stdunordered-map-is-implemented

LinkedHashMap.java?

detect cycle in slist #Part 2

Note there’s an open question at the end

Don’t use fast-slow iterators. These 3 solutions below each has advantages over the fast-slow iterators solution.

–solution: list-reversal, based on https://blog.ostermiller.org/find-loop-singly-linked-list and implemented in https://github.com/tiger40490/repo1/blob/py1/py/slist/slistReverseLoopDetect.py
Basically, iterate from aa to bb to cc … and reverse each arrow. Suppose dd points to bb initially forming a loop. At some point the snapshot is

  • null <- aa <- bb <- cc <- dd  while currentNode is dd and nextNode is bb.

If we go on, we would be back at aa, proving the existence of a loop.

Benefit of this solution — O(1) space complexity and O(N) time complexity

constraint — wont work on read-only lists .

–solution: count nodes against memory usage — my invention
We can get a reliable estimate of the memory usage of the entire data structure, and divide it by per-node footprint, so we know within good approximation how many nodes exist. If we count more than that, there must be duplicates.

https://blog.ostermiller.org/find-loop-singly-linked-list shows one way to estimate the memory usage — keep track of the maximum and minimum addresses among all visited nodes.

The low level languages usually provide addresses. The higher level languages usually provide memory usage. In realistic applications, there is always a way to get an estimate of memory usage.

–solution: Brent. See my code https://github.com/tiger40490/repo1/blob/cpp1/cpp/linkedList/cycleDetect.cpp

If we keep count of total number of increments, we should see that every time we double “power”, that total count is a power of 2. If we have incremented 32 times then we know { mu + lambda } must exceed 32… Advantages over the simple-2-iterator:

  1. tortoise is much faster, so if mu (merge point, or loop starting point) is far out, then we will find it faster. With the standard solution, tortoise takes a long time to reach the merge point.
  2. can return loop size

I now believe I can predict in which “phase” we will detect the loop — If lambda is between 2^5 and 2^6, then we will detect the loop in Phase 6, and we can’t detect in Phase 5

I suspect power-of-3 is also working:

  • If the slower iterator stops moving completely, then the algorithm is broken because the slower iterator could remain outside the loop.
  • If the slower iterator tailgates the faster iterator and the following distance is always always within a static limit (like 99), then the algorithm is broken because the loop size could exceed 99, so the two iterators would have no chance of rendezvous.
  • ==> If the following distance grows gradually, AND the follower is never stuck forever, I think eventually they will meet —
    • Suppose right after a tortoise jump, powOfN bumps up to 81 but actual loop length is shorter, like 80. Now within the next 81 iterations, the hare would move step by step while the tortoise remains. They would rendezvous for sure.
  • —- pros and cons for power-of-3 over power-of-2 —-
  • pro: if a big loop involves root, tortoise won’t have to jump too early too frequently ==> Fewer comparison needed.
  • con: if there’s a tiny loop far out, tortoise would end up jumping too late?

lazy singleton based@JVM dynamic classloading #Lea

In [[DougLea]] P86, this JVM expert pointed out that a simple “eager” singleton is eager in other language, but lazy in Java due to runtime on-demand class loading.

Specifically we mean a public[1] static final field. This initialization is thread-safe by default. Assuming the field is immediately accessed after class loading, this simple design is comparable to the familiar synchronized lazy singleton. What are the pros and cons?

Synchronized singleton requires more legwork (by developer) but
* It lets you pass computed ctor parameters at runtime. Note the singleton ctor is private but yo can call getInstance(userInput).
* As hinted earlier, if you load the class but do not immediately use the instance, then the simple design incurs the expensive initialization cost too early.

[[DougLea]] was writen before java5. With java5, [[EffJava]] advocates enum.

[1] DougLea actually prefers private field with public getter, for encapsulation.

pass temp into func by val: mv-ctor skipped #RVO CSY

Suppose we have a class MoveOnlyStr which has only move-ctor, no copy-ctor. Suppose we pass an unnamed temporary instance of this class into a function by value, like void func1(MoverOnlyStr arg_mos).

Q: Will the move-ctor be used to create the argument object arg_mos? We discussed this in your car last time we met up.

A: If the temp is produce by a function, then No. My test shows I was right to predict that compiler optimizes away the temporary, due to RVO. So move-ctor is NOT used. This RVO optimization has existed long before c++11.

A: if the temp is not produced by a function, then RVO is irrelevant (nothing “Returned”) but I don’t know if there’s still some copy-elision.

multiple inheritance !! always trouble-maker

Many interviewers ask about MI.

  • Both java and c# officially support MI via interface inheritance
  • python supports MI but seldom used in my projects
  • c++ MI is safe if superclasses are protocol classes , just like java/c#.
  • c++ MI is OK if the superclasses do not conflict, and subclass is final, preventing the diamond problem
  • [[effC++]] also says it’s OK to publicly subclass an interface and private subclass a concrete class. I don’t fully understand it and not widely used IMHO
  • [[Alaxandrescu]] pointed out a combination of MI+TMP is essential for library designs. STL doesn’t use it, but ETSFlow does.

c++GC interface

https://stackoverflow.com/questions/27728142/c11-what-is-its-gc-interface-and-how-to-implement

GC interface is partly designed to enable

  • reachability-based leak detectors
  • garbage collection

The probe program listed in the URL shows that as of 2019, all major compilers provide trivial support for GC.

Q: why does c++ need GC, given RAII and smart pointers?
A: system-managed automatic GC instead of manual deallocation, without smart pointers

POSIX^SysV-sempaphores

https://www.ibm.com/developerworks/library/l-semaphore/index.html — i have not read it.

My [[beginning linux programming]] book also touches on the differences.

I feel this is less important than the sharedMem topic.

  • The posix semaphore is part of pthreads
  • The sysV semaphore is part of IPC and often mentioned along with sysV sharedMem

The counting semaphore is best known and easy to understand.

  • The pthreads semaphore can be used this way or as a binary semaphore.
  • The system V semaphore can be used this way or as a binary semaphore. See http://portal.unimap.edu.my/portal/page/portal30/Lecturer%20Notes/KEJURUTERAAN_KOMPUTER/SEM10809/EKT424_REAL_TIME_SYSTEM/LINUX_FOR_YOU/12_IPC_SEMAPHORE.PDF

Linux manpage pointed out — System V semaphores (semget(2)semop(2), etc.) are an older semaphore API. POSIX semaphores provide a simpler, and better designed interface than System V semaphores; on the other hand POSIX semaphores are less widely available (especially on older systems) than System V semaphores.

The same manage implies both APIs use a _counting_ semaphore semantic, without notification semantics

## mkt data: avoid byte-copying #NIO

I would say “avoid” or “eliminate” rather than “minimize” byte copying. Market data volume is gigabytes so we want and can design solutions to completely eliminate byte copying.

  • RTS uses reinterpret_cast but still there’s copying from kernel socket buffer to userland buffer.
  • Java NIO buffers can remove the copying between JVM heap and the socket buffer in C library. See P226 [[javaPerf]]
  • java autoboxing is highly unpopular for market data systems. Use byte arrays instead

Arrays.sort(primitiveArray) beats List.sort() #defaultMethod

In terms of sorting performance, Arrays.sort(primitiveArray) is a few times faster than Collections.sort() even though both are O(N logN). My learning notes:

  • Arrays.sort(int []) is a double-pivot quicksort, probably using random access
  • Arrays.sort(Object []) is a mergesort
  • Collections.sort(List) defers to List.sort()
    • List.sort() is a Java8 default method in the List.java interface. It copies data to an array then runs a mergesort
    • ArrayList.java overrides the default method, so no copying for ArrayList from  java8 onwards

RandomAccess marker interface (ArrayList implements) is completely irrelevant. That’s because any List.java subtype that provides RandomAccess can simply override (at source code level) the default method as demonstrated in ArrayList.java. This is cleaner than checking RandomAccess at runtime. One or Both designs could potentially be JIT-compiled to remove the runtime check.

 

ADL #namespace

  • “Put the function in the same namespace as the classes it operates on.” is a one-liner summary
  • If you want to write a function that needs only a class’s public interface – then that function doesn’t have to be a (static/non-static) member. The function can become a free function placed in the same name space as the class. This increases encapsulation and data hiding, and reduces coupling as only the public interface is needed.. P79 [[c++codingStd]]
  • It’s also an idiom/pattern to impress interviewers. The published experts all seem to view ADL as a library feature mainly for overloaded operators. Read cppreference and [[c++codingStd]]. I’m not interested in that usage.
  • boost seems to use a lot of ADL
  • I feel namespace loves ADL more than any other c++ feature 🙂
  • ADL is an endorsed, prized compiler feature but still with criticisms [1]. To eliminate the confusion, simply fully qualify the function call.

[1] https://stackoverflow.com/questions/8111677/what-is-argument-dependent-lookup-aka-adl-or-koenig-lookup has the best quickguide with simple examples.

https://softwareengineering.stackexchange.com/questions/274306/free-standing-functions-in-global-namespace is a short, readable discussion.

My annotations on the formal, long-winded definition — ADL governs look-up of unqualified function names in function-call expressions (including operator calls). These function names are looked up in the namespaces of their arguments in addition to the usual namespace-based lookup.

CPU(data)cache prefetching

https://stackoverflow.com/questions/1950878/c-for-loop-indexing-is-forward-indexing-faster-in-new-cpus top answer is concise. I think the observations may not be relevant in x years but the principles are.

  • adjacent cache line (ACL) prefetcher — simple to understand
  • cpu can detect streams of memory accesses in forward or backward directions

Note L1/L2/L3 caches are considered part of the CPU even if some of them are physically outside the microprocessor.

c++user-defined constant:(scoped)enum^const static

See https://stackoverflow.com/questions/112433/should-i-use-define-enum-or-const

  • best practice — enclose in a namespace or a class
  • enum can group 2 related constants. Scoped enum is even more descriptive.
  • enum also creates a typename for code documentation
  • enum const value must be signed integers 😦
  • Both enum type and a single static const field can be declared as class members 🙂
  • For singular constants, use enum or const static variables (file-scope, and independently instantiated in each compilation unit). Avoid extern.

STL iterator invalidation rules, succinctly

http://www.martinbroadhurst.com/iterator-invalidation-rules-for-c-containers.html has concise explanations. Specifically,

  • list insertions don’t invalidate any iterator. I feel iterator is a pointer to a node.
  • tree insertions don’t invalidate any iterator. Same reason as list.
  • for erasure from lists or trees, only the iterator to the erased node is invalidated.

Now for vector:

  • vector insertion invalidates any iterator positioned somewhere after the insertion point. If reallocation happens due to exceeding vector.capacity() then all invalidated

Who cares?

  1. in projects, we always check the documentation
  2. in coding interviews, we can save precious time if we remember these rules
  3. in QQ interviews, I have never received such a knowledge question. I think interviewers don’t consider this very “deep”, but we could possibly impress some interviews.

19obscure but recurr`c++QQ questions #halo

All QQ topics, not related to GTD or coding IV

  1. Throwing dtor
  2. how to avoid virtual functions (use enum?); CRTP
  3. Delete a pointer to base, but the actual object is a derived without virtual dtor. See my post https://bintanvictor.wordpress.com/2015/04/21/deleting-base-class-ptr-non-virt-dtor
  4. casting “this” pointer. See https://bintanvictor.wordpress.com/2017/11/15/crtp-1st-lesson/
  5. memory: q(delete this)
  6. memory: placement new and how to delete
  7. why division by zero is not an exception
  8. debug build modifying behavior
  9. IPC mutex; synchronization in shared mem
  10. memcpy
  11. ADL
  12. template template params

–socket programming interviews

  1. ##tcp tuning params
  2. non-blocking socket calls
  3. buffer full

non-blocking socket readiness: alternatives2periodic poll

Some Wells interviewer once asked me —

Q: After your non-blocking send() fails due to a full buffer, what can you do to get your data sent ASAP?

Simple solution is retrying after zero or more millisecond. Zero would be busy-weight i.e. spinning CPU. Non-zero means unwanted latency.

A 1st alternative is poll()/select() with a timeout, and immediately retry the same. There’s basically no latency. No spinning either. The linux proprietary epoll() is more efficient than poll()/select() and a popular solution for asynchronous IO.

2nd alternative is SIGIO. http://compgeom.com/~piyush/teach/4531_06/project/hell.html says it doesn’t waste CPU. P52 [[tcp/ip sockets in C]] also picked this solution to go with non-blocking sockets.

http://compgeom.com/~piyush/teach/4531_06/project/hell.html is actually a concise overview of several alternatives

  • non-blocking socket
  • select/poll/epoll
  • SIGIO
  • .. other tricks

java assignment-expression returns assigned value

A small halo in coding interviews 🙂

This feature is inherited from C.

  • eg: return singletonInstance = new MySingleton();
  • eg: a = b = 22;
  • eg: while ((line = reader.readLine()) != null)
  • eg (bigger):
    List<Object> processables;
    while ((processables = retrieveProcessableItems(..)).size() > 0) {/*/}

Incidentally, unlike C, java while-loop header can’t declare variables 😦  Solution — use for-loop 🙂

concurrent lazy singleton using static-local var#c++

As explained in another blog post, static local is a shared mutable, but fortunately initialization is thread-safe:

“If multiple threads attempt to initialize the same static local variable concurrently, the initialization occurs exactly once (similar behavior can be obtained for arbitrary functions with std::call_once).” — https://en.cppreference.com/w/cpp/language/storage_duration

https://stackoverflow.com/questions/26013650/threadsafe-lazy-initialization-static-vs-stdcall-once-vs-double-checked-locki has 20 upvotes and looks substantiated. It also considers double-checking, std::call_once, atomics, CAS…

GCC uses platform-specific tricks to ensure a static local variable is initialized only once, on first use. 

Conclusion — On GCC and other compilers, this is the easiest solution for a concurrent lazy singleton.

 

3rd effect@volatile in java5

A Wells Fargo java interviewer said there are 3 effects. I named

  1. load/store to main memory on the target variable
  2. disable statement reordering

I think interviewer mentioned a 3rd effect about memory barrier.

This quote from Java Concurrency in Practice, chap. 3.1.4 may be relevant:

The visibility effects of volatile variables extend beyond the value of the volatile variable itself. When thread A writes to a volatile variable and subsequently thread B reads that same variable, the values of all variables that were visible to A prior to writing to the volatile variable become visible to B after reading the volatile variable. So from a memory visibility perspective, writing a volatile variable is like exiting a synchronized block and reading a volatile variable is like entering a synchronized block.

— “volatile” on a MyClass field provides special concurrency protection when the MyClass field is constructed :

java reference-type volatile #partially constructed has good details.

https://stackoverflow.com/questions/9169232/java-volatile-and-side-effects address my doubt about “other writes“. Nialscorva’s answer echoes the interviewer:

Before java 1.5, the compiler can reorder the two steps

  1. construction of the new object
  2. assigning the new address to the variable

In such a scenario, other threads (unsynchronized) can see the address in the variable and use the incomplete object, while the construction thread is preempted indefinitely like for 3 hours!

So in java 1.5, the construction is among the “other writes” by the volatile-writing thread! Therefore, the construction is flushed to memory before the address assignment. Below is my own solution, using a non-static volatile field:

public class DoubleCheckSingleton {
	private static DoubleCheckSingleton inst = null;
	private volatile boolean isConstructed = false;
	private DoubleCheckSingleton() {
		/* other construction steps */
		this.isConstructed = true; //last step
	}
	DoubleCheckSingleton getInstance() {
		if (inst != null && inst.isConstructed) return inst;
		synchronized(DoubleCheckSingleton.class) {
			if (inst != null && inst.isConstructed) return inst;
			
/**This design makes uses of volatile feature that's reputed to be java5
*
* Without the isConstructed volatile field, an incomplete object's 
* address can be assigned to inst, so another thread entering getInstance()
* will see a non-null inst and use the half-cooked object 😦
* 
* The isConstructed check ensures the construction has completed
*/
			return inst = new DoubleCheckSingleton();
		}
	}
}

jmp_buf/setjmp() basics for IV #ANSI-C

Q: how is longjmp different from goto? See http://ecomputernotes.com/what-is-c/function-a-pointer/what-is-the-difference-between-goto-and-longjmp-and-setjmp

A: longjmp can 1) jump across functions, and 2) restore program state from a jmp_buf, which was saved earlier by setjmp.

pbclone large obj(eg:vector)rely`@move

This is impressive in QQ interviews + coding questions

GotW #90 Solution: Factories

has a good illustration of move semantics put to good use.

  • Before c++11, a function returning a large vector (or any large object) by value incurs expensive deep copying of all vector elements.
  • With c++11 move features added to std::vector class, returning a vector by value is cheap and recommended.
  • RVO may kick in but (i feel) less reliable than move semantic. For the specific rules see RVO^move-semantics

std::string is COW-reference-counted now but should change]c++11

http://www.drdobbs.com/cpp/c-string-performance/184405453 (circa 2003) explains that COW speeds up string copy and string destruction.

https://stackoverflow.com/questions/12520192/is-stdstring-refcounted-in-gcc-4-x-c11

This proved a small halo.

However, gcc 5.1 introduced a breaking change. https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_abi.html says These changes were necessary to conform to the 2011 C++ standard which forbids Copy-On-Write strings

Q: why forbidden?
A: thread-safety … See [[std c++lib]] P692

declare variables ] loop header: c^j #IF/for/while

Small trick to show off in your coding test…

Background — In short code snippet, I want to minimize variable declarations. The loop control variable declaration is something I always want to avoid.

https://stackoverflow.com/questions/38766891/is-it-possible-to-declare-a-variable-within-a-java-while-conditional shows java WHILE-loop header allows assignment:

List<Object> processables;
while ((processables = retrieveProcessableItems(..)).size() > 0) {/*/}

But only (I’m 99% sure) c++ WHILe-loop header allows variable declaration.

The solution — both java/c++ FOR-loop headers allow variable declaration. Note the condition is checked Before first iteration, in both for/while loops.

update — c++0x allows variable declaration in IF-block header, designed to limit the variable scope.

if (int a=string().size()+3) cout<<a << ” = a \n”; // shows 3

c++debug^release build can modify app behavior #IV

This was actually asked in an interview, but it’s also good GTD knowledge.

https://stackoverflow.com/questions/4012498/what-to-do-if-debug-runs-fine-but-release-crashes points out —

  • fewer uninitialized variables — Debug mode is more forgiving because it is often configured to initialize variables that have not been explicitly initialized.
    • For example, Perhaps you’re deleting an uninitialized pointer. In debug mode it works because pointer was nulled and delete ptr will be ok on NULL. On release it’s some rubbish, then delete ptr will actually cause a problem.

https://stackoverflow.com/questions/186237/program-only-crashes-as-release-build-how-to-debug points out —

  • guard bytes on the stack frame– The debugger puts more on the stack, so you’re less likely to overwrite something important.

I had frequent experience reading/writing beyond an array limit.

https://stackoverflow.com/questions/312312/what-are-some-reasons-a-release-build-would-run-differently-than-a-debug-build?rq=1 points out —

  • relative timing between operations is changed by debug build, leading to race conditions

Echoed on P260 [[art of concurrency]] which says (in theory) it’s possible to hit threading error with optimization and no such error without optimization, which represents a bug in the compiler.

P75 [[moving from c to c++]] hints that compiler optimization may lead to “critical bugs” but I don’t think so.

  • poor use of assert can have side effect on debug build. Release build always turns off all assertions as the assertion failure messages are always unwelcome.

j4 factory: java^c++ #Wells

A Wells Fargo interviewer asked

Q6: motivation of factory pattern?
Q6b: why prevent others calling your ctor?

  • %%A: some objects are expensive to construct (DbConnection) and I need tight control.
  • %%A: similarly, after construction, I often have some initialization logic in my factory, but I may not be allowed to modify the ctor, or our design doesn’t favor doing such things in the ctor. I make my factory users’ life easier if they don’t call new() directly.
  • AA: more importantly, the post-construction function could be virtual! This is a sanctioned justification for c++ factory on the authoritative [[c++codingStd]] P88
  • %%A: I want to (save the new instance and) throw exception in a post-construction routine, but I don’t want to throw from ctor
  • %%A: I want to centralize the non-trivial business rule of selecting target types to return. Code reuse rather than duplication
  • %%A: a caching factory
  • %%A: there are too many input parameters to my ctor and I want to provide users a simplified façade

a shared_ptr !! @heap #no make_shared

Boost documentation says shared_ptr class template stores a pointer to a dynamically allocated object (not an “auto” object), typically allocated with a C++ new-expression (not array-new .. see separate blogpost). The object pointed to is guaranteed to be deleted.

If you feed it a stack object, then the eventual deletion will probably be undefined behavior. For a non-heap pointer, you can’t use make_shared() according to [[effModernC++]]

Q: Can you finish your task and exit the program before the eventual deletion? Maybe you don’t care?

Anyways, it’s unwise to create a library routine accepting a pointer argument and return a share_ptr. The library routine has no way to test if it points to the heap or the stack (or global area). Such a routine (if you must make one) had better force callers to pass in an explicit flag “isOnHeap”.

As stated in http://stackoverflow.com/questions/7383572/c-shared-ptr-of-stack-object, the function should be designed to receive a shared_ptr as argument, not a raw ptr.

I believe it’s possible to seed a shared_ptr with a non-heap pointer, if you provide your own deleter.

thread-unsafe shared_ptr: tiny examples

##shared_ptr thr-safety: 3 cases is a “framework”.

http://www.boost.org/doc/libs/1_35_0/libs/smart_ptr/shared_ptr.htm#ThreadSafety gave simple examples of unsafe access.

//— Example 3, where the shared mutable is a club member p3 ——————
// thread A
p = p3; // reads p3, writes p

// thread B
p3.reset(); // writes p3: Undefined, simultaneous read/write

I gave this example correctly to a Sep 2018 SCB interview 🙂

===> In the same clock cycle, p3 Object is accessed on 2 threads. This is nothing to do with the ref count, or the pointee object.

//— Example 4,  where the shared mutable is a club member p2 ——————
// thread A
p3 = refToP2; // reads p2, writes p3

// thread B
// p2 goes out of scope: Undefined, simultaneous read/write since the destructor is considered a “write access”

===> In the same clock cycle, p2 Object gets write-accessed.

simple implementation of memory allocator#no src

P9 [[c++game development primer]] has a short implementation without using heap. The memory pool comes from a large array of chars. The allocator keeps track of allocated chunks but doesn’t reuse reclaimed chunks.

It showcases the header associated with each allocated chunk. This feature is also part of a real heap allocator.

reinterpret_cast is used repeatedly.

##10 c++coding habits to optimize perf

Many of these suggestions are based on [[optimized c++]]

· #1 Habit – in c++ at least, ++counter performance is strictly “better or equal to” counter++. If there’s no compelling reason, I would prefer the former.

· #2 Habit – in a for loop, one of the beginning and ending values is more expensive to evaluate. Choose the more expensive one as the beginning value, so you don’t evaluate it over and over. Some people object that compiler can cache the more expensive end value, but 2016 tests show otherwise.

· If a method can be static, then make it static. Good for performance and semantics.

· For a small if-else-if block, put the most likely scenario first. May affect readability. Worthwhile only in a hot spot.

· For a long if-elif-elif-elif-elif block, a switch statement performance is strictly “greater or equal”

· For-loop starts by checking the condition (2nd component in header). If this initial check is redundant (as often is), then use a do-while loop

· Call a loop in a function, rather than call a function in a loop. Another micro-optimization.

java8 lambda under the hood #phrasebook

Q: how are java8 lambda expressions translated?

* Special helper method – InvokeDynamic. You can see it in the bytecode
* Static methods – a non-capturing (stateless) lambda expression is simply converted to a static method
* a capturing lambda expression can also become a static method with the captures as additional method args. This may not be the actual compiler action, but it is a proven model. (Compare : separate chaining is one proven implementation of hash tables.)

However, static methods obscure an essential rule — the lambda expression’s type must “look like” a subtype of a SAM interface. Remember you often pass a lambda around as if it’s a SAM implementation instance.

So even if the actual work (like number crunching) is done in a static method, there must be some non-static wrapper method in a SAM subtype instance.

https://blog.codefx.org/java/dev/lambdas-java-peek-hood/ has some details on the evolution and design.

reinterpret_cast{int}( somePtr): practical use

http://stackoverflow.com/questions/17880960/what-does-it-mean-to-reinterpret-cast-a-pointer-as-long shows a real use case of reinterpret_cast from pointer to integer, and reinterpret_cast back to pointer.

What if I serialize an object to xml and the object contains a pointer field (reference field is less common but possible)? Boost::serialization would unwrap the pointer and serialize the pointee object.

Most of the time (like parsing unsigned char array), we reinterpret_cast from an (array) address into address of another data type

c++method default arg isn’t saved]vtable

  1. Google c++ style guide forbids default args in virtual functions. It’s error-prone.
  2. [[c++primer]] P607 has a slightly more liberal advice — on a virtual function, subclass/baseclass always use same default arg.

Ashish said he was asked a common question in multiple c++ interviews. The default arg value is “declared” in base class and (or) derived class but never saved in the vtable. It’s basically decided statically (i.e. at compile time).

If you use a pointer to Base to invoke the virtual method, then the Base version’s declared default arg applies unconditionally, even though subclass override is chosen at run time, via the vtable. Highly confusing.

##c++good habits like java q[final]

  • Good habit — internal linkage
    • ‘static’ keyword on file-level static variables
    • anonymous namespace
  • if (container.end() ==itrFromLower_bound) {…}
  • use typedef to improve readability of nested containers
  • typedef — “typedef int FibNum” as source documentation
  • initialize local variables upon declaration.
  • use ‘const/constexpr’ whenever it makes sense. The c++11 constexpr is even better.
  • [c++11] – use ‘override’ or ‘final’ keywords when overriding inherited virtual methods. Compiler will break if there’s nothing to override.

factory returning smart ptr^pbclone #Sutter2013

Background — This is one example of “too many variations”. A simple idea become popular… People start mix-n-match it with other ideas and create many, many interesting questions. In practice we should stick to one or two best practices and avoid the other 50 variations.

Update: https://herbsutter.com/2013/05/30/gotw-90-solution-factories/ is a little too dogmatic as an article. A few take-aways:

  • returning raw ptr can indeed leak in real life. I would still do this in simple projects.
  • For a polymorphic type, return a unique_ptr by default. See also [[effModernC++]]. (Return shared_ptr only if the factory saves it in some internal state, but who does?)
  • For a non-polymorphic type, return a clone. Safely assume it’s copyable or movable in c++11.

I think it’s quite common (see [[safe c++]] P51) to have a factory function creating a heap “resource”. Given it’s on heap, factory must return the object by pointer. Easy to hit memory leak. Standard solution is return by ref count smart ptr.

Q: what if I forget to destruct the obj returned?
A: I feel it’s uncommon and unnatural. (More common mistake is to forget deletion.) Often the object returned (i.e. the smart pointer object) is on the stack – no way to “forget”.

Note if a (64-bit) raw pointer variable is on the stack, then the stack unwinding will deallocate/reclaim the 64-bit, but won’t call delete on it!

Q: what if the object returned is saved as a field like myUmbrella.resource1
A: I feel this is fine because the myUmbrella object will be destructed by someone’s responsibility. That would automatically destruct the smart pointer field this->resource1.

Alternative to the ref count smart ptr, we can also return a (boost) scoped pointer. See [[safe c++]].

Q: what if factory need to return a pointer to stack?
A: I don’t think so. Pointee goes out of scope….
A: I feel most “resources” are represented by a heapy thingy. Alternatively, it’s also conceivable to hold a “permanent” resource like globally. Memory management is non-issue.

 

 

By the way, [[safe c++]] also mentions the __common__ pattern of a (non-factory) function returning a pointer to an existing object, rather than a “new SomeClass”. The challenges and solutions are different.

c++atomic^volatile #fencing

See other posts about volatile, in both c++ and java.

For a “volatile” variable, c++ (unlike java) compiler is not required to introduce memory fence i.e. updates by one thread is not guaranteed to be visible globally. See Eric Lippert on http://stackoverflow.com/questions/26307071/does-the-c-volatile-keyword-introduce-a-memory-fence

based on [[effModernC++]]
atomic volatile
optimization: reorder prevented permitted 😦
atomicity of RMW operations enforced not enfoced
optimization: redundant access skip permitted prevented
cross-thr visibility of update probably enforced no guarantee [1]
memory fencing enforced no guarantee

[1] https://en.wikipedia.org/wiki/Memory_barrier

non-virtual dtor: some random scenarios tested #IV

Background — How important are these scenarios? First off, tech quizzes are extremely important since you are judged just over a few questions. Second, these scenarios pop up by accidents, rather than be design, all the time in real projects. You better learn to deal with a drunken driver while on the road.

Q1: what if Base and Derived dtor both non-virtual and an autoVar is destroyed?
AA: Tested — ~Derived() and then ~Base().  See post on DCBC.

Q2: What if Base dtor is non-virtual but Derived is virtual, and a Derived auto variable is destroyed on the stack?
AA: Same as Q1. For an autoVariable that’s not deleted via a ptr, Derived ctor (virtual or not) runs, followed by Base dtor. Same DCBC

Q3: Base dtor is non-virtual but Derived is virtual. Derived heap pointer is assigned to a B pointer and deleted?
AA: only ~Base runs , in my g++ test, though officially UB.

Q4: Base and Derived are virtual. Derived heap pointer is assigned to a B pointer and deleted?
AA: ~Derived() then ~Base(). DCBC + dynamic dispatch

Note the well-known __undefinedBehavior__ affects delete only, not stack variables or static variables.Note virtual keyword affects pointer variable. Non-ref variables aren’t affected.

The object to destruct is a Derived
~B ~D st? delete heap D via B* or D* test result notes
1 nv nv stack ~D~B DCBC + static dispatch
2 nv virtual stack ~D~B DCBC + static dispatch
3 nv virtual B*  ~B only static dispatch. “virtual” ignored
4 v virtual D*  ~D~B DCBC + dynamic dispatch
5 v implicitly v D*  ditto ditto
struct B{
  ~B(){ cout<<"~B\n";}
};
struct D: public B{
  virtual ~D(){ cout<<"~D\n";}
};
int main(){
  B* myD=new D();
  delete myD;
}

Morris in-order walk]O(N) #O(1)space

I feel this is too hard and unlikely to come up in coding interviews. Also, most trees have up-links, so the no-uplink constraint is artificial and unrealistic. In reality, Morris complexity is usually unnecessary.

–threaded binary tree is the basis of Morris

https://en.wikipedia.org/wiki/Threaded_binary_tree

The Morris algo need to construct the threaded BST, walk and remove the extra links, all without recursion.

  • Tip: For my ascending walk, I only add right-up thread link to bring me to my ancestors. I don’t need any leftward thread link.
  • Tip: All the thread links point upward
  • How many right-up links? Let’s define two clubs Honey and White.
    • every node HH having a left child will get a incoming right-up link pointing to it! After printing HH’s left subtree, this link is used to revisit HH.
    • Except the very last node, every node WW without a right child needs to add a right-up link, so we don’t get stuck at WW.
    • If the Honey club has H members, and White club has W members, then I think we create H right-up links. W +1 = H
    • A node can be both Honey and White i.e. it has a right-up link and is also target of a right-up link.
  • Tip: the sequence of Honey nodes to try has to be top-down. We must create the uplink to every higher Honey node first, as insurance that we can always come back.  While a higher Honey node is locked down and we search for its predecessor in its left subtree, we ignore any lower Honey node
    • First time we hit a Honey node, I’m sure it is not uplinked. After we get it uplinked, we must descend Left.
    • 2nd time we hit a Honey node, I believe its left subtree is completely fixed, so if we descend left again, we will go to a dead end. We must move right, either down or up.
  • Tip: should really use a few variables instead of “cur”. This “cur” is like a “temp1” variable used in first pass, then temp2 variable in 2nd pass etc. These variables are not related at all.

https://github.com/tiger40490/repo1/blob/cpp1/cpp1/binTree/MorrisInOrder.cpp is my c++ implementation with a concise/cryptic implementation and an “unfolded” elaborate implementation

return value optimization, again

See also http://bigblog.tanbin.com/2010/09/return-value-optimization.html. There’s also a brief but complete item in [[More eff c++]]

Q : what are required for RVO? (From MS CVA interview. Obscure detail but a halo)
A: copy/move must be available (not private not deleted)

Q: when would RVO kick in?
%%A:
– object created as a nonref auto/local/stackVar in the called function
– return by value (pbclone)
– return value is assigned to a new variable.

Q: what conditions would suppress RVO?
??

Such performance features are good to know, but I feel we don’t need to predict, count on or avoid RVO. In a given context, it may or may not happen and our app should work correctly nevertheless.

boost weak_ptr can access inner ptr only through a shared_ptr

Unlike shared_ptr, scoped_ptr and auto_ptr, a boost weak_ptr doesn’t overload operator-} and operator*. Ditto for std::weak_ptr.

There’s an important implication.[2]

“If I want to veto deletion of internal ptr, then I must join the club, i.e. the club must register me as –veto– member”.

As a passive observer, weak_ptr is helpless to prevent the internal pointer’s demise. There’s an important implication again [3].

[2] No direct means to get the raw pointer from a weak_ptr. You must create and go through a shared_ptr, either this->lock() or shared_ptr conversion ctor.  I feel a weak_ptr instance is an observer using a shared_ptr instance as a lens.

[3] User of weak_ptr can’t reliably use the referenced resource, since the resource could disappear any time.

Crucially, a std::weak_ptr can be empty, as explained on P96 [[std c++lib]]. We can detect weak_ptr emptiness when the inner ptr is deleted. In contrast, raw ptr doesn’t offer this feature — if the underlying heap object is reclaimed, the raw ptr becomes dangling and we have absolutely no way to detect that, according to my friend Paul M.

stack to heap trend in languages

  • I guess you seldom see malloc() and free() in business applications. I have seen a number of c applications but didn’t notice them. However, how does c pointers deal with automatic destruction of the objects they reference? I guess all such objects are declared and used in the same scope, never declared in an inner func and used in an outer func?
  • C++ biz apps use lots of new(), but by default, if you don’t call new() or malloc(), C++ uses the stack and global area. In terms of language support and encouragement, stack is still easier to use than heap in C++. I consider C++ as a transition between C and newer languages.
  • STL seems to be mostly heap, to support expansion
  • Java uses the heap more than the stack. Every object is on the heap. Memory leak becomes too likely so java had to provide garbage collector.
Conclusion: c -> c++ -> STL -> java, I see less stack and more heap usage. However, Stroustup pointed out the performance advantage of stack over heap.