non-volatile field can have volatile behavior #DQH

Unsafe.getObjectVolatile() and setObjectVolatile() should be the only access to the field.

I think for an integer or bool field (very important use cases), we need to use Unsafe.putIntVolatile() and Unsafe.getIntVolatile()

Q: why not use a volatile field?
A: I guess in some designs, a field need not be volatile at most access points, but at one access point it needs to behave like volatile field.  Qihao agrees that we want to control when we want to insert a load/store fence.

Non-volatile behavior usually has lower latency.

 

Is java/c# interpreted@@No; CompiledTwice!

category? same as JIT blogposts

Q: are java and c# interpreted? QQ topic — academic but quite popular in interviews.

https://stackoverflow.com/questions/8837329/is-c-sharp-partially-interpreted-or-really-compiled shows one explanation among many:

The term “Interpreter” referencing a runtime generally means existing code interprets some non-native code. There are two large paradigms — Parsing: reads the raw source code and takes logical actions; bytecode execution : first compiles the code to a non-native binary representation, which requires much fewer CPU cycles to interpret.

Java originally compiled to bytecode, then went through an interpreter; now, the JVM reads the bytecode and just-in-time compiles it to native code. CIL does the same: The CLR uses just-in-time compilation to native code.

C# compiles to CIL, while JIT compiles to native; by contrast, Perl immediately compiles a script to a bytecode, and then runs this bytecode through an interpreter.

multicast routing: efficient data duplication

The generic  routing algorithm in multicast has optimization features that offer efficient data duplication.

— optimized message duplication at routing layer (Layer 3?)

https://www.metaswitch.com/knowledge-center/reference/what-is-multicast-ip-routing explains that

Routers duplicate data packets and forward multiple copies wherever the path to recipients diverges. Group membership information is used to calculate optimal branch points i.e. the best routers at which to duplicate the packets to optimize the use of the network.

The multicast distribution tree of receiving hosts holds the route to every recipient that has joined the multicast group, and is optimized so that

  • multicast traffic does not reach networks that do not have any such recipients (unless the network is a transit network on the way to other recipients)
  • duplicate copies of packets are kept to a minimum.

RBTree O(1)insert quite common ] coding IV

— for single-insert
https://en.cppreference.com/w/cpp/container/map/insert is precise saying that hinted insert is O(1).
— for mass insert
Special case — If we know the input sequence is pre-sorted and going to be “appended” to existing tree nodes, then each insert will always hit the end(). We can rely on hinted insert to achieve O(N) .

http://www.cplusplus.com/reference/map/map/insert/  is even more encouraging — ” If N elements are inserted, Nlog(size+N) in general, but linear in size+N if the elements are already sorted” so as long as pre-sorted, then we get O(N)

For 64-bit integer or float inputs, we can radix sort in O(N) then mass-insert in O(N).

— for construction from a pre-sorted sequence
https://en.cppreference.com/w/cpp/container/map/map confirms it’s O(N) if the input range is already sorted.

pre-allocated array as backing store for graph nodes #java No

I think any graph node can use the same technique, but here I present a simple yet interesting use case — a linked list with each node allocated from an array. https://github.com/tiger40490/repo1/blob/cpp1/cpp/lang_66mem/slistFromArray.cpp shows three home-made implementations:

  1. backing array of dummy link nodes, pre-allocated at compile time
  2. backing array of dummy link nodes, pre-allocated from free store aka DMA
  3. backing array is a byte array on heap or data section. Each link node is constructed via placement-new.

Here are a few Advantages that I consider minor because linked list is seldom needed in low-latency

  1. d-cache efficiency
  2. eliminates runtime load on heap allocator, since memory is pre-allocated. See malloc=long considered costly

Advantage #3: For c++ algo questions, this set-up has an interesting advantage — The node address is now an index into the backing array. This index is a natural auto-increment ID , based on creation order.

Now, the biggest advantage of linked list over vector is mid-stream insert/delete. One of the biggest disadvantages is lack of random-access. If nothing happens mid-stream (as in coding questions), then we can achieve random-access-by-id using array as backing store.

If nothing happens mid-stream, then this linked list is physically similar to an array with extra capacity.

This technique won’t work in java because java array of Node is array-of-pointers.

RBTree used inside java8 hashmap

https://en.wikipedia.org/wiki/Hash_table#Separate_chaining_with_other_structures is a language-neutral discussion.

https://digitalmars.com/d/archives//640.html#N803 shows an early implementation

they are implemented as a hashed array of binary trees. 
My experience with them is that they are very efficient.

— JEP 180 is the white paper to introduce a self-balanced tree as an alternative to linked list.

https://dzone.com/articles/hashmap-performance shows with one million entries in a HashMap a single lookup taken 20 CPU cycles, or less than 10 nanoseconds. Another benchmark test demonstrates O(logN) get(key) in java8, but O(N) in java7, as traditionally known. N being the colliding elements in a single bucket.

If two hashCode() values are different but ended up in the same bucket (due to rehashing and bucketing), one is considered bigger and goes to the right. If hashCodes are identical (as in a contrived hashCode() implementation), HashMap hopes that the keys are Comparable, so that it can establish some order. This is not a requirement of HashMap keys.

If hashCodes are mostly identical (rare) and keys are not comparable, don’t expect any performance improvements in case of heavy hash collisions. Here’s my analysis of this last case:

  • if your Key.equals() is based on address and hashCode() is mostly the same, then the RBTree ordering can and should use address. You won’t be able to look up using a “clone” key.
  • if you have customized Key.hashCode() then you ought to customize equals(), but suppose you don’t implement Comparable, then you are allowed to lookup using a clone key. Since there’s no real ordering among the tree nodes, the only way to look up is running equals() on every node.

http://coding-geek.com/how-does-a-hashmap-work-in-java/#JAVA_8_improvements says

A given bucket contains both Node (linked list) and TreeNode (red-black tree). Oracle decided to use both data structures with the following rules:

  • If for a given index (bucket) in the inner table there are more than 8 nodes, the linked list is transformed into a red black tree
  • If for a given index (bucket) in the inner table there are less than 6 nodes, the tree is transformed into a linked list

With the self-balanced tree replacing the linked list, worst-case lookup, insert and delete are no longer O(N) but O(logN) guaranteed.

This technique, albeit new, is one of the best simple ideas I have ever seen. Why has nobody thought of it earlier?

RBTree: technical notes #Realtime

Red–black trees offer … worst-case guarantees, valuable in real-time applications. The Completely Fair Scheduler used in current Linux kernels and epoll system call implementation[19] uses red–black trees.

This valuable guarantee is valid only on always-balanced trees, but no need to be strictly balanced. In fact, AVL is more rigidly balanced but lacks this guarantee.

In contrast, hash table doesn’t offer worst-case guarantee in the face of hash collision. In fact, Java 8 HashMap uses RBTree in additional to linked list… see my blogpost RBTree used in java hashmap.

After an insert or delete, restoring the red-black properties requires a small number (O(log n) or amortized O(1)) of color changes (which are very quick in practice) and no more than three tree rotations (two for insertion). Although insert and delete operations are complicated logically, their times remain O(log n).

The AVL tree is another structure supporting O(log n) search, insertion, and removal. AVL trees can be colored red-black, thus are a subset of RB trees. Worst-case height is better than the worst-case height of RB trees, so AVL trees are more rigidly balanced. However, Mehlhorn & Sanders (2008) point out: “AVL trees do not support constant amortized deletion costs”, but red-black trees do.[25]

CPU run-queue #java perspective

— Mostly based on Charlie Hunt’s [[JavaPerf]] P28

Runtime.availableProcessors() returns the count of virtual processors, or count of hardware threads. This is an important number for CPU tuning, bottleneck analysis.

When a run-queue depth exceeds 4 times the processor count, then host system will become visibly slow (presumably due to excessive context switching).  For a host dedicated to jvm, this is a 2nd reason for CPU saturation. First reason is high CPU usage, which can become high even with a single CPU-hog.

Note run-queue depth is the first column in vmstat output

QuickSelect: select nth key#O(1)space

Quickselect has average O(N) runtime but worst-case O(NN), if the partitioning goes bad repeatedly. Randomized pivot can reduce the chance but I don’t think we can eliminate it

Space complexity is O(1) despite the apparent recursive set-up. Tail recursion optimization is usually available to reuse the same stack frame. Otherwise, recursion can be replaced by a loop.

std::nth_element() is linear on average but quadratic in worst case — https://stackoverflow.com/questions/11068429/nth-element-implementations-complexities explains QuickSelect algo

Quickselect is by the same inventor of quicksort.

https://en.wikipedia.org/wiki/Quickselect

## halo knowledge to impress west coast interviews

My naive list of "halo" technical knowledge that might impress some west coast interviewers.

  • order stat tree
  • Morris tree walk – O(1) space
  • path compression in union-find
  • hash table worst case is no longer O(N) as in java8
  • shortest-path algorithms
  • Fibonacci and other advanced heaps — I discussed it at Facebook
  • tail recursion to reduce space complexity
  • worst case quick sort and randomized pivot selection
  • bucket sort
  • insertion sort as the fastest sort
  • radix sort for floats and strings

XOR: key points for algo IV

  • usage: swap two int variables. I think the “minus” trick is easier to understand
  • usage: doubly-linked list … MLP-sg connectivity java IV#2
  • usage: ‘1’ XOR an unknown bit among neighboring bits => TOGGLE the bit
  • usage: If you apply an all-1 toggle (bit vector), you get the “bitwise NOT” also known as “one’s complement”
  • Like AND, OR, this is bitwise meaning each position is computed independently — nice simplicity

If you are looking for a one-phrase intro to the 2-input XOR, consider

) TOGGLE ie toggle the selected bits. If you apply a toggle twice, you get the original.
) DIFFERENCE ie difference gate, as a special case of “odd number of ONEs

See https://hackernoon.com/xor-the-magical-bit-wise-operator-24d3012ed821

— how about a bunch of bits to XOR together?

Wikipedia points out —  A chain of XORs — a XOR b XOR c XOR d (and so on) — evalutes to ONE iFF there is an odd number of ONEs in the inputs. Every pair of toggles would cancel out each other.
Venn 0110 1001.svg is the Venn diagram for a xor b xor c. Red means True. Each of the three circles were initially meaning if you shoot dart inside the ‘a’ circle, then you get ‘a=True’. If outside the ‘a’ circle, then ‘a=False’. You can see that your dart is red (i.e. True) only when encircled an odd number of times. Note your dart is unable to land outside the big circle.

Insight — iFF you toggle a NULL (taken as False) odd times, it becomes True. Therefore, if among N input bits, count of True (toggles) is odd, then result is True.

Does order of operand matter? No

https://leetcode.com/problems/single-number/ has a O(1) time O(1) space solution using this hack. Applicable on a collection of floats, or dates, or any serializable objects.

 

biasedLocking^lockfree^ST-mode

This is a pure QQ topic, but you can score a few points by mentioning it.

[[javaPerf2014]] P268 says biased locking is enabled by default but can be disabled to improve performance in app servers using a thread pool and contended locks. It explains the reason i.e. biased locking comes with a price — a runtime overhead.

https://stackoverflow.com/questions/9439602/biased-locking-in-java is very concise — un-contended locking may incur zero cost, as claimed in the accepted answer

  • .. incurs no synchronization cost. I suppose this is typically geared towards overly conservative code that performs locks on objects without ever exposing them to another thread. The actual synchronization overhead will only kick in once another thread tries to obtain a lock on the object.
  • Biased locking is the default in java6 ====> -XX:+UseBiasedLocking improving the performance of uncontended synchronization. An object is “biased” toward the thread which first acquires its monitor; subsequent monitor-related operations are relatively faster. Some applications with significant amounts of uncontended synchronization may attain significant speedups

Is this technique considered lockfree? No, but some may speculate that it might be even faster than lockfree. So if you suspect most of the time there’s only one thread accessing a critical section, you could choose to rely on the (java6 default) biased locking rather than lockfree solution. Most of the time this mutex-based design would challenge (if not beat) a lockfree performance.

However, I believe single-threaded mode is still faster, where a thread isn’t aware of other threads, as if it is the only thread access those objects i.e. no shared-mutable. There would be no lock grab, no memory fencing at all. [[javaPerf2014]] P375 agrees.

cpu sharing among Docker container for jvm

Note cgroup is also usable beyond jvm and Docker, but i will just focus on jvm running in a Docker container..

Based on https://jaxenter.com/nobody-puts-java-container-139373.html

CPU shares are the default CPU isolation (or sharing??) and basically provide a priority weighting across all cpu time slots across all cores.

The default weight value of any process is 1024, so if you start a container as follows q[ docker run -it –rm -c 512 stress ] it will receive less CPU cycles than a default process/container.

But how many cycles exactly? That depends on the overall set of processes running at that node. Let us consider two cgroups A and B.

sudo cgcreate -g cpu:A
sudo cgcreate -g cpu:B
cgroup A: sudo cgset -r cpu.shares=768 A 75%
cgroup B: sudo cgset -r cpu.shares=256 B 25%

Cgroups A has CPU shares of 768 and the other has 256. That means that the CPU shares assume that if nothing else is running on the system, A is going to receive 75% of the CPU shares and B will receive the remaining 25%.

If we remove cgroup A, then cgroup B would end up receiving 100% of CPU shares.

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpu has more precise details.

https://scoutapp.com/blog/restricting-process-cpu-usage-using-nice-cpulimit-and-cgroups compares q(nice), cpulimit and cgroups. It provides more precise info on cpu.shares.

cpulimit can be used on an existing PID 1234:

cpulimit -l 50 -p 1234 # limit process 1234 to 50% of cpu timeslots. The remaining cpu timeslots can go to other processes or go to waste.

##Java9 features #fewer than java8

  1. #1 Most important – modular jars featuring declarative module-descriptors i.e. requires and exports
  2. #2 linux cgroup support.. For one example, see Docker/java9 cpu isolation/affinity
  3. #3 G1 becoming default JGC.. CMS JGC: deprecated in java9
  4. REPL JShell
  5. private interface methods, either static or non-static
  6. Minor: C++11 style collection factory methods like

List<String> strings = List.of(“first”, “second”);


It’s unbelievable but not uncommon in Java history —

  • Java9 release introduced significantly fewer and less impactful features than java8.
  • Similarly, java5 overshadows java6 and java7 combined

 

UseSpinning jvm flag #pthr/kernel/c#

Hi XR,

Around 2011 I once said a java thread waiting for a lock would not keep grabbing the lock…. Wrong! Actually it does keep grabbing for a while. It runs some useless CPU instructions and checks the lock periodically. This is really spin locking inside the JVM.

Spin locking wastes CPU but frequently results in faster application performance (reducing context switches). Therefore, JVM runtime uses some policy to decide how long a thread would spin before placing the thread in some kind of “parking lot” (my terminology) so the thread doesn’t’ occupy CPU any more. This spin phase used to be configurable via the boolean UseSpinning JVM flag, but in java7 and beyond, this flag has no effect — JVM always has freedom to use spinning. This is described in [[javaPerf2014]]

Comparison 1 — Dotnet SpinWait() provides exactly the same spinlock feature, as described in https://docs.microsoft.com/en-us/dotnet/standard/threading/spinwait

Comparison 2 — In linux Kernel (not userland), 1) spinlock and 2) counting semaphore are the two locking primitives, as described in Page 164 [[understandingLinuxKernel]]. Both locks block a “grabbing” thread getting into a critical section until another thread leaves it. Similar to my 2011 description, the kernel semaphore remembers all sleeping processes (P174) and wakes one of them when the critical section is clear. I think this kernel semaphore is used by the scheduler in the kernel.

Comparison 3 — In linux pthreads library (the basis of linux JVM), the locking constructs are implemented using system calls and userland C functions. Pthreads as a userland library probably can’t use the above two kernel constructs directly. Userland C function can implement spinlock, but to implement “parking” I assume (kernel support and) system calls are required. So far I assume the pthreads thread maps to native threads controlled by the kernel. This mapping is now the dominant usage.

The alternative mapping model is green threads, where userland threads are invisible to kernel. Parking has to be implemented in userland.  Nowadays it’s rare to see green threads.

parts@mv-semantics impl: helicopter/hist view

Move semantics is 90% compile-time + 10% run-time programming. 95% in-the-fabric and invisible to us.

  • 90% TMP
    • 70% is about special casts — in the form of std::move and std::forward
    • std::swap
  • 10% traditional programming
    • RAII to destroy the “robbed” rvalue object
    • heap ptr assignment [1] to rob the resource

[1] This “stealing” is tip of the iceberg, the most visible part of move semantics. Bare-hand stealing was doable in c++03, but too dangerous. The rvr is a fundamental language feature to make the “steal” safer:

  • a safety device around the dangerous “steal”
  • a safety glove
  • a controlled destruction

The resource is typically a heapy thingy, common fixture in all STL containers and+ std::string + smart pointers. Therefore, move-semantics is widely used only in libraries, not in applications.

IV Q: implement op=()using copier #swap+RAII

A 2010 interviewer asked:

Q: do you know any technique to implement the assignment operator using an existing copy ctor?
A: Yes. See P100 [[c++codingStd]] and P347 [[c++cookbook]] There are multiple learning points:

  • RAII works with a class owning some heap resource.
  • RAII works with a local stack object owning a heapy thingy
  • RAII provides a much-needed exception safety. For the exception guarantee to hold, std::swap and operator delete should never throw, as echoed in my books.
  • I used to think only half the swap (i.e. transfer-in) is needed. Now I know the transfer-out on the old resource is more important. It guarantees the old resource is released even-if hitting exceptions, thanks to RAII.
  • The std::swap() trick needed here is powerful because there’s a heap pointer field. Without this field, I don’t think std::swap will be relevant.
  • self-assignment check — not required as it is rare and tolerable
  • Efficiency — considered highly efficient. The same swap-based op= is used extensively in the standard library.
  • idiomatic — this implementation of operator= for a resource-owning class is considered idiomatic, due to simplicity, safety and efficiency

http://www.geeksforgeeks.org/copy-swap-idiom-c/ shows one technique. Not sure if it’s best practice. Below is my own

C & operator=(C const & rhs){
  C localCopy(rhs); //This step is not needed for move-assignment
  std::swap(localCopy.heapResource, this->heapResource);
  return *this;
}// at end of this function, localCopy is destructed, and the original this->heapResource is deleted

C & operator=(C && rhs){ //move-assignment
  std::swap(rhs.heapResource, this->heapResource);
  return *this;
}

 

std::defer_lock #kill1@4deadlock factors

Only in the lock() call does the thread grabs both locks together. This breaks “incremental acquisition” , one of the four deadlock conditions.

Sample code from https://en.cppreference.com/w/cpp/thread/lock_tag

  std::unique_lock<std::mutex> ulock1(mutex1,std::defer_lock);
  std::unique_lock<std::mutex> ulock2(mutex2,std::defer_lock);

  // now grab both locks
  std::lock(ulock1,ulock2); 

1)template code bloat 2)inline..hurt i-cache@@

Q: does template code bloat increase instruction cache pressure and increase risk of cache thrashing?
A: i have seen about 8 forums mention of template code bloat increasing i-cache pressure, but no clear evidence, no consensus among authorities. In fact, no known authority has confirmed this issue.

However, I would think having many similar functions in the executable will likely lead to unnecessary competition for i-cache.

In contrast, for inlining there is consensus. wikipedia states — excess inlining will hurt speed, due to inlined code consuming too much of the instruction cache .. Inlining imposes a cost on performance, due to the code expansion (due to duplication) hurting instruction cache performance.[5] This is most significant if, prior to expansion, the working set of the program (or a hot section of code) fit in one level of the memory hierarchy (e.g., L1 cache), but after expansion it no longer fits, resulting in frequent cache misses at that level.

See also inline: footprint+perf can Backfire ! #Google

fastestSortFor: 1-str; arr@Str; 64bit floats #non-comparison

I think the big-O coding question may require this knowledge, because

….real world sorting problems can often use key-based sorting, rather than comparison-based sorting.

Therefore, if your overall solution requires sorting, don’t assume it would become O(J*K+M^2+P* N logN) based on your O(N logN) sorting assumption. Here are some achievable linear-time sorts:

sort O(?) data type ref notes
counting O(N) chars
 radix O(N) 64-bit ints O(Nw) -> O(N)
 radix O(N) floats [3][4] IEEE format
 radix O(N) variable-size strings [1/2] #1 fastest
 burst O(N) variable-size strings #2 fastest. key-based, trie-based
 radix O(N) date/time converting to fixed-length string or fixed-size integers
 bucket? O(N) random ints in python/perl unbounded… No limit on #digits

thread^process: lesser-known differences4IV

Popular IV question. Largely a QQ question.  Some may consider it zbs.

To the kernel, there are many similarities between the “thread” construct vs the “process” construct. In fact, a (non-kernel) thread is often referenced as a LightWeightProcess in many kernels such as Solaris and Linux.

  • context switching — is faster between threads than between processes because no TLB flush needed.
    • In linux, context switching between kernel-threads is even faster.
  • creation — some thread libraries can create threads without the kernel knowing. No such thing for a process.
  • socket — 2 threads in a process can access the same socket; two processes usually can’t access the same socket, unless …
    • .. unless parent-child. See post on fork()
  • memory — thread AA can access all heap objects of Thread BB, and even BB’s stack objects via pointers. Two processes can’t share these, except via shared memory.
  • a non-kernel thread can never exist without an owner process. In contrast, every process always has a parent process which could be long gone.

 

L1/L2/L3 latency stats@2012 processor #Martin

I said “1000 times” when a GS interviewer asked for an estimate of relative latency of main memory vs register. He said that’s about right.

Numbers below were taken from the CPU Cache Flushing Fallacy blog post by Martin Thompson, which indicates that for a particular 2012-era Intel processor, the following was observed:

  • register access = single cycle or 4 instructions per cycle
  • L1 latency = 3 cycles (3 x register)
  • L2 latency = 12 cycles (4 x L1, 48 x register)
  • L3 latency = up to 38 cycles (3 x L2, 12 x L1, 144 x register)
  • MM Latency= very roughly 200 cycles  (5 x L3, 15 x L2, 60 x L1, 720 x register) = average 65 ns on a 3 GHz CPU

Diagram is more simplified than the text, but there are many fine prints.

## optimize code for i-cache: few tips

I don’t see any ground-breaking suggestions. I think only very hot functions (confirmed by oprofile + cachegrind) requires such micro-optimization.

I like the function^code based fragmentation framework on https://www.eetimes.com/document.asp?doc_id=1275472 (3 parts)

  • inline: footprint+perf can backfire. Can be classified as embedding
  • use table lookup to replace “if” ladder — minimize jumps
  • branching — refactor a lengthy-n-corner-case (not “hot”) code chunk out to a function, so 99% of the time the instruction cache (esp. the pre-fetch flavor) doesn’t load a big chunk of cold stuff.
    • this is the opposite of embedding !
  • Trim the executable footprint. Reduce code bloat due to inlining and templates?
  • loop unrolling to minimize jumps. I think this is practical and time-honored — at aggressive optimization levels some compilers actually perform loop unrolling! Programmers can do it manually.
  • Use array (anything contiguous) instead of linked list or maps to exploit d-cache + i-cache
  • https://software.intel.com/en-us/blogs/2014/11/17/split-huge-function-if-called-by-loop-for-best-utilizing-instruction-cache is a 2014 Intel paper — split huge function if it’s invoked in a loop.

 

-XX:CompileThreshold to control JIT priming

## how might jvm surpass c++]latency #MS #priming has 2 other tips from jvm trading engine veterans

https://www.theserverside.com/tip/Improving-Java-performance-by-minimizing-Virtual-Machine-JVM-latency is a 2014 short piece focused on q[ -XX:CompileThreshold ] — sets the number of method invocations before Hotspot will compile a method to native code.

The -server VM defaults to 10,000 and -client defaults to 1500.

Smaller numbers reduce priming time, but Very low numbers would mean that the server starts considerably slower because of the time taken by the JIT to compile too many methods (which may not be used that often after all).

CRTP,ADL,thread_local to replace old-school

  • — thread_local variable to replace member data .. q[static thread_local ] in %%production code
  • eliminates pollution
  • — ADL  is often chosen to replace member operator and methods.. ADL #namespace
  • reduces coupling
  • –CRTP is often chosen to replace runtime binding (dynamic dispatch) of virtual function call
  • Template only
  • shaves a few clock cycles in HFT

I think both are advanced.

RVO^move : on return value

Let’s set the stage. A function returns a local Trade object “myTrade” by value. Will RVO kick in or move-semantic kicks in? Not both!

I had lots of confusions about these 2 features.[[effModernC++]] P176 has a long discussion and an advice — do not write std::move() hoping to “help” compiler on a local object being returned from a function

  • If the local object is eligible for RVO then all compilers would elide the copy. Your std::move() would hinder the compiler and back fire
  • if the local object is ineligible for RVO then compiler are required to return an rvalue object, often implicitly using st::move(), so your help is unneeded.
    • Note local object returned by clone is a naturally-occurring temp object.

P23 [[c++stdLib]] gave 2-line answer:

  1. if Trade class has a suitable copy or move ctor, then compiler may choose to “elide the copy”. This was long implemented as RVO optimization in most compilers before c++11. https://github.com/tiger40490/repo1/blob/cpp1/cpp/rvr/moveOnlyType_pbvalue.cpp is my experiment.
  2. otherwise, if Trade class has a move ctor, the myTrade object is robbed

So if condition for RVO is present, then most likely your move-ctor will NOT run.

c++11 atomic{int}^AtomicInteger.java #Bool can be half-written

The #1 usage of atomic<int> is load() and store(). I will use short form “load/store” or “l/s”.

The #2 usage is CAS.  Interviewers are mostly interested in this usage, though I won’t bother to remember the function names —
* compare_exchange_strong()
* compare_exchange_week()


The CAS usage is same as AtomicInteger.java, but the load/store usage is more like the thread-safety feature of Vector.java. To see the need for load/store, we need to realize the simple “int” type assignment is not atomic [1]:

  • P1012 [[c++ standard library]] shocked me by affirming that without locks, you can read a “half-written Boolean” [1].

To solve this problem, atomic<int> uses internal locks (just like Vector.java) to ensure load() and store() is always atomic.

[1] different from java. https://stackoverflow.com/questions/11459543/should-getters-and-setters-be-synchronized points out that 32-bit int in java is never “half-written”. If you read a shared mutable int in java, you can hit a stale value but never a c++ style half-written value. Therefore, java doesn’t need guarded load()/store() functions on an integer.

Q: are these c++ atomic types lock-free?
A: for load/store — not lock-free. See P 1013
A: for CAS — lock-free CPU instructions are used, if available.

posix^SysV-sharedMem^MMF

http://www.boost.org/doc/libs/1_65_0/doc/html/interprocess/sharedmemorybetweenprocesses.html#interprocess.sharedmemorybetweenprocesses.sharedmemory.xsi_shared_memory points out

  • Boost.Interprocess provides portable shared memory in terms of POSIX semantics. I think this is the simplest or default mode of Boost.Interprocess. (There are at least two other modes.)
  • Unlike POSIX shared memory segments, SysV shared memory segments are not identified by names but by ‘keys’. SysV shared memory mechanism is quite popular and portable, and it’s not based in file-mapping semantics, but it uses special system functions (shmgetshmatshmdtshmctl…).
  • We could say that memory-mapped files offer the same interprocess communication services as shared memory, with the addition of filesystem persistence. However, as the operating system has to synchronize the file contents with the memory contents, memory-mapped files are not as fast as shared memory. Therefore, I don’t see any market value in this knowledge.

java lock: atomic^visibility

You need to use “synchronized” on a simple getter, for the #2 effect.!

 

A typical lock() operation in a java method has two effects:

  1. serialized access — unless I release the lock, all other threads will be blocked grabbing this lock
  2. memory barrier — after I acquire the lock, all the changes (on shared variables) by other threads are now visible. The exact details are .. to be detailed.

I now feel the 2nd effect is often more important (and more tricky) than the 1st effect. See P94 [[Doug Lea]] . I like this simple summary, even if not 100% complete —

“In essence, releasing a lock forces a flush of all writes from working memory….and acquiring a lock forces reload of accessible fields.”

Q: what are the subset of “accessible fields” in a class with 9 fields?
A: I believe the compiler knows “in advance” what subset of fields will be accessed after lock acquisition.

Q: what if I acquire a lock, do nothing and release the lock? Is there the #2 effect?
A: I doubt it. You need to enclose all the “tricky” operations between a lock grab/release. If you leave some update (to a shared mutable) outside in the “cold”, then #1 will fail and #2 may also fail.

 

java8 default method break backward compatibility #HSBC

Among the java8 features, I think default  method is a more drastic and fundamental change than lambda or stream,  in terms of language

In my HSBC interview a London interviewer (Wais) challenged me and said that he thought default methods are designed for backward compatibility. I now think he was wrong.

—- Based on P171 [[Mastering Lambdas]]
Note The rare cases of incompatibility is an obscure (almost academic) concern. More important are the rules of method resolution when default methods are among the candidates. This topic is similar in spirit to popular interview questions around overriding vs overloading.

Backward compatibility (BWC) means that when an existing interface like Collection.java includes a brand new default method, the existing “customer” source code should work as before. Default methods has a few known violations of BWC.

  • simplest case: all (incl. default) methods in an interface must be public. No ifs or buts.  Suppose Java7 MyConcreteClass has private m2() and implements MyIntf. What if MyIntf is now updated with a default method m2()? Compilation error!
  • a more serious case: java overriding rule (similar to c++) is very strict so m(int) vs m(long) is always, automatically overloading not overriding.  Consider a method call myObj.m(33). Originally, this binds to the m(long) declared in the class. Suppose the new default method is m(int) … an Overload! At compile time, this is seen as a better match so selected by compiler (not JVM runtime)… Silent, unexpected change in business logic and a violation of BWC!

This refreshingly thin book gives 2 more examples. Its last example is a serious backward incompatibility issue but I am not convinced it is technically possible. Here’s my rationale —

Any legacy code relying on putIfAbsent() must have an implementation of putIfAbsent() somewhere in some legacy java7 class. Due to “class-wins-over-interface” rule, a new default method putIfAbsent() will never be chosen when compiling the legacy code using java8 tool chain.