c++11 atomic{int}^AtomicInteger.java

The #1 usage of atomic<int> is load() and store(). I will use short form “load/store” or “l/s”.

The #2 usage is CAS.  Interviewers are mostly interested in this usage, though I won’t bother to remember the function names —
* compare_exchange_strong()
* compare_exchange_week()

The CAS usage is same as AtomicInteger.java, but the load/store usage is more like the thread-safety feature of Vector.java. To see the need for load/store, we need to realize the simple “int” type assignment is not atomic [1]:

  • P1012 [[c++ standard library]] shocked me by affirming that without locks, you can read a “half-written Boolean” [1].

To solve this problem, atomic<int> uses internal locks (just like Vector.java) to ensure load() and store() is always atomic.

[1] different from java. https://stackoverflow.com/questions/11459543/should-getters-and-setters-be-synchronized points out that 32-bit int in java is never “half-written”. If you read a shared mutable int in java, you can hit a stale value but never a c++ style half-written value. Therefore, java doesn’t need guarded load()/store() functions on an integer.

Q: are these c++ atomic types lock-free?
A: for load/store — not lock-free. See P 1013
A: for CAS — lock-free CPU instructions are used, if available.


lockfree usage in high speed trading system@@

Hi Deepak,

Further to our conversation, do high frequency shops use lock-free? I would guess that the most demanding high-speeding data processing system (such as the matching engine in an exchange) are single-threaded, rather than lock-free multithreaded.

I hope to hear a typical high-speed data processing system that has lots of shared mutable data. I have not found one.

· Order matching

· Airline booking

· Vote counting in real time?

If 99% of the data is not shared mutable, then single-threaded design is probably better.

· I feel one reason is complexity vs simplicity. Lock-free designs can be very tricky, according to experts.

· Another reason is performance. Lock-free is, in my view, definitely slower than single-threaded. The memory fence required on those atomic objects impeded compiler optimizations.

· Most important reason, in my view — it’s always possible to achieve the same functionality without using locks or lock-free designs. Single-threaded designs are possible, simpler and faster.

If we have 64 cores, just run 64 processes, each single-threaded. However, in reality these systems often do have some shared mutable data between two threads. There are various solutions I’m not qualified to compare and illustrate. These solutions could use lock-free or locks.


serialize access to shared mutable: mutex^CAS

[[optimized c++]] P290 points out that, in addition to mutex, CAS construct also serializes access to shared mutable objects.

I feel it’s nothing but a restatement of the definition of “shared mutable”.  More relevant question is

Q: what constructs support unimpeded concurrent access to shared mutable?
A: read-write lock lets multiple readers proceed, in the absence of writers.
A: RCU lets all readers proceed, but writers are restricted.



See also blog on c++ atomic<int> — primary usage is load/store, not CAS

lock free isn’t equal to CAS — lock-free construct also include Read-Copy-Update, widely used in linux and other kernels.

atomic isn’t equal to lock-free — For example, c++ atomic classes (actually template) can use locks if the underlying processor doesn’t have CAS instruction.



read-copy-update lockfree +! retry

RCU is an advanced concurrency construct, not implemented in a regular application, but valuable for lock-free interviews.

http://concurrencyfreaks.blogspot.sg/2016/09/a-simple-userspace-rcu-in-java.html is a simple java implementation

The Wikipedia article is accessible until the paragraph introducing read-side-critical-section and grace-period. This is the first paragraph on the implementation. I found it hard. Therefore a few background pointers:

· There must be multiple threads, typically many readers and few writers.

· There must a shared mutable data structure, probably on heap.

· Not all data structures are suitable for RCU. So which subset are? I would say pointer-graph including hash tables.

In the simplified model, we are talking about

· A reader thread R1 executing a block of code that has started reading the shared mutable data structure. This code block is the critical section

· A writer thread W1 executing a block of code attempting to update the same data.

· After the so-called grace period, a GC thread would reclaim the obsolete data that R1 has finished reading.

· A reader R2 enters the critical section after the update is done. R2 would see the new data

GC need to know when it’s safe to reclaim. It needs the wait-for-readers operation and the grace period.

Q: What if R1 gets a reference to the “obsolete node”, performs some slow task, reads the node, then exits the critical section? This node is reclaimable only after R1 exits critical section. The grace period would be long?

Q: Among 9999 nodes, how does system keep track which each node is reclaimable?

%%A: I feel kernel access to CPU registers might be needed.

Q: How does it compare with copy-on-write?

A: COW probably copies the entire data structure. Also, the modified copy is not visible to subsequent readers (like R2)

Q: How does it compare with read/write lock?

A: readlock can block

Q: How does it relate to lock-free concepts?

A: reader threads are lock-free and even better — not required to retry.

A GC threads need to wait.

A: Writer thread (like W1) ? not sure. Might be lock-free in some situations.


minimize locking – queue for producer/consumer

I was asked this question in a big bank IV.

Q: if our queue needs to be as fast as possible, we want to avoid a global lock. How?

%%A1: multi-queue, based on cusip or account prefix. I implemented this partially in a JPM take-home coding test

%%A2: if were are very sure the queue is never depleted, then use 2 locks at both ends, so consumer threads only need to contend for the consumer lock.

%%A3: lock free queues are probably quite common in c#, java and c++

For A2, On one hand, we want to keep things simple by having fewer locks. On the other hand, we want to increase parallelism by breaking up one big sync group (of threads) into independent groups, each having a “smaller” lock.
Coarse-grained parallelism is key. If the 2 smaller sync groups never cross paths, then the additional locks won’t add complexity. We may need a way to ensure that the 2 ends of the queue never cross, hopefully without needing both locks. The risk — the consumer is too fast and “eats” an item that’s not yet added. We can make sure this always fails reliable in production and stress test, rather than UndefinedBehavior.

I feel in reality, there is usually some bound on the capacity of producer, consumer or queue. Sometimes producer will be too fast (overflow), or too slow (cross), so a queue without any bound check is unreliable in practice.


atomic operations offered in c++11 ^ pthreads libraries

P341 [[c++ concurrency in action]] has a nice table showing about 7 to 10 most important concurrency features across pthreads vs c++11 [2]. It’s obvious to me the 2 fundamental[1] and universally important features are locks and condVars. These are thoroughly and adequately supported everywhere — pthreads, c++11, java, c#. Beyond these 2 features, other features aren’t universally supported across the board. Let’s look at some of them.

— Atomic operations —
pthreads? no support
c++11? yes atomic types
java? yes atomic variables
boost? no support
c#? yes interlocked

Notice in the C#, c++11, java thread libraries there are specific constructs (classes and methods) for atomic INT and atomic BOOLEAN (but not atomic float), because in practice most atomic algorithms use primitive int and boolean types.

Atomic operations don’t always rely on specific CPU instructions. They are often “offered” on (no more than a few) specific atomic data types. Can we apply atomic operations on a random data type, like some method in a regular class? I doubt it. I feel in each thread library, there are no more than a few specific atomic Operations, tied to a handful of atomic data types.

— thread pool —
java? yes
c#? yes
c++11? no
pthreads? no
boost? To my surprise, No, according to the maintainer of boost::thread

I think it’s technically very feasible to implement thread pool using locks and condVars, so this feature is left out of the c/c++ base libraries.

[1] “Fundamental” is a imprecise term that I would not spend too much time debating. In c#, locks and condVars aren’t really rock-bottom fundamental. They are based on more fundamental constructs namely primitive kernel objects. In other languages, I’m not sure. Locks and condVars are often implemented in thread libraries not syscalls. It’s a bit obscure, arcane and even irrelevant to many developers.
[2] (and java and boost too, but this blog is about c++)


max-throughput live quote distribution: 6designs(CAS,socket

Update — fastest would require single-threaded model with no shared mutable

Suppose a live feed of market quotes pumps in messages at the max speed of the network (up to 100GB/sec). We have (5) thousands of (Hedge Fund etc) clients, each with some number (not sure how large) of subscriptions in these quotes. Each subscription sets up a filter that may look like some combination of “Symbol = IBM”, “bid/ask spread < 0.2…”, or “size at the best bid price….”. All the filters only reference fields of the quote object such as symbol, size and price. We need the fastest distribution system. Bottleneck should be network, not our application.

–memory allocation and copying–
If an IBM /quote/ matches 300 filters, then we need to send it to 300 destinations, therefore copying 300 times, but not 300 allocations within JVM. We want to minimize allocation within JVM. I believe the standard practice is to send just one copy as a message and let the receiver (different machine) forward it to those 300 hedge funds. Non-certified RV is probably efficient, but unicast JMS is fine too.

–socket reader thread latency–
Given the messaging rate, socket reader thread should be as lean as possible. I suggest it should blindly drop each msg into a buffer, without looking at it. Asynchronously consumer threads can apply the filters and distribute the quotes.

A fast wire format is fixed-width. Socket reader takes 500bytes and assume it’s one complete quote object, and blindly drops this 500-long byte array into the buffer.

–cpu dedication–
Each thread is busy and important enough to deserve a dedicated cpu. That CPU is never given to another thread.
Now let me introduce my design. One thread per filter. Buffer is a circular array — bounded but efficient pre-allocation. Pre-allocation requires fixed-sized nodes, probably byte arrays of 500 each. I believe de-allocation is free — recycling. Another friend (csdoctor) suggested an unbounded linked list of arrays . Total buffer capacity should exceed the *temporary* queue build-up. Slowest consumer thread must be faster than producer, though momentarily the reverse could happen.

—-garbage collection—-
Note jvm gc can’t free the memory in our buffer.

–Design 3–
Allocate a counter in each quote object. Each filter applied will decrement the counter. The thread that hits zero will free it. But this incurs allocation cost for that counter.

–Design 6–
Each filter thread records in a global var its current position within the queue. Each filter thread advances through the queue and increments it’s global var. One design is based on the observation that given the dedicated CPU, the slowest thread is always the slowest in the wolfpack. This designated thread would free the memory after applying its filter.

However, it’s possible for 2 filters to be equally slow.

–design 8–We can introduce a sweeper thread that periodically wakes up to sequentially free all allocations that have been visited by all filters.

–Design 9– One thread to apply all filters for a given HF client. This works if filter logic is few and simple.

–Design A (CAS)– Create any # of “identical” consumer threads. Any time we can expand this thread pool.
1)read BigArrayBuffer[++MyThreadPtr] into this thread’s register and examine the fields, without converting to a Quote instance.
2) examine the Taken boolean flag. If already set, then simply “continue” the loop. This step might be needed if CAS is costly.
3) CAS to set this flag
4a) if successful, apply ALL filters on the quote. Then somehow free up the memory (without the GC). Perhaps set another boolean flag to indicate this fixed-length block is now reusable storage.
4b) else just “continue” since another thread will process and free it.


uncontended lock acquisition always uses CAS

In http://www.ibm.com/developerworks/java/library/j-jtp04186/index.html, Brian Goetz (IBM) pointed out that un-contended lock acquisition always involves[1] CAS.

[1] Not sure about the “involving”. I imagine this operation as “if the lock owner is Nobody, then set current thread as lock owner”