serialize access to shared mutable: mutex^CAS

[[optimized c++]] P290 points out that, in addition to mutex, CAS construct also serializes access to shared mutable objects.

I feel it’s nothing but a restatement of the definition of “shared mutable”.  More relevant question is

Q: what constructs support unimpeded concurrent access to shared mutable?
A: read-write lock lets multiple readers proceed, in the absence of writers.
A: RCU lets all readers proceed, but writers are restricted.

read-copy-update lockfree without retry

RCU is an advanced concurrency construct, not implemented in a regular application, but valuable for lock-free interviews.

http://concurrencyfreaks.blogspot.sg/2016/09/a-simple-userspace-rcu-in-java.html is a simple java implementation

The Wikipedia article is accessible until the paragraph introducing read-side-critical-section and grace-period. This is the first paragraph on the implementation. I found it hard. Therefore a few background pointers:

· There must be multiple threads, typically many readers and few writers.

· There must a shared mutable data structure, probably on heap.

· Not all data structures are suitable for RCU. So which subset are? I would say pointer-graph including hash tables.

In the simplified model, we are talking about

· A reader thread R1 executing a block of code that has started reading the shared mutable data structure. This code block is the critical section

· A writer thread W1 executing a block of code attempting to update the same data.

· After the so-called grace period, a GC thread would reclaim the obsolete data that R1 has finished reading.

· A reader R2 enters the critical section after the update is done. R2 would see the new data

GC need to know when it’s safe to reclaim. It needs the wait-for-readers operation and the grace period.

Q: What if R1 gets a reference to the “obsolete node”, performs some slow task, reads the node, then exits the critical section? This node is reclaimable only after R1 exits critical section. The grace period would be long?

Q: Among 9999 nodes, how does system keep track which each node is reclaimable?

%%A: I feel kernel access to CPU registers might be needed.

Q: How does it compare with copy-on-write?

A: COW probably copies the entire data structure. Also, the modified copy is not visible to subsequent readers (like R2)

Q: How does it compare with read/write lock?

A: readlock can block

Q: How does it relate to lock-free concepts?

A: reader threads are lock-free and even better — not required to retry.

A GC threads need to wait.

A: Writer thread (like W1) ? not sure. Might be lock-free in some situations.

minimize locking – queue for producer/consumer

I was asked this question in a big bank IV.

Q: if our queue needs to be as fast as possible, we want to avoid a global lock. How?

%%A1: multi-queue, based on cusip or account prefix. I implemented this partially in a JPM take-home coding test

%%A2: if were are very sure the queue is never depleted, then use 2 locks at both ends, so consumer threads only need to contend for the consumer lock.

%%A3: lock free queues are probably quite common in c#, java and c++

For A2, On one hand, we want to keep things simple by having fewer locks. On the other hand, we want to increase parallelism by breaking up one big sync group (of threads) into independent groups, each having a “smaller” lock.
Coarse-grained parallelism is key. If the 2 smaller sync groups never cross paths, then the additional locks won’t add complexity. We may need a way to ensure that the 2 ends of the queue never cross, hopefully without needing both locks. The risk — the consumer is too fast and “eats” an item that’s not yet added. We can make sure this always fails reliable in production and stress test, rather than UndefinedBehavior.

I feel in reality, there is usually some bound on the capacity of producer, consumer or queue. Sometimes producer will be too fast (overflow), or too slow (cross), so a queue without any bound check is unreliable in practice.

atomic operations offered in c++11 ^ pthreads libraries

P341 [[c++ concurrency in action]] has a nice table showing about 7 to 10 most important concurrency features across pthreads vs c++11 [2]. It’s obvious to me the 2 fundamental[1] and universally important features are locks and condVars. These are thoroughly and adequately supported everywhere — pthreads, c++11, java, c#. Beyond these 2 features, other features aren’t universally supported across the board. Let’s look at some of them.

— Atomic operations —
pthreads? no support
c++11? yes atomic types
java? yes atomic variables
boost? no support
c#? yes interlocked

Notice in the C#, c++11, java thread libraries there are specific constructs (classes and methods) for atomic INT and atomic BOOLEAN (but not atomic float), because in practice most atomic algorithms use primitive int and boolean types.

Atomic operations often rely on specific CPU instructions. They are often “offered” on (no more than a few) specific atomic data types. Can we apply atomic operations on a random data type, like some method in a regular class? I doubt it. I feel in each thread library, there are no more than a few specific atomic Operations, tied to a handful of atomic data types.

— thread pool —
java? yes
c#? yes
c++11? no
pthreads? no
boost? To my surprise, No, according to the maintainer of boost::thread

I think it’s technically very feasible to implement thread pool using locks and condVars, so this feature is left out of the c/c++ base libraries.

[1] “Fundamental” is a imprecise term that I would not spend too much time debating. In c#, locks and condVars aren’t really rock-bottom fundamental. They are based on more fundamental constructs namely primitive kernel objects. In other languages, I’m not sure. Locks and condVars are often implemented in thread libraries not syscalls. It’s a bit obscure, arcane and even irrelevant to many developers.
[2] (and java and boost too, but this blog is about c++)

max-throughput live quote distribution: 6designs(CAS,socket

Update — fastest would require single-threaded model with no shared mutable

Suppose a live feed of market quotes pumps in messages at the max speed of the network (up to 100GB/sec). We have (5) thousands of (Hedge Fund etc) clients, each with some number (not sure how large) of subscriptions in these quotes. Each subscription sets up a filter that may look like some combination of “Symbol = IBM”, “bid/ask spread < 0.2…”, or “size at the best bid price….”. All the filters only reference fields of the quote object such as symbol, size and price. We need the fastest distribution system. Bottleneck should be network, not our application.

–memory allocation and copying–
If an IBM /quote/ matches 300 filters, then we need to send it to 300 destinations, therefore copying 300 times, but not 300 allocations within JVM. We want to minimize allocation within JVM. I believe the standard practice is to send just one copy as a message and let the receiver (different machine) forward it to those 300 hedge funds. Non-certified RV is probably efficient, but unicast JMS is fine too.

–socket reader thread latency–
Given the messaging rate, socket reader thread should be as lean as possible. I suggest it should blindly drop each msg into a buffer, without looking at it. Asynchronously consumer threads can apply the filters and distribute the quotes.

A fast wire format is fixed-width. Socket reader takes 500bytes and assume it’s one complete quote object, and blindly drops this 500-long byte array into the buffer.

–cpu dedication–
Each thread is busy and important enough to deserve a dedicated cpu. That CPU is never given to another thread.
————-
Now let me introduce my design. One thread per filter. Buffer is a circular array — bounded but efficient pre-allocation. Pre-allocation requires fixed-sized nodes, probably byte arrays of 500 each. I believe de-allocation is free — recycling. Another friend (csdoctor) suggested an unbounded linked list of arrays . Total buffer capacity should exceed the *temporary* queue build-up. Slowest consumer thread must be faster than producer, though momentarily the reverse could happen.

—-garbage collection—-
Note jvm gc can’t free the memory in our buffer.

–Design 3–
Allocate a counter in each quote object. Each filter applied will decrement the counter. The thread that hits zero will free it. But this incurs allocation cost for that counter.

–Design 6–
Each filter thread records in a global var its current position within the queue. Each filter thread advances through the queue and increments it’s global var. One design is based on the observation that given the dedicated CPU, the slowest thread is always the slowest in the wolfpack. This designated thread would free the memory after applying its filter.

However, it’s possible for 2 filters to be equally slow.

–design 8–We can introduce a sweeper thread that periodically wakes up to sequentially free all allocations that have been visited by all filters.

–Design 9– One thread to apply all filters for a given HF client. This works if filter logic is few and simple.

–Design A (CAS)– Create any # of “identical” consumer threads. Any time we can expand this thread pool.
while(true){
1)read BigArrayBuffer[++MyThreadPtr] into this thread’s register and examine the fields, without converting to a Quote instance.
2) examine the Taken boolean flag. If already set, then simply “continue” the loop. This step might be needed if CAS is costly.
3) CAS to set this flag
4a) if successful, apply ALL filters on the quote. Then somehow free up the memory (without the GC). Perhaps set another boolean flag to indicate this fixed-length block is now reusable storage.
4b) else just “continue” since another thread will process and free it.
}