cancel^thread.stop() ^interrupt^unixSignal

Cancel, interrupt and thread.stop() are three less-quizzed topics that show up occasionally in java/c++ interviews. They are fundamental features of the concurrency API. You can consider this knowledge as zbs.

As described in ##thread cancellation techniques: java #pthread,c#, thread cancellation is supported in c# and pthreads, whereas java indirectly supports it.

— cancel and java thread.stop() are semantically identical but java thread.stop() is forbidden and unsafe.

PTHREAD_CANCEL_ASYNCHRONOUS is usable in very restricted contexts. I think this is similar to thread.stop().

— cancel and interrupt both define stop points.  In both cases, the target thread choose to ignore the cancellation/interrupt request, or check them at the stop points.

Main difference between cancel and interrupt ? Perhaps just the wording. In pthreads there’s only cancel, no interrupt, but in java there’s no cancel.

Note interrupted java thread probably can’t resume.

Nos sure if Unix signal handler also supports stop points.

Java GC on-request is also cooperative. You can’t force the GC to start right away.

Across the board, the only immediate (non-cooperative) mechanism is power loss. Even a keyboard “kill” is subject to software programmed behavior, typically the OS scheduler. 

[17] deadlock involving 1 lock(again) #StampedLock

— Scenario 3: If you app uses just one lock, you can still get deadlock. Eg: ThreadM (master) starts ThreadS (slave). S acquires a lock. M calls the now-deprecated suspend() method on S, then tries to acquire the lock, which is held by the suspended S. S is waiting to be resumed by M — deadlock.

Now the 4 conditions of deadlock:
* wait-for cycle? indeed
* mutually exclusive resource? Just one resource
* incremental acquisition? indeed
* non-Preemptive? Indeed. But if there’s a third thread, it can restart ThreadS.

— Scenario 2: a single thread using a single non-reentrant lock (such as and many c++ lock objects) can deadlock with itself. A common mis-design.

— Scenario 1:

In multithreaded apps, I feel single-lock deadlock is very common. Here’s another example — “S gets a lock, and wait for ConditionA. Only M can create conditionA, but M is waiting for the lock.”

Here, ConditionA can be many things, including a Boolean flag, a JMS message, a file, a count to reach a threshold, a data condition in DB, a UI event where UI depends partly on ThreadM.

P90 [[eff c#]] also shows a single-lock deadlock. Here’s my recollection — ThrA acquires the lock and goes into a while-loop, periodically checking a flag to be set by ThrB. ThrB wants to set the flag but need the lock.


multi-core as j4 multi-threading #STM

Q: why choose multi-threaded design instead of single-threaded processes?

Most publications mention multi-core hardware as an answer. Questionable.

With multi-threading, you can run 30 threads in one process. Or you can run 30 single-threaded processes as in RTS parser and Rebus — industrial strength proven solution. Both designs make use of all CPU cores.

Between these two designs, heap memory efficiency can be different, as the 30 threads are able to share 99GB of objects in the same address space, but the 30 processes would need shared memory.

I feel the lesser known middle-ground design is … 30 threads running in single-threaded-mode, By definition, these 30 threads can only share immutable data only,. Anything mutable is thread-local.

In any multi-threaded design, those 30 threads can share the text segment i.e. memory occupied by code. Text segment tend to be smaller than the heap footprint.

In a boss-worker design, the worker threads may need to share very few mutable object, so these worker threads are not strictly STM. They are quasi-STM

success@concurrency features] java^c++^c#

I’m biased towards java.

I feel c# concurrency is less impactful because most of the important concurrent systems use *nix servers not windows, and most concurrent programming jobs do not go to windows developers.

Outside windows, c++ concurrency is mostly based on the C library pthreads, non-OO and inconvenient compared to java/c#

The c++11 thread classes are the next generation following pthreads, but not widely used.

Java’s concurrency support is the most successful among languages, well-designed from the beginning and rather stable. It’s much simpler than c++11 thread classes, having only the and data types. More than half the java interviews would ask threading, because java threading is understandable and usable by the average programmer, who can’t really understand c++ concurrency features.

POSIX semaphores do grow beyond initial value #CSY

My friend Shanyou pointed out a key difference between c++ binary semaphore and a simple mutex — namely ownership. Posix Semaphore (binary or otherwise) has no ownership, because non-owner threads can create permits from thin air, then use them to enter the protected critical section. Analogy — NFL fans creating free tickets to enter the stadium. is one of the better online discussions.

  • ! to my surprise, if you initialize a posix semaphore to 1 permit, it can be incremented to 2. So it is not a strict binary semaphore. is my experiment using linux POSIX semaphore. Not sure about system-V semaphore. I now think a posix semaphore is always a multi-value counting semaphore with the current value indicating current permits available.

  • ! Initial value is NOT “max permits” but rather “initial permits”

I feel the restriction by semaphore is just advisory, like advisory file locking. If a program is written to subvert the restriction, it’s easy. Therefore, “binary semaphore” is binary IIF all threads follow protocol. claims “Mutex can be released only by thread that had acquired it, while you can signal semaphore from any non-owner thread (or process),” This does NOT mean a non-owner thread can release a toilet key owned by another thread/process. It means the non-owner thread can mistakenly create a 2nd key, in breach of exclusive, serialized access to the toilet.–-part-1-semaphores/ is explicit saying that a thread can release the single toilet key without ever acquiring it.


[[DougLea]] P220 confirms that in some thread libraries such as pthreads, release() can increment a 0/1 binary semaphore value to beyond 1, destroying the mutual exclusion control.

However, java binary semaphore is a mutex because releasing a semaphore before acquiring has no effect (no “spare key”) but doesn’t throw error.

hardware mutex, based@XCHG instruction

[[OS by William Stallings]] P214 is concise and clear.

The atomic XCHG instruction on Intel processors can swap a local variable (held in cpu register) with a main memory location visible to all threads.

This XCHG instruction alone can implement a simple mutex usable by multiple threads on multiple processors sharing the same main memory.

Another atomic instruction, atomic test-n-set instruction, can also by itself implement mutex.

Both mutexes have only academic value because their Limitations are severe:

  • even with a single mutex, deadlock is possible at the hardware level — ThrA in critical section can get interrupted and preempted — perfectly normal. Now ThrA has to wait for high-priority ThrH to complete, but ThrH is busy-waiting (wheel-spinning) to enter the same critical section 😦
    • Note in this example there’s no “lock object” per-se, only a critical section in code !
  • by default, unsuccessful threads like ThrH are stuck in busy-wait — wasting CPU and generating heat 😉


xchg between register/memory triggers fencing

I guess xchg is used in some basic concurrency constructs, but I need to research more.

See my blogpost hardware mutex, based@XCHG instruction points out the implicit locking performed by CPU whenever one of the two operands is a memory location. The other operand must be a register.

This implicit locking is quite expensive according to, but presumably cheaper than user-level locking.

This implicit locking involves a memory fence.

UseSpinning jvm flag #pthr/kernel/c#

Hi XR,

Around 2011 I once said a java thread waiting for a lock would not keep grabbing the lock…. Wrong! Actually it does keep grabbing for a while. It runs some useless CPU instructions and checks the lock periodically. This is really spin locking inside the JVM.

Spin locking wastes CPU but frequently results in faster application performance (reducing context switches). Therefore, JVM runtime uses some policy to decide how long a thread would spin before placing the thread in some kind of “parking lot” (my terminology) so the thread doesn’t’ occupy CPU any more. This spin phase used to be configurable via the boolean UseSpinning JVM flag, but in java7 and beyond, this flag has no effect — JVM always has freedom to use spinning. This is described in [[javaPerf2014]]

Comparison 1 — Dotnet SpinWait() provides exactly the same spinlock feature, as described in

Comparison 2 — In linux Kernel (not userland), 1) spinlock and 2) counting semaphore are the two locking primitives, as described in Page 164 [[understandingLinuxKernel]]. Both locks block a “grabbing” thread getting into a critical section until another thread leaves it. Similar to my 2011 description, the kernel semaphore remembers all sleeping processes (P174) and wakes one of them when the critical section is clear. I think this kernel semaphore is used by the scheduler in the kernel.

Comparison 3 — In linux pthreads library (the basis of linux JVM), the locking constructs are implemented using system calls and userland C functions. Pthreads as a userland library probably can’t use the above two kernel constructs directly. Userland C function can implement spinlock, but to implement “parking” I assume (kernel support and) system calls are required. So far I assume the pthreads thread maps to native threads controlled by the kernel. This mapping is now the dominant usage.

The alternative mapping model is green threads, where userland threads are invisible to kernel. Parking has to be implemented in userland.  Nowadays it’s rare to see green threads.

## linux kernel concurrency primitives #to elaborate

I feel the concurrency primitives in kernel are conceptually similar to those in thread libraries but fundamentally different in implementation. This topic is deep and time consuming so I will only memorize some jargons for show-off

I should read the OReilly Linux book + Josh’s book

  • spinlock — addressed in both books
  • read-write spinlock
  • binary and counting semaphores
  • read-write semaphores
  • RCU?


semaphore: often !! ] thread library

lock and condVar are essential components of any thread library. Counting Semaphore is not.

  • In (C library) POSIX the semaphore functions do not start with pthread_ like locks and condVars are.
  • In (C library) SysV, the semaphore API is not part of any thread library whatsoever
  • In java, locks and condVars are integrated with every object, but Semaphore is a separate class
  • Windows is different

Important Note — both dotnet and java are a few abstraction levels higher than the thread or semaphore “libraries” provided on top of an operating system. These libraries [1]  are sometimes NOT part of kernel, even if the kernel provides basic thread support.

[1] ObjectSpace and RogueWave both provides thread libraries, not built into any operating system whatsoever.

[17] semaphore^mutex

See also CSY blogpost on binary semaphore vs mutex..

I usually trust official documentation by Microsoft, Oracle, Linux .. more than online forums, but I did find some worthwhile read in . Multiple authors pointed out the notification feature of semaphore. I agree.

There are at least 4 different semaphore APIs
* java has Semaphore class
* dotnet has a Semaphore class based on Windows semaphore
* system-V semaphore API
* POSIX semaphore API

The windows (including dotnet) semaphore API is somewhat different from the rest. Notification feature is built into windows semaphore, but I don’t think the other 3 API documentations have any mentions of notification or signaling features. I believe ownership is NOT the key point. By “ownership” the (Windows) semaphore discussions basically means “notification”. Non-owners of a semaphore can announce intention-to-acquire.

low-latency: avoid concurrency #ST-mode

Backgrounder — CPU speed is increasing more gradually than before. The technology industry as a whole is advancing more horizontally — increasing parallelism. Yet the best designs don’t use lock-free or concurrency at all.

I asked Martin Thompson — To really push the limit of latency, should we avoid concurrency as much as possible, completely eliminating it if possible? Answer is yes. Martin pointed out the difference between

  • parallel design —— use multitasking, in ST-mode, or “do multiple things at the same time
  • concurrent design — deal with multitasking, or “deal with multiple things at the same time“. The expression “deal with” implies complexities, hazards, risks, control, management.

One of the hidden hazards Martin pointed out is heap memory de-allocation, but that’s for another blogpost.

fiber^thread, conceptually #shallow

I believe the concept of fiber is not standardized across languages. Here are some general observations

  • fibers are unknown to kernel. They are similar to userland threads that are implemented in userland thread libraries, rather than implemented by kernel and system calls.
  • like userland threads, a fiber adds less load on the kernel. See [[pthreads]]
  • diff: fibers are even more light-weight than threads
  • diff: fibers are usually short-lived, perhaps similar to tasks
  • diff: fibers have smaller stacks
  • a few language (only heard about scala?) support millions of concurrent fibers in one OS. For threads, with a IO-heavy workload, you probably can run tens of thousands of threads on a single JVM.

per-thread^per-DBConn writeBuffer

  1. Broadly speaking, if you write to a table in your DB connection but don’t commit, it will be invisible to other connections.
  2. Similarly, if a thread writes to shared object without memory fence, the updates is only in the per-thread buffer and invisible to other threads.
  3. File write is an even more common scenario. One process writes a single line to the file, but doesn’t flush. It will not be visible to other processes.
  4. CVS

blockingMutex implementation ] kernel!!userland

Background — Linux kernel provides two types of locks — spinlock and blocking mutex, as in . Here I focus on the mutex. I think this is far more useful to userland applications. has good pointers:

  • I believe context switch is expensive, since CPU cache has to be replaced. Therefore, optimistic spin is beneficial. shows

  • a blocking mutex used in kernel, perhaps not directly used by userland apps
  • implemented using spin lock + some wait_lock
  • maintains a wait_list. Not visible to any userland app.


read-mostly data store: needs synchronization

Suppose a data store is infrequently updated, and only by a single writer thread, but is accessed frequently by many, many reader threads.

Q1: Can we cheat and skip some of the (expensive) locking?
A1: No. The writer could be in the middle of a multi-step update, so readers would see a half-updated data store

Q2: What if we somehow ensure the updates are never multi-step [1]?
A2: Unfortunately no, due to visibility failure. If a reader thread doesn’t synchronize, the crucial memory fencing would not occur and the reader may not notice a recent update, and use cached data instead.

[1] One way to do it — writer creates a new version of a record, and at last reseat a single pointer , as described in my AtomicReference blogpost

calling database access method while holding mutex

Update: XR referred me to —

Hi XR,

Q: If some vendor provides a database access function (perhaps dbget()) that may be slow and may acquire some lock internally, is it a good idea to call this function while holding an unrelated mutex?

We spoke about this. Now I think this is so bad it should be banned. This dbget() could take an unknown amount of time. If someone deleted a row from a table that dbget() needs, it could block forever. The mutex you hold is obviously shared with other threads, so those threads would be starved. Even if this scenario happens once in a million times, it is simply unacceptable.

Here, I’m assuming the dbget() internal lock is completely unrelated to the mutex. In other words, dbget() doesn’t know anything about your mutex and will not need it.

As a rule of thumb, we should never call reach-out functions while holding a mutex. Reach-out functions need to acquire some shared resources such as a file, a web site, a web service, a socket, a remote system, a messaging system, a shared queue, a shared-memory mapping… Most of these resources are protected and can take some amount of time to become available.

(I remember there’s some guideline named open-call but I can’t find it.)

That’s my understanding. What do you think?

deadlock avoidance: release all locks fail-fast

In some deadlock avoidance algorithms, at runtime we need to ensure we immediately release every lock already acquired, without hesitation, so to speak. Here’s one design my friend Shanyou shared with me.

  • Tip: limit the number of “grabbing” functions, functions that explicit grab any lock.

Suppose a grabbing function f1 acquires a lock and calls another function f2 (a red flag to concurrency designers). In such a situation, f2() will use -1 return value to indicate “must release lock”. If f2() calls a grabbing function f3(), and if f3() fails to grab, it propagates “-1” to f2 and f1.

  1. f3() could use trylock.
  2. f3() could check lock hierarchy and return -1.

if thread fails b4 releasing mutex #CSY

My friend Shanyou asked:

Q: what if a thread somehow fails before releasing mutex?

I see only three scenarios:

  • If machine loses power, then releasing mutex or not makes no difference.
  • If process crashes but the mutex is in shared memory, then we are in trouble. The mutex will be seen as forever in-use. The other process can’t get this mutex. I feel this could be a practical problem, with practical solutions like reboot or process restart.
  • If process is still alive, I rely on stack unwinding.

Stack unwinding is set up by compiler. The only situation when this compiler-generated stack unwinding is incomplete is — if the failing function is declared noexcept. (In such a case, the failure is your self-inflicted problem since you promised to compiler it should never throw exception.) I will assume we don’t have a noexcept function. Therefore, I assume stack unwinding is robust and all stack objects will be destructed.

If one of the stack objects is a std::unique_lock, then compiler guarantees an unlocked status on destruction. That’s the highest reliability and reassurance I can hope for.

updates2shared-mutable: seldom flushed2memory immediately #CSY

  • memory fence c++
  • memory barrier c++
  • c++ thread memory visibility


You can search for these keywords on Google. Hundreds of people would agree that without synchronization, a write to sharedMutableObject1 by thread A at Time 1 is not guaranteed to be visible to Thread B at Time 2.

In any aggressively multithreaded program, there are very few shared mutable objects. If there’s none by design, then all threads can operate in single-threaded mode as if there’s no other thread in existence.

In single-threaded mode (the default) compilers would Not generate machine code to always flush a write to main memory bypassing register/L1/L2/L3 caches. Such a memory barrier/fence is extremely expensive — Main memory is at least 100 times slower than register access.

I would hypothesize that by default, the most recent write (at Time 1) is saved to register only, not L1 cache, because at compile time, compiler author doesn’t know if at runtime this same thread may write to that same address! If you update this object very soon, then it’s wasteful to flush the intermediate, temporary values to L1 cache, since no other threads need to see it.

L1 cache is about 10 times slower than register.

Multi-threaded lock-free programming always assumes multiple threads access shared mutable objects.

Even a lock-free function without contention [1] requires memory barriers and is therefore slower than single-threaded mode. I would say in a low-contention context, the main performance gain of single-threaded over lock-free is data cache efficiency. Another performance gain is statement reordering.

[1] i.e. no retry needed, since other threads are unlikely to touch sharedMutableObject1 concurrently

c++condVar 2 usages #timedWait

poll()as timer]real time C : industrial-strength #RTS is somewhat similar. singles out two distinct usages:

1) notification
2) timed wait — often forgotten shows std::condition_variable::wait_for() takes a std::chrono::duration parameter, which has nanosec precision.

Note java wait() also has nanosec precision.

std::condition_variable::wait_until() can be useful too, featured in my proposal RTS pbflow msg+time files #wait_until

c++11 atomic{int}^ #Bool can be half-written

The #1 usage of atomic<int> is load() and store(). I will use short form “load/store” or “l/s”.

The #2 usage is CAS.  Interviewers are mostly interested in this usage, though I won’t bother to remember the function names —
* compare_exchange_strong()
* compare_exchange_week()

The CAS usage is same as, but the load/store usage is more like the thread-safety feature of To see the need for load/store, we need to realize the simple “int” type assignment is not atomic [1]:

  • P1012 [[c++ standard library]] shocked me by affirming that without locks, you can read a “half-written Boolean” [1].

To solve this problem, atomic<int> uses internal locks (just like to ensure load() and store() is always atomic.

[1] different from java. points out that 32-bit int in java is never “half-written”. If you read a shared mutable int in java, you can hit a stale value but never a c++ style half-written value. Therefore, java doesn’t need guarded load()/store() functions on an integer.

Q: are these c++ atomic types lock-free?
A: for load/store — not lock-free. See P 1013
A: for CAS — lock-free CPU instructions are used, if available.

after fork(): threads,sockets.. #Trex

I have read about fork() many times without knowing these details, until Trex interviewer asked !

–based on

The child process is created with a single thread—the one that called fork(). The entire virtual address space of the parent is replicated in the new process, including the states of pthread mutexes, pthread condition variables, and other pthreads objects In particular, if in parent process a lock was held by some other thread t2, then child process only has the main thread (which called fork()) and no t2 but the lock is still unavailable. This is a common problem, addressed in

The very 1st instruction executed in Child is the instruction after fork() — as proven in

The child inherits copies of the parent’s set of open file descriptors, including stdin/stdout/stderr. Child process should usually close them.

Special case — socket file descriptor inherited. See


## Y avoid blocking design

There are many contexts. I only know a few.

1st, let’s look at an socket context. Suppose there are many (like 500 or 50) sockets to process. We don’t want 50 threads. We prefer fewer, perhaps 1 thread to check each “ready” socket, transfer whatever data can be transferred then go back to waiting. In this context, we need either

  • /readiness notification/, or
  • polling
  • … Both are compared on P51 [[TCP/IP sockets in C]]

2nd scenario — GUI. Blocking a UI-related thread (like the EDT) would freeze the screen.

3rd, let’s look at some DB request client. The request thread sends a request and it would take a long time to get a response. Blocking the request thread would waste some memory resource but not really CPU resource. It’s often better to deploy this thread to other tasks, if any.

Q: So what other tasks?
A: ANY task, in the thread pool design. The requester thread completes the sending task, and returns to the thread pool. It can pick up unrelated tasks. When the DB server responds, any thread in the pool can pick it up.

This can be seen as a “server bound” system, rather than IO bound or CPU bound. Both the CPU task queue and the IO task queue gets drained quickly.


fileHandle/socket/dbConection are thread Unsafe

There’s a buffer in a DB connection, in a file handle, in a socket …

The buffer is a shared mutable object. Consider a read-buffer. The host object knows how much of the data in the buffer was already delivered, to avoid repeated delivery. There’s some kind of watermark, which is moved by a consumer thread.

As all shared mutables, these objects are thread unsafe.

All of these objects can also be allocated on stack and therefore invisible to other threads. Therefore, this could be the basis of a thread-safe design.


mutex: allocated in heap or global memory

In java, mutex is always in heap. Primitive objects can’t be host to a mutex.

In pthreads, mutex object can be allocated anywhere, but I have seen it allocated only in heap or in global area.

  • in global area, you use a one-liner to allocate and initialize a non-ref variable
  • on heap, you first malloc then call init(). I think this is like a constructor.
    • As a special case, you can also allocate mutex in shared memory, creating a cross-process mutex! Not a common practice in java.

Q: what if I allocate a mutex in stack?
A: in java,the actual object is still on heap,though the variable is on stack and invisible to other threads
A: in C, the entire mutex object can be on stack, but such a mutex is useless. Imagine a lock on a door with a single key to be shared, but other people have their own doors, so there’s no synchronization no access control at all:(

Single-Threaded^STM #jargon

SingleThreadedMode means each thread operates without regard to other threads, as if there’s no shared mutable data with other threads.

Single-Threaded can mean

  • no other thread is doing this task. We ensure there’s strict control in place. Our thread still needs to synchronize with other threads doing other tasks.
  • there’s only one thread in the process.

wait()needs while-loop #spurious

Across all languages, I have never seen any exception. You must enclose wait() in a while-loop.

C++11 has a syntactic sugar — wait(myLock, myPredicate), hiding the while loop.

Q: why is spurious wake unavoidable? [[Josuttis]] P1004 has a concise answer:
A: thread library (not operating system) can’t reliably ensure to deliver the notification, so to play safe, the thread library wakes the waiting thread.

— I wrote about java:

Wait() is always awakened by a notification message sent to the monitor object.

Between the notification thread releasing the lock and the waking thread acquiring, a third thread can modify some shared data[1], and the condition you are waiting for disappears as a result.

Therefore the waking thread need to recheck the condition. if() won’t do[2]. while() is required.

If you know there is no 3rd thread messing with the condition, then the waking thread can assume the condition is still intact. However, this design unnecessarily weakens code extensibility and re-usability. It’s fairly easy to put a while() loop around a wait(). No good reason not to do it.

[2] This kind of bug is intermitten.
[1] in a database for example (a crude example)

condition.signalAll Usually requires locking

Across languages,

  • notify() doesn’t strictly need the lock;
  • wait() always requires the lock.
    • c++ wait() takes the lock as argument
    • old jdk uses the same object as lock and conditionVar
    • jdk5 makes the lock beget the conditionVar

[[DougLea]] P233 points out that pthreads signal() doesn’t require the lock being held

—- original jdk
100% — notify() must be called within synchronized block, otherwise exception. See

—- jdk 5 onwards
99% — says that An implementation may (and typically does) require the lock acquired by the notifying thread.

Not 100% strict.

— compare await():
100% — On the same page, the await() javadoc is 100% strict on this requirement!

0% — notify_all() method does NOT require holding the lock.  See

–compare wait()
100% — wait(myLock) requires a lock as argument!

100% — Similar to old JDK, the notifying thread must hold the lock. See



See also blog on c++ atomic<int> — primary usage is load/store, not CAS

lock free isn’t equal to CAS — lock-free construct also include Read-Copy-Update, widely used in linux and other kernels.

atomic isn’t equal to lock-free — For example, c++ atomic classes (actually templates) can use locks if the underlying processor lacks CAS instruction.


read-copy-update lockfree +! retry #RCU

RCU is an advanced concurrency construct, not implemented in a regular application, but valuable for lock-free interviews. is a “simple Userspace” java implementation. I didn’t read it in detail and assume it’s rather different from the linux kernel RCU (by a published author) has a brief mention of Userspace RCU solution to ABA problem. Also touches on Garbage Collection. — by RCU inventor. I can see RCU is non-trivial !

The Wikipedia article is accessible until the paragraph introducing read-side-critical-section and grace-period. This is the first paragraph on the implementation. I found it hard. Therefore a few background pointers:

  • · There must be multiple threads, typically many readers and few writers.
  • · There must a shared mutable data structure, probably on heap.
  • · Not all data structures are suitable for RCU. So which subset are? I would say pointer-graph including hash tables.

In the simplified model, we are talking about

  • · A reader thread R1 executing a block of code that has started reading the shared mutable data structure. This code block is the critical section
  • · A writer thread W1 executing a block of code attempting to update the same data.
  • · After the so-called grace period, a GC thread would reclaim the obsolete data that R1 has finished reading.
  • · A reader R2 enters the critical section after the update is done. R2 would see the new data

GC need to know when it’s safe to reclaim. It needs the wait-for-readers operation and the grace period.

Q: What if R1 gets a reference to the “obsolete node”, performs some slow task, reads the node, then exits the critical section? This node is reclaimable only after R1 exits critical section. The grace period would be long?

Q: Among 9999 nodes, how does system keep track which each node is reclaimable?
%%A: I feel kernel access to CPU registers might be needed.

Q: How does it compare with copy-on-write?
A: COW probably copies the entire data structure. Also, the modified copy is not visible to subsequent readers (like R2)

Q: How does it compare with read/write lock?
A: readlock can block

Q: How does it relate to lock-free concepts?
A: reader threads are lock-free and even better — not required to retry.
A GC threads need to wait.
A: Writer thread (like W1) ? not sure. Might be lock-free in some situations.

where are jvm locks stored@@ #cf Windows

A 2017 Morgan Stanley on-site interviewer asked…

  • %%A: can’t be on stack, since multiple threads need access to it
  • %%A: probably kernel objects. On Windows, most of the synchronization objects are wrappers over kernel objects. See [[Optimized c++]] P288.
    • If java threads maps to kernel threads then I would think the same.
    • If java threads are implemented in userland, then kernel scheduler is unaware and can’t use kernel objects to control them. I would think JVM must have its own mutex objects without kernel support. explains that only embedded or low-power devices use “green threads” i.e. userland threads. Linux and Windows JVM probably don’t support green threads, since the kernel provide “better” thread support.

jargon: mutex^lock^monitor

All these terms appear in function names and class names. They must have well-defined technical meanings. Problem is, each platform has its own definitions. [[Optimized c++]] P288 singled out “Semaphore” jargon as rather different animals across operation systems.

–In Java

  • Monitor refers to the “combo” object serving as both lock and condition variable. Java 5 introduced an alternative family of objects, where a lock object gives birth to condition objects. Most java developers still use the simpler combo object.
  • I feel lock vs mutex mean the same.

–in c++

  • Lock is an RAII object, based on a mutex.
  • C++ doesn’t use “monitor” jargon as far as I know.

–On Win32


readLock +! writeLock – same as no lock@@

Q9: With regard to shared-mutable access semantics, using a readLock and discarding the writeLock (i.e. unable to use it) is similar to using no lock, but are they semantically the same?

This is a theoretical discussion. The implementations are not familiar to me and not relevant. Before we answer  Q9, let’s look at

Q1: in the absence of shared-mutable, do we need any lock?
A1: no

Q1b: what if we use readLock without writeLock?
A2: should be fine.

Q3: what if we do use a lock in a shared function?
A3: system behavior is modified — serialized access, which is unnecessary

A9: should be semantically identical. Everyone gets a free ticket at entrance.

mutex^condition top2constructs across platforms#except win32 is one more system programming manual that asserts the fundamental importance of mutex and condition variable.

NSPR started with java in mind but primary application is supporting clients written entirely in C or C++.

I said many times in this blog that most thread synchronization features are built on top these duo.

However, on Windows the fundamental constructs are locks and waitHandles…

Y more threads !! help`throughput if I/O bound

To keep things more concrete. You can think of the output interface in the I/O.

The paradox — given an I/O bound busy server, the conventional wisdom says more thread could increase CPU utilization [1]. However, the work queue for CPU gets quickly /drained/, whereas the I/O queue is constantly full, as the I/O subsystem is working at full capacity.

Can we find threads doing CPU-only-zero-I/O tasks?

[1] In a CPU bound server, adding 20 threads will likely create 20 idle, starved new threads!

Holy Grail is simultaneous saturation. Suggestion: “steal” a cpu core from this engine and use it for unrelated tasks. Additional threads basically achieve that purpose. In other words, the cpu cores aren’t stictly dedicated to this purpose.

If the CPU cores are dedicated, then there’s no way to improve throughput without adding more I/O capacity. At a high level, I see too much CPU /overcapacity/.

Assumption — adding more I/O hardware is not possible. (Instead, scaling out to more nodes could help.)

async – 2 threads 2 different execution contexts

See also

Sync call is simple — the output of the actual operation is processed on the same thread, so all the objects on the stack frame are available.

In an async call, the RR fires and forgets. Firing means registration.

The output data is processed …. usually not RR thread. Therefore the host object of the callback and other objects on the RR stack frame are not available.

If one of them is made available to the callback, then the object must be protected from concurrent access.

memory fencing: terminology

Let’s standardize our terminology in this blog to ease searching — “fence instruction” and “memory-fencing” are more flexible words than “barrier”… also shorter.

Visibility-of-update is part of memory fencing.

Q: does memory fencing imply flush to main memory and bypassing register?
A: I think memory fencing is more about statement reordering and less about “flush to main memory”. All of these affect global visibility of a shared mutable.


threading – I/O^CPU intensive

See also — [[linux sys programming]] has an one-pager on this topic.
This is a common topic in IV and in the literature. HSBC market data + algo trading IV 2017 also touched on this.
IO-intensive – may scale up to hundreds of threads, even with just 4 cores. Each thread handles some I/O channel or connection.
eg(?): network server
eg: GUI – multiple threads could be waiting for disk or user input
CPU-intensive on multi-core machines – don’t need too many threads, but single-thread is non-optimal. Each core is effectively dedicated to a single thread.

how likely are two threads sharing a CHM were to clash@@

I would also like to point out the java ConcurrentHashMap lets two threads (possibly a writer thread and a reader thread) access the shared data structure concurrently, probably without locking or CAS, when the two threads happen to access two distinct segments.

Note there can be a large number of segments, up to a few hundred at least, so the chance of two threads hitting distinct segments is very high (99.999% chance for a 333-segments map). Therefore, contrary to what Jun said, for reader/writer threads to access the same data structure,  we don’t always need locking in every read and write access.

concurrencyLevel – the estimated number of concurrently updating threads. The implementation may use this value as a sizing hint.

c++atomic^volatile #fencing

See other posts about volatile, in both c++ and java.

For a “volatile” variable, c++ (unlike java) compiler is not required to introduce memory fence i.e. updates by one thread is not guaranteed to be visible globally. See Eric Lippert on

based on [[effModernC++]]
atomic volatile
optimization: reorder prevented permitted 😦
atomicity of RMW operations enforced not enfoced
optimization: redundant access skip permitted prevented
cross-thr visibility of update probably enforced no guarantee [1]
memory fencing enforced no guarantee


##thread cancellation techniques: java #pthread,c#

Cancellation is required when you decide a target thread should be told to give up halfway. Cancellation is a practical technique, too advanced for most IV.

Note in both java and c#, cancellation is cooperative. The requester (on it’s own thread) can’t force the target thread to stop.

C# has comprehensive support for thread cancellation (CancellationToken etc). Pthreads also offer cancellation feature. Java uses a numbers of simpler constructs, described concisely in [[thinking in java]]. Doug Lea discussed cancellation in his book.

Here are the java techniques

  • interrupt
  • loop polling – the preferred method if your design permits.
  • thread pool shutdown, which calls thread1.interrupt(), thread2.interrupt() …
  • Future — myFuture.cancel(true) can call underlyingThread.interrupt()

Some blocking conditions are clearly interruptible — indicated by the compulsory try block surrounding the wait() and sleep(). Other blocking conditions are immune to interrupt.

NIO is interruptible but the traditional I/O isn’t.

The new Lock objects supports lockInterruptibly(), but the traditional synchronized() lock grab is immune to interrupt.

threading support in compiler^library

Not sure what POSIX falls into. OS-level library?

Java (followed by c#) absorbed many concurrency features into the compiler, by adding language keywords. Is this related to the VM? I think so. Is The library approach still necessary in the presence of VM? Yes. confirmed that jvm uses pthreads on linux and macOS.

Python, perl and most languages have no VM and use the common add-on “library” approach. The library is usually a wrapper over C threading library.

calling unknown code while hold`(or Not hold`) lock #XR

In c++, if the unknown code does a fork(), then all locks are replicated to new born process. See after fork(): threads

[[eff c#]] warns against calling “unknown” code while holding a lock. This advice applies to java and c++ too.

Now, What defines “unknown”? Any delegate can have an inv list containing unknown callbacks. Raising an event involves delegate invocation. If you receives an Action instance, realize it’s a delegate instance too.
Another category of “unknown code” might be implementations of an interface. You get a IComparable instance, and you just call Compare() without knowing the implementation
Suppose within obj1.method1() we grab a lock and then call the unknown code. Such unknown code could circle back to call obj1.method2(). If method2() also grabs the same lock, then no issue, thanks to c# reentrancy. Note c# reentrancy is implemented by counting —
If method2 is on another thread, then we get a classic single-lock deadlock — method2 would be blocked by the lock held by method1(), hopefully for a short time only. If method1() is unfortunately waiting for method2 to finish its work on the other thread, then the deadlock would persist.
Even if there’s no locking involved, method2 may circle back to invoke method1 (on the same or a separate thread, on the same obj1 or another object), recursively. If this recursion is unplanned (like unplanned pregnancy:), then stack overflow may ensue.
Dumb luck if unintentional recursion doesn’t hit stack overflow.

Producer-Consumer threading question by YH

(publishing this example on my blog) presents a simple but tricky threading scenario. Without the sleep(), we get random output like

Putting : 0
Putting : 1
Getting : 0
Q: how could Putting 1 happen before “Getting 0”?
Explanation — Actually, the assignment to this.hasPut does happen in lock steps like true false true false true …. (Can confirm by printing its value immediately after assignment). This is thanks to the “synchronized” keyword. Problem is the Putting/Getting prints. These statements are outside synchronized blocks, and get badly delayed, like in the above scenario —
put() returns -> “Putting 0” -> get() returns but consumer thread preempted before “Getting 0” -> put() returns -> “Putting 1” -> “Getting 0”

testing threaded designs – enterprise apps( !! lib)

Bottom line – (Unconventional wisdom) Be bold to create new threaded designs. It doesn’t have to be rock solid like in the standard library.

Experts unanimously agree that non-trivial MT designs are hard to verify or test, often exceedingly hard. There are often too many possibilities. Maybe a million tests pass but next test reveals a bug. Therefore peer review is the way to go.  I feel that’s the “library” view, the view from the language creators. Different from enterprise apps.

In enterprise apps, if a MT design passes load test and UAT, then good enough. No budget to test further. If only 1 in a million cases fail then that case has something special — perhaps a special combination or purely timing coincidence. Strictly speaking those are still logical errors and design defects. A sound design ought to handle such cases gracefully. Most designs aren’t so sound. If such a failure happens so rarely, just about everyone involved would agree it’s basically impossible to catch it during testing. Such a bug, if ever uncovered, would be too hard to catch and catching it should not be in any job scope here. Missing it is tolerable and only human.

A goal keeper can’t be expected to catch 99 penalties in a row.

In the enterprise reality, such a bug is probably never uncovered (unless a log happens to provide the evidence). It often takes too much effort to investigate such a rare issue. Not worth the effort.

"preempt a thread" == move it off driver’s seat

First some jargon. The “eligible club” (as described in other posts) includes all the runnable threads that are eligible to take the driver’s seat. In contrast, a “waiting” thread is BLOCKED and suspended.

Usually there are more eligible threads than processors (even though some CPU’s like the Sparc T1 can have 32 simultaneous threads on 8 cores.) In other words, there are are more drivers than cars.

The technical jargon “preempt” means “move an executing thread off the driver’s seat”. See P127[[headfirstC]]

[[linux sys programming]] P165 has more details.

sync primitives]pthreads^winThreads

(see also post on “top 2 threading constructs in java ^ c#”)

[[Art of concurrency]] P89 has a 3-pager on pthreads vs win threads, the 2 dominant thread libraries. It claims “… most of the functionality in one model can be found in the other.”

In pthreads (like java), the 2 “most common” controls are mutex and condition var

In win threads library (and later dotnet), the 2 essential controls are

  1. events WaitHandles– kernel objects. Also known as event mutex, these are comparable to condVar, according to P175 [[object-oriented multithreading using c++]]
  2. locks
    • mutex — kernel object, cross process
    • CRITICAL_SECTION — userland object, single-process

top 2 threading constructs in java ^ win32

Update: Now I think condition is not a fundamental construct in c#. The wait handle is. WH are based on primitive kernel objects…. See other posts

(Soundbyte– I feel the know-how about low level constructs are more valuable/versatile/powerful. Interviewers often recognize that. These include CAS, conditions.)

See other posts on the additional threading constructs added by dotnet …
See also my post on NSPR, a cross-platform library with heavy concurrency emphasis.

Most important, practical and popular [1] constructs —
* pthreads       — 1) locks, 2) conditions
* java              — 1) locks, 2) conditions. On a “higher” layer, timers; thread pools; queues and other thread-safe collections
* win32/dotnet — 1) locks, 2) event WaitHandle (not conditions)… Also timers, thread pools. Task is supposed to be popular too.
* the dbx debugger offers “mutex” and “condition” commands as the only 2 thread-related features (beside the “thread” command)

In Win32, there are 2 lock constructs
1a) mutex : usable cross-process , like other kernel objects
1b) CRITICAL_SECTION : single-process only, like other userland objects

In general, locks alone are sufficient for simple thread coordination. Sometimes you need fancier tools —
– In java, you can use wait/notify, which is another primitive.
– In C#, wait/notify seems less popular. WaitHandle seem to be popular. Wait handles are designed (by MS) to be a more complex but feature-rich notification /construct/ than conditions. See However, experts agree conditions are more powerful. The IAsyncResult uses wait handles for inter-thread signal.

Interlocked/automicVar seems to be equally popular in java and c#.

I think exception CAS all other threading constructs rely on locks and conditions. As [[c# threading]] points out, you can simulate most features of dotnet Wait Handles by simple Condition techniques. However, wait handles (but not conditions) supports IPC.

[1] some techniques are important and practical but poorly marketed and unpopular — wait/notify, immutable, interrupts,..

producer/consumer threading pattern #YH


You are I are big fans of producer/consumer threading pattern. Now I realize there are drawbacks.

There is synchronization overhead since the task queue is shared mutable data. The more fine-grained the tasks, the more frequent threads would add/remove on the task queue, the more overhead.

I feel P/C is a high-level “general purpose” threading pattern, good for most everyday concurrency, but not for those extreme requirements. In a lot of high-end gigs, they look right past the standard techniques like P/C and look for specialized techniques/experience that is needed for higher performance. I think they look for lockfree, and they look for parallelism without inter-thread interference.

For example, if the (market) data floods in thick and fast, I fear P/C may suffer. Perhaps a naive design is for the “dispatcher” or “boss” thread to accumulate a chunk of data and package it into a task and add it to the queue. Such a design may make the dispatcher a hotspot or bottleneck.

The lower we descend, the less overhead is tolerated, and the less synchronization code I see. I feel the low-level coders avoid synchronization like a plague.

What’s your view?

reentrant/recursive lock always uses acquire-count, across languages

I have seen c++ (boost) and c# documentations. Now I believe across languages all reentrant locks need to keep count of repeated “acquires” by the same thread. Lock is released only when count hits zero.

P273 [[concur programming on windows]] confirms that CLR monitors support reentrance by counting.

One Java reentrant lock (reentrant readwriteLock) has a max acquire count of 65535.

## most popular base threading libraries

win32 – on windows
Pthreads – on *nix

Traditionally, these are the 2 most important among the base thread libraries. Unsurprisingly, all the base thread libraries are in C. As explained elsewhere on my blog, there’s no reason to package these libraries in any other language. C is the clear, overwhelming favorite. confirmed that jvm uses pthreads on linux and macOS

How many is too many threads in a process@@

By default the dotnet thread pool has 250 threads per core[1]. In dotnet, I believe all threads are native, not “green” threads manufactured by the thread library.

I read in other places that 400 threads per core is considered a soft upper limit.

Remember each core, like a car, can only have one driver thread at any time. So all other threads must be inactive — blocked (i.e. context switched out, or preempted)

[1] My friend told me it’s just 25 in older versions.

mutable-shared data – source of all concurrency hazard

Consistency, liveness, deadlock, mutex, synchronization … and all the concurrency complexities have a necessary conditions — mutable shared data. Without this single condition, all these headaches disappear or become unnecessary.

If all mutable data are non-shared and all shared data are immutable, then you can create a great many threads and they will run faster than under synchronization

Note “data” doesn’t imply classes. In java/c#, such data is an instance of a class or struct. In non-OO languages like C, “data” can be global variable or any heapy thingy, manipulated by a pointer.

mutex: serialize access to… data or code@@ #kernel

These are 2 ways to look at the basic mutex use case —
DD) you can use a mutex to serialize access to a piece of shared mutable data
CC) you can use a mutex to serialize passage through a code path — the so-called critical section

In the kernel, CC (not DD) is the main concern.

Many people intuitively cling on to the intuitive DD view and visualize the runtime using a mutex to regulate multiple threads’ access to shared Data, as if the mutex is a locker. However, I find CC a more realistic/accurate view.

DD is the “logical” view whereas CC is the physical view.

DD is achieved through CC.

The more low-level you go, the more you take the physical view and speak in terms of critical sections.

DD is more object oriented. For non-OO languages like C, DD is not as natural as in OO language. If a data item is passed by pointer, then any function running on any thread can modify it. It’s not enough to use a mutex on 99% of those functions and have one function access it directly.

In OO languages, that data item would Always be a private field, automatically restricting access to very few methods of the class. Now we understand why encapsulation is essential to threading.

In non-OO languages, it’s less natural to associate a mutex with a piece of data. If the shared data is passed by pointer, then the mutex must also be passed by pointer. The mutex-pointer must accompany the data-pointer.

#1 motivation/justification for threading in GUI

In most multi-threading domains, the driving force is often one of throughput, latency and parallelism. Yet The most prevalent motivation goes “processor clock speed is reaching a plateau, so the only way ahead is multi-core. Without multi-threading, the multiple cores lay waste.” This is like scale-out vs scale-up. This is the resource “utilization” argument. Any resource underutilized is a “guilt” and shame to some managers.

In GUI space, however, the real motivation is not “utilization” per se but Responsiveness.

I have seen many GUI apps that can't keep up with the live market data rate — latency.

I have seen many GUI apps that leave a long ugly trail when you move/resize a window.

Most GUI threading problems and techniques are related to Responsive not performance or throughput. Even “latency” is not really appropriate. In latency sensitive systems, Latency is often measured in clock cycles and microseconds, sometimes sub-micro, but Responsiveness doesn't need that. Such latency techniques can lead to over-engineering.

##what bad things could happen to a thread in a pool

Do you ever wonder why threads in a thread pool can go bad and become unusable? Weblogic has long realized that. Hard to diagnose.

However, I don’t believe “voodoo” exists in a threading system. A stuck thread must be stuck in a useless loop or waiting for cpu, IO or memory. It won’t be stuck for “no reason”. Pretend to be a spy looking at the program counter inside any thread — why not moving? There are only a limited (not UNlimited) number of reasons. Here are various bad things that can happen to an innocent thread.

– in java, any uncaught exception can fall on a thread (like a rotten apple:). These are extremely common so a thread in a pool should handle these /gracefully/.

– a thread may be seen doing useful work in normal conditions, but in an edge case get into an endless loop but doing no useful work — just wasting cpu
** Note endless loop is fine if the thread is doing useful work.

– divide by zero (perhaps terminating the thread)
– access wrong memory location and trigger a segfault. In C++ there are many pointer/memory-related errors.
– java thread can run out of memory too
– in java, a thread can be stopped due to the deprecated but supported stop() and suspend() methods. Such a thread might emerge in bad shape.
– starvation due to low thread priority.
– stack overflow, due to deep recursion (like a recursive equals() or hashCode()).

– thread could get stuck in a normal blocking operation like accept(), read(), write(), lock()/synchronized keyword, wait(), join(), long sleep. Many of these aren’t even interruptible.
** IO (disk or network) wait is notorious culprit. In this state, everything including those “kill -9” signals are ignored (see [[Unix Power Tools]]). Entire process may appear stuck in mud.
** my java thread book also says a thread blocked on a socket may not respond to interrupt.
– deadlock

mutex^condVar^countingSemaphore – 3 synchronization devices

I believe mutex is a primitive construct, hardware-wise, therefore elemental and low-level. It’s NOT built using condition variables or counting semaphores. In contrast,

– mutex is usually needed _inside_ counting Semaphore. See OS/2 and java. Also see solaris pthreads sem_wait() and sem_post() [1] as described on P93 [[Bil Lewis]]
– mutex is usually needed as guard _around_ Condition var. See java, pthreads [[Bil Lewis]] and boost thread.
** In fact, both work similarly. See similarity between the wait-loops in [[Bill Lewis]].

Complicating the big 3 interdependency, when one of several “grabbing” threads is given the contended mutex, system may internally send signals, but that happens at a lower level than a condition var. Condition variable and Counting semaphore probably depends on the same low-level signaling. This low-level signaling construct is not an API construct and not one of our big 3.

counting semaphore is less fundamental than mutex and condition variables.
* java has built-in support for M and C but not S
* OS/2 doesn’t support counting semaphore. Note For some time, pthreads, win32 and OS/2 are the 3 dominant thread implementation,

[1] pthreads sem_wait() means “grab a permit”, sem_post() means “return my permit to the pool”.

— Jargon clean-up —
There are 2 distinct meanings of “semaphore”+ some vague meanings.
1) Counting semaphore is the first and best known semaphore API.
2) Other types of semaphores were added, notably System V IPC semaphores.
x) In some discussions, “semaphores” vaguely mean “various synchronization variables” and includes mutex, condVar and countingSemaphores.

The counting semaphore concept is fundamental and elemental, but implementation-wise, it’s always constructed using mutex and condVar.

Some people chose to write their own counting semaphore using mutex, condVar and a counter

Q: why is “lock” not in our big 3.
A: “Lock” often means mutex, but is sometimes a wrapper on a mutex, as in boost.

— sysV semaphore and shared memory
Linux supports three types of IPC mechanisms — message queues, sysV IPC semaphores and shared memory. As the Camel book points out, those are the same 3 sysV IPC constructs.

Once the memory is shared, there are no checks on how the processes are using it. They must rely on other mechanisms, for example System V semaphores, to synchronize access to shared memory.

Live hardware threads ^ kernel threads – Barrel processor

Exec summary — in one cpu cycle a Barrel-processor machine (like T2000) can have
– 8 executing threads
– 32 Live hardware threads
– hundreds of kernel threads
– thousands of userland threads
I worked for a while on the Sun T2000 server supporting 4 cpu threads per core. explains

– Such a cpu-core does not allow execution of multiple instructions in one cycle. So “Simultaneous” is misleading.
– Each thread running at roughly 1/4 the original CPU speed

A single-tasking processor spends a lot of time idle, not doing anything useful whenever a cache miss or pipeline stall occurs

In the T2000 (8-core, 32 hardware threads), if you have 100 processes, then you have at least 100 kernel threads (software threads). Optimally, 32 of them would be Running as “live hardware-threads” since each is assigned to a cpu core. In one cycle, 8 of them are executing, while the other 24 are in pipeline inside the cpu core. sheds more light.

avoid notifyAll() in low latency apps

notifyAll() is unpopular in low-latency apps. For fear of lost signal, we ask jvm to do a lot of useless work. You wake up everyone when you just need one of them waken up. But How do we get away with notify()? Here are 2 common conditions that should be satisfied.

* All the threads in your target wait set must be identical, so any one can get up and finish the task.
* All are waiting on the same while() condition.

Imagine some threads are waiting for a blocking queue to become non-empty; other threads are waiting for non-full. If you notify(), the risk is that scheduler chooses one waiting thread, but the chosen one checks the while() condition to see it’s waiting for non-empty, but the signal is no-longer-full, so the signal is ignored and lost. Using jdk 5 conditionVar, you can simplify this kind of situation using multiple conditions bound to the same lock.

offload non-essential processing to helper threads

update: mentions a queue for logging requests.

In many real time trading systems, a novice threading developer may decide to dispatch some non-essential logging[1] logic to a task queue backed by a thread pool, without measuring the resulting latency. I don’t always have time to profile such trivial code, so I am not sure if latency might be better if we simply execute the logging code in the current “request processing” thread.

However — here’s the key point — even if separate threading doesn’t help in this case, it’s still important to know how to use threading in similar contexts.

I’d rather have a developer with threading know-how even though we don’t need it now, than a developer uninitiated to the intricacies of threading.

If you are wondering why the fuss over non-essentials, here’s an example — In some of the busiest, most time-critical trading engines in the exchanges/ECNs, there’s just a single thread sorting/matching quotes for a given symbol. This thread has to be lean and mean. All non-essentials must be delegated. We just mentioned a naive Design #1.

Design #2) Another common technique is messaging. Send a message with all the details, and leave the rest to the receivers. I feel this is slower than threading. Serialization, data copy, network bottleneck, socket handshake. Fastest thread should be network-free and disk-free.

Design #3) Another producer/consumer design uses a pre-allocated (circulalr) array of Requests. Main thread just set a flag to mark a request as executed and move on. Some consumer threads would clean it up and remove it from the array. Typically, we have a single producer and a configurable bunch of consumers, just like a thread pool. If there are enough consumer threads, then the buffer [2] will be almost empty. I feel it’s helpful if the buffer fits into CPU cache.

Note all the designs are producer/consumer.

[1] or other non-essential work such as serialization, persistence, replication, comparison, verification, DB query, or web download. See

[2] You can call it a task queue, but physically it’s really a simple memory buffer, the simpler the faster.

producer/consumer – fundamental to MT(+MOM/trading) apps

I hardly find any non-trivial threading app without some form of producer/consumer.

#1) Async — always requires a buffer in a P/C pattern. See

# dispatcher/worker — architectures use a task-queue in a P/C pattern.
# thread pools — all (java or C++) use a task-queue in a P/C pattern.
# Swing — EDT comes with a event queue in a P/C pattern
# mutex — Both producer and consumer needs write access to a shared object, size 1 or higher. Always needs a mutex.
# condVar — is required in most cases.

uncontended lock grab !! as cheap as unsynchronized #pthread

Some java authorities claim that uncontended lock acquisition is so cheap that it's almost “free” — comparable to no locking at all. I see opposite evidence

Evidence: java Vector and HashTable are superceded by no-lock classes. Even if a HashTable object is never shared by 2 threads, the uncontended synchronization is still considered expensive.

Evidence: pthreads standard recognizes that synchronized versions of getc() will result in “serious” performance hits for single-threaded apps, and added 4 unsynchronized functions —

getchar_unlocked — same as getc_unlocked ( stdin )

These are like the ArrayList and HashMap in java.

consistency and liveness hazards – largely theoretical

The primary tension/conflict in multi-threading — the twin challenges of consistency vs liveness, are seldom seen as practical challenges. This challenge is supposed to the only show in town but has an empty audience. After years of practice, I now see a few reasons

Reason: Developers stick to simple or (a tiny number of) well-tested concurrency constructs. For example, swing developers use lots of threads often without knowing. Threads are indispensable to servlet and JDBC. Even more important roles are given to threads in the MOM arena. No threading knowledge required though.

Reason: Creating a new concurrency construct is considered tricky, unproven, hard to sell, unjustified, low ROI, so few developers try.

Reason: When we do try, we seldom confront the twin challenge. We focus on getting the pieces to work together.

Reason: When concurrency bugs do exist, they stay dormant practically “forever”. I have never seen a test on Wall St that actually revealed a concurrency bug. Concurrency Bugs are discovered by code review.

immutables and their total costs of ownership

I always maintain that one of the practical cures for deadlock is immutable object, but there are several costs.

* every time you deep-clone, you must ensure all the objects cloned are immutable.
* every method and constructor taking in an object arg must deep-clone that object.
* every method returning an object must deep-clone the object. Note Java strings are inherently immutable
* deep-cloning usually require massive chain-reaction. If MyImm class has a pointer field of MyImm, then the deep-cloning will clone a linked list. Immutability means object state immutable. If your definition of “State” includes that MyImm field, then the entire linked list is part of the state of the head node! Cloning could lead to stack overflow.

* It’s easy to overlook a detail, and unknowingly create an almost-immutable, with (possibly disastrous) consequences in production
* immutable classes are harder to extend, limiting reusability, flexibility and usefulness.

can’t avoid threads in socket, GUI, MOM…

I wrote about this topic in other posts but here’s a new angle. Forget about multicore hardware or low latency arms race. A few application domains were created (and inherently) multi-threaded. A single-threaded design would require a redesign, if at all feasible.

+ Domain – GUI (swing, wpf ..) — EDT is a super busy thread. Any long-running task would need a helper thread.
+ Domain – blocking socket — accept() would get the main thread blocked. read() and write() would also block a worker thread.
+ Domain – nonblocking socket — I believe you still need multiple threads. The entire design of sockets presumes multi-threading.
+ Domain – MOM — onMsg() would get one dedicated thread blocked.
+ Domain – DB server — 2 clients can keep their transactional sessions open, and would need 2 dedicated threads. Contrast —
– Domain – stateless web server — which need not maintain client state. In this case, a single thread can service many users off a request queue (at least in theory).
– Domain – multi-user Operating System — at least one session per user, but the OS usually fork-exec a new process, without multi-threading

real Protagonist in an asynchronous system

When discussing asynchronous, folks say “this system” will send request and go on with other work, and will handle the response when it comes in. When we say that, our finger is pointing at some code Modules — functions, methods, classes — business logic. By “this system” we implicitly refer to that code, not data. That code has life, intelligence, and behavior and looks like the agonist of the show.

However, a code module must run on a thread but such a thread can’t possibly “react” or “handle” an event. It has to be handled on another thread. Therefore the code we are looking at technically can’t react to events. The code handling it is another code module unrelated to our code module. So what’s “this system“?

A: a daemon…. Well not a bad answer, but I prefer …
A: the cache. Possibly in the form of disk/memory database, or in the form of coherence/gemfire, or a lowly hashmap, or more generally a buffer ( In every asynchrous design, I have always been able to uncover a bunch of stateful objects, living in memory long enough for the event/response. It’s clear to me the cache is one eternal, universal, salient feature across all asynchronous systems.

The “cache” in this context can be POJO with absolute zero behavior, but often people put business logic around/in it —
* database triggers, stroed procs
* coherence server-side modules
* gemfire cache listeners
* wrapper classes around POJO
* good old object-oriented techniques to attach logic to the cache
* more creative ways to attach logic to the cache

However, I feel it’s critically important to see through the smoke and the fluff and recognize that even the barebones POJO qualifies as the “this system” (as in the opening sentence). It qualifies better than anyone else.

Swing/WPF can have one thread displaying data according to some logic, and another thread (EDT?) responding to events. Again, there are stateful objects at the core.

observer can be MT or SingleT

Observer pattern can either go MT or stay ST ie single-threaded.  Both are common.

The textbook showcase implementation is actually single threaded. The event happens on the same thread as the notifications, often looping through the list of observers. That’s actually a real world industrial-strength implementation.

(—> Notably, the single thread can span processes, where one method blocks while another method runs in an RMI server. This is the so-called synchronous, blocking mode.)

Other implementations are MT — event origination and and notifications happen on different threads. As explained in the post on async and buffer ( ), we need a buffer to hold the event objects, as the observer could suffer any amount of latency.

Condition^Lock objects in pthreads and java

I feel Windows has a different implementation altogether…

Pthreads (and many threading systems) employs a classic 2-object design for thread synchronization:

1) a mutex lock object. Think of it as a bathroom.
2) a condition object aka condition variable. I call it a waiting room.

In Pthreads, Condition object offers

* wait() and timed_wait() — similar to Object.wait()
* signal() and boradcast() — similar to Object.notify()

In POSIX, if Thread A wants to call these methods on the condition object, then A must hold the mutex. Therefore the 2 objects are tightly coupled.

Java 1.5 provides Lock and Condition objects. 1.5 lets you have 2 “waiting rooms” tied to a single mutex.

  Condition cond1 = myLock.newCondition();
  Condition cond2 = myLock.newCondition();

Now you can target your notification to a subset of waiting threads, rather than indiscriminate notifyAll()

In java 1.4, you grab myMutext and call myMutex.wait(). Single-object design.

a simple cache server staying alive

Main server threads simply keep running, usually blocked in socket.accept() or rmi server mode. This thread below is responsible to shut down JVM at 23:00. As a redundant measure, there’s also an Autosys stop job to shut down the JVM.

public void run () {
      if (hour < 1 || hour > 23) {            return;      }
      while (true) {
            try {
                  Thread.sleep (10 * 60 * 1000);
                  // set up hour from current system system
                  if (hour < 1 || hour > 23) {              System.exit (0);                  }
            } catch (Exception ex) {} // do nothing

callback object – called by which thread@@

(See also post on generic callback in generic OO)

A novice developer first encounters callback functions. 2 years later she realize it’s actually a function pointer i.e. address of a function. For the callback to be registered with some Hollywood, we must use and understand function pointers.

Then the developer encounters the (anonymous) callback object, typically implementing a callback interface. This trivial idiom is extremely widespread but non-trivial in a concurrent OO context. Examples —

– the simplest observer in [[head first design patterns]]
– swing event listener  — always Objects
– onMsg / onMessage listener
– gemfire callback object and listener object – i can never remember the difference

In the simplest form, a callback object is a stateless wrapper over a function pointer.

Q: called on which call stack?
A: Long answer. The thread creating the callback object is not relevant. The thread registering the callback object with Hollywood is relevant but not the answer. Once the callback object’s address (not the naked function address, thanks to the wrapper) is saved in the framework, many threads can access it. Often it’s an internal framework thread.

thread detach — boost ^ java

In most (if not all) systems, a thread object is a “poor handle” on a real world thread.

in java, if the thread object becomes unreachable (and garbage collected) then you can't join or interrupt the thread, or test if it holds a particular lock. You have to wait for it to finish or pull the plug with System.exit().

Same scenario exists in boost thread. Additionally, boost lets you explicitly call myThread.detach(), to decouple the myThread object and the real world thread.

Q: after detach(), what's the thread object?
A: it represents not-a-thread

jargon – grabbing, waiting, blocked, sleeping

A few types of thread suspension. Guess which methods can throw InterruptedException!

1) sleeping — sleep()
2) waiting — (implicitly for another thread) wait(),join(), invokeAndWait() …
3) blocked — socket.accept(),
4) grabbing — blocked while acquiring a lock. Note when a wait() returns due to a notification/exception/timeout, it must go grab the lock.

I would say “the thread is grabbing the lock” or “my thread is blocked by a lock”

cyclic barrier – boost thread ^ java

A quick and memorable analogy of a boost thread barrier object (of the barrier CLASS). Think of such an object as a thread-safe collection (say vector) of 3 items. 3 threads all have a pointer to the same collection.

Thread A calls myBarrier.wait() inside some function1, to remove one item, and blocks
Thread B does the same
Thread C does the same in some function2. Now the collection is empty, all 3 threads join the Eligible club (as described in the post on Eligible^Running)

java offers CyclicBarrier.

break the 4 deadlock conditions

My long-time java friend told me lock ordering and lock timeout are the 2 standard solutions, but here are additional suggestions directly or indirectly addressing deadlocks.

(“[top x]” means one of my favorite x  techniques.)

For want of a better word, “stuck” is my general term meaning “no-progress” and includes
– lock grab,
– Object.wait(), waking up and grabbing lock
– join(),
– socket accept(),
– long sleep(),
– endless loop,
– unbounded *recursion*

Now the solutions.
–to provide preemption
* [top 3] remain interruptible while getting “stuck”
** even if a recursion/loop calls no blocking method, the thread can still check for the interrupt status from time to time and exit if necessary. see
* tryInterruptibly(). Now, if thread 1 knows another thread is holding a mutex it can forcibly grab it, but how does it know which thread? –to break incremental acquisition

*[top 2] have a timeout while getting “stuck” — a wide-spread implementation
** example — in an endless loop, check how long we have been stuck and exit if necessary.
** example — lock timeout — A thread tries to acquire a lock in a specified period. Upon timeout, it release all it’s locks and go to wait status for a period. For example, release 1st chopstick if 2nd chopstick remains unavailable for x seconds. Achievable with tryLock()? Yes P279 [[ java concurrency in practice ]] I believe tryLock() can tell u the 2nd chopstick is unavailable and give you a chance to give up 1st chopstick held. See P69 [[pthreads]] and deadlock avoidance: release all locks voluntarily
* open call? impractical

–to break circular wait
* lockInterruptibly() of the Lock interface. It let’s u try to acquire a lock and remain responsive to interruption.
* reentrant/recursive lock. By default pthreads locks aren’t so a single thread can deadlock itself.
* tryLock()
* [top 2] globally fixed resource acquisition order
* if you identify that 3 threads form a wait-for circle, then reverse one of the threads’ acquisition order
* finally() to release lock if using Lock objects. “synchronized” doesn’t need this.
* Release locks asap. minimize lock scope.
* [top 5] Make all methods finite. Avoid using locks while inside infinite loops or unbounded recursions.

–to break the need for mutex resources
* immutables. no need to lock anything to access them. CopyOnWrite is thread safe because it’s immutable.
* ThreadLocal
* copy On Write
* copy the data for each thread, and sync up later
* replacing sync with atomic variables? both options can be criticized as risky

really simple scenario of data races

Q: Suppose we comment out increment(), is it thread-safe?
A: I feel yes. Imagine 2 threads calling the method on the same object in the same clock cycle, on 2 processors, setting it to 5 and 6. What’s the new value? I believe it’s one of the 2. This is a DataRace (, but object state is still valid.

Q: Suppose we comment out set(), is it thread-safe?
A: thread-unsafe, due to lost-updates if 2 threads call increment() simultaneously.

Q: Suppose we remove the volatile keyword and comment out increment(), and keep only getter/setter ?
A: thread-unsafe. Threads can cache the new this.counter value in per-thread registers for many clock cycles (ie forever), so the new values are invisible to other threads indefinitely.

final class C {
 private volatile int counter;
 public void increment() {
    this.counter ++;
 public void set(int newVal) {
    this.counter = newVal;
 public int get() {
    return this.counter;

mutable fields in concurrency — private volatile !! enough

say your class has an int counter field. “this.counter++” translates to 3 assembly instructions — LOAD , increment, STORE”. Executing thread could be preempted before STORE. This could remain for many clock cycles. Many things could happen on a 128-core machine.

If not careful, another thread could see the old value. Private field won’t help. Volatile won’t help. You must use lock to prevent all read/write access to this memory location.

By the way, remember volatile adds a special atomicity — to long/double field LOAD and STORE. Specifically,
* composite operations like increment is never made atomic by the volatile keyword

statement reorder for atomic variables #java,c++

see also [[effModernC++]]

–Java: Atomic is similar to volatile

See also

Atomic package javadoc says:

The memory effects for accesses and updates of atomics generally follow the rules for volatiles, as stated in  The Java Language Specification, Third Edition (17.4 Memory Model):

* get() has the memory effects of reading a volatile variable.
* set() has the memory effects of writing (assigning) a volatile variable.
* lazySet() … is weaker
* compareAndSet() and all other read-and-update operations such as getAndIncrement have the memory effects of both reading and writing volatile variables.

deadlock involving 1 lock,2 threads

See also ..

Deadlock is seldom found in real projects. Even if an observed problem is actually due to deadlock, the problem is usually resolved without the deadlock being identified. Most managers and teams are happy to see a problem disappear. They are not motivated to uncover root cause(s).

The classic 2-lock deadlock situation is usually hard to reproduce, which makes it a rare “natural disaster”, but possibly catastrophic. I’ve never seen it but if a deadlock were consistently reproducible, then it would be much easier to resolve.

However, if you are serious about deadlock risks in real projects, you can’t ignore one of the more common deadlock patterns — 2 threads deadlocking each other with only one lock.

I’d /go out on a limb/ to claim that this pattern is arguably more common than the 2-lock textbook pattern. It may account for up to 90% of all uncovered deadlock cases.

recursive tree walk -> iterative #barcap

Suppose there’s a deep directory tree (no hard/soft links) with millions of files and directories. Write a tree walker to list files excluding directories. For memory efficiency, turn the recursive algorithm into iterative.

I feel recursive thinking is far more powerful than iterative, because it solves problems otherwise unsolvable. Most classic algo problems fall into this category. I always try to train myself to think recursively. In this regard, It’s good to study and compare iterative algorithms. What kind of recursive algo can be converted to iterative? I feel only simple ones.

Inspired by the article in the link, here are my rules for converting recursion to iteration

Rule: one push (on the stack) per call of the recursive func.

Rule: a stack to hold the arg of the recursive calls, in the correct order. Assuming m(1) -> m(2) -> m(3), then we should push 1,2,3 on the stack. Ideally, the order in our stack should match the recursive call stack.

Rule: since it’s a stack, you remove from the top and add to the top too.

Rule: first step when you reenter the while(not_empty) loop, you simulate a new recursive call by _popping_ the stack. The popped item is no longer a “pending” todo item.

Next Question: how to parallelize? You don’t know how deep each sub-tree is.

public class LowLatencyTreeWalker {
static java.util.LinkedList envelopesToOpen = new java.util.LinkedList();

static void factorial() { // a simple iterative solution to compute
// factorials
long ret = 1;
while (!envelopesToOpen.isEmpty()) {
int currNum = envelopesToOpen.removeLast();
System.out.println(ret = ret * currNum);
if (currNum > 1)
envelopesToOpen.add(currNum - 1);

static void walkTree(String args[]) { // a harder iterative algorithm for
// tree-walk
envelopesToOpen.add(new File("C:/"));
while (!envelopesToOpen.isEmpty()) {
// once we open an envelope, we remove from the todo stack
File f = envelopesToOpen.removeLast();
if (f.listFiles() != null)

2 threads sharing a stack variable — c++^java

In C++, any primitive/class object on stack/heap are often passed by reference. It’s possible to create a char stackVar and pass its address to another thread. 2 threads can then write to the same stack memory location simultaneously. In contrast, Java’s primitive stack objects are thread-confined.

It’s important to make sure the stack object doesn’t go out of scope too early, creating a dangling pointer.

Possible in java? I don’t think so. Java stack only holds primitives. Java Primitives are never passed by reference. Suppose thread A has a method A that creates an int on its stack, then starts a thread B that reads the int stackVar. It looks like pass-by-ref but is probably pass-by-value behind the scene, because java requires that local int to be _final_

generic callback in a concurrent java/c++ sys

(Note OO is a development-time concept. Once compiled, an OO program looks no different from a C program with tons of structs and function pointers.)

Most financial systems today are concurrent OO systems. The lowly callback pattern is extremely common but actually requires a thorough understanding of both thread and object interaction. Let’s examine some. Note just about every player in the callback scenarios is a pointer, including function pointers.

In the simplest 1-thread case, an observer pointee registers interest with a subject object, by giving its own address. After subject state changes in Thr 1, the same thread loops through the list of observers and invokes the callback method on each observer. Note the observer object must not be deallocated.

For onMsg() in tibrv or onMessage() in JMS, there’s a remote “upstream” thread in the message broker’s PID/runtime, and there’s a separate downstream thread in the listener’s PID. I believe this thread starts out blocking for messages. Upon receiving a msg, the runtime somehow wakes up this blocking thread and this thread invokes onMsg() on the listener pointee. Meanwhile what’s the remote upstream thread doing? Simplest design is synchronous — upstream thread waits for the listener thread to finish. This means the upstream thread can only dispatch the message to remote listeners sequentially. I feel it’s much better to get the upstream thread to return right away after sending the one-way-message[3] (see Doug Lea’s book) so this thread can contact the 2nd listener in the list. Rewind to registration time — Any thread could register the listener with “Hollywood”. This thread could terminate immediately but the listener pointee must live.

[3] in classic Reo, we actually send one-way messages to the market-facing gateway. We get the status message on a separate thread via onMessage().

In C/C++, you can also register into “Hollywood” a ptr to a free func. No object, usually no state (unless the free func uses stateful static objects). Free functions are free of destruction. Hollywood could save the (function) ptr in any collection. When there’s an event, Hollywood calls us back, in any thread.

multi-threading with onMessage()

Update — is another tutorial. says -- A JMS Session is a single threaded context for producing and consuming messages. Once a connection has been started, any session with a registered message listener(s) is *dedicated* to the thread [1] of control that delivers messages to it. It is erroneous for client code to use this session or any of its constituent objects from another thread of control. The only exception to this is the use of the session or connection close method.

In other words, other threads should not touch this session or receive/mention this object in source code

[1] I think such a session is used by one thread only. but who creates and starts that thread? If I write the jms listener app, then it has to be started by my app. [[java enterprise in a nutshell]] P329 has sample code but looks buggy. Now I think java.jms.Connection.start() is the answer. API mentions “Starts delivery of *incoming* messages only”, not outgoing! This method is part of a process that
* creates a lasting connection
* creates the sessionSSS in memory (multiple-session/one-connection)
* perhaps starts the thread that will call onMessage()

——– says:
Sessions and Threading

The Chat application uses a separate session for the publisher and subscriber, pubSession and subSession, respectively. This is due to a threading restriction imposed by JMS. According to the JMS specification, a session may not be operated on by more than one thread at a time. In our example, two threads of control are active: the default main thread of the Chat application and the thread that invokes the onMessage( ) handler. The thread that invokes the onMessage( ) handler is owned by the JMS provider(??). Since the invocation of the onMessage( ) handler is asynchronous, it could be called while the main thread is publishing a message in the writeMessage( ) method. If both the publisher and subscriber had been created by the same session, the two threads could operate on these methods at the same time; in effect, they could operate on the same TopicSession concurrently — a condition that is prohibited.

asynchronous: meaning…@@

When people talk about async, they could mean a few things.

Meaning — non-blocking thread. HTTP interaction is not async because the browser thread blocks.
* eg: email vs browser
* eg: async query

Meaning — initiating thread *returns* immediately, before the task is completed by another thread.

Meaning — no immediate response, less stringent demand on response time. HTTP server need to service the client right away since the client is blocking. As a web developer i constantly worry about round-trip duration.

Meaning — delayed response. In any design where a response is provided by a different thread, the requester generally don’t need immediate response. Requester thread goes on with its business.
* eg: email, MQ, producer/consumer
* eg: jms onMessage() vs poll, but RV is purely async
* eg: acknowledgement messages in zed, all-http

You can (but i won’t) think of these “meanings” as “features” of async.

main thread early exit

in java, main thread can “feel free” to exit if another non-daemon thread keeps the entire JVM alive. Not c++. [[c++cookbook]] says

when the operating system destroys a process, all of its child threads go with it, whether they’re done or not. Without the call to join(), main() doesn’t wait for its child thread: it exits, and the operating system thread is destroyed.

This assumes a 1:1 native thread model, so the operating thread is actually a kernel thread. When it ends, entire process ends.

each method call belongs to a thread^object

a j thread is essentially a call stack. Methods on the stack can belong to diff obj or (in the case of static methods) classes.

Each method call in a jvm
1) belongs to a thread (same thread as the caller method) AND
2) it also belongs to an obj/class.
3) By the way, it often has a caller obj.

WHO (which obj, not which class) calls WHOSE (which obj’s or class’s) method in WHICH thread largely defines the multi-threading architecture. This interaction can be complex but there are some patterns, esp. in GUI

Simple deadlock demo

package com.test1;

* showcasing Thread.yield(), Thread.join(), anon class
public class DeadlockSimpleDemo {
public static void main(String[] args) {
Thread thr1 = startNewThread("lockA", "lockB");
Thread thr2 = startNewThread("lockB", "lockA"); // identical string literals are the same objects
// thr1.join();
// thr2.join();

private static Thread startNewThread(final String lock1, final String lock2) {
Thread ret = new Thread() {
public void run() {
synchronized (lock1) {
System.out.println(currentThread() + lock1 + " acquired, grabbing "
+ lock2);
yield(); // needed to produce a deadlock
synchronized (lock2) {
System.out.println(currentThread() + lock1
+ " acquired, -> then acquired " + lock2);
System.out.println(currentThread() + "end of run()");
return ret;