joinable instance of std::thread

[[effModernC++]] P 252 explains why in c++ joinable std::thread objects must not get destroyed. Such a destruction would trigger std::terminate(), therefore, programmers must make their std::thread objects non-joinable before destruction.

The key is a basic understanding of “joinable”. Informally, I would say a joinable std::thread has a real thread attached to it, even if that real thread has finished running. says “A thread that has finished executing code, but has not yet been joined is still considered an active thread of execution and is therefore joinable.”

An active std::thread object becomes unjoinable

  • after it is joined, or
  • after it is detached, or
  • after it is be “robbed” via move()

The primary mechanism to transition from joinable to unjoinable is via join().

std::thread key points

For a thread to actually become eligible, a Java thread needs start(), but c++ std::thread becomes eligible immediately after initialization i.e. after it is initialized with its target function.

For this reason, [[effModernC++]] dictates that between an int field and a std::thread field in a given class Runner, the std::thread field should be the last initialized in constructor. The int field needs to be already initialized if it is needed in the new thread.

Q1: Can you initialize the std::thread field in the constructor body?
A: yes unless the std::thread field is a declared const field

Now let’s say there’s no const field.

Q2: can the Runner copy ctor initialize the std::thread field in the ctor body, via move()?
A: yes provided the ctor parameter is non-const reference to Runner.
A: no if the parameter is a const reference to Runner. move(theConstRunner) would evaluate to a l-value reference, not a rvr. std::thread ctor and op= only accept rvr, because std::thread is move-only

See for my experiments.

if external lib return object by pointer #Rahul

My colleague Rahul used some external library util and found out it returns a pointer. We briefly discussed the implications. Here are some afterthoughts.

  • Pattern — if I pass in an original object by reference, the lib util may return a wrapper object by pointer.  In such a context, is this wrapper object on heap? Always?

I would say usually on heap. But since the lib implementation is invisible, it may do other things.

In an alternative design, the library could update an object pre-existing in a ring buffer (or 3D matrix) and make it a wrapper. There’s no heap allocation in this round trip. The ring buffer itself could be previously allocated on heap or data segment, not as a local object in main() function. Reason — main() can end before other threads 😦

  • Pattern — if I pass in some parameters, the lib util could create a brand new object, based on my parameters. So in this context is this new-born on heap? Always?

I would say usually on heap. But since the lib implementation is invisible, it may use non-heap.

In a specialized design, the library could return singletons held in a lookup table. The singletons are usually pre-existing in heap or data-segment.

Q: Now suppose the library does return a new heap object, who would be responsible to release the memory?
A: One common design (Sutter) requires the library factory to own it. Reference counted smart pointers are useful, as the factory knows the reference count of the new heap object.

Q: Now suppose the library factory returns a new heap object by raw pointer, who would be responsible to release the memory?

2 threads taking turn #std::ref #CSY

Vanguard Q: write a program containing exactly 3 threads printing 1/3/5/7/… 2/4/6/8… respectively, but in lock steps, as if passing a token between them. is my solution.

I managed to avoid sleep() and condVar. The idea — all thread run the same function which

  1. check the shared mutable variable Next. If it’s “my” turn then
  2. grab lock, print my next number, update Next, release lock
  3. end-if
  4. yield and exit

I used an atomic<char> “Next” set to lastOutput%something, that’s visible to all threads, even if not holding a lock.


prevent c++compiler reordering statements says this low-level c++11 function is something like a

directive to the compiler inhibiting it from making optimizations that involve moving writing operations beyond a releasing fence or read operations before an acquire fence.

I doubt there’s any need for this in any application.

std::defer_lock #kill1@4deadlock factors

Only in the lock() call does the thread grabs both locks together. This breaks “incremental acquisition” , one of the four deadlock conditions.

Sample code from

  std::unique_lock<std::mutex> ulock1(mutex1,std::defer_lock);
  std::unique_lock<std::mutex> ulock2(mutex2,std::defer_lock);

  // now grab both locks

pthread_join() retriev`return value: uncommon practice

Most pthreads programs don’t retrieve the return value via pthread_join(). has a comment by the author of boost::thread (reference implementation for c++11 thread library). He said

(3) I never use the return value of a thread in raw POSIX threads. However, I tend to use higher level facilities such as the Boost thread library, and more recently the C++0x thread library, which provide alternative means for transferring values between threads such as futures, which avoid the problems associated with memory management that you allude to.

Therefore, even though you felt it was unnatural to store the per-thread computed results in a global array, in practice it’s not bad. It is inherently thread-safe because there’s no data sharing, if we enforce a truly parallel mode.

That was my preferred solution, but to experiment, I also used new() to return a value to pthread_join(). Personally, I am always wary of using new() in one function and the corresponding delete() in another function … unreliable. As much as possible, I use smart pointers to manage new/delete. shows both solutions

detach()^pthread_exit()^pthread_join() at end@main()


Q: When would main() call ..join() vs ..exit()?

  • I would call pthread_join() if main thread has something to do after child threads completes.
  • I would call pthread_exit() if main thread has nothing to do and should leave the child threads running.
    • However, if the child thread object is destructed…
  • I would call detach() to create a daemon child thread, which would automatically terminate with host process — according to MS training.
  • I would not call any of them from main() if I want main() to bring down entire process… The normal return of main() kills all threads, unlike jvm — see Anthony Williams comment in

For c++11 std::thread object, rules are different from pthreads. If a std::thread object is destructed at end of main(), it would by default invoke std::terminate()! So before any return statement, main() need to call child.detach() , so as to leave the child Thread running.

Explained more in std::thread dtor std::terminate()

std::reference_wrapper usages

std::reference_wrapper is a class template that wraps a reference in a copyable, assignable object. Usages:

  • std::reference_wrapper is primarily used to store references inside standard containers which cannot hold references. In my projects, no one ever stores references in containers. Pointless.
  • std::reference_wrapper is also used to pass objects by reference to the constructor of std::thread, or the helper functions std::make_pair() and std::make_tuple()

Helper functions std::ref and std::cref() are often used to generate std::reference_wrapper objects.


–details on the std::thread usage shows that without std::ref(), the object passed to std::thread ctor is a clone, so the original object is not updated by the thread routine.

My uses std::ref() too. Same code also shows another adapter wrapper for a function says std::ref() is likely needed. has detailed sample code

CVA c++ IV 2 #oq

void func(ResourceMgr & rm){ 
  int connId = rm.getConn();
  double d=externalFunc(connId); 
  rm.reclaim(connId); //release the resource to the mgr
  • Q5: In the code above, external func can throw, so write a RAII wrapper to prevent resource leak.
  • Q5b: what if you aren’t allowed to use RAII? Ok you said catch-all. Is an empty catch block enough?
    AA: need to re-throw the original exception, to mimic the RAII behavior.
  • Q(paper coding): Iterate over two vectors in lock steps. Which is faster — iterator vs vector index?
  • Q (bond math): Is there any uncertainty in the present value calc on an pre-existing vanilla IRS?
  • q: what design patterns do you use?
  • q: write a singleton skeleton
  • Q: how do we make a class “final” in pre-c++11
    %%A: either make dtor private or make all constructors private
  • Q: is shared_ptr thread safe?
  • Q: any difference — const shared_ptr<T> vs shared_ptr<const T>?
  • %%Q: does it make sense to pass a shared_ptr by const reference? I feel it’s cleaner to pass by raw ptr
    AA!: pass by const-reference is faster and recommended


  • Q: why use weak_ptr instead of raw ptr. Valid question.
    A: See [[std c++lib]] P84.
    A: See last paragraph in
  • Q: you said atomic<int> operator++ probably wraps a CAS in a while loop internally, so is it spinlock?
    %%A: I think so. is a detailed explanation
  • Q34: various synchronization techniques?
  • Q34b: what’s the c++ function to force a memory barrier?
    A: std::atomic_thread_fence(),
    but i feel if an application (not a library) uses this function then something is seriously wrong.
  • Q: can we implement a mutex using CAS?
    AA: blockingMutex implementation ] kernel
    AA: spinlock, not blockingMutex, can be implemented as in
  • ? Q (paper coding): write a piecewise linear interpolation Curve class with a ctor taking a vector<pair<double, double>>.  See
  • Q: What are r-values and r-value references
  • Q: Explain your use of SFINAE. See
  • Q: what’s the c++ library function for cmpxch?
    A: atomic::compare_exchange_weak()
  • Q: is your barcap vol surface used for EOD risk only?
    %%A: no. A trader might suspect a lesser known product is currently mis-priced. She could confirm the live bid/ask are obvious outliers on the fitted surface.
    %%A: a dealer desk price new deals based on the fitted surface, whether or not the strike/expiry requires interpolation.

–Simon Ma:

  • Q (paper coding): implement Fib() recursively and iteratively
  • Q: what’s inline?
  • Q: semaphore vs mutex
  • Q: how did you compute greeks like gamma?
  • Q (bond math): given 6M spot IR = 5% and 12M = 10%, what’s the 6M rate 6M forward?
  • Q: Assign first half of a vector to another vector
    %%A: vector::assign() probably takes two iterators so I can pass v.begin(), v.begin()+5
  • Q (obscure): What data structure to represent a directed graph of N nodes? See NxN matrix for graph of N nodes
    %%A: create a Node class with a single ptr field…
  • Q: use parallel algorithm to compute sum of a vector<int>
    %%A: only the last stage global aggregation has a shared mutable that needs protection. The sub-aggregation can be single-threaded.
  • Q (probability problem): Both Alan and Bob agree to show up at Cafe9 sometime between 12 and 2pm. Whoever arrives first would wait for 15 minutes only. What’s the probability of them actually meeting
  • Q: what’s thread_local?
    %%A (correct): a new storage class that’s similar to static
  • Q (paper coding): A natural number is Good if it is a product of only 3, 5 and 7. The smallest Good numbers are 1,3,5,7,9,15,21,25,27,35… Write a print(int N) to print the Nth Good number. Hint: write a isGood(int k). See
  • Q (unclear): implement subtract(int a, int b) using only operator+ and comparison operators. I feel this question is unclear. Can we use bitwise? Can we multiply?
    %%A: just try all the integers (i) one by one a == b+i
  • Q (obscure): what operators can’t be overloaded?
    %%A: q(?:) correct
    AA: address-of CAN be overloaded!

–Mikhail the mgr

  • Q: inserting 1000,000 items into a list vs a vector without reserve()
    A: vector wins
  • Q: do you define your exception classes?
  • Q: in what context is a deque completely unusable whereas vector is perfectly fine?
    A: a C function taking an array (i.e. a ptr + a count). Vector::data() is designed for this. In Pre-c++11, we can pass &v[0] and v.size()
    A: if the code takes one element and uses pointer increment to access other elements. Deque would fail at segment boundaries.
  • Q89 (coding question): given a vector of T [passed in by const ref], write a template function to return [by const reference] the minimum element.
  • Q89b: how do you handle an empty vector?
  • ? Q89c (obscure): given q(vector const & vec), what can you say about q{auto it = vec.begin()} versus q{vector::const_iterator it=vec.begin()}

atomic{int} supports operator+=()

Background — on a machine with lock-free CPU…

My friend Shanyou asked me — with c++11 atomic data types, can we simply say

myAtomicInt ++; //without any mutex

A: Yes according to [[c++StdLib]]

Q: Is there some hidden CAS while-loop therein?
A: Yes I am 99% sure because other threads could be updating the same shared mutable object concurrently on other CPU cores.

Q: is there a CAS while-loop in a basic store/load operation?
A: I don’t think so

childThr.get_id() after join()

Not sure about pthreads but here is c++11 std::thread::get_id() behavior:
“If the thread object is not joinable, the function returns a default-constructed object of member type thread::id.”
I believe after you join a childThr, that thread is no longer joinable, SO get_id() will return a meaningless boilerplate value.

Solution: to use that id, you need to save it before joining is my experiment

gdb to show c++thread wait`for mutex/condVar shows my experiment using gdb supplied by StrawberryPerl.

On this g++/gdb set-up, “info threads” shows thread id number 1 for main thread, “2” for the thread whose pthread_self() == 2 … matching 🙂

The same “info-threads” output also shows

  • one of the worker threads is executing sleep() while holding lock (by design)
  • the other worker threads are all waiting for the lock.
  • At the same time, the main thread is waiting in a conditional variable, so info-threads shows it executing a different function.

concurrent lazy singleton using static-local var#c++

As explained in another blog post, static local is a shared mutable, but fortunately initialization is thread-safe:

“If multiple threads attempt to initialize the same static local variable concurrently, the initialization occurs exactly once (similar behavior can be obtained for arbitrary functions with std::call_once).” — has 20 upvotes and looks substantiated. It also considers double-checking, std::call_once, atomics, CAS…

GCC uses platform-specific tricks to ensure a static local variable is initialized only once, on first use. 

Conclusion — On GCC and other compilers, this is the easiest solution for a concurrent lazy singleton.


non-local static composite object: pitfalls

Google style guide and this MSDN article both warn against non-local static objects with a ctor/dtor.

  • (MSDN) construction order is tricky, and not thread-safe
  • dtor order is tricky. Some code might access an object after destruction 😦
  • (MSDN) regular access is also thread-unsafe, unless immutable, for any static object.
  • I feel any static object including static fields and local statics can increase the risk of memory leak since they are destructed very very late. What if they hold a growing container?

I feel stateless global objects are safe, but perhaps they don’t need to exist.

non-local^local static object initialization ] c++

Regarding when such objects are initialized, there are simple rules as illustrated in 2 simple yet concurrent singleton implementations ] c++The two singleton implementations each use one of the 2 types.

The rules:

  1. lazy init — local statics are initialized in the first function call. Only one thread would initialize it, never concurrently on two threads. See concurrent lazy singleton using static-local var
  2. eager init — non-local statics are initialized before main(), on one single thread, but the sequence among them is non-deterministic, as explained by Scott Meyers.
  3. Both are kind of thread-safe

##shared_ptr thr-safety: 3 cases

This topic is worthwhile as it is about two high-value topics … threading + smart ptr

At least 3 interviewers pointed out thread safety issues … first answer shows many correct MT usages and incorrect MT usages. Looking at that answer, I see at least 3 distinct”objects” that could be “shared-mutable”:

  1. control block — shared by different club members. Any one club member, like global_instance could be a shared mutable object. Concurrent access to the control block is internally managed by the shared_ptr implementation and probably thread-safe.
  2. pointee on heap — is shared  mutable. If 2 threads call mutator methods on this object, you can hit race condition.
  3. global_instance variable —  a shared mutable instance of shared_ptr. race condition 😦




%%rwlock using only 1 mutex #AshS

Q: implement a simple read/write lock using a basic mutex provided by a basic thread library.

— Target Usage Scenario —
Imagine a data table that any thread can read in the absence of concurrent updates, but only one writer thread at a time can update.

I feel some minimum fairness is needed. You don’t want a writer thread to keep waiting for a long time while readers come and go, as if the writer is a second-class citizen. In fact, this implementation favors writers to readers, assuming write is more urgent. Reader starvation is considered tolerable.

— implementation notes–

Since there are 2 mutexes involved, we will always acquire the police first. Helps reduce chance of deadlocks. Still, it’s better to use a single mutex, as shown in [[pthreads]]. One simple suggestion is replacing this->_mutex with a bool this->isWriting. I feel this is a common simplification to many synchronization designs.

After reading P84 [[pthreads]], the mutex field can be a nonref rather than a pointer. In that case, we should never copy this object. It should be created at time of construction.

Ashish pointed out a bug, so now, all 4 methods must acquire _police from onset, before reading/writing  internal shared data.

#include <cassert>
using namespace std;
class mutex{ //this class should be provided to us
  void release();
  void acquire();

class raiiLock{//when an instance goes out of scope on the stack, dtor will release the lock
  mutex const _lock; 

class rwlock{
  rwlock(rwlock const & other); //disabled
  rwlock & operator =(rwlock const & other); //disabled

  // data members
  mutex  _mutex; //used by writers only, never by readers
  mutex  _police; // controls modifications to internal data members
  size_t const _permits; // max number of concurrent readers allowed in theory
  size_t _available; // available permits, controlled by the police
  size_t _waitingWriters; // controlled by the police
  // note dtor will be synthesized, which will do nothing if all data members are primitives or pointers
  // note no-arg ctor is not allowed.
  rwlock(size_t maxPermits):
    _permits  (maxPermits),

  void release1permit(){
    raiiLock autoLock(_police); //blocking briefly
    _available ++;
    if (_available == _permits && _waitingWriters){
      //wake up one of the writers, perhaps using a conditional variable tied to _police mutex
  void releaseWriterLock(){
    raiiLock autoLock(_police); //blocking briefly
    _available = _permits;
    if (_waitingWriters){
      //wake up one of the writers, perhaps using a conditional variable tied to _police mutex
  char try_acquire1permit(){ // non-blocking
    //As Ashish pointed out, _police need to be acquired before 
    //checking _waitingWriters and _available. Any access to shared internal
    //data needs police protection.
    raiiLock autoLock(_police); //blocking briefly
    if (_waitingWriters) return 'w'; //immediately reject this
    // reader's request if there's any waiting writer.
    // Therefore, any subsequent reader requests will not affect the waiting writers.
    if (_available == 0 ) return 'a';

    assert( _available > 0 && _waitingWriters == 0);
    _available --;
    return '0'; // 0 means safe to read the protected data table
  void acquireWriterLock(){ // blocking
    raiiLock autoLock(_police); //blocking briefly
    _waitingWriters ++ ; //Subsequent readers will be rejected, to avoid writer starvation, which is common with java RWlock
    while(_available < _permits){
      //block in some kind of waiting room perhaps using a conditional variable tied to _police mutex
    assert (_available == _permits && "all permits should be released and the writer, if any, is done");
    _mutex.acquire(); // should never block
    _available = 0;
    _waitingWriters --;
    // now the current thread is free to update the table

[11] threading,boost IV #Sapient perhaps

Read-guard, write-guard ?

Q: how to test if current thread already holds a target lock?
%%A: Though there might be platform-specific tricks, I think it’s not widely supported. Best to use recursive mutex to bypass the problem.
AA: pthread_self() is a C function (=> non-method) that returns the thread id of the current thread.
AA: has a microsoft specific solution
AA: using this thread::id object a lock can “remember” which thread is holding it. shows a gdb technique, not available at runtime.

Q: Is scoped_lock reentrant by default? Can you make it reentrant?
AA: With a boost recursive mutex, a single thread may lock the same mutex several times and must unlock the mutex the **same-number-of-times**.

AA: pthreads supports it. says — If the mutex type is PTHREAD_MUTEX_RECURSIVE and the mutex is currently owned by the calling thread, the mutex lock count shall be incremented by one and the pthread_mutex_trylock() function shall immediately return success.

Q: Boost supports try_lock()?
AA: yes

Q: Boost supports lockInterruptibly()?
A: probably not, but there’s timed_lock

Q: Boost supports reader/writer locks?
AA: read/write lock in the form of boost::shared_mutex

Q: difference between mutex and condition var?

Q: can you write code to implement “recursive” locking with reader/writer threads?
A: key technique is a bool isHoldingLock(targetlock)

Q: what types of smart pointers did you use and when. What are the differences? I’d like to focus on the fundamentals not the superstructures.

Q: what other boost lib?
%%A: I briefly used boost bind. I believe STL binders are imperfect, and lambdas are superior

Q2: if I want a custom class as a map key?
%%A: must overload the less-than operator

Q2c: what if I can’t change the source code?
%%A: derive from binary_function to get a functor, and pass it to map constructor as the optional type param. However, the new functor only has “operator()” overloaded? I think it will still work, as the function doesn’t have to be named “less()”.
AA: Item 42 [[eff STL]]

%%Q2g: is operator less-than a friend func or a method?
AA: can be a non-friend(if everything public) or a friend class/func.

concurrency complexity/learn`curve c++imt java

C++ concurrency learning curve has lower /leverage/ higher effort.

* well-established — Java is designed with concurrency at its core so its memory model is mature and sound. There are a few very well-respected books on java concurrency (and some secondary books). As a result, concurrency best practices are more established in java than arguably any other language as of 2017.

* templates — Many concurrency constructs use templates with possibly specializations.

* C — Java developers never need to deal with the system-level concurrency library (invariably in C), the c++ developer may need to descend to the C level.

* multiple libraries — See post on c++ concurrency libraries.

* Low-level — C++ is a lower-level language than java.
** The concurrency classes must deal with the big 4.
** Memory management also complicate concurrency.
** Not only heap objects, but primitives on the stack are also accessible from multiple threads thanks to pointers.
** The atomic class templates are also more low-level than java’s, and much harder to get right.

* There are many mutex constructs.

##common c++concurrency libraries

Java has language-level support for concurrency + some special concurrency classes in the JDK library.

C/C++ use a library-only approach. C libraries include pthreads and winthreads. I think these are provided by the OS, because these threads are kernel threads.

Many threaded c++ projects use nothing but these C libraries.

There are also c++ thread libraries, (probably always) based on the underlying C libraries.
* c++11 thread library
* boost
* Microsoft Foundation Classes
* RogueWave

While the lucky java developer can afford to focus on a single concurrency tool set, the unlucky c++ developer often have to juggle 2 or more.

Therefore, it’s harder to get the “halo” in c++ concurrency interviews

c++atomic^volatile #fencing

See other posts about volatile, in both c++ and java.

For a “volatile” variable, c++ (unlike java) compiler is not required to introduce memory fence i.e. updates by one thread is not guaranteed to be visible globally. See Eric Lippert on

based on [[effModernC++]]
atomic volatile
optimization: reorder prevented permitted 😦
atomicity of RMW operations enforced not enfoced
optimization: redundant access skip permitted prevented
cross-thr visibility of update probably enforced no guarantee [1]
memory fencing enforced no guarantee


std::lock_guard simple idiom]cod`IV

Since this data type is designed as a stack object i.e. scoped variable, simply declare (and initialize) this object within curly brackets. When this variable goes out of scope, the object behind the variable gets destructed.

In fact, the compiler sets up this sequence of events.

std::lock_guard<std::mutex> aLocalLock(sharedMutex);
someMutableShared ++;

c++threading support,historically

[[c++concurrencyInAction]] has a concise historical review…

* POSIX is a platform-specific C api, just like the Windows API.
I feel this is an OS-level API. Every OS-level API tends to be in C (not c++), so various languages can all use it. OS-level API tend to “present” the hardware as is. The hardware has no concepts of object orientation or exceptions. The C language is the natural language for the hardware.
* Threading is a low-level, OS-level feature. (Kernel support? Not sure about the jargon). Naturally, threading is implemented in C and exposed first as C api. Many c++ threading developers simply use C api without any motivation to use c++.
* There’s no surprise that the early c++ concurrency libraries are wrappers over the existing C api. Examples include MFC and Boost.
* Note low-level C api is usually platform-dependent. The c++ wrapper api has an opportunity to smooth out the platform differences.

sample code showing boost scoped_lock# !! c++14

#include <iostream>
#include <boost/thread.hpp>
#include <string>
#include <iostream>
#include <sstream>
using namespace std;
boost::posix_time::milliseconds sleepTime(1);
template<typename T>
class MyQueue {
 void enqueue(const T& x) {
  cout << "\t\t\t > enqueuing ... " << x << "\n";
  boost::mutex::scoped_lock    myScopedLock(this->mutex_);
  cout << "\t\t\t >> just got lock ... " << x << "\n";
  // A scoped_lock is destroyed (and thus unlocked) when it goes out of scope
 T dequeue() {
  boost::mutex::scoped_lock lock(this->mutex_);
  if (list_.empty()) {
   throw 0; // unlock
  T tmp = list_.front();
  cout << "< dequeu " << tmp << "\n";
  return (tmp);
 std::list<T> list_;
 boost::mutex mutex_;
MyQueue<std::string> queueOfStrings;
int reps = 5;
void sendSomething() {
 for (int i = 0; i < reps; ++i) {
  stringstream st;
  st << i;
  queueOfStrings.enqueue("item_" + st.str());
void recvSomething() {
 for (int i = 0; i < reps*3; ++i) {
  try {
  } catch (int ex) {
   cout << "<- - (    ) after releasing lock, catching " << ex <<"\n";
int main() {
 boost::thread thr1(sendSomething);
 boost::thread thr2(recvSomething);

volatile in c++ #hardware eg

(See other posts on volatile.) I now get more and more questions on this keyword.

Now i think a volatile variable can be updated by another thread, another process via shared-memory[1], or by hardware (via interrupts). P206 [[from c to c++]] has a “hardware” example.

[1] see effModernC++

P178 [[understanding and using c pointers]] shows how to read a hardware port, declared as a const pointer to a volatile integer object —

unsigned int volatile * const port = ….

Unlike in java, Volatile can also be used on regular variables, methods, fields. Similar restriction as “const”.
For memory fence, see post on c++volatile^atomic.
–experts all agree q(volatile) is unrelated to threading:



2 important predecessors to c++11 thread API

One of the committee members behind the c++11 thread api said in [[c++ concurrency in action]] that Boost::thread has been used as the primary reference for the standard api, although the latter includes non-Boost features, notably atomic operations.

The 2nd strongest influence among the early cross-platform thread libraries is ACE.

3rd influence? Perhaps MFC, but I'm not sure.

IPC mutex #boost

favorite IV topic

Update — both pthreads mutex and pthreads condVar can be configured as cross-process. Note — both are used for inter-thread coordination. For inter-process, better use semaphore. favors IPC mutex over IPC semaphore.

IPC mutex is supported in many operating systems such as ….?

A mutex in boost::thread is effective only within a single process [1]. To share a mutex between 2 processes, you can use either

A) an anonymous mutex in shared mem managed by your processes

B) a named mutex managed by the kernel (“hotel service desk”). They don’t live in shm. In this case, I guess the mutex is treated like a modem or any other hardware device, whereby any access is regulated by the kernel.

  • In B, the mutex is a boost::interprocess::named_mutex
  • In A, the mutex is a boost::interprocess::interprocess_mutex

If you want recursive,

  • In B, you can use a boost::interprocess::named_recursive_mutex
  • In A, you can use a boost::interprocess::interprocess_recursive_mutex

But Why bother with a shared mutex in the first place? Usually to guard a shared resource such as an object (say object K) in shared mem. By definition K is accessible from 2 (or more) processes P1 and P2. They can both ignore and bypass the mutex and access K directly. Therefore the guardian mutex lock is only advisory.

[1] because the mutex object lives in the private memory of the process

pthread-based thread pool – full source code

On my Ubuntu laptop, i tested the source code bundled with [[pthreads programming]]. I once spent an hour studying the source code.

  • uses malloc/free, not new/delete
  • uses a struct to describe each task, with a pointer to a void func returning void, taking in a void ptr
  • I have problem compiling it in g++, but gcc worked.
  • use a request queue (task queue?) A single Producer adds tasks to it; multiple consumers (pool threads) remove them.
  • uses 3 condVars to ….
  • uses a single mutex

Which files have “substance”? Only the tpool.c. The header file contains no implementation. The test code is like unit test. The bugtest is optional.

c++ thread library – all wrappers@@

C++ thread libraries invariably smell like C library wrappers. They have a hard time shaking off that image.

Fundamentally, concurrency utilities are low level, and less suitable for OO modelling. OO often introduces overhead. A friend said “thread library is always using function API, not class API”.

I have not seen a c++ thread library that’s not based on a C thread library. A C version is more useful to more people.

SCB threading design IV: caching remote DB

This could be a good story to tell during interviews.

The MOM design is still unfinished but I think a practitioner could see we are on the right track.

Q: A remote position data source (a DB or some opaque data source) is super slow. It takes up to 500ms to return a position object when queried. The read(int key) method is provided by the DB’s client-side API and is thread safe, so multiple threads can call read() concurrently. Now, we need some kind of caching to minimize the 500ms latency, but we want a lazy cache, partly to avoid overloading this busy DB — We don’t want to load all 9 million positions when we are interested in only a few positions. Our cache shall provide an API like a Cache class with a method

“public Position lookup(int key) const”

If a local lookup thread is blocked due to a cache miss (resulting in a DB read), it should not block other unrelated lookup threads. Also, if 2 threads happen to call lookup(123) in quick succession which is a cache miss, 2nd thread should block and gets the result as soon as first thread receives the position and populates the cache. Perhaps some kind of notification.
———– Analysis ———–
Similar to my B2bTradingEngine. I feel we can address this either with low-level tools or with high level tools. Perhaps the interviewer is interested in the low-level know-how, or perhaps interested in the higher-level design techniques and principles. I feel it’s more challenging to assume high volume and slow connections, more relevant to the growing big data economy. Perhaps a distributed infrastructure.

It’s generally easier to scale down an over-engineered design than to scale-up a simplistic design.
———– MOM based idea ———-
(not yet a “design”). If the request volume is too high, then we run the risk of creating thousands of blocked threads. MOM to the rescue. I will refer to the MOM-based server process as the “engine”.

The client thread would have to be “decoupled” from the engine thread, otherwise 5000 concurrent requests would map to 5000 threads in the MOM server. To decouple, the client thread (one of 5000 or 55000) could block as a synchronous MOM consumer, or the client thread could register an onMsg() callback.

On the MOM server, we check request against the cache and returns the Position if hit — synchronously. Position data goes out in a response “topic”, so dozens of clients could filter the responses to look for their desired position. What if cache miss? Request goes into another queue (cacheMissQueue). Add the key to a task queue (producer-consumer). The consumer thread pool could have 1 or more work threads. Assuming the read() method is blocking, the worker thread must block. Upon failure, we update the cache with a special record for key 123. Upon success, we update the global cache. Either way we publish to the response topic.

Note If a request key 123 is already being processed or in the task queue, then we won’t even add it to the task queue. The task queue should “see” each key only once.
——–Low level design ——-
Now let’s assume this is a low-level single-process threading challenge, without MOM. The lookup() method could be called on 50 threads simultaneously. If 2 threads both request key 123 (cache miss), they must both block. Each reader thread should create a lock tagged with the key. During the read(), the lock should not be held. Upon successful read, the reader thread should lock, update cache, notify, unlock, then return the Position object. If in the interim a 2nd thread (a want2read thread) also has a cache miss on key 123, it would look for the lock tagged with 123, lock, then wait in it. Upon wake-up, it check cache. If found, it exits the loop and returns the Position object.

If read() times out or fails, the reader thread would fake a special Position object with an error msg, and do the same as above.

%%A: We need a thread pool to consume the queue. The pool threads will execute processOneReadTask(), “complete” the Position object in cache and notify on the position id

class Cache{
   std::map<.....> * map; // should use shared_ptr<position>
   Mutex mutex;
   Condition cond;
   Queue<int> queue; //thread safe queue, not std::queue
   DB * db;
     map(new std::map<....>),
     db(getRemoteDB()) {}
   void submitReadTask(int key){
         queue.enqueueIfNew(key); //thread-safely enqueue, if key has never been enqueued
   //callable by clients
   Position * lookup(int key){
       ScopedLock lock(mutex); //may block
       Position * found = map.findByKey(key);
       if (found) return found;
          Position * found = map.findByKey(key);
          if (found) return found;
   // executed by pool threads
   void processOneReadTask(){
       int key = queue.pop();
       if (key == -1) return; // empty queue
       Position * readOut =; //slow
       ScopedLock lock(this.mutex); //may block
       map.populate(key, readOut);
}; //end of class

How about the listener solution instead of wait/notify? If we have 500 threads requesting position 123. They all block in wait() — memory-intensive. The listener solution means each thread would package its local state info into a state object and save it into a global collection. A few threads would be enough. For example, a single thread could wake up upon notification and sequentially complete the tasks of the 500 requests.

[13]c^c++ thread libraries

Update — [[understanding and using c pointers]] c2013 says that C11 standard implements threading, but it’s not widely supported as of 2013. Among the supported C threading libraries, pthreads is most widely supported.

pthreads — not much change since the beginning.

Boost, roguewave and other C++ thread libraries are often (sugar coated) class libraries over C-level thread libraries. On a given platform, if there is pthreads and another thread library at C level, then the class library usually provides a consistent OO wrapper to hide their differences. I feel there will never be significant capabilities added to the c++ library and not added to the underlying C library. Here’s my reasoning —

Like any system call and standard system library, thread libraries are designed to support multiple languages and diverse applications. Many languages and many essential applications are still written in C not C++. Multi-threaded or not, C is the true common denominator, not c++. (All unix thread libraries I know are C-based). It’s inconceivable for a thread library vendor with new “ideas” to add them at the c++ layer rather than the underlying C-layer. says —

Many of today’s operating systems now provide support for multithreading and synchronization. This support takes the form of a low-level procedural API accessed as C language functions. is a detailed TOC

pure-output param in C++ standard library

Remember sybase/mssql stored proc can have pure-output param. Some of my blog posts ( point out that @param1=@localVar2 actually assigns left to right !

Anyway, ansi C pointer parameter also supports pure-output param. Example:

Case A) pthread_create() has a ptr-to-pthread_t param. This param is purely for output only. Caller allocates a buffer then gives its address to pthread_create(). This function populates that buffer with (not the address but the value of) a new-born thread’s handle

The buffer deserves scrutiny. If caller were to declare an uninitialized ptr-to-pthread_t then it would allocate 32bits for the pointer. But in this case *content* of the buffer might possibly exceed 32-bit if it is a struct. If you assume it’s a struct, then you need to allocate the struct, then pass its address to pthread_create.

In most implementations, pthread_t is a typedef (supported in ansi C) for an integer, but it’s instructive to think of it as a struct.

Allocation could be on heap or stack, but in threading apps, stack allocation is often meaningless.

recvfrom() also has a pure-output, in addition to a in/out parameter. See

pthreads mutex variable, briefly

The lock function [1] accepts an _address_ to a mutex object. So you can either have a ptr-to-mutex variable, or a nonref variable and pass its address. It’s hopeless to pass-by-value (pbclone) because multiple threads must share the same mutex instance, not copies of it.

Since address is needed everywhere, it’s slightly simpler to use a ptr, but don’t forget to free the malloc.

Typically you can’t avoid malloc, because a mutex object on stack is not very useful. A mutex is typically used by multiple threads.

[1] Actually all pthread mutex functions

create unbounded queue using STL deque – threading

There’s a STL container adapter — queue over deque. (Probably no queue over vector exists since one end of the vector is hard to operate.) Like all STL containers it’s thread unsafe. But in this context, let’s ignore this adapter and try to design our own adapter.

How many condition variables? I think 1 only — isEmpty.

Q1: How many mutexes? At most 1 at each end of the queue.
%%A: probably 1. See below.

Q1b: Is it possible to allow consumers and producers to operate simultaneously?
%%A1b: I don’t think so. See below.

Q2: What if there’re 0 elements now and both threads are operating on that 1 new element?

For a queue backed by a linked list, Answer 1b might be yes — I feel the producer would button up the entire node and put its address into the head pointer in one instruction. For a deque though, the copying is not an atomic one-stepper. It’s either operator=() or copy-ctor. Before the producer finishes copying, consumer reads the half-copied node! In conclusion, I feel we need a big lock over the entire data structure.

wait() locking/unlocking #boost

For both boost-thread, java 1.4 wait() and java Condition variables, here’s the sequence of events
1) before wait(), application programmer must lock explicitly.
——– going into wait() ——-
2) unlock implicitly

3) lock implicitly
——– return from wait() ——

Q: look at (3) — will an abnormal “exceptional” return from wait() still grab the lock?

A: yes for boost. This is a choke point — before the thread emerges from “under water” and awakes from “the avatar dream”, it must grab the lock — grabbing indefinitely if it’s not available.

A: yes for java 1.4. i guess the wait() method never throws any checked or unchecked exception except those 2 listed in API. The standard wait() call is wrapped in synchronized block, so it’s obvious that exceptional return from wait() must grab the lock.

A: ditto for See example in javadoc

user->kernel thread mapping]Solaris

P91 Hughes and Hughes diagram portraits the most general scheme, where 1 or more user threads (LWPs) map to a kernel thread. This 1-or-m:1 mapping depends critically on an unnoticed layer of thread library, which

* receives thread creation requests
* handles thread scheduling

Therefore, in general we have to assume threads are managed by the thread library, and threads we create may be invisible to kernel, if they exists as USER-threads.

The pthread thread-id TID doesn’t always map to a kernel thread id. Pthreads do not need to be implemented with kernel threads, so some thread libraries are entirely user threads or mixed. P92 Hughes&Hughes confirms Pthreads is a hybrid model.

java offers green and native threads.
– green — [[weblogic definitive guide]] and [[java threads]] clarify that multiple user-threads map to a single kernel thread under the green thread model. Therefore thread creation and scheduling is invisible to kernel.
– win32 native threads — one java thread maps to a kernel thread
– solaris and linux generally support native threads better. one-to-one mapping

boost thread to run a method of my object (full source

class Worker{
    Worker()    {
        // the thread is not-a-thread ie a dummy thread object, until we assign a real thread object
    void start(int N)    {
        // pass “this” as first arg to processQueue(), which runs not as a method but as a free func in its own thread
        m_Thread = boost::thread(&Worker::processQueue, this, N);
    void join(){m_Thread.join();}
    void processQueue(unsigned N)    {
        float ms = N * 1e3;
        boost::posix_time::milliseconds workTime(ms);
        std::cout << "Worker: started, will work for "
                  << ms << "ms"
                  << std::endl;
        std::cout << "Worker: completed" << std::endl;
    boost::thread m_Thread;
int main(int argc, char* argv[]){
    std::cout << "main: startup" << std::endl;
    Worker worker;
    std::cout << "main: waiting for thread" << std::endl;
    std::cout << "main: done" << std::endl;
    return 0;

RAII lock auto-release with boost SCOPED_lock

using namespace std;
boost::posix_time::milliseconds sleepTime(1);

class MyQueue {
    void enqueue(const T& x) {
        cout < enqueuing … ” << x << "n";
        boost::mutex::scoped_lock myScopedLock(mutex_);
        cout <> just got lock … ” << x << "n";
        // A scoped_lock is destroyed (and thus unlocked) when it goes out of scope
    T dequeue() {
        boost::mutex::scoped_lock lock(mutex_);
        if (list_.empty()) {
            throw “empty”; // unlock
        T tmp = list_.front();
        cout << "< dequeu " << tmp << "n";
        return (tmp);
    std::list list_;
    boost::mutex mutex_;
MyQueue queueOfStrings;
int reps = 5;
void sendSomething() {
    std::string s;
    for (int i = 0; i < reps; ++i) {
        stringstream st;
        st << i;
        s = “item_” + st.str();
void recvSomething() {
    std::string s;
    for (int i = 0; i < reps*3; ++i) {
        try {
            s = queueOfStrings.dequeue();
        } catch (…) {
            cout << "<- – (    ) after releasing lock n";
int main() {
    boost::thread thr1(sendSomething);
    boost::thread thr2(recvSomething);

simple boost thread program


void workerFunc() {
    boost::posix_time::seconds workTime(3);
    std::cout << "Worker: starting up then going to sleep" << std::endl;
    boost::this_thread::sleep(workTime); // Thread.sleep
    std::cout << "Worker: finished" << std::endl;
int main(int argc, char* argv[]) {
    std::cout << "main: startup" << std::endl;
    boost::thread workerThread(workerFunc); // pass a func ptr to thread ctor
    boost::posix_time::seconds workTime(1);
    std::cout << "main: sleeping" << std::endl;
    boost::this_thread::sleep(workTime); // Thread.sleep
    std::cout << "main: waking up and waiting for thread" << std::endl;
    std::cout << "main: done" << std::endl;
    return 0;

boost thread — recursive, try-lock, reader/writer, timeout …

These lock features are implemented at different layers in Boost.

In java, Reentrance is the only option — all locks are recursive. Reader/writer lock is a feature built on top of the basic reentrant lock. tryLock and lock timeout are basic features of the Lock interface.

In Boost, those CONCEPTS, lock types, free functions, typedef are noises to a beginner;) Specifically, people say mutex objects are accessed not directly but through wrapper, but the wrappers add noise. Actually, core classes[1] are mutex and unique_lock. Both of them support try-locking. However, to understand our big 4 features, it’s cleaner to focus on the mutex classes —

* Try-lock — supported by all mutex classes
* Timed locking — supported by a subset of the mutex classes, namely timed_mutex and recursive_timed_mutex.
* Reader/writer — supported by exactly one mutex class, namely shared_mutex.
* Reentrance — supported by a subset of the mutex classes, namely recursive_mutex and recursive_timed_mutex. Note Reentrance is implemented by these 2 mutex classes only, and not in the Lockable concepts or those complicated lock types. Just scan the boost documentation.

Once we are clear on these features in the mutex classes, we can understand the structure among the Lockable CONCEPTS —

+ Try-lock — supported by all Lockable concepts.
+ Timed locking — TimedLockable and derivatives
+ Reader/writer — SharedLockable only.
(Reentrance — not in these concepts)

[1]How about the workhorse —> scoped_lock? Confusing to some novices, scoped_lock is a typedef of unique_lock

heap issue in concurrent C++ apps

Q: how to avoid double-delete across threads?
%%A: Java solves the problem with a garbage collector. C++ Smart ptr can be designed to do the delete()  in a critical section.

Q: how to avoid reading a pointer on Thread A after Thread B deletes it?
%%A: with smart ptr, thread B won’t delete it.
A: Basic technique is a reference-counted smart pointer such as shared_ptr and rogue-wave smart pointers.

A technique is to give different mini-heaps to different threads, provided they don’t share heap objects.

[13]pthreads dominates in linux #jvm

See also post on c^c++ threading libraries.

— (3.0 kernel, c2013) [[Linux system programming ]] P226 points out

  • “Pthreads remains the predominant threading solution for both C __and__ C++ on unix systems”
    • ** not just C
    • ** not just linux
    • ** “Even if you use a different threading solution, … most of them are built on top of Pthreads”
  • * in linux, pthreads is “provided by” glibc — the main C library — but is in a separate library libpthread (not “libpthreadS”)
  • * the linux pthreads implementation could create “thousands of threads in a single process without slowdown”
  • shows c++11 thread constructs use underlying thread support in the OS. On linux, the underlying is pthreads. 

RAII mutex needed : no guaranteed unlock at thread death

See other posts on thread death, such as if a thread fails before releasing mutex #Shanyou

Update: RAII mutex introduces data inconsistency and is not always a good thing.

Mutex is shared by multiple stacks so can’t live on stack, but in a special semaphore memory area. According to [[OO MT using c++]], some semaphores (mutex locks) are shared among PROCESSES, so must live outside the address space of any process. When a process (or thread) dies, it could die owning the semaphore, blocking other processes indefinitely.

Because mutex release is manual, not by default, you should consider RAII, such as boost scoped_lock. See other posts on RAII mutex. says

If a thread is the owner of one or more exclusive write locks acquired by pthread_rwlock_wrlock(), pthread_rwlock_trywrlock(), or pthread_rwlock_timedwrlock_np(), when that thread terminates, the exclusive write locks are not automatically released by the system. This is an error in the application and indicates that the data associated with the lock is in an inconsistent state. If another thread attempts to get a shared read or exclusive write lock on the lock, that thread blocks forever. echos

“When a thread terminates, the mutex does not unless explicitly unlocked. Nothing happens by default.”

[[c++ cookbook]] echos “If an operating system thread is killed, it will not release its locks…”

##RAII (mutex + eg) — briefly

Resource Release Is Finalization is arguably the more relevant part of RAII.

The “resource” in question is often a shared resource, but not shared simultaneously. It has to be acquired and released. Such resources include
* locks
* files
* DB connections
* heap objects (via pointers)

— exemplifications
java synchronized keyword
java finally() to close DB connection

RAII mutex unlocking in exception stack unwinding

Q: why adopt exception handling in my c++ app instead of traditional error handling?

I don’t buy the “convenience” argument. To me the real justification is stack unwinding and RAII, including nearly guaranteed [1]
* unlock — same as in java synchronized keyword
* dtor

Without try/catch, deadlock is more likely.

[1] actually guarantee is tough and may require careful coding. See –>

When an exception occurs, and the stack unwound, only destructors for objects created on the stack are invoked. If you had created an object using “operator new()”, and only held a pointer to that heap object via a stack ref/ptr variable, the heap object will not be destroyed, resulting in a memory leak. A similar problem can arise when using threads in that an exception occurring within the bounds of a mutex lock, can result in the mutex not being unlocked when the stack is unwound and control returned to a higher point in the call chain.

I think one way to guarantee is using a scoped mutex. As a stackvar, the object has a limited LIFETIME and the var has a limited SCOPE.

In java, synchronized keyword guarantees automatic unlocking upon exception. If you use Lock objects, then you must do it yourselves. Try/finally seems to be the way. My suggestion:

* put lock() in Method 1, in a try block
* put abnormal unlock() in finally block
* put normal unlock() statements in any subsequent method on the call stack.

c++ exception handlers – global var

These “types” are aliases of the same type. Typedef over “ptr to void-void function”.

– terminate_handler / set_terminate() [1]
– unexpected_handler / set_unexpected() [1]
– new_handler / set_new_handler() [2]

Each setter modifies a global (implicitly static) variable. Global variable is quite common in C. The canonical global var is the errno global variable (which /bothers/ pthreads). I believe all of these global variables are shared by all threads.

For set_unexpected, Java offers similar things. ( highlights Swing EDT.)
* setUncaughtExceptionHandler()
* setDefaultUncaughtExceptionHandler()

[1] FAQ 271
[2] [[ effC++ ]]

boost thread described by its creator

Excerpts from There's also a very short program featuring boost mutex, and a program featuring boost bind, …

Many C++ experts provided input to the design of Boost.Threads. The interface is not just a simple wrapper around any C threading API. Many features of C++ (such as the existence of constructors/destructors, function objects, and templates) were fully utilized to make the interface more flexible. The current implementation works for POSIX, Win32

Currently (2002), not a lot can be done with a thread object created in Boost.Threads. In fact only two operations can be performed. Thread objects can easily be compared for equality or inequality using the == and != operators to verify if they refer to the same thread of execution, and you can wait for a thread to complete by calling boost::thread::join. Other threading libraries allow you to perform other operations with a thread (for example, set its priority or even cancel it). However, because these operations don’t easily map into portable interfaces….

Boost.Threads is designed to make deadlock very difficult. No direct access to operations for locking and unlocking any of the mutex types is available. Instead, mutex classes define nested typedefs for types that implement the RAII (Resource Acquisition in Initialization) idiom for locking and unlocking a mutex. This is known as the Scoped Lock pattern. To construct one of these types, pass in a reference to a mutex. The constructor locks the mutex and the destructor unlocks it. C++ language rules ensure the destructor will always be called, so even when an exception is thrown, the mutex will always be unlocked properly.

void ptr – thread library and other usages

There’s an ivory tower view that “malloc, char arrays (string) and other arrays are largely unused in c++, since we have delete/new, std::string, vector”. Now i see one more — void pointers. despite the ivory tower, void pointers are widely used and indispensable

* global op-new returns a pointer of type … can’t be any specific type; must be void
** even per-class op-new (implicitly static) returns void ptr. See P282 ARM
* allocators have an op-new. ditto
* other usages — See boost::any.

A lot of, if not most, thread libraries are written in C to be usable to C and C++. A single void pointer is a flexible and versatile vehicle to “transport” several arguments into a functor, which is then handed to a thread. Here’s an example.

thread (void (FVVP*)(void *), void* vptr) // ——–creates and starts a new thread. FVVP means a function ptr returning Void, accepting a Void Pointer

Above is a thread api with 2 arguments — a functor and a void ptr. The functor points to a function accepting a void ptr, so the 2nd argument feeds into the functor.

Now Suppose you realize the actual function you intend for the thread needs access to multiple objects, but you have only the 2nd argument to transport them. Thanks to void pointer, you can create a struct containing those multiple objects, and pass a single pointer as 2nd arg to the thread API. This usage of void pointer resembles
* the Object pointer in java.
* the raw HashMap argument Piroz suggested.
* the std::vector holding heterogeneous objects.

pthreads according to S. Wang

Boost thread is a wrapper around pthread? But boost creator disagrees — see other post.

Think of pthread as a c library on solaris. There’s a similar pthread library on linux. There’s another pthread library on aix … All of them have the same application programmer interface. If i have a c program using the solaris pthread library, and port to linux, then it should be able to link to the linux pthread library.

Let’s zoom into the solaris (or linux or aix …) pthread library. It is probably a bunch of c functions. Each probably makes system calls to the OS to create, schedule threads, yield, join, acquire lock … I feel a thread uses system resources. All system resources are “guarded” by the OS software layer, so I feel thread operations (join, signal, unlock …) must go through OS.

How is a native kernel thread created in the OS? I feel all native thread creation must go through OS, since OS will soon schedule and control the new born thread. In linux, thread is created by fork-exec, much like a process, with one notable exception — the new born thread doesn’t get an independent heap. Its heap is shared with the parent process. In contrast, a regular fork-exec child process and its parent can’t share data on their 2 heaps. See P68 [[OO multithreading using c++]]