primitive mutex=most restrictive of all locks

  • recursive/reentrant lock is more permissive on the owner thread
  • reader-writer lock is more permissive on the reader threads
  • try-lock doesn’t block
  • a counting semaphore is more permissive until we use up all permits

Most advanced locks help with liveness (deadlock..) and efficiency.

Advertisements

fiber^thread, conceptually #shallow

I believe the concept of fiber is not standardized across languages. Here are some general observations

  • fibers are unknown to kernel. They are similar to userland threads that are implemented in userland thread libraries, rather than implemented by kernel and system calls.
  • like userland threads, a fiber adds less load on the kernel. See [[pthreads]]
  • diff: fibers are even more light-weight than threads
  • diff: fibers are usually short-lived, perhaps similar to tasks
  • diff: fibers have smaller stacks
  • a few languages support millions of concurrent fibers in one OS. For threads, with a IO-heavy workload, you probably can run tens of thousands of threads on a single JVM.

Liunx kernel thread cannot run user programs

–“Liunx kernel thread cannot run user programs”, as explained in [[UnderstandingLinuxKernel]].

Removing ‘kernel’… Linux userland threads do run user programs.

Removing ‘thread’… Linux kernel processes or interrupt routines could possibly run under user process pid.

Removing ‘Linux’ … Other Unix variants do use kernel threads to run both 1) user programs and 2) kernel routines. This is my understanding from reading about JVM…

  • unix — multi-threaded user apps =use=> LWP =use=> kernel threads
  • linux — multi-threaded user apps =use=> LWP. LWP is the basic unit of concurrency in user apps.

per-thread^per-DBConn writeBuffer

  1. Broadly speaking, if you write to a table in your DB connection but don’t commit, it will be invisible to other connections.
  2. Similarly, if a thread writes to shared object without memory fence, the updates is only in the per-thread buffer and invisible to other threads.
  3. File write is an even more common scenario. One process writes a single line to the file, but doesn’t flush. It will not be visible to other processes.
  4. CVS

blockingMutex implementation ] kernel

Background — Linux kernel provides two types of locks — spinlock and blocking mutex, as in https://www.kernel.org/doc/htmldocs/kernel-locking/locks.html . Here I focus on the mutex. I think this is far more useful to userland applications.

https://lwn.net/Articles/575460/ has good pointers:

  • I believe context switch is expensive since CPU cache has to be replaced. Therefore, optimistic spin is beneficial.

https://github.com/torvalds/linux/blob/master/kernel/locking/mutex.c shows

  • a blocking mutex used in kernel, perhaps not directly used by userland apps
  • implemented using spin lock + some wait_lock
  • maintains a wait_list. Not visible to any userland app.

 

calling database access method while holding mutex

Hi XR,

Q: If some vendor provides a database access function (perhaps dbget()) that may be slow and may acquire some lock internally, is it a good idea to call this function while holding an unrelated mutex?

We spoke about this. Now I think this is so bad it should be banned. This dbget() could take an unknown amount of time. If someone deleted a row from a table that dbget() needs, it could block forever. The mutex you hold is obviously shared with other threads, so those threads would be starved. Even if this scenario happens once in a million times, it is simply unacceptable.

Here, I’m assuming the dbget() internal lock is completely unrelated to the mutex. In other words, dbget() doesn’t know anything about your mutex and will not need it.

As a rule of thumb, we should never call reach-out functions while holding a mutex. Reach-out functions need to acquire some shared resources such as a file, a web site, a web service, a socket, a remote system, a messaging system, a shared queue, a shared-memory mapping… Most of these resources are protected and can take some amount of time to become available.

(I remember there’s some guideline named open-call but I can’t find it.)

That’s my understanding. What do you think?

deadlock avoidance: release all locks fail-fast

In some deadlock avoidance algorithms, at runtime we need to ensure we immediately release every lock already acquired, without hesitation, so to speak. Here’s one design my friend Shanyou shared with me.

  • Tip: limit the number of “grabbing” functions, functions that explicit grab any lock.

Suppose a grabbing function f1 acquires a lock and calls another function f2 (a red flag to concurrency designers). In such a situation, f2() will use -1 return value to indicate “must release lock”. If f2() calls a grabbing function f3(), and if f3() fails to grab, it propagates “-1” to f2 and f1.

  1. f3() could use trylock.
  2. f3() could check lock hierarchy and return -1.

if thread fails b4 releasing mutex #CSY

My friend Shanyou asked:

Q: what if a thread somehow fails before releasing mutex?

I see only three scenarios:

  • If machine loses power, then releasing mutex or not makes no difference.
  • If process crashes but the mutex is in shared memory, then we are in trouble. The mutex will be seen as forever in-use. The other process can’t get this mutex. I feel this could be a practical problem, with practical solutions like reboot or process restart.
  • If process is still alive, I rely on stack unwinding.

Stack unwinding is set up by compiler. The only situation when this compiler-generated stack unwinding is incomplete is — if the failing function is declared noexcept. (In such a case, the failure is your self-inflicted problem since you promised to compiler it should never throw exception.) I will assume we don’t have a noexcept function. Therefore, I assume stack unwinding is robust and all stack objects will be destructed.

If one of the stack objects is a std::unique_lock, then compiler guarantees an unlocked status on destruction. That’s the highest reliability and reassurance I can hope for.

updates2shared-mutable: seldom flushed2memory immediately #CSY

  • memory fence c++
  • memory barrier c++
  • c++ thread memory visibility

SY,

You can search for these keywords on Google. Hundreds of people would agree that without synchronization, a write to sharedMutableObject1 by thread A at Time 1 is not guaranteed to be visible to Thread B at Time 2.

In any aggressively multithreaded program, there are very few shared mutable objects. If there’s none by design, then all threads can operate in single-threaded mode as if there’s no other thread in existence.

In single-threaded mode (the default) compilers would Not generate machine code to always flush a write to main memory bypassing register/L1/L2/L3 caches. Such a memory barrier/fence is extremely expensive — Main memory is at least 100 times slower than register access.

I would hypothesize that by default, the most recent write (at Time 1) is saved to register only, not L1 cache, because at compile time, compiler author doesn’t know if at runtime this same thread may write to that same address! If you update this object very soon, then it’s wasteful to flush the intermediate, temporary values to L1 cache, since no other threads need to see it.

L1 cache is about 10 times slower than register.

Multi-threaded lock-free programming always assumes multiple threads access shared mutable objects.

Even a lock-free function without contention [1] requires memory barriers and is therefore slower than single-threaded mode. I would say in a low-contention context, the main performance gain of single-threaded over lock-free is data cache efficiency. Another performance gain is statement reordering.

[1] i.e. no retry needed, since other threads are unlikely to touch sharedMutableObject1 concurrently

c++11 atomic{int}^AtomicInteger.java

The #1 usage of atomic<int> is load() and store(). I will use short form “load/store” or “l/s”.

The #2 usage is CAS.  Interviewers are mostly interested in this usage, though I won’t bother to remember the function names —
* compare_exchange_strong()
* compare_exchange_week()


The CAS usage is same as AtomicInteger.java, but the load/store usage is more like the thread-safety feature of Vector.java. To see the need for load/store, we need to realize the simple “int” type assignment is not atomic [1]:

  • P1012 [[c++ standard library]] shocked me by affirming that without locks, you can read a “half-written Boolean” [1].

To solve this problem, atomic<int> uses internal locks (just like Vector.java) to ensure load() and store() is always atomic.

[1] different from java. https://stackoverflow.com/questions/11459543/should-getters-and-setters-be-synchronized points out that 32-bit int in java is never “half-written”. If you read a shared mutable int in java, you can hit a stale value but never a c++ style half-written value. Therefore, java doesn’t need guarded load()/store() functions on an integer.

Q: are these c++ atomic types lock-free?
A: for load/store — not lock-free. See P 1013
A: for CAS — lock-free CPU instructions are used, if available.

after fork(): threads,sockets.. #Trex

I have read about fork() many times without knowing these details, until Trex interviewer asked !

–based on http://man7.org/linux/man-pages/man2/fork.2.html

The child process is created with a single thread—the one that called fork(). The entire virtual address space of the parent is replicated in the new process, including the states of mutexes, condition variables, and other pthreads objects.

The very 1st instruction executed in Child is the instruction after fork() — as proven in https://github.com/tiger40490/repo1/blob/cpp1/cpp1/fork3times.cpp

The child inherits copies of the parent’s set of open file descriptors, including stdin/stdout/stderr. Child process should usually close them.

Special case — socket file descriptor inherited. See https://bintanvictor.wordpress.com/2017/04/29/socket-shared-between-2-processes/

 

## Y avoid blocking design

There are many contexts. I only know a few.

1st, let’s look at an socket context. Suppose there are many (like 500 or 50) sockets to process. We don’t want 50 threads. We prefer fewer, perhaps 1 thread to check each “ready” socket, transfer whatever data can be transferred then go back to waiting. In this context, we need either

  • /readiness notification/, or
  • polling
  • … Both are compared on P51 [[TCP/IP sockets in C]]

2nd scenario — GUI. Blocking a UI-related thread (like the EDT) would freeze the screen.

3rd, let’s look at some DB request client. The request thread sends a request and it would take a long time to get a response. Blocking the request thread would waste some memory resource but not really CPU resource. It’s often better to deploy this thread to other tasks, if any.

Q: So what other tasks?
A: ANY task, in the thread pool design. The requester thread completes the sending task, and returns to the thread pool. It can pick up unrelated tasks. When the DB server responds, any thread in the pool can pick it up.

This can be seen as a “server bound” system, rather than IO bound or CPU bound. Both the CPU task queue and the IO task queue gets drained quickly.

 

fileHandle/socket/dbConection are thread Unsafe

There’s a buffer in a DB connection, in a file handle, in a socket …

The buffer is a shared mutable object. Consider a read-buffer. The host object knows how much of the data in the buffer was already delivered, to avoid repeated delivery. There’s some kind of watermark, which is moved by a consumer thread.

As all shared mutables, these objects are thread unsafe.

All of these objects can also be allocated on stack and therefore invisible to other threads. Therefore, this could be the basis of a thread-safe design.

 

mutex: allocated in heap or global memory

In java, mutex is always in heap. Primitive objects can’t be host to a mutex.

In pthreads, mutex object can be allocated anywhere, but I have seen it allocated only in heap or in global area.

  • in global area, you use a one-liner to allocate and initialize a non-ref variable
  • on heap, you first malloc then call init(). I think this is like a constructor.
    • As a special case, you can also allocate mutex in shared memory, creating a cross-process mutex! Not a common practice in java.

Q: what if I allocate a mutex in stack?
A: in java,the actual object is still on heap,though the variable is on stack and invisible to other threads
A: in C, the entire mutex object can be on stack, but such a mutex is useless. Imagine a lock on a door with a single key to be shared, but other people have their own doors, so there’s no synchronization no access control at all:(

Single-Threaded^STM #jargon

SingleThreadedMode means each thread operates without regard to other threads, as if there’s no shared mutable data with other threads.

Single-Threaded can mean

  • no other thread is doing this task. We ensure there’s strict control in place. Our thread still needs to synchronize with other threads doing other tasks.
  • there’s only one thread in the process.

condition.signalAll Usually requires locking

Across languages,

  • notify() doesn’t strictly need the lock;
  • wait() always requires the lock.
    • c++ wait() takes the lock as argument
    • old jdk uses the same object as lock and conditionVar
    • jdk5 makes the lock beget the conditionVar

—- original jdk
100% — notify() must be called within synchronized block, otherwise exception. See https://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#notify()

—- jdk 5 onwards
99% — https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/Condition.html#signal() says that An implementation may (and typically does) require the lock acquired by the notifying thread.

Not 100% strict.

— compare await():
100% — On the same page, the await() javadoc is 100% strict on this requirement!

====c++
0% — notify_all() method does NOT require holding the lock.  See http://en.cppreference.com/w/cpp/thread/condition_variable/notify_all

–compare wait()
100% — wait(myLock) requires a lock as argument!

====c#
100% — Similar to old JDK, the notifying thread must hold the lock. See https://msdn.microsoft.com/en-us/library/system.threading.monitor.pulseall(v=vs.110).aspx

 

atomic=\=lockFree=\=CAS

See also blog on c++ atomic<int> — primary usage is load/store, not CAS

lock free isn’t equal to CAS — lock-free construct also include Read-Copy-Update, widely used in linux and other kernels.

atomic isn’t equal to lock-free — For example, c++ atomic classes (actually template) can use locks if the underlying processor doesn’t have CAS instruction.

 

read-copy-update lockfree +! retry

RCU is an advanced concurrency construct, not implemented in a regular application, but valuable for lock-free interviews.

http://concurrencyfreaks.blogspot.sg/2016/09/a-simple-userspace-rcu-in-java.html is a simple java implementation

The Wikipedia article is accessible until the paragraph introducing read-side-critical-section and grace-period. This is the first paragraph on the implementation. I found it hard. Therefore a few background pointers:

· There must be multiple threads, typically many readers and few writers.

· There must a shared mutable data structure, probably on heap.

· Not all data structures are suitable for RCU. So which subset are? I would say pointer-graph including hash tables.

In the simplified model, we are talking about

· A reader thread R1 executing a block of code that has started reading the shared mutable data structure. This code block is the critical section

· A writer thread W1 executing a block of code attempting to update the same data.

· After the so-called grace period, a GC thread would reclaim the obsolete data that R1 has finished reading.

· A reader R2 enters the critical section after the update is done. R2 would see the new data

GC need to know when it’s safe to reclaim. It needs the wait-for-readers operation and the grace period.

Q: What if R1 gets a reference to the “obsolete node”, performs some slow task, reads the node, then exits the critical section? This node is reclaimable only after R1 exits critical section. The grace period would be long?

Q: Among 9999 nodes, how does system keep track which each node is reclaimable?

%%A: I feel kernel access to CPU registers might be needed.

Q: How does it compare with copy-on-write?

A: COW probably copies the entire data structure. Also, the modified copy is not visible to subsequent readers (like R2)

Q: How does it compare with read/write lock?

A: readlock can block

Q: How does it relate to lock-free concepts?

A: reader threads are lock-free and even better — not required to retry.

A GC threads need to wait.

A: Writer thread (like W1) ? not sure. Might be lock-free in some situations.

where are jvm locks stored@@ #cf Windows

A 2017 Morgan Stanley on-site interviewer asked…

  • %%A: can’t be on stack, since multiple threads need access to it
  • %%A: probably kernel objects. On Windows, most of the synchronization objects are wrappers over kernel objects. See [[Optimized c++]] P288.
    • If java threads maps to kernel threads then I would think the same.
    • If java threads are implemented in userland, then kernel scheduler is unaware and can’t use kernel objects to control them. I would think JVM must have its own mutex objects without kernel support.

http://stackoverflow.com/questions/5713142/green-threads-vs-non-green-threads explains that only embedded or low-power devices use “green threads” i.e. userland threads. Linux and Windows JVM probably don’t support green threads, since the kernel provide “better” thread support.

jargon: mutex^lock^monitor

All these terms appear in function names and class names. They must have well-defined technical meanings. Problem is, each platform has its own definitions. [[Optimized c++]] P288 singled out “Semaphore” jargon as rather different animals across operation systems.

–In Java

  • Monitor refers to the “combo” object serving as both lock and condition variable. Java 5 introduced an alternative family of objects, where a lock object gives birth to condition objects. Most java developers still use the simpler combo object.
  • I feel lock vs mutex mean the same.

–in c++

  • Lock is an RAII object, based on a mutex.
  • C++ doesn’t use “monitor” jargon as far as I know.

–On Win32

See https://bintanvictor.wordpress.com/2010/06/01/c-mutex-vs-semaphore/

readLock +! writeLock – same as no lock@@

Q9: With regard to shared-mutable access semantics, using a readLock and discarding the writeLock (i.e. unable to use it) is similar to using no lock, but are they semantically the same?

This is a theoretical discussion. The implementations are not familiar to me and not relevant. Before we answer  Q9, let’s look at

Q1: in the absence of shared-mutable, do we need any lock?
A1: no

Q2: what if we use readLock without writeLock?
A2: should be fine.

Q3″ what if we do use a lock in a shared function?
A3: system behavior is modified — serialized access, which is unnecessary

A9: should be semantically identical. Everyone gets a free ticket at entrance.

mutex^condition top2constructs across platforms#except win32

https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSPR/About_NSPR is one more system programming manual that asserts the fundamental importance of mutex and condition variable.

NSPR started with java in mind but primary application is supporting clients written entirely in C or C++.

I said many times in this blog that most thread synchronization features are built on top these duo.

However, on Windows the fundamental constructs are locks and waitHandles…

Y more threads !! help throughput if I/O bound

To keep things more concrete. You can think of the output interface in the I/O.

The paradox — given an I/O bound busy server, the conventional wisdom says more thread could increase CPU utilization [1]. However, the work queue for CPU gets quickly /drained/, whereas the I/O queue is constantly full, as the I/O subsystem is working at full capacity.

[1] In a CPU bound server, adding 20 threads will likely create 20 idle, starved new threads!

Holy Grail is simultaneous saturation. Suggestion: “steal” a cpu core from this engine and use it for unrelated tasks. Additional threads or processes basically achieve that purpose. In other words, the cpu cores aren’t dedicated to this purpose.

Assumption — adding more I/O hardware is not possible. (Instead, scaling out to more nodes could help.)

If the CPU cores are dedicated, then there’s no way to improve throughput without adding more I/O capacity. At a high level, I clearly see too much CPU /overcapacity/.

async – 2 threads 2 different execution contexts

See also https://bintanvictor.wordpress.com/2013/01/06/every-async-operation-involves-a-sync-call/

Sync call is simple — the output of the actual operation is processed on the same thread, so all the objects on the stack frame are available.

In an async call, the RR fires and forgets. Firing means registration.

The output data is processed …. usually not RR thread. Therefore the host object of the callback and other objects on the RR stack frame are not available.

If one of them is made available to the callback, then the object must be protected from concurrent access.

threading – I/O^CPU intensive

See also — [[linux sys programming]] has an one-pager on this topic.
This is a common topic in IV and in the literature. HSBC market data + algo trading IV 2017 also touched on this.
IO-intensive – may scale up to hundreds of threads, even with just 4 cores. Each thread handles some I/O channel or connection.
eg(?): network server
eg: GUI – multiple threads could be waiting for disk or user input
CPU-intensive on multi-core machines – don’t need too many threads, but single-thread is non-optimal. Each core is effectively dedicated to a single thread.

how likely are two threads sharing a CHM were to clash@@

I would also like to point out the java ConcurrentHashMap lets two threads (possibly a writer thread and a reader thread) access the shared data structure concurrently, probably without locking or CAS, when the two threads happen to access two distinct segments.

Note there can be a large number of segments, up to a few hundred at least, so the chance of two threads hitting distinct segments is very high (99.999% chance for a 333-segments map). Therefore, contrary to what Jun said, for reader/writer threads to access the same data structure,  we don’t always need locking in every read and write access.

concurrencyLevel – the estimated number of concurrently updating threads. The implementation may use this value as a sizing hint.

c++atomic^volatile

See other posts about volatile, in both c++ and java.

[[c++succinctly]] also firmly denounced volatile keyword’s usage for concurrency.

[[effModernC++]] echoes in Item 40: Use std::atomic for concurrency, volatile for special memory.

For a “volatile” variable, c++ (unlike java) compiler is not required to introduce memory fence i.e. updates by one thread is not guaranteed to be visible globally. See Eric Lippert on http://stackoverflow.com/questions/26307071/does-the-c-volatile-keyword-introduce-a-memory-fence

threading support in compiler^library

Not sure what POSIX falls into. OS-level library?

Java (followed by c#) absorbed many concurrency features into the compiler, by adding language keywords. Is this related to the VM? I think so. Is The library approach still necessary in the presence of VM? Yes. http://stackoverflow.com/questions/30816307/does-java-jvm-use-pthread confirmed that jvm uses pthreads on linux and macOS.

Python, perl and most languages have no VM and use the common add-on “library” approach. The library is usually a wrapper over C threading library.

calling unknown code while hold`(or !! hold`) lock

In c++, if the unknown code does a fork(), then the lock is replicated. See after fork(): threads

[[eff c#]] warns against calling “unknown” code while holding a lock. This advice applies to java and c++ too.

Now, What defines “unknown”? Any delegate can have an inv list containing unknown callbacks. Raising an event involves delegate invocation. If you receives an Action instance, realize it’s a delegate instance too.
Another category of “unknown code” might be implementations of an interface. You get a IComparable instance, and you just call Compare() without knowing the implementation
Suppose within obj1.method1() we grab a lock and then call the unknown code. Such unknown code could circle back to call obj1.method2(). If method2() also grabs the same lock, then no issue, thanks to c# reentrancy. Note c# reentrancy is implemented by counting — http://stackoverflow.com/questions/3687505/recursive-nested-locking-in-c-sharp-with-the-lock-statement
If method2 is on another thread, then we get a classic single-lock deadlock — method2 would be blocked by the lock held by method1(), hopefully for a short time only. If method1() is unfortunately waiting for method2 to finish its work on the other thread, then the deadlock would persist.
Even if there’s no locking involved, method2 may circle back to invoke method1 (on the same or a separate thread, on the same obj1 or another object), recursively. If this recursion is unplanned (like unplanned pregnancy:), then stack overflow may ensue.
Dumb luck if unplanned recursion doesn’t hit stack overflow.

Producer-Consumer threading question by YH

(publishing this example on my blog)

https://community.oracle.com/thread/1142629?tstart=120 presents a simple but tricky threading scenario. Without the sleep(), we get random output like

Putting : 0
Putting : 1
Getting : 0
Q: how could Putting 1 happen before “Getting 0”?
Explanation — Actually, the assignment to this.hasPut does happen in lock steps like true false true false true …. (Can confirm by printing its value immediately after assignment). This is thanks to the “synchronized” keyword. Problem is the Putting/Getting prints. These statements are outside synchronized blocks, and get badly delayed, like in the above scenario —
put() returns -> “Putting 0” -> get() returns but consumer thread preempted before “Getting 0” -> put() returns -> “Putting 1” -> “Getting 0”

testing threaded designs – enterprise apps( !! lib)

Bottom line – (Unconventional wisdom) Be bold to create new threaded designs. It doesn’t have to be rock solid like in the standard library.

Experts unanimously agree that non-trivial MT designs are hard to verify or test, often exceedingly hard. There are often too many possibilities. Maybe a million tests pass but next test reveals a bug. Therefore peer review is the way to go.  I feel that’s the “library” view, the view from the language creators. Different from enterprise apps.

In enterprise apps, if a MT design passes load test and UAT, then good enough. No budget to test further. If only 1 in a million cases fail then that case has something special — perhaps a special combination or purely timing coincidence. Strictly speaking those are still logical errors and design defects. A sound design ought to handle such cases gracefully. Most designs aren’t so sound. If such a failure happens so rarely, just about everyone involved would agree it’s basically impossible to catch it during testing. Such a bug, if ever uncovered, would be too hard to catch and catching it should not be in any job scope here. Missing it is tolerable and only human.

A goal keeper can’t be expected to catch 99 penalties in a row.

In the enterprise reality, such a bug is probably never uncovered (unless a log happens to provide the evidence). It often takes too much effort to investigate such a rare issue. Not worth the effort.

"preempt a thread" = move it off driver’s seat

First some jargon. The “eligible club” (as described in other posts) includes all the runnable threads that are eligible to take the driver’s seat. In contrast, a “waiting” thread is BLOCKED and suspended.

Usually there are more eligible threads than processors (even though some CPU’s like the Sparc T1 can have 32 simultaneous threads on 8 cores.) In other words, there are are more drivers than cars.

The technical jargon “preempt” means “move an executing thread off the driver’s seat”. See P127[[headfirstC]]

[[linux sys programming]] P165 has more details.

sync primitives in pthreads^winThreads

(see also post on “top 2 threading constructs in java ^ c#”)

[[Art of concurrency]] P89 has a 3-pager on pthreads vs win threads, the 2 dominant thread libraries. It claims “… most of the functionality in one model can be found in the other.”

In pthreads (like java), the 2 “most common” controls are mutex and condition var

In win threads library, the 2 essential controls are

* events WaitHandles– kernel objects. Also known as event mutex, these are comparable to condVar, according to P175 [[object-oriented multithreading using c++]]
* locks
** mutex — kernel object, cross process
** CRITICAL_SECTION — userland object, single-process

top 2 threading constructs in java ^ win32

Update: Now I think condition is not a fundamental construct in c#. The wait handle is. WH are based on primitive kernel objects…. See other posts
—–

(Soundbyte– I feel the know-how about low level constructs are more valuable/versatile/powerful. Interviewers often recognize that. These include CAS, conditions.)

See other posts on the additional threading constructs added by dotnet …
See also my post on NSPR, a cross-platform library with heavy concurrency emphasis.

Most important, practical and popular [1] constructs —
* pthreads       — 1) locks, 2) conditions
* java              — 1) locks, 2) conditions. On a “higher” layer, timers; thread pools; queues and other thread-safe collections
* win32/dotnet — 1) locks, 2) event WaitHandle (not conditions)… Also timers, thread pools. Task is supposed to be popular too.
* the dbx debugger offers “mutex” and “condition” commands as the only 2 thread-related features (beside the “thread” command)

In Win32, there are 2 lock constructs
1a) mutex : usable cross-process , like other kernel objects
1b) CRITICAL_SECTION : single-process only, like other userland objects

In general, locks alone are sufficient for simple thread coordination. Sometimes you need fancier tools —
– In java, you can use wait/notify, which is another primitive.
– In C#, wait/notify seems less popular. WaitHandle seem to be popular. Wait handles are designed (by MS) to be a more complex but feature-rich notification /construct/ than conditions. See http://msdn.microsoft.com/en-us/library/ms228964.aspx#signaling. However, experts agree conditions are more powerful. The IAsyncResult uses wait handles for inter-thread signal.

Interlocked/automicVar seems to be equally popular in java and c#.

I think exception CAS all other threading constructs rely on locks and conditions. As [[c# threading]] points out, you can simulate most features of dotnet Wait Handles by simple Condition techniques. However, wait handles (but not conditions) supports IPC.

[1] some techniques are important and practical but poorly marketed and unpopular — wait/notify, immutable, interrupts,..

producer/consumer threading pattern #letter to YH

YH,

You are I are big fans of producer/consumer threading pattern. Now I realize there are drawbacks.

There is synchronization overhead since the task queue is shared mutable data. The more fine-grained the tasks, the more frequent threads would add/remove on the task queue, the more overhead.

I feel P/C is a high-level “general purpose” threading pattern, good for most everyday concurrency, but not for those extreme requirements. In a lot of high-end gigs, they look right past the standard techniques like P/C and look for specialized techniques/experience that is needed for higher performance. I think they look for lockfree, and they look for parallelism without inter-thread interference.

For example, if the (market) data floods in thick and fast, I fear P/C may suffer. Perhaps a naive design is for the “dispatcher” or “boss” thread to accumulate a chunk of data and package it into a task and add it to the queue. Such a design may make the dispatcher a hotspot or bottleneck.

The lower we descend, the less overhead is tolerated, and the less synchronization code I see. I feel the low-level coders avoid synchronization like a plague.

What’s your view?

reentrant/recursive lock always uses acquire-count, across languages

I have seen c++ (boost) and c# documentations. Now I believe across languages all reentrant locks need to keep count of repeated “acquires” by the same thread. Lock is released only when count hits zero.

P273 [[concur programming on windows]] confirms that CLR monitors support reentrance by counting.

One Java reentrant lock (reentrant readwriteLock) has a max acquire count of 65535.

## most popular base threading libraries

win32 – on windows
Pthreads – on *nix

Traditionally, these are the 2 most important among the base thread libraries. Unsurprisingly, all the base thread libraries are in C. As explained elsewhere on my blog, there’s no reason to package these libraries in any other language. C is the clear, overwhelming favorite.

 

http://stackoverflow.com/questions/30816307/does-java-jvm-use-pthread confirmed that jvm uses pthreads on linux and macOS

How many is too many threads in a process?

By default the dotnet thread pool has 250 threads per core[1]. In dotnet, I believe all threads are native, not “green” threads manufactured by the thread library.

I read in other places that 400 threads per core is considered a soft upper limit.

Remember each core, like a car, can only have one driver thread at any time. So all other threads must be inactive — blocked (i.e. context switched out, or preempted)

[1] My friend told me it’s just 25 in older versions.

mutable-shared data – source of all concurrency hazard

Consistency, liveness, deadlock, mutex, synchronization … and all the concurrency complexities have a necessary conditions — mutable shared data. Without this single condition, all these headaches disappear or become unnecessary.

If all mutable data are non-shared and all shared data are immutable, then you can create a great many threads and they will run faster than under synchronization

Note “data” doesn’t imply classes. In java/c#, such data is an instance of a class or struct. In non-OO languages like C, “data” can be global variable or any heapy thingy, manipulated by a pointer.

mutex – serialize access to… data or code?

These are 2 ways to look at the basic mutex use case —
DD) you can use a mutex to serialize access to a piece of shared mutable data
CC) you can use a mutex to serialize passage through a code path — the so-called critical section

Many people intuitively cling on to the intuitive DD view and visualize the runtime using a mutex to regulate multiple threads’ access to shared Data, as if the mutex is a locker. However, I find CC a more realistic/accurate view.

DD is the “logical” view whereas CC is the physical view.

DD is achieved through CC.

The more low-level you go, the more you take the physical view and speak in terms of critical sections.

DD is more object oriented. For non-OO languages like C, DD is not as natural as in OO language. If a data item is passed by pointer, then any function running on any thread can modify it. It’s not enough to use a mutex on 99% of those functions and have one function access it directly.

In OO languages, that data item would Always be a private field, automatically restricting access to very few methods of the class. Now we understand why encapsulation is essential to threading.

In non-OO languages, it’s less natural to associate a mutex with a piece of data. If the shared data is passed by pointer, then the mutex must also be passed by pointer. The mutex-pointer must accompany the data-pointer.

#1 motivation/justification for threading in GUI

In most multi-threading domains, the driving force is often one of throughput, latency and parallelism. Yet The most prevalent motivation goes “processor clock speed is reaching a plateau, so the only way ahead is multi-core. Without multi-threading, the multiple cores lay waste.” This is like scale-out vs scale-up. This is the resource “utilization” argument. Any resource underutilized is a “guilt” and shame to some managers.

In GUI space, however, the real motivation is not “utilization” per se but Responsiveness.

I have seen many GUI apps that can't keep up with the live market data rate — latency.

I have seen many GUI apps that leave a long ugly trail when you move/resize a window.

Most GUI threading problems and techniques are related to Responsive not performance or throughput. Even “latency” is not really appropriate. In latency sensitive systems, Latency is often measured in clock cycles and microseconds, sometimes sub-micro, but Responsiveness doesn't need that. Such latency techniques can lead to over-engineering.

##what bad things could happen to a thread in a pool

Do you ever wonder why threads in a thread pool can go bad and become unusable? Weblogic has long realized that. Hard to diagnose.

However, I don’t believe “voodoo” exists in a threading system. A stuck thread must be stuck in a useless loop or waiting for cpu, IO or memory. It won’t be stuck for “no reason”. Pretend to be a spy looking at the program counter inside any thread — why not moving? There are only a limited (not UNlimited) number of reasons. Here are various bad things that can happen to an innocent thread.

– in java, any uncaught exception can fall on a thread (like a rotten apple:). These are extremely common so a thread in a pool should handle these /gracefully/.

– a thread may be seen doing useful work in normal conditions, but in an edge case get into an endless loop but doing no useful work — just wasting cpu
** Note endless loop is fine if the thread is doing useful work.

– divide by zero (perhaps terminating the thread)
– access wrong memory location and trigger a segfault. In C++ there are many pointer/memory-related errors.
– java thread can run out of memory too
– in java, a thread can be stopped due to the deprecated but supported stop() and suspend() methods. Such a thread might emerge in bad shape.
– starvation due to low thread priority.
– stack overflow, due to deep recursion (like a recursive equals() or hashCode()).

– thread could get stuck in a normal blocking operation like accept(), read(), write(), lock()/synchronized keyword, wait(), join(), long sleep. Many of these aren’t even interruptible.
** IO (disk or network) wait is notorious culprit. In this state, everything including those “kill -9” signals are ignored (see [[Unix Power Tools]]). Entire process may appear stuck in mud.
** my java thread book also says a thread blocked on a socket may not respond to interrupt.
– deadlock

mutex^condVar^countingSemaphore – 3 synchronization devices

I believe mutex is a primitive construct, hardware-wise, therefore elemental and low-level. It’s NOT built using condition variables or counting semaphores. In contrast,

– mutex is usually needed _inside_ counting Semaphore. See OS/2 and java. Also see solaris pthreads sem_wait() and sem_post() [1] as described on P93 [[Bil Lewis]]
– mutex is usually needed as guard _around_ Condition var. See java, pthreads [[Bil Lewis]] and boost thread.
** In fact, both work similarly. See similarity between the wait-loops in [[Bill Lewis]].

Complicating the big 3 interdependency, when one of several “grabbing” threads is given the contended mutex, system may internally send signals, but that happens at a lower level than a condition var. Condition variable and Counting semaphore probably depends on the same low-level signaling. This low-level signaling construct is not an API construct and not one of our big 3.

counting semaphore is less fundamental than mutex and condition variables.
* java has built-in support for M and C but not S
* OS/2 doesn’t support counting semaphore. Note For some time, pthreads, win32 and OS/2 are the 3 dominant thread implementation,

[1] pthreads sem_wait() means “grab a permit”, sem_post() means “return my permit to the pool”.

— Jargon clean-up —
There are 2 distinct meanings of “semaphore”+ some vague meanings.
1) Counting semaphore is the first and best known semaphore API.
2) Other types of semaphores were added, notably System V IPC semaphores.
x) In some discussions, “semaphores” vaguely mean “various synchronization variables” and includes mutex, condVar and countingSemaphores.

The counting semaphore concept is fundamental and elemental, but implementation-wise, it’s always constructed using mutex and condVar.

Some people chose to write their own counting semaphore using mutex, condVar and a counter

Q: why is “lock” not in our big 3.
A: “Lock” often means mutex, but is sometimes a wrapper on a mutex, as in boost.

— sysV semaphore and shared memory
Linux supports three types of IPC mechanisms — message queues, sysV IPC semaphores and shared memory. As the Camel book points out, those are the same 3 sysV IPC constructs.

Once the memory is shared, there are no checks on how the processes are using it. They must rely on other mechanisms, for example System V semaphores, to synchronize access to shared memory.

Live hardware threads ^ kernel threads – Barrel processor

Exec summary — in one cpu cycle a Barrel-processor machine (like T2000) can have
– 8 executing threads
– 32 Live hardware threads
– hundreds of kernel threads
– thousands of userland threads
———–
I worked for a while on the Sun T2000 server supporting 4 cpu threads per core. http://en.wikipedia.org/wiki/Barrel_processor explains

– Such a cpu-core does not allow execution of multiple instructions in one cycle. So “Simultaneous” is misleading.
– Each thread running at roughly 1/4 the original CPU speed

A single-tasking processor spends a lot of time idle, not doing anything useful whenever a cache miss or pipeline stall occurs

In the T2000 (8-core, 32 hardware threads), if you have 100 processes, then you have at least 100 kernel threads (software threads). Optimally, 32 of them would be Running as “live hardware-threads” since each is assigned to a cpu core. In one cycle, 8 of them are executing, while the other 24 are in pipeline inside the cpu core.

http://en.wikipedia.org/wiki/UltraSPARC_T1 sheds more light.

avoid notifyAll() in low latency apps

notifyAll() is unpopular in low-latency apps. For fear of lost signal, we ask jvm to do a lot of useless work. You wake up everyone when you just need one of them waken up. But How do we get away with notify()? Here are 2 common conditions that should be satisfied.

* All the threads in your target wait set must be identical, so any one can get up and finish the task.
* All are waiting on the same while() condition.

Imagine some threads are waiting for a blocking queue to become non-empty; other threads are waiting for non-full. If you notify(), the risk is that scheduler chooses one waiting thread, but the chosen one checks the while() condition to see it’s waiting for non-empty, but the signal is no-longer-full, so the signal is ignored and lost. Using jdk 5 conditionVar, you can simplify this kind of situation using multiple conditions bound to the same lock.

offload non-essential processing to helper threads

In many real time trading systems, a novice threading developer may decide to dispatch some non-essential logging[1] logic to a thread pool, without measuring the resulting latency. I don't always have time to profile such trivial code, so I am not sure if latency might be better if we simply execute the logging code in the current “request processing” thread.

However — here's the key point — even if threading doesn't help in this case, it's still important to know how to use threading in similar contexts.

I'd rather have a developer with threading know-how even though we don't need it now, than a developer uninitiated to the intricacies of threading.

If you are wondering why the fuss over non-essentials, here's an example — In some of the busiest, most time-critical trading engines in the exchanges/ECNs, there's just a single thread sorting/matching quotes for a given symbol. This thread has to be lean and mean. All non-essentials must be delegated. We just mentioned a naive Design #1.

Design #2) Another common technique is messaging. Send a message with all the details, and leave the rest to the receivers. I feel this is slower than threading. Serialization, data copy, network bottleneck, socket handshake. Fastest thread should be network-free and disk-free.

Design #3) Another producer/consumer design uses a pre-allocated (circulalr) array of Requests. Main thread just set a flag to mark a request as executed and move on. Some consumer threads would clean it up and remove it from the array. Typically, we have a single producer and a configurable bunch of consumers, just like a thread pool. If there are enough consumer threads, then the buffer [2] will be almost empty. I feel it's helpful if the buffer fits into CPU cache.

Note all the designs are producer/consumer.

[1] or other non-essential work such as serialization, persistence, replication, comparison, verification, DB query, or web download. See http://bigblog.tanbin.com/2010/10/gemfire-write-behind-and-gateway-queue.html

[2] You can call it a task queue, but physically it's really a simple memory buffer, the simpler the faster.

producer/consumer – fundamental to MT(+MOM/trading) apps

I hardly find any non-trivial threading app without some form of producer/consumer.

#1) Async — always requires a buffer in a P/C pattern. See http://bigblog.tanbin.com/2011/05/asynchronous-always-requires-buffer-and.html

# dispatcher/worker — architectures use a task-queue in a P/C pattern.
# thread pools — all (java or C++) use a task-queue in a P/C pattern.
# Swing — EDT comes with a event queue in a P/C pattern
# mutex — Both producer and consumer needs write access to a shared object, size 1 or higher. Always needs a mutex.
# condVar — is required in most cases.

slist iteration with concurrent add()

Mithun,

Thanks for the interview question —

Q: Can you iterate over a LinkedList while another thread adds or removes elements? How might you make it thread safe?

If a long linked list is super busy with a lot of add/remove, then we may never finish iterating — we get ConcurrentModificationException every time.

I feel solution depends on length of the list, how critical is iteration vs add()/remove(), performance requirement etc.

I feel your initial idea is plausible — make a copy of the entire linked list before iterating. But when we make a copy, we need to iterate — chicken and egg problem.

How about iteration-free copying? Get a node, copy it, then read the “nextNode” pointer to reach the next node. This way I implement my own loop. Must handle concurrent modification. Not easy, but at least I don’t get exceptions. In contrast, Similar home-made solution is harder on vector/hashtable/deque as their reallocation is more complex.

I also like your 2nd idea of a custom linked list, where add() operations are customized. This is fine if iteration can finish quickly or add() is infrequent. However, If add()/remove() are time-critical, then we need other solutions.

If we need to iterate a long linked list that’s super busy, then the __design__ is probably wrong. Why not use a database?

Alternatively, split the long linked list into shorter lists. Easier to iterate.

uncontended lock grab !! as cheap as unsynchronized #pthread

Some java authorities claim that uncontended lock acquisition is so cheap that it's almost “free” — comparable to no locking at all. I see opposite evidence

Evidence: java Vector and HashTable are superceded by no-lock classes. Even if a HashTable object is never shared by 2 threads, the uncontended synchronization is still considered expensive.

Evidence: pthreads standard recognizes that synchronized versions of getc() will result in “serious” performance hits for single-threaded apps, and added 4 unsynchronized functions —

getc_unlocked
getchar_unlocked — same as getc_unlocked ( stdin )
putchar_unlocked
putc_unlocked

These are like the ArrayList and HashMap in java.

consistency and liveness hazards – largely theoretical

The primary tension/conflict in multi-threading — the twin challenges of consistency vs liveness, are seldom seen as practical challenges. This challenge is supposed to the only show in town but has an empty audience. After years of practice, I now see a few reasons

Reason: Developers stick to simple or (a tiny number of) well-tested concurrency constructs. For example, swing developers use lots of threads often without knowing. Threads are indispensable to servlet and JDBC. Even more important roles are given to threads in the MOM arena. No threading knowledge required though.

Reason: Creating a new concurrency construct is considered tricky, unproven, hard to sell, unjustified, low ROI, so few developers try.

Reason: When we do try, we seldom confront the twin challenge. We focus on getting the pieces to work together.

Reason: When concurrency bugs do exist, they stay dormant practically “forever”. I have never seen a test on Wall St that actually revealed a concurrency bug. Concurrency Bugs are discovered by code review.

immutables and their total costs of ownership

I always maintain that one of the practical cures for deadlock is immutable object, but there are several costs.

* every time you deep-clone, you must ensure all the objects cloned are immutable.
* every method and constructor taking in an object arg must deep-clone that object.
* every method returning an object must deep-clone the object. Note Java strings are inherently immutable
* deep-cloning usually require massive chain-reaction. If MyImm class has a pointer field of MyImm, then the deep-cloning will clone a linked list. Immutability means object state immutable. If your definition of “State” includes that MyImm field, then the entire linked list is part of the state of the head node! Cloning could lead to stack overflow.

* It’s easy to overlook a detail, and unknowingly create an almost-immutable, with (possibly disastrous) consequences in production
* immutable classes are harder to extend, limiting reusability, flexibility and usefulness.

can’t avoid threads in socket, GUI, MOM…

I wrote about this topic in other posts but here’s a new angle. Forget about multicore hardware or low latency arms race. A few application domains were created (and inherently) multi-threaded. A single-threaded design would require a redesign, if at all feasible.

+ Domain – GUI (swing, wpf ..) — EDT is a super busy thread. Any long-running task would need a helper thread.
+ Domain – blocking socket — accept() would get the main thread blocked. read() and write() would also block a worker thread.
+ Domain – nonblocking socket — I believe you still need multiple threads. The entire design of sockets presumes multi-threading.
+ Domain – MOM — onMsg() would get one dedicated thread blocked.
+ Domain – DB server — 2 clients can keep their transactional sessions open, and would need 2 dedicated threads. Contrast —
– Domain – stateless web server — which need not maintain client state. In this case, a single thread can service many users off a request queue (at least in theory).
– Domain – multi-user Operating System — at least one session per user, but the OS usually fork-exec a new process, without multi-threading

real Protagonist in an asynchronous system

When discussing asynchronous, folks say “this system” will send request and go on with other work, and will handle the response when it comes in. When we say that, our finger is pointing at some code Modules — functions, methods, classes — business logic. By “this system” we implicitly refer to that code, not data. That code has life, intelligence, and behavior and looks like the agonist of the show.

However, a code module must run on a thread but such a thread can’t possibly “react” or “handle” an event. It has to be handled on another thread. Therefore the code we are looking at technically can’t react to events. The code handling it is another code module unrelated to our code module. So what’s “this system“?

A: a daemon…. Well not a bad answer, but I prefer …
A: the cache. Possibly in the form of disk/memory database, or in the form of coherence/gemfire, or a lowly hashmap, or more generally a buffer (http://bigblog.tanbin.com/2011/05/asynchronous-always-requires-buffer-and.html). In every asynchrous design, I have always been able to uncover a bunch of stateful objects, living in memory long enough for the event/response. It’s clear to me the cache is one eternal, universal, salient feature across all asynchronous systems.

The “cache” in this context can be POJO with absolute zero behavior, but often people put business logic around/in it —
* database triggers, stroed procs
* coherence server-side modules
* gemfire cache listeners
* wrapper classes around POJO
* good old object-oriented techniques to attach logic to the cache
* more creative ways to attach logic to the cache

However, I feel it’s critically important to see through the smoke and the fluff and recognize that even the barebones POJO qualifies as the “this system” (as in the opening sentence). It qualifies better than anyone else.

Swing/WPF can have one thread displaying data according to some logic, and another thread (EDT?) responding to events. Again, there are stateful objects at the core.

observer can be MT or SingleT

Observer pattern can either go MT or stay ST ie single-threaded.  Both are common.

The textbook showcase implementation is actually single threaded. The event happens on the same thread as the notifications, often looping through the list of observers. That’s actually a real world industrial-strength implementation.

(—> Notably, the single thread can span processes, where one method blocks while another method runs in an RMI server. This is the so-called synchronous, blocking mode.)

Other implementations are MT — event origination and and notifications happen on different threads. As explained in the post on async and buffer ( http://bigblog.tanbin.com/2011/05/asynchronous-always-requires-buffer-and.html ), we need a buffer to hold the event objects, as the observer could suffer any amount of latency.

Condition^Lock objects in pthreads and java

I feel Windows has a different implementation altogether…

Pthreads (and many threading systems) employs a classic 2-object design for thread synchronization:

1) a mutex lock object. Think of it as a bathroom.
2) a condition object aka condition variable. I call it a waiting room.

In Pthreads, Condition object offers

* wait() and timed_wait() — similar to Object.wait()
* signal() and boradcast() — similar to Object.notify()

In POSIX, if Thread A wants to call these methods on the condition object, then A must hold the mutex. Therefore the 2 objects are tightly coupled.

Java 1.5 provides Lock and Condition objects. 1.5 lets you have 2 “waiting rooms” tied to a single mutex.

  Condition cond1 = myLock.newCondition();
  Condition cond2 = myLock.newCondition();

Now you can target your notification to a subset of waiting threads, rather than indiscriminate notifyAll()

In java 1.4, you grab myMutext and call myMutex.wait(). Single-object design.

a simple cache server staying alive

Main server threads simply keep running, usually blocked in socket.accept() or rmi server mode. This thread below is responsible to shut down JVM at 23:00. As a redundant measure, there’s also an Autosys stop job to shut down the JVM.

public void run () {
      if (hour < 1 || hour > 23) {            return;      }
      while (true) {
            try {
                  Thread.sleep (10 * 60 * 1000);
                  // set up hour from current system system
                  if (hour < 1 || hour > 23) {              System.exit (0);                  }
            } catch (Exception ex) {} // do nothing
      }
}

callback object – called by which thread@@

(See also post on generic callback in generic OO)

A novice developer first encounters callback functions. 2 years later she realize it’s actually a function pointer i.e. address of a function. For the callback to be registered with some Hollywood, we must use and understand function pointers.

Then the developer encounters the (anonymous) callback object, typically implementing a callback interface. This trivial idiom is extremely widespread but non-trivial in a concurrent OO context. Examples —

– the simplest observer in [[head first design patterns]]
– swing event listener  — always Objects
– onMsg / onMessage listener
– gemfire callback object and listener object – i can never remember the difference

In the simplest form, a callback object is a stateless wrapper over a function pointer.

Q: called on which call stack?
A: Long answer. The thread creating the callback object is not relevant. The thread registering the callback object with Hollywood is relevant but not the answer. Once the callback object’s address (not the naked function address, thanks to the wrapper) is saved in the framework, many threads can access it. Often it’s an internal framework thread.

thread detach — boost ^ java

In most (if not all) systems, a thread object is a “poor handle” on a real world thread.

in java, if the thread object becomes unreachable (and garbage collected) then you can't join or interrupt the thread, or test if it holds a particular lock. You have to wait for it to finish or pull the plug with System.exit().

Same scenario exists in boost thread. Additionally, boost lets you explicitly call myThread.detach(), to decouple the myThread object and the real world thread.

Q: after detach(), what's the thread object?
A: it represents not-a-thread

jargon – grabbing, waiting, blocked, sleeping

A few types of thread suspension. Guess which methods can throw InterruptedException!

1) sleeping — sleep()
2) waiting — (implicitly for another thread) wait(),join(), invokeAndWait() …
3) blocked — socket.accept(),
4) grabbing — blocked while acquiring a lock. Note when a wait() returns due to a notification/exception/timeout, it must go grab the lock.

I would say “the thread is grabbing the lock” or “my thread is blocked by a lock”

cyclic barrier – boost thread ^ java

A quick and memorable analogy of a boost thread barrier object (of the barrier CLASS). Think of such an object as a thread-safe collection (say vector) of 3 items. 3 threads all have a pointer to the same collection.

Thread A calls myBarrier.wait() inside some function1, to remove one item, and blocks
Thread B does the same
Thread C does the same in some function2. Now the collection is empty, all 3 threads join the Eligible club (as described in the post on Eligible^Running)

java offers CyclicBarrier.

break the 4 deadlock conditions

My long-time java friend told me lock ordering and lock timeout are the 2 standard solutions, but here are additional suggestions directly or indirectly addressing deadlocks.

(“[top x]” means one of my favorite x  techniques.)

For want of a better word, “stuck” is my general term meaning “no-progress” and includes
– lock grab,
– Object.wait(), waking up and grabbing lock
– join(),
– socket accept(),
– long sleep(),
– endless loop,
– unbounded *recursion*

Now the solutions.
–to provide preemption
* [top 3] remain interruptible while getting “stuck”
** even if a recursion/loop calls no blocking method, the thread can still check for the interrupt status from time to time and exit if necessary. see http://download.oracle.com/javase/tutorial/essential/concurrency/interrupt.html
* tryInterruptibly(). Now, if thread 1 knows another thread is holding a mutex it can forcibly grab it, but how does it know which thread? –to break incremental acquisition

*[top 2] have a timeout while getting “stuck” — a wide-spread implementation
** example — in an endless loop, check how long we have been stuck and exit if necessary.
** example — lock timeout — A thread tries to acquire a lock in a specified period. Upon timeout, it release all it’s locks and go to wait status for a period. For example, release 1st chopstick if 2nd chopstick remains unavailable for x seconds. Achievable with tryLock()? Yes P279 [[ java concurrency in practice ]] I believe tryLock() can tell u the 2nd chopstick is unavailable and give you a chance to give up 1st chopstick held. See P69 [[pthreads]] and deadlock avoidance: release all locks voluntarily
* open call? impractical

–to break circular wait
* lockInterruptibly() of the Lock interface. It let’s u try to acquire a lock and remain responsive to interruption.
* reentrant/recursive lock. By default pthreads locks aren’t so a single thread can deadlock itself.
* tryLock()
* [top 2] globally fixed resource acquisition order
* if you identify that 3 threads form a wait-for circle, then reverse one of the threads’ acquisition order
* finally() to release lock if using Lock objects. “synchronized” doesn’t need this.
* Release locks asap. minimize lock scope.
* [top 5] Make all methods finite. Avoid using locks while inside infinite loops or unbounded recursions.

–to break the need for mutex resources
* immutables. no need to lock anything to access them. CopyOnWrite is thread safe because it’s immutable.
* ThreadLocal
* copy On Write
* copy the data for each thread, and sync up later
* replacing sync with atomic variables? both options can be criticized as risky

really simple scenario of data races

Q: Suppose we comment out increment(), is it thread-safe?
A: I feel yes. Imagine 2 threads calling the method on the same object in the same clock cycle, on 2 processors, setting it to 5 and 6. What’s the new value? I believe it’s one of the 2. This is a DataRace (http://java.sun.com/docs/books/jls/third_edition/html/memory.html#61871), but object state is still valid.

Q: Suppose we comment out set(), is it thread-safe?
A: thread-unsafe, due to lost-updates if 2 threads call increment() simultaneously.

Q: Suppose we remove the volatile keyword and comment out increment(), and keep only getter/setter ?
A: thread-unsafe. Threads can cache the new this.counter value in per-thread registers for many clock cycles (ie forever), so the new values are invisible to other threads indefinitely.

final class C {
 private volatile int counter;
 public void increment() {
    this.counter ++;
 }
 public void set(int newVal) {
    this.counter = newVal;
 }
 public int get() {
    return this.counter;
 }
}

mutable fields in concurrency — private volatile !! enough

say your class has an int counter field. “this.counter++” translates to 3 assembly instructions — LOAD , increment, STORE”. Executing thread could be preempted before STORE. This could remain for many clock cycles. Many things could happen on a 128-core machine.

If not careful, another thread could see the old value. Private field won’t help. Volatile won’t help. You must use lock to prevent all read/write access to this memory location.

By the way, remember volatile adds a special atomicity — to long/double field LOAD and STORE. Specifically,
* composite operations like increment is never made atomic by the volatile keyword

statement reorder for atomic variables

see also [[effModernC++]]

http://www.cs.umd.edu/~pugh/java/memoryModel/jsr-133-faq.html#volatile

Atomic package javadoc says:

The memory effects for accesses and updates of atomics generally follow the rules for volatiles, as stated in  The Java Language Specification, Third Edition (17.4 Memory Model):

* get has the memory effects of reading a volatile variable.
* set has the memory effects of writing (assigning) a volatile variable.
* lazySet … is weaker
* compareAndSet and all other read-and-update operations such as getAndIncrement have the memory effects of both reading and writing volatile variables.

deadlock involving 1 lock (again):suspend/resume methods

if you app uses just one lock, you can still get deadlock.  ThreadM (master) starts ThreadS (slave). S acquires a lock. M calls suspend method on S, then tries to acquire the lock, but lock is held by the suspended S. S is waiting to be resumed by M — deadlock.

Now the 4 conditions of deadlock:
* wait-for cycle? indeed
* mutually exclusive resource? Just one resource
* incremental acquisition? indeed
* non-Preemptive? Indeed. But if there’s a third thread, it can restart ThreadS.

In multithreaded apps, I feel single-lock deadlock is very common. Here’s another example — “S gets a lock, and wait for ConditionA. Only M can create conditionA, but M is waiting for the lock.”

Here, ConditionA can be many things, including a Boolean flag, a JMS message, a file, a count to reach a threshold, a data condition in DB, a UI event where UI depends partly on ThreadM.

deadlock involving a single lock and 2 threads

Deadlock is seldom found in real projects. Even if an observed problem is actually due to deadlock, the problem is usually resolved without the deadlock being identified. Most managers and teams are happy to see a problem disappear and not motivated to uncover root cause(s).

The classic (2-lock) deadlock situation is usually hard to reproduce, which makes it a rare natural disaster, but possibly catastrophic. I’ve never seen it but if a deadlock were consistently reproducible, then it would be much easier to resolve.

However, if you are serious about deadlock problems in real projects, you can’t ignore one of the more common deadlock patterns — 2 threads deadlock each other with a single lock.

I’d /go out on a limb/ to claim that this pattern is arguably more common than the 2-lock textbook pattern. It may account for up to 90% of all uncovered deadlock cases.

P90 [[eff c#]] also shows a single-lock deadlock.

recursive tree walk -> iterative #barcap

Suppose there’s a deep directory tree (no hard/soft links) with millions of files and directories. Write a tree walker to list files excluding directories. For memory efficiency, turn the recursive algorithm into iterative.

I feel recursive thinking is far more powerful than iterative, because it solves problems otherwise unsolvable. Most classic algo problems fall into this category. I always try to train myself to think recursively. In this regard, It’s good to study and compare iterative algorithms. What kind of recursive algo can be converted to iterative? I feel only simple ones.

Inspired by the article in the link, here are my rules for converting recursion to iteration

Rule: one push (on the stack) per call of the recursive func.

Rule: a stack to hold the arg of the recursive calls, in the correct order. Assuming m(1) -> m(2) -> m(3), then we should push 1,2,3 on the stack. Ideally, the order in our stack should match the recursive call stack.

Rule: since it’s a stack, you remove from the top and add to the top too.

Rule: first step when you reenter the while(not_empty) loop, you simulate a new recursive call by _popping_ the stack. The popped item is no longer a “pending” todo item.

Next Question: how to parallelize? You don’t know how deep each sub-tree is.


public class LowLatencyTreeWalker {
static java.util.LinkedList envelopesToOpen = new java.util.LinkedList();

static void factorial() { // a simple iterative solution to compute
// factorials
envelopesToOpen.add(6);
long ret = 1;
while (!envelopesToOpen.isEmpty()) {
int currNum = envelopesToOpen.removeLast();
System.out.println(ret = ret * currNum);
if (currNum > 1)
envelopesToOpen.add(currNum - 1);
}
}

static void walkTree(String args[]) { // a harder iterative algorithm for
// tree-walk
envelopesToOpen.add(new File("C:/"));
while (!envelopesToOpen.isEmpty()) {
// once we open an envelope, we remove from the todo stack
File f = envelopesToOpen.removeLast();
System.out.println(f.getAbsolutePath());
if (f.listFiles() != null)
envelopesToOpen.addAll(Arrays.asList(f.listFiles()));
}
}
}

2 threads sharing a stack variable — c++^java

In C++, any primitive/class object on stack/heap are often passed by reference. It’s possible to create a char stackVar and pass its address to another thread. 2 threads can then write to the same stack memory location simultaneously. In contrast, Java’s primitive stack objects are thread-confined.

It’s important to make sure the stack object doesn’t go out of scope too early, creating a dangling pointer.

Possible in java? I don’t think so. Java stack only holds primitives. Java Primitives are never passed by reference. Suppose thread A has a method A that creates an int on its stack, then starts a thread B that reads the int stackVar. It looks like pass-by-ref but is probably pass-by-value behind the scene, because java requires that local int to be _final_

generic callback in a concurrent java/c++ sys

(Note OO is a development-time concept. Once compiled, an OO program looks no different from a C program with tons of structs and function pointers.)

Most financial systems today are concurrent OO systems. The lowly callback pattern is extremely common but actually requires a thorough understanding of both thread and object interaction. Let’s examine some. Note just about every player in the callback scenarios is a pointer, including function pointers.

In the simplest 1-thread case, an observer pointee registers interest with a subject object, by giving its own address. After subject state changes in Thr 1, the same thread loops through the list of observers and invokes the callback method on each observer. Note the observer object must not be deallocated.

For onMsg() in tibrv or onMessage() in JMS, there’s a remote “upstream” thread in the message broker’s PID/runtime, and there’s a separate downstream thread in the listener’s PID. I believe this thread starts out blocking for messages. Upon receiving a msg, the runtime somehow wakes up this blocking thread and this thread invokes onMsg() on the listener pointee. Meanwhile what’s the remote upstream thread doing? Simplest design is synchronous — upstream thread waits for the listener thread to finish. This means the upstream thread can only dispatch the message to remote listeners sequentially. I feel it’s much better to get the upstream thread to return right away after sending the one-way-message[3] (see Doug Lea’s book) so this thread can contact the 2nd listener in the list. Rewind to registration time — Any thread could register the listener with “Hollywood”. This thread could terminate immediately but the listener pointee must live.

[3] in classic Reo, we actually send one-way messages to the market-facing gateway. We get the status message on a separate thread via onMessage().

In C/C++, you can also register into “Hollywood” a ptr to a free func. No object, usually no state (unless the free func uses stateful static objects). Free functions are free of destruction. Hollywood could save the ptr in any collection. When there’s an event, Hollywood calls us back, in any thread.

multi-threading with onMessage()

Update — http://outatime.wordpress.com/2007/12/06/jms-patterns-with-activemq/ is another tutorial.

http://download.oracle.com/javaee/1.4/api/javax/jms/Session.html says -- A JMS Session is a single threaded context for producing and consuming messages. Once a connection has been started, any session with a registered message listener(s) is *dedicated* to the thread [1] of control that delivers messages to it. It is erroneous for client code to use this session or any of its constituent objects from another thread of control. The only exception to this is the use of the session or connection close method.

In other words, other threads should not touch this session or receive/mention this object in source code

[1] I think such a session is used by one thread only. but who creates and starts that thread? If I write the jms listener app, then it has to be started by my app. [[java enterprise in a nutshell]] P329 has sample code but looks buggy. Now I think java.jms.Connection.start() is the answer. API mentions “Starts delivery of *incoming* messages only”, not outgoing! This method is part of a process that
* creates a lasting connection
* creates the sessionSSS in memory (multiple-session/one-connection)
* perhaps starts the thread that will call onMessage()

——– http://onjava.com/lpt/a/951 says:
Sessions and Threading

The Chat application uses a separate session for the publisher and subscriber, pubSession and subSession, respectively. This is due to a threading restriction imposed by JMS. According to the JMS specification, a session may not be operated on by more than one thread at a time. In our example, two threads of control are active: the default main thread of the Chat application and the thread that invokes the onMessage( ) handler. The thread that invokes the onMessage( ) handler is owned by the JMS provider(??). Since the invocation of the onMessage( ) handler is asynchronous, it could be called while the main thread is publishing a message in the writeMessage( ) method. If both the publisher and subscriber had been created by the same session, the two threads could operate on these methods at the same time; in effect, they could operate on the same TopicSession concurrently — a condition that is prohibited.

asynchronous: meaning…@@

When people talk about async, they could mean a few things.

Meaning — non-blocking thread. HTTP interaction is not async because the browser thread blocks.
* eg: email vs browser
* eg: http://search.cpan.org/~mewp/sybperl-2.18/pod/sybperl.pod async query

Meaning — initiating thread *returns* immediately, before the task is completed by another thread.

Meaning — no immediate response, less stringent demand on response time. HTTP server need to service the client right away since the client is blocking. As a web developer i constantly worry about round-trip duration.

Meaning — delayed response. In any design where a response is provided by a different thread, the requester generally don’t need immediate response. Requester thread goes on with its business.
* eg: email, MQ, producer/consumer
* eg: jms onMessage() vs poll, but RV is purely async
* eg: acknowledgement messages in zed, all-http

You can (but i won’t) think of these “meanings” as “features” of async.

main thread early exit — java^c++

in java, main thread can “feel free” to exit if another non-daemon thread keeps the entire JVM alive. Not c++. [[c++cookbook]] says

when the operating system destroys a process, all of its child threads go with it, whether they’re done or not. Without the call to join(), main() doesn’t wait for its child thread: it exits, and the operating system thread is destroyed.

This assumes a 1:1 native thread model, so the operating thread is actually a kernel thread. When it ends, entire process ends.

each method call belongs to a thread^object

a j thread is essentially a call stack. Methods on the stack can belong to diff obj or (in the case of static methods) classes.

Each method call in a jvm
1) belongs to a thread (same thread as the caller method) AND
2) it also belongs to an obj/class.
3) By the way, it often has a caller obj.

WHO (which obj, not which class) calls WHOSE (which obj’s or class’s) method in WHICH thread largely defines the multi-threading architecture. This interaction can be complex but there are some patterns, esp. in GUI

Simple deadlock demo


package com.test1;

/**
* showcasing Thread.yield(), Thread.join(), anon class
*/
public class DeadlockSimpleDemo {
public static void main(String[] args) {
Thread thr1 = startNewThread("lockA", "lockB");
Thread thr2 = startNewThread("lockB", "lockA"); // identical string literals are the same objects
// thr1.join();
// thr2.join();
}

private static Thread startNewThread(final String lock1, final String lock2) {
Thread ret = new Thread() {
@Override
public void run() {
synchronized (lock1) {
System.out.println(currentThread() + lock1 + " acquired, grabbing "
+ lock2);
yield(); // needed to produce a deadlock
synchronized (lock2) {
System.out.println(currentThread() + lock1
+ " acquired, -> then acquired " + lock2);
}
}
System.out.println(currentThread() + "end of run()");
}
};
ret.start();
return ret;
}
}