- recursive/reentrant lock is more permissive on the owner thread
- reader-writer lock is more permissive on the reader threads
- try-lock doesn’t block
- a counting semaphore is more permissive until we use up all permits
Most advanced locks help with liveness (deadlock..) and efficiency.
I believe the concept of fiber is not standardized across languages. Here are some general observations
- fibers are unknown to kernel. They are similar to userland threads that are implemented in userland thread libraries, rather than implemented by kernel and system calls.
- like userland threads, a fiber adds less load on the kernel. See [[pthreads]]
- diff: fibers are even more light-weight than threads
- diff: fibers are usually short-lived
- diff: fibers have smaller stacks
- a few languages support millions of concurrent fibers in one OS. For threads, with a IO-heavy workload, you probably can run tens of thousands of threads on a single JVM.
“Liunx kernel thread cannot run user programs”, as explained in [[UnderstandingLinuxKernel]].
Removing ‘Linux’ … Other Unix variants do use kernel threads to run both 1)user programs and 2)kernel routines. This is my understanding from reading about JVM…
Removing ‘kernel’… Linux userland threads do run user programs.
Removing ‘thread’… Linux kernel processes or interrupt routings could possibly run under user process pid.
Background — Linux kernel provides two types of locks — spinlock and blocking mutex, as in https://www.kernel.org/doc/htmldocs/kernel-locking/locks.html . Here I focus on the mutex. I think this is far more useful to userland applications.
https://lwn.net/Articles/575460/ has good pointers:
- I believe context switch is expensive since CPU cache has to be replaced. Therefore, optimistic spin is beneficial.
- a blocking mutex used in kernel, perhaps not directly used by userland apps
- implemented using spin lock + some wait_lock
- maintains a wait_list. Not visible to any userland app.
Q: If some vendor provides a database access function (perhaps dbget()) that may be slow and may acquire some lock internally, is it a good idea to call this function while holding an unrelated mutex?
We spoke about this. Now I think this is so bad it should be banned. This dbget() could take an unknown amount of time. If someone deleted a row from a table that dbget() needs, it could block forever. The mutex you hold is obviously shared with other threads, so those threads would be starved. Even if this scenario happens once in a million times, it is simply unacceptable.
Here, I’m assuming the dbget() internal lock is completely unrelated to the mutex. In other words, dbget() doesn’t know anything about your mutex and will not need it.
As a rule of thumb, we should never call reach-out functions while holding a mutex. Reach-out functions need to acquire some shared resources such as a file, a web site, a web service, a socket, a remote system, a messaging system, a shared queue, a shared-memory mapping… Most of these resources are protected and can take some amount of time to become available.
(I remember there’s some guideline named open-call but I can’t find it.)
That’s my understanding. What do you think?
In some deadlock avoidance algorithms, at runtime we need to ensure we immediately release every lock already acquired, without hesitation, so to speak. Here’s one design my friend Shanyou shared with me.
- Tip: limit the number of “grabbing” functions, functions that explicit grab any lock.
Suppose a grabbing function f1 acquires a lock and calls another function f2 (a red flag to concurrency designers). In such a situation, f2() will use -1 return value to indicate “must release lock”. If f2() calls a grabbing function f3(), and if f3() fails to grab, it propagates “-1” to f2 and f1.
- f3() could use trylock.
- f3() could check lock hierarchy and return -1.
My friend Shanyou asked:
Q: what if a thread somehow fails before releasing mutex?
I see only three scenarios:
- If machine loses power, then releasing mutex or not makes no difference.
- If process crashes but the mutex is in shared memory, then we are in trouble. The mutex will be seen as forever in-use. The other process can’t get this mutex. I feel this could be a practical problem, with practical solutions like reboot or process restart.
- If process is still alive, I rely on stack unwinding.
Stack unwinding is set up by compiler. The only situation when this compiler-generated stack unwinding is incomplete is — if the failing function is declared noexcept. (In such a case, the failure is your self-inflicted problem since you promised to compiler it should never throw exception.) I will assume we don’t have a noexcept function. Therefore, I assume stack unwinding is robust and all stack objects will be destructed.
If one of the stack objects is a std::unique_lock, then compiler guarantees an unlocked status on destruction. That’s the highest reliability and reassurance I can hope for.
- memory fence c++
- memory barrier c++
- c++ thread memory visibility
You can search for these keywords on Google. Hundreds of people would agree that without synchronization, a write to sharedMutableObject1 by thread A at Time 1 is not guaranteed to be visible to Thread B at Time 2.
In any aggressively multithreaded program, there are very few shared mutable objects. If there’s none by design, then all threads can operate in single-threaded mode as if there’s no other thread in existence.
In single-threaded mode (the default) compilers would Not generate machine code to always flush a write to main memory bypassing register/L1/L2/L3 caches. Such a memory barrier/fence is extremely expensive — Main memory is at least 100 times slower than register access.
I would hypothesize that by default, the most recent write (at Time 1) is saved to register only, not L1 cache, because at compile time, compiler author doesn’t know if at runtime this same thread may write to that same address! If you update this object very soon, then it’s wasteful to flush the intermediate, temporary values to L1 cache, since no other threads need to see it.
L1 cache is about 10 times slower than register.
Multi-threaded lock-free programming always assumes multiple threads access shared mutable objects.
Even a lock-free function without contention  requires memory barriers and is therefore slower than single-threaded mode. I would say in a low-contention context, the main performance gain of single-threaded over lock-free is data cache efficiency. Another performance gain is statement reordering.
 i.e. no retry needed, since other threads are unlikely to touch sharedMutableObject1 concurrently
The #1 usage of atomic<int> is load() and store(). I will use short form “load/store” or “l/s”.
The #2 usage is CAS. Interviewers are mostly interested in this usage, though I won’t bother to remember the function names —
The CAS usage is same as AtomicInteger.java, but the load/store usage is more like the thread-safety feature of Vector.java. To see the need for load/store, we need to realize the simple “int” type assignment is not atomic :
- P1012 [[c++ standard library]] shocked me by affirming that without locks, you can read a “half-written Boolean” .
To solve this problem, atomic<int> uses internal locks (just like Vector.java) to ensure load() and store() is always atomic.
 different from java. https://stackoverflow.com/questions/11459543/should-getters-and-setters-be-synchronized points out that 32-bit int in java is never “half-written”. If you read a shared mutable int in java, you can hit a stale value but never a c++ style half-written value. Therefore, java doesn’t need guarded load()/store() functions on an integer.
Q: are these c++ atomic types lock-free?
A: for load/store — not lock-free. See P 1013
A: for CAS — lock-free CPU instructions are used, if available.