thread^process: lesser-known differences #kernel

This is largely a QQ question.  Some may consider it zbs.

  • To the kernel, there are man similarities between the thread construct vs the process construct. In fact, a (non-kernel) thread is often referenced as a LWP in many kernels such as Solaris and Linux
  • socket — threads in a process can access the same socket; two processes usually can’t access the same socket, unless … parent-child. See post on fork()
  • memory — thread AA can access all heap objects, and even Thread BB’s stack objects. Two process can’t share these, except via shared memory.
  • context switching — is faster between threads
  • creation — some thread libraries can create threads without the kernel knowing. No such thing for a process.
  • a non-kernel thread can never exist without an owner process. A process always has a parent process which could be long gone.

## UPSTREAM tech domains: defined by eg

It pays to specialize in a domain that helps you break into similar domains. (Unsuccessful with my quant adventure so far.)

  • eg: [L] socket —- C++ is the alpha wolf, leader of the wolf pack
  • eg: [L] latency —- C++ is the leader of the pack in low-level optimizations .. In-line, cache-efficiency, footprint..
  • eg: collections —- C++ is the leader of the pack. There’s a lot of low level details you gain in STL that help you understand the java/c#/python collections
  • eg: threading —- java is the leader of the pack. In c++ threading is too hard, so once I master some important details in java threading, it helps me build zbs in c/c#, but only to some extent
  • eg: [L]pbref^val —- C++ is the leader. C# has limited complexity and java has no complxity
  • eg: regex/string —- Perl is the leader
  • eg: alpha —- equities trading is the leader
  • [L] heap/stack —- C++ is the leader. Java and c# are cleaner, simpler and hides the complexities.
  • defensive coding —- (and crash management) C++ is the leader, because c++ hits more errors
  • lambda —- C# is the leader. Scripting languages follow a different pattern.
  • list/stream —- Linq is the leader
  • Black-Scholes —- feels like a leader or upstream, but not /convincing/
  • execution algo —- sell side equities desk seems to be the leader
  • data analytics —- python feels like the leader
  • [L = lowLevel]


calling database access method while holding mutex

Hi XR,

Q: If some vendor provides a database access function (perhaps dbget()) that may be slow and may acquire some lock internally, is it a good idea to call this function while holding an unrelated mutex?

We spoke about this. Now I think this is so bad it should be banned. This dbget() could take an unknown amount of time. If someone deleted a row from a table that dbget() needs, it could block forever. The mutex you hold is obviously shared with other threads, so those threads would be starved. Even if this scenario happens once in a million times, it is simply unacceptable.

Here, I’m assuming the dbget() internal lock is completely unrelated to the mutex. In other words, dbget() doesn’t know anything about your mutex and will not need it.

As a rule of thumb, we should never call reach-out functions while holding a mutex. Reach-out functions need to acquire some shared resources such as a file, a web site, a web service, a socket, a remote system, a messaging system, a shared queue, a shared-memory mapping… Most of these resources are protected and can take some amount of time to become available.

(I remember there’s some guideline named open-call but I can’t find it.)

That’s my understanding. What do you think?

## proliferation -> consolidation.. beware of churn

This is an extension of my 2015 blog post

Imagine you spent months of serious personal effort [1] learning to use, debug, and tune, say, MongoDB but after this project you only find projects that need just superficial Mongo knowledge. Developer time-investment has no recurring return. I think this is widespread in tech: A domain heats up attracting too many players creating competing products with varying degrees of similarity. We wish these products are mostly similar so we developers’ time investment can help us more than once, rather than learn-n-forget like 狗熊掰棒子. Often it takes decades to see some consolidation among the competitors, when most of them drop out of the race and we one player emerges dominant, or a common standard [2] is adopted with limited vendor extensions.

Therefore I see two phases : Proliferation -> Consolidation. The churn in the early phase represents a hazardous pitfall.

If we invest too much learning effort there we get burned.

  • Javascript packages
  • JVM languages — javascript, Scala, Groovy, jython
    • I don’t even know which company uses them
  • ORM and database access “frameworks”–, LINQ, EntityFramework, SpringJDBC,  iBatis,
  • Data Grid and NoSQL — Terracotta, Hazelcast, Gigaspace, Gemfire, Coherence, …
  • MOM — tibco, solace, 29west, Tervela, zeroc
  • machine learning?
  • web app languages
    • php, perl and the LAMP stack
    • Javascript MEAN stack
    • ASP and the Microsoft stack
    • Java stack

[1] You could have spent the time on personal investment, or something else if not required by the project.

[2] Some positive examples of standardization —

  1. RDBMS vendors
  2. Unix vendors
  3. c++ vendors — mostly GCC vs Microsoft VC++
  4. Java IDEs; c++/java/c# debuggers
  5. cvs, svn, git

A few development technologies “free” of proliferation pains —

  1. socket and system programming — complexities are low level and in C not c++
  2. core java
  3. SQL
  4. c/c++
  5. Unix know-how for testing, investigation, devops and process management
    1. shell scripting,
    2. regular expressions
    3. searching

make_shared() cache efficiency, forward()

This low-level topic is apparently important to multiple interviewers. I guess there are similarly low-level topics like lockfree, wait/notify, hashmap, const correctness.. These topics are purely for theoretical QQ interviews. I don’t think app developers ever need to write forward() in their code. touches on a few low-level optimizations. Suppose you follow Herb Sutter’s advice and write a factory accepting Trade ctor arg and returning a shared_ptr<Trade>,

  • your factory’s parameter should be a universal reference. You should then std::forward() it to make_shared(). See gcc source code See make_shared() source in
  • make_shared() makes one allocation for a Trade and an adjacent control block, with cache efficiency — any read access on the Trade pointer will cache the control block too
  • if the arg object is a temp object, then the rvr would be forwarded to the Trade ctor. Scott Meryers says the lvr would be cast to a rvr. The Trade ctor would need to move() it.
  • if the runtime object is carried by an lvr (arg object not a temp object), then the lvr would be forwarded as is to Trade ctor?

Q: What if I omit std::forward()?
AA: Trade ctor would receive always a lvr. See ScottMeyers P162 and my github code is my experiment.


deadlock avoidance: release all locks fail-fast

In some deadlock avoidance algorithms, at runtime we need to ensure we immediately release every lock already acquired, without hesitation, so to speak. Here’s one design my friend Shanyou shared with me.

  • Tip: limit the number of “grabbing” functions, functions that explicit grab any lock.

Suppose a grabbing function f1 acquires a lock and calls another function f2 (a red flag to concurrency designers). In such a situation, f2() will use -1 return value to indicate “must release lock”. If f2() calls a grabbing function f3(), and if f3() fails to grab, it propagates “-1” to f2 and f1.

  1. f3() could use trylock.
  2. f3() could check lock hierarchy and return -1.

which thread/pid drains NicBuffer->socketBuffer

Too many kernel concepts. I will use a phrasebook format. I have also separated some independent tips into hardware interrupt handler #phrasebook

  1. Scenario 1 : A single CPU. I start my parser which creates the multicast receiver socket but no data coming. My pid111 gets preempted. CPU is running unrelated pid222 when data /wash up/.
  2. Scenario 2: pid111 is running handleInput() while additional data comes in on the NIC.
  • context switching — to interrupt handler (i-handler). In all scenarios, the running process gets suspended to make way for the interrupt handler function. I-handler’s instruction address gets loaded into the cpu registers and it starts “driving” the cpu. Traditionally, the handler used the suspended process’s existing stack.
    • After the i-handler completes, the suspended “current” process resumes by default. However, the handler may cause another pid to be scheduled right away [1 Chapter 4.1].
  • no pid — interrupt handler execution has no pid, though some authors say it runs on behalf of the suspended pid. I feel the suspended pid may be unrelated to the socket, rather than the socket’s owner process (pid111).
  • kernel scheduler — In Scenario 1, pid111 would not get to process the data until it gets in the “driver’s seat” again. However, the interrupt handler could trigger a rescheduling and push pid111 “to the top” so to speak. [1 Chapter 4.1]
  • top-half — drains the tiny NIC buffer into main memory as fast as possible [2]
  • bottom-half — (i.e. deferrable functions) includes lengthy tasks like copying packets. Deferrable function run in interrupt context [1 Chapter 4.8], so there’s no pid
  • sleeping — the socket owner pid 111 would be technically “sleeping” in the socket’s wait queue initially. After the data is copied into the socket receive buffer, I think the kernel scheduler would locate pid111 in the socket’s wait queue and make pid111 the driver. Pid111 would call read() on the socket.
    • wait queue — How the scheduler does it is non-trivial. See [1 Chapter]
  • burst — What if there’s a burst of multicast packets? The i-handler would hog or steal the driver’s seat and /drain/ the NIC buffer as fast as possible, and populate the socket receive buffer. When the i-handler takes a break our handleInput() would chip away at the socket buffer.
    • priority — is given to the NIC’s interrupt handler, since we have a single CPU.

Q: What if the process scheduler wants to run while i-handler is busy draining the NIC?
A: Well, all interrupt handlers can be interrupted, but I would doubt the process scheduler would suspend the NIC interrupt handler.

One friend said the pid is 1, the kernel process.

[1] [[Understanding the Linux Kernel, 3rd Edition]]


if thread fails b4 releasing mutex #CSY

My friend Shanyou asked:

Q: what if a thread somehow fails before releasing mutex?

I see only three scenarios:

  • If machine loses power, then releasing mutex or not makes no difference.
  • If process crashes but the mutex is in shared memory, then we are in trouble. The mutex will be seen as forever in-use. The other process can’t get this mutex.
  • If process is still alive, I rely on stack unwinding.

Stack unwinding is set up by compiler. The only situation when this compiler-generated stack unwinding is incomplete is — if the failing function is declared noexcept. (In such a case, the failure is your problem since you promised to compiler it should never throw exception.) I will assume we don’t have a noexcept function. Therefore, I assume stack unwinding is robust and all stack objects will be destructed.

If one of the stack objects is a std::unique_lock, then compiler guarantees an unlocked status on destruction. That’s the highest reliability I can hope to achieve.

atomic{int} offers operator+=()

Background — on a machine with lock-free CPU…

My friend Shanyou asked me — with c++11 atomic data types, can we simply say

myAtomicInt ++; //without any mutex

A: Yes according to [[c++StdLib]]

Q: Is there some hidden CAS while-loop therein?
A: Yes I am 99% sure because other threads could be updating the same shared mutable object concurrently on other CPU cores.

Q: is there a CAS while-loop in a basic store/load operation?
A: I don’t think so

updates2shared-mutable: seldom flushed2memory immediately

  • memory fence c++
  • memory barrier c++
  • c++ thread memory visibility


You can search for these keywords on Google. Hundreds of people would agree that without synchronization, a write to sharedMutableObject1 by thread A at Time 1 is not guaranteed to be visible to Thread B at Time 2.

In any aggressively multithreaded program, there are very few shared mutable objects. If there’s none by design, then all threads can operate in single-threaded mode as if there’s no other thread in existence.

In single-threaded mode (the default) compilers would Not generate machine code to always flush a write to main memory bypassing register/L1/L2/L3 caches. Such a memory barrier/fence is extremely expensive — Main memory is at least 100 times slower than register access.

I would hypothesize that by default, the most recent write (at Time 1) is saved to register only, not L1 cache, because at compile time, compiler author doesn’t know if at runtime this same thread may write to that same address! If you update this object very soon, then it’s wasteful to flush the intermediate, temporary values to L1 cache, since no other threads need to see it.

L1 cache is about 10 times slower than register.

Multi-threaded lock-free programming always assumes multiple threads access shared mutable objects.

Even a lock-free function without contention [1] requires memory barriers and is therefore slower than single-threaded mode. I would say in a low-contention context, the main performance gain of single-threaded over lock-free is data cache efficiency. Another performance gain is statement reordering.

[1] i.e. no retry needed, since other threads are unlikely to touch sharedMutableObject1 concurrently