Latency /engineering/ and optimization is all about the implicit operations, hidden costs and “stolen” cpu cycles. Incur the minimum CPU costs for a task.
- eg (practical!): function A calling B, calling C is one more stack frame than A-calling-C
- eg: boxing/unboxing — extra work for cpu. Also creates garbage for the GC.
- eg: one thread switching between multiple sockets — more cpu workload than one thread dedicated to one (exchange) socket
- eg: un-contended lock acquisition — more work than no-lock, partly due to memory fence
- eg: garbage collection – competes for CPU at a bad time. Usually, if there’s no OOM, then the GC thread is very low priority and won’t slow down a critical task.
- eg: page swap as part of virtual memory systems — competes for CPU
- eg: vtbl lookup — adds a few clock cycles per function call. To be avoided inside the most critical apps in an \
- exchange. Therefore c developers favor templates than virtuals
- eg: RTTI — latency sensitive apps generally disable RTTI early on — during compilation
- eg: null terminator at end of c-strings — adds network traffic by x%
- eg: inline – trims cpu cycles.
- eg: one kernel thread mapping to multiple user threads — fastest system should create no more user threads than the maximum cpu (or kernel) threads, so the thread scheduler doesn’t need to run at all. I feel this is possible only in a dedicated machine, but such a machine must then communicate with peripheral machines, involving serialization and network latency.
For a dedicated kernel thread to service a busy stream of tasks, we need to consider what if the tasks come in bursts so the thread becomes temporarily idle. One idea is to suspend the thread in wait() but i believe kernel thread can’t be suspended. In the kernel, a common approach is to simply keep the thread in spinlock. Assumption is, one cpu is exclusively dedicated to this thread, so this cpu can’t do anything else even if we suspend this thread.