Latency /engineering/ and optimization is all about the implicit operations, hidden costs and “stolen” cpu cycles. Incur the minimum CPU costs for a task.
eg: boxing/unboxing — extra work for cpu. Also creates garbage for the GC.
eg: function A calling B, calling C is one more stack frame than A-calling-C
eg: one thread switching between multiple sockets — more cpu workload than one thread dedicated to one (exchange) socket
eg: un-contended lock acquisition — more work than no-lock
eg: garbage collection – competes for CPU.
eg: page swap as part of virtual memory systems — competes for CPU
eg: vtbl lookup — adds a few clock cycles per function call. To be avoided inside the most critical apps in an exchange. Therefore c developers favor templates than virtuals
eg: RTTI — latency sensitive apps generally disable RTTI early on — during compilation
eg: null terminator at end of c-strings — adds network traffic by x%
eg: inline – trims cpu cycles.
eg: one kernel thread mapping to multiple user threads — fastest system should create no more user threads than the maximum cpu (or kernel) threads, so the thread scheduler doesn’t need to run at all. I feel this is possible only in a dedicated machine, but such a machine must then communicate with peripheral machines, involving serialization and network latency.
For a dedicated kernel thread to service a busy stream of tasks, we need to consider what if the tasks come in bursts so the thread becomes temporarily idle. One solution is to suspend the thread in wait() but a more radical approach is to simply let the thread busy-wait in a loop. Assumption is, one cpu is exclusively dedicated to this thread, so this cpu can’t do anything else even if we suspend this thread.