context switch: 1-10 microsec

See also #CPU-cycles]1 time slice ~ millions #AshS 

(A thread context switch is faster….) could be 10 millisec, in theory. See P 351 [[linux kernel]]

In a MS interview, Deepak was asked about (typical) duration of a TaskSwitch or context switch. I think 20 millisec is way too long.

https://stackoverflow.com/questions/21887797/what-is-the-overhead-of-a-context-switch and other sites suggest “below 10 microsecond”

— One of the most interesting and popular focus points is the difference between thread switching vs process switching

The L1-cache (private to a cpu core) reload happens if a core switches to another Process…?

 

loop fusion^fission

Stroustrup told me about compiler optimization “loop fusion” when I asked how c/c++ can get beaten in benchmarks.

  • loop fusion may improve runtime overhead
  • loop fission may improve locality of reference and parallelism on multi-core??

I guess this is a technique to improve raw speed.

xchg between register/memory triggers fencing

I guess xchg is used in some basic concurrency constructs, but I need to research more.

See my blogpost hardware mutex, based@XCHG instruction

https://c9x.me/x86/html/file_module_x86_id_328.html points out the implicit locking performed by CPU whenever one of the two operands is a memory location. The other operand must be a register.

This implicit locking is quite expensive according to https://stackoverflow.com/questions/50102342/how-does-xchg-work-in-intel-assembly-language, but presumably cheaper than user-level locking.

This implicit locking involves a memory fence.

JIT has opportunities to avoid vtable latency #Martin

P76 [[javaPerf]] described a nifty JIT technique to avoid runtime cost of the dynamic binding of virtual function equals(). Suppose in some class, we call obj1.equals(obj2).

After a priming (i.e. warm-up) period, JIT collects enough statistics to see that every dynamic dispatch at this site is calling String.equals(), so JIT decides to turn it into faster “static binding” so the String.equals() function address is hardwired into the assembly code (not JVM bytecode). JIT also needs to handle the possibility of Character.equals(). I guess the assembly code can detect that obj1/obj2 is not a String.java instance and retry the virtual function lookup. JIT can generate assembly code to
1. verify obj is a String and call an inlined String.equals()
2. if obj1 is not String, then use obj1 vtable to look up the virtual function obj1.equals()

It may turn out that 99.9% of the time we can skip the time-consuming Step 2: )

Martin gave a (hypothetical?) example. Suppose JIT notices that obj1 is always either a String or Character. JIT could inline both equals() functions and completely bypass vtable. (Branching can be done via if/else or …) This inline compilation is done after a fairly long period of instrumentation. I asked Martin why c++ can’t do it. He said c++ only uses static compilation. I feel c++ compilers don’t bother with this technique as it is not a proven performance win.

string comparison is costly #loop-unroll`

P120 [[ algos in a nutshell ]] claims that string comparison is consider expensive.

String comparison requires byte-by-byte comparison in a loop. Sometimes people need to write this loop by hand in assembly, and sometimes unroll the loop to fill the instruction pipeline.

I have no reason to doubt the authors. I believe this practice proved effective in practice.

translation lookaside buffer #stats

  • a.k.a Address Translation Cache. The TLB lets the processor very quickly convert virtual addresses to physical addresses.
  • TLB is a cache for the big, slow page table
  • A typical entry in TLB is a pair of {virtual -> physical addresses}. In contrast,
  • A typical entry in a L1 cache is mapping of {physical address -> payload}.
  • You can hit both caches!
  • Both caches sits between processor and main memory
  • each hardware system has one or more TLBs
  • TLB-miss can be handled in hardware or kernel
  • typical miss probability — 0.01% to 1%
  • typical miss latency (penalty) — 10 to 100 clock cycles to read the page table
  • typical hit latency: 0.5 to 1 clock cycle

If a TLB hit takes 1 clock cycle, a miss takes 30 clock cycles, and the miss rate is 1%, the effective memory cycle rate is an average of 1 × 0.99 + (1 + 30) × 0.01 = 1.30 i.e. 1.30 clock cycles per memory access.

https://stackoverflow.com/questions/5440128/thread-context-switch-vs-process-context-switch says TLB favors thread rather than process context-switch. TLB gets flushed during process rather than thread context-switch.

 

CPU(data)cache prefetching

https://stackoverflow.com/questions/1950878/c-for-loop-indexing-is-forward-indexing-faster-in-new-cpus top answer is concise. I think the observations may not be relevant in x years but the principles are.

  • adjacent cache line (ACL) prefetcher — simple to understand
  • cpu can detect streams of memory accesses in forward or backward directions

Note L1/L2/L3 caches are considered part of the CPU even if some of them are physically outside the microprocessor.

[20] how many cpu cycles per instruction

A 2019 GS-Hongkong interviewer pointed out that very few instructions can complete within one clock cycle. Indeed only simple add/move etc take one clock cycle, without pipelining:

https://www.d.umn.edu/~gshute/arch/performance-equation.xhtml#clocks-per-instruction says “Without instruction-level parallelism, simple instructions usually take 4 or more cycles to execute…. Pipelining (overlapping execution of instructions) can bring the average for simple instructions down to near 1 clock per instruction. Superscalar pipelining (issuing multiple instructions per cycle) can bring the average down to a fraction of a clock per instruction”

https://www.agner.org/optimize/instruction_tables.pdf is very detailed and up to date.

http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/ is an info-graphic

  • integer multiplication 3-6 cycles
  • integer division 12-44 cycles
  • L1 cache hit: 3-4
  • L2 cache hit: 10-12, about 3 times slower
  • L3 cache hit on the same core: 30-70, about 3 times slower
  • (L3 cache hit across sockets: 100-300, about 3 times slower )
  • main RAM read or L1/2/3 cache miss: 100-150, about 3 times slower than L3
  • ..main RAM read across socket : 300-500, about 3 times slower
  • CAS on single-core hardware and hitting L1 only: 15-30 cycles
  • .. about 5 times slower than L1 hit in Single-threaded-mode.
  • CAS on multi-core across sockets: 100-300 cycles, guesstimated by the author. Same as L3 hit across socket.
  • c/c++ function call: 15-30 cycles
  • c++ virtual function call: 30-60 cycles
  • heap/free-store/DAM {allocation+deallocation} pair for small objects: 200-500 cycles, more than 100 times slower than integer add/subtract (1 cycle)
  • thread context switch (including d-cache and i-cache invalidation): 10,000 – 1 million cycles

Note a socket is the seat on the mainboard to hold a physical chip. The seat has pins sticking out like an octopus. One socket holds one chip, which can hold multiple cores.

subverting kernel’s resource-allocation – a few eg

[[Linux sys programming]] explains several techniques to subvert OS resource-allocation decisions. Relevant to performance-critical apps.

P275 mlockall() — mlock() : prevent paging ] real time apps

P173 sched_setaffinity(). A strongly cache-sensitive app could benefit from hard affinity (stronger than the default soft affinity), which prevents the kernel scheduler migrating the process to another
processor.

[[The Linux Programmer’s Toolbox]]. A write-only variable will be removed by the compiler’s optimizer, but such a variable could be useful to debugger. I read somewhere that you can mark it volatile — subversive.

Any way to prevent “my” data or instruction leaving the L1/L2 cache?

Any way to stay “permantly” in the driver’s seat, subverting the thread scheduler’s time-slicing?

cache-miss in(CPU hogging)hot-function

P245 [[Optimizing Linux Perf]] (2005) points out this “intriguing” scenario. Here are the investigation steps to identify it —

First, we use _oprofile_ to identify a function(s) taking up the most application time. In other words, the process is spending a large portion (I would imagine 5%+) of cpu cycles in this function. However, the cycles could be spent in one lengthy entry or a million quick re-entries. Either way, this would be a hotspot. Then we use oprofile/valgrind(cachegrind)/kcache on the same process, and check if the hot function generates high cache misses.

The punch line – the high cache misses could be the cause of the observed process hogging. I assume the author has experienced the same but I’m not sure how rare or common this scenario is.

Some optimization guy in a HFT shop told me main memory is now treated as IO, so cache miss is treated seriously. http://programmers.stackexchange.com/questions/142864/writing-low-latency-java mentions that “Cache misses are your biggest cost to performance. Use algorithms that are cache friendly.”

By the way, instruction cache miss is worse than data cache miss. My friend Shanyou also said the same.

empty while(true) loop hogging CPU

Someone (barclays?) gave me a basic tuning quiz —

Q: what cpu utilization will you see if a program executes an empty while(1==1){} ?

A: On a 16-core machine, 3 instances of this program each take up 4% of the aggregate CPU according to windows taskmgr. I think 4% means hogging one entire core.

A: on my RedHat linux, the same program has 99.8% CPU usage meaning one entire core.

A: On a dual-core, if I start 2 instances, each takes up about 50% i.e. one entire core, keeping all other processes off both cores. With 3 instances, system becomes visibly slow. Taskmgr itself becomes a bit hard to use and reports about 30%-40% by each instance.

I think this would count as cpu intensive job.

##hidden cost,stolen CPU cycles: latency engineering

Update — “kernel thread” means something special in linux…

—-

Latency /engineering/ and optimization is all about the implicit operations, hidden costs and “stolen” cpu cycles. Incur the minimum CPU costs for a task.

  • eg (practical!): function A calling B, calling C is one more stack frame than A-calling-C, unless B is actually inlined.
  • eg: boxing/unboxing — extra work for cpu. Also creates garbage for the GC.
  • eg: one thread switching between multiple sockets — more cpu workload than one thread dedicated to one (exchange) socket
  • eg: un-contended lock acquisition — more work than no-lock, partly due to memory fence
  • eg: garbage collection – competes for CPU at a bad time. Usually, if there’s no OOM, then the GC thread is very low priority and won’t slow down a critical task.
  • eg: page swap as part of virtual memory systems — competes for CPU
  • eg: vtbl lookup — adds a few clock cycles per function call. To be avoided inside the most critical apps in an exchange. Therefore c developers favor templates than virtuals
  • eg: RTTI — latency sensitive apps generally disable RTTI early on — during compilation
  • eg: null terminator at end of c-strings — adds network traffic by x% unnecessarily
  • eg: inline – trims cpu cycles.
  • eg: one kernel thread mapping to multiple user threads — fastest system should create no more user threads than the maximum cpu (or kernel) threads, so the thread scheduler doesn’t need to run at all. I feel this is possible only in a dedicated machine, but such a machine must then communicate with peripheral machines, involving serialization and network latency.

For a dedicated kernel thread to service a busy stream of tasks, we need to consider what if the tasks come in bursts so the thread becomes temporarily idle. One idea is to suspend the thread in wait() but i believe kernel thread can’t be suspended. In the kernel, a common approach is to simply keep the thread in spinlock. Assumption is, one cpu is exclusively dedicated to this thread, so this cpu can’t do anything else even if we suspend this thread.