context switch: 1-10 microsec

发表于7月 8, 20197月 27, 2020 作者 BinTAN

(A thread context switch is faster….) could be 10 millisec, in theory. See P 351 [[linux kernel]]

In a MS interview, Deepak was asked about (typical) duration of a TaskSwitch or context switch. I think 20 millisec is way too long.

https://stackoverflow.com/questions/21887797/what-is-the-overhead-of-a-context-switch and other sites suggest “below 10 microsecond”

— One of the most interesting and popular focus points is the difference between thread switching vs process switching

The L1-cache (private to a cpu core) reload happens if a core switches to another Process…?

loop fusion^fission

发表于4月 15, 201910月 19, 2019 作者 BinTAN

Stroustrup told me about compiler optimization “loop fusion” when I asked how c/c++ can get beaten in benchmarks.

loop fusion may improve runtime overhead
loop fission may improve locality of reference and parallelism on multi-core??

I guess this is a technique to improve raw speed.

xchg between register/memory triggers fencing

发表于2月 14, 201911月 3, 2020 作者 BinTAN

I guess xchg is used in some basic concurrency constructs, but I need to research more.

See my blogpost hardware mutex, based@XCHG instruction

https://c9x.me/x86/html/file_module_x86_id_328.html points out the implicit locking performed by CPU whenever one of the two operands is a memory location. The other operand must be a register.

This implicit locking is quite expensive according to https://stackoverflow.com/questions/50102342/how-does-xchg-work-in-intel-assembly-language, but presumably cheaper than user-level locking.

This implicit locking involves a memory fence.

JIT to minimize runtime dispatch

发表于2月 1, 20194月 14, 2019 作者 BinTAN

## how might jvm surpass c++]latency #MS #priming

This JIT technique is designed to address the runtime cost of dynamic dispatch.

JVM object referencing: does GC require indirection@@

发表于1月 15, 20193月 19, 2020 作者 BinTAN

For low-latency systems, any indirection in the critical path is a huge cost. That’s why virtual functions are often de-virtualized to avoid dynamic dispatch.

Is there any indirection when GC relocates a live object?

JIT has opportunities to avoid vtable latency #Martin

发表于9月 12, 20189月 14, 2019 作者 BinTAN

P76 [[javaPerf]] described a nifty JIT technique to avoid runtime cost of the dynamic binding of virtual function equals(). Suppose in some class, we call obj1.equals(obj2).

After a priming (i.e. warm-up) period, JIT collects enough statistics to see that every dynamic dispatch at this site is calling String.equals(), so JIT decides to turn it into faster “static binding” so the String.equals() function address is hardwired into the assembly code (not JVM bytecode). JIT also needs to handle the possibility of Character.equals(). I guess the assembly code can detect that obj1/obj2 is not a String.java instance and retry the virtual function lookup. JIT can generate assembly code to
1. verify obj is a String and call an inlined String.equals()
2. if obj1 is not String, then use obj1 vtable to look up the virtual function obj1.equals()

It may turn out that 99.9% of the time we can skip the time-consuming Step 2: )

Martin gave a (hypothetical?) example. Suppose JIT notices that obj1 is always either a String or Character. JIT could inline both equals() functions and completely bypass vtable. (Branching can be done via if/else or …) This inline compilation is done after a fairly long period of instrumentation. I asked Martin why c++ can’t do it. He said c++ only uses static compilation. I feel c++ compilers don’t bother with this technique as it is not a proven performance win.

how urgent are linux interrupts #millisec

发表于6月 29, 201811月 3, 2020 作者 BinTAN

I suspect some hardware interrupts might be different, but many interrupts should be handled in milliseconds. See P 145 [[linux kernel]]

string comparison is costly #loop-unroll`

发表于6月 25, 201810月 11, 2019 作者 BinTAN

P120 [[ algos in a nutshell ]] claims that string comparison is consider expensive.

String comparison requires byte-by-byte comparison in a loop. Sometimes people need to write this loop by hand in assembly, and sometimes unroll the loop to fill the instruction pipeline.

I have no reason to doubt the authors. I believe this practice proved effective in practice.

translation lookaside buffer #stats

发表于3月 13, 20186月 6, 2020 作者 BinTAN

a.k.a Address Translation Cache. The TLB lets the processor very quickly convert virtual addresses to physical addresses.
TLB is a cache for the big, slow page table
A typical entry in TLB is a pair of {virtual -> physical addresses}. In contrast,
A typical entry in a L1 cache is mapping of {physical address -> payload}.
You can hit both caches!
Both caches sits between processor and main memory
each hardware system has one or more TLBs
TLB-miss can be handled in hardware or kernel
typical miss probability — 0.01% to 1%
typical miss latency (penalty) — 10 to 100 clock cycles to read the page table
typical hit latency: 0.5 to 1 clock cycle

If a TLB hit takes 1 clock cycle, a miss takes 30 clock cycles, and the miss rate is 1%, the effective memory cycle rate is an average of 1 × 0.99 + (1 + 30) × 0.01 = 1.30 i.e. 1.30 clock cycles per memory access.

https://stackoverflow.com/questions/5440128/thread-context-switch-vs-process-context-switch says TLB favors thread rather than process context-switch. TLB gets flushed during process rather than thread context-switch.

CPU(data)cache prefetching

发表于1月 6, 20185月 5, 2019 作者 BinTAN

https://stackoverflow.com/questions/1950878/c-for-loop-indexing-is-forward-indexing-faster-in-new-cpus top answer is concise. I think the observations may not be relevant in x years but the principles are.

adjacent cache line (ACL) prefetcher — simple to understand
cpu can detect streams of memory accesses in forward or backward directions

Note L1/L2/L3 caches are considered part of the CPU even if some of them are physically outside the microprocessor.

[20] how many cpu cycles per instruction

发表于12月 11, 201710月 29, 2020 作者 BinTAN

A 2019 GS-Hongkong interviewer pointed out that very few instructions can complete within one clock cycle. Indeed only simple add/move etc take one clock cycle, without pipelining:

https://www.d.umn.edu/~gshute/arch/performance-equation.xhtml#clocks-per-instruction says “Without instruction-level parallelism, simple instructions usually take 4 or more cycles to execute…. Pipelining (overlapping execution of instructions) can bring the average for simple instructions down to near 1 clock per instruction. Superscalar pipelining (issuing multiple instructions per cycle) can bring the average down to a fraction of a clock per instruction”

https://www.agner.org/optimize/instruction_tables.pdf is very detailed and up to date.

— http://ithare.com/infographics-operation-costs-in-cpu-clock-cycles/ is an info-graphic

integer multiplication 3-6 cycles
integer division 12-44 cycles
L1 cache hit: 3-4
L2 cache hit: 10-12, about 3 times slower
L3 cache hit on the same core: 30-70, about 3 times slower
(L3 cache hit ~~across sockets~~: 100-300, about 3 times slower )
main RAM read or L1/2/3 cache miss: 100-150, about 3 times slower than L3
..main RAM read ~~across socket~~ : 300-500, about 3 times slower
CAS on single-core hardware and hitting L1 only: 15-30 cycles
.. about 5 times slower than L1 hit in Single-threaded-mode.
CAS on multi-core ~~across sockets~~: 100-300 cycles, guesstimated by the author. Same as L3 hit ~~across socket~~.
c/c++ function call: 15-30 cycles
c++ virtual function call: 30-60 cycles
heap/free-store/DAM {allocation+deallocation} pair for small objects: 200-500 cycles, more than 100 times slower than integer add/subtract (1 cycle)
- See alloc=long considered costly
thread context switch (including d-cache and i-cache invalidation): 10,000 – 1 million cycles

Note a socket is the seat on the mainboard to hold a physical chip. The seat has pins sticking out like an octopus. One socket holds one chip, which can hold multiple cores.

kernel time > 25%: excessive syscalls

发表于4月 10, 20158月 12, 2019 作者 BinTAN

label – tuning

Here's a rule of thumb from P244 [[optimizing linux performance]].

“time” command can show it. If kernel time > 25%, then it's excessive and warrants investigation. The investigation is relatively standard – use strace to rank the most time-consuming system calls.

subverting kernel’s resource-allocation – a few eg

发表于5月 11, 20143月 19, 2019 作者 BinTAN

[[Linux sys programming]] explains several techniques to subvert OS resource-allocation decisions. Relevant to performance-critical apps.

P275 mlockall() — mlock() : prevent paging ] real time apps

P173 sched_setaffinity(). A strongly cache-sensitive app could benefit from hard affinity (stronger than the default soft affinity), which prevents the kernel scheduler migrating the process to another
processor.

[[The Linux Programmer’s Toolbox]]. A write-only variable will be removed by the compiler’s optimizer, but such a variable could be useful to debugger. I read somewhere that you can mark it volatile — subversive.

Any way to prevent “my” data or instruction leaving the L1/L2 cache?

Any way to stay “permantly” in the driver’s seat, subverting the thread scheduler’s time-slicing?

cache-miss in(CPU hogging)hot-function

发表于4月 20, 201411月 3, 2020 作者 BinTAN

P245 [[Optimizing Linux Perf]] (2005) points out this “intriguing” scenario. Here are the investigation steps to identify it —

First, we use _oprofile_ to identify a function(s) taking up the most application time. In other words, the process is spending a large portion (I would imagine 5%+) of cpu cycles in this function. However, the cycles could be spent in one lengthy entry or a million quick re-entries. Either way, this would be a hotspot. Then we use oprofile/valgrind(cachegrind)/kcache on the same process, and check if the hot function generates high cache misses.

The punch line – the high cache misses could be the cause of the observed process hogging. I assume the author has experienced the same but I’m not sure how rare or common this scenario is.

Some optimization guy in a HFT shop told me main memory is now treated as IO, so cache miss is treated seriously. http://programmers.stackexchange.com/questions/142864/writing-low-latency-java mentions that “Cache misses are your biggest cost to performance. Use algorithms that are cache friendly.”

By the way, instruction cache miss is worse than data cache miss. My friend Shanyou also said the same.

empty while(true) loop hogging CPU

发表于4月 16, 20143月 16, 2019 作者 BinTAN

Someone (barclays?) gave me a basic tuning quiz —

Q: what cpu utilization will you see if a program executes an empty while(1==1){} ?

A: On a 16-core machine, 3 instances of this program each take up 4% of the aggregate CPU according to windows taskmgr. I think 4% means hogging one entire core.

A: on my RedHat linux, the same program has 99.8% CPU usage meaning one entire core.

A: On a dual-core, if I start 2 instances, each takes up about 50% i.e. one entire core, keeping all other processes off both cores. With 3 instances, system becomes visibly slow. Taskmgr itself becomes a bit hard to use and reports about 30%-40% by each instance.

I think this would count as ~~cpu intensive job~~.

##hidden cost,stolen CPU cycles: latency engineering

发表于5月 11, 20127月 5, 2020 作者 BinTAN

Update — “kernel thread” means something special in linux…

—-

Latency /engineering/ and optimization is all about the implicit operations, hidden costs and “stolen” cpu cycles. Incur the minimum CPU costs for a task.

eg (practical!): function A calling B, calling C is one more stack frame than A-calling-C, unless B is actually inlined.
eg: boxing/unboxing — extra work for cpu. Also creates garbage for the GC.
eg: one thread switching between multiple sockets — more cpu workload than one thread dedicated to one (exchange) socket
eg: un-contended lock acquisition — more work than no-lock, partly due to memory fence
eg: garbage collection – competes for CPU at a bad time. Usually, if there’s no OOM, then the GC thread is very low priority and won’t slow down a critical task.
eg: page swap as part of virtual memory systems — competes for CPU
eg: vtbl lookup — adds a few clock cycles per function call. To be avoided inside the most critical apps in an exchange. Therefore c developers favor templates than virtuals
eg: RTTI — latency sensitive apps generally disable RTTI early on — during compilation
eg: null terminator at end of c-strings — adds network traffic by x% unnecessarily
eg: inline – trims cpu cycles.
eg: one kernel thread mapping to multiple user threads — fastest system should create no more user threads than the maximum cpu (or kernel) threads, so the thread scheduler doesn’t need to run at all. I feel this is possible only in a dedicated machine, but such a machine must then communicate with peripheral machines, involving serialization and network latency.

For a dedicated kernel thread to service a busy stream of tasks, we need to consider what if the tasks come in bursts so the thread becomes temporarily idle. One idea is to suspend the thread in wait() but i believe kernel thread can’t be suspended. In the kernel, a common approach is to simply keep the thread in spinlock. Assumption is, one cpu is exclusively dedicated to this thread, so this cpu can’t do anything else even if we suspend this thread.

	ptr-ref layering #re…发表在《convert a reference variable i…》
	1330152open⇒发表在《My xx-absorbency[def#1]!=highe…》
	why our coding drill…发表在《## coding IV P/F》
	“hard” l…发表在《FB: spiral number pattern》
	sensitivities = #1 v…发表在《beta ^ rho i.e. correlation co…》

keep learning 活到老学到老

to remove two-column,resize your browser window to narrow

分类： t_6microLatency