label – tuning
Here's a rule of thumb from P244 [[optimizing linux performance]].
“time” command can show it. If kernel time > 25%, then it's excessive and warrants investigation. The investigation is relatively standard – use strace to rank the most time-consuming system calls.
root – privilege required to start/stop the daemon, but the query tools don’t need root
dtrace – comparable. I think these two are the most powerful profilers on solaris/linux.
statistical – results can be partially wrong. Example – call graph.
Per-process – profiling is possible. I think default is system-wide.
CPU – counters (hardware counters). Therefore, low impact on running apps, lower than “attachment” profilers.
userland – or kernel : both can be profiled
recompile – not required. Other profilers require recompiling.
kernel support – must be compiled in.
oprifiled – the daemon. Note there’s no executable named exactly “oprofile”.
[[Optimizing Linux performance]] has detailed usage examples of oprofile. [[linux programmer’s toolbox]] has decent coverage too.
Based on P244 [[linux sys programming ]]
I am not quite sure about the use cases, but let's say this huge file needs to be loaded into memory, by 2 processes. If readonly, then savings is possible by sharing the memory pages between the 2 processes. Basically, the 2 virtual address spaces map to the same physical pages.
Now suppose one of them — Process-A — needs to write to that memory. Copy-on-write takes place so Process-B isn't affected. The write is “intercepted” by the kernel which transparently creates a “copy”, before committing the write to the new page.
fork() system call is another user of Copy-on-write technique.
 Use cases?
* perhaps a large library, where the binary code must be loaded into memory
* memory-mapped file perhaps
I was told Google engineers discuss algorithm efficiency everyday, including lunch time. I guess the intensity could be even higher at HFT shops.
I feel latency due to software algorithm might be a small component of overall latency. However, the bigger portions of that latency may be unavoidable – network latency, disk write(?), serialization for transmission(?), … So the only part we could tune might be the software algorithm.
Further, it's also possible that all the competitors are already using the same tricks to minimize network latency. In that case the competitive advantage is in the software algorithm.
I feel algorithm efficiency could be more permanent and fundamental than threading. If I compare 2 algorithms A1 and A2 and find A2 being 2x A1's speed, then no matter what threading or hardware solutions I use, A2 still beats A1.