how2guage TPS capacity@mkt-data engine

Pump in an artificial feed. Increase the input TPS rate until

  1. CPU utilization hits 100%
  2. messages get dropped

The input TPS is the the highest acceptable rate i.e. the “capacity” of this one process.

Note each feed has its own business logic complexity level, so the same software may have 600k TPS capacity for a simple Feed A but only 100k TPS for a complex Feed B.

Also in my experience the input interface is the bottle neck compared to the output interface. If System X feeds into System Y, then we want to have System X pumping at 50% of Y’s capacity. In fact, we actually monitor the live TPS rate. The difference between that and the capacity is the “headway”.

socket stats monitoring tools – on-line resources

This is a rare interview question, perhaps asked 1 or 2 times. I don’t want to overspend.

In ICE RTS, we use built-in statistics modules written in C++ to collect the throughput statistics.

If you don’t have source code to modify, I guess you need to rely on standard tools.

strace, ltrace, truss, oprofile, gprof – random notes

[[optimizing Linux performance]] has usage examples of ltrace.
I think truss is the oldest and most well-known.
Q: what values do the others add?
truss, strace, ltrace all show function arguments, though pointer to objects will not be “dumped”. (Incidentally, I guess apptrace has a unique feature to dump arguments of struct types.)
strace/ltrace are similar in many ways…
ltrace is designed for shared LLLibrary tracing, but can also trace syscalls.
truss is designed for syscalls, but “-u” covers shared libraries.
oprofile — can measure time spent and hit rates on library functions

oprofile – phrasebook

root – privilege required to start/stop the daemon, but the query tools don’t need root

dtrace – comparable. I think these two are the most powerful profilers on solaris/linux.

statistical – results can be partially wrong. Example – call graph.

Per-process – profiling is possible. I think default is system-wide.

CPU – counters (hardware counters). Therefore, low impact on running apps, lower than “attachment” profilers.

userland – or kernel : both can be profiled

recompile – not required. Other profilers require recompiling.

kernel support – must be compiled in.

oprifiled – the daemon. Note there’s no executable named exactly “oprofile”.

[[Optimizing Linux performance]] has detailed usage examples of oprofile. [[linux programmer’s toolbox]] has decent coverage too.

cache-miss in(CPU hogging)hot-function

P245 [[Optimizing Linux Perf]] (2005) points out this “intriguing” scenario. Here are the investigation steps to identify it —

First, we use _oprofile_ to identify a function(s) taking up the most application time. In other words, the process is spending a large portion (I would imagine 5%+) of cpu cycles in this function. However, the cycles could be spent in one lengthy entry or a million quick re-entries. Either way, this would be a hotspot. Then we use oprofile/valgrind(cachegrind)/kcache on the same process, and check if the hot function generates high cache misses.

The punch line – the high cache misses could be the cause of the observed process hogging. I assume the author has experienced the same but I’m not sure how rare or common this scenario is.

Some optimization guy in a HFT shop told me main memory is now treated as IO, so cache miss is treated seriously. mentions that “Cache misses are your biggest cost to performance. Use algorithms that are cache friendly.”

By the way, instruction cache miss is worse than data cache miss. My friend Shanyou also said the same.

## c++ instrumentation tools #mostly unfamiliar

(Each star means one endorsement)

oprofile **
gprof *
callgrind (part of valgrind)
sar *
strace *
(Compared to strace, I feel there are more occasions when ltrace is useful.)

*Pin threads to CPUs. This prevents threads from moving between cores and invalidating caches etc. (sched_setaffinity)

See more

empty while(true) loop hogging CPU

Someone (barclays?) gave me a basic tuning quiz —

Q: what cpu utilization will you see if a program executes an empty while(1==1){} ?

A: On a 16-core machine, 3 instances of this program each take up 4% of the aggregate CPU according to windows taskmgr. I think 4% means hogging one entire core.

A: on my RedHat linux, the same program has 99.8% CPU usage meaning one entire core.

A: On a dual-core, if I start 2 instances, each takes up about 50% i.e. one entire core, keeping all other processes off both cores. With 3 instances, system becomes visibly slow. Taskmgr itself becomes a bit hard to use and reports about 30%-40% by each instance.

I think this would count as cpu intensive job.

[[linux programmer’s toolbox]]

MALLOC_CHECK_ is a glibc env var
–debugger on optimized code

P558 Sometimes without compiler optimization performance is unacceptable.

To prevent optimizer removing your variables, mark them volatile.

An inline function may not appear in call stack. Consider “-fno-inline”

–P569 double-free may not show issues until the next free() or malloc()

–P470 – 472 sar command
can show per-process performance data

can monitor network devices

—P515 printf + macros for debugging

buffering behavior differs between terminal ^ log files

2kinds@essential developer tools #WallSt+elsewhere

Note: There are also “common investigative” tools, but for now i will ignore them. Reasons is that most developers have only superficial knowledge of them, so the myriad of advanced features actually fall into #2).

Note: in the get-things-done stage, performance tools are much less “useful” than logic-revealing tools. In my experience, these tools seldom shed light on perf problems in DB, java, or MOM. Performance symptoms are often resolved by understanding logical flow. Simplest and best tools simply reveal intermediate data.
I think most interview questions on dev tools largely fall into these two categories. Their value is more “real” than other tools.

1) common tools, either indispensable or for productivity
* turn on/off exception breakpoint in VS
* add references in VS
* resolve missing assembly/references in VS
* app.config in VS projects
* setting classpath in IDE, makefile, autosys…
* eclipse plugin to locate classes in jars anywhere in windows
* literally dozens of IDE pains
* cvs,
* maven, ant,
* MS excel
* junction in windows
* vi
* bash
* grep, find, tail
* browser view-source

2) Specialized investigative/analytical/instrumentation tools — offers RARE insights. RARELY needed but RARE value. Most (if not all) of the stronger developers I met have these secret weapons. However, most of us don’t have bandwidth or motivation to learn all of these (obscure) features, because we don’t know which ones are worth learning.
* JMS — browser, weblogic JMS console…
* compiler flags
* tcpdump, snoop, lsof
* truss
* core dump analysis
* sar, vmstat, perfmeter,
* query plan
* sys tables, sp_lock, sp_depend
* set statistics io
* debuggers (IDE)
* IDE call hierarchy
* IDE search
* thread dump
* jvmstart tools — visualgc, jps,
* bytecode inspector, decompiler
* profilers
* leak detector
* jconsole
* jvmstat tools — visualgc, jps,..
* any tools for code tracing Other productivity skills that no software tools can help:
* log analysis
* when I first came to autoreo, i quickly characterized a few key tables.

RAM insufficient@@ telltale signs #scanRate

Scan Rate is the most important indicator. There’s a threshold to look for. It indicates “number of pages per second scanned by the page stealing daemon” in unix but not windows. is concise

Paging (due to insufficient Physical memory) is another important indicator, correlated with Scan Rate. There’s a threshold to look for.

let’s find out What the system is doing

A real, practical challenge in a low-latency, market-data system is to quickly find out “What’s the system doing”. Log files usually have a lot of details, but we also want to know what files/sockets our process is accessing, what kind of data it is reading and writing.

truss -s or -r can reveal actual data transferred(??)

if write() syscall is stuck, then perhaps disk is full.

lsof reads /proc to get open sockets/files


dtrace/truss, ptrace /proc/sys basics

On freebsd, truss works by stopping and restarting the process being monitored via ptrace()
On Solaris, truss works by stopping and restarting the process being monitored via /proc. Dtrace doesn’t stop/start a process, therefore adds lower overhead.
/proc is readable by cat not less.
/proc is mostly readonly, but on linux /proc/sys is writable !,0 is good intro. Wikipedia says

By using ptrace (the name is an abbreviation of “process trace”) one process can control another, enabling the controller to manipulate the internal state of its target. ptrace is used by debuggers such as gdb and dbx.

By attaching to another process using the ptrace call, a tool can single-step through the target’s code. The ability to write into the target’s memory allows not only its data store to be changed, but also the applications own code segment, allowing the controller to install breakpoints and patch the running code of the target.

ptrace is available as a system call on AIX, FreeBSD, Mac OS X, Linux, and HPUX up to 11. On Solaris, ptrace is implemented as a library call, built on top of Solaris kernel’s procfs filesystem; Sun notes that ptrace on Solaris is intended for compatibility, and recommends that new implementations use the richer procfs.

3 types of tuning practitioners.

1) A 3-year “experienced coder” may know how to produce solid functionalities with good enough performance most of the time, but knows close to nothing about why his code has this performance. He has less than 1% of the written knowledge on tuning.

2) A “bookish tuning professional” may not know the limits of the written theory. Her theoretical knowledge could be non-trivial, based on many tuning books and online articles.

3) A “questioning practitioner” knows only the essential theory but relies heavily on utilities to /interrogate/ the target system and verify the tuning theories. As a questioning practitioner, he questions the tuning utilities as well. Profiling helps u pinpoint bottlenecks and calculates tcost (processing time) at each stage.

Look at an abstract/generic transation processing system. For each one of millions of transactions, the totol processing time is a simple sum total of the processing times at each stage. This is an extremely simple, well-understood and powerful truth. Every tuning approach relies on this truth. This is one basis of profiling. Common tuning utilities:

EXPLAIN, showplan…
Perl profiling
JVM profiling jconsole, JRA
GC monitor
PHP @@ apache benchmark and P316 [[ programming php ]]