latency favors STM; lockfree is for moderate latency

Now I think the real low-latency systems always prefer Single-Threaded-Mode. But is it feasible?

  • Nasdaq new java-based architecture is STM, including their matching engine.
  • The matching engines in many exchanges/ECNs are STM. Remember FXAll…
  • Rebus and xtap are both STM — very high performance, proven designs.



5 constructs: c++implicit singletons

#1 most implicit singleton in c++ is the ubiquitous “file-scope variable”. Extremely common in my projects.

  • — The constructs below are less implicit as they all use some explicit keyword to highlight the programmer’s intent
  • keyword “extern” — file-scope variable with extern
    • I seldom need it and don’t feel the need to remember the the details.. see other blogposts
  • keyword “static” — file-scope static variables
  • keyword “static” within function body — local static variables — have nice feature of predictable timing of initializaiton
  • keyword “static” within a class declaration —  static field

~~~~~~  The above are the 5 implicit singleton constructs ~~~~~~

Aha — it’s useful to recognize that when a data type is instantiated many many times i.e. non-singleton usage, it is usually part of a collection, or a local (stack) variable.

Sometimes we have the ambiguous situation where we use one of the constructs above, but we instantiate multiple instances of the class. It’s best to document the purpose like “instance1 for …; instance2 for …”

kernels(+lang extensions)don’t use c++

Q: why kernels are usually written in C not c++? This is an underpinning of the value and longevity of the languages.

I asked Stroustrup. He clearly thinks c++ can do the job. As to why C still dominates, he cited a historical reason. Kernels were written long before C++ was invented.

Aha — I think there are conventions and standard interfaces (POSIX is one of them)… always in C.

I said “The common denominator among various languages is always a C API”. He said that’s also part of what he meant.

benchmark c++^newer languages

C++ can be 5x faster than java if both programs are well-tuned — A ball-park estimate given by Stroustrup.

The c++ code is often written like java code, using lots of pointers, virtual functions, no inline, perhaps with too many heap allocations (STL containers) rather than strictly-stack variables .

Many other benchmarks are similarly questionable. These new languages out there are usually OO and rely on GC + pointer indirection. If you translate their code to C++, the resulting c++ code would be horribly inefficient, not taking advantage of c++ compiler’s powers. An expert c++ developer would rewrite everything to avoid virtual functions and favor local variables and inline, and possibly use compile-time programming. The binary would usually become comparable in benchmark.  The c++ compiler is more sophisticated and have more optimization opportunities, so it usually produces faster code.

local_var=c++strength over GC languages

Stroustrup told me c++ code can use lots of local variables, whereas garbage collected languages put most objects on heap.

I hypothesized that whether I have 200 local variables in a function, or no local variable, the runtime cost of stack allocation is the same. He said it’s nanosec scale, basically free. In contrast, with heap objects, biggest cost is allocation. The deallocation is also costly.

Aha — At compile-time, compiler already knows how many bytes are needed for a given stack frame

insight — I think local variables don’t need pointers. GC languages rely heavily on “indirect” pointers. Since GC often relocates objects, the pointer content need to be translated to the current address of the target object. I believe this translation has to be done at run time. This is what I mean by “indirect” pointer.

insight — STL containers almost always use heap, so they are not strictly “local variables” in the memory sense

heap allocation: java Can beat c++

  • case 1 (standard java): you allocate heap memory. After you finish with it you wait for the java GC to clean it up.
  • case 2 (low latency java): you allocate heap memory but disable java GC. Either you hold on to all your objects, or you leave unreachable garbage orbiting the earth forever.
  • case 3 (c++): you allocate heap memory with the expectation of releasing it, so the compiler sets up housekeeping in advance for the anticipated delete(). This housekeeping overhead is somehow similar to try/catch before c++11 ‘noexcept’.

Stroustrup suggested that #2 will be faster than #3, but #3 is faster than #1. I said “But c++ can emulate the allocation as jvm does?” Stroustrup said C++ is not designed for that. I have seen online posts about this “emulation” but I would trust Stroustrup more.

  • case 4 (C): C/c++ can sometimes use local variables to beat heap allocation. C programmers use rather few heap allocations, in my experience.

Note jvm or malloc are all userland allocators, not part of kernel and usually not using system calls. You can substitute your own malloc. top answer by Kanze is consistent with what Stroustrup told me.

  • no dynamic allocation is always faster than even the fastest dynamic allocation. Similar to Case 4
  • jvm allocation (without the GC clean-up) can be 10 times faster than c++ allocation. Similar to Case 2^3
    • Q: Is there a free list in JVM allocator? claims

  • c++ Custom allocators managing a pool of fixed-sized objects can beat jvm
  • jvm allocation often requires little more than one pointer addition, which is certainly faster than typical C++ heap allocation algorithms


See also posts on jvm portability

  • — Breaking API change is a decision by a library supplier/maintainer.
  • clients are required to make source code change just to compile against the library.
  • clients are free to use previous compiler
  • — Breaking ABI change is a decision by a compiler supplier.
  • clients are free to keep source code unchanged.
  • clients are often required to recompile all source code including library source code

API is source-level compatibility; ABI is binary-level compatibility. I feel

  1. jar file’s compile-once-run-anywhere is kind of ABI portability
  2. python, perl… offer API portability at source level

c++ABI #best eg@zbs

Mostly based on

Imagine a client uses libraries from Vendor AA and Vender BB, among others. All vendors support the same c++ compiler brand, but new compiler versions keep coming up. In this context, unstable ABI means

  • Recompile-all – client needs libAA and libBB (+application) all compiled using the same compiler version, otherwise the binary files don’t agree on some key details.
  • Linker error – LibAA compiled by version 21 and LibBB compiled under version 21.3 may fail to link
  • Runtime error – if they link, they may not run correctly, if ABI has changed.

Vendor’s solutions:

  1. binary releases —  for libAA. Vendor AA needs to keep many old binary versions of libAA, even if compiler version 1.2 has retired for a long time. Some clients may need that libAA version. Many c++ libraries are distributed this way on vendor websites.
  2. Source distribution – Vendor AA may choose to distribute libAA in source form. Maintenance issues exist in other forms.

In a better world,

  • ABI compatible — between compiler version 5.1 and 5.2. Kevin of Macq told me this does happen to some extent.
  • interchangeable parts — the main application (or libBB) can be upgraded to newer compiler version, without upgrading everything else. Main application could be upgraded to use version 5.2 and still link against legacy libAA compiled by older compiler.

(overloaded function) name mangling algorithm — is the best-known part of c++ABI. Two incompatible compiler versions would use different algorithms so LibAA and LibBB will not link correctly. However, I don’t know how c++lint demangles those names across compilers.

No ABI between Windows and Linux binaries, but how about gcc vs llvm binaries on Intel linux machine? Possible according to

Java ABI?


11 notable features added to c++ briefly mentions

  • [90] exception handling
  • [90] templates
  • [98] cast operators
  • [98] dynamic_cast and typeid()
  • [98] covariant return type
  • [07] boost ref wrapper .. see std::reference_wrapper
  • [11] GarbageCollector interface .. See c++ GC interface
  • [11] std::next(), prev(), std::begin(), std::end() .. see favor std::begin(arrayOrContainer)
  • [11] exception_ptr? not sure how useful
  • [14] shared_lock — a RW lock
  • [14] shared_timed_mutex .. see try_lock: since pthreads
  • [14] std::exchange — comparable to std::swap() but doesn’t offer the atomicity of std::atomic_exchange()

scaffolding around try{}block #noexcept

[[ARM]] P358 says that all local non-static objects on the current call stack fully constructed since start of the try-block are “registered” for stack unwinding. The registration is fine-grained in terms of partial destruction —

  • for any array with 3 out of 9 objects fully constructed, the stack unwinding would only destruct those 3
  • for a half constructed composite object with sub-objects, all constructed sub-objects will be destructed
  • Any half-constructed object is not registered since the dtor would be unsafe.

I guess this registration is an overhead at run time.

For the stack objects created in a noexcept function, this “registration” is not required, so compiler may or may not call their destructors.

— in Stroustrup hints at the  scaffolding

  • noexcept is a efficiency feature — widely ans systematically used in standard library to improve performance
  • noexcept is crude and “very efficient”
  • dtor may not be invoked upon stack unwinding
  • stack unwinding may not happen at all



3 real overheads@vptr #inline

Suppose your class Trade has virtual functions and a comparable class Order has no virtual functions. What are the specific runtime overheads of the vptr/vtable usage?

  1. cpu cache efficiency — memory footprint of the vptr in each object. Java affected! If you have a lot of Trade objects with only one char data field, then the vptr greatly expands footprint and you overuse cache lines.
    • [[ARM]] singles out this factor as a justification for -fno-rtti… see RTTI compiler-option enabled by default
    • [[moreEffC++]] P116 singles out vptr footprint as the biggest performance penalty of vptr
  2. runtime indirection — “a few memory references more efficient” [1] in the Order usage
  3. inlining inhibition is the most significant overhead. P209 [[ARM]] says inline virtual functions make perfect sense so it is best to bypass vptr and directly call the virtual function, if possible.

[1] P209 [[ARM]] wording

Note a virtual function unconditionally introduces the first overhead, but the #2/#3 overheads can sometimes be avoided by a smart compiler.


std::sort() beating ANSI-C qsort() #inline

Stroustrup was the first one to tell me c++ std::sort() can beat C qsort() easily. says:

Since the qsort() code is compiled ahead of time and is found inside the shared libc binary, there is no chance that the comparator funciton, passed as a function pointer, can be inlined. says

For qsort(), since the function is passed as a pointer, and the elements are passed as void pointers as well, it means that each comparison costs three indirections and a function call.

In C++, the std::sort is a template algorithm, so that it can be compiled once for each type. The operator< of the type is usually baked into the sort algorithm as well (inlining), reducing the cost of the comparison significantly.

c++condVar 2 usages #timedWait

poll()as timer]real time C : industrial-strength #RTS is somewhat similar. singles out two distinct usages:

1) notification
2) timed wait — often forgotten shows std::condition_variable::wait_for() takes a std::chrono::duration parameter, which has nanosec precision.

Note java wait() also has nanosec precision.

std::condition_variable::wait_until() can be useful too, featured in my proposal RTS pbflow msg+time files #wait_until

who generates the local proxy for the remote

In any distributed-OO infrastructure (ejb, rmi, wcf, remoting…), someone needs to generate a proxy class based on the server-side
API. If you aren’t paying attention, you will not notice where, when and how exactly this is generated. But this is a crucial detail
not to be missed.

There’s a bit of ambiguity over 2 related concepts — Instantiating a proxy instance (at run-time) vs Generating a proxy class
source code before compiling.

(I like to say “client-side proxy” but the adjective is superfluous. )

— WCF —
Usually i generate the “service reference” from a wsdl or the service endpoint URL.

pure virtual with/out implementation: popular@IV

Item 44 in [[EffC++]] offers a summary that basically says

1) a pure virtual means only interface is inherited (since parent provides no implementation)
2) a “simple/traditional virtual” means interface plus a default implementation is inherited
3) a non-virtual means interface plus a mandatory implementation is inherited and subclasses are advised to keep the implementation. C++ actually allows subclass to /deviate/ by redefining and hiding the parent implementation. Java has a cleaner solution in the “final” keyword.

I’d like to add

1b) a pure-virtual-with-an-implementation tells the subclass author

“Inherit this interface. By the way I also offer an implementation upon request, but uninheritable.”

This differs from (2). Both are illustrated in Item 36.

Author of a subclass of an abstract base class (featuring pure2()) can choose one of three options:

  • 1. don’t declare pure2() at all.  As the default and most popular usage, the subclass is also abstract (pun intended) by virtual of the inherited pure2().
  • 2. define pure2(), and becoming a non-abstract class
  • … so far, exactly same as java syntax
  • 3. redeclare the same pure2() without implementation — an error. See P215 [[ARM]]