CPU run-queue #java perspective

— Mostly based on [[JavaPerf]] P28

Runtime.availableProcessors() returns the count of virtual processors, or count of hardware threads. This is an important number for CPU tuning, bottleneck analysis.

When a run-queue depth exceeds 4 times the processor count, then host system will become visibly slow. For a host dedicated to java, this is a 2nd reason for CPU saturation. First reason is high CPU usage.

Note run-queue depth is the first column in vmstat output

Advertisements

heap allocation: java Can beat c++

  • case 1 (standard java): you allocate heap memory. After you finish with it you wait for the java GC to clean it up.
  • case 2 (low latency java): you allocate heap memory but disable java GC. Either you hold on to all your objects, or you leave unreachable garbage orbiting the earth forever.
  • case 3 (c++): you allocate heap memory with the expectation of releasing it, so the compiler sets up housekeeping in advance for the anticipated delete(). This housekeeping overhead is somehow similar to try/catch before c++11 ‘noexcept’.

Stroustrup suggested that #2 will be faster than #3, but #3 is faster than #1. I said “But c++ can emulate the allocation as jvm does?” Stroustrup said C++ is not designed for that. I have seen online posts about this “emulation” but I would trust Stroustrup more.

  • case 4 (C): C/c++ can sometimes use local variables to beat heap allocation. C programmers use rather few heap allocations, in my experience.

Note jvm or malloc are all userland allocators, not part of kernel and usually not using system calls. You can substitute your own malloc.

https://stackoverflow.com/questions/18268151/java-collections-faster-than-c-containers top answer by Kanze is consistent with what Stroustrup told me.

  • no dynamic allocation is always faster than even the fastest dynamic allocation. Similar to Case 4
  • jvm allocation (without the GC clean-up) can be 10 times faster than c++ allocation. Similar to Case 2^3
    • Q: Is there a free list in JVM allocator?

https://softwareengineering.stackexchange.com/questions/208656/java-heap-allocation-faster-than-c claims

  • c++ Custom allocators managing a pool of fixed-sized objects can beat jvm
  • jvm allocation often requires little more than one pointer addition, which is certainly faster than typical C++ heap allocation algorithms

tsn: what if I fail due2capabilities #Okao

Yet another revisit. See also post on dare2fail.

My intern David Okao asked “What if the west coast workplaces are too demanding? Can you cope?” I replied

  • As an adventurer, I don’t mind the risk… robust, resilient confidence
  • As an adventurer, I see myself as adaptable, a survivor
  • I may have my secret weapons
  • I may find my strengths such as domain knowledge, data analysis, trouble-shooting, efficient design, math(yes)

I then went over a similar discussion about MLP with Ashish, when I said —

  • If similar to Macq, I put up a good fight but still fail due to personal “capabilities”, I ought to feel positive about the whole experience.
  • I’m good at the job-hunting game so no real worries.

java enum: elegant

“Elegant” (and “clean”) is probably the best adjective. My comments below are mostly based on [[effJava]]

  • 🙂 immutable in the face of serialization
  • 🙂 strict singleton in the face of serialization .. P311
  • 🙂 simple enough to be easily embedded as a static nested class
  • 🙂 you can add behavior (and data) unique to Jupiter, using a constant-specific class body
  • 🙂 You can switch on enum values
  • compiler adds two implicit static methods values() and valueOf(), not listed on the official javadoc 😦
  • values() returns a fixed array of the predefined enum values
  • valueOf(planetNameStr) would return the matching enum instance
    • Note this method is unrelated to String.valueOf()
    • you can even add your own fromString(abbrevatedPlanetName) method. see [[effJava]]

EnumSet (see separate blogpost) and EnumMap built on the strength of enum feature

identityHashCode,minimum object size,relocation by JGC

https://srvaroa.github.io/jvm/java/openjdk/biased-locking/2017/01/30/hashCode.html offers a few “halo” knowledge pearls

  • every single java Object must always give an idHashcode on-demand, even if its host class has hashCode() method overridden to return a hard-coded 55.
    • hashcode() doesn’t overshadow idHashcode
  • The “contract” says an object’s idHashcode number must never [2] change, in the face of object relocations. So it’s not really computed based on address. Once someone requests the idHashCode number (like 4049040490), this number must be retained somewhere in object, as per the “contract”. It is retained in the 12-byte object header. (8-byte for a 32-bit JVM)
    • Therefore, the idHashcode contributes to the minimum size of java objects.
  • contrary to common belief, the idHashcode can clash between two objects, so idHashcode is a misnomer, more “hashcode” and not “identity”. https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6321873 explains there are insufficient integer values given the maximum object count.
  • Note anyone can call the hashcode() method on this same object and it could be overridden to bypass the idHashcode.
  • [2] in contrast a custom hashcode() can change its value when object state changes.

POSIX countingSemaphore ^ lock+condVar #Solaris docs

https://docs.oracle.com/cd/E19120-01/open.solaris/816-5137/sync-11157/index.html points out a lesser-known difference in the Solaris context:

Because semaphores need not be acquired and be released by the same thread, semaphores can be used for asynchronous event notification, such as in signal handlers (but presumably not interrupt handlers). And, because semaphores contain state, semaphores can be used asynchronously without acquiring a mutex lock as is required by condition variables. However, semaphores are not as efficient as mutex locks.

The same page also shows POSIX countingSemaphore can be used IPC or between threads.

## G5 move-only types #std::atomic

[1] P106 [[effModernC++]]

default malloc is slow, can be improved

See https://stackoverflow.com/questions/161053/which-is-faster-stack-allocation-or-heap-allocation

I see various evidence that industry practitioners consider the default allocator too slow.

I don’t think system call is the issue. System calls are very infrequent with malloc.

thread^process: lesser-known differences #IV

Popular IV question. Largely a QQ question.  Some may consider it zbs.

To the kernel, there are man similarities between the “thread” construct vs the “process” construct. In fact, a (non-kernel) thread is often referenced as a LightWeightProcess in many kernels such as Solaris and Linux.

  • context switching — is faster between threads than between processes. In linux, context switching between kernel-threads is even faster.
  • creation — some thread libraries can create threads without the kernel knowing. No such thing for a process.
  • socket — 2 threads in a process can access the same socket; two processes usually can’t access the same socket, unless … parent-child. See post on fork()
  • memory — thread AA can access all heap objects, and even Thread BB’s stack objects via pointers. Two processes can’t share these, except via shared memory.
  • a non-kernel thread can never exist without an owner process. In contrast, every process always has a parent process which could be long gone.

 

[18] G4 IV(!! GTD)domains 2 provide 20Y job security

See also

Let’s ignore zbs or GTD or biz domains like mktData/risk here …

  • –roughly ranked by value-to-me
  • [c s] java? resilient in the face of c# and dynamic languages. At least 10Y relevance.
  • [c s] c++? resilient in the face of java. Time-honored like SQL
  • [c] abstract algorithm and data structures, comp science problem solving
  • [c n] tcp/udp optimization + other hardware/kernel/compiler optimizations
  • ……….No more [c]
  • py + shell scripting? no [c] rating since depth unappreciated
  • Linux and windows? at least 10Y growth, but no [c]
  • [s] SQL? resilient in the face of noSQL, but no [c]
  • bond math?
  • [n s] FIX? At least 10Y relevance
  • [c=high complexity in IV; shelf-life; depth appreciated …]
  • [n=niche, but resilient]
  • [s=survived serious challenges]

TCP blocking send() timeout #details

See also recv()/send() with timeout #CSY

I see three modes

  1. non-blocking send() — immediate return if unable to send
  2. blocking send() without timeout — blocks forever, so the thread can’t do anything.
  3. blocking send() with timeout —

SO_SNDTIMEO: sets the timeout value specifying the amount of time that an output function blocks because flow control prevents data from being sent. If a send operation has blocked for this time, it shall return with a partial count or with errno set to [EAGAIN] or [EWOULDBLOCK] if no data is sent. The default for this option is zero, which indicates that a send operation shall not time out. This option stores a timeval structure. Note that not all implementations allow this option to be set.

In xtap library, timeout isn’t implemented at all. Default is non-blocking.  If we configure to use 2), then we can hit a strange problem — one of three receivers gets stuck but keeps its connection open. The other receives are starved even though their receive buffers are free.

linux hardware interrupt handler #phrasebook

I think some interrupts are generated by software, but here I focus on hardware interrupt handlers.

  • pseudo-function — Each handler is like a pseudo-function containing a series of instructions that will run on a cpu core
  • top priority — interrupt context is higher priority than kernel context or process context. You can say “interrupt” means “emergency”. Emergency vehicles don’t obey traffic rules.
    • However, an interrupt handler “function” can get interrupted by another [1]. The kernel somehow remembers the “stack”
  • not preemptible — except the [1] scenario, kernel can’t suspend a hardware interrupt handler in mid-stream and put in another series of instructions in the “driver’s seat”
  • no PID — since there’s no preemption, we don’t need a pid associated with this series of instructions.

de-multiplex by-destPort: UDP ok but insufficient for TCP

When people ask me what is the purpose of the port number in networking, I used to say that it helps demultiplex. Now I know that’s true for UDP but TCP uses more than the destination port number.

Background — Two processes X and Y on a single-IP machine  need to maintain two private, independent ssh sessions. The incoming packets need to be directed to the correct process, based on the port numbers of X and Y… or is it?

If X is sshd with a listening socket on port 22, and Y is a forked child process from accept(), then Y’s “worker socket” also has local port 22. That’s why in our linux server, I see many ssh sockets where the local ip:port pairs are indistinguishable.

TCP demultiplex uses not only the local ip:port, but also remote (i.e. source) ip:port. Demultiplex also considers wild cards.

TCP UDP
socket has local IP:port
socket has remote IP:port no such thing
2 sockets with same
local port 22 ???
can live in two processes not allowed
can live in one process not allowed
2 msg with same dest ip:port
but different source ports
addressed to 2 sockets;
2 ssh sessions
addressed to the
same socket

JGC G1 Metaspace: phrasebook #intern

Incidentally, NIO buffer is also in native memory

 

long term planning can be demoralizing

My father often tells me I plan ahead too much…

Q: where will I be, what job will I have 5 years from now?

Such questions can be demoralizing and sometimes can dampen a precious spirit of optimism. I sometimes perform better by focusing on here and now.

I think the reality may be quite bland and uninspiring — same job, with declining income, not much “offensive” to mount …

declare variables ] loop header: c^j #IF/for/while

Small trick to show off in your coding test…

Background — In short code snippet, I want to minimize variable declarations. The loop control variable declaration is something I always want to avoid.

https://stackoverflow.com/questions/38766891/is-it-possible-to-declare-a-variable-within-a-java-while-conditional shows java WHILE-loop header allows assignment:

List<Object> processables;
while ((processables = retrieveProcessableItems(..)).size() > 0) {/*/}

But only (I’m 99% sure) c++ WHILe-loop header allows variable declaration.

The solution — both java/c++ FOR-loop headers allow variable declaration. Note the condition is checked Before first iteration, in both for/while loops.

update — c++0x allows variable declaration in IF-block header, designed to limit the variable scope.

if (int a=string().size()+3) cout<<a << ” = a \n”; // shows 3

MOM+threading Unwelcome ] low latency@@ #FIX/socket

Piroz told me that trading IT job interviews tend to emphasize multi-threading and MOM. Some use SQL too. I now feel all of these are unwelcome in low latency trading.

A) MOM – see also HFT mktData redistribution via MOMFor order processing, FIX is the standard. FIX can use MOM as transport, but not popular and unfamiliar to me.

B) threading – Single-Threaded-Mode is generally the fastest in theory and in practice. (I only have a small observed sample size.) I feel the fastest trading engines are STM. No shared mutable. Nsdq new platform (in java) is STM

MT is OK if they don’t compete for resources like CPU, I/O or locks. Compared to STM, most lockfree systems introduce latency like retries, and additional memory barrier. By default compiler optimization doesn’t need such memory barriers.

C) SQL – as stated elsewhere, flat files are much faster than relational DB. How about in-memory relational DB?

Rebus, the order book engine, is in-memory.

Y more threads !! help`throughput if I/O bound

To keep things more concrete. You can think of the output interface in the I/O.

The paradox — given an I/O bound busy server, the conventional wisdom says more thread could increase CPU utilization [1]. However, the work queue for CPU gets quickly /drained/, whereas the I/O queue is constantly full, as the I/O subsystem is working at full capacity.

[1] In a CPU bound server, adding 20 threads will likely create 20 idle, starved new threads!

Holy Grail is simultaneous saturation. Suggestion: “steal” a cpu core from this engine and use it for unrelated tasks. Additional threads or processes basically achieve that purpose. In other words, the cpu cores aren’t dedicated to this purpose.

Assumption — adding more I/O hardware is not possible. (Instead, scaling out to more nodes could help.)

If the CPU cores are dedicated, then there’s no way to improve throughput without adding more I/O capacity. At a high level, I clearly see too much CPU /overcapacity/.

struct packing^memory alignment %%experiment

Sound byte — packing is for structs but alignment and padding is Not only for structs

Longer version — alignment and padding apply to Any object placement, whether that object is a field of a struct or an independent object on stack. In contrast, the packing technique is specific to a struct having multiple fields.

https://github.com/tiger40490/repo1/blob/cpp1/cpp1/memAlign.cpp has my test code to reveal the rules:

  1. For a (8-byte) long long object on stack (or in a struct), the address is 8-byte aligned. So padding is added by compiler, unless you say “packed” on a struct.
  2. For a (4-byte) int object on stack (or in a struct), the address is 4-byte aligned.
  3. For a (2-byte) short object on stack (or in a struct), the address is 2-byte aligned.
  4. For a char object on stack (or in a struct), the address is 1-byte aligned. All memory address are 1-byte aligned, so compiler adds no padding.

http://stackoverflow.com/questions/11108328/double-alignment Wug’s answer echoes Ashish’s comment that tiny fields like char should be grouped together, due to Rule #4. This applies to stack layout as well [1]. However, compiler optimization can be subversive:

Not only can the compiler reorder the stack layout of the local variables, it can assign them to registers, assign them to live sometimes in registers and sometimes on the stack, it can assign two locals to the same slot in memory (if their live ranges do not overlap) and it can even completely eliminate variables.

[1] See https://stackoverflow.com/questions/1061818/stack-allocation-padding-and-alignment

shared_ptr: exclusive ownership

Background — Like any useful software component, shared_ptr offers a clean api as facade in front of a sophisticated implementation, and involves many non-trivial interactions when integrated into an application — Complexity.

Many ways to wrap your mind around the complexity. This is one of the best Perspectives.

The various ways to construct a shared_ptr clearly illustrates the fundamental use cases. An experienced user is probably familiar with all these use cases. These use cases are so important that you should be able to explain each constructor. Here’s what I remember.

1) pass in a raw ptr — starting a new club as the first member.
2) copier — joining an existing club.
3) pass in a weak_ptr
) pass in an auto_ptr

Note the good old auto_ptr is a club of 1. When “joining” a club you must remove the existing “owner”.

Shared_ptr implements shared yet exclusive ownership within an exclusive club of equals. No other owners are allowed.

RAII^ContextManager^using^java-AutoCloseable

1) Stroustrup commented that c++ doesn’t support finally{} because it has RAII dtor. See
http://www.stroustrup.com/bs_faq2.html#finally

Both deal with exceptional exits.
Both are robust.
Both are best practices.

However, try{} etc has performance cost, so much so that some c++ compilers can be configured to disable it. C++ Memory management relies heavily on RAII. Using Try for that would be too costly.

2) python ContextManager protocol defines __enter__() and __exit__() methods

Keyword “with” required …

3) Java uses finally(). Note finally{} becomes implicit in java7 try-with-resources

AutoCloseable interface is needed in try-with-resource. See https://docs.oracle.com/javase/tutorial/essential/exceptions/tryResourceClose.html

4) c# — Achilles’ heel of java GC  is non-deterministic. C#’s answer is q(using). C# provides both USING and try/finally. Under the hood USING calls try/finally.

I feel c# USING is evolution-wise a closer cousin to RAII (while try/finally is is a distant cousin). Both use variable (not obj) scope to manage object (not var) lifetime.

USING uses Dispose() method, which is frequently compared to the class dtor/Finalize(). For the difference between c# Dispose() vs dtor vs Finalize, see other blog post(s).

As you can see, c# borrowed all the relevant techniques from c++ and java. So it’s better to first understand the c++/java constructs before studying c# constructs.

sizeof : array on stack^heap

Here are 3 int pointers — almost indistinguishable at runtime, but compile-time….?

int main(){
cout<<sizeof (char)<<endl; // always 1 by definition
cout<<sizeof (void*)<<endl; // shows 8 on my 64-bit machine, the size of any address
cout<<sizeof (int)<<endl; // shows 4 on my 64-bit machine

int * ipNew = new int;
cout<<sizeof ipNew // 8, size of an address

int * iaNew = new int[5];
cout<<sizeof(iaNew) // 8 too, since array-new returns just an address

int ia[] = {1,3,9};
cout<<sizeof(ia) // 12, size of an Entire array variable on stack
}

At run time, ia probably is just a simple pointer (to stack) and doesn’t remember the array length. However, when sizeof is evaluated at compile time, compiler knows the array length!

##[09]low latency techniques

roughly ranked in terms of interviewer’s emphasis

  1. OOM and GC. memory conservation.
    1. Avoid dupe objects.
    2. avoid keeping large order state object
  2. parallelism
  3. avoid recursion
  4. JVM tuning beyond GC tuning
  5. –network
  6. avoid network like distributed cache. Favor single-JVM designs
  7. NIO
  8. FIX payload size reduction, such as encoding and compression
  9. –MOM
  10. multicast. RV is more stable than the newly invented multicast-JMS
  11. MOM message size control
  12. !! Peer-to-peer messaging eliminates message brokers and daemon processes

http://download.oracle.com/docs/cd/E13150_01/jrockit_jvm/jrockit/geninfo/diagnos/tune_fast_xaction.html
http://www.sun.com/solutions/documents/pdf/fn_lowlatency.pdf — diagrams, brevity.
http://www.quantnet.com/forum/showthread.php?t=5736
http://en.wikipedia.org/wiki/Low_latency_(capital_markets)#Reducing_Latency_in_the_Order_Chain

JRockit real time — http://www.oracle.com/appserver/docs/low-latency-capital-markets-whitepaper.pdf

##triggers for JGC: various comments

Some big trading engine (UBS?) runs just 1 GC cycle each day. I wonder … Q: how often does GC run normally?

P155 [[SanFrancisco]] nicely outlines that GC can start
* upon alloc-failure
* when jvm explicitly waits (IO, sleep …)
* every 10 seconds, but at a low priority

– In addition, application can request GC.

Synchronous GC is predominant, but google “asynchronous garbage collection idle periods of low activity” and you see GC can start when system is idle. Creditex java guys also hinted at that.

In my experiment, I do see GC springs into action when app is … paused.

http://java.sun.com/docs/hotspot/gc1.4.2/faq.html says about JDK1.4 —
* In the default garbage collector, a generation is collected when it is full (i.e., when no further allocations can be done from that generation).
* The concurrent low pause collector starts a collection when the occupancy of the tenured generation reaches a specified threshold(by default 68%).
* The incremental low pause collector collects a portion of the tenured generation during each young generation collection.
* A collection can also be started explicitly by the application.

http://www.softwareengineeringsolutions.com/blogs/2010/04/30/garbage-collection-in-java-part-2/ drills into alloc-failure —

The JVM is unable to fulfill the request (alloc-failure), and so it awakens the garbage collector to come to its aid. This results in a Young Generation collection cycle. This is the simplest and first answer to the opening question. Therefore, the novice thinks this is the only answer.

A Young Generation collection is called for because there isn’t enough space available in Eden to satisfy this allocation request (alloc-failure). However, unlike the earlier scenario, the promotion of the middle-aged object from the active Survivor space to being an elderly object in the old generation cannot proceed because there is no free space available in the Old Generation. In this case, the Young Generation collection is put on hold, while an Old Generation collection cycle is spun up to free up space in the Old Generation.

dynamic_cast, typeid(*ptr) and vtbl

We won’t be too far off if we imagine that all RTTI operations work their magic by following the vptr. Remember each polymorphic class has a distinct vtbl. If there are 521 polymorphic classes in memory, then there are exactly that many v-tables in memory. Each polymorphic object holds a 32-bit vptr seated at the vtbl of the “real” class. Recall that during subclass instantiation, the vptr first gets seated at the parent vtbl, then reseated at the subclass vtbl.

An interviewer pointed out — if a pair of derived/base classes have no virtual function, then you can’t dynamic_cast between them even using pointers and references.

RTTI is inapplicable to types without vtbl, because such types are always fixed and known at compile time. For example D extends B, without any virtual method. We create a D object but give its address to a B pointer. What’s the typeid of the _unwrapped_ pointer? It’s a B. See post on AddressOfBasement.

See https://github.com/tiger40490/repo1/blob/cpp1/cpp/88miscLang/typeid_WithoutVptr.cpp

dependency,rigid,fragile: OO design concepts4IV

— based on http://www.objectmentor.com/resources/articles/oodmetrc.pdf

rigidity (in-flexible, un-adaptable) is due to the fact that a single change to heavily interdependent software begins a cascade of changes in dependent modules. When the extent of that cascade of change cannot be predicted by the designers or maintainers the impact of the change cannot be estimated. This makes the cost of the change impossible to estimate. Managers, faced with such unpredictability, become reluctant to authorize changes.

Fragility (non-robust) is the tendency of a program to break in many places when a single change is made. Often the new problems are in areas that have no conceptual relationship with the area that was changed. Simple changes to one part of the application lead to failures in other parts that appear to be completely unrelated.

———————
I think this “change-impact” analysis would help answer the question “how to reduce subclass/superclass dependency”