SDI: 3 ways to expire cached items

server-push update ^ TTL ^ conditional-GET # write-through is not cache expiration

Few Online articles list these solutions explicitly. Some of these are simple concepts but fundamental to DB tuning and app tuning. https://docs.oracle.com/cd/E15357_01/coh.360/e15723/cache_rtwtwbra.htm#COHDG198 compares write-through ^ write-behind ^ refresh-ahead. I think refresh-ahead is similar to TTL.

B) cache-invalidation — some “events” would trigger an invalidation. Without invalidation, a cache item would live forever with a infinity TTL, like the list of China provinces.

After cache proxies get the invalidation message in a small payload (bandwidth-friendly), the proxies discard the outdated item, and can decide when to request an update. The request may be skipped completely if the item is no longer needed.

B2) cache-update by server push — IFF bandwidth is available, server can send not only a tiny invalidation message, but also the new cache content.

IFF combined with TTL, or with reliability added, then multicast can be used to deliver cache updates, as explained in my other blogposts.

T) TTL — more common. Each “cache item” embeds a time-to-live data field a.k.a expiry timestamp. Http cookie is the prime example.

In Coherence, it’s possible for the cache proxy to pre-emptively request an update on an expired item. This would reduce latency but requires a multi-threaded cache proxy.

G) conditional-GET in HTTP is a proven industrial strength solution described in my 2005 book [[computer networking]]. The cache proxy always sends a GET to the database but with a If-modified-since header. This reduces unnecessary database load and network load.

W) write-behind (asynchronous) or write-through — in some contexts, the cache proxy is not only handling Reads but also Writes. So the Read requests will read or add to cache, and Write requests will update both cache proxy and the master data store. Drawback — In distributed topology, updates from other sources are not visible to “me” the cache proxy, so I still rely one of the other 3 means.

TTL eager server-push conditional-GET
if frequent query, in-frequent updates efficient efficient frequent but tiny requests between DB and cache proxy
if latency important OK lowest latency slower lazy fetch, though efficient
if in-frequent query good waste DB/proxy/NW resources as “push” is unnecessary efficient on DB/proxy/NW
if frequent update unsuitable high load on DB/proxy/NW efficient conflation
if frequent update+query unsuitable can be wasteful perhaps most efficient

 

Advertisements

[17] string(+vector) reserve()to avoid re-allocation #resize()wrong

See also std::vector capacity reduction and vector internal array: always on heap; cleaned up via RAII

string/vector use expandable arrays. Has to be allocated on heap not stack.

For vector, it’s very simple —

  • resize() creates dummy elements iFF upsizing
  • reserve() only allocates spare “raw” capacity without creating elements to fill it. See P279 [[c++stdLib]]

[[Optimized C++]] P70 confirms that std::string can get re-allocated when it grows beyond current capacity.

http://stackoverflow.com/questions/9521629/stdstringss-capacity-reserve-resize-functions compares std::string.resize() vs reserve().

  • Backgrounder: Capacity — allocated capacity is often uninitialized memory reserved for this string.
  • Backgrounder: size — length of the string that’s fully initialized and accessible. Some of the characters (nulls) could be invisible.
  • string.resize() can increase the string’s size by filling in space (at the end) with dummy characters (or a user-supplied char). See http://www.cplusplus.com/reference/string/string/resize/
    • After resize(55) the string size is 55. All 55 characters are part of the string proper.
    • changes string’s size
  • string.reserve() doesn’t affect string size. This is a request to increase or shrink the “capacity” of the object. I believe capacity (say 77) is always bigger than the size of 55. The 77 – 55 = 22 extra slots allocated are uninitialized! They only get populated after you push_back or insert to the string.

heap allocation: java Can beat c++

  • case 1 (standard java): you allocate heap memory. After you finish with it you wait for the java GC to clean it up.
  • case 2 (low latency java): you allocate heap memory but disable java GC. Either you hold on to all your objects, or you leave unreachable garbage orbiting the earth forever.
  • case 3 (c++): you allocate heap memory with the expectation of releasing it, so the compiler sets up housekeeping in advance for the anticipated delete(). This housekeeping overhead is somehow similar to try/catch before c++11 ‘noexcept’.

Stroustrup suggested that #2 will be faster than #3, but #3 is faster than #1. I said “But c++ can emulate the allocation as jvm does?” Stroustrup said C++ is not designed for that. I have seen online posts about this “emulation” but I would trust Stroustrup more.

  • case 4 (C): C/c++ can sometimes use local variables to beat heap allocation. C programmers use rather few heap allocations, in my experience.

Note jvm or malloc are all userland allocators, not part of kernel and usually not using system calls. You can substitute your own malloc.

https://stackoverflow.com/questions/18268151/java-collections-faster-than-c-containers top answer by Kanze is consistent with what Stroustrup told me.

  • no dynamic allocation is always faster than even the fastest dynamic allocation. Similar to Case 4
  • jvm allocation (without the GC clean-up) can be 10 times faster than c++ allocation. Similar to Case 2^3
    • Q: Is there a free list in JVM allocator?

https://softwareengineering.stackexchange.com/questions/208656/java-heap-allocation-faster-than-c claims

  • c++ Custom allocators managing a pool of fixed-sized objects can beat jvm
  • jvm allocation often requires little more than one pointer addition, which is certainly faster than typical C++ heap allocation algorithms

how could jvm surpass c++latency

A Shanghai Morgan Stanley interviewer asked in a 2017 java interview — “How could jvm surpass c++ latency?”

— One reason — JIT compiler could aggressively compile bytecode into machine code with speedy shortcuts for the “normal” code path + special code path to handle the abnormal conditions.

JIT to avoid vtable latency #Martin is a prime example.

Priming is tricky in practice — https://www.theserverside.com/tip/Avoid-JVM-de-optimization-Get-your-Java-apps-runnings-fast-right-fromt-the-start highlights pitfalls of priming in the trading context. Some take-aways:
1. Optimizing two paths rather than just one path
2. Reusing successful optimization patterns from one day to the next, using historical data

— One hypothesis — no free() or delete() in java, so the memory manager doesn’t need to handle reclaiming and reusing the memory. [[optimizedC++]] P333 confirmed the c++ mem mgr does that. See [1]

https://stackoverflow.com/questions/1984856/java-runtime-performance-vs-native-c-c-code is Not a published expert but he says —
On average, a garbage collector is far faster than manual memory management, for many reasons:
• on a managed heap, dynamic allocations can be done much faster than the classic heap
• shared ownership can be handled with negligible amortized cost, where in a native language you’d have to use reference counting which is awfully expensive
• in some (possibly rare and contrived) cases, object destruction is vastly simplified as well (Most Java objects can be reclaimed just by GC’ing the memory block. In C++ destructors must always be executed)

— One hypothesis — new() is faster in jvm than c++. See [1]

Someone said “Object instantiation is indeed extremely fast. Because of the way that new objects are allocated sequentially in memory, it often requires little more than one pointer addition, which is certainly faster than typical C++ heap allocation algorithms.”

[1] my blogpost java allocation Can beat c++


Julia, Go and Lua often beat C in benchmark tests .. https://julialang.org/benchmarks/

http://www.javaworld.com/article/2076593/performance-tests-show-java-as-fast-as-c–.html is a 1998 research.

In my GS-PWM days, a colleague circulated a publication claiming java could match C in performance, but didn’t say “surpass”.

##all c++ explicit casts #std::move included

  • — implicit casts
  • cvctor — is the most important implicit cast
  • conversion operator member-functions
  • numerical type conversions specified by C standard

=== now the explicit casts

  • — C:
  • parentheses cast
  • — C++03 added four casts
  • dynamic_cast — usable on references and pointers.
  • const_cast
  • reinterpret_cast — usually on pointers
  • static_cast
  • — c++11
  • std::move() — only for references, not pointers
  • std::forward()
  • numerous type conversions in type_traits.h
  • — boost:
  • boost::polymorphic_cast? is similar to dynamic_cast but much less popular and not essential knowledge expected of a c++ veteran

 

1)template code bloat 2)inline..hurt i-cache@@

Q: does template code bloat increase instruction cache pressure and increase risk of cache thrashing?
A: i have seen about 8 forums mention of template code bloat increasing i-cache pressure, but no clear evidence, no consensus among authorities. In fact, no known authority has confirmed this issue.

However, I would think having many similar functions in the executable will likely lead to unnecessary competition for i-cache.

In contrast, for inlining there is consensus. wikipedia states — excess inlining will hurt speed, due to inlined code consuming too much of the instruction cache .. Inlining imposes a cost on performance, due to the code expansion (due to duplication) hurting instruction cache performance.[5] This is most significant if, prior to expansion, the working set of the program (or a hot section of code) fit in one level of the memory hierarchy (e.g., L1 cache), but after expansion it no longer fits, resulting in frequent cache misses at that level.

See also inline: footprint+perf can Backfire ! #Google

interrupt+signal ] kernel

There are precise answers in the kernel code, but here are my high-level, imprecise pointers, based on [[linux kernel]]:

  • Based on my “adapted Intel terminology”, interrupts from cpu are known as cpu Exceptions, whereas interrupts from other devices are known as hardware interrupts
    • interrupting hardware includes hardware clocks, keyboard, NIC and other I/O devices.
    • cpu exceptions are generated by problematic/erroneous instructions
  • LG2 — “software interrupts” concept is not so well-defined, and used primarily for 1) notifying debugger 2) implement system calls

Both hardware interrupts and cpu exceptions can generate SIG* signals. Signals are much higher-level constructs than the hardware-level interrupts. Signal as a concept is 50% kernel 50% UserMode:

  • A signal is always addressed to a user process.
  • signal delivery is kernel’s job; signal handling is user process’s job

I feel interrupts^signals are somewhat like cells^molecules — at different levels!

fragmentation: IP^TCP #retrans

I consider this a “halo” knowledge pearl because it is part of an essential everyday service. We can easily find an opportunity to inject it into an Interview. Interviews are unlikely to go this deep, but it’s good to over-prepare here.

This comparison ties together many loose ends like Ack, reassembly, retrans, seq resets..

This comparison clarifies what reliability IP offers + what reliability features it lacks:

  • it offers complete datagrams
  • it can lose an entire datagram — TCP (not UDP) will detect it and use retrans
  • it can deliver two datagrams out of order — TCP will detect and fix it

[1] IP fragmentation can cause excessive retransmissions when fragments encounter packet loss and reliable protocols such as TCP must retransmit ALL of the fragments in order to recover from the loss of a SINGLE fragment
[2] TCP seq# never looks like 1,2,3
[3] IP (de)fragmentation #MTU,offset

IP4 fragmentation TCP fragmentation
minimum guarantees all-or-nothing. Never partial packet stream in-sequence without gap
reliability unreliable fully reliable
name of a “whole” piece IP datagram #“packet/msg” are vague
a TCP session with thousands of sequence numbers
name for a “part” IP fragment TCP segment
sequencing each fragment has an offset each segment has a seq#
.. continuous? yes discontinuous ! [2]
.. seq # reset? yes for each packet loops back to 0 right before overflow
Ack no Ack !
positive Ack needed
gap detection using offset using seq# [2]
id for the “msg” identification number no such thing
end-of-msg flag in last fragment no such thing. Continuous stream
out-of-sequence? likely likely
..reassembly based on id/offset/flag [3] based on seq#
retrans not by IP4 [1][3] commonplace
missing a piece? entire IP datagram discarded[3] triggers retrans

retrans: FIX^TCP^xtap

The FIX part is very relevant to real world OMS.. Devil is in the details.

IP layer offers no retrans. UDP doesn’t support retrans.

 

 

TCP FIX xtap
seq# continuous no yes.. see seq]FIX yes
..reset automatic loopback managed by application seldom #exchange decision
..dup possible possible normal under bestOfBoth
..per session per connection per clientId per day
..resumption? possible if wire gets reconnected quickly yes upon re-login unconditional. no choice
Ack positive Ack needed only needed for order submission etc not needed
gap detection sophisticated every gap should be handled immediately since sequence is critical. Out-of-sequence is unacceptable. gap mgr with timer
retrans sophisticated receiver(ECN) will issue resend request; original sender to react intelligently gap mgr with timer
Note original sender should be careful resending new orders.
See https://drivewealth-fix-api.readme.io/docs/resend-request-message

 

FIX heartBtInt^TCP keep-alive^XDP heartbeat^xtap inactivity^MSFilter hearbeat

When either end of a FIX connection has not sent any data for [HeartBtInt] seconds, it should transmit a Heartbeat message. This is first layer of defense.

When either end of the connection has not received any data for (HeartBtInt + “some reasonable transmission lag”) seconds, it will send a TestRequest probe message to verify the connection. This is 2nd layer of defense. If there is still no message received after (HeartBtInt + “some reasonable transmission lag”) seconds then the connection should be considered lost.

Heartbeats issued as a response to TestRequest must contain the TestReqID transmitted in the TestRequest. This is useful to verify that the Heartbeat is the result of the TestRequest and not as the result of a HeartBtInt timeout. If a response doesn’t have the expected TestReqId, I think we shouldn’t ignore it. If everyone were to ignore it, then TestReqId would be a worthless feature added to FIX protocol.

FIX clients should reset the heartbeat interval timer after every transmitted message (not just heartbeats).

TCP keep-alive is an optional, classic feature.

NYSE XDP protocol uses heartbeat messages in both multicast and TCP request server. The TCP heartbeat requires client response, similar to FIX probe message.

Xtap also has an “inactivity timeout”. Every incoming exchange message (including heartbeats) is recognized as “activity”. 60 sec of Inactivity triggers an alert in the log, the log watchdog…

MS Filter framework supports a server-initiated heartbeat, to inform clients that “I’m alive”.  An optional feature — heartbeat msg can carry an expiration value a.k.a dynamic frequency value like “next heartbeat will arrive in X seconds”.

c++condVar 2 usages #timedWait

poll()as timer]real time C : industrial-strength #RTS is somewhat similar.

http://www.stroustrup.com/C++11FAQ.html#std-condition singles out two distinct usages:

1) notification
2) timed wait — often forgotten

https://en.cppreference.com/w/cpp/thread/condition_variable/wait_for shows std::condition_variable::wait_for() takes a std::chrono::duration parameter, which has nanosec precision.

Note java wait() also has nanosec precision.

std::condition_variable::wait_until() can be useful too, featured in my proposal RTS pbflow msg+time files #wait_until

c++11 atomic{int}^AtomicInteger.java #Bool can be half-written

The #1 usage of atomic<int> is load() and store(). I will use short form “load/store” or “l/s”.

The #2 usage is CAS.  Interviewers are mostly interested in this usage, though I won’t bother to remember the function names —
* compare_exchange_strong()
* compare_exchange_week()


The CAS usage is same as AtomicInteger.java, but the load/store usage is more like the thread-safety feature of Vector.java. To see the need for load/store, we need to realize the simple “int” type assignment is not atomic [1]:

  • P1012 [[c++ standard library]] shocked me by affirming that without locks, you can read a “half-written Boolean” [1].

To solve this problem, atomic<int> uses internal locks (just like Vector.java) to ensure load() and store() is always atomic.

[1] different from java. https://stackoverflow.com/questions/11459543/should-getters-and-setters-be-synchronized points out that 32-bit int in java is never “half-written”. If you read a shared mutable int in java, you can hit a stale value but never a c++ style half-written value. Therefore, java doesn’t need guarded load()/store() functions on an integer.

Q: are these c++ atomic types lock-free?
A: for load/store — not lock-free. See P 1013
A: for CAS — lock-free CPU instructions are used, if available.

##which common UDP/TCP functions are blocking

Q: which blocking functions support a timeout?
A: All

A non-blocking send fails when it can’t send a single byte, usually because destination send buffer (for TCP) is full. For UDP, see [4]

Note read() and write() has similar behavior. See send()recv() ^ write()read() @sockets and http://www.mathcs.emory.edu/~cheung/Courses/455/Syllabus/7-transport/flow-control.html

[1] meaning of non-blocking connect() on TCP is special. See P51[[tcp/ip sokets in C]] and https://www.scottklement.com/rpg/socktut/nonblocking.html
[2b] accept() returning with timeout is obscure knowledge — see accept(2) manpage
[2c] accept() is designed to return immediately on a nonblocking socket — https://stackoverflow.com/questions/12861956/is-it-possible-and-safe-to-make-an-accepting-socket-non-blocking and https://www.scottklement.com/rpg/socktut/nonblocking.html
[3] select() on a non-blocking socket is obscure knowledge —  See https://www.scottklement.com/rpg/socktut/nonblocking.html
[4] UDP has no send buffer but some factors can still cause blocking … obscure knowledge
[5] close() depends on SO_LINGER and non-blocking by default … https://stackoverflow.com/questions/6418277/how-to-close-a-non-blocking-socket
[6] MSG_DONTWAIT flag

t/out default flags arg to func fcntl on entire socket touching TCP buffers?
y select blocking not supported? still blocking! [3]  no
epoll blocking not supported?  no
y recv blocking can change to NB [6] can change to NB  yes
y send/write TCP blocking can change to NB [6] can change to NB  yes
y recvfrom blocking can change to NB [6] can change to NB  yes
y sendto UDP can block [4] can change to NB [6] can change to NB  yes
y close() with SO_LINGER not the default not supported yes yes
close() by default NB [5] not supported can change to blocking yes
y[2b] accept by default blocking not supported yes yes
accept on NB socket [2c] not supported yes no
LG connect()TCP blocking not supported? can change to NB [1] no
LG connect()UDP NB NotApplicable

size of Object.java, and how does it add up@@

Typically, the per-object overhead is 8 bytes on 32-bit, and 12-byte on a 64-bit machine, as shown on https://stackoverflow.com/questions/44468639/memory-alignment-of-java-classes, but sometimes rounded to 16 bytes.

(Hypotheses and Personal opinion only. I’m no authority.)

Minimum info to be stored in an Object.java instance —

– (4-byte) wait-set as a hidden field — to hold the waiting threads. Expandable set, so perhaps a WS_pointer to a linked list
** It’s probably inefficient to try to “save” memory by building a static VM-wide lookup table of {instance-address, wait-set}. As this table grows, such hashtable lookup is going to impact the most time-critical operations in JVM. Not having this lookup means minimum overhead locating the wait-set.
** this particular WS_pointer starts out as null pointer
** Note c# value types don’t have wait-set. c# Reference types do. The sync-block takes 4-bytes
** why linked list? Well, Arrays can’t grow. Vector involves re-allocation.

– How about the data structure holding the set of threads blocked in lock() or synchronized keyword? A mutex “contender set” associated with each Object.java instance? Apparently there’s no such thing mentioned in popular literature. Is it possible to put these contenders in the wait-set but use a flag to distinguish?

– (4-byte) vptr as a hidden field — If you try to be clever and put this ptr as the first element of the wait-set, then every Object instance still must occupy 64bits == WS_pointer + 1st element in the wait set. Therefore it’s faster to store the vptr directly in the Object instance. Note the vtable also holds the runtime type info, just as in c++ RTTI. Vtable can even hold a pointer to the “class” object.

– ? pointer to the class object as a hidden (static) field — representing the Runtime type of the object.
** This can be stored in the per-class vtbl, as C++ does. This hidden field occupies 32 bits per Class, not per instance.
** I believe myObject.getClass() would use this pointer.
** See my blog post on type info stored inside instances (http://bigblog.tanbin.com/2012/02/type-info-stored-inside-instances-c-c.html).

– hashcode — Once generated, hashcode must not mutate (due to garbage collection relocation), until object state changes.
** I feel this can live on the wait-set, since most Object instances don’t get a hashcode generated.
** Why not use another hidden field? Well, that would require 32 bits dedicated real estate per object, even if an object needs no hashcode in its lifetime.
** Note python hashcode never mutates. See separate blog post.

?? this-pointer as a hidden field — to hold “my” own address. Remember this.m(…) is compiled to m(this, …) to let m() access fields of the host object.
** However, I believe this-pointer is possibly added (as a hidden field) only when you subclass Object.java and introduce a field, and then a method referencing that field. There’s no need to add this-pointer to classes whose instance methods never access instance fields. Such an object doesn’t (need to) know it’s own address — like a lost child.
** even in those “other” cases, I feel it’s possible for the compiler to remove this-pointer as a field (reducing memory footprint) because compilers implicitly translate myInstance.m1() to m1(&myIntance)

In conclusion, I’d guess minimum size = 8 bytes but will grow when hashcode generated.

Q1a: min size of a c# struct or any value type instance?
%%A: C# structs need zero overhead, so an Int32 instance needs just 32 bits.
%%A:
When we call myStruct.m2(), there’s no vptr involved — just a compile-time resolved function call as in C and non-OO languages. Compiler translates it to some_Free_Function(ptr_to_myStruct). Note even for this value-type there’s no pass-by-value — no passing at all. Just translation by compiler.

Q1b: min size of c# System.Object instance.
A: Supposed to be 8 bytes minimum (A value type instance has no such overhead)
* 4-byte vptr
* 4-byte sync block
A real instance seem to be 12-bytes long.

Q2: why is the minimum size of an empty c++ object no more than 1 byte?
%%A: obvious from the analysis above.

Q2b: why not 0 byte?
A: an object is defined as a storage location. Even if there’s no field in it, a=new MyObject() and b=new MyObject() (twice in a row, single-threaded program) must produce 2 objects at 2 locations.

Note the size of an empty c++ string class instance is 12 bytes, according to [[c++ primer]]

According to P89 [[C# in depth]] a C# byte takes 1 byte, but 8+1 bytes of heap usage when “boxed”. These 9 bytes are rounded up to 12 bytes (memory alignment). Since heap objects are nameless and accessed only via a pointer, the pointer becomes a field (of the boxed object) and takes 4 bytes (assuming a 32-bit machine)

[09]global free func^free func]a namespace^static method #ADL

In java there are static methods and nothing else. In c there are “functions” and nothing else. In c++ there are 3 alternatives.

1) static methods — like java
** unlike the free functions, static methods can be private or protected
** can access private static fields of host class
** unlike the free functions, static methods are inherited. See Operator-new in [[eff c++]]
** usually don’t need name space prefix

2) free func/operators in a “package” like a boost library or STL, typically organized into namespaces. Better than global. See ADL technique

3) GLOBAL free func /operators — vestige of C syntax

In quick and dirty applications (not libraries), you see lots of global free functions.

double-colon q[ :: ] in c++

Recently I hit this confusion multiple times. The compiler is not confused because the usage scenarios are distinct —

  1. usage — namespace1::free_func2(). Example – you can specify std::swap() instead of your local swap()
    • somewhat similar to java package separator the dot
  2. usage — classA::staticMethod1()
  3. usage (very useful) — superclassA::instanceMethod3(). This is equivalent to this->superclassA::instanceMethod3() //tested
  4. usage — classB::localType

Actually, the first 2 scenarios have similar meanings. Java simply merged both into the dot.

boost::this_thread::get_id() shows two usages.

I believe a field name can replace the function name.

MyNamespace1::MyNamespace2::j = 10; // nested namespace
std::terminate() and ::operator new() are similar to System.java and Runtime.java methods.

In a class template, the #2 and #4 usages can confuse the compiler. See P670 [[c++primer]].