[11]concurrent^serial allocation in JVM^c++

–adapted from online article (Covalent)

Problem: Multithreaded apps create new objects at the same time. During object creation, memory is locked. On a multi CPU machine (threads run concurrently) there can be contention

Solution: Allow each thread to have a private piece of the EDEN space. Thread Local Allocation Buffer

You can also Analyse TLAB usage -XX:+PrintTLAB

Low-latency c++ apps use a similar technique. http://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919 reveals insight of lock contention in malloc()

java array footprint: hypothetical calc

In low-latency or high-volume java apps (like exchange-facing), I think array of primitives is quite popular due to its efficiency advantage. Here I review a few potential footprint issues. Suppose we have an array AA of 12 ints, each 32-bit.

  • AA carries housekeeping data such as the array length (i.e. 12). AA also has the “standard” object header, of twelve (or eight) bytes… See size of Object.java, and how does it add up@@
  • The 12 ints take up 48 bytes, allocated in contiguous memory… lucky. For 12 Trade objects, we would get 12 scattered allocations.. expensive for runtime allocation and cache-unfriendly.
  • So we have 12 + 48 = 60 bytes so far. There might be padding for alignment. 64-bit CPU probably uses 8-byte alignment as shown on https://stackoverflow.com/questions/44468639/memory-alignment-of-java-classes
  • Finally, footprint is 64 bytes, but I have not verified it.

java off-heap memory #cf database,reinterpret_cast

  • justification — see “jGC” item
  • tradeoff — ?
  • java.nio.ByteBuffer — non-blocking IO api is very relevant. It has API to access memory-mapped files, among other things. More often, I think we probably use NIO to request a big chunk of memory from kernel, and bypass the java heap.
  • serialization — most (if not all) off-heap techniques require objects serialized and stored off-heap.
  • byte array — Unlikely object pool, off-heap memory solutions must serialize Trade objects into binary byte array [1] and stores it. It has to be transformed back to Trade objects before use. C would use reinterpet_cast() — no disk involved.
  • offloading jGC — With huge data moved off the heap, jGC is now much faster, and jitter-free. This is the main goal and justification of off-heap
  • bigger — off-heap storage can be bigger than on-heap storage.
  • slower — off-heap storage is slower (less immediate) than on-heap, but faster than disk or database[2].
  • performance boost — even though “slower”, it can boost performance of GC.
  • memory sharing — between two JVM’s is easier if using off-heap. I think memory-mapped file is one means.

[1] In this way, the NYSE XDP protocol could arguably be considered an off-heap serialization protocol.

[2] database is actually a competing alternative solution for some use cases like “shared-cache” or “large lukewarm cache”. I feel database solution is very proven, reliable but slower than off-heap.

Among the many disparate online articles about off-heap, here are my top 2 fav

  1. https://stackoverflow.com/questions/6091615/difference-between-on-heap-and-off-heap
  2. https://www.tutorialdocs.com/article/spark-memory-management.html

embed char_array ] my java object #XR

With market data it’s common to use some Message class(s) that “embeds” a fixed-length character array like 20-char for example.

Allocating an array object off-site on heap is very costly in memory footprint. One extra allocation per Message.

Also slower reading at run time due to data-cache inefficiency. Data cache favors contiguous data structures. See CPU(data)cache prefetching

c/c++ and c# (via Struct) can easily support exactly this “embedding”. I feel java also has some support. Beside JNI, I wonder if there’s another, pure-java solution.

Q: in java, how can I have embedded fixed-length char-array field in my Message or Acct object, rather than a separate array object allocated somewhere off-site?

  1. Solution: If the fixed length is small like 10, I could maintain 10 individual char fields.
  2. Solution: assuming the chars are ascii (8-bit rather than 16-bit in java), I can group first eight chars into a 64-bit long int field. Provide a translation interface when reading/writing the field. With 10 such fields I can support 80-char embedded.
  3. Solution: If not possible, I would use a gigantic singleton off-site char array to hold fixed-length “segments”. Then I need a single int “position”. Every Acct object has a field this.position, where this.position * fixedLength = offset, to identify one segment.
  4. There are two published solutions described in ringBuffer@pre_allocated objects to preempt JGC

Among them, not sure which solution is fastest in practice.

ringBuffer@pre_allocated objects to preempt JGC

Goal — to eliminate JGC completely.

Design 1: I will want Order.java to use primitive fields only and avoid reference fields [1] at all cost, so the total footprint of an Order is known in advance. Say it’s 100 bytes. I will create 10M of dummy Order instances, possibly scattered in heap, not adjacent as in c++, and hold their 10M addresses in an Order array… about 1GB footprint for the Order objects + 80M footprint for the array of 8-byte pointers.

(Note I reuse these Order instances in this object pool and never let them get garbage-collected.)

Then i need a few subscripts to identify the “activeRegion” of the ring but how about released slots enclosed therein?

[1] timestamps will be ints; symbolIDs and clientIDs are ints; short ascii strings will use 64-bit ints (8 characters/int); free-form strings must be allocated off-site:(

Design 2a: To avoid the “scatter” and to place the Order instances side by side, Can we use a serialized byte[100] array object to represent one Order? Can we use one gigantic off-heap byte array to hold all Orders, eliminating the 80M footprint? See java off-heap memory

Design 2b: https://blog.bramp.net/post/2015/08/26/unsafe-part-2-using-sun.misc.unsafe-to-create-a-contiguous-array-of-objects/ shows a contiguous array of java objects, like std::vector<MyObject>

Design 2c: https://www.ibm.com/support/knowledgecenter/en/SSYKE2_7.1.0/com.ibm.java.lnx.71.doc/user/packed_optimizing.html is a feature in IBM jvm

Ring buffer is good if the object lifetimes are roughly equal, giving us FIFO phenomenon. This occurs naturally in market data or message passing gateways. Otherwise, we may need a linked list (free list) of released slots in addition to a pair of subscript to identify the active region.

It might be better to allocate a dedicated buffer for each thread, to avoid contention. Drawback? One buffer may get exhausted when another stays unused.

jvm footprint: classes can dominate objects

P56 of The official [[java platform performance]], written by SUN java dev team, has pie charts showing that

  • a typical Large server app can have about 20% of heap usage taken up by classes, rather than objects.
  • a typical small or medium client app usually have more RAM used by classes than data, up to 66% of heap usage take up by classes.

On the same page also says it’s possible to reduce class footprint.

jvm heap histogram simple console tool

The simplest among many similar tools. This type of analysis is easier in java than in c++ because jvm manages memory allocations. In fact, GC can even relocate objects.

~$jcmd test GC.class_histogram

num #instances #bytes class name
 1: 1020 93864 [C
 2: 481 54856 java.lang.Class
 3: 526 26072 [Ljava.lang.Object;
 4: 13 25664 [B
 5: 1008 24192 java.lang.String
 6: 79 5688 java.lang.reflect.Field
 7: 256 4096 java.lang.Integer
 8: 94 3760 java.lang.ref.SoftReference
 9: 91 3712 [I
 10: 111 3552 java.util.Hashtable$Entry
 11: 8 3008 java.lang.Thread

jvm spawns(approx)32 JGC threads on 32-core

Based on https://jaxenter.com/nobody-puts-java-container-139373.html

jvm would spawn 32 GC threads on a 32-core box [1].  As of 2018, this is the default, but you can change it with jvm parameters like -XX:ParallelGCThreads and -XX:ConcGCThreads

Java 9 introduced automatic detection of cpu-set when jvm runs in a cgroup (such as a docker container) with a cpu-set. For example, JVM detects there are 3 cores in the cpu-set, and spawns 3 GC threads.

[1] presumably for the parallel GC or the CMS GC but i’m not sure. https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/parallel.html and https://blog.codecentric.de/en/2013/01/useful-jvm-flags-part-6-throughput-collector/ says for the parallelGC the default would be 5/8*32 + 3 = 23 threads.

##%%c++(n java)memoryMgmt trec : IV

Q: Your resume says “memory mgmt” for c++, so what did you do personally, not as a team?
A: I will talk about the features in my system. Most of them are not written by me, honestly

  • per-class: restricting allocation to heap or stack
  • per-class: customize operator new for my class
  • per-class: disable operator new for my class and provide allocateFromRingBuffer() API
  • per-class: move support
  • static thread_locals
  • ring buffer to eliminate runtime allocation
  • custom smart pointers
  • memory leak detector tools like valgrind+massif .. breakdown heap/non-heap footprint@c++app #massif
  • RAII
  • pre-allocate DTOs@SOD #HFT #RTS
  • customize new_handler() — I didn’t write this

Q: java

  • GC tuning
  • memory sizing
  • JDK tools like jhat
  • memory profiler tools
  • string intern
  • investigating memory leaks, all related to GC

handle java OOM @runtime

https://stackoverflow.com/questions/2679330/catching-java-lang-outofmemoryerror says "only very infrequently is OutOfMemoryError the death-knell to a JVM. There is only one good reason to catch an OutOfMemoryError and that is to close down gracefully, cleanly releasing resources and logging the reason for the failure best you can (if it is still possible to do so)."

Note in a Docker container, OOM leads to immediate crash of not only JVM but entire container.

eg@ JGC-JNI interaction: String

First, let’s remember Java string objects are collected by GC. In fact, String.java instances are often the biggest group of “unreachable” objects in a GC iteration.

I guess interned string objects are treated differently … need more research just for QQ interviews.

GC thread has builtin safety checks to cooperate with application threads to avoid collecting a string.

However, JNI code is not 100% compatible with GC. Example from [[Core java]]:

Suppose a JNI C function instantiates a jstring object in the jvm heap (not in native memory), to be deallocated by GC.

Note the GC doesn’t know when this object is unreachable. To inform GC, JNI function need to call ReleaseStringUTFChars().

JGC G1 Metaspace: phrasebook #intern

Incidentally, NIO buffer is also in native memory


CMS JGC: deprecated in java9

Java9/10 default GC is G1. CMS is officially deprecated in Java 9.

Java8/7 default GC is ParallelGC, CMS. See https://stackoverflow.com/questions/33206313/default-garbage-collector-for-java-8

Note parallelGC uses

  • parallel in most generations
  • serial in old gen

…whereas parallelOldGC uses parallel in all generations.

Q: why is CMS deprecated?
A: one blogger seems to know the news well. He said JVM engineering team needs to focus on new GC engines and need to let go the most high-maintenance but outdated codebase — the CMS, As a result, new development will cease on CMS but CMS engine is likely to be available for a long time.

## low-complexity QQ topics #JGC/parser..

java GC is an example of “low-complexity domain”. Isolated knowledge pearls. (Complexity would be high if you delve into the implementation.)

Other examples

  • FIX? slightly more complex when you need to debug source code. java GC has no “source code” for us.
  • socket programming? conceptually, relatively small number of variations and combinations. But when I get into a big project I am likely to see the true color.
  • stateless feed parser coded against an exchange spec

JGC duration=100→10ms ] “Z-GC” #frequency

This blog has many posts on JGC overhead.

  • For low-Latency JVM, duration (pause time) outweighs frequency
  • For mainstream JVM, overhead + throughput outweighs duration


Could be Every 10 sec , as documented in my blogpost

–stop-the-world duration:

100 mills duration is probably good enough for most apps but too long for latency sensitive apps, according to my blogpost.

For a 32GB JVM in a latency-sensitive Barclays system, worst long pause == 300ms.

The new Z-GC features GC pause times below 10ms on multi-terabyte heaps. This is cutting-edge low pause.

java collection in a static field/singleton can leak memory

Beware of collections in static fields or singletons. By default they are unbounded, so by default they pose a risk of unexpected growth leading to memory leak.

Solution — Either soft or weak reference could help.

Q: why is soft reference said to support memory/capacity-sensitive cache?
A: only when memory capacity becomes a problem, will this soft reference show its value.

Q: is WeakHashMap designed for this purpose?
A: not an ideal solution. See other posts about the typical usage of WHM

Q: what if I make an unmodifiable facade?
A: you would need to ensure no one has the original, read-write interface. So if everyone uses the unmodifiable facade, then it should work.

[14]JGC tuning: will these help@@

Q: have a small nursery generation, so as to increase the frequency of GC collections?

AA: An optimal nursery size for maximum application throughput is such that as many objects as possible are garbage collected by young collection rather than old collection. This value approximates to about half of the free heap.

Q: have maximum heap size exceeding RAM capacity.
A: 32-bit JVM won’t let you specify more than 4G even with 32 GB RAM. Suppose you use 64-bit JVM, then actually JVM would start and would likely use up all available RAM and starts paging.

2 JGC algos for latency^throughput

https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html is a 2015 Intel blog.

Before the G1 algo, Java applications typically use one of two garbage collection strategies: Concurrent Mark Sweep (CMS) garbage collection and ParallelOld GC (similar to parallelGC, the java8 default).

The former aims at lower latency, while the latter is targeted for higher throughput. Both strategies have performance bottlenecks: CMS GC does not do compaction, while Parallel GC performs only whole-heap compaction, which results in considerable pause times.

  1. For applications with real-time response, “generally” (by default) we recommend CMS GC;
  2. for off-line or batch programs, we use Parallel GC. In my experience, this 2nd scenario has less stringent requirements so no need to bother.

https://blog.codecentric.de/en/2013/01/useful-jvm-flags-part-6-throughput-collector/ (2013) has an intro on Throughput vs. pause times

suggest(hint) a GC run – CLR ^ JVM

Java’s gc() method is a suggestion/plea to JVM at run-time. (There’s no way to force an immediate GC cycle. I often call gc() some 4000 times in a few loops to coax the GC to start.)

CLR Collect() method with Optimized mode is exactly the same.

CLR Collect() method with Forced mode would force an immediate collection. No such feature in JVM.

All of these techniques are discouraged. When are they justified?

y java object always allocated on heap, again

Short answer #1: Because java objects are passed by reference.

Suppose a holder object (say a collection, or a regular object holding a field) holds a Reference to a student1 object. If student1 object were allocated on stack, and de-allocated when stack frame unwinds, the holder would hold a stray pointer. Java is allergic to and won’t tolerate a single stray pointer.

Java primitive entities are always passed by value (pbclone). No stray pointer – no pointer involved at all.

Short answer #2: all pbref entities live safer on heap.

How about c# and C++? Somewhat more complicated.

stop-the-world inevitable: either minor or major JGC

Across All of Sun’s GC engines so far (2012), young generation (eden + survivors) algorithm has _always_ been STW. The algo changed from single-threaded to parallel, but remains STW. Therefore, during a minor GC ALL application threads are suspended. Usually a short pause.

Across All of Sun’s GC engines so far (2012), oldgen is at best low-pause but _never_ no-pause. For the oldgen, there is _always_ some pause, due to an inevitable STW phase.

Note one of the oldgen algorithms (i.e. CMS) is mostly-concurrent, meaning the (non-concurrent) STW phase is a brief phase — low-pause. However, young gen algorithm has always been STW throughout, without any concurrent phase.

jvm stack ^ C stakck in JNI@@

Background — we know the jvm heap and the C heap (or off-heap?) are separate. Among other things, GC can relocate and clean up stuff in jvm stack only.

Q: is the stack segment divided into the jvm stack and the C stack?

%%A: yes and no. Stack segment is naturally divided into individual call stacks. If there are 33 jvm stacks and 11 of them extend into JNI, and there are 5 threads created in C, then naturally a shared stack area and 2 unshared stack areas exist in the stack segment.

Even the heap can be divided per thread, if you realize your objects aren’t shared across threads. I was told subheap-per-thread speeds up heap memory allocation, because the default allocator (malloc and derivatives) must synchronize to access the free list

See https://pangin.pro/posts/stack-overflow-handling

try to constrain objects to Eden #UBS tip

If you want to avoid full GC and can tolerate short jitters due to partial GCs, then here is a practical and effective technique.

Idea is to minimize promoting objects into the 2 survivor spaces or further into OldGen.

Make most of your newly created objects unreachable (i.e. garbage-collectible) ASAP. Well, even if I do that asap, it may still be 3000 instructions after instantiation, and in a multi-threade JVM, we don’t know how long that means to the memory allocator. Here’s the deal — Suppose we were a spy inside the eden and monitor the amount of reachable vs unreachable objects. Whenever eden gets full, I  want to see a good amount of unreachables.

If a large object is needed over and over again, put it into object pool (or local cache) and get it promoted into OldGen.

Keep eden the standard size, perhaps 1/8 (not 25%) of heap. No need to increase it.

[11]how2incur 1 STW JGC/day: #1 low-latency java

Big eq trading desks in Citi, UBS and other banks claim they incur a single StopTheWorld GC “penalty” a day in most jvm instances (where an instance is typically a node in a scalable grid). Here are some techniques.

1) “pre-allocate” long-living objects. Probably allocate a large array of primitive objects. Each “primitive object” could be a byteArray of fixed size. No POJO no java beans.
1b)) don’t let GC do the clean-up; reuse the memory pre-allocated. Reuse by dropping in new data. Circular array of fixed-size byteArray is one of my ideas. RTS xtap also uses circular array to hold new messages. There are millions of them a day.

See pre-allocate DTOs@SOD #HFT #RTS

2) avoid promoting objects beyond the nursery/eden — see other posts in this blog.

3) run 64bit and allocate enough heap (typically 10G) to last a day.
)) Each jvm instance presumably serves only US, or only Hongkong…, so it could shutdown nightly.
)) If we still need more heap, split to 2 jvm instances.
)) avoid distributed cache and latency. Barc DMA team also avoided distributed caching.
)) How do you size heap? See post on heap sizing

5) During a scheduled quiet period, preemptively trigger a full GC throughout the entire (huge) heap, which could actually take 30 minutes. Just call Runtime.gc(). If this doesn’t trigger GC, then tweak -XdisableExplicitGC. This flag disables explicit calls to Runtime.gc() but has rare side effects — https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/malpractice_do_not_permanently_use_xdisableexplicitgc_or_xx_disableexplicitgc?lang=en

2 survivor spaces

— Annotated Sun documentation —

Java defines two generations: the young generation (sometimes called the “nursery”) and the old generation. The young generation consists of an “Eden space” and __two__”survivor spaces.” The VM initially assigns all objects to the Eden space, and most objects die there. When it performs a minor GC in Eden, the VM moves any remaining objects from the Eden space to __one__ of the 2 survivor spaces.

The VM moves objects that live long enough in the survivor spaces to the “tenured” space. When the tenured generation fills up, there is a full GC that is often much slower because it involves all live objects.

The permanent generation holds all the reflective data of the virtual machine itself, such as class and method objects.

— adapted from online article (Covalent) —
1. work work work
2. When EDEN is full -> minor collection copies surviving objects into 1st survivor space
3. work work work
4. Eden is full again -> minor GC again
Copy from Eden to 2nd
Copy from 1st to 2nd
5. If 2nd fills up and objects remain in Eden or 1st, these get copied to the tenured

One survivor space is always empty. Serves as destination for minor collections

6. work work work
7. Major collections occur when the tenured space fills up. Major collections free up Eden and both survivor spaces

object, referent, and reference — intro to weak reference

see also post on pointer.

— physical explanation of the 3 jargons
A regular object (say a String “Hi”) can be the target of many references
* a regular strong reference String s=”Hi”;
* another regular strong reference String ss1=s; // remote control duplicated
* a weak ref WeakReference wr1 = new WeakReference(ss1);
* another wak ref wr2 = new WeakReference(s);

1) First, distinguish a regular object “Hi” from a regular strong reference ss1.
* The object “Hi” is a “cookie” made from a “cookie cutter” (ie a class), with instance and static methods defined for it.
** like all objects, this one is created on the heap. Also, think of it as an onion — see other posts.
** “Hi” is nameless, but with a physical address. The address is crucial to the “pointer” below.

* strong reference ss1 is a “remote control” of type String. See blog post [[an obj ref = a remote control]]
** People don’t say it but ss1 is also a real thingy in memory. It can be duplicated like a remote control. It has a name. It’s smaller than an object. It can be nullified.
** It can’t have methods or fields. But it’s completely different from primitive variables.
** method calls duplicate the remote control as an argument. That’s why java is known for pass-by-value.

2) Next, distinguish a strong reference from a pointer.
* a regular strong reference ss1 wraps a pointer to the object “Hi”, but the pointer has no technical name and we never talk about the pointer inside a strong reference.
** the pointer uses the “address” above to locate the object
** when you duplicate a remote control, you duplicate the pointer. You can then point the old remote control at another object.
** when no pointers point to a given object, it can be garbage collected.

Summary — a pointer points to the Object “Hi”, and a Strong reference named ss1 wraps the pointer. Note among the 3, only the strong ref has a name ss1. The other strong ref s is another real thingy in memory.

3) Now, weak reference. A weakref wr1 wraps a pointer to an obj (ie an onion/cookie), just like a strong ref. In a weak reference (unlike strong reference) the pointer is known as a “referent in the reference”. Note wr1’s referent is not ss1, but the object “Hi” referenced by ss1. That’s why “Hi” is target of 4 pointers (4 reference). This object’s address is duplicated inside all 4 variables (4 references).

“Referent in a reference” is the pointer to object “Hi”. If you “set the referent to null” in wr1, then wr1 doesn’t point to any object in memory.

Unlike strong references, a weak reference is sometimes mentioned like an object, which is misleading.

Q: Is the weak ref wr1 also an object? We explained a strong ref ss1 differs completely from the object “Hi”.

A: Yes and no. you create it with new weakReference(..) but Please don’t treat it like a regular object like “Hi”. In our discussion, an object means a regular object, excluding reference objects. [[Hardcore Java]] page 267 talks about “reference object” and “referenced object” — confusing.
A: java creators like to dress up everything in OO, including basic programming constructs like a thread, a pointer, the VM itself, a function (Method object), or a class (Class object). Don’t overdo it. It’s good to keep some simple things non-OO.

A: functionally, wr1 is comparable to the remote control ss1. Technically, they are rather different

JVM tuning is 80% memory tuning, which is 80% JGC tuning

Look at real time java documentations. I feel most of the design
effort went into memory optimization.

I once read an article on Weblogic server tuning. The JVM tuning
section is 90% on memory tuning. Threads is a much smaller section.

Finally, I had a long discussion with a Oracle/BEA/Sun consultant. The
primary focus of JVM tuning is GC overhead and long pauses.

[11]how to incur 1 STW JGC/day: #2 heap sizing

(See other post on 1 GC/day)
UBS eq system in Stamford claims to incur a single GC a day. They probably need enough RAM for the heap. 32bit  gives you about 3GB memory, probably insufficient.

Q: Just How much RAM is sufficient?
A: In trading, 16G is a common figure, all dedicated to a single JVM.
A: In a risk engine, each machine has 512GB but runs multiple JVMs.
A: Here’s an answer from an Oracle/BEA tuning consultant (XiaoAn)+ my own input.

1) decide if caching is suitable.
1b) if yes, then make sure caches have size-control and leak-control

2) eliminate obvious memory hogs, reuse objects and pools if possible
3) Profile your engine in load test.
4) tentatively release to production. If it actually uses much higher memory, then investigate why the estimate failed. Possible reasons

– leaks
– cache

Many memory profilers can count how many instances of a particular class (like String.java) there are. If your own class has too many instances, you can add instrumentation to ctor

basic JVM tuning – 2 focuses

JVM tuning is part of system tuning, often a small part. I feel DB tuning and application tuning often/usually provide better ROI. In a sense, JVM is already very well-turned compared to application code (in any langauge). It’s worthwhile to keep the ROI in mind. After you make sure JVM performance metrics look reasonable, further effort often yields diminishing ROI. Better focus on other parts of the system.
It’s critical to establish measurable targets, otherwise we can’t tell when JVM performance metrics are reasonable. In practice, 2 main targets are
– OO) GC overall overhead – to be reduced, expressed as a percentage of system resources spent on GC
– PP) long pauses are often unacceptable and should be eliminated. Short jitters are more common and more tolerable
Once OO and PP are in reasonable range, further JVM tuning gives diminishing return, according to real world experiences.

You can see OO) in verbose GC output (more easily on GCHisto). 1-3% is ok. This goal is important to batch systems. An exchange gateway guy told me PP is more important. 10% OO is tolerable in the low-pause CMS.

You can see PP) in verbose GC output. One minute would be too long.  This goal is important to latency sensitive systems like most trading engines, where Even 100millis is bad. Long pause is often due to full GC. A first step is to turn on concurrent mark and sweep collector. Some trading shops create their own GC.

Real time java addresses PP. Not easy. UBS claims they incur a single GC “penalty” a day, so PP is eliminated. See other blog posts on “GC/day”

If your overall Overhead is too high, you need to see breakdown, then divide and conquer. A key metric is allocation rate.

Related tools needed for jvm tuning
– jstat
– jconsole
– hpjmeter

y java CMS needs remarking #my take

CMS marking consists of 3 phases — Initial Mark -> Concurrent Mark -> ReMark. I used to feel “once an object is found dead/unreachable, it will remain forever unreachable, and will never come back to life, and therefore needs no double-check”. Reality is, during IM and CM you can’t conclude “ok this object X is definitely unreachable”

Remember the #1 job duty of marking is to ensure EVERY (i am not saying “only”) live objects is identified.

Initial Mark (IM) puts together a list of “immediately reachable addresses”. Job not completed.

Concurrent marking (CM) is an imprecise process regarding the job duty #1. Imagine object A (say some collection) holds a pointer to our object at address X and is transferring that pointer to object B (like another collection). “Transfer” in the form of baton in a relay race — never drop the baton (once dropped it’s unreachable forever). If CM algorithm visits B before A, it may not catch sight of the (on-the-move) pointer to X. In such a case, it will not notice address X is reachable. Job not completed!

Only in the RM phase is the “job” completed.

This is part of the deal with Concurrent marking in the low-pause CMS. With STW you won’t get this complication.

sizeof.java – complete class

package com.gs;

import java.lang.management.GarbageCollectorMXBean;
import java.lang.management.ManagementFactory;
import java.lang.management.MemoryMXBean;
import java.lang.management.MemoryPoolMXBean;
import java.lang.management.MemoryUsage;
import java.util.Arrays;
import java.util.Formatter;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;

// —————————————————————————-
 * @author <a href="mailto:vlad@trilogy.com“>Vladimir Roubtsov, modified by
 *         TAN,Bin
public class Sizeof {

    // this is our way of requesting garbage collection to be run:
    // [how aggressive it is depends on the JVM to a large degree, but
    // it is almost always better than a single Runtime.gc() call]
    static public void runGC() {
        // for whatever reason it helps to call Runtime.gc()
        // using several method calls:
        for (int r = 0; r < 4; ++r)

    static public StringBuilder printHeapSegments4JDK6() {
        final StringBuilder ret = new StringBuilder();
        final Formatter formatter = new Formatter(ret);
        final List gcMBeans = ManagementFactory
        for (GarbageCollectorMXBean collector : gcMBeans) {
            final long justRan = justRan(collector);
            if (justRan > 0) {
                        ”    ^ ^ %16s collector ran %d/%d time(s) and covers “
                                + Arrays.toString(collector
                                        .getMemoryPoolNames()) + “n”,
                        collector.getName(), justRan, collector

        final MemoryMXBean memMXBean = ManagementFactory.getMemoryMXBean();
        memMXBean.setVerbose(true); // not sure how useful
        MemoryUsage heap = memMXBean.getHeapMemoryUsage();
        ret.append(heap.toString() + “n”);

        MemoryUsage nonHeap = memMXBean.getNonHeapMemoryUsage();
        ret.append(nonHeap.toString() + “n”);

        List pool = ManagementFactory.getMemoryPoolMXBeans();
        for (int i = 0; i < pool.size(); i++) {
            MemoryPoolMXBean bean = pool.get(i);
            ret.append(bean.getName() + “t”);
            ret.append(bean.getType() + “t”);
            ret.append(bean.getUsage() + “n”);
        return ret;

    private static long justRan(GarbageCollectorMXBean collector) {
        final Long newCount = collector.getCollectionCount();
        if (newCount <= 0)
            return 0;
        Long oldCount = collectionCounts.put(collector, newCount);
        if (oldCount == null)
            return newCount;
        long ret = newCount – oldCount;
        return ret > 0 ? ret : 0;

    private static ConcurrentHashMap collectionCounts = new ConcurrentHashMap();

    static private void _runGC() {
        long usedMem1 = used(), usedMem2 = Long.MAX_VALUE;
        for (int i = 0; (usedMem1 < usedMem2) && (i < 1000); ++i) {
            usedMem2 = usedMem1;
            usedMem1 = used();

    static public long used() {
        return s_runtime.totalMemory() – s_runtime.freeMemory();

    static public long total_ie_Committed() {
        return s_runtime.totalMemory();

    static private final Runtime s_runtime = Runtime.getRuntime();
} // end of class
// —————————————————————————-

fwd: heap memory allocation – java/c#/C

Thanks Nigel or the article. A quick glance suggests to me the allocation (not the de-allocation) procedure is no different from malloc, logically.

At run time, the memory management library (some functions pre-loaded into the code section of the address space) gives out small chunks of memory to the requesting application. When there’s insufficient “small chunks” to satisfy a request[1], the library synchronously grabs a large block from OS, and adds this large block to the private free-store, which is private to the process, managed by the library. This is the malloc() behavior I know.

[1] perhaps due to fragmentation

This “library” is compiled binary code. For c/c++, I think this means platform-specific object code. For java, this means platform-independent java bytecode, or more likely platform-specific native code.
For dotnet, I guess it’s windows native binary code. Now, is this binary code compiled from a C linker/compiler that used the standard malloc() function? I think it’s possible. If the dotnet runtime is itself written in C, then the runtime (including the  memory library) is compiled/linked with a C compiler/linker. It’s therefore possible to use the malloc library.
Alternatively, I guess it’s also possible to create a custom memory module with an equivalent function to malloc(). I make this claim because Microsoft created its own version of C compiler and C memory management library since the early days of C.
Subject: RE: question on heap memory allocation
Does this use malloc ?
Subject: question on heap memory allocation
Hi Nigel,
Someone speculated that “in any programming language, heap memory is allocated always using the C function malloc(). There’s no alternative.” I know JVM and dotnet runtime are both written in C/c++ so at run time, heap memory is grabbed from OS probably by malloc(). Not sure about other languages.
Is my understanding correct?

OutOfMemory ≠ allocation failure

IBM says —
Q: I am getting an OutOfMemoryError. Does this mean that the Java heap is exhausted?
A: Not necessarily. Sometimes the Java heap has free space but an OutOfMemoryError can occur. The error could occur because of
* out of memory but not due to a call to new()
* Excessive memory allocation in other parts of the application, unrelated to the JVM, if the JVM is just a part of the process, rather than the entire process (JVM through JNI, for instance).

Q: How can I confirm if the OutOfMemoryError was caused by the Java heap becoming exhausted?
A: Run with the -verbosegc option. VerboseGC will show messages such as “Insufficient heap space to satisfy allocation request”

Q: When I see an OutOfMemoryError, does that mean that the Java program will exit?
A: No. My experiments below show you can continue even after new() failed.
    static ArrayList list = new ArrayList();
    public static void main(String[] kkkj) throws InterruptedException {
        byte[] array = null;
        for (;;) {
            out.println(Sizeof.total_ie_Committed() + “\t>  ” + Sizeof.used()
                    + ” <– bytes in use. Trying to new up….");
            try {
                array = new byte[20 * 1000 * 1000];
            } catch (OutOfMemoryError e) {
                StringBuilder stat = Sizeof.printHeapSegments4JDK6();

java^c# generic: struct, boxing, erasure

At the heart of all the jargon and debates lies one core issue — field layout (FL). Essentially [1], field layout is what truly defines a type, be it class or struct.

For example, the java LinkedList’s Node class has field layout to hold a payload and a “next-node” — all pointers. Since untyped Node came to the scene first, Java generics resorted to type erasure (TE) for backward compatibility. As a result, Node of any payload is the same Node type at runtime. Consequently, LinkedList of any payload boils down to the same type. one-size-fit-all.

C# needs no TE. However, in a twist of clever design, Node of any reference payload has a field layout just like java. So LinkedList of any reference type boils down to the same type. Reduces code bloat compared to c++ template.

vvv Java Node of int has to be boxed/unboxed because Node payload (like any reference type) must be a subtype of Object.java. Given a node, the compiler is /hardwired/ to treat the “payload” field therein as a pointer to the Heap (where boxed Integer lives).

^^^ C# Node of int is a distinct type with a distinct field layout.[2] The int value is a struct Instance embedded in Node object’s real estate. No boxing/unboxing.
** Struct payloads are all treated this way.

This is yet another low-level technical issue affecting only high-performance, high volume market data where megabytes of packet copying is expensive. Non-trading developers tend to consider this a niche that’s “too” specialized.

See P156 [[c#precisely]]

[2] Incidentally, a Node of float has the same size as a Node of int, but compiler interprets the payload field as int vs float. Therefore different field layouts.
[1] This is practical thinking. I am no language lawyer. Specifically,
– A type can have only 1 field layout.
– If 2 types have identical field layout, then we can make one an alias of the other like a typedef. It’s wasteful to create too many duplicate types.

Runtime.gc() calls do make a difference

Hi YH,

You said calling Runtime.gc() is counter-productive. My experiment shows otherwise. Below is a heavy java job that uses about 200MB memory. “JJ_1001001=233564416” means after processing 1001001 rows, 233MB memory was in use. As you can see, usage grows with data volume.

[2010-09-07 14:03:07,132 INFO .Main:39] – {before_JJ=17944080, JJ_1=149162416, JJ_100101=161278496, JJ_200201=177440888, JJ_300301=179908768, JJ_400401=182329728, JJ_500501=198315560, JJ_600601=200645064, JJ_700701=216799944, JJ_800801=218142816, JJ_900901=231948296, JJ_1001001=233564416, before_destroySingletons=143855904}

——– Now I added calls to Runtime.gc() after every 100,000 rows. Memory usage is now kept constant at around 120MB.

[2010-09-07 13:40:12,347 INFO .Main:39] – {before_JJ=17944080, JJ_1=120664072, JJ_100101=120664248, JJ_200201=120664072, JJ_300301=120664504, JJ_400401=120664176, JJ_500501=120664760, JJ_600601=120664584, JJ_700701=120664864, JJ_800801=120664840, JJ_900901=120665336, JJ_1001001=120665256, before_destroySingletons=131827256}

I run the job again. Results are similar, so GC behavior is not 100% predictable —

[2010-09-07 13:57:02,060 INFO .Main:39] – {before_JJ=17944080, JJ_1=120664224, JJ_100101=120663944, JJ_200201=120664528, JJ_300301=120664048, JJ_400401=120664632, JJ_500501=120664608, JJ_600601=120664888, JJ_700701=120664712, JJ_800801=120664992, JJ_900901=120665280, JJ_1001001=120665248, before_destroySingletons=131828080}

Here's how to request gc():

public static void runGC() {
// for whatever reason it helps to call Runtime.gc()
// using several method calls:
for (int r = 0; r < 4; ++r)

private static void _runGC() {
long usedMem1 = usedMemory(), usedMem2 = Long.MAX_VALUE;
for (int i = 0; (usedMem1 < usedMem2) && (i < 1000); ++i) {
usedMem2 = usedMem1;
usedMem1 = usedMemory();

calculating JGC overhead as percentage

(Based on P120 [[JavaPerformance]] by Charlie Hunt) There are many GC related command line options. These two have quite specific meanings —

C) -XX:PrintGCApplicationConcurrentTime
S) -XX:PrintGCApplicationStoppedTime

With these 2, the GC log will show something like

“app time: 0.527905 seconds” # due to (C)
“total time for which app threads where stopped: 0.0453124 sec” # due to (S)

If you divide the (S) duration by the Preceding (C) duration, you get a good estimate of GC overhead — .04531/.5279 = 8%

Warning — If you see Another “app time: …” right after, that would be another “uneventful period”, which would be followed by a minor/major GC.

It’s critical to find out if the (S) duration is a minor or major GC. You can find such details in the intervening messages in [GC…]

concurrent JGC does occasional stop-the-world #GS

Folks often mention concurrent GC as the opposite of stop-the-world (STW) GC, but in fact in Java lingo “concurrent” actually involves short stop-the-world pauses. In the initial marking, the GC root objects are marked as alive. During this phase, all threads of the application are suspended.

* STW means always-STW.
* Concurrent means occasional-STW.

STW is simpler and cleaner than Concurent because … during concurrent marking you can miss some reachable objects. See https://java.sun.com/j2se/reference/whitepapers/memorymanagement_whitepaper.pdf  for the exact details.

Even in single-threaded STW you could design finalize() to perform resurrection like http://www.xyzws.com/Javafaq/what-is-resurrection-in-garbage-collection/47


java array memory allocation

See if anyone can shed some light. Not an interview question. Just my own curiosity.

Q: When i declare a Serializable[2] array, how does JVM decide how many contiguous bytes to reserve for 2 array elements of type Serializable? The actual object can have any number of fields in it, so the object can occupy any amount of memory.

Now the background. In C, i remember array element addresses are always uniform and contiguous, so we can use pointer arithmetic for random access. Someone mentioned java array is similar to C array. For example, int[2] would need exactly 32 bytes * 2 = 64 bytes, probably same in java vs C.

There's more circumstantial evidence that java arrays occupy contiguous memory — when inserting an element in the middle, java need to shift or copy data in memory, that's why linked list is known to be more efficient for this operation.

It's possible to measure runtime memory usage before and after declaring a Serializable[10000] but I'd like to understand how system decides how much memory to allocate.


3 essential JGC actions af marking

In this write-up, let’s put aside “live vs dead” object marking, and focus on actions after marking. Across all the GC engines as of java 5, there are only 3 post-deallocation actions.

C) copier action young generation
Coping needs a “destination” space in the form of the survivors. So copier is inapplicable to oldgen

S) sliding compaction action => oldgen
F) non-relocating free-list action => oldgen

(notation — the arrow sign “A=>B” means “A requires/implies B”)

Among the 3, C and S create “frontiers” to support fast allocation at the expense of long pauses. F offers low-pause but fragmentation.

Among the 3, C is always young gen. S and F are always oldgen.

Once you are clear about these 3 GC actions, you understand a lot of jargons such as

concurrent GC
promotion failure

new() and syscall – wholesale^retail

update — Exactly which code module performs the wholesale-retail …? [[linux sys programming]] P256 reveals glibc.

One of the shortest definitions of Operating System is — “software-controlling-access-to-hardware-devices”. OS encapsulates hardware resources and presents a facade to competing userland applications.

The most important hardware resource is memory. For memory, the OS control point is the _allocation_ of heap memory aka free-store, shared among competing applications. Once a particular memory cell is allocated to a user (or application, or process), the OS doesn’t intervene further. The application freely read/write to the memory location.

Q: At run time, when JVM calls new SomeClass(), does the thread go into kernel and executes a system call to grab a chunk of heap memory?
%%A: I would say NO.

Q: But first, let’s look at a malloc() call at run time… does the thread go into kernel and executes a system call to grab a chunk of heap memory?
A: answer is NO.

See the divvy-up “sawing” diagram on P188 [[Michael Daconta]]. It’s too _inefficient_ to make so many “small” syscalls. Instead when you call malloc(4), you call into glibc functions namely malloc() and friends. Synchronously, the library functions call into the kernel (brk()/sbrk() perhaps?) to grab a large chunk wholesale, then gives 4 bytes to you before your malloc() returns. This is one malloc() scenario, the wholesale scenario, when there’s insufficient memory “in the library”.

The more common scenario is the retail scenario — malloc() finds a free block of 4 bytes in one of the “large chunks” grabbed earlier. So malloc() marks those 4 bytes as ALLOCATED and gives you the starting address.

By the way, malloc() typically stores the size requested (i.e. 4) in the header block before returning the address. The size information is needed by free(). Since the header block is in addition to the requested 4 bytes, malloc() _loses_ more than 4 bytes.

Fwd: cache causing JVM memory leak


Someone told me cache can often lead to memory leak. I believe our system maintains several caches such as model-cache and circuit-design-cache(?). If one of these caches lacks a size limit or an effective expiry schedule, then the cache could grow indefinitely, someone told me.

Curious to know if we have anything to control it. Thanks.

tan bin

Update — after talking to Vibrant, i realized cache is one of the most common, effective, yet complex techniques. Memory leak due to cache is growing in importance as competition grows and we move deeper into optimization.

Use SoftRef.

CMS garbabe collector "losing the race"

Apt metaphor — losing the race

Think of a hare (promotions) chasing the tortoise (CMS). If the promotion rate exceeds the “release rate” for too long, oldgen occupancy will grow too high. If it grows to 100% (catching up), then the CMS collector loses the race — a full STW collection is inevitable. You lose the benefit of the low-pause concurrent collector, and get the worst case scenario — longest pause.

By the way, The STW collection is a sliding compaction — see other posts in this blog.

As soon as the CMS notices the race is on (promotion rate, occupancy etc) it starts the collection cycle, hoping to keep ahead of the rival.

!!all promotions are "awarded" to long-surviving objects

GC commentators usually tiptoes around this tricky topic — “generational GC typically promote an object from eden to old gen only when it has survived a few minor collections”. It’s not a hard guarantee that every promoted has survived at least 2 minor GC[1]

[1] but I believe there is a guarantee of at least 1 survival. Very simple, promotion can only happen during collection. If an object is promoted then it must be live at the time.

[[javaPerformance]] make it clear that premature promotion does happen — a new object is promoted during its very first collection. This object (like all new objects) is likely to become garbage soon, but once promoted will not be examined until a major GC. This tends to litter the oldgen with garbage.

!! every new object is allocated in Eden

Contrary to popular belief, some large objects might be allocated from oldgen (as of 2012).

This is at the discretion of the garbage collector. Note every JVM garbage collector controls allocation as its 2nd job (cleanup is first job). There’s no guarantee by the JVM vendor exactly when a large object may be allocated from eden or oldgen.

Therefore, it’s only safe to say that “most allocations hit eden”.

JVM memory usage stats — reconciliation

Suppose I start a jvm with initial heap of 1G and max 4G. Halfway through, 2G is in use.

Q: What would unix ps SIZE column show?
A: committed. Note RSS column is less useful. It’s the physical portion of the virtual memory allocated.

Q: What would windows task manager -> PageFile Usage graph show?
A: committed.

Q: What would Runtime.totalMemory() show?
A: committed

Q: What would jconsole show?
A: committed, used and max. {— These meanings are explained in the java.lang.Runtime api. JDK 5 onwards offers more instrumentation under MemoryMXBean

I believe “committed” is the actual amount of virtual memory dedicated to the JVM and not available to competing processes. Max is higher, but is partly available to other processes.

Difference between committed vs used? See post on wholesale/retail. JVM has grabbed the committed amount from OS syscalls, but only allocated the used amount.

eden + survivors must be small (!! too small)

I used to wonder why young generation (eden+survivors) should not be larger than ¼ (or 1/8….can’t remember the recommendation). The reason turns out to be fundamental and it’s good to remember it.

Reason is, minor GC is very frequent and is expected to be efficient and “cheap”. Long pause is associated with major GC, whereas minor GC STW pause is usually much shorter.

Note Even though minor GC is brief, it is Stop-The-World throughout. There are no phases. There’s just one phase and it’s STW.

Many financial trading engines incur just one major GC a day, but no such thing about minor GC.

General advice/observations —

* young gen too large -> bad pause during minor GC
* young gen too small -> too frequent minor GC

promotion failure in JGC, briefly

In a low-latency app, PF is rather serious. PF in the GC log usually foretells/omens a full GC (often a long pause), like the lightning before a rolling thunder.

PF occurs when a minor GC runs out of space, finds an object that has survived a few minor collections and wants to promote it to oldgen, but oldgen is too full.

PF can hit any java GC schemes you select.

JNI memory leak, briefly

Java has a problem with accessing resources outside the JVM, such as directly accessing hardware. Java solves this with native methods (JNI) that allows calls to functions written in another language (currently only C and C++ are supported). …

There are performance overhead in JNI, especially for large messages, due to copying of the data from the JVM’s heap onto the system buffer. JNI also may lead to memory leaks because in C the programmer is responsible for allocating and freeing the memory. GC can’t go beyond jvm heap to visit the malloc free store. See post on wholesale/retail.

Even regular java objects on java heap may become memory leak when you add JNI. http://www.iam.ubc.ca/guides/javatut99/native1.1/implementing/array.html says (using int array for example) that ReleaseIntArrayElements will “unpin” the Java array if it has been pinned in memory. I believe anything pinned by JNI will not be garbage collected. If JNI programmer forgets to release, it’s similar to a java programmer forgetting to clear a static hashmap

JGC pauses + stop-the-world JGC

These are some random notes. What we need later is go to the bottom of the issue and summarize the key points.

I guess GC can cause pauses even outside those brief stop-the-world moments. Concurrent collection is not immune to pauses. “Incremental CMS works by doing very small stop_the_world phases to accomplish the concurrent phases, instead of using concurrent threads. Sun recommends this mode for one or two processors.”

Concurrent garbage collectors seldom stops program execution, except perhaps briefly when the program’s execution stack is scanned. “when app needs to allocate a new object, the runtime system may need to suspend it until the collection cycle is complete, or …”

Concurrent GC runs concurrently_with_the_application.
http://www.softwareengineeringsolutions.com/blogs/2010/04/30/garbage-collection-in-java-part-2/ is a short blog article.
http://www.softwareengineeringsolutions.com/blogs/2010/04/30/garbage-collection-in-java-part-2/ says constructor may block when GC is called upon (asynchronously?) to free up more memory.

when does java heap expand, briefly

As advised in [[solaris 10 performance]] and numerous experts, better set -Xms and -Xmx identical, to avoid heap expansion, but what if we can’t?

Expansion occurs if GC is unable to free enough heap storage for an allocation request, or if the JVM determines that expanding the heap is required for better performance.

I believe expansion can only happen after a full (but ineffective) GC.

##triggers for JGC: various comments

Some big trading engine (UBS?) runs just 1 GC cycle each day. I wonder … Q: how often does GC run normally?

P155 [[SanFrancisco]] nicely outlines that GC can start
* upon alloc-failure
* when jvm explicitly waits (IO, sleep …)
* every 10 seconds, but at a low priority

– In addition, application can request GC.

Synchronous GC is predominant, but google “asynchronous garbage collection idle periods of low activity” and you see GC can start when system is idle. Creditex java guys also hinted at that.

In my experiment, I do see GC springs into action when app is … paused.

http://java.sun.com/docs/hotspot/gc1.4.2/faq.html says about JDK1.4 —
* In the default garbage collector, a generation is collected when it is full (i.e., when no further allocations can be done from that generation).
* The concurrent low pause collector starts a collection when the occupancy of the tenured generation reaches a specified threshold(by default 68%).
* The incremental low pause collector collects a portion of the tenured generation during each young generation collection.
* A collection can also be started explicitly by the application.

http://www.softwareengineeringsolutions.com/blogs/2010/04/30/garbage-collection-in-java-part-2/ drills into alloc-failure —

The JVM is unable to fulfill the request (alloc-failure), and so it awakens the garbage collector to come to its aid. This results in a Young Generation collection cycle. This is the simplest and first answer to the opening question. Therefore, the novice thinks this is the only answer.

A Young Generation collection is called for because there isn’t enough space available in Eden to satisfy this allocation request (alloc-failure). However, unlike the earlier scenario, the promotion of the middle-aged object from the active Survivor space to being an elderly object in the old generation cannot proceed because there is no free space available in the Old Generation. In this case, the Young Generation collection is put on hold, while an Old Generation collection cycle is spun up to free up space in the Old Generation.

JGC: low-pause ⇒high-overhead @@

Inevitably (IMO), a low-pause GC engine tends to increase your overall GC “overhead” — defined as a percentage of cpu cycles.

The efficient GC algorithms (usually?) need to stop the world for the entire cycle, and use compaction/copying to carve out a “frontier” of continuous free area to support fast allocation.

In theory, new algorithms might emerge with good pause characteristic but without overhead penalty.  I don’t know any. Sounds like a perpetual machine

ref-counter Garbage collector

#1 drawback = “island”. Probably uncurable.
#2 drawback = “unavoidable[1] overhead[2]” in terms of ref-counter updates for (almost) every assign and every “unassign”
The #2 drawback is not the only “unavoidable overhead” — There’s also an unavoidable overhead in a handle-table compactor GC, which
maps every variable [3] to a handle, and every handle to an address in the jvm’s memory space. This adds one additional level to a
direct variable-addr mapping. Probably additional CPU cycle required for every variable[3] load or store.

ref-counter is the only GC without root objects.

[1] Most if not all other GC algorithms add overheads[2] in other ways but They don’t add overhead for every single assign.
[2] overheads = additional CPU cycles
[3] excluding primitives

JGC-algorithm keywords

http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html is a good intro to generational GC

* Within the Eden (youngest generation area), you can use schedule your garbage collector visits most frequently.
* Within the old generation area, you can schedule your GC visits with least frequency.
* Now, the GC algorithm can be ref count in one area, and a tracer in another.

The *heap* can be divided into old generation area and young (Eden + survivors).

http://articles.techrepublic.com.com/5100-10878_11-6108296.html focuses on java 1.5 GC

–keywords, based on an intro to GC algorithm : http://www.javaworld.com/javaworld/jw-12-2001/jw-1207-java101.html

* copiers — advantage: no fragmentation

* generational copiers — a generational algorithm is-a copier(?). Objects are copied from young to old generation subheap.

* subheap gc — each subheap can have a different gc daemon thread

* concurrent subheap gc — each subheap’s gc daemon thread can run simultaneously

* ref-counters ^ tracers — basically the 2 main types of algorithms. I think most modern algorithms are tracers, including generational

* mark-and-sweep —- a tracer feature. Unmarked => reclaimed

* path-from-root —- a tracer feature.

* overhead — some algorithms adds dereference-overhead (handle table) to every object access [1], while other algorithms adds extra cpu cycles only when GC thread runs — daemon-overhead. The first type of overhead is found in ref-counters and handle tables.

[1] read/write

SoftReference^WeakReference : %%intro

First understand weak ref. It’s more useful than soft ref. [[hard core java]] says many jvm treats soft ref just as weak ref, but nowadays jvm tends to aggressively extend softRef lifetime.

1-word intro for Mr Soft — middle. Its strength level is between regular ref and weak ref. Weak reference is weakest.

— based on http://www.axlrosen.net/stuff/softreferences.html

  • Q: What’s the difference between SoftReferences and WeakReferences?
  • A: In terms of lifetime …..  SoftReferences are extended aggressively; WeakReferences are released aggressively;
  • A: Use WeakRef if release is desirable; use SoftRef if release is undesirable and should be postponed if possible
  • A: Use WeakRef if memory is more precious than the object; use SoftRef if memory is less precious than the object so you keep the object for contingency.
  • A: Both types are suggestions to GC. Using a SoftReference tells the garbage collector that you would like it to keep the object around for a while, until memory considerations cause it to reclaim the object. By contrast, using a WeakReference tells the garbage collector (to act aggressive) that there’s no reason to keep the object around any longer than necessary.

Usage?  SoftReferences are primarily for implementing capacity-sensitive caches. WeakReferences are mostly used in WeakHashMap, primarily for associating extra information with an object (that’s what WeakHashMap does). http://stackoverflow.com/questions/154724/when-would-you-use-a-weakhashmap-or-a-weakreference has one answer showing such an example.


soft ref ^ weak ref, according to UBS interviewer

Both features below are meaningful only in the GC context.

Let’s say there’s a target object t1 that our soft ref or weak ref points to.

– our soft ref behaves exactly the same as a regular ref until an outofmemory exception is to be thrown. At this moment, GC wakes up, scans the heap and encounters t1. If regular-reference-count is zero, and our soft ref is the only ref, then t1 goes.

Without outofmemory condition, our soft ref behaves like a regular ref, and counts in reference-count. (Personally, i don’t feel very sure, but i trust him.)

– a weak ref never counts during reference-count. Therefore, when t1 reference-count drops to zero, t1 goes, even with all the weak references clinging on to it.

JGC runs finalize() despite "protected"

Object class has an overridable instance method

protected void finalize()

Q: will “protected” hinder GC who will invoke finalize()?
A: “protected” was chosen for overriding/inheritance consideration only. Access modifiers do not affect GC daemon.

Q: why should GC be above the law (ie access modifiers)?
A: A possible GC behaviour — GC probably need to look into every line of bytecode to make out a reference graph. A private method suspended by yield()/wait()/after-notify() might hold a vital reference???

Encapsulation only controls object-to-object communication and collaboration

java static field – root object@@

Get Inside the static initializer In FTTP loader. Suppose we define a var like

XMLElement xml;

Q: will it get garbage collected at end of the initializer?
%% A: I think so. Read P8 [[ java precisely ]] top to bottom. When such a variable goes out of scope -> unreachable. I think this applies to any variable declared in a block or in a for(int i=0….)

Q: But static ? root objects?
A: I think a static /field/ is a root object “reachable” from any thread [1]. Our local var is not a static field.

[1] pubic or otherwise, I believe it’s a root object. See blog on GC root objects

JGC root objects #intern

Conceptually, think of the GC daemon waking up and scanning all threads and all static fields.

1) call stacks [3] — Most obvious category of root objects are “objects of unfinished method CALLs”. For each unfinished CALL on each call stack, root objects include:
– every argument of reference type
– every local var (except primitives)
– that class object /hosting/ the static method
– that instance hosting the instance method

[3] I prefer “call stacks”. A “thread” is more than a call-stack — see blog post on call stack.

2) static fields — 2nd most obvious type of root objects are “static fields[1] of loaded classes” . Technically, static fields are 2nd-level nodes whose parent nodes are “class-objects”[2]. However, static fields are a more tangible thing to programmers.

[1] except primitive static fields (these aren’t objects per se, even if on heap)
[2] Every loaded class has a class-object which persists in memory forever => class loaders can introduce mem leaks. Not sure about class-unloading(???)
eg: System.out
eg: Math.PI
eg: most if not all constants defined as “public final static”
eg: D.seen in my debugger. On demand [1], D.java class [2] is loaded by JVM. During class loading, every single static vars must be initialized. D.seen is a HashMap, so the default initial value of D.seen is null, but in a subsequent method, we call new HashMap(). This HashMap becomes a root object.

[1] If you don’t use it, the D class is not loaded.
[2] (By the way, no D() constructor call needed.)

Q: Why is Math.PI always reachable after class loading?
A: Anyone can access PI any time.

Q: what if D.seen is not public?
A: D.d() is a public static method that relies on D.seen. Anyone can call D.d() any time. D.seen must remain in memory forever.

3) jni. It’s good to know how jni relates to other java technologies

other groups of root objects?

Now, some non-root objects that can linger forever:

  1. interned strings — see https://www.infoq.com/articles/Java-PERMGEN-Removed by a GC expert
  2. event listeners + observers. Persistent collections. We need to remember to unregister/unsubscribe listeners or observers. Swing is particular prone to memory leak.
  3. static collections such as a cache

JVM mem leak detector basics + force JGC

Some JVM profilers like JProbe can “force” GC daemon to wake up and run. We know that A regular java method can’t “force” GC daemon thread to wake up and run. But a JVM profiler is not written in java.

jconsole can invoke GC on the target JVM. It’s on the Memory tab, and also in Mbeans -> java.lang->Memory->operations

I think JProbe uses JVM Tool Interface (aka JVM Profiler Interface) to do that? Probably. See
http://java.sun.com/j2se/1.5.0/docs/guide/jvmti/jvmti.html#ForceGarbageCollection. The profiler runs in the same process as the JVM,
sharing the address space and have access to all the jvm variables, GC, threads..

Many memory leak detectors work by counting instances for each class and detect growth over time. The assumption — instance count for any class should rise and fall, not rise and rise. IMHO, This is the #1 fundamental assumption in memory leak detectors.

In general, if by design a system is supposed to generate increasing “live” objects then it will crash sooner or later. However, if under normal load the increase is slow enough relative to the lifespan between VM restarts, then it’s tolerable but still, such a design remains worrisome and questionable for a robust server-side engine.

I feel a robust design should “retire” most of its new objects for GC.

JVM mem leak detectors — 2 types

jvmti ^ bytecode engineering.

A) profiler — is a separate OS process. This process “attaches” to the target jvm process, often via JVMTI, perhaps by some protocol? Yes

Perhaps an IPC protocol? Yes

Example: jprobe, Jrockit runtime analyzer

$ ./jprobe/jpprofiler # starting a unix process

$ ./jra/jrcmd jrarecording time=300 filename=/tmp/jra.xml.zip # Connects to the pid, starts a JRA recording of 300s and stores the result in the specified file.

B) bytecode instrumentation — like soaking clothes before washing. You first get the standard java bytecode. You feed this bytecode into some kind of “processor” (what name?), somewhat like an obfuscator or a perl text-converter.

Echoing the Decorator pattern, it produces a “decorated” bytecode. When you execute this bytecode, memory statistics would be collected and saved somewhere, perhaps a file?