jGC heap: 2 unrelated advantages over malloc

Advantage 1: faster allocation, as explained in other blogposts

Advantage 2: programmer can "carelessly" create an "local" Object in any method1, pass (by reference) the object into other methods and happily forget about freeing the memory.

In this extremely common set-up, the reference itself is a stack variable in method1, but the heapy thingy is "owned" by the GC.

In contrast, c/c++ requires some "owner" to free the heap memory, otherwise memory would leak. There’s also the risk of double-free. Therefore, we absolutely need clearly documented ownership.

STL container=#1 common resource owner #RAII

Stroustrup said every resource (usually heapy thingies) need to have an owner, who will eventually return the resource.

By his definition, every resource has an acquire/release protocol. These resources include locks, DB connections and file handles. The owner is the Object responsible for the release.

  • The most common resource owner is the STL container. When a std::vector or std::unorderded_multimap … is destructed, it would release/return the resource to the heap memory manager.
  • The best-known resource-owner is the family of smart pointers.
  • You can also combine them as a container of smart pointers.
  • All of these resource owners rely on RAII

freelist in pre-allocated object pool #DeepakCM

Deepak CM described a pre-allocated free-list used in his telecom system. I think https://www.embedded.com/dynamic-memory-and-heap-contiguity/ briefly describes the same scheme.

https://github.com/tiger40490/repo1/blob/cpp1/cpp/lang_66mem/fixedSizedFreeList.cpp is my self-tested implementation. Am proud of the low-level details that I had to nail down one by one.

He said his system initialization could make 400,000 new() calls to allocate 400,000 dummy objects and put them into a linked list. Personally, My design /emulates/ it with a single malloc() call. This is at startup time.

During the day, each new msg will overwrite [1] a linked node retrieved at the Head of the slist.

[1] using operator=(). Placement-new would be needed if we use a single malloc()

Every release() will link-in the node at the Tail of the slist. Can we link it in at the Head? I think so. Benefit — It would leave a large chunk of continuous free space near the tail. Improved Fragmentation.

Illustration — Initially the physical addresses in the slist are likely consecutive like addr1 -> addr2 -> addr3…. After some release() calls, it would look like a random sequence.

Using return-to-Head, I would get

  • pop, pop, pop: 4->5->..
  • rel 2: 2->4->5..
  • pop: 4->5->….
  • rel 1: 1->4->5…

— The API usage:

  • void ffree(void *)
  • void * fmalloc(size_t), possibly used by placement new

4 common mem techniques]%%HRT assignment

Technique: reinterpret_cast

Technique: memcpy

Technique: pre-allocated buffer as local static objects … can be better than "stack" allocation only for a large buffer.

Unlike local variables, local Static objects can be returned by pointer — major flexibility in low-level memory management.

https://stackoverflow.com/questions/3835857/in-c-does-using-static-variables-in-a-function-make-it-faster?rq=1 discusses the runtime cost of static local vs local

##%%c++(n java)memoryMgmt trec : IV

Q: Your resume says “memory mgmt” for c++, so what did you do personally, not as a team?
A: I will talk about the features in my system. Most of them are not written by me, honestly

  • per-class: restricting allocation to heap or stack
  • per-class: customize operator new for my class
  • per-class: disable operator new for my class and provide allocateFromRingBuffer() API
  • per-class: move support
  • static thread_locals
  • ring buffer to eliminate runtime allocation
  • custom smart pointers
  • memory leak detector tools like valgrind+massif .. breakdown heap/non-heap footprint@c++app #massif
  • RAII
  • pre-allocate DTOs@SOD #HFT #RTS
  • customize new_handler() — I didn’t write this

Q: java

  • GC tuning
  • memory sizing
  • JDK tools like jhat
  • memory profiler tools
  • string intern
  • investigating memory leaks, all related to GC

## placement new: pearls to impress IV

clone a std::vector by memcpy@@

In general, we can’t use memcpy to clone the internal array of a vector.

If we have a vector of simple value types like an int or float, memcpy is probably simpler.

If there’s any pointer member data, then we are in trouble. For example, if there’s any std::string in the internal array, memcpy will bypass the string copy operations. The internal pointer in the string will be copied incorrectly. When the original array content gets deleted, our cloned strings will be messed up.

There are many other element types that involve pointer:

  • the element may have a smart ptr data member. All smart ptr classes provide special copy control, not to be bypassed.
  • the element may be another container. Every container I know uses pointers. They all have special copy control.
  • the element may have some shared resource like a mutex or a connection. These resources are typically allocated on heap. When the original vector gets destroyed, the resource will need release or clean-up

Q: For vector of int, is memcpy faster than copying element by element?
A: Probably not. Both O(N). A for-loop to copy an array of integers may compile to the same binary.

3ways to call ctor #placement new

  1. to construct an Acct at a pre-existing address, call placement-new, which implicitly uses Acct ctor
  2. to construct an Acct in stack or static memory (usually data segment), just call ctor explicitly
  3. to construct an Acct in heap, call new, which implicit uses ctor.

By the way, you can also allocate memory without calling ctor. The malloc() and q[ operator new ] can do that, as explained in [[moreEffC++]]

C/c++ freelist #phasebook

[[Pro .net performance]] P92 has 2 pages on C/c++ free-list. Concise, in-depth.

  • uniform-size — freelist is good at supporting objects of identical sizes.
  • Search — for a matching block. Needed if object sizes are non-uniform
  • Fragmentation — many too-small blocks
  • Per-thread private free-list? Worsens fragmentation

 

c++GC interface

https://stackoverflow.com/questions/27728142/c11-what-is-its-gc-interface-and-how-to-implement

GC interface is partly designed to enable

  • reachability-based leak detectors
  • garbage collection

The probe program listed in the URL shows that as of 2019, all major compilers provide trivial support for GC.

Q: why does c++ need GC, given RAII and smart pointers?
A: system-managed automatic GC instead of manual deallocation, without smart pointers

container of smart^raw pointer

In many cases, people need to store addresses in a container. Let’s use std::vector for example. Both smart ptr and raw ptr are common and practical

  • Problem with raw ptr — stray pointer. Usually we the vector doesn’t “own” the pointee, and won’t delete them. But what if the pointee is deleted somewhere and we access the stray pointer in this vector? smart pointer would solve this problem nicely.
  • J4 raw ptr — footprint efficiency. Raw ptr object is smaller.

c++q[new] variations

  1. variation: new MyClass(arg1) // most common. Can throw bad_alloc
  2. variation: new MyClass() // better form, calling the no-arg ctor
  3. variation: new MyClass //bare word, same as above. See op-new: allocate^construct #placement #IV
  4. variation: new (nothrow) MyClass(…) // returns NULL upon failure
  5. variation: placement new
  6. variation: array-new // no argument allowed!

op-new: allocate^construct #placement #IV

Popular IV topic. P41 [[more effective c++]] has an excellent summary:

  1. to BOTH allocate (on heap) and call constructor, use regular q(new)
  2. to allocate Without construction, use q(operator new)
    1. You can also use malloc. See https://stackoverflow.com/questions/8959635/malloc-placement-new-vs-new
  3. to call constructor on heap storage already allocated, use placement-new, which invokes ctor

The book has examples of Case 2 and Case 3.

Note it’s common to directly call constructor on stack and in global area, but on the heap, placement-new is the only way.

Placement-new is a popular interview topic (Jump, DRW and more …), rarely used in common projects.

simple implementation of memory allocator#no src

P9 [[c++game development primer]] has a short implementation without using heap. The memory pool comes from a large array of chars. The allocator keeps track of allocated chunks but doesn’t reuse reclaimed chunks.

It showcases the header associated with each allocated chunk. This feature is also part of a real heap allocator.

reinterpret_cast is used repeatedly.

struct packing^memory alignment %%experiment

Sound byte — packing is for structs but alignment and padding is Not only for structs

Longer version — alignment and padding apply to Any object placement, whether that object is a field of a struct or an independent object on stack. In contrast, the packing technique is specific to a struct having multiple fields.

https://github.com/tiger40490/repo1/blob/cpp1/cpp1/memAlign.cpp has my test code to reveal the rules:

  1. For a (8-byte) long long object on stack (or in a struct), the address is 8-byte aligned. So padding is added by compiler, unless you say “packed” on a struct.
  2. For a (4-byte) int object on stack (or in a struct), the address is 4-byte aligned.
  3. For a (2-byte) short object on stack (or in a struct), the address is 2-byte aligned.
  4. For a char object on stack (or in a struct), the address is 1-byte aligned. All memory address are 1-byte aligned, so compiler adds no padding.

http://stackoverflow.com/questions/11108328/double-alignment Wug’s answer echoes Ashish’s comment that tiny fields like char should be grouped together, due to Rule #4. This applies to stack layout as well [1]. However, compiler optimization can be subversive:

Not only can the compiler reorder the stack layout of the local variables, it can assign them to registers, assign them to live sometimes in registers and sometimes on the stack, it can assign two locals to the same slot in memory (if their live ranges do not overlap) and it can even completely eliminate variables.

[1] See https://stackoverflow.com/questions/1061818/stack-allocation-padding-and-alignment

[16]standard practice around q[delete]

See also post about MSDN overview on mem mgmt…

  • “delete” risk: too early -> stray pointer
  • “delete” risk: too “late” -> leak. You will see steadily growing memory usage
  • “delete” risk: too many times -> double free
  • … these risks are resolved in java and dotnet.

For all simple scenarios, I feel the default and standard idiom to manage the delete is RAII. This means the delete is inside a dtor, which in turn means there’s a wrapper class, perhaps some smart ptr class.

It also means we create stack instances of
– the wrapper class
– a container holding the wrapper objects
– an umbrella holding the wrapper object

Should every departure/deviation be documented?

I feel it’s not a best idea to pass a pointer into some cleanup function, and inside the function, delete the passed-in pointer. What if the pointee is a pointee already deleted, or a pointee still /in active service/, or a non-heap pointee, or a pointee embedded in a container or umbrella… See P52 [[safe c++]]

std::vector memory allocation/free: always heap@@

https://stackoverflow.com/questions/8036474/when-vectors-are-allocated-do-they-use-memory-on-the-heap-or-the-stack is concise.

  • The vector “shell” can be on stack or heap or static memory, as you wish.
  • The continuous array (payload objects) are always allocated on the heap, so that the vector can grow or deallocate them.

Deallocation is explained in https://stackoverflow.com/questions/13944886/is-stdvector-memory-freed-upon-a-clear

avoid overhead of dynamic memory allocation – alloca etc

I now think there are various overhead with DMA

* search the free list for a suitable chunk
* fragmentation
* If an allocation is needed where the heap is almost used up, glibc must grab more memory from the kernel, then hand out a slice of it.

Linux alloca() and variable-length-array both avoids some of the overhead. See P267 [[linux sys programming]].

If a low-latency module does a ton of malloc(), then alloca() might outperforms malloc() easily. We should benchmark.

use delete() or delete[] @@ #briefly

An address stored in a pointer (ptr-Object or ptr-Nickname — see http://bigblog.tanbin.com/2012/07/3-meanings-of-pointer-tip-on-delete-this.html) can mean either a “lone-wolf” or a “wolf pack”. 
Specifically, the logical (rather than physical) data at the address could be an array of objects or a single one.

The op-new and op-new[] operators both return an address. If you received the address without knowing which new operator was used, then it’s impossible to know how to delete it. As [[effC++]] points out, deleting the wrong way is UndefinedBehavior. 

The suggestion in [[safeC++]] is to avoid smart array (like shared_array) but use vector instead.

On a similar note, if your function receives a char*, you would have no idea whether it’s a single char or a string.

prevent heap instantiation of my class#Part 2#nQuant

I might have blogged about this…. [[more effC++]] by Scott Meyers has a chapter on this. The basic ideas, expressed in my own language for maximum absorption —

case 1) it’s relatively easy to prevent qq( new MyClass ) P157 basically says customize a new-operator as a private static class member and leave it undefined.

— below cases 2 and 3 are non-issues in “our” projects, where team is small and guys are expected to read the source code comments —

case 2) MyClass could be subclassed to Der. If Der is beyond my control, then MyClass could (against my will) be instantiated on heap as a subobject

case 3) MyClass could be a field in class Umbrella. If Umbrella is beyond my control, then MyClass could (against my will) be instantiated on heap as a subobject

I feel in many enterprise app teams, Cases 2/3 are rare and impractical, because Der and Umbrella would be owned by myself or my own team.

Therefore, in many real world contexts it is feasible to prevent heap instantiation of my class, and reduces memory leak.

c++mem mgmt: MSDN overview

http://msdn.microsoft.com/en-us/library/vstudio/hh438473.aspx is a brief but decent overview, illustrated with code snippets. I re-read in Nov 2017.

  • nonref variables as data members ! Default dtor is good enough 🙂
  • vector as data member ! Default dtor is good enough 🙂
  • (MSDN didn’t mention) shared_ptr and unique_ptr as data member. Default dtor is good enough 🙂
  • Heap-based vs stack-based — different techniques
  • c++ mem mgmt — “designed” for stack-based, so prefer stack as far as possible.
  • “observing” vs “owning” a heap object. An observing raw ptr can dangle but will never leak.

pairing up array (or placement) new/delete

Item 8 of [[more eff C++]] has a good coverage. Here’s my summary.
Q: Which of these does the compiler allow you to invoke directly in your source code?
OPERATOR placement-new? Not sure. Perhaps never needed.
placement-new EXPRESSION? Of course. P40

placement delete? NO SUCH THING
dtor? yes. P42. I also saw it in some C++ FAQ.
operator new? yes P39. Rather similar to how you call malloc()
operator delete? Yes p41

delete[] EXPRESSION? Of course. That’s the standard way to de-allocate an array of “heapy thingies”
OPERATOR delete[]? Not sure.

operator new[] ? Not sure. I feel we can always avoid invoking this explicitly. We can invoke operator-new with a sizeof(targetObject)*arraySize. Both operators return a void pointer to a chunk of raw memory of the correct size.
————Here are the legitimate pairings. “Inter-faith marriage” leads to undefined behaviour ———–
malloc === free
operator new === operator delete // similar to malloc/free, but very rarely needed
new expression === delete expression
new[] expression === delete[] expression
placement-new expression === explicit dtor call followed by some special form of de-allocation. There’s no such thing as placement-delete
————
Q: which of these can we customize?
placement-new EXPRESSION? no
OPERATOR placement new? yes P40
OPERATOR new[] and operator delete[]? Yes P43

a concrete allocator@@

Every single STL container use allocators including the (implicit!) default allocators. It’s probably worthwhile to see a real allocator. [[STL tutorial]] has some incomplete code. [[effC++]] Items 8,9,10 present a customized op-new, which seems to be the basis of a reusable allocator class.

An allocator keeps its own free list, like malloc. Wholesale vs retail.

I believe an allocator is tightly integrated to customized op-new.

I was told in multi-threaded apps, the default op-new hits the same heap every time and can be a synchronization hotspot. I guess if there’s a single free-list then it needs synchronization protection. By giving a mini-heap to each thread, op-new must be customized to hit the mini-heaps.

piecemeal vs one-gulp – c++ object memory allocation

– some operations must be piecemeal
– some operations must be one-gulp

* ctor and dtor are always piece-meal. Reason — each subclass and embedded component class can define its own ctor and dtor.
** For example, dtor execute in the DCBC sequence — http://bigblog.tanbin.com/2010/01/d-c-b-destructor-execution-order.html

* “operator new” (I don’t mean a new-expression) is always one-gulp. Reason — the memory (“real-estate”) of an entire object must be a contiguous piece of “real estate”, but one allocation for one part of the real estate and another allocation for another part will be dis-contiguous.
** There’s an important implication on P157 [[more eff C++]]

Background —
+ if you are a derived class instance, Your base class instance will occupy part of “your” real estate
+ if you have a nonref[1] field of a another class, That instance occupies part of “your” real estate

[1] “nonref” is something other than a pointer or reference, something not dereferencible. No such field in java, because a non-primitive field is always a reference field in any java class.

prevent heap instantiation4my class#Part 1

(Much of this write-up is based on http://www.cprogramming.com/tutorial/operator_new.html) It’s useful if we can restrict a special class’s instantiation to the stack only, and disallow heap allocation. Overload op-new for your class, and make it throw exception unconditionally? Alternatively, can we declare it but leave it unimplemented like —

class Myclass {
public:
void* operator new(size_t);
void operator delete(void*);
};

Both of them are by default static members. Overloading can be used for many purposes. For example, we may need to alter the exception thrown in case of failure–std::bad_alloc–and throw something else:

The relationship between Operator New and the New Keyword

Don’t be confused by the fact that there is both a new keyword and an operator new. When you write:

MyClass *x = new MyClass;

there are actually two things that happen–memory allocation and object construction; the new keyword is responsible for both. One step in the process is to call operator new in order to allocate memory; the other step is to actually invoke the constructor. Operator new lets you change the memory allocation method, but does not have any responsibility for calling the constructor. That’s the job of the new keyword.

It is possible to invoke the constructor without calling operator new. That’s the job of placement new

customize new/malloc, briefly

For a given class C with a derived class D,

C::operator new(……); // is inheritable by D; implicit static; allocates raw memory
C::operator delete(…); // is inheritable by D; implicit static; deallocates raw memory i.e. return to the “pool”

C(……..); // never inherited; never static; turns allocated raw memory into object
~C(void); // never inherited; never static; turns object into raw memory, to be deallocated.

All of them can be private.

As explained in another blog post, the new expression like “new D()” can’t be overloaded. This expression invokes the operator new and the ctor.

[[effC++]] has a lengthy chapter on how to customize new/delete.

P139 [[understanding and using c pointers]] has a short chapter on how to avoid the overhead of frequent malloc/free operations. Note this is NOT replacing malloc with our own malloc.

P70 [[understanding and using c pointers]] has a one-pager super-simple wrapper over free()

new() and syscall – wholesale^retail

update — Exactly which code module performs the wholesale-retail …? [[linux sys programming]] P256 reveals glibc.

One of the shortest definitions of Operating System is — “software-controlling-access-to-hardware-devices”. OS encapsulates hardware resources and presents a facade to competing userland applications.

The most important hardware resource is memory. For memory, the OS control point is the _allocation_ of heap memory aka free-store, shared among competing applications. Once a particular memory cell is allocated to a user (or application, or process), the OS doesn’t intervene further. The application freely read/write to the memory location.

Q: At run time, when JVM calls new SomeClass(), does the thread go into kernel and executes a system call to grab a chunk of heap memory?
%%A: I would say NO.

Q: But first, let’s look at a malloc() call at run time… does the thread go into kernel and executes a system call to grab a chunk of heap memory?
A: answer is NO.

See the divvy-up “sawing” diagram on P188 [[Michael Daconta]]. It’s too _inefficient_ to make so many “small” syscalls. Instead when you call malloc(4), you call into glibc functions namely malloc() and friends. Synchronously, the library functions call into the kernel (brk()/sbrk() perhaps?) to grab a large chunk wholesale, then gives 4 bytes to you before your malloc() returns. This is one malloc() scenario, the wholesale scenario, when there’s insufficient memory “in the library”.

The more common scenario is the retail scenario — malloc() finds a free block of 4 bytes in one of the “large chunks” grabbed earlier. So malloc() marks those 4 bytes as ALLOCATED and gives you the starting address.

By the way, malloc() typically stores the size requested (i.e. 4) in the header block before returning the address. The size information is needed by free(). Since the header block is in addition to the requested 4 bytes, malloc() _loses_ more than 4 bytes.

OutOfMemory retry loop of "new" #c++

We know c++ keeps retrying when a “new” operation encounters OOM condition (no such exception in c++ actually)

Q: what are the very few (therefore important) ways to break out of the loop?
A1: more memory becomes available and retry succeeds
A1a: new-handler releases a secret “stash” of memory
A1b: install a more resourceful new-handler, which releases some special memory block

A2: new-handler turns out null -> std::bad_alloc -> abort()
A2a: set new-handler to null
A2b: directly throw bad_alloc and pass on the hot potato
A2c: directly call abort()

return mem cells to freelist, then fill your pointer’s 32 bits with 0

Suppose a 32-bit pointer p2 contains address 0x1234,and 22==sizeof(Derived)

******* q[base p2 = new Derived()] implements —
– grab 22 bytes from freelist
– initialize by ctor
– return 0x1234 i.e. starting address of the 22 bytes
–> from now on, “system” won’t give those 22 bytes to other malloc requests, since those are ALLOCATED to us.

****** q(delete p2) performs
– call dtor, possibly virtual
– determine size of the (polymorphic) pointee object, possibly via vtbl. Suppose it’s 22.
(The compile-time declared type of p2 doesn’t matter. If we did a Base p2 = new Derived(), then deletion will reclaim all the mem cells of Derived.)
– mark that many bytes starting from Address 0x1234
– return these bytes to freelist
–> Any time from now on, those 22 bytes can be allocated to anyone. Note p2 continues to be seated at 0x1234. The deletion performs no state change in the 32-bit object p2. If you read/write/delete on p2, you die.

Rule — after you call q(delete p2), always fill the 32-bit object p2 with zeros, by calling q(p2 = 0)

heap fragmentation and freelist

Heap memory allocators (java GC is actually among them) make a great effort to avoid (heap) fragmentation. Having many holes in the free store is wasteful and hurts allocation speed, to say the least

The fastest allocation-time algorithm is to simply carve out a piece of real estate from THE “frontier” and ignore any usable holes. Done naively, this could lead to huge waste of usable “hole” real estate – imagine a 100MB array gets freed and becomes a hole but ignored during all subsequent allocations.

The frontier is simply marked by a forward moving pointer (“bump-the-pointer”). In JVM, the popular mark-and-sweep (also mark-compact?) and young-generation coping GC carefully creates such a frontier.

In C, there’s no garbage collector so holes are perhaps inevitable. I think the standard solution is the free-list. This hurts allocation-time. To satisfy a 2KB allocation “request” we must search the free-list for a hole, before trying the frontier.

Both the free-list and the “frontier” solutions entail concurrency control.

memory leak detection ideas#malloc etc

http://www.flipcode.com/archives/How_To_Find_Memory_Leaks.shtml is a dated (2000) but readable and detailed treatment. A home-made new/delete overload, using malloc and free rather than qq[[ ::operator new ]] as advised by Scott Meyers.

  • Valgrind – no need to link anything… malloc/free are “replaced” automatically.
  • electric fence — link it into your code to get seg fault upon DAM errors. Won’t catch memory leaks. (My GDB book covers similar tools)
  • cmemleak traces malloc() and free() — the choke points.
  • [[c++nutshell]] says allocators can implement debugging or validity checks to detect programmer errors, such as memory leaks or double frees.
  • GlowCode — Three easy ways to use GlowCode (a windows-only profiler):
    • (1) use it to launch your application,
    • (2) attach GlowCode to a running program, or
    • (3) link GlowCode directly into your application. No source code or
      build change or post-build step required. Similar to Valgrind
  • IBM Purify — When a program is *linked* with Purify, corrected verification code is automatically inserted into the executable by parsing and adding to the object code, including libraries. Similar to Java bytecode instrumentation. That way, if a memory error occurs, the program will print out the exact location of the error, the memory address involved, and other relevant information. Purify also detects memory leaks (but I guess as a secondary feature). Leak report can be generated by calling the Purify leak-detection API from within an instrumented application. Object Code Insertion (OCI) OCI can be performed either during the link phase or after the link. Rational Purify reads object files generated by existing compilers and linkers, and adds error checking instructions without disturbing the ability to use conventional debuggers on the executable.
  • MDB is a commercial dynamic/shared library that provides replacements for malloc/free etc. You must link your software to these *.SO.
    • 🙂 Much lower overhead than Purify and Valgrind

don’t explicitly clean up a stack object in c++

don’t explicitly clean up a stack object — let system do it.

“clean up” includes both destruct or delete.  Don’t explicitly delete a stack pointee or call destructor on a stack object.  http://www.cembedded.com/2009/07/c-c-interview-questions-part-6.html says

The destructor will get called again at the close } of the block in which the local was created. But you can get really bad results from calling a destructor on the same object a second time!