contiguous memory data structures : page faults+cache miss

Compared to linked data structures, I feel vectors (and array, deque, circular array…) reduce page faults and cache miss.

P124 [[art of debugging]] describes how a simple array is loaded from swap space into a real memory “page”. This is managed by the virtual memory system.

P19 [[optimized c++]] describes that L1 or L2 cache also loads vector more efficiently than a linked data structure. For example, each time main memory is read, the adjacent memory cells are loaded together into L1 cache, so contiguous memory data structures result in much fewer cache miss, in theory.

A virtual memory page is about 4KB. A cache line is about 64B.

Both page fault and cache miss are costly at runtime. In both cases the “backing store” is much slower but we have to read from it. These situations are by design and very common, but it pays to reduce their /incidence/occurrence. Even if we reduce page fault frequency or cache miss frequency by 10%, it could be helpful.

RAM insufficient? telltale signs by Unix command

Scan Rate is the most important indicator. There’s a threshold to look for. It indicates “number of pages per second scanned by the page stealing daemon” in unix but not windows. http://www.dba-oracle.com/oracle10g_tuning/t_server_ram.htm is concise

Paging (due to insufficient Physical memory) is another important indicator, correlated with Scan Rate. There’s a threshold to look for.

type info stored inside instances – c#, c++, java

Java is simple and clean — all non-primitive objects has a pointer to the runtime (not compile time[1]) class object, which lives in the permanent generation. This enables getClass().

C++ is similar. The RTTI info is often stored in the vtbl for each *virtual* class.  This means each java/c++ object needs only 4 bytes of real estate for vptr. (The type info is N bytes of real estate on the vtbl for the entire class.)

[1] Note the class object for the DECLARED type doesn’t need to be stored in the instance. It’s only used by the compiler and not needed at runtime. RTTI stands for “RunTime…”

—-C# is less simple. See P54,104 [[c#precisely]]
– C# reference objects are probably same as java.
– C# struct instance has no “runtime type” and has no real estate carved out for it. A struct instance doesn’t (need to) know it’s own type.
** (C struct instance has no “runtime type” either, since C has no vtbl.)

–Boxing of struct type is trickier. Even though a struct instance has no pointer to the STRUCT type object, a boxed version, as all c# reference objects do, does hold a pointer to the CLASS object.

memory^thread creation – kernel always involved?

memory – see post on new-and-syscall-wholesale^retail (http://bigblog.tanbin.com/2010/02/new-and-syscall-wholesaleretail.html)
A) wholesale –  malloc grabbing from kernel [1]
B) retail – malloc to divvy up to host application. Note malloc is an /embedded-library/ and not a separate process. It is in the code section of one application only.

thread –
A) if you create a kernel thread or create a Solaris LWP, then you probably call into kernel[1]
A2) when your kernel thread is scheduled, it’s scheduled by kernel. Your application can’t interfere with it
B) if you create a userland thread, then the thread library does it without kernel
B2) when your userland thread is schedule, it’s scheduled by the thread library. Kernel is unaware of userland threads and can’t schedule them.

Note the thread library could be bundled with the OS, just like malloc, but it doesn’t have to be. Half the time, these libraries can serve user applications without calling into the kernel.

[1] since memory is a hardware resource protected by OS

physical ^ virt memory — when do we see which

V – when you print out the address of any C++ object.
P? – out of memory
P? – committed memory as reported by various tools?
P – seg fault or “general protection error”
V – stack and heap sections growing towards the center. Note heap is contiguous only virtually — Physically discontiguous as your app requests big chucks of physical mem from OS.

Q: is physical memory address knowable?
A: Probably no. “If, while executing an instruction, a CPU fetches an instruction located at a particular virtual address, or fetches data from a specific virtual address or stores data to a particular virtual address, the virtual address must be translated to the corresponding physical address. This is done by a hardware component”

Q1: is virtual address knowable in c?
A: yes

Q1b: is virtual address knowable in java?
A: probably no

XR’s memory efficiency in large in-memory search (DB diff via xml

A real world Problem: 2 DB tables have very similar content. The owner decided to periodically export each table into a 200MB xml file. Each row becomes an element in xml. Each field an attribute. Business Users want to compare the rows on-demand via a servlet. Need to show per-field differences.

Performance? A few seconds to compare the xml files and show result.

Due to various overhead while reading and write files, -Xmx1000m is required.

Insight — duplicate object is the chief performance issue. Key technique is an object registry (like SSN in US), where each recurring object (strings, Floats too!) is assigned an 32-bit Oid. Dates are converted to dates then long integer (memory efficient).  To compare 2 corresponding fields, just compare their Oid. String comparison is much slower than int comparison.

Use an AtomicInteger as Oid auto-increment generator — thread-safe.

Solution:
– load each XML into memory, but never to instantiate 2 identical objects. Use the Oid.
** construct an {Object,Oid} hashmap?
– now build object graphs, using the Oid’s. Each row is represented by an int array
– 1m rows become 1m int arrays, where 0th array element is always Column 0, 1st element always Column 1 …
– Since Column 3 is primary key, we build a hashmap (primary key Oid, int array) using 1st xml file
– now we go through the 1m rows from 2nd xml file. For each row, we construct its int array, use the primary key’s Oid to locate the corresponding int array from the hashmap, and compare the 2 int arrays.

Note xml contains many partial rows — some of the 50 fields might be null. However, every field has a name. therefore our int-array is always 50 but with nulls represented by -1.

Table has 50 columns, 400,000 rows. Each field is char(20) to char(50).

cache size control in trading engines#%%ideas

My answer in 2007 – separate thread to monitor the size and send out alerts. JMX-based clean-up operation.

Now I feel we can take the swing EDT approach. “EDT” here means some kind of separate thread/data-structure for cache size control — a watchdog.

Each insert/delete operation on any thread would send a msg to the watchdog queue. Watchdog can decide how many messages to drop before it decides to take a look. If size is close to the limit, then it could decide to be more vigilant. Once limit is hit, watchdog could turn on some flag in the cache system to remind the inserters.

But how does watchdog keep the size within limit? It has to remove the least accessed item. A priority queue with a last-accecced-time might be good. See post on LRU for a java builtin solution.

More importantly, the app has to be tolerant of involuntary loss of cache items. “If cache doesn’t have what I inserted, i will insert again.”