private-header^shared-header

In our discussions on ODR, global variables, file-scope static variables, global functions … the concept of “shared header” is often misunderstood.

  • If a header is only included in one *.cpp, then its content is effectively part of a *.cpp.

Therefore, you may experiment by putting “wrong” things in such a private header and the set-up may work or fail, but it’s an invalid test. Your test is basically putting those “wrong” things in an implementation file!

 

Advertisements

STL iterator invalidation rules, succinctly

http://www.martinbroadhurst.com/iterator-invalidation-rules-for-c-containers.html is concise with explanations. Specifically,

  • list insertions don’t invalidate any iterator. I feel iterator is a pointer to a node.
  • tree insertions don’t invalidate any iterator. Same reason as list.
  • for erasure from list of trees, only the iterator to the erased node is invalidated.

Now for vector:

  • vector insertion invalidates any iterator positioned somewhere after the insertion point. If reallocation happens due to exceeding vector.capacity() then all invalidated

##why avoid blocking design

There are many contexts. I only know a few.

1st, let’s look at an socket context. Suppose there are many (like 500 or 50) sockets to process. We don’t want 50 threads. We prefer fewer, perhaps 1 thread to check each “ready” socket, transfer whatever data can be transferred then go back to waiting. In this context, we need either

  • /readiness notification/, or
  • polling
  • … Both are compared on P51 [[TCP/IP sockets in C]]

2nd scenario — GUI. Blocking a UI-related thread (like the EDT) would freeze the screen.

3rd, let’s look at some DB request client. The request thread sends a request and it would take a long time to get a response. Blocking the request thread would waste some memory resource but not really CPU resource. It’s often better to deploy this thread to other tasks, if any.

Q: So what other tasks?
A: ANY task, in the thread pool design. The requester thread completes the sending task, and returns to the thread pool. It can pick up unrelated tasks. When the DB server responds, any thread in the pool can pick it up.

This can be seen as a “server bound” system, rather than IO bound or CPU bound. Both the CPU task queue and the IO task queue gets drained quickly.

 

no overflow]TCP slow receiver #non-blocking sender

Q: Does TCP receiver ever overflow due to a fast sender?

A: See http://www.mathcs.emory.edu/~cheung/Courses/455/Syllabus/7-transport/flow-control.html

A: should not. When the receiver buffer is full, the receiver sends AdvertizedWindowSize to informs the sender. If sender app ignores it and continues to send, then sent data will remain in the send buffer and not sent over the wire. Soon the send buffer will fill up and send() will block. On a non-blocking TCP socket, send() returns with error only when it can’t send a single byte. (UDP is different.)

Non-block send/receive operations either complete the job or returns an error.

Q: Do they ever return with part of the data processed?
A: Yes they return the number of bytes transferred. Partial transfer is considered “completed”.

 

3rd effect@volatile ] java5

A Wells Fargo java interviewer said there are 3 effects. I named

  1. load/store to main memory on the target variable
  2. disable statement reordering

I think interviewer mentioned a 3rd effect about memory barrier.

This quote from Java Concurrency in Practice, chap. 3.1.4 may be relevant:

The visibility effects of volatile variables extend beyond the value of the volatile variable itself. When thread A writes to a volatile variable and subsequently thread B reads that same variable, the values of all variables that were visible to A prior to writing to the volatile variable become visible to B after reading the volatile variable. So from a memory visibility perspective, writing a volatile variable is like exiting a synchronized block and reading a volatile variable is like entering a synchronized block.

https://stackoverflow.com/questions/9169232/java-volatile-and-side-effects address my doubt about “other writes“. Nialscorva’s answer echoes the interviewer:

Before java 1.5, the compiler can reorder the two steps

  1. construction of the new object
  2. assigning the new address to the variable

In such a scenario, other threads (unsynchronized) can see the address in the variable and use the incomplete object, while the construction thread is preempted indefinitely like for 3 hours!

So in java 1.5, the construction is among the “other writes” by the volatile-writing thread! Therefore, the construction is flushed to memory before the address assignment. Below is my own solution, using a non-static volatile field:

public class DoubleCheckSingleton {
	private static DoubleCheckSingleton inst = null;
	private volatile boolean isConstructed = false;
	private DoubleCheckSingleton() {
		/* other construction steps */
		this.isConstructed = true; //last step
	}
	DoubleCheckSingleton getInstance() {
		if (inst != null && inst.isConstructed) return inst;
		synchronized(DoubleCheckSingleton.class) {
			if (inst != null && inst.isConstructed) return inst;
			
/**This design makes uses of volatile feature that's reputed to be java5
*
* Without the isConstructed volatile field, an incomplete object's 
* address can be assigned to inst, so another thread entering getInstance()
* will see a non-null inst and use the half-cooked object 😦
* 
* The isConstructed check ensures the construction has completed
*/
			return inst = new DoubleCheckSingleton();
		}
	}
}

volume alone doesn’t qualify as big data

The Oracle nosql book has these four “V”s to qualify any system as big data system. I added my annotations:

  1. Volume
  2. Velocity
  3. Variety of data format — If any two data formats account for more than 99% of your data in your system, then it doesn’t meet this definition. For example, FIX is one format.
  4. Variability in value — Does the system treat each datum equally?

Most of the so-called big-data systems I have seen don’t have these four V’s. All of them have some volume but none has the Variety or the Variability.

I would venture to say that

  • 1% of the big-data systems today have all four V’s
  • 50%+ of the big-data systems have no Variety no Variability
    • 90% of financial big-data systems are probably in this category
  • 10% of the big-data systems have 3 of the 4 V’s

The reason that these systems are considered “big data” is the big-data technologies applied. You may call it “big data technologies applied on traditional data”

See #top 5 big-data technologies

Does my exchange data qualify? Definitely high volume and velocity, but no Variety or Variability.

data science^big data Tech

Consider Quant library technology vs quant research. I think the relationship is similar

Data science is an experimental discovery task, like other scientific research. I feel it’s somewhat academic and theoretical. As a result, it doesn’t pay so well. My friend Jingsong worked with data scientists in Nokia/Microsoft.

Big data technologies (acquisition, indexing, parsing, cleansing) is not exploratory. It’s more similar to database technology than scientific research.

data mining^big-data

Data mining has been around for 20 years (before 1995). The most visible and /compelling/ value-add in big-data always involves some form of data mining, often using AI including machine-learning.

Data mining is The valuable thing that customers pay for, whereas Big-data technologies enhance the infrastructure supporting the mining

https://www.quora.com/What-is-the-difference-between-the-concepts-of-Data-Mining-and-Big-Data has a /critical/ and concise comment. I modified it slightly for emphasis.

Data mining involves finding patterns from datasets. Big data involves large scale storage and processing of datasets. So combining both, data mining done on big data(e.g, finding buying patterns from large purchase logs) is getting lot of attention currently.

NOT All big data task are data mining ones(e.g, large scale indexing).

NOT All data mining tasks are on big data(e.g, data mining on a small file which can be performed on a single node). However, note that wikipedia(as on 10 Sept. 2012) defines data mining as “the process that attempts to discover patterns in large data sets”.

(latency) DataGrid^noSQL (throughput)

  • Coherence/Gemfire/gigaspace are traditional data grids, probably distributed hashmaps.
  • One of the four categories of noSQL systems is also a distributed key/value hashmaps, such as redis
  • …. so what’s the diff?

https://blog.octo.com/en/data-grid-or-nosql-same-same-but-different/ has an insightful answer — DataGrids were designed for latency; noSQL were designed for throughput.

I can see the same trade-off —

  • FaceBook’s main challenge/priority is fanout (throughput)
  • IDC’s main challenge is TPS measured in messages per second throughput
  • HFT main challenge is nanosec latency.
  • For a busy exchange, latency and throughput are both important but if they must pick one? .. throughput