STL trees: insert sorted data one-by-one

The STL standard requires insert() and find() to be O(log N), so the tree has to remain fairly balanced at all times.

Q: When inserting sorted data one-by-one, what’s the frequency of re-balancing?

%%A: I didn’t find any definitive answer. I think it’s quite frequent. System has to assume “This could the last insertion and there will be many queries.” so re-balancing has to kick in whenever the tree becomes lopsided.

Incidentally, if you have the sorted data in some temp container and you insert them /in one gulp/, then STL standard requires the tree container to provide O(N) insertion which is better than O(N log N).

Advertisements

volume alone doesn’t qualify as big data

The Oracle nosql book has these four “V”s to qualify any system as big data system. I added my annotations:

  1. Volume
  2. Velocity
  3. Variety of data format — If any two data formats account for more than 99% of your data in your system, then it doesn’t meet this definition. For example, FIX is one format.
  4. Variability in value — Does the system treat each datum equally?

Most of the so-called big-data systems I have seen don’t have these four V’s. All of them have some volume but none has the Variety or the Variability.

I would venture to say that

  • 1% of the big-data systems today have all four V’s
  • 50%+ of the big-data systems have no Variety no Variability
    • 90% of financial big-data systems are probably in this category
  • 10% of the big-data systems have 3 of the 4 V’s

The reason that these systems are considered “big data” is the big-data technologies applied. You may call it “big data technologies applied on traditional data”

See #top 5 big-data technologies

Does my exchange data qualify? Definitely high volume and velocity, but no Variety or Variability.

##satisfying Occupation { partial financial freedom

(relevant to “post-65″…)

As I told German and Yihai, I feel my passive income may reach half of my burn rate — some level of financial freedom! So how do I want to spend my remaining career on more satisfying endeavors?

  • data science and machine learning? I feel market depth is poor, and I don’t feel confident to outshine
  • more in-depth c++, like ICE projects
  • Wenqiang’s innovative xml tool? I thought it had “value” and commercial potential but Wrong
  • dotnet? I used to feel this is up and coming, with strong potential
  • swing? I used to feel it needs help but no no … I feel it’s losing ground hopelessly.
  • See also after 65..contribute to stackOverFlow+OSS testing

Elements of a satisfying job after I achieve partial financial freedom:

  1. respect from team; decent benchmarking within the team.
  2. commute
  3. freedom to blog and experiment with coding in the office. Eg: OC, ICE
  4. —LG2
  5. trySomethingNew? minor. Can enhance job satisfaction slightly. More relevant is “engagement”, rather than mindless tasks.
  6. sustainable social value and social impact. Elusive. Idealistic. May have to Let Go.
  7. workload? I don’t mind if I really like the work, but there will always be some drudgery workload in any job.
  8. salary?
  9. market depth? less important after I have (partial) financial freedom?

MOM^shared memory ring buffer^UDP : mkt data transmission

I feel in most environments, the MOM design is most robust, relying on a reliable middleware. However, latency sensitive trading systems won’t tolerate the additional latency and see it as unnecessary.

Gregory (ICE) told me about his home-grown simple ring buffer in shared memory. He used a circular byte array. Message boundary is embedded in the payload. When the producer finishes writing to the buffer, it puts some marker to indicate end of data. Greg said the consumer is slower, so he makes it a (periodic) polling reader. When consumer encounters the marker it would stop reading. I told Gregory we need some synchronization. Greg said it’s trivial. Here are my tentative ideas —

Design 1 — every time the producer or the consumer starts it would acquire a lock. Coarse-grained locking

But when the consumer is chipping away at head of the queue, the producer can simultaneously write to the tail, so here’s

Design 2 — the latest message being written is “invisible” to the consumer. Producer keeps the marker unchanged while adding data to the tail of queue. When it has nothing more to write, it moves the marker by updating it.

The marker can be a lock-protected integer representing the index of the last byte written.

No need to worry about buffer capacity, or a very slow consumer.

MOM UDP multicast or TCP or UDS shared_mem
how many processes 3-tier 2-tier 2-tier
1-to-many distribution easy easiest doable
intermediate storage yes tiny. The socket buffer can be 256MB yes
producer data burst supported message loss is common in such a situation supported
async? yes yes, since the receiver must poll or be notified I think the receiver must poll or be notified
additional latency yes yes minimal

As CTO, I’d favor transparent langs, wary of outside libraries

If I were to propose a system rewrite, or start a new system from scratch without constraints like legacy code, then Transparency (+instrumentation) is my #1 priority.

  • c++ is the most opaque. Just look at complex declarations, or linker rules, the ODR…
  • I feel more confident debugging java. The JVM is remarkably well-behaving (better than CLR), consistent, well discussed on-line
  • key example — the SOAP stub/skeleton hurts transparency, so does the AOP proxies. These semi-transparent proxies are not like regular code you can edit and play with in a sandbox
  • windows is more murky than linux
  • There are many open-source libraries for java, c++, py etc but many of them affects transparency. I think someone like Piroz may say Spring is a transparent library
  • SQL is transparent except performance tuning
  • Based on my 1990’s experience, I feel javascript is transparent but I could be wrong.
  • I feel py, perl are still more transparent than most compiled languages. They too can become less transparent, when the codebase grows. (In contrast, even small c++ systems can be opaque.)

This letter is about which language, but allow me to digress briefly. For data store and messaging format (both require serialization), I prefer the slightly verbose but hugely transparent solutions, like FIX, CTF, json (xml is too verbose) rather than protobuf. Remember 99% of the sites use only strings, numbers, datetimes, and very rarely involve audio/visual data.

[13]arrayName^ptrVar differences tabulated

See other posts why an array name is a permanent name plate on a permanently allocated room in memory. Once you understand that, you know
– can’t move this name plate to another room
– can’t put a 2nd name plate on the same room
– can’t put this array name on the LHS of assignment

I feel overall, in most everyday contexts you can treat the array name in this example as if it’s a variable whose type is int*. However, the differences lurk in the dark like snakes, and once bitten, you realize there are many many differences. The 2 constructs are fundamentally different.

Q: how to edit the below html table structure (content is easy)?
A: edit the html
A: copy paste to ms-word

ptr to heap array ptr-var array-name (array not on heap)
declaration array-new int* p//allocate 32bit int arr[3]//allocate 3 ints
declared as a struct/class field same as ptr-var 4-bytes onsite entire array embedded onsite
declared as func param var no such thing common converted a ptr-var by compiler
initialize probably not supported char* p=”abc”;// puts the literal string in RO memory char arr[]=”xyz”; //copying the literal string from RO memory to stack/heap where the array is allocated. The new copy is editable.
object passed as func arg same as ptr-var common converted to &arr[0]
dereference same as ptr-var common gives the first int element value. P119[[primer]]
address of same as ptr-var double ptr &arr[0]==&arr==arr. See headfirstC post and also
http://publications.gbdirect.co.uk
/c_book/chapter5
/arrays_and_address_of.html
rebind/reseat same as ptr-var common syntax error
add alias to the pointee same as ptr-var common unsupported
as LHS (not as func param) same as ptr-var common. Reseat syntax error

accumulation: contractor^FTE #XR

XR,

You said that we contractors don’t accumulate (积累) as FTE do.

I do agree that after initial 2Y of tough learning, some FTE could reap the monetary rewards whereas consultants are often obliged, due to contract, to leave the team. (Although there are long-term contracts, they don’t always work out as promised.)

Here’s my experience in GS for 2.5Y. My later months had much lower “bandwidth” tension i.e. the later months required less learning and figure-things-out. Less stress, fewer negative feedbacks, less worry about my own competence, more confidence , more in-control because more familiar with the local system. If my compensation had become 150k I would say that money amounts to “reaping the reward”. In reality, the monetary accumulation was an empty promise.

As a developer stays longer, the accumulation in terms of his value-add to the team is natural and likely [1]. Managers like to point out that after a FTE stays in the team for 2Y her competence, her design, her solutions, her suggestions, her value-add per year grows higher every year. If her initial value-add to the company can be quantified as $100k, every year it grows by 30%. Alas, that doesn’t always translate to compensation.

That’s accumulation in personal income. How about accumulation in tech skill? Staying in one system usually means less exposure to other, or newer, technologies. Some developers prefer to be shielded from newer technologies. I embrace them. I feel my technical accumulation is higher when I keep moving from company to company.

[1] There are exceptions. About 5% of the old timers are, in my view, organization /dead-weights/. Their value-add doesn’t grow and is routinely surpassed within a year by bright new joiners. Often company can’t let them go due to political, legal or ethical reasons.

You said IV questions change over time so much (ignoring the superficial changes) that the IV skills we acquire today is useless in 5Y and we have to again learn new IV skills. This is not intuitive to me. Please give one typical example if you can without a lot of explaining (I understand your time constraints). I guess you mean technology churn? If I prepare for a noSQL or big data interview, then I will probably face technology churn.

On the other hand, in my experience, many interview topics remain ever-green including some hard topics — algorithms (classic algos and creative algos), classic data structures, concurrency, java OO, pass-by reference/value, SQL, unix commands, TCP/UDP sockets, garbage collection, asynchronous/synchronous concepts, pub/sub, producer/consumer, thread pool concepts… In the same vein, most coding tests are similar to 10 yeas ago when I first received them. So the study of these topics do accumulate to some extent.

edit1file]big python^c++ prod system #XR

Q1: suppose you work in a big, complex system with 1000 source files, all in python, and you know a change to a single file will only affect one module, not a core module. You have tested it + ran a 60-minute automated unit test suit. You didn’t run a prolonged integration test that’s part of the department-level full release. Would you and approving managers have the confidence to release this single python file?
A: yes

Q2: change “python” to c++ (or java or c#). You already followed the routine to build your change into a dynamic library, tested it thoroughly and ran unit test suite but not full integration test. Do you feel safe to release this library?
A: no.

Assumption: the automated tests were reasonably well written. I never worked in a team with a measured test coverage. I would guess 50% is too high and often impractical. Even with high measured test coverage, the risk of bug is roughly the same. I never believe higher unit test coverage is a vaccination. Diminishing return. Low marginal benefit.

Why the difference between Q1 and Q2?

One reason — the source file is compiled into a library (or a jar), along with many other source files. This library is now a big component of the system, rather than one of 1000 python files. The managers will see a library change in c++ (or java) vs a single-file change in python.

Q3: what if the change is to a single shell script, used for start/stop the system?
A: yes. Manager can see the impact is small and isolated. The unit of release is clearly a single file, not a library.

Q4: what if the change is to a stored proc? You have tested it and run full unit test suit but not a full integration test. Will you release this single stored proc?
A: yes. One reason is transparency of the change. Managers can understand this is an isolated change, rather than a library change as in the c++ case.

How do managers (and anyone except yourself) actually visualize the amount of code change?

  • With python, it’s a single file so they can use “diff”.
  • With stored proc, it’s a single proc. In the source control, they can diff this single proc
  • with c++ or java, the unit of release is a library. What if in this new build, beside your change there’s some other change , included by accident? You can’t diff a binary 😦

So I feel transparency is the first reason. Transparency of the change gives everyone (not just yourself) confidence about the size/scope of this change.

Second reason is isolation. I feel a compiled language (esp. c++) is more “fragile” and the binary modules more “coupled” and inter-dependent. When you change one source file and release it in a new library build, it could lead to subtle, intermittent concurrency issues or memory leaks in another module, outside your library. Even if you as the author sees evidence that this won’t happen, other people have seen innocent one-line changes giving rise to bugs, so they have reason to worry.

  • All 1000 files (in compiled form) runs in one process for a c++ or java system.
  • A stored proc change could affect DB performance, but it’s easy to verify. A stored proc won’t introduce subtle problems in an unrelated module.
  • A top-level python script runs in its own process. A python module runs in the host process of the top-level script, but a typical top-level script will include just a few custom modules, not 1000 modules. Much better isolation at run time.

There might be python systems where the main script actually runs in a process with hundreds of custom modules (not counting the standard library modules). I have not seen it.

effi^instrumentation ] new project

I always prioritize instrumentation over effi/productivity/GTD.

A peer could be faster than me in the beginning but if she lacks instrumentation skill with the local code base there will be more and more tasks that she can’t solve without luck.

In reality, many tasks can be done with superficial “insight”, without instrumentation, with old-timer’s help, or with lucky search in the log.

What if developer had not added that logging? You are dependent on that developer.

I could be slow in the beginning, but once I build up (over x months) a real instrumentation insight I will be more powerful than my peers including some older timers. I think the Stirt-tech London team guru (John) was such a guy.

In reality, even though I prioritize instrumentation it’s rare to make visible progress building instrumentation insight.

incisive example showing diff: with^without extern-C

---- dummy8.c ----
#include <stdio.h> 
//without this "test", we could be using c++ compiler unknowingly 😦
int cfunc(){ return 123; }
---- dummy9.cpp ----
#include <iostream>
extern "C" // Try removing this line and see the difference
  int cfunc();
int main(){std::cout << cfunc() <<std::endl; }

Above is complete source of a c++ application using a pre-compiled C function. It shows the need for extern-C.

/bin/rm -v *.*o *.out
### 1a
g++ -v -c dummy8.c # 
objdump --syms dummy8.o # would show mangled function name _Z5cfuncv
### 1b
gcc -v -x c -c dummy8.c # Without the -x c, we could end up with c++ compiler 😦
objdump --syms dummy8.o # would show unmangled function name "cfunc"
### 2
g++ -v dummy8.o dummy9.cpp  # link the dummy8.o into executable

# The -v flag reveals the c vs c++ compiler versions 🙂
### 3
./a.out

So that’s how to compile and run it. Note you need both a c compiler and a c++ compiler. If you only use a c++ compiler, then you won’t have any pre-compiled C code. You can still make the code work, but you won’t be mixing C and C++ and you won’t need extern-C.

My goal is not merely “make the code work”. It’s very easy to make the code work if you have full source code. You won’t need extern-C. You have a simpler alternative — compile every source file in c++ after trivial adjustments to #include.