opaque challenge:(intermediate)data browser #push the limit

Opacity — is a top 3 (possibly biggest) fear and burden in terms of figure-things-out-relative-to-cowokers on a localSys. The most effective and direct solution is some form of instrumentation tool for intermediate data. If you develop or master an effective browsing tool, however primitive, it would likely become a competitive advantage in terms of figure-out speed, and consequently earn you some respect.

LocalSys — I feel most of the effective data browser knowledge is localSys knowledge.

If you are serious about your figure-out speed weakness, if you are seriously affected by the opaque issues, then consider investing more time in these browsing tools.

Hard work, but worthwhile.

  • eg: Piroz built a Gemfire data browser and it became crucial in the Neoreo project
  • #1 eg: in my GS projects, the intermediate data was often written into RDBMS. Also important — input and output data are also written into RDBMS tables. Crucial in everyday trouble-shooting. I rank this as #1 in terms of complexity and value. Also this is my personal first-hand experience
  • #2 eg: RTS rebus — during development, I captured lots of output CTF messages as intermediate data… extremely valuable
    • Once deployed, QA team relied on some web tools. I didn’t need to debug production issues.
    • I remember the static data team save their static data to RDBMS, so they relied on the RDBMS query tool on a daily basis.

Now some negative experiences

  • eg: I’m not too sure, but during the Stirt QZ dev projects I didn’t develop enough investigation skill esp. in terms of checking intermediate data.
  • eg: in Mvea, we rely on net-admin commands to query order state, flow-element state and specific fills… Not as convenient as a data store. I never became proficient.
    • I would say the FIX messages are logged consistently and serve as input and output data browser.

Many projects in my recent past have no such data store. I don’t know if there’s some effective solution to the opacity, but everyone else face the same issue.

error stack trace: j^c++

Without stack trace, silent crashes are practically intractable.

In my java and python /career/ (I think c# as well) , exceptions always generate a console stack trace. The only time it didn’t happen was a Barclays JNI library written in c++..

In c++, getting the stack trace is harder.

  • when I used the ETSFlowAsert() construct I get a barely usable stack trace, with function names, without line numbers
  • [[safeC++]] described a technique to generate a simple stack trace with some line numbers but I have never tried it
  • the standard assert() macro doesn’t generate stack trace
  • In RTS and mvea, memory access errors lead to seg-fault and core dump. In these contexts we are lucky because the runtime environment (host OS, standard library, seg-fault signal handler ..) cooperate to dump some raw data into the core file, but it’s not as reliable as the JVM runtime … Here are some of the obstacles:
    • core files may be suppressed
    • To actually make out this information, we need gdb + a debug build of the binary with debug symbols.
    • it can take half an hour to load the debug symbols
    • My blogpost %%first core dump gdb session describes how much trouble and time is involved to see the function names in the stack trace.

 

killing a stuck thread #cancellation points#CSY

Q: once you know one of many threads is stuck in a production process, what can you do? Can you kill a single thread?
A: there will Not be a standard construct provided by OS or thread library because killing a thread is inherently unsafe.. Look at java Thread.stop()
A: yes if I have a builtin kill-hook in the binary

https://www.thoughtspot.com/codex/threadstacks-library-inspect-stacktraces-live-c-processes describes a readonly custom hook. Its conceivable to add a kill feature —

  • Each thread runs a main loop to check an exit-condition periodically.
  • This exit-condition would be similar to pthreads “cancellation points”

https://stackoverflow.com/questions/10961714/how-to-properly-stop-the-thread-in-java shows two common kill hooks — interrupt and Boolean flag

 

gdb to show c++thread wait`for mutex/condVar

https://github.com/tiger40490/repo1/blob/cpp1/cpp/thr/pthreadCondVar.cpp shows my experiment using gdb supplied by StrawberryPerl.

On this g++/gdb set-up, “info threads” shows thread id number 1 for main thread, “2” for the thread whose pthread_self() == 2 … matching 🙂

The same “info-threads” output also shows

  • one of the worker threads is executing sleep() while holding lock (by design)
  • the other worker threads are all waiting for the lock.
  • At the same time, the main thread is waiting in a conditional variable, so info-threads shows it executing a different function.

breakdown heap/non-heap footprint@c++app #massif

After reading http://valgrind.org/docs/manual/ms-manual.html#ms-manual.not-measured, I was able to get massif to capture non-heap memory:

valgrind --tool=massif  --pages-as-heap=yes --massif-out-file=$massifOut .../xtap -c ....
ms_print $massifOut

Heap allocation functions such as malloc are built on top of system calls  such as mmapmremap, and brk. For example, when needed, an allocator will typically call mmap to allocate a large chunk of memory, and then hand over pieces of that memory chunk to the client program in response to calls to malloc et al. Massif directly measures only these higher-level malloc et al calls, not the lower-level system calls.

Furthermore, a client program may use these lower-level system calls directly to allocate memory. By default, Massif does not measure these. Nor does it measure the size of code, data and BSS segments. Therefore, the numbers reported by Massif may be significantly smaller than those reported by tools such as top that measure a program’s total size in memory.



			

python to dump binary data in hex digits

Note hex() is a built-in, but I find it inconvenient. I need to print in two-digits with leading 0.

Full source is hosted in https://github.com/tiger40490/repo1/blob/py1/tcpEchoServer.py

def Hex(data): # a generator function
  i=0
  for code in map(ord,data):
    yield "%02x " % code
    i += 1
    if i%8==0: yield ' '

print ''.join(Hex("\x0a\x00")); exit(0)

As CTO, I’d favor transparent langs, wary of outside libraries

If I were to propose a system rewrite, or start a new system from scratch without constraints like legacy code, then Transparency (+instrumentation) is my #1 priority.

  • c++ is the most opaque. Just look at complex declarations, or linker rules, the ODR…
  • I feel more confident debugging java. The JVM is remarkably well-behaving (better than CLR), consistent, well discussed on-line
  • key example — the SOAP stub/skeleton hurts transparency, so does the AOP proxies. These semi-transparent proxies are not like regular code you can edit and play with in a sandbox
  • windows is more murky than linux
  • There are many open-source libraries for java, c++, py etc but many of them affects transparency. I think someone like Piroz may say Spring is a transparent library
  • SQL is transparent except performance tuning
  • Based on my 1990’s experience, I feel javascript is transparent but I could be wrong.
  • I feel py, perl are still more transparent than most compiled languages. They too can become less transparent, when the codebase grows. (In contrast, even small c++ systems can be opaque.)

This letter is about which language, but allow me to digress briefly. For data store and messaging format (both require serialization), I prefer the slightly verbose but hugely transparent solutions, like FIX, CTF, json (xml is too verbose) rather than protobuf. Remember 99% of the sites use only strings, numbers, datetimes, and very rarely involve audio/visual data.

q[g++ -g -O] together

https://linux.die.net/man/1/g++ has a section specifically on debugging. It says

GCC allows you to use -g with -O

I think -g adds additional debug info into the binary to help debuggers; -O turns on complier optimization.

By default, our binaries are compiled with “-g3 -O2”. When I debug these binaries, I can see variables but lines are rearranged in source code, causing minor problems. See my blog posts on gdb.

gdb skill level@Wall St

I notice that, absolutely None of my c++  veteran colleagues (I asked only 3) [2] is a gdb expert as there are concurrency experts, algo experts [1], …

Most of my c++ colleagues don’t prefer (reluctance?) console debugger. Many are more familiar with GUI debuggers such as eclipse and MSVS. All agree that prints are often a sufficient debugging tool.

[1] Actually, these other domains are more theoretical and produces “experts”.

[2] maybe I didn’t meet enough true c++ alpha geeks. I bet many of them may have very good gdb skills.

I would /go out on a limb/ to say that gdb is a powerful tool and can save lots of time. It’s similar to adding a meaningful toString() or operator<< to your custom class.

Crucially, it could help you figure things out faster than your team peers. I first saw this potential when learning remote JVM debugging in GS.

— My view on prints —
In perl and python, I use prints exclusively and never needed interactive debuggers. However, in java/c++/c# I heavily relied on debuggers. Why the stark contrast? No good answer.

Q: when are prints not effective?
A: when the edit-compile-test cycle is too long, not automated but too frequent (like 40 times in 2 hours) and when there is real delivery pressure. Note the test part could involve many steps and many files and other systems.
A: when you can’t edit the file at all. I have not seen it.

A less discussed fact — prints are simple and reliable. GUI or console debuggers are often poorly understood. Look at step-through. Optimization, threads, and exceptions often have unexpected impacts. Or look at program state inspection. Many variables are hard to “open up” in console debuggers. You can print var.func1()

 

Locate msg]binary feed #multiple issues solved

Hi guys, thanks to all your help, I managed to locate the very first trading session message in the raw data file.

We hit and overcame multiple obstacles in this long “needle search in a haystack”.

  • · Big Obstacle 1: endian-ness. It turned out the raw data is little-endian. For my “needle”, the symbol integer id 15852(in decimal) or 3dec(in hex) is printed swapped as “ec3d” when I finally found it.

Solution: read the exchange spec. It should be mentioned.

  • · Big Obstacle 2: my hex viewers (like “xxd”) adds line breaks to the output, so my needle can be missed during my search. (Thanks to Vishal for pointing this out.)

Solution 1: xxd -c 999999 raw/feed/file > tmp.txt; grep $needle tmp.txt

The default xxd column size is 16 so every 16 bytes output will get a line break — unwanted! So I set a very large column size of 999999.

Solution 2: in vi editor after “%!xxd -p” if you see line breaks, then you can still search for “ec\_s*3d”. Basically you need to insert “\_s*” between adjacent bytes.

Here’s a 4-byte string I was able to find. It span across lines: 15\_s*00\_s*21\_s*00

  • · Obstacle 3: identify the data file among 20 files. Thanks to this one obstacle, I spent most of my time searching in the wrong files 😉

Solution: remove each file successively, starting from the later hours, and retest, until the needle stops showing. The last removed file must contain our needle. That file is a much smaller haystack.

o one misleading info is the “9.30 am” mentioned in the spec. Actually the message came much earlier.

o Another misleading info is the timestamp passed to my parser function. Not sure where it comes from, but it says 08:00:00.1 am, so I thought the needle must be in the 8am file, but actually, it is in the 4am file. In this feed, the only reliable timestamp I have found is the one in packet header, one level above the messages.

  • · Obstacle 4: my “needle” was too short so there are too many useless matches.

Solution: find a longer and more unique needle, such as the SourceTime field, which is a 32-bit integer. When I convert it to hex digits I get 8 hex digits. Then I flip it due to endian-ness. Then I get a more unique needle “008e0959”. I was then able to search across all 14 data files:

for f in arca*0; do

xxd -c999999 -p $f > $f.hex

grep -ioH 008e0959 $f.hex && echo found in $f

done

  • · Obstacle 5: I have to find and print the needle using my c++ parser. It’s easy to print out wrong hex representation using C/C++, so for most of this exercise I wasn’t sure if I was looking at correct hex dump in my c++ log.

o If you convert a long byte array to hex and print without whitespace, you could see 15002100ffffe87600,but when I added a space after each byte, it looks like 15 00 21 00 ffffe876 00, so the 3rd byte was overflowing without warning!

o If you forget padding, then you can see a lot of single “0” when you should get “00”. Again, if you don’t include white space you won’t notice.

Solution: I have worked out some simplified code that works. I have a c++ solution and c solution. You can ask me if you need it.

  • · Obstacle 6: In some cases, sequence number is not in the raw feed. In this case the sequence number is in the feed, so Nick’s suggestion was valid, but I was blocked by other obstacles.

Tip: If sequence number is in the feed, you would probably spot a pattern of incrementing hex numbers periodically in the hex viewer.

char-array dump in hex digits: printf/cout

C++ code is convoluted. Must cast twice!

// same output from c and c++: 57 02 ff 80
void dumpBufferPrintf(){
  static const char tag[] = {'W', 2, 0xFF, 0x80};
  cout << hex << setfill('0') ;
  for(int i = 0; i< sizeof(tag)/sizeof(char); ++i)
    printf("%02hhx ", tag[i]);
  printf ("\n");
}
///////////////////
#include <iostream>
#include <sstream> //stringstream
#include <iomanip> //setfill

//This function was also used to dump a class instance. See below
void dumpBufferCout(const char * buf, size_t const len){
                std::stringstream ss;
                ss << std::hex << std::setfill('0');
                
                for(size_t i=0; i<len; ++i){
                          if (i%8 == 0) ss<< "  ";
                          ss<<std::setw(2)<< (int)(unsigned char) buf[i]<<" ";
                }
                std::cerr<<ss.str()<<std::endl;
}
dumpBufferCout((const char*)&myStruct, sizeof(myStruct));

q[nm] instrumentation #learning notes

When you want to reduce the opacity of the c++ compiled artifacts, q(nm) is instrumental. It is related to other instrumentation tools like

c++filt
objdump
q(strings -a)

Subset of noteworthy features:
–print-file-name
–print-armap? Tested with my *.a file. The filename printed is different from the above
–line-numbers? Tested
–no-sort
–demangle? Similar to c++filt but c++filt is more versatile
–dynamic? for “certain” types of shared libraries
–extern-only

My default command line is


nm --print-armap --print-file-name --line-numbers --demangle
nm --demangle ./obj/debug/ETSMinervaBust/src.C/ReparentSource.o //worked better

In May 2018, I ran nm on a bunch of *.so files (not *.a) to locate missing symbol definitions. Once I found a needed symbol is exported by libabc.so, I had to add -labc to my g+ command line.

[17] 5 unusual tips@initial GTD

See also https://bintanvictor.wordpress.com/wp-admin/edit.php?s&post_status=all&post_type=post&action=-1&m=0&cat=560907660&filter_action=Filter&paged=1&action2=-1

* build up instrumentation toolset
* Burn weekends, but first … build momentum and foundation including the “instrumentation” detailed earlier
* control distractions — parenting, housing, personal investment, … I didn’t have these in my younger years. I feel they take up O2 and also sap the momentum.
* Focus on output that’s visible to boss, that your colleagues could also finish so you have nowhere to hide. Clone if you need to. CSDoctor told me to buy time so later you can rework “under the hood” like quality or design

–secondary suggestions:
* Limit the amount of “irrelevant” questions/research, when you notice they are taking up your O2 or dispersing the laser. Perhaps delay them.

Inevitably, this analysis relies on the past work experiences. Productivity(aka GTD) is a subjective, elastic yardstick. #1 Most important is GTD rating by boss. It sinks deep… #2 is self-rating https://bintanvictor.wordpress.com/2016/08/09/productivity-track-record/

bash script: demo`trap,read,waiting for gdb-attach

#1 valuable feature is the wait for gdb to attach, before unleashing the data producer.

#2 signal trap. I don’t have to remember to kill off background processes.

# Better source this script. One known benefit -- q(jobs) command would now work

sigtrap(){
  echo Interrupted
  kill %1 %2 %3 %4 # q(jobs) can show the process %1 %2 etc
  set -x
  trap - INT
  trap # show active signal traps
  sleep 1
  jobs
  set +x
}

set +x
ps4 # my alias to show relevant processes
echo -e "\njobs:"
jobs
echo -en "\nContinue? [y/any_other_key] "
unset REPLY; read $REPLY
[ "$REPLY" = "y" ] || return

trap "sigtrap" INT # not sure what would happen to the current cmd and to the shell

base=/home/vtan//repo/tp/plugins/xtap/nysemkt_integrated_parser/
pushd $base
make NO_COMPILE=1 || return
echo '---------------------'
popd
/bin/rm --verbose $base/*_vtan.*log /home/vtan/nx_parser/working/CSMIParser_StaticDataMap.dat

set -x

#If our parser is a client to rebus server, then run s2o as a fake rebus server:
s2o 40490|tee $base/2rebus_vtan.bin.log | decript2 ctf -c $base/etc/cdd.cfg > $base/2rebus_vtan.txt.log 2>&1 &

#if our parser is a server outputing rtsd to VAP, then run c2o as a fake client:
c2o localhost 40492|tee $base/rtsd_vtan.bin.log | decript2 ctf -c $base/etc/cdd.cfg > $base/rtsd_vtan.txt.log 2>&1 &

# run a local xtap process:
$base/shared/tp_xtap/bin/xtap -c $base/etc/test_replay.cfg > $base/xtap_vtan.txt.log 2>&1 &
#sleep 3; echo -en "\n\n\nDebugger ready? Start pbflow? [y/any_other_key] "
#unset REPLY; read $REPLY; [ "$REPLY" = "y" ] || return

# playback some historical data, on a multicast port:
pbflow -r999 ~/captured/ax/arcabookxdp1-primary 224.0.0.7:40491 &

set +x
jobs
trap

[09]%%design priorities as arch/CTO

Priorities depend on industry, target users and managers’ experience/preference… Here are my Real answers:

A: instrumentation (non-opaque ) — #1 priority to an early-stage developer, not to a CTO.

Intermediate data store (even binary) is great — files; reliable[1] snoop/capture; MOM

[1] seldom reliable, due to the inherent nature — logging/capture, even error messages are easily suppressed.

A: predictability — #2 (I don’t prefer the word “reliability”.) related to instrumentation. I hate opaque surprises and intermittent errors like

  • GMDS green/red LED
  • SSL in Guardian
  • thick, opaque libraries like Spring
  1. Database is rock-solid predictable.
  2. javascript was predictable in my pre-2000 experience
  3. automation Scripts are often more predictable, but advanced python is not.

(bold answers are good interview answers.)
A: separation of concern, encapsulation.
* any team dev need task breakdown. PWM tech department consists of teams supporting their own systems, which talk to each other on an agreed interface.
* Use proc and views to allow data source internal change without breaking data users (RW)
* ftp, mq, web service, ssh calls, emails between departments
* stable interfaces. Each module’s internals are changeable without breaking client code
* in GS, any change in any module must be done along with other modules’ checkout, otherwise that single release may impact other modules unexpectedly.

A: prod support and easy to learn?
* less support => more dev.
* easy to reproduce prod issues in QA
* easy to debug
* audit trail
* easy to recover
* fail-safe
* rerunnable

A: extensible and configurable? It often adds complexity and workload. Probably the #1 priority among managers i know on wall st. It’s all about predicting what features users might add.

How about time-to-market? Without testibility, changes take longer to regression-test? That’s pure theory. In trading systems, there’s seldom automated regression testing.

A: testability. I think Chad also liked this a lot. Automated tests are less important to Wall St than other industries.

* each team’s system to be verifiable to help isolate production issues.
* testable interfaces between components. Each interface is relatively easy to test.

A: performance — always one of the most important factors if our system is ever benchmarked in a competition. Benchmark statistics are circulated to everyone.

A: scalability — often needs to be an early design goal.

A: self-service by users? reduce support workload.
* data accessible (R/W) online to authorized users.

A: show strategic improvement to higher management and users. This is how to gain visibility and promotion.

How about data volume? important to eq/fx market data feed, low latency, Google, facebook … but not to my systems so far.

DB=%% favorite data store due to instrumentation

The noSQL products all provide some GUI/query, but not very good. Piroz had to write a web GUI to show the content of gemfire. Without the GUI it’s very hard to manage anything that’s build on gemfire.

As data stores, even binary files are valuable.

Note snoop/capture is no data-store, but falls in the same category as logging. They are easily suppressed, including critical error messages.

Why is RDBMS my #1 pick? ACID requires every datum to be persistent/durable, therefore viewable from any 3rd-party app, so we aren’t dependent on the writer application.

c++ compiler to print __cplusplus

Based on http://stackoverflow.com/questions/1562074/how-do-i-show-the-value-of-a-define-at-compile-time:

#define VALUE(x) #x
#define VAR_NAME_VALUE(var) #var “=” VALUE(var)
#pragma message(VAR_NAME_VALUE(__cplusplus))
—-save above in dummy.cpp—-

g++ -std=c++14 dummy.cpp # shows:
dummy.cpp:7:44: note: #pragma message: __cplusplus=201300L

strace, ltrace, truss, oprofile, gprof – random notes

[[optimizing Linux performance]] has usage examples of ltrace.
I think truss is the oldest and most well-known.
Q: what values do the others add?
truss, strace, ltrace all show function arguments, though pointer to objects will not be “dumped”. (Incidentally, I guess apptrace has a unique feature to dump arguments of struct types.)
strace/ltrace are similar in many ways…
ltrace is designed for shared LLLibrary tracing, but can also trace syscalls.
truss is designed for syscalls, but “-u” covers shared libraries.
oprofile — can measure time spent and hit rates on library functions

## c++ instrumentation tools

(Each star means one endorsement)

oprofile **
gprof *
papi
callgrind (part of valgrind)
sar *
strace *
(Compared to strace, I feel there are more occasions when ltrace is useful.)
systemtap
tcpdump
perf,
perf_events


libc
*Pin threads to CPUs. This prevents threads from moving between cores and invalidating caches etc. (sched_setaffinity)

See more http://virtualizationandstorage.wordpress.com/2013/11/19/algo-high-frequency-trading-design-and-optmization/

[[debug it]] c++, java.. — tips

I find this book fairly small and practical. No abstract theories. Uses c++  java etc for illustrations.

Covers unix, windows, web app.

=== debugging memory allocators
–P170
memory leaks
uninitialized variable access
varialbe access after deallocation
–p199
Microsoft VC++ has a debuging mem allocator built in. Try http://msdn.microsoft.com/en-us/library/e5ewb1h3(v=vs.90).aspx

Electric Fence

===
–P201 DTrace – included in Mac OS X
–P202 WireShark, similar to tcpdump
–P203 firebug – client-side debugging
edit DOM
full javascript debugging

–P188 rewrites – pitfalls

–A chapter on reproducing bugs — quite practical

string,debugging+other tips:[[moving from c to c++]]

[[moving from c to c++]] is fairly practical. Not full of good-on-paper “best practice” advice.

P132 don’t (and why) put “using” in header files
P133 nested struct
P129 varargs suppressing arg checking
P162 a practical custom Stack class non-template
P167 just when we could hit “missing default ctor” error. It’s a bit complicated.

–P102 offers practical tips on c++ debugging

* macro DEBUG flag can be set in #define and also … on the compiler command line
* frequently people (me included) don’t want to recompile a large codebase just to add DEBUG flag. This book shows simple techniques to turn on/off run-time debug flags
* perl dumper receives a variable $abc and dump the value of $abc and also ….. the VARIABLE NAME “abc”. C has a similar feature via the preprocessor stringize operator “#”

— chapter on the standard string class — practical, good for coding IV

* ways to initialize

* substring

* append

* insert

y learn Trace/Debug classes in dotnet

I believe log4net (and others like NLog) are better than the built-in, but the built-in has one nice feature — In visual studio, the output window conveniently shows the log messages if you use Trace.WriteLine or Debug.WriteLine.

Note output window isn’t the console.

Can log4net write to the output window? See http://angledontech.blogspot.sg/2011/08/direct-log4net-output-to-visual-studio.html. A bit of work. When I start a new project and have no time to set up log4net, then I use Trace/Debug to dump to the output window automatically, by default.

Reason2 – some projects may use the builtin not log4net.

Reason3 – some projects use other logging frameworks, so you may need to learn multiple logging frameworks. If you are lazy and you can decide, then the builtin is a basic solution always available.

##coding guru tricks (tools) learnt across Wall St teams

(Blogging. No need to reply.)

Each time I join a dev team, I tend to meet some “gurus” who show me a trick. If I am in a team for 6 months without learning something cool, that would be a low-calibre team. After Goldman Sachs, i don’t remember a sybase developer who showed me a cool sybase SQL trick (or any generic SQL trick). That’s because my GS colleagues were too strong in SQL.

After I learn something important about an IDE, in the next team again I become a newbie to the IDE since this team uses other (supposedly “common”) features.

eg: remote debugging
eg: hot swap
eg: generate proxy from a web service
eg: attach debugger to running process
eg: visual studio property sheets
eg: MSBuild

I feel this happens to a lesser extent with a programming language. My last team uses some c++ features and next c++ team uses a new set of features? Yes but not so bad.

Confucius said “Among any 3 people walking by, one of them could be teacher for me“. That’s what I mean by guru.

Eg: a Barcap colleague showed me how to make a simple fixed-size cache with FIFO eviction-policy, based on a java LinkedHashMap.
Eg: a guy showed me a basic C# closure in action. Very cool.
Eg: a Taiwanese colleague showed me how to make a simple home-grown thread pool.
Eg: in Citi, i was lucky enough to have a lot of spring veterans in my project. They seem to know 5 times more spring than I do.
Eg: a sister team in GS had a big, powerful and feature-rich OO design. I didn’t know the details but one thing I learnt was — the entire OO thing has a single base class
Eg: GS guys taught me remote debugging and hot replacement of a single class
Eg: a guy showed me how to configure windows debugger to kick-in whenever any OS process dies an abnormal death.
Eg: GS/Citi guys showed me how to use spring to connect jconsole to the JVM management interface and change object state through this backdoor.
Eg: a lot of tricks to investigate something that’s supposed to work
Eg: a c# guy showed me how to consolidate a service host project and a console host project into a single project.
Eg: a c# guy showed me new() in generic type parameter constraints

These tricks can sometimes fundamentally change a design (of a class, a module or sub-module)

Length of experience doesn’t always bring a bag of tricks. It’s ironic that some team could be using, say, java for 10 years without knowing hot code replacement, so these guys had to restart a java daemon after a tiny code change.

Q: do you know anyone who knows how to safely use Thread.java stop(), resume(), suspend()?
Q: do you know anyone who knows how to read query plan and predict physical/logical io statistic spit out by a database optimizer?

So how do people discover these tricks? Either learn from another guru or by reading. Then try it out, iron out everything and make the damn code work.

arch xp is overrated as KPI #instrumentation, tools…

See also other blog posts —

// “in the initial dev stage, instrumentation is my #1 design goal”.
// http://bigblog.tanbin.com/2010/09/2-important-categories-of-software.html

Going out on a limb, I’d say that in high-pace projects GTD (Getting Things Done) is more important than many things which are supposed to be more important.

GTD is more important than … integrating deep domain knowledge insight into the system. I think such domain knowledge can sometimes be valuable. It can help dev team avoid wasting time on “the wrong things”.

GTD is more important than … quality of code. For a few months at least, the only guy looking at the code is the original author.

GTD is more important than … “quality of design”, which is seldom well-defined.

GTD is more important than … being nice to others. Many times other people can tolerate a bit of personality if you can GTD. However, you must not offend the people who matter.

GTD is more important than … impressive presentations that address many real customer pains. A user may die for a guy who “understands my pains”, but when it is time to deliver, selling/understanding is an irrelevant distraction.

GTD is more important than … in-depth knowledge of the language or of a software platform/product. Such knowledge is more important than GTD during interviews though. Threading, data structure…

GTD is more important than … uncovering the root casue of (intermittent) problems/errors/surprises — the Toyota drill-down root-cause investigation. Regex engine creator may need to fully characterise every unexpected behavior; AutomateTellerMachine error probably deserve a drill-down investigation but in enterprise apps drill-down investigation simply takes too much time. We developers are paid to build a usable tool, not a fully-understood tool. Live with ambiguity and move on.

GTD is more important than … adding an important test to the automated test suite. Sure that mistake we just found may happen again so we really appreciate adding that test, but in reality, most high-pace environments don’t have an automated test suite. If we are lucky enough to have documented manual test plan, then yes add it, but such a plan seldom is comprehensive, so after a while people won’t follow it religiously. So in reality we just hope developers learn the lesson and avoid making the same mistake. If they do, then we just need someone who can GTD. Any system to incorporate such heard-learned lesson is likely an imperfect system, and whoever investing in such a system is often wasting his/her time.  If a GTD guy doesn’t bother with that system, he will still be respected and loved by manager and users. Basically, any long-term investment is unappreciated. GTD is all about short-term results. This is the reality of quality control in fast-paced teams.

GTD is more important than … adding meaningful error messages. Anyone debugging the system would love the guy who added the meaningful error message, but he is an unsung hero. Manager love the guy who GTD fast. Code quality is invisible and therefore ignored and unrewarded.

To achieve GTD, you must solve tech problems. Tech problems could (occasionally) call for architectural perspective, but less than tool knowledge, debugging experience, or low-level system insight.

In defense of architects, architectural track record is more important in sales contexts, including selling to internal clients and business users.

I should probably get input from Raja, Xiao An, Yi Hai …

2kinds@essential developer tools #WallSt+elsewhere

Note: There are also “common investigative” tools, but for now i will ignore them. Reasons is that most developers have only superficial knowledge of them, so the myriad of advanced features actually fall into #2).

Note: in the get-things-done stage, performance tools are much less “useful” than logic-revealing tools. In my experience, these tools seldom shed light on perf problems in DB, java, or MOM. Performance symptoms are often resolved by understanding logical flow. Simplest and best tools simply reveal intermediate data.
————
I think most interview questions on dev tools largely fall into these two categories. Their value is more “real” than other tools.

1) common tools, either indispensable or for productivity
* turn on/off exception breakpoint in VS
* add references in VS
* resolve missing assembly/references in VS
* app.config in VS projects
* setting classpath in IDE, makefile, autosys…
* eclipse plugin to locate classes in jars anywhere in windows
* literally dozens of IDE pains
* cvs,
* maven, ant,
* MS excel
* junction in windows
* vi
* bash
* grep, find, tail
* browser view-source

2) Specialized investigative/analytical/instrumentation tools — offers RARE insights. RARELY needed but RARE value. Most (if not all) of the stronger developers I met have these secret weapons. However, most of us don’t have bandwidth or motivation to learn all of these (obscure) features, because we don’t know which ones are worth learning.
* JMS — browser, weblogic JMS console…
* compiler flags
* tcpdump, snoop, lsof
* truss
* core dump analysis
* sar, vmstat, perfmeter,
* query plan
* sys tables, sp_lock, sp_depend
* set statistics io
* debuggers (IDE)
* IDE call hierarchy
* IDE search
* thread dump
* jvmstart tools — visualgc, jps,
* bytecode inspector, decompiler
* profilers
* leak detector
* jconsole
* jvmstat tools — visualgc, jps,..
* http://wiki.caucho.com/Category:Troubleshooting
* any tools for code tracing Other productivity skills that no software tools can help:
* log analysis
* when I first came to autoreo, i quickly characterized a few key tables.

##instrumentation[def]skill sets u apart

Have you seen an old doctor in her 60’s, or 70’s? What (practical) knowledge make her ?

[T = Some essential disagnostic Tools are still useful after 100 years, even though new tools get invented. These tools must provide reliable observable data. I call it instrumentation knowledge]
[D = There’s constant high Demand for the same expertise for the past 300 years]
[E = There’s a rich body of literature in this long-Established field, not an Emerging field]

As a software engineer, I feel there are a few fields comparable to the practice of medicine —
* [TDE] DB tuning
* [DE] network tuning
* [T] Unix tuning
* [TDE] latency tuning
* [] GC tuning
* [DE] web tuning

Now the specific instrumentation skills

  • dealing with intermittent issues — reproduce
  • snooping like truss
  • binary data inspection
  • modify/create special input data to create a specific scenario. Sometimes you need a lot of preliminary data
  • prints and interactive debugger
  • source code reading, including table constraints
  • log reading and parsing — correlated with source code
  • data dumper in java, perl, py. Cpp is harder
  • pretty-print a big data structure
  • — above are the generic, widely useful items —
  • In the face of complexity, boil down the data and code components to a bare minimum to help you focus. Sometimes this is not optional.
  • black box testing — Instrumentation usually require source code, but some hackers can find out a lot about a system just by black-box probing.
  • instrumentation often requires manual steps — too many and too slow. In some cases you must automate them to make progress.
  • memory profiler + leak detector

code tracing ^ performance tuning

In terms of practical value-add, keeping the job, workload management, time for family, keep up with colleagues and avoid falling into bottom quartile…. #1 challenge i see so far is legacy code tracing. How about latency and throughput?

market feed
eq/FX,
google, facebook
clearance

I think most performance optimization work is devoid of domain knowledge. Exceptions include
* high frequency trading
* CEP

perl var dumper "synopsis"

Q: For simple variables, a perl subroutine dump(‘price’) can dump @price content [1] along with the variable name — “price” in this case. But do we ever need to pass in a reference like dump(\@price, ‘price’)? [1] How about a lexical my $price declared in a nested while inside an if, wrapped in a subroutine?

A: I think sooner or later you may have to pass in ref, perhaps in a very rare and tricky context. To show the variable’s name, u need to pass 2 args in total — ref + name

[1] in dump(), print Data::Dumper->Dump (map {[$_]} @_);

real stories illustrat`%%debugging expertise — focus on java

see other stories in j,ora.xls

DNS reload
freebsd NIC driver
custom classloader breaking EJB casting
weblogic upgrade max connection exceeded — ulimit for the shell,
max_tcp_connections, file descriptors

jdbc failures — snoop

weblogic jms XA conn factory; disable catch clause to trigger rollback

2 (custom) class loaders loading the same class — cast exception

nbc firefox and ie gets different dates from the same url. we reduced
the code to the simplest. then i managed to get a
Runtime.getRuntime().exec( “ls /tmp” );

log4j appenders in assignment are from 2 classloaders, but “murderer”
(classloader bug) looks like an accomplice — log4j errors are usually
harmless