testing threaded designs – enterprise apps(!lib)

Bottom line – (Unconventional wisdom) Be bold to create new threaded designs. It doesn’t have to be rock solid like in the standard library.

Experts unanimously agree that non-trivial MT designs are hard to verify or test, often exceedingly hard. There are often too many possibilities. Maybe a million tests pass but next test reveals a bug. Therefore peer review is the way to go.  I feel that’s the “library” view, the view from the language creators. Different from enterprise apps.

In enterprise apps, if a MT design passes load test and UAT, then good enough. No budget to test further. If only 1 in a million cases fail then that case has something special — perhaps a special combination or purely timing coincidence. Strictly speaking those are still logical errors and design defects. A sound design ought to handle such cases gracefully. Most designs aren’t so sound. If such a failure happens so rarely, just about everyone involved would agree it’s basically impossible to catch it during testing. Such a bug, if ever uncovered, would be too hard to catch and catching it should not be in any job scope here. Missing it is tolerable and only human.

A goal keeper can’t be expected to catch 99 penalties in a row.

In the enterprise reality, such a bug is probably never uncovered (unless a log happens to provide the evidence). It often takes too much effort to investigate such a rare issue. Not worth the effort.

Advertisements

quiet confidence on go-live day

I used to feel “Let’s pray no bug is found in my code on go-live day. I didn’t check all the null pointers…”

I feel it’s all about …blame, even if manager make it a point to to avoid blame.

Case: I once had a timebomb bug in my code. All tests passed but production system failed on the “scheduled” date. UAT guys are not to blame.

Case: Suppose you used hardcoding to pass UAT. If things break on go-live, you bear the brunt of the blame.

Case: if a legitimate use case is mishandled on go-live day, then
* UAT guys are at fault, including the business users who signed off. Often the business come up with the test cases. The blame question is “why this use case isn’t specified”?
* Perhaps a more robust exception framework would have caught such a failure gracefully, but often the developer doesn’t bear the brunt of the blame.
**** I now feel business reality discounts code quality in terms of airtight error-proof
**** I now feel business reality discounts automated testing for Implementation Imperfections (II). See http://bigblog.tanbin.com/2011/02/financial-app-testing-biz.html

Now I feel if you did a thorough and realistic UAT, then you should have quiet confidence on go-live day. Live data should be no “surprise” to you.

swing automated test, briefly

3) http://www.codework-solutions.com/testing-tools/qfs-test/ says …event is constructed and inserted artificially into the system’s EventQueue. To the SUT it is indistinguishable whether an event was triggered by an actual user or by QF-Test. These artificial events are more reliable than “hard” events that could be generated with the help of the AWT-Robot, for example, which could be used to actually move the mouse cursor across the screen. Such “hard” events can be intercepted by the operating system or other applications.

2) http://jemmy.java.net/Jemmy3UserTutorial.html and http://wiki.netbeans.org/Jemmy_Tutorial explain some fundamentals about component searching. Jemmy can simulate user input by mouse and keyboard operations.

1) java.awt.Robot is probably the most low-level — Using the class to generate input events differs from posting events to the AWT event queue —  the events are generated in the platform's native input queue. For example, Robot.mouseMove will actually move the mouse cursor instead of just generating mouse move events.

IV questions/answers on trading system testing

See post on biz-scenario [S] ^ implementation error [I].

More and more interviewers seem to ask increasingly serious questions about trading system testing. I feel that’s part of our professional growth towards senior positions (though many Wall St lead developers I know aren’t big fans). In some cases, the higher you are paid, the deeper testing experience is expected of you.

Testing skill is a complement to design skill, both of which become increasingly important as we move up. These Interview questions tend to be open-ended. Interviewer gives us a chance to sell and convince them that we do have real experience and personal insight. Here are some pointers for myself. Your inputs welcome.
–At the Tools level, we can easily talk about Fitnesse, Mockito, JUnit, DbUnit etc [I]. Easy to read up about these. Easy to tell real stories. Nothing unique about trading systems.
–Another level of easy topics – regression testing, integrated testing. Again nothing unique about trading system. Since i have personal experience [S] i usually spend a lot of time at this level.
* scenario testing — essential in larger systems. Without scenarios, if you ask a rookie to list all the combination of various inputs, the list is endless and useless. Instead, I use assertions to reduce the combination.
* matrix testing — i often draw up some spreadsheet with rows and columns of input values.
* production data — essential if we were to involve users. The earlier we involve users, the better.
* I use JUnit to drive integrated tests with real DB, real MOM, real cache servers, real web servers

–At another level, special to trading systems are load testing, performance testing, thread testing, MOM testing. MOM and esp. thread testing uses special techniques. But in reality these are seldom adopted. An interesting question I like to ask is “at what specific (number please) load level will my system crash?” It’s good if we can predict that magic number. If we can’t then we might get caught with our pants down, which is quite common. I feel trading system testing can be divided into logic testing and performance testing. In reality, most small enhancements need only logic testing.
–Perhaps the highest level is UAT [S]. I would mention Fitnesse. I would make it clear that in GS, users or BA are called upon to generate test cases.
–At a methodology level, a tougher question is something like “How do you ensure adequate testing before release?” or “How do you decide how much testing is enough?”. Since I never used test coverage [I] tools, I won’t mention it. Frankly, I don’t believe 80% test coverage means the code is more reliable than 0% test coverage
— The most productive GS team members do no automated testing at all. They rely on experience to know what scenarios are worth testing, and then test those manually, without automation. When we realize bugs are released to production, in reality we almost always find our test scenarios incomplete. Automated tests won’t help us. But I can’t really say these things in interviews — emperor’s new dress.

financial app testing – biz scenario^implementation error

Within the domain of _business_logic_ testing, I feel there are really 2 very different targets/focuses – Serious Scenario (SS) vs Implementation Imperfections (II). This dichotomy cuts through every discussion in financial application testing.

* II is the focus of junit, mock objects, fitnesse, test coverage, Test-driven-development, Assert and most of the techie discussions
* SS is the real meaning of quality assurance (QA) and user acceptance (UA) test on Wall St. In contrast II doesn’t provide assurance — QA or UA.

SS is about real, serious scenario, not the meaningless fake scenarios of II.

When we find out bugs have been released to production, in reality we invariably trace root cause to incomplete SS, and seldom to II. Managers, users, BA, … are concerned with SS only, never II. SS requires business knowledge. I discussed with a developer (Mithun) in a large Deutsche bank application. He pointed out their technique of verifying intermediate data in a workflow SS test. He said II is simply too much effort and little value.

NPE (null pointer exception) is a good example of II tester’s mindset. Good scenario testing can provide good assurance that NPE doesn’t happen in any acceptable scenarios. If a scenario is not within scope and not in the requirement, then in that scenario system behavior is “undefined”. Test coverage is important in II, but if some execution path (NPE-ridden) is never tested in our SS, then that path isn’t important and can be safely left untested in many financial apps. I’m speaking from practical experience.

Regression testing should be done in II testing and (more importantly) SS testing.

SS is almost always integrated testing, so mock objects won’t help.

automated test in non-infrastructure financial app

I worked in a few (non-infrastructure) financial apps under different senior managers. Absolutely zero management buy-in for automated testing (atest).

“automated testing is nice to have” — is the unspoken policy.
“regression test is compulsory” — but almost always done without automation.
“Must verify all known scenarios” — is the slogan but if there are too many (5+) known scenarios and not all of them critical then usually no special budget or test plan.

I feel atest is a cost/risk analysis for the manager. Just like market risk system. Cost of maintaining a large atest and QA system is real. It is justified on
* Risk, which ultimately must translates to costs.
* speed up future changes
* build confidence

Reality is very different on wall street. [1]
– Blackbox Confidence[2] is not provided by test coverage but by “battle-tested”. Many minor bugs (catch-able by atest but were not) will not show up in a fully “battle tested” system; but ironically a system with 90% atest coverage may show many types of problems once released. Which one enjoys confidence?

– future changes are only marginally enhanced by atest. Many atests become irrelevant. Even if a test scenario remains relevant, test method may need a complete rewrite.

– in reality, system changes are budgeted in a formal corporate process. Most large banks deal with man-months so won’t appreciate a few days of effort saving (actually not easy) due to automated tests.

– Risk is a vague term. For whitebox, automated tests provide visible and verifiable evidence and therefore provides a level of assurance, but i know as a test writer that a successful test can, paradoxically, cover up bugs. I never look at someone’s automated tests and feel that’s “enough” test coverage. Only the author himself knows how much test coverage there really is. Therefore Risk reduction is questionable even at whitebox level. Blackbox is more important from Risk perspective. For a manager, Risk is real, and automated tests offer partial protection.

– If your module has high concentration of if/else and computation, then it’s a different animal. Automated tests are worthwhile.

[1] Presumably, IT product vendors (and infrastructure teams) are a different animal, with large install bases and stringent defect tolerance level.
[2] users, downstream/upstream teams, and managers always treat your code as blackbox, even if they analyze your code. People maintaining your codebase see it as a whitebox. Blackbox confidence is more important than Whitebox confidence.

dependency injection – promises && justifications

* “boilerplate code slows you down”.
* “code density”
* separating behavior from configuration. See post on “motivation of spring”
* code reuse, compile-time dependency. cut transitive dependency. — u can move the aircon without moving the house. “to build one class, you must build the entire system”
* “tests are easier to write, so you end up writing more tests.”
* easy to mock

These are the promises and justifications as authors and experienced users described.

Mockito partial mock (googlemock?

partial mock using thenCallRealMethod()

First understand the norm — full mock. Usually your MethodUnderTest is on Object A, which depends on Object B. Object B is hard to
instantiate => mock up B. ALLLLL the methods in B will mock using thenReturn().

Now what if MUT in A depends on A.m2() which depends on external/infrastructure, => hard to run during test => mockup m2() using
thenReturn(). => A must be a mock object. To test MUT, you have to register MUT with thenCallRealMethod.

Not sure how googlemock (c++) does…

fitnesse Q&&A

(We will intermix “fix” and “fixt” and “fixture.)

fitness runs with 2 JVM
* a long term infra jvm
* a jvm launched for the system under test

FIT system interprets HTML inbound and outbound. Fitnesse wraps FIT.

set commission rate -> setCommissionRate() will be called.

Q: automated build?
A: ant can trap the fit failures

Q: cvs?
A: fitness has versions for each page

Q: books on fit tricks
A: it may not sell, since the fitness guys did good documentation(?)

— 5 types of colors
* green: expected = actual
* red: obvious
* yellow: a broken (not found) fixture — Thanks to blog reader comment below.
* grey background: See the top summary saying “5 ignored” meaning “5 table cells ignored”. nothing-to-run. expected will show ……………
good for comments
We should leave “expected” output header empty.

* grey text: expected — we don’t care. Actual will show.
Good for debugging.
We should leave “expected” output header empty.

Q: Just want to execute a method and the result is not important (or the method is void)?
A: use the SetUpFixture instead (see http://www.fitnesse.info/fixturegallery:fitlibraryfixtures:setupfixture).
A: The ColumnFixture will also allow you to just print the result of the calculation, not testing it.
A: empty cell in a fitnesse table? Try it

—- various fixtures —-
one table => one fixture class. For doFixture, one page => one fixt class. Stateful fix

column/row fixt need one fixt class per TABLE; do fixt need a fixt class per PAGE. That’s the norm. However, a doFix can also instantiate a rowFix, like Peter Walker showed.

— col fixt —
one input column => one field
one output column => one method
————-

— testing
Q: How do I know what type of ‘error’ matched my expected ‘error’?
A: provide another method that opens the exception…

Q: how to share an object between 2 column fixtures on a page, so that first fixt populates it for 2nd fixt?
A: use a new class to hold a static instance to the object

Q: other output columns for an error row?
A: leave blank

Q: how to say expected = null?
A: just spell out n u l l

Q: what if an “expected” is in camel case and confuses fitness
A: escape it with ! or “!-…-!”

Q: use fit to test non-public stuff?
A: not designed for that. try parallel tree

— wiki
Q: save the editor?
A: alt-S

Q: collapse or hide a table?
A: http://localhost/FitNesse.QuickReferenceGuide

Q: comment?
A: see notes on grey-background

Q: comment at end or elsewhere?
A: yes

!contents // will render to a table of links

brief notes from the Test-driven-development course

Refactor a 20-line nested if/else chunk of procedure code into design patterns? You will create new classes, but that's not bad
complexity. The OO trainer actually refactored a 20-line if-else that's somewhat complex, into multi-class State pattern!

Test coverage measures how many execution paths through your method there are, and how many are tested. Apparently any if/else
creates new execution paths. I didn't believe developers should realistically write tests for any IF in a method (a method worth
testing). The TTD Trainer showed us through 8 hours of hands-on demo that it's actually best-practice to write that many tests. It's
not unrealistic to maintain all these tests when my application changes.

His own test coverage is up to 80%. I saw during the demo that 80% means practically every if/else and exception situation, among
other things.