dotnet remoting and related jargon

P4 [[.net 1.1 remoting, reflection and threading]] shows a insightful history leading to dotnet remoting —
#1) RPC (pre-OO).
OO movement brought about the Next generation in the form of distributed objects (aka distributed components) —
#2) CORBA, RMI (later ejb) and dcom, which emerged around the same time.
COM is mostly for in-process and dcom is distributed
#3) soap and web services , which are OO-agnostic
I feel soap is more like RPC… The 2 distinct features of soap — xml/http. All predecessors are based on binary protocols (efficient), and the “service component” is often not hosted in any server.
#4) dotnet remoting feels more like RMI to me…According to the book above, remoting can use either
1) http channel with the soap formatter, or
2) tcp channel  with the binary formatter

Therefore, I feel remoting is an umbrella technology with different implementations for different usage scenarios.

#5) WCF
Remoting vs wcf? See other post.

private bank trade/order/quote/execution flow

Remember — Most non-exchange traded products are voice executed. Only a few very dominant, high volume products are electronically executed. Perhaps 1% of the products account for 99% of the trades — by number of trades. By dollar amount, IRS alone is probably 50% of all the trades, and IRS is probably voice-executed, given the large notional amounts.

The products traded between the bank and its clients (not interbank) are often customized private offerings, unavailable from any other bank. (Remember the BofA puttable floats.)

RM / PWA would get live quotes from dealer and give to a client. Sometimes dealer publishes quotes on an internal network, but RFQ is more common. Any time the quote could be executed between RM and client. RM would book the new position into the bank's database. As soon as as executed (before the booking), the bank has a position but dealer knows the position only after the booking, and would hedge quickly.

Dealer initially only responds to RFQ. It's usually executed without her knowledge, just like an ECN flow.

I think in BofA's wealth management platform, many non-equity products (muni bonds are largely sold to retail clients) trade in the same way. Dealer publishes quotes on an intranet website. RM negotiates with client and executes over the phone. During trade booking, the price and quantity would be validated. Occasionally (volatile market), trade fails to go through and RM must inform client to retry. Perhaps requote. Fundamentally, the dealer gets a last look, unlike the exchange flow.

I believe structured products (traded between bank and clients) are usually not fast and volatile — less requote. However, when dealer hedges the position, I think she often uses vanilla instruments.

Terminology warning — some places use “trade” to mean many things including orders. I think in exchange flow, “order” is a precise word.

credit e-trading – basics

2 “external” venues —
$ (ECN) interdealer electronic brokers — Bloomberg, Marketaxess, TradeWeb, BondDesk, NYSE, TMC. These are like the exchange-connectivity interface.
$ Retail-facing Distributors – Fidelity, Charles Schwab etc. These are often the retail-oriented portfolio/wealth managers. These portfolio managers are like the “client connectivity” interface.
* Each external connectivity above can have a customized FIX protocol.

Volume — 300 trades/day, 1000 RFQ(client inquiries)/day, but 4000 price updates/SECOND at peak! Probably incoming quotes. These would update internal cache within the bank. These “incoming” updates must be synchronized with other updates initiated from “within” the bank. Probably the trickiest technical challenge in this platform.

Most trades are institutional (often from wealth management firms) but there are retail trades as small as $5000/trade.

The treasury work-up process also takes place on some of these ECN’s.

Most important products — corporate bonds, CDS, CDX. 

(Based on a major credit trading sell-side.)

simplified order book design doc – jump

It’s tempting to use virtual function processMessage() to process various order types (A, M, X …) and trade types, but virtual functions add runtime overhead. Template specialization is a more efficient design, but due to the limited timeframe I implemented an efficient and simple alternative.

Assumption: M messages can only change quantity. Price change not allowed — Sender would need to cancel and submit new order. The B/S and price fields of the order should not change, but validation is omitted in this version.

Assumption: T messages and the corresponding M and X messages (also the initiator A message) are assumed consistent and complete. Validation is technically possible but omitted. Validation failure indicates lost messages.

The cornerstone of the design is the data structure of the order book — a RB-tree of linked lists. Add is O(logN) due to the tree-insert. Modify is O(1) thanks to the lookup array. Remove is O(1) — eliminating tree search. This is achieved with the lookup array, and by saving iterator into the order object.

There are 2 containers of pointers — the map of lists and the lookup-array. It would be better to use container of smart pointers to ease memory management, but STL doesn’t provide any smart pointer.

All equality test on doubles are done using “==”. Should use some kind of tolerance if time permits.

Here’s the documentation in the lookup array class

/*This class encapsulates an array of pointers.
 Assumption 1 — Exchanges is likely to generate auto-incrementing orderID’s. Here’s my reasoning. OrderID’s are unique, as stated in the question. If orderID generation isn’t continuous, then the generator has 2 choices about the inevitable gap between 2 adjacent ID numbers. It can keep the gap forever wasted, or somehow “go back” into a gap and pick a number therein as a new orderID. To do the latter it must keep track of what numbers are already assigned — rather inefficient. There are proven in-memory algorithms to generate auto-increment identity numbers. I assume an exchange would use them. Auto-increment numbers make a good candidate as array index, but what about the total number range?

 Assumption 2 — each day the number range has an upper limit. Exchange must agree with exchange members on the format of the orderID. It’s likely to be 32 bits, 64 bits etc and won’t be a million bits.

 Question 1: Can the exchange use OrderID 98761234 on both IBM and MSFT during a trading day? I don’t know and i feel it doesn’t matter. Here’s the reason.

 Case 1: suppose exchange uses an *independent* auto-increment generator for each stock. So IBM and MSFT generators can both generate 98761234. My design would use one array for IBM and one array for MSFT. For basket orders, additional generator instances might be needed.

 Case 2: suppose exchange uses an independent auto-increment generator for each stock, but each stock uses a non-overlap number range. 98761234 will fall into IBM number range. My design would need to know the number range so as to convert orderID to array index and conserve memory.

 Case 3: suppose exchange uses a singleton auto-increment generator across all stocks (bottleneck inside the exchange). My design would use one gigantic array. Given Assumption 1, the numbers would be quasi-continuous rather than sparse — below 50% of the range is assigned. Suppose the range is S+1, S+2 … S+N, then my array would be allocated N elements (where S is orderIDBase). There’s a limit on N in reality. Every system is designed for a finite N — no system can handle 10^9999 (that’s one followed by ten thousand zeros) orders in a day. Each array element is a pointer. For a 64-bit machine, N elements take 64N bits or 8N bytes. If I have 640GB memory, N can be 80 billion but not higher. To scale out horizontally, we would hope Case 1 or 2 is the case.

 Therefore the answer to Question 1 shows array of pointer is feasible for the purpose of lookup by orderID. In a real system hash table is likely to be time/space efficient. In this exercise, only STL is available, which provides no hash table. Tree based map has logN time complexity — too slow. My choice is between a built-in array vs a non-expanding vector. I chose array for simplicity.
 */

##bottlenecks in a high performance data "flow" #abinitio

#1 probably most common — database, both read and write operations. Therefore, ETL solutions achieve superior throughput by taking data processing out of database. ETL uses DB mostly as dumb storage.
* write – if a database data “sink” is too slow, then entire pipe is limited by its throughput, just like sewage.
** relevant in high frequency trading, where every execution must be recorded
* read – if you must query a DB to enrich or lookup some thing, this read can be much slower than other parts of the pipe.

#2 (similarly) flat files. Write tends to be faster than database write. (Read is a completely different story.)
** used in high frequency trading
** used in high volume market data storage — Sigma2 for example
So flat file writing is important in industry.

#? Web service

#? The above are IO-bound. In contrast, CPU-bound compute-intensive transform can (and do) also become bottlenecks.

high-level design tips – how many unique words in a huge file

Q: Design a system to calculate the number of unique words in a file
1) What if the file is huge? (i.e. cannot fit in the main memory)
2) Assuming that you have more than one computer available, how can you distribute the problem?

Constraints are the key to such an optimization. Let’s make it more realistic but hopefully without loss of generality. Say the file is 2 TB ascii of purely alphabetical words of any language in a unified alphabet, with natural distribution such as text from world newspapers. Word length is typically below 20.

I’d assume regular 100GB network with dedicated sockets between machines. The machines have roughly equal memory, and the combined memory is enough to hold the file.

I’d minimize disk and network access since these are slower than memory access and require serialization.

Q: is the network transfer such a bottle neck that I’m better off processing entire file in one machine?

— one-machine solution —
Assuming my memory (2GB) can only hold 1% of the unique words. I’d select only those words “below” ad* — i.e. aa*, ab*, ac* only. Save the unique words to a temp file, then rescan the input file looking for ad*, ae*…ak* to produce a 2nd temp file… Finally Combine the temp files.

— multi-machine solution —
Don’t bother to have one machine scanning the file and tcp the words to other machines. Just copy the entire input file by CD or file transfer to each machine. Each machine would ignore words outside its target range.

How do we divide the task. Say we have 50 machines. We don’t know the exact distribution, so if we assume aa-ak to Not have too many unique words to fit into one machine (2GB), assumption might be wrong. Instead, we’d divide the entire universe into 50 * 10 ranges. We assume even if we are underestimating, still each range should fit into one machine. Every time a machine finishes one range, it sends a tiny signal to a controller and waits for controller to give it next range.

— hashing on words —
Hash table should be sized to minimize rehash. We need superfast hashCode and compression. hashcode should use all the characters, perhaps except the first, since it tends to be the same within a range.

max-throughput live quote distribution: 6designs(CAS,socket

Update — fastest would require single-threaded model with no shared mutable

Suppose a live feed of market quotes pumps in messages at the max speed of the network (up to 100GB/sec). We have (5) thousands of (Hedge Fund etc) clients, each with some number (not sure how large) of subscriptions in these quotes. Each subscription sets up a filter that may look like some combination of “Symbol = IBM”, “bid/ask spread < 0.2…”, or “size at the best bid price….”. All the filters only reference fields of the quote object such as symbol, size and price. We need the fastest distribution system. Bottleneck should be network, not our application.

–memory allocation and copying–
If an IBM /quote/ matches 300 filters, then we need to send it to 300 destinations, therefore copying 300 times, but not 300 allocations within JVM. We want to minimize allocation within JVM. I believe the standard practice is to send just one copy as a message and let the receiver (different machine) forward it to those 300 hedge funds. Non-certified RV is probably efficient, but unicast JMS is fine too.

–socket reader thread latency–
Given the messaging rate, socket reader thread should be as lean as possible. I suggest it should blindly drop each msg into a buffer, without looking at it. Asynchronously consumer threads can apply the filters and distribute the quotes.

A fast wire format is fixed-width. Socket reader takes 500bytes and assume it’s one complete quote object, and blindly drops this 500-long byte array into the buffer.

–cpu dedication–
Each thread is busy and important enough to deserve a dedicated cpu. That CPU is never given to another thread.
————-
Now let me introduce my design. One thread per filter. Buffer is a circular array — bounded but efficient pre-allocation. Pre-allocation requires fixed-sized nodes, probably byte arrays of 500 each. I believe de-allocation is free — recycling. Another friend (csdoctor) suggested an unbounded linked list of arrays . Total buffer capacity should exceed the *temporary* queue build-up. Slowest consumer thread must be faster than producer, though momentarily the reverse could happen.

—-garbage collection—-
Note jvm gc can’t free the memory in our buffer.

–Design 3–
Allocate a counter in each quote object. Each filter applied will decrement the counter. The thread that hits zero will free it. But this incurs allocation cost for that counter.

–Design 6–
Each filter thread records in a global var its current position within the queue. Each filter thread advances through the queue and increments it’s global var. One design is based on the observation that given the dedicated CPU, the slowest thread is always the slowest in the wolfpack. This designated thread would free the memory after applying its filter.

However, it’s possible for 2 filters to be equally slow.

–design 8–We can introduce a sweeper thread that periodically wakes up to sequentially free all allocations that have been visited by all filters.

–Design 9– One thread to apply all filters for a given HF client. This works if filter logic is few and simple.

–Design A (CAS)– Create any # of “identical” consumer threads. Any time we can expand this thread pool.
while(true){
1)read BigArrayBuffer[++MyThreadPtr] into this thread’s register and examine the fields, without converting to a Quote instance.
2) examine the Taken boolean flag. If already set, then simply “continue” the loop. This step might be needed if CAS is costly.
3) CAS to set this flag
4a) if successful, apply ALL filters on the quote. Then somehow free up the memory (without the GC). Perhaps set another boolean flag to indicate this fixed-length block is now reusable storage.
4b) else just “continue” since another thread will process and free it.
}

hide client names and address

I proposed a system to a buy-side asset manager shop. I said client names don't need to stored in the central database. Maybe the salesforce and investment advisors need the names but they don't need to save those in a shared central database for everyone else to see.

Each client is identified by account id, which might include an initial.

When client logs in to a client-facing website, they will not see their name but some kind of relatively public information such as their self-chosen nick name, investment objectives, account balance, and last login time.

Client postal address is needed only for those who opt for paper statement. And only one system needs to access it — the statement printing shop.

A veteran in a similar system told me this is feasible and proposed an enhancement — encrypt sensitive client information.

What are some of the inconveniences in practice?