As of 2017, I see evidence that both Morgan Stanley and Millennium have such a system
In each case there’s a data buffer holding the “details” of the pending operation.
async – DB query
async – disk IO
async – http request
async – unix signal processing. Signal is “stored” in the kernel and the callback is invoked at a later time.
async – UDP read and write?
OO movement brought about the Next generation in the form of distributed objects (aka distributed components) —
Therefore, I feel remoting is an umbrella technology with different implementations for different usage scenarios.
Remoting vs wcf? See other post.
#1 probably most common — database, both read and write operations. Therefore, ETL solutions achieve superior throughput by taking data processing out of database. ETL uses DB mostly as dumb storage.
* write – if a database data “sink” is too slow, then entire pipe is limited by its throughput, just like sewage.
** relevant in high frequency trading, where every execution must be recorded
* read – if you must query a DB to enrich or lookup some thing, this read can be much slower than other parts of the pipe.
#2 (similarly) flat files. Write tends to be faster than database write. (Read is a completely different story.)
** used in high frequency trading
** used in high volume market data storage — Sigma2 for example
So flat file writing is important in industry.
#? Web service
#? The above are IO-bound. In contrast, CPU-bound compute-intensive transform can (and do) also become bottlenecks.
Q: Design a system to calculate the number of unique words in a file
1) What if the file is huge? (i.e. cannot fit in the main memory)
2) Assuming that you have more than one computer available, how can you distribute the problem?
Constraints are the key to such an optimization. Let’s make it more realistic but hopefully without loss of generality. Say the file is 2 TB ascii of purely alphabetical words of any language in a unified alphabet, with natural distribution such as text from world newspapers. Word length is typically below 20.
I’d assume regular 100GB network with dedicated sockets between machines. The machines have roughly equal memory, and the combined memory is enough to hold the file.
I’d minimize disk and network access since these are slower than memory access and require serialization.
Q: is the network transfer such a bottle neck that I’m better off processing entire file in one machine?
— one-machine solution —
Assuming my memory (2GB) can only hold 1% of the unique words. I’d select only those words “below” ad* — i.e. aa*, ab*, ac* only. Save the unique words to a temp file, then rescan the input file looking for ad*, ae*…ak* to produce a 2nd temp file… Finally Combine the temp files.
— multi-machine solution —
Don’t bother to have one machine scanning the file and tcp the words to other machines. Just copy the entire input file by CD or file transfer to each machine. Each machine would ignore words outside its target range.
How do we divide the task. Say we have 50 machines. We don’t know the exact distribution, so if we assume aa-ak to Not have too many unique words to fit into one machine (2GB), assumption might be wrong. Instead, we’d divide the entire universe into 50 * 10 ranges. We assume even if we are underestimating, still each range should fit into one machine. Every time a machine finishes one range, it sends a tiny signal to a controller and waits for controller to give it next range.
— hashing on words —
Hash table should be sized to minimize rehash. We need superfast hashCode and compression. hashcode should use all the characters, perhaps except the first, since it tends to be the same within a range.
Update — fastest would require single-threaded model with no shared mutable
Suppose a live feed of market quotes pumps in messages at the max speed of the network (up to 100GB/sec). We have (5) thousands of (Hedge Fund etc) clients, each with some number (not sure how large) of subscriptions in these quotes. Each subscription sets up a filter that may look like some combination of “Symbol = IBM”, “bid/ask spread < 0.2…”, or “size at the best bid price….”. All the filters only reference fields of the quote object such as symbol, size and price. We need the fastest distribution system. Bottleneck should be network, not our application.
–memory allocation and copying–
If an IBM /quote/ matches 300 filters, then we need to send it to 300 destinations, therefore copying 300 times, but not 300 allocations within JVM. We want to minimize allocation within JVM. I believe the standard practice is to send just one copy as a message and let the receiver (different machine) forward it to those 300 hedge funds. Non-certified RV is probably efficient, but unicast JMS is fine too.
–socket reader thread latency–
Given the messaging rate, socket reader thread should be as lean as possible. I suggest it should blindly drop each msg into a buffer, without looking at it. Asynchronously consumer threads can apply the filters and distribute the quotes.
A fast wire format is fixed-width. Socket reader takes 500bytes and assume it’s one complete quote object, and blindly drops this 500-long byte array into the buffer.
Each thread is busy and important enough to deserve a dedicated cpu. That CPU is never given to another thread.
Now let me introduce my design. One thread per filter. Buffer is a circular array — bounded but efficient pre-allocation. Pre-allocation requires fixed-sized nodes, probably byte arrays of 500 each. I believe de-allocation is free — recycling. Another friend (csdoctor) suggested an unbounded linked list of arrays . Total buffer capacity should exceed the *temporary* queue build-up. Slowest consumer thread must be faster than producer, though momentarily the reverse could happen.
Note jvm gc can’t free the memory in our buffer.
Allocate a counter in each quote object. Each filter applied will decrement the counter. The thread that hits zero will free it. But this incurs allocation cost for that counter.
Each filter thread records in a global var its current position within the queue. Each filter thread advances through the queue and increments it’s global var. One design is based on the observation that given the dedicated CPU, the slowest thread is always the slowest in the wolfpack. This designated thread would free the memory after applying its filter.
However, it’s possible for 2 filters to be equally slow.
–design 8–We can introduce a sweeper thread that periodically wakes up to sequentially free all allocations that have been visited by all filters.
–Design 9– One thread to apply all filters for a given HF client. This works if filter logic is few and simple.
–Design A (CAS)– Create any # of “identical” consumer threads. Any time we can expand this thread pool.
1)read BigArrayBuffer[++MyThreadPtr] into this thread’s register and examine the fields, without converting to a Quote instance.
2) examine the Taken boolean flag. If already set, then simply “continue” the loop. This step might be needed if CAS is costly.
3) CAS to set this flag
4a) if successful, apply ALL filters on the quote. Then somehow free up the memory (without the GC). Perhaps set another boolean flag to indicate this fixed-length block is now reusable storage.
4b) else just “continue” since another thread will process and free it.
Any time I see asynchronous (swing, MOM etc), i see additional complexity. Synchronous is simpler. Synchronous means blocking, and requires no object beside the caller actor and service actor. The call is confined to a single call stack.
In contrast, async involves 2 call stacks, requires a 3rd object in the form of a buffer . Async means caller can return before responder even gets the message. In that /limbo/, the message must be kept in the buffer. If responder were a doctor then she might be “not accepting new patients“.
Producer/consumer pattern … (details omitted)
Buffer has capacity and can overflow.
Buffer is usually shared by different producer threads.
Buffer can resend.
Buffer can send the messages out of order.
 I guess the swing event object must be kept not just on the 2 call stacks, but on the event queue — the buffer