As of 2017, I see evidence that both Morgan Stanley and Millennium have such a system
In each case there’s a data buffer holding the “details” of the pending operation.
async – DB query
async – disk IO
async – http request
async – unix signal processing. Signal is “stored” in the kernel and the callback is invoked at a later time.
async – UDP read and write?
OO movement brought about the Next generation in the form of distributed objects (aka distributed components) —
Therefore, I feel remoting is an umbrella technology with different implementations for different usage scenarios.
Remoting vs wcf? See other post.
#1 probably most common — database, both read and write operations. Therefore, ETL solutions achieve superior throughput by taking data processing out of database. ETL uses DB mostly as dumb storage.
* write – if a database data “sink” is too slow, then entire pipe is limited by its throughput, just like sewage.
** relevant in high frequency trading, where every execution must be recorded
* read – if you must query a DB to enrich or lookup some thing, this read can be much slower than other parts of the pipe.
#2 (similarly) flat files. Write tends to be faster than database write. (Read is a completely different story.)
** used in high frequency trading
** used in high volume market data storage — Sigma2 for example
So flat file writing is important in industry.
#? Web service
#? The above are IO-bound. In contrast, CPU-bound compute-intensive transform can (and do) also become bottlenecks.
Q: Design a system to calculate the number of unique words in a file
1) What if the file is huge? (i.e. cannot fit in the main memory)
2) Assuming that you have more than one computer available, how can you distribute the problem?
Constraints are the key to such an optimization. Let’s make it more realistic but hopefully without loss of generality. Say the file is 2 TB ascii of purely alphabetical words of any language in a unified alphabet, with natural distribution such as text from world newspapers. Word length is typically below 20.
I’d assume regular 100GB network with dedicated sockets between machines. The machines have roughly equal memory, and the combined memory is enough to hold the file.
I’d minimize disk and network access since these are slower than memory access and require serialization.
Q: is the network transfer such a bottle neck that I’m better off processing entire file in one machine?
— one-machine solution —
Assuming my memory (2GB) can only hold 1% of the unique words. I’d select only those words “below” ad* — i.e. aa*, ab*, ac* only. Save the unique words to a temp file, then rescan the input file looking for ad*, ae*…ak* to produce a 2nd temp file… Finally Combine the temp files.
— multi-machine solution —
Don’t bother to have one machine scanning the file and tcp the words to other machines. Just copy the entire input file by CD or file transfer to each machine. Each machine would ignore words outside its target range.
How do we divide the task. Say we have 50 machines. We don’t know the exact distribution, so if we assume aa-ak to Not have too many unique words to fit into one machine (2GB), assumption might be wrong. Instead, we’d divide the entire universe into 50 * 10 ranges. We assume even if we are underestimating, still each range should fit into one machine. Every time a machine finishes one range, it sends a tiny signal to a controller and waits for controller to give it next range.
— hashing on words —
Hash table should be sized to minimize rehash. We need superfast hashCode and compression. hashcode should use all the characters, perhaps except the first, since it tends to be the same within a range.
Update — fastest would require single-threaded model with no shared mutable
Suppose a live feed of market quotes pumps in messages at the max speed of the network (up to 100GB/sec). We have (5) thousands of (Hedge Fund etc) clients, each with some number (not sure how large) of subscriptions in these quotes. Each subscription sets up a filter that may look like some combination of “Symbol = IBM”, “bid/ask spread < 0.2…”, or “size at the best bid price….”. All the filters only reference fields of the quote object such as symbol, size and price. We need the fastest distribution system. Bottleneck should be network, not our application.
–memory allocation and copying–
If an IBM /quote/ matches 300 filters, then we need to send it to 300 destinations, therefore copying 300 times, but not 300 allocations within JVM. We want to minimize allocation within JVM. I believe the standard practice is to send just one copy as a message and let the receiver (different machine) forward it to those 300 hedge funds. Non-certified RV is probably efficient, but unicast JMS is fine too.
–socket reader thread latency–
Given the messaging rate, socket reader thread should be as lean as possible. I suggest it should blindly drop each msg into a buffer, without looking at it. Asynchronously consumer threads can apply the filters and distribute the quotes.
A fast wire format is fixed-width. Socket reader takes 500bytes and assume it’s one complete quote object, and blindly drops this 500-long byte array into the buffer.
Each thread is busy and important enough to deserve a dedicated cpu. That CPU is never given to another thread.
Now let me introduce my design. One thread per filter. Buffer is a circular array — bounded but efficient pre-allocation. Pre-allocation requires fixed-sized nodes, probably byte arrays of 500 each. I believe de-allocation is free — recycling. Another friend (csdoctor) suggested an unbounded linked list of arrays . Total buffer capacity should exceed the *temporary* queue build-up. Slowest consumer thread must be faster than producer, though momentarily the reverse could happen.
Note jvm gc can’t free the memory in our buffer.
Allocate a counter in each quote object. Each filter applied will decrement the counter. The thread that hits zero will free it. But this incurs allocation cost for that counter.
Each filter thread records in a global var its current position within the queue. Each filter thread advances through the queue and increments it’s global var. One design is based on the observation that given the dedicated CPU, the slowest thread is always the slowest in the wolfpack. This designated thread would free the memory after applying its filter.
However, it’s possible for 2 filters to be equally slow.
–design 8–We can introduce a sweeper thread that periodically wakes up to sequentially free all allocations that have been visited by all filters.
–Design 9– One thread to apply all filters for a given HF client. This works if filter logic is few and simple.
–Design A (CAS)– Create any # of “identical” consumer threads. Any time we can expand this thread pool.
1)read BigArrayBuffer[++MyThreadPtr] into this thread’s register and examine the fields, without converting to a Quote instance.
2) examine the Taken boolean flag. If already set, then simply “continue” the loop. This step might be needed if CAS is costly.
3) CAS to set this flag
4a) if successful, apply ALL filters on the quote. Then somehow free up the memory (without the GC). Perhaps set another boolean flag to indicate this fixed-length block is now reusable storage.
4b) else just “continue” since another thread will process and free it.
Any time I see asynchronous (swing, MOM etc), i see additional complexity. Synchronous is simpler. Synchronous means blocking, and requires no object beside the caller actor and service actor. The call is confined to a single call stack.
In contrast, async involves 2 call stacks, requires a 3rd object in the form of a buffer . Async means caller can return before responder even gets the message. In that /limbo/, the message must be kept in the buffer. If responder were a doctor then she might be “not accepting new patients“.
Producer/consumer pattern … (details omitted)
Buffer has capacity and can overflow.
Buffer is usually shared by different producer threads.
Buffer can resend.
Buffer can send the messages out of order.
 I guess the swing event object must be kept not just on the 2 call stacks, but on the event queue — the buffer
What's so good about OO? Why are the 3 most “relevant” enterprise app dev languages all happen to be OO – java, c# and c++?
Why is google choosing java, c++ and python?
(Though this is not really a typical “enterprise app”) Why is apple choosing to promote a compiled OO language — objective C?
Why is microsoft choosing to promote a compiled OO language more vigorously than VB.net?
But why is facebook (and yahoo?) choosing php?
Before c++ came along, most enterprise apps were developed in c, cobol, fortran…. Experience in the field show that c++ and java require more learning but do offer real benefits. I guess it all boils down to the 3 base OO features of encapsulation, inheritance and polymorphism.
(A personal blog) We discussed enterprise reporting on a database with millions of new records added each day. Some reflections…
One of my tables had about 10G data and more than 50 million rows. (100 million is kind of minimum to qualify as a large table.) This is the base table for most of our important online reports. Every user hits this table (or its derivative summary tables) one way or another. We used more than 10 special summary tables and Business Objects and the performance was good enough.
With the aid of a summary table, users can modify specific rows in main table. You can easily join. You can update main table using complex SELECT. The most complex reporting logic can often be implemented by pure SQL (joins, case, grouping…) without java. None of these is available to gigaspace or hibernate users. These tools simply get in the way of my queries, esp. when I do something fancy.
In all the production support (RTB) teams I have seen on wall street, investigating and updating DB is the most useful technique at firefighting time. If the reporting system is based on tables without cache, prod support will feel more comfortable. Better control, better visibility. The fastest cars never use automatic gear.
Really need to limit disk I/O? Then throw enough memory in the DB.
MOM and distributed cache are popular in trading apps. Developers tend to shy away from DB due to latency. However, for rapid development, relational DB offers excellent debugging, tracing, and correlation capabilities in a context of event-driven, callback-driven, concurrent processing. When things fail mysteriously and intermittently, logging is the key, but u often have multiple log files. You can query the cache but much less easily than DB.
Important events can be logged in DB tables and
* joined (#1 most powerful)
* searched in complex ways
* log data-mining. We can discover baselines, trends and anti-trends.
* Log files are usually archived (less accessible) and then removed, but DB data are usually more permanent. Don't ask me why:)
* selectively delete log events, easily, quickly.
* Data can be transformed.
* accessible by web service
* concurrent access
* extracted into another, more usable table.
* More powerful than XML.
Perhaps the biggest logistical advantage of DB is easy availability. Most applications can access the DB.
Adding db-logging requires careful design. When time to market is priority, I feel the debug capability of DB can be a justification for the effort.
A GS senior manager preferred logging in DB. Pershing developers generally prefer searching the same data in DB rather than file.