Update — fastest would require single-threaded model with no shared mutable
Suppose a live feed of market quotes pumps in messages at the max speed of the network (up to 100GB/sec). We have (5) thousands of (Hedge Fund etc) clients, each with some number (not sure how large) of subscriptions in these quotes. Each subscription sets up a filter that may look like some combination of “Symbol = IBM”, “bid/ask spread < 0.2…”, or “size at the best bid price….”. All the filters only reference fields of the quote object such as symbol, size and price. We need the fastest distribution system. Bottleneck should be network, not our application.
–memory allocation and copying–
If an IBM /quote/ matches 300 filters, then we need to send it to 300 destinations, therefore copying 300 times, but not 300 allocations within JVM. We want to minimize allocation within JVM. I believe the standard practice is to send just one copy as a message and let the receiver (different machine) forward it to those 300 hedge funds. Non-certified RV is probably efficient, but unicast JMS is fine too.
–socket reader thread latency–
Given the messaging rate, socket reader thread should be as lean as possible. I suggest it should blindly drop each msg into a buffer, without looking at it. Asynchronously consumer threads can apply the filters and distribute the quotes.
A fast wire format is fixed-width. Socket reader takes 500bytes and assume it’s one complete quote object, and blindly drops this 500-long byte array into the buffer.
Each thread is busy and important enough to deserve a dedicated cpu. That CPU is never given to another thread.
Now let me introduce my design. One thread per filter. Buffer is a circular array — bounded but efficient pre-allocation. Pre-allocation requires fixed-sized nodes, probably byte arrays of 500 each. I believe de-allocation is free — recycling. Another friend (csdoctor) suggested an unbounded linked list of arrays . Total buffer capacity should exceed the *temporary* queue build-up. Slowest consumer thread must be faster than producer, though momentarily the reverse could happen.
Note jvm gc can’t free the memory in our buffer.
Allocate a counter in each quote object. Each filter applied will decrement the counter. The thread that hits zero will free it. But this incurs allocation cost for that counter.
Each filter thread records in a global var its current position within the queue. Each filter thread advances through the queue and increments it’s global var. One design is based on the observation that given the dedicated CPU, the slowest thread is always the slowest in the wolfpack. This designated thread would free the memory after applying its filter.
However, it’s possible for 2 filters to be equally slow.
–design 8–We can introduce a sweeper thread that periodically wakes up to sequentially free all allocations that have been visited by all filters.
–Design 9– One thread to apply all filters for a given HF client. This works if filter logic is few and simple.
–Design A (CAS)– Create any # of “identical” consumer threads. Any time we can expand this thread pool.
1)read BigArrayBuffer[++MyThreadPtr] into this thread’s register and examine the fields, without converting to a Quote instance.
2) examine the Taken boolean flag. If already set, then simply “continue” the loop. This step might be needed if CAS is costly.
3) CAS to set this flag
4a) if successful, apply ALL filters on the quote. Then somehow free up the memory (without the GC). Perhaps set another boolean flag to indicate this fixed-length block is now reusable storage.
4b) else just “continue” since another thread will process and free it.