Horizontal scale-out (distributing to different boxes) is the design of choice when we are cpu-bound. For instance, if we get hundreds of updates a sec and each update requires repricing a large number of objects.
Ideally, you would want cpu to be saturated. By using twice the hardware threads, you want throughput to doubles. Our pricing engine didn’t have that much cpu load, so we didn’t scale out to more than a few boxes.
The complication of scale-out is, data required to reprice one object may reside in different boxes. People try many solutions like memory virtualization (non-trivial synchronization cost + network latency), message-passing, RMI, … but I personally prefer the one-big machine approach. Throw in 16 (or 128) processors, each with say 4 to 8 hardware threads, run 64-bit, throw in 32G RAM. No network latency. No RMI/messaging latency. I think this hardware is rather costly. Total cost of 8 smaller machines with a comparable total CPU power would cost much less, so most big banks prefer it – so-called grid computing.
According to my observations, most practitioners in your type of situations eventually opt for scale-out.
It sounds like after routing a message, your “worker” process has all it needs in its local memory. That would an ideal use case for parallel processing.
I don’t know if FX spot real time pricing is that ideal. Specifically, suppose a worker process is *dedicated* to update and publish eur/usd spot quote. I know you would listen to the eurusd quotes from all liquidity providers, but do you also need to watch usd/jpy and eur/jpy?