SDI: DoS-guard #Nsdq

Q: Design an API Rate Limiter (e.g. for Firebase or Github)

You are expected to develop a Rate Limiter services that can:

  • Limit the number of requests an entity can send to an API within a time window e.g., 15 requests per second.
  • The rate limiting should work for a distributed setup, as the APIs are accessible through a cluster of servers.

(A similar question was asked at Nsdq… )

Q2: how do your cluster of cache servers detect a given IP on the Internet is sending requests too frequently, causing Denial of Service? How do you protect yourself?

Q2b: After you blacklist a client IP, it goes quiet, then it sends a single request again. How you decide whether to ignore the request?

Q2c: what algorithm to decide if a client IP has legitimate need to send lots of requests vs another client IP engaging in Denial of Service attack?

Q2d: what if distributed DoS attack?

https://en.wikipedia.org/wiki/Denial-of-service_attack#Defense_techniques has practical solutions.

Advertisements

unordered_map^map performance: unpredictable

Small halo… the theories and explanations are subject to change.

Many people report that unordered_map can be slower than traditional std::map for small collections. I feel it’s implementation-dependent. Result may NOT apply to java or c#.

https://blog.dubbelboer.com/2012/12/04/lru-cache.html is mostly about lookup speed. It shows that string key size hurt hash table, but sample size hurts RBTree.

Many interviewers asked me

  • Q: why for a small string collection, RBTree can outperform hash table?
  • A: String hashing takes time proportional to string size
  • A: Poor hashing => Hash collision => linear search is another potential problem. String hashing is unpredictable compared to integer hashing. No one can guarantee the next sample will be “uniform” according to our hash function.
  • A: rehash

Cache efficiency on the bucket array (array of pointers) is another reported weakness in hash tables. However, I guess a small bucket array can be completely held in cache and each linked list usually holds 1 node only so cache miss is on that 1 node. For a tree, logN nodes are accessed for a lookup and all of them could get cache-miss.

Practical suggestion — it’s easy to swap the two in a given codebase, so just swap and benchmark. Either can be faster. If your data set changes over time, you may need to re-benchmark.

How we achieved 400 to 700 kmps #reinterpret_cast

  • ! Many threads but each in single threaded mode. No shared mutable.
  • ! allocation avoided — on millions of Output message objects. Pre-allocated ring buffer eliminates new()
  • ! allocation avoided — on millions of Input message objects, thanks to reinterpret_cast() on pointers. See reinterpret_cast^memcpy in raw mktData parsing
  • ! Stateless Parser. Downstream component (Rebus) handles cancels, modifications..
  • Hot fictions use pbref or RVO or move semantics, minimizing stack allocations
  • Socket tuning
  • Output list (of messages) buffering
  • aggressively simplified parsing. Minimal logic
  • large containers are pre sized. No reallocation
  • Very fast duplicate seq check — a hot function
  • Special Logger to reduce I/O cost
  • Sharding to speed up symbol lookup
  • No virtual functions
  • Inline
  • —-Speed with control
  • Best of both to reduce the real risk of our connection becoming slow or lossy
  • Depth monitor to detect missed messages
  • Capacity tracker to monitor mps rates
  • Missed seq reporting

##fastest container choices: array of POD #or pre-sized vector

relevant to low-latency market data.

  • raw array is “lean and mean” — the most memory efficient; vector is very close, but we need to avoid reallocation
  • std::array is less popular but should offer similar performance to vector
  • all other containers are slower, with bigger footprint
  • For high-performance, avoid container of node/pointer — Cache affinity loves contiguous memory. After accessing 1st element, then accessing 2nd element is likely a cache-hit
    • set/map, linked list suffer the same

C for latency^TPS can use java

I’m 98% confident — low latency favors C/C++ over java [1]. FPGA is _possibly_ even faster.

I’m 80% confident — throughput (in real time data processing) is achievable in C, java, optimized python (Facebook?), optimized php (Yahoo?) or even a batch program. When you need to scale out, Java seems the #1 popular choice as of 2017. Most of the big data solutions seem to put java as the first among equals.

In the “max throughput” context, I believe the critical java code path is optimized to the same efficiency as C. JIT can achieve that. A python and php module can achieve that, perhaps using native extensions.

[1] Actually, java bytecode can run faster than compiled C code (See my other posts such as https://bintanvictor.wordpress.com/2017/03/20/how-might-jvm-beat-cperformance/)

MOM+threading Unwelcome ] low latency@@

Piroz told me that trading IT job interviews tend to emphasize multi-threading and MOM. Some use SQL too. I now feel all of these are unwelcome in low latency trading.

A) MOM – see HFT mktData redistribution via MOM

B) threading – Single-Threaded-Mode is generally the fastest in theory and in practice. (I only have a small observed sample size.) I feel the fastest trading engines are STM. No shared mutable. Nsdq new platform (in java) is STM

MT is OK if they don’t compete for resources like CPU, I/O or locks. Compared to STM, most lockfree systems introduce latency like retries, and additional memory barrier. By default compiler optimization doesn’t need such memory barriers.

C) SQL – as stated elsewhere, flat files are much faster than relational DB. How about in-memory relational DB?

Rebus, the order book engine, is in-memory.

realistic TPS for exchange mkt-data gateway: below 1M/sec

  • Bombay’s BSE infrastructure (including matching engine) can execute 1 mil trades/second. Must maintain order books, therefore NOT stateless.
  • Rebus maintains full order books for each symbol, can handle 300k (in some cases 700k) messages per second per instance. Uses AVL tree, which beat all other data structure in tests.
  • The Nasdaq feed parser (in c++) is stateless and probably single-threaded. Very limited parsing logic compared to Rebus. It once handled 600k message/second per instance
  • Product Gateway is probably a dumb process since we never need to change it. Can handle 550k message/second per instance

I believe TPS throughput, not latency, is the optimization goal. Biggest challenge known to me is the data burst.

Therefore, java GC pause is probably unacceptable. In my hypothesis, after you experience a data surge for a while, you tend to run out of memory and must run GC (like run to the bathroom). But that’s the wrong time to run GC. If the surge continues while you GC runs, then the incoming data would overflow the queue.

Does inlining really make a difference@@ Yes

MSVS and g++ debug build both disable inline (presumably to ease debugging). The performance difference vis-a-vis release build is mostly due to this single factor, according to [[optimized c++]]

The same author asserts that inlining is probably the most powerful code optimization.

Pimpl effectively disables inline.

However, c++faq gives many reasons for and against inline.

c++SCB eFX IV#Dmitry

100% QQ type, as defined in https://bintanvictor.wordpress.com/2017/02/15/qqzz-mutual-exclusion-cjava/.I feel many are micro optimizations with questionable improvement. I wonder how much value such obscure knowledge adds to the team.

Q: Scanning a vector of int (like finding the average or max). Forward iteration vs backward iteration, which one could be faster, considering all possible compiler optimizations.

%%A: forward. Memory read into cpu cache will be in chunks, not one element at a time. Easy for forward iteration. Not sure about backward.

Q: Which one could be fastest:

void f(double arg){…..}
void f(double & arg){….}

%%A: inlining for first but not 2nd?
A: See http://stackoverflow.com/questions/722257/should-i-take-arguments-to-inline-functions-by-reference-or-value esp. the long answer.

Q: Thr1 and Thr2 on 2 CPU’s both update an object s, having 2 fields. Thr1 only updates s.field1. Thr2 only updates s.field2. No interference. No synchronization required. We observe the performance is slower than using one thread to update both fields. Any explanation?
%%A: caching in cpu

Q: weak_ptr justification, when we have shared_ptr already? I feel [[effModernC++]] has a good chapter on it.

Ashish pointed out in some apps, you could identify a clear risk of circular dependency. Replace with weak_ptr.

Q: given an 2D array arr[10][5], how do you use pointer arithmetic to hit arr[1][5]

A: Contiguous. see http://stackoverflow.com/questions/7784758/c-c-multidimensional-array-internals. Note this is different from an array of pointers.

Q: what would you see if a TCP socket server has a full queue
%%A: TCP requires handshake, so if server is unable to accept a request the client would know it.
%%A: connection refused?

Q: what STL algorithms did you use?
%%A: foreach(), find(), copy_if(), transform(), reverse(), sort(), replace_if, remov_if