how urgent are linux interrupts #millisec

I suspect some hardware interrupts might be different, but many interrupts should be handled in milliseconds. See P 145 [[linux kernel]]


kernel bypass : possible usage ] RTS

Partially hypothetical usage scenario/proposal.

“Bypass” means .. bypassing standard kernel functions and using faster, lighter firmware instead.

“Bypass” means .. every network packet would go straight from NIC to user application, without passing through tcp/ip stack in the kernel.

Background — Traditional packet processing goes through tcp/ip software stack, implemented as a family of kernel functions. Whenever a network packet is received, NIC writes the packet to a ring buffer and raise a hardware interrupt. The i-handler (interrupt handler routine) and bottom-half will then perform packet processing in the kernel socket buffer, and finally copy it to a UserModeBuffer.

Note the two separate buffers. In our parser config file, we configure them as sock_recv_buf vs read_buf. The former is accessible by kernel only and is not used when we turn on kernel bypass.

In contrast, with kernel bypass,

  • the Network card (NIC) has a FPGA chip, which contains the low-level packet processing software (actually firmware “burned” into fpga)
  • This firmware replaces tcp/ip kernel functions and delivers the packets directly to application. However, my parser relies more on another feature —
  • The SolarFlare firmware also lets my parser (user applications) access the NIC ring-buffer directly. Zero-copy technique bypasses the socket receive buffer in the kernel.

My parser uses SolarFlare NIC for both multicast and tcp.

Kernel bypass API was only used in some low-level modules of the framework, and disabled by default and configurable for each connection defined in configuration file. is relevant.

SDI: DoS-guard #Nsdq

Q: Design an API Rate Limiter (e.g. for Firebase or Github)

You are expected to develop a Rate Limiter services that can:

  • Limit the number of requests an entity can send to an API within a time window e.g., 15 requests per second.
  • The rate limiting should work for a distributed setup, as the APIs are accessible through a cluster of servers.

(A similar question was asked at Nsdq… )

Q2: how do your cluster of cache servers detect a given IP on the Internet is sending requests too frequently, causing Denial of Service? How do you protect yourself?

Q2b: After you blacklist a client IP, it goes quiet, then it sends a single request again. How you decide whether to ignore the request?

Q2c: what algorithm to decide if a client IP has legitimate need to send lots of requests vs another client IP engaging in Denial of Service attack?

Q2d: what if distributed DoS attack? has practical solutions.

inline: footprint+perf can Backfire !

See also

Without inline, instruction cache system could hit ChangeOfFlow twice as it enters/exits your function f1(). If f1() is actually inlined and embedded in a hostFunc, then the instruction cache system can often load entire hostFunc, eliminating COF. This helps instruction cache, but excessive inlining can increase footprint (code bloat).

google c++ guide points out that

  • incline can either increase or decrease (for tiny functions) executable footprint. In general, smaller footprint improves running time due to instruction cache efficiency
  • virtual functions are inlined (i.e. defined in class body) primarily for convenience/maintainability, not performance

mlock() : low-level syscall to prevent paging ] real-time apps has sample code

See also says

If the application is entering a time sensitive region of code, an mlockall call prior to entering, followed by munlockall can reduce paging while in the critical section. Similarly, mlock can be used on a data region that is relatively static or that will grow slowly but needs to be accessed without page faulting.

## optimize code for instruction cache: few tips

I don’t see any ground-breaking suggestions. I think only very hot functions (confirmed by oprofile + cachegrind) requires such micro-optimization.

I like the function^code based fragmentation framework on (3 parts)

  • inline: footprint+perf can backfire. Can be classified as embedding
  • use table lookup to replace “if” ladder — minimize jumps
  • branching — refactor a lengthy-n-cold (not “hot”) code chunk out to a function, so 99% of the time the instruction cache (esp. the pre-fetch flavor) doesn’t load a big chunk of cold stuff.
    • this is the opposite of embedding !
  • Trim the executable footprint. Reduce code bloat due to inlining and templates?
  • loop unrolling to minimize jumps. I think this is practical and time-honored — at aggressive optimization levels some compilers actually perform loop unrolling!
  • Use array (anything contiguous) instead of linked list or maps?? is a 2014 Intel paper — split huge function if it’s invoked in a loop.


reinterpret_cast(zero-copy)^memcpy: raw mktData parsing

Raw market data input comes in as array of unsigned chars. I “reinterpret_cast” it to a pointer-to-TradeMsgStruct before looking up each field inside the struct.

Now I think this is the fastest solution. Zero-cost at runtime.

As an alternative, memcpy is also popular but it requires bitwise copy. It often require allocating a tmp variable.

linux tcp buffer^AWS tuning params

—receive buffer configuration
In general, there are two ways to control how large a TCP socket receive buffer can grow on Linux:

  1. You can set setsockopt(SO_RCVBUF) explicitly as the max receive buffer size on individual TCP/UDP sockets
  2. Or you can leave it to the operating system and allow it to auto-tune it dynamically, using the global tcp_rmem values as a hint.
  3. … both values are capped by

/proc/sys/net/core/rmem_max — is a global hard limit on all sockets (TCP/UDP). I see 256M in my system. Can you set it to 1GB? I’m not sure but it’s probably unaffected by the boolean flag below.

/proc/sys/net/ipv4/tcp_rmem — doesn’t override SO_RCVBUF. The max value on RTS system is again 256M. The receive buffer for each socket is adjusted by kernel dynamically, at runtime.

The linux “tcp” manpage explains the relationship.

Note large TCP receive buffer size is usually required for high latency, high bandwidth, high volume connections. Low latency systems should use smaller TCP buffers.

For high-volume multicast connections, you need large receive buffers to guard against data loss — UDP sender doesn’t obey flow control to prevent receiver overflow.


/proc/sys/net/ipv4/tcp_window_scaling is a boolean configuration. (Turned on by default) 1GB  is the new limit on AWS after turning on window scaling. If turned off, then AWS value is constrained to a 16-bit integer field in the TCP header — 65536

I think this flag affects AWS and not receive buffer size.

  • if turned on, and if buffer is configured to grow beyond 64KB, then Ack can set AWS to above 65536.
  • if turned off, then we don’t (?) need a large buffer since AWS can only be 65536 or lower.


HFT mktData redistribution via MOM

Several low-latency practitioners say MOM is unwelcome due to added latency:

  1. The HSBC hiring manager Brian R was the first to point out to me that MOM adds latency. Their goal is to get the raw (market) data from producer to consumer as quickly as possible, with minimum stops in between.
  2. 29West documentation echos “Instead of implementing special messaging servers and daemons to receive and re-transmit messages, Ultra Messaging routes messages primarily with the network infrastructure at wire speed. Placing little or nothing in between the sender and receiver is an important and unique design principle of Ultra Messaging.
  3. Then I found that theRTS systems (not ultra-low-latency ) have no middle-ware between feed parser and order book engine (named Rebus).

However, HFT doesn’t always avoid MOM. P143 [[all about HFT]] published 2010 says an HFT such as Citadel often subscribes to both individual stock exchanges and CTS/CQS [1], and multicasts the market data for other components of the HFT. This design has additional buffers inherently. The first layer receives raw external data via a socket buffer. The 2nd layer components would receive the multicast data via their socket buffers.

[1] one key reason to subscribe redundant feeds — CTS/CQS may deliver a tick message faster!

Lehman’s market data is re-distributed over tibco RV, in FIX format.