2011 white paper@high-perf messaging

https://www.informatica.com/downloads/1568_high_perf_messaging_wp/Topics-in-High-Performance-Messaging.htm is a 2011 white paper by some experts. I have saved the html in my google drive. Here are some QQ  + zbs knowledge pearls. Each sentence in the article can expand to a blogpost .. thin->thick.

  • Exactly under what conditions would TCP provide low-latency
  • TCP’s primary concern is bandwidth sharing, to ensure “pain is felt equally by all TCP streams“. Consequently, a latency-sensitive TCP stream can’t have priority over other streams.
    • Therefore, one recommendation is to use a dedicated network having no congestion or controlled congestion. Over this network, the latency-sensitive system would not be victimized by the inherent speed control in TCP.
  • to see how many received packets are delayed (on the receiver end) due to OOS, use netstat -s
  • TCP guaranteed delivery is “better later than never”, but latency-sensitive systems prefer “better never than late”. I think UDP is the choice.
  • The white paper features an in-depth discussion of group rate. Eg: one mkt data sender feeding multiple (including some slow) receivers.

 

TCP_NODELAY to improve latency

https://www.extrahop.com/company/blog/2016/tcp-nodelay-nagle-quickack-best-practices/#3

https://stackoverflow.com/questions/3761276/when-should-i-use-tcp-nodelay-and-when-tcp-cork

The default Nagle’s algo helps in applications like telnet. However, it may increase latency when sending streaming data.

In the case of interactive applications or chatty protocols with a lot of handshakes such as SSL, Citrix and Telnet, Nagle’s algorithm can cause a drop in performance, whereas enabling TCP_NODELAY can improve latency, but at the expense of efficiency, as briefly mentioned in 2011 white paper@high-perf messaging.

In such cases, disabling Nagle’s algorithm is a better option. Enabling the TCP_NODELAY option disables Nagle’s algorithm.

 

kernel bypass : possible usage ] RTS

Partially hypothetical usage scenario/proposal.

“Bypass” means .. bypassing standard kernel functions and using faster, lighter firmware instead.

“Bypass” means .. every network packet would go straight from NIC to user application, without passing through tcp/ip stack in the kernel.

“Bypass” probably means bypassing the socket buffer in kernel

Background — Traditional packet processing goes through tcp/ip software stack, implemented as a family of kernel functions. Whenever a network packet is received, NIC writes the packet to a circular array and raise a hardware interrupt. The i-handler (interrupt handler routine) and bottom-half will then perform packet processing using the kernel socket buffer, and finally copy the packet to a UserModeBuffer.

Note the two separate buffers. In RTS parser config file, we configure them as sock_recv_buf vs read_buf for every channel, regardless of TCP or multicast. The socket buffer is accessible by kernel only (probably unused when we turn on kernel bypass.) as kernel can’t expose a fast-changing memory location to a slow userland thread. I believe userland thread uses read() or similar functions to drain the socket buffer, so that kernel can refill the socket buffer. See [[linux kernel]] and https://eklitzke.org/how-tcp-sockets-work

With kernel bypass,

  • the Network card (NIC) has a FPGA chip, which contains the low-level packet processing software (actually firmware “burned” into fpga)
  • This firmware replaces tcp/ip kernel functions and delivers the packets directly and automatically to application UserModeBuffer. However, my parser relies more on another feature —
  • The SolarFlare firmware also lets my parser (user applications) read the NIC circular array directly. Zero-copy technique could bypasses not only socket buffer but also UserModeBuffer.

My parser uses SolarFlare NIC for both multicast and tcp.

Kernel bypass API was only used in some low-level modules of the framework, and disabled by default and configurable for each connection defined in configuration file.

http://jijithchandran.blogspot.com/2014/05/solarflare-multicast-and-kernel-bypass.html is relevant.

linux tcp buffer^AWS tuning params

—receive buffer configuration
In general, there are two ways to control how large a TCP socket receive buffer can grow on Linux:

  1. You can set setsockopt(SO_RCVBUF) explicitly as the max receive buffer size on individual TCP/UDP sockets
  2. Or you can leave it to the operating system and allow it to auto-tune it dynamically, using the global tcp_rmem values as a hint.
  3. … both values are capped by

/proc/sys/net/core/rmem_max — is a global hard limit on all sockets (TCP/UDP). I see 256M in my system. Can you set it to 1GB? I’m not sure but it’s probably unaffected by the boolean flag below.

/proc/sys/net/ipv4/tcp_rmem — doesn’t override SO_RCVBUF. The max value on RTS system is again 256M. The receive buffer for each socket is adjusted by kernel dynamically, at runtime.

The linux “tcp” manpage explains the relationship.

Note large TCP receive buffer size is usually required for high latency, high bandwidth, high volume connections. Low latency systems should use smaller TCP buffers.

For high-volume multicast channel, you need large receive buffers to guard against data loss — UDP sender doesn’t obey flow control to prevent receiver overflow.

—AWS

/proc/sys/net/ipv4/tcp_window_scaling is a boolean configuration. (Turned on by default) 1GB  is the new limit on AWS after turning on window scaling. If turned off, then AWS value is constrained to a 16-bit integer field in the TCP header — 65536

I think this flag affects AWS and not receive buffer size.

  • if turned on, and if buffer is configured to grow beyond 64KB, then Ack can set AWS to above 65536.
  • if turned off, then we don’t (?) need a large buffer since AWS can only be 65536 or lower.

 

which socket/port is hijacking bandwidth

I guess some HFT machine might be dedicated to one (or few) process, but in general, multiple applications often share one host. A low latency system may actually prefer this, due to the shared memory messaging advantage.  In such a set-up, It’s extremely useful to pinpoint exactly which process, which socket, which network port is responsible for high bandwidth usage.

Solaris 10? Using Dtrace? tough? See [[solaris performance and tools]]

Linux? doable

# use iptraf to see how much traffic flowing through a given network interface.
# given a specific network interface, use iptraf to see the traffic break down by individual ports. If you don’t believe it, [[optimizing linux perf ]] P202 has a iptraf screenshot showing the per-port volumes
# given a specific port, use netstat or lsof to see the process PID using that port.
# given a PID, use strace and /proc/[pid]/fd to drill down to the socket (among many) responsible for the traffic. Socket is seldom shared (see other posts) between processes. I believe strace/ltrace can also reveal which user functions make those socket system calls.