Nagle’s algo helps in applications like telnet. However, it may increase latency when sending streaming data. Additionally, if the receiver implements the ‘delayed ACK policy’, it will cause a temporary deadlock situation. In such cases, disabling Nagle’s algorithm is a better option.
Partially hypothetical usage scenario/proposal.
“Bypass” means .. bypassing standard kernel functions and using faster, lighter firmware instead.
“Bypass” means .. every network packet would go straight from NIC to user application, without passing through tcp/ip stack in the kernel.
“Bypass” probably means bypassing the socket buffer in kernel
Background — Traditional packet processing goes through tcp/ip software stack, implemented as a family of kernel functions. Whenever a network packet is received, NIC writes the packet to a circular array and raise a hardware interrupt. The i-handler (interrupt handler routine) and bottom-half will then perform packet processing using the kernel socket buffer, and finally copy the packet to a UserModeBuffer.
Note the two separate buffers. In RTS parser config file, we configure them as sock_recv_buf vs read_buf for every channel, regardless of TCP or multicast. The socket buffer is accessible by kernel only (probably unused when we turn on kernel bypass.) as kernel can’t expose a fast-changing memory location to a slow userland thread. I believe userland thread uses read() or similar functions to drain the socket buffer, so that kernel can refill the socket buffer. See [[linux kernel]] and https://eklitzke.org/how-tcp-sockets-work
With kernel bypass,
- the Network card (NIC) has a FPGA chip, which contains the low-level packet processing software (actually firmware “burned” into fpga)
- This firmware replaces tcp/ip kernel functions and delivers the packets directly and automatically to application UserModeBuffer. However, my parser relies more on another feature —
- The SolarFlare firmware also lets my parser (user applications) read the NIC circular array directly. Zero-copy technique could bypasses not only socket buffer but also UserModeBuffer.
My parser uses SolarFlare NIC for both multicast and tcp.
Kernel bypass API was only used in some low-level modules of the framework, and disabled by default and configurable for each connection defined in configuration file.
—receive buffer configuration
In general, there are two ways to control how large a TCP socket receive buffer can grow on Linux:
- You can set
setsockopt(SO_RCVBUF)explicitly as the max receive buffer size on individual TCP/UDP sockets
- Or you can leave it to the operating system and allow it to auto-tune it dynamically, using the global
tcp_rmemvalues as a hint.
- … both values are capped by
/proc/sys/net/core/rmem_max — is a global hard limit on all sockets (TCP/UDP). I see 256M in my system. Can you set it to 1GB? I’m not sure but it’s probably unaffected by the boolean flag below.
/proc/sys/net/ipv4/tcp_rmem — doesn’t override SO_RCVBUF. The max value on RTS system is again 256M. The receive buffer for each socket is adjusted by kernel dynamically, at runtime.
The linux “tcp” manpage explains the relationship.
Note large TCP receive buffer size is usually required for high latency, high bandwidth, high volume connections. Low latency systems should use smaller TCP buffers.
For high-volume multicast channel, you need large receive buffers to guard against data loss — UDP sender doesn’t obey flow control to prevent receiver overflow.
/proc/sys/net/ipv4/tcp_window_scaling is a boolean configuration. (Turned on by default) 1GB is the new limit on AWS after turning on window scaling. If turned off, then AWS value is constrained to a 16-bit integer field in the TCP header — 65536
I think this flag affects AWS and not receive buffer size.
- if turned on, and if buffer is configured to grow beyond 64KB, then Ack can set AWS to above 65536.
- if turned off, then we don’t (?) need a large buffer since AWS can only be 65536 or lower.
A Wells interviewer asked about window size when I said we use large receive buffers. What’s the relationship between the 2?
- tcp/udp receive buffer sizes
- tcp window size — see https://en.wikipedia.org/wiki/TCP_tuning#Window_size
There are many high-level easy-reading articles. So this is a low-hanging fruit
I guess some HFT machine might be dedicated to one (or few) process, but in general, multiple applications often share one host. A low latency system may actually prefer this, due to the shared memory messaging advantage. In such a set-up, It’s extremely useful to pinpoint exactly which process, which socket, which network port is responsible for high bandwidth usage.
Solaris 10? Using Dtrace? tough? See [[solaris performance and tools]]
# use iptraf to see how much traffic flowing through a given network interface.
# given a specific network interface, use iptraf to see the traffic break down by individual ports. If you don’t believe it, [[optimizing linux perf ]] P202 has a iptraf screenshot showing the per-port volumes
# given a specific port, use netstat or lsof to see the process PID using that port.
# given a PID, use strace and /proc/[pid]/fd to drill down to the socket (among many) responsible for the traffic. Socket is seldom shared (see other posts) between processes. I believe strace/ltrace can also reveal which user functions make those socket system calls.