TCP blocking send() timeout

I see three modes

  1. non-blocking send() — immediate return if unable to send
  2. regular blocking send() — blocks forever, so the thread can’t do anything
  3. blocking send() with timeout:

SO_SNDTIMEO: sets the timeout value specifying the amount of time that an output function blocks because flow control prevents data from being sent. If a send operation has blocked for this time, it shall return with a partial count or with errno set to [EAGAIN] or [EWOULDBLOCK] if no data is sent. The default for this option is zero, which indicates that a send operation shall not time out. This option stores a timeval structure. Note that not all implementations allow this option to be set.

In xtap library, timeout isn’t implemented at all. Default is non-blocking.  If we configure to use 2), then we can hit a strange problem — one of three receivers gets stuck but keeps its connection open. The other receives are starved even though their receive buffers are free.

Advertisements

de-multiplex packets bearing Same dest ip:port Different source

see de-multiplex by-destPort: UDP ok but insufficient for TCP

For UDP, the 2 packets are always delivered to the same destination socket. Source IP:port are ignored.

For TCP, if there are two matching worker sockets … then delivered to them. Perhaps two ssh sessions.

If there’s only a listening socket, then both packets delivered to the same socket, which has wild cards for remote ip:port.

UDP socket is identified by two-tuple; TCP socket is by four-tuple

Based on [[computer networking]] P192. see also de-multiplex by-destPortNumber UDP ok but !! enough for TCP

  • Note the term in subject is “socket” not “connection”. UDP is connection-less.

A TCP segment has four header fields for Source IP:port and destination IP:port.

A TCP socket has internal data structure for a four-tuple — Remote IP:port and local IP:port.

A regular TCP “Worker socket” has all four items populated, to represent a real “session/connection”, but a Listening socket could have wild cards in all but the local-port field.

fragmentation: IP^TCP #retrans

See also IP (de)fragmentation #MTU,offset

Interviews are unlikely to go this deep, but it’s good to over-prepare here. This comparison ties together many loose ends like Ack, retrans, seq resets..

[1] IP fragmentation can cause excessive retransmissions when fragments encounter packet loss and reliable protocols such as TCP must retransmit ALL of the fragments in order to recover from the loss of a SINGLE fragment
[2] see TCP seq# never looks like 1,2,3

IP fragmentation TCP fragmentation
minimum guarantees all-or-nothing. Never partial packet stream in-sequence without gap
reliability unreliable fully reliable
name for a “part” fragment segment
sequencing each fragment has an offset each segment has a seq#
.. continuous? yes no! [2]
.. reset? yes for each packet loops back to 0 right before overflow
Ack no such thing positive Ack needed
gap detection using offset using seq# [2]
id for the “msg” identification number no such thing
end-of-msg flag in last fragment no such thing
out-of-sequence? likely likely
..reassembly based on id/offset/flag based on seq#
..retrans not by IP [1] commonplace

retrans: FIX^TCP^xtap

The FIX part is very relevant to real world OMS.. Devil is in the details.

IP layer offers no retrans. UDP doesn’t support retrans.

TCP FIX xtap
seq# continuous no yes yes
..reset automatic loopback managed by application seldom(exchange decision)
..dup possible possible normal under bestOfBoth
..per session per connection per clientId per day
..resumption? possible if wire gets reconnected quickly yes upon re-login unconditional. no choice
Ack positive Ack needed only needed for order submission etc not needed
gap detection sophisticated every gap should be handled immediately since sequence is critical gap mgr with timer

de-multiplex by-destPort: UDP ok but insufficient for TCP

When people ask me what is the purpose of the port number in networking, I used to say that it helps demultiplex. Now I know that’s true for UDP but TCP uses more than the destination port number.

Background — Two processes X and Y on a single-IP machine  need to maintain two private, independent ssh sessions. The incoming packets need to be directed to the correct process, based on the port numbers of X and Y… or is it?

If X is sshd with a listening socket on port 22, and Y is a forked child process from accept(), then Y’s “worker socket” also has local port 22. That’s why in our linux server, I see many ssh sockets where the local ip:port pairs are indistinguishable.

TCP demultiplex uses not only the local ip:port, but also remote (i.e. source) ip:port. Demultiplex also considers wild cards.

TCP UDP
socket has local IP:port
socket has remote IP:port no such thing
2 sockets with same
local port 22 ???
can live in two processes not allowed
can live in one process not allowed
2 msg with same dest ip:port
but different source ports
addressed to 2 sockets;
2 ssh sessions
addressed to the
same socket

which thread/pid drains NicBuffer->socketBuffer

Too many kernel concepts. I will use a phrasebook format. I have also separated some independent tips into hardware interrupt handler #phrasebook

  1. Scenario 1 : A single CPU. I start my parser which creates the multicast receiver socket but no data coming. My pid111 gets preempted. CPU is running unrelated pid222 when data /wash up/.
  2. Scenario 2: pid111 is running handleInput() while additional data comes in on the NIC.
  • context switching — to interrupt handler (i-handler). In all scenarios, the running process gets suspended to make way for the interrupt handler function. I-handler’s instruction address gets loaded into the cpu registers and it starts “driving” the cpu. Traditionally, the handler used the suspended process’s existing stack.
    • After the i-handler completes, the suspended “current” process resumes by default. However, the handler may cause another pid to be scheduled right away [1 Chapter 4.1].
  • no pid — interrupt handler execution has no pid, though some authors say it runs on behalf of the suspended pid. I feel the suspended pid may be unrelated to the socket, rather than the socket’s owner process (pid111).
  • kernel scheduler — In Scenario 1, pid111 would not get to process the data until it gets in the “driver’s seat” again. However, the interrupt handler could trigger a rescheduling and push pid111 “to the top” so to speak. [1 Chapter 4.1]
  • top-half — drains the tiny NIC buffer into main memory as fast as possible [2]
  • bottom-half — (i.e. deferrable functions) includes lengthy tasks like copying packets. Deferrable function run in interrupt context [1 Chapter 4.8], so there’s no pid
  • sleeping — the socket owner pid 111 would be technically “sleeping” in the socket’s wait queue initially. After the data is copied into the socket receive buffer, I think the kernel scheduler would locate pid111 in the socket’s wait queue and make pid111 the driver. Pid111 would call read() on the socket.
    • wait queue — How the scheduler does it is non-trivial. See [1 Chapter 3.2.4.1]
  • burst — What if there’s a burst of multicast packets? The i-handler would hog or steal the driver’s seat and /drain/ the NIC buffer as fast as possible, and populate the socket receive buffer. When the i-handler takes a break our handleInput() would chip away at the socket buffer.
    • priority — is given to the NIC’s interrupt handler, since we have a single CPU.

Q: What if the process scheduler wants to run while i-handler is busy draining the NIC?
A: Well, all interrupt handlers can be interrupted, but I would doubt the process scheduler would suspend the NIC interrupt handler.

One friend said the pid is 1, the kernel process.

[1] [[UnderstandingLinuxKernel, 3rd Edition]]

[2] https://notes.shichao.io/lkd/ch7/#top-halves-versus-bottom-halves

linux tcp buffer^AWS tuning params

—receive buffer configuration
In general, there are two ways to control how large a TCP socket receive buffer can grow on Linux:

  1. You can set setsockopt(SO_RCVBUF) explicitly as the max receive buffer size on individual TCP/UDP sockets
  2. Or you can leave it to the operating system and allow it to auto-tune it dynamically, using the global tcp_rmem values as a hint.
  3. … both values are capped by

/proc/sys/net/core/rmem_max — is a global hard limit on all sockets (TCP/UDP). I see 256M in my system. Can you set it to 1GB? I’m not sure but it’s probably unaffected by the boolean flag below.

/proc/sys/net/ipv4/tcp_rmem — doesn’t override SO_RCVBUF. The max value on my system is again 256M. The receive buffer for each socket is adjusted by kernel dynamically, at runtime.

The linux “tcp” manpage explains the relationship.

Note large TCP receive buffer size is usually required for high latency, high bandwidth, high volume connections. Low latency systems should use smaller TCP buffers.

For high-volume multicast connections, you need large receive buffers to guard against data loss — UDP sender doesn’t obey flow control to prevent receiver overflow.

—AWS

/proc/sys/net/ipv4/tcp_window_scaling is a boolean configuration. (Turned on by default) 1GB  is the new limit on AWS after turning on window scaling. If turned off, then AWS value is constrained to a 16-bit integer field in the TCP header — 65536

I think this flag affects AWS and not receive buffer size.

  • if turned on, and if buffer is configured to grow beyond 64KB, then Ack can set AWS to above 65536.
  • if turned off, then we don’t need a large buffer since AWS can only be 65536 or lower.