close(): with timeout or non-blocking #CSY

I was right that close() can be non-blocking even on a TCP socket. It is the default.

When you specify SO_LINGER, you can also give a timeout to control how long close() blocks on a TCP socket

So there are 3 modes on a TCP socket:

  • close() blocking with timeout via SO_LINGER
  • close() blocking without timeout
  • close() nonblocking

FIX^MOM: exchange connectivity

The upshot — some subsystem use FIX not MOM; some subsystem use MOM not FIX. A site could send FIX over MOM (such as RV) but not common.

It’s important to realize FIX is not a type of MOM system. Even though FIX uses messages, I believe they are typically sent over persistent TCP sockets, without middle-ware or buffering/queuing. RTS is actually a good example.

Q: does FIX have a huge message buffer? I think it’s small, like TCP, but TCP has congestion control, so sender will wait for receiver.

Say we (sell-side or buy-side) have a direct connectivity to an exchange. Between exchange and our exterior gateway, it’s all sockets and possibly java NIO — no MOM. Data format could be FIX, or it could be the exchange’s proprietary format — here’s why.

The exchange often has an internal client-server protocol — the real protocol and a data flow “choke point” . All client-server request/response relies solely on this protocol. The exchange often builds a FIX translation over this protocol (for obvious reasons….) If we use their FIX protocol, our messages are internally translated from FIX to the exchange proprietary format. Therefore it’s obviously faster to directly pass messages in the exchange proprietary format. The exchange offers this access to selected hedge funds.

As an espeed developer told me, they ship a C-level dll (less often *.so [1]) library (not c++, not static library), to be used by the hedge fund’s exterior gateway. The hedge fund application uses the C functions therein to communicate with the exchange. This communication protocol is possibly proprietary. HotspotFX has a similar client-side library in the form of a jar. There’s no MOM here. The hedge fund gateway engine makes synchronous function calls into the C library.

[1] most trader workstations are on win32.

Note the dll or jar is not running in its own process, but rather loaded into the client process just like any dll or jar.

Between this gateway machine and the trading desk interior hosts, the typical communication is dominated by MOM such as RV, 29West or Solace. In some places, the gateway translates/normalizes messages into an in-house standard format.

TCP sliding-window: congestion^receiver

Based on P 248/265 [[computerNetworking]] , 2011 paper@high-perf messaging, and my blogpost on AWS^SWS

Sliding window is a protocol or a design framework. AdvertizedWindowSize is an integer field in the Ack message. There are also two variables “ReceiveWindow/CongestionWindow” in the TCP sender codebase (part of kernel).

Sliding window kills two birds with one stone — 1) congestion control and 2) overflow protection on the receiving end. Without these controls, sender can pump packets too fast causing

  • receiver socket buffer overflows
  • congestion on the sender-to-receiver path so some of the router buffers overflow

Basic idea is limit sending rate such that

  • unacknowledged bytes := lastByteSent – lastByteAcked < ReceiveWindow
  • unacknowledged bytes := lastByteSent – lastByteAcked < CongestionWindow

UDP doesn’t limit sending rate and is favored by many aggressive applications such as market data dissemination.

TCP blocking send() timeout #details

See also recv()/send() with timeout #CSY

I see three modes

  1. non-blocking send() — immediate return if unable to send
  2. blocking send() without timeout — blocks forever, so the thread can’t do anything.
  3. blocking send() with timeout —

SO_SNDTIMEO: sets the timeout value specifying the amount of time that an output function blocks because flow control prevents data from being sent. If a send operation has blocked for this time, it shall return with a partial count or with errno set to [EAGAIN] or [EWOULDBLOCK] if no data is sent. The default for this option is zero, which indicates that a send operation shall not time out. This option stores a timeval structure. Note that not all implementations allow this option to be set.

In xtap library, timeout isn’t implemented at all. Default is non-blocking.  If we configure to use 2), then we can hit a strange problem — one of three receivers gets stuck but keeps its connection open. The other receives are starved even though their receive buffers are free.

4buffers in 1 TCP connection #full-duplex

Many socket programmers may not realize there are four transmission buffers in a single TCP connection or TCP session. These buffers are set aside by the kernel when the socket is created.

Two buffers in socket AA (say in U.S.) + two sockets on socket BB (say, in Singapore)

Two receive buffers + two send buffers

At any time, there could be data in all four buffers. AA can initiate a send in the middle of receiving data because TCP is a full-duplex protocol.

Whenever one of the two sockets initiate a send, it has a job duty to ensure it never overflows the receiving buffer on the other end. This is a the essence of flow-control. This flow-control relies on the 2 buffers involved (not the other two buffers) P 247 [[computer networking]] has details. Basically, sender has an estimate of the remaining free space in the receive buffer so sender never sends too many bytes. It keeps unsent data in its send buffer.

Is UDP full-duplex? My book doesn’t mention it. I think it is.

Ack in tcp # phrasebook

Ack — returned by receiver to original sender … on every segment. See P241 [[computerNetworking]]

byte-level — TCP seq number is at byte level and it jumps. Ack is such a seq number.

expected — Ack number is the next seq number expected

proactive Ack — Never. Receiver will Never send ACK if it has not receive anything. I think this means receiver can’t detect unplugged wire.

zero AWS — gradually the AWS value in the Ack will drop to zero

1-byte probe — See [1]

Slow-receiver — TCP flow-control is only evident with a slow receiver

retrans — sender always resends x nanosec (Timeout) after missing an Ack. See tcp: detect wire unplugged



single/multi-thread TCP servers contrasted

In all cases, listening socket LL and worker sockets W4 W5 look like

local/server port local/server ip remote/client ip remote/client port socket file descriptor
 80 ANY unbound unbound LL  3
 80 high port A W4  4
 80 high port B W5  5 newly created

Now the 3 designs. [1] is described in accept()+select() : multiple persistent worker-sockets

LL W4 W5
not used by the “engaged” thr sending data single-thr primitive
not used by the “engaged” thr, but will be monitored via select() sending data single-thr multiplexing[1]
 listening Established. sending or idle sending multi-threaded

tcp: detect wire unplugged

In general for either producer or consumer, the only way to detect peer-crash is by probing (eg: keepalive, 1-byte probe, RTO…).

  • Receiver generally don’t probe and will remain oblivious.
  • Sender will always notice a missing Ack (timeout OR 3 duplicate Acks). After retrying, TCP module will give up and generate SIGPIPE.
send/recv buffers full buffer full then receiver-crash receiver-crash then sender has data to send receiver-crash amid active transmission
visible symptom 1-byte probe from sender triggers Ack containing AWS=0 The same Ack suddenly stops coming very first expected Ack doesn’t come Ack was coming in then suddenly stops coming
retrans by sender yes yes yes yes
SIGPIPE no probably yes probably

Q20: if a TCP producer process dies After transmission, what would the consumer get?
AA: nothing. See — Receiver is ready to receive data, and has no idea that sender has crashed.
AA: Same answer on

Q21: if a TCP producer process dies During transmission, what would the consumer get?
%A: ditto. Receiver has to assume sender stopped.

Q30: if a TCP consumer process dies during a quiet period After transmission, what would the producer get?
AA: P49 [[tcp/ip sockets in C]] Sender doesn’t know right away. At the next send(), sender will get -1 as return value. In addition, SIGPIPE will also be delivered, unless configured otherwise.

Q30b: Is SIGPIPE generated immediately or after some retries?
AA: describes Ack and re-transmission. Sender will notice a missing Ack and RTO will kick in.
%A: I feel TCP module will Not give up prematurely. Sometimes a wire is quickly restored during ftp, without error. If wire remains unplugged it would look exactly like a peer crash.

Q31: if a TCP consumer dies During transmission, what would the producer get?
%A: see Q30.

Q32: if a TCP consumer process dies some time after buffer full, what would the producer get?
%A: probably similar to above, since sender would send a 1-byte probe to trigger a Ack. Not getting the Ack tells sender something. This probe is builtin and mandatory , but functionally similar to (the optional) TCP Keepalive feature

I never studied these topics but they are quite common.

Q: same local IP@2 worker-sockets: delivery to who

Suppose two TCP server worker-sockets both have local address port 80. Both connections active.

When a packet comes addressed to, which socket will kernel deliver it to? Not both.

Not sure about UDP, but TCP connection is a virtual circuit or “private conversation”, so Socket W1  knows the client is If the incoming packet doesn’t have source IP:port matching that, then this packet doesn’t belong to the conversation.

new socket from accept()inherits port# from listen`socket

[[tcp/ip sockets in C]] P96 has detailed diagrams, but this write-up is based on

Background — a socket is an object in memory, with dedicated buffers.

We will look at an sshd server listening on port 22. When accept() returns without error, the return value is a positive integer file descriptor pointing to a new-born socket object with its own buffers. I call it the “worker” socket. It inherits almost every property from the listening socket (but see tcp listening socket doesn’t use a lot of buffers for differences.)

  • local (i.e. server-side) port is 22
  • local address is a single address that client source code specified. Note sshd server could listen on ANY address, or listen on a single address.
  • … so if you look at the local address:port, the worker socket and the listening socket may look identical, but they are two objects!
  • remote port is some random high port client host assigned.
  • remote address is … obvious
  • … so if you compare the worker-socket vs listening socket, they have
    • identical local port
    • different local ip
    • different remote ip
    • different remote port

tcp listening socket doesn’t use a lot of buffers, probably

The server-side “worker” socket relies on the buffers. So does the client-side socket, but the listening socket probably doesn’t need the buffer since it exchanges very little data with client

That’s one difference between listening socket vs worker socket

2nd difference is the server (i.e. local) address field. Worker socket has a single server address filled in. The listening socket often has “ANY” as its server address.

non-forking STM tcp server

This a simple design. In contrast, I proposed a multiplexing design for a non-forking single-threaded tcp server in accept()+select() : multiple persistent worker-sockets

ObjectSpace manual P364 shows a non-forking single-threaded tcp server. It exchanges some data with each incoming client and immediate disconnects the client and goes back to accept()

Still, when accept() returns, it returns with a new “worker” socket that’s already connected to the client.

This new “worker” socket has the same port as the listening socket.

Since the worker socket object is a local variable in the while() loop, it goes out of scope and gets destructed right away.

y tcp server listens on ANY address

Background — a server host owns multiple addresses, as shown by ifconfig. There’s usually an address for each network “interface”. (Routing is specified per interface.) Suppose we have on IF1 on IF2

By default, an Apache server would listen on and This allows clients from different network segments to reach us.

If Apache server only listens to, but some clients can only connect to perhaps due to routing!

tcp: one of 3 client-receivers is too slow

See also no overflow]TCP slow receiver #non-blocking sender

This is a real BGC interview question

Q: server is sending data fast. One of the clients (AA) is too slow.

Background — there will be 3 worker-sockets. The local address:port will look identical among them if the 3 clients connect to the same network interface, from the same network segment.

The set-up is described in simultaneous send to 2 tcp clients #mimic multicast

Note every worker socket for every client has identical local port.


I believe the AA connection/session/thread will be stagnant. At a certain point [1] server will have to remove the (mounting) data queue and release memory — data loss for the AA client.

[1] can happen within seconds for a fast data feed.

I also feel this set-up overloads the server. A TCP server has to maintain state for each laggard client, assuming single-threaded multiplexing(?). If each client takes a dedicated thread then server gets even higher load.

Are 5 client Connections using 5 sockets on server? I think so. Can a single thread multiplex them? I think so.

no overflow]TCP slow receiver #non-blocking sender

Q: Does TCP receiver ever overflow due to a fast sender?

A: See

A: should not. When the receiver buffer is full, the receiver sends AdvertizedWindowSize to informs the sender. If sender app ignores it and continues to send, then sent data will remain in the send buffer and not sent over the wire. Soon the send buffer will fill up and send() will block. On a non-blocking TCP socket, send() returns with error only when it can’t send a single byte. (UDP is different.)

Non-block send/receive operations either complete the job or returns an error.

Q: Do they ever return with part of the data processed?
A: Yes they return the number of bytes transferred. Partial transfer is considered “completed”.


Sliding-^Advertised- window size has real life illustration using wireshark.

  • AWS = amount of free space on receive buffer
    • This number, along with the ack seq #, are both sent from receiver to sender
  • lastSeqSent and lastSeqAcked are two sender control variable in the sender process.

SWS indicates position within a stream; AWS is a single integer.

Q: how are the two variables updated during transmission?
A: When an Ack for packet #9183 is received, sender updates its control variable “lastSeqAcked”. It then computes how many more packets to send, ensuring that “lastSeqSent – lastSeqAcked < AWS

  • SWS (sliding window size) = lastSeqSent – lastSeqAcked = amount of transmitted but unacknowledged bytes, is a derived control variable in the sender process, like a fluctuating “inventory level”.
  • SWS is a concept; AWS is a TCP header field
  • receive buffer size — is sometimes incorrectly referred as window size
  • “window size” is vague but usually refers to SMS
  • AWS is probably named after the sender’s sliding window.
  • receiver/sender — only these players control the SWS, not the intermediate routers etc.
  • too large — large SWS is always set based on large receive buffer, perhaps AWS
  • too small — underutilized bandwidth. As explained in linux tcp buffer^AWS tuning, high bandwidth connections should use larger AWS.

Q: how are AWS and SWS related?
A: The sender adjusts lastSeqSent (and SWS) based on the feedback of AWS and lastSeqAcked.

TCP client set-up steps #connect()UDP

TCP Client-side is a 2-stepper (look at Wikipedia and [[python ref]], among many references)
1) [SC] socket()
2) [C] connect()

[SC = used on server and client sides]
[C=client-only. seldom/never used on server-side.]

Note UDP is connection-less but connect() can be used too — to set the default destination. See

Under TCP, The verb connect() means something quite different — “reach across and build connection”[1]. You see it when you telnet … Also, server-side don’t make outgoing connections, so this is used by TCP client only. When making connection, we often see error messages about server refusing connection, because no server is “accepting”.

[1] think of a foreign businessman traveling to China to build guanxi with local government officials.


TCP listening socket shared by2processes #mcast

Common IV question: In what scenarios can a listening socket (in memory) be shared between 2 listening processes?

Background — a socket is a special type of file descriptor (at least in unix). Consider an output file handle. By default, this “channel” isn’t shared between 2 processes. Similarly, when a packet (say a price) is delivered to a given network endpoint, the kernel must decide which process to receive the data, usually not to two processes.

To have two processes both listening on the same listening-socket, one of them is usually a child of the other. The webpage in [1] and my code in show a short python code illustrating this scenario. I tested. q(lsof) and q(ss) commands both (but not netstat) show the 2 processes listening on the same endpoint. OS delivers the data to A B A B… shows an advanced kernel feature to let multiple processes bind() to the same endpoint.

For multicast (UDP only) two processes can listen to the same UDP endpoint. See [3] and [2]

A Unix domain socket can be shared between two unrelated processes.





2 Active connections on 1 TCP server IP/port

This is not the most common design, but have a look at the following output:

remote          local        state

What needs to be unique, is the 5-tuple (protocol, remote-ip, remote-port, local-ip, local-port)… so this situation can exist. [[tcp/ip sockets in C]] P100 has a full section on this topic. also says “Multiple worker-sockets on the same TCP server can share the same server-side IP/Port pair as long as they are associated with different client-side IP/Port pairs”. This “accept, move the connection to a dedicated server socket, then go back to accept()” is the common design — On each incoming connection, the listening TCP server will start a new thread/task/process  using a new “worker” socket on the server side. Note the new worker socket shares server ip:port with the original listening socket is my experiment. It pushes the concept of “sharing” further  — two TCP serves sharing a single socket not just a single ip:port endpoint!

tcp client bind()to non-random port: j4

TCP client doesn’t specify local endpoint. It only specifies the remote endpoint.

  • The local port is random. It’s conceptually an “outgoing” port as the client reaches out to the remote server.
  • The local IP address is probably chosen by the kernel, based on the remote IP address specified.


A Barclays TCP interview asked

Q: When a tcp client runs connect(), can it specify a client-side port rather than using a random port assigned by the system?
A: use bind() —

* I feel the client port number can work like a rudimentary tag for a “special” client thread
* similarly, debugging —
* firewall filtering on client port —
* some servers expect client to use a low port —

Note client bind() can also specify a particular client ip address (multihoming). Client side bind() defines the local port and interface address for the connection. In fact, connect() does an implicit bind(“”, 0) if one has not been done previously (with zero being taken as “any”). See

SO_REUSEPORT TCP server socket option – hungry chicks

With SO_REUSEPORT option, multiple TCP server processes could bind() to the same server endpoint. Designed for the busiest multithreaded servers. – a bunch of hungry chicks competing to get the next worm the mother delivers. The mother can only give the worm to one chick at a time. SO_REUSEPORT option sets up a chick family. When an incoming connection hits the accept(), kernel picks one of the accepting threads/processes and delivers the data to it alone.

See  + my socket book P102.

TCP server socket lingering briefly af host process exits

[[tcp/ip sockets in C]] P159 points out that after a host process exits, the socket enters the TIME_WAIT state for some time, visible in netstat.

Problem is, the socket still binds to some address:port, so if a new socket were to attempt bind() to the same it might fail. The exact rule is possibly more complicated but it does happen.

The book mentions 2 solutions:

  1. wait for the dying socket to exit TIME_WAIT. After I kill the process, I have seen this lingering for about a minute then disappearing.
  2. new socket to specify SO_REUSEADDR.

There are some simple rules about SO_REUSEADDR, so the new socket must be distinct from the existing socket in at least one of the 4 fields. Otherwise the selection rule in this post would have been buggy.

(server)promiscuous socket^connected socket

[[tcp/ip sockets in C]] P100 has a diagram showing that an incoming packet will be matched against multiple candidate listening sockets:

  • format: {local address:local port / remote address:remote port}
  • Socket 0: { *:99/*:*}
  • Socket 1: {*:*}
  • Socket 2: {} — this one has the remote address:port populated because it’s an Established connection)

An incoming packet need to match all fields otherwise it’s rejected.

However it could find multiple candidate sockets. Socket 0 is very “promiscuous”. The rule (described in the book) is — the more wild cards, the less likely selected.

(Each packet must be delivered to at most 1 socket as far as I know.)

SSocket→ BBind→ LListen→ AAccept#details

My focus is internet-socket (not UnixDomain-socket) and server-side TCP. UDP and client-side will be addressed later.

“Bind before Listen” — also shows the same flow.

1) [SC] socket() system_call calls into the kernel to _creates_ a new socket and returns a socket FILE descriptor integer

2) [S] bind() specifies the local end point. The bind() is the choke point to specify (socket() doesn’t) the local end point. It can bind to “any” one address available to the hosting OS, but must bind to a fixed port(??). I know bind() can use INADDR_ANY to bind to multiple addresses. Some say each socket can bind to one address at a time but I don’t think so and I don’t care. To an api user like me, bind() can indeed connect me to multiple addresses.

In a multicast receiver (no “server” per se), bind() specifies the group port, not the local port.

3) [S] listen() is the choke point to specify the _LLLLLLength of the queue. Any incoming connection exceeding the queue capacity will hit “server busy”

4) [S] accept() is the only blocking call in the family —
** If no pending connections are present on the queue, and the socket is NOT marked as non-blocking, accept() blocks the caller until a connection is present. If the socket is marked non-blocking and no pending connections are present on the queue, accept() fails immediately with the error EAGAIN.

The name accept() means accept CONNECTIONs, so it’s used for connection-oriented TCP only.

java ServerSocket.accept()

I think this should be one of the first important yet tricky socket methods to study and /internalize/. Memorize it’s extended signature and you will understand how it relates to other things.

See P221 [[ java threads ]].

A real server jvm always creates and uses 2 *types* of sockets — a single well-known listening socket, which on demand creates a “private socket” (aka data socket) for each incoming client request.

The new socket manufactured by accept() has the remote address:port set to the client’s address:port and is already connected to it.

I think the new socket initiates the connection, therefore this socket is considered a “socket” and not a ServerSocket. Nothing found online to confirm this.

java ServerSocket HAS-A queue

Default queue of 50 waiting “patrons” to our restaurant. If a patron arrives when the queue is full, the connection is refused.

Each “successful” patron would be allocated a dining table ie an address:port on the server-side.

The operating system stores incoming connection requests addressed to a particular port in a first-in, first-out queue. The default length of the queue is normally 50, though this can vary from operating system to operating system. Some operating systems (though not Solaris) have a maximum queue length, typically five. On these systems, the queue length will be the largest possible value less than or equal to 50. After the queue fills to capacity with unprocessed connections, the host refuses additional connections on that port until slots in the queue open up. Many (though not all) clients will try to make a connection multiple times if their initial attempt is refused. Managing incoming connections and the queue is a service provided by the operating system; your program does not need to worry about it. Several ServerSocket constructors allow you to change the length of the queue if its default length isn’t large enough; however, you won’t be able to increase the queue beyond the maximum size that the operating system supports.