reliable multicast for replicated-cache update is a 1999 research paper. I hope by now multicast has grown more mature more proven. Not sure where this is used, perhaps within certain network boundaries such as a private network of data servers.

This paper examines reliable multicast for invalidation and delivery of popular, frequently updated objects to web cache proxies.

multicast routing: efficient data duplication

The generic  routing algorithm in multicast has optimization features that offer efficient data duplication.

— optimized message duplication at routing layer (Layer 3?) explains that

Routers duplicate data packets and forward multiple copies wherever the path to recipients diverges. Group membership information is used to calculate optimal branch points i.e. the best routers at which to duplicate the packets to optimize the use of the network.

The multicast distribution tree of receiving hosts holds the route to every recipient that has joined the multicast group, and is optimized so that

  • multicast traffic does not reach networks that do not have any such recipients (unless the network is a transit network on the way to other recipients)
  • duplicate copies of packets are kept to a minimum.

multicast over public internet

Many people say that in general, multicast over public internet is unsupported. I think many retail websites (including the dominant west coast firms) are technically unable to use multicast.

As an end-user, You can multicast across the public Internet to another site by using a tunnel that supports multicast.

As a larger organization, like a video provider or an ISP, it is certainly possible to forward multicast packets across their domain boundary (i.e. across an Internet).

To forward those multicast packets to another ISP, you would need a peering agreement with them and use the Multicast Source Discovery Protocol (MSDP), configured on both ends.

While you won’t propagate your multicast across the global Internet, crossing network boundaries with multicast packets is not impossible.

multicast: multi-player games

  • data loss is tolerable
  • many-to-many messaging — with multiple senders and multiple receivers says

Collaborative applications such as data conferences (whiteboarding) and network-based games are many-to-many applications with ….

… modest scaling requirements of less than 100 participants.

This kind of application requires low latency of less than 400 millisec.

Transmission does not always need strict reliability; for example, refresh of background information for a network game could wait for the next refresh.

close(): with timeout or non-blocking #CSY

I was right that close() can be non-blocking even on a TCP socket. It is the default.

When you specify SO_LINGER, you can also give a timeout to control how long close() blocks on a TCP socket

So there are 3 modes on a TCP socket:

  • close() blocking with timeout via SO_LINGER
  • close() blocking without timeout
  • close() nonblocking

[19]with3sockets: select^non-block recv#CSY

Update — CSY said when you increase from 3 to 99, the difference becomes obvious.

Q (TradeWeb mkt data team): given 3 non-blocking receive-sockets, you can either read them sequentially in a loop or use select(). What’s the difference? compares many alternatives. It says Non-block socket read() is least efficient as it wastes CPU.

If you have 3 receive-sockets to monitor, select()/poll() is designed just for your situation.

Using read() instead, you may try specifying a timeout in your read() like

//set socket option via SO_RCVTIMEO to 5 seconds

Low latency systems can’t use this design because while you wait 5 seconds on socket1, you miss new data on socket2 😦

Therefore, low-latency systems MUST disable SO_RCVTIMEO. In such a design, the loop wastes cpu in a spin-wait.

–another advantage to selelct(): you can process a high-priority socket earlier when select() says multiple sockets ready.

I gave this answer on the spot.

recv()/send() with timeout #CSY

I was right — timeout is supported Not only in select()/poll(), but also read/recv/send..


Timeouts only have effect for system calls that perform socket I/O (e.g., read(2), recvmsg(2), send(2), sendmsg(2)); timeouts have no effect for select(2), poll(2), epoll_wait(2), and so on. These latter functions only check socket status… no I/O.

I believe this socket option has no effect if used on non-blocking sockets . Nonblocking socket can be created either

  • O_NONBLOCK in fcntl() or
  • SOCK_NONBLOCK in socket()

UDP: send/receive buffers are configurable #CSY

It’s wrong to say UDP uses a small receive buffer but doesn’t use send buffer.

Receive — shows how to increase UDP receive buffer to 25MB

Send — shows you CAN configure send buffer on a UDP socket.

Send — also confirms SO_SNDBUF send-buffer size applies to UDP sockets

In addition, Application is free to use its own custom buffer in the form of a vector for example.

xtap check`if UDP channel=healthy #CSY

xtap library needs reliable ways to check if a “connectivity” is down

UDP (including multicast) is a connectionless protocol; TCP is connection-oriented. Therefore the xtap “connection” class cannot be used for multicast channels. Multicast Channel is like a TV-channel (NYSE terminology).

–UDP is connectionless, has no session, no virtual circuit, so no “established” state. So how do we know the exchange server is dead or just quiet?

After discussing with CSY, I feel UDP unicast or UDP multicast can’t tell me.

heartbeat — I think we must rely on the heartbeat. I actually helped create an inactivity timeout alert precisely because in a multicast channel, we don’t know if exchange is down or just quiet.

–TCP should be easier.

Many online resources such as

According to CSY, as a receiver, we don’t need to send a probe message. if exchange has closed the TCP socket, the four-way handshake would have informed my socket that the connection is closed. So my select() would say my TCP socket is ready for reading. When I read it I would get 0 bytes.

I believe there’s a one-to-one mapping between a socket and a “connection” for TCP only


370,000 MPS isn’t tough for multicast #CSY

370,000 msg/sec isn’t too high. Typical exchange message size is 100 bits, so we are talking about 37 Mbps, , less than 0.1% of a 100 Gbps network capacity.

My networking mentor CSY and I both believe it’s entirely possible to host 8 independent threads in a single process to handle the 8 independent message channels. Capacity would be 296 Mbps on a single NIC and single PID.

See also mktData parser: multi-threaded]ST mode #Balaji

I feel a more bandwidth-demanding multicast application is video-on-demand where a single server may need to simultaneously stream different videos.

Q: how about world cup final real-time multicast video streaming to millions of subscribers?
%%A: now I think this is not so demanding, because the number of simultaneous video streams is one


I think there are many application-layer multicast protocols. says

Unlike native multicast (i.e. IP-multicast) where data packets are replicated at routers inside the network, in application-layer multicast data packets are replicated at end hosts.

When we hear “multicast” we should not assume it means IP-multicast using UDP ! “Multicast” is not a specific protocol like UDP, but a generic, loosely defined term.

— RGDD says “many current systems support RGDD at application layer using unicast TCP

message fragment in Internet Protocol !! TCP

IP layer handles fragmentation/defrag. UDP and TCP are one layer above IP and relies on this “service” of the IP layer.

UDP may (TCP is different) occasionally lose an entire “logical” packet, but never Part of a logical packet.

In my own words, If IP layer loses a “fragment” it discards the entire packet.

When a logical packet is broken up at IP layer into physical packets, the constituent physical packets will either be delivered altogether or lost altogether. The frag/defrag IP service is transparent to upper layers so UDP/TCP don’t need to worry about basic data integrity.

I will attempt to contrast it to TCP flow control, which breaks up a megabyte file into smaller chunks. Each chunk is a “logical” packet. TCP (not UDP) uses sequence numbers in the packets.

NAT: linux server can become a simple router


What limitations are there? I only know a few

  • need multiple network interfaces

Here’s how we use NAT to translate destination port number in every incoming IP packet:

[sa@rtppeslo2 ~]$  sudo iptables -t nat -nL

target     prot opt source               destination         
DNAT       udp  --           udp dpt:18020 to::48020 

specify(by ip:port) multicast group to join has zipped sample code showing

mc_addr.sin_port = thePort;

bind(sock, (struct sockaddr *) &mc_addr, sizeof(mc_addr) ) // set the group port, not local port!
mc_req.imr_multiaddr.s_addr = inet_addr(“”);

(void*) &mc_req, sizeof(mc_req) // set the IP by sending a IGMP join-request

Note setsocopt() actually sends a request!

====That’s for multicast receivers.  Multicast senders use a simpler procedure —

mc_addr.sin_addr.s_addr = inet_addr(“”);
mc_addr.sin_port = htons(thePort);

sendto(sock, send_str, send_len, 0, (struct sockaddr *) &mc_addr, …

FIX^MOM: exchange connectivity

The upshot — some subsystem use FIX not MOM; some subsystem use MOM not FIX. A site could send FIX over MOM (such as RV) but not common.

It’s important to realize FIX is not a type of MOM system. Even though FIX uses messages, I believe they are typically sent over persistent TCP sockets, without middle-ware or buffering/queuing. RTS is actually a good example.

Q: does FIX have a huge message buffer? I think it’s small, like TCP, but TCP has congestion control, so sender will wait for receiver.

Say we (sell-side or buy-side) have a direct connectivity to an exchange. Between exchange and our exterior gateway, it’s all sockets and possibly java NIO — no MOM. Data format could be FIX, or it could be the exchange’s proprietary format — here’s why.

The exchange often has an internal client-server protocol — the real protocol and a data flow “choke point” . All client-server request/response relies solely on this protocol. The exchange often builds a FIX translation over this protocol (for obvious reasons….) If we use their FIX protocol, our messages are internally translated from FIX to the exchange proprietary format. Therefore it’s obviously faster to directly pass messages in the exchange proprietary format. The exchange offers this access to selected hedge funds.

As an espeed developer told me, they ship a C-level dll (less often *.so [1]) library (not c++, not static library), to be used by the hedge fund’s exterior gateway. The hedge fund application uses the C functions therein to communicate with the exchange. This communication protocol is possibly proprietary. HotspotFX has a similar client-side library in the form of a jar. There’s no MOM here. The hedge fund gateway engine makes synchronous function calls into the C library.

[1] most trader workstations are on win32.

Note the dll or jar is not running in its own process, but rather loaded into the client process just like any dll or jar.

Between this gateway machine and the trading desk interior hosts, the typical communication is dominated by MOM such as RV, 29West or Solace. In some places, the gateway translates/normalizes messages into an in-house standard format.

TCP sliding-window: congestion^receiver

Based on P 248/265 [[computerNetworking]] , 2011 paper@high-perf messaging, and my blogpost on AWS^SWS

Sliding window is a protocol or a design framework. AdvertizedWindowSize is an integer field in the Ack message. There are also two variables “ReceiveWindow/CongestionWindow” in the TCP sender codebase (part of kernel).

Sliding window kills two birds with one stone — 1) congestion control and 2) overflow protection on the receiving end. Without these controls, sender can pump packets too fast causing

  • receiver socket buffer overflows
  • congestion on the sender-to-receiver path so some of the router buffers overflow

Basic idea is limit sending rate such that

  • unacknowledged bytes := lastByteSent – lastByteAcked < ReceiveWindow
  • unacknowledged bytes := lastByteSent – lastByteAcked < CongestionWindow

UDP doesn’t limit sending rate and is favored by many aggressive applications such as market data dissemination.

TCP blocking send() timeout #details

See also recv()/send() with timeout #CSY

I see three modes

  1. non-blocking send() — immediate return if unable to send
  2. blocking send() without timeout — blocks forever, so the thread can’t do anything.
  3. blocking send() with timeout —

SO_SNDTIMEO: sets the timeout value specifying the amount of time that an output function blocks because flow control prevents data from being sent. If a send operation has blocked for this time, it shall return with a partial count or with errno set to [EAGAIN] or [EWOULDBLOCK] if no data is sent. The default for this option is zero, which indicates that a send operation shall not time out. This option stores a timeval structure. Note that not all implementations allow this option to be set.

In xtap library, timeout isn’t implemented at all. Default is non-blocking.  If we configure to use 2), then we can hit a strange problem — one of three receivers gets stuck but keeps its connection open. The other receives are starved even though their receive buffers are free.

de-multiplex packets bearing Same dest ip:port Different source

see de-multiplex by-destPort: UDP ok but insufficient for TCP

For UDP, the 2 packets are always delivered to the same destination socket. Source IP:port are ignored.

For TCP, if there are two matching worker sockets … then delivered to them. Perhaps two ssh sessions.

If there’s only a listening socket, then both packets delivered to the same socket, which has wild cards for remote ip:port.

UDP socket is identified by two-tuple; TCP socket is by four-tuple

Based on [[computer networking]] P192. see also de-multiplex by-destPortNumber UDP ok but !! enough for TCP

  • Note the term in subject is “socket” not “connection”. UDP is connection-less.

A TCP segment has four header fields for Source IP:port and destination IP:port.

A TCP socket has internal data structure for a four-tuple — Remote IP:port and local IP:port.

A regular TCP “Worker socket” has all four items populated, to represent a real “session/connection”, but a Listening socket could have wild cards in all but the local-port field.

fragmentation: IP^TCP #retrans

I consider this a “halo” knowledge pearl because it is part of an essential everyday service. We can easily find an opportunity to inject it into an Interview. Interviews are unlikely to go this deep, but it’s good to over-prepare here.

This comparison ties together many loose ends like Ack, reassembly, retrans, seq resets..

This comparison clarifies what reliability IP offers + what reliability features it lacks:

  • it offers complete datagrams
  • it can lose an entire datagram — TCP (not UDP) will detect it and use retrans
  • it can deliver two datagrams out of order — TCP will detect and fix it

[1] IP fragmentation can cause excessive retransmissions when fragments encounter packet loss and reliable protocols such as TCP must retransmit ALL of the fragments in order to recover from the loss of a SINGLE fragment
[2] TCP seq# never looks like 1,2,3
[3] IP (de)fragmentation #MTU,offset

IP4 fragmentation TCP fragmentation
minimum guarantees all-or-nothing. Never partial packet stream in-sequence without gap
reliability unreliable fully reliable
name of a “whole” piece IP datagram #“packet/msg” are vague
a TCP session with thousands of sequence numbers
name for a “part” IP fragment TCP segment
sequencing each fragment has an offset each segment has a seq#
.. continuous? yes discontinuous ! [2]
.. seq # reset? yes for each packet loops back to 0 right before overflow
Ack no Ack !
positive Ack needed
gap detection using offset using seq# [2]
id for the “msg” identification number no such thing
end-of-msg flag in last fragment no such thing. Continuous stream
out-of-sequence? likely likely
..reassembly based on id/offset/flag [3] based on seq#
retrans not by IP4 [1][3] commonplace
missing a piece? entire IP datagram discarded[3] triggers retrans

retrans: FIX^TCP^xtap

The FIX part is very relevant to real world OMS.. Devil is in the details.

IP layer offers no retrans. UDP doesn’t support retrans.



TCP FIX xtap
seq# continuous no yes.. see seq]FIX yes
..reset automatic loopback managed by application seldom #exchange decision
..dup possible possible normal under bestOfBoth
..per session per connection per clientId per day
..resumption? possible if wire gets reconnected quickly yes upon re-login unconditional. no choice
Ack positive Ack needed only needed for order submission etc not needed
gap detection sophisticated every gap should be handled immediately since sequence is critical. Out-of-sequence is unacceptable. gap mgr with timer
retrans sophisticated receiver(ECN) will issue resend request; original sender to react intelligently gap mgr with timer
Note original sender should be careful resending new orders.


de-multiplex by-destPort: UDP ok but insufficient for TCP

When people ask me what is the purpose of the port number in networking, I used to say that it helps demultiplex. Now I know that’s true for UDP but TCP uses more than the destination port number.

Background — Two processes X and Y on a single-IP machine  need to maintain two private, independent ssh sessions. The incoming packets need to be directed to the correct process, based on the port numbers of X and Y… or is it?

If X is sshd with a listening socket on port 22, and Y is a forked child process from accept(), then Y’s “worker socket” also has local port 22. That’s why in our linux server, I see many ssh sockets where the local ip:port pairs are indistinguishable.

TCP demultiplex uses not only the local ip:port, but also remote (i.e. source) ip:port. Demultiplex also considers wild cards.

socket has local IP:port
socket has remote IP:port no such thing
2 sockets with same
local port 22 ???
can live in two processes not allowed
can live in one process not allowed
2 msg with same dest ip:port
but different source ports
addressed to 2 sockets;
2 ssh sessions
addressed to the
same socket

4buffers in 1 TCP connection #full-duplex

Many socket programmers may not realize there are four transmission buffers in a single TCP connection or TCP session. These buffers are set aside by the kernel when the socket is created.

Two buffers in socket AA (say in U.S.) + two sockets on socket BB (say, in Singapore)

Two receive buffers + two send buffers

At any time, there could be data in all four buffers. AA can initiate a send in the middle of receiving data because TCP is a full-duplex protocol.

Whenever one of the two sockets initiate a send, it has a job duty to ensure it never overflows the receiving buffer on the other end. This is a the essence of flow-control. This flow-control relies on the 2 buffers involved (not the other two buffers) P 247 [[computer networking]] has details. Basically, sender has an estimate of the remaining free space in the receive buffer so sender never sends too many bytes. It keeps unsent data in its send buffer.

Is UDP full-duplex? My book doesn’t mention it. I think it is.

Q: which thread/PID drains NicBuffer→socketBuffer

Too many kernel concepts here. I will use a phrasebook format. I have also separated some independent tips into hardware interrupt handler #phrasebook

  1. Scenario 1 : A single CPU. I start my parser which creates the multicast receiver socket but no data coming. My process (PID 111) gets preempted on timer interrupt. CPU is running unrelated PID 222 when my data wash up on the NIC.
  2. Scenario 2: pid111 is running handleInput() while additional data comes in on the NIC.

Some key points about all scenarios:

  • context switching — There’s context switch to interrupt handler (i-handler). In all scenarios, the running process gets suspended to make way for the interrupt handler function. I-handler’s instruction address gets loaded into the cpu registers and this function starts “driving” the cpu. Traditionally, the handler function would use the suspended process’s existing stack.
    • After the i-handler completes, the suspended “current” process resumes by default. However, the handler may cause another pid333 to be scheduled right away [1 Chapter 4.1].
  • no pid — interrupt handler execution has no pid, though some authors say it runs on behalf of the suspended pid. I feel the suspended pid may be unrelated to the socket (Scenario 2), rather than the socket’s owner process pid111.
  • kernel scheduler — In Scenario 1, pid111 would not get to process the data until it gets in the “driver’s seat” again. However, the interrupt handler could trigger a rescheduling and push pid111 “to the top of the queue” so to speak. [1 Chapter 4.1]
  • top-half — drains the tiny NIC ring-buffer into main memory (presumably socket buffer) as fast as possible [2] as NIC buffer can only hold a few packets — [[linux kernel]] P 629.
  • bottom-half — (i.e. deferrable functions) includes lengthy tasks like copying packets. Deferrable function run in interrupt context [1 Chapter 4.8], under nobody’s pid
  • sleeping — the socket owner pid 111 would be technically “sleeping” in the socket’s wait queue initially. After the data is copied into the socket receive buffer in user space, I think the kernel scheduler would locate pid111 in the socket’s wait queue and make pid111 the cpu-driver. This pid111 would call read() on the socket.
    • wait queue — How the scheduler does it is non-trivial. See [1 Chapter]
  • burst — What if there’s a burst of multicast packets? The i-handler would hog or steal the driver’s seat and /drain/ the NIC ring-buffer as fast as possible, and populate the socket receive buffer. When the i-handler takes a break, our handleInput() would chip away at the socket buffer.
    • priority — is given to the NIC’s interrupt handler as NIC buffer is much smaller than socket buffer.
    • UDP could overrun the socket receive buffer; TCP uses transmission control to prevent it.

Q: What if the process scheduler is triggered to run (on timer interrupt) while i-handler is busy draining the NIC?
A: Well, all interrupt handlers can be interrupted, but I would doubt the process scheduler would suspend the NIC interrupt handler.

One friend said the while the i-handler runs on our single-CPU, the executing pid is 1, the kernel process. I doubt it.

[1] [[UnderstandingLinuxKernel, 3rd Edition]]


CHANNEL for multicast; TCP has Connection

In NYSE market data lingo, we say “multicast channel”.

  • analogy: TV channel — you can subscribe but can’t connect to it.
  • analogy: Twitter hashtag — you can follow it, but can’t connect to it.

“Multicast connectivity” is barely tolerable but not “connection”. A multicast end system joins or subscribes to a group. You can’t really “connect” to a group as there could be zero or a million different peer systems without a “ring leader” or a representative.

Even for unicast UDP, “connect” is the wrong word as UDP is connectionless.

Saying a nonsense like “multicast connection” is an immediate giveaway that the speaker isn’t familiar with UDP or multicast.

UDP recv()from 1 send()at most

P116 [[tcp/ip sockets in C]] made it very clear.

A call to recv on the receiver machine will return data from at most one send() on the sender machine.

It can be a partial message, but would be the first part. See

I believe entire payload of one send()/sendto() is packaged into an envelope. The kernel would never deliver two envelopes to one recv()/recvfrom() call. Therefore receiver can only receive one envelope at a time. If entire envelope is too large then only only first part of the payload is delivered.

IP4 fragmentation+reassembly #MTU,offset

I consider this a “halo” knowledge pearl because it is part of an essential everyday service. We can easily find an opportunity to inject it into an IV.

A Trex interviewer said something questionable. I said fragmentation is done at IP layer and he said yes but not reassembly. He was wrong. See P329 [[Computer Networking]]

I was talking about IP layer breaking up , say, a 4KB packet (TCP or UDP packet) into three IP-fragments no bigger than 1500B [1]. The reassembly task is to put all 3 fragments back together in sequence (and detect missing fragments) and hand it over to TCP or UDP.

This reassembly is done in IP layer. IP4 uses an “offset” number in each fragment to identify the sequencing and to detect missing fragments. The fragment with the highest offset also has a flag indicating it’s the last fragment of a given /logical/ packet.

Therefore, IP4 detects and will never deliver partial packets to UDP/TCP (P328 [[computer networking]]), even though IP is considered an unreliable service. IP4 can detect missing/incomplete IP-datagram and will refuse to “release” it to upper layer (i.e. UDP/TCP).

  • If TCP doesn’t get one “segment” (a.k.a. IP-datagram), it will request retransmission
  • UDP does no retransmission
  • IP4 does no retransmission

[1] MTU for some hardware is lower than 1500 Bytes …

Ack in tcp # phrasebook

Ack — returned by receiver to original sender … on every segment. See P241 [[computerNetworking]]

byte-level — TCP seq number is at byte level and it jumps. Ack is such a seq number.

expected — Ack number is the next seq number expected

proactive Ack — Never. Receiver will Never send ACK if it has not receive anything. I think this means receiver can’t detect unplugged wire.

zero AWS — gradually the AWS value in the Ack will drop to zero

1-byte probe — See [1]

Slow-receiver — TCP flow-control is only evident with a slow receiver

retrans — sender always resends x nanosec (Timeout) after missing an Ack. See tcp: detect wire unplugged



single/multi-thread TCP servers contrasted

In all cases, listening socket LL and worker sockets W4 W5 look like

local/server port local/server ip remote/client ip remote/client port socket file descriptor
 80 ANY unbound unbound LL  3
 80 high port A W4  4
 80 high port B W5  5 newly created

Now the 3 designs. [1] is described in accept()+select() : multiple persistent worker-sockets

LL W4 W5
not used by the “engaged” thr sending data single-thr primitive
not used by the “engaged” thr, but will be monitored via select() sending data single-thr multiplexing[1]
 listening Established. sending or idle sending multi-threaded

tcp: detect wire unplugged

In general for either producer or consumer, the only way to detect peer-crash is by probing (eg: keepalive, 1-byte probe, RTO…).

  • Receiver generally don’t probe and will remain oblivious.
  • Sender will always notice a missing Ack (timeout OR 3 duplicate Acks). After retrying, TCP module will give up and generate SIGPIPE.
send/recv buffers full buffer full then receiver-crash receiver-crash then sender has data to send receiver-crash amid active transmission
visible symptom 1-byte probe from sender triggers Ack containing AWS=0 The same Ack suddenly stops coming very first expected Ack doesn’t come Ack was coming in then suddenly stops coming
retrans by sender yes yes yes yes
SIGPIPE no probably yes probably

Q20: if a TCP producer process dies After transmission, what would the consumer get?
AA: nothing. See — Receiver is ready to receive data, and has no idea that sender has crashed.
AA: Same answer on

Q21: if a TCP producer process dies During transmission, what would the consumer get?
%A: ditto. Receiver has to assume sender stopped.

Q30: if a TCP consumer process dies during a quiet period After transmission, what would the producer get?
AA: P49 [[tcp/ip sockets in C]] Sender doesn’t know right away. At the next send(), sender will get -1 as return value. In addition, SIGPIPE will also be delivered, unless configured otherwise.

Q30b: Is SIGPIPE generated immediately or after some retries?
AA: describes Ack and re-transmission. Sender will notice a missing Ack and RTO will kick in.
%A: I feel TCP module will Not give up prematurely. Sometimes a wire is quickly restored during ftp, without error. If wire remains unplugged it would look exactly like a peer crash.

Q31: if a TCP consumer dies During transmission, what would the producer get?
%A: see Q30.

Q32: if a TCP consumer process dies some time after buffer full, what would the producer get?
%A: probably similar to above, since sender would send a 1-byte probe to trigger a Ack. Not getting the Ack tells sender something. This probe is builtin and mandatory , but functionally similar to (the optional) TCP Keepalive feature

I never studied these topics but they are quite common.

Q: same local IP@2 worker-sockets: delivery to who

Suppose two TCP server worker-sockets both have local address port 80. Both connections active.

When a packet comes addressed to, which socket will kernel deliver it to? Not both.

Not sure about UDP, but TCP connection is a virtual circuit or “private conversation”, so Socket W1  knows the client is If the incoming packet doesn’t have source IP:port matching that, then this packet doesn’t belong to the conversation.

accept()+select() : multiple persistent worker-sockets

I feel it is not that common. See is very relevant.

The naive design — a polling-thread (select/poll) to monitor new data on 2 worker-sockets + accept-thread to accept on the listening socket. The accept-thread must inform the polling thread after a worker-socket is born.

The proposed design —

  1. a single polling thread to watch two existing worker sockets W1/W2 + listening socket LL. select() or poll() would block.
  2. When LL is seen “ready”, select() returns, so the same thread will run accept() on LL and immediately get a 3rd worker-socket W3. No blocking:)
  3. process the data on the new W3 socket
  4. go back to select() on W1 W2 W3 LL
  • Note if any worker socket has data our polling thread must process it quickly. If any worker socket is hogging the polling thread, then we need another thread to offload the work.
  • Note all worker sockets, by definition, have identical local (i.e. server-side) port, since they all inherit the local port from LL.

[[tcp/ip socket programming in C]] shows a select() example with multiple server ports.


simultaneous send to 2 tcp clients #multicast emulation

Consider a Mutlti-core machine hosting either a forking or multi-threaded tcp server. The accept() call would return twice with file descriptors 5 and 6 for two new-born worker sockets. Both could have the same server-side address, but definitely their ports must be identical to the listening port like 80.

There will be 2 dedicated threads (or processes) serving file descriptor 5 and 6, so they can both send the same data simultaneously. The two data streams will not be exactly in-sync because the two threads are uncoordinated.

My friend Alan confirmed this is possible. Advantages:

  • can handle slow and fast receives at different data rates. Each tcp connection maintains its state
  • guaranteed delivery

For multicast, a single thread would send just a single copy  to the network. I believe the routers in between would create copies for distribution.  Advantages:

  • Efficiency — Router is better at this task than a host
  • throughput — tcp flow control is a speed bump


new socket from accept()inherits port# from listen`socket

[[tcp/ip sockets in C]] P96 has detailed diagrams, but this write-up is based on

Background — a socket is an object in memory, with dedicated buffers.

We will look at an sshd server listening on port 22. When accept() returns without error, the return value is a positive integer file descriptor pointing to a new-born socket object with its own buffers. I call it the “worker” socket. It inherits almost every property from the listening socket (but see tcp listening socket doesn’t use a lot of buffers for differences.)

  • local (i.e. server-side) port is 22
  • local address is a single address that client source code specified. Note sshd server could listen on ANY address, or listen on a single address.
  • … so if you look at the local address:port, the worker socket and the listening socket may look identical, but they are two objects!
  • remote port is some random high port client host assigned.
  • remote address is … obvious
  • … so if you compare the worker-socket vs listening socket, they have
    • identical local port
    • different local ip
    • different remote ip
    • different remote port

tcp listening socket doesn’t use a lot of buffers, probably

The server-side “worker” socket relies on the buffers. So does the client-side socket, but the listening socket probably doesn’t need the buffer since it exchanges very little data with client

That’s one difference between listening socket vs worker socket

2nd difference is the server (i.e. local) address field. Worker socket has a single server address filled in. The listening socket often has “ANY” as its server address.

non-forking STM tcp server

This a simple design. In contrast, I proposed a multiplexing design for a non-forking single-threaded tcp server in accept()+select() : multiple persistent worker-sockets

ObjectSpace manual P364 shows a non-forking single-threaded tcp server. It exchanges some data with each incoming client and immediate disconnects the client and goes back to accept()

Still, when accept() returns, it returns with a new “worker” socket that’s already connected to the client.

This new “worker” socket has the same port as the listening socket.

Since the worker socket object is a local variable in the while() loop, it goes out of scope and gets destructed right away.

y tcp server listens on ANY address

Background — a server host owns multiple addresses, as shown by ifconfig. There’s usually an address for each network “interface”. (Routing is specified per interface.) Suppose we have on IF1 on IF2

By default, an Apache server would listen on and This allows clients from different network segments to reach us.

If Apache server only listens to, but some clients can only connect to perhaps due to routing!

making out a little-endian 32-bit int

Let’s set the stage — We have a stream of bytes in little-endian format. Let’s understand it according to the spec. The struct is packed tight without padding.

Spec says left-most field is a char. It is always on the left-most, regardless of endianness. If we look at the 8 bits, they are normal. 0x41 is ‘A’. Within the 8 bits, no reordering due to endianness.

Spec says next four bytes is an integer. Most significant bit (suppose a one) is on the right end, representing 2^31. What’s the integer value? To work out by hand, we need to pick the four bytes as is — Byte1 Byte2 Byte3 Byte4. Then we reverse them into Byte4 Byte 3 Byte2 Byte1. Now this 32-bit integer is human-readable. The human-readable form is now a binary number taught in classrooms.

Note the software program still uses the original “Byte1 Byte2 Byte3 Byte4” and can print out the correct integer value.

Spec says next four bytes is a float. There’s nothing I can do to make out its value without a computer, so I don’t bother to rearrange the bytes.

Next 2 bytes is a string like “XY”. First byte is “X”. Endian-ness doesn’t bother us.

after fork(): threads,sockets.. #Trex

I have read about fork() many times without knowing these details, until Trex interviewer asked !

–based on

The child process is created with a single thread—the one that called fork(). The entire virtual address space of the parent is replicated in the new process, including the states of pthread mutexes, pthread condition variables, and other pthreads objects In particular, if in parent process a lock was held by some other thread t2, then child process only has the main thread (which called fork()) and no t2 but the lock is still unavailable. This is a common problem, addressed in

The very 1st instruction executed in Child is the instruction after fork() — as proven in

The child inherits copies of the parent’s set of open file descriptors, including stdin/stdout/stderr. Child process should usually close them.

Special case — socket file descriptor inherited. See


multicast: IV care only about bookish nlg !!practical skills

Hi friends,

I recently used multicast for a while and I see it as yet another example of the same pattern — technical interviewers care about deep theoretical knowledge not practical skills.

Many new developers don’t know multicast protocol uses special IP addresses. This is practical knowledge required on my job, but not asked by interviewers.

Unlike TCP, there’s not a “server” or a “client” in a multicast set-up. This is practical knowledge in my project but not asked by interviewers.

When I receive no data from a multicast channel, it’s not obvious whether nobody is sending or I have no connectivity. (In contrast, with TCP, you get connection error if there’s no connectivity. See tcp: detect wire unplugged.) This is practical knowledge, but never asked by interviewers.

I never receive a partial message by multicast, but I always receive partial message by TCP when the message is a huge file. This is reality in my project, but never asked by any interviewer.

So what do interviewers focus on?

  • packet loss — UDP (including multicast) lacks delivery guarantee. This is a real issue for system design, but I seldom notice it.
  • higher efficiency than TCP — I don’t notice it, though it’s a true.
  • socket buffer overflow — should never happen in TCP but could happen in UDP including multiast. This knowledge is not needed in my project.
  • flow control — TCP receiver can notify sender to reduce sending speed. This knowledge is not needed in many projects.
  • non-blocking send/receive — not needed in any project.

So what can we do? Study beyond what’s needed in the project. (The practical skills used is only 10% of the interview requirements.) Otherwise, even after 2 years using multicast in very project, I would still look like as a novice to an interviewer.

Without the job interviews, it’s hard to know what theoretical details are required. I feel a multicast project is a valuable starting point to get me started. I can truthfully mention multicast in my resume. Then I need to attend interviews and study the theoretical topics.

check a receiver socket is alive: tcp^udp

Q: given a TCP receiver socket, how do you tell if it’s connected to a session or disconnected?

Shanyou said when you recv() on the socket but got 0 it means disconnected. shows recv() return value of 0 indicates dead connection i.e. disconnected. uses getsockopt()

Q: given a UDP multicast receiver socket, how do you tell if it’s still has a live subscription to the multicast group?

%%A: I guess you can use getsockopt() to check socket /aliveness/. If alive but no data, then the group is quiet

non-blocking socket readiness: alternatives2periodic poll

Some Wells interviewer once asked me —

Q: After your non-blocking send() fails due to a full buffer, what can you do to get your data sent ASAP?

Simple solution is retrying after zero or more millisecond. Zero would be busy-weight i.e. spinning CPU. Non-zero means unwanted latency.

A 1st alternative is poll()/select() with a timeout, and immediately retry the same. There’s basically no latency. No spinning either. The linux proprietary epoll() is more efficient than poll()/select() and a popular solution for asynchronous IO.

2nd alternative is SIGIO. says it doesn’t waste CPU. P52 [[tcp/ip sockets in C]] also picked this solution to go with non-blocking sockets. is actually a concise overview of several alternatives

  • non-blocking socket
  • select/poll/epoll
  • .. other tricks

tcp: one of 3 client-receivers is too slow

See also no overflow]TCP slow receiver #non-blocking sender

This is a real BGC interview question

Q: server is sending data fast. One of the clients (AA) is too slow.

Background — there will be 3 worker-sockets. The local address:port will look identical among them if the 3 clients connect to the same network interface, from the same network segment.

The set-up is described in simultaneous send to 2 tcp clients #mimic multicast

Note every worker socket for every client has identical local port.


I believe the AA connection/session/thread will be stagnant. At a certain point [1] server will have to remove the (mounting) data queue and release memory — data loss for the AA client.

[1] can happen within seconds for a fast data feed.

I also feel this set-up overloads the server. A TCP server has to maintain state for each laggard client, assuming single-threaded multiplexing(?). If each client takes a dedicated thread then server gets even higher load.

Are 5 client Connections using 5 sockets on server? I think so. Can a single thread multiplex them? I think so.

## Y avoid blocking design

There are many contexts. I only know a few.

1st, let’s look at an socket context. Suppose there are many (like 500 or 50) sockets to process. We don’t want 50 threads. We prefer fewer, perhaps 1 thread to check each “ready” socket, transfer whatever data can be transferred then go back to waiting. In this context, we need either

  • /readiness notification/, or
  • polling
  • … Both are compared on P51 [[TCP/IP sockets in C]]

2nd scenario — GUI. Blocking a UI-related thread (like the EDT) would freeze the screen.

3rd, let’s look at some DB request client. The request thread sends a request and it would take a long time to get a response. Blocking the request thread would waste some memory resource but not really CPU resource. It’s often better to deploy this thread to other tasks, if any.

Q: So what other tasks?
A: ANY task, in the thread pool design. The requester thread completes the sending task, and returns to the thread pool. It can pick up unrelated tasks. When the DB server responds, any thread in the pool can pick it up.

This can be seen as a “server bound” system, rather than IO bound or CPU bound. Both the CPU task queue and the IO task queue gets drained quickly.


no overflow]TCP slow receiver #non-blocking sender

Q: Does TCP receiver ever overflow due to a fast sender?

A: See

A: should not. When the receiver buffer is full, the receiver sends AdvertizedWindowSize to informs the sender. If sender app ignores it and continues to send, then sent data will remain in the send buffer and not sent over the wire. Soon the send buffer will fill up and send() will block. On a non-blocking TCP socket, send() returns with error only when it can’t send a single byte. (UDP is different.)

Non-block send/receive operations either complete the job or returns an error.

Q: Do they ever return with part of the data processed?
A: Yes they return the number of bytes transferred. Partial transfer is considered “completed”.


UDP/TCP socket read buffer size: can be 256MB

For my UDP socket, I use 64MB.
For my TCP socket, I use 64MB too!

These are large values and required kernel turning. In my linux server, /etc/sysctl.conf shows these permissible read buffer sizes:

net.core.rmem_max = 268435456 # —–> 256 MB
net.ipv4.tcp_rmem = 4096   10179648   268435456 # —–> 256 MB

Note a read buffer of any socket is always maintained by the kernel and can be shared across processes [1]. In my mind, the TCP/UDP code using these buffers is kernel code, like hotel service. Application code is like hotel guests.

[1] Process A will use its file descriptor 3 for this socket, while Process B will use its file descriptor 5 for this socket.

TCP/UDP: partial or multiple messages in one buffer

This is often mentioned in IV. At least you can demonstrate your knowledge.

What if the UDP datagram is too big for recv() i.e. specified buffer length is too small? P116 [[tcp/ip soclets in C]] seems to say the oversize message is silently truncated.

UDP recv() will only return a single “logical” message [1]. I believe TCP can put partial or multiple messages into one “buffer” for recv().

Q: if my buffer is big enough, will my UDP recv() ever truncate a msg?
%%A: never

Note IP would always deliver a whole msg or miss a whole msg, never a partial msg. See P 329 [[comp networking]]

[1] a logical msg is the payload from one send()

select^poll # phrasebook

Based on, which I respect.

  • descriptor count — up to 200 is fine with select(); 1000 is fine with poll(); Above 1000 consider epoll
  • time-out precision — poll/epoll has millisec precision. select() has nanosec, a million times higher precision, but only embedded devices need such precision.
  • single-threaded app — poll is just as fast as epoll. epoll() excels in MT. has sample code on poll().

sharing port or socket #index page

Opening example – we all know that once a local endpoint is occupied by a tcp server process, another process can’t bind to it.

However, various scenarios exist to allow some form of sharing.

Sliding-^Advertised- window size has real life illustration using wireshark.

  • AWS = amount of free space on receive buffer
    • This number, along with the ack seq #, are both sent from receiver to sender
  • lastSeqSent and lastSeqAcked are two sender control variable in the sender process.

SWS indicates position within a stream; AWS is a single integer.

Q: how are the two variables updated during transmission?
A: When an Ack for packet #9183 is received, sender updates its control variable “lastSeqAcked”. It then computes how many more packets to send, ensuring that “lastSeqSent – lastSeqAcked < AWS

  • SWS (sliding window size) = lastSeqSent – lastSeqAcked = amount of transmitted but unacknowledged bytes, is a derived control variable in the sender process, like a fluctuating “inventory level”.
  • SWS is a concept; AWS is a TCP header field
  • receive buffer size — is sometimes incorrectly referred as window size
  • “window size” is vague but usually refers to SMS
  • AWS is probably named after the sender’s sliding window.
  • receiver/sender — only these players control the SWS, not the intermediate routers etc.
  • too large — large SWS is always set based on large receive buffer, perhaps AWS
  • too small — underutilized bandwidth. As explained in linux tcp buffer^AWS tuning, high bandwidth connections should use larger AWS.

Q: how are AWS and SWS related?
A: The sender adjusts lastSeqSent (and SWS) based on the feedback of AWS and lastSeqAcked.

socket accept() key points often missed

I have studied accept() many times but still unfamiliar.

Useful as zbs, and perhaps QQ, rarely for GTD…

Based on P95-97 [[tcp/ip socket in C]]

  • used in tcp only
  • used on server side only
  • usually called inside an endless loop
  • blocks most of the time, when there’s no incoming new connections. The existing clients don’t bother us as they communicate with the “child” sockets independently. The accept() “show” starts only upon a new incoming connection
    • thread remains blocked, starting from receiving the incoming until a newborn socket is fully Established.
    • at that juncture the new remote client is probably connected to the newborn socket, so the “parent thread[2]” have the opportunity/license to let-go and return from accept()
    • now, parent thread has the newborn socket, it needs to pass it to a child thread/process
    • after that, parent thread can go back into another blocking accept()
  • new born or other child sockets all share the same local port, not some random high port! Until now I still find this unbelievable. confirms it.
  • On a host with a single IP, 2 sister sockets would share the same local ip too, but luckily each socket structure has at least 4 [1] identifier keys — local ip:port / remote ip:port. So our 2 sister sockets are never identical twins.
  • [1] I omitted a 5th key — protocol as it’s a distraction from the key point.
  • [2] 2 variations — parent Thread or parent Process.

in-depth article: epoll illustrated #SELECT

(source code is available for download in the article)

Compared to select(), the newer linux system call epoll() is designed to be more performant.

Ticker Plant uses epoll. No select() at all. is a nice article with sample code of a TCP server.

  • bind(), listen(), accept()
  • main() function with an event loop. In the loop
  • epoll_wait() to detect
    • new client
    • new data on existing clients
    • (Using the timeout parameter, it could also react to a timer events.)

I think this toy program is more readable than a real-world epoll server with thousands of lines.

TCP client set-up steps #connect()UDP

TCP Client-side is a 2-stepper (look at Wikipedia and [[python ref]], among many references)
1) [SC] socket()
2) [C] connect()

[SC = used on server and client sides]
[C=client-only. seldom/never used on server-side.]

Note UDP is connection-less but connect() can be used too — to set the default destination. See

Under TCP, The verb connect() means something quite different — “reach across and build connection”[1]. You see it when you telnet … Also, server-side don’t make outgoing connections, so this is used by TCP client only. When making connection, we often see error messages about server refusing connection, because no server is “accepting”.

[1] think of a foreign businessman traveling to China to build guanxi with local government officials.


multicast address ownership#eg exchanges shows a few hundred big companies including exchanges. For example, one exchange multicast address falls within the range to Inter-continental Exchange, Inc.

It’s educational to compare with a unicast IP address. If you own such an unicast address, you can put it on a host and bind an http server to it. No one else can bind a server to that uncast address. Any client connecting to that IP will hit your host.

As owner of a multicast address, you alone can send datagrams to it and (presumably) you can restrict who can send or receive on this group address. Alan Shi pointed out the model is pub-sub MOM.

Twitter hashtab is another analogy.

UDP^TCP #TV-channel is relevant.

  • FIFO — TCP; UDP — packet sequencing is uncontrolled
  • Virtual circuit — TCP; UDP — datagram network
  • Connectionless — UDP ; TCP — Connection-oriented
  • Channel vs Connection — In RTS and xtap, we use the analogy of “TV channel” for multicast. TCP uses “Connection”.

With http, ftp etc, you establish a Connection (like a session). No such connection for UDP communication.

Retransmission is part of — TCP; UDP — application layer (not network layer) on receiving end must request retransmission.

To provide guaranteed FIFO data delivery, over unreliable channel, TCP must be able to detect and request retransmission. UDP doesn’t bother. An application built on UDP need to create that functionality, as in the IDC (Interactive Data Corp) ticker plant. Here’s one simple scenario (easy to set up as a test):

  • sender keeps multicasting
  • shut down and restart receiver.
  • receiver detects the sequence number gap, indicate message loss during the down time.
  • Receiver request for retransmission.


TCP listening socket shared by2processes #mcast

Common IV question: In what scenarios can a listening socket (in memory) be shared between 2 listening processes?

Background — a socket is a special type of file descriptor (at least in unix). Consider an output file handle. By default, this “channel” isn’t shared between 2 processes. Similarly, when a packet (say a price) is delivered to a given network endpoint, the kernel must decide which process to receive the data, usually not to two processes.

To have two processes both listening on the same listening-socket, one of them is usually a child of the other. The webpage in [1] and my code in show a short python code illustrating this scenario. I tested. q(lsof) and q(ss) commands both (but not netstat) show the 2 processes listening on the same endpoint. OS delivers the data to A B A B… shows an advanced kernel feature to let multiple processes bind() to the same endpoint.

For multicast (UDP only) two processes can listen to the same UDP endpoint. See [3] and [2]

A Unix domain socket can be shared between two unrelated processes.





joining/leaving a multicast group

Every multicast address is a group address. In other words, a multicast address identifies a group.

Sending a multicast datagram is much simpler than receiving…

[1] is a concise 4-page introduction. Describes joining/leaving.

[2] has sample code to send/receive. Note there’s no server/client actually.


2 Active connections on 1 TCP server IP/port

This is not the most common design, but have a look at the following output:

remote          local        state

What needs to be unique, is the 5-tuple (protocol, remote-ip, remote-port, local-ip, local-port)… so this situation can exist. [[tcp/ip sockets in C]] P100 has a full section on this topic. also says “Multiple worker-sockets on the same TCP server can share the same server-side IP/Port pair as long as they are associated with different client-side IP/Port pairs”. This “accept, move the connection to a dedicated server socket, then go back to accept()” is the common design — On each incoming connection, the listening TCP server will start a new thread/task/process  using a new “worker” socket on the server side. Note the new worker socket shares server ip:port with the original listening socket is my experiment. It pushes the concept of “sharing” further  — two TCP serves sharing a single socket not just a single ip:port endpoint!

tcp client bind()to non-random port: j4

TCP client doesn’t specify local endpoint. It only specifies the remote endpoint.

  • The local port is random. It’s conceptually an “outgoing” port as the client reaches out to the remote server.
  • The local IP address is probably chosen by the kernel, based on the remote IP address specified.


A Barclays TCP interview asked

Q: When a tcp client runs connect(), can it specify a client-side port rather than using a random port assigned by the system?
A: use bind() —

* I feel the client port number can work like a rudimentary tag for a “special” client thread
* similarly, debugging —
* firewall filtering on client port —
* some servers expect client to use a low port —

Note client bind() can also specify a particular client ip address (multihoming). Client side bind() defines the local port and interface address for the connection. In fact, connect() does an implicit bind(“”, 0) if one has not been done previously (with zero being taken as “any”). See

multicast address 1110xxxx #briefly

By definition, multicast addresses all start with 1110 in the first half byte. Routers seeing such a destnation (never a source) address knows the msg is a multicast msg.

However, routers don’t forward any msg with destnation address through because these are local multicast addresses. I guess these local multicast addresses are like 192.168.* addresses.

SO_REUSEPORT TCP server socket option – hungry chicks

With SO_REUSEPORT option, multiple TCP server processes could bind() to the same server endpoint. Designed for the busiest multithreaded servers. – a bunch of hungry chicks competing to get the next worm the mother delivers. The mother can only give the worm to one chick at a time. SO_REUSEPORT option sets up a chick family. When an incoming connection hits the accept(), kernel picks one of the accepting threads/processes and delivers the data to it alone.

See  + my socket book P102.

TCP server socket lingering briefly af host process exits

[[tcp/ip sockets in C]] P159 points out that after a host process exits, the socket enters the TIME_WAIT state for some time, visible in netstat.

Problem is, the socket still binds to some address:port, so if a new socket were to attempt bind() to the same it might fail. The exact rule is possibly more complicated but it does happen.

The book mentions 2 solutions:

  1. wait for the dying socket to exit TIME_WAIT. After I kill the process, I have seen this lingering for about a minute then disappearing.
  2. new socket to specify SO_REUSEADDR.

There are some simple rules about SO_REUSEADDR, so the new socket must be distinct from the existing socket in at least one of the 4 fields. Otherwise the selection rule in this post would have been buggy.

(server)promiscuous socket^connected socket

[[tcp/ip sockets in C]] P100 has a diagram showing that an incoming packet will be matched against multiple candidate listening sockets:

  • format: {local address:local port / remote address:remote port}
  • Socket 0: { *:99/*:*}
  • Socket 1: {*:*}
  • Socket 2: {} — this one has the remote address:port populated because it’s an Established connection)

An incoming packet need to match all fields otherwise it’s rejected.

However it could find multiple candidate sockets. Socket 0 is very “promiscuous”. The rule (described in the book) is — the more wild cards, the less likely selected.

(Each packet must be delivered to at most 1 socket as far as I know.)

##which common UDP/TCP functions are blocking

Q: which blocking functions support a timeout?
A: All

A non-blocking send fails when it can’t send a single byte, usually because destination send buffer (for TCP) is full. For UDP, see [4]

Note read() and write() has similar behavior. See send()recv() ^ write()read() @sockets and

[1] meaning of non-blocking connect() on TCP is special. See P51[[tcp/ip sokets in C]] and
[2b] accept() returning with timeout is obscure knowledge — see accept(2) manpage
[2c] accept() is designed to return immediately on a nonblocking socket — and
[3] select() on a non-blocking socket is obscure knowledge —  See
[4] UDP has no send buffer but some factors can still cause blocking … obscure knowledge
[5] close() depends on SO_LINGER and non-blocking by default …

t/out default flags arg to func fcntl on entire socket touching TCP buffers?
y select blocking not supported? still blocking! [3]  no
epoll blocking not supported?  no
y recv blocking can change to NB [6] can change to NB  yes
y send/write TCP blocking can change to NB [6] can change to NB  yes
y recvfrom blocking can change to NB [6] can change to NB  yes
y sendto UDP can block [4] can change to NB [6] can change to NB  yes
y close() with SO_LINGER not the default not supported yes yes
close() by default NB [5] not supported can change to blocking yes
y[2b] accept by default blocking not supported yes yes
accept on NB socket [2c] not supported yes no
LG connect()TCP blocking not supported? can change to NB [1] no
LG connect()UDP NB NotApplicable

send()recv() ^ write()read() @sockets

Q: A socket is also a file descriptor, so why bother with send()recv() when you can use write()read()?

A: See

send()recv() are recommended, more widely used and better documented.

[[linux kernel]] P623 actually uses read()write() for udp, in stead of sendto()recvfrom(), but only after a call to connect() to set the remote address

socket stats monitoring tools – on-line resources

This is a rare interview question, perhaps asked 1 or 2 times. I don’t want to overspend.

In ICE RTS, we use built-in statistics modules written in C++ to collect the throughput statistics.

If you don’t have source code to modify, I guess you need to rely on standard tools.

broadcast^multicast shows(suggests?) that broadcast is also time-efficient since sender only does one send. However, multicast is smarter and more bandwidth-efficient.

IPv6 disabled broadcast — to prevent disturbing all nodes in a network when only a few are interested in a particular service. Instead it relies on multicast addressing, a conceptually similar one-to-many routing methodology. However, multicasting limits the pool of receivers to those that join a specific multicast receiver group.

multicast – highly efficient@@ # my take

(Note virtually all MC apps use UDP.)
To understand MC efficency, we must compare with UC (unicast) and BC (broadcast). First we need some “codified” metrics —
  • TT = imposing extra Traffic on network, which happens when the same packet is sent multiple times through the same network.
  • RR = imposing extra processing workload on the Receiver host, because the packet is addressed TO “me” (pretending to be a receiver). If “my” address were not mentioned in the packet, then I would have ignored it without processing.
  • SS = imposing extra processing workload by the Sender — a relatively low priority.
Now we can contrast MC, UC and BC. Suppose there are 3 receiver hosts to be notified, and 97 other hosts to leave alone, and suppose you send the message via —
  1. UC – TT not RR — sender dispatches 3 copies each addressed to a single host.
  2. BC – RR not TT — every host on the network sees a packet addressed to it though most would process then ignore it, wasting receiver’s time. When CEO sends an announcement email, everyone is in the recipient list.
  3. MC – not RR not TT. However, MC can still flood the network.

multicast: video streaming^mkt data

These are the 2 main usages of IP multicast. In both, Lost packets are considered lost forever. Resend is sometimes considered “too late”.

I think some of the world’s most cutting-edge network services — live price feed, live event broadcast, VOD, multiplayer gaming — rely on multicast.

Multicast is more intelligent data dissemination than broadcast, and faster than unicast. Intelligence is built into routers.

I believe JMS publish is unicast based, not broadcast based. The receivers don’t comprise an IP broadcast group. Therefore JMS broker must deliver to one receiver at a time.

tcp/udp use a C library, still dominating

History – The socket library was created in the 1980’s and has stood the test of time. Similar resilience is seen in SQL, Unix, and mutex/condition constructs

In the socket programming space, I feel C still dominates, though java is a contender (but i don’t understand why).

Perl and python both provide thin wrappers over the C socket API. (Python’s wrapper is a thin OO wrapper.)

Sockets are rather low level /constructs/ and performance-critical — latency and footprint. OO adds both overheads without adding well-appreciated or much-needed OO nicety. If you need flexibility, consider c++ templates. All modern languages try to encapsulate/wrap the low-level details and present an easier API to upper-layer developers.

Choose one option only between —
AA) If you write tcp/ip code by hand, then you probably don’t need OO wrappers in c#, java or python
BB) If you like high-level OO wrappers, then don’t bother with raw sockets.

My bias is AA, esp. on Wall St. Strong low-level experience always beats (and often compensates for lack of) upper-layer experience. If you have limited time, invest wisely.

I feel one problems with java is, sockets are low-level “friends” of ints and chars, but java collections need auto-boxing. If you write fast java sockets, you must avoid auto-boxing everywhere.

Q: is java socket based on C api?
A: yes

b4 and af select() syscall

Note select() is usually used on the server-side. It allows a /single/ server thread to handle hundreds of concurrent clients.
— B4 —
open the sockets. Each socket is represented by an integer file descriptor. It can be saved in an int array. (A vector would be better, but in C the array also looks like an int pointer).

FD_SET(socketDes1, &readfds); /* add socketDes1 to the readfds */

select() function argument includes readfds — the list of existing sockets[1]. Select will test each socket.

— After —
check the set of incoming sockets and see which socket is “ready”.

FD_ISSET(socketDes1, &readfds)

If ready, Then you can either read() or recvfrom() has sample code.

[1] Actually Three independent sets of file descriptors are watched, but for now let’s focus on the first — the incoming sockets

select() syscall and its drv

select() is the most “special” socket kernel syscall (not a “standard library function”). It’s treated special.

– Java Non-blocking IO is related to select().
– Python has at least 3 modules — select, asyncore, asynchat all built around select()
– dotnet offers 2 solutions to the same problem:

  1. a select()-based and
  2. a newer asynchronous solution to the same problem

socket file desc in main() function

I see real applications with main() function declaring a stack variable my_sockfd = socket(…).

Q: is the my_sockfd stackVar in scope throughout the app's lifetime?

Q2: what if the main() function exits with some threads still alive? Will the my_sockfd object (and variable) disappear?
A: yes see Q2b.

Q2b: Will the app exit?
A: yes. See blog

how does reliable multicast work #briefly

I guess a digest of the msg + a sequence number is sent out along with the msg itself.

See wiki.

One of the common designs is PGM —

While TCP uses ACKs to acknowledge groups of packets sent (something that would be uneconomical over multicast), PGM uses the concept of Negative Acknowledgements (NAKs). A NAK is sent unicast back to the host via a defined network-layer hop-by-hop procedure whenever there is a detection of data loss of a specific sequence. As PGM is heavily reliant on NAKs for integrity, when a NAK is sent, a NAK Confirmation (NCF) is sent via multicast for every hop back. Repair Data (RDATA) is then sent back either from the source or from a Designated Local Repairer (DLR).

PGM is an IETF experimental protocol. It is not yet a standard, but has been implemented in some networking devices and operating systems, including Windows XP and later versions of Microsoft Windows, as well as in third-party libraries for Linux, Windows and Solaris.

reliable multicast – basics

First, use a distinct sequence numbers for each packet. When one of the receivers notices a missed packet, it asks sender to resend ….to all receivers.

As an optimization, use bigger chunks. Use a window of packets. If the transmission is reliable, then expand the window size, so each sequence number covers to a (much) larger chunk of packets.

These are the basic reliability techniques of TCP. Reliable multicast could borrow these from TCP.

Note real TCP isn’t usable for multicast as each TCP transmission has exactly one sender and one receiver. I think entire TCP protocol is based on that premise — unicast circuit.

IV – UDP/java

My own Q: How do you make UDP reliable?
A: sequence number + gap management + retransmission

My own Q: can q(snoop) capture UDP traffic?

Q: Why would multicast use a different address space? Theoretical question?
A: each MC address is a group…

Q: why would a server refuse connection? (Theoretical question?)
%%A: perhaps tcp queue is full, so application layer won’t see anything


Q: How do you avoid full GC
Q: what’s the impact of 64 bit JVM?
Q: how many package break releases did your team have in a year?
Q: In a live production system, how do you make configuration changes with minimal impact to existing modules?

5 parts in socket data structure

— Adapted from

Note accept() instantiates a socket object and returns a file descriptor for it. accept() doesn’t open a new port.

A socket object in memory consists of 5 things – (source ip, source port, destination ip, destination port, protocol). Here the protocol could TCP or UDP[1]. This protocol is identified in the packet from the ‘protocol’ field in the IP datagram.

Thus it is possible to have 2 different applications on the server communicating to to the same client on exactly the same 4-tuples but different in protocol field. For example

Apache at server talking on ( on TCP) and
World of warcraft talking on ( on UDP)

Both the client and server will handle it since protocol field in the IP packet in both cases is different even if all the other 4 fields are same.

[1] or others

tcp/udp client+server briefly #connect()UDP

(Note no unix-domain sockets covered here.)

As elaborated in

TCP server is socket()-bind-listen-accept
TCP client is socket()-connect

UDP server is socket()-bind
UDP client is socket()-bind { optional }

(UDP both sides similar. See the nice diagram and sample code in

However, after the set-up, to move the data, UDP supports several choices.

SSocket→ BBind→ LListen→ AAccept#details

My focus is internet-socket (not UnixDomain-socket) and server-side TCP. UDP and client-side will be addressed later.

“Bind before Listen” — also shows the same flow.

1) [SC] socket() system_call calls into the kernel to _creates_ a new socket and returns a socket FILE descriptor integer

2) [S] bind() specifies the local end point. The bind() is the choke point to specify (socket() doesn’t) the local end point. It can bind to “any” one address available to the hosting OS, but must bind to a fixed port(??). I know bind() can use INADDR_ANY to bind to multiple addresses. Some say each socket can bind to one address at a time but I don’t think so and I don’t care. To an api user like me, bind() can indeed connect me to multiple addresses.

In a multicast receiver (no “server” per se), bind() specifies the group port, not the local port.

3) [S] listen() is the choke point to specify the _LLLLLLength of the queue. Any incoming connection exceeding the queue capacity will hit “server busy”

4) [S] accept() is the only blocking call in the family —
** If no pending connections are present on the queue, and the socket is NOT marked as non-blocking, accept() blocks the caller until a connection is present. If the socket is marked non-blocking and no pending connections are present on the queue, accept() fails immediately with the error EAGAIN.

The name accept() means accept CONNECTIONs, so it’s used for connection-oriented TCP only.

2011 BGC c++IV #socket #done

Q3: what can go wrong during a write to a socket??? (LG)
Q3b: if buffer is full, will the writing thread block???
%A: by default it blocks, but it’s probably possible to configure it to return an error code or even thrown an exception

Q: blocking vs non-blocking socket?

Q: socket programming – what’s select ?

Q: Should base class dtor always be virtual?
%%A: I would say yes. If a dotr is really not virtual, then it’s not supposed to be a base class. However, SOF mentions: “If you want to prevent the deletion of an instance through a base class pointer, you can make the base class destructor protected and non-virtual; by doing so, the compiler won’t let you call deleteon a base class pointer.”

Q: how many ways to share data between 2 processes? How about shared memory

Q: synchronize 2 unix processes accessing a shared file?
A: named semaphore

fd_set in select() syscall, learning notes

First thing to really understand in select() is the fd_set. Probably the basis of the java Selector.

A fd_set is a struct holding a bunch[2] of file descriptors. (I don’t think there’s any boolean flag in it). A fd_set instance is used as an in/out parameter to select().
– upon entry, it carries the list of sockets to check
– upon return, it carries the subset of those sockets found “dirty” [1]

FD_SET(fd, fdSet) adds a file descriptor “fd” to fdSet. Used before select().
FD_ISSET(fd, fdSet) checks if fd is part of fdSet. Used after select().

Now we understand fd_set, let’s look at …
First parameter to select() is max_descriptor. File Descriptors are numbered starting at zero, so the max_descriptor parameter must specify a value that is one greater than the largest descriptor number that is to be tested. I see a lot of confusion looking at how programmers populate this parameter.


[1] “ready” is the better word
[2] if you don’t have any file descriptor, then you should pass in NULL, not an empty fd_set

select() syscall lesson1: fd_set …

A fd_set instance is a struct (remember — C API) holding a bunch of file descriptors. I don’t think there’s any boolean flag in it.

A fd_set instance is used as an in/out parameter to select(). Only pointer arguments support in/out.
– upon entry, it carries the list of sockets to check
– upon return, it carries the subset of those sockets found “dirty” [1]

FD_SET(fd, fdSet) is a free function (remember — C API) which adds a file descriptor “fd” into a fdSet. Used before select().

FD_ISSET(fd, fdSet) is a free function (remember — C API) which checks if fd is part of fdSet. Used after select().

First parameter to select() is max_descriptor. File Descriptors are numbered starting at zero, so the max_descriptor parameter must specify a value that is one greater than the largest descriptor number that is to be tested.


[1] “ready” is the better word

select() syscall multiplex vs 1 thread/socket ]mkt-data gateway

Low-volume market data gateways could multiplex using select() syscall — Warren of CS. A single thread can service thousands of low-volume clients. (See my brief write-up on epoll) Blocking socket means each read() and write() could block an entire thread. If 90% of 1000 sockets have full buffers, then 900 threads would block in write(). Too many threads slow down entire system.

A standard blocking socket server’s main thread blocks in accept(). Upon return, it gets a file handle. It could save the file handle somewhere then go back to accept(). Over time it will collect a bunch of file handles, each being a socket for a particular network client. Another server thread can then use select() to talk to multiple clients, whild the main accept() thread continues to wait for new connections.

However, in high volume mkt data gateways, you might prefer one dedicated thread per socket. This supposedly reduces context switching. I believe in this case there’s a small number of sockets preconfigured, perhaps one socket per exchange. In such a case there’s no benefit in multiplex. Very different from a google web server.

This dedicated thread may experience short periods of silence on the socket – I guess market data could come in bursts. I was told the “correct” design is spin-wait, with a short sleep between iterations. I was told there’s no onMsg() in this case. I guess onMsg() requires another thread to wake up the blocking thread. Instead, the spin thread simply sleeps briefly, then reads the socket until there’s no data to read.

If this single thread and this socket are dedicated to each other like husband and wife, then there’s not much difference between blocking vs non-blocking read/write. The reader probably runs in an endless loop, and reads as fast as possible. If non-blocking, then perhaps the thread can do something else when socket buffer is found empty. For blocking socket, the thread is unable to do any useful work while blocked.

I was told UDP asynchronous read will NOT block.

socket operations in perl, python and c

(“i-socket” means internet socket. Our discussion doesn't cover UD-socket ie unix domain sockets.)

1st) and easy lesson – both perl and python provide wrappers over C-level syscalls, so function names are identical across C, perl and python.

2nd) Lesson – python uses a socket object as the Dictator (remember Benevolent Dictator For Life?) whereas Perl puts a socket FILEhandle as the first argument to those functions.

Like Perl, C puts a socket FILE descriptor as first argument to those functions. However, the C file descriptor is different from a Perl file handle —
– C file descriptor is an integer either for a socket or for a disk file
– Perl file handle is internally implemented like c++ i/o streams. As compared in the post on [[pipes, streams, io classes]], all of these are built on the PIPE concept

However, both C and Perl treat a socket like a file, to support read/write.

NIO q&&a

Prerequisite: blocking. See other posts

Q: Typically, how many active clients can an NIO server support?
A: thousands for a single-thread NIO server

Q: How about traditional IO?
A: typically hundreds

Q: can we have a single-thread server in traditional IO? P221
A: no. when data is unavailable on the socket, the server thread would “block” and stop doing anything at all. Since there are no other threads to do anything either, the entire server freezes up in waiting, ignoring any events or signals or inputs.

Q: basic NIO server uses a single thread or multiple threads?
A: p 233

Q: in traditional IO, what’s the threading, session, socket … allocation?
A: each client is linked to a server socket, a thread.

Q: why traditional io supports limited number of clients?
A: threads can’t multiply infinitely. (No real limit on sockets.). P227 shows 2nd reason

Q: problem of thread pool?
A: thousands of active clients would require same size (heavy!) of thread pool!

Q: in one sentence, explain a major complexity in NIO programming. P232

What q[bind]means to a java socket

Both types — ServerSocket and Socket — are bound to a [1] local address:port and never a remote one. “Bind” always implies “local”.

[1] one and only 1 local address:port.

Look at the ServerSocket.bind() and Socket.bind() API.

It may help to think of a server jvm and a client jvm.

It’s possible for a socket object (in JVM) to be unbound.

java ServerSocket.accept()

I think this should be one of the first important yet tricky socket methods to study and /internalize/. Memorize it’s extended signature and you will understand how it relates to other things.

See P221 [[ java threads ]].

A real server jvm always creates and uses 2 *types* of sockets — a single well-known listening socket, which on demand creates a “private socket” (aka data socket) for each incoming client request.

The new socket manufactured by accept() has the remote address:port set to the client’s address:port and is already connected to it.

I think the new socket initiates the connection, therefore this socket is considered a “socket” and not a ServerSocket. Nothing found online to confirm this.

java ServerSocket HAS-A queue

Default queue of 50 waiting “patrons” to our restaurant. If a patron arrives when the queue is full, the connection is refused.

Each “successful” patron would be allocated a dining table ie an address:port on the server-side.

The operating system stores incoming connection requests addressed to a particular port in a first-in, first-out queue. The default length of the queue is normally 50, though this can vary from operating system to operating system. Some operating systems (though not Solaris) have a maximum queue length, typically five. On these systems, the queue length will be the largest possible value less than or equal to 50. After the queue fills to capacity with unprocessed connections, the host refuses additional connections on that port until slots in the queue open up. Many (though not all) clients will try to make a connection multiple times if their initial attempt is refused. Managing incoming connections and the queue is a service provided by the operating system; your program does not need to worry about it. Several ServerSocket constructors allow you to change the length of the queue if its default length isn’t large enough; however, you won’t be able to increase the queue beyond the maximum size that the operating system supports.

java clientside inet-sockets — briefly

Look at constructor signatures. As a client side internet socket (not UD-socket), the most basic address:port pair needed is the REMOTE address:port.

Q: so how about the local address:port?
A: Usually, only after a socket is created with the remote address:port, does the socket need to bind() to a local address:port.

Q: Can a Socket object can be on server side or client side.
A: I think both. See ServerSocket.accept() javadoc. accept() manufactures a socket object in the server jvm.

Q: Can a java Socket object be associated to 2 connections? The output data would broadcast into both channels?