specify(by ip:port) multicast group to join

http://www.nmsl.cs.ucsb.edu/MulticastSocketsBook/ has zipped sample code showing

mc_addr.sin_port = thePort;

bind(sock, (struct sockaddr *) &mc_addr, sizeof(mc_addr) ) // set the group port, not local port!
—-
mc_req.imr_multiaddr.s_addr = inet_addr(“224.1.2.3”);

setsockopt(sock, IPPROTO_IP, IP_DROP_MEMBERSHIP,
(void*) &mc_req, sizeof(mc_req) // set the IP by sending a IGMP join-request

Note setsocopt() actually sends a request!

====That’s for multicast receivers.  Multicast senders use a simpler procedure —

mc_addr.sin_addr.s_addr = inet_addr(“224.1.2.3”);
mc_addr.sin_port = htons(thePort);

sendto(sock, send_str, send_len, 0, (struct sockaddr *) &mc_addr, …

Advertisements

FIX^MOM: exchange connectivity

The upshot — some subsystem use FIX not MOM; some subsystem use MOM not FIX. A site could send FIX over MOM (such as RV) but not common.

It’s important to realize FIX is not a type of MOM system. Even though FIX uses messages, I believe they are typically sent over persistent TCP sockets, without middle-ware or buffering/queuing. RTS is actually a good example.

Q: does FIX have a huge message buffer? I think it’s small, like TCP, but TCP has congestion control, so sender will wait for receiver.

Say we (sell-side or buy-side) have a direct connectivity to an exchange. Between exchange and our exterior gateway, it’s all sockets and possibly java NIO — no MOM. Data format could be FIX, or it could be the exchange’s proprietary format — here’s why.

The exchange often has an internal client-server protocol — the real protocol and a data flow “choke point” . All client-server request/response relies solely on this protocol. The exchange often builds a FIX translation over this protocol (for obvious reasons….) If we use their FIX protocol, our messages are internally translated from FIX to the exchange proprietary format. Therefore it’s obviously faster to directly pass messages in the exchange proprietary format. The exchange offers this access to selected hedge funds.

As an espeed developer told me, they ship a C-level dll (less often *.so [1]) library (not c++, not static library), to be used by the hedge fund’s exterior gateway. The hedge fund application uses the C functions therein to communicate with the exchange. This communication protocol is possibly proprietary. HotspotFX has a similar client-side library in the form of a jar. There’s no MOM here. The hedge fund gateway engine makes synchronous function calls into the C library.

[1] most trader workstations are on win32.

Note the dll or jar is not running in its own process, but rather loaded into the client process just like any dll or jar.

Between this gateway machine and the trading desk interior hosts, the typical communication is dominated by MOM such as RV, 29West or Solace. In some places, the gateway translates/normalizes messages into an in-house standard format.

TCP blocking send() timeout

I see three modes

  1. non-blocking send() — immediate return if unable to send
  2. regular blocking send() — blocks forever, so the thread can’t do anything
  3. blocking send() with timeout:

SO_SNDTIMEO: sets the timeout value specifying the amount of time that an output function blocks because flow control prevents data from being sent. If a send operation has blocked for this time, it shall return with a partial count or with errno set to [EAGAIN] or [EWOULDBLOCK] if no data is sent. The default for this option is zero, which indicates that a send operation shall not time out. This option stores a timeval structure. Note that not all implementations allow this option to be set.

In xtap library, timeout isn’t implemented at all. Default is non-blocking.  If we configure to use 2), then we can hit a strange problem — one of three receivers gets stuck but keeps its connection open. The other receives are starved even though their receive buffers are free.

de-multiplex packets bearing Same dest ip:port Different source

see de-multiplex by-destPort: UDP ok but insufficient for TCP

For UDP, the 2 packets are always delivered to the same destination socket. Source IP:port are ignored.

For TCP, if there are two matching worker sockets … then delivered to them. Perhaps two ssh sessions.

If there’s only a listening socket, then both packets delivered to the same socket, which has wild cards for remote ip:port.

UDP socket is identified by two-tuple; TCP socket is by four-tuple

Based on [[computer networking]] P192. see also de-multiplex by-destPortNumber UDP ok but !! enough for TCP

  • Note the term in subject is “socket” not “connection”. UDP is connection-less.

A TCP segment has four header fields for Source IP:port and destination IP:port.

A TCP socket has internal data structure for a four-tuple — Remote IP:port and local IP:port.

A regular TCP “Worker socket” has all four items populated, to represent a real “session/connection”, but a Listening socket could have wild cards in all but the local-port field.

fragmentation: IP^TCP #retrans

See also IP (de)fragmentation #MTU,offset

Interviews are unlikely to go this deep, but it’s good to over-prepare here. This comparison ties together many loose ends like Ack, retrans, seq resets..

[1] IP fragmentation can cause excessive retransmissions when fragments encounter packet loss and reliable protocols such as TCP must retransmit ALL of the fragments in order to recover from the loss of a SINGLE fragment
[2] see TCP seq# never looks like 1,2,3

IP fragmentation TCP fragmentation
minimum guarantees all-or-nothing. Never partial packet stream in-sequence without gap
reliability unreliable fully reliable
name for a “part” fragment segment
sequencing each fragment has an offset each segment has a seq#
.. continuous? yes no! [2]
.. reset? yes for each packet loops back to 0 right before overflow
Ack no such thing positive Ack needed
gap detection using offset using seq# [2]
id for the “msg” identification number no such thing
end-of-msg flag in last fragment no such thing
out-of-sequence? likely likely
..reassembly based on id/offset/flag based on seq#
..retrans not by IP [1] commonplace

retrans: FIX^TCP^xtap

The FIX part is very relevant to real world OMS.. Devil is in the details.

IP layer offers no retrans. UDP doesn’t support retrans.

TCP FIX xtap
seq# continuous no yes yes
..reset automatic loopback managed by application seldom(exchange decision)
..dup possible possible normal under bestOfBoth
..per session per connection per clientId per day
..resumption? possible if wire gets reconnected quickly yes upon re-login unconditional. no choice
Ack positive Ack needed only needed for order submission etc not needed
gap detection sophisticated every gap should be handled immediately since sequence is critical gap mgr with timer

de-multiplex by-destPort: UDP ok but insufficient for TCP

When people ask me what is the purpose of the port number in networking, I used to say that it helps demultiplex. Now I know that’s true for UDP but TCP uses more than the destination port number.

Background — Two processes X and Y on a single-IP machine  need to maintain two private, independent ssh sessions. The incoming packets need to be directed to the correct process, based on the port numbers of X and Y… or is it?

If X is sshd with a listening socket on port 22, and Y is a forked child process from accept(), then Y’s “worker socket” also has local port 22. That’s why in our linux server, I see many ssh sockets where the local ip:port pairs are indistinguishable.

TCP demultiplex uses not only the local ip:port, but also remote (i.e. source) ip:port. Demultiplex also considers wild cards.

TCP UDP
socket has local IP:port
socket has remote IP:port no such thing
2 sockets with same
local port 22 ???
can live in two processes not allowed
can live in one process not allowed
2 msg with same dest ip:port
but different source ports
addressed to 2 sockets;
2 ssh sessions
addressed to the
same socket

Q: which thread/PID drains NicBuffer→socketBuffer

Too many kernel concepts. I will use a phrasebook format. I have also separated some independent tips into hardware interrupt handler #phrasebook

  1. Scenario 1 : A single CPU. I start my parser which creates the multicast receiver socket but no data coming. My pid111 gets preempted. CPU is running unrelated pid222 when data /wash up/.
  2. Scenario 2: pid111 is running handleInput() while additional data comes in on the NIC.
  • context switching — to interrupt handler (i-handler). In all scenarios, the running process gets suspended to make way for the interrupt handler function. I-handler’s instruction address gets loaded into the cpu registers and it starts “driving” the cpu. Traditionally, the handler used the suspended process’s existing stack.
    • After the i-handler completes, the suspended “current” process resumes by default. However, the handler may cause another pid to be scheduled right away [1 Chapter 4.1].
  • no pid — interrupt handler execution has no pid, though some authors say it runs on behalf of the suspended pid. I feel the suspended pid may be unrelated to the socket, rather than the socket’s owner process (pid111).
  • kernel scheduler — In Scenario 1, pid111 would not get to process the data until it gets in the “driver’s seat” again. However, the interrupt handler could trigger a rescheduling and push pid111 “to the top” so to speak. [1 Chapter 4.1]
  • top-half — drains the tiny NIC ring-buffer into main memory as fast as possible [2]
  • bottom-half — (i.e. deferrable functions) includes lengthy tasks like copying packets. Deferrable function run in interrupt context [1 Chapter 4.8], so there’s no pid
  • sleeping — the socket owner pid 111 would be technically “sleeping” in the socket’s wait queue initially. After the data is copied into the socket receive buffer in user space, I think the kernel scheduler would locate pid111 in the socket’s wait queue and make pid111 the cpu-driver. Pid111 would call read() on the socket.
    • wait queue — How the scheduler does it is non-trivial. See [1 Chapter 3.2.4.1]
  • burst — What if there’s a burst of multicast packets? The i-handler would hog or steal the driver’s seat and /drain/ the NIC ring-buffer as fast as possible, and populate the socket receive buffer. When the i-handler takes a break our handleInput() would chip away at the socket buffer.
    • priority — is given to the NIC’s interrupt handler, since we have a single CPU.
    • UDP could overrun the socket receive buffer; TCP uses transmission control to avoid it.

Q: What if the process scheduler wants to run while i-handler is busy draining the NIC?
A: Well, all interrupt handlers can be interrupted, but I would doubt the process scheduler would suspend the NIC interrupt handler.

One friend said the pid is 1, the kernel process.

[1] [[UnderstandingLinuxKernel, 3rd Edition]]

[2] https://notes.shichao.io/lkd/ch7/#top-halves-versus-bottom-halves