hadoop^spark #ez2remember

All 3 are based on JVM:

  • hadoop — java
  • spark — Scala
  • storm — Clojure

Simplified — some practitioners view hadoop’s value-add as 2 fold

  1. HDFS
  2. MapReduce but in batch mold

Spark keeps the MapReduce part only.

Spark runs MapReduce in streaming often using HDFS for storage.


UDP socket is identified by two-tuple; TCP socket is by four-tuple

Based on [[computer networking]] P192. see also de-multiplex by-destPortNumber UDP ok but !! enough for TCP

  • Note the term in subject is “socket” not “connection”. UDP is connection-less.

A TCP segment has four header fields for Source IP:port and destination IP:port.

A TCP socket has internal data structure for a four-tuple — Remote IP:port and local IP:port.

A regular TCP “Worker socket” has all four items populated, to represent a real “session/connection”, but a Listening socket could have wild cards in all but the local-port field.

fragmentation: IP^TCP #retrans

See also IP (de)fragmentation #MTU,offset

Interviews are unlikely to go this deep, but it’s good to over-prepare here. This comparison ties together many loose ends like Ack, retrans, seq resets..

[1] IP fragmentation can cause excessive retransmissions when fragments encounter packet loss and reliable protocols such as TCP must retransmit ALL of the fragments in order to recover from the loss of a SINGLE fragment
[2] see TCP seq# never looks like 1,2,3

IP fragmentation TCP fragmentation
minimum guarantees all-or-nothing. Never partial packet stream in-sequence without gap
reliability unreliable fully reliable
name for a “part” fragment segment
sequencing each fragment has an offset each segment has a seq#
.. continuous? yes no! [2]
.. reset? yes for each packet loops back to 0 right before overflow
Ack no such thing positive Ack needed
gap detection using offset using seq# [2]
id for the “msg” identification number no such thing
end-of-msg flag in last fragment no such thing
out-of-sequence? likely likely
..reassembly based on id/offset/flag based on seq#
..retrans not by IP [1] commonplace

de-multiplex by-destPort: UDP ok but insufficient for TCP

When people ask me what is the purpose of the port number in networking, I used to say that it helps demultiplex. Now I know that’s true for UDP but TCP uses more than the destination port number.

Background — Two processes X and Y on a single-IP machine  need to maintain two private, independent ssh sessions. The incoming packets need to be directed to the correct process, based on the port numbers of X and Y… or is it?

If X is sshd with a listening socket on port 22, and Y is a forked child process from accept(), then Y’s “worker socket” also has local port 22. That’s why in our linux server, I see many ssh sockets where the local ip:port pairs are indistinguishable.

TCP demultiplex uses not only the local ip:port, but also remote (i.e. source) ip:port. Demultiplex also considers wild cards.

socket has local IP:port
socket has remote IP:port no such thing
2 sockets with same
local port 22 ???
can live in two processes not allowed
can live in one process not allowed
2 msg with same dest ip:port
but different source ports
addressed to 2 sockets;
2 ssh sessions
addressed to the
same socket

make_shared() cache efficiency, forward()

This low-level topic is apparently important to multiple interviewers. I guess there are similarly low-level topics like lockfree, wait/notify, hashmap, const correctness.. These topics are purely for theoretical QQ interviews. I don’t think app developers ever need to write forward() in their code.

https://stackoverflow.com/questions/18543717/c-perfect-forwarding/18543824 touches on a few low-level optimizations. Suppose you follow Herb Sutter’s advice and write a factory accepting Trade ctor arg and returning a shared_ptr<Trade>,

  • your factory’s parameter should be a universal reference. You should then std::forward() it to make_shared(). See gcc source code See make_shared() source in https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-api-4.6/a01033_source.html
  • make_shared() makes one allocation for a Trade and an adjacent control block, with cache efficiency — any read access on the Trade pointer will cache the control block too
  • if the arg object is a temp object, then the rvr would be forwarded to the Trade ctor. Scott Meryers says the lvr would be cast to a rvr. The Trade ctor would need to move() it.
  • if the runtime object is carried by an lvr (arg object not a temp object), then the lvr would be forwarded as is to Trade ctor?

Q: What if I omit std::forward()?
AA: Trade ctor would receive always a lvr. See ScottMeyers P162 and my github code

https://github.com/tiger40490/repo1/blob/cpp1/cpp1/rvrDemo.cpp is my experiment.


session^msg^tag level concepts

Remember every session-level operation is implemented using messages.

  • Gap management? Session level
  • Seq reset? Session level, but seq reset is a msg
  • Retrans? per-Msg
  • Checksum? per-Msg
  • MsgType tag? there is one per-Msg
  • Header? per-Msg
  • order/trade life-cycle? session level
  • OMS state management? None of them …. more like application level than Session level

linux tcp buffer^AWS tuning params

—receive buffer configuration
In general, there are two ways to control how large a TCP socket receive buffer can grow on Linux:

  1. You can set setsockopt(SO_RCVBUF) explicitly as the max receive buffer size on individual TCP/UDP sockets
  2. Or you can leave it to the operating system and allow it to auto-tune it dynamically, using the global tcp_rmem values as a hint.
  3. … both values are capped by

/proc/sys/net/core/rmem_max — is a global hard limit on all sockets (TCP/UDP). I see 256M in my system. Can you set it to 1GB? I’m not sure but it’s probably unaffected by the boolean flag below.

/proc/sys/net/ipv4/tcp_rmem — doesn’t override SO_RCVBUF. The max value on my system is again 256M. The receive buffer for each socket is adjusted by kernel dynamically, at runtime.

The linux “tcp” manpage explains the relationship.

Note large TCP receive buffer size is usually required for high latency, high bandwidth, high volume connections. Low latency systems should use smaller TCP buffers.

For high-volume multicast connections, you need large receive buffers to guard against data loss — UDP sender doesn’t obey flow control to prevent receiver overflow.


/proc/sys/net/ipv4/tcp_window_scaling is a boolean configuration. (Turned on by default) 1GB  is the new limit on AWS after turning on window scaling. If turned off, then AWS value is constrained to a 16-bit integer field in the TCP header — 65536

I think this flag affects AWS and not receive buffer size.

  • if turned on, and if buffer is configured to grow beyond 64KB, then Ack can set AWS to above 65536.
  • if turned off, then we don’t need a large buffer since AWS can only be 65536 or lower.


y FIX needs session seqNo over TCP seqNo #reset

My friend Alan said … Suppose your FIX process crashed or lost power, reloads (from disk) the last sequence received and reconnects (resetting tcp seq#). It would then receive a live seq # higher than expected. CME documentation states:

… a given system, upon detecting a higher than expected message sequence number from its counterparty, requests a range of ordered messages resent from the counterparty.

Major difference from TCP sequence number — FIX has no Ack. See Ack in FIX^TCP

— Sequence number reset policy:

After a logout, sequence numbers is supposed to reset to 1, but if connection is terminated ‘non-gracefully’ sequence numbers will continue when the session is restored. In fact a lot of service providers (eg: Trax) never reset sequence numbers during the day. There are also some, who reset sequence numbers once per week, regardless of logout.

FIX.5 + FIXT.1 breaking changes

I think many systems are not yet using FIX.5 …

  1. FIXT.1 protocol [1] is a kinda subset of the FIX.4 protocol. It specifies (the traditional) FIX session maintenance over naked TCP
  2. FIX.5 protocol is a kinda subset of the FIX.4 protocol. It specifies only the application messages and not the session messages.

See https://www.fixtrading.org/standards/unsupported/fix-5-0/. Therefore,

  • FIX4 over naked TCP = FIX.5 + FIXT
  • FIX4 over non-TCP = FIX.5 over non-TCP

FIX.5 can use a transport messaging queue or web service, instead of FIXT

In FIX.5, Header now has 5 mandatory fields:

  • existing — BeginString(8), BodyLength(9), MsgType(35)
  • new — SenderCompId(49), TargetCompId(56)

Some applications also require MsgSeqNo(34) and SendingTime(52), but these are unrelated to FIX.5

Note BeginString actually look like “8=FIXT1.1

[1] FIX Session layer will utilize a new version moniker of “FIXTx.y”

HFT mktData redistribution via MOM

Several practitioners say MOM is unwelcome due to added latency:

  1. The HSBC hiring manager Brian R was the first to point out to me that MOM adds latency. Their goal is to get the raw (market) data from producer to consumer as quickly as possible, with minimum stops in between.
  2. 29West documentation echos “Instead of implementing special messaging servers and daemons to receive and re-transmit messages, Ultra Messaging routes messages primarily with the network infrastructure at wire speed. Placing little or nothing in between the sender and receiver is an important and unique design principle of Ultra Messaging.
  3. Then I found that the ICE/RTS systems (not ultra-low-latency ) have no middleware between feed parser and order book engine (named Rebus).

However, HFT doesn’t always avoid MOM. P143 [[all about HFT]] published 2010 says an HFT such as Citadel often subscribes to both individual stock exchanges and CTS/CQS [1], and multicasts the market data for other components of the HFT. This design has additional buffers inherently. The first layer receives raw external data via a socket buffer. The 2nd layer components would receive the multicast data via their socket buffers.

[1] one key reason to subscribe redundant feeds — CTS/CQS may deliver a tick message faster!

Lehman’s market data is re-distributed over tibco RV, in FIX format.