Introduction to TCP

Jamshid Mahdavi
Novell
mahdavi@novell.com

Internet Protocol Stack and Network Layers

(5 min)

Physical layer: Raw bit stream...
Link layer: Local addressing, packet delivery, broadcast/multicast capabilities.
Network Layer: Unreliable datagram delivery. Globally routed heirarchical address space.
Transport Layer: Add Reliability, connections, etc.
Presentation
Session
Application

The end to end argument

(10 min)

Thought experiment: Suppose you are developing a program like FTP. If you want to be sure all of the data got there, how do you go about it?

Sender and receiver must communicate "success". Receiver must ultimately count the data and make sure the right number of bytes arrived.

Similarly, receiver must checksum or otherwise verify correctness.

The end-to-end argument says that if you are going to do something on the end-host anyway, don't replicate that function in the network.

Think "hop-by-hop" vs "end-to-end". This drives the decision of what goes into the transport layer.

Duties of the Transport Layer

(10 min)

IP provides an unreliable datagram service. What additional features are generally needed (and therefore should be provided by a Transport layer?)

In Order, Reliable Data Delivery
Connection Establishment or Session Control
Multiplexing Connections
Flow Control and Congestion control

TCP Packet Format

(10 min)

TCP Header Format (from RFC 793) 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Source Port | Destination Port | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Sequence Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Acknowledgment Number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data | |U|A|P|R|S|F| | | Offset| Reserved |R|C|S|S|Y|I| Window | | | |G|K|H|T|N|N| | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Checksum | Urgent Pointer | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Options | Padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | data | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ TCP Header Format Note that one tick mark represents one bit position. Figure 3.

Note that we refer to IP Packets and TCP Segments. A TCP segment refers just to the data. Because packets can be fragmented, it is useful to make this distinction. Also segments can be retransmitted; in this case the same segment is repeated in a different packet.

Which duty is each field in the packet format related to?

Port numbers: Multiplexing, connection setup/management.

Sequence numbers: reliability and ordering.

Flag bits: connection management (primarily).

Window: Flow control

Checksum: reliability

Urgent Pointer: control function. Not really used.

The TCP state machine

(15 min)

From 793 +---------+ ---------\ active OPEN | CLOSED | \ ----------- +---------+<---------\ \ create TCB | ^ \ \ snd SYN passive OPEN | | CLOSE \ \ ------------ | | ---------- \ \ create TCB | | delete TCB \ \ V | \ \ +---------+ CLOSE | \ | LISTEN | ---------- | | +---------+ delete TCB | | rcv SYN | | SEND | | ----------- | | ------- | V +---------+ snd SYN,ACK / \ snd SYN +---------+ | |<----------------- ------------------>| | | SYN | rcv SYN | SYN | | RCVD |<-----------------------------------------------| SENT | | | snd ACK | | | |------------------ -------------------| | +---------+ rcv ACK of SYN \ / rcv SYN,ACK +---------+ | -------------- | | ----------- | x | | snd ACK | V V | CLOSE +---------+ | ------- | ESTAB | | snd FIN +---------+ | CLOSE | | rcv FIN V ------- | | ------- +---------+ snd FIN / \ snd ACK +---------+ | FIN |<----------------- ------------------>| CLOSE | | WAIT-1 |------------------ | WAIT | +---------+ rcv FIN \ +---------+ | rcv ACK of FIN ------- | CLOSE | | -------------- snd ACK | ------- | V x V snd FIN V +---------+ +---------+ +---------+ |FINWAIT-2| | CLOSING | | LAST-ACK| +---------+ +---------+ +---------+ | rcv ACK of FIN | rcv ACK of FIN | | rcv FIN -------------- | Timeout=2MSL -------------- | | ------- x V ------------ x V \ snd ACK +---------+delete TCB +---------+ ------------------------>|TIME WAIT|------------------>| CLOSED | +---------+ +---------+ TCP Connection State Diagram Figure 6.

Normal server connection establishment:

Passive Open -> Listen -> SYN RCVD -> EST

Normal client connection establishment:

Active Open -> SYN SENT -> EST

3 way handshake protects against duplicate or old data in the network. Each host must ACK the attempt to establish the connection.

Similarly, FIN is used to close connections; each side must ACK the attempt to close. TCP permits half-open connections to continue sending in one direction. Don't know of anything that uses this.

Sliding Window Flow Control

(10 min)

The receiver announces a window which specifies the amount of data which can be outstanding in the network at any point in time. This prevents the sender from overrunning the receiver with data if the receiver can't keep up.

If the receiver slows down, the sender automatically slows down as the ACK rate decreases.

Side effect: what is the maximum bandwidth which can be obtained at a given window? Answer: Window/RTT.

Loss detection

(30 min)

Initial TCP performed loss detection through a simple timeout. Go-back-n algorithm went back to the last byte successfully received and started transmitting from there.

What are the problems with this approach?

Lots of extra data retransmitted.
Timeouts don't make use of all of the available information.
Timers also need to be adaptive.

All of these problems lead to "congestion collapse" in the Internet in 1988. Van Jacobson invented a set of fixes which revolutionized TCP.

Basic idea: Add a congestion window (cwnd). This adds sliding window congestion control to TCP. cwnd controls the amount of data in the network.

First benefit: if the network slows down, queues increase, ACKs arrive more slowly, and the sender naturally slows down. Introduce the idea of ACK clocking.

Second benefit: the sender can adjust cwnd to match the available network capacity.

Question: how do you adjust?

Answer: additive increase, multiplicative decrease. TCP adds one segment to cwnd for each RTT, and divides cwnd by two when there is a loss.

Problem: starting with cwnd=1 might take many RTTs to get to large values of cwnd.

Solution: Slow start by doubling cwnd each RTT.

Questions: How do you identify a loss? (duplicate ACKs). How can you distinguish loss from reordering? (Add a 3 dupacks check). Fast Retransmit...

Duplicate ACKs cause problems with clocking. Can't use snd.una + cwnd to figure out what to send.

Tahoe solution: Identify congestion; retransmit; slowstart up to cwnd/2.

Reno solution: Artificially inflate cwnd until ACK advances. (Fast Recovery).

At this point, show a nam animation of slowstart and congestion avoidance.

Question: Do we still need timeouts at all?

Yes -- could still lose a whole window of data or ACKs.

Discuss adaptive timers and the RTO calculation.

Time one packet per RTT. Compute a smoothed RTT (SRTT), and deviation (SDEV).

Base the timeout on RTO=SRTT+4*SDEV. This adapts the timeout to the connection RTT.

Timer backoff: Successive timeouts are an indication of heavy congestion. Apply an exponential backoff to the timeout (1,2,4,8,16,32,64,64,64,64,64...) for each successive timeout.

Spoofing TCP

(Not enough time for this, probably).

Now for something completely different. Suppose you want to attack a TCP connection. Example, insert data in the middle of a netscape download.

TCP provides some protection by randomizing the ISN. This makes it hard for an outsider to inject data. (Need to sniff the connection first to get the seq numbers, then add the false data).

Denial of service

Another popular attack is the SYN flood. Establish many half-open connections to fill up server memory.

Lab Tools

(10 min)

Labs will use a number of popular applications.

Netperf and TTCP run TCP connections and measure their performance.

WebPolygraph runs a web-like workload designed to benchmark web proxies.

tcpdump: collect packet traces.

tcptrace: analyze tcpdump traces.

xplot: plot traces from tcptrace.

Discussion of Exercises

(30 min)

This section introduces new developments to address some of the problems found in the first lab.

Try to roll through these solutions in chronological order.

Packet sizes: What is the dilemma? Want to use as large of packets as possible. This reduces overhead and improves host and network performance. But, packets which are too large will be fragmented by the network.
TCP includes an MSS negotiation option. Originally, this value was used if the hosts were on the same network. If going over a WAN, a minimal packet size of 576 bytes was used. This prevented unnecessary fragmentation, but was inefficient.
RFC1191 (Path MTU Discovery) provided a means for hosts to detect the path MTU. (But there are still problems with it due to various blackholing problems...)
Because of the blackholing problems, some OS come with PMTU disabled by default.
Inadequate Window: 16 bit receiver window field in the packet header limits BW to 64 kb/RTT. RFC1323 provides a Window Scale option to address this problem.
Can now scale window by up to 2^14. New max window is 1 GB. (How soon will it be until this runs out?) (With 1500 byte packets, this is a window of 600,000 packets. How long would that take to open?)
Timestamps: RFC1323 also introduced the Timestamp Option. This provides for timing of every packet in flight, rather than only one per RTT. This enables keeping a more dynamic and accurate value for RTO.
Can anyone see problems with this? Haven't changed the exponential smoothing, so RTO varies rapidly. In particular SDEV could become small quickly even if long term RTT fluctuations are occuring.
Next Lab one group devoted to this issue.
Problems with Reno: Fundamental problem is that Fast Retransmit assumes that only one segment was lost. This can result in loss of ACK clocking and timeouts if more than one segment is lost. Two solutions are available: SACK and NewReno.
NewReno attacks the problem by filling one hole per RTT. That is, a partial ACK is indicative of a new hole, which is immediately retransmitted. The downside of this is that it may take many RTTs to recover from a loss episode, and you must have enough new data around to keep the ACK clock running.
Selective Acknowledgment (SACK) attaches detailed information in SACK options on ACKs. Each SACK block specifies a block of data which has been successfully received. The sender can use this information to quickly retransmit only those segments which have been lost. This enables holes to be filled within roughly one RTT.
Problems with drop-tail queues: Unfair (more likely to drop many packets from one connection). Randomization of drops (RED) improves fairness and reduces overall queue length.
ECN builds on RED by adding a bit to the IP header to signal congestion. ECN capable routers and hosts can use this to control congestion with no packet loss.
Finally, for web and slowstart problems, RFC2414 increased the allowed initial window to 2-4 segments.

Kernel TCP architecture

Lab 2 exercises use NetBSD kernel. Some require kernel programming, some do not. Quick introduction to the BSD kernel. Copies of Stevens available for reference during the lab.

Q: Who has experience building kernels? Kernel hacking?

NetBSD quick primer:

Kernel config files live in /sys/arch/i386/conf/KERNEL_NAME
Edit the config; then type ./config KERNEL_NAME
cd to /sys/arch/i386/compile/KERNEL_NAME. make depend then make
Install the new kernel in /netbsd and reboot.

Kernel source files: (include copies of important sections in the handouts?)

tcp_input.c handles all input packet processing. The fast retransmit and fast recovery sections of the code are in here. RTTs are computed in this section of code (you need to do the timing when the ACK for the packet being timed comes in).
tcp_input uses a "header prediction" algorithm to speed up the code. This algorithm attempts to immediately go to the most common code path by predicting the next header. If successful, tcp_input processing is very short. Non-success cases include bi-directional data, duplicate ACKs, and state-machine transitions.
tcp_input calls tcp_output when ACKs come in which trigger new data. For bulk transfer (where cwnd or receiver window is limiting the transmissions), this is the normal means for calling tcp_output to send a segment.
tcp_output.c handles sending packets. New data and retransmissions are segmented, and headers are added to the segments.
what else?

Analysis of TCP Performance

Would like to better understand how TCP performs under various circumstances.

Simple analysis of slowstart. Go through math.

Simple analysis of 1/sqrt(p). Go through math for periodic loss case.

UMass calculation. Here, we add the following additional "features" to the calculation. (cwnd < 4 MSS results in timeout).

Fairness of TCP. Assume all connections at a bottleneck see equal loss rates. RTT and MSS weight the fairness.

Multiple congested gateways: Connections passing through two gateways get lower performance.

Introduce concept of min-max fairness.

New developments

Finally, some new things taking place.

T/TCP: Transaction TCP. Allow sending SYN/DATA/FIN in one packet. State machine is more complex to attempt maintain security guarantees. Goal is to reduce 5 packet transaction to 3 packets.

Rate-halving algorithm: unifies newreno, SACK, and ECN. Single algorithm.

Autotuning: This solves the receiver window problem. Receivers announce a maximum window. Note that receiver socket buffer consumes no memory normally. Sender is in full control using cwnd. To save memory on the sender, sender only allocates socket buffer for 2-4 times cwnd.

Sharing congestion state: Desirable because http likes to open multiple connections. RFC2140 specifies that these connections may share congestion state (cwnd).

IETF Endpoint Congestion Management WG is expanding on this idea. Create a UI for congestion control, and allow TCP and UDP to use the same UI. CM does all of the regulation of packets to and from the network. TCP, and UDP apps, control what data to send.

(I believe that RH is a good algorithm for this, because it separates data transmission from congestion control; exactly what CM needs).

New congestion control algorithms: For inherently rate-based applications where window based TCP can't work, two new approaches are being tested.

Rate based: instead of linear increase on cwnd, use a rate directly. Pace out packets into the network based on this rate.
Equation based: use the formulae we computed to set the rate of a connection based on the perceived loss rate. This is good, because you don't have to slowly increase your bandwidth to test higher speeds. You can apply the packet loss rules and instantly jump (for example) to a higher bandwidth coding.

Finally, reliable multicast is attempting to apply TCP congestion control principles to multicast. Multicast has the problem that you can't send ACKs for every packet back to the sender, so window-based schemes are hard to use.

One approach being worked on is heirarchical ACKs (HACKs). ACKs are aggregated on the way back up the tree to the sender, allowing for the possibility of using a window based CC scheme.

The other approach is using equation based CC. Here, receivers infrequently send back info on packet loss rates and RTTs. The sender aggregates all of this information together and chooses an appropriate rate.

For both of these schemes the general idea is "go no faster than TCP would go over the worst path in the multicast tree".

References

Textbooks:
- W. R. Stevens, "TCP/IP Illustrated, Volume 1: The Protocols", Addison-Wesley, 1994.
- D. Comer, "Internetworking with TCP/IP, Volume 1", Prentice Hall, 1991.
- Peterson, LL and Davie, S. Computer Networks: A Systems Approach. Morgan Kaufman, 1996.
TCP Standards RFCs:
- RFC793: Transmission Control Protocol
- RFC896: Congestion control in IP/TCP internetworks
- RFC1191: Path MTU Discovery
- RFC1323: TCP Extensions for High Performance
- RFC1644: T/TCP -- TCP Extensions for Transactions Functional Specification
- RFC2018: TCP Selective Acknowledgement Options
- RFC2140: TCP Control Block Interdependence
- RFC2398: Some Testing Tools for TCP Implementors
- RFC2525: Known TCP Implementation Problems
- RFC2581: TCP Congestion Control
- RFC2582: The NewReno Modification to TCP's Fast Recovery Algorithm
- draft-paxson-tcp-rto-01.txt: Computing TCP's Retransmission Timer
- draft-handley-tcp-cwv-02.txt: TCP Congestion Window Validation
- draft-floyd-sack-00.txt: An Extension to the Selective Acknowledgement (SACK) Option for TCP
- draft-allman-tcp-lossrec-00.txt: Enhancing TCP's Loss Recovery Using Early Duplicate Acknowledgment Response
- draft-ietf-sigtran-sctp-10.txt: Stream Control Transmission Protocol
- Pilc Working Group Charter: This includes references to several good works in progress on TCP protocol behavior under various conditions. http://www.ietf.org/html.charters/pilc-charter.html
TCP Design and Behavior:
- J.H Saltzer, D.P. Reed, and D.D. Clark. End-to-End Arguments in System Design. ACM Transactions on Computer Systems, 2(4):195-206, 1984.
- Floyd, S., Connections with Multiple Congested Gateways in Packet-Switched Networks Part 1: One-way Traffic. Computer Communication Review, Vol.21, No.5, October 1991, p. 30-47. ftp://ftp.ee.lbl.gov/papers/gates1.ps.Z
- Fall, K., and Floyd, S., Simulation-based Comparisons of Tahoe, Reno, and SACK TCP. Computer Communication Review, V. 26 N. 3, July 1996, pp. 5-21. This is a revised version of Comparisons of Tahoe, Reno, and Sack TCP. Technical report, December 1995. http://www.aciri.org/floyd/papers/sacks.ps.Z
- M. Mathis, J. Semke, J. Mahdavi, T. Ott, The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm,Computer Communication Review, volume 27, number 3, pp. 67-82, July 1997. http://www.psc.edu/networking/papers/model_ccr97.ps
- Hari Balakrishnan, Venkata N. Padmanabhan and Randy H. Katz, The Effects of Asymmetry on TCP Performance, in Proc. of 3rd ACM/IEEE Intl. Conference on Mobile Computing and Networking (MobiCom), (Budapest, Hungary), Sep. 1997.
- Stefan Savage, Neal Cardwell, David Wetherall and Tom Anderson, TCP Congestion Control with a Misbehaving Receiver, ACM Computer Communication Review, vol. 29, no. 3, Oct. 1999.
- K. Fall and S. Floyd. Simulation-based Comparisons of Tahoe, Reno, and SACK TCP. ACM Computer Communication Review, 26(3):5-21, Jul. 96.
- L. Zhang, S. Shenker, and D.D. Clark. Observations on the Dynamics of a Congestion Control Algorithm: The Effects of Two-Way Traffic. Proc. of ACM SIGCOMM '91, pages 133-147, Sep. 1991.
- V. Paxson. End-to-End Internet Packet Dynamics. Proc. of ACM SIGCOMM '97, Sep. 1997.
- Neal Cardwell, Stefan Savage and Thomas Anderson, Modeling TCP Latency, in Proceedings of the Conference on Computer Communications (IEEE Infocom), (Tel Aviv, Israel), Mar. 2000.
- R. Ludwig, B. Rathonyi, A. Konrad, K. Oden, and A. Joseph, Multi-layer Tracing of TCP over a Reliable Wireless Link. In Proceedings of ACM SIGMETRICS 1999 (Atlanta, GA, May 1999).
Derivations of TCP Performance:
- Teunis J.Ott, J.H.B. Kemperman and Matt Mathis, Window Size Behavior in TCP/IP with Constant Loss Probability, November 1996. This paper is a revision of The Stationary Behavior of Ideal TCP Congestion Avoidance by the same authors. ftp://ftp.bellcore.com/pub/tjo/TCPwindow.ps
- Padhye, J., Firoiu, V., Towsley, D., and Kurose, J., Modeling TCP Throughput: a Simple Model and its Empirical Validation, UMASS CMPSCI Tech Report TR98-008, Feb. 1998. ftp://gaia.cs.umass.edu/pub/Padhye-Firoiu98:TCP-throughput.ps.Z
- J. Padhye, V. Firoiu and D. Towsley, A Stochastic Model of TCP Reno Congestion Avoidance and Control, Technical Report 99-02, Department of Computer Science, University of Massachusetts, Amherst. ftp://gaia.cs.umass.edu/pub/Padhye99-markov.ps
- Sally Floyd, Mark Handley, Jitendra Padhye, and Joerg Widmer, Equation-Based Congestion Control for Unicast Applications, May 2000. To appear in SIGCOMM 2000.
Other TCP Algorithms:
- Jacobson, V., Congestion Avoidance and Control. Proceedings of SIGCOMM '88 (Palo Alto, CA, Aug. 1988) (The pointers are to a slightly-revised 1992 version of the 1988 paper.) ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z
- Floyd, S., and Jacobson, V., Random Early Detection gateways for Congestion Avoidance, IEEE/ACM Transactions on Networking, V.1 N.4, August 1993, p. 397-413. http://www.aciri.org/floyd/papers/red/red.html
- Floyd, S., TCP and Explicit Congestion Notification. ACM Computer Communication Review, V. 24 N. 5, October 1994, p. 10-23. [This issue of CCR incorrectly has "1995" on the cover instead of "1994".] ftp://ftp.ee.lbl.gov/papers/tcp_ecn.4.ps.Z
- M. Mathis, J. Semke, J. Mahdavi, The Rate-Halving Algorithm for TCP Congestion Control, under development. http://www.psc.edu/networking/rate-halving/
- J. Semke, J. Mahdavi, M. Mathis, Automatic TCP Buffer Tuning, ACM SIGCOMM '98/ Computer Communication Review, volume 28, number 4, October 1998. http://www.psc.edu/networking/ftp/papers/autotune_sigcomm98.ps
- Shrikrishna Karandikar, Shivkumar Kalyanaraman, Prasad Bagal and Bob Packer, TCP Rate Control, ACM Computer Communication Review, vol. 30, no. 1, Jan. 2000.
- Deepak Bansal and Hari Balakrishnan, TCP-friendly Congestion Control for Real-time Streaming Applications, MIT, Cambridge, Massachusetts, no. MIT-LCS-TR-806, May 2000.
- Mark Allman, On the Generation and Use of TCP Acknowledgments, ACM Computer Communication Review, vol. 28, no. 5, pp. 4--21, Oct. 1998.
- Mark Allman, TCP Byte Counting Refinements, ACM Computer Communication Review, vol. 29, no. 3, Jul. 1999.
- G. Minshall, A Suggested Modification to Nagle's Algorithm, Internet Engineering Task Force, Jun. 1999. http://www.ietf.org/internet-drafts/draft-minshall-nagle-01.txt (expired)
- J.C. Hoe. Improving the Start-up Behavior of a Congestion Control Scheme for TCP, Proc. of ACM SIGCOMM '96, pages 270-280, Aug. 1996.
- The Perf_Tune Page: http://www.psc.edu/networking/perf_tune.html
- The TCP Friendly Website: http://www.psc.edu/networking/tcp_friendly.html
RED Stuff:
- S. Floyd and V. Jacobson. Random Early Detection Gateways for Congestion Avoidance. ACM/IEEE Transactions on Networking, 1(4):397-413, Aug. 1993.
- B. Braden, D. Clark, J. Crowcroft, B. Davie, S. Deering, D. Estrin, S. Floyd, V. Jacobson, G. Minshall, C. Partridge, L. Peterson, K. Ramakrishnan, S. Shenker, J. Wroclawski and L. Zhang, "Recommendations on Queue Management and Congestion Avoidance in the Internet," Internet Engineering Task Force, RFC 2309, Apr. 1998.
- S. Floyd and K. Fall, Promoting the Use of End-to-End Congestion Control in the Internet, IEEE/ACM Trans. on Networking, 7(4), August 1999, pp. 458-472.
TCP Implementation Stuff:

D. Clark, V. Jacobson, J. Romkey, and H. Salwen, An Analysis of TCP Processing Overhead, IEEE Communication Magazine, June 1989, pp 23-29.

L. S. Brakmo and L. L. Peterson, TCP Vegas: End to End Congestion Avoidance on a Global Internet. IEEE Journal of Selected Areas in Communication, Vol. 13, No. 8, pp. 1465-1480, October 1995.