tcp(no ip) review part1

TCP (No IP) RefresherPart-1

Disclaimer• This Presentation contains a mixture of self made slides and material

already out on the Internet. • The goal of this presentation/talk is to provide TCP refresher.• At the end you wont know everything about TCP.• The goal here is to introduce you to the Mountains by climbing on top

of a hill. How to climb the Mountains is an exercise left for the audience/reader .

• My hope is that you will get at least something new out of this but if you already knew everything then at least you got the free Lunch .

Problem Statement• So assume that we have to develop a protocol which is responsible for

data transfer over a medium that is lossy. This protocol needs to provide mechanism to transfer data reliably.

• If an error occurs during the data transfer there are primarily two ways it can be corrected:

• Error Correction Code (FEC): Shannon theory forms the base and mostly used in low level protocols like Optical, Memory chips etc..

• Automatic Repeat Request (ARQ): Basically brute force method where you try to send the data again. TCP uses this form to provide reliability.

4

• General ARQ Algorithms • Stop & Wait• Sliding Window Protocols

• Go-Back-N• Selective Repeat

Reliability and Error Recovery

Error Recovery: Stop and Wait

Time

Packet

ACKTim

eout

• ARQ– Receiver sends acknowledgement

(ACK) when it receives packet– Sender waits for ACK and timeouts if it

does not arrive within some time period

• Simplest ARQ protocol• Send a packet, stop and wait

until ACK arrives

Sender Receiver

Recovering from Error

Packet

ACK

Tim

eout

Packet

ACK

Tim

eout

Packet

Tim

eout

Packet

ACK

Tim

eout

Time

Packet

ACKTi

meo

ut

Packet

ACK

Tim

eout

ACK lost Packet lost Early timeoutDUPLICATEPACKETS!!!

DUPLICATEPACKETS!!!

• How do I recognize a duplicate packet ? (Generic Problem)• Performance issues:

• Can only send one packet per round trip

• Question: How long the sender should wait for an ACK ?? • This turns out to be a hard problem to solve. Will look later.

Problems with Stop and Wait

How to Recognize Duplicates?• Use sequence numbers

• both packets and acks

• Sequence # in packet is finite How big should it be?

• For stop and wait?

• One bit – won’t send seq #1 until received ACK for seq #0

Pkt 0

ACK 0

Pkt 0

ACK 1

Pkt 1ACK 0

Got Reliability but what about efficiency?• So we found a way to achieve reliability in our protocol but sending a

packet and then waiting for an ACK isn't very efficient.

• The throughput of our Stop and Wait protocol = “M/R” • M is the size of the packet• R is the RTTAs you can see the throughput is inversely proportional to the RTT.

• If a packet is lost, we have to send that data again which brings our “GoodPut” further down..

• Can’t keep the pipe full: Utilization is low when bandwidth-delay product (R x RTT)is large!. R is the link capacity and RTT is the round-trip

• What is BDP ?: It’s a measure of capacity of the network in bits. Or in other words, amount of data on the network at any given time, i.e. data that has been transmitted but not yet acknowledged.

Sender Receiver

data (L bytes)

ACK

first packet bit transmitted, t = 0

RTT

first packet bit arrives

ACK arrives, send next packet, t =

RTT + L / R

Got Reliability but what about efficiency?

• By letting sender sending more packets in the network at a time without waiting for first to be acked.

• Number of pkts in flight = window

So how can we make it efficient ?

So how can we make it efficient ?Pipelining: sender allows multiple, “in-flight”, yet-to-be-acknowledged data

segments• range of sequence numbers must be increased• buffering at sender and/or receiver

• Sender Side Complexity:• Sender needs to decide when to inject packets into the network ?• How many packets do I need to inject into the network ?• Need to keep the timers when waiting for ACKs and keep a copy of the un-acked

packets in case retransmission is needed.• Receiver Side Complexity:

• It needs to have a logic for ACK to distinguish which packets have been received and which have not.

• Needs a buffer which allows to hold “out-of-sequence” packets (due to re-ordering), unless it wants to throw those packets which will be very inefficient (remember our goal ?, that efficiency thing we were talking about)

• Sender and Receiver speed mismatch (Flow control)• Network can be overwhelmed by the sender (Congestion control)

Now things are getting complicated, isn’t it ?

Introducing Windows of Packet(Sliding Windows)• Define a Window of packets that have been injected but not yet

ACK’d. (window size is the number of packets unack’d)• We will slide the window forward as the packets get acknowledged

(Sliding Window).

• Sliding window will be kept as a data structure on both Sender and Receiver.

• It allows the sender to keep track on what packets can be released, packets waiting for ACKs, and packets which cannot yet be sent.

• It allows the receiver to keep track of packets already received and ACK’d.

• This structure looks good and promising but where is the guideline for:

• How large the window should be ?• What if the receiver can not handle the sender data rate ?

Introducing Windows of Packet(Sliding Windows)

We also need to worry about Flow Control

• Flow Control is the problem which arises when there is a mismatch between Sender and Receiver Data rate.

• Window based flow control is used to solve the flow control problem.• In this approach Window size isn’t fixed and varies over time.• In order to make this work a Receiver needs to signal back the Sender

on how large a Window to use. This is known as Window Update.• Technically an ACK sent by a receiver is different then Window Update

but in practice it piggy backs on the ACK packets.• As we can see that this approach clearly allows the sender to control

the number of packets which can be injected by controlling the window size.

Flow Control• Assume that the sender is allowed to inject “W” packets into the

network before it hears an ACK for any of them.• Assume both Sender and Receiver are sufficiently fast and the

network has no Loss with infinite capacity.• Then the Throughput = (SW/R) where W is the window size, S is the

packet size in bits, and R is the RTT.• By controlling the size of “W” we can control the sender throughput.• Now time to throw another monkey wrench:

• What about the network carrying traffic ? It is possible for the Sender to inject packets which can exceed routers ability causing packet loss.

• This problem is called Congestion Control

Congestion Control• This problem involves the sender slowing down as to not overwhelm

the network between Sender and Receiver

• How can we solve this problem ?• Two Approaches:

• We can use explicit signaling similar to Flow Control problem (Where Receiver advertised the window size its expecting to the source)

• We can use implicit signaling where the Sender guesses that it needs to slow down based on some indication. (popular research area)

Quick Recap on what we have so far..• Sequence Nos. for Data and ACKs: Solve the problem of Identifying

duplicates• Pipelining: brings efficiency by sending more than one packet at a time.• Sender and Receiver keeps Window Size as Data structure: Helps in

keeping track of how many packets in flight, ACK’d, Next packets which need to be sent.

• Receiver tells his Window to the Sender: Helps in achieving Flow control.• Some form of implicit and explicit signaling to handle congestion control

(we haven’t gone into the specifics yet)

Introducing TCPTCP Header

TCP Connection Management

Sequence Numbers• So How large does size of sequence number space need to be?

• Must be able to detect wrap-around• Depends on sender/receiver window size

• E.g.• size of seq. no. space = 8, send win=recv win=7• If pkts 0 to 6 are sent successfully and all acks lost

• Receiver expects 7,0 to 5, sender retransmits old 0 to 6!!!• size of sequence no. space must be send window + recv window

• TCP regards data as a “byte-stream” • each byte in byte stream is numbered.

• 32 bit value, wraps around (around 4GB)• initial values selected at start up time

• TCP breaks up byte stream in packets.• Packet size is limited to the Maximum Segment Size (MSS)

• Each packet has a sequence number• seq. no of 1st byte indicates where it fits in the byte stream

Sequence Numbers in TCP

TCP Options: MSS• MSS is the largest segment that a TCP receiver is willing to receive

from its peer and, as a result, the largest size its peer should ever use when sending

• It only counts TCP data bytes

• MSS option is carried within the SYN segment. Default is 536 bytes. Typical value we see is 1460 bytes

TCP Options: Window Scale• Without Window Scaling, max amount of data which can be sent is =

65535 bytes (65KB).• WS option effectively increases the capacity of the TCP window

advertisement field from 16 to 30 bits.• Window Scale option applies a scaling factor the 16 bit Window

advertisement value.• A shift count of 0 indicates no scaling and maximum scale value is 14

which provides (65535 x ), effectively 1GB. • This option can only appear in SYN segments, so the scale factor is

fixed in each direction when the connection is established.• Sample PCAP

TCP Options: Timestamp Options• Timestamp option allows the sender place two 4-byte timestamp

values in every segment.• Receiver reflects these values in the ACK, allowing the sender to

calculate an estimate of the connection’s RTT for each ACK received.• Sender puts a 32 bit value field call TSV or Tsval and the receiver

echoes this back unchanged in the second Timestamp Echo Retry (TSER)

• This option serves two purpose:• Better estimation of RTT which goes as a factor for RTO. (We will later dig

deeper)• It also helps in Protection against Wrapped Sequence Numbers (PAWS)

• Sample PCAP

TCP Timeout (RTO) and Retransmission• One problem we mentioned earlier was that how long a Sender

should wait before concluding that the packet is lost (Sent packet or ACK)

• A Time out needs to be at least somewhere around = Time to send the packet to Receiver+ Time for the ACK to travel back

• Problem is that both factors aren’t constant and can vary in the network and we cant rely on the users to tell for every circumstances.

• The Protocol tries to (guess-)estimate by itself and is known as Round Trip Estimation.

• It’s a statistical process and is close to sample mean of a collection of samples of RTTs. (Note: The avg. changes over time as the network condition changes)

TCP Round Trip Time and Timeout

Q: Challenges with setting TCP timeout value?

• It has to be longer than RTT• but RTT varies so it must adapt

• Importance of accurate RTT estimators:• If its set too short:

• premature timeout which will cause unnecessary retransmissions

• If its set too long:• slow reaction to segment loss

Q: how to estimate RTT?• SampleRTT: measured time from

segment transmission until ACK receipt• ignore retransmissions, why?

• SampleRTT will vary, want estimated RTT “smoother”

• average several recent measurements, not just current SampleRTT

Adaptive Retransmission(Original Algorithm)• Measure SampleRTT for each segment/ACK pair• Compute weighted running average of RTT

• EstRTT = a x EstimatedRTT + (1-a) x SampleRTT- a between 0.8 and 0.9 ( to smooth Estimated RTT)- Small a indicates temp. fluctuation, a large value more stable, may not be quick

to adapt to real changes- with recommended value of a, 80% to 90% of each new estimate is from

previous estimate and 10 to 20% is from new measurement.

• Set timeout based on EstRTT• TimeOut = 2 x EstRTT

Example RTT Estimation:RTT: gaia.cs.umass.edu to fantasia.eurecom.fr

100

150

200

250

300

350

1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106time (seconnds)

RTT

(mill

isec

onds

)

SampleRTT Estimated RTT

Problem: Retransmission Ambiguity

• ACK is for Original transmission but was for retransmission => Sample RTT is too large• ACK is for retransmission but was for original => Sample RTT too small

Sender Receiver

Original transmission

ACK

Sam

pleR

TT Retransmission

Sender Receiver

Original transmission

ACK

Sam

pleR

TT

Retransmission

Karn/Partridge Algorithm(Simple Proposal)

• Do not sample RTT when retransmitting • only measures sample RTT for segments sent once

• Karn and Patridge proposal is exponential backoff• Congestion is most likely cause of lost segments• TCP sources should not react too aggressively to a timeout• More timeouts mean more cautious the source should become (congestion problem)

Jacobson/Karels Algorithm• Original computation for RTT did not take the variance of sample RTTs into account

• The original formula couldn't keep up with wide fluctuations in the RTT.• If variation among samples is small, Estimated RTT can be better used without increasing the estimate

twice• A large variance in the samples mean Time out values should not be too tightly coupled to the Estimated

RTT

• Jacobson’s new Calculations for average RTT kept track of both variability in samples + avg.

• Using Standard Deviation was considered expensive operation (due to Sq. Root) so Mean deviation was chosen though many say it’s baloney…

• Further Reading: https://www.leeds.ac.uk/educol/documents/00003759.htm

https://www.leeds.ac.uk/educol/documents/00003759.htm



Jacobson/Karels Algorithm• Diff = SampleRTT - EstRTT• EstRTT = EstRTT + ( x Diff)• Dev = Dev + ( |Diff| - Dev)

• where is a fraction between 0 and 1• TimeOut = EstRTT + 4 x Dev Empirical Values

Timestamp Extension for RTT estimate

• This improves the timeout mechanism by providing accurate measurement of RTT

• When sending a packet, insert current time into option• 4 bytes for time, 4 bytes for echo a received timestamp

• Receiver echoes timestamp in ACK• Actually will echo whatever is in timestamp

• Removes retransmission ambiguity• Can get RTT sample on any packet

Linux RTO• Linux RTO mechanism includes both Standard RTT and Timestamp’s to

increase the RTO accuracy.• One basic problem with standard RTT is that if there is a huge RTT

drop , RTO will increase which is counterintuitive. Linux takes care of that.

So far what we have learned• Little bit about TCP and the problems it tries to solve• TCP header, Few TCP Options• TCP timeout calculation how is RTT estimated.

Fast Retransmit• So before Fast Retransmit, TCP detected packet loss solely by using “timeout” which means it’s taking a long time

to reach on lost packets(what ever the RTO is)

• Sender sends a packet, TCP sets up the timer (Standard Algo or Timestamps) and if the ACK isn't received within that time then it triggers a retransmit of the packet.

• Retransmission Timeout are considered really bad within TCP world (Will look later on why, for now just take the word)

• That’s why “Fast Retransmit” was introduced within TCP.

• The basic idea is that a receiver get’s 3 DUP ACK’s then send the packet immediately and not wait for the RTO which allows to react faster on lost packets.

• We will still have Retransmission Timeout (RTO) but it will act as a back up mechanism.

Fast Retransmit (Ex:1)

Assumption:• Only one packet in a window is dropped. In this

diagram, it is Pkt1.• TCP at the sending side implements fast

retransmit algorithm.• The retransmitted packet is completely received

at the receiver.

All packets in transit are not enough to trigger fast retransmit.

Ignore cwnd at this moment. We will look into that during congestion avoidance

Fast Retransmit (Ex:2)

Assumption:• Only one packet in a window is

dropped. In this diagram, it is Pkt5.• TCP at the sending side implements

fast retransmit algorithm.• The retransmitted packet is

completely received at the receiver

In the example, packets in transit are enough to trigger fast retransmit. Therefore, after receiving the third duplicate ACK5, TCP at the sending side retransmits Pkt5

Fast Retransmit(Ex:3)Assumption:

• Two packets in a window are dropped. In this diagram, they are Pkt7 and Pkt9.

• TCP at the sending side implements fast retransmit algorithm.

• The retransmitted packet is completely received at the receiver.

Retransmission of Pkt9 happens because Pkt9's RTO is expired - not fast retransmit. Why? As shown that there are no enough packets in transit for triggering the 2nd fast retransmit.

Fast Retransmit - Recap• So we saw how Fast Retransmit helps in recovering quickly in the case of a single

packet loss but not in the case of multiple packet loss.

Lets talks about Flow Control

TCP Traffic Control• Traffic control

• There are two reasons for sender to reduce the rate of sending packets.• When receiver’s buffer space is not enough, flow control• When the network is congested, congestion control

Small-capacityreceiver

networkcongestion

Flow Control• The idea here is to not let the Sender overwhelm with more data then

what a receiver can handle.• The way this can be achieved is by Receiver telling the Sender whats its

buffer size (Capability to receive is) which gives the sender an idea how much it can send.

• This is achieved by maintaining a Window Structure on Sender and Receiver side.

Sliding Window Flow Control

(a) sender’s window

(b) receiver’s window

Window is shrinking as the segments are sent

Window expandsas the acks are received

Segments sent, butnot acknowledged

Segments that can be sent

0 1 2 3 0 1 2 3

The last segmentThat was acked

Window expands as acks are sent

Segments that were received Segments that will be received

0 1 2 3 0 1 2 3

Window is shrinking as the segments are received

Receiver tells the Sender its window size which is tracked at the Sender side as a variable “awnd” (Advertised Window)

Window Flow Control

• ~ W packets per RTT• Lost packet detected by missing ACK

RTT

time

time

Source

Destination

1 2 W

1 2 W

1 2 W

data ACKs

1 2 W

Source Rate (Throughput)• Limit the number of packets in the network to window W (which is

primarily dominated by “awnd”)

• Source rate = bps

• If W too small then rate « capacityIf W too big then rate > capacity => congestion

• This doesn't take packet loss into account, Matthew Mathis formula fixes it.

RTTMSSW

TCP Persist Timer• Background

• When the TCP receiver advertises window = 0, the TCP sender stops sending temporarily. Afterwards, the receiver lets the sender know it can receive segments again by sending new window advertisement. But if this new window advertisement is lost (since this doesn't contain any data, its not delivered reliably), the sender will wait for the new advertisement forever. (Deadlock!!)

• Solution• After the sender knows window=0, the sender transmits window probe segment

periodically to check out if the receiver is ready to accept. The window probe is sent according to the persist timer.

• Window probe is a segment of 1 byte length.• TCP allows sender to transmit one byte even if the receiver’s window is closed. • TCP persist timer is increasing exponentially.

TCP Persist Timerwin=0

win=256

lost

Deadlock

win=0

window probe

window probe

window probe

ACK(win=0)

ACK(win=0)

ACK(win=0)

Persist Timer (normal TCP Exponential backoff)

Delayed AcknowledgementDelayed acknowledgments for reducing number of segments: The receiver does not send Acks immediately after the receipt of an (error-free) packet but waits up to ~200ms/500ms if a packet is in the send buffer (depends on host OS). If so it, piggybacks the ACK onto the transmit packet; if no transmit packet is available the receiver sends an Ack latest after ~200ms/500ms

Questions ?

tcp(no ip) review part1

Engineering