summer school, brasov, romania, july 2005, r. hughes-jones manchester 1 tcp/ip and other transports...

36
Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard Hughes-Jones The University of Manchester www.hep.man.ac.uk/~rich/ then “Talks” then look for “Brasov”

Upload: dana-wormley

Post on 16-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 1

TCPIP and Other Transports for High Bandwidth Applications

Back to Basics

Richard Hughes-Jones The University of Manchester

wwwhepmanacuk~rich then ldquoTalksrdquo then look for ldquoBrasovrdquo

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 2

The aim is to give you a picture of how researchers are using high performance networks to support their work

Back to Basics Simple Introduction to Networking

TCPIP on High Bandwidth Long Distance Networks But TCPIP works

The effect of packet loss Advanced TCP Stacks Fairness

Real Applications on Real Networks Disk-2-disk applications on real networks

Memory-2-memory testsTransatlantic disk-2-disk at Gigabit speeds

Remote Computing FarmsThe effect of distance

Radio Astronomy e-VLBI

Thanks for allowing me to use their slides to Sylvain Ravot CERN Les Cottrell SLAC Brian Tierney LBL Robin Tasker DL

Structure of the Talks

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 3

Simple Introduction to Networking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 4

What is a Protocol Stack ISO OSI (Open Systems Interconnection) Seven Layer Model defines a

framework allowing development of real network protocols A layerhellip

performs unique and specific tasks only has knowledge of those layers immediately above and below uses services of layer below and provides services to layer above the services defined by a layer are implementation independent ndash

itrsquos a definition of how things work conceptually communicates with its peer in the remote system

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 5

The Layering Principle Encapsulation

Each protocol layer N adds a Header to the data unit from layer N+1 Header contains control information

Layer 7 Applicationuser processes

Layer 6 Presentationdata interpretation code transformation

Layer 5 SessionConnection negotiation control

Layer 4 TransportEnd-2-end data transfer amp integrityPacket sequencing flow control

Layer 3 NetworkAddressing RoutingPacket sequencing flow control

Layer 2 Data LinkPacket assemblydisassembly Transmission control Error checking

Layer 1 PhysicalElectrical Optical Mechanical

DH App data FCSNH TH PHSH

App data NH TH PHSH

App data TH PHSH

App data PHSH

App data PH

App data

Bits on the ldquowirerdquo

Frame

Packet

Segment

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 6

What do the Layers do Transport Layer acts as a go-between for the user and network

Provides end-to-end data movement amp control Gives the level of reliabilityintegrity need by the application Can ensure a reliable service (which network layer cannot)

eg assigns sequence numbers to identify ldquolostrdquo packets Network Layer deals with logical addressing amp the transmission of

packets mechanism for routing Data Link Layer provides the synchronization and error checking for

the data transmitted over a single physical link (may ensure correct delivery of frames) 1048708Going down fits packets from the network layer above into

frames 1048708Going up Groups bits from the physical layer into frames

Physical Layer concerned with the transmission of individual bits

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 7

How do the ldquoIPrdquo Protocols fit together

Application

( Presentation

Session)

Transport

Network

Data Link

Physical

FileTransferProtocol

(FTP) RFC 559

Simple MailTransfer Protocol(SMTP) RFC 821

TELNETRFC 854

TFTP RFC 783

NFSRFC 1024 1057

and 1094

SNMPRFC 1157

Transmission Control Protocol (TCP)

RFC 793

User Datagram Protocol (UDP)

RFC 768

Address ResolutionProtocols

ARP RFC 826RARP RFC 903

Internet ProtocolIP

RFC 791

Internet ControlMessage Protocol

(ICMP) RFC 792

Ethernet Token Ring ISDN FDDI SMDS ATM SDHSONET xDSL

Transmission Mode

TP Copper Fibre Optic Satellite Microwave DWDM CWDM etc

Network Interface Cards

RoutingOSPF BGP

ssh

HTTP POP3IMAP

DNS

DNS

ping

traceroute

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 8

Some of the ldquoIPrdquo Protocols Transmission Control Protocol TCP provides application programs

access to the network using a reliable connection-oriented transport layer service

User Datagram Protocol UDP provides unreliable connection-less delivery service using the IP protocol to transport messages between machines It adds the ability to distinguish among multiple destinations on a single host computer

Internet Protocol IP receives datagrams from the upper-layer software and transmits it to the destination host based upon a best effort connection-less delivery service

Internet Control Message Protocol ICMP allows internet routers to transmit error messages and test messages

Internet Group Message Protocol IGMP is used with multicast to send UDP datagrams to multiple hosts

Address Resolution Protocol ARP translates between the 32 bit IP address and a 48 bit LAN address

Reverse Address Resolution Protocol RARP translates between the 48 bit LAN address and the 32 bit IP address

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 9

The Physical Layer 1 Ethernet

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 10

The Link Layer 2 Ethernet Frame

Preamble which is comprised of 56 bits of alternating 0s and 1s The preamble provides all the nodes on the network a signal against which to synchronize

Media Access Control (MAC) AddressEvery Ethernet network card has built into its hardware a unique six-octet (48-bit) hexadecimal number that differentiates it from all other Ethernet cards in the universe The DA and SA define the path across the link

Start Frame delimiter which marks the start of a frame The start frame delimiter is 8 bits long with the pattern10101011

Data the reason the frame exists MTU Maximum Transport Unit

Frame Check Sequence to protect the frame contents

LengthType field two octets longIf the value =lt 1500 (0x05dc hex) indicates the length of dataIf the value gt 1500 indicates network-layer protocol ldquoEthernet Typesrdquo

Frame header IP Datagram FCS

12 bytes

Inter Frame Gap

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 11

The Link Layer Ethernet VLANs

VLANS are logical networks built over the same physical cable plant Distinguishes Ethernet frames betweentheir logical networks using VLANheader

VLAN is defined by the use of value 0x8100 in the Type field location

The next two octets are composed of the following three fields

User Priority field This field is 3 bits in length and is used to define the priority of the Ethernet frame This is utilized to define and deliver a class of service

Canonical format indicator This is 1 bit in length Just donrsquot ask

VLAN Identifier field This field is 12 bits in length and contains the VLAN identifier (VID) of this frame

The original LengthType field will then follow the inserted VLAN tag

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 12

The Network Layer 3 IP IP Layer properties

Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Connection less Provides logical addresses Provides routing Demultiplex data on protocol number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 13

The Internet datagram

31HlenVers Type of serv Total length

0 8 16

Identification Flags

244

Fragment offset

19

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

20 Bytes

Frame header Transport FCSIP header

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 14

IP Datagram Format (cont) Type of Service ndash TOS

now being used for QoS Total length length of datagram

in bytes includes header and data Time to live ndash TTL specifies how long datagram is

allowed to remain in internet Routers decrement by 1 When TTL = 0 router discards datagram Prevents infinite loops

Protocol specifies the format of the data area Protocol numbers administered by central authority to guarantee

agreement eg ICMP=1 TCP=6 UDP=17 hellip Source amp destination IP address (32 bits each) contain

IP address of sender and intended recipient Options (variable length) Mainly used to record a route

or timestamps or specify routing

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 15

Internet Class-based addresses An Address looks like 19216822123

Class A large number of hosts few networks 0nnnnnnn hhhhhhhh hhhhhhhh hhhhhhhh

7 network bits (0 and 127 reserved so 126 networks) 24 host bits (gt 16M hostsnet)

Initial byte 1-127 (decimal) Class B medium number of hosts and networks

10nnnnnn nnnnnnnn hhhhhhhh hhhhhhhh16384 class B networks 65534 hostsnetworkInitial byte 128-191 (decimal)

Class C large number of small networks 110nnnnn nnnnnnnn nnnnnnnn hhhhhhhh

2097152 networks 254 hostsnetworkInitial byte 192-223 (decimal)

Class D Multicast (See RFC 1112) 1110nnnn nnnnnnnn nnnnnnnn hhhhhhhh

Initial byte 224-239 (decimal) Class E Reserved

Initial byte 248-255 (decimal)

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 16

The Transport Layer 4 UDP UDP Provides

Connection less service over IPNo setup teardownOne packet at a time

Minimal overhead ndash high performance Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Application is responsible for Data reliabilityFlow controlError handling

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 17

UDP Datagram format

Sourcedestination port port numbers identify sending amp receiving processes

Port number amp IP address allow any application on Internet to be uniquely identified

Ports can be static or dynamic

Static (lt 1024) assigned centrally known as well known ports

Dynamic

Message length in bytes includes the UDP header and data (min 8 max 65535)

8 16 3124

Source port Destination port

UDP message len Checksum (opt)

0

Frame header Application data FCSIP header UDP header

8 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 18

The Transport Layer 4 TCP TCP RFC 768 RFC 1122 Provides

Connection orientated service over IPDuring setup the two ends agree on detailsExplicit teardownMultiple connections allowed

Reliable end-to-end Byte Stream delivery over unreliable network It takes care of

Lost packets Duplicated packetsOut of order packets

TCP provides Data buffering Flow controlError detection amp handlingLimits network congestion

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 19

Code

Source port Destination port

Sequence number

0 8 16 3124

Acknowledgement number

4

Hlen

10

Resv Window

Urgent ptrChecksum

Options (if any) Padding

The TCP Segment FormatFrame header Application data FCSIP header TCP header

20 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 20

TCP Segment Format ndash cont SourceDest port TCP port numbers to ID applications

at both ends of connection Sequence number First byte in segment from senderrsquos

byte stream Acknowledgement identifies the number of the byte the

sender of this segment expects to receive next Code used to determine segment purpose eg SYN

ACK FIN URG Window Advertises how much data this station is willing

to accept Can depend on buffer space remaining Options used for window scaling

SACK timestamps maximum segment size etc

Code

Source port Destination port

Sequence number

Acknowledgement number

Hlen Resv Window

Urgent ptrChecksum

Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 21

TCP ndash providing reliability Positive acknowledgement (ACK) of each received segment

Sender keeps record of each segment sent Sender awaits an ACK ndash ldquoI am ready to receive byte 2048 and beyondrdquo Sender starts timer when it sends segment ndash so can re-transmit

Segment n

ACK of Segment nRTT

Time

Sender Receiver

Sequence 1024Length 1024

Ack 2048

Segment n+1

ACK of Segment n +1RTT

Sequence 2048Length 1024

Ack 3072

Inefficient ndash sender has to wait

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 22

Flow Control Sender ndash Congestion Window Uses Congestion window cwnd a sliding window to control the data flow

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sentwaiting for windowto openApplication writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiverrsquos advertisedwindow advances leading edge

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 23

Flow Control Receiver ndash Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no

Receiverrsquos advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ne next byte expected Duplicate ACK is send with the expected sequence number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 2: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 2

The aim is to give you a picture of how researchers are using high performance networks to support their work

Back to Basics Simple Introduction to Networking

TCPIP on High Bandwidth Long Distance Networks But TCPIP works

The effect of packet loss Advanced TCP Stacks Fairness

Real Applications on Real Networks Disk-2-disk applications on real networks

Memory-2-memory testsTransatlantic disk-2-disk at Gigabit speeds

Remote Computing FarmsThe effect of distance

Radio Astronomy e-VLBI

Thanks for allowing me to use their slides to Sylvain Ravot CERN Les Cottrell SLAC Brian Tierney LBL Robin Tasker DL

Structure of the Talks

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 3

Simple Introduction to Networking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 4

What is a Protocol Stack ISO OSI (Open Systems Interconnection) Seven Layer Model defines a

framework allowing development of real network protocols A layerhellip

performs unique and specific tasks only has knowledge of those layers immediately above and below uses services of layer below and provides services to layer above the services defined by a layer are implementation independent ndash

itrsquos a definition of how things work conceptually communicates with its peer in the remote system

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 5

The Layering Principle Encapsulation

Each protocol layer N adds a Header to the data unit from layer N+1 Header contains control information

Layer 7 Applicationuser processes

Layer 6 Presentationdata interpretation code transformation

Layer 5 SessionConnection negotiation control

Layer 4 TransportEnd-2-end data transfer amp integrityPacket sequencing flow control

Layer 3 NetworkAddressing RoutingPacket sequencing flow control

Layer 2 Data LinkPacket assemblydisassembly Transmission control Error checking

Layer 1 PhysicalElectrical Optical Mechanical

DH App data FCSNH TH PHSH

App data NH TH PHSH

App data TH PHSH

App data PHSH

App data PH

App data

Bits on the ldquowirerdquo

Frame

Packet

Segment

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 6

What do the Layers do Transport Layer acts as a go-between for the user and network

Provides end-to-end data movement amp control Gives the level of reliabilityintegrity need by the application Can ensure a reliable service (which network layer cannot)

eg assigns sequence numbers to identify ldquolostrdquo packets Network Layer deals with logical addressing amp the transmission of

packets mechanism for routing Data Link Layer provides the synchronization and error checking for

the data transmitted over a single physical link (may ensure correct delivery of frames) 1048708Going down fits packets from the network layer above into

frames 1048708Going up Groups bits from the physical layer into frames

Physical Layer concerned with the transmission of individual bits

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 7

How do the ldquoIPrdquo Protocols fit together

Application

( Presentation

Session)

Transport

Network

Data Link

Physical

FileTransferProtocol

(FTP) RFC 559

Simple MailTransfer Protocol(SMTP) RFC 821

TELNETRFC 854

TFTP RFC 783

NFSRFC 1024 1057

and 1094

SNMPRFC 1157

Transmission Control Protocol (TCP)

RFC 793

User Datagram Protocol (UDP)

RFC 768

Address ResolutionProtocols

ARP RFC 826RARP RFC 903

Internet ProtocolIP

RFC 791

Internet ControlMessage Protocol

(ICMP) RFC 792

Ethernet Token Ring ISDN FDDI SMDS ATM SDHSONET xDSL

Transmission Mode

TP Copper Fibre Optic Satellite Microwave DWDM CWDM etc

Network Interface Cards

RoutingOSPF BGP

ssh

HTTP POP3IMAP

DNS

DNS

ping

traceroute

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 8

Some of the ldquoIPrdquo Protocols Transmission Control Protocol TCP provides application programs

access to the network using a reliable connection-oriented transport layer service

User Datagram Protocol UDP provides unreliable connection-less delivery service using the IP protocol to transport messages between machines It adds the ability to distinguish among multiple destinations on a single host computer

Internet Protocol IP receives datagrams from the upper-layer software and transmits it to the destination host based upon a best effort connection-less delivery service

Internet Control Message Protocol ICMP allows internet routers to transmit error messages and test messages

Internet Group Message Protocol IGMP is used with multicast to send UDP datagrams to multiple hosts

Address Resolution Protocol ARP translates between the 32 bit IP address and a 48 bit LAN address

Reverse Address Resolution Protocol RARP translates between the 48 bit LAN address and the 32 bit IP address

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 9

The Physical Layer 1 Ethernet

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 10

The Link Layer 2 Ethernet Frame

Preamble which is comprised of 56 bits of alternating 0s and 1s The preamble provides all the nodes on the network a signal against which to synchronize

Media Access Control (MAC) AddressEvery Ethernet network card has built into its hardware a unique six-octet (48-bit) hexadecimal number that differentiates it from all other Ethernet cards in the universe The DA and SA define the path across the link

Start Frame delimiter which marks the start of a frame The start frame delimiter is 8 bits long with the pattern10101011

Data the reason the frame exists MTU Maximum Transport Unit

Frame Check Sequence to protect the frame contents

LengthType field two octets longIf the value =lt 1500 (0x05dc hex) indicates the length of dataIf the value gt 1500 indicates network-layer protocol ldquoEthernet Typesrdquo

Frame header IP Datagram FCS

12 bytes

Inter Frame Gap

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 11

The Link Layer Ethernet VLANs

VLANS are logical networks built over the same physical cable plant Distinguishes Ethernet frames betweentheir logical networks using VLANheader

VLAN is defined by the use of value 0x8100 in the Type field location

The next two octets are composed of the following three fields

User Priority field This field is 3 bits in length and is used to define the priority of the Ethernet frame This is utilized to define and deliver a class of service

Canonical format indicator This is 1 bit in length Just donrsquot ask

VLAN Identifier field This field is 12 bits in length and contains the VLAN identifier (VID) of this frame

The original LengthType field will then follow the inserted VLAN tag

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 12

The Network Layer 3 IP IP Layer properties

Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Connection less Provides logical addresses Provides routing Demultiplex data on protocol number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 13

The Internet datagram

31HlenVers Type of serv Total length

0 8 16

Identification Flags

244

Fragment offset

19

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

20 Bytes

Frame header Transport FCSIP header

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 14

IP Datagram Format (cont) Type of Service ndash TOS

now being used for QoS Total length length of datagram

in bytes includes header and data Time to live ndash TTL specifies how long datagram is

allowed to remain in internet Routers decrement by 1 When TTL = 0 router discards datagram Prevents infinite loops

Protocol specifies the format of the data area Protocol numbers administered by central authority to guarantee

agreement eg ICMP=1 TCP=6 UDP=17 hellip Source amp destination IP address (32 bits each) contain

IP address of sender and intended recipient Options (variable length) Mainly used to record a route

or timestamps or specify routing

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 15

Internet Class-based addresses An Address looks like 19216822123

Class A large number of hosts few networks 0nnnnnnn hhhhhhhh hhhhhhhh hhhhhhhh

7 network bits (0 and 127 reserved so 126 networks) 24 host bits (gt 16M hostsnet)

Initial byte 1-127 (decimal) Class B medium number of hosts and networks

10nnnnnn nnnnnnnn hhhhhhhh hhhhhhhh16384 class B networks 65534 hostsnetworkInitial byte 128-191 (decimal)

Class C large number of small networks 110nnnnn nnnnnnnn nnnnnnnn hhhhhhhh

2097152 networks 254 hostsnetworkInitial byte 192-223 (decimal)

Class D Multicast (See RFC 1112) 1110nnnn nnnnnnnn nnnnnnnn hhhhhhhh

Initial byte 224-239 (decimal) Class E Reserved

Initial byte 248-255 (decimal)

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 16

The Transport Layer 4 UDP UDP Provides

Connection less service over IPNo setup teardownOne packet at a time

Minimal overhead ndash high performance Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Application is responsible for Data reliabilityFlow controlError handling

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 17

UDP Datagram format

Sourcedestination port port numbers identify sending amp receiving processes

Port number amp IP address allow any application on Internet to be uniquely identified

Ports can be static or dynamic

Static (lt 1024) assigned centrally known as well known ports

Dynamic

Message length in bytes includes the UDP header and data (min 8 max 65535)

8 16 3124

Source port Destination port

UDP message len Checksum (opt)

0

Frame header Application data FCSIP header UDP header

8 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 18

The Transport Layer 4 TCP TCP RFC 768 RFC 1122 Provides

Connection orientated service over IPDuring setup the two ends agree on detailsExplicit teardownMultiple connections allowed

Reliable end-to-end Byte Stream delivery over unreliable network It takes care of

Lost packets Duplicated packetsOut of order packets

TCP provides Data buffering Flow controlError detection amp handlingLimits network congestion

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 19

Code

Source port Destination port

Sequence number

0 8 16 3124

Acknowledgement number

4

Hlen

10

Resv Window

Urgent ptrChecksum

Options (if any) Padding

The TCP Segment FormatFrame header Application data FCSIP header TCP header

20 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 20

TCP Segment Format ndash cont SourceDest port TCP port numbers to ID applications

at both ends of connection Sequence number First byte in segment from senderrsquos

byte stream Acknowledgement identifies the number of the byte the

sender of this segment expects to receive next Code used to determine segment purpose eg SYN

ACK FIN URG Window Advertises how much data this station is willing

to accept Can depend on buffer space remaining Options used for window scaling

SACK timestamps maximum segment size etc

Code

Source port Destination port

Sequence number

Acknowledgement number

Hlen Resv Window

Urgent ptrChecksum

Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 21

TCP ndash providing reliability Positive acknowledgement (ACK) of each received segment

Sender keeps record of each segment sent Sender awaits an ACK ndash ldquoI am ready to receive byte 2048 and beyondrdquo Sender starts timer when it sends segment ndash so can re-transmit

Segment n

ACK of Segment nRTT

Time

Sender Receiver

Sequence 1024Length 1024

Ack 2048

Segment n+1

ACK of Segment n +1RTT

Sequence 2048Length 1024

Ack 3072

Inefficient ndash sender has to wait

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 22

Flow Control Sender ndash Congestion Window Uses Congestion window cwnd a sliding window to control the data flow

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sentwaiting for windowto openApplication writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiverrsquos advertisedwindow advances leading edge

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 23

Flow Control Receiver ndash Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no

Receiverrsquos advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ne next byte expected Duplicate ACK is send with the expected sequence number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 3: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 3

Simple Introduction to Networking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 4

What is a Protocol Stack ISO OSI (Open Systems Interconnection) Seven Layer Model defines a

framework allowing development of real network protocols A layerhellip

performs unique and specific tasks only has knowledge of those layers immediately above and below uses services of layer below and provides services to layer above the services defined by a layer are implementation independent ndash

itrsquos a definition of how things work conceptually communicates with its peer in the remote system

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 5

The Layering Principle Encapsulation

Each protocol layer N adds a Header to the data unit from layer N+1 Header contains control information

Layer 7 Applicationuser processes

Layer 6 Presentationdata interpretation code transformation

Layer 5 SessionConnection negotiation control

Layer 4 TransportEnd-2-end data transfer amp integrityPacket sequencing flow control

Layer 3 NetworkAddressing RoutingPacket sequencing flow control

Layer 2 Data LinkPacket assemblydisassembly Transmission control Error checking

Layer 1 PhysicalElectrical Optical Mechanical

DH App data FCSNH TH PHSH

App data NH TH PHSH

App data TH PHSH

App data PHSH

App data PH

App data

Bits on the ldquowirerdquo

Frame

Packet

Segment

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 6

What do the Layers do Transport Layer acts as a go-between for the user and network

Provides end-to-end data movement amp control Gives the level of reliabilityintegrity need by the application Can ensure a reliable service (which network layer cannot)

eg assigns sequence numbers to identify ldquolostrdquo packets Network Layer deals with logical addressing amp the transmission of

packets mechanism for routing Data Link Layer provides the synchronization and error checking for

the data transmitted over a single physical link (may ensure correct delivery of frames) 1048708Going down fits packets from the network layer above into

frames 1048708Going up Groups bits from the physical layer into frames

Physical Layer concerned with the transmission of individual bits

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 7

How do the ldquoIPrdquo Protocols fit together

Application

( Presentation

Session)

Transport

Network

Data Link

Physical

FileTransferProtocol

(FTP) RFC 559

Simple MailTransfer Protocol(SMTP) RFC 821

TELNETRFC 854

TFTP RFC 783

NFSRFC 1024 1057

and 1094

SNMPRFC 1157

Transmission Control Protocol (TCP)

RFC 793

User Datagram Protocol (UDP)

RFC 768

Address ResolutionProtocols

ARP RFC 826RARP RFC 903

Internet ProtocolIP

RFC 791

Internet ControlMessage Protocol

(ICMP) RFC 792

Ethernet Token Ring ISDN FDDI SMDS ATM SDHSONET xDSL

Transmission Mode

TP Copper Fibre Optic Satellite Microwave DWDM CWDM etc

Network Interface Cards

RoutingOSPF BGP

ssh

HTTP POP3IMAP

DNS

DNS

ping

traceroute

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 8

Some of the ldquoIPrdquo Protocols Transmission Control Protocol TCP provides application programs

access to the network using a reliable connection-oriented transport layer service

User Datagram Protocol UDP provides unreliable connection-less delivery service using the IP protocol to transport messages between machines It adds the ability to distinguish among multiple destinations on a single host computer

Internet Protocol IP receives datagrams from the upper-layer software and transmits it to the destination host based upon a best effort connection-less delivery service

Internet Control Message Protocol ICMP allows internet routers to transmit error messages and test messages

Internet Group Message Protocol IGMP is used with multicast to send UDP datagrams to multiple hosts

Address Resolution Protocol ARP translates between the 32 bit IP address and a 48 bit LAN address

Reverse Address Resolution Protocol RARP translates between the 48 bit LAN address and the 32 bit IP address

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 9

The Physical Layer 1 Ethernet

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 10

The Link Layer 2 Ethernet Frame

Preamble which is comprised of 56 bits of alternating 0s and 1s The preamble provides all the nodes on the network a signal against which to synchronize

Media Access Control (MAC) AddressEvery Ethernet network card has built into its hardware a unique six-octet (48-bit) hexadecimal number that differentiates it from all other Ethernet cards in the universe The DA and SA define the path across the link

Start Frame delimiter which marks the start of a frame The start frame delimiter is 8 bits long with the pattern10101011

Data the reason the frame exists MTU Maximum Transport Unit

Frame Check Sequence to protect the frame contents

LengthType field two octets longIf the value =lt 1500 (0x05dc hex) indicates the length of dataIf the value gt 1500 indicates network-layer protocol ldquoEthernet Typesrdquo

Frame header IP Datagram FCS

12 bytes

Inter Frame Gap

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 11

The Link Layer Ethernet VLANs

VLANS are logical networks built over the same physical cable plant Distinguishes Ethernet frames betweentheir logical networks using VLANheader

VLAN is defined by the use of value 0x8100 in the Type field location

The next two octets are composed of the following three fields

User Priority field This field is 3 bits in length and is used to define the priority of the Ethernet frame This is utilized to define and deliver a class of service

Canonical format indicator This is 1 bit in length Just donrsquot ask

VLAN Identifier field This field is 12 bits in length and contains the VLAN identifier (VID) of this frame

The original LengthType field will then follow the inserted VLAN tag

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 12

The Network Layer 3 IP IP Layer properties

Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Connection less Provides logical addresses Provides routing Demultiplex data on protocol number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 13

The Internet datagram

31HlenVers Type of serv Total length

0 8 16

Identification Flags

244

Fragment offset

19

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

20 Bytes

Frame header Transport FCSIP header

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 14

IP Datagram Format (cont) Type of Service ndash TOS

now being used for QoS Total length length of datagram

in bytes includes header and data Time to live ndash TTL specifies how long datagram is

allowed to remain in internet Routers decrement by 1 When TTL = 0 router discards datagram Prevents infinite loops

Protocol specifies the format of the data area Protocol numbers administered by central authority to guarantee

agreement eg ICMP=1 TCP=6 UDP=17 hellip Source amp destination IP address (32 bits each) contain

IP address of sender and intended recipient Options (variable length) Mainly used to record a route

or timestamps or specify routing

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 15

Internet Class-based addresses An Address looks like 19216822123

Class A large number of hosts few networks 0nnnnnnn hhhhhhhh hhhhhhhh hhhhhhhh

7 network bits (0 and 127 reserved so 126 networks) 24 host bits (gt 16M hostsnet)

Initial byte 1-127 (decimal) Class B medium number of hosts and networks

10nnnnnn nnnnnnnn hhhhhhhh hhhhhhhh16384 class B networks 65534 hostsnetworkInitial byte 128-191 (decimal)

Class C large number of small networks 110nnnnn nnnnnnnn nnnnnnnn hhhhhhhh

2097152 networks 254 hostsnetworkInitial byte 192-223 (decimal)

Class D Multicast (See RFC 1112) 1110nnnn nnnnnnnn nnnnnnnn hhhhhhhh

Initial byte 224-239 (decimal) Class E Reserved

Initial byte 248-255 (decimal)

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 16

The Transport Layer 4 UDP UDP Provides

Connection less service over IPNo setup teardownOne packet at a time

Minimal overhead ndash high performance Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Application is responsible for Data reliabilityFlow controlError handling

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 17

UDP Datagram format

Sourcedestination port port numbers identify sending amp receiving processes

Port number amp IP address allow any application on Internet to be uniquely identified

Ports can be static or dynamic

Static (lt 1024) assigned centrally known as well known ports

Dynamic

Message length in bytes includes the UDP header and data (min 8 max 65535)

8 16 3124

Source port Destination port

UDP message len Checksum (opt)

0

Frame header Application data FCSIP header UDP header

8 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 18

The Transport Layer 4 TCP TCP RFC 768 RFC 1122 Provides

Connection orientated service over IPDuring setup the two ends agree on detailsExplicit teardownMultiple connections allowed

Reliable end-to-end Byte Stream delivery over unreliable network It takes care of

Lost packets Duplicated packetsOut of order packets

TCP provides Data buffering Flow controlError detection amp handlingLimits network congestion

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 19

Code

Source port Destination port

Sequence number

0 8 16 3124

Acknowledgement number

4

Hlen

10

Resv Window

Urgent ptrChecksum

Options (if any) Padding

The TCP Segment FormatFrame header Application data FCSIP header TCP header

20 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 20

TCP Segment Format ndash cont SourceDest port TCP port numbers to ID applications

at both ends of connection Sequence number First byte in segment from senderrsquos

byte stream Acknowledgement identifies the number of the byte the

sender of this segment expects to receive next Code used to determine segment purpose eg SYN

ACK FIN URG Window Advertises how much data this station is willing

to accept Can depend on buffer space remaining Options used for window scaling

SACK timestamps maximum segment size etc

Code

Source port Destination port

Sequence number

Acknowledgement number

Hlen Resv Window

Urgent ptrChecksum

Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 21

TCP ndash providing reliability Positive acknowledgement (ACK) of each received segment

Sender keeps record of each segment sent Sender awaits an ACK ndash ldquoI am ready to receive byte 2048 and beyondrdquo Sender starts timer when it sends segment ndash so can re-transmit

Segment n

ACK of Segment nRTT

Time

Sender Receiver

Sequence 1024Length 1024

Ack 2048

Segment n+1

ACK of Segment n +1RTT

Sequence 2048Length 1024

Ack 3072

Inefficient ndash sender has to wait

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 22

Flow Control Sender ndash Congestion Window Uses Congestion window cwnd a sliding window to control the data flow

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sentwaiting for windowto openApplication writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiverrsquos advertisedwindow advances leading edge

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 23

Flow Control Receiver ndash Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no

Receiverrsquos advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ne next byte expected Duplicate ACK is send with the expected sequence number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 4: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 4

What is a Protocol Stack ISO OSI (Open Systems Interconnection) Seven Layer Model defines a

framework allowing development of real network protocols A layerhellip

performs unique and specific tasks only has knowledge of those layers immediately above and below uses services of layer below and provides services to layer above the services defined by a layer are implementation independent ndash

itrsquos a definition of how things work conceptually communicates with its peer in the remote system

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 5

The Layering Principle Encapsulation

Each protocol layer N adds a Header to the data unit from layer N+1 Header contains control information

Layer 7 Applicationuser processes

Layer 6 Presentationdata interpretation code transformation

Layer 5 SessionConnection negotiation control

Layer 4 TransportEnd-2-end data transfer amp integrityPacket sequencing flow control

Layer 3 NetworkAddressing RoutingPacket sequencing flow control

Layer 2 Data LinkPacket assemblydisassembly Transmission control Error checking

Layer 1 PhysicalElectrical Optical Mechanical

DH App data FCSNH TH PHSH

App data NH TH PHSH

App data TH PHSH

App data PHSH

App data PH

App data

Bits on the ldquowirerdquo

Frame

Packet

Segment

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 6

What do the Layers do Transport Layer acts as a go-between for the user and network

Provides end-to-end data movement amp control Gives the level of reliabilityintegrity need by the application Can ensure a reliable service (which network layer cannot)

eg assigns sequence numbers to identify ldquolostrdquo packets Network Layer deals with logical addressing amp the transmission of

packets mechanism for routing Data Link Layer provides the synchronization and error checking for

the data transmitted over a single physical link (may ensure correct delivery of frames) 1048708Going down fits packets from the network layer above into

frames 1048708Going up Groups bits from the physical layer into frames

Physical Layer concerned with the transmission of individual bits

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 7

How do the ldquoIPrdquo Protocols fit together

Application

( Presentation

Session)

Transport

Network

Data Link

Physical

FileTransferProtocol

(FTP) RFC 559

Simple MailTransfer Protocol(SMTP) RFC 821

TELNETRFC 854

TFTP RFC 783

NFSRFC 1024 1057

and 1094

SNMPRFC 1157

Transmission Control Protocol (TCP)

RFC 793

User Datagram Protocol (UDP)

RFC 768

Address ResolutionProtocols

ARP RFC 826RARP RFC 903

Internet ProtocolIP

RFC 791

Internet ControlMessage Protocol

(ICMP) RFC 792

Ethernet Token Ring ISDN FDDI SMDS ATM SDHSONET xDSL

Transmission Mode

TP Copper Fibre Optic Satellite Microwave DWDM CWDM etc

Network Interface Cards

RoutingOSPF BGP

ssh

HTTP POP3IMAP

DNS

DNS

ping

traceroute

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 8

Some of the ldquoIPrdquo Protocols Transmission Control Protocol TCP provides application programs

access to the network using a reliable connection-oriented transport layer service

User Datagram Protocol UDP provides unreliable connection-less delivery service using the IP protocol to transport messages between machines It adds the ability to distinguish among multiple destinations on a single host computer

Internet Protocol IP receives datagrams from the upper-layer software and transmits it to the destination host based upon a best effort connection-less delivery service

Internet Control Message Protocol ICMP allows internet routers to transmit error messages and test messages

Internet Group Message Protocol IGMP is used with multicast to send UDP datagrams to multiple hosts

Address Resolution Protocol ARP translates between the 32 bit IP address and a 48 bit LAN address

Reverse Address Resolution Protocol RARP translates between the 48 bit LAN address and the 32 bit IP address

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 9

The Physical Layer 1 Ethernet

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 10

The Link Layer 2 Ethernet Frame

Preamble which is comprised of 56 bits of alternating 0s and 1s The preamble provides all the nodes on the network a signal against which to synchronize

Media Access Control (MAC) AddressEvery Ethernet network card has built into its hardware a unique six-octet (48-bit) hexadecimal number that differentiates it from all other Ethernet cards in the universe The DA and SA define the path across the link

Start Frame delimiter which marks the start of a frame The start frame delimiter is 8 bits long with the pattern10101011

Data the reason the frame exists MTU Maximum Transport Unit

Frame Check Sequence to protect the frame contents

LengthType field two octets longIf the value =lt 1500 (0x05dc hex) indicates the length of dataIf the value gt 1500 indicates network-layer protocol ldquoEthernet Typesrdquo

Frame header IP Datagram FCS

12 bytes

Inter Frame Gap

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 11

The Link Layer Ethernet VLANs

VLANS are logical networks built over the same physical cable plant Distinguishes Ethernet frames betweentheir logical networks using VLANheader

VLAN is defined by the use of value 0x8100 in the Type field location

The next two octets are composed of the following three fields

User Priority field This field is 3 bits in length and is used to define the priority of the Ethernet frame This is utilized to define and deliver a class of service

Canonical format indicator This is 1 bit in length Just donrsquot ask

VLAN Identifier field This field is 12 bits in length and contains the VLAN identifier (VID) of this frame

The original LengthType field will then follow the inserted VLAN tag

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 12

The Network Layer 3 IP IP Layer properties

Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Connection less Provides logical addresses Provides routing Demultiplex data on protocol number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 13

The Internet datagram

31HlenVers Type of serv Total length

0 8 16

Identification Flags

244

Fragment offset

19

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

20 Bytes

Frame header Transport FCSIP header

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 14

IP Datagram Format (cont) Type of Service ndash TOS

now being used for QoS Total length length of datagram

in bytes includes header and data Time to live ndash TTL specifies how long datagram is

allowed to remain in internet Routers decrement by 1 When TTL = 0 router discards datagram Prevents infinite loops

Protocol specifies the format of the data area Protocol numbers administered by central authority to guarantee

agreement eg ICMP=1 TCP=6 UDP=17 hellip Source amp destination IP address (32 bits each) contain

IP address of sender and intended recipient Options (variable length) Mainly used to record a route

or timestamps or specify routing

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 15

Internet Class-based addresses An Address looks like 19216822123

Class A large number of hosts few networks 0nnnnnnn hhhhhhhh hhhhhhhh hhhhhhhh

7 network bits (0 and 127 reserved so 126 networks) 24 host bits (gt 16M hostsnet)

Initial byte 1-127 (decimal) Class B medium number of hosts and networks

10nnnnnn nnnnnnnn hhhhhhhh hhhhhhhh16384 class B networks 65534 hostsnetworkInitial byte 128-191 (decimal)

Class C large number of small networks 110nnnnn nnnnnnnn nnnnnnnn hhhhhhhh

2097152 networks 254 hostsnetworkInitial byte 192-223 (decimal)

Class D Multicast (See RFC 1112) 1110nnnn nnnnnnnn nnnnnnnn hhhhhhhh

Initial byte 224-239 (decimal) Class E Reserved

Initial byte 248-255 (decimal)

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 16

The Transport Layer 4 UDP UDP Provides

Connection less service over IPNo setup teardownOne packet at a time

Minimal overhead ndash high performance Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Application is responsible for Data reliabilityFlow controlError handling

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 17

UDP Datagram format

Sourcedestination port port numbers identify sending amp receiving processes

Port number amp IP address allow any application on Internet to be uniquely identified

Ports can be static or dynamic

Static (lt 1024) assigned centrally known as well known ports

Dynamic

Message length in bytes includes the UDP header and data (min 8 max 65535)

8 16 3124

Source port Destination port

UDP message len Checksum (opt)

0

Frame header Application data FCSIP header UDP header

8 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 18

The Transport Layer 4 TCP TCP RFC 768 RFC 1122 Provides

Connection orientated service over IPDuring setup the two ends agree on detailsExplicit teardownMultiple connections allowed

Reliable end-to-end Byte Stream delivery over unreliable network It takes care of

Lost packets Duplicated packetsOut of order packets

TCP provides Data buffering Flow controlError detection amp handlingLimits network congestion

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 19

Code

Source port Destination port

Sequence number

0 8 16 3124

Acknowledgement number

4

Hlen

10

Resv Window

Urgent ptrChecksum

Options (if any) Padding

The TCP Segment FormatFrame header Application data FCSIP header TCP header

20 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 20

TCP Segment Format ndash cont SourceDest port TCP port numbers to ID applications

at both ends of connection Sequence number First byte in segment from senderrsquos

byte stream Acknowledgement identifies the number of the byte the

sender of this segment expects to receive next Code used to determine segment purpose eg SYN

ACK FIN URG Window Advertises how much data this station is willing

to accept Can depend on buffer space remaining Options used for window scaling

SACK timestamps maximum segment size etc

Code

Source port Destination port

Sequence number

Acknowledgement number

Hlen Resv Window

Urgent ptrChecksum

Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 21

TCP ndash providing reliability Positive acknowledgement (ACK) of each received segment

Sender keeps record of each segment sent Sender awaits an ACK ndash ldquoI am ready to receive byte 2048 and beyondrdquo Sender starts timer when it sends segment ndash so can re-transmit

Segment n

ACK of Segment nRTT

Time

Sender Receiver

Sequence 1024Length 1024

Ack 2048

Segment n+1

ACK of Segment n +1RTT

Sequence 2048Length 1024

Ack 3072

Inefficient ndash sender has to wait

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 22

Flow Control Sender ndash Congestion Window Uses Congestion window cwnd a sliding window to control the data flow

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sentwaiting for windowto openApplication writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiverrsquos advertisedwindow advances leading edge

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 23

Flow Control Receiver ndash Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no

Receiverrsquos advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ne next byte expected Duplicate ACK is send with the expected sequence number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 5: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 5

The Layering Principle Encapsulation

Each protocol layer N adds a Header to the data unit from layer N+1 Header contains control information

Layer 7 Applicationuser processes

Layer 6 Presentationdata interpretation code transformation

Layer 5 SessionConnection negotiation control

Layer 4 TransportEnd-2-end data transfer amp integrityPacket sequencing flow control

Layer 3 NetworkAddressing RoutingPacket sequencing flow control

Layer 2 Data LinkPacket assemblydisassembly Transmission control Error checking

Layer 1 PhysicalElectrical Optical Mechanical

DH App data FCSNH TH PHSH

App data NH TH PHSH

App data TH PHSH

App data PHSH

App data PH

App data

Bits on the ldquowirerdquo

Frame

Packet

Segment

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 6

What do the Layers do Transport Layer acts as a go-between for the user and network

Provides end-to-end data movement amp control Gives the level of reliabilityintegrity need by the application Can ensure a reliable service (which network layer cannot)

eg assigns sequence numbers to identify ldquolostrdquo packets Network Layer deals with logical addressing amp the transmission of

packets mechanism for routing Data Link Layer provides the synchronization and error checking for

the data transmitted over a single physical link (may ensure correct delivery of frames) 1048708Going down fits packets from the network layer above into

frames 1048708Going up Groups bits from the physical layer into frames

Physical Layer concerned with the transmission of individual bits

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 7

How do the ldquoIPrdquo Protocols fit together

Application

( Presentation

Session)

Transport

Network

Data Link

Physical

FileTransferProtocol

(FTP) RFC 559

Simple MailTransfer Protocol(SMTP) RFC 821

TELNETRFC 854

TFTP RFC 783

NFSRFC 1024 1057

and 1094

SNMPRFC 1157

Transmission Control Protocol (TCP)

RFC 793

User Datagram Protocol (UDP)

RFC 768

Address ResolutionProtocols

ARP RFC 826RARP RFC 903

Internet ProtocolIP

RFC 791

Internet ControlMessage Protocol

(ICMP) RFC 792

Ethernet Token Ring ISDN FDDI SMDS ATM SDHSONET xDSL

Transmission Mode

TP Copper Fibre Optic Satellite Microwave DWDM CWDM etc

Network Interface Cards

RoutingOSPF BGP

ssh

HTTP POP3IMAP

DNS

DNS

ping

traceroute

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 8

Some of the ldquoIPrdquo Protocols Transmission Control Protocol TCP provides application programs

access to the network using a reliable connection-oriented transport layer service

User Datagram Protocol UDP provides unreliable connection-less delivery service using the IP protocol to transport messages between machines It adds the ability to distinguish among multiple destinations on a single host computer

Internet Protocol IP receives datagrams from the upper-layer software and transmits it to the destination host based upon a best effort connection-less delivery service

Internet Control Message Protocol ICMP allows internet routers to transmit error messages and test messages

Internet Group Message Protocol IGMP is used with multicast to send UDP datagrams to multiple hosts

Address Resolution Protocol ARP translates between the 32 bit IP address and a 48 bit LAN address

Reverse Address Resolution Protocol RARP translates between the 48 bit LAN address and the 32 bit IP address

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 9

The Physical Layer 1 Ethernet

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 10

The Link Layer 2 Ethernet Frame

Preamble which is comprised of 56 bits of alternating 0s and 1s The preamble provides all the nodes on the network a signal against which to synchronize

Media Access Control (MAC) AddressEvery Ethernet network card has built into its hardware a unique six-octet (48-bit) hexadecimal number that differentiates it from all other Ethernet cards in the universe The DA and SA define the path across the link

Start Frame delimiter which marks the start of a frame The start frame delimiter is 8 bits long with the pattern10101011

Data the reason the frame exists MTU Maximum Transport Unit

Frame Check Sequence to protect the frame contents

LengthType field two octets longIf the value =lt 1500 (0x05dc hex) indicates the length of dataIf the value gt 1500 indicates network-layer protocol ldquoEthernet Typesrdquo

Frame header IP Datagram FCS

12 bytes

Inter Frame Gap

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 11

The Link Layer Ethernet VLANs

VLANS are logical networks built over the same physical cable plant Distinguishes Ethernet frames betweentheir logical networks using VLANheader

VLAN is defined by the use of value 0x8100 in the Type field location

The next two octets are composed of the following three fields

User Priority field This field is 3 bits in length and is used to define the priority of the Ethernet frame This is utilized to define and deliver a class of service

Canonical format indicator This is 1 bit in length Just donrsquot ask

VLAN Identifier field This field is 12 bits in length and contains the VLAN identifier (VID) of this frame

The original LengthType field will then follow the inserted VLAN tag

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 12

The Network Layer 3 IP IP Layer properties

Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Connection less Provides logical addresses Provides routing Demultiplex data on protocol number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 13

The Internet datagram

31HlenVers Type of serv Total length

0 8 16

Identification Flags

244

Fragment offset

19

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

20 Bytes

Frame header Transport FCSIP header

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 14

IP Datagram Format (cont) Type of Service ndash TOS

now being used for QoS Total length length of datagram

in bytes includes header and data Time to live ndash TTL specifies how long datagram is

allowed to remain in internet Routers decrement by 1 When TTL = 0 router discards datagram Prevents infinite loops

Protocol specifies the format of the data area Protocol numbers administered by central authority to guarantee

agreement eg ICMP=1 TCP=6 UDP=17 hellip Source amp destination IP address (32 bits each) contain

IP address of sender and intended recipient Options (variable length) Mainly used to record a route

or timestamps or specify routing

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 15

Internet Class-based addresses An Address looks like 19216822123

Class A large number of hosts few networks 0nnnnnnn hhhhhhhh hhhhhhhh hhhhhhhh

7 network bits (0 and 127 reserved so 126 networks) 24 host bits (gt 16M hostsnet)

Initial byte 1-127 (decimal) Class B medium number of hosts and networks

10nnnnnn nnnnnnnn hhhhhhhh hhhhhhhh16384 class B networks 65534 hostsnetworkInitial byte 128-191 (decimal)

Class C large number of small networks 110nnnnn nnnnnnnn nnnnnnnn hhhhhhhh

2097152 networks 254 hostsnetworkInitial byte 192-223 (decimal)

Class D Multicast (See RFC 1112) 1110nnnn nnnnnnnn nnnnnnnn hhhhhhhh

Initial byte 224-239 (decimal) Class E Reserved

Initial byte 248-255 (decimal)

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 16

The Transport Layer 4 UDP UDP Provides

Connection less service over IPNo setup teardownOne packet at a time

Minimal overhead ndash high performance Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Application is responsible for Data reliabilityFlow controlError handling

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 17

UDP Datagram format

Sourcedestination port port numbers identify sending amp receiving processes

Port number amp IP address allow any application on Internet to be uniquely identified

Ports can be static or dynamic

Static (lt 1024) assigned centrally known as well known ports

Dynamic

Message length in bytes includes the UDP header and data (min 8 max 65535)

8 16 3124

Source port Destination port

UDP message len Checksum (opt)

0

Frame header Application data FCSIP header UDP header

8 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 18

The Transport Layer 4 TCP TCP RFC 768 RFC 1122 Provides

Connection orientated service over IPDuring setup the two ends agree on detailsExplicit teardownMultiple connections allowed

Reliable end-to-end Byte Stream delivery over unreliable network It takes care of

Lost packets Duplicated packetsOut of order packets

TCP provides Data buffering Flow controlError detection amp handlingLimits network congestion

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 19

Code

Source port Destination port

Sequence number

0 8 16 3124

Acknowledgement number

4

Hlen

10

Resv Window

Urgent ptrChecksum

Options (if any) Padding

The TCP Segment FormatFrame header Application data FCSIP header TCP header

20 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 20

TCP Segment Format ndash cont SourceDest port TCP port numbers to ID applications

at both ends of connection Sequence number First byte in segment from senderrsquos

byte stream Acknowledgement identifies the number of the byte the

sender of this segment expects to receive next Code used to determine segment purpose eg SYN

ACK FIN URG Window Advertises how much data this station is willing

to accept Can depend on buffer space remaining Options used for window scaling

SACK timestamps maximum segment size etc

Code

Source port Destination port

Sequence number

Acknowledgement number

Hlen Resv Window

Urgent ptrChecksum

Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 21

TCP ndash providing reliability Positive acknowledgement (ACK) of each received segment

Sender keeps record of each segment sent Sender awaits an ACK ndash ldquoI am ready to receive byte 2048 and beyondrdquo Sender starts timer when it sends segment ndash so can re-transmit

Segment n

ACK of Segment nRTT

Time

Sender Receiver

Sequence 1024Length 1024

Ack 2048

Segment n+1

ACK of Segment n +1RTT

Sequence 2048Length 1024

Ack 3072

Inefficient ndash sender has to wait

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 22

Flow Control Sender ndash Congestion Window Uses Congestion window cwnd a sliding window to control the data flow

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sentwaiting for windowto openApplication writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiverrsquos advertisedwindow advances leading edge

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 23

Flow Control Receiver ndash Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no

Receiverrsquos advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ne next byte expected Duplicate ACK is send with the expected sequence number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 6: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 6

What do the Layers do Transport Layer acts as a go-between for the user and network

Provides end-to-end data movement amp control Gives the level of reliabilityintegrity need by the application Can ensure a reliable service (which network layer cannot)

eg assigns sequence numbers to identify ldquolostrdquo packets Network Layer deals with logical addressing amp the transmission of

packets mechanism for routing Data Link Layer provides the synchronization and error checking for

the data transmitted over a single physical link (may ensure correct delivery of frames) 1048708Going down fits packets from the network layer above into

frames 1048708Going up Groups bits from the physical layer into frames

Physical Layer concerned with the transmission of individual bits

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 7

How do the ldquoIPrdquo Protocols fit together

Application

( Presentation

Session)

Transport

Network

Data Link

Physical

FileTransferProtocol

(FTP) RFC 559

Simple MailTransfer Protocol(SMTP) RFC 821

TELNETRFC 854

TFTP RFC 783

NFSRFC 1024 1057

and 1094

SNMPRFC 1157

Transmission Control Protocol (TCP)

RFC 793

User Datagram Protocol (UDP)

RFC 768

Address ResolutionProtocols

ARP RFC 826RARP RFC 903

Internet ProtocolIP

RFC 791

Internet ControlMessage Protocol

(ICMP) RFC 792

Ethernet Token Ring ISDN FDDI SMDS ATM SDHSONET xDSL

Transmission Mode

TP Copper Fibre Optic Satellite Microwave DWDM CWDM etc

Network Interface Cards

RoutingOSPF BGP

ssh

HTTP POP3IMAP

DNS

DNS

ping

traceroute

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 8

Some of the ldquoIPrdquo Protocols Transmission Control Protocol TCP provides application programs

access to the network using a reliable connection-oriented transport layer service

User Datagram Protocol UDP provides unreliable connection-less delivery service using the IP protocol to transport messages between machines It adds the ability to distinguish among multiple destinations on a single host computer

Internet Protocol IP receives datagrams from the upper-layer software and transmits it to the destination host based upon a best effort connection-less delivery service

Internet Control Message Protocol ICMP allows internet routers to transmit error messages and test messages

Internet Group Message Protocol IGMP is used with multicast to send UDP datagrams to multiple hosts

Address Resolution Protocol ARP translates between the 32 bit IP address and a 48 bit LAN address

Reverse Address Resolution Protocol RARP translates between the 48 bit LAN address and the 32 bit IP address

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 9

The Physical Layer 1 Ethernet

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 10

The Link Layer 2 Ethernet Frame

Preamble which is comprised of 56 bits of alternating 0s and 1s The preamble provides all the nodes on the network a signal against which to synchronize

Media Access Control (MAC) AddressEvery Ethernet network card has built into its hardware a unique six-octet (48-bit) hexadecimal number that differentiates it from all other Ethernet cards in the universe The DA and SA define the path across the link

Start Frame delimiter which marks the start of a frame The start frame delimiter is 8 bits long with the pattern10101011

Data the reason the frame exists MTU Maximum Transport Unit

Frame Check Sequence to protect the frame contents

LengthType field two octets longIf the value =lt 1500 (0x05dc hex) indicates the length of dataIf the value gt 1500 indicates network-layer protocol ldquoEthernet Typesrdquo

Frame header IP Datagram FCS

12 bytes

Inter Frame Gap

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 11

The Link Layer Ethernet VLANs

VLANS are logical networks built over the same physical cable plant Distinguishes Ethernet frames betweentheir logical networks using VLANheader

VLAN is defined by the use of value 0x8100 in the Type field location

The next two octets are composed of the following three fields

User Priority field This field is 3 bits in length and is used to define the priority of the Ethernet frame This is utilized to define and deliver a class of service

Canonical format indicator This is 1 bit in length Just donrsquot ask

VLAN Identifier field This field is 12 bits in length and contains the VLAN identifier (VID) of this frame

The original LengthType field will then follow the inserted VLAN tag

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 12

The Network Layer 3 IP IP Layer properties

Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Connection less Provides logical addresses Provides routing Demultiplex data on protocol number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 13

The Internet datagram

31HlenVers Type of serv Total length

0 8 16

Identification Flags

244

Fragment offset

19

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

20 Bytes

Frame header Transport FCSIP header

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 14

IP Datagram Format (cont) Type of Service ndash TOS

now being used for QoS Total length length of datagram

in bytes includes header and data Time to live ndash TTL specifies how long datagram is

allowed to remain in internet Routers decrement by 1 When TTL = 0 router discards datagram Prevents infinite loops

Protocol specifies the format of the data area Protocol numbers administered by central authority to guarantee

agreement eg ICMP=1 TCP=6 UDP=17 hellip Source amp destination IP address (32 bits each) contain

IP address of sender and intended recipient Options (variable length) Mainly used to record a route

or timestamps or specify routing

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 15

Internet Class-based addresses An Address looks like 19216822123

Class A large number of hosts few networks 0nnnnnnn hhhhhhhh hhhhhhhh hhhhhhhh

7 network bits (0 and 127 reserved so 126 networks) 24 host bits (gt 16M hostsnet)

Initial byte 1-127 (decimal) Class B medium number of hosts and networks

10nnnnnn nnnnnnnn hhhhhhhh hhhhhhhh16384 class B networks 65534 hostsnetworkInitial byte 128-191 (decimal)

Class C large number of small networks 110nnnnn nnnnnnnn nnnnnnnn hhhhhhhh

2097152 networks 254 hostsnetworkInitial byte 192-223 (decimal)

Class D Multicast (See RFC 1112) 1110nnnn nnnnnnnn nnnnnnnn hhhhhhhh

Initial byte 224-239 (decimal) Class E Reserved

Initial byte 248-255 (decimal)

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 16

The Transport Layer 4 UDP UDP Provides

Connection less service over IPNo setup teardownOne packet at a time

Minimal overhead ndash high performance Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Application is responsible for Data reliabilityFlow controlError handling

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 17

UDP Datagram format

Sourcedestination port port numbers identify sending amp receiving processes

Port number amp IP address allow any application on Internet to be uniquely identified

Ports can be static or dynamic

Static (lt 1024) assigned centrally known as well known ports

Dynamic

Message length in bytes includes the UDP header and data (min 8 max 65535)

8 16 3124

Source port Destination port

UDP message len Checksum (opt)

0

Frame header Application data FCSIP header UDP header

8 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 18

The Transport Layer 4 TCP TCP RFC 768 RFC 1122 Provides

Connection orientated service over IPDuring setup the two ends agree on detailsExplicit teardownMultiple connections allowed

Reliable end-to-end Byte Stream delivery over unreliable network It takes care of

Lost packets Duplicated packetsOut of order packets

TCP provides Data buffering Flow controlError detection amp handlingLimits network congestion

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 19

Code

Source port Destination port

Sequence number

0 8 16 3124

Acknowledgement number

4

Hlen

10

Resv Window

Urgent ptrChecksum

Options (if any) Padding

The TCP Segment FormatFrame header Application data FCSIP header TCP header

20 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 20

TCP Segment Format ndash cont SourceDest port TCP port numbers to ID applications

at both ends of connection Sequence number First byte in segment from senderrsquos

byte stream Acknowledgement identifies the number of the byte the

sender of this segment expects to receive next Code used to determine segment purpose eg SYN

ACK FIN URG Window Advertises how much data this station is willing

to accept Can depend on buffer space remaining Options used for window scaling

SACK timestamps maximum segment size etc

Code

Source port Destination port

Sequence number

Acknowledgement number

Hlen Resv Window

Urgent ptrChecksum

Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 21

TCP ndash providing reliability Positive acknowledgement (ACK) of each received segment

Sender keeps record of each segment sent Sender awaits an ACK ndash ldquoI am ready to receive byte 2048 and beyondrdquo Sender starts timer when it sends segment ndash so can re-transmit

Segment n

ACK of Segment nRTT

Time

Sender Receiver

Sequence 1024Length 1024

Ack 2048

Segment n+1

ACK of Segment n +1RTT

Sequence 2048Length 1024

Ack 3072

Inefficient ndash sender has to wait

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 22

Flow Control Sender ndash Congestion Window Uses Congestion window cwnd a sliding window to control the data flow

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sentwaiting for windowto openApplication writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiverrsquos advertisedwindow advances leading edge

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 23

Flow Control Receiver ndash Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no

Receiverrsquos advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ne next byte expected Duplicate ACK is send with the expected sequence number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 7: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 7

How do the ldquoIPrdquo Protocols fit together

Application

( Presentation

Session)

Transport

Network

Data Link

Physical

FileTransferProtocol

(FTP) RFC 559

Simple MailTransfer Protocol(SMTP) RFC 821

TELNETRFC 854

TFTP RFC 783

NFSRFC 1024 1057

and 1094

SNMPRFC 1157

Transmission Control Protocol (TCP)

RFC 793

User Datagram Protocol (UDP)

RFC 768

Address ResolutionProtocols

ARP RFC 826RARP RFC 903

Internet ProtocolIP

RFC 791

Internet ControlMessage Protocol

(ICMP) RFC 792

Ethernet Token Ring ISDN FDDI SMDS ATM SDHSONET xDSL

Transmission Mode

TP Copper Fibre Optic Satellite Microwave DWDM CWDM etc

Network Interface Cards

RoutingOSPF BGP

ssh

HTTP POP3IMAP

DNS

DNS

ping

traceroute

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 8

Some of the ldquoIPrdquo Protocols Transmission Control Protocol TCP provides application programs

access to the network using a reliable connection-oriented transport layer service

User Datagram Protocol UDP provides unreliable connection-less delivery service using the IP protocol to transport messages between machines It adds the ability to distinguish among multiple destinations on a single host computer

Internet Protocol IP receives datagrams from the upper-layer software and transmits it to the destination host based upon a best effort connection-less delivery service

Internet Control Message Protocol ICMP allows internet routers to transmit error messages and test messages

Internet Group Message Protocol IGMP is used with multicast to send UDP datagrams to multiple hosts

Address Resolution Protocol ARP translates between the 32 bit IP address and a 48 bit LAN address

Reverse Address Resolution Protocol RARP translates between the 48 bit LAN address and the 32 bit IP address

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 9

The Physical Layer 1 Ethernet

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 10

The Link Layer 2 Ethernet Frame

Preamble which is comprised of 56 bits of alternating 0s and 1s The preamble provides all the nodes on the network a signal against which to synchronize

Media Access Control (MAC) AddressEvery Ethernet network card has built into its hardware a unique six-octet (48-bit) hexadecimal number that differentiates it from all other Ethernet cards in the universe The DA and SA define the path across the link

Start Frame delimiter which marks the start of a frame The start frame delimiter is 8 bits long with the pattern10101011

Data the reason the frame exists MTU Maximum Transport Unit

Frame Check Sequence to protect the frame contents

LengthType field two octets longIf the value =lt 1500 (0x05dc hex) indicates the length of dataIf the value gt 1500 indicates network-layer protocol ldquoEthernet Typesrdquo

Frame header IP Datagram FCS

12 bytes

Inter Frame Gap

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 11

The Link Layer Ethernet VLANs

VLANS are logical networks built over the same physical cable plant Distinguishes Ethernet frames betweentheir logical networks using VLANheader

VLAN is defined by the use of value 0x8100 in the Type field location

The next two octets are composed of the following three fields

User Priority field This field is 3 bits in length and is used to define the priority of the Ethernet frame This is utilized to define and deliver a class of service

Canonical format indicator This is 1 bit in length Just donrsquot ask

VLAN Identifier field This field is 12 bits in length and contains the VLAN identifier (VID) of this frame

The original LengthType field will then follow the inserted VLAN tag

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 12

The Network Layer 3 IP IP Layer properties

Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Connection less Provides logical addresses Provides routing Demultiplex data on protocol number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 13

The Internet datagram

31HlenVers Type of serv Total length

0 8 16

Identification Flags

244

Fragment offset

19

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

20 Bytes

Frame header Transport FCSIP header

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 14

IP Datagram Format (cont) Type of Service ndash TOS

now being used for QoS Total length length of datagram

in bytes includes header and data Time to live ndash TTL specifies how long datagram is

allowed to remain in internet Routers decrement by 1 When TTL = 0 router discards datagram Prevents infinite loops

Protocol specifies the format of the data area Protocol numbers administered by central authority to guarantee

agreement eg ICMP=1 TCP=6 UDP=17 hellip Source amp destination IP address (32 bits each) contain

IP address of sender and intended recipient Options (variable length) Mainly used to record a route

or timestamps or specify routing

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 15

Internet Class-based addresses An Address looks like 19216822123

Class A large number of hosts few networks 0nnnnnnn hhhhhhhh hhhhhhhh hhhhhhhh

7 network bits (0 and 127 reserved so 126 networks) 24 host bits (gt 16M hostsnet)

Initial byte 1-127 (decimal) Class B medium number of hosts and networks

10nnnnnn nnnnnnnn hhhhhhhh hhhhhhhh16384 class B networks 65534 hostsnetworkInitial byte 128-191 (decimal)

Class C large number of small networks 110nnnnn nnnnnnnn nnnnnnnn hhhhhhhh

2097152 networks 254 hostsnetworkInitial byte 192-223 (decimal)

Class D Multicast (See RFC 1112) 1110nnnn nnnnnnnn nnnnnnnn hhhhhhhh

Initial byte 224-239 (decimal) Class E Reserved

Initial byte 248-255 (decimal)

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 16

The Transport Layer 4 UDP UDP Provides

Connection less service over IPNo setup teardownOne packet at a time

Minimal overhead ndash high performance Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Application is responsible for Data reliabilityFlow controlError handling

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 17

UDP Datagram format

Sourcedestination port port numbers identify sending amp receiving processes

Port number amp IP address allow any application on Internet to be uniquely identified

Ports can be static or dynamic

Static (lt 1024) assigned centrally known as well known ports

Dynamic

Message length in bytes includes the UDP header and data (min 8 max 65535)

8 16 3124

Source port Destination port

UDP message len Checksum (opt)

0

Frame header Application data FCSIP header UDP header

8 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 18

The Transport Layer 4 TCP TCP RFC 768 RFC 1122 Provides

Connection orientated service over IPDuring setup the two ends agree on detailsExplicit teardownMultiple connections allowed

Reliable end-to-end Byte Stream delivery over unreliable network It takes care of

Lost packets Duplicated packetsOut of order packets

TCP provides Data buffering Flow controlError detection amp handlingLimits network congestion

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 19

Code

Source port Destination port

Sequence number

0 8 16 3124

Acknowledgement number

4

Hlen

10

Resv Window

Urgent ptrChecksum

Options (if any) Padding

The TCP Segment FormatFrame header Application data FCSIP header TCP header

20 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 20

TCP Segment Format ndash cont SourceDest port TCP port numbers to ID applications

at both ends of connection Sequence number First byte in segment from senderrsquos

byte stream Acknowledgement identifies the number of the byte the

sender of this segment expects to receive next Code used to determine segment purpose eg SYN

ACK FIN URG Window Advertises how much data this station is willing

to accept Can depend on buffer space remaining Options used for window scaling

SACK timestamps maximum segment size etc

Code

Source port Destination port

Sequence number

Acknowledgement number

Hlen Resv Window

Urgent ptrChecksum

Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 21

TCP ndash providing reliability Positive acknowledgement (ACK) of each received segment

Sender keeps record of each segment sent Sender awaits an ACK ndash ldquoI am ready to receive byte 2048 and beyondrdquo Sender starts timer when it sends segment ndash so can re-transmit

Segment n

ACK of Segment nRTT

Time

Sender Receiver

Sequence 1024Length 1024

Ack 2048

Segment n+1

ACK of Segment n +1RTT

Sequence 2048Length 1024

Ack 3072

Inefficient ndash sender has to wait

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 22

Flow Control Sender ndash Congestion Window Uses Congestion window cwnd a sliding window to control the data flow

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sentwaiting for windowto openApplication writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiverrsquos advertisedwindow advances leading edge

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 23

Flow Control Receiver ndash Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no

Receiverrsquos advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ne next byte expected Duplicate ACK is send with the expected sequence number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 8: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 8

Some of the ldquoIPrdquo Protocols Transmission Control Protocol TCP provides application programs

access to the network using a reliable connection-oriented transport layer service

User Datagram Protocol UDP provides unreliable connection-less delivery service using the IP protocol to transport messages between machines It adds the ability to distinguish among multiple destinations on a single host computer

Internet Protocol IP receives datagrams from the upper-layer software and transmits it to the destination host based upon a best effort connection-less delivery service

Internet Control Message Protocol ICMP allows internet routers to transmit error messages and test messages

Internet Group Message Protocol IGMP is used with multicast to send UDP datagrams to multiple hosts

Address Resolution Protocol ARP translates between the 32 bit IP address and a 48 bit LAN address

Reverse Address Resolution Protocol RARP translates between the 48 bit LAN address and the 32 bit IP address

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 9

The Physical Layer 1 Ethernet

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 10

The Link Layer 2 Ethernet Frame

Preamble which is comprised of 56 bits of alternating 0s and 1s The preamble provides all the nodes on the network a signal against which to synchronize

Media Access Control (MAC) AddressEvery Ethernet network card has built into its hardware a unique six-octet (48-bit) hexadecimal number that differentiates it from all other Ethernet cards in the universe The DA and SA define the path across the link

Start Frame delimiter which marks the start of a frame The start frame delimiter is 8 bits long with the pattern10101011

Data the reason the frame exists MTU Maximum Transport Unit

Frame Check Sequence to protect the frame contents

LengthType field two octets longIf the value =lt 1500 (0x05dc hex) indicates the length of dataIf the value gt 1500 indicates network-layer protocol ldquoEthernet Typesrdquo

Frame header IP Datagram FCS

12 bytes

Inter Frame Gap

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 11

The Link Layer Ethernet VLANs

VLANS are logical networks built over the same physical cable plant Distinguishes Ethernet frames betweentheir logical networks using VLANheader

VLAN is defined by the use of value 0x8100 in the Type field location

The next two octets are composed of the following three fields

User Priority field This field is 3 bits in length and is used to define the priority of the Ethernet frame This is utilized to define and deliver a class of service

Canonical format indicator This is 1 bit in length Just donrsquot ask

VLAN Identifier field This field is 12 bits in length and contains the VLAN identifier (VID) of this frame

The original LengthType field will then follow the inserted VLAN tag

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 12

The Network Layer 3 IP IP Layer properties

Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Connection less Provides logical addresses Provides routing Demultiplex data on protocol number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 13

The Internet datagram

31HlenVers Type of serv Total length

0 8 16

Identification Flags

244

Fragment offset

19

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

20 Bytes

Frame header Transport FCSIP header

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 14

IP Datagram Format (cont) Type of Service ndash TOS

now being used for QoS Total length length of datagram

in bytes includes header and data Time to live ndash TTL specifies how long datagram is

allowed to remain in internet Routers decrement by 1 When TTL = 0 router discards datagram Prevents infinite loops

Protocol specifies the format of the data area Protocol numbers administered by central authority to guarantee

agreement eg ICMP=1 TCP=6 UDP=17 hellip Source amp destination IP address (32 bits each) contain

IP address of sender and intended recipient Options (variable length) Mainly used to record a route

or timestamps or specify routing

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 15

Internet Class-based addresses An Address looks like 19216822123

Class A large number of hosts few networks 0nnnnnnn hhhhhhhh hhhhhhhh hhhhhhhh

7 network bits (0 and 127 reserved so 126 networks) 24 host bits (gt 16M hostsnet)

Initial byte 1-127 (decimal) Class B medium number of hosts and networks

10nnnnnn nnnnnnnn hhhhhhhh hhhhhhhh16384 class B networks 65534 hostsnetworkInitial byte 128-191 (decimal)

Class C large number of small networks 110nnnnn nnnnnnnn nnnnnnnn hhhhhhhh

2097152 networks 254 hostsnetworkInitial byte 192-223 (decimal)

Class D Multicast (See RFC 1112) 1110nnnn nnnnnnnn nnnnnnnn hhhhhhhh

Initial byte 224-239 (decimal) Class E Reserved

Initial byte 248-255 (decimal)

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 16

The Transport Layer 4 UDP UDP Provides

Connection less service over IPNo setup teardownOne packet at a time

Minimal overhead ndash high performance Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Application is responsible for Data reliabilityFlow controlError handling

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 17

UDP Datagram format

Sourcedestination port port numbers identify sending amp receiving processes

Port number amp IP address allow any application on Internet to be uniquely identified

Ports can be static or dynamic

Static (lt 1024) assigned centrally known as well known ports

Dynamic

Message length in bytes includes the UDP header and data (min 8 max 65535)

8 16 3124

Source port Destination port

UDP message len Checksum (opt)

0

Frame header Application data FCSIP header UDP header

8 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 18

The Transport Layer 4 TCP TCP RFC 768 RFC 1122 Provides

Connection orientated service over IPDuring setup the two ends agree on detailsExplicit teardownMultiple connections allowed

Reliable end-to-end Byte Stream delivery over unreliable network It takes care of

Lost packets Duplicated packetsOut of order packets

TCP provides Data buffering Flow controlError detection amp handlingLimits network congestion

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 19

Code

Source port Destination port

Sequence number

0 8 16 3124

Acknowledgement number

4

Hlen

10

Resv Window

Urgent ptrChecksum

Options (if any) Padding

The TCP Segment FormatFrame header Application data FCSIP header TCP header

20 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 20

TCP Segment Format ndash cont SourceDest port TCP port numbers to ID applications

at both ends of connection Sequence number First byte in segment from senderrsquos

byte stream Acknowledgement identifies the number of the byte the

sender of this segment expects to receive next Code used to determine segment purpose eg SYN

ACK FIN URG Window Advertises how much data this station is willing

to accept Can depend on buffer space remaining Options used for window scaling

SACK timestamps maximum segment size etc

Code

Source port Destination port

Sequence number

Acknowledgement number

Hlen Resv Window

Urgent ptrChecksum

Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 21

TCP ndash providing reliability Positive acknowledgement (ACK) of each received segment

Sender keeps record of each segment sent Sender awaits an ACK ndash ldquoI am ready to receive byte 2048 and beyondrdquo Sender starts timer when it sends segment ndash so can re-transmit

Segment n

ACK of Segment nRTT

Time

Sender Receiver

Sequence 1024Length 1024

Ack 2048

Segment n+1

ACK of Segment n +1RTT

Sequence 2048Length 1024

Ack 3072

Inefficient ndash sender has to wait

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 22

Flow Control Sender ndash Congestion Window Uses Congestion window cwnd a sliding window to control the data flow

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sentwaiting for windowto openApplication writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiverrsquos advertisedwindow advances leading edge

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 23

Flow Control Receiver ndash Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no

Receiverrsquos advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ne next byte expected Duplicate ACK is send with the expected sequence number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 9: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 9

The Physical Layer 1 Ethernet

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 10

The Link Layer 2 Ethernet Frame

Preamble which is comprised of 56 bits of alternating 0s and 1s The preamble provides all the nodes on the network a signal against which to synchronize

Media Access Control (MAC) AddressEvery Ethernet network card has built into its hardware a unique six-octet (48-bit) hexadecimal number that differentiates it from all other Ethernet cards in the universe The DA and SA define the path across the link

Start Frame delimiter which marks the start of a frame The start frame delimiter is 8 bits long with the pattern10101011

Data the reason the frame exists MTU Maximum Transport Unit

Frame Check Sequence to protect the frame contents

LengthType field two octets longIf the value =lt 1500 (0x05dc hex) indicates the length of dataIf the value gt 1500 indicates network-layer protocol ldquoEthernet Typesrdquo

Frame header IP Datagram FCS

12 bytes

Inter Frame Gap

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 11

The Link Layer Ethernet VLANs

VLANS are logical networks built over the same physical cable plant Distinguishes Ethernet frames betweentheir logical networks using VLANheader

VLAN is defined by the use of value 0x8100 in the Type field location

The next two octets are composed of the following three fields

User Priority field This field is 3 bits in length and is used to define the priority of the Ethernet frame This is utilized to define and deliver a class of service

Canonical format indicator This is 1 bit in length Just donrsquot ask

VLAN Identifier field This field is 12 bits in length and contains the VLAN identifier (VID) of this frame

The original LengthType field will then follow the inserted VLAN tag

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 12

The Network Layer 3 IP IP Layer properties

Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Connection less Provides logical addresses Provides routing Demultiplex data on protocol number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 13

The Internet datagram

31HlenVers Type of serv Total length

0 8 16

Identification Flags

244

Fragment offset

19

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

20 Bytes

Frame header Transport FCSIP header

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 14

IP Datagram Format (cont) Type of Service ndash TOS

now being used for QoS Total length length of datagram

in bytes includes header and data Time to live ndash TTL specifies how long datagram is

allowed to remain in internet Routers decrement by 1 When TTL = 0 router discards datagram Prevents infinite loops

Protocol specifies the format of the data area Protocol numbers administered by central authority to guarantee

agreement eg ICMP=1 TCP=6 UDP=17 hellip Source amp destination IP address (32 bits each) contain

IP address of sender and intended recipient Options (variable length) Mainly used to record a route

or timestamps or specify routing

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 15

Internet Class-based addresses An Address looks like 19216822123

Class A large number of hosts few networks 0nnnnnnn hhhhhhhh hhhhhhhh hhhhhhhh

7 network bits (0 and 127 reserved so 126 networks) 24 host bits (gt 16M hostsnet)

Initial byte 1-127 (decimal) Class B medium number of hosts and networks

10nnnnnn nnnnnnnn hhhhhhhh hhhhhhhh16384 class B networks 65534 hostsnetworkInitial byte 128-191 (decimal)

Class C large number of small networks 110nnnnn nnnnnnnn nnnnnnnn hhhhhhhh

2097152 networks 254 hostsnetworkInitial byte 192-223 (decimal)

Class D Multicast (See RFC 1112) 1110nnnn nnnnnnnn nnnnnnnn hhhhhhhh

Initial byte 224-239 (decimal) Class E Reserved

Initial byte 248-255 (decimal)

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 16

The Transport Layer 4 UDP UDP Provides

Connection less service over IPNo setup teardownOne packet at a time

Minimal overhead ndash high performance Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Application is responsible for Data reliabilityFlow controlError handling

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 17

UDP Datagram format

Sourcedestination port port numbers identify sending amp receiving processes

Port number amp IP address allow any application on Internet to be uniquely identified

Ports can be static or dynamic

Static (lt 1024) assigned centrally known as well known ports

Dynamic

Message length in bytes includes the UDP header and data (min 8 max 65535)

8 16 3124

Source port Destination port

UDP message len Checksum (opt)

0

Frame header Application data FCSIP header UDP header

8 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 18

The Transport Layer 4 TCP TCP RFC 768 RFC 1122 Provides

Connection orientated service over IPDuring setup the two ends agree on detailsExplicit teardownMultiple connections allowed

Reliable end-to-end Byte Stream delivery over unreliable network It takes care of

Lost packets Duplicated packetsOut of order packets

TCP provides Data buffering Flow controlError detection amp handlingLimits network congestion

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 19

Code

Source port Destination port

Sequence number

0 8 16 3124

Acknowledgement number

4

Hlen

10

Resv Window

Urgent ptrChecksum

Options (if any) Padding

The TCP Segment FormatFrame header Application data FCSIP header TCP header

20 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 20

TCP Segment Format ndash cont SourceDest port TCP port numbers to ID applications

at both ends of connection Sequence number First byte in segment from senderrsquos

byte stream Acknowledgement identifies the number of the byte the

sender of this segment expects to receive next Code used to determine segment purpose eg SYN

ACK FIN URG Window Advertises how much data this station is willing

to accept Can depend on buffer space remaining Options used for window scaling

SACK timestamps maximum segment size etc

Code

Source port Destination port

Sequence number

Acknowledgement number

Hlen Resv Window

Urgent ptrChecksum

Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 21

TCP ndash providing reliability Positive acknowledgement (ACK) of each received segment

Sender keeps record of each segment sent Sender awaits an ACK ndash ldquoI am ready to receive byte 2048 and beyondrdquo Sender starts timer when it sends segment ndash so can re-transmit

Segment n

ACK of Segment nRTT

Time

Sender Receiver

Sequence 1024Length 1024

Ack 2048

Segment n+1

ACK of Segment n +1RTT

Sequence 2048Length 1024

Ack 3072

Inefficient ndash sender has to wait

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 22

Flow Control Sender ndash Congestion Window Uses Congestion window cwnd a sliding window to control the data flow

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sentwaiting for windowto openApplication writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiverrsquos advertisedwindow advances leading edge

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 23

Flow Control Receiver ndash Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no

Receiverrsquos advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ne next byte expected Duplicate ACK is send with the expected sequence number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 10: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 10

The Link Layer 2 Ethernet Frame

Preamble which is comprised of 56 bits of alternating 0s and 1s The preamble provides all the nodes on the network a signal against which to synchronize

Media Access Control (MAC) AddressEvery Ethernet network card has built into its hardware a unique six-octet (48-bit) hexadecimal number that differentiates it from all other Ethernet cards in the universe The DA and SA define the path across the link

Start Frame delimiter which marks the start of a frame The start frame delimiter is 8 bits long with the pattern10101011

Data the reason the frame exists MTU Maximum Transport Unit

Frame Check Sequence to protect the frame contents

LengthType field two octets longIf the value =lt 1500 (0x05dc hex) indicates the length of dataIf the value gt 1500 indicates network-layer protocol ldquoEthernet Typesrdquo

Frame header IP Datagram FCS

12 bytes

Inter Frame Gap

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 11

The Link Layer Ethernet VLANs

VLANS are logical networks built over the same physical cable plant Distinguishes Ethernet frames betweentheir logical networks using VLANheader

VLAN is defined by the use of value 0x8100 in the Type field location

The next two octets are composed of the following three fields

User Priority field This field is 3 bits in length and is used to define the priority of the Ethernet frame This is utilized to define and deliver a class of service

Canonical format indicator This is 1 bit in length Just donrsquot ask

VLAN Identifier field This field is 12 bits in length and contains the VLAN identifier (VID) of this frame

The original LengthType field will then follow the inserted VLAN tag

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 12

The Network Layer 3 IP IP Layer properties

Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Connection less Provides logical addresses Provides routing Demultiplex data on protocol number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 13

The Internet datagram

31HlenVers Type of serv Total length

0 8 16

Identification Flags

244

Fragment offset

19

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

20 Bytes

Frame header Transport FCSIP header

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 14

IP Datagram Format (cont) Type of Service ndash TOS

now being used for QoS Total length length of datagram

in bytes includes header and data Time to live ndash TTL specifies how long datagram is

allowed to remain in internet Routers decrement by 1 When TTL = 0 router discards datagram Prevents infinite loops

Protocol specifies the format of the data area Protocol numbers administered by central authority to guarantee

agreement eg ICMP=1 TCP=6 UDP=17 hellip Source amp destination IP address (32 bits each) contain

IP address of sender and intended recipient Options (variable length) Mainly used to record a route

or timestamps or specify routing

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 15

Internet Class-based addresses An Address looks like 19216822123

Class A large number of hosts few networks 0nnnnnnn hhhhhhhh hhhhhhhh hhhhhhhh

7 network bits (0 and 127 reserved so 126 networks) 24 host bits (gt 16M hostsnet)

Initial byte 1-127 (decimal) Class B medium number of hosts and networks

10nnnnnn nnnnnnnn hhhhhhhh hhhhhhhh16384 class B networks 65534 hostsnetworkInitial byte 128-191 (decimal)

Class C large number of small networks 110nnnnn nnnnnnnn nnnnnnnn hhhhhhhh

2097152 networks 254 hostsnetworkInitial byte 192-223 (decimal)

Class D Multicast (See RFC 1112) 1110nnnn nnnnnnnn nnnnnnnn hhhhhhhh

Initial byte 224-239 (decimal) Class E Reserved

Initial byte 248-255 (decimal)

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 16

The Transport Layer 4 UDP UDP Provides

Connection less service over IPNo setup teardownOne packet at a time

Minimal overhead ndash high performance Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Application is responsible for Data reliabilityFlow controlError handling

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 17

UDP Datagram format

Sourcedestination port port numbers identify sending amp receiving processes

Port number amp IP address allow any application on Internet to be uniquely identified

Ports can be static or dynamic

Static (lt 1024) assigned centrally known as well known ports

Dynamic

Message length in bytes includes the UDP header and data (min 8 max 65535)

8 16 3124

Source port Destination port

UDP message len Checksum (opt)

0

Frame header Application data FCSIP header UDP header

8 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 18

The Transport Layer 4 TCP TCP RFC 768 RFC 1122 Provides

Connection orientated service over IPDuring setup the two ends agree on detailsExplicit teardownMultiple connections allowed

Reliable end-to-end Byte Stream delivery over unreliable network It takes care of

Lost packets Duplicated packetsOut of order packets

TCP provides Data buffering Flow controlError detection amp handlingLimits network congestion

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 19

Code

Source port Destination port

Sequence number

0 8 16 3124

Acknowledgement number

4

Hlen

10

Resv Window

Urgent ptrChecksum

Options (if any) Padding

The TCP Segment FormatFrame header Application data FCSIP header TCP header

20 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 20

TCP Segment Format ndash cont SourceDest port TCP port numbers to ID applications

at both ends of connection Sequence number First byte in segment from senderrsquos

byte stream Acknowledgement identifies the number of the byte the

sender of this segment expects to receive next Code used to determine segment purpose eg SYN

ACK FIN URG Window Advertises how much data this station is willing

to accept Can depend on buffer space remaining Options used for window scaling

SACK timestamps maximum segment size etc

Code

Source port Destination port

Sequence number

Acknowledgement number

Hlen Resv Window

Urgent ptrChecksum

Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 21

TCP ndash providing reliability Positive acknowledgement (ACK) of each received segment

Sender keeps record of each segment sent Sender awaits an ACK ndash ldquoI am ready to receive byte 2048 and beyondrdquo Sender starts timer when it sends segment ndash so can re-transmit

Segment n

ACK of Segment nRTT

Time

Sender Receiver

Sequence 1024Length 1024

Ack 2048

Segment n+1

ACK of Segment n +1RTT

Sequence 2048Length 1024

Ack 3072

Inefficient ndash sender has to wait

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 22

Flow Control Sender ndash Congestion Window Uses Congestion window cwnd a sliding window to control the data flow

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sentwaiting for windowto openApplication writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiverrsquos advertisedwindow advances leading edge

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 23

Flow Control Receiver ndash Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no

Receiverrsquos advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ne next byte expected Duplicate ACK is send with the expected sequence number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 11: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 11

The Link Layer Ethernet VLANs

VLANS are logical networks built over the same physical cable plant Distinguishes Ethernet frames betweentheir logical networks using VLANheader

VLAN is defined by the use of value 0x8100 in the Type field location

The next two octets are composed of the following three fields

User Priority field This field is 3 bits in length and is used to define the priority of the Ethernet frame This is utilized to define and deliver a class of service

Canonical format indicator This is 1 bit in length Just donrsquot ask

VLAN Identifier field This field is 12 bits in length and contains the VLAN identifier (VID) of this frame

The original LengthType field will then follow the inserted VLAN tag

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 12

The Network Layer 3 IP IP Layer properties

Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Connection less Provides logical addresses Provides routing Demultiplex data on protocol number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 13

The Internet datagram

31HlenVers Type of serv Total length

0 8 16

Identification Flags

244

Fragment offset

19

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

20 Bytes

Frame header Transport FCSIP header

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 14

IP Datagram Format (cont) Type of Service ndash TOS

now being used for QoS Total length length of datagram

in bytes includes header and data Time to live ndash TTL specifies how long datagram is

allowed to remain in internet Routers decrement by 1 When TTL = 0 router discards datagram Prevents infinite loops

Protocol specifies the format of the data area Protocol numbers administered by central authority to guarantee

agreement eg ICMP=1 TCP=6 UDP=17 hellip Source amp destination IP address (32 bits each) contain

IP address of sender and intended recipient Options (variable length) Mainly used to record a route

or timestamps or specify routing

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 15

Internet Class-based addresses An Address looks like 19216822123

Class A large number of hosts few networks 0nnnnnnn hhhhhhhh hhhhhhhh hhhhhhhh

7 network bits (0 and 127 reserved so 126 networks) 24 host bits (gt 16M hostsnet)

Initial byte 1-127 (decimal) Class B medium number of hosts and networks

10nnnnnn nnnnnnnn hhhhhhhh hhhhhhhh16384 class B networks 65534 hostsnetworkInitial byte 128-191 (decimal)

Class C large number of small networks 110nnnnn nnnnnnnn nnnnnnnn hhhhhhhh

2097152 networks 254 hostsnetworkInitial byte 192-223 (decimal)

Class D Multicast (See RFC 1112) 1110nnnn nnnnnnnn nnnnnnnn hhhhhhhh

Initial byte 224-239 (decimal) Class E Reserved

Initial byte 248-255 (decimal)

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 16

The Transport Layer 4 UDP UDP Provides

Connection less service over IPNo setup teardownOne packet at a time

Minimal overhead ndash high performance Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Application is responsible for Data reliabilityFlow controlError handling

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 17

UDP Datagram format

Sourcedestination port port numbers identify sending amp receiving processes

Port number amp IP address allow any application on Internet to be uniquely identified

Ports can be static or dynamic

Static (lt 1024) assigned centrally known as well known ports

Dynamic

Message length in bytes includes the UDP header and data (min 8 max 65535)

8 16 3124

Source port Destination port

UDP message len Checksum (opt)

0

Frame header Application data FCSIP header UDP header

8 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 18

The Transport Layer 4 TCP TCP RFC 768 RFC 1122 Provides

Connection orientated service over IPDuring setup the two ends agree on detailsExplicit teardownMultiple connections allowed

Reliable end-to-end Byte Stream delivery over unreliable network It takes care of

Lost packets Duplicated packetsOut of order packets

TCP provides Data buffering Flow controlError detection amp handlingLimits network congestion

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 19

Code

Source port Destination port

Sequence number

0 8 16 3124

Acknowledgement number

4

Hlen

10

Resv Window

Urgent ptrChecksum

Options (if any) Padding

The TCP Segment FormatFrame header Application data FCSIP header TCP header

20 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 20

TCP Segment Format ndash cont SourceDest port TCP port numbers to ID applications

at both ends of connection Sequence number First byte in segment from senderrsquos

byte stream Acknowledgement identifies the number of the byte the

sender of this segment expects to receive next Code used to determine segment purpose eg SYN

ACK FIN URG Window Advertises how much data this station is willing

to accept Can depend on buffer space remaining Options used for window scaling

SACK timestamps maximum segment size etc

Code

Source port Destination port

Sequence number

Acknowledgement number

Hlen Resv Window

Urgent ptrChecksum

Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 21

TCP ndash providing reliability Positive acknowledgement (ACK) of each received segment

Sender keeps record of each segment sent Sender awaits an ACK ndash ldquoI am ready to receive byte 2048 and beyondrdquo Sender starts timer when it sends segment ndash so can re-transmit

Segment n

ACK of Segment nRTT

Time

Sender Receiver

Sequence 1024Length 1024

Ack 2048

Segment n+1

ACK of Segment n +1RTT

Sequence 2048Length 1024

Ack 3072

Inefficient ndash sender has to wait

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 22

Flow Control Sender ndash Congestion Window Uses Congestion window cwnd a sliding window to control the data flow

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sentwaiting for windowto openApplication writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiverrsquos advertisedwindow advances leading edge

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 23

Flow Control Receiver ndash Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no

Receiverrsquos advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ne next byte expected Duplicate ACK is send with the expected sequence number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 12: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 12

The Network Layer 3 IP IP Layer properties

Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Connection less Provides logical addresses Provides routing Demultiplex data on protocol number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 13

The Internet datagram

31HlenVers Type of serv Total length

0 8 16

Identification Flags

244

Fragment offset

19

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

20 Bytes

Frame header Transport FCSIP header

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 14

IP Datagram Format (cont) Type of Service ndash TOS

now being used for QoS Total length length of datagram

in bytes includes header and data Time to live ndash TTL specifies how long datagram is

allowed to remain in internet Routers decrement by 1 When TTL = 0 router discards datagram Prevents infinite loops

Protocol specifies the format of the data area Protocol numbers administered by central authority to guarantee

agreement eg ICMP=1 TCP=6 UDP=17 hellip Source amp destination IP address (32 bits each) contain

IP address of sender and intended recipient Options (variable length) Mainly used to record a route

or timestamps or specify routing

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 15

Internet Class-based addresses An Address looks like 19216822123

Class A large number of hosts few networks 0nnnnnnn hhhhhhhh hhhhhhhh hhhhhhhh

7 network bits (0 and 127 reserved so 126 networks) 24 host bits (gt 16M hostsnet)

Initial byte 1-127 (decimal) Class B medium number of hosts and networks

10nnnnnn nnnnnnnn hhhhhhhh hhhhhhhh16384 class B networks 65534 hostsnetworkInitial byte 128-191 (decimal)

Class C large number of small networks 110nnnnn nnnnnnnn nnnnnnnn hhhhhhhh

2097152 networks 254 hostsnetworkInitial byte 192-223 (decimal)

Class D Multicast (See RFC 1112) 1110nnnn nnnnnnnn nnnnnnnn hhhhhhhh

Initial byte 224-239 (decimal) Class E Reserved

Initial byte 248-255 (decimal)

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 16

The Transport Layer 4 UDP UDP Provides

Connection less service over IPNo setup teardownOne packet at a time

Minimal overhead ndash high performance Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Application is responsible for Data reliabilityFlow controlError handling

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 17

UDP Datagram format

Sourcedestination port port numbers identify sending amp receiving processes

Port number amp IP address allow any application on Internet to be uniquely identified

Ports can be static or dynamic

Static (lt 1024) assigned centrally known as well known ports

Dynamic

Message length in bytes includes the UDP header and data (min 8 max 65535)

8 16 3124

Source port Destination port

UDP message len Checksum (opt)

0

Frame header Application data FCSIP header UDP header

8 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 18

The Transport Layer 4 TCP TCP RFC 768 RFC 1122 Provides

Connection orientated service over IPDuring setup the two ends agree on detailsExplicit teardownMultiple connections allowed

Reliable end-to-end Byte Stream delivery over unreliable network It takes care of

Lost packets Duplicated packetsOut of order packets

TCP provides Data buffering Flow controlError detection amp handlingLimits network congestion

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 19

Code

Source port Destination port

Sequence number

0 8 16 3124

Acknowledgement number

4

Hlen

10

Resv Window

Urgent ptrChecksum

Options (if any) Padding

The TCP Segment FormatFrame header Application data FCSIP header TCP header

20 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 20

TCP Segment Format ndash cont SourceDest port TCP port numbers to ID applications

at both ends of connection Sequence number First byte in segment from senderrsquos

byte stream Acknowledgement identifies the number of the byte the

sender of this segment expects to receive next Code used to determine segment purpose eg SYN

ACK FIN URG Window Advertises how much data this station is willing

to accept Can depend on buffer space remaining Options used for window scaling

SACK timestamps maximum segment size etc

Code

Source port Destination port

Sequence number

Acknowledgement number

Hlen Resv Window

Urgent ptrChecksum

Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 21

TCP ndash providing reliability Positive acknowledgement (ACK) of each received segment

Sender keeps record of each segment sent Sender awaits an ACK ndash ldquoI am ready to receive byte 2048 and beyondrdquo Sender starts timer when it sends segment ndash so can re-transmit

Segment n

ACK of Segment nRTT

Time

Sender Receiver

Sequence 1024Length 1024

Ack 2048

Segment n+1

ACK of Segment n +1RTT

Sequence 2048Length 1024

Ack 3072

Inefficient ndash sender has to wait

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 22

Flow Control Sender ndash Congestion Window Uses Congestion window cwnd a sliding window to control the data flow

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sentwaiting for windowto openApplication writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiverrsquos advertisedwindow advances leading edge

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 23

Flow Control Receiver ndash Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no

Receiverrsquos advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ne next byte expected Duplicate ACK is send with the expected sequence number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 13: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 13

The Internet datagram

31HlenVers Type of serv Total length

0 8 16

Identification Flags

244

Fragment offset

19

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

20 Bytes

Frame header Transport FCSIP header

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 14

IP Datagram Format (cont) Type of Service ndash TOS

now being used for QoS Total length length of datagram

in bytes includes header and data Time to live ndash TTL specifies how long datagram is

allowed to remain in internet Routers decrement by 1 When TTL = 0 router discards datagram Prevents infinite loops

Protocol specifies the format of the data area Protocol numbers administered by central authority to guarantee

agreement eg ICMP=1 TCP=6 UDP=17 hellip Source amp destination IP address (32 bits each) contain

IP address of sender and intended recipient Options (variable length) Mainly used to record a route

or timestamps or specify routing

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 15

Internet Class-based addresses An Address looks like 19216822123

Class A large number of hosts few networks 0nnnnnnn hhhhhhhh hhhhhhhh hhhhhhhh

7 network bits (0 and 127 reserved so 126 networks) 24 host bits (gt 16M hostsnet)

Initial byte 1-127 (decimal) Class B medium number of hosts and networks

10nnnnnn nnnnnnnn hhhhhhhh hhhhhhhh16384 class B networks 65534 hostsnetworkInitial byte 128-191 (decimal)

Class C large number of small networks 110nnnnn nnnnnnnn nnnnnnnn hhhhhhhh

2097152 networks 254 hostsnetworkInitial byte 192-223 (decimal)

Class D Multicast (See RFC 1112) 1110nnnn nnnnnnnn nnnnnnnn hhhhhhhh

Initial byte 224-239 (decimal) Class E Reserved

Initial byte 248-255 (decimal)

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 16

The Transport Layer 4 UDP UDP Provides

Connection less service over IPNo setup teardownOne packet at a time

Minimal overhead ndash high performance Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Application is responsible for Data reliabilityFlow controlError handling

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 17

UDP Datagram format

Sourcedestination port port numbers identify sending amp receiving processes

Port number amp IP address allow any application on Internet to be uniquely identified

Ports can be static or dynamic

Static (lt 1024) assigned centrally known as well known ports

Dynamic

Message length in bytes includes the UDP header and data (min 8 max 65535)

8 16 3124

Source port Destination port

UDP message len Checksum (opt)

0

Frame header Application data FCSIP header UDP header

8 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 18

The Transport Layer 4 TCP TCP RFC 768 RFC 1122 Provides

Connection orientated service over IPDuring setup the two ends agree on detailsExplicit teardownMultiple connections allowed

Reliable end-to-end Byte Stream delivery over unreliable network It takes care of

Lost packets Duplicated packetsOut of order packets

TCP provides Data buffering Flow controlError detection amp handlingLimits network congestion

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 19

Code

Source port Destination port

Sequence number

0 8 16 3124

Acknowledgement number

4

Hlen

10

Resv Window

Urgent ptrChecksum

Options (if any) Padding

The TCP Segment FormatFrame header Application data FCSIP header TCP header

20 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 20

TCP Segment Format ndash cont SourceDest port TCP port numbers to ID applications

at both ends of connection Sequence number First byte in segment from senderrsquos

byte stream Acknowledgement identifies the number of the byte the

sender of this segment expects to receive next Code used to determine segment purpose eg SYN

ACK FIN URG Window Advertises how much data this station is willing

to accept Can depend on buffer space remaining Options used for window scaling

SACK timestamps maximum segment size etc

Code

Source port Destination port

Sequence number

Acknowledgement number

Hlen Resv Window

Urgent ptrChecksum

Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 21

TCP ndash providing reliability Positive acknowledgement (ACK) of each received segment

Sender keeps record of each segment sent Sender awaits an ACK ndash ldquoI am ready to receive byte 2048 and beyondrdquo Sender starts timer when it sends segment ndash so can re-transmit

Segment n

ACK of Segment nRTT

Time

Sender Receiver

Sequence 1024Length 1024

Ack 2048

Segment n+1

ACK of Segment n +1RTT

Sequence 2048Length 1024

Ack 3072

Inefficient ndash sender has to wait

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 22

Flow Control Sender ndash Congestion Window Uses Congestion window cwnd a sliding window to control the data flow

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sentwaiting for windowto openApplication writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiverrsquos advertisedwindow advances leading edge

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 23

Flow Control Receiver ndash Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no

Receiverrsquos advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ne next byte expected Duplicate ACK is send with the expected sequence number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 14: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 14

IP Datagram Format (cont) Type of Service ndash TOS

now being used for QoS Total length length of datagram

in bytes includes header and data Time to live ndash TTL specifies how long datagram is

allowed to remain in internet Routers decrement by 1 When TTL = 0 router discards datagram Prevents infinite loops

Protocol specifies the format of the data area Protocol numbers administered by central authority to guarantee

agreement eg ICMP=1 TCP=6 UDP=17 hellip Source amp destination IP address (32 bits each) contain

IP address of sender and intended recipient Options (variable length) Mainly used to record a route

or timestamps or specify routing

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

HlenVers TOS Total length

Identification Flags Fragment offset

TTL Protocol Header Checksum

Source IP address

Destination IP address

IP Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 15

Internet Class-based addresses An Address looks like 19216822123

Class A large number of hosts few networks 0nnnnnnn hhhhhhhh hhhhhhhh hhhhhhhh

7 network bits (0 and 127 reserved so 126 networks) 24 host bits (gt 16M hostsnet)

Initial byte 1-127 (decimal) Class B medium number of hosts and networks

10nnnnnn nnnnnnnn hhhhhhhh hhhhhhhh16384 class B networks 65534 hostsnetworkInitial byte 128-191 (decimal)

Class C large number of small networks 110nnnnn nnnnnnnn nnnnnnnn hhhhhhhh

2097152 networks 254 hostsnetworkInitial byte 192-223 (decimal)

Class D Multicast (See RFC 1112) 1110nnnn nnnnnnnn nnnnnnnn hhhhhhhh

Initial byte 224-239 (decimal) Class E Reserved

Initial byte 248-255 (decimal)

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 16

The Transport Layer 4 UDP UDP Provides

Connection less service over IPNo setup teardownOne packet at a time

Minimal overhead ndash high performance Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Application is responsible for Data reliabilityFlow controlError handling

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 17

UDP Datagram format

Sourcedestination port port numbers identify sending amp receiving processes

Port number amp IP address allow any application on Internet to be uniquely identified

Ports can be static or dynamic

Static (lt 1024) assigned centrally known as well known ports

Dynamic

Message length in bytes includes the UDP header and data (min 8 max 65535)

8 16 3124

Source port Destination port

UDP message len Checksum (opt)

0

Frame header Application data FCSIP header UDP header

8 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 18

The Transport Layer 4 TCP TCP RFC 768 RFC 1122 Provides

Connection orientated service over IPDuring setup the two ends agree on detailsExplicit teardownMultiple connections allowed

Reliable end-to-end Byte Stream delivery over unreliable network It takes care of

Lost packets Duplicated packetsOut of order packets

TCP provides Data buffering Flow controlError detection amp handlingLimits network congestion

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 19

Code

Source port Destination port

Sequence number

0 8 16 3124

Acknowledgement number

4

Hlen

10

Resv Window

Urgent ptrChecksum

Options (if any) Padding

The TCP Segment FormatFrame header Application data FCSIP header TCP header

20 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 20

TCP Segment Format ndash cont SourceDest port TCP port numbers to ID applications

at both ends of connection Sequence number First byte in segment from senderrsquos

byte stream Acknowledgement identifies the number of the byte the

sender of this segment expects to receive next Code used to determine segment purpose eg SYN

ACK FIN URG Window Advertises how much data this station is willing

to accept Can depend on buffer space remaining Options used for window scaling

SACK timestamps maximum segment size etc

Code

Source port Destination port

Sequence number

Acknowledgement number

Hlen Resv Window

Urgent ptrChecksum

Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 21

TCP ndash providing reliability Positive acknowledgement (ACK) of each received segment

Sender keeps record of each segment sent Sender awaits an ACK ndash ldquoI am ready to receive byte 2048 and beyondrdquo Sender starts timer when it sends segment ndash so can re-transmit

Segment n

ACK of Segment nRTT

Time

Sender Receiver

Sequence 1024Length 1024

Ack 2048

Segment n+1

ACK of Segment n +1RTT

Sequence 2048Length 1024

Ack 3072

Inefficient ndash sender has to wait

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 22

Flow Control Sender ndash Congestion Window Uses Congestion window cwnd a sliding window to control the data flow

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sentwaiting for windowto openApplication writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiverrsquos advertisedwindow advances leading edge

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 23

Flow Control Receiver ndash Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no

Receiverrsquos advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ne next byte expected Duplicate ACK is send with the expected sequence number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 15: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 15

Internet Class-based addresses An Address looks like 19216822123

Class A large number of hosts few networks 0nnnnnnn hhhhhhhh hhhhhhhh hhhhhhhh

7 network bits (0 and 127 reserved so 126 networks) 24 host bits (gt 16M hostsnet)

Initial byte 1-127 (decimal) Class B medium number of hosts and networks

10nnnnnn nnnnnnnn hhhhhhhh hhhhhhhh16384 class B networks 65534 hostsnetworkInitial byte 128-191 (decimal)

Class C large number of small networks 110nnnnn nnnnnnnn nnnnnnnn hhhhhhhh

2097152 networks 254 hostsnetworkInitial byte 192-223 (decimal)

Class D Multicast (See RFC 1112) 1110nnnn nnnnnnnn nnnnnnnn hhhhhhhh

Initial byte 224-239 (decimal) Class E Reserved

Initial byte 248-255 (decimal)

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 16

The Transport Layer 4 UDP UDP Provides

Connection less service over IPNo setup teardownOne packet at a time

Minimal overhead ndash high performance Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Application is responsible for Data reliabilityFlow controlError handling

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 17

UDP Datagram format

Sourcedestination port port numbers identify sending amp receiving processes

Port number amp IP address allow any application on Internet to be uniquely identified

Ports can be static or dynamic

Static (lt 1024) assigned centrally known as well known ports

Dynamic

Message length in bytes includes the UDP header and data (min 8 max 65535)

8 16 3124

Source port Destination port

UDP message len Checksum (opt)

0

Frame header Application data FCSIP header UDP header

8 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 18

The Transport Layer 4 TCP TCP RFC 768 RFC 1122 Provides

Connection orientated service over IPDuring setup the two ends agree on detailsExplicit teardownMultiple connections allowed

Reliable end-to-end Byte Stream delivery over unreliable network It takes care of

Lost packets Duplicated packetsOut of order packets

TCP provides Data buffering Flow controlError detection amp handlingLimits network congestion

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 19

Code

Source port Destination port

Sequence number

0 8 16 3124

Acknowledgement number

4

Hlen

10

Resv Window

Urgent ptrChecksum

Options (if any) Padding

The TCP Segment FormatFrame header Application data FCSIP header TCP header

20 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 20

TCP Segment Format ndash cont SourceDest port TCP port numbers to ID applications

at both ends of connection Sequence number First byte in segment from senderrsquos

byte stream Acknowledgement identifies the number of the byte the

sender of this segment expects to receive next Code used to determine segment purpose eg SYN

ACK FIN URG Window Advertises how much data this station is willing

to accept Can depend on buffer space remaining Options used for window scaling

SACK timestamps maximum segment size etc

Code

Source port Destination port

Sequence number

Acknowledgement number

Hlen Resv Window

Urgent ptrChecksum

Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 21

TCP ndash providing reliability Positive acknowledgement (ACK) of each received segment

Sender keeps record of each segment sent Sender awaits an ACK ndash ldquoI am ready to receive byte 2048 and beyondrdquo Sender starts timer when it sends segment ndash so can re-transmit

Segment n

ACK of Segment nRTT

Time

Sender Receiver

Sequence 1024Length 1024

Ack 2048

Segment n+1

ACK of Segment n +1RTT

Sequence 2048Length 1024

Ack 3072

Inefficient ndash sender has to wait

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 22

Flow Control Sender ndash Congestion Window Uses Congestion window cwnd a sliding window to control the data flow

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sentwaiting for windowto openApplication writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiverrsquos advertisedwindow advances leading edge

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 23

Flow Control Receiver ndash Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no

Receiverrsquos advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ne next byte expected Duplicate ACK is send with the expected sequence number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 16: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 16

The Transport Layer 4 UDP UDP Provides

Connection less service over IPNo setup teardownOne packet at a time

Minimal overhead ndash high performance Provides best effort delivery It is unreliable

Packet may be lostDuplicatedOut of order

Application is responsible for Data reliabilityFlow controlError handling

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 17

UDP Datagram format

Sourcedestination port port numbers identify sending amp receiving processes

Port number amp IP address allow any application on Internet to be uniquely identified

Ports can be static or dynamic

Static (lt 1024) assigned centrally known as well known ports

Dynamic

Message length in bytes includes the UDP header and data (min 8 max 65535)

8 16 3124

Source port Destination port

UDP message len Checksum (opt)

0

Frame header Application data FCSIP header UDP header

8 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 18

The Transport Layer 4 TCP TCP RFC 768 RFC 1122 Provides

Connection orientated service over IPDuring setup the two ends agree on detailsExplicit teardownMultiple connections allowed

Reliable end-to-end Byte Stream delivery over unreliable network It takes care of

Lost packets Duplicated packetsOut of order packets

TCP provides Data buffering Flow controlError detection amp handlingLimits network congestion

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 19

Code

Source port Destination port

Sequence number

0 8 16 3124

Acknowledgement number

4

Hlen

10

Resv Window

Urgent ptrChecksum

Options (if any) Padding

The TCP Segment FormatFrame header Application data FCSIP header TCP header

20 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 20

TCP Segment Format ndash cont SourceDest port TCP port numbers to ID applications

at both ends of connection Sequence number First byte in segment from senderrsquos

byte stream Acknowledgement identifies the number of the byte the

sender of this segment expects to receive next Code used to determine segment purpose eg SYN

ACK FIN URG Window Advertises how much data this station is willing

to accept Can depend on buffer space remaining Options used for window scaling

SACK timestamps maximum segment size etc

Code

Source port Destination port

Sequence number

Acknowledgement number

Hlen Resv Window

Urgent ptrChecksum

Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 21

TCP ndash providing reliability Positive acknowledgement (ACK) of each received segment

Sender keeps record of each segment sent Sender awaits an ACK ndash ldquoI am ready to receive byte 2048 and beyondrdquo Sender starts timer when it sends segment ndash so can re-transmit

Segment n

ACK of Segment nRTT

Time

Sender Receiver

Sequence 1024Length 1024

Ack 2048

Segment n+1

ACK of Segment n +1RTT

Sequence 2048Length 1024

Ack 3072

Inefficient ndash sender has to wait

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 22

Flow Control Sender ndash Congestion Window Uses Congestion window cwnd a sliding window to control the data flow

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sentwaiting for windowto openApplication writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiverrsquos advertisedwindow advances leading edge

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 23

Flow Control Receiver ndash Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no

Receiverrsquos advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ne next byte expected Duplicate ACK is send with the expected sequence number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 17: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 17

UDP Datagram format

Sourcedestination port port numbers identify sending amp receiving processes

Port number amp IP address allow any application on Internet to be uniquely identified

Ports can be static or dynamic

Static (lt 1024) assigned centrally known as well known ports

Dynamic

Message length in bytes includes the UDP header and data (min 8 max 65535)

8 16 3124

Source port Destination port

UDP message len Checksum (opt)

0

Frame header Application data FCSIP header UDP header

8 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 18

The Transport Layer 4 TCP TCP RFC 768 RFC 1122 Provides

Connection orientated service over IPDuring setup the two ends agree on detailsExplicit teardownMultiple connections allowed

Reliable end-to-end Byte Stream delivery over unreliable network It takes care of

Lost packets Duplicated packetsOut of order packets

TCP provides Data buffering Flow controlError detection amp handlingLimits network congestion

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 19

Code

Source port Destination port

Sequence number

0 8 16 3124

Acknowledgement number

4

Hlen

10

Resv Window

Urgent ptrChecksum

Options (if any) Padding

The TCP Segment FormatFrame header Application data FCSIP header TCP header

20 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 20

TCP Segment Format ndash cont SourceDest port TCP port numbers to ID applications

at both ends of connection Sequence number First byte in segment from senderrsquos

byte stream Acknowledgement identifies the number of the byte the

sender of this segment expects to receive next Code used to determine segment purpose eg SYN

ACK FIN URG Window Advertises how much data this station is willing

to accept Can depend on buffer space remaining Options used for window scaling

SACK timestamps maximum segment size etc

Code

Source port Destination port

Sequence number

Acknowledgement number

Hlen Resv Window

Urgent ptrChecksum

Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 21

TCP ndash providing reliability Positive acknowledgement (ACK) of each received segment

Sender keeps record of each segment sent Sender awaits an ACK ndash ldquoI am ready to receive byte 2048 and beyondrdquo Sender starts timer when it sends segment ndash so can re-transmit

Segment n

ACK of Segment nRTT

Time

Sender Receiver

Sequence 1024Length 1024

Ack 2048

Segment n+1

ACK of Segment n +1RTT

Sequence 2048Length 1024

Ack 3072

Inefficient ndash sender has to wait

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 22

Flow Control Sender ndash Congestion Window Uses Congestion window cwnd a sliding window to control the data flow

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sentwaiting for windowto openApplication writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiverrsquos advertisedwindow advances leading edge

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 23

Flow Control Receiver ndash Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no

Receiverrsquos advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ne next byte expected Duplicate ACK is send with the expected sequence number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 18: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 18

The Transport Layer 4 TCP TCP RFC 768 RFC 1122 Provides

Connection orientated service over IPDuring setup the two ends agree on detailsExplicit teardownMultiple connections allowed

Reliable end-to-end Byte Stream delivery over unreliable network It takes care of

Lost packets Duplicated packetsOut of order packets

TCP provides Data buffering Flow controlError detection amp handlingLimits network congestion

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 19

Code

Source port Destination port

Sequence number

0 8 16 3124

Acknowledgement number

4

Hlen

10

Resv Window

Urgent ptrChecksum

Options (if any) Padding

The TCP Segment FormatFrame header Application data FCSIP header TCP header

20 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 20

TCP Segment Format ndash cont SourceDest port TCP port numbers to ID applications

at both ends of connection Sequence number First byte in segment from senderrsquos

byte stream Acknowledgement identifies the number of the byte the

sender of this segment expects to receive next Code used to determine segment purpose eg SYN

ACK FIN URG Window Advertises how much data this station is willing

to accept Can depend on buffer space remaining Options used for window scaling

SACK timestamps maximum segment size etc

Code

Source port Destination port

Sequence number

Acknowledgement number

Hlen Resv Window

Urgent ptrChecksum

Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 21

TCP ndash providing reliability Positive acknowledgement (ACK) of each received segment

Sender keeps record of each segment sent Sender awaits an ACK ndash ldquoI am ready to receive byte 2048 and beyondrdquo Sender starts timer when it sends segment ndash so can re-transmit

Segment n

ACK of Segment nRTT

Time

Sender Receiver

Sequence 1024Length 1024

Ack 2048

Segment n+1

ACK of Segment n +1RTT

Sequence 2048Length 1024

Ack 3072

Inefficient ndash sender has to wait

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 22

Flow Control Sender ndash Congestion Window Uses Congestion window cwnd a sliding window to control the data flow

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sentwaiting for windowto openApplication writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiverrsquos advertisedwindow advances leading edge

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 23

Flow Control Receiver ndash Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no

Receiverrsquos advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ne next byte expected Duplicate ACK is send with the expected sequence number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 19: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 19

Code

Source port Destination port

Sequence number

0 8 16 3124

Acknowledgement number

4

Hlen

10

Resv Window

Urgent ptrChecksum

Options (if any) Padding

The TCP Segment FormatFrame header Application data FCSIP header TCP header

20 Bytes

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 20

TCP Segment Format ndash cont SourceDest port TCP port numbers to ID applications

at both ends of connection Sequence number First byte in segment from senderrsquos

byte stream Acknowledgement identifies the number of the byte the

sender of this segment expects to receive next Code used to determine segment purpose eg SYN

ACK FIN URG Window Advertises how much data this station is willing

to accept Can depend on buffer space remaining Options used for window scaling

SACK timestamps maximum segment size etc

Code

Source port Destination port

Sequence number

Acknowledgement number

Hlen Resv Window

Urgent ptrChecksum

Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 21

TCP ndash providing reliability Positive acknowledgement (ACK) of each received segment

Sender keeps record of each segment sent Sender awaits an ACK ndash ldquoI am ready to receive byte 2048 and beyondrdquo Sender starts timer when it sends segment ndash so can re-transmit

Segment n

ACK of Segment nRTT

Time

Sender Receiver

Sequence 1024Length 1024

Ack 2048

Segment n+1

ACK of Segment n +1RTT

Sequence 2048Length 1024

Ack 3072

Inefficient ndash sender has to wait

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 22

Flow Control Sender ndash Congestion Window Uses Congestion window cwnd a sliding window to control the data flow

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sentwaiting for windowto openApplication writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiverrsquos advertisedwindow advances leading edge

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 23

Flow Control Receiver ndash Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no

Receiverrsquos advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ne next byte expected Duplicate ACK is send with the expected sequence number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 20: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 20

TCP Segment Format ndash cont SourceDest port TCP port numbers to ID applications

at both ends of connection Sequence number First byte in segment from senderrsquos

byte stream Acknowledgement identifies the number of the byte the

sender of this segment expects to receive next Code used to determine segment purpose eg SYN

ACK FIN URG Window Advertises how much data this station is willing

to accept Can depend on buffer space remaining Options used for window scaling

SACK timestamps maximum segment size etc

Code

Source port Destination port

Sequence number

Acknowledgement number

Hlen Resv Window

Urgent ptrChecksum

Options (if any) Padding

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 21

TCP ndash providing reliability Positive acknowledgement (ACK) of each received segment

Sender keeps record of each segment sent Sender awaits an ACK ndash ldquoI am ready to receive byte 2048 and beyondrdquo Sender starts timer when it sends segment ndash so can re-transmit

Segment n

ACK of Segment nRTT

Time

Sender Receiver

Sequence 1024Length 1024

Ack 2048

Segment n+1

ACK of Segment n +1RTT

Sequence 2048Length 1024

Ack 3072

Inefficient ndash sender has to wait

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 22

Flow Control Sender ndash Congestion Window Uses Congestion window cwnd a sliding window to control the data flow

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sentwaiting for windowto openApplication writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiverrsquos advertisedwindow advances leading edge

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 23

Flow Control Receiver ndash Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no

Receiverrsquos advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ne next byte expected Duplicate ACK is send with the expected sequence number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 21: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 21

TCP ndash providing reliability Positive acknowledgement (ACK) of each received segment

Sender keeps record of each segment sent Sender awaits an ACK ndash ldquoI am ready to receive byte 2048 and beyondrdquo Sender starts timer when it sends segment ndash so can re-transmit

Segment n

ACK of Segment nRTT

Time

Sender Receiver

Sequence 1024Length 1024

Ack 2048

Segment n+1

ACK of Segment n +1RTT

Sequence 2048Length 1024

Ack 3072

Inefficient ndash sender has to wait

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 22

Flow Control Sender ndash Congestion Window Uses Congestion window cwnd a sliding window to control the data flow

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sentwaiting for windowto openApplication writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiverrsquos advertisedwindow advances leading edge

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 23

Flow Control Receiver ndash Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no

Receiverrsquos advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ne next byte expected Duplicate ACK is send with the expected sequence number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 22: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 22

Flow Control Sender ndash Congestion Window Uses Congestion window cwnd a sliding window to control the data flow

Byte count giving highest byte that can be sent with out an ACK Transmit buffer size and Advertised Receive buffer size important ACK gives next sequence no to receive AND

The available space in the receive buffer Timer kept for each packet

Unsent Datamay be transmitted immediately

Sent Databuffered waiting ACK

TCP Cwnd slides Data to be sentwaiting for windowto openApplication writes here

Data sent and ACKed

Sending hostadvances markeras data transmitted

Received ACKadvances trailing edge

Receiverrsquos advertisedwindow advances leading edge

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 23

Flow Control Receiver ndash Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no

Receiverrsquos advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ne next byte expected Duplicate ACK is send with the expected sequence number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 23: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 23

Flow Control Receiver ndash Lost Data

Received butnot ACKed

ACKed but not given to user

Window slides

Lost data

Data given to application

Last ACK givenNext byte expectedExpected sequence no

Receiverrsquos advertisedwindow advances leading edge

Application reads here

If new data is received with a sequence number ne next byte expected Duplicate ACK is send with the expected sequence number

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 24: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 24

How it works TCP Slowstart Probe the network - get a rough estimate of the optimal congestion window size The larger the window size the higher the throughput

Throughput = Window size Round-trip Time exponentially increase the congestion window size until a packet is lost

cwnd initially 1 MTU then increased by 1 MTU for each ACK received Send 1st packet get 1 ACK increase cwnd to 2 Send 2 packets get 2 ACKs inc cwnd to 4Time to reach cwnd size W = RTTlog2 (W)

Rate doubles each RTT

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 25: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 25

additive increase starting from the rough estimate linearly increase the congestion window size to probe for additional available bandwidth cwnd increased by 1 MTU for each ACK ndash linear increase in rate

TCP takes packet loss as indication of congestion multiplicative decrease cut the congestion window size

aggressively if a packet is lost Standard TCP reduces cwnd by 05 Slow start to Congestion avoidance transition determined by ssthresh

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

How it works TCP Congestion Avoidance

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 26: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 26

TCP Fast Retransmit amp Recovery Duplicate ACKs are due to lost segments or segments out of order Fast Retransmit If the sender transmits 3 duplicate ACKs

(ie it received 3 additional segments without getting the one expected) Send the missing segment

Set ssthresh to 05cwnd ndash so enter congestion avoidance phaseSet cwnd = (05cwnd +3 ) ndash the 3 dup ACKs Increase cwnd by 1 segment when get duplicate ACKs Keep sending new data if allowed by cwndSet cwnd to half original value on new ACK

no need to go into ldquoslow startrdquo again

At steady state CWND oscillates around the optimal window size With a retransmission timeout slow start is triggered again

CWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeoutCWND

slow start exponential

increase

congestion avoidance linear increase

packet loss

time

retransmit slow start

again

timeout

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 27: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 27

TCP Simple Tuning - Filling the Pipe Remember TCP has to hold a copy of data in flight Optimal (TCP buffer) window size depends on

Bandwidth end to end ie min(BWlinks) AKA bottleneck bandwidth

Round Trip Time (RTT)

The number of bytes in flight to fill the entire path BandwidthDelay Product BDP = RTTBW Can increase bandwidth by

orders of magnitude

Windows also used for flow controlRTT

Time

Sender Receiver

ACK

Segment time on wire = bits in segmentBW

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 28: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 28

Congestion control ACK clocking

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 29: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 29

Lectures tutorials etc on TCPIP wwwnvccvaushomejoneytcp_iphtm wwwcspdxedu~jrbtcpiplectureshtml wwwraleighibmcomcgi-binbookmgrBOOKSEZ306200CCONTENTS wwwciscocomunivercdcctddocproductiaabucentri4userscf4ap1htm wwwcisohio-stateeduhtbinrfcrfc1180html wwwjbmelectronicscomtcphtm

Encylopaedia httpwwwfreesoftorgCIEindexhtm

TCPIP Resources wwwprivateorgiltcpip_rlhtml

Understanding IP addresses httpwww3comcomsolutionsen_USncs501302html

Configuring TCP (RFC 1122) ftpnicmeriteduinternetdocumentsrfcrfc1122txt

Assigned protocols ports etc (RFC 1010) httpwwwesnetpubrfcsrfc1010txt amp etcprotocols

More Information

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 30: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 30

Any Questions

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 31: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 31

Backup Slides

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 32: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 32

More Information Some URLs UKLight web site httpwwwuklightacuk MB-NG project web site httpwwwmb-ngnet DataTAG project web site httpwwwdatatagorg UDPmon TCPmon kit + writeup

httpwwwhepmanacuk~richnet Motherboard and NIC Tests

httpwwwhepmanacuk~richnetnicGigEth_tests_Bostonpptamp httpdatatagwebcernchdatatagpfldnet2003 ldquoPerformance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboardsrdquo FGCS Special issue 2004 http wwwhepmanacuk~rich

TCP tuning information may be found athttpwwwncnenlanrnetdocumentationfaqperformancehtml amp httpwwwpscedunetworkingperf_tunehtml

TCP stack comparisonsldquoEvaluation of Advanced TCP Stacks on Fast Long-Distance Production Networksrdquo Journal of Grid Computing 2004

PFLDnet httpwwwens-lyonfrLIPRESOpfldnet2005 Dante PERT httpwwwgeant2netservershownav00d00h002

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 33: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 33

tcpdump tcptrace tcpdump dump all TCP header information for a specified

sourcedestination ftpftpeelblgov

tcptrace format tcpdump output for analysis using xplot httpwwwtcptraceorg NLANR TCP Testrig Nice wrapper for tcpdump and tcptrace tools

httpwwwncnenlanrnetTCPtestrig

Sample use tcpdump -s 100 -w tmptcpdumpout host hostname tcptrace -Sl tmptcpdumpout xplot tmpa2b_tsgxpl

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 34: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 34

tcptrace and xplot X axis is time Y axis is sequence number the slope of this curve gives the throughput over time xplot tool make it easy to zoom in

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 35: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 35

Zoomed In View Green Line ACK values received from the receiver Yellow Line tracks the receive window advertised from the receiver Green Ticks track the duplicate ACKs received Yellow Ticks track the window advertisements that were the same as the

last advertisement White Arrows represent segments sent Red Arrows (R) represent retransmitted segments

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start
Page 36: Summer School, Brasov, Romania, July 2005, R. Hughes-Jones Manchester 1 TCP/IP and Other Transports for High Bandwidth Applications Back to Basics Richard

Summer School Brasov Romania July 2005 R Hughes-Jones Manchester 36

TCP Slow Start

  • TCPIP and Other Transports for High Bandwidth Applications Back to Basics
  • Slide 2
  • Slide 3
  • What is a Protocol Stack
  • The Layering Principle
  • What do the Layers do
  • How do the ldquoIPrdquo Protocols fit together
  • Some of the ldquoIPrdquo Protocols
  • The Physical Layer 1 Ethernet
  • The Link Layer 2 Ethernet Frame
  • Slide 11
  • The Network Layer 3 IP
  • The Internet datagram
  • IP Datagram Format (cont)
  • Internet Class-based addresses
  • The Transport Layer 4 UDP
  • UDP Datagram format
  • The Transport Layer 4 TCP
  • The TCP Segment Format
  • TCP Segment Format ndash cont
  • TCP ndash providing reliability
  • Flow Control Sender ndash Congestion Window
  • Flow Control Receiver ndash Lost Data
  • How it works TCP Slowstart
  • Slide 25
  • TCP Fast Retransmit amp Recovery
  • TCP Simple Tuning - Filling the Pipe
  • Congestion control ACK clocking
  • More Information
  • Slide 30
  • Slide 31
  • More Information Some URLs
  • tcpdump tcptrace
  • tcptrace and xplot
  • Zoomed In View
  • TCP Slow Start