vijay vasudevan, amar phanishayee, hiral shah, elie krevat david andersen, greg ganger, garth...

Vijay Vasudevan, Amar Phanishayee, Hiral Shah, Elie Krevat

David Andersen, Greg Ganger, Garth Gibson, Brian Mueller*

Carnegie Mellon University, *Panasas Inc.

Solving TCP Incast (and more) With Aggressive TCP Timeouts

PDL Retreat 2009

Cluster-based Storage Systems

ClientCommodity

EthernetSwitch

Servers

Ethernet: 1-10Gbps

Round Trip Time (RTT): 100-10us

Cluster-based Storage Systems

Client Switch

Storage Servers

Data Block

Server Request Unit(SRU)

Synchronized Read

Client now sendsnext batch of requests

1 2 3 4

Synchronized Read Setup

• Test on an Ethernet-based storage cluster

• Client performs synchronized reads

• Increase # of servers involved in transfer• Data block size is fixed (FS read)

• TCP used as the data transfer protocol

TCP Throughput Collapse

Collapse!

Cluster Setup

1Gbps Ethernet

Unmodified TCP

S50 Switch

1MB Block Size

• TCP Incast• Cause of throughput collapse:

coarse-grained TCP timeouts

Solution: µsecond TCP + no minRTO

more servers

High throughput for up to 47 servers

Simulation scales to thousands of servers

Throughput(Mbps)

Unmodified TCP

Our solution

Overview

• Problem: Coarse-grained TCP timeouts (200ms) too expensive for datacenter applications

• Solution: microsecond granularity timeouts• Improves datacenter app throughput & latency• Also safe for use in the wide-area (Internet)

Outline• Overview

Why are TCP timeouts expensive?

• How do coarse-grained timeouts affect apps?

• Solution: Microsecond TCP Retransmissions

• Is the solution safe?

TCP: data-driven loss recovery

Sender Receiver

3 duplicate ACKs for 1(packet 2 is probably lost)

Retransmit packet 2 immediately Ack 5

In datacentersdata-driven recovery in µsecs after

TCP: timeout-driven loss recovery

Sender Receiver

RetransmissionTimeout(RTO)

Timeouts are expensive (msecs to recover after loss)

Retransmit packet

TCP: Loss recovery comparison

Sender Receiver

Ack 1Ack 1Ack 1

Retransmit 2

Sender Receiver

RetransmissionTimeout(RTO)

Timeout driven recovery is slow (ms)

Data-driven recovery issuper fast (µs) in datacenters

RTO Estimation and Minimum Bound

• Jacobson’s TCP RTO Estimator• RTOEstimated = SRTT + (4 * RTTVAR)

• Actual RTO = max(minRTO, RTOEstimated)

• Minimum RTO bound (minRTO) = 200ms• TCP timer granularity• Safety (Allman99)• minRTO (200ms) >> Datacenter RTT (100µs)• 1 TCP Timeout lasts 1000 datacenter RTTs!

Outline• Overview

• Why are TCP timeouts expensive?

How do coarse-grained timeouts affect apps?

Single Flow TCP Request-Response

Client Switch Server

DataData Data

timeRequest sent

Response sent

Response dropped

Response resent

Apps Sensitive to 200ms Timeouts

• Single flow request-response• Latency-sensitive applications

• Barrier-Synchronized workloads• Parallel Cluster File Systems

– Throughput-intensive• Search: multi-server queries

– Latency-sensitive

Link Idle Time Due To Timeouts

Client Switch

Synchronized Read

1 2 3 4 Server Request Unit(SRU)

Req. sent

Rsp. sent

4 dropped Response resent1 – 3 done Link Idle!

Client Link Utilization

Link Idle!

200ms timeouts Throughput Collapse

• [Nagle04] called this Incast• Provided application level solutions• Cause of throughput collapse: TCP timeouts

• [FAST08] Search for network level solutions to TCP Incast

Collapse!

Cluster Setup

1Gbps Ethernet

200ms minRTO

S50 Switch

1MB Block Size

Results from our previous work (FAST08)Network Level Solutions Results / Conclusions

Increase Switch Buffer Size Delays throughput collapse Throughput collapse inevitable Expensive

Alternate TCP Implementations (avoiding timeouts, aggressive data-driven recovery, disable slow start)

Throughput collapse inevitable because timeouts are inevitable (complete window loss a common case)

Ethernet Flow Control Effective Limited effectiveness (works for

simple topologies) head-of-line blocking

Ethernet Flow Control Effective Limited effectiveness (works for

simple topologies) head-of-line blocking

Reducing minRTO (in simulation) Very effective Implementation concerns (µs

timers for OS, TCP) Safety concerns

Outline• Overview

• How do coarse-grained timeouts affect apps?

Solution: Microsecond TCP Retransmissions• and eliminate minRTO

µsecond Retransmission Timeouts (RTO)

RTO = max( minRTO, f(RTT) )

200µs?

RTT tracked in milliseconds

Track RTT in µsecond

Lowering minRTO to 1ms

• Lower minRTO to as low a value as possible without changing timers/TCP impl.

• Simple one-line change to Linux

• Uses low-resolution 1ms kernel timers

Default minRTO: Throughput Collapse

Unmodified TCP(200ms minRTO)

Lowering minRTO to 1ms helps

Millisecond retransmissions are not enough

1ms minRTO

Requirements for µsecond RTO

• TCP must track RTT in microseconds• Modify internal data structures• Reuse timestamp option

• Efficient high-resolution kernel timers• Use HPET for efficient interrupt signaling

Solution: µsecond TCP + no minRTO

more servers

1ms minRTO

microsecond TCP+ no minRTO

High throughput for up to 47 servers

Simulation: Scaling to thousands

Block Size = 80MB, Buffer = 32KB, RTT = 20us

Synchronized Retransmissions At Scale

Simultaneous retransmissions successive timeouts

Successive RTO = RTO * 2backoff

Simulation: Scaling to thousands

Desynchronize retransmissions to scale further

Successive RTO = (RTO + (rand(0.5)*RTO) ) * 2backoff

For use within datacenters only

Outline• Overview

• The Incast Workload

Is the solution safe?• Interaction with Delayed-ACK within datacenters• Performance in the wide-area

Delayed-ACK (for RTO > 40ms)

Delayed-Ack: Optimization to reduce #ACKs sent

Sender Receiver

µsecond RTO and Delayed-ACK

Premature TimeoutRTO on sender triggers before Delayed-ACK on receiver

Sender Receiver

RTO < 40ms

TimeoutRetransmit packet

Sender Receiver

RTO > 40ms

Impact of Delayed-ACK

Is it safe for the wide-area?

• Stability: Could we cause congestion collapse?• No: Wide-area RTOs are in 10s, 100s of ms• No: Timeouts result in rediscovering link capacity

(slow down the rate of transfer)

• Performance: Do we timeout unnecessarily?• [Allman99] Reducing minRTO increases the chance

of premature timeouts– Premature timeouts slow transfer rate

• Today: detect and recover from premature timeouts• Wide-area experiments to determine performance

impact

Wide-area Experiment

Do microsecond timeouts harm wide-area throughput?

Microsecond TCP+

No minRTO

Standard TCP

BitTorrent Seeds

BitTorrent Clients

Wide-area Experiment: Results

No noticeable difference in throughput

Conclusion

• Microsecond granularity TCP timeouts (with no minRTO) improve datacenter application response time and throughput

• Safe for wide-area communication

• Linux patch: http://www.cs.cmu.edu/~vrv/incast/

• Code (simulation, cluster) and scripts: http://www.cs.cmu.edu/~amarp/dist/incast/incast_1.1.tar.gz

vijay vasudevan, amar phanishayee, hiral shah, elie krevat david andersen, greg ganger, garth...

tcp timeouts expensive

tcp incastcollapse

tcp timeouts fast08

coarsegrained timeouts

solution safe

apps sensitive

transferdata block size

rto estimation

Documents

hiral fataniya

organisation pattern in marketing channels presented by...

agility and performance in elastic distributed storage ·...

hiral polynomials mini

jackrabbit: improved agility in elastic distributed...

ditto - a system for opportunistic caching in multi-hop mesh...

bolt: data management for connected homes - ratul ·...

beam: ending monolithic applications for connected ... -...

parsippany.vidyalaya.org saanvi vavilala, pavan eda, sriya...

longfield primary school hiral kelly. exploring different...

by hiral a. jesrani s as champion of clean air

fawn-forward 2010 · 2018. 1. 4. · amar phanishayee,...

towards a storage system for connected homes trinabh gupta*,...

software pipelining in pegasus/cash cody hartwig elie krevat...

group id: 090470110014 made by: detroja hiral parmar...

gujarat international technological experience … · mr....

financial accounting hiral level 7

fawn a fast array of wimpy nodes* bogdan eremia, scpd *by...

system design amar phanishayee. system design complex act...

wireshark presented by: hiral chhaya, anvita priyam