powerandperformance

46
Power and Performance considerations in the Data center Partha Kundu Sr. Dis0nguished Engineer Corporate CTO Office Keynote

Upload: partha-kundu

Post on 20-Mar-2017

19 views

Category:

Documents


0 download

TRANSCRIPT

Power and Performance considerations in the

Data center

Partha  Kundu  Sr.  Dis0nguished  Engineer    Corporate  CTO  Office  

 

Keynote  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   2  Partha  Kundu  

Data  center  compu0ng  is  a  new  paradigm!  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013  

Outline  of  talk  

q   Power  &  Energy  in  Data  Centers  

q   Performance  at  Scale  

q   Network  variability  &        Performance    

q   Conclusions  

3  Partha  Kundu  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   4  Partha  Kundu  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   5  Partha  Kundu  

•   Power  delivery  and  Cooling  overheads  are  quan0fied  in  PUE  metric  •   Historically,  cooling  is  the  most  significant  source  of  energy  inefficiency  

Data  Center  Energy  breakdown    Source  :  ASHRAE  2007  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   6  Partha  Kundu  

•   PUE  reducing  over  the  years  •   But,  PUE  also  reported  incorrectly  

Source:  Google  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013  

Energy  Efficiency  

Most  of  the  0me  server  load  is  around  30%  

But,  server  is  least  energy  efficient  in  it’s  most  common  opera0ng  region!  

Source  :  Barroso,  Holzle:  Data  Center  as  a  Computer,  Morgan  Claypool  (publishers),  2009  

Servers  are  never  completely  idle  

7  Partha  Kundu  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013  

Dynamic  Power  Range  CPU  power  component  (peak  &  idle)  in  servers  has  reduced  over  the  years    

Dynamic  Power  range:  •   CPU  power  range  is  3x  for  servers  •   DRAM  range  is  2X  •   Disk  and  Networking  is  <  1.2X  

Disk  and  Network  switches  need  to  learn  from  the  CPU’s  power  

propor9onality  gains  Source  :  Barroso,  Holzle:  Data  Center  as  a  Computer,  Morgan  Claypool  (publishers),  2009  

8  Partha  Kundu  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013  

Energy  Propor0onality  

Goal:    Achieve  best  energy  efficiency  (~80%)  in  the  common  opera0ng  regions  (20  –  30%  load)  

Challenges    to  propor9onality:  •   Most  propor0onality  tricks  in  embedded/mobile  devices  are  not  useable  in  DC  due  to  huge  ac0va0on  penal0es  •   Distributed  structure  of  data  and  applica0on  doesn’t  allow  powering  down  during  low  use  •     Disk  drives  spin  >50%  of  0me  even  when  there  is  no  ac0vity  

•   SSD  help  but  beware  of  wear  leveling  and  garbage  collec0on  overheads  9  Partha  Kundu  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013  

Dynamic  Resource  requirements    in  the  Data-­‐center  

Intra-­‐server  varia0on  (TPC-­‐H,  log  scale)   Inter-­‐server  varia0on  (rendering  farm)  

Time  

Server  M

emory  Alloca0o

n  

Query  

Huge  varia0ons  even  within  a  single  Applica0on  running  in  a  large  cluster  

10  Partha  Kundu  

Q 1 Q 2 Q 3 Q 4 Q 5 Q 6 Q 7 Q 8 Q 9 Q 10 Q 11 Q 120 .1MB

1MB

10MB

100MB

1G B

10G B

100G B

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013  

 

   

11  Partha  Kundu  

CPUs  DIMM  DIMM  

CPUs  DIMM  DIMM  

CPUs  DIMM  DIMM  

CPUs  DIMM  DIMM  

DIMM  

DIMM  

DIMM  Backplane  

DIMM  

DIMM  

DIMM  

DIMM  

DIMM  

Conven0onal  blade  systems  

Mo0va0ng  Disaggregated  memory*  *Lim  et  al:  Disaggregated  Memory  for  expansion  and  sharing  in  Blade  Servers,  ISCA  2009  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   12  Partha  Kundu  

Disaggregated  Memory  

ð  Break  CPU-­‐memory  co-­‐loca0on    

Leverage  fast,  shared  communica0on  fabrics  

Blade  systems  with  disaggregated  memory  

CPUs  DIMM  DIMM  

CPUs  DIMM  DIMM  

CPUs  DIMM  DIMM  

CPUs  DIMM  DIMM  

Backplane  DIMM  

Memory  blade  

DIMM  DIMM  

DIMM  DIMM  DIMM  

DIMM  DIMM  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   13  Partha  Kundu  

Blade  systems  with  disaggregated  memory  

CPUs  DIMM  DIMM  

CPUs  DIMM  DIMM  

CPUs   DIMM  DIMM  

CPUs   DIMM  DIMM  

Backplane   DIMM  

Memory  blade  

DIMM  DIMM  

DIMM  DIMM  DIMM  

DIMM  DIMM  

Authors  claim:  Ø   8X  improvement  on  memory  constrained  environments    Ø   80+%  improvement  in  performance  per  $  Ø   3x  consolida0on  

Disaggregated  Memory  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013  

Performance    at  Scale  

14  Partha  Kundu  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   15  Partha  Kundu  

A)Typical)Facebook)Page)

2)

Modern'pages'have'many'components'

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   16  Partha  Kundu  

Crea6ng)a)Page)

Datacenter)Network)

Internet)

…'

Front)End)

…'

News)Feed)

…'

Search)

…'

Ads)

…'

Chat)3)

1. Soft real-time nature

• Online services have SLAs baked into their operation

• Example: 300ms response time

for 99.9% requests

User-facing online services

Agg.

Aggregator

Agg. Agg.

Worker

Request Response

Worker Worker Worker

Impact of breached SLAs: • Amazon: extra 100ms costs 1%

in sales

Two common underlying themes

2. Partition-aggregate workflow

Parallelism  when  handling  a  query  Parallelism  when  serving  a  page  

It’s  all  very  embarrassingly  parallel!  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   17  Partha  Kundu  

contributed articles

FEBRUARY 2013 | VOL. 56 | NO. 2 | COMMUNICATIONS OF THE ACM 77

systems with shared computational resources exhibit performance fluctua-tions beyond the control of application developers. Google has therefore found it advantageous to develop tail-tolerant techniques that mask or work around temporary latency pathologies, instead of trying to eliminate them altogether. We separate these techniques into two main classes: The first corresponds to within-request immediate-response techniques that operate at a time scale of tens of milliseconds, before longer-term techniques have a chance to react. The second consists of cross-request long-term adaptations that perform on a time scale of tens of seconds to min-utes and are meant to mask the effect of longer-term phenomena.

Within Request Short-Term Adaptations A broad class of Web services deploy multiple replicas of data items to pro-vide additional throughput capacity and maintain availability in the presence of failures. This approach is particularly effective when most requests operate on largely read-only, loosely consistent da-tasets; an example is a spelling-correc-tion service that has its model updated once a day while handling thousands of correction requests per second. Simi-larly, distributed file systems may have multiple replicas of a given data chunk that can all be used to service read re-quests. The techniques here show how replication can also be used to reduce latency variability within a single high-er-level request:

Hedged requests. A simple way to curb latency variability is to issue the same request to multiple replicas and use the results from whichever replica responds first. We term such requests “hedged requests” because a client first sends one request to the replica be-lieved to be the most appropriate, but then falls back on sending a secondary request after some brief delay. The cli-ent cancels remaining outstanding re-quests once the first result is received. Although naive implementations of this technique typically add unaccept-able additional load, many variations exist that give most of the latency-re-duction effects while increasing load only modestly.

One such approach is to defer send-ing a secondary request until the first

request has been outstanding for more than the 95th-percentile expected la-tency for this class of requests. This approach limits the additional load to approximately 5% while substantially shortening the latency tail. The tech-nique works because the source of la-tency is often not inherent in the par-ticular request but rather due to other forms of interference. For example, in a Google benchmark that reads the val-ues for 1,000 keys stored in a BigTable table distributed across 100 different servers, sending a hedging request after a 10ms delay reduces the 99.9th-percen-tile latency for retrieving all 1,000 values from 1,800ms to 74ms while sending just 2% more requests. The overhead of hedged requests can be further reduced by tagging them as lower priority than the primary requests.

Tied requests. The hedged-requests technique also has a window of vulner-

ability in which multiple servers can execute the same request unnecessar-ily. That extra work can be capped by waiting for the 95th-percentile expect-ed latency before issuing the hedged request, but this approach limits the benefits to only a small fraction of re-quests. Permitting more aggressive use of hedged requests with moderate resource consumption requires faster cancellation of requests.

A common source of variability is queueing delays on the server before a request begins execution. For many services, once a request is actually scheduled and begins execution, the variability of its completion time goes down substantially. Mitzenmacher10 said allowing a client to choose between two servers based on queue lengths at enqueue time exponentially improves load-balancing performance over a uni-form random scheme. We advocate not

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

1 500 1,000 1,500 2,000

Probability of one-second service-level response time as the system scales and frequency of server-level high-latency outliers varies.

0.18

0.63

1 in 10,0001 in 1,0001 in 100

Numbers of Servers

P (serv

ice lat

ency >

1s)

Table 1. Individual-leaf-request finishing times for a large fan-out service tree (measured from root node of the tree).

50%ile latency 95%ile latency 99%ile latency

One random leaf finishes 1ms 5ms 10ms

95% of all leaf requests finish

12ms 32ms 70ms

100% of all leaf requests finish

40ms 87ms 140ms

contributed articles

FEBRUARY 2013 | VOL. 56 | NO. 2 | COMMUNICATIONS OF THE ACM 77

systems with shared computational resources exhibit performance fluctua-tions beyond the control of application developers. Google has therefore found it advantageous to develop tail-tolerant techniques that mask or work around temporary latency pathologies, instead of trying to eliminate them altogether. We separate these techniques into two main classes: The first corresponds to within-request immediate-response techniques that operate at a time scale of tens of milliseconds, before longer-term techniques have a chance to react. The second consists of cross-request long-term adaptations that perform on a time scale of tens of seconds to min-utes and are meant to mask the effect of longer-term phenomena.

Within Request Short-Term Adaptations A broad class of Web services deploy multiple replicas of data items to pro-vide additional throughput capacity and maintain availability in the presence of failures. This approach is particularly effective when most requests operate on largely read-only, loosely consistent da-tasets; an example is a spelling-correc-tion service that has its model updated once a day while handling thousands of correction requests per second. Simi-larly, distributed file systems may have multiple replicas of a given data chunk that can all be used to service read re-quests. The techniques here show how replication can also be used to reduce latency variability within a single high-er-level request:

Hedged requests. A simple way to curb latency variability is to issue the same request to multiple replicas and use the results from whichever replica responds first. We term such requests “hedged requests” because a client first sends one request to the replica be-lieved to be the most appropriate, but then falls back on sending a secondary request after some brief delay. The cli-ent cancels remaining outstanding re-quests once the first result is received. Although naive implementations of this technique typically add unaccept-able additional load, many variations exist that give most of the latency-re-duction effects while increasing load only modestly.

One such approach is to defer send-ing a secondary request until the first

request has been outstanding for more than the 95th-percentile expected la-tency for this class of requests. This approach limits the additional load to approximately 5% while substantially shortening the latency tail. The tech-nique works because the source of la-tency is often not inherent in the par-ticular request but rather due to other forms of interference. For example, in a Google benchmark that reads the val-ues for 1,000 keys stored in a BigTable table distributed across 100 different servers, sending a hedging request after a 10ms delay reduces the 99.9th-percen-tile latency for retrieving all 1,000 values from 1,800ms to 74ms while sending just 2% more requests. The overhead of hedged requests can be further reduced by tagging them as lower priority than the primary requests.

Tied requests. The hedged-requests technique also has a window of vulner-

ability in which multiple servers can execute the same request unnecessar-ily. That extra work can be capped by waiting for the 95th-percentile expect-ed latency before issuing the hedged request, but this approach limits the benefits to only a small fraction of re-quests. Permitting more aggressive use of hedged requests with moderate resource consumption requires faster cancellation of requests.

A common source of variability is queueing delays on the server before a request begins execution. For many services, once a request is actually scheduled and begins execution, the variability of its completion time goes down substantially. Mitzenmacher10 said allowing a client to choose between two servers based on queue lengths at enqueue time exponentially improves load-balancing performance over a uni-form random scheme. We advocate not

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

1 500 1,000 1,500 2,000

Probability of one-second service-level response time as the system scales and frequency of server-level high-latency outliers varies.

0.18

0.63

1 in 10,0001 in 1,0001 in 100

Numbers of Servers

P (s

ervic

e lat

ency

> 1s

)

Table 1. Individual-leaf-request finishing times for a large fan-out service tree (measured from root node of the tree).

50%ile latency 95%ile latency 99%ile latency

One random leaf finishes 1ms 5ms 10ms

95% of all leaf requests finish

12ms 32ms 70ms

100% of all leaf requests finish

40ms 87ms 140ms

Last  5%  of  requests  double  the  

total  response  9me  

Un0l  it’s  not!  

Source  :  The  Tail  at  Scale,  Dean,  February  2013  Communica0on  of  the  ACM  

Consider  system  with:  •  Median  latency  10ms      

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   18  Partha  Kundu  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   19  Partha  Kundu  

• Local  Shared  resources  • Global  shared  resources  

Shared  Resources:  

•  Logging  •  Data  reconstruc0on  •  Garbage  collec0on  

Maintenance  ac0vi0es  

• Network  •  Server  Queuing  

• Ac0va0on  delay    Power  &  energy  management  

Causes  of  Performance  Variability  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   20  Partha  Kundu  

Reducing  Variability  

Differen0ated  service  classes    

assign  SLA  at  app,    

component  level  

Mul0ple  queues  

Reducing  Head  of  line  blocking    

Mi0gate  by  breaking  into  sub  flows  and  

allow  interleaved  processing  of  

flows  

Manage  disrup0ons  

Synchronize  housekeeping  and  system  maintenance    

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   21  Partha  Kundu  

Tolera0ng  Variability  

Within-­‐request  short-­‐term  adap0ons  

Hedged  Requests  

Wait  “95th  percen0le  0me”  before  launching  (redundant)  requests  to  

replica  processing  agents  

Tied  Requests:    

Enqueue  requests  in  replicas  but  cancel  at  start  of  job      

Cross  request  long  term  adap0ons  caused  by  load  imbalance  

Micro-­‐par00oning  :    

unit  of  compute  further  par00oned  into  small  grains  for  bewer  load  balance  and  recovery  

from  faults    

Selec0ve  replica0on  

Replicate  dynamically  to  handle  burst  traffic  

Latency  induced  proba0on:  

Record  slow  machines  and  temporarily  disbar  them  from  

ac0ve  pool  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013  

Network  variability  &  performance  

22  Partha  Kundu  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   23  Partha  Kundu  

Causes  of  Latency  variability  

• Buffer  occupancy  leads  to  increase  in  median  and  tail  latencies  Switch  buffer  

• Serving  all  can  be  harmful  Fairness  

• TCP  0me  out  Incast  collapse    

• Poor  network  u0liza0on,  path  conges0on  

Single  path  transport  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   24  Partha  Kundu  

Data  center  traffic  characteriza0on  age tra!c into and out of the data center. All links use Ethernet asa physical-layer protocol, with a mix of copper and "ber cabling.All switches below each pair of access routers form a single layer- domain, typically connecting several thousand servers. To limitoverheads (e.g., packet $ooding and ARP broadcasts) and to iso-late di%erent services or logical server groups (e.g., email, search,web front ends, web back ends), servers are partitioned into vir-tual LANs (VLANs). Unfortunately, this conventional design su%ersfrom three fundamental limitations:

Limited server-to-server capacity: As we go up the hierar-chy, we are confronted with steep technical and "nancial barriersin sustaining high bandwidth. &us, as tra!c moves up through thelayers of switches and routers, the over-subscription ratio increasesrapidly. For example, servers typically have : over-subscription toother servers in the same rack — that is, they can communicate atthe full rate of their interfaces (e.g., Gbps). We found that up-linksfrom ToRs are typically : to : oversubscribed (i.e., to Gbpsof up-link for servers), and paths through the highest layer ofthe tree can be : oversubscribed. &is large over-subscriptionfactor fragments the server pool by preventing idle servers from be-ing assigned to overloaded services, and it severely limits the entiredata-center’s performance.

Fragmentation of resources: As the cost and performance ofcommunication depends on distance in the hierarchy, the conven-tional design encourages service planners to cluster servers nearbyin the hierarchy. Moreover, spreading a service outside a singlelayer- domain frequently requires recon"guring IP addresses andVLAN trunks, since the IP addresses used by servers are topolog-ically determined by the access routers above them. &e result isa high turnaround time for such recon"guration. Today’s designsavoid this recon"guration lag by wasting resources; the plentifulspare capacity throughout the data center is o+en e%ectively re-served by individual services (and not shared), so that each servicecan scale out to nearby servers to respond rapidly to demand spikesor to failures. Despite this, we have observed instances when thegrowing resource needs of one service have forced data center oper-ations to evict other services from nearby servers, incurring signif-icant cost and disruption.

Poor reliability and utilization: Above the ToR, the basic re-silience model is :, i.e., the network is provisioned such that if anaggregation switch or access router fails, there must be su!cient re-maining idle capacity on a counterpart device to carry the load.&isforces each device and link to be run up to at most of its maxi-mumutilization. Further, multiple paths either do not exist or aren’te%ectively utilized. Within a layer-domain, the Spanning Tree Pro-tocol causes only a single path to be used even when multiple pathsbetween switches exist. In the layer- portion, Equal CostMultipath(ECMP) when turned on, can use multiple paths to a destinationif paths of the same cost are available. However, the conventionaltopology o%ers at most two paths.

3. MEASUREMENTS & IMPLICATIONSTo design VL, we "rst needed to understand the data cen-

ter environment in which it would operate. Interviews with archi-tects, developers, and operators led to the objectives described inSection , but developing the mechanisms on which to build thenetwork requires a quantitative understanding of the tra!c matrix(who sends how much data to whom and when?) and churn (howo+en does the state of the network change due to changes in demandor switch/link failures and recoveries, etc.?). We analyze these as-pects by studying production data centers of a large cloud serviceprovider and use the results to justify our design choices as well asthe workloads used to stress the VL testbed.

0 0.05 0.1

0.15 0.2

0.25 0.3

0.35 0.4

0.45

1 100 10000 1e+06 1e+08 1e+10 1e+12

PD

F

Flow Size (Bytes)

Flow Size PDFTotal Bytes PDF

0 0.2 0.4 0.6 0.8

1

1 100 10000 1e+06 1e+08 1e+10 1e+12

CD

F

Flow Size (Bytes)

Flow Size CDFTotal Bytes CDF

Figure : Mice are numerous; of $ows are smaller than MB. However, more than of bytes are in $ows betweenMB and GB.

Our measurement studies found two key results with implica-tions for the network design. First, the tra!c patterns inside a datacenter are highly divergent (as even over representative tra!cmatrices only loosely cover the actual tra!cmatrices seen),and theychange rapidly andunpredictably. Second, the hierarchical topologyis intrinsically unreliable—even with huge e%ort and expense to in-crease the reliability of the network devices close to the top of thehierarchy, we still see failures on those devices resulting in signi"-cant downtimes.

3.1 Data-Center Traffic AnalysisAnalysis of Net$ow and SNMP data from our data centers re-

veals several macroscopic trends. First, the ratio of tra!c volumebetween servers in our data centers to tra!c entering/leaving ourdata centers is currently around : (excluding CDN applications).Second, data-center computation is focused where high speed ac-cess to data on memory or disk is fast and cheap. Although datais distributed across multiple data centers, intense computation andcommunication on data does not straddle data centers due to thecost of long-haul links. &ird, the demand for bandwidth betweenservers inside a data center is growing faster than the demand forbandwidth to external hosts. Fourth, the network is a bottleneckto computation. We frequently see ToR switches whose uplinks areabove utilization.

To uncover the exact nature of tra!c inside a data center, weinstrumented a highly utilized , node cluster in a data centerthat supports data mining on petabytes of data. &e servers aredistributed roughly evenly across ToR switches, which are con-nected hierarchically as shown in Figure .We collected socket-levelevent logs from all machines over two months.

3.2 Flow Distribution AnalysisDistribution of $ow sizes: Figure illustrates the nature of

$ows within the monitored data center. &e $ow size statistics(marked as ‘+’s) show that the majority of $ows are small (a fewKB); most of these small $ows are hellos and meta-data requests tothe distributed "le system. To examine longer $ows, we compute astatistic termed total bytes (marked as ‘o’s) by weighting each $owsize by its number of bytes. Total bytes tells us, for a random byte,the distribution of the $ow size it belongs to. Almost all the bytesin the data center are transported in $ows whose lengths vary fromabout MB to about GB.&e mode at around MB springsfrom the fact that the distributed "le system breaks long "les into-MB size chunks. Importantly, $ows over a few GB are rare.

Source:  VL2:  A  Scalable  and  Flexible  data  center  network,  Greenberg  et  al,  SIGCOMM  20009      

99%  are  mouse  flows  

But  Elephants  consume    most  network  bandwidth!  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   25  Partha  Kundu  

Query  Traffic  •  Par00on/Aggregate  

       (Request,  Response)      

Back  Ground  Traffic  •  Short  messages  [50KB-­‐1MB]    

         (Coordina9on,  Control  state)      

•  Large  flows  [1MB-­‐50MB]    

         (Data  update)      

Data  center  traffic  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   26  Partha  Kundu  

Data  center  traffic  Balancing  conflic0ng  goals  

High  Bandwidth  

Low  Latency  

•  Par00on/Aggregate  

       (Query)      

•  Short  messages  [50KB-­‐1MB]    

         (Coordina9on,  Control  state)      

•  Large  flows  [1MB-­‐50MB]    

         (Data  update)      

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   27  Partha  Kundu  

TCP  Buffer  Requirement  at  switch  •  Bandwidth-­‐delay  product  rule  of  thumb:  –  A  single  flow  needs  C×RTT  buffers  for  100%  Throughput.  

B  Bu

ffer  S

ize   B  =  C×RTT  

B  

B  <  C×RTT  

Buffe

r  Size  

Throughput  loss!  

B  

Buffe

r  Size  

B  >  C×RTT  

More  latency!  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   28  Partha  Kundu  

TCP  Buffer  Requirement  •  Bandwidth-­‐delay  product  rule  of  thumb:  –  A  single  flow  needs  C×RTT  buffers  for  100%  Throughput.  

B  

Buffe

r  Size   B  =  C×RTT  

B  

B  <  C×RTT  

Buffe

r  Size  

Throughput  loss!  

B  Bu

ffer  S

ize  

B  >  C×RTT  

More  latency!  

To  lower  the  buffering  requirements,  we  must  reduce  sending  rate  varia0ons.  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   29  Partha  Kundu  

DCTCP*:  Algorithm  

Switch  side:  –   Mark  packets  when  Queue  Length  >  K.  

Sender  side:  –  Maintain  running  average  of  frac.on  of  packets  marked  (α).                                                                                                                            

                                                                                                                                                                         

Ø  Adap9ve  window  decreases:  

–  Note:  decrease  factor  between  1  and  2.  

   

 

B   K  Mark   Don’t    Mark  

each RTT : F =# of marked ACKsTotal # of ACKs

⇒ α← (1− g)α + gF

W ← (1− α2)W

*Data  Center  TCP  (DCTCP),  SIGCOMM  2010,  New  Delhi,  India,  August  31,  2010.    

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   30  Partha  Kundu  

DCTCP  vs  TCP  

Setup:  Win  7,  Broadcom  1Gbps  Switch  Scenario:  2  long-­‐lived  flows,  K  =  30KB  

0

100

200

300

400

500

600

700

0

Queue L

ength

(P

ackets

)

Time (seconds)

DCTCP, K=20, 2 flowsTCP, 2 flows

(Kbytes)  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013  

TCP  InCast  Collapse  

Source  :  Nagle  et  al,  The  Panasas  AcHveScale  Storage  Cluster  –  Delivering  Scalable  High  Bandwidth  Storage,    SC2004  

31  Partha  Kundu  

1.  Low  latency,  shallow  buffer  network  switches  2.  “Barrier  synch”  opera0ons  3.  Servers  returning  small  data  items    

Pre-­‐condi0ons  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   32  Partha  Kundu  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   33  Partha  Kundu  

New  Cluster  Based  Storage  System  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013  

Incast  Applica0on  overfills  Buffers  

34  Partha  Kundu  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013  

Solu0on:  TCP  with  µs-­‐RTO*  *Safe  and  Effec0ve  Fine-­‐grained  TCP  Retransmissions  for  Datacenter  Communica0on,  Vasudevan  et  al,  SIGCOMM  2009  

•     Liwle  adverse  effect  on  WAN  traffic  

35  Partha  Kundu  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013  

Incast  Collapse  :    an  unsolved  problem  at  scale*  

*Understanding  TCP  Incast  Throughput  Collapse  in  Datacenter  Networks,  Griffith  et  al  WREN  2009  

Solu0on  space  is  complex:  •   Network  condi0ons  can  impact  RTT  •   Switch  buffer  management  strategies    •   Goodput  can  be  unstable  with  load/num.  senders  

36  Partha  Kundu  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   37  Partha  Kundu  

Today’s  transport  protocols: Deadline agnostic and strive for fairness

Application SLAs

Cascading SLAs SLAs for components at each

level of the hierarchy

Network SLAs Deadlines on communications

between components

User-facing online services

Agg.

Aggregator

Agg. Agg.

Worker

Request Response

Worker Worker Worker

200 ms

100 ms

45 ms

5ms 5ms

4ms 4ms 4ms 4ms

5ms 25ms 35ms 25ms

15ms

30ms 22ms 25ms

Flow Deadlines A flow is useful if and only if it satisfies its deadline

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   38  Partha  Kundu  

Pizalls  of  Fair  Sharing  

Flows get bandwidth in accordance to their deadlines Deadline awareness ensures both flows satisfy deadlines

Limitations of Fair Sharing

Flow f1, 20ms

Flow f2, 40ms

Time

Flow

s f1

f2

20 40

X

Flows f1 and f2 get a fair share of bandwidth Flow f1 misses its deadline (incomplete response to user)

Status Quo

Case for unfair sharing:

Bad  :    •  Network  bandwidth  wasted  carrying  a  flow  that  is  not  useable  •  Applica0on  throughput/user  experience  suffers  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   39  Partha  Kundu  

Flows get bandwidth in accordance to their deadlines Deadline awareness ensures both flows satisfy deadlines

Limitations of Fair Sharing

Flow f1, 20ms

Flow f2, 40ms

Time

Flow

s f1

f2

20 40

X

Status Quo

Case for unfair sharing:

Flow

s f1

f2

20 40 Time

Deadline aware

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   40  Partha  Kundu  

Limitations of Fair Sharing

Flow f1, 20ms

Flow f2, 40ms

Time Fl

ows f1

f2

20 40

X

Flow

s f1

f2

20 40 Time

Status Quo Deadline aware

6 flows with 30ms deadline

Flow

s

X

Time 30

X X X X X

Insufficient bandwidth to satisfy all deadlines With fair share, all flows miss the deadline (empty response)

Case for unfair sharing:

Case for flow quenching:

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   41  Partha  Kundu  

D3*  Overview  

s : flow size d : deadline RRQ : Rate Request α  : allocated rate

D3 overview

2. 𝐷𝑒𝑠𝑖𝑟𝑒𝑑  𝑟𝑎𝑡𝑒 ∶ 𝑟 =𝑠𝑑

Sender Receiver

       1. 𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛  𝑒𝑥𝑝𝑜𝑠𝑒𝑠   𝑠, 𝑑

5. Send data at rate 𝑠𝑟

6. One of the packets contains and updated RRQ based on the remaining flow size and deadline

RRQ

RRQ | α1 RRQ | (α1, α2)

4. 𝑆𝑒𝑛𝑑𝑖𝑛𝑔  𝑟𝑎𝑡𝑒  𝑓𝑜𝑟  𝑛𝑒𝑥𝑡  𝑅𝑇𝑇 ∶ sr =  min  (α1,  α2)

3.  Routers  allocate  rates  (α)  based  on  traffic  load        

*  Bewer  Never  than  Late,  Wilson  et  al,  SIGCOMM  2011  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   42  Partha  Kundu  

D3  Router  Rate  Control  Router needs to track:

1. Sum of desired rates (demand) 2. Available Capacity 3. Fair-share (fs)

Three aggregate values (no per-flow state)

Allocations : A = ∑𝑎 Demand : D = ∑𝑟 Number of flows : N =>

Rate allocation

fs = available capacity = C-A

r : desired rate α  : allocated rate C : link capacity RRQ : Rate Request

A = A – αt-1

D = D – rt-1 + rt

If available capacity > D: αt = r + fs

A = A + αt

RRQ (r) |  αt-1 | rt-1 | (αt)

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   43  Partha  Kundu  

D3  with  Quenching/Results   3. Flow quenching

Terminate  “useless  flows”  when: — Desired rate exceeds link capacity — Deadline has expired Allows for graceful

degradation of performance

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   44  Partha  Kundu  

Network)Layer)Load)Balancers)

•  Expected)to)support)singleSpath)assump6on)•  Common)approach:)hash)flows)to)paths)– Does)not)consider)flow)size)or)sending)rate))

•  Results)in)uneven)load)spreading)–  Leads)hotspots)and)increased)queuing)delays)

The'singleEpath'assump:on'restricts'the'ability'to'agilely'balance'load'

12)

Recent)Proposals)

•  Reduce)packet)drops)–  By)crossSflow)learning)[DCTCP])or)explicit)flow)scheduling)[D3])

– Maintain)the)singleSpath)assump6on)

•  Adap6vely)move)traffic)–  By)crea6ng)subflows)[MPTCP])or)periodically)remapping)flows)[Hedera])

– Not)sufficiently)agile)to)support)short)flows)

13)

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013  

Conclusions  

•  Data  center  infrastructure  efficiencies  are  improving  

•  Energy  efficiency  among  non-­‐CPU  components  is  poor  

•  Parallelism  at  scale  is  great  –  un0l  the  point  where  long  tail  latencies  reverse  the  gains  

•  Network  latencies  subject  to  legacy  protocol  and  transport  mechanisms  

Intl.  Green  Compu0ng  Conf.  2013  6/27/2013   46  Partha  Kundu  

Thank  you!