measuring and optimizing tail latency · 2017-10-23 · best practices ear bp + bigger 2010 trend...

Measuring and Optimizing Tail Latency Kathryn S McKinley, Google Xi Yang, Stephen M Blackburn, Md Haque, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Tail Latency Matters

3

Twosecondslowdownreducedrevenue/userby4.3%.[EricSchurman,Bing]

400milliseconddelaydecreasedsearches/userby0.59%.[JackBrutlag,Google]

TOP PRIORITY

4Photo:Google/ConnieZhou

Servers in US datacenters

5

Serv

ers i

n m

illion

s

06 2008 2010 2012 2014 2016 2018 2020

20

0

4

8

16

12

Unbranded 2+ sockets Unbranded 1 socket Branded 2+ sockets Branded 1 socket

*Shehabi et al., United States Data Center Energy Usage Report, Lawrence Berkeley, 2016.

2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger

Billio

n KW

Hou

rs /

Year

2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger

200 175 150 125 100 75 50 25 0 2000 2005 2010 2015 2020

Actual $1 to $6 billion

Electricity in US datacenters


~ $30,000,000 Savings from 1% less work Lots more by not building a datacenter

Datacenter economics quick facts*

7

~ $500,000 Cost of small datacenter ~3,000,000 US datacenters in 2016

~ $1.5 trillion US Capital investment to date ~ $3,000,000,000 KW dollars / year


8

Efficiency TOP PRIORITY

Tail Latency

9

Efficiency TOP PRIORITY

Tail Latency

10

Efficiency BOTH ?!

11

Server architecture

aggregator

workers

client

12

Characteristics of interactive services

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

LC

Bursty,diurnalCDFchangesslowlySlowestserverdictatestailOrdersofmagnitudediffaverage&tail-99th%Ule

13

What is in the tail?

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

?

Roadmap

Diagnosing the tail with continuous profiling Noise systems are not perfect Queuing too much load is bad, but so is over provisioning Work many requests are long

Insights Use the CDF off line

Long requests reveal themselves, treat them specially

14

15

application

worker OS / VM

Java VM

application

worker OS / VM

Java VM

application

worker OS / VM

Java VM

Simplified life of a request

…

request response

Prior state of the art Dick Site, Google https://www.youtube.com/watch?v=QBu2Ae8-8LM

16

@ Google

17

Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system

Request profiling

18

Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system

Request profiling

19

Automated instrumentation 1% on-line budget continuous on-line profiling Off-line schematics Have insight Improve the system + On-line optimization

✗Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system

✗

counters tags

Automated cycle-level on-line profiling

Insight Hardware & software generate signals

20

[ISCA’15(TopPicksHM),ATC’16]

20

hardware signals software signals performance counters memory locations

✓ ✓

✓ ✓

SHIM Design ISCA’15 (Top Picks HM), ATC’16

21

Observe global state from other core

22

22

LLCmissespercycle

while(true):forcounterinLLCmisses,cycles:buf[i++]=readCounter(counter)

Observe local state with SMT hardware

23

23

HT1

HT2

0

4

HT1IPC

0

4

CoreIPC

0

4HT2SHIMIPC

HT1IPC=CoreIPC–HT2SHIMIPC

while(true):forcounterinHT2SHIM,Core,Cycles:buf[i++]=readCounter(counter);

Correlate hardware & software events

01234

HT1IPC

01234

CoreIPC

01234

HT2SHIMIPC

1

2

3

A()B()C()

HT1

HT2

while(true):forcounterinHT2SHIM,Core,cycles:buf[i++]=readCounter(counter);tid=threadonHT1buf[i++]=tid.method;

Fidelity

25

26

1e-05

0.0001

0.001

0.01

0.1

1

0.01 0.1

1 10 100 1000

IPC(logscale)

%ofsamples(logscale)

Raw samples

27

!me

R0C0

IPC1 IPC2 IPC3

R1C1 R2C2 R3C3

IPC=(Rt–Rt-1)/(Ct–Ct-1)

✗✓ ✓

CountersC:cyclesR:reUredinstrucUons

Problem: samples are not atomic

28

!me

IPC1 IPC2 IPC3

✗✓ ✓

Solution: use clock as ground truth CPC=(Cet–Cet-1)/(Cst–Cst-1)thisshouldbe1!

CPC1=1.0+/-1% CPC2=1.0+/-1% CPC3!=1.0+/-1%

Cs0R0C0Ce0 Cs1R1C1Ce1 Cs2R2C2Ce2 Cs3R3C3Ce3

29

1e-05

0.0001

0.001

0.01

0.1

1

0.01 0.1

1 10 100 1000

1e-05

0.0001

0.001

0.01

0.1

1

0.01 0.1

1 10 100 1000

----rawIPC

%ofsam

ples(logscale)

1e-05

0.0001

0.001

0.01

0.1

1

0.01 0.1

1 10 100 1000

----rawCPC----filteredIPC

----filteredCPCin[0.99,1.01]

Filtering Lusearch IPC samples

30

top10methods(74%totalexecuUonUme)

IPC

00.20.40.60.81

1.21.41.6

1 2 3 4 5 6 7 8 9 10

default1KHz maximum100KHz SHIM10MHz

IPC of individual methods in Lucene

31

0

0.5

1

1.5

2

2.5

3

3.5

4

30cycles 1213cycles

methodandloopIDs

Normalize

dto

with

outS

HIM

OverheadsfromwriteinvalidaUons

3MHz:1+orderofmagnitudeoverinterrupt‘maximum’

113MHz:3+ordersofmagnitudeoverinterrupt‘maximum’

Overheads from other core

Understanding Tail Latency

32

SHIM signals

Requests •  thread ids •  request id – configure •  time stamps, PC System threads •  thread ids •  time stamp, PC

33

All requests

34

0

20

40

60

80

100

120

0 20 40 60 80 100

late

ncy

(m

s)

Request groups (from the slowest 1% to the fastest 1%)

Client latency

Average queueing time

The Tail Longest 200 requests

35

0

20

40

60

80

100

120

0 50 100 150

200

late

ncy

(ms)

Top 200 requests

Network and networking queueing time

Idle time

CPU time

Dispatch queueing time

laten

cy

Network & other Idle CPU work Queuing at worker not noise

Network imperfections OS imperfections Long requests Overload

} noise

}

Insight Long requests reveal themselves Regardless of the cause

36

Noise Replicate & reissue The Tail at Scale, Dean & Barroso, CACM’13

3737

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

All requests? CFD for cost & potential Fixed issue time

10 % reissued 5% reissued

noise

Probabilistic reissue Optimal Reissue Policies for Reducing Tail Latencies, Kaler, He, & Elnickety , SPAA’17

3838

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

Adding randomness to reissue makes one earlier reissue time d (vs n) optimal Probability is proportional to reissue budget & noise in tail

1-3% reissue w/ prob. p

noise

5% reissued

39

Single R Probabilistic reissue Optimal Reissue Policies for Reducing Tail Latencies, Kaler, He, & Elnickety , SPAA’17

40

Work Speed up the tail efficiently

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

work

Judicious parallelism [ASPLOS’15]

DVFS faster on the tail [DISC’14, MICRO’17]

Asymmetric multicore [DISC’14, MICRO’17]

0 0

0 0

0 0 0 0

0 0

0 0

0 0 0 0

Work Parallelism Parallelismhistoricallyforthroughput

ParallelismfortaillatencyIdea

Queuing theory Optimizing average latency maximizes throughput But not the tail! Shortening the tail reduces queuing latency 42

43

Parallelism

InsightApproach

Longrequestsrevealthemselves

Incrementallyaddparallelismtolongrequests–thetail–basedonrequestprogress&load

Parallelismhistoricallyforthroughput


Few-to-Many Dynamic Parallelism [ASPLOS’15]

0 0

0 0

0 0 0 0

0 0

0 0

300

600

900

1200

1500

30 32 34 36 38 40 42 44 46 48

Tail l

atenc

y m

s

Lucene RPS

Sequential

4 way

Fixed interval 20 ms



Few to Many at fixed delay d Add thread every d ms

0 0

0 0

0 0 0 0

0 0

0 0

Long delay good at high load

Short delay good at low load

best at all loads?

Offline

Profiles Sequential & parallel demand distribution Efficiency of parallelism

Choose maximum target parallelism Utilize available hardware resources

Exhaustively explore parallelism given set of time intervals t & load find best tail latency & parallelism

45

IntervalTable

Online self scheduling

0 0

0 0

0

0

|requests| Interval0=0 Interval1,2=50,100≤2 @0parallelism=33 @0parallelism=1 @50,parallelism=34-6 @50parallelism=1 @100,parallelism=3≥7 @exitparallelism=1 @100,parallelism=3

Evaluation 2x8 64 bit 2.3 GHz Xeon, 64 GB

300

600

900

1200

1500

30 32 34 36 38 40 42 44 46 48

Tail

late

ncy

ms

Requests per Second

21% fewer servers

or reduce tail by 28%

Few to Many Sequential

48

Work speed up the tail efficiently

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

work

Judicious parallelism [ASPLOS’15]

✔

49

Work speed up the tail efficiently

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

work


Asymmetric multicore (AMP) [DISC’14, MICRO’17]

all requests @ 2.3 MHz

50

Speed up the tail efficiently

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

2.3 MHz

0.5 MHz


+ available in servers today

51

Speed up the tail efficiently

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

DVFS faster on the tail [MICRO’17]

+ available in servers today Asymmetric multicore (AMP) [DISC’14, MICRO’17]

+ much more energy efficient + hyper-threading is a form - core competition

0 0

0 0

0 0

0 0

0 0

Adaptive Slow to Fast Framework

Slow to fast migration is optimal [ICAC’13]

Goal Minimize energy consumption and satisfy a tail latency target

Challenges When to migrate? What if the core speed is not available?

Insight Use big core just enough th+(l99-th)/sp ≤ target Migrate oldest first and migrate early under load!

52

Controller design

53

Controller System

Target

Error

Threshold

Tail latency

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Pe

rce

nta

ge

of

req

ue

sts

Latency (ms) load

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Pe

rce

nta

ge

of

req

ue

sts

Latency (ms)

All-cores Pegasus adjust all-core frequencies for load [Towards Energy Proportional…, Google & Stanford, ISCA'14]

Per-core approaches TL minimize tail latency using 0 threshold EETL energy efficient with target tail latency

54

Policies

55

100

125

150

175

200

225

0 25 50 75 100 125

99thpercen?

le

latency(m

s)

RPS

TL Pegasus EETL

0.75

1

1.25

1.5

0 25 50 75 100 125Normalized

average

energy

RPS

TL Pegasus EETL

Lucene with DVFS on Broadwell

Lucene on emulated AMP and DVFS

56

100120140160180200220240

0 50 100 150

99thpercen?

le

latency(m

s)

RPS

TL_DVFS Pegasus EETL_DVFS

TL_AMP EETL_AMP

0

0.5

1

1.5

2

2.5

3

0 50 100 150Normalized

average

energy

RPS

TL_DVFS Pegasus EETL_DVFS

TL_AMP EETL_AMP

Tail Latency

57

Efficiency BOTH !

Efficiency at scale for interactive workloads

Diagnosing the tail with continuous profiling Noise replicate, systems are not perfect Queuing not today! Work judicious use of resources on long requests

Request latency CDF is a powerful tool Tail efficiency ≠ average or throughput Hardware heterogeneity

58

Thank you

Heterogeneous hardware dominates homogeneous hardware for throughput, performance, and energy with a fixed power budget & variable request demand Slow-to-Fast sacrifice average a bit to reduce energy & tail latency

59

Requirements pull for heterogeneity! [DISC’14, ICAC’13, submission]

61

big

lirle

000

000

0 0

0 0

0 0

0 0

000

000

0 0

0 0

0 0

0 0

custom

Hardware heterogeneity – opportunity & challenge

Processors Memory

DDRNVM flash

PIMpaired

0 0

0 0

0 0 0 0

0 0

0 0

0 0 0 0

Parallelism

0 300 600 900

1200 1500

0 10 20 30 40 50

Late

ncy

ms

Lucene RPS

Sequential 99th 4 way 99th

improves at low load

degrades at high load

Parallelismhistoricallyforthroughput


Software & hardware

Lucene open source enterprise search Wikipedia English 10 GB index of 33 million pages 10k queries from Lucene nightly tests

Bing web search with one Index Serving Node (ISN) 160 GB web index in SSD, 17 GB cache 30k Bing user queries

Hardware 2x8 64 bit 2.3 GHz Xeon, 64 GB Windows 15 request servers, 1 core issues requests Target parallelism = 24 threads

63

Policies Sequential N way single degree of parallelism for each

request Adaptive Select parallelism degree when request starts using system load [EUROSYS’13] Request Clairvoyant parallelizes long requests by

perfect prediction of tail FM Few to Many incrementally add parallelism

64

Load variation

Alternatebetweenhigh&lowloadFMadaptstoburstswithlowvariance

65

0

400

800

1200

1600

Tail l

atenc

y m

s

Lucene RPS

Sequential 2 way 4 way FM

Low Low High High

Fewer servers: Total Cost of ownership

300

600

900

1200

1500

30 32 34 36 38 40 42 44 46 48

Tail

late

ncy

ms

Lucene RPS

Sequential FM

21%

30 32 34 36 38 40 42 44 46 48

Lucene RPS

Adaptive FM

9%

measuring and optimizing tail latency · 2017-10-23 · best practices ear bp + bigger 2010 trend...

Documents