measuring and optimizing tail latency · 2017-10-23 · best practices ear bp + bigger 2010 trend...

66
Measuring and Optimizing Tail Latenc athryn S McKinley, Google Yang, Stephen M Blackburn, d Haque, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Upload: others

Post on 24-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Measuring and Optimizing Tail Latency Kathryn S McKinley, Google Xi Yang, Stephen M Blackburn, Md Haque, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Page 2: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75
Page 3: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Tail Latency Matters

3

Twosecondslowdownreducedrevenue/userby4.3%.[EricSchurman,Bing]

400milliseconddelaydecreasedsearches/userby0.59%.[JackBrutlag,Google]

TOP PRIORITY

Page 4: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

4Photo:Google/ConnieZhou

Page 5: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Servers in US datacenters

5

Serv

ers i

n m

illion

s

06 2008 2010 2012 2014 2016 2018 2020

20

0

4

8

16

12

Unbranded 2+ sockets Unbranded 1 socket Branded 2+ sockets Branded 1 socket

*Shehabi et al., United States Data Center Energy Usage Report, Lawrence Berkeley, 2016.

Page 6: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger

Billio

n KW

Hou

rs /

Year

2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger

200 175 150 125 100 75 50 25 0 2000 2005 2010 2015 2020

Actual $1 to $6 billion

Electricity in US datacenters

*Shehabi et al., United States Data Center Energy Usage Report, Lawrence Berkeley, 2016.

Page 7: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

~ $30,000,000 Savings from 1% less work Lots more by not building a datacenter

Datacenter economics quick facts*

7

~ $500,000 Cost of small datacenter ~3,000,000 US datacenters in 2016

~ $1.5 trillion US Capital investment to date ~ $3,000,000,000 KW dollars / year

*Shehabi et al., United States Data Center Energy Usage Report, Lawrence Berkeley, 2016.

Page 8: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

8

Efficiency TOP PRIORITY

Page 9: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Tail Latency

9

Efficiency TOP PRIORITY

Page 10: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Tail Latency

10

Efficiency BOTH ?!

Page 11: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

11

Server architecture

aggregator

workers

client

Page 12: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

12

Characteristics of interactive services

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

LC

Bursty,diurnalCDFchangesslowlySlowestserverdictatestailOrdersofmagnitudediffaverage&tail-99th%Ule

Page 13: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

13

What is in the tail?

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

?

Page 14: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Roadmap

Diagnosing the tail with continuous profiling Noise systems are not perfect Queuing too much load is bad, but so is over provisioning Work many requests are long

Insights Use the CDF off line

Long requests reveal themselves, treat them specially

14

Page 15: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

15

application

worker OS / VM

Java VM

application

worker OS / VM

Java VM

application

worker OS / VM

Java VM

Simplified life of a request

request response

Page 16: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Prior state of the art Dick Site, Google https://www.youtube.com/watch?v=QBu2Ae8-8LM

16

Page 17: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

@ Google

17

Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system

Page 18: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Request profiling

18

Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system

Page 19: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Request profiling

19

Automated instrumentation 1% on-line budget continuous on-line profiling Off-line schematics Have insight Improve the system + On-line optimization

✗Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system

Page 20: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

counters tags

Automated cycle-level on-line profiling

Insight Hardware & software generate signals

20

[ISCA’15(TopPicksHM),ATC’16]

20

hardware signals software signals performance counters memory locations

✓ ✓

✓ ✓

Page 21: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

SHIM Design ISCA’15 (Top Picks HM), ATC’16

21

Page 22: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Observe global state from other core

22

22

LLCmissespercycle

while(true):forcounterinLLCmisses,cycles:buf[i++]=readCounter(counter)

Page 23: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Observe local state with SMT hardware

23

23

HT1

HT2

0

4

HT1IPC

0

4

CoreIPC

0

4HT2SHIMIPC

HT1IPC=CoreIPC–HT2SHIMIPC

while(true):forcounterinHT2SHIM,Core,Cycles:buf[i++]=readCounter(counter);

Page 24: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Correlate hardware & software events

01234

HT1IPC

01234

CoreIPC

01234

HT2SHIMIPC

1

2

3

A()B()C()

HT1

HT2

while(true):forcounterinHT2SHIM,Core,cycles:buf[i++]=readCounter(counter);tid=threadonHT1buf[i++]=tid.method;

Page 25: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Fidelity

25

Page 26: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

26

1e-05

0.0001

0.001

0.01

0.1

1

0.01 0.1

1 10 100 1000

IPC(logscale)

%ofsamples(logscale)

Raw samples

Page 27: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

27

!me

R0C0

IPC1 IPC2 IPC3

R1C1 R2C2 R3C3

IPC=(Rt–Rt-1)/(Ct–Ct-1)

✗✓ ✓

CountersC:cyclesR:reUredinstrucUons

Problem: samples are not atomic

Page 28: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

28

!me

IPC1 IPC2 IPC3

✗✓ ✓

Solution: use clock as ground truth CPC=(Cet–Cet-1)/(Cst–Cst-1)thisshouldbe1!

CPC1=1.0+/-1% CPC2=1.0+/-1% CPC3!=1.0+/-1%

Cs0R0C0Ce0 Cs1R1C1Ce1 Cs2R2C2Ce2 Cs3R3C3Ce3

Page 29: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

29

1e-05

0.0001

0.001

0.01

0.1

1

0.01 0.1

1 10 100 1000

1e-05

0.0001

0.001

0.01

0.1

1

0.01 0.1

1 10 100 1000

----rawIPC

%ofsam

ples(logscale)

1e-05

0.0001

0.001

0.01

0.1

1

0.01 0.1

1 10 100 1000

----rawCPC----filteredIPC

----filteredCPCin[0.99,1.01]

Filtering Lusearch IPC samples

Page 30: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

30

top10methods(74%totalexecuUonUme)

IPC

00.20.40.60.81

1.21.41.6

1 2 3 4 5 6 7 8 9 10

default1KHz maximum100KHz SHIM10MHz

IPC of individual methods in Lucene

Page 31: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

31

0

0.5

1

1.5

2

2.5

3

3.5

4

30cycles 1213cycles

methodandloopIDs

Normalize

dto

with

outS

HIM

OverheadsfromwriteinvalidaUons

3MHz:1+orderofmagnitudeoverinterrupt‘maximum’

113MHz:3+ordersofmagnitudeoverinterrupt‘maximum’

Overheads from other core

Page 32: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Understanding Tail Latency

32

Page 33: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

SHIM signals

Requests •  thread ids •  request id – configure •  time stamps, PC System threads •  thread ids •  time stamp, PC

33

Page 34: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

All requests

34

0

20

40

60

80

100

120

0 20 40 60 80 100

late

ncy

(m

s)

Request groups (from the slowest 1% to the fastest 1%)

Client latency

Average queueing time

Page 35: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

The Tail Longest 200 requests

35

0

20

40

60

80

100

120

0 50 100 150

200

late

ncy

(ms)

Top 200 requests

Network and networking queueing time

Idle time

CPU time

Dispatch queueing time

laten

cy

Network & other Idle CPU work Queuing at worker not noise

Network imperfections OS imperfections Long requests Overload

} noise

}

Page 36: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Insight Long requests reveal themselves Regardless of the cause

36

Page 37: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Noise Replicate & reissue The Tail at Scale, Dean & Barroso, CACM’13

3737

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

All requests? CFD for cost & potential Fixed issue time

10 % reissued 5% reissued

noise

Page 38: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Probabilistic reissue Optimal Reissue Policies for Reducing Tail Latencies, Kaler, He, & Elnickety , SPAA’17

3838

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

Adding randomness to reissue makes one earlier reissue time d (vs n) optimal Probability is proportional to reissue budget & noise in tail

1-3% reissue w/ prob. p

noise

5% reissued

Page 39: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

39

Single R Probabilistic reissue Optimal Reissue Policies for Reducing Tail Latencies, Kaler, He, & Elnickety , SPAA’17

Page 40: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

40

Work Speed up the tail efficiently

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

work

Judicious parallelism [ASPLOS’15]

DVFS faster on the tail [DISC’14, MICRO’17]

Asymmetric multicore [DISC’14, MICRO’17]

Page 41: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

0 0

0 0

0 0 0 0

0 0

0 0

0 0 0 0

Work Parallelism Parallelismhistoricallyforthroughput

ParallelismfortaillatencyIdea

Page 42: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Queuing theory Optimizing average latency maximizes throughput But not the tail! Shortening the tail reduces queuing latency 42

Page 43: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

43

Parallelism

InsightApproach

Longrequestsrevealthemselves

Incrementallyaddparallelismtolongrequests–thetail–basedonrequestprogress&load

Parallelismhistoricallyforthroughput

ParallelismfortaillatencyIdea

Few-to-Many Dynamic Parallelism [ASPLOS’15]

0 0

0 0

0 0 0 0

0 0

0 0

Page 44: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

300

600

900

1200

1500

30 32 34 36 38 40 42 44 46 48

Tail l

atenc

y m

s

Lucene RPS

Sequential

4 way

Fixed interval 20 ms

Fixed interval 100 ms

Fixed interval 500 ms

Few to Many at fixed delay d Add thread every d ms

0 0

0 0

0 0 0 0

0 0

0 0

Long delay good at high load

Short delay good at low load

best at all loads?

Page 45: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Offline

Profiles Sequential & parallel demand distribution Efficiency of parallelism

Choose maximum target parallelism Utilize available hardware resources

Exhaustively explore parallelism given set of time intervals t & load find best tail latency & parallelism

45

IntervalTable

Page 46: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Online self scheduling

0 0

0 0

0

0

|requests| Interval0=0 Interval1,2=50,100≤2 @0parallelism=33 @0parallelism=1 @50,parallelism=34-6 @50parallelism=1 @100,parallelism=3≥7 @exitparallelism=1 @100,parallelism=3

Page 47: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Evaluation 2x8 64 bit 2.3 GHz Xeon, 64 GB

300

600

900

1200

1500

30 32 34 36 38 40 42 44 46 48

Tail

late

ncy

ms

Requests per Second

21% fewer servers

or reduce tail by 28%

Few to Many Sequential

Page 48: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

48

Work speed up the tail efficiently

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

work

Judicious parallelism [ASPLOS’15]

Page 49: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

49

Work speed up the tail efficiently

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

work

DVFS faster on the tail [DISC’14, MICRO’17]

Asymmetric multicore (AMP) [DISC’14, MICRO’17]

all requests @ 2.3 MHz

Page 50: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

50

Speed up the tail efficiently

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

2.3 MHz

0.5 MHz

DVFS faster on the tail [DISC’14, MICRO’17]

+ available in servers today

Page 51: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

51

Speed up the tail efficiently

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Perc

enta

ge o

f re

quest

s

Latency (ms)

DVFS faster on the tail [MICRO’17]

+ available in servers today Asymmetric multicore (AMP) [DISC’14, MICRO’17]

+ much more energy efficient + hyper-threading is a form - core competition

0 0

0 0

0 0

0 0

0 0

Page 52: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Adaptive Slow to Fast Framework

Slow to fast migration is optimal [ICAC’13]

Goal Minimize energy consumption and satisfy a tail latency target

Challenges When to migrate? What if the core speed is not available?

Insight Use big core just enough th+(l99-th)/sp ≤ target Migrate oldest first and migrate early under load!

52

Page 53: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Controller design

53

Controller System

Target

Error

Threshold

Tail latency

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Pe

rce

nta

ge

of

req

ue

sts

Latency (ms) load

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0 20 40 60 80 100

0

20

40

60

80

100

Pe

rce

nta

ge

of

req

ue

sts

Latency (ms)

Page 54: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

All-cores Pegasus adjust all-core frequencies for load [Towards Energy Proportional…, Google & Stanford, ISCA'14]

Per-core approaches TL minimize tail latency using 0 threshold EETL energy efficient with target tail latency

54

Policies

Page 55: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

55

100

125

150

175

200

225

0 25 50 75 100 125

99thpercen?

le

latency(m

s)

RPS

TL Pegasus EETL

0.75

1

1.25

1.5

0 25 50 75 100 125Normalized

average

energy

RPS

TL Pegasus EETL

Lucene with DVFS on Broadwell

Page 56: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Lucene on emulated AMP and DVFS

56

100120140160180200220240

0 50 100 150

99thpercen?

le

latency(m

s)

RPS

TL_DVFS Pegasus EETL_DVFS

TL_AMP EETL_AMP

0

0.5

1

1.5

2

2.5

3

0 50 100 150Normalized

average

energy

RPS

TL_DVFS Pegasus EETL_DVFS

TL_AMP EETL_AMP

Page 57: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Tail Latency

57

Efficiency BOTH !

Page 58: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Efficiency at scale for interactive workloads

Diagnosing the tail with continuous profiling Noise replicate, systems are not perfect Queuing not today! Work judicious use of resources on long requests

Request latency CDF is a powerful tool Tail efficiency ≠ average or throughput Hardware heterogeneity

58

Thank you

Page 59: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Heterogeneous hardware dominates homogeneous hardware for throughput, performance, and energy with a fixed power budget & variable request demand Slow-to-Fast sacrifice average a bit to reduce energy & tail latency

59

Requirements pull for heterogeneity! [DISC’14, ICAC’13, submission]

Page 60: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

60

Page 61: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

61

big

lirle

000

000

0 0

0 0

0 0

0 0

000

000

0 0

0 0

0 0

0 0

custom

Hardware heterogeneity – opportunity & challenge

Processors Memory

DDRNVM flash

PIMpaired

Page 62: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

0 0

0 0

0 0 0 0

0 0

0 0

0 0 0 0

Parallelism

0 300 600 900

1200 1500

0 10 20 30 40 50

Late

ncy

ms

Lucene RPS

Sequential 99th 4 way 99th

improves at low load

degrades at high load

Parallelismhistoricallyforthroughput

ParallelismfortaillatencyIdea

Page 63: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Software & hardware

Lucene open source enterprise search Wikipedia English 10 GB index of 33 million pages 10k queries from Lucene nightly tests

Bing web search with one Index Serving Node (ISN) 160 GB web index in SSD, 17 GB cache 30k Bing user queries

Hardware 2x8 64 bit 2.3 GHz Xeon, 64 GB Windows 15 request servers, 1 core issues requests Target parallelism = 24 threads

63

Page 64: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Policies Sequential N way single degree of parallelism for each

request Adaptive Select parallelism degree when request starts using system load [EUROSYS’13] Request Clairvoyant parallelizes long requests by

perfect prediction of tail FM Few to Many incrementally add parallelism

64

Page 65: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Load variation

Alternatebetweenhigh&lowloadFMadaptstoburstswithlowvariance

65

0

400

800

1200

1600

Tail l

atenc

y m

s

Lucene RPS

Sequential 2 way 4 way FM

Low Low High High

Page 66: Measuring and Optimizing Tail Latency · 2017-10-23 · Best Practices ear BP + Bigger 2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger 200 175 150 125 100 75

Fewer servers: Total Cost of ownership

300

600

900

1200

1500

30 32 34 36 38 40 42 44 46 48

Tail

late

ncy

ms

Lucene RPS

Sequential FM

21%

30 32 34 36 38 40 42 44 46 48

Lucene RPS

Adaptive FM

9%