measuring and optimizing tail latency · 2017-10-23 · best practices ear bp + bigger 2010 trend...
TRANSCRIPT
Measuring and Optimizing Tail Latency Kathryn S McKinley, Google Xi Yang, Stephen M Blackburn, Md Haque, Sameh Elnikety, Yuxiong He, Ricardo Bianchini
Tail Latency Matters
3
Twosecondslowdownreducedrevenue/userby4.3%.[EricSchurman,Bing]
400milliseconddelaydecreasedsearches/userby0.59%.[JackBrutlag,Google]
TOP PRIORITY
4Photo:Google/ConnieZhou
Servers in US datacenters
5
Serv
ers i
n m
illion
s
06 2008 2010 2012 2014 2016 2018 2020
20
0
4
8
16
12
Unbranded 2+ sockets Unbranded 1 socket Branded 2+ sockets Branded 1 socket
*Shehabi et al., United States Data Center Energy Usage Report, Lawrence Berkeley, 2016.
2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger
Billio
n KW
Hou
rs /
Year
2010 trend Current trend Better Bigger B+B Best Practices BP + Bigger
200 175 150 125 100 75 50 25 0 2000 2005 2010 2015 2020
Actual $1 to $6 billion
Electricity in US datacenters
*Shehabi et al., United States Data Center Energy Usage Report, Lawrence Berkeley, 2016.
~ $30,000,000 Savings from 1% less work Lots more by not building a datacenter
Datacenter economics quick facts*
7
~ $500,000 Cost of small datacenter ~3,000,000 US datacenters in 2016
~ $1.5 trillion US Capital investment to date ~ $3,000,000,000 KW dollars / year
*Shehabi et al., United States Data Center Energy Usage Report, Lawrence Berkeley, 2016.
8
Efficiency TOP PRIORITY
Tail Latency
9
Efficiency TOP PRIORITY
Tail Latency
10
Efficiency BOTH ?!
11
Server architecture
aggregator
workers
client
12
Characteristics of interactive services
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100
0
20
40
60
80
100
Perc
enta
ge o
f re
quest
s
Latency (ms)
LC
Bursty,diurnalCDFchangesslowlySlowestserverdictatestailOrdersofmagnitudediffaverage&tail-99th%Ule
13
What is in the tail?
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100
0
20
40
60
80
100
Perc
enta
ge o
f re
quest
s
Latency (ms)
?
Roadmap
Diagnosing the tail with continuous profiling Noise systems are not perfect Queuing too much load is bad, but so is over provisioning Work many requests are long
Insights Use the CDF off line
Long requests reveal themselves, treat them specially
14
15
application
worker OS / VM
Java VM
application
worker OS / VM
Java VM
application
worker OS / VM
Java VM
Simplified life of a request
…
request response
Prior state of the art Dick Site, Google https://www.youtube.com/watch?v=QBu2Ae8-8LM
16
17
Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system
Request profiling
18
Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system
Request profiling
19
Automated instrumentation 1% on-line budget continuous on-line profiling Off-line schematics Have insight Improve the system + On-line optimization
✗Hand instrument system 1% on-line budget sample – but tails are rare… Off-line schematics Have insight Improve the system
✗
counters tags
Automated cycle-level on-line profiling
Insight Hardware & software generate signals
20
[ISCA’15(TopPicksHM),ATC’16]
20
hardware signals software signals performance counters memory locations
✓ ✓
✓ ✓
SHIM Design ISCA’15 (Top Picks HM), ATC’16
21
Observe global state from other core
22
22
LLCmissespercycle
while(true):forcounterinLLCmisses,cycles:buf[i++]=readCounter(counter)
Observe local state with SMT hardware
23
23
HT1
HT2
0
4
HT1IPC
0
4
CoreIPC
0
4HT2SHIMIPC
HT1IPC=CoreIPC–HT2SHIMIPC
while(true):forcounterinHT2SHIM,Core,Cycles:buf[i++]=readCounter(counter);
Correlate hardware & software events
01234
HT1IPC
01234
CoreIPC
01234
HT2SHIMIPC
1
2
3
A()B()C()
HT1
HT2
while(true):forcounterinHT2SHIM,Core,cycles:buf[i++]=readCounter(counter);tid=threadonHT1buf[i++]=tid.method;
Fidelity
25
26
1e-05
0.0001
0.001
0.01
0.1
1
0.01 0.1
1 10 100 1000
IPC(logscale)
%ofsamples(logscale)
Raw samples
27
!me
R0C0
IPC1 IPC2 IPC3
R1C1 R2C2 R3C3
IPC=(Rt–Rt-1)/(Ct–Ct-1)
✗✓ ✓
CountersC:cyclesR:reUredinstrucUons
Problem: samples are not atomic
28
!me
IPC1 IPC2 IPC3
✗✓ ✓
Solution: use clock as ground truth CPC=(Cet–Cet-1)/(Cst–Cst-1)thisshouldbe1!
CPC1=1.0+/-1% CPC2=1.0+/-1% CPC3!=1.0+/-1%
Cs0R0C0Ce0 Cs1R1C1Ce1 Cs2R2C2Ce2 Cs3R3C3Ce3
29
1e-05
0.0001
0.001
0.01
0.1
1
0.01 0.1
1 10 100 1000
1e-05
0.0001
0.001
0.01
0.1
1
0.01 0.1
1 10 100 1000
----rawIPC
%ofsam
ples(logscale)
1e-05
0.0001
0.001
0.01
0.1
1
0.01 0.1
1 10 100 1000
----rawCPC----filteredIPC
----filteredCPCin[0.99,1.01]
Filtering Lusearch IPC samples
30
top10methods(74%totalexecuUonUme)
IPC
00.20.40.60.81
1.21.41.6
1 2 3 4 5 6 7 8 9 10
default1KHz maximum100KHz SHIM10MHz
IPC of individual methods in Lucene
31
0
0.5
1
1.5
2
2.5
3
3.5
4
30cycles 1213cycles
methodandloopIDs
Normalize
dto
with
outS
HIM
OverheadsfromwriteinvalidaUons
3MHz:1+orderofmagnitudeoverinterrupt‘maximum’
113MHz:3+ordersofmagnitudeoverinterrupt‘maximum’
Overheads from other core
Understanding Tail Latency
32
SHIM signals
Requests • thread ids • request id – configure • time stamps, PC System threads • thread ids • time stamp, PC
33
All requests
34
0
20
40
60
80
100
120
0 20 40 60 80 100
late
ncy
(m
s)
Request groups (from the slowest 1% to the fastest 1%)
Client latency
Average queueing time
The Tail Longest 200 requests
35
0
20
40
60
80
100
120
0 50 100 150
200
late
ncy
(ms)
Top 200 requests
Network and networking queueing time
Idle time
CPU time
Dispatch queueing time
laten
cy
Network & other Idle CPU work Queuing at worker not noise
Network imperfections OS imperfections Long requests Overload
} noise
}
Insight Long requests reveal themselves Regardless of the cause
36
Noise Replicate & reissue The Tail at Scale, Dean & Barroso, CACM’13
3737
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100
0
20
40
60
80
100
Perc
enta
ge o
f re
quest
s
Latency (ms)
All requests? CFD for cost & potential Fixed issue time
10 % reissued 5% reissued
noise
Probabilistic reissue Optimal Reissue Policies for Reducing Tail Latencies, Kaler, He, & Elnickety , SPAA’17
3838
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100
0
20
40
60
80
100
Perc
enta
ge o
f re
quest
s
Latency (ms)
Adding randomness to reissue makes one earlier reissue time d (vs n) optimal Probability is proportional to reissue budget & noise in tail
1-3% reissue w/ prob. p
noise
5% reissued
39
Single R Probabilistic reissue Optimal Reissue Policies for Reducing Tail Latencies, Kaler, He, & Elnickety , SPAA’17
40
Work Speed up the tail efficiently
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100
0
20
40
60
80
100
Perc
enta
ge o
f re
quest
s
Latency (ms)
work
Judicious parallelism [ASPLOS’15]
DVFS faster on the tail [DISC’14, MICRO’17]
Asymmetric multicore [DISC’14, MICRO’17]
0 0
0 0
0 0 0 0
0 0
0 0
0 0 0 0
Work Parallelism Parallelismhistoricallyforthroughput
ParallelismfortaillatencyIdea
Queuing theory Optimizing average latency maximizes throughput But not the tail! Shortening the tail reduces queuing latency 42
43
Parallelism
InsightApproach
Longrequestsrevealthemselves
Incrementallyaddparallelismtolongrequests–thetail–basedonrequestprogress&load
Parallelismhistoricallyforthroughput
ParallelismfortaillatencyIdea
Few-to-Many Dynamic Parallelism [ASPLOS’15]
0 0
0 0
0 0 0 0
0 0
0 0
300
600
900
1200
1500
30 32 34 36 38 40 42 44 46 48
Tail l
atenc
y m
s
Lucene RPS
Sequential
4 way
Fixed interval 20 ms
Fixed interval 100 ms
Fixed interval 500 ms
Few to Many at fixed delay d Add thread every d ms
0 0
0 0
0 0 0 0
0 0
0 0
Long delay good at high load
Short delay good at low load
best at all loads?
Offline
Profiles Sequential & parallel demand distribution Efficiency of parallelism
Choose maximum target parallelism Utilize available hardware resources
Exhaustively explore parallelism given set of time intervals t & load find best tail latency & parallelism
45
IntervalTable
Online self scheduling
0 0
0 0
0
0
|requests| Interval0=0 Interval1,2=50,100≤2 @0parallelism=33 @0parallelism=1 @50,parallelism=34-6 @50parallelism=1 @100,parallelism=3≥7 @exitparallelism=1 @100,parallelism=3
Evaluation 2x8 64 bit 2.3 GHz Xeon, 64 GB
300
600
900
1200
1500
30 32 34 36 38 40 42 44 46 48
Tail
late
ncy
ms
Requests per Second
21% fewer servers
or reduce tail by 28%
Few to Many Sequential
48
Work speed up the tail efficiently
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100
0
20
40
60
80
100
Perc
enta
ge o
f re
quest
s
Latency (ms)
work
Judicious parallelism [ASPLOS’15]
✔
49
Work speed up the tail efficiently
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100
0
20
40
60
80
100
Perc
enta
ge o
f re
quest
s
Latency (ms)
work
DVFS faster on the tail [DISC’14, MICRO’17]
Asymmetric multicore (AMP) [DISC’14, MICRO’17]
all requests @ 2.3 MHz
50
Speed up the tail efficiently
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100
0
20
40
60
80
100
Perc
enta
ge o
f re
quest
s
Latency (ms)
2.3 MHz
0.5 MHz
DVFS faster on the tail [DISC’14, MICRO’17]
+ available in servers today
51
Speed up the tail efficiently
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100
0
20
40
60
80
100
Perc
enta
ge o
f re
quest
s
Latency (ms)
DVFS faster on the tail [MICRO’17]
+ available in servers today Asymmetric multicore (AMP) [DISC’14, MICRO’17]
+ much more energy efficient + hyper-threading is a form - core competition
0 0
0 0
0 0
0 0
0 0
Adaptive Slow to Fast Framework
Slow to fast migration is optimal [ICAC’13]
Goal Minimize energy consumption and satisfy a tail latency target
Challenges When to migrate? What if the core speed is not available?
Insight Use big core just enough th+(l99-th)/sp ≤ target Migrate oldest first and migrate early under load!
52
Controller design
53
Controller System
Target
Error
Threshold
Tail latency
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100
0
20
40
60
80
100
Pe
rce
nta
ge
of
req
ue
sts
Latency (ms) load
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0 20 40 60 80 100
0
20
40
60
80
100
Pe
rce
nta
ge
of
req
ue
sts
Latency (ms)
All-cores Pegasus adjust all-core frequencies for load [Towards Energy Proportional…, Google & Stanford, ISCA'14]
Per-core approaches TL minimize tail latency using 0 threshold EETL energy efficient with target tail latency
54
Policies
55
100
125
150
175
200
225
0 25 50 75 100 125
99thpercen?
le
latency(m
s)
RPS
TL Pegasus EETL
0.75
1
1.25
1.5
0 25 50 75 100 125Normalized
average
energy
RPS
TL Pegasus EETL
Lucene with DVFS on Broadwell
Lucene on emulated AMP and DVFS
56
100120140160180200220240
0 50 100 150
99thpercen?
le
latency(m
s)
RPS
TL_DVFS Pegasus EETL_DVFS
TL_AMP EETL_AMP
0
0.5
1
1.5
2
2.5
3
0 50 100 150Normalized
average
energy
RPS
TL_DVFS Pegasus EETL_DVFS
TL_AMP EETL_AMP
Tail Latency
57
Efficiency BOTH !
Efficiency at scale for interactive workloads
Diagnosing the tail with continuous profiling Noise replicate, systems are not perfect Queuing not today! Work judicious use of resources on long requests
Request latency CDF is a powerful tool Tail efficiency ≠ average or throughput Hardware heterogeneity
58
Thank you
Heterogeneous hardware dominates homogeneous hardware for throughput, performance, and energy with a fixed power budget & variable request demand Slow-to-Fast sacrifice average a bit to reduce energy & tail latency
59
Requirements pull for heterogeneity! [DISC’14, ICAC’13, submission]
60
61
big
lirle
000
000
0 0
0 0
0 0
0 0
000
000
0 0
0 0
0 0
0 0
custom
Hardware heterogeneity – opportunity & challenge
Processors Memory
DDRNVM flash
PIMpaired
0 0
0 0
0 0 0 0
0 0
0 0
0 0 0 0
Parallelism
0 300 600 900
1200 1500
0 10 20 30 40 50
Late
ncy
ms
Lucene RPS
Sequential 99th 4 way 99th
improves at low load
degrades at high load
Parallelismhistoricallyforthroughput
ParallelismfortaillatencyIdea
Software & hardware
Lucene open source enterprise search Wikipedia English 10 GB index of 33 million pages 10k queries from Lucene nightly tests
Bing web search with one Index Serving Node (ISN) 160 GB web index in SSD, 17 GB cache 30k Bing user queries
Hardware 2x8 64 bit 2.3 GHz Xeon, 64 GB Windows 15 request servers, 1 core issues requests Target parallelism = 24 threads
63
Policies Sequential N way single degree of parallelism for each
request Adaptive Select parallelism degree when request starts using system load [EUROSYS’13] Request Clairvoyant parallelizes long requests by
perfect prediction of tail FM Few to Many incrementally add parallelism
64
Load variation
Alternatebetweenhigh&lowloadFMadaptstoburstswithlowvariance
65
0
400
800
1200
1600
Tail l
atenc
y m
s
Lucene RPS
Sequential 2 way 4 way FM
Low Low High High
Fewer servers: Total Cost of ownership
300
600
900
1200
1500
30 32 34 36 38 40 42 44 46 48
Tail
late
ncy
ms
Lucene RPS
Sequential FM
21%
30 32 34 36 38 40 42 44 46 48
Lucene RPS
Adaptive FM
9%