daniel s. berger benjamin berg timothy zhu microsoft ... · daniel s. berger benjamin berg timothy...
TRANSCRIPT
![Page 1: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/1.jpg)
RobinHood: Tail Latency-Aware Caching Dynamically Reallocating from Cache-Rich to Cache-Poor
Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University
USENIX OSDI, 10/8/18.
Siddhartha Sen Mor Harchol-BalterMicrosoft Research Carnegie Mellon University
![Page 2: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/2.jpg)
Typical Web Architecture
1
Aggregationserver
Recom
.
Products
Ads
User request
...Backend queries
Request latency = max of query latencies
Qu
erylaten
cy
![Page 3: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/3.jpg)
2
Typical Web Architecture
Goal: minimize 99-th percentile
(P99) request latency
Aggregationserver
Recom
.
Products
Ads
User request
...
Request latency = max of query latencies
Qu
erylaten
cy
Backend queries
![Page 4: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/4.jpg)
What Causes High P99 Request Latency?
3
Aggregationserver
Recom
.
Products
Ads
User request
...Backend queries
Observations at xbox.com (3/2018):
Better load balancing? Elastically scale backends?
Partially implemented
Qu
ery latency
Request latency
![Page 5: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/5.jpg)
What Else Can We Do?
4
Aggregationserver
Recom
.
Products
Ads
User request
...Backend queries
Cache
Can we use the aggregation cache to reduce the P99 request latency?
Observations at xbox.com (3/2018):
Aggregation Cache: Currently shared among queries to all backends
Qu
ery latency
Request latency
![Page 6: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/6.jpg)
Can We Use Caching to Reduce the P99?
5
State-of-the-art caching systems focus on hit ratio, fairness — not the P99
“Caching layers do not directly address tail latency, aside from configurations where the entire working set can reside in a cache.”
1ms 90%
100ms 10%
Belief: No Cache
B
P99=100ms
95%
5%P99=
100ms
![Page 7: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/7.jpg)
Can We Use Caching to Reduce the P99?
6
But: latency isnot a constant
50ms 500ms85% 15%
Caching can reduce P99 request latency!
Effectiveness in web architecture?
Belief: No 1ms 90%
100ms 10%
Cache
B
P99=500ms
95%
5%P99=50ms
![Page 8: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/8.jpg)
RobinHood: Key Idea
7
During load spike:Observations for xbox.com (3/2018):
Aggregationserver
Recom
.
Product
s
Ads
User request
...Backend queries
Cache
RobinHood: more cache ⇒ less load ⇒ much lower P99
![Page 9: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/9.jpg)
RobinHood: Key Idea
8
During load spike:Observations for xbox.com (3/2018):
Aggregationserver
Recom
.
Product
s
Ads
User request
...Backend queries
Cache
Products Recom. Ads
Dynamic Cache Partitions
![Page 10: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/10.jpg)
9
During load spike:Observations for xbox.com (3/2018):
Aggregationserver
Recom
.
Product
s
Ads
User request
...Backend queries
Cache
Recom. AdsProducts
RobinHood: Key Idea
Dynamic Cache Partitions
![Page 11: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/11.jpg)
10
During load spike:Observations for xbox.com (3/2018):
Aggregationserver
Recom
.
Product
s
Ads
User request
...Backend queries
Cache
Recom. AdsProducts
RobinHood: Key Idea
Dynamic Cache Partitions
![Page 12: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/12.jpg)
11
During load spike:Observations for xbox.com (3/2018):
Aggregationserver
Recom
.
Product
s
Ads
User request
...Backend queries
Cache
Recom. AdsProducts
RobinHood: Key Idea
Dynamic Cache Partitions
![Page 13: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/13.jpg)
12
During load spike:Observations for xbox.com (3/2018):
Aggregationserver
Recom
.
Product
s
Ads
User request
...Backend queries
Cache
Recom. AdsProducts
RobinHood: Key Idea
Dynamic Cache Partitions
![Page 14: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/14.jpg)
13
During load spike:Observations for xbox.com (3/2018):
Aggregationserver
Recom
.
Product
s
Ads
User request
...Backend queries
Cache
Recom. AdsProducts
RobinHood: Key Idea
Dynamic Cache Partitions
![Page 15: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/15.jpg)
RobinHood Cache
14
The RobinHood Caching System
Scalable in # backends,
# aggregation servers
Dynamically partition
the aggregation cache
First caching system to
minimize request P99
Deployable on off-the-
shelf software stack
![Page 16: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/16.jpg)
15
Highlatency
User requests99.5% 0.5%
Cache
How to Repartition the Cache?
How to redistribute the tax?
Every 5 seconds: RobinHood taxes everyone 1%
First idea: give cache to high-latency backends
Recall: not all requests are the same
Small effect on request P99
![Page 17: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/17.jpg)
RobinHood: find the cause of high request P99
16
P0 P100P99
Who “blocked” this request?
How to Repartition the Cache?
How to redistribute the tax?
Every 5 seconds: RobinHood taxes everyone 1%
![Page 18: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/18.jpg)
17
P0 P100P99
Who “blocked” these requests?
How to Repartition the Cache?
⇒ Track “request blocking count” (RBC) for each backend
RobinHood: find the cause of high request P99
How to redistribute the tax?
Every 5 seconds: RobinHood taxes everyone 1%
![Page 19: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/19.jpg)
RobinHood Architecture
18
Aggregation server
... ...
Cache
RobinHood Controller
- ingests RBC
- calculates / enforces cache
allocation
- not on request path
RH-control
Backends
![Page 20: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/20.jpg)
RobinHood Architecture
19
... ...
Cache Cache Cache
⇒ RH-control / Ag. server
In practice many Ag. servers
... ...
- Local decisions
- Local measurements- Pooled measurements
Challenge: insufficient# tail data points
RH-control RH-control RH-control Ag. servers
Backends
Distributed RobinHood:
![Page 21: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/21.jpg)
Experimental Setup
20
Request generator
MySQL(I/O Bound)
Matrix Multiply(CPU Bound)
K-V Store(CPU Bound)
Replay production trace
For 4 hours, 200k queries/second
32 GB cache size
⇒ Emulate query latency spikes
ABCD
20x
Ag. servers
... ...
Cache Cache Cache... ...
Backends
RH-control RH-control RH-control16x
![Page 22: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/22.jpg)
Evaluation Results: P99 Request Latency
21
RobinHood[our proposal]
Balance Query Latencies[Hyberbolic, ATC’17]
Original MS System[OneRF]
Maximize Overall Hit Ratio[Cliffhanger, NSDI’16]
Req
uest
P99
Lat
ency
[ms]
>
![Page 23: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/23.jpg)
What Makes RobinHood so Effective?
22
RobinHood[our proposal]
Original MS System[OneRF]
Req
uest
P99
Lat
ency
[ms]
>
The RobinHood tradeoff:→ up to 2.5x higher latency→ typically 4x lower latency
- Sacrifice performance of some backends- Reduce latency of bottleneck backends
⇒ Reduced request latency
![Page 24: Daniel S. Berger Benjamin Berg Timothy Zhu Microsoft ... · Daniel S. Berger Benjamin Berg Timothy Zhu Carnegie Mellon University Pennsylvania State University USENIX OSDI, 10/8/18](https://reader031.vdocuments.us/reader031/viewer/2022022717/5c399a7509d3f23f308c5da6/html5/thumbnails/24.jpg)
Conclusions
Yes! Huge reduction in P99 spikes and SLO violations.⇒ Use cache as load balancers: “RBC load metric”.
Yes! Built using off-the-shelf software stack. Works orthogonally to existing load balancing and data/quality tradeoffs.
23
Feasibility in production systems?
Is it possible to use caches to improve the request P99?
No! There’s a lot to do. Need to consider the effect of other request structures.
Is this the optimal solution? End of this project?
Poster #31