learning in situ: a randomized experiment in video streaming talks/2019... · figure:pantheon...
TRANSCRIPT
Learning in situ: a randomized experiment in video streaminghttps://puffer.stanford.edu
Francis Y. Yan, Hudson Ayers, Chenzhi Zhu†,Sadjad Fouladi, James Hong, Keyi Zhang,
Philip Levis, Keith Winstein
Stanford University, †Tsinghua University
October 22, 2019
Francis Y. Yan (Stanford) October 22, 2019 1 / 32
Outline
• Networked systems present unique challenges for machine learning
• Puffer: a live TV streaming website we built to conduct a randomized experiment
• Fugu: an adaptive bitrate (ABR) algorithm that robustly outperforms other schemes bylearning in situ (on data from the real deployment environment, Puffer)
Francis Y. Yan (Stanford) October 22, 2019 2 / 32
Unique challenges for ML in networking
• We don’t know how to emulate the Internet very accurately
• Mismatch between training environment (simulator, emulator, or testbed) and testingenvironment (Internet)
• Internet has too much variability and heavy tails
Francis Y. Yan (Stanford) October 22, 2019 3 / 32
Challenge 1: We don’t know how to emulate the Internet
Better
Figure: Best effort to emulate a wireless path between Nepal to AWS India. Mean error: 19.1%.
Francis Y. Yan (Stanford) October 22, 2019 4 / 32
Challenge 2: Mismatch between training and testing environments
Indigo in emulation:
Figure: Power of schemes over emulated networks with varying link rates and 50 ms min one-way delay.The schemes are split into two graphs for clarity.
Francis Y. Yan (Stanford) October 22, 2019 5 / 32
Challenge 2: Mismatch between training and testing environments
Indigo in real life:
Indigo
Better
Figure: Pantheon result (March 27, 2019, China to AWS Korea), P5985
Francis Y. Yan (Stanford) October 22, 2019 6 / 32
Challenge 3: Internet is highly variable and heavy-tailed
Will show using our randomized experiment:
• Only 3% of the eligible streams had any stalls
• With 1.75 years of data for each ABR scheme, the width ofthe 95% confidence interval on a scheme’s mean stall ratio isbetween ±10% and ±17% of the mean value• Two identical schemes will see considerable variation in
average performance
- ...until a substantial amount of data is assembledFigure: Throughputdistribution on FCC tracesand in real world.
Francis Y. Yan (Stanford) October 22, 2019 7 / 32
High level takeaways
• Networked systems present unique challenges for machine learning
- Training algorithms in emulation: disappointing real-world results- Evaluating algorithms in emulation: not predictive of real-world results- Running in real life: requires a substantial amount of data to reduce statistical uncertainty
• Our solution: combining classical control with a learned network predictor, trained withsupervised learning in situ on data from the real deployment environment
- It robustly outperforms existing schemes in practice
Francis Y. Yan (Stanford) October 22, 2019 8 / 32
Puffer: a video streaming website running a randomized experiment
• Live TV streaming website (https://puffer.stanford.edu)
• Approved by Stanford lawyers, opened to public December 2018
• Randomizes sessions to different algorithms• Goal: realistic testbed and learning environment for research in
- congestion control- throughput prediction- adaptive bitrate (ABR)
Francis Y. Yan (Stanford) October 22, 2019 9 / 32
Algorithms that affect video streaming
• Congestion control: when to send each packet
• Throughput prediction: how fast can server send in near future?
• Adaptive bitrate (ABR): what version of each upcoming “chunk” to send
Francis Y. Yan (Stanford) October 22, 2019 10 / 32
Demo
Francis Y. Yan (Stanford) October 22, 2019 11 / 32
Media coverage
Francis Y. Yan (Stanford) October 22, 2019 12 / 32
More press articles
Francis Y. Yan (Stanford) October 22, 2019 13 / 32
Tutorial of building a Fire TV app for Puffer
Francis Y. Yan (Stanford) October 22, 2019 14 / 32
Google ad for “tv streaming”
Francis Y. Yan (Stanford) October 22, 2019 15 / 32
Reddit ad
Francis Y. Yan (Stanford) October 22, 2019 16 / 32
Puffer experiment
• Starting from the beginning of 2019, streamed 14.2 years of video to 55,897 users using61,682 unique IP addresses
• About 7-month was spent on the “primary experiment”: a randomized trial comparingour ABR algorithm with other schemes (MPC, RobustMPC, Pensieve, and BBA)
Francis Y. Yan (Stanford) October 22, 2019 17 / 32
Experimental-flow diagram in CONSORT format
337,170 sessions underwent randomization1,595,356 streams56,262 unique IPs
26.8 client-years of data
97,068 sessions were excluded158,077 streams
3.2 client-years of data
◦ 53,631 streams were assigned CUBIC◦ 103,446 streams were assigned experimental algorithms for ◦ portions of the study duration
47,958 sessions were assignedFugu
233,190 streams
48,703 sessions were assignedMPC-HM
238,651 streams
48,082 sessions were assignedRobustMPC-HM236,120 streams
47,584 sessions were assignedPensieve
229,851 streams
47,775 sessions were assignedBBA
231,694 streams
139,981 streams were excluded
◦ 55,301 did not begin playing◦ 84,640 had watch time less than 4s◦ 40 stalled from a slow video decoder
144,832 streams were excluded
◦ 56,845 did not begin playing◦ 87,958 had watch time less than 4s◦ 29 stalled from a slow video decoder
144,586 streams were excluded
◦ 57,119 did not begin playing◦ 87,426 had watch time less than 4s◦ 41 stalled from a slow video decoder
138,899 streams were excluded
◦ 59,450 did not begin playing◦ 79,435 had watch time less than 4s◦ 14 stalled from a slow video decoder
142,407 streams were excluded
◦ 55,182 did not begin playing◦ 87,200 had watch time less than 4s◦ 24 stalled from a slow video decoder◦ 1 sent contradictory data
2,683 streams were truncated because of a loss of contact
2,655 streams were truncated because of a loss of contact
2,391 streams were truncated because of a loss of contact
2,599 streams were truncated because of a loss of contact
2,520 streams were truncated because of a loss of contact
93,209 streams were considered1.9 client-years of data
93,819 streams were considered1.7 client-years of data
91,534 streams were considered1.7 client-years of data
90,952 streams were considered1.6 client-years of data
89,287 streams were considered1.7 client-years of data
458,801 streams were considered8.5 client-years of data
◦ 2.7 client-days spent in startup◦ 5.1 client-days spent stalled◦ 8.5 client-years spent playing
Francis Y. Yan (Stanford) October 22, 2019 18 / 32
Puffer experiment
• Of the 458,801 streams in primary analysis, only 15,788 (3%) of streams had any stalls
- mirroring the ratio (7%) reported by Google
• With 1.75 years of data for each ABR scheme, the width of the 95% confidence intervalon a scheme’s mean stall ratio is between ±10% and ±17% of the mean value
- comparable to the magnitude of total benefit reported by prior work based on traces orreal-world experiments lasting hours or days
Francis Y. Yan (Stanford) October 22, 2019 19 / 32
Fugu: an ABR algorithm trained in situ
• Objective of Fugu: select video chunks to maximize cumulative QoE over a finite horizon
1080p-22
1080p-24
720p-20
240p-26
1080p-22
1080p-24
720p-20
240p-26
1080p-22
1080p-24
720p-20
240p-26
1080p-22
1080p-24
720p-20
240p-26
1080p-22
1080p-24
720p-20
240p-26
10 v
ersi
ons
5-step lookahead
Francis Y. Yan (Stanford) October 22, 2019 20 / 32
Fugu: an ABR algorithm trained in situ
• QoE: +video quality, −quality variation, −rebuffering
• max∑
QoEi =∑
(SSIMi − λ|SSIMi − SSIMi−1| − µ · Rebufferi )
1080p-22
1080p-24
720p-20
240p-26
1080p-22
1080p-24
720p-20
240p-26
1080p-22
1080p-24
720p-20
240p-26
1080p-22
1080p-24
720p-20
240p-26
1080p-22
1080p-24
720p-20
240p-26
10 v
ersi
ons
5-step lookahead
Francis Y. Yan (Stanford) October 22, 2019 20 / 32
Fugu: an ABR algorithm trained in situ
• Given a plan of the next 5 chunks to send and assume the transmission time of eachchunk is known, can calculate∑
QoEi =∑
(SSIMi − λ|SSIMi − SSIMi−1| − µ · Rebufferi )Pl
ayba
ck B
uffer
TimeRebuffer
Transmission Time
Chunk Length
Drains 1s/s
12
3
45
Francis Y. Yan (Stanford) October 22, 2019 20 / 32
Fugu: an ABR algorithm trained in situ
Remaining problems:• How do we estimate the unknown transmission time of each given chunk?
- the only uncertainty in the control
• How do we compute the optimal plan to maximize QoE?
- more efficient than exhaustive search (105 combinations)
• How do we follow the optimal plan?
- send 5 chunks and recompute the optimal plan?
Francis Y. Yan (Stanford) October 22, 2019 21 / 32
Solving problem 1: Transmission Time Predictor (TTP)
• Neural network predicts “how long would each chunk take?”• Input:
- sizes and transmission times of past 8 chunks- low-level TCP statistics (min RTT, RTT, CWND, packets in flight, delivery rate)- size of the chunk to be transmitted (vs. throughput predictor)
• Output:
- probability distribution over the transmission time (vs. point estimate)
• Training: supervised learning in situ on real data from Puffer
Francis Y. Yan (Stanford) October 22, 2019 22 / 32
Solving problem 2: Value Iteration
• Well-known technique to solve MDP
• Denote the maximum expected sum of QoE that can be achieved in the lookahead horizon
• Optimal plan can be computed with dynamic programming
It looks like
v∗i (Bi ,Ki−1) = maxK si
{∑Ti
Pr[T̂ (K si ) = Ti ]·
(QoE (K si ,Ki−1) + v∗i+1(Bi+1,K
si ))
}
Francis Y. Yan (Stanford) October 22, 2019 23 / 32
Solving problem 3: Model Predictive Control (MPC)
• Send only one chunk following the optimal plan
• Replan before sending the next chunk to mitigate accumulation of errors
1080p-22
1080p-24
720p-20
240p-26
1080p-22
1080p-24
720p-20
240p-26
1080p-22
1080p-24
720p-20
240p-26
1080p-22
1080p-24
720p-20
240p-26
1080p-22
1080p-24
720p-20
240p-26
10 v
ersi
ons
5-step lookahead
1080p-22
1080p-24
720p-20
240p-26
Francis Y. Yan (Stanford) October 22, 2019 24 / 32
Fugu: an ABR algorithm trained in situ
Data Aggregation
Transmission Time Predictor MPC Controller
PufferVideo Server
bitrateselection
stateupdate
updatemodel
da
ily t
rain
ing
mo
de
l-ba
se
d c
on
trol
Francis Y. Yan (Stanford) October 22, 2019 25 / 32
SSIM vs. Stalls (458,801 streams, 8.5 years of data)
16.25
16.5
16.75
0.10.20.3
Aver
age
SSIM
(dB
)
Time spent stalled (%)
Fugu
MPC-HM
RobustMPC-HM
Pensieve
BBA
Bette
r QoE
Francis Y. Yan (Stanford) October 22, 2019 26 / 32
SSIM vs. Stalls (< 6 Mbps only, 100,500 streams, 1.3 years of data)
13.5
14
14.5
15
15.5
0.50.7511.25
Aver
age
SSIM
(dB
)
Time spent stalled (%)
FuguMPC-HM
RobustMPC-HM
Pensieve
BBA
Bette
r QoE
Francis Y. Yan (Stanford) October 22, 2019 27 / 32
First-chunk SSIM vs. Startup delay (cold start)
9.9
10
10.1
10.2
10.3
0.480.50.520.540.56
Aver
age
first-
chun
k SS
IM (d
B)
Startup delay (s)
Fugu
Pensieve BBA
RobustMPC-HM
MPC-HM
Bette
r QoE
Figure: TTP’s use of low-level TCP statistics boosts initial quality.
Francis Y. Yan (Stanford) October 22, 2019 28 / 32
Mismatch between emulation and real world
12.5
13
13.5
14
14.5
15
00.250.50.7511.251.5
Aver
age
SSIM
(dB
)
Time spent stalled (%)
Fugu
BBA
Pensieve
RobustMPC-HM
MPC-HM
Bette
r QoE
Figure: Performance in emulation on FCC traces.
15
16
17
0.10.20.30.40.50.6
Aver
age
SSIM
(dB
)
Time spent stalled (%)
FuguMPC-HM
RobustMPC-HM
Pensieve
BBA
Emulation-trained Fugu
Bette
r QoE
Figure: Puffer results during Jan.–Apr. 2019.
Francis Y. Yan (Stanford) October 22, 2019 29 / 32
Users randomly assigned to Fugu watched 10%–20% longer
0.0001
0.001
0.01
0.1
1
10 100 1000
CCD
F
Total time on video player (minutes)
Fugu (mean 32.6 ± 1.1 min [95% CI])MPC-HM (mean 27.9 ± 0.9)
RobustMPC-HM (mean 27.4 ± 0.9)Pensieve (mean 28.5 ± 0.9)
BBA (mean 29.6 ± 1.0)
Better
Francis Y. Yan (Stanford) October 22, 2019 30 / 32
Ablation study of Fugu’s TTP
Francis Y. Yan (Stanford) October 22, 2019 31 / 32
Takeaways
• Networked systems present unique challenges for machine learning
- Training algorithms in emulation: disappointing real-world results- Evaluating algorithms in emulation: not predictive of real-world results- Running in real life: requires a substantial amount of data to reduce statistical uncertainty
• Our solution: combining classical control with a learned network predictor, trained withsupervised learning in situ on data from the real deployment environment
- It robustly outperforms existing schemes in practice
• We are opening Puffer to the research community for others to develop and deploycongestion control and ABR algorithms on real traffic.
Francis Y. Yan, [email protected]
Francis Y. Yan (Stanford) October 22, 2019 32 / 32