learning in situ: a randomized experiment in video streaming talks/2019... · figure:pantheon...

Learning in situ: a randomized experiment in video streaminghttps://puffer.stanford.edu

Francis Y. Yan, Hudson Ayers, Chenzhi Zhu†,Sadjad Fouladi, James Hong, Keyi Zhang,

Philip Levis, Keith Winstein

Stanford University, †Tsinghua University

October 22, 2019

Francis Y. Yan (Stanford) October 22, 2019 1 / 32

https://puffer.stanford.edu

Outline

• Networked systems present unique challenges for machine learning

• Puffer: a live TV streaming website we built to conduct a randomized experiment

• Fugu: an adaptive bitrate (ABR) algorithm that robustly outperforms other schemes bylearning in situ (on data from the real deployment environment, Puffer)


Unique challenges for ML in networking

• We don’t know how to emulate the Internet very accurately

• Mismatch between training environment (simulator, emulator, or testbed) and testingenvironment (Internet)

• Internet has too much variability and heavy tails


Challenge 1: We don’t know how to emulate the Internet

Better

Figure: Best effort to emulate a wireless path between Nepal to AWS India. Mean error: 19.1%.


Challenge 2: Mismatch between training and testing environments

Indigo in emulation:

Figure: Power of schemes over emulated networks with varying link rates and 50 ms min one-way delay.The schemes are split into two graphs for clarity.


Challenge 2: Mismatch between training and testing environments

Indigo in real life:

Indigo

Better

Figure: Pantheon result (March 27, 2019, China to AWS Korea), P5985


https://pantheon.stanford.edu/result/5985/

Challenge 3: Internet is highly variable and heavy-tailed

Will show using our randomized experiment:

• Only 3% of the eligible streams had any stalls

• With 1.75 years of data for each ABR scheme, the width ofthe 95% confidence interval on a scheme’s mean stall ratio isbetween ±10% and ±17% of the mean value• Two identical schemes will see considerable variation in

average performance

- ...until a substantial amount of data is assembledFigure: Throughputdistribution on FCC tracesand in real world.


High level takeaways


- Training algorithms in emulation: disappointing real-world results- Evaluating algorithms in emulation: not predictive of real-world results- Running in real life: requires a substantial amount of data to reduce statistical uncertainty

• Our solution: combining classical control with a learned network predictor, trained withsupervised learning in situ on data from the real deployment environment

- It robustly outperforms existing schemes in practice


Puffer: a video streaming website running a randomized experiment

• Live TV streaming website (https://puffer.stanford.edu)

• Approved by Stanford lawyers, opened to public December 2018

• Randomizes sessions to different algorithms• Goal: realistic testbed and learning environment for research in

- congestion control- throughput prediction- adaptive bitrate (ABR)


https://puffer.stanford.edu

Algorithms that affect video streaming

• Congestion control: when to send each packet

• Throughput prediction: how fast can server send in near future?

• Adaptive bitrate (ABR): what version of each upcoming “chunk” to send


Demo


Media coverage


More press articles


Tutorial of building a Fire TV app for Puffer


Google ad for “tv streaming”


Reddit ad


Puffer experiment

• Starting from the beginning of 2019, streamed 14.2 years of video to 55,897 users using61,682 unique IP addresses

• About 7-month was spent on the “primary experiment”: a randomized trial comparingour ABR algorithm with other schemes (MPC, RobustMPC, Pensieve, and BBA)


Experimental-flow diagram in CONSORT format

337,170 sessions underwent randomization1,595,356 streams56,262 unique IPs

26.8 client-years of data

97,068 sessions were excluded158,077 streams

3.2 client-years of data

◦ 53,631 streams were assigned CUBIC◦ 103,446 streams were assigned experimental algorithms for ◦ portions of the study duration

47,958 sessions were assignedFugu

233,190 streams

48,703 sessions were assignedMPC-HM

238,651 streams

48,082 sessions were assignedRobustMPC-HM236,120 streams

47,584 sessions were assignedPensieve

229,851 streams

47,775 sessions were assignedBBA

231,694 streams

139,981 streams were excluded

◦ 55,301 did not begin playing◦ 84,640 had watch time less than 4s◦ 40 stalled from a slow video decoder








◦ 55,182 did not begin playing◦ 87,200 had watch time less than 4s◦ 24 stalled from a slow video decoder◦ 1 sent contradictory data

2,683 streams were truncated because of a loss of contact





93,209 streams were considered1.9 client-years of data






◦ 2.7 client-days spent in startup◦ 5.1 client-days spent stalled◦ 8.5 client-years spent playing


Puffer experiment

• Of the 458,801 streams in primary analysis, only 15,788 (3%) of streams had any stalls

- mirroring the ratio (7%) reported by Google

• With 1.75 years of data for each ABR scheme, the width of the 95% confidence intervalon a scheme’s mean stall ratio is between ±10% and ±17% of the mean value

- comparable to the magnitude of total benefit reported by prior work based on traces orreal-world experiments lasting hours or days


Fugu: an ABR algorithm trained in situ

• Objective of Fugu: select video chunks to maximize cumulative QoE over a finite horizon

1080p-22

1080p-24

720p-20

240p-26

1080p-22

1080p-24

720p-20

240p-26

1080p-22

1080p-24

720p-20

240p-26

1080p-22

1080p-24

720p-20

240p-26

1080p-22

1080p-24

720p-20

240p-26

10 v

ersi

ons

5-step lookahead



• QoE: +video quality, −quality variation, −rebuffering

• max∑

QoEi =∑

(SSIMi − λ|SSIMi − SSIMi−1| − µ · Rebufferi )

1080p-22

1080p-24

720p-20

240p-26

1080p-22

1080p-24

720p-20

240p-26

1080p-22

1080p-24

720p-20

240p-26

1080p-22

1080p-24

720p-20

240p-26

1080p-22

1080p-24

720p-20

240p-26

10 v

ersi

ons

5-step lookahead



• Given a plan of the next 5 chunks to send and assume the transmission time of eachchunk is known, can calculate∑

QoEi =∑

(SSIMi − λ|SSIMi − SSIMi−1| − µ · Rebufferi )Pl

ayba

ck B

uffer

TimeRebuffer

Transmission Time

Chunk Length

Drains 1s/s

12

3

45



Remaining problems:• How do we estimate the unknown transmission time of each given chunk?

- the only uncertainty in the control

• How do we compute the optimal plan to maximize QoE?

- more efficient than exhaustive search (105 combinations)

• How do we follow the optimal plan?

- send 5 chunks and recompute the optimal plan?


Solving problem 1: Transmission Time Predictor (TTP)

• Neural network predicts “how long would each chunk take?”• Input:

- sizes and transmission times of past 8 chunks- low-level TCP statistics (min RTT, RTT, CWND, packets in flight, delivery rate)- size of the chunk to be transmitted (vs. throughput predictor)

• Output:

- probability distribution over the transmission time (vs. point estimate)

• Training: supervised learning in situ on real data from Puffer


Solving problem 2: Value Iteration

• Well-known technique to solve MDP

• Denote the maximum expected sum of QoE that can be achieved in the lookahead horizon

• Optimal plan can be computed with dynamic programming

It looks like

v∗i (Bi ,Ki−1) = maxK si

{∑Ti

Pr[T̂ (K si ) = Ti ]·

(QoE (K si ,Ki−1) + v∗i+1(Bi+1,K

si ))

}


Solving problem 3: Model Predictive Control (MPC)

• Send only one chunk following the optimal plan

• Replan before sending the next chunk to mitigate accumulation of errors

1080p-22

1080p-24

720p-20

240p-26

1080p-22

1080p-24

720p-20

240p-26

1080p-22

1080p-24

720p-20

240p-26

1080p-22

1080p-24

720p-20

240p-26

1080p-22

1080p-24

720p-20

240p-26

10 v

ersi

ons

5-step lookahead

1080p-22

1080p-24

720p-20

240p-26



Data Aggregation

Transmission Time Predictor MPC Controller

PufferVideo Server

bitrateselection

stateupdate

updatemodel

da

ily t

rain

ing

mo

de

l-ba

se

d c

on

trol


SSIM vs. Stalls (458,801 streams, 8.5 years of data)

16.25

16.5

16.75

0.10.20.3

Aver

age

SSIM

(dB

)

Time spent stalled (%)

Fugu

MPC-HM

RobustMPC-HM

Pensieve

BBA

Bette

r QoE


SSIM vs. Stalls (< 6 Mbps only, 100,500 streams, 1.3 years of data)

13.5

14

14.5

15

15.5

0.50.7511.25

Aver

age

SSIM

(dB

)


FuguMPC-HM

RobustMPC-HM

Pensieve

BBA

Bette

r QoE


First-chunk SSIM vs. Startup delay (cold start)

9.9

10

10.1

10.2

10.3

0.480.50.520.540.56

Aver

age

first-

chun

k SS

IM (d

B)

Startup delay (s)

Fugu

Pensieve BBA

RobustMPC-HM

MPC-HM

Bette

r QoE

Figure: TTP’s use of low-level TCP statistics boosts initial quality.


Mismatch between emulation and real world

12.5

13

13.5

14

14.5

15

00.250.50.7511.251.5

Aver

age

SSIM

(dB

)


Fugu

BBA

Pensieve

RobustMPC-HM

MPC-HM

Bette

r QoE

Figure: Performance in emulation on FCC traces.

15

16

17

0.10.20.30.40.50.6

Aver

age

SSIM

(dB

)


FuguMPC-HM

RobustMPC-HM

Pensieve

BBA

Emulation-trained Fugu

Bette

r QoE

Figure: Puffer results during Jan.–Apr. 2019.


Users randomly assigned to Fugu watched 10%–20% longer

0.0001

0.001

0.01

0.1

1

10 100 1000

CCD

F

Total time on video player (minutes)

Fugu (mean 32.6 ± 1.1 min [95% CI])MPC-HM (mean 27.9 ± 0.9)

RobustMPC-HM (mean 27.4 ± 0.9)Pensieve (mean 28.5 ± 0.9)

BBA (mean 29.6 ± 1.0)

Better


Ablation study of Fugu’s TTP


Takeaways


- Training algorithms in emulation: disappointing real-world results- Evaluating algorithms in emulation: not predictive of real-world results- Running in real life: requires a substantial amount of data to reduce statistical uncertainty

• Our solution: combining classical control with a learned network predictor, trained withsupervised learning in situ on data from the real deployment environment

- It robustly outperforms existing schemes in practice

• We are opening Puffer to the research community for others to develop and deploycongestion control and ABR algorithms on real traffic.

Francis Y. Yan, [email protected]


learning in situ: a randomized experiment in video streaming talks/2019... · figure:pantheon...

Documents