joshua reich, oren laadan, eli brosh, alex sherman, vishal misra, jason nieh, and dan rubenstein 1...

1

Joshua Reich, Oren Laadan, Eli Brosh, Alex Sherman, Vishal Misra,

Jason Nieh, and Dan Rubenstein

VMTorrent: Scalable P2P Virtual Machine Streaming

2

• VM: software implementation of computer

• Implementation stored in VM image

• VM runs on VMM– Virtualizes HW

– Accesses image

VM Basics

VMMVMImage

VM

3

Where is Image Stored?

VMMVMImage

VM

4

Traditionally: Local Storage

VMM VMLocalStorage

5

IaaS Cloud: on Network Storage

VMM VMVM

Image

NetworkStorage

6

Can Be Primary

VMM VMNFS/iSCSIVM

Image

NetworkStorage

e.g., OpenStack Glance Amazon EC2/S3

vSphere network storage

7

Or Secondary

VMMVM

Image

VMNetworkStorage Local

Storage

e.g., Amazon EC2/EBSvSphere local

storage

8

Either Way, No Problem Here

VMM VMVM

Image

NetworkStorage

9

Here?

VMImage

NetworkStorage

Bottleneck!

10

Lots of Unique VM Images

NetworkStorage

on EC2 alone

54784 unique images*

*http://thecloudmarket.com/stats#/totals , 06 Dec 2012

http://thecloudmarket.com/stats%23/totals



11

Unpredictable Demand

NetworkStorage

• Lots of customers

• Spot-pricing

• Cloud-bursting

12

Don’t Just Take My Word• “The challenge for IT teams will be

finding way to deal with the bandwidth strain during peak demand - for instance when hundreds or thousands of users log on to a virtual desktop at the start of the day - while staying within an acceptable budget” 1

• “scale limits are due to simultaneous loading rather than total number of nodes” 2

• Developer proposals to replace or supplement VM launch architecture for greater scalability 3

1. http://www.zdnet.com/why-so-many-businesses-arent-ready-for-virtual-desktops-7000008229/?s_cid=e539

2. http://www.openstack.org/blog/2011/12/openstack-deployments-abound-at-austin-meetup-129

3. https://blueprints.launchpad.net/nova/+spec/xenserver-bittorrent-images

http://www.zdnet.com/why-so-many-businesses-arent-ready-for-virtual-desktops-7000008229/?s_cid=e539



http://www.openstack.org/blog/2011/12/openstack-deployments-abound-at-austin-meetup-129









https://blueprints.launchpad.net/nova/+spec/xenserver-bittorrent-images

13

Challenge: VM Launch in IaaS

• Minimize delay in VM execution

• Starting from time launch request arrives

• For lots of instances (scale!)

14

Naive Scaling Approaches

• Multicast – Setup, configuration, maintenance, etc.

1

– ACK implosion

– “multicast traffic saturated the CPU on [Etsy] core switches causing all of Etsy to be unreachable“ 2

1. [El-Sayed et al., 2003; Hosseini et al., 2007]2. http://codeascraft.etsy.com/2012/01/23/solr-bittorrent-index-replication

http://codeascraft.etsy.com/2012/01/23/solr-bittorrent-index-replication

15

Naive Scaling Approaches

• P2P bulk data download (e.g., Bit-Torrent)– Files are big (waste bandwidth)

–Must wait until whole file available (waste time)

– Network primary? Must store GB image in RAM!

16

Both Miss Big Opportunity

VM image access • Sparse• Gradual

• Most of image doesn’t need to be transferred

• Can start w/ just a couple of blocks

17

VMTorrent Contributions

• Architecture

–Make (scalable) streaming possible: Decouple data delivery from presentation

–Make scalable streaming effective: Profile-based image streaming techniques

• Understanding / Validation

–Modeling for VM image streaming

– Prototype & evaluation not highly optimized

18

Talk

• Make (scalable) streaming possible: Decouple data delivery from presentation

• Make scalable streaming effective: Profile-based image streaming techniques

• VMTorrent Prototype & Evaluation

(Modeling along the way)

19

Decoupling Data Delivery from Presentation

(Making Streaming Possible)

20

Generic Virtualization Architecture

VM

VMM Host

Hardware

VM Imag

e

FS

• Virtual Machine Monitor virtualizes hardware

• Conducts I/O to image through file system

21

Cloud Virtualization Architecture

VM

VMM

Hardware

VM Imag

e

FS

• Either to download image

NetworkBacken

d

Network backend used

• Or to access via remote FS

22

CustomFS

VMTorrent Virtualization Architecture

VM

VMM

Hardware

VM Imag

e

FS

NetworkBacken

d

• Divide image into pieces• But provide appearance

of complete image to VMM

• Introduce custom file system

23

CustomFS

Decoupling Delivery from Presentation

VM

VMM

Hardware

NetworkBacken

d

0 1 23 4 56 7 8

VMM attempts to read piece 1Piece 1 is present, read completes

24

CustomFS

VM

VMM

Hardware

NetworkBacken

d

0 1 23 4 56 7 8

VMM attempts to read piece 0Piece 0 isn’t local, read stallsVMM waits for I/O to completeVM stalls


25

CustomFS

VM

VMM

Hardware

NetworkBacken

d

0 1 23 4 56 7 8

FS requests piece from backendBackend requests from network


26

VM

VMM

Hardware

NetworkBacken

d

0

Later, network delivers piece 0

CustomFS

1 23 4 56 7 8

0

Read completesCustom FS receives, updates piece

VMM resumes VM’s execution


27

Decoupling Improves Performance

VM

VMM

Hardware

NetworkBacken

d

CustomFS

1 23 4 56 7 8

0

Primary StorageNo waiting for image download to complete

28

Decoupling Improves Performance

VM

VMM

Hardware

NetworkBacken

d

CustomFS

1 23 4 56 7 8

0

X

X

Secondary StorageNo more writes or re-reads over network w/ remote FS

29

But Doesn’t Scale

Assuming a single server,the time to download a single piece is

t = W + S / (rnet / n)

• W : wait time for first bit • rnet : network speed

• S : piece size• n : # of clients

Transfer time,each client getsrnet / n of server BW

30

Read Time Grows Linearly w/ n

Assuming a single server,the time to download a single piece is

t = W + n * S / rnet

• W : wait time for first bit • rnet : network speed

• S : piece size• n : # of clients

Transfer timelinear w/ n

31

This Scenario

VM

VMM

Hardware

NetworkBacken

d

CustomFS

1 23 4 56 7 8

0

csd

32

Alleviate network storage bottleneck

VM

VMM

Hardware

NetworkBacken

d

CustomFS

1 23 4 56 7 8

0

Decoupling Enables P2P Backend

Swarm

P2PManage

r

1 23 4 56 7 8

0

• Exchange pieces w/ swarmP2P copy must remain pristine

33

VM

VMM

Hardware

CustomFS

1 23 4 56 7 8

0

Space Efficient

Swarm

P2PManage

r

1 23 4 56 7 8

0

FS uses pointers to P2P imageFS does copy-on-write

6 7

34

VM

VMM

Hardware

CustomFS

1 23 4 56 7 8

0

Minimizing Stall Time

Swarm

P2PManage

r

1 23 4 56 7 8

0

Non-local piece accesses

6 7

4?

4?

4!

Trigger high priority requests

35

Now, the time to download a single piece is

t = W(d) + S / rnet

• W(d) : wait time for first bit as function of

• d : piece diversity• rnet : network speed

• S : piece size• n : # of peers

P2P Helps

Wait is function of diversity

Transfer timeindependent of n

36

High Diversity Swarm Efficiency

37

Low Diversity Little Benefit

Nothing to share

38

All peers request same pieces at same time

t = W(d) + S / rnet

Low piece diversity Long wait (gets worse as n grows) Long download times

P2P Helps, But Not Enough

39

VM

VMM

Hardware

CustomFS

1 23 4 56 7 8

0

This Scenario

Swarm

P2PManage

r

1 23 4 56 7 8

0

6 7

p2pd

40

Profile-based Image Streaming Techniques

(Making Streaming Effective)

41

How to Increase Diversity?

Need to fetch pieces that are

• Rare: not yet demanded by many peers

• Useful: likely to be used by some peer

Profiling

• Need useful pieces

• But only small % of VM image accessed

• We need to know which pieces accessed

• Also, when (need later for piece selection)

42

43

Build Profile

• One profile for each VM/workload

• Ran one or more times (even online)

• Use FS to track –Which pieces accessed

–When pieces accessed

• Entries w/ average appearance time, piece index, and frequency

44

Piece Selection

• Want pieces not yet demanded by many

• Don’t know piece distribution in swarm

• Guess others like self

• Gives estimate when pieces likely needed

45

Piece Selection Heuristic

• Randomly (rarest first) pick one of first k pieces in predicted playback window

• fetch w/ medium priority (demand wins)

46

Profile-based Prefetching

• Increases diversity

• Helps even w/ no peers (when ideal access exceeds network rate)

47

Profile-based window-randomized prefetch

t = W(d) + S / rnet

High piece diversity Short wait (shouldn’t grow much w/ n)

Quick piece download

Obtain Full P2P Benefit

48

VM

VMM

Hardware

CustomFS

1 23 4 56 7 8

0

Full VMTorrent Architecture

Swarm

P2P Manager

1 23 4 56 7 8

0

6 7

profile

p2pp

49

Prototype

50

VM

Hardware

CustomFS

1 23 4 56 7 8

0

VMTorrent Prototype

BT Swarm

P2P Manager

1 23 4 56 7 8

0

6 7

profile

Custom CUsing FUSE

Custom C++& Libtorrent

51

Evaluation Setup

52

Testbeds

• Emulab [White, et. al, 2002]

– Instances on 100 dedicated hardware nodes

– 100 Mbps LAN

• VICCI [Peterson, et. al, 2011]

– Instances on 64 vserver hardware node

slices

– 1 Gbps LAN

53

VMs

54

Workloads

• Short VDI-like tasks• Some cpu-intensive, some I/O

intensive

55

• Measured total runtime – Launch through shutdown

– (Easy to measure)

• Normalized against memory-cached execution– Ideal runtime for that set of hardware

– Allows easy cross-comparison • different VM/workload combinations

• Different hardware platforms

Assessment

56

Evaluation

57

100 Mbps Scaling

Starting to increase

58

Due to Decreased Diversity

# peers increases more demand requests to seed less opportunity to build diversity longer to reach max swarming efficiency + lower max

59



60



We optimized too much for single instance!

(choosing demand requests take precedence)

61

(Some) Future Work

• Piece selection for better diversity

• Improved profiling

• DC-specific optimizations

Current work orders of magnitude better than state-of-art

62

Demo(video omitted for space)

63

See Paper for More Details

• Modeling– Playback process dynamics – Buffering (for prefetch)– Full characterization of r incorporating

impact of centralized and distributed models on W

– Other elided details

• Plus–More architectural discussion!– Lots more experimental results!

64

Summary• Scalable VM launching needed

• VMTorrent addresses by– Decoupling data presentation from streaming

– Profile-based VM image streaming

• Straightforward techniques, implementation, no special optimizations for DC

• Performance much better than state-of-art– Hardware evaluation on multiple testbeds

– As predicted by modeling

joshua reich, oren laadan, eli brosh, alex sherman, vishal misra, jason nieh, and dan rubenstein 1...

Documents

vmm vm image vm slide

vmm vm local storage

vm execution

images slide

hw accesses image vm

image doesnt

gb image

lots of unique vm images