fast and scalable virtual machine deployment dissertation.pdf · fast and scalable virtual machine...

161
VU Research Portal Fast and scalable virtual machine deployment Razavi, K. 2015 document version Publisher's PDF, also known as Version of record Link to publication in VU Research Portal citation for published version (APA) Razavi, K. (2015). Fast and scalable virtual machine deployment. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal ? Take down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. E-mail address: [email protected] Download date: 02. Apr. 2021

Upload: others

Post on 21-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

  • VU Research Portal

    Fast and scalable virtual machine deployment

    Razavi, K.

    2015

    document versionPublisher's PDF, also known as Version of record

    Link to publication in VU Research Portal

    citation for published version (APA)Razavi, K. (2015). Fast and scalable virtual machine deployment.

    General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright ownersand it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

    • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal ?

    Take down policyIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediatelyand investigate your claim.

    E-mail address:[email protected]

    Download date: 02. Apr. 2021

    https://research.vu.nl/en/publications/8113de21-38a4-4f37-9f7a-5e12648ee4ad

  • Fast and Scalable Virtual Machine Deployment

    Ph.D. Thesis

    Kaveh Razavi

    VU University Amsterdam, 2015

  • This work is partially funded by the FP7 Programme of the European Commissionin the context of the Contrail project under Grant Agreement FP7-ICT-257438, andby the Dutch public-private research community COMMIT/.

    This work was carried out in the ASCI graduate school.ASCI dissertation series number 338.

    Copyright © 2015 by Kaveh Razavi.

    ISBN 978-94-6259-833-1

    Cover design by Sarah Bagheri.Printed by Ipskamp Drukkers.

  • VRIJE UNIVERSITEIT

    FAST AND SCALABLE VIRTUAL MACHINE DEPLOYMENT

    ACADEMISCH PROEFSCHRIFT

    ter verkrijging van de graad Doctor aande Vrije Universiteit Amsterdam,op gezag van de rector magnificus

    prof.dr. F.A. van der Duyn Schouten,in het openbaar te verdedigen

    ten overstaan van de promotiecommissievan de Faculteit der Exacte Wetenschappen

    op donderdag 5 november 2015 om 11.45 uurin de aula van de universiteit,

    De Boelelaan 1105

    door

    KAVEH RAZAVI

    geboren te Shiraz, Iran

  • promotor: prof. dr. ir. H.E. Balcopromotor: Dr.-Ing. habil. T. Kielmann

  • Examiners: Dr. P. Costa Microsoft ResearchProf. Dr. B. Freisleben Philipps-Universität MarburgDr. C. Giuffrida VU University AmsterdamDr. P. Grosso Universiteit van AmsterdamProf. Dr. G. Pierre Université de Rennes 1

  • “Perfection is reached not when there is nothing left to add,but when there is nothing left to take away.”

    Antoine de Saint Exupéry

  • Acknowledgements

    I am indebted to many people for my professional development and an inspiringsocial environment during the course of my PhD. I would like to thank them here.

    First and foremost, I would like to express my eternal gratitude to my supervisor,Thilo Kielmann. I very much enjoyed working alongside him and to me that was themost important factor of my PhD life. He spent countless hours helping me makesense of my research and ensuring that I was always on the right track. I admirehis critical and principled attitude towards research, and his joyful sense of humour.Working with him, I learned a lot about writing scientific articles, and he probablylearned a lot about the idiosyncrasies of a Persian writing in English!

    My promotor, Henri Bal, has created a tight-knit group of people who are gen-uinely interested in large-scale computing research and I am proud to have been apart of it. He always supported my research and provided valuable feedback on mywork for which I am very grateful. Henri’s efforts revolving around DAS4/DAS5clusters were determinant for not only my research, but many of my colleaguesas well. Kees Verstoep, our amazing DAS system administrator, made it possiblefor me to smoothly run low-level system experiments on the shared DAS clusters.Without his help, it would have been much harder, if not impossible, to finish theexperiments in time for the deadlines.

    I like to specially thank the members of my thesis committee for their valuabletime and feedback on a draft of this dissertation which helped me improve it further.They further provided insight on possible future directions based on my work.

    I had the pleasure of interning with Microsoft Research Cambridge in the sum-mers of 2014 and 2015, thanks to Paolo Costa. He invited me to work with himduring the summer of 2014 on a prototype network stack, and he facilitated my sec-ond internship with Sergey Legtchenko and Ant Rowstron to work on building acost-efficient storage rack. I really enjoyed the high-quality research environment

    vii

  • viii ACKNOWLEDGEMENTS

    during these summers and apart from the aforementioned, I learned a great dealfrom the other members of the Systems and Networking group, namely, AleksandarDragojevic, Austin Donnely, Dushyanth Narayanan, Eno Thereska, Greg O’shea,Hitesh Ballani, Hugh Williams, Parisa Jalili, and Richard Black.

    I was very lucky to have the opportunity to work closely with Guillaume Pierreand Renato Figueiredo while they were at the VU, resulting in a fruitful internationalcollaboration. As part of being a PhD student, I had the pleasure of working with anumber of brilliant master students as well, namely, Ana Ion, Genc Tato, and Gerritvan der Kolk. They helped me with my research and I learned a great deal aboutteaching and collaboration.

    My dear officemates at the VU, Alex Uta, Ana Oprescu, and Stefania Costache,made the office environment jolly with humour and sweet with Romanian delicacies.I had lots of interesting discussions with my colleagues at the computer systemsgroup over lunches, coffee breaks, and borrels. Namely, Aggelos Oikonomopou-los, Alessio Sclocco, Andrei Bacs, Ben van Werkhoven, Claudio Martella, Davidvan Moolenbroek, Eric Bosman, Hamid Bazoubandi, Herbert Bos, Ismail El Helw,Istvan Haller, Lionel Sambuc, Lucian Cojocar, Pieter Hijma, Remco Vermeulen,Spyros Voulgaris, Stefan Vijzelaar, Victor van der Veen, and Vladimir Bozdog.

    There are two important people at the VU who helped me figure out solutions tomany problems other than the ones that happen inside computers. Our kind secretary,Caroline Waij, has always been there with helpful answers to my many VU-relatedquestions. Special thanks to Marike ten Hoorn from the personnel department whohelped me a lot with complicated bureaucratic matters.

    I am very lucky to have some of my brilliant colleagues in my social life as well.Dirk Vogt and Albana Gaba made me feel welcome when I initially started at theVU, thank you guys and all the joy with the little baby girl to come. I very muchenjoyed the so many pizza and game nights organized by Cristiano Giuffrida andLaura Ferranti and the Taiko drumming lessons that we took together. Ben Gras, thevillager or the werewolf? It is always fun with him around. Christian Roth, withwhom I shared many ups and downs during the last few months of our PhDs, hasalways been a good company to delve into various discussions, from virtual realityto selfish genomes! I also like to mention my good (and now remote) friends, AlenStojanov, Alina Kuznetsova, Animesh Trivedi, Antonio Barresi, Darko Makreshan-ski, Florian Landolt, George Chatzopoulos, Jana Giceva, Juho Ojala, Ozan Kaya,and Sam Whitlock, with whom I have shared many interesting moments in differentplaces around Europe.

    Last but not least, I would like to thank people who are closest to my heart.Maman, be harjayi ke residam tahala, va be harjayi ke beresam, bekhatere zah-mataye to boode. Baba, kheyli delam barat tang shode, mersi ke mano injoori kehastam tarbiat kardi. Kamrano Raha, khoda kone ke saalaye dige bishtar palooyeham bashim, kheyli doosetoon daram. Sarah, daset dard nakone ke coveramo designkardi, kheyli khoshkele! Hani, pishiye man, nemidoonam chejoori azat tashakkorkonam. Bedoone to hich kodom az in karaye sakhto be in asoonia nemitoonestam

  • ACKNOWLEDGEMENTS ix

    anjam bedam! Mersi ke az khodam bishtar be karam etemad dari. Kheyli doosetdaram va kheyli khosh shansam ke peydat kardam.

    Kaveh RazaviCambridge, UK, August 2015

  • Contents

    Acknowledgements vii

    Contents xi

    List of Figures xv

    List of Tables xvii

    Publications xix

    1 General Introduction 11.1 Deployment of Virtual Machines . . . . . . . . . . . . . . . . . . . 21.2 Storage of Virtual Machine Images . . . . . . . . . . . . . . . . . . 61.3 Structure of This Dissertation . . . . . . . . . . . . . . . . . . . . . 71.4 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . 9

    2 Scalable VM Deployment Using VM Image Caches 112.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Scalability of on-demand transfers . . . . . . . . . . . . . . . . . . 13

    2.2.1 Single VM image . . . . . . . . . . . . . . . . . . . . . . . 142.2.2 Many VM images . . . . . . . . . . . . . . . . . . . . . . . 142.2.3 Boot working set size . . . . . . . . . . . . . . . . . . . . . 15

    2.3 VM image caches . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3.1 VM image chaining . . . . . . . . . . . . . . . . . . . . . . 162.3.2 Cache creation . . . . . . . . . . . . . . . . . . . . . . . . 172.3.3 Caching medium . . . . . . . . . . . . . . . . . . . . . . . 172.3.4 Cache-aware cloud scheduler . . . . . . . . . . . . . . . . . 18

    xi

  • xii

    2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4.1 QCOW2 image format . . . . . . . . . . . . . . . . . . . . 182.4.2 Block drivers in QEMU . . . . . . . . . . . . . . . . . . . . 202.4.3 Cache extension . . . . . . . . . . . . . . . . . . . . . . . . 202.4.4 Chaining cache images with qemu-img . . . . . . . . . . . 21

    2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.5.1 Cache creation . . . . . . . . . . . . . . . . . . . . . . . . 232.5.2 Cache quota . . . . . . . . . . . . . . . . . . . . . . . . . . 252.5.3 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    2.6 Cache Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.7 Background and related work . . . . . . . . . . . . . . . . . . . . . 31

    2.7.1 Efficient VM image transfer . . . . . . . . . . . . . . . . . 312.7.2 Efficient VM migration . . . . . . . . . . . . . . . . . . . . 332.7.3 Caching storage data . . . . . . . . . . . . . . . . . . . . . 332.7.4 Virtualized disk performance . . . . . . . . . . . . . . . . . 34

    2.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    3 Squirrel: Scatter Hoarding VM Image Contents on IaaS Compute Nodes 373.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    3.2.1 VM image cache chaining . . . . . . . . . . . . . . . . . . 393.2.2 Compression efficiency . . . . . . . . . . . . . . . . . . . . 413.2.3 Summary and discussion . . . . . . . . . . . . . . . . . . . 43

    3.3 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 443.3.1 Squirrel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.3.2 Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.3.3 Boot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.3.4 Deregister . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.3.5 Offline propagation . . . . . . . . . . . . . . . . . . . . . . 48

    3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.4.1 Dataset information . . . . . . . . . . . . . . . . . . . . . . 493.4.2 Cache volume efficiency . . . . . . . . . . . . . . . . . . . 493.4.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . 533.4.4 Network transfer size . . . . . . . . . . . . . . . . . . . . . 583.4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    3.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.5.1 Storing VM image contents on compute nodes . . . . . . . . 603.5.2 Scalable distribution of VM image contents . . . . . . . . . 613.5.3 Compressing VM image contents . . . . . . . . . . . . . . 63

    3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

  • xiii

    4 Scalable Local Deduplication for Virtual Machine Images 654.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.2 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . 67

    4.2.1 Temporal similarity . . . . . . . . . . . . . . . . . . . . . . 684.2.2 Block repetition . . . . . . . . . . . . . . . . . . . . . . . . 684.2.3 Spatial locality . . . . . . . . . . . . . . . . . . . . . . . . 694.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    4.3 Local Deduplication with Nuts . . . . . . . . . . . . . . . . . . . . 704.3.1 Fingerprinting . . . . . . . . . . . . . . . . . . . . . . . . . 724.3.2 Identifying repeating hashes . . . . . . . . . . . . . . . . . 724.3.3 Merging hashes . . . . . . . . . . . . . . . . . . . . . . . . 744.3.4 Garbage collection . . . . . . . . . . . . . . . . . . . . . . 784.3.5 Integrating Nuts . . . . . . . . . . . . . . . . . . . . . . . . 784.3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

    4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794.4.1 Local deduplication and temporal locality . . . . . . . . . . 794.4.2 Hash table size . . . . . . . . . . . . . . . . . . . . . . . . 824.4.3 Execution time . . . . . . . . . . . . . . . . . . . . . . . . 844.4.4 Benefits of hash-merging . . . . . . . . . . . . . . . . . . . 854.4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 86

    4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874.5.1 Scalable deduplication . . . . . . . . . . . . . . . . . . . . 874.5.2 Deduplication of VM images . . . . . . . . . . . . . . . . . 88

    4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

    5 Prebaked µVMs: Scalable, Instant VM Startup for IaaS Clouds 915.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 925.2 Booting virtual machines . . . . . . . . . . . . . . . . . . . . . . . 935.3 µVMs and the VM bakery . . . . . . . . . . . . . . . . . . . . . . . 94

    5.3.1 µVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 945.3.2 VM Bakery . . . . . . . . . . . . . . . . . . . . . . . . . . 955.3.3 Host-side Caching of µVMs . . . . . . . . . . . . . . . . . 965.3.4 Security Considerations . . . . . . . . . . . . . . . . . . . . 97

    5.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 985.4.1 Supporting µVMs in QEMU/KVM . . . . . . . . . . . . . . 985.4.2 µVM Storage Overlay . . . . . . . . . . . . . . . . . . . . . 995.4.3 VM Bakery . . . . . . . . . . . . . . . . . . . . . . . . . . 1005.4.4 Host-side Caching of µVMs with Squirrel . . . . . . . . . . 1005.4.5 Encountered Issues . . . . . . . . . . . . . . . . . . . . . . 102

    5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.5.1 VM Startup Times . . . . . . . . . . . . . . . . . . . . . . 1065.5.2 Startup Times of Compressed µVMs . . . . . . . . . . . . . 1065.5.3 Storage Scalability . . . . . . . . . . . . . . . . . . . . . . 108

  • xiv CONTENTS

    5.5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 1105.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

    5.6.1 Dynamic Resource Allocation to VMs . . . . . . . . . . . . 1105.6.2 Fast VM Startup . . . . . . . . . . . . . . . . . . . . . . . 1115.6.3 Host-side Caching . . . . . . . . . . . . . . . . . . . . . . 112

    5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

    6 General Conclusions 1156.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . 1166.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.3 Directions for Future Research . . . . . . . . . . . . . . . . . . . . 119

    References 121

    Summary 133

    Samenvatting 137

  • List of Figures

    1.1 On-demand, or lazy transfer of a VM image. . . . . . . . . . . . . . . . 31.2 Consumed storage after VM image specialization. . . . . . . . . . . . . 41.3 Deployment time of optimized ConPaaS VM Image. . . . . . . . . . . 5

    2.1 Copy-on-write with on-demand transfers in action. . . . . . . . . . . . 132.2 Scaling one VM image to many nodes. . . . . . . . . . . . . . . . . . . 142.3 Scaling many VM images to many nodes. . . . . . . . . . . . . . . . . 152.4 VM image cache architecture. . . . . . . . . . . . . . . . . . . . . . . 162.5 VM image cache creation. . . . . . . . . . . . . . . . . . . . . . . . . 172.6 QCOW2 format’s layout. . . . . . . . . . . . . . . . . . . . . . . . . . 192.7 VM image cache on a compute node. . . . . . . . . . . . . . . . . . . . 222.8 Cache creation overhead with increasing cache quota. . . . . . . . . . . 232.9 Observed traffic at the storage node with increasing cache quota. . . . . 242.10 Final arrangement for cache creation. . . . . . . . . . . . . . . . . . . 242.11 Caching a single VM image at compute nodes. . . . . . . . . . . . . . 262.12 Caching many VM images at the compute nodes. . . . . . . . . . . . . 272.13 Caching a VM image on the storage node’s memory. . . . . . . . . . . 282.14 Caching many VM images on the storage node’s memory. . . . . . . . 29

    3.1 The introduction of VM image caches. . . . . . . . . . . . . . . . . . . 403.2 Compression ratio of VM images and caches with dedup and gzip6. . . 423.3 Compression ratio of VM image caches with different routines. . . . . . 423.4 Combined compression ratio of VM images and caches. . . . . . . . . . 433.5 Squirrel architecture diagram. . . . . . . . . . . . . . . . . . . . . . . 453.6 Squirrel VM image registration workflow. . . . . . . . . . . . . . . . . 463.7 Booting a VM with Squirrel’s ccVolume. . . . . . . . . . . . . . . . . . 47

    xv

  • xvi LIST OF FIGURES

    3.8 Disk consumption with deduplication and compression. . . . . . . . . . 503.9 Deduplication table size on disk. . . . . . . . . . . . . . . . . . . . . . 503.10 Memory consumption for deduplication tables. . . . . . . . . . . . . . 513.11 Booting performance from compressed storage. . . . . . . . . . . . . . 523.12 Cross-similarity of VM images and caches. . . . . . . . . . . . . . . . 543.13 ZFS resource consumption with iterative addition. . . . . . . . . . . . . 553.14 Disk consumption curve-fitting quality. . . . . . . . . . . . . . . . . . . 563.15 Extrapolation of disk consumption. . . . . . . . . . . . . . . . . . . . . 573.16 Memory consumption curve-fitting quality. . . . . . . . . . . . . . . . 573.17 Extrapolation of memory consumption. . . . . . . . . . . . . . . . . . 583.18 Network transfer size with Squirrel. . . . . . . . . . . . . . . . . . . . 59

    4.1 Deduplication ratios with top digests. . . . . . . . . . . . . . . . . . . 694.2 (a) Nuts’ operation. (b) Nuts’ pipeline. . . . . . . . . . . . . . . . . . . 714.3 Nuts’ sequence splitting mechanism. . . . . . . . . . . . . . . . . . . . 754.4 Global and local deduplication ratios of VM images and caches. . . . . 804.5 Average number of merged hashes with different cluster sizes. . . . . . 814.6 The importance of temporal ordering for local deduplication. . . . . . . 824.7 Comparison of hash table size for local and global deduplication. . . . . 834.8 Nuts’ execution times of global and local deduplication. . . . . . . . . . 844.9 Hash-merging and the size of deduplication plan. . . . . . . . . . . . . 85

    5.1 Normal VM start up. . . . . . . . . . . . . . . . . . . . . . . . . . . . 935.2 VM Bakery: VM startup using µVM and resource hot-plugging. . . . . 965.3 Deduplication ratios of µVMs in KVM and raw formats. . . . . . . . . 975.4 Creation and reuse of µVM storage overlays. . . . . . . . . . . . . . . . 995.5 Squirrel architecture diagram. . . . . . . . . . . . . . . . . . . . . . . 1015.6 The organization of a µVM from a host’s point of view. . . . . . . . . . 1025.5 VM startup time using µVMs of various OSes. . . . . . . . . . . . . . . 1055.6 µVM startup times over a gzip compressed and deduplicated ZFS. . . . 1075.7 The effect of various compressions on µVMs vs. boot caches. . . . . . . 109

  • List of Tables

    1.1 Interesting VM image configurations for ConPaaS. . . . . . . . . . 4

    2.1 Read working set size of various VMs during boot. . . . . . . . . . 162.2 Cache quota necessary for various VM images. . . . . . . . . . . . 25

    3.1 Attained storage efficiency with 128 KB block size. . . . . . . . . . 443.2 OS diversity in Windows Azure and Amazon EC2. . . . . . . . . . 493.3 RMSE of various curves that estimate disk consumption. . . . . . . 563.4 RMSE of various curves that estimate memory consumption. . . . . 57

    4.1 A snippet of one file recipe with digest counters. . . . . . . . . . . . 704.2 Profit margin of using local deduplication. . . . . . . . . . . . . . . 86

    5.1 Booting time of various operating systems. . . . . . . . . . . . . . . 103

    xvii

  • Publications

    Reducing VM Startup Time and Storage Costs by VM Image Content Consol-idation, Kaveh Razavi, Liviu Mihai Razorea, and Thilo Kielmann. In Proceedingsof the Euro-Par Workshops 2013 (DIHC 2013), Aachen, Germany, August 2013.

    Scalable Virtual Machine Deployment Using VM Image Caches, Kaveh Razaviand Thilo Kielmann. In Proceedings of the 25th International Conference on HighPerformance Computing, Networking, Storage and Analysis (SC 2013), Denver,USA, November 2013.

    Squirrel: Scatter Hoarding VM Image Contents on IaaS Compute Nodes, KavehRazavi, Ana Ion, and Thilo Kielmann. In Proceedings of the 23rd International Sym-posium on High-Performance Parallel and Distributed Computing (HPDC 2014),Vancouver, Canada, June 2014.

    Kangaroo: A Tenant-Centric Software-Defined Cloud Infrastructure. KavehRazavi, Ana Ion, Genc Tato, Kyuho Jeong, Renato Figueiredo, Guillaume Pierre,and Thilo Kielmann. In Proceedings of the 3rd International Conference on CloudEngineering (IC2E 2015), Tempe, USA, March 2015.

    Scaling VM Deployment in an Open Source Cloud Stack, Kaveh Razavi, StefaniaCostache, Andrea Gardiman, Kees Verstoep, and Thilo Kielmann. In Proceedingsof the 6th Workshop on Scientific Cloud Computing (ScienceCloud 2015), co-locatedwith HPDC 2015, Portland, Oregon, USA, June 2015.

    Prebaked uVMs: Scalable, Instant VM Startup for IaaS Clouds, Kaveh Razavi,Gerrit van der Kolk, and Thilo Kielmann. In Proceedings of the 35th Interna-tional Conference on Distributed Computing Systems (ICDCS 2015), Columbus,USA, June/July 2015.

    xix

  • xx PUBLICATIONS

    CAIN: Silently Breaking ASLR in the Cloud, Antonio Barresi, Kaveh Razavi,Mathias Payer, and Thomas R. Gross. In Proceedings of the 9th USENIX Workshopon Offensive Technologies (WOOT 2015), Washington, D.C., USA, August 2015.

    R2C2: A Network Stack for Rack-scale Computers, Paolo Costa, Hitesh Ballani,Kaveh Razavi, and Ian Kash. In Proceedings of the ACM SIGCOMM Conferenceon Data Communication (SIGCOMM 2015), London, UK, August 2015.

    Nested Clouds: A Common Ground for Tenants and Providers?, Stefania Costache,Ana-Maria Oprescu, Kaveh Razavi, and Thilo Kielmann. Under submission.

    Scalable Local Deduplication for Virtual Machine Image Storage, Kaveh Razaviand Thilo Kielmann. Under submission.

  • Cha

    pter

    1

    1General Introduction

    Complete machine virtualization is becoming increasingly common in today’s datacenters, termed Infrastructure-as-a-Service (IaaS). It allows data centers to sell un-used resources (e.g., compute cycles), while providing isolation between differentusers. A user of an IaaS cloud, referred to as a tenant, allocates resources by de-ploying virtual machines (VMs) on top of the data center’s physical resources. Ten-ants can dynamically scale out their resources by adding new VMs, or scale in byremoving unused VMs. This unprecedented flexibility opens up many interestingpossibilities for tenant-controlled resource management [2, 44, 59, 72, 100], leadingto increased adoption of IaaS clouds [6]. With the increasing number of tenants, andthe diversity of their applications, scalability of the cloud infrastructure becomes aprimary concern.

    Before the IaaS era, program binaries were the abstraction used for executingcode on a certain computing resource in a cluster of machines. In contrast, to executean application on an IaaS cloud, the tenant has to provide a VM image that containsall the execution dependencies, typically containing a complete operating system(OS) plus application-specific software. The VM image is attached to a VM as a(virtual) disk. During startup, a VM reads the necessary information from this diskto boot the OS before executing the intended application. Given this description,one can see the VM image as a new abstraction for executing code on computingresources.

    One benefit of this model, compared to the traditional clusters or grids, is the factthat the tenants are in complete control of what runs in their VM down to the OSkernel. This opens up new possibilities for the tenants, including VM specialization,flexible software versioning, and reproducibility, to name a few [53, 57, 60, 63, 94].Tenants, embracing this possibility, often register many VM images, each special-ized in its own way according to the needs of the applications. As a result, it is not

    1

  • 2 CHAPTER 1. GENERAL INTRODUCTION

    uncommon for public IaaS providers to store tens of thousands of public VM im-ages [98], and possibly hundreds of thousands of private ones. The VM images areoften large, in the orders of gigabytes if not tens of gigabytes, and their abundancemakes it impossible to store all of them on the hosts that run the VMs. Hence, tostart a VM on a host, the VM image often needs to be copied from a remote location(i.e., a storage server), fully or in part.

    As mentioned earlier, IaaS clouds are becoming increasingly popular. In this dis-sertation, we are concerned with the scalability aspects of VM images. More specifi-cally, we study the scalability of their transfer over the network to the compute hosts,and the scalability of their storage on the storage servers, or on the compute hostswhen they are cached. We show that current state-of-the-art suffers from severescalability bottlenecks in both network and storage dimensions. We then design andimplement techniques for resolving these scalability problems, and evaluate them ata large scale. Further, we show that VM images are not the right abstraction whenvery fast VM startup is necessary. Autoscaling systems [34] and interactive high-performance computing [50, 103] are examples of applications that require fast VMstart up. We propose a new abstraction, and show through a prototype implementa-tion and large-scale evaluation that this abstraction satisfies the requirement for fastVM startup. Overall, the solutions presented in this dissertation provide an answerto the following primary research question:

    RQ: How can we build a system that achieves fast and scalable VM deployment?

    In the rest of this introduction, we formulate secondary research questions basedon our primary research question in Section 1.1 and Section 1.2, and describe thestructure of this dissertation in Section 1.3. Finally, we discuss the concrete contri-butions of this thesis in Section 1.4.

    1.1 Deployment of Virtual Machines

    Whenever a tenant requests the creation of a new VM on the provider’s infrastruc-ture, the cloud scheduler needs to find a suitable host with free resources for therequested VM. Scheduling at a large scale, if not done properly, can take a longtime. Previous work has addressed this issue by means of hierarchical [33], or fullydecentralized scheduling [92]. We have also shown that it is possible to improve anexisting centralized scheduler to make it scalable [99]. In this dissertation, we focuson what happens after the scheduling decision is made.

    After a suitable host is determined for a new VM, before starting the VM, therequested VM image needs to be made accessible to the VM on the chosen host. Themost straightforward approach copies the VM image from a storage location, to thehost before starting the VM. In fact, this is the same approach that has been taken

  • 1.1. DEPLOYMENT OF VIRTUAL MACHINES 3

    Cha

    pter

    1initially by Amazon EC21. When a tenant requests a new VM on Amazon EC2,the compressed VM image, stored on Amazon S32, is transferred in its entirety overthe network to the selected host, decompressed, and then a new VM is started usingthis VM image. We refer to this approach as eager transfer of a VM image. Thedeployment time of eager transfer directly correlates with the size of the VM image.

    KVM

    Mem

    Disk

    Base

    Read

    Read

    Storage node Compute node

    CoW

    VM

    Read Write

    Write

    Figure 1.1: On-demand, or lazy transfer of a VM image.

    An alternative deployment approach, which we refer to as on-demand or lazytransfer, starts the VM without copying its VM image first. The VM is started usingan (initially) empty VM image that acts as a placeholder for the original remoteVM image. The write requests from the VM are written to this placeholder VMimage in a copy-on-write manner. The read requests from the VM, if not alreadyavailable in the copy-on-write VM image, are resolved by reading data blocks on-demand over the network from the storage location hosting the original VM image.Figure 1.1 visualizes this approach with a compute node hosting a VM, and a storagenode hosting the VM image. Note that during deployment, potentially a very smallfraction of the VM image is required for booting the OS, and starting the desiredapplication, hence lazy transfer requires little data compared to eager transfer, at thecost of per-request network delay. Amazon EBS-backed3 VM images, for example,rely on this approach for the deployment of VMs. The deployment time of on-demand transfers is independent of the size of the VM images due to their lazynature.

    To understand deployment times of VMs, we have conducted a number of ex-periments on Amazon EC2 using the VM image of ConPaaS [86], an open-sourcePlatform-as-a-Service. We reduced the set of software packages in the VM image tothe minimum necessary for different configurations found in Table 1.1. We walkedthe dependency trees of software packages required for each service in Table 1.1 tocalculate the set of packages necessary for the execution of the intended services.We then remove the remaining packages from the VM image, before registering theoptimized VM image to Amazon S3 and EBS. Please refer to [97] for further details

    1Amazon Elastic Compute Cloud, or Amazon EC2, is the earliest IaaS providers active since 2006 [12].2Amazon Simple Storage Service, or Amazon S3, is an elastic blob-based storage service.3Amazon Elastic Block Storage, or Amazon EBS, is an elastic block-based storage service.

  • 4 CHAPTER 1. GENERAL INTRODUCTION

    on the VM image optimization by means of reducing the software packages and theexperiments.

    Table 1.1: Interesting VM image configurations for ConPaaS.

    Config. List of services Use-case

    Complete All services General-purposeCore No services N/AC1 XtreemFS Scalable storageC2 Hadoop Scalable MapReduceC3 PHP, MySQL Web applicationC4 PHP, MySQL, Scalaris Web application with cachingC5 PHP, MySQL, Scalaris, XtreemFS Web application with caching & storageC6 HTCondor, XtreemFS High-throughput computing & storage

    0

    500

    1000

    1500

    2000

    Complete

    CoreC1 C2 C3 C4 C5 C6

    Size

    (MB)

    Disk sizeS3 size - compressed

    Figure 1.2: Consumed storage after VM image optimization and specialization.

    Figure 1.2 shows the raw and compressed disk size of the VM images (VMimages are always stored in compressed form on S3) using the configurations of Ta-ble 1.1. These results show that eager transfer does not scale with a large number ofVM deployment requests: In the case of ConPaaS, even with optimization and spe-cialization of its VM image, each VM deployment needs to read between 200 MB to300 MB over the network from S3. At the time of our experiments (April 2013), weobserved around 120 Mb/s of data transfer bandwidth from S3. More optimistically,given a storage/network solution that provides 10 Gb/s of bandwidth (two orders ofmagnitude faster), using eager transfers, only five concurrent VM deployments willconsume one second of the entire available network bandwidth, which is very lim-ited for a public IaaS provider. Further, not every tenant optimizes her VM images,and hence in reality the transfer size will be larger on average.

  • 1.1. DEPLOYMENT OF VIRTUAL MACHINES 5

    Cha

    pter

    1

    0

    20

    40

    60

    80

    100

    120

    S3/Complete

    S3/CoreS3/C1

    S3/C2S3/C3

    S3/C4S3/C5

    S3/C6EBS

    Depl

    oym

    ent t

    ime

    (s)

    EC2 sched + OS bootDecompression

    Network tx

    Figure 1.3: Deployment time of ConPaaS VM Image optimized to different configurations.

    Figure 1.3 shows the deployment time of a single ConPaaS VM decomposedinto the transfer time of the VM image, decompression of the VM image, and EC2scheduling time along with the boot time of the VM. We can make a number ofobservations: First, with eager transfer (S3-based VM images), even with variousoptimizations and specialization of the VM image, still a large amount of time andresources is spent in transferring and decompressing the VM image. Second, withlazy (on-demand) transfer (EBS-based VM images), the deployment time is inde-pendent of the VM image size, hence we only report the number with the completeset of services. Interestingly, despite the fact that the amount of data transferredduring lazy deployment is smaller compared to eager transfer, the per-request delayscaused by on-demand reads over the network result in similar deployment time toan eager, complete VM image transfer. Third, when network transfers are minimal(e.g., the S3/Core case), the booting time of the VM’s OS becomes the dominantpart of the deployment.

    Given these observations in a single VM scenario, and the back-of-the-envelopecalculation regarding the scalability of eager transfers, we formulate the followingresearch questions:

    RQ1: How scalable are lazy transfers when deploying many VMs?

    RQ2: Is it possible to improve “lazy transfer” deployment time by eliminating de-lays caused by reading from the network?

    RQ3: Can we avoid booting the VM’s OS in response to VM deployment requestsin order to reduce VM deployment times?

    The answer to RQ1 will determine whether using lazy transfers alone is enough

  • 6 CHAPTER 1. GENERAL INTRODUCTION

    for building a scalable IaaS provider. The answers to RQ2 and RQ3 will shed somelight on how to build an infrastructure that supports elastic cloud applications. Forexample, an autoscaling system should react differently to changes in the systemload if a VM deployment takes a few seconds instead of a few minutes (if not tensof minutes). If it takes a long time to deploy VMs, the autoscaling system shouldoverprovision the number of VMs, causing latent costs of leasing additional, standbyVMs that were supposed to be avoided by using the cloud infrastructure in the firstplace. Long deployment times also hamper the usability of clouds for interactive ap-plications that use VMs in the backend, such as interactive high-performance com-puting applications. An infrastructure with proper support for elasticity eradicatesthe need for overprovisioning, and improves the cloud usability in such scenarios.

    We now discuss scalability concerns with respect to the storage of VM images,and formulate additional research questions.

    1.2 Storage of Virtual Machine Images

    As mentioned before, one of the benefits of using VM images is providing the ten-ants with the flexibility of customizing and specializing what runs in their VMs.This flexibility, however, is a double-edged sword and results in huge amounts ofdata duplicates and storage wastage: The VM images are often large (up to tens ofgigabytes), they are based on only a few popular OSes that are widely adopted, andunlike our aggressive optimizations discussed in Section 1.1, most of the tenants’customizations only change a small part of the VM images.

    The abundance of VM images, and their large size, create two problems: Theirstorage at the storage servers becomes costly when their number goes into tens ofthousands, and their caching at the compute hosts, possibly in response to RQ2, be-comes inefficient. Standard data deduplication techniques [31] have been previouslyapplied to improve the storage efficiency of VM images [46, 47, 71, 123]. Datadeduplication removes data duplicates by calculating digests over data blocks in theVM images, and storing only one copy of the data block for each unique digest.These digests are often stored in an in-memory hash table for fast lookup whenevernew data needs to be deduplicated.

    There is an inherent trade-off when deduplicating data at the block level: As-suming a larger block size, the total number of digests becomes smaller, but theprobability of finding duplicates becomes smaller. Thus, this trade-off gives a pos-sibility to decide the deduplication ratio based on the amount of available memorysince the digests are kept in memory. To improve the storage efficiency of VM im-ages further, it is possible to apply standard compression algorithms (e.g., gzip) atthe block level. But the interaction of compression with deduplication is not wellunderstood. Further, in the case of caching VM images at the compute hosts, the im-plications of applying both deduplication and compression on the deployment timesshould be investigated. We therefore formulate two additional research questions:

  • 1.3. STRUCTURE OF THIS DISSERTATION 7

    Cha

    pter

    1RQ4: What is the efficiency of deduplication when combined with standard com-pression on VM images?

    RQ5: What is the (negative) impact of deduplication and/or compression on thedeployment times of VMs?

    The response to RQ4 will tell us whether it makes sense to apply compression ondeduplicated VM images, and if it does, at which block granularity. The response toRQ5 will quantify the overhead of compressing VM images on the deployment timesof VMs, and shows whether additional work is necessary to reduce this overhead.

    Scaling the number of VM images raises concern on the scalability of dedu-plication. A back-of-the-envelope calculation shows that with e.g., 1 TB of uniquedata, assuming a block size of 4 KB, we will have about 270 million digests. Witha perfect hash-table, assuming a modest hash-table entry size of 50 bytes (e.g., di-gesting with SHA-256, 32 bytes alone will be the key size), we will need around13 GB of main memory just for the hash-table. We will show later that these num-bers match the requirements of today’s average-sized data centers. Hence, we needa more scalable solution for today’s larger or future’s smaller data centers that re-quire deduplication for efficient storage of VM images. Further, if deduplication onthe compute hosts is desired, its memory overhead should be minimal due to themonetary value of the memory (i.e., being rented out as part of the VM). Hence, weformulate our final research question:

    RQ6: Is it possible to create a more scalable deduplication scheme by exploitingunique features of VM images?

    If the answer to this question is positive, the proposed solution should give thedata centers better trade-offs for deciding the granularity of the deduplication.

    1.3 Structure of This Dissertation

    Parts of this introduction chapter, specifically the experiments presented in Fig-ure 1.3 and Figure 1.2, are taken from a workshop paper that is published in theproceedings of the Euro-Par Workshops 2013 (DIHC’ 13) [97].

    In Chapter 2, we first address RQ1 by showing that on-demand lazy transfers donot scale. We solve the problem, also known as the “boot storm” problem by meansof host-side block-level caching. We introduce the notion of “OS boot workingset”, and show that these sets are orders of magnitude smaller than the VM images.Due to their small size, these sets make good candidates for caching, hence ourchosen name: VM image caches. By storing these caches on a storage mediumcloser to the VMs (e.g., at the hosts), we can significantly reduce the load on boththe network and the storage solutions used for delivering VM images to achievescalable VM deployment. Our solution addresses RQ2 by removing the per-transfernetwork delay of lazy VM deployments. The work presented in this chapter has been

  • 8 CHAPTER 1. GENERAL INTRODUCTION

    published as a full paper in the proceedings of the 25th International Conference onHigh Performance Computing, Networking, Storage and Analysis (SC’ 13) [96].

    In Chapter 3, we present Squirrel, a storage system designed to store large quan-tities of the VM image caches at the hosts. We study the effect of deduplication(i.e., a common technique to remove duplicate data), combined with compression(e.g., gzip) on 607 VM images and VM image caches of Windows Azure4, a publicdata center, and show that Squirrel can store a significant number of caches withvery moderate amount of resources at the hosts, without affecting the starting timeof VMs, effectively addressing RQ4 and RQ5. Squirrel shows that it is possible tokeep warm caches available all the time without any elaborate cache or VM place-ment strategy. The work presented in this chapter has been published as a full paperin the proceedings of the 23rd International Symposium on High-performance Par-allel and Distributed Computing (HPDC’ 14) [98].

    In Chapter 4, we present Nuts, a scalable local deduplication system for VMimages and their caches. Current deduplication systems (e.g., ZFS) perform globaldeduplication by keeping the entire digests in main memory for finding every possi-ble duplication. To reduce the main memory requirements of global deduplication,our local deduplication scheme finds duplications in a subset of VM images at eachtime. We will discuss unique properties of VM images and their caches that makethem good candidates for local deduplication. Using the 1011 VM images and VMimage caches from the Windows Azure repository5, we show that Nuts can achievemore than 80% efficiency of global deduplication, while reducing the main memoryrequirements by up to 35.3x. The low-overhead nature of Nuts makes it a perfectdeduplication engine for caching architectures such as Squirrel, or in general forefficient storage of VM images without requiring access to large amounts of mainmemory. The work presented in this chapter has been recently submitted for publi-cation.

    In Chapter 5, we describe a technique to further reduce the VM deployment time;sometimes to under a second. With a caching system such as Squirrel that eradicatesnetwork and storage bottlenecks, the bulk of the VM startup time is spent bootingthe OS. We introduce µVM, a snapshot of an entire OS with minimal resources (e.g.,core, memory, etc.) at the time when it has finished booting. We describe VMBakery, a service that extends a resumed µVM to a size requested by the tenant viahot-plugging resources. By modifying Squirrel to serve µVMs instead of VM imagecaches, we show that we can deploy VMs in under one second on average using astandard ext4 file system. Storing 1011 µVMs from the public VM images of Win-dows Azure on a deduplicated and compressed ZFS file system amounts to 50 GB ofdisk space, while maintaining deployment times of 2.8 seconds on average. µVMsaddress RQ3 by offering an alternative abstraction for VM deployment. The workpresented in this chapter has been published as a full paper in the proceedings of the35th IEEE International Conference on Distributed Computing Systems (ICDCS’

    4This repository includes the public VM images registered by the tenants from April to November 2013.5This repository includes the public VM images registered by the tenants from April 2013 to June 2014.

  • 1.4. SUMMARY OF CONTRIBUTIONS 9

    Cha

    pter

    115) [101].Finally, in Chapter 6, we discuss the conclusions of the work presented in this

    dissertation and hint at possible directions for future research.

    1.4 Summary of Contributions

    The contributions of this dissertation are as follows:

    • Chapter 2 introduces the boot working set of a VM, and shows that cachingthis set resolves the “boot storm” problem, allowing for scalable VM deploy-ments.

    • Chapter 3 shows that applying deduplication and compression allows for scal-able storage of the boot working sets, removing the need for elaborate cachereplacement policies.

    • Chapter 4 introduces a new scheme for scalable deduplication by exploitingtemporal similarity between the contents in the VM images and the boot work-ing sets.

    • Chapter 5 shows that it is possible to achieve instant VM startup by moving theslow booting of the guest OS out of the deployment path without sacrificinggenerality.

  • Cha

    pter

    22Scalable VM Deployment Using VM Image Caches

    Abstract

    In IaaS clouds, VM startup times are frequently perceived as slow, negatively im-pacting both dynamic scaling of web applications and the startup of high-performancecomputing applications consisting of many VM nodes. A significant part of thestartup time is due to the large transfers of VM image content from a storage nodeto the actual compute nodes, even when copy-on-write schemes are used. We haveobserved that only a tiny part of the VM image is needed for the VM to be able tostart up. Based on this observation, we propose using small caches for VM imagesto overcome the VM startup bottlenecks. We have implemented such caches as anextension to KVM/QEMU. Our evaluation with up to 64 VMs shows that using ourcaches reduces the time needed for simultaneous VM startups to the one of a singleVM.

    The contents of this chapter have been originally published in the proceedings of the 25th InternationalConference on High Performance Computing, Networking, Storage and Analysis (SC ’13), and have beenslightly modified to improve readability.

    11

  • 12 CHAPTER 2. VM IMAGE CACHES

    2.1 Introduction

    With the advent of public Infrastructure-as-a-Service (IaaS) clouds like AmazonEC2 or Rackspace, the use of virtualized operating systems, “virtual machines,”has gained widespread use. Also in privately owned computing environments suchas compute clusters, the use of virtual machines (VMs) is gaining popularity dueto its benefits like elastic machine allocation, user-controlled software installations,and the possibility to reduce energy footprints by consolidating multiple VMs ontoa single physical machine.

    The promise of elastic computing is instantaneous creation of virtual machines,according to the needs of an application or web service. In practice, however, usersface VM startup times of several minutes, along with high variability, depending onthe actual system load. Two major factors are contributing to VM startup times: theresource selection process by the cloud middleware (Amazon EC2, OpenNebula,OpenStack, etc.), and the actual VM boot time, including the transfer of the VMimage to the selected compute node.

    While we have noticed that there is room for improvement in the resource selec-tion process, e.g. in our own OpenNebula deployment, that problem is beyond thescope of this chapter. Here, we focus on reducing the transfer time for VM imagesfrom the storage node to the compute nodes. In particular, we study the scalabilityof VM image transfers with respect to simultaneous VM startups, from a single VMimage or from many VM images.

    Our initial study shows that state-of-the-art, on-demand transfers (“copy-on-write”) cannot sustain performance for 4 or more simultaneous image transfers whenusing a 1Gb Ethernet connection to the storage node. When using 32Gb InfiniBand(IB) instead, simultaneous startup of up to 64 machines (the size of our cluster) canbe done in constant time, as long as all machines boot from the same VM image.When increasing the number of VM images used, the storage node itself becomesthe bottleneck, and startup times rise linearly with the number of images.

    While investigating this problem, we have observed that during the boot process,virtual machines actually read only a small fraction (we have seen up to 200 MB)of the total VM image, typically sized at several GB. Based on this observation, wepropose to use VM image caches to mask the actual transfer bottlenecks. Dependingon the location of the bottleneck, VM image caches can be placed either on the disksof the compute nodes (when the network is the bottleneck, e.g. with 1Gb Ethernet),or in main memory of the storage node (when disk access is the bottleneck, e.g. withan IB network).

    We have implemented our VM image caches as an extension to KVM/QEMU.(A similar extension could be implemented for Xen as well.) As such, our cachingscheme is independent of the cloud middleware in use and could be deployed on awide range of cloud infrastructures. We have evaluated our caching scheme on ourDAS-4 cluster [27] at VU Amsterdam. Our results show that using our caches, witheither network, reduces the time needed for (up to 64) simultaneous VM startups to

  • 2.2. SCALABILITY OF ON-DEMAND TRANSFERS 13

    Cha

    pter

    2

    the time needed for booting a single VM.This chapter is organized as follows. In Section 2.2, we demonstrate the limited

    scalability of on-demand VM image transfers. In Section 2.3, we present the de-sign and in Section 2.4 the implementation of VM image caches to overcome theselimitations. The results of our experimental evaluation are shown in Section 2.5.Based on the results of our evaluation, we recommend cache placement strategies inSection 2.6. Section 2.7 discusses related work; in Section 2.8 we conclude.

    2.2 Scalability of on-demand transfers

    The simplest way of deploying a VM image on a compute node is to copy the imageonto the compute node before booting the VM from it. As VM images typicallycomprise one or more GB of data, this approach obviously is slow and easily con-sumes large amounts of network bandwidth in between the storage node and thecompute nodes.

    The current state-of-the-art is to reduce the amount of data transfer to thoseblocks of the VM image that are actually need during the boot process, called on-demand transfers. With on-demand transfers, VM writes go to a second, copy-on-write (CoW) image. VM reads, if not already in the CoW image, come from theoriginal VM image (base), accessed through a remote file-system like NFS. In thisscheme, the base VM image is read-only and can be shared simultaneously by anarbitrary number of nodes.

    QCOW2 [61] is an example image format that supports CoW images. The baseimage can be of any supported format. The read and write granularity of QCOW2 isdefined by QCOW2’s cluster size with the default value of 64KB. Figure 2.1 showsthe operation of QCOW2’s CoW mechanism. QEMU’s implementation of QCOW2is used by KVM and partly Xen, two commonly used virtual machine monitors(VMMs) in public and private IaaS clouds.

    KVM

    Mem

    Disk

    Base

    Read

    Read

    Storage node Compute node

    CoW

    VM

    Read Write

    Write

    Figure 2.1: Copy-on-write with on-demand transfers in action. The VM writes go to a local CoWimage, and reads are fetched from a Base image over a remote file-system like NFS.

    Using QCOW2 with a remote base image is a good example of CoW with on-demand transfers. This approach significantly reduces the booting delay and the

  • 14 CHAPTER 2. VM IMAGE CACHES

    pressure on the network by voiding the need for the complete image transfer in thebeginning. On-demand transfers, however, can have scalability problems of theirown, which we are going to discuss in the remainder of this section.

    2.2.1 Single VM image

    In a single VM image scenario, the content of one image needs to be transfered tomany compute nodes on demand. This is a common case either for popular imagein public clouds, or for high-performance computations with many worker nodes ofthe same type, as with parameter sweep applications [79].

    0

    20

    40

    60

    80

    100

    120

    140

    1 4 8 16 32 64

    Boot

    ing

    time

    (sec

    ond)

    # nodes

    Scaling the number of nodes

    QCOW2 - 1GbEQCOW2 - 32GbIB

    Figure 2.2: Booting time of a CentOS Linux VM on many compute nodes simultaneously using asingle VM image. The reads are fetched from a remote base image and the writes go to a localCoW image.

    Figure 2.2 shows the booting time of a CentOS Linux VM on compute nodes1.When there are more than four concurrent boots, the booting time increases linearlywith the number of nodes in the case of a 1 GbE network, suggesting that the networkis becoming the bottleneck. This result already shows the need for some sort ofefficient caching of the VM image at compute nodes. In the case of a 32 Gb IBnetwork, the booting time remains constant, suggesting that booting these CentOSVMs is not saturating the network.

    2.2.2 Many VM images

    In this scenario, many VM images need to be transferred to many compute nodes.This is a common case for public IaaS clouds, where many users may boot differentVM images simultaneously.

    1Machine details of the experiments presented in this section are explained along with our evaluation inSection 2.5.

  • 2.2. SCALABILITY OF ON-DEMAND TRANSFERS 15

    Cha

    pter

    2

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    1 4 8 16 32 64

    Boot

    ing

    time

    (sec

    ond)

    # VM images

    Scaling the number of VM Images - 64 nodes

    QCOW2 - 1GbEQCOW2 - 32GbIB

    Figure 2.3: Booting time of a CentOS Linux VM on many compute nodes simultaneously usingdifferent number of VM images. The reads are fetched from the assigned remote base image andthe writes go to a local CoW image.

    Figure 2.3 shows the booting time of 64 CentOS Linux VMs, scaling the numberof VM images used. (For this test, we have created 64 identical but independentcopies of the CentOS image.) Regardless of the network speed, as the VMs usemore independent images, the booting time increases significantly. This is due tothe disk queueing delay at the storage node, suggesting the use of caches in betweenstorage-node disk and compute-node memory.

    Two possible locations for such caches are the storage node’s memory or thecompute nodes’ disks. We will investigate both in the next section. Before doing so,we verify the feasibility of such caches by analyzing their required sizes.

    2.2.3 Boot working set size

    The basic idea for using VM image caches is to have those parts of the image inthe cache that are required for booting the VM, leaving accesses to the remainingVM image parts for the actual service or application runtime, a period with muchweaker performance requirements towards the VM image storage node. For thispurpose, we have measured the amount of data that is read from the base image forthree different VM images. The results are in Table 2.1, and suggest that with amodestly sized cache, it is possible to boot many VMs while avoiding the potentialbottlenecks described earlier in this section. From these values we can conclude thata VM image cache entry would need to have in the order of 250 MB (providing somemargin). This size is small enough to build caches for multiple VM images, eitheron the disks of the compute nodes or on the main memory of the storage node. InSection 2.3, we present our design of such a caching mechanism.

  • 16 CHAPTER 2. VM IMAGE CACHES

    Table 2.1: Read working set size of various VM images for booting their VMs.

    VM Image Size of unique reads

    CentOS 6.3 85.2 MBDebian 6.0.7 24.9 MB

    Windows Server 2012 195.8 MB

    2.3 VM image caches

    We will now present the design of our VM image caching scheme. We begin bysummarizing the underlying, fundamental requirements.

    The first requirement for the cache is being a VM image itself. A VM image canthen be created/stored on any desired medium (i.e., disk, memory) at any desiredlocation (i.e., storage node, compute node). This also means that the cache is stan-dalone; a VM can start booting using it. In the case of missing data however, thecache should be able to recurse to the base image.

    The second requirement is support for quota. If the caching medium is a scarceresource like memory, a quota makes sure that it is not overly used by the cache.Further, it provides a fine-grained resource accounting of the cache per-VM image.

    The third requirement is immutability with respect to the base image. An im-mutable cache, once created, can be reused many times in the future as long as thebase image remains unchanged.

    2.3.1 VM image chaining

    KVMBaseWrite

    Read

    CoW

    VM

    Read

    Cache

    WriteReadRead

    Figure 2.4: The new architecture with a VM image cache in between the base and CoW images.

    To support these requirements, we have come up with an intermediate imagebetween the base image and the CoW image, called the VM image cache.

    Figure 2.4 shows how the image chain looks like with a cache. The cache is aVM image by definition, and with enough data blocks, a VM can boot using it. Sincethe cache image is separate from the CoW image, it is possible to enforce quota on

  • 2.3. VM IMAGE CACHES 17

    Cha

    pter

    2

    it, satisfying the second requirement. To make the cache immutable with respectto the base image, we only write the data that comes from the base image into thecache. (All writes coming from the VM itself go to the CoW image.)

    2.3.2 Cache creation

    KVMBase

    Read

    CoW

    VM

    Read

    Cache

    Read

    Read

    Write

    Figure 2.5: The process creating the cache. Every read from the base image incurs an additionalwrite to the cache.

    We now describe how we populate the cache with data. The first time a VMboots, an empty cache is created. Figure 2.5 shows the process of warming thecache. Every read that is fetched from the base image is also copied into the cache(copy-on-read or CoR). The first n blocks of data are stored in the cache until thequota is reached or the VM does not need any more data to be fetched from the baseimage. This CoR caching strategy ensures that the blocks that are needed for thebooting process will be available in the cache.

    The caches can be created in variety of ways. The system can boot a sample VMupon a new VM image registration to create its cache. It is also possible to createthe cache lazily when a VM is booted from its image for the first time.

    2.3.3 Caching medium

    Another design decision is the caching medium. Since the cache is a VM imageby itself, it is possible to store it at compute or storage nodes. The quota allowsa fine-grained control over how much resources should be dedicated to the cacheimage.

    In the case of a slower network, caching on the compute nodes is an interestingoption to reduce the load on the network. The small size of these cache images,makes it possible to store many of them within a modest amount of disk resourcesat compute nodes.

    Another interesting medium is the storage node’s memory to save on the limitedperformance of its disk(s). The read requests coming from different VMs are mostly

  • 18 CHAPTER 2. VM IMAGE CACHES

    random in nature and rotational disks do not handle this well. Memory (or a solid-state drive) provides a much better performance under a random access workload,but is scarcer in terms of capacity. The small size of our caches can use this capacityeffectively under the VM boot workload. Furthermore, it is possible to store manyof these caches on the storage node’s memory. This is not possible with normalVM image due to the fact that their size is usually in the order of gigabytes. InSection 2.6, we will compare various cache placement strategies based on the resultsof our evaluation.

    2.3.4 Cache-aware cloud scheduler

    We discuss design considerations for a cache-aware cloud scheduler. Cloud sched-ulers are designed with various orthogonal goals. As an example, OpenNebula [65],an off-the-shelf cloud stack, has the following options for its scheduler:

    • Packing: tries to minimize the number of nodes in-use by packing the VMs inthe same host.

    • Striping: tries to allocate VMs to nodes in a striping fashion in order to providemaximum available resources for VMs.

    • Load-aware mapping: tries to allocate VMs to nodes with less load in orderto provide maximum available resources for VMs.

    One of the goals of a cache-aware scheduler should be allocation of VMs tonodes with an existing warm cache. This heuristic can be used in conjunction withany of the above desired strategies. One of the other tasks of a cache-aware schedulershould be the eviction of the caches whenever the allocated cache space is full fora new cache. This can be a policy such as LRU at the node or cloud level. Furtherdiscussion on this topic is out of the scope of this chapter and is left for future work.

    2.4 Implementation

    We have implemented the VM image caches as an extension to the QCOW2 blockdriver of QEMU. Before explaining our solution, we first take a look at the QCOW2image format and block drivers in QEMU.

    2.4.1 QCOW2 image format

    A QCOW2 image is a self-contained file with a number of meta-data structuresfollowed by the actual data in clusters. The meta-data structures hold informationabout image parameters (e.g. size) and help the translation of virtual block addresses(VBAs) to physical block addresses (PBAs).

  • 2.4. IMPLEMENTATION 19

    Cha

    pter

    2

    ...

    L1 Table

    Pointer to L2 Table

    Pointer to L2 Table

    QcowHeader

    Pointer to L1 Table

    Data Cluster

    ...

    L2 Table

    Pointer to Data Cluster

    Pointer to Data Cluster

    Figure 2.6: A simplified QCOW2 image layout. The choice of cluster size decides the size ofL1-table and the number of data cluster pointers in L2-tables.

    Figure 2.6 shows a simplified structure of a QCOW2 image. The first meta-datastructure that is found in a QCOW2 file is the QCowHeader. Among other fields, itincludes the cluster size, the image size, the backing file (if any) and an offset to theL1-table. QCOW2 uses a two-level look-up system using a level 1 (L1) and level2 tables (L2). For a look-up operation, first the high n bits of the 64-bit VBA isused as an offset into a L1-table to find the corresponding L2-table. Then the nexthigh m bits is used as an offset within the L2-table to find the corresponding clusteroffset within the image file. The rest of the bits (cluster bits or d) are used as anoffset within the cluster. L2-tables also occupy one cluster. This means that giventhe cluster size, it is easy to calculate n and m. For example, with the default clustersize of 64 KB (18 bits):

    d = 18 bitsm = 18− 3 (address size) = 15 bitsn = 64− (18 + 15) = 31 bits

    The actual size of the L1-table depends on the number of L2-tables. The numberof L2-tables depends on the image and cluster size. More details on the QCOW2

  • 20 CHAPTER 2. VM IMAGE CACHES

    format can be found in [61].

    2.4.2 Block drivers in QEMU

    Like any other block driver, QCOW2 needs to implement certain functions to beexported as a block driver in QEMU. The most relevant of these functions are create,open, close, read, and write. These functions are then used in two applications thatcome with the QEMU/KVM suite: qemu-img and qemu-kvm.

    qemu-img is used for creating and/or manipulating virtualized images. As anexample, when creating a new QCOW2 image, qemu-img is invoked with the rele-vant parameters such as the image size or in the case of QCOW2, the cluster sizeor the path to the backing file (if any). This information is then passed to the createfunction of the QCOW2 driver to prepare the requested image file.

    qemu-kvm provides virtualization and device emulation. One of the emulated de-vices is the disk controller. When a VM is running under qemu-kvm, all its read andwrite requests to the disk controller are handled by the relevant block driver. In thiscase, the VM is completely unaware of the underlying block driver functionalities.

    Once the caching mechanism is included in the QCOW2 block driver, qemu-kvm will use it seamlessly. qemu-img, however, must be invoked with the relevantarguments for creating and/or manipulating the cache images. We have explainedthis further in Section 2.4.4.

    2.4.3 Cache extension

    To support cache images, we needed to add two more fields to the QCowHeader ofthe QCOW2 image. These new 8-byte fields define the quota and the current sizeof the cache. It was not possible to re-use the size field of the QCowHeader sinceit has to be the same as the base image’s. This is because the CoW image can intheory have the same size as the base image and we decided not to propagate thedifferentiation between the CoW and the cache image throughout the source.

    We now describe our modifications to QCOW2 functions to support caches:create: If the quota passed to the create function is not zero, it is assumed that

    the new image will be used as a cache. The create function then stores the quotaand the current size of the cache (= size of the header and initial tables) as part of anew extension to the QCowHeader in the image file. The implementation of thesenew fields as an extension is to ensure backward compatibility with normal QCOW2images.

    open: When opening a QCOW2 image, it is checked against our new cachingextension. If the extension is detected, then the two size fields are read into QEMU’sQCOW2 main data-structure and the image is treated as a cache image.

    read: When we get a read on the cache image, two scenarios are possible. Eitherthe data exists in the cache (warm cache), or the data needs to be fetched from thebase image (cold cache). In the first scenario, the data is read from the cache image

  • 2.4. IMPLEMENTATION 21

    Cha

    pter

    2

    and returned to image requesting it (CoW image). In the second scenario, we recurseto the base image for the data. Once the data is available, before returning it to theCoW image, we write it to the cache. It could be that we get a space error whentrying to write to the cache due to full quota. In this case, we stop writing to thecache for the future cold reads.

    write: In our design, described in Section 2.3, the cache image is protectedfrom writes coming from the VM. The only writes that the cache image gets are forwarming it up (with data from the base image). Whenever we see a write to thecache, we check whether there is enough space left by looking at the quota and thecurrently used size. If there is enough space, we write the data to the cache andupdate the currently used size. If not, we return with a space error that is handled atthe read function described above.

    close: When closing a QCOW2 image, if the cache quota field is present (i.e. acache image), the (new) current size of the cache is written back to the image file.

    Other than these modifications to the QCOW2 block driver, we also had tochange the permission flags of QEMU when opening an image. The default flagfor the backing images is read-only and the cache image is used as a backing imagefor the CoW image. The cache image, however, needs write permission at least atits creation time. It is not known at the opening time whether an image is a cacheimage or a base image. To address this problem, we first open the backing imagewith read and write permissions, and then if we detect that the image is not a cacheimage, we re-open the image with read-only permission.

    Since the QCOW2 driver already has the concept of recursion for its CoW fea-ture, our introduction of cache images required minimal changes to QEMU’s QCOW2block driver. A complete patch against the original QEMU/KVM modifies about ahundred and fifty lines of code. With this design, we achieve both backward com-patibility with QCOW2 and massive code reuse.

    2.4.4 Chaining cache images with qemu-img

    With normal QCOW2 operation, first a CoW image is created using qemu-img. Thebase image is given to qemu-img as CoW image’s backing file. After that, a VM isstarted with qemu-kvm pointing to the CoW image as its booting disk.

    With the cache images, there is another step involved when creating the cacheimage. First, qemu-img is invoked with a cache quota and pointing to the base imageas its backing file. This step creates a cache image. Second, qemu-img is invokedwith no cache quota and pointing to the cache image as its backing file. This stepcreates a CoW image. Now the VM can be started with CoW image as its bootingdisk. With a warm cache, there is obviously no need to invoke qemu-img for creatingthe cache.

    One of the benefits of our approach, as we just discussed, is its simplicity forchaining cache disks. This makes it ideal for integration with any cloud stack withan already existing support for QCOW2.

  • 22 CHAPTER 2. VM IMAGE CACHES

    2.5 Evaluation

    We have conducted an extensive experimental evaluation of the VM image cachingmechanism. We used the DAS4/VU [27] cluster as our evaluation testbed. Each stan-dard DAS4/VU node is equipped with dual-quad-core Intel E5620 CPUs, running at2.4 GHz, 24 GB of memory and two Western Digital SATA 3.0-Gbps/7200-RPM/1-TB in software RAID-0 fashion. The nodes are connected using a commodity 1Gb/sEthernet and a premium Quad Data Rate (QDR) InfiniBand providing a theoreticalpeak of 32 Gb/s.

    Our experiments use up to 65 of these nodes, one of them acting as storagenode, and up to 64 other nodes as compute nodes. This is a common setup forsmall-scale private clouds. The storage node runs an off-the-shelf NFS-server; thecompute nodes mount the NFS location. We have tuned the NFS rwsize to 64 KB(the default cluster size of QCOW2), as the default NFS rwsize of 1 MB does notmatch well with the small-sized read requests during boot time. This rwsize hasbeen used for all experiments. Besides, we use the Linux tmpfs and tmpfs exportsfor backing (remote) files by memory when necessary.

    For all the experiments described below, we have used a default installation ofCentOS 6.3 as our VM image. The other images mentioned (Debian Linux andWindows Server) have only been used for estimating their cache size requirements.

    We are mostly interested in the boot time of virtual machines. We measure theboot time as the time from invoking KVM for starting the VM until the VM connectsback (automatically) to a given port as soon as it has completed its boot process.

    KVM

    Mem

    Disk

    Base

    Read

    Read

    Storage node Compute node

    CoW

    VM

    Read

    Cache

    KVM

    Mem

    Disk

    Base CoW

    VM

    Read

    Cache

    Read

    Read

    Cold cache

    Warm cache

    Write

    Read

    Figure 2.7: Caching the VM image on the compute node. The cache is created on the memory ofthe compute node to avoid slowing the VM down due to expensive writes. With a warm cache,there is no need to go to the network anymore.

  • 2.5. EVALUATION 23

    Cha

    pter

    2

    2.5.1 Cache creation

    0

    50

    100

    150

    200

    0 20 40 60 80 100 120 140

    Boot

    ing

    time

    (sec

    ond)

    Cache size (MB)

    QCOW2Cold cache - on disk

    Cold cache - on memWarm cache

    Figure 2.8: Cache creation overhead with increasing cache quota.

    We study the performance of cache creation, the effect of cache quota and thereduction on the storage node transfers with a warm cache. All these experimentsuse one storage node and one compute node. The base image is on a NFS export onthe storage node and the cache is created at the compute node. We use the 1 GbEnetwork in these experiments. (The results are similar for the 32 Gb InfiniBand andare omitted for brevity.)

    Figure 2.8 shows booting times with increasing cache quota, hence controllingthe amount of data that can be stored in a cache image. The boot times with a warmcache are roughly the same as with the original QCOW2 mechanism, as expected.With a cold cache, however, writing into the cache file during boot time significantlyslows down the boot process, due to delays from slow, synchronous writes to thecache image. To circumvent this problem, we create the cache in memory, such thatthe cache write operations do not delay the reads from the booting VM. This reducesthe cache creation overhead to a negligible amount. When creating the cache inmemory, the cache still needs to be written to the disk. We delay this actual writeto the moment after the VM has been shut down, taking it out of the critical pathfor booting. Due to the small size of the cache, the transfer to the disk takes lessthan one second. The booting time for the cold cache creation slightly decreaseswhen moving from 80 MB to 120 MB, which is likely due to some reuse within thehalf-warm cache.

    Figure 2.9 shows the observed traffic at the storage node. With a warm cache,we see a smaller traffic with a bigger cache quota. An interesting observation is thata cold cache with the default QCOW2 cluster size of 64KB, is causing more trafficthan the original QCOW2. Investigating further revealed that this is because small

  • 24 CHAPTER 2. VM IMAGE CACHES

    0

    40

    80

    120

    160

    200

    240

    280

    320

    360

    0 20 40 60 80 100 120 140

    Tran

    sfer

    siz

    e fro

    m th

    e st

    orag

    e no

    de (M

    B)

    Cache size (MB)

    QCOW2Cold cache - cluster = 64KBCold cache - cluster = 512B

    Warm cache - cluster = 64KBWarm cache - cluster = 512B

    Figure 2.9: Observed traffic at the storage node with increasing cache quota.

    writes to the cache need to fetch more data from the base image to meet the clustergranularity. Reducing the cache cluster size to the minimum of 512 bytes (sectorsize), circumvented a potentially unscalable cold cache.

    0

    20

    40

    60

    80

    100

    0 20 40 60 80 100 120 140 0

    40

    80

    120

    160

    200

    240

    280

    Boot

    ing

    time

    (sec

    ond)

    Tran

    sfer

    siz

    e fro

    m th

    e st

    orag

    e no

    de (M

    B)

    Cache size (MB)

    Caching medium = memory - Cache cluster size = 512B

    QCOW2 - boot timeCold cache - boot time

    Warm cache - boot time

    QCOW2 - tx sizeCold cache - tx size

    Warm cache - tx size

    Figure 2.10: Final arrangement for cache creation.

    Figure 2.10 shows the observed performance (boot time and data transfer size)with the cold and warm cache when the cluster size is set to 512 bytes. The resultsshow that with the careful choice of the cache cluster size and placement of thecold cache on memory, it is possible to make cache creation scalable with near-zero overhead. The same arrangement, also shown in Figure 2.7, is used with the

  • 2.5. EVALUATION 25

    Cha

    pter

    2

    scalability benchmarks in the rest of this section.According to our discussion on the QCOW2 image format in Section 2.4.1, a

    smaller cluster size will result in more frequent lookups and more L2-table entriesfor the cache image. As shown in Figure 2.10, the frequency of lookups does notaffect the booting time since most reads during boot are small and need a lookupanyway. For a cache quota of 200 MB, only 3.1 MB is necessary for L2-tables.Thus, we believe the smaller cluster size for the cache image is justified.

    2.5.2 Cache quota

    Table 2.2: Cache quota necessary for various VM images.

    VM image Warm cache size

    CentOS 6.3 93 MBDebian 6.0.7 40 MB

    Windows Server 2012 201 MB

    Table 2.2 shows the necessary cache quota for various VM images with cachecluster size of 512 bytes. From Figure 2.10 it is clear that CentOS 6.3 needs acache size of about 90MB. The created cache image has about the same size onthe file-system. A Debian that is taken from the services image of an open-sourcePaaS [86], creates a cache image of 40MB. A Windows Server needs a substantiallybigger cache image for the boot workload. The numbers in Table 2.2 are slightlybigger than the read working set sizes shown in Table 2.1. The difference is causedby the meta data added by QCOW2 at various locations of the VM image file.

    2.5.3 Scaling

    The micro benchmarks presented so far provide us with a good understanding of theindividual caching behavior. Now, we continue to our primary concern of this work:scalability.

    Scaling nodes

    As shown in Section 2.2, on-demand transfers have a scalability problem with com-modity networks when booting one VM image over many compute nodes. We showthat this can be resolved by our proposed cache images. In this experiment, 64 com-pute nodes start a VM from the same VM image simultaneously.

    Figure 2.11 shows the average booting time of the VMs. With a cold cache, ittakes about the same time to boot the VMs as the original QCOW2. With a warmcache, the booting time over many compute nodes is similar to that of a single VM.These results suggest that the caches are effective in resolving the network bottle-neck.

  • 26 CHAPTER 2. VM IMAGE CACHES

    0

    20

    40

    60

    80

    100

    120

    140

    1 4 8 16 32 64

    Boot

    ing

    time

    (sec

    ond)

    # nodes

    Scaling the number of nodes - Network = 1GbE

    QCOW2Cold cache

    Warm cache

    Figure 2.11: Caching a single VM image at compute nodes over a 1 GbE.

    In a real-life scenario, we do not expect that all the nodes start from a cold or awarm cache. Depending on the cloud node scheduler, it can be that some of the nodesstart from the cold cache and some from a warm cache. A cache-aware schedulershould always prefer the nodes with a warm cache. Studying a cache-aware nodescheduler is left for future work. Regardless of the node allocations, the nodes witha warm cache contribute to reducing the network load on the storage node(s). (We donot present quantitative results for such mixed scenarios, however, in this chapter.)

    Scaling VM images

    Networks are getting much faster than they used to be and disk-based storage is notcatching up. We have shown in Section 2.2 that the disks at the storage node becomea severe scalability bottleneck when many VMs are booted from many VM imagessimultaneously. In this set of experiments, we show how caches can help addressingthis problem. In all the points in the graphs, 64 nodes are booting VMs while theyshare various number of VM images. The caching setup is the same as in Figure 2.7,where compute nodes store the caches on their local disk.

    Figure 2.12 shows the effect of caching the VM images at the disk of the com-pute nodes. For the 1 GbE network, with a single VM image, the difference betweenwarm caches and QCOW2 is the cost of the network bottleneck. This is the samebottleneck that we observed in Figure 2.11 at 64 nodes. Starting from 16 VM images,the storage node’s disk is becoming the primary source of the scalability bottleneck.Caching at the compute nodes avoids both scalability bottlenecks at the network andat the storage node’s disk. For the 32 GbIB network, the caching avoids the scala-bility bottleneck of the storage node’s disk. Since the network is not a scalabilitybottleneck in this scenario, the difference in booting time with more than a single

  • 2.5. EVALUATION 27

    Cha

    pter

    2

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    1 4 8 16 32 64

    Boot

    ing

    time

    (sec

    ond)

    # VM images

    Scaling the number of VM images - 64 nodes - Network = 1GbE

    QCOW2Cold cache

    Warm cache

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    1 4 8 16 32 64

    Boot

    ing

    time

    (sec

    ond)

    # VM images

    Scaling the number of VM images - 64 nodes - Network = 32GbIB

    QCOW2Cold cache

    Warm cache

    Figure 2.12: Caching many VM image at the compute nodes’ disk over the two different networks.

    VM image is only due to the bottleneck at the storage node’s disks.Since the storage nodes’ disks are proving to be a more severe scalability prob-

    lem than that of the network, another attractive caching strategy is caching over thememory of the storage node. Figure 2.13 shows a possible setup where the cachesare created on the compute nodes and then transferred back to the storage node’smemory before being used as a warm cache.

    In this set of experiments, we have added the time of cache transfers to the boot-ing time with the cold cache to reflect on the fact that the cache image transfers arenow a necessary part of the system. When VM images are shared between VMs,only one of the VMs creates and transfers the cache back to the storage node whileother VMs just proceed with normal QCOW2.

    Figure 2.14 shows the results over the two networks. In the case of the 1 GbEnetwork, this caching strategy does not solve the network scalability bottleneck, but

  • 28 CHAPTER 2. VM IMAGE CACHES

    Cache

    Warm cache

    KVMCoW

    VM

    Read

    Read

    Transfer

    Transfer file

    Read

    Mem

    Disk

    Base

    Cache

    Mem

    Disk

    Base

    Cache

    KVM

    Mem

    Disk

    Base

    Read

    Read

    Storage node Compute node

    CoW

    VM

    Read

    Cache

    Cold cache

    Write

    Read

    Figure 2.13: Caching the VM image in the memory of storage node. The cache is created in thememory of the compute node and then transferred to the memory of storage node. With a warmcache in memory, there is no need to go to the storage disk anymore.

    it does solve that of the storage nodes’ disk. With the cold cache, the booting delayis slightly higher with 64 nodes, due to the transfer time. In the case of the 32 GbIBnetwork, the only scalability bottleneck is resolved without any overhead.

    With caches on the storage node’s memory, there needs to be a mechanism thatdecides on the eviction policy from the cache pool. Strategies similar to the onesdiscussed in Section 2.3.4 can be applied here.

    In the next section, we further discuss cache placement considerations based onthe evaluation of this section.

    2.6 Cache Placement

    In the previous section, we showed that VM image caches can be placed either onthe compute nodes’ disks or on the storage node’s memory.

    There are certain advantages for placing the caches on the storage node’s mem-ory, compared to compute nodes’ disk:

  • 2.6. CACHE PLACEMENT 29

    Cha

    pter

    2

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    1 4 8 16 32 64

    Boot

    ing

    time

    (sec

    ond)

    # VM images

    Scaling the number of VM images - 64 nodes - Network = 1GbE

    QCOW2Cold cache

    Warm cache

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    1 4 8 16 32 64

    Boot

    ing

    time

    (sec

    ond)

    # VM images

    Scaling the number of VM images - 64 nodes - Network = 32GbIB

    QCOW2Cold cache

    Warm cache

    Figure 2.14: Caching many VM images on the storage node’s memory over the two differentnetworks.

    • The compute nodes do not need to reserve any disk space for the caches.

    • There are fewer security concerns, regarding the content of the VM imagescached at any time at any compute node.

    • The storage node’s memory is efficiently used for the task at hand; transferringblocks of VM image data.

    • A cache-aware scheduler can treat all compute nodes equally as VM imagecaches are centrally available at the storage node.

    Thus, in scenarios where the network is fast enough to handle on-demand trans-fers of many simultaneous VM startups, using solely the storage node’s memory forplacing the caches is the superior solution.

  • 30 CHAPTER 2. VM IMAGE CACHES

    The only remaining problem is the question whether a certain network is able tohandle such a workload. If this is not the case, caching on the compute node’s diskis one possibility. Caching only on the compute node’s disk, however, still leaves thepossibility that the storage nodes’ disks become a bottleneck in a multi-VM imagescenario. To address this, we recommend using caches at both storage and computenodes. Algorithm 1 uses chaining to the cache at the proper location, or creates oneif necessary.

    Algorithm 1: Chaining to a proper VM image cache.Input: Compute node C, Storage node S, VM image BaseOutput: A VM image to be chained to a CoW imageif CacheBase exists in C then

    return CacheBase;endif CacheBase exists in S then

    if CacheBase is on disk thenCopy Basecache to tmpfs;

    endCreate NewCacheBase on C;Chain NewCacheBase to Cachebase;return NewCacheBase

    endCreate CacheBase on C;Chain CacheBase to Base;Copy CacheBase to S on VM shutdown;return CacheBase

    We assume that there is a cache eviction policy, as described in Section 2.3.4,that removes caches in case the reserved cache space is full. Algorithm 1 preferschaining to a local cache (if it exists) to avoid the network as much as possible. Incase the cache does not exist at the compute node, it tries to create one while chainingto another cache at storage node’s memory, avoiding the storage node’s disks.

    Since, with a fast network, random access on remote memory can be faster thanon local disk, a VM might boot faster from the storage node’s memory. This mightconflict with Algorithm 1. We have investigated the severeness of this effect withour machine setup from Section 2.5, using a CentOS VM image. Our results showat most 1 % difference in startup times between a cache on the compute node’s disk,compared to the storage’s memory. We hence consider the difference to be negligi-ble.

  • 2.7. BACKGROUND AND RELATED WORK 31

    Cha

    pter

    2

    2.7 Background and related work

    We distinguish different research work related to VM image caches in four overlap-ping categories:

    1. Efficient VM image transfer deals with the problem of moving VM imagesfrom storage nodes to compute nodes.

    2. Efficient VM migration deals with efficient migration of VMs along with theirstate from a node to another. The goal is reduction in downtime to improveuser-perceived experience.

    3. Caching storage data is on the mature field of caching data in storage layers.We focus on the relevant work to VM images.

    4. Virtualized disk performance discusses the advances in virtualized storage thatare directly applied to VM images.

    Below, we discuss each category separately. Whenever appropriate, we makecomparisons to ou