cloud experiences

Download Cloud Experiences

If you can't read please download the document

Upload: guy-coates

Post on 01-Jul-2015

2.413 views

Category:

Technology


0 download

DESCRIPTION

Sanger Institute's experiences with the cloud.Given at Green Datacentre & Cloud Control 2011

TRANSCRIPT

  • 1. Cloud Experiences
    • Guy Coates
    2. Wellcome Trust Sanger Institute 3. [email_address]

4. The Sanger Institute

  • Funded by Wellcome Trust.
    • 2 ndlargest research charity in the world. 5. ~700 employees. 6. Based in Hinxton Genome Campus, Cambridge, UK.
  • Large scale genomic research.
    • Sequenced 1/3 of the human genome. (largest single contributor). 7. We have active cancer, malaria, pathogen and genomic variation / human health studies.
  • All data is made publicly available.
    • Websites, ftp, direct database. access, programmatic APIs.

8. DNA SequencingTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG AAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA TGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG TGCACTCCAGCTTGGGTGACACAGCAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG AAATAATCAGTTTCCTAAGATTTTTTTCCTGAAAAATACACATTTGGTTTCA ATGAAGTAAATCGATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC 250 Million * 75-108 Base fragments Human Genome (3GBases) 9. Moore's Law Compute/disk doubles every 18 months Sequencing doubles every 12 months 10. Economic Trends:

  • The Human genome project:
    • 13 years. 11. 23 labs. 12. $500 Million.
  • A Human genome today:
    • 3 days. 13. 1 machine. 14. $8,000.
  • Trend will continue:
    • $500 genome is probable within 3-5 years.

15. The scary graph Peak Yearly capillary sequencing: 30 Gbase Current weekly sequencing: 6000 Gbase 16. Our Science 17. UK 10K Project

  • Decode the genome of 10,000 people inthe uk. 18. Will improve the understanding of human genetic variation and disease.

Genome Research Limited Wellcome Trust launches study of 10,000 human genomes in UK; 24 June 2010 www.sanger.ac.uk/about/press/2010/100624-uk10k.html 19. New scale, new insights . . . to common disease

    • Coronary heart disease 20. Hypertension 21. Bipolar disorder 22. Arthritis 23. Obesity 24. Diabetes (types I and II) 25. Breast cancer 26. Malaria 27. Tuberculosis

28. Cancer Genome Project

  • Cancer is a disease caused by abnormalities in a cell's genome.

29. Detailed Changes:

  • Sequencing hundreds of cancer samples 30. First Comprehensive look at cancer genomes
    • Lung Cancer 31. Malignant melanoma 32. Breast cancer
  • Identify driver mutations for:
    • Improved diagnostics 33. Development of novel therapies 34. Targeting of existing therapeutics

Lung Cancer and melanoma laid bare; 16 December 2009www.sanger.ac.uk/about/press/2009/091216.html 35. IT Challenges 36. Managing Growth

  • Analysing the data takes a lot of compute and disk space
    • Finished sequence is the start of the problem, not the end.
  • Growth of compute & storage
    • Storage /compute doubles every 12 months.
      • 2010 ~12 PB raw
  • Moore's law will not save us. 37. 1000$ genome* 38. *Informatics not included

39. Sequencing data flow. Alignments (200GB) Variation data (1GB) Feature (3MB) Raw data (10 TB) Sequence (500GB) Sequencer Processing/ QC Comparative analysis datastore Structured data (databases) Unstructured data (Flat files) Internet 40. Data centre

  • 4x250 M 2Data centres.
    • 2-4KW / M 2cooling. 41. 1.8 MW power draw 42. 1.5 PUE
  • Overhead aircon, power and networking.
    • Allows counter-current cooling. 43. Focus on power & space efficient storage and compute.
  • Technology Refresh.
    • 1 data centre is an empty shell.
      • Rotate into the empty room every 4 years and refurb.
    • Fallow Field principle.

rack rack rack rack 44. Our HPC Infrastructure

  • Compute
    • 8500 cores 45. 10GigE / 1GigE networking.
  • High performance storage
    • 1.5 PB DDN 9000&10000 storage 46. Lustre filesystem
  • LSF queuing system

47. Ensembl

  • Data visualisation / Mining web services.
    • www.ensembl.org 48. Provides web / programmatic interfaces to genomic data. 49. 10k visitors / 126k page views per day.
  • Compute Pipeline (HPTC Workload)
    • Take a raw genome and run it through a compute pipeline to find genes and other features of interest. 50. Ensembl at Sanger/EBI provides automated analysis for 51 vertebrate genomes.
    • Software is Open Source (apache license). 51. Data is free for download.

52. Sequencing data flow. Alignments (200GB) Variation data (1GB) Feature (3MB) Raw data (10 TB) Sequence (500GB) Sequencer Processing/ QC Comparative analysis datastore Structured data (databases) Unstructured data (Flat files) Internet HPC Compute Pipeline Web / Databaseinfrastructure 53. TCCTCTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTG GAATTGTCAGACATATACCAAATCCCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAA TTGGAAAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTA TTTAGAGAAGAGAAAGCAAACATATTATAAGTTTAATTCTTATATTTAAAAATAGGAGCC AAGTATGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGC TTGAGACCAGGAGTTTGATACCAGCCTGGGCAACATAGCAAGATGTTATCTCTACACAAA ATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTG AAGCAGGAGGGTTACTTGAGCCCAGGAGTTTGAGGTTGCAGTGAGCTATGATTGTGCCAC TGCACTCCAGCTTGGGTGACACAGCAAAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGG AACATCTCATTTTCACACTGAAATGTTGACTGAAATCATTAAACAATAAAATCATAAAAG AAAAATAATCAGTTTCCTAAGAAATGATTTTTTTTCCTGAAAAATACACATTTGGTTTCA GAGAATTTGTCTTATTAGAGACCATGAGATGGATTTTGTGAAAACTAAAGTAACACCATT ATGAAGTAAATCGTGTATATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC 54. Annotation 55. Annotation 56. Why Cloud? 57. Web Services

  • Ensembl has a worldwide audience. 58. Historically, web site performance was not great, especially for non-european institutes.
    • Pages were quite heavyweight. 59. Not properly cached etc.
  • Web team spent a lot oftime re-designing the code to make it more streamlined.
    • Greatly improved performance.
  • Coding can only get you so-far.
    • 150-240ms round trip time from Europe to the US. 60. We need a set of geographically dispersed mirrors.

61. Colocation

  • Real machines in a co-lo facility in California.
    • Traditional mirror.
  • Hardware was initially configured on site.
    • 16 servers, SAN storage, SAN switches, SAN management appliance, Ethernet switches, firewall, out-of-band management etc.
  • Shipped to the co-lo for installation.
    • Sent a person to California for 3 weeks. 62. Spent 1 week getting stuff into/out of customs.
      • ****ing FCC paperwork!
  • Additional infrastructure work.
    • VPN between UK and US.
  • Incredibly time consuming.
    • Really don't want to end up having to send someone on a plane to the US to fix things.

63. Cloud Opportunities

  • We wanted more mirrors.
    • US East coast, asia-pacific.
  • Investigations into AWS already ongoing. 64. Many people would like to run ensembl webcode to visualise their own data.
    • Non trivial for the non-expert user.
      • Mysql, apache, perl.
  • Can we distribute AMIs instead?
    • Ready to run.
  • Can we eat our own dog-food?
    • Run mirror site from the AMIs?

65. What we actually did: AWS Sanger Sanger VPN 66. Building a mirror on AWS

  • Application development was required
    • Significant code changes required to make the webcode mirror aware.
      • Mostly done for the original co-location site.
  • Some software development / sysadmin work needed.
    • Preparation of OS images,software stack configuration. 67. VPN configuration
  • Significant amount of tuning required.
    • Initial mysql performance was pretty bad, especially for the large ensembl databases. (~1TB). 68. Lots of people doing Apache/mysql on AWS, so there is a good amount of best-practice etc available.

69. Traffic 70. Is it cost effective?

  • Lots of misleading cost statements made about cloud.
    • Our analysis only cost $500. 71. CPU is only $0.085 / hr.
  • What are we comparing against?
    • Doing the analysis once? Continually? 72. Buying a $2000 server? 73. Leasing a $2000 server for 3 years? 74. Using $150 of time at your local supercomputing facility? 75. Buying a $2000 of server but having to build a $1M datacentre to put it in?
  • Requires the dreaded Total Cost of Ownership (TCO) calculation.
    • hardware+ power + cooling + facilities + admin/developers etc
      • Incredibly hard to do.

76. Breakdown:

  • Comparing costs to the real Co-lo
    • power, cooling costs are all included. 77. Admin costs are the same, so we can ignore them.
      • Same people responsible for both.
  • Cost for Co-location facility:
    • $120,000 hardware + $51,000 /yrcolo. 78. $91,000 per year (3 years hardware lifetime).
  • Cost forAWS site:
    • $84,000 per year.
  • We can run 3 mirrors for 90% of the cost of 1 mirror. 79. It is not free!

80. Advantages

  • No physical hardware.
    • Work can start as soon as we enter our credit card numbers... 81. No US customs, Fedex etc.
  • Less hardware:
    • No Firewalls, SAN management appliances etc.
  • Much simpler management infrastructure.
      • AWS give you out of band management for free. 82. No hardware issues.
  • Easy path for growth.
    • No space constraints.
      • No need to get tin decommissioned /re-installed at Co-lo.
    • Add more machines until we run out of cash.

83. Downsides

  • Underestimated the time it would take to make the web-code mirror-ready.
    • Not a cloud specific problem, but something to be aware of when you take big applications and move them outside your home institution.
  • Curation of software images takes time.
    • Regular releases of new data and code. 84. Ensembl team now has a dedicated person responsible for the cloud. 85. Somebody has to look after the systems.
  • Management overhead does not necessarily go down.
    • But it does change.

86. Going forward

  • Change code to remove all dependencies on Sanger.
    • Full DR capability.
  • Make the AMIs publically available.
    • Today we have Mysql servers + data.
      • Data generously hosted on Amazon public datasets.
    • Allow users to simply run their own sites.

87. HPC Workloads 88. Why HPC in the Cloud?

  • We already have a data-centre.
    • Not seeking to replace our existing infrastructure. 89. Not cost effective.
  • But: Long lead-times for installing kit.
    • ~3-6 months from idea to going live. 90. Longer than the science can wait. 91. Ability to burst capacity might be useful.
  • Test environments.
    • Test at scale. 92. Large clusters for a short amount of time.

93. Distributing analysis tools

  • Sequencing is becoming a commodity. 94. Informatics / analysis tools needs to be commodity too. 95. Requires a significant amount ofdomain knowledge.
    • Complicated software installs, relational databases etc.
  • Goal:
    • Researcher with no IT knowledge can take their sequence data, upload it to AWS, get it analysed and view the results.

96. Life Sciences HPC Workloads Tightly Coupled (MPI) Embarrassingly Parallel CPU Bound IO Bound modelling/ docking Simulation Genomics 97. Our Workload

  • Embarrassingly Parallel.
    • Lots of single threaded jobs. 98. 10,000s of jobs. 99. Core algorithms in C 100. Perl pipeline manager to generate and manage workflow. 101. Batch schedular to execute jobs on nodes. 102. mysql database to hold results & state.
  • Moderate memory sizes.
    • 3 GB/core
  • IO bound.
    • Fast parallel filesystems.

103. Life Sciences HPC Workloads Tightly Coupled (MPI) Embarrassingly Parallel CPU Bound IO Bound modelling/ docking Simulation Genomics 104. Different Architectures VS CPU CPU CPU Fat Network POSIX Global filesystem CPU CPU CPU CPU thin network Local storage Local storage Local storage Local storage Batch schedular S3 Hadoop? 105. Life Sciences HPC Workloads Tightly Coupled (MPI) Embarrassingly Parallel CPU Bound IO Bound modelling/ docking Simulation Genomics 106. Careful choice of problem:

  • Choose a simple part of the pipeline
    • Re-factor all the code that expects global filesystem and make it use S3.
  • Why not use hadoop?
    • Production code that works nicely inside Sanger. 107. Vast effort to port code, for little benefit. 108. Questions about stability for multi-user systems internally.
  • Build self assembling HPC cluster.
    • Code which will spin up AWS images and self assembles into a HPC cluster and batch schedular.
  • Cloud allows you to simplify.
    • Sanger compute cluster is shared.
      • Lots of complexity in ensuring applications/users play nicely together.
    • AWS clusters are unique to a user/application.

109. The real problem: Internet

  • Data transfer rates (gridFTP/FDT via our 2 Gbit/s site link).
    • Cambridge -> EC2 East coast: 12 Mbytes/s (96 Mbits/s) 110. Cambridge -> EC2 Dublin: 25 Mbytes/s(200 Mbits/s) 111. 11 hours to move 1TB to Dublin. 112. 23 hours to move 1 TB to East coast.
  • What speedshouldwe get?
    • Once we leave JANET (UK academic network) finding out what the connectivity is and what we should expect is almost impossible.
  • Do you have fast enough disks at each end to keep the network full?

113. Networking

  • How do we improve data transfers across the public internet?
    • CERN approach; don't. 114. 10 Gbit dedicated network between CERN and the T1 centres.
  • Can it work for cloud?
    • Buy dedicated bandwidth to aprovider.
      • Ties you in. 115. Should they pay?
  • What happens when you want to move?

116. Summary

  • Moving existing HPC applications is painful. 117. Small data / high CPU applications work really well. 118. Large data applications less well.

119. Data Security 120. Are you allowed to put data on the cloud?

  • Default policy: 121. Our data is confidential/important/critical to our business. 122. We must keep our data on our computers. 123. Apart from when we outsource it already.

124. Reasons to be optimistic:

  • Most (all?) data security issues can be dealt with.
    • But the devil is in the details. 125. Data can be put on the cloud, if care is taken.
  • It is probably more secure there than in your own data-centre.
    • Can you match AWS data availability guarantees?
  • Are cloud providers different from any other organisation you outsource to?

126. Outstanding Issues

  • Audit and compliance:
    • If you need IP agreements, above your providers standard T&Cs, how do you push them through?
  • Geographical boundaries mean little in the cloud.
    • Data can be replicated across national boundaries, withoutend user being aware.
  • Moving personally identifiable data outside of the EU is potentially problematic.
    • (Can be problematic within the EU; privacy laws are not as harmonised as you might think.) 127. More sequencing experiments are trying to link with phenotype data. (ie personally identifiable medical records).

128. Private Cloud to rescue?

  • Can we do something different?

129. TraditionalCollaboration DCC: Sequencing Centre + Archive Sequencing centre Sequencing centre Sequencing centre Sequencing centre IT IT IT IT 130. Dark Archives

  • Storing data in an archive is not particularly useful.
    • You need to be able to access the data and do something useful with it.
  • Data in current archives is dark.
    • You can put/get data, but cannot compute across it. 131. Is data in an inaccessible archive really useful?

132. Private Cloud Collaborations Sequencing CentreSequencing centre Sequencing centre Sequencing centre Private Cloud IaaS / SaaS Private Cloud IaaS / SaaS 133. Private Cloud

  • Advantages:
    • Small organisations leverage expertise of big IT organisations. 134. Academia tends to be linked by fast research networks.
      • Moving data is easier. (move compute to the data via VMs)
    • Consortium will be signed up to data-access agreements.
      • Simplifies data governance.
  • Problems:
    • Big change in funding model. 135. Are big centres set up to provide private cloud services?
      • Selling services is hard if you are a charity.
    • Can we do it as well as the big internet companies?

136. Summary

  • Cloud is a useful tool.
    • Will not replace our local IT infrastructure.
  • Porting existing applications can be hard.
    • Do not underestimate time / people.
  • Still need IT staff.
    • End up doing different things.

137. Acknowledgements

  • Sanger
    • Phil Butcher 138. James Beal 139. Pete Clapham 140. Simon Kelley 141. Gen-Tao Chiang
    • Steve Searle 142. Jan-Hinnerk Vogel 143. Bronwen Aken
  • EBI 144. Glenn Proctor 145. Steve Keenan