cloud experiences

1. Cloud Experiences

Guy Coates

2. Wellcome Trust Sanger Institute 3. [email_address]

Funded by Wellcome Trust.

2 ndlargest research charity in the world. 5. ~700 employees. 6. Based in Hinxton Genome Campus, Cambridge, UK.

Large scale genomic research.

Sequenced 1/3 of the human genome. (largest single contributor). 7. We have active cancer, malaria, pathogen and genomic variation / human health studies.

All data is made publicly available.

Websites, ftp, direct database. access, programmatic APIs.

The Human genome project:

13 years. 11. 23 labs. 12. $500 Million.

A Human genome today:

3 days. 13. 1 machine. 14. $8,000.

Trend will continue:

$500 genome is probable within 3-5 years.

Decode the genome of 10,000 people inthe uk. 18. Will improve the understanding of human genetic variation and disease.

Coronary heart disease 20. Hypertension 21. Bipolar disorder 22. Arthritis 23. Obesity 24. Diabetes (types I and II) 25. Breast cancer 26. Malaria 27. Tuberculosis

Cancer is a disease caused by abnormalities in a cell's genome.

Sequencing hundreds of cancer samples 30. First Comprehensive look at cancer genomes

Lung Cancer 31. Malignant melanoma 32. Breast cancer

Identify driver mutations for:

Improved diagnostics 33. Development of novel therapies 34. Targeting of existing therapeutics

Analysing the data takes a lot of compute and disk space

Finished sequence is the start of the problem, not the end.

Growth of compute & storage

Storage /compute doubles every 12 months.

2010 ~12 PB raw

Moore's law will not save us. 37. 1000$ genome* 38. *Informatics not included

4x250 M 2Data centres.

2-4KW / M 2cooling. 41. 1.8 MW power draw 42. 1.5 PUE

Overhead aircon, power and networking.

Allows counter-current cooling. 43. Focus on power & space efficient storage and compute.

Technology Refresh.

1 data centre is an empty shell.

Rotate into the empty room every 4 years and refurb.

Fallow Field principle.

Compute

8500 cores 45. 10GigE / 1GigE networking.

High performance storage

1.5 PB DDN 9000&10000 storage 46. Lustre filesystem

LSF queuing system

Data visualisation / Mining web services.

www.ensembl.org 48. Provides web / programmatic interfaces to genomic data. 49. 10k visitors / 126k page views per day.

Compute Pipeline (HPTC Workload)

Take a raw genome and run it through a compute pipeline to find genes and other features of interest. 50. Ensembl at Sanger/EBI provides automated analysis for 51 vertebrate genomes.

Software is Open Source (apache license). 51. Data is free for download.

Ensembl has a worldwide audience. 58. Historically, web site performance was not great, especially for non-european institutes.

Pages were quite heavyweight. 59. Not properly cached etc.

Web team spent a lot oftime re-designing the code to make it more streamlined.

Greatly improved performance.

Coding can only get you so-far.

150-240ms round trip time from Europe to the US. 60. We need a set of geographically dispersed mirrors.

Real machines in a co-lo facility in California.

Traditional mirror.

Hardware was initially configured on site.

16 servers, SAN storage, SAN switches, SAN management appliance, Ethernet switches, firewall, out-of-band management etc.

Shipped to the co-lo for installation.

Sent a person to California for 3 weeks. 62. Spent 1 week getting stuff into/out of customs.

****ing FCC paperwork!

Additional infrastructure work.

VPN between UK and US.

Incredibly time consuming.

Really don't want to end up having to send someone on a plane to the US to fix things.

We wanted more mirrors.

US East coast, asia-pacific.

Investigations into AWS already ongoing. 64. Many people would like to run ensembl webcode to visualise their own data.

Non trivial for the non-expert user.

Mysql, apache, perl.

Can we distribute AMIs instead?

Ready to run.

Can we eat our own dog-food?

Run mirror site from the AMIs?

Application development was required

Significant code changes required to make the webcode mirror aware.

Mostly done for the original co-location site.

Some software development / sysadmin work needed.

Preparation of OS images,software stack configuration. 67. VPN configuration

Significant amount of tuning required.

Initial mysql performance was pretty bad, especially for the large ensembl databases. (~1TB). 68. Lots of people doing Apache/mysql on AWS, so there is a good amount of best-practice etc available.

Lots of misleading cost statements made about cloud.

Our analysis only cost $500. 71. CPU is only $0.085 / hr.

What are we comparing against?

Doing the analysis once? Continually? 72. Buying a $2000 server? 73. Leasing a $2000 server for 3 years? 74. Using $150 of time at your local supercomputing facility? 75. Buying a $2000 of server but having to build a $1M datacentre to put it in?

Requires the dreaded Total Cost of Ownership (TCO) calculation.

hardware+ power + cooling + facilities + admin/developers etc

Incredibly hard to do.

Comparing costs to the real Co-lo

power, cooling costs are all included. 77. Admin costs are the same, so we can ignore them.

Same people responsible for both.

Cost for Co-location facility:

$120,000 hardware + $51,000 /yrcolo. 78. $91,000 per year (3 years hardware lifetime).

Cost forAWS site:

$84,000 per year.

We can run 3 mirrors for 90% of the cost of 1 mirror. 79. It is not free!

No physical hardware.

Work can start as soon as we enter our credit card numbers... 81. No US customs, Fedex etc.

Less hardware:

No Firewalls, SAN management appliances etc.

Much simpler management infrastructure.

AWS give you out of band management for free. 82. No hardware issues.

Easy path for growth.

No space constraints.

No need to get tin decommissioned /re-installed at Co-lo.

Add more machines until we run out of cash.

Underestimated the time it would take to make the web-code mirror-ready.

Not a cloud specific problem, but something to be aware of when you take big applications and move them outside your home institution.

Curation of software images takes time.

Regular releases of new data and code. 84. Ensembl team now has a dedicated person responsible for the cloud. 85. Somebody has to look after the systems.

Management overhead does not necessarily go down.

But it does change.

Change code to remove all dependencies on Sanger.

Full DR capability.

Make the AMIs publically available.

Today we have Mysql servers + data.

Data generously hosted on Amazon public datasets.

Allow users to simply run their own sites.

We already have a data-centre.

Not seeking to replace our existing infrastructure. 89. Not cost effective.

But: Long lead-times for installing kit.

~3-6 months from idea to going live. 90. Longer than the science can wait. 91. Ability to burst capacity might be useful.

Test environments.

Test at scale. 92. Large clusters for a short amount of time.

Sequencing is becoming a commodity. 94. Informatics / analysis tools needs to be commodity too. 95. Requires a significant amount ofdomain knowledge.

Complicated software installs, relational databases etc.

Goal:

Researcher with no IT knowledge can take their sequence data, upload it to AWS, get it analysed and view the results.

Embarrassingly Parallel.

Lots of single threaded jobs. 98. 10,000s of jobs. 99. Core algorithms in C 100. Perl pipeline manager to generate and manage workflow. 101. Batch schedular to execute jobs on nodes. 102. mysql database to hold results & state.

Moderate memory sizes.

3 GB/core

IO bound.

Fast parallel filesystems.

Choose a simple part of the pipeline

Re-factor all the code that expects global filesystem and make it use S3.

Why not use hadoop?

Production code that works nicely inside Sanger. 107. Vast effort to port code, for little benefit. 108. Questions about stability for multi-user systems internally.

Build self assembling HPC cluster.

Code which will spin up AWS images and self assembles into a HPC cluster and batch schedular.

Cloud allows you to simplify.

Sanger compute cluster is shared.

Lots of complexity in ensuring applications/users play nicely together.

AWS clusters are unique to a user/application.

Data transfer rates (gridFTP/FDT via our 2 Gbit/s site link).

Cambridge -> EC2 East coast: 12 Mbytes/s (96 Mbits/s) 110. Cambridge -> EC2 Dublin: 25 Mbytes/s(200 Mbits/s) 111. 11 hours to move 1TB to Dublin. 112. 23 hours to move 1 TB to East coast.

What speedshouldwe get?

Once we leave JANET (UK academic network) finding out what the connectivity is and what we should expect is almost impossible.

Do you have fast enough disks at each end to keep the network full?

How do we improve data transfers across the public internet?

CERN approach; don't. 114. 10 Gbit dedicated network between CERN and the T1 centres.

Can it work for cloud?

Buy dedicated bandwidth to aprovider.

Ties you in. 115. Should they pay?

What happens when you want to move?

Moving existing HPC applications is painful. 117. Small data / high CPU applications work really well. 118. Large data applications less well.

Default policy: 121. Our data is confidential/important/critical to our business. 122. We must keep our data on our computers. 123. Apart from when we outsource it already.

Most (all?) data security issues can be dealt with.

But the devil is in the details. 125. Data can be put on the cloud, if care is taken.

It is probably more secure there than in your own data-centre.

Can you match AWS data availability guarantees?

Are cloud providers different from any other organisation you outsource to?

Audit and compliance:

If you need IP agreements, above your providers standard T&Cs, how do you push them through?

Geographical boundaries mean little in the cloud.

Data can be replicated across national boundaries, withoutend user being aware.

Moving personally identifiable data outside of the EU is potentially problematic.

(Can be problematic within the EU; privacy laws are not as harmonised as you might think.) 127. More sequencing experiments are trying to link with phenotype data. (ie personally identifiable medical records).

Can we do something different?

Storing data in an archive is not particularly useful.

You need to be able to access the data and do something useful with it.

Data in current archives is dark.

You can put/get data, but cannot compute across it. 131. Is data in an inaccessible archive really useful?

Advantages:

Small organisations leverage expertise of big IT organisations. 134. Academia tends to be linked by fast research networks.

Moving data is easier. (move compute to the data via VMs)

Consortium will be signed up to data-access agreements.

Simplifies data governance.

Problems:

Big change in funding model. 135. Are big centres set up to provide private cloud services?

Selling services is hard if you are a charity.

Can we do it as well as the big internet companies?

Cloud is a useful tool.

Will not replace our local IT infrastructure.

Porting existing applications can be hard.

Do not underestimate time / people.

Still need IT staff.

End up doing different things.

Sanger

Phil Butcher 138. James Beal 139. Pete Clapham 140. Simon Kelley 141. Gen-Tao Chiang

Steve Searle 142. Jan-Hinnerk Vogel 143. Bronwen Aken

EBI 144. Glenn Proctor 145. Steve Keenan