clouds, grids and data

sanger

Clouds, Grids and Data

Guy Coates

Wellcome Trust Sanger Institute

[email protected]

The Sanger Institute

Funded by Wellcome Trust.

2nd largest research charity in the world.

~700 employees.

Based in Hinxton Genome Campus, Cambridge, UK.

Large scale genomic research.

Sequenced 1/3 of the human genome. (largest single contributor).

We have active cancer, malaria, pathogen and genomic variation / human health studies.

All data is made publicly available.

Websites, ftp, direct database. access, programmatic APIs.

Shared data archives

Past Collaborations

DataSequencingCentre + DCCSequencingcentreSequencingcentreSequencingcentreSequencingcentre

Future Collaborations

SequencingCentre 3SequencingCentre 1SequencingCentre 2A

SequencingCentre 2BFederatedaccess

Collaborations are short term: 18 months-3 years.

Genomics Data

Intensities / raw data (2TB)

Alignments (200 GB)

Sequence + quality data (500 GB)

Variation data (1GB)

Individual features (3MB)

Structured data(databases)

Unstructured data(flat files)

Data size per Genome

DAS, bioMART etc

?

Sharing Unstructured data

Large data volumes, flat files.

Federated access.Data is not going to be in once place.

Single institute will have data distributed for DR / worldwide access.Some parts of the data may be on cloud stores.

Controlled access.Many archives will be public.

Some will have patient identifiable data.

Plan for it now.

iRODS

iRODS: Integrated Rule-Oriented Data System.Produced by DICE Group (Data Intensive Cyber Environments) at U. North Carolina, Chapel Hill.

Successor to SRB.SRB used by the High-Energy-Physics (HEP) community.20PB/year LHC data.

HEP community has lots of lessons learned that we can benefit from.

Promising glue layer to pull archives together.

iRODS

ICATCataloguedatabaseRule EngineImplements policiesIrods ServerData on diskUser interfaceWebDAV, icommands,fuse

Irods ServerData in database

Irods ServerData in S3

Useful Features

EfficientCopes with PB of data and 100,000M+ files.

Fast parallel data transfers across local and wide area network links.

ExtensibleSystem can be linked out external services.Eg external databases holding metadata, external authentication systems.

FederatedPhysically and logically separated iRODS installs can be federated.

Allows user at institute A to seamlessly access data at institute B in a controlled manner.

What are we doing with it?

Piloting it for internal use.Help groups keep track of their data.

Move files between different storage pools.Fast scratch space warehouse disk Offsite DR centre.

Link metadata back to our LIMs/tracking databases.

We need to share data with other institutions.Public data is easy: FTP/http.

Controlled data is hard:

Encrypt files and place on private FTP dropboxes.

Cumbersome to manage and insecure.

Ports trivially to the cloud.Build with federation from day 1.

Software knows about S3 storage layers.

Identity management

Which identity management system to use for controlled access?

Culture shock.

Lots of solutions:openID, shibboleth(aspis), globus/x509 etc.

What features are important?How much security?

Single sign on?

Delegated authentication?

Finding consensus will be hard.

Cloud Archives

Dark Archives

Storing data in an archive is not particularly useful.You need to be able to access the data and do something useful with it.

Data in current archives is dark.You can put/get data, but cannot compute across it.

Is data in an inaccessible archive really useful?

Last week's bombshell

We want to run out pipeline across 100TB of data currently in EGA/SRA.

We will need to de-stage the data to Sanger, and then run the compute.Extra 0.5 PB of storage, 1000 cores of compute.

3 month lead time.

~$1.5M capex.

Elephant in the room

Network speeds

Moving large amounts of data across the public internet is hard.

Data transfer rates (gridFTP/FDT):Cambridge EC2 East coast: 12 Mbyte/s (96 Mbits/s)

NCBI Sanger: 15 Mbyte/s (120 Mbit/s)

Oxford Sanger: 60 Mbyte/s (480 Mbit/s)

77 days to pull down 100TB from NCBI.

20 days to pull down 100TB from Oxford.

Can we use the CERN model?Lay dedicated 10gig lines between Geneva and the 10 T1 centres.

Collaborations are too fluid.1.5-3 years vs 15 years for LHC.

Cloud / Computable archives

Can we move the compute to the data?Upload workload onto VMs.

Put VMs on compute that is attached to the data.

DataCPUCPUCPUCPUDataCPUCPUCPUCPUVM

Proto-Example:
Ssaha trace search

HashTable(320 GB)traceDatabase~30TB

1. hash database

CPUCPUCPUCPUhash

hash

hash

hash

2 .Distribute hash across machines

query

3. Run query in parallel

Practical Hurdles

Where does it live?

Most of us are funded to hold data, not to fund everyone else's compute costs to.Now need to budget for raw compute power as well as disk.

Implement virtualisation infrastructure, billing etc.Are you legally allowed to charge?

Who underwrites it if nobody actually uses your service?

Strongly implies data has to be held on a commercial provider.

Networking:

We still need to get data in.Fixing the internet is not going to be cost effective for us.

Fixing the internet may be cost effective for big cloud providers.Core to their business model.

All we need to do is get data into Amazon, and then everyone else can get the data from there.

Do we invest in a fast links to Amazon?It changes the business dynamic.

We have effectively tied ourselves to a single provider.

Compute architecture

CPUCPUCPUFat NetworkPosix Global filesystemCPUCPUCPUCPUthin networkLocalstorageLocalstorageLocalstorageLocalstorageBatch schedularhadoop/S3VS

Data-storeData-store

Architecture

Our existing pipelines do not port well to clouds.Expect a POSIX shared filesystem.

Re-writing apps to use S3 or hadoop/HDFS is a real hurdle.Fork existing apps: HPTC and Cloud stream?

New apps:How do you run them internally?Build a cloud?

Am I being a reactionary old fart?15 years ago clusters of PCs were not real supercomputers.

...then beowulf took over the world.

Big difference: porting applications between the two architectures was easy.

Will the market provide traditional compute clusters in the cloud?

Summary

Good tools are available for building federated data archives.

The challenge is computing across the data at scale.

Network infrastructure and cloud architectures still problematic.

Acknowledgements

Phil Butcher

ISG TeamJames Beal

Gen-Tao Chiang

Pete Clapham

Simon Kelley

1k Genomes ProjectThomas Keane

Jim Stalker

Cancer Genome ProjectAdam Butler

John Teague

Backup

Other cloud projects

Virtual Colo.

Ensembl Website.Access from outside Europe has been slow.

Build a mirror in US west coast commercial co-lo. ~25% of total traffic uses the west coast mirror.

We would like to extend mirrors to other parts of the world.US East coast.

Building a mirror inside Amazon.LAMP stack.

Common workload.

Not (technically) challenging.Management overhead.

Cost comparisons will be interesting.

Ensembl / Annotation

TCCTCTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTGGAATTGTCAGACATATACCAAATCCCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAATTGGAAAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTATTTAGAGAAGAGAAAGCAAACATATTATAAGTTTAATTCTTATATTTAAAAATAGGAGCCAAGTATGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGCTTGAGACCAGGAGTTTGATACCAGCCTGGGCAACATAGCAAGATGTTATCTCTACACAAAATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTGAAGCAGGAGGGTTACTTGAGCCCAGGAGTTTGAGGTTGCAGTGAGCTATGATTGTGCCACTGCACTCCAGCTTGGGTGACACAGCAAAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGGAACATCTCATTTTCACACTGAAATGTTGACTGAAATCATTAAACAATAAAATCATAAAAGAAAAATAATCAGTTTCCTAAGAAATGATTTTTTTTCCTGAAAAATACACATTTGGTTTCAGAGAATTTGTCTTATTAGAGACCATGAGATGGATTTTGTGAAAACTAAAGTAACACCATTATGAAGTAAATCGTGTATATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC

HPTC workloads on the cloud

There are going to be lots of new genomes that need annotating.Small labs: limited informatics / systems experience.Typically postdocs/PhD who have a real job to do.

Getting ensembl pipeline up and running takes a lot of domain-expertease.

We have already done all the hard work on installing the software and tuning it.Can we package up the pipeline, put it in the cloud?

Goal: End user should simply be able to upload their data, insert their credit-card number, and press GO.

clouds, grids and data

Technology