clouds, grids and data
Post on 16-Apr-2017
949 Views
Preview:
TRANSCRIPT
sanger
Clouds, Grids and Data
Guy Coates
Wellcome Trust Sanger Institute
gmpc@sanger.ac.uk
The Sanger Institute
Funded by Wellcome Trust.
2nd largest research charity in the world.
~700 employees.
Based in Hinxton Genome Campus, Cambridge, UK.
Large scale genomic research.
Sequenced 1/3 of the human genome. (largest single contributor).
We have active cancer, malaria, pathogen and genomic variation / human health studies.
All data is made publicly available.
Websites, ftp, direct database. access, programmatic APIs.
Shared data archives
Past Collaborations
DataSequencingCentre + DCCSequencingcentreSequencingcentreSequencingcentreSequencingcentre
Future Collaborations
SequencingCentre 3SequencingCentre 1SequencingCentre 2A
SequencingCentre 2BFederatedaccess
Collaborations are short term: 18 months-3 years.
Genomics Data
Intensities / raw data (2TB)
Alignments (200 GB)
Sequence + quality data (500 GB)
Variation data (1GB)
Individual features (3MB)
Structured data(databases)
Unstructured data(flat files)
Data size per Genome
DAS, bioMART etc
?
Sharing Unstructured data
Large data volumes, flat files.
Federated access.Data is not going to be in once place.
Single institute will have data distributed for DR / worldwide access.Some parts of the data may be on cloud stores.
Controlled access.Many archives will be public.
Some will have patient identifiable data.
Plan for it now.
iRODS
iRODS: Integrated Rule-Oriented Data System.Produced by DICE Group (Data Intensive Cyber Environments) at U. North Carolina, Chapel Hill.
Successor to SRB.SRB used by the High-Energy-Physics (HEP) community.20PB/year LHC data.
HEP community has lots of lessons learned that we can benefit from.
Promising glue layer to pull archives together.
iRODS
ICATCataloguedatabaseRule EngineImplements policiesIrods ServerData on diskUser interfaceWebDAV, icommands,fuse
Irods ServerData in database
Irods ServerData in S3
Useful Features
EfficientCopes with PB of data and 100,000M+ files.
Fast parallel data transfers across local and wide area network links.
ExtensibleSystem can be linked out external services.Eg external databases holding metadata, external authentication systems.
FederatedPhysically and logically separated iRODS installs can be federated.
Allows user at institute A to seamlessly access data at institute B in a controlled manner.
What are we doing with it?
Piloting it for internal use.Help groups keep track of their data.
Move files between different storage pools.Fast scratch space warehouse disk Offsite DR centre.
Link metadata back to our LIMs/tracking databases.
We need to share data with other institutions.Public data is easy: FTP/http.
Controlled data is hard:
Encrypt files and place on private FTP dropboxes.
Cumbersome to manage and insecure.
Ports trivially to the cloud.Build with federation from day 1.
Software knows about S3 storage layers.
Identity management
Which identity management system to use for controlled access?
Culture shock.
Lots of solutions:openID, shibboleth(aspis), globus/x509 etc.
What features are important?How much security?
Single sign on?
Delegated authentication?
Finding consensus will be hard.
Cloud Archives
Dark Archives
Storing data in an archive is not particularly useful.You need to be able to access the data and do something useful with it.
Data in current archives is dark.You can put/get data, but cannot compute across it.
Is data in an inaccessible archive really useful?
Last week's bombshell
We want to run out pipeline across 100TB of data currently in EGA/SRA.
We will need to de-stage the data to Sanger, and then run the compute.Extra 0.5 PB of storage, 1000 cores of compute.
3 month lead time.
~$1.5M capex.
Elephant in the room
Network speeds
Moving large amounts of data across the public internet is hard.
Data transfer rates (gridFTP/FDT):Cambridge EC2 East coast: 12 Mbyte/s (96 Mbits/s)
NCBI Sanger: 15 Mbyte/s (120 Mbit/s)
Oxford Sanger: 60 Mbyte/s (480 Mbit/s)
77 days to pull down 100TB from NCBI.
20 days to pull down 100TB from Oxford.
Can we use the CERN model?Lay dedicated 10gig lines between Geneva and the 10 T1 centres.
Collaborations are too fluid.1.5-3 years vs 15 years for LHC.
Cloud / Computable archives
Can we move the compute to the data?Upload workload onto VMs.
Put VMs on compute that is attached to the data.
DataCPUCPUCPUCPUDataCPUCPUCPUCPUVM
Proto-Example:
Ssaha trace search
HashTable(320 GB)traceDatabase~30TB
1. hash database
CPUCPUCPUCPUhash
hash
hash
hash
2 .Distribute hash across machines
query
3. Run query in parallel
Practical Hurdles
Where does it live?
Most of us are funded to hold data, not to fund everyone else's compute costs to.Now need to budget for raw compute power as well as disk.
Implement virtualisation infrastructure, billing etc.Are you legally allowed to charge?
Who underwrites it if nobody actually uses your service?
Strongly implies data has to be held on a commercial provider.
Networking:
We still need to get data in.Fixing the internet is not going to be cost effective for us.
Fixing the internet may be cost effective for big cloud providers.Core to their business model.
All we need to do is get data into Amazon, and then everyone else can get the data from there.
Do we invest in a fast links to Amazon?It changes the business dynamic.
We have effectively tied ourselves to a single provider.
Compute architecture
CPUCPUCPUFat NetworkPosix Global filesystemCPUCPUCPUCPUthin networkLocalstorageLocalstorageLocalstorageLocalstorageBatch schedularhadoop/S3VS
Data-storeData-store
Architecture
Our existing pipelines do not port well to clouds.Expect a POSIX shared filesystem.
Re-writing apps to use S3 or hadoop/HDFS is a real hurdle.Fork existing apps: HPTC and Cloud stream?
New apps:How do you run them internally?Build a cloud?
Am I being a reactionary old fart?15 years ago clusters of PCs were not real supercomputers.
...then beowulf took over the world.
Big difference: porting applications between the two architectures was easy.
Will the market provide traditional compute clusters in the cloud?
Summary
Good tools are available for building federated data archives.
The challenge is computing across the data at scale.
Network infrastructure and cloud architectures still problematic.
Acknowledgements
Phil Butcher
ISG TeamJames Beal
Gen-Tao Chiang
Pete Clapham
Simon Kelley
1k Genomes ProjectThomas Keane
Jim Stalker
Cancer Genome ProjectAdam Butler
John Teague
Backup
Other cloud projects
Virtual Colo.
Ensembl Website.Access from outside Europe has been slow.
Build a mirror in US west coast commercial co-lo. ~25% of total traffic uses the west coast mirror.
We would like to extend mirrors to other parts of the world.US East coast.
Building a mirror inside Amazon.LAMP stack.
Common workload.
Not (technically) challenging.Management overhead.
Cost comparisons will be interesting.
Ensembl / Annotation
TCCTCTCTTTATTTTAGCTGGACCAGACCAATTTTGAGGAAAGGATACAGACAGCGCCTGGAATTGTCAGACATATACCAAATCCCTTCTGTTGATTCTGCTGACAATCTATCTGAAAAATTGGAAAGGTATGTTCATGTACATTGTTTAGTTGAAGAGAGAAATTCATATTATTAATTATTTAGAGAAGAGAAAGCAAACATATTATAAGTTTAATTCTTATATTTAAAAATAGGAGCCAAGTATGGTGGCTAATGCCTGTAATCCCAACTATTTGGGAGGCCAAGATGAGAGGATTGCTTGAGACCAGGAGTTTGATACCAGCCTGGGCAACATAGCAAGATGTTATCTCTACACAAAATAAAAAAGTTAGCTGGGAATGGTAGTGCATGCTTGTATTCCCAGCTACTCAGGAGGCTGAAGCAGGAGGGTTACTTGAGCCCAGGAGTTTGAGGTTGCAGTGAGCTATGATTGTGCCACTGCACTCCAGCTTGGGTGACACAGCAAAACCCTCTCTCTCTAAAAAAAAAAAAAAAAAGGAACATCTCATTTTCACACTGAAATGTTGACTGAAATCATTAAACAATAAAATCATAAAAGAAAAATAATCAGTTTCCTAAGAAATGATTTTTTTTCCTGAAAAATACACATTTGGTTTCAGAGAATTTGTCTTATTAGAGACCATGAGATGGATTTTGTGAAAACTAAAGTAACACCATTATGAAGTAAATCGTGTATATTTGCTTTCAAAACCTTTATATTTGAATACAAATGTACTCC
HPTC workloads on the cloud
There are going to be lots of new genomes that need annotating.Small labs: limited informatics / systems experience.Typically postdocs/PhD who have a real job to do.
Getting ensembl pipeline up and running takes a lot of domain-expertease.
We have already done all the hard work on installing the software and tuning it.Can we package up the pipeline, put it in the cloud?
Goal: End user should simply be able to upload their data, insert their credit-card number, and press GO.
top related