sharing data: sanger experiences
DESCRIPTION
Sharing large amounts of data is easier said than done. This talk gives an overview of our experiences doing big-data science over wide-area networks.TRANSCRIPT
- 1. Data Sharing: Sanger Experiences Guy Coates Wellcome Trust Sanger Institute [email_address]
2. Background
- Moving large amounts of data: 3. Cloud Experiments
- Moving data to the cloud
- Production Pipelines
- Moving data to EBI
- Do we need to move this data at all?
4. Cloud Experiments
- Can we move some solexa image files to AWS and run our
processing pipeline? 5. Answer: No.
- Moving the data took much longer than processing it. 6. First attempt: 14 Mbits/s out of 2Gbit/s link.
7. Do some reading:
- http://fasterdata.es.net 8. Department of Enegy Office of
Science.
- Covers all of the technical bits and pieces required to make wide-area transfers go fast.
9. Getting better:
- Use the right tools:
- Use WAN tools:gridFTP/FDT/Aspera, not rsync/ssh. 10. Tune your TCP stack.
- Data transfer rates:
- Cambridge -> EC2 East coast: 12 Mbytes/s (96 Mbits/s) 11. Cambridge -> EC2 Dublin: 25 Mbytes/s(200 Mbits/s)
- What speedshouldwe get?
- Once we leave JANET (UK academic network) finding out what the connectivity is and what we should expect is almost impossible.
- How do we get the broken bits in the middle?
- Finding the person responsible for a broken router on the internet is hard.
12. What about the Physicists?
- LHC moves 20 PBytes year across the internet to their
processing sites.
- Not really. 13. Dedicated 10GigE networking between CERN and the 10 Tier 1 centres.
- Even with dedicated paths, it is still hard.
- Multiple telcos involved, even for a point-to-point link. 14. Constant monitoring / bandwidth tests to ensure it stays working. 15. See HEPIX talks for gory details.
16. We need a bigger networks:
- A fast network is fundamental to moving data. 17. Is it the only thing we need to do?
18. Sanger Production Pipeline
- Provides a nice example of moving large amounts of data in real-life.
19. Sequencing data flow Sanger Sequencer Analysis/ alignment Internalrepository EBI EGA / SRA (EBI) 20. Data movement between Sanger/EBI
- This should be easy...
- We are on the same campus. 21. 10Gbit/s (1.2 Gbyte/s) link between EBI and Sanger. 22. We share a data-centre. 23. Physically near, so we do not need to worry about WAN issues.
24. It is not just networks:
- Speed will only be as fast as the slowest link. 25. Speed was
not a design point for our holding area.
- $ per TB was the overriding design goal, not speed.
EBI Sanger Server Firewall Internet Server Firewall Network Network Disk Disk 26. Organisational issues:
- Data movement was not considered until after Sanger/EBI started
building the systems.
- Hard to do fast data transfers if your disk subsystem is not up to the job.
- Expectation management:
- How fast should I be able to move data?
- Good communication.
- Multi-institute teams. 27. Need to take end-to-end ownership across institutions.
- Application Led:
- Nobody cares about raw data rates- they care how fasttheir applicationcan move data. 28. Need application developers and sys-admin to work together.
- This needs to be in-place before you start projects!
29. Do we need to move the data? 30. Centralised data Data Sequencing Centre + DCC Sequencing centre Sequencing centre Sequencing centre Sequencing centre 31. Example Problem:
- We want to run out pipeline across 100TB of data currently in
EGA/SRA. 32. We will need to de-stage the data to Sanger, and then
run the compute.
- Extra 0.5 PB of storage, 1000 cores of compute. 33. 3 month lead time. 34. ~$1.5M capex.
35. Federation: A Better way: Collaborations are short term: 18 months-3 years. Sequencing centre Sequencing centre Sequencing centre Sequencing centre Federated access 36. Federation software: Unstructured data (flat files) Data size per Genome Structured data (databases) BioMart IRODS (data grid software) Intensities / raw data (2TB) Alignments (200 GB) Sequence + quality data (500 GB) Variation data (1GB) Individualfeatures(3MB) 37. Cloud / Computable archives
- Can we move the compute to the data?
- Upload workload onto VMs. 38. Put VMs on compute that is attached to the data.
Data CPU CPU CPU CPU Data CPU CPU CPU CPU VM 39. Summary
- We need fast network links. 40. We need cross site teams who can troubleshoot all potential trouble spots. 41. Teams need application & systems people.
42. Acknowledgements:
- The HEPIX Community.
- Http://www.hepix.org
- Team ISG:
- James Beal 43. Gen-Tao Chiang 44. Pete Clapham 45. Simon Kelley