irods/ddn user group 20140908 sanger
DESCRIPTION
My presentation to the UCL hosted DDN/iRods user group help on the 8/9th September.TRANSCRIPT
About the Institute Funded by Wellcome Trust.
2nd largest research charity in the world. ~700 employees.
Large scale genomic research.
Sequenced 1/3 of the human genome (largest single contributor).
We have active cancer, malaria, pathogen and genomic variation studies.
All data is made publicly available.
Websites, ftp, direct database, access, programmatic APIs.
3 About Us
The Sanger Institute: A Little Background
1997 (yeast genome
completed)
2003 (first mouse genome draft
Malarial parasite sequence completed)
2010 (Completion of 1000 genomes
Start or uk10k study)
2005 (WTGCCC
established)
2008 (start of 1000
genome project)
2001 (First draft of
human genome. Sanger upped
contribution to 1/3)
4 About Us
Sequence till 2011
5 About Us
Original Design Brief
Image credit: Ryan Raffa, ryanraffa.com 6
Design Brief
Is the data safe?
7 Design Brief
Can the scientists find their data?
Image credit: searchengineland.com 8
Design Brief
Path of least surprise
Image credit: betanews.com 9
Design Brief
Minimal Maintenance
Image credit: failblog.cheezburger.com 10
Design Brief
Current Setup
11
Metadata Heavy Usage
Example attribute fields → Users query and access
data largely from local compute clusters
Users access iRODS
locally via the cli Largely provided on
creation with Baton API’s via pipelines
attribute: library attribute: total_reads attribute: type attribute: lane attribute: is_paired_read attribute: study_accession_number attribute: library_id attribute:
sample_accession_number attribute: sample_public_name attribute: manual_qc attribute: tag attribute: sample_common_name attribute: md5 attribute: tag_index attribute: study_title attribute: study_id attribute: reference attribute: sample attribute: target attribute: sample_id attribute: id_run attribute: study attribute: alignment
12 Current Setup
Current Deployment
Replication between data centre ‘rooms’ (direct, not queued)
One resource set as default for incoming objects,
migration via cron script as it fills up
Checksum via iput strongly encouraged
13 Current Setup
Current Logical Design
Sanger1 (Portal Zone)
/seq
green red
/humgen
green red
Portal provides kerberised access Federation using head zone accounts
14 Current Setup
The Future
Image copyright: flyinglow.ca 15
Future Logical Design
Sanger1 (Portal Zone)
/seq
green red Orange?
/humgen
Red green Orange?
Orange will be offsite at Infinity
16 The Future
Why the change?
17 The Future
Everyone has an airport, why is that noteworthy?
18 The Future
• 10g via JANET between sites • Tested on AWS first (successfully!) • Using Oracle failover
• (HA tnsnames entry to RAC clusters) • Build here and ship to new site
• ~3PB to transfer • Need to have local replicas
19 The Future
Image credit: diy.despair.com 20
The spindles spin, but does the data(base)?
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●●
●●
●
●
●
●●●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●
●●●●
●●●●●
●●●●
●
●
●
●
●●
●
●●●
●●
●
●
●
●●
●●●
●●●
●
●●
●
●●●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
0 20 40 60 80 100 120 140 160
010
0020
0030
0040
00
IRODS upload of genotyping data archiveStarted 2014−04−24 10:50
Hours after start
Files
uplo
aded
per
hou
r
010
2030
4050
6070
80
Mea
n file
s uplo
aded
per
minu
te
5 pm Friday 9 am Monday
mean = 55.6/min
21 Experience
Our block storage deployment
22
Your mileage may vary
23 Experience
‘Wait, WHAT?’
24 Experience
Databases are good at data, right?
25 Experience
Can you spot the optimisation time point?
26 Experience
Database Tuning FTW
27 Experience
Features we’re not using.. yet
Image credit: Melissa Penta; mydigitalmind.com 28
More rules based notification E.g. notifying PI’s on access to restricted data
Pam authentication instead of Kerberos
iDrop
(metadata query non trivial ATM)
29 Features
Features we want
Image credit: www.paperspencils.com 30
Features
Object store plugins/integration
• Caching plugin thoughts
• Streaming files • Local cache space • Managing local cache • Multi site, esp updates
• Integration with Vendors • Replica count? • Site/geographical awareness • metrics, metrics, metrics! (also reliability, manageability, low cost.. )
31 Features
Instrumentation
32 Features
More like this, pls
33 Features
Oracle MySQL
34 Features
What our users are doing
Source: projectcartoon.com 35
Serapis (REST API, python, RabbitMQ, MongoDB)
Baton
Stuff they haven’t told us
36 Users
Thank you!
Acknowledgements: Dr Pete Clapham
John Constable Informatics Support Group
[email protected] @kript
37