proteomics repositories integration using eudat resources · proteomics repositories integration...
TRANSCRIPT
European Life Sciences Infrastructure for Biological Informationwww.elixir-europe.org
Rafael C JimenezELIXIR CTO
25 September 2014
Proteomics repositories integration using EUDAT resources
Data submissions
2
Sub
mis
sio
ns
raw data
processed data
metadata
Data
repository
Search
Integration
Noble WS, MacCoss MJ (2012) Computational and Statistical Analysis of Protein Mass Spectrometry Data. PLoS Comput Biol 8(1):
e1002296. doi:10.1371/journal.pcbi.1002296
http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002296
Overview of shotgun proteomics data production
MKKKNIYSIRKLGVG
IASVTLGTLLISG
GVTPAANAAQHD
FYQVLNMPNLNADQ
RNGFIQSLK
DDPSQSANVKLN
4
Peptide sequences
Raw data Process data
Metadata
Data examples
4
Raw data Process data Metadata
DNA
Human
Liver
Mitochondria
W. Smith
…
Peptide
Mouse
Heart
Nucleus
J. Heinz
…
LPISASHSSK…
TTGTTATCCG…
… … …
Proteomics data in PRIDE
5
~85% raw data
ProteomeCentralMetadata /
Manuscript
Raw Data*
Results
Journals
UniProt/
NeXtProt
Peptide Atlas
Other DBs
Receiving repositories
PASSEL
(SRM data)
PRIDE
(MS/MS data)
GPMDB
Researcher’s results Reprocessed results Raw data* Metadata
ProteomeXchange
Vizcaíno et al., Nature Biotechnology, 2014
• Framework to enable standard data submission and
dissemination pipelines between the main existing
proteomics resources.
7 Martens et al., Proteomics, 2005
Vizcaíno et al., NAR, 2013
PRIDE (PRoteomics IDEntifications) database
Mass spectrometry
Origin: 152 USA108 Germany67 United Kingdom53 Switzerland48 Netherlands42 China42 Canada41 France36 Spain33 Belgium25 Australia23 Sweden17 Japan16 Denmark13 Norway12 Finland12 India12 Taiwan10 Italy9 Republic of Korea8 Austria8 Ireland8 Brazil7 Singapore5 Israel5 Russia …
Type:
273 PRIDE complete
501 PRIDE partial
47 PeptideAtlas/PASSEL complete
Access:
38.3% PRIDE public
5.3% PASSEL public
56% PRIDE private
0.4% PASSEL private
Data volume:
Total: >40 TB
Number of all files: >120,000
PXD000320-324: ~ 5 TB
PXD000065: ~ 1.4TB
Top Species studied by at least 8
datasets:
381 Homo sapiens
100 Mus musculus
31 Arabidopsis thaliana
26 Saccharomyces cerevisiae
16 Escherichia coli
14 Rattus norvegicus
12 Mycobacterium tuberculosis
11 Drosophila melanogaster
~ 215 species in total
Submissions/year:
2012: 102
2013: 527
2014: 192
Pilot evolution
• Use EUDAT• Replication of ELIXIR data in EUDAT data centers
• Delegation of ELIXIR data in EUDAT data centers
• Adopt EDUAT• Replication of ELIXIR data in ELIXIR data centers using EUDAT
technology
9
Replication of ELIXIR data in EUDAT data centers
10
Central repository Data storage centers
Meta
data
Raw
Data
Meta
data
ResultsRaw
Data
ProteomeCentralMetadata /
Manuscript
Raw Data*
Results
Journals
UniProt/
NeXtProt
Peptide Atlas
Other DBs
Receiving repositories
GPMDB
Researcher’s results Reprocessed results Raw data* Metadata
Vizcaíno et al., Nature Biotechnology, 2014
Raw Data*
PASSEL
(SRM data)
PRIDE
(MS/MS data)
Replication of ELIXIR data in EUDAT data centers
Delegation of ELIXIR data in EUDAT data centers
12
Central repository Data storage centers
Meta
data
Raw
Data
Meta
data
ResultsRaw
Data
ProteomeCentralMetadata /
Manuscript
Raw Data*
Results
Journals
UniProt/
NeXtProt
Peptide Atlas
Other DBs
Receiving repositories
GPMDB
Researcher’s results Reprocessed results Raw data* Metadata
Vizcaíno et al., Nature Biotechnology, 2014
Raw Data*
PRIDE
(MS/MS data)
PASSEL
(SRM data)
Delegation of ELIXIR data in EUDAT data centers
Replication of ELIXIR data in ELIXIR data centers using EUDAT technology
14
National proteomics centers
Meta
data
ResultsRaw
Data
Central repository
Meta
data
ResultsRaw
Data
Plans
15
National proteomics centers
Meta
data
ResultsRaw
Data
Central repository
Meta
data
ResultsRaw
Data
Data storage centers
Meta
data
Raw
Data
1.- ELXIR replication
2.- EUDAT replication
Plans
16
National proteomics centers
Meta
data
ResultsRaw
Data
Central repository
Meta
data
ResultsRaw
Data
Data storage centers
Meta
data
Raw
Data
3.- delegation
ELIXIR Pilot action
17
EUDAT services
18
File sharing model
19
CSC
BILS
Site B
Site C
EUDAT CDIELIXIR
B2SAFE
B2SAFE
B2SAFE
B2SAFE
PRIDEEMBL-EBI
Pilot – EUDAT adoption: ELIXIR replication
20
CSC
BILS
Site B
Site C
EUDAT CDIELIXIR
B2SAFE
B2SAFE
B2SAFE
B2SAFE
PRIDEEMBL-EBI
Central repositoryNational proteomics centers
Meta
data
ResultsRaw
Data
Meta
data
ResultsRaw
Data
PIDs
21
ELIXIR
community centerELIXIR
Data center 1
EUDAT
Data center 1
CSCPRIDEBILS
Status
• BILS• Migrating from existing Swestore dCache to iRODS
• Testing compatibility with B2SAFE
• Latest iRDOS not compatible with B2SAFE?
• PRIDE• iRODS service installed
• B2SAFE module have been deployed at EMBL-EBI (PRIDE)
• Test B2SAFE replication PRIDE -> CSC
• DOI for datasets
• PID for dataset files
• Web service to associate datasets to dataset files
22
Status
In progress• Handle System Registration
• Test requests of EPIC/EUDAT identifiers
Open questions• BILS local PIDs?
• Sync back from PRIDE to BILS for modifications/additions at PRIDE?
• Data push or pull model?
• Replication of process data requires previous validation
23
Participants
EUDAT/CSC
• Jani Heikkinen
• Damien Lecarpentier
EMBL-EBI/systems
• Andy Jenkinson
• Steven Newhouse
EMBL-EBI/PRIDE
• Juan Antonio Vizcaíno
• Henning Hermjakob
24
BILS
• Mikael Borg
• Fredrik Levander
• Bengt Persson
ELIXIR Hub
• Rafael C Jimenez
European Life Sciences Infrastructure for Biological Information
www.elixir-europe.org
Thank you for your attention
Delegation of raw data
26
processed data
metadata
Data
repository
PID
Subm
issio
ns
Search
Integration
27
National proteomics centers
Meta
data
ResultsRaw
Data
Central repository
Meta
data
ResultsRaw
Data
Data storage centers
Meta
data
Raw
Data
National proteomics centers
Meta
data
ResultsRaw
Data
Central repository
Meta
data
ResultsRaw
Data
Data storage centers
Meta
data
Raw
Data