proteomics data standards
TRANSCRIPT
EMBL-EBI Now and in the Future
Introduction to the PSI standard data formatsDr. Juan Antonio Vizcano
Proteomics Team LeaderEMBL-EBIHinxton, Cambridge, UK
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Overview
A couple of slides about the need of data standards
The Proteomics Standards Initiative
Existing data standards
Connection with ProteomeXchange and IMEx
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Overview
A couple of slides about the need of data standards
The Proteomics Standards Initiative
Existing data standards
Connection with ProteomeXchange and IMEx
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Standards are needed in real life: also in bioinformatics
With a small number of standards,converters are feasibleData standards are needed
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016
Taken from Biocomicals, http://biocomicals.blogspot.com
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Mass Spectrometry (MS)-based proteomicsMany different workflows -> Many different data types -> Need for several data standards.
Discovery mode:Bottom-up proteomicsData dependent acquisition (DDA)Data independent acquisition (DIA)
Top down proteomics
Targeted mode:SRM (Selected Reaction Monitoring)
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Overview
A couple of slides about the need of data standards
The Proteomics Standards Initiative
Existing data standards
Connection with ProteomeXchange and IMEx
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016
Develops data standards for proteomics.Both data representation and annotation standards.Involves data producers, database providers, software producers, publishers, everyone who wants to be involvedActive Workgroups: MI, MS, PI, Mod and the new QC.Inter-group activities: MIAPE and Controlled Vocabularies.Started in 2002, so some experience alreadyOne annual meeting in March-April, regular phone calls.Close interaction with the metabolomics community (MSI).
http://www.psidev.infoHUPO Proteomics Standards Initiative
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016PSI DeliverablesMinimum information (MIAPE) specifications: Format-independent specification of minimum information guidelines.
Formats: Usually XML-based (but also tab-delimited files), capable of representing the relevant Minimum Information, plus additional detailed data for the domain.
Controlled vocabularies: Usually an OBO-style hierarchical controlled vocabulary precisely defining the metadata that are encoded in the formats.
Databases and Tools: Foster open software implementations to make the standards truly useful.
Community interaction to ensure deposition of data in public repositories.
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016
9
PSI MS Controlled Vocabulary
Mayer et al., Database, 2013~2,600 terms by October 2016
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016
The Ontology Lookup Service (OLS)
http://www.ebi.ac.uk/ontology-lookup/
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016
MIAPE guidelinesMinimum Information About a Proteomics Experiment guidelines.
Set of experimental and technical metadata that are needed to make one experiment reproducible.
They cover different aspects: mass spectrometry, informatics (identification and quantification), particular techniques, etc.
Published since 2008, but their adoption has been limited
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Overview
A couple of slides about the need of data standards
The Proteomics Standards Initiative
Existing data standards
Connection with ProteomeXchange and IMEx
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016The typical dilemma
Data standards need to be stable to promote adoption
Proteomics standards need to evolve very rapidly: Data is inherently very complex Experimental techniques are evolving all the time
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016
14
MS data: mzML (also used in MS metabolomics).
Protein and peptide identification: mzIdentML.
Peptide and protein quantification: mzQuantML.
SRM transitions (for targeted proteomics): TraML.
Molecular interactions: PSI MI XML and MITAB.
mzTab: identification and quantification results for peptides, proteins and small molecules (also used in MS metabolomics).
www.psidev.infoExisting data standards in proteomics
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Current PSI Standard File Formats for MS
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016
Binary data
mzDatamzXMLmzML
XML-basedfiles.dta, .pkl, .mgf,.ms2
Peak lists
Data formats for mass spectra data
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016An example of success story: mzMLA data format for the storage and exchange of MS output filesDesigned by merging the best aspects of both mzData and mzXMLDeveloped with full participation of academic researchers, hardware and software vendorsExpected to replace mzXML and mzData, but not expected to completely replace vendor binary formatsCaptures spectra (raw data or peak lists), chromatograms and related metadata
Version 1.0 released in June 2008, v1.1 released in June 2009Many implementations already existVersion 1.2 with enhanced compression considered for the near future.
Martens et al., MCP, 2011
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016The important note here is that all the vendors are on board, and that mzML is considered an export format by instrument vendors and an import format by commercial software vendors. Some commercial systems (like Indigos), and some academic systems like TPP have shown that it is possible to use XML formats as a native format, but this is not a requirement for the standard to be effective. It is expected that data analysis and archival software will tend to be the early adopters of the specification since it helps them the most.18
An example of success story: mzMLMartens et al., MCP, 2011
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016The important note here is that all the vendors are on board, and that mzML is considered an export format by instrument vendors and an import format by commercial software vendors. Some commercial systems (like Indigos), and some academic systems like TPP have shown that it is possible to use XML formats as a native format, but this is not a requirement for the standard to be effective. It is expected that data analysis and archival software will tend to be the early adopters of the specification since it helps them the most.19
An example of success story: mzML
The most popular search engines support mzML
Many parser libraries availableConversion from raw files into mzMLhttp://www.psidev.info/mzml_1_0_0
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016The important note here is that all the vendors are on board, and that mzML is considered an export format by instrument vendors and an import format by commercial software vendors. Some commercial systems (like Indigos), and some academic systems like TPP have shown that it is possible to use XML formats as a native format, but this is not a requirement for the standard to be effective. It is expected that data analysis and archival software will tend to be the early adopters of the specification since it helps them the most.20
Application of mzML to metabolomics
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016
21
Current PSI Standard File Formats for MS
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016
mzIdentML, mascot .dat, sequest .out,SpectrumMill .spopep.xml, prot.xml
Only qualitative data!Data formats for output from search engines
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016mzIdentML: peptide and protein identificationsOverviewXML-based data standard for peptide and protein identifications e.g. following database search and protein inference.Sections for all PSMs, proteins/protein groups inferred, protocols/parameters etc.
Timeline:Original 1.0 version in Aug 2009.Version 1.1 stable (Aug 2011).Manuscript published in MCP in 2012*.2012-2016:Improving support for protein grouping multiple search engines, pre-fractionation approaches and de novo sequencing.Now firmly embedded as part of ProteomeXchange submission process, and supported by lots of external software.
* Jones, A. R., Eisenacher, M., Mayer, G., Kohlbacher, O., et al., The mzIdentML data standard for mass spectrometry-based proteomics results. Molecular & Cellular Proteomics 2012, 11, M111.014381.
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016mzIdentML: peptide and protein identifications* Jones, A. R., Eisenacher, M., Mayer, G., Kohlbacher, O., et al., The mzIdentML data standard for mass spectrometry-based proteomics results. Molecular & Cellular Proteomics 2012, 11, M111.014381.
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016mzIdentML 1.1
Data standard for peptide and protein identification datamzIdentML 1.22011-20122017New support for:Cross-linking approachesPeptide level scoresModification localization scoresProteogenomics approachesImproved support for:Protein inferencePre-fractionation de novo sequencingSpectral library searches
Increasingly supported by the most-used proteomics softwareand databases
jmzIdentMLmzid Libraryms-data-core-apiMyriMatchProteoAnnotator
PIAProCon
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Current PSI Standard File Formats for MS
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016
Wide variety of quantification techniques
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016mzQuantML: Standard for quantitative dataOverviewXML-based standard for quantification data following use of quant softwareCan report tables of data (QuantLayers), columns are: StudyVariables, Assays or Ratios; rows are ProteinGroups, Proteins or PeptidesCan also capture 2D coordinates of quantified regions in LC-MS (Features)
TimelineVersion 1.0 rc-1 submitted to the PSI process October 2011; Version 1.0 rc-2 June 2012; Re-submitted to PSI process in October 2012 Completed PSI process in Feb 2013 version 1.0 releaseSupports label-free (intensity), label-free (spectral counting), MS2 tag techniques (e.g. iTRAQ) and MS1 label techniques e.g. SILACSchema is fixed with each technique defined by separate semantic rules, implemented in validator softwareManuscript published in MCP in summer 2013*Updated to support SRM as a new technique** (version 1.0.1 just submitted to the document process).
*Walzer et al. MCP 2013 Aug;12(8):2332-40. doi: 10.1074/mcp.O113.028506**Qi et al. PROTEOMICS, 2015
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016mzQuantML: Standard for quantitative data*Walzer et al. MCP 2013 Aug;12(8):2332-40. doi: 10.1074/mcp.O113.028506**Qi et al. PROTEOMICS, 2015
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Current PSI Standard File Formats for MS
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016The last addition: mzTab Aims and conceptTo provide a simple and efficient way of exchanging results from MS approaches.Simpler summary report of the experimental resultsPeptides and proteins identified in a given experimental settingSmall molecules identified Reported quantification valuesTechnical and biological metadata
Easier to parse and use by the research community, systems biologists as well as providers of knowledge bases.It can be used by non-experts in bioinformatics.It does not aim to replace mzIdentMl and mzQuantML
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016mzTab - SectionsGriss et al., MCP, 2014
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016
Metadata section - Example
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016mzTab Current implementations
jmzTab (Java API). Manuscript published in the journal Proteomics.mzTab Validator, PRIDE XML to mzTab converter (PRIDE team).Mascot (Matrix Science) exporter.MaxQuant: exporter in beta is available.ProteomeDiscoverer: Exporter in beta.OpenMS (version 1.10).
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016mzTab ongoing development in metabolomicsMore detailed modelling of MS metabolomics data (version 1.1):Led by A. Jones (Liverpool University)Extension from one to three sections for metabolomics.
Also applicable to lipidomics data.
Software will also be extended to support the new version.
http://www.cosmos-fp7.eu/
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Current PSI Standard File Formats for MS
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Unify exchange of transitions with TraMLPSIs TraML (Transitions Markup Language) Format for encoding SRM/MRM transitionsVersion 1.0.0 now released and published in MCP (Deutsch et al. 2012)
JournalArticles
TransitionsDatabases
ExcelsheetsSRMAnalysisSoftwareInstruments
TraML
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 201638
Unify exchange of transitions with TraML
Deutsch et al., MCP, 2012
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 201639
Java library for working with TraML filesIt aims:- command line & simple GUI- TraML to TSVTSV to TraML- TSV vendor formats from TSQ, QTRAP5500, AgilentQQQ
Published: Helsens et al., JPR, 2011
TraML software implementations
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016
PSI document processEvery data standard has to undergo a thorough review processIn fact, in practice, two review processes happen in parallel: the PSI and manuscript review.
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Data standard publicationsmzML (data standard for MS data)Martens et al., MCP, 2011
mzIdentML (standard for peptide/protein IDs)Jones et al., MCP, 2012
TraML (for SRM transitions)Deutsch et al., MCP, 2012
mzQuantML (for quantitative data)Waltzer et al., MCP, 2013
mzTab (peptide/protein ID and quantification)Griss et al., MCP, 2014
Some updates already going on (e.g. mzIdentML 1.2)
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Importance of making software available
jmzML (https://github.com/PRIDE-Utilities/jmzml)Cote et al., Proteomics, 2009
jmzIdentML (https://github.com/PRIDE-Utilities/jmzidentML)Reisinger et al., Proteomics, 2012
jmzReader (https://github.com/PRIDE-Utilities/jmzReader)Griss et al., Proteomics, 2012
jmzQuantML (https://github.com/UKQIDA/jmzquantml)Qi et al., Proteomics, 2014
jmzTab (https://github.com/PRIDE-Utilities/jmzTab)Xu et al., Proteomics, 2014
ms-data-core-api (https://github.com/PRIDE-Utilities/ms-data-core-api)Perez-Riverol et al., Bioinformatics, 2015PSI promotes implementations. The reference libraries are always open source and can be used by anyone!
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Under development: Proteogenomics related formats
Two ongoing formats are being developed: proBed and proBAM.
Same overall objective: to map identified peptides to genome coordinates.
Different level of detail:proBed is tab-delimited and simpler, based on the original BED format. Less level of detail.proBAM is based in the original SAM/BAM formats, widely used in genomics. Much higher level of detail.
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Provide your own data to genome browsers
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016
45
And also protein-protein interactionsPSI-XML: XML-based format
Version 2.5 is the working versionVersion 3.0 under development
MITAB: tab-delimited format
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016The premise behind the test is the activation of downstream reporter gene(s) by the binding of a transcription factor onto an upstream activating sequence (UAS). For two-hybrid screening, the transcription factor is split into two separate fragments, called the binding domain (BD) and activating domain (AD). The BD is the domain responsible for binding to the UAS and the AD is the domain responsible for the activation of transcription.
Overview of two-hybrid assay, checking for interactions between two proteins, called here Bait and Prey.A. Gal4 transcription factor gene produces two domain protein (BD and AD), which is essential for transcription of the reporter gene (LacZ).B,C. Two fusion proteins are prepared: Gal4BD+Bait and Gal4AD+Prey. None of them is usually sufficient to initiate the transcription (of the reporter gene) alone.D. When both fusion proteins are produced and Bait part of the first interact with Prey part of the second, transcription of the reporter gene occurs.46
Overview
A couple of slides about the need of data standards
The Proteomics Standards Initiative
Existing data standards
Connection with ProteomeXchange and IMEx
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016
ProteomeXchange: A Global, distributed proteomics database
PASSEL (SRM data)
PRIDE (MS/MS data)
MassIVE (MS/MS data)
Raw
ID/Q
Meta
jPOST(MS/MS data)
Mandatory raw data deposition since July 2015
Goal: Development of a framework to allow standard data submission and dissemination pipelines between the main existing proteomics repositories.
http://www.proteomexchange.orgNew in 2016Vizcano et al., Nat Biotechnol, 2014Deutsch et al., NAR, 2017, in press
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016
MBInfo
The IMEx Consortium (www.imexconsortium.org)Orchard et al., Nat Methods, 2012
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Do you want to learn more?
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 2016Questions?
Juan A. [email protected] Proteomics Bioinformatics Course 2016Hinxton, 8 December 201651