current state of proteomics standardization and (c … · current state of proteomics...
TRANSCRIPT
9/6/2016 | 3
Current state of proteomics standardization and (C-)HPP data quality guidelines
DTL focus meeting on data integration, standards and fair principles in proteomics
Department of PharmacyAnalytical Biochemistry
Péter Horvatovich
9/6/2016 | 7First guideline of C-HPP
Paik YK, et al., Standard Guidelines for the Chromosome-Centric Human Proteome Project, PMID 22443261.
The key to making real headway
on the HPP is to agree on a
common , shared, globally
acceptable “big data” language
Slide from Mark Baker
ProteomeXchange
Individual lab-based MS data
PRIDE
MassIVE
GPMdb
PASSEL
PeptideAtlas
neXtProt
HPP Metrics
Human Protein Atlas
HPP Publications
HPP Guidelines
† neXtProt PE1-5 classifications PE1 =PE2 =PE3 =PE4 =PE5 =
The Human Proteome Project Workflow
Slide from Mark Baker
PE LevelNeXtProt
18/09/2013 version
%NeXtProt
12/02/2016 version
%
PE1Evidence at Protein Level
15,649 77.7 16,518 82.4
PE2Evidence at Transcript Level
only3,576 17.7 2290 11.4
PE3Inferred from Homology
198 1.0 565 2.8
PE4Predicted
94 0.5 94 0.5
PE5Uncertain
635 3.2 588 2.9
TOTAL 20,152 100 20,055 100
HPP/neXtProt protein existence data from 2013-2016
the
missing
proteins
Slide from Mark Baker
Metrics Used by HPP Teams
› Initial 2013 definition of “missing” was “no protein level data or insufficient documentation for ID” (PE2+PE3+PE4+PE5)
› In 2014, revised toPE2+PE3+PE4 as PE5 proteins considered dubious
Slide from Mark Baker
A new protein existence viewer
https://search.nextprot.org/view/statistics/protein-existence
Slide from Lydie Lane
1.Failure to use discriminating (proteotypic) from non-discriminating peptides
2.Inclusion of many low-quality MS spectra3.Use of short peptides (< 7aa containing peptides)4.Use of older d’base builds
Testing 2014 Claims of Credible MS evidence for 108/200 ORs
Slide from Mark Baker
16
133 million PSMs
1 million distinct peptides
14,000 canonical proteins
0.00009 PSM FDR
0.0002 Peptide FDR
0.01 Protein FDR
Only peptides ≥ 7 AA
0%
75%
100%
50%
25%
70%
Proteins
Human peptides in PeptideAtlas 2014-08
Slide from Eric Deutsch
18
Only 2 of neXtProt’s 473 olfactory receptors are canonical in PeptideAtlas
Olfactory receptors in PeptideAtlas
Slide from Eric Deutsch
19
Which protein does the peptide implicate?
Spectrum originally identified to: GYIVAAVVK
But a better and exact match is: GYIAVAVVK
But this latter sequence is not in our reference proteome.
Which is why it was not identified correctly.
Is it olfactory receptor OR5A2? (no other corroborating evidence)…GIVSVLVVLISYGYIVAAVVKISSATGRTKAFSTCASH…
GYIAVAVVK
Or is it serotransferrin (0.5 million PSMs)…SDNCEDTPEAGYFAIAVVKKSASDLTWDNLKGKKS…
GYIAVAVVK
I V dbSNP:rs2692696 is in our reference proteome from UniProt
F I not in our reference proteome. Not in neXtProt.
But this protein has many SNPs, and this may be the explanation
Slide from Eric Deutsch
20
Q9H255 = OR51E2
But GPMdb does have this one.
This is the only OR that Ron
Beavis thinks is legitimate.
But only observed with a single
peptide (many times) (in one
sample that PeptideAtlas doesn’t
have)
Ron Beavis:
If you check a little closer, the older
gene symbol for OR51E2 is
PSGR, a prostate-specific G-
coupled receptor protein (Cancer
Res. 2000 Dec 1;60(23):6568-72).
So, I'd actually suggest that this is
a true identification and that
interpreting the "OR" in the gene
name as being literally true is the
problem.
Slide from Eric Deutsch
Growth of Human Proteome with Large Datasets from 2014-2015
Note Savitski/Kuster reanalysis of Wilhelm et al: 14,741 proteins identified, MCP 2015Slide from Gilbert S. Omenn
PMID 27490519
Latest HPP Guideline
HUPO: MIAPE PSI Journals:- Journal of Proteome Research- Molecular and Cellular Proteomics- Proteomics Clinical Applications
NIH-NCI: proteogenomics guideline
HPP 1.0: data deposition at ProteomeXchange, FDR at PSM, peptide and proteins levelsHPP 2.0: MS data interpretation
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Manuscript detailing the process
Ternent et al., Proteomics, 2014http://www.proteomexchange.org/submission
Example dataset:
PXD000764
- Title: “Discovery of new CSF biomarkers for meningitis in children”
- 12 runs: 4 controls and 8 infected samples
- Identification and quantification data
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
PX Data workflow for MS/MS data1. Mass spectrometer output files: raw data (binary files) or
peak list spectra in a standardized format (mzML, mzXML).
2. Result files:
a. Complete submissions: Result files can be converted to
PRIDE XML or the mzIdentML data standard.
b. Partial submissions: For workflows not yet supported by
PRIDE, search engine output files will be stored and
provided in their original form.
3. Metadata: Sufficiently detailed description of sample origin,
workflow, instrumentation, submitter.
4. Other files: Optional files:
a. QUANT: Quantification related results e. FASTA
b. PEAK: Peak list files f. SP_LIBRARY
c. GEL: Gel images
d. OTHER: Any other file type
Published
RawFiles
Other files
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Complete vs Partial submissions: experimental metadata
Complete Partial
General experimental metadata about the projects is similar.
However, at the assay level information in partial submissions is not so detailed
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Complete
Partial
Complete vs Partial submissions: processed results
For complete submissions, it is possible to connect the spectra with the identification
processed results and they can be visualized.
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Complete submissions using mzIdentML
Search
Engine
Results +
MS files
Search
engines
mzIdentML
- Mascot
- MSGF+
- Myrimatch and related tools from D. Tabb’s lab
- OpenMS
- PEAKS
- ProCon (ProteomeDiscoverer, Sequest)
- Scaffold
- TPP via the idConvert tool (ProteoWizard)
- ProteinPilot (planned by the end of 2014)
- Others: library for X!Tandem conversion, lab
internal pipelines, …
An increasing number of tools support export to mzIdentML
1.1
- Referenced spectral files need to be submitted as well
(all open formats are supported).
Updated list: http://www.psidev.info/tools-implementing-
mzIdentML#.
Juan A. Vizcaí[email protected]
13th HUPO World CongressMadrid, 5 October 2014
Tools ‘RESULT’ file generation Final ‘RESULT’ file
mzIdentML
‘RESULT’
Now: native file export
Spectra
files
Mascot
ProteinPilot
Scaffold
PEAKS
MSGF+
Others
Native File export
Manual Inspection of Extraordinary Claims
› Reviewers and readers (and authors) need to see this:
Slide from Eric Deutsch
Manual Inspection of Extraordinary Claims
› Reviewers and readers should not see this:
› This is what false positives look like
Slide from Eric Deutsch