background spectral library searching spectral library searching is an alternative approach to...
TRANSCRIPT
BackgroundSpectral library searching
Spectral library searching is an alternative approach to traditional sequence database searching for peptide inference from MS/MS spectra. It is particularly suited for targeted proteomics2 applications, in which one seeks not to discover novel peptides, but to find and study expected peptides in the sample.
Previously observed and confidently identified peptide MS/MS spectra are collected and catalogued in data repositories, such as PeptideAtlas.
Repeated observations of the same peptide ion are combined to create a “consensus spectrum” of that particular peptide ion.
Searchable spectral libraries of consensus spectra are built for fast indexed searching.
During searching, each unknown query spectrum is compared to candidate library spectra one by one; high spectral similarity indicates positive identification.
Quality filters Three types of questionable spectra in the resulting consensus spectral library are subject to quality filters:
Spectra that have look-alikes in the library with non-homologous peptide IDs
•Often sequence-search false positives.
•Could also lead to false negatives when look-alikes end up as second hits, depressing the delta score artificially
Spectra with many unexplained large peaks
•Often sequence-search false positives.
•Even if true, often not representative of the peptide ion due to contamination, leading to false positives
Spectra with only one replicate
(singletons)•Often sequence-search false positives, especially if large enough pool of spectra is compiled
•No consensus is made – raw spectra are of poorer quality
Development of a Spectral Library Building Tool and Development of a Spectral Library Building Tool and Re-Analysis of Human Plasma PeptideAtlas Re-Analysis of Human Plasma PeptideAtlas Datasets using Spectral SearchingDatasets using Spectral SearchingHenry Lam1, Eric Deutsch1, James S. Eddes1, Jimmy K. Eng1,2, Nichole King1, Steve Stein3, Ruedi
Aebersold1,4
1Institute for Systems Biology, Seattle, WA 3National Institute of Standards and Technology, Gaithersburg, MD 2Fred Hutchison Cancer Research Center, Seattle, WA 4Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland
Library spectra
Query spectra
An example of a confidently-matched query spectrum in spectral searching
References1. Spectral searching and SpectraST: Lam H, Deutsch EW, Eddes JS, Eng JK, King NL, Stein SE, Aebersold R. Development and validation
of a spectral library searching method for peptide identification from MS/MS. Proteomics 7(5), (2007).2. Targeted proteomics: Kuster B, Schirle M, Mallick P, Aebersold R. Scoring proteomes with proteotypic peptide probes. Nature Review
Molecular and Cell Biology 6(7), 577-583 (2005).3. SEQUEST: Eng JK, A.L. M, Yates JRI: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in
a protein database. Journal of the American Society for Mass Spectrometry 5(11), 976-989 (1994).4. Trans-Proteomic Pipeline: Keller A, Eng J, Zhang N, Li X-J, Aebersold R. A uniform proteomics MS/MS analysis platform utilizing open
XML file formats. Molecular Systems Biology 1, 17 (2005).5. Mascot: Perkins, DN, Pappin, DJ, Creasy, DM and Cottrell, JS. Probability-based protein identification by searching sequence
databases using mass spectrometry data. Electrophoresis 20(18), 3551-3567 (1999). 6. X!Tandem: Craig R, Beavis RC. TANDEM: matching proteins with mass spectra. Bioinformatics 20, 1466-1467 (2004).7. Human Plasma PeptideAtlas: Deutsch EW, Eng JK, Zhang H, King NL, Nesvizhskii AI, Lin B, Lee H, Yi EC, Ossola R, Aebersold R. Human
Plasma PeptideAtlas. Proteomics 5(13), 3497-3500 (2005).8. PeptideProphet: Keller A, Nesvizhskii AI, Kolker E, Aebersold R. Empirical statistical model to estimate the accuracy of peptide
identifications made by MS/MS and database search. Analytical Chemistry 74(20), 5383-5392 (2002).
This work is supported by the National Heart, Lung and Blood Institute, National Institutes of Health, under contract No. N01-HV-28179.
MotivationThere has been a recent surge of interest in using spectral library
searching as an alternative to sequence searching for identification of peptide MS/MS spectra. In spectral library searching, unknown MS/MS spectra are searched against a carefully compiled library of previously observed and identified peptide MS/MS spectra; a good spectral match indicates a correct identification. This method was shown to offer significant speed gain and superior sensitivity and accuracy compared to traditional sequence searching.1 However, it relies on the availability of a high-quality spectral library covering the proteome or subproteome of interest. Although public spectral libraries, such as those under development at the National Institute of Standards and Technology, are rapidly maturing, a ready-to-deploy tool to create custom spectral libraries is greatly desirable for various applications for which suitable public spectral libraries are not yet available. This would enable individual researchers to build spectral libraries of their own proteomes or subproteomes of interest, as a means to organizing and condensing previous data and to facilitate future research.
MethodsSoftware development – SpectraST (v.3.0)
Written in C++, with LINUX and Windows versions available (http://tools.proteomecenter.org/) Performs both spectral library searching1 and spectral library creation Minimal resource requirement - can be run on typical personal PCs Open-source and readily customizable Integrated with the Trans Proteomic Pipeline (TPP) software suite6 for full workflow support and easy
adaptation Requires no relational database backend
Consensus spectrum creation
Pool replicate spectra (identified to tbe same peptide ion) of high confidenceA PeptideProphet probability cutoff of 0.9 is used
Remove dissimilar replicatesOnly keeps the largest cluster of similar spectra
Align peaksSlightly m/z-shifted peaks from different replicates are aligned
Peak votingOnly peaks present in a majority of replicates are included
Intensity averagingIntensities of eligible peaks are weighted-averaged by the replicate’s signal-to-noise ratio
Results
Building a spectral library from the Human Plasma PeptideAtlas7 (http://www.peptideatlas.org/)
22 datasets (including 7 from HUPO Plasma Proteome Project) 1.4 million spectra positively identified with P > 0.9 (SEQUEST/PeptideProphet8) from a total of 14 million
spectra Over 30,000 distinct peptide ions among positive identifications Library building by SpectraST takes about 3 days of CPU time The number of peaks is reduced by more than a factor of 3 during consensus creation, indicating effective
noise removal Different levels of quality filter stringency were investigated
SEQUEST/Mascot5/X!Tandem6
PeptideProphetSpectraST
Raw spectra import
PepXMLPepXMLSearch Results
(.pepXML)
mzXMLmzXMLSpectra(.mzXML)
Library(.splib)
.spidx
.pepidx
SpectraSTLibrary manipulation
• Union/Intersection• Filter based on criteria
• Consensus creation• Quality filter
Library(.splib)
Library(.splib)
Library(.splib)
.spidx
.pepidx
Library(.splib)
.spidx
.pepidx
791792
842
907
…
Library of consensus spectra
CVDAGQAKDGGGENSRQPWHIVK
TTSGLADK
IPGSGQGAR…
DGGGENSRQPWHIVK
CVDAGQAK
CVDAGQAK
CVDAGQAK
TTSGGANK
IPGSGQGAR
TTSGGANKQPWHIVK
Data repository
Dataset 1
Dataset 2
Dataset 3
Precursor m/z index
GVM147NAVNNVNNVIAAAFK/2 (6 replicates)
Dot = 0.87
Similar spectra with conflicting IDs
FFTAICDMVAWLGYTPYKVTY/3 (1 replicate)
ALVLIAFAQYLQQC160PFEDHVK/3 (100 replicates)
AVDLLFFTDESGDSR/2 (2 replicates)
Possibly correctly identified, but impure spectrum
DFFTPNLFLK/3 (1 replicate)
False-positive impure spectrum
Re-analysis of the Human Plasma PeptideAtlas datasets by spectral searching
All datasets used to build the spectral library were re-searched by SpectraST against the library Dramatic increase (over 60%) in positively identified spectra with P > 0.9 (SpectraST/PeptideProphet) Extra identifications are mostly lower quality spectra previously missed by sequence searching Library searching by SpectraST takes about 3 days of CPU time
Conclusions The library-searching tool SpectraST is extended to allow users to build custom spectral libraries. The consensus spectrum creation algorithm enables the reduction of noise and spurious peaks, resulting in
high-quality, representative spectra. A spectral library has been built from the entire Human Plasma PeptideAtlas, and is now available. The Human Plasma PeptideAtlas datasets were re-analyzed by searching against this library. Compared to
sequence searching by SEQUEST, SpectraST identified over 60% more spectra at the same probability threshold, with much improved sensitivites and false discovery rates.
Quality filters are shown to have a significant impact on the performance of the search.
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5False Discovery Rate
Sens
itivi
ty
SEQUEST
SpectraST (Q0)
SpectraST (Q1)
SpectraST (Q2)
2.268(+64%)2.302 (+66%)
1.385
2.094 (+51%)
0.0
0.5
1.0
1.5
2.0
2.5
SEQUEST SpectraST(Q0)
SpectraST(Q1)
SpectraST(Q2)
Num
ber o
f pos
itive
IDs
(Milli
ons)
Quality Level Number of spectra remaining
Q0 (No filter) 37,428
Q1 (Removed impure spectra; spectra with look-alikes having conflicting IDs)
30,517
Q2 (Removed impure spectra; spectra with look-alikes having conflicting IDs; singleton spectra)
20,315
0
1
2
3
4
5
6
2-3 4-9 10-19 20+# replicates used to create consensus
Peak
redu
ction
ratio
, R
R = (Ave. # peaks in replicates) / (# peaks in consensus)
Advantages of spectral searching over traditional sequence searching1
Smaller search space. Only peptide ions known to be observed and identified are included in the library.
More precise scoring. Peak intensities are naturally accounted for and all spectral features, including uncommon fragments, are used for similarity scoring.
Vast improvement in speed. A reduced search space and a simpler scoring algorithm yield a typical speed gain of 500-1000X (compared to SEQUEST3).
Higher confidence in identifications. The more precise scoring allows better separation of good and bad hits, leading to much improved sensitivity and false discovery rates.
SpectraST library creation operations
An example of consensus spectrum creation
Building searchable spectral libraries