background spectral library searching spectral library searching is an alternative approach to...

BackgroundSpectral library searching

Spectral library searching is an alternative approach to traditional sequence database searching for peptide inference from MS/MS spectra. It is particularly suited for targeted proteomics2 applications, in which one seeks not to discover novel peptides, but to find and study expected peptides in the sample.

Previously observed and confidently identified peptide MS/MS spectra are collected and catalogued in data repositories, such as PeptideAtlas.

Repeated observations of the same peptide ion are combined to create a “consensus spectrum” of that particular peptide ion.

Searchable spectral libraries of consensus spectra are built for fast indexed searching.

During searching, each unknown query spectrum is compared to candidate library spectra one by one; high spectral similarity indicates positive identification.

Quality filters Three types of questionable spectra in the resulting consensus spectral library are subject to quality filters:

Spectra that have look-alikes in the library with non-homologous peptide IDs

•Often sequence-search false positives.

•Could also lead to false negatives when look-alikes end up as second hits, depressing the delta score artificially

Spectra with many unexplained large peaks

•Often sequence-search false positives.

•Even if true, often not representative of the peptide ion due to contamination, leading to false positives

Spectra with only one replicate

(singletons)•Often sequence-search false positives, especially if large enough pool of spectra is compiled

•No consensus is made – raw spectra are of poorer quality

Development of a Spectral Library Building Tool and Development of a Spectral Library Building Tool and Re-Analysis of Human Plasma PeptideAtlas Re-Analysis of Human Plasma PeptideAtlas Datasets using Spectral SearchingDatasets using Spectral SearchingHenry Lam1, Eric Deutsch1, James S. Eddes1, Jimmy K. Eng1,2, Nichole King1, Steve Stein3, Ruedi

Aebersold1,4

1Institute for Systems Biology, Seattle, WA 3National Institute of Standards and Technology, Gaithersburg, MD 2Fred Hutchison Cancer Research Center, Seattle, WA 4Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland

Library spectra

Query spectra

An example of a confidently-matched query spectrum in spectral searching

References1. Spectral searching and SpectraST: Lam H, Deutsch EW, Eddes JS, Eng JK, King NL, Stein SE, Aebersold R. Development and validation

of a spectral library searching method for peptide identification from MS/MS. Proteomics 7(5), (2007).2. Targeted proteomics: Kuster B, Schirle M, Mallick P, Aebersold R. Scoring proteomes with proteotypic peptide probes. Nature Review

Molecular and Cell Biology 6(7), 577-583 (2005).3. SEQUEST: Eng JK, A.L. M, Yates JRI: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in

a protein database. Journal of the American Society for Mass Spectrometry 5(11), 976-989 (1994).4. Trans-Proteomic Pipeline: Keller A, Eng J, Zhang N, Li X-J, Aebersold R. A uniform proteomics MS/MS analysis platform utilizing open

XML file formats. Molecular Systems Biology 1, 17 (2005).5. Mascot: Perkins, DN, Pappin, DJ, Creasy, DM and Cottrell, JS. Probability-based protein identification by searching sequence

databases using mass spectrometry data. Electrophoresis 20(18), 3551-3567 (1999). 6. X!Tandem: Craig R, Beavis RC. TANDEM: matching proteins with mass spectra. Bioinformatics 20, 1466-1467 (2004).7. Human Plasma PeptideAtlas: Deutsch EW, Eng JK, Zhang H, King NL, Nesvizhskii AI, Lin B, Lee H, Yi EC, Ossola R, Aebersold R. Human

Plasma PeptideAtlas. Proteomics 5(13), 3497-3500 (2005).8. PeptideProphet: Keller A, Nesvizhskii AI, Kolker E, Aebersold R. Empirical statistical model to estimate the accuracy of peptide

identifications made by MS/MS and database search. Analytical Chemistry 74(20), 5383-5392 (2002).

This work is supported by the National Heart, Lung and Blood Institute, National Institutes of Health, under contract No. N01-HV-28179.

MotivationThere has been a recent surge of interest in using spectral library

searching as an alternative to sequence searching for identification of peptide MS/MS spectra. In spectral library searching, unknown MS/MS spectra are searched against a carefully compiled library of previously observed and identified peptide MS/MS spectra; a good spectral match indicates a correct identification. This method was shown to offer significant speed gain and superior sensitivity and accuracy compared to traditional sequence searching.1 However, it relies on the availability of a high-quality spectral library covering the proteome or subproteome of interest. Although public spectral libraries, such as those under development at the National Institute of Standards and Technology, are rapidly maturing, a ready-to-deploy tool to create custom spectral libraries is greatly desirable for various applications for which suitable public spectral libraries are not yet available. This would enable individual researchers to build spectral libraries of their own proteomes or subproteomes of interest, as a means to organizing and condensing previous data and to facilitate future research.

MethodsSoftware development – SpectraST (v.3.0)

Written in C++, with LINUX and Windows versions available (http://tools.proteomecenter.org/) Performs both spectral library searching1 and spectral library creation Minimal resource requirement - can be run on typical personal PCs Open-source and readily customizable Integrated with the Trans Proteomic Pipeline (TPP) software suite6 for full workflow support and easy

adaptation Requires no relational database backend

Consensus spectrum creation

Pool replicate spectra (identified to tbe same peptide ion) of high confidenceA PeptideProphet probability cutoff of 0.9 is used

Remove dissimilar replicatesOnly keeps the largest cluster of similar spectra

Align peaksSlightly m/z-shifted peaks from different replicates are aligned

Peak votingOnly peaks present in a majority of replicates are included

Intensity averagingIntensities of eligible peaks are weighted-averaged by the replicate’s signal-to-noise ratio

Results

Building a spectral library from the Human Plasma PeptideAtlas7 (http://www.peptideatlas.org/)

22 datasets (including 7 from HUPO Plasma Proteome Project) 1.4 million spectra positively identified with P > 0.9 (SEQUEST/PeptideProphet8) from a total of 14 million

spectra Over 30,000 distinct peptide ions among positive identifications Library building by SpectraST takes about 3 days of CPU time The number of peaks is reduced by more than a factor of 3 during consensus creation, indicating effective

noise removal Different levels of quality filter stringency were investigated

SEQUEST/Mascot5/X!Tandem6

PeptideProphetSpectraST

Raw spectra import

PepXMLPepXMLSearch Results

(.pepXML)

mzXMLmzXMLSpectra(.mzXML)

Library(.splib)

.spidx

.pepidx

SpectraSTLibrary manipulation

• Union/Intersection• Filter based on criteria

• Consensus creation• Quality filter

Library(.splib)

Library(.splib)

Library(.splib)

.spidx

.pepidx

Library(.splib)

.spidx

.pepidx

791792

842

907

…

Library of consensus spectra

CVDAGQAKDGGGENSRQPWHIVK

TTSGLADK

IPGSGQGAR…

DGGGENSRQPWHIVK

CVDAGQAK

CVDAGQAK

CVDAGQAK

TTSGGANK

IPGSGQGAR

TTSGGANKQPWHIVK

Data repository

Dataset 1

Dataset 2

Dataset 3

Precursor m/z index

GVM147NAVNNVNNVIAAAFK/2 (6 replicates)

Dot = 0.87

Similar spectra with conflicting IDs

FFTAICDMVAWLGYTPYKVTY/3 (1 replicate)

ALVLIAFAQYLQQC160PFEDHVK/3 (100 replicates)

AVDLLFFTDESGDSR/2 (2 replicates)

Possibly correctly identified, but impure spectrum

DFFTPNLFLK/3 (1 replicate)

False-positive impure spectrum

Re-analysis of the Human Plasma PeptideAtlas datasets by spectral searching

All datasets used to build the spectral library were re-searched by SpectraST against the library Dramatic increase (over 60%) in positively identified spectra with P > 0.9 (SpectraST/PeptideProphet) Extra identifications are mostly lower quality spectra previously missed by sequence searching Library searching by SpectraST takes about 3 days of CPU time

Conclusions The library-searching tool SpectraST is extended to allow users to build custom spectral libraries. The consensus spectrum creation algorithm enables the reduction of noise and spurious peaks, resulting in

high-quality, representative spectra. A spectral library has been built from the entire Human Plasma PeptideAtlas, and is now available. The Human Plasma PeptideAtlas datasets were re-analyzed by searching against this library. Compared to

sequence searching by SEQUEST, SpectraST identified over 60% more spectra at the same probability threshold, with much improved sensitivites and false discovery rates.

Quality filters are shown to have a significant impact on the performance of the search.

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5False Discovery Rate

Sens

itivi

ty

SEQUEST

SpectraST (Q0)

SpectraST (Q1)

SpectraST (Q2)

2.268(+64%)2.302 (+66%)

1.385

2.094 (+51%)

0.0

0.5

1.0

1.5

2.0

2.5

SEQUEST SpectraST(Q0)

SpectraST(Q1)

SpectraST(Q2)

Num

ber o

f pos

itive

IDs

(Milli

ons)

Quality Level Number of spectra remaining

Q0 (No filter) 37,428

Q1 (Removed impure spectra; spectra with look-alikes having conflicting IDs)

30,517

Q2 (Removed impure spectra; spectra with look-alikes having conflicting IDs; singleton spectra)

20,315

0

1

2

3

4

5

6

2-3 4-9 10-19 20+# replicates used to create consensus

Peak

redu

ction

ratio

, R

R = (Ave. # peaks in replicates) / (# peaks in consensus)

Advantages of spectral searching over traditional sequence searching1

Smaller search space. Only peptide ions known to be observed and identified are included in the library.

More precise scoring. Peak intensities are naturally accounted for and all spectral features, including uncommon fragments, are used for similarity scoring.

Vast improvement in speed. A reduced search space and a simpler scoring algorithm yield a typical speed gain of 500-1000X (compared to SEQUEST3).

Higher confidence in identifications. The more precise scoring allows better separation of good and bad hits, leading to much improved sensitivity and false discovery rates.

SpectraST library creation operations

An example of consensus spectrum creation

Building searchable spectral libraries

background spectral library searching spectral library searching is an alternative approach to...

Documents