background spectral library searching spectral library searching is an alternative approach to...

1
Background Spectral library searching Spectral library searching is an alternative approach to traditional sequence database searching for peptide inference from MS/MS spectra. It is particularly suited for targeted proteomics 2 applications, in which one seeks not to discover novel peptides, but to find and study expected peptides in the sample. Previously observed and confidently identified peptide MS/MS spectra are collected and catalogued in data repositories, such as PeptideAtlas. Repeated observations of the same peptide ion are combined to create a “consensus spectrum” of that particular peptide ion. Searchable spectral libraries of consensus spectra are built for fast indexed searching. During searching, each unknown query spectrum is compared to candidate library spectra one by one; high spectral similarity indicates positive identification. Quality filters Three types of questionable spectra in the resulting consensus spectral library are subject to quality filters: Spectra that have look-alikes in the library with non-homologous peptide IDs Often sequence-search false positives. Could also lead to false negatives when look-alikes end up as second hits, depressing the delta score artificially Spectra with many unexplained large peaks Often sequence-search false positives. Even if true, often not representative of the peptide ion due to contamination, leading to false positives Spectra with only one replicate (singletons) Often sequence-search false positives, especially if large enough pool of spectra is compiled No consensus is made – raw spectra are of poorer quality Development of a Spectral Library Building Tool Development of a Spectral Library Building Tool and Re-Analysis of Human Plasma PeptideAtlas and Re-Analysis of Human Plasma PeptideAtlas Datasets using Spectral Searching Datasets using Spectral Searching Henry Lam 1 , Eric Deutsch 1 , James S. Eddes 1 , Jimmy K. Eng 1,2 , Nichole King 1 , Steve Stein 3 , Ruedi Aebersold 1,4 1 Institute for Systems Biology, Seattle, WA 3 National Institute of Standards and Technology, Gaithersburg, MD 2 Fred Hutchison Cancer Research Center, Seattle, WA 4 Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland Library spectra Query spectra An example of a confidently-matched query spectrum in spectral searching References 1. Spectral searching and SpectraST: Lam H, Deutsch EW, Eddes JS, Eng JK, King NL, Stein SE, Aebersold R. Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics 7(5), (2007). 2. Targeted proteomics: Kuster B, Schirle M, Mallick P, Aebersold R. Scoring proteomes with proteotypic peptide probes. Nature Review Molecular and Cell Biology 6(7), 577-583 (2005). 3. SEQUEST: Eng JK, A.L. M, Yates JRI: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry 5(11), 976-989 (1994). 4. Trans-Proteomic Pipeline: Keller A, Eng J, Zhang N, Li X-J, Aebersold R. A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Molecular Systems Biology 1, 17 (2005). 5. Mascot: Perkins, DN, Pappin, DJ, Creasy, DM and Cottrell, JS. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20(18), 3551-3567 (1999). 6. X!Tandem: Craig R, Beavis RC. TANDEM: matching proteins with mass spectra. Bioinformatics 20, 1466-1467 (2004). 7. Human Plasma PeptideAtlas: Deutsch EW, Eng JK, Zhang H, King NL, Nesvizhskii AI, Lin B, Lee H, Yi EC, Ossola R, Aebersold R. Human Plasma PeptideAtlas. Proteomics 5(13), 3497-3500 (2005). 8. PeptideProphet: Keller A, Nesvizhskii AI, Kolker E, Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Analytical Chemistry 74(20), 5383-5392 (2002). This work is supported by the National Heart, Lung and Blood Institute, National Institutes of Health, under contract No. N01-HV-28179. Motivation There has been a recent surge of interest in using spectral library searching as an alternative to sequence searching for identification of peptide MS/MS spectra. In spectral library searching, unknown MS/MS spectra are searched against a carefully compiled library of previously observed and identified peptide MS/MS spectra; a good spectral match indicates a correct identification. This method was shown to offer significant speed gain and superior sensitivity and accuracy compared to traditional sequence searching. 1 However, it relies on the availability of a high-quality spectral library covering the proteome or subproteome of interest. Although public spectral libraries, such as those under development at the National Institute of Standards and Technology, are rapidly maturing, a ready-to-deploy tool to create custom spectral libraries is greatly desirable for various applications for which suitable public spectral libraries are not yet available. This would enable individual researchers to build spectral libraries of their own proteomes or subproteomes of interest, as a means to organizing and condensing previous data and to facilitate future research. Methods Software development – SpectraST (v.3.0) Written in C++, with LINUX and Windows versions available (http://tools.proteomecenter.org/) Performs both spectral library searching 1 and spectral library creation Minimal resource requirement - can be run on typical personal PCs Open-source and readily customizable Integrated with the Trans Proteomic Pipeline (TPP) software suite 6 for full workflow support and easy adaptation Requires no relational database backend Consensus spectrum creation Pool replicate spectra (identified to tbe same peptide ion) of high confidence A PeptideProphet probability cutoff of 0.9 is used Remove dissimilar replicates Only keeps the largest cluster of similar spectra Align peaks Slightly m/z-shifted peaks from different replicates are aligned Peak voting Only peaks present in a majority of replicates are included Intensity averaging Intensities of eligible peaks are weighted-averaged by the replicate’s signal-to-noise ratio Results Building a spectral library from the Human Plasma PeptideAtlas 7 (http://www.peptideatlas.org/) 22 datasets (including 7 from HUPO Plasma Proteome Project) 1.4 million spectra positively identified with P > 0.9 (SEQUEST/PeptideProphet 8 ) from a total of 14 million spectra Over 30,000 distinct peptide ions among positive identifications Library building by SpectraST takes about 3 days of CPU time The number of peaks is reduced by more than a factor of 3 during consensus creation, indicating effective noise removal Different levels of quality filter stringency were investigated SEQUEST/Mascot 5 /X!Tandem 6 PeptideProphet SpectraST Raw spectra import PepXML PepXML Search Results (.pepXML) mzXML mzXML Spectra (.mzXML) Library (.splib) .spidx .pepidx SpectraST Library manipulation • Union/Intersection • Filter based on criteria • Consensus creation • Quality filter Library (.splib) Library (.splib) Library (.splib) .spidx .pepidx Library (.splib) .spidx .pepidx 791 792 842 907 Library of consensus spectra CVDAGQAK DGGGENSR QPWHIVK TTSGLADK IPGSGQGAR DGGGENSR QPWHIVK CVDAGQAK CVDAGQAK CVDAGQAK TTSGGANK IPGSGQGAR TTSGGANK QPWHIVK Data repository Dataset 1 Dataset 2 Dataset 3 Precursor m/z index GVM 147 NAVNNVNNVIAAAFK/2 (6 replicates) Dot = 0.87 Similar spectra with conflicting IDs FFTAICDMVAWLGYTPYKVTY/3 (1 replicate) ALVLIAFAQYLQQC 160 PFEDHVK/3 (100 replicates) AVDLLFFTDESGDSR/2 (2 replicates) Possibly correctly identified, but impure spectrum DFFTPNLFLK/3 (1 replicate) False-positive impure spectrum Re-analysis of the Human Plasma PeptideAtlas datasets by spectral searching All datasets used to build the spectral library were re-searched by SpectraST against the library Dramatic increase (over 60%) in positively identified spectra with P > 0.9 (SpectraST/PeptideProphet) Extra identifications are mostly lower quality spectra previously missed by sequence searching Library searching by SpectraST takes about 3 days of CPU time Conclusions The library-searching tool SpectraST is extended to allow users to build custom spectral libraries. The consensus spectrum creation algorithm enables the reduction of noise and spurious peaks, resulting in high-quality, representative spectra. A spectral library has been built from the entire Human Plasma PeptideAtlas, and is now available. The Human Plasma PeptideAtlas datasets were re-analyzed by searching against this library. Compared to sequence searching by SEQUEST, SpectraST identified over 60% more spectra at the same probability threshold, with much improved sensitivites and false discovery rates. Quality filters are shown to have a significant impact on the performance of the search. 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 False D iscovery R ate Sensitivity SEQUEST S pectraS T (Q 0) S pectraS T (Q 1) S pectraS T (Q 2) 2.268(+64%) 2.302 (+66% ) 1.385 2.094 (+51% ) 0.0 0.5 1.0 1.5 2.0 2.5 SEQUEST S pectraS T (Q 0) S pectraST (Q 1) S pectraS T (Q 2) N um ber ofpositive ID s (M illions) Quality Level Number of spectra remaining Q0 (No filter) 37,428 Q1 (Removed impure spectra; spectra with look-alikes having conflicting IDs) 30,517 Q2 (Removed impure spectra; spectra with look-alikes having conflicting IDs; singleton spectra) 20,315 0 1 2 3 4 5 6 2-3 4-9 10-19 20+ # replicates used to create consensus Peak reduction ratio,R R = (Ave. # peaks in replicates) / (# peaks in consensus) Advantages of spectral searching over traditional sequence searching 1 Smaller search space. Only peptide ions known to be observed and identified are included in the library. More precise scoring. Peak intensities are naturally accounted for and all spectral features, including uncommon fragments, are used for similarity scoring. Vast improvement in speed. A reduced search space and a simpler scoring algorithm yield a typical speed gain of 500-1000X (compared to SEQUEST 3 ). Higher confidence in identifications. The more precise scoring allows better separation of good and bad hits, leading to much improved sensitivity and false discovery rates. SpectraST library creation operations An example of consensus spectrum creation Building searchable spectral libraries

Upload: coleen-pope

Post on 06-Jan-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Background Spectral library searching Spectral library searching is an alternative approach to traditional sequence database searching for peptide inference

BackgroundSpectral library searching

Spectral library searching is an alternative approach to traditional sequence database searching for peptide inference from MS/MS spectra. It is particularly suited for targeted proteomics2 applications, in which one seeks not to discover novel peptides, but to find and study expected peptides in the sample.

Previously observed and confidently identified peptide MS/MS spectra are collected and catalogued in data repositories, such as PeptideAtlas.

Repeated observations of the same peptide ion are combined to create a “consensus spectrum” of that particular peptide ion.

Searchable spectral libraries of consensus spectra are built for fast indexed searching.

During searching, each unknown query spectrum is compared to candidate library spectra one by one; high spectral similarity indicates positive identification.

Quality filters Three types of questionable spectra in the resulting consensus spectral library are subject to quality filters:

Spectra that have look-alikes in the library with non-homologous peptide IDs

•Often sequence-search false positives.

•Could also lead to false negatives when look-alikes end up as second hits, depressing the delta score artificially

Spectra with many unexplained large peaks

•Often sequence-search false positives.

•Even if true, often not representative of the peptide ion due to contamination, leading to false positives

Spectra with only one replicate

(singletons)•Often sequence-search false positives, especially if large enough pool of spectra is compiled

•No consensus is made – raw spectra are of poorer quality

Development of a Spectral Library Building Tool and Development of a Spectral Library Building Tool and Re-Analysis of Human Plasma PeptideAtlas Re-Analysis of Human Plasma PeptideAtlas Datasets using Spectral SearchingDatasets using Spectral SearchingHenry Lam1, Eric Deutsch1, James S. Eddes1, Jimmy K. Eng1,2, Nichole King1, Steve Stein3, Ruedi

Aebersold1,4

1Institute for Systems Biology, Seattle, WA 3National Institute of Standards and Technology, Gaithersburg, MD 2Fred Hutchison Cancer Research Center, Seattle, WA 4Institute of Molecular Systems Biology, ETH Zurich, Zurich, Switzerland

Library spectra

Query spectra

An example of a confidently-matched query spectrum in spectral searching

References1. Spectral searching and SpectraST: Lam H, Deutsch EW, Eddes JS, Eng JK, King NL, Stein SE, Aebersold R. Development and validation

of a spectral library searching method for peptide identification from MS/MS. Proteomics 7(5), (2007).2. Targeted proteomics: Kuster B, Schirle M, Mallick P, Aebersold R. Scoring proteomes with proteotypic peptide probes. Nature Review

Molecular and Cell Biology 6(7), 577-583 (2005).3. SEQUEST: Eng JK, A.L. M, Yates JRI: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in

a protein database. Journal of the American Society for Mass Spectrometry 5(11), 976-989 (1994).4. Trans-Proteomic Pipeline: Keller A, Eng J, Zhang N, Li X-J, Aebersold R. A uniform proteomics MS/MS analysis platform utilizing open

XML file formats. Molecular Systems Biology 1, 17 (2005).5. Mascot: Perkins, DN, Pappin, DJ, Creasy, DM and Cottrell, JS. Probability-based protein identification by searching sequence

databases using mass spectrometry data. Electrophoresis 20(18), 3551-3567 (1999). 6. X!Tandem: Craig R, Beavis RC. TANDEM: matching proteins with mass spectra. Bioinformatics 20, 1466-1467 (2004).7. Human Plasma PeptideAtlas: Deutsch EW, Eng JK, Zhang H, King NL, Nesvizhskii AI, Lin B, Lee H, Yi EC, Ossola R, Aebersold R. Human

Plasma PeptideAtlas. Proteomics 5(13), 3497-3500 (2005).8. PeptideProphet: Keller A, Nesvizhskii AI, Kolker E, Aebersold R. Empirical statistical model to estimate the accuracy of peptide

identifications made by MS/MS and database search. Analytical Chemistry 74(20), 5383-5392 (2002).

This work is supported by the National Heart, Lung and Blood Institute, National Institutes of Health, under contract No. N01-HV-28179.

MotivationThere has been a recent surge of interest in using spectral library

searching as an alternative to sequence searching for identification of peptide MS/MS spectra. In spectral library searching, unknown MS/MS spectra are searched against a carefully compiled library of previously observed and identified peptide MS/MS spectra; a good spectral match indicates a correct identification. This method was shown to offer significant speed gain and superior sensitivity and accuracy compared to traditional sequence searching.1 However, it relies on the availability of a high-quality spectral library covering the proteome or subproteome of interest. Although public spectral libraries, such as those under development at the National Institute of Standards and Technology, are rapidly maturing, a ready-to-deploy tool to create custom spectral libraries is greatly desirable for various applications for which suitable public spectral libraries are not yet available. This would enable individual researchers to build spectral libraries of their own proteomes or subproteomes of interest, as a means to organizing and condensing previous data and to facilitate future research.

MethodsSoftware development – SpectraST (v.3.0)

Written in C++, with LINUX and Windows versions available (http://tools.proteomecenter.org/) Performs both spectral library searching1 and spectral library creation Minimal resource requirement - can be run on typical personal PCs Open-source and readily customizable Integrated with the Trans Proteomic Pipeline (TPP) software suite6 for full workflow support and easy

adaptation Requires no relational database backend

Consensus spectrum creation

Pool replicate spectra (identified to tbe same peptide ion) of high confidenceA PeptideProphet probability cutoff of 0.9 is used

Remove dissimilar replicatesOnly keeps the largest cluster of similar spectra

Align peaksSlightly m/z-shifted peaks from different replicates are aligned

Peak votingOnly peaks present in a majority of replicates are included

Intensity averagingIntensities of eligible peaks are weighted-averaged by the replicate’s signal-to-noise ratio

Results

Building a spectral library from the Human Plasma PeptideAtlas7 (http://www.peptideatlas.org/)

22 datasets (including 7 from HUPO Plasma Proteome Project) 1.4 million spectra positively identified with P > 0.9 (SEQUEST/PeptideProphet8) from a total of 14 million

spectra Over 30,000 distinct peptide ions among positive identifications Library building by SpectraST takes about 3 days of CPU time The number of peaks is reduced by more than a factor of 3 during consensus creation, indicating effective

noise removal Different levels of quality filter stringency were investigated

SEQUEST/Mascot5/X!Tandem6

PeptideProphetSpectraST

Raw spectra import

PepXMLPepXMLSearch Results

(.pepXML)

mzXMLmzXMLSpectra(.mzXML)

Library(.splib)

.spidx

.pepidx

SpectraSTLibrary manipulation

• Union/Intersection• Filter based on criteria

• Consensus creation• Quality filter

Library(.splib)

Library(.splib)

Library(.splib)

.spidx

.pepidx

Library(.splib)

.spidx

.pepidx

791792

842

907

Library of consensus spectra

CVDAGQAKDGGGENSRQPWHIVK

TTSGLADK

IPGSGQGAR…

DGGGENSRQPWHIVK

CVDAGQAK

CVDAGQAK

CVDAGQAK

TTSGGANK

IPGSGQGAR

TTSGGANKQPWHIVK

Data repository

Dataset 1

Dataset 2

Dataset 3

Precursor m/z index

GVM147NAVNNVNNVIAAAFK/2 (6 replicates)

Dot = 0.87

Similar spectra with conflicting IDs

FFTAICDMVAWLGYTPYKVTY/3 (1 replicate)

ALVLIAFAQYLQQC160PFEDHVK/3 (100 replicates)

AVDLLFFTDESGDSR/2 (2 replicates)

Possibly correctly identified, but impure spectrum

DFFTPNLFLK/3 (1 replicate)

False-positive impure spectrum

Re-analysis of the Human Plasma PeptideAtlas datasets by spectral searching

All datasets used to build the spectral library were re-searched by SpectraST against the library Dramatic increase (over 60%) in positively identified spectra with P > 0.9 (SpectraST/PeptideProphet) Extra identifications are mostly lower quality spectra previously missed by sequence searching Library searching by SpectraST takes about 3 days of CPU time

Conclusions The library-searching tool SpectraST is extended to allow users to build custom spectral libraries. The consensus spectrum creation algorithm enables the reduction of noise and spurious peaks, resulting in

high-quality, representative spectra. A spectral library has been built from the entire Human Plasma PeptideAtlas, and is now available. The Human Plasma PeptideAtlas datasets were re-analyzed by searching against this library. Compared to

sequence searching by SEQUEST, SpectraST identified over 60% more spectra at the same probability threshold, with much improved sensitivites and false discovery rates.

Quality filters are shown to have a significant impact on the performance of the search.

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5False Discovery Rate

Sens

itivi

ty

SEQUEST

SpectraST (Q0)

SpectraST (Q1)

SpectraST (Q2)

2.268(+64%)2.302 (+66%)

1.385

2.094 (+51%)

0.0

0.5

1.0

1.5

2.0

2.5

SEQUEST SpectraST(Q0)

SpectraST(Q1)

SpectraST(Q2)

Num

ber o

f pos

itive

IDs

(Milli

ons)

Quality Level Number of spectra remaining

Q0 (No filter) 37,428

Q1 (Removed impure spectra; spectra with look-alikes having conflicting IDs)

30,517

Q2 (Removed impure spectra; spectra with look-alikes having conflicting IDs; singleton spectra)

20,315

0

1

2

3

4

5

6

2-3 4-9 10-19 20+# replicates used to create consensus

Peak

redu

ction

ratio

, R

R = (Ave. # peaks in replicates) / (# peaks in consensus)

Advantages of spectral searching over traditional sequence searching1

Smaller search space. Only peptide ions known to be observed and identified are included in the library.

More precise scoring. Peak intensities are naturally accounted for and all spectral features, including uncommon fragments, are used for similarity scoring.

Vast improvement in speed. A reduced search space and a simpler scoring algorithm yield a typical speed gain of 500-1000X (compared to SEQUEST3).

Higher confidence in identifications. The more precise scoring allows better separation of good and bad hits, leading to much improved sensitivity and false discovery rates.

SpectraST library creation operations

An example of consensus spectrum creation

Building searchable spectral libraries