building and using libraries of peptide ion fragmentation spectra s.e. stein, l.e. kilpatrick, m....
Post on 21-Dec-2015
212 views
TRANSCRIPT
Building and Using Libraries of Peptide Ion Fragmentation SpectraS.E. Stein, L.E. Kilpatrick, M. Mautner, P. Neta, J. Roth
National Institute of Standards and Technology, Gaithersburg, MD/Charleston, SC
Overview: MS/MS spectra that serve to identify peptides by sequence/spectrum matching can be of value for more reliably identifying those peptides in later studies. Our work involves the development of methods for refining and reusing information contained in these spectra to create mass spectral reference libraries, not dissimilar from those routinely employed in other fields of mass spectrometry.
Method: Spectra that identify peptides by current sequence/spectrum matching methods are first added to an archive along with associated information. For each identified peptide ion that has been identified more than once, an annotated ‘consensus spectrum’ is derived from the spectra used to make those identifications. Consensus spectra are then subjected to spectrum/sequence consistency tests to assign a measure of identification reliability and remove false positive identifications. This employs sequence/spectrum correlations reported in the literature and found in our work as well as other quantities found to be correlated with identification reliability. Examples are the similarity of original spectra, unassigned abundance in the consensus spectrum and scores of the original sequence/spectrum match. Related consensus spectra are intercompared to find further errors and finally combined to form an annotated, searchable reference library.
Building The Library
1) Extract spectra and analysis information for reliable peptide identifications from spectrum/sequence matches.
2) Create a ‘consensus spectrum’ from all spectra assigned to a single peptide ion.
3) For each consensus spectrum, perform spectrum/sequence consistency check. Also examine reliable, single hit identifications.
4) Assign reliability measures to each spectrum and build library.
Ways of Using the Library
Identification Confirmation (post-processing)Confirm/reject peptides identified by sequence search programs
Find peptides (pre-processing) Search all spectra against libraryThen search unmatched peptides using sequence search and other methods
Create library of ‘unidentified spectra’ from their consensus spectraCompare to identified peptides, use de novo methods, find biomarkers, …
Target identificationInternal standards, biomarkers, target proteins
Transmit peptide analysis informationDifficult-to-identify, unusual, manually identified, special meaning
Spectrum Variability: Effective peptide identification by matching spectra in a reference library requires that spectra are reproducible. The degree of variability has been measured using sets of spectra identifying the same precursor ion. Variations typical of ion trap spectra are shown below.
Most spectra identified by sequence search methods have dot products above 0.7.
0
20
40
60
80
100
120
140
160
180
0.2 0.4 0.6 0.8 1
Dot Product
Max
/Med
ian
0
200
400
600
800
1000
1200
1400
1600
0.2 0.4 0.6 0.8 1
Dot Product
#Spe
ctra
Next Steps
Enhance QA/QC Goal: Distinguish unexpected fragmentations from misidentifications
Add spectra from reliable singly identified peptides to Library
Build comprehensive Libraries
Test and optimize spectrum matching algorithms
Provide annotated libraries for mass spectrometry data systems
Spectrum/SequenceConsistency
All Data for an Identified Peptide Ion
Annotated ReferenceSpectrum
ReferenceArchive
ReferenceLibrary
Spectra Sequence ConfirmatoryInformation
Consensus Spectrum
Extract Common Features
FragmentationRules
Analysis
LC-MS/MS Results
1) Source Spectraa) Online repositories
- PeptideAtlas- the Global Proteome Machine - Open Proteomics Database- NCRR Proteomics Resource
b) Collaborators/Contributors- NIH/LNT (Markey/Geer/Kowalik…)- Institute of Systems Biology (Nesvizshskii/King/Aebersold/…) - theGPM (Beavis)- Blueprint Initiative (Hogue)
c) NIST MeasurementsSingle Protein Digests/IT, 3Q
3) Spectrum/Sequence ConsistencyFind likelihood that a consensus spectrum originated from assigned sequence.Each factor serves to refine probability.
Factors identified as discriminating (a and b are illustrated below):a) Match with theoretical spectrum
based on individual amino acid fragmentation behaviorb) Fraction of unassigned abundance
abundance for peaks not originating from a known fragmentation pathc) Y/B ion correlations and ratio of Y/B ion abundance sumsd) Unexpected major peaks formally consistent with rules
include tests for reasonableness of neutral lossese) Amino acid specific rules (Proline, Glycine, Aspartic Acid, …)f) Predicted/observed fragment ion charge state abundance ratios
4) Overall Identification Confidence [under development]
Influential factors employed Spectrum/Sequence Consistency (described above)
Original Score (degree of sequence match)
Peptide sequence re-identificationsNumber of different spectra, experiments, modifications, charge states
Number of peptides per protein
Occurrence Distribution: Most peptide identifications are made multiple times.
2) Build Consensus SpectraReject noise and spurious signals/Measure and report variability
Identify matching m/z peaks in input spectra.
Find and exclude outlier spectra (use dot product of each pair) when multiple sources are available , limit the number of spectra from a single source
Omit peaks occurring in ½ or fewer of spectra consider only spectra with sufficiently high S/N to have generated the peak of interest
Compute average abundance, report variance
Create annotated spectruminclude information such as spectrum origin, retention, median dot product using all peaks and only consensus peaks, fraction of abundance not at consensus m/z, …
Sample Application: Mycobacterium Smegmatis
[data from the Open Proteomics Database/R.Wang, 27 sets of experiments]
- Created library of 2739 Peptide Ion Consensus Spectra
- 95% of identified spectra matched peptides identified by other spectra
In one series of LC-MS/MS experiments:
1551 Spectra were identified by sequence search engine
1527 Of the above spectra were re-matched by consensus library searching
1067 Spectra not identified by sequence search were identified by library search
948 Different peptides were originally identified by sequence search engine
924 Peptides were re-matched
332 Peptides not identified by sequence search in this series were identified
24 Peptides not re-matched were rich in false positives
983 Consensus spectra of spectra unmatched by library search were derived
Please Recycle Your Spectra!- Your MS/MS Data Files contain valuable, reusable information that may be very helpful to others (and maybe even you) after recycling.
- Please let us know if we may recycle them ([email protected] / 301-975-2505). Raw data files are fine, with or without identifications.
- Sources of original spectra are cited as part of consensus spectrum annotation (if you wish).
- We also encourage submission to on-line spectral repositories (www.peptideatlas.org, www.thegpm.org, bioinformatics.icmb.utexas.edu/OPD/)
Library Construction Flow Diagram
Spectrum Matching Algorithms
Algorithms: Measures of spectrum matching have been adapted from algorithms used for electron ionization spectra. Peaks are weighted by their significance:
- Reduce significance of common impurity ions (e.g., neutral loss from parent ion)
- Adjust Y/B weighting for instrument and sequence
- Reduce weight for uncertain and isotopic peaks
- Use confidence of library spectrum
Speed: Straightforward indexing leads to very fast identification (<< sec) even for very large libraries.
Spectrum 1
Spectrum 2
ConsensusSpectrum
A ‘consensusspectrum’ is composed of peaks present in the majority of spectra in which the peak could have been generated
Replicate spectra at moderate to low S/N commonly show spurious peaks from neutral losses by impurity ions of uncertain origin as well as seemingly random noise spikes.
Similarity of pairs of spectra originating from the same peptide ion are measured by their ‘dot product’, where spectra are expressed as normalized vectors. Here, S/N is measured as the maximum to median
abundance in a typical spectrum (assumes most peaks are noise).
Above S/N of 40, spectra are quite reproducible (0.7 or better median dot product).
Distribution of Number of Identifications Per Identified Peptide for D. Radiodurans (spectra from NCRR, http://ncrr.pnl.gov/data/) and M. Smegmatis
(Open Proteomics Database)
Pairs of spectra identified as originating from a given peptide ion have been compared to each other – variations in abundance have been found to depend primarily on the level of signal/noise in the measured spectra.
Organism Different Ions
Source Spectra
Human 14,698 84,288
D. Radiodurans 6,888 115,972
Yeast 5,879 43,344
M. Smegmatis 2,739 30,718
E. Coli 1,259 5,861
Mixed (GPM) 66,882 587,199
These plots use D. Radiodurans spectra from NCRR, false ids obtained by searching Human sequences. True and false identifications had the same average sequence ‘score’.
Correct identifications have lower fractions of unassigned abundance than incorrect identifications
True positive spectra match theoretical spectra better than false positive spectra.
These plots illustrate quantitative relations for factors a) and b) that can aid in the separation of true and false positive identifications
In two well-studied cases, about 5% of identifications were made by only one spectrum.
Roughly 10% of identifications were made more than 100 times
Approximately 1/3 of peptides identified were identified only once (not shown). These will be separately processed, requiring higher scores for acceptance in library.
Consensus Spectra Derived