protein identification by database searching · protein identification by database searching. john...

31
ASMS 2005 Protein Identification by Database Searching John Cottrell Matrix Science

Upload: others

Post on 16-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

Protein Identificationby

Database Searching

John CottrellMatrix Science

Page 2: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

Three ways to use mass spectrometry data for protein ID:1. Peptide Mass Fingerprint

A set of peptide molecular weights from an enzyme digest of a protein

Page 3: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

Henzel, W. J., Billeci, T. M., Stults, J. T., Wong, S. C., Grimley, C. and Watanabe, C. (1993). Proc Natl Acad Sci USA 90, 5011-5.

James, P., Quadroni, M., Carafoli, E. and Gonnet, G. (1993). Biochem Biophys Res Commun 195, 58-64.

Mann, M., Hojrup, P. and Roepstorff, P. (1993). Biol Mass Spectrom 22, 338-45.

Pappin, D. J. C., Hojrup, P. and Bleasby, A. J. (1993). Curr. Biol. 3, 327-32.

Yates, J. R., 3rd, Speicher, S., Griffin, P. R. and Hunkapiller, T. (1993). Anal Biochem 214, 397-408.

1993

Page 4: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

Page 5: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

Page 6: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for
Page 7: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

Page 8: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

Peptide Mass Fingerprint

• Fast, simple analysis• High sensitivity• Need database of protein sequences, not

ESTs or genomic DNA• Sequence (or close homolog) must be present

in database• Not good for mixtures, especially a minor

component.

Page 9: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for
Page 10: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

H – N – C – C – N – C – C – N – C – C – N – C – C – OH

R1 R2 R3 R4O O O

HH H H H HHH

O

a1 b1 c1 a2 b2 c2 a3 b3 c3

x3 y3 z3 x2 y2 z2 x1 y1 z1H+

Roepstorff, P. and Fohlman, J. (1984). Proposal for a common nomenclature for sequence ions in mass spectra of peptides. Biomed Mass Spectrom 11, 601.

Page 11: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

Three ways to use mass spectrometry data for protein ID:1. Peptide Mass Fingerprint

A set of peptide molecular weights from an enzyme digest of a protein

2. Sequence QueryMass values combined with amino acid sequence or composition data

Page 12: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

Mann, M. and Wilm, M. (1994). Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal Chem 66, 4390-9.

Page 13: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

TA

G

913.2 1278.3

Page 14: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

Sequence Tag

• Rapid search times• Error tolerant• Requires interpretation• Requires high quality data.

Page 15: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

Three ways to use mass spectrometry data for protein ID:1. Peptide Mass Fingerprint

A set of peptide molecular weights from an enzyme digest of a protein

2. Sequence QueryMass values combined with amino acid sequence or composition data

3. MS/MS Ions SearchMS/MS data from a single peptide or from a complete LC-MS/MS run

Page 16: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

Eng, J. K., McCormack, A. L. and Yates, J. R., 3rd (1994). An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976-89.

Page 17: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for
Page 18: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

MS/MS Ions Search

• Easily automated for high throughput• Get matches from marginal data• Can be slow

• No enzyme• Lots of variable modifications• Large database• Large dataset

• Peptide identification, proteins by inference.

Page 19: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

MS/MS matching identifies peptides, not proteins

• Grouping peptide matches into protein matches is an arbitrary procedure

Protein A Protein BProtein C

Peptide 1 Peptide 2 Peptide 3

Peptide 1 Peptide 3

Peptide 2

• If match peptides 1, 2 and 3 from 2D gel spot, Mascot will prefer Protein A (Occam’s razor)

• But, could easily have been mixture of B and C.

Page 20: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

BLAST / FASTA• Sequence against sequence• Can be used to find weak / distant similarity• Can make gapped alignments

MS-based ID• Mass & intensity values against sequence• Looking for identity or near identity• Generally, short peptides

Page 21: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

What is probability based scoring?

We compute the probability that the observed match between the experimental data and mass values calculated from a candidate protein or peptide sequence is a random event.

The ‘correct’ match, which is not a random event, has a very low probability.

Page 22: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

Why is probability based scoring important?• Human (even expert) judgement is subjective

and can be unreliable.

Page 23: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

Page 24: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

Page 25: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

Why is probability based scoring important?• Human (even expert) judgement is subjective

and can be unreliable• Standard, statistical tests of significance can

be applied to the results.

Page 26: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

Standard significance tests can be applied to results• Mascot score is -10Log10(P), where P is absolute probability that

observed match is random event

• If we make 50,000 trials, a 1 in a 20 significance threshold is

• -10Log10(1 / (20 x 50,000)) = 60 … “identity”

• If data quality are poor, this may not be achievable. If match is clearly an outlier, also report a lower, empirical threshold

• … “homology”

Page 27: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

Why is probability based scoring important?• Human (even expert) judgement is subjective

and can be unreliable• Standard, statistical tests of significance can

be applied to the results• Arbitrary scoring schemes are susceptible to

false positives.

Page 28: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

Major proteomics study published in Nature, 2002• 11,381 peptides• 2,415 proteins• Matches to fully non-tryptic peptides discarded• Overall fraction of semi-tryptic peptides 34%• For proteins identified using

• 1 peptide: 63% semi-tryptic• 2 peptides: 54% semi-tryptic• 3 peptides: 46% semi-tryptic.

Page 29: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005

Can we calculate a probability that the match is correct?• Maybe, if it is a test sample and you know what

the answer should be• If the sample is an unknown, then you have to

define “correct” very carefully:– The best match in the database?– The best match out of all possible peptides?– The peptide sequence that is uniquely and completely

defined by the MS data?– A statistically unlikely match?

Page 30: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

Expect 1.8E-5

Expect 9.2E-4

Expect 0.037

Expect 4.0

Page 31: Protein Identification by Database Searching · Protein Identification by Database Searching. John Cottrell Matrix Science. ASMS 2005. Three ways to use mass spectrometry data for

ASMS 2005