statistical significance for peptide identification by tandem mass spectrometry

Statistical Significance for

Peptide Identification by

Tandem Mass Spectrometry

Statistical Significance for

Peptide Identification by

Tandem Mass SpectrometryNathan EdwardsCenter for Bioinformatics and Computational BiologyUniversity of Maryland, College Park

2

Mass Spectrometry for Proteomics

• Measure mass of many (bio)molecules simultaneously• High bandwidth

• Mass is an intrinsic property of all (bio)molecules• No prior knowledge required

3


• Measure mass of many molecules simultaneously• ...but not too many, abundance bias

• Mass is an intrinsic property of all (bio)molecules• ...but need a reference to compare to

4

High Bandwidth

100

0250 500 750 1000

m/z

% I

nte

nsit

y

5

Mass is fundamental!

6


• Mass spectrometry has been around since the turn of the century...• ...why is MS based Proteomics so new?

• Ionization methods• MALDI, Electrospray

• Protein chemistry & automation• Chromatography, Gels, Computers

• Protein sequence databases• A reference for comparison

7

Sample Preparation for Peptide Identification

Enzymatic Digestand

Fractionation

8

Single Stage MS

MS

m/z

9

Tandem Mass Spectrometry(MS/MS)

Precursor selection

m/z

m/z

10

Tandem Mass Spectrometry(MS/MS)

Precursor selection + collision induced dissociation

(CID)

MS/MS

m/z

m/z

11

Peptide Fragmentation

H…-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH

Ri-1 Ri Ri+1

AA residuei-1 AA residuei AA residuei+1

N-terminus

C-terminus

Peptides consist of amino-acids arranged in a linear backbone.

12


13


-HN-CH-CO-NH-CH-CO-NH-

RiCH-R’

bi

yn-iyn-i-1

bi+1

R”

i+1

i+1

14


Peptide: S-G-F-L-E-E-D-E-L-K

y1

y2

y3

y4

y5

y6

y7

y8

y9

ion

1020

907

778

663

534

405

292

145

88

MW

762SGFL EEDELKb4

389SGFLEED ELKb7

MWion

633SGFLE EDELKb5

1080S GFLEEDELKb1

1022SG FLEEDELKb2

875SGF LEEDELKb3

504SGFLEE DELKb6

260SGFLEEDE LKb8

147SGFLEEDEL Kb9

15


100

0250 500 750 1000

m/z

% I

nte

nsit

y

K1166

L1020

E907

D778

E663

E534

L405

F292

G145

S88 b ions

147260389504633762875102210801166 y ions

16


K1166

L1020

E907

D778

E663

E534

L405

F292

G145

S88 b ions

100

0250 500 750 1000

m/z

% I

nte

nsit

y

147260389504633762875102210801166 y ions

y6

y7

y2 y3 y4

y5

y8 y9

b3

b5 b6 b7b8 b9

b4

17

Peptide Identification

• For each (likely) peptide sequence1. Compute fragment masses2. Compare with spectrum3. Retain those that match well

• Peptide sequences from protein sequence databases• Swiss-Prot, IPI, NCBI’s nr, ...

• Automated, high-throughput peptide identification in complex mixtures

18

High Quality Peptide Identification: E-value < 10-8

http://codon.umiacs.umd.edu:8891/thegpm-cgi/peptide.pl?path=/tandem/archive/GPM00300000340.3.xml&uid=53361&label=AAAACKOM&homolog=AAAACKOM&id=895.1.1&proex=-1

19

Moderate quality peptide identification: E-value < 10-3

20

Amino-Acid Molecular Weights

Amino-Acid Residual MW Amino-Acid Residual MW

A Alanine 71.03712 M Methionine 131.04049

C Cysteine 103.00919 N Asparagine 114.04293

D Aspartic acid 115.02695 P Proline 97.05277

E Glutamic acid 129.04260 Q Glutamine 128.05858

F Phenylalanine 147.06842 R Arginine 156.10112

G Glycine 57.02147 S Serine 87.03203

H Histidine 137.05891 T Threonine 101.04768

I Isoleucine 113.08407 V Valine 99.06842

K Lysine 128.09497 W Tryptophan 186.07932

L Leucine 113.08407 Y Tyrosine 163.06333

21


• Peptide fragmentation by CID is poorly understood

• MS/MS spectra represent incomplete information about amino-acid sequence• I/L, K/Q, GG/N, …

• Correct identifications don’t come with a certificate!

22


• High-throughput workflows demand we analyze all spectra, all the time.

• Spectra may not contain enough information to be interpreted correctly• …bad static on a cell phone

• Peptides may not match our assumptions• …its all Greek to me

• “Don’t know” is an acceptable answer!

23


• Rank the best peptide identifications

• Is the top ranked peptide correct?

24




25




26


• Incorrect peptide has best score• Correct peptide is missing?• Potential for incorrect conclusion• What score ensures no incorrect

peptides?• Correct peptide has weak score

• Insufficient fragmentation, poor score• Potential for weakened conclusion• What score ensures we find all correct

peptides?

27

Statistical Significance

• Can’t prove particular identifications are right or wrong...• ...need to know fragmentation in advance!

• A minimal standard for identification scores...• ...better than guessing.• p-value, E-value, statistical significance

28

Pin the tail on the donkey…

29

Probability Concepts

Throwing darts• One at a time• Blindfolded

Uniform distribution?Independent?Identically distributed?

Pr [ Dart hits 20 ] = 0.05

30


Throwing darts• One at a time• Blindfolded• Three darts

Pr [Hitting 20 3 times] = 0.05 * 0.05 * 0.05

Pr [Hit 20 at least twice] = 0.007125 + 0.000125

0 times 0.857375

1 times 0.135375

2 times 0.007125

3 times 0.000125

31


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Probability 0.857375 0.135375 0.007125 0.000125

0 1 2 3

32


Throwing darts• One at a time• Blindfolded• 100 darts

Pr [Hitting 20 3 times] = 0.139575

Pr [Hit 20 at least twice] = 0.9629188

0 times 0.005920

1 times 0.031160

2 times 0.081181

3 times 0.139575

33

Probability ConceptsHistogram of rbinom(10000, 100, 0.05)

rbinom(10000, 100, 0.05)

Fre

qu

en

cy

0 5 10 15

05

00

10

00

15

00

34

Match Score

• Dartboard represents the mass range of the spectrum

• Peaks of a spectrum are “slices”• Width of slice corresponds to mass tolerance

• Darts represent • random masses

• masses of fragments of a random peptide• masses of peptides of a random protein• masses of biomarkers from a random class

• How many darts do we get to throw?

35

Match Score

100

0250 500 750 1000 m/z

% I

nte

nsit

y

270

755 580

550

330

870

• What is the probability that we match at least 5 peaks?

36

Match Score

• Pr [ Match ≥ s peaks ] = Binomial( p , n ) ≈ Poisson( p n ), for small p and large n

p is prob. of random mass / peak match,n is number of darts (fragments in our answer)

37

Match Score

Theoretical distribution• Used by OMSSA• Proposed, in various forms, by many.

• Probability of random mass / peak match• IID (independent, identically distributed)• Based on match tolerance

38

Match Score

Theoretical distribution assumptions• Each dart is independent

• Peaks are not “related”

• Each dart is identically distributed• Chance of random mass / peak match is

the same for all peaks

39

Tournament Size

0 2 4 6 8 10 12

0.00

0.05

0.10

0.15

0 2 4 6 8 10 12

0.00

0.05

0.10

0.15

0 5 10 15

0.00

0.05

0.10

0.15

0 5 10 15

0.00

0.05

0.10

0.15

100

Dar

ts, #

20’

s

100 people 1000 people10000 people 100000 people

40

Tournament Size10

0 D

arts

, # 2

0’s

100 people 1000 people10000 people 100000 people

10 12 14 16 18

010

2030

4050

10 12 14 16 18

010

2030

4050

10 12 14 16 18

010

2030

4050

10 12 14 16 18

010

2030

4050

41

Number of Trials

• Tournament size == number of trials• Number of peptides tried• Related to sequence database size

• Probability that a random match score is ≥ s• 1 – Pr [ all match scores < s ]• 1 – Pr [ match score < s ] Trials (*)• Assumes IID!

• Expect value • E = Trials * Pr [ match ≥ s ]• Corresponds to Bonferroni bound on (*)

42

Better Dart Throwers

43

Better Random Models

• Comparison with completely random model isn’t really fair

• Match scores for real spectra with real peptides obey rules

• Even incorrect peptides match with non-random structure!

44


• Want to generate random fragment masses (darts) that behave more like the real thing:• Some fragments are more likely than others• Some fragments depend on others

• Theoretical models can only incorporate this structure to a limited extent.

45


• Generate random peptides• Real looking fragment masses• No theoretical model!• Must use empirical distribution• Usually require they have the correct

precursor mass

• Score function can model anything we like!

46


Fenyo & Beavis, Anal. Chem., 2003

47


Fenyo & Beavis, Anal. Chem., 2003

48


• Truly random peptides don’t look much like real peptides

• Just use peptides from the sequence database!

• Caveats:• Correct peptide (non-random) may be included• Peptides are not independent

• Reverse sequence avoids only the first problem

49

Extrapolating from the Empirical Distribution

• Often, the empirical shape is consistent with a theoretical model

Geer et al., J. Proteome Research, 2004 Fenyo & Beavis, Anal. Chem., 2003

50

False Positive Rate Estimation

• Each spectrum is a chance to be right, wrong, or inconclusive.• How many decisions are wrong?

• Given identification criteria:• SEQUEST Xcorr, E-value, Score, etc., plus...• ...threshold

• Use “decoy” sequences• random, reverse, cross-species• Identifications must be incorrect!

51


• # FP in real search = # hits in decoy search• Need same size database, or rate conversion

• FP Rate: # decoy hits # real hits

• FP Rate: 2 x # decoy hits . (# real hits + # decoy hits)

52


• A form of statistical significance• In “theory”, E-value and a FP rate are the

same.• Search engine independent

• Easy to implement• Assumes a single threshold for all

spectra• Spectrum/Peptide Identification scores are

not iid!...• ...but E-values, in principle, are.

53

Peptide Prophet

• From the Institute for Systems Biology• Keller et al., Anal. Chem. 2002

• Re-analysis of SEQUEST results

• Spectra are trials • Assumes that many of the spectra are

not correctly identified

54

Peptide Prophet

Distribution of spectral scores in the results

Keller et al., Anal. Chem. 2002

55

Peptide Prophet

• Assumes a bimodal distribution of scores, with a particular shape

• Ignores database size• …but it is included implicitly

• Like empirical distribution for peptide sampling, can be applied to any score function• Can be applied to any search engines’ results

56

Peptide Prophet

• Caveats• Are spectra scores sampled from the same

distribution?• Is there enough correct identifications for second

peak?• Are spectra independent observations?• Are distributions appropriately shaped?

• Huge improvement over raw SEQUEST results

57

Peptides to Proteins

Nesvizhskii et al., Anal. Chem. 2003

58


59


• A peptide sequence may occur in many different protein sequences• Variants, paralogues, protein families

• Separation, digestion and ionization is not well understood

• Proteins in sequence database are extremely non-random, and very dependent

60

Publication Guidelines

61


1. Computational parameters• Spectral processing• Sequence database• Search program• Statistical analysis

2. Number of peptides per protein• Each peptide sequence counts once!• Multiple forms of the same peptide

count once!

62


3. Single-peptide proteins must be explicitly justified by

• Peptide sequence• N and C terminal amino-acids• Precursor mass and charge• Peptide Scores• Multiple forms of the peptide counted once!

4. Biological conclusions based on single-peptide proteins must show the spectrum

63


5. More stringent requirements for PMF data analysis

• Similar to that for tandem mass spectra

6. Management of protein redundancy• Peptides identified from a different species?

7. Spectra submission encouraged

64

Summary

• Could guessing be as effective as a search?

• More guesses improves the best guess

• Better guessers help us be more discriminating

• Peptide to proteins is not as simple as it seems

• Publication guidelines reflect sound statistical principles.

statistical significance for peptide identification by tandem mass spectrometry

Documents

ranked peptide

peptide identificationrank

best peptide identificationsis

best scorecorrect peptide

likely peptide sequence1

proteomicsmeasure mass

correct identifications

incorrect peptides