statistical significance for peptide identification by tandem mass spectrometry

64
Statistical Significance for Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park

Upload: anaya

Post on 17-Jan-2016

63 views

Category:

Documents


1 download

DESCRIPTION

Statistical Significance for Peptide Identification by Tandem Mass Spectrometry. Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park. Mass Spectrometry for Proteomics. Measure mass of many (bio)molecules simultaneously High bandwidth - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

Statistical Significance for

Peptide Identification by

Tandem Mass Spectrometry

Statistical Significance for

Peptide Identification by

Tandem Mass SpectrometryNathan EdwardsCenter for Bioinformatics and Computational BiologyUniversity of Maryland, College Park

Page 2: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

2

Mass Spectrometry for Proteomics

• Measure mass of many (bio)molecules simultaneously• High bandwidth

• Mass is an intrinsic property of all (bio)molecules• No prior knowledge required

Page 3: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

3

Mass Spectrometry for Proteomics

• Measure mass of many molecules simultaneously• ...but not too many, abundance bias

• Mass is an intrinsic property of all (bio)molecules• ...but need a reference to compare to

Page 4: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

4

High Bandwidth

100

0250 500 750 1000

m/z

% I

nte

nsit

y

Page 5: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

5

Mass is fundamental!

Page 6: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

6

Mass Spectrometry for Proteomics

• Mass spectrometry has been around since the turn of the century...• ...why is MS based Proteomics so new?

• Ionization methods• MALDI, Electrospray

• Protein chemistry & automation• Chromatography, Gels, Computers

• Protein sequence databases• A reference for comparison

Page 7: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

7

Sample Preparation for Peptide Identification

Enzymatic Digestand

Fractionation

Page 8: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

8

Single Stage MS

MS

m/z

Page 9: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

9

Tandem Mass Spectrometry(MS/MS)

Precursor selection

m/z

m/z

Page 10: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

10

Tandem Mass Spectrometry(MS/MS)

Precursor selection + collision induced dissociation

(CID)

MS/MS

m/z

m/z

Page 11: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

11

Peptide Fragmentation

H…-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH

Ri-1 Ri Ri+1

AA residuei-1 AA residuei AA residuei+1

N-terminus

C-terminus

Peptides consist of amino-acids arranged in a linear backbone.

Page 12: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

12

Peptide Fragmentation

Page 13: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

13

Peptide Fragmentation

-HN-CH-CO-NH-CH-CO-NH-

RiCH-R’

bi

yn-iyn-i-1

bi+1

R”

i+1

i+1

Page 14: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

14

Peptide Fragmentation

Peptide: S-G-F-L-E-E-D-E-L-K

y1

y2

y3

y4

y5

y6

y7

y8

y9

ion

1020

907

778

663

534

405

292

145

88

MW

762SGFL EEDELKb4

389SGFLEED ELKb7

MWion

633SGFLE EDELKb5

1080S GFLEEDELKb1

1022SG FLEEDELKb2

875SGF LEEDELKb3

504SGFLEE DELKb6

260SGFLEEDE LKb8

147SGFLEEDEL Kb9

Page 15: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

15

Peptide Fragmentation

100

0250 500 750 1000

m/z

% I

nte

nsit

y

K1166

L1020

E907

D778

E663

E534

L405

F292

G145

S88 b ions

147260389504633762875102210801166 y ions

Page 16: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

16

Peptide Fragmentation

K1166

L1020

E907

D778

E663

E534

L405

F292

G145

S88 b ions

100

0250 500 750 1000

m/z

% I

nte

nsit

y

147260389504633762875102210801166 y ions

y6

y7

y2 y3 y4

y5

y8 y9

b3

b5 b6 b7b8 b9

b4

Page 17: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

17

Peptide Identification

• For each (likely) peptide sequence1. Compute fragment masses2. Compare with spectrum3. Retain those that match well

• Peptide sequences from protein sequence databases• Swiss-Prot, IPI, NCBI’s nr, ...

• Automated, high-throughput peptide identification in complex mixtures

Page 19: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

19

Moderate quality peptide identification: E-value < 10-3

Page 20: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

20

Amino-Acid Molecular Weights

Amino-Acid Residual MW Amino-Acid Residual MW

A Alanine 71.03712 M Methionine 131.04049

C Cysteine 103.00919 N Asparagine 114.04293

D Aspartic acid 115.02695 P Proline 97.05277

E Glutamic acid 129.04260 Q Glutamine 128.05858

F Phenylalanine 147.06842 R Arginine 156.10112

G Glycine 57.02147 S Serine 87.03203

H Histidine 137.05891 T Threonine 101.04768

I Isoleucine 113.08407 V Valine 99.06842

K Lysine 128.09497 W Tryptophan 186.07932

L Leucine 113.08407 Y Tyrosine 163.06333

Page 21: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

21

Peptide Identification

• Peptide fragmentation by CID is poorly understood

• MS/MS spectra represent incomplete information about amino-acid sequence• I/L, K/Q, GG/N, …

• Correct identifications don’t come with a certificate!

Page 22: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

22

Peptide Identification

• High-throughput workflows demand we analyze all spectra, all the time.

• Spectra may not contain enough information to be interpreted correctly• …bad static on a cell phone

• Peptides may not match our assumptions• …its all Greek to me

• “Don’t know” is an acceptable answer!

Page 23: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

23

Peptide Identification

• Rank the best peptide identifications

• Is the top ranked peptide correct?

Page 24: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

24

Peptide Identification

• Rank the best peptide identifications

• Is the top ranked peptide correct?

Page 25: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

25

Peptide Identification

• Rank the best peptide identifications

• Is the top ranked peptide correct?

Page 26: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

26

Peptide Identification

• Incorrect peptide has best score• Correct peptide is missing?• Potential for incorrect conclusion• What score ensures no incorrect

peptides?• Correct peptide has weak score

• Insufficient fragmentation, poor score• Potential for weakened conclusion• What score ensures we find all correct

peptides?

Page 27: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

27

Statistical Significance

• Can’t prove particular identifications are right or wrong...• ...need to know fragmentation in advance!

• A minimal standard for identification scores...• ...better than guessing.• p-value, E-value, statistical significance

Page 28: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

28

Pin the tail on the donkey…

Page 29: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

29

Probability Concepts

Throwing darts• One at a time• Blindfolded

Uniform distribution?Independent?Identically distributed?

Pr [ Dart hits 20 ] = 0.05

Page 30: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

30

Probability Concepts

Throwing darts• One at a time• Blindfolded• Three darts

Pr [Hitting 20 3 times] = 0.05 * 0.05 * 0.05

Pr [Hit 20 at least twice] = 0.007125 + 0.000125

0 times 0.857375

1 times 0.135375

2 times 0.007125

3 times 0.000125

Page 31: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

31

Probability Concepts

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Probability 0.857375 0.135375 0.007125 0.000125

0 1 2 3

Page 32: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

32

Probability Concepts

Throwing darts• One at a time• Blindfolded• 100 darts

Pr [Hitting 20 3 times] = 0.139575

Pr [Hit 20 at least twice] = 0.9629188

0 times 0.005920

1 times 0.031160

2 times 0.081181

3 times 0.139575

Page 33: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

33

Probability ConceptsHistogram of rbinom(10000, 100, 0.05)

rbinom(10000, 100, 0.05)

Fre

qu

en

cy

0 5 10 15

05

00

10

00

15

00

Page 34: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

34

Match Score

• Dartboard represents the mass range of the spectrum

• Peaks of a spectrum are “slices”• Width of slice corresponds to mass tolerance

• Darts represent • random masses

• masses of fragments of a random peptide• masses of peptides of a random protein• masses of biomarkers from a random class

• How many darts do we get to throw?

Page 35: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

35

Match Score

100

0250 500 750 1000 m/z

% I

nte

nsit

y

270

755 580

550

330

870

• What is the probability that we match at least 5 peaks?

Page 36: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

36

Match Score

• Pr [ Match ≥ s peaks ] = Binomial( p , n ) ≈ Poisson( p n ), for small p and large n

p is prob. of random mass / peak match,n is number of darts (fragments in our answer)

Page 37: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

37

Match Score

Theoretical distribution• Used by OMSSA• Proposed, in various forms, by many.

• Probability of random mass / peak match• IID (independent, identically distributed)• Based on match tolerance

Page 38: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

38

Match Score

Theoretical distribution assumptions• Each dart is independent

• Peaks are not “related”

• Each dart is identically distributed• Chance of random mass / peak match is

the same for all peaks

Page 39: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

39

Tournament Size

0 2 4 6 8 10 12

0.00

0.05

0.10

0.15

0 2 4 6 8 10 12

0.00

0.05

0.10

0.15

0 5 10 15

0.00

0.05

0.10

0.15

0 5 10 15

0.00

0.05

0.10

0.15

100

Dar

ts, #

20’

s

100 people 1000 people10000 people 100000 people

Page 40: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

40

Tournament Size10

0 D

arts

, # 2

0’s

100 people 1000 people10000 people 100000 people

10 12 14 16 18

010

2030

4050

10 12 14 16 18

010

2030

4050

10 12 14 16 18

010

2030

4050

10 12 14 16 18

010

2030

4050

Page 41: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

41

Number of Trials

• Tournament size == number of trials• Number of peptides tried• Related to sequence database size

• Probability that a random match score is ≥ s• 1 – Pr [ all match scores < s ]• 1 – Pr [ match score < s ] Trials (*)• Assumes IID!

• Expect value • E = Trials * Pr [ match ≥ s ]• Corresponds to Bonferroni bound on (*)

Page 42: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

42

Better Dart Throwers

Page 43: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

43

Better Random Models

• Comparison with completely random model isn’t really fair

• Match scores for real spectra with real peptides obey rules

• Even incorrect peptides match with non-random structure!

Page 44: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

44

Better Random Models

• Want to generate random fragment masses (darts) that behave more like the real thing:• Some fragments are more likely than others• Some fragments depend on others

• Theoretical models can only incorporate this structure to a limited extent.

Page 45: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

45

Better Random Models

• Generate random peptides• Real looking fragment masses• No theoretical model!• Must use empirical distribution• Usually require they have the correct

precursor mass

• Score function can model anything we like!

Page 46: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

46

Better Random Models

Fenyo & Beavis, Anal. Chem., 2003

Page 47: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

47

Better Random Models

Fenyo & Beavis, Anal. Chem., 2003

Page 48: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

48

Better Random Models

• Truly random peptides don’t look much like real peptides

• Just use peptides from the sequence database!

• Caveats:• Correct peptide (non-random) may be included• Peptides are not independent

• Reverse sequence avoids only the first problem

Page 49: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

49

Extrapolating from the Empirical Distribution

• Often, the empirical shape is consistent with a theoretical model

Geer et al., J. Proteome Research, 2004 Fenyo & Beavis, Anal. Chem., 2003

Page 50: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

50

False Positive Rate Estimation

• Each spectrum is a chance to be right, wrong, or inconclusive.• How many decisions are wrong?

• Given identification criteria:• SEQUEST Xcorr, E-value, Score, etc., plus...• ...threshold

• Use “decoy” sequences• random, reverse, cross-species• Identifications must be incorrect!

Page 51: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

51

False Positive Rate Estimation

• # FP in real search = # hits in decoy search• Need same size database, or rate conversion

• FP Rate: # decoy hits # real hits

• FP Rate: 2 x # decoy hits . (# real hits + # decoy hits)

Page 52: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

52

False Positive Rate Estimation

• A form of statistical significance• In “theory”, E-value and a FP rate are the

same.• Search engine independent

• Easy to implement• Assumes a single threshold for all

spectra• Spectrum/Peptide Identification scores are

not iid!...• ...but E-values, in principle, are.

Page 53: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

53

Peptide Prophet

• From the Institute for Systems Biology• Keller et al., Anal. Chem. 2002

• Re-analysis of SEQUEST results

• Spectra are trials • Assumes that many of the spectra are

not correctly identified

Page 54: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

54

Peptide Prophet

Distribution of spectral scores in the results

Keller et al., Anal. Chem. 2002

Page 55: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

55

Peptide Prophet

• Assumes a bimodal distribution of scores, with a particular shape

• Ignores database size• …but it is included implicitly

• Like empirical distribution for peptide sampling, can be applied to any score function• Can be applied to any search engines’ results

Page 56: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

56

Peptide Prophet

• Caveats• Are spectra scores sampled from the same

distribution?• Is there enough correct identifications for second

peak?• Are spectra independent observations?• Are distributions appropriately shaped?

• Huge improvement over raw SEQUEST results

Page 57: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

57

Peptides to Proteins

Nesvizhskii et al., Anal. Chem. 2003

Page 58: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

58

Peptides to Proteins

Page 59: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

59

Peptides to Proteins

• A peptide sequence may occur in many different protein sequences• Variants, paralogues, protein families

• Separation, digestion and ionization is not well understood

• Proteins in sequence database are extremely non-random, and very dependent

Page 60: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

60

Publication Guidelines

Page 61: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

61

Publication Guidelines

1. Computational parameters• Spectral processing• Sequence database• Search program• Statistical analysis

2. Number of peptides per protein• Each peptide sequence counts once!• Multiple forms of the same peptide

count once!

Page 62: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

62

Publication Guidelines

3. Single-peptide proteins must be explicitly justified by

• Peptide sequence• N and C terminal amino-acids• Precursor mass and charge• Peptide Scores• Multiple forms of the peptide counted once!

4. Biological conclusions based on single-peptide proteins must show the spectrum

Page 63: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

63

Publication Guidelines

5. More stringent requirements for PMF data analysis

• Similar to that for tandem mass spectra

6. Management of protein redundancy• Peptides identified from a different species?

7. Spectra submission encouraged

Page 64: Statistical Significance for Peptide Identification by Tandem Mass Spectrometry

64

Summary

• Could guessing be as effective as a search?

• More guesses improves the best guess

• Better guessers help us be more discriminating

• Peptide to proteins is not as simple as it seems

• Publication guidelines reflect sound statistical principles.