mass spectrometry-based proteomics

70
1 Mass Spectrometry-based Proteomics Xuehua Shen (Adapted from slides with textbook)

Upload: jarrod-morrow

Post on 31-Dec-2015

47 views

Category:

Documents


0 download

DESCRIPTION

Mass Spectrometry-based Proteomics. Xuehua Shen (Adapted from slides with textbook). Outline. Motivation of proteomics Mass spectrometry-based proteomics Instrumentation of mass spectrometry De novo sequencing algorithm Database search Algorithms of real software (e.g., sequence tags). - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Mass Spectrometry-based Proteomics

1

Mass Spectrometry-based Proteomics

Xuehua Shen

(Adapted from slides with textbook)

Page 2: Mass Spectrometry-based Proteomics

2

Outline

• Motivation of proteomics

• Mass spectrometry-based proteomics

• Instrumentation of mass spectrometry

• De novo sequencing algorithm

• Database search

• Algorithms of real software (e.g., sequence tags)

Page 3: Mass Spectrometry-based Proteomics

3

Motivation

• Proteins are working units of the cells

– The number of found genes is much less than the number of expressed proteins

– Directly related with cell processes and diseases

>1,000,000 distinct protein forms

~30,000 human genes

DNA

Alternative

splicing

mRNA Protein

Post-translational

Modification

>100,000 RNA messages

SNP

Page 4: Mass Spectrometry-based Proteomics

4

Tools for Proteomics

• Edman degradation reaction

• NMR (Nuclear Magnetic Resonance)

• X-ray crystallography

• Protein array

• Mass Spectrometry

Page 5: Mass Spectrometry-based Proteomics

5

Mass Spectrometry-based Proteomics

• Primary sequence (sequencing, identification)

• Post-translational modification (PTM) (characterization)

• Quantitative proteomics (quantification)

• Protein-protein interaction

Page 6: Mass Spectrometry-based Proteomics

6

Page 7: Mass Spectrometry-based Proteomics

7

Components of Mass Spectrometer

• Ion source (ESI and MALDI)

• Mass analyzer (ion traps, TOF, Quadrupole, FT, etc.)

– Mass-to-charge ratio (m/z)

• Ion detector

Page 8: Mass Spectrometry-based Proteomics

8

Peptide and Intact Protein

• Peptide: a fragment of protein

• Some enzymes, e.g. trypsin, break protein into peptides.

• Some technology put intact protein into the mass spectrometer

Page 9: Mass Spectrometry-based Proteomics

9

Peptide Fragmentation

• Peptides tend to fragment along the backbone.

• Fragments can also loose neutral chemical groups like NH3 and H2O.

H...-HN-CH-CO . . . NH-CH-CO-NH-CH-CO-…OH

Ri-1 Ri Ri+1

H+

N-Terminus C-Terminus

Collision Induced Dissociation

Page 10: Mass Spectrometry-based Proteomics

10

Ideal Mass Spectrum

Page 11: Mass Spectrometry-based Proteomics

11

Real Mass Spectrum

Page 12: Mass Spectrometry-based Proteomics

12

N- and C-terminal Peptides

N-term

inal

pep

tides

C-te

rmin

al p

eptid

es

Page 13: Mass Spectrometry-based Proteomics

13

Terminal peptides and ion types

Peptide

Mass (D) 57 + 97 + 147 + 114 = 415

Peptide

Mass (D) 57 + 97 + 147 + 114 – 18 = 397

without

Page 14: Mass Spectrometry-based Proteomics

14

N- and C-terminal Peptides

N-term

inal

pep

tides

C-te

rmin

al p

eptid

es

415

486

301

154

57

71

185

332

429

Page 15: Mass Spectrometry-based Proteomics

15

N- and C-terminal Peptides

N-term

inal

pep

tides

C-te

rmin

al p

eptid

es

415

486

301

154

57

71

185

332

429

Page 16: Mass Spectrometry-based Proteomics

16

N- and C-terminal Peptides

415

486

301

154

57

71

185

332

429

Page 17: Mass Spectrometry-based Proteomics

17

N- and C-terminal Peptides

415

486

301

154

57

71

185

332

429

Problem:

Reconstruct peptide from the set of masses of fragment

Page 18: Mass Spectrometry-based Proteomics

18

Mass Spectra

G V D L K

mass0

57 Da = ‘G’ 99 Da = ‘V’LK D V G

• The peaks in the mass spectrum:

– Prefix

– Fragments with neutral losses (-H2O, -NH3)

– Noise and missing peaks.

and Suffix Fragments.

D

H2O

Page 19: Mass Spectrometry-based Proteomics

19

Protein Identification with MS/MS

G V D L K

mass0

Inte

nsity

mass0

MS/MSPeptide Identification:

Page 20: Mass Spectrometry-based Proteomics

20

Protein Identification by Tandem Mass Spectrometry

SSeeqquueennccee

S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Rel

ativ

e Ab

unda

nce

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

MS/MS instrumentMS/MS instrument

De Novo interpretation•SherengaDatabase search•Sequest

Page 21: Mass Spectrometry-based Proteomics

21

De Novo vs. Database Search

S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Re

lative

Ab

un

da

nce

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

WR

A

C

VG

E

K

DW

LP

T

L T

WR

A

C

VG

E

K

DW

LP

T

L T

De Novo

AVGELTK

Database Search

Database ofknown peptides

MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT,

HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC,

GVFGSVLRA, EKLNKAATYIN..

Database ofknown peptides

MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT,

HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC,

GVFGSVLRA, EKLNKAATYIN..

Mass, Score

Page 22: Mass Spectrometry-based Proteomics

22

Pros and Cons of de novo Sequencing

• Advantage:– Gets the sequences that are not necessarily in the database.

– An additional similarity search step using these sequences may identify the related proteins in the database.

• Disadvantage:– Requires higher quality data.

– Often contains errors.

Page 23: Mass Spectrometry-based Proteomics

23

Current Status

• It is still a open problem of protein sequencing no matter whether using de novo sequencing or database search methods

• Following algorithms only deal with simplified (or ideal) spectrums

• Some algorithms combine de novo sequencing and database search

Page 24: Mass Spectrometry-based Proteomics

24

Outline

• Motivation of proteomics

• Mass spectrometry-based proteomics

• Instrumentation of mass spectrometry

• De novo sequencing

• Database search

• Algorithms of real software (e.g., sequence tags)

Page 25: Mass Spectrometry-based Proteomics

25

De novo Peptide Sequencing

S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Rel

ativ

e A

bund

ance

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

SequenceSequence

Page 26: Mass Spectrometry-based Proteomics

26

Peptide Sequencing Problem

Goal: Find a peptide with maximal match between an experimental and theoretical spectrum.

Input:

– S: experimental spectrum

– Δ: set of possible ion types

– m: parent mass

Output:

– P: peptide with mass m, whose theoretical spectrum matches the experimental S spectrum the best

Page 27: Mass Spectrometry-based Proteomics

27

Procedure of De Novo Sequencing

• Build spectrum graph

– How to create vertices (from masses)

– How to create edges (from mass differences)

• Find best path or rank paths of spectrum graph

– How to find candidate paths

– How to score paths

Page 28: Mass Spectrometry-based Proteomics

28

S E Q U E N C Eb

Mass/Charge (M/Z)Mass/Charge (M/Z)

From Sequence to Spectrum

Page 29: Mass Spectrometry-based Proteomics

29

a

Mass/Charge (M/Z)Mass/Charge (M/Z)

S E Q U E N C E

From Sequence to Spectrum(cont.)

Page 30: Mass Spectrometry-based Proteomics

30

S E Q U E N C E

Mass/Charge (M/Z)Mass/Charge (M/Z)

a is an ion type shift in b

From Sequence to Spectrum(cont.)

Page 31: Mass Spectrometry-based Proteomics

31

y

Mass/Charge (M/Z)Mass/Charge (M/Z)

E C N E U Q E S

From Sequence to Spectrum (cont.)

Page 32: Mass Spectrometry-based Proteomics

32

Mass/Charge (M/Z)Mass/Charge (M/Z)

Inte

nsit

yIn

tens

ity

From Sequence to Spectrum (cont.)

Page 33: Mass Spectrometry-based Proteomics

33

Mass/Charge (M/Z)Mass/Charge (M/Z)

Inte

nsit

yIn

tens

ity

From Sequence to Spectrum (cont.)

Page 34: Mass Spectrometry-based Proteomics

34

noise

Mass/Charge (M/Z)Mass/Charge (M/Z)

From Sequence to Spectrum (cont.)

Page 35: Mass Spectrometry-based Proteomics

35

MS/MS Spectrum

Mass/Charge (M/z)Mass/Charge (M/z)

Inte

nsit

yIn

tens

ity

Page 36: Mass Spectrometry-based Proteomics

36

Some Mass Differences between Peaks Correspond to Amino Acids

ss

ssss

ee

eeee

ee

ee

ee

ee

ee

qq

qq

qquu

uu

uu

nn

nn

nn

ee

cc

cc

cc

Page 37: Mass Spectrometry-based Proteomics

37

Now decoding from spectrum to sequence…?

Build spectrum graph

Page 38: Mass Spectrometry-based Proteomics

38

Vertices of Spectrum Graph

• Vertices are generated by reverse shifts corresponding to

ion types Δ={δ1, δ2,…, δk}

• Every mass s in an MS/MS spectrum generates k vertices

V(s) = {s+δ1, s+δ2, …, s+δk}

corresponding to potential N-terminal peptides

• Vertices of the spectrum graph:

{initial vertex}V(s1) V(s2) ... V(sm) {terminal vertex}

Page 39: Mass Spectrometry-based Proteomics

39

Reverse Shifts

Shift in H2O+NH3

Shift in H2O

Page 40: Mass Spectrometry-based Proteomics

40

Edges of Spectrum Graph

• Two vertices with mass difference corresponding to

an amino acid A:

– Connect with an edge labeled by A

• Gap edges for di- and tri-peptides

– Potential sequence tag method (covered later)

Page 41: Mass Spectrometry-based Proteomics

41

Best Path of Spectrum Graph

• How to find candidate paths

• There are many paths, how to find the correct one?

• We need scoring to evaluate paths

Page 42: Mass Spectrometry-based Proteomics

42

Find Candidate Paths

• Heuristics: find a path with maximum number

of edges

• Longest path problem in DAG

• DFS (Depth First Search)

Page 43: Mass Spectrometry-based Proteomics

43

Path Score

• p(P,S) = probability that peptide P produces spectrum S= {s1,s2,…sq}

• p(P, s) = the probability that peptide P generates a peak s

• Scoring = computing probabilities

Page 44: Mass Spectrometry-based Proteomics

44

Finding Optimal Paths in the Spectrum Graph

• For a given MS/MS spectrum S, find a peptide P’ maximizing p(P,S) over all possible peptides P:

• Peptides = paths in the spectrum graph

• P’ = the optimal path in the spectrum graph

• Some software rank paths

p(P,S)p(P',S) Pmax

Page 45: Mass Spectrometry-based Proteomics

45

Ions and Probabilities

• A peptide has all k peaks with probability

• and no peaks with probability

• A peptide also produces a ``random noise'' with uniform probability qR in any position.

k

iiq

1

k

iiq

1

)1(

Page 46: Mass Spectrometry-based Proteomics

46

Ratio Test Scoring for Partial Peptides

• Incorporates premiums for observed ions and penalties for missing ions.

• Example: for k=4, assume that for a partial peptide P’ we only see ions δ1,δ2,δ4.

The score is calculated as:

RRRR q

q

q

q

q

q

q

q 4321

)1(

)1(

Page 47: Mass Spectrometry-based Proteomics

47

Why Not Sequence De Novo?

• De novo sequencing is still not very accurate!

• Less than 30% of the peptides sequenced were completely correct!

Algorithm Amino Acid Accuracy

Whole Peptide Accuracy

Lutefisk (Taylor and Johnson, 1997). 0.566 0.189

SHERENGA (Dancik et. al., 1999). 0.690 0.289

Peaks (Ma et al., 2003). 0.673 0.246

PepNovo (Frank and Pevzner, 2005). 0.727 0.296

Page 48: Mass Spectrometry-based Proteomics

48

Thank you !

The End

Page 49: Mass Spectrometry-based Proteomics

49

De Novo vs. Database Search

S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Re

lative

Ab

un

da

nce

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

WR

A

C

VG

E

K

DW

LP

T

L T

WR

A

C

VG

E

K

DW

LP

T

L T

De Novo

AVGELTK

Database Search

Database ofknown peptides

MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT,

HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC,

GVFGSVLRA, EKLNKAATYIN..

Database ofknown peptides

MDERHILNM, KLQWVCSDL, PTYWASDL, ENQIKRSACVM, TLACHGGEM, NGALPQWRT,

HLLERTKMNVV, GGPASSDA, GGLITGMQSD, MQPLMNWE,

ALKIIMNVRT, AVGELTK, HEWAILF, GHNLWAMNAC,

GVFGSVLRA, EKLNKAATYIN..

Page 50: Mass Spectrometry-based Proteomics

50

De Novo vs. Database Search: A Paradox

• de novo algorithms are much faster, even though their search space is much larger!

• A database search scans all peptides in the search space to find best one.

• De novo eliminates the need to scan all peptides by modeling the problem as a graph search.

Why not sequence de novo?

Page 51: Mass Spectrometry-based Proteomics

51

Outline

• Motivation of proteomics

• Mass spectrometry-based proteomics

• Instrumentation: Mass Spectrometry

• De novo sequencing algorithm

• Database search

• Algorithms of real software (e.g., sequence tags)

Page 52: Mass Spectrometry-based Proteomics

52

Peptide Identification Problem

Goal: Find a peptide from the database with maximal match between an experimental and theoretical spectrum.

Input:

– S: experimental spectrum

– database of peptides

– Δ: set of possible ion types

– m: parent mass

Output:

– A peptide of mass m from the database whose theoretical spectrum matches the experimental S spectrum the best

Page 53: Mass Spectrometry-based Proteomics

53

MS/MS Database Search

Database search in mass-spectrometry has been very successful in identification of already known proteins.

Experimental spectrum can be compared with theoretical spectra of database peptides to find the best fit.

SEQUEST (Yates et al., 1995)

But reliable algorithms for identification of modified peptides is a much more difficult problem.

Page 54: Mass Spectrometry-based Proteomics

54

Post-Translational ModificationsProteins are involved in cellular signaling and

metabolic regulation.

They are subject to a large number of biological modifications.

Almost all protein sequences are post-translationally modified and 200 types of modifications of amino acid residues are known.

Page 55: Mass Spectrometry-based Proteomics

55

Examples of Post-Translational Modification

Post-translational modifications increase the number of “letters” in amino acid alphabet and lead to a combinatorial explosion in both database search and de novo approaches.

Page 56: Mass Spectrometry-based Proteomics

56

Search for Modified Peptides: Virtual Database Approach

Yates et al.,1995: an exhaustive search in a virtual database of all modified peptides.

Exhaustive search leads to a large combinatorial problem, even for a small set of modifications types.

Problem (Yates et al.,1995). Extend the virtual database approach to a large set of modifications.

Page 57: Mass Spectrometry-based Proteomics

57

Exhaustive Search for Modified Peptides

• YFDSTDYNMAK

• 25=32 possibilities, with 2 types of modifications!

Phosphorylation?

Oxidation?

• For each peptide, generate all modifications.

• Score each modification.

Page 58: Mass Spectrometry-based Proteomics

58

Modified Peptide Identification Problem

Goal: Find a modified peptide from the database with maximal match between an experimental and theoretical spectrum.

Input:

– S: experimental spectrum

– database of peptides

– Δ: set of possible ion types

– m: parent mass

– Parameter k (# of mutations/modifications)

Output:

– A peptide of mass m that is at most k mutations/modifications apart from a database peptide and whose theoretical spectrum matches the experimental S spectrum the best

Page 59: Mass Spectrometry-based Proteomics

59

Peptide Identification Problem: Challenge

Very similar peptides may have very different spectra!

Goal: Define a notion of spectral similarity that correlates well with the sequence similarity.

If peptides are a few mutations/modifications apart, the spectral similarity between their spectra should be high.

Page 60: Mass Spectrometry-based Proteomics

60

Spectrum Alignment

• See 8.14 and 8.15 in the text book for one algorithm

• Complicated for real spectrums

Page 61: Mass Spectrometry-based Proteomics

61

Quality Measure of Mass Spectrometer

• Sensitivity

• Mass accuracy

• Resolution

• Dynamic range

Page 62: Mass Spectrometry-based Proteomics

62

Ion Types

• Some masses correspond to fragment

ions, others are just random noise

• Knowing ion types Δ={δ1, δ2,…, δk} lets us

distinguish fragment ions from noise

• We can learn ion types δi and their

probabilities qi by analyzing a large test

sample of annotated spectra.

Page 63: Mass Spectrometry-based Proteomics

65

Database Search: Sequence Analysis vs. MS/MS AnalysisSequence analysis:

similar peptides (that a few mutations apart) have similar sequences

MS/MS analysis:

similar peptides (that a few mutations apart) have dissimilar spectra

Page 64: Mass Spectrometry-based Proteomics

66

Deficiency of the Shared Peaks Count

Shared peaks count (SPC): intuitive measure of spectral similarity.

Problem: SPC diminishes very quickly as the number of mutations increases.

Only a small portion of correlations between the spectra of mutated peptides is captured by SPC.

Page 65: Mass Spectrometry-based Proteomics

67

Ions and Probabilities

• Tandem mass spectrometry is characterized by a set of ion types {δ1,δ2,..,δk} and their probabilities {q1,...,qk}

•δi-ions of a partial peptide are produced independently with probabilities qi

Page 66: Mass Spectrometry-based Proteomics

68

De Novo vs. Database Search:

• The database of all peptides is huge ≈ O(20n) .

• The database of all known peptides is much smaller ≈ O(108).

• However, de novo algorithms can be much faster, even though their search space is much larger!

• A database search scans all peptides in the database of all known peptides search space to find best one.

• De novo eliminates the need to scan database of all peptides by modeling the problem as a graph search.

Page 67: Mass Spectrometry-based Proteomics

69

Probabilistic Model

• For a position t δj Ti the probability p(t, P,S) that peptide P produces a peak at position t.

• Similarly, for tR, the probability that P produces a random noise peak at t is:

otherwise1

position tat generated ispeak a if),,( j

j

j

q

qSPtP

otherwise1

position tat generated ispeak a if)(

R

RR q

qtP

Page 68: Mass Spectrometry-based Proteomics

70

Probabilistic Score

• For a peptide P with n amino acids, the score for the whole peptides is expressed by the following ratio test:

n

i

k

j iR

i

R j

j

tp

SPtp

Sp

SPp

1 1 )(

),,(

)(

),(

Page 69: Mass Spectrometry-based Proteomics

71

• For a position t that represents ion type dj :

qj, if peak is generated at t

p(P,st) =

1-qj , otherwise

Peak Score

Page 70: Mass Spectrometry-based Proteomics

72

Peak Score (cont.)

• For a position t that is not associated with an ion type:

qR , if peak is generated at t

pR(P,st) =

1-qR , otherwise

• qR = the probability of a noisy peak that does not correspond to any ion type