proteomics via mass spectrometry (a bioinformatics perspective)

148
Wi’07 Bafna Proteomics via Mass Spectrometry (a bioinformatics perspective) Vineet Bafna www.cse.ucsd.edu/~vbafna

Upload: xantha-buck

Post on 31-Dec-2015

51 views

Category:

Documents


0 download

DESCRIPTION

Proteomics via Mass Spectrometry (a bioinformatics perspective). Vineet Bafna www.cse.ucsd.edu/~vbafna. Nobel Citation 2002. Nobel Citation, 2002. Proteomics via MS. Enzymatic Digestion (Trypsin) +. Fractionation. Q: Sufficient to identify peptides?. Peptide MS. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Proteomics via Mass Spectrometry

(a bioinformatics perspective)

Vineet Bafnawww.cse.ucsd.edu/~vbafna

Page 2: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Nobel Citation 2002

Page 3: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Nobel Citation, 2002

Page 4: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Proteomics via MS

Enzymatic Digestion (Trypsin)

+Fractionation

Q: Sufficient to identify peptides?

Page 5: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Peptide MS

• Instrument software usually detects peaks, and computes features (peak, area, m/z…)

m/z

Page 6: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Single Stage MS

MassSpectrometry

Page 7: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

MS versus Micro-array

cDNA

sample

Protein/Peptide?

sample

• Unlike micro-array, peptide id is not trivial at the end of the MS experiment!

• Identification is an important part of pre-processing

Page 8: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

MS based proteomics

• Identification– Identify all the proteins in the proteome, specific

organelles, specific pathways, complexes…

• Quantitation– Is a protein differentially-expressed in certain

conditions?

• Others– Protein 3D structure, protein protein interactions,…

We will consider an informatics-centered perspective

Page 9: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Protein Identification

• The preferred mode is through tandem mass spectrometry of peptides.

• Is identifying peptides sufficient? • Rough probability for co-occurrence of a 15-aa

peptide?

With higher accuracy instruments, it may be possible to do intact proteins as well.

20−15 ⋅(50 ⋅106) ≅10−11

Page 10: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Tandem MS of peptides

Secondary Fragmentation

Ionized parent peptide

Page 11: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

The peptide backbone

H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH

Ri-1 Ri Ri+1

AA residuei-1 AA residuei AA residuei+1

N-terminus C-terminus

The peptide backbone breaks to formfragments with characteristic masses.

Page 12: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Ionization

The peptide backbone breaks to formfragments with characteristic masses.

Ionized parent peptide

H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH

Ri-1 Ri Ri+1

AA residuei-1 AA residuei AA residuei+1

N-terminus C-terminus

H+

Page 13: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Fragment ion generation

H...-HN-CH-CO NH-CH-CO-NH-CH-CO-…OH

Ri-1Ri Ri+1

AA residuei-1 AA residuei AA residuei+1

N-terminus C-terminus

The peptide backbone breaks to formfragments with characteristic masses.

Ionized peptide fragment

H+

Page 14: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Tandem MS for Peptide ID

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

100

0250 500 750 1000

[M+2H]2+

m/z

% I

nte

nsit

y

Page 15: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Peak Assignment

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

100

0250 500 750 1000

y2 y3 y4

y5

y6

y7

b3b4 b5 b8 b9

[M+2H]2+

b6 b7 y9

y8

m/z

% I

nte

nsit

y Peak assignment impliesSequence (Residue tag) Reconstruction!

Page 16: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Ion types, and offsets

• P = prefix residue mass• S = Suffix residue mass• b-ions = P+1

– (NH2-CHR-CO-..-NH-CHR-CO(+))• y-ions = S+19

– (NH3(+)-CHR-CO-..NH-CHR-COOH)• a-ions = P-27, and so on..

H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH

Ri-1 Ri Ri+1

AA residuei-1 AA residuei AA residuei+1

N-terminus C-terminus

H+

Page 17: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

MS Quiz:

• Why aren’t all tandem MS peaks of the same intensity?

• Do the intensities for a peptide vary from spectrum to spectrum?

Page 18: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Database Searching for peptide ID• For every peptide from a

database– Reject if it has the wrong

mass, else:– Generate a hypothetical

spectrum– Compute a correlation

between observed and experimental spectra

– Choose the best• Database searching is

very powerful and is the de facto standard for MS.

– Sequest, Mascot, Inspect, and many others

…SARLSQETFSDLWKLLPENNVLSPLP….

Page 19: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

So what’s new?

• The Id picture is very simplistic. Only 20-30% of spectra are conclusively identified.

• Many reasons:– Spectra are noisy.– Databases are incomplete. Sometimes, we need to

do a de novo interpretation– Post-translational modifications.– Instrument performance is critical.

• The algorithms for identification must be sensitive to these issues.

• We present a systematic look at identification software.

Page 20: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Modules for Peptide Id

• Interpretation (D)• Input Spectrum• Output: all that can be

extracted from the spectrum (peptides/tags/parent mass/charge)

• Indexing/Filtering• Input: Db (set of peptides)• Output: pre-processing of the

database, peptide subset.• Scoring

• Input; peptide set, spectrum• Output: ranked list of scores

• Validation• Significance of the top hit.

I/F

S VD

Db

Page 21: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

De novo interpretation of mass spectra

• The so called de novo algorithms focus exclusively on the D module.

• There is no database (I/F).• Limited scoring and validation• Important when no database exists!

– Also important for db search

I/F

S VD

Page 22: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

De Novo Interpretation: Example

S G E K0 88 145 274 402 b-ions

420 333 276 147 0 y-ions

b

y

y

2

100 500400300200

M/Z

b

1

1

2

Page 23: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

The simplest case

• Suppose only (and all) the prefix ions were visible. Would identification be easy?

• We have two problems:– There is a mix of b and y ions. Separating them is critical!– Other ions besides b,y, including neutral losses, noise and

so on. We need to account for them.

S G E K0 88 145 274 402 b-ions

420 333 276 147 0 y-ions

88145

402

274

SG

EK

Page 24: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

• Separating b-, and y-ions is solved using a combinatorial formulation (forbidden pairs)

• Separating b,y from all others is solved using a statistical approach.

• Together, they form the basis for a de novo sequencer.

Page 25: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

De Novo Interpretation: Example

S G E K0 88 145 274 402 b-ions

420 333 276 147 0 y-ions

b

y

y

2

100 500400300200

M/Z

b

1

1

2

Ion Offsetsb=P+1y=S+19=M-P+19

Page 26: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Computing possible prefixes

• We know the parent mass M=401.• Consider a mass value 88• Assume that it is a b-ion, or a y-ion• If b-ion, it corresponds to a prefix of the peptide with residue

mass 88-1 = 87.• If y-ion, y=M-P+19.

– Therefore the prefix has mass • P=M-y+19= 401-88+19=332

• Compute all possible Prefix Residue Masses (PRM) for all ions.

Page 27: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Putative Prefix Masses

Prefix Mass

M=401 b y88 87 332145 144 275147 146 273276 275 144

S G E K0 87 144 273 401

• Only a subset of the prefix masses are correct.

• The correct mass values form a ladder of amino-acid residues

Page 28: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Spectral Graph

• Each prefix residue mass (PRM) corresponds to a node.

• Two nodes are connected by an edge if the mass difference is a residue mass.87 144G

Page 29: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Spectral Graph

• Each peak, when assigned to a prefix/suffix ion type generates a unique prefix residue mass.

• Spectral graph: – Each node u defines a putative prefix residue M(u).– (u,v) in E if M(v)-M(u) is the residue mass of an a.a. (tag)

or 0.– Paths in the spectral graph correspond to a interpretation

300100

401

200

0

S G E K

27387 146144 275 332

Page 30: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Re-defining de novo interpretation

• Find a subset of nodes in spectral graph s.t.– 0, M are included– Each peak contributes at most one node (interpretation)(*)– Each adjacent pair (when sorted by mass) is connected by an edge

(valid residue mass)– An appropriate objective function (ex: the number of peaks

interpreted) is maximized

300100

401

200

0

S G E K

27387 146144 275 332

87 144G

Page 31: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Two problems

• Too many nodes.– A. Only a small fraction correspond to b/y ions (leading

to true PRMs).– B. Even if the b/y ions were correctly predicted, each

peak generates multiple possibilities, only one of which is correct. We need to find a path that uses each peak only once (algorithmic problem).

– In general, the forbidden pairs problem is NP-hard

300100

401

200

0

S G E K

27387 146144 275 332

Page 32: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

However,..

• The b,y ions have a special non-interleaving property

• Consider pairs (b1,y1), (b2,y2)– Note that b1+y1 = b2+y2

– If (b1 < b2), then y1 > y2

Page 33: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Non-Intersecting Forbidden pairs

300100 4002000

S G E K• If we consider only b,y ions, ‘forbidden’ node pairs are non-

intersecting, • The de novo problem can be solved efficiently using a

dynamic programming technique.

87 332

Page 34: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

The forbidden pairs method

• There may be many paths that avoid forbidden pairs.

• We choose a path that maximizes an objective function, – EX: the number of peaks interpreted – Here we assume a function , which gives a

score to a PRM. The score captures the likelihood that the PRM is correct.

Page 35: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

The forbidden pairs method

• Sort the PRMs according to increasing mass values.

• For each node u, f(u) represents the forbidden pair

• Let m(u) denote the mass value of the PRM.

300100 4002000 87 332

u f(u)

Page 36: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

D.P. for forbidden pairs

• Consider all pairs u,v– m[u] <= M/2, m[v] >M/2

• Define S(u,v) as the best score of a forbidden pair path from 0->u, v->M

• Is it sufficient to compute S(u,v) for all u,v?

300100 4002000 87 332

u v

Page 37: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

D.P. for forbidden pairs

• Note that the best interpretation is given by

max((u,v )∈E ) S(u,v)

300100 4002000 87 332

u v

Page 38: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

D.P. for forbidden pairs

• Denote the forbidden pair of node v by f(v). – What is f(f(v))?

• Note that we have one of two cases.1. Either u < f(v) (and f(u) > v)2. Or, u > f(v) (and f(u) < v)

• Case 1.– Extend v, do not touch f(u)

300100 4002000u f(u)

v

S(u,v) = max w:(v,w )∈E

w≠ f (u)

⎝ ⎜

⎠ ⎟S(u,w) + δ(v)

w

Page 39: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

The complete algorithm

for all u /*increasing mass values from 0 to M/2 */for all v /*decreasing mass values from M to M/2 */

if (u > f[v])

else if (u < f[v])

If (u,v)E /*maxI is the score of the best interpretation*/

maxI = max {maxI,S[u,v]}

S[u,v] = max (w,u)∈E

w≠ f (v )

⎝ ⎜

⎠ ⎟S[w,v] + δ(u)

S[u,v] = max (v,w )∈E

w≠ f (u)

⎝ ⎜

⎠ ⎟S[u,w] + δ(v)

Page 40: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

De Novo: Second issue

• Given only b,y ions, a forbidden pairs path will solve the problem.

• However, recall that there are MANY other ion types.

– Typical length of peptide: 15– Typical # peaks? 50-150?– #b/y ions?– Most ions are “Other”

• a ions, neutral losses, isotopic peaks….

Page 41: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

De novo: Weighting nodes in Spectrum Graph

• Factors determining if the ion is b or y– Intensity– Support ions

• b- and y-ions are the most likely ions to lose water/ammonia– Isotopic peaks

Page 42: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Offset frequency function

• b, and y-ions show offsets due to neutral losses

Page 43: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

De novo: Weighting nodes

• A probabilistic network to model support ions (Pepnovo)

Page 44: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

De Novo Interpretation Summary

• The main challenge is to separate b/y ions from everything else (weighting nodes), and separating the prefix ions from the suffix ions (Forbidden Pairs).

• As always, the abstract idea must be supplemented with many details.

– Noise peaks, incomplete fragmentation– A PRM is first scored on its likelihood of being correct, and

the forbidden pair method is applied subsequently.

Page 45: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Db search versus de novo interpretation

Filter ValidationScore

De novo

Db55M peptides

1. Traditional db search simply have the scoring module.

2. De novo is useful when the peptide is not in the database, but not as accurate.

3. It can be thought of as a database search over a much larger database.

4. PT modifications change the picture .

Page 46: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Filtering

Filter ValidationScoreextension

De novo

Db55M peptides

CandidatePeptides (700)

1. Db indexing/filtering is a key mechanism for reducing the search space

Page 47: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Filtering

• Define a filter as a computational tool that rapidly screens a database, removing much of it but retaining the true peptide.

• Can you suggest commonly used filters?

1. Parent mass2. Trypsin digested peptides

Page 48: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Parent Mass filter

• Sort all peptides in the database by their parent mass.

• Search only the peptides that are within some mass tolerance.

• The filter does not work when you have modifications.

Page 49: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

The dynamic nature of the proteome

• The proteome of the cell is changing

• Various extra-cellular, and other signals activate pathways of proteins.

• A key mechanism of protein activation is PT modification

• These pathways may lead to other genes being switched on or off

• Mass Spectrometry is key to probing the proteome

Page 50: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Db search for putatively modified peptides.

• Ex:YFDSTDYNMAK

• 25=32 possibilities, with 2 types of modifications!

• In contrast, de novo search space does not change significantly.

Phosphorylation?

oxidation

For each peptide, generate all mods.Score each modificationIs parent mass still a good filter?

Page 51: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Enzymatic digestion rules as a filter

• Consider only tryptic peptides– Trypsin cleaves after R,K (not if RP, or KP)

• Tryptic peptide filters may not be very effective– Missed cleavage– End-point degradation– Endogenous peptide activity

Page 52: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Example of non-tryptic peptide analysis

0

5 0

1 0 0

1 5 0

2 0 0

2 5 0

0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 1 0 0 1 1 0 1 2 0 1 3 0 1 4 0 1 5 0

R e s i d u e n u m b e r

Endpoints

N o n - c o v e r e d a n d n o u p s t r e a m c o v e r a g e N o n - c o v e r e d

• Experiment: a massive oversampling of a proteome (14M spectra of a prokaryotic genome)

• Plot the absolute postion of the most N-terminal peptide (not-necessarily tryptic) (Gupta et al., 2007)

• Two peaks are seen, at position 2, and position ~22

Page 53: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Signal Peptide discovery

Page 54: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

PT modifications/processing

• The problem was not intractable, but impractical.

• Identifications of modified peptides was not routine, and is left to specialized cases.

• Better filtering technology makes it practical to explore modifications on a large scale.

• Increase in time is modest with increasing number of modifications.

• The technology can only improve. Should be possible to improve speed by another order of magnitude.

Page 55: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

We will filter databases via a trick from sequence searching

Page 56: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Sequence Search Basics

• Q: Given k words (si has length li), and a database of size n, find all matches to these words in the database string.

• How fast can this be done?

1:POTATO2:POTASSIUM3:TASTE

P O T A S T P O T A T O

dictionary

database

Page 57: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Dict. Matching & string matching

• How fast can you do it, if you only had one word of length m?– Trivial algorithm O(nm) time– Pre-processing O(m), Search O(n) time.

• Dictionary matching– Trivial algorithm (l1+l2+l3…)n

– Using a keyword tree, lpn (lp is the length of the longest pattern)

– Aho-Corasick: O(n) after preprocessing O(l1+l2..)

• We will consider the most general case

Page 58: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Sequence tag filters

Basics1. De novo sequencing can be used to get

partial sequence information from the spectra.

2. Exact matching for sequence is fast. 3. We can search a database (size n) with k

substrings (of any length) in time that is proportional to n, but independent of the size or number of substrings.

• Aho-Corasick trie data-structure.

Page 59: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Dictionary matching

Y A K

SN

N

F

F

AT

YFAKYFNSFNTA

…..Y F R A Y F N T A…..

• In each step, either f, or l is incremented. Total time is 2n, independent of automaton size.

f l

Page 60: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Tag-based filtering

• A tag is a short peptide with a prefix and suffix mass• Efficient: An average tripeptide tag matches Swiss-Prot

~700 times• Tagging is related to de novo sequencing yet different.• Objective: Compute a subset of short strings, at least

one of which must be correct. Longer tags=> better filter.

• Analogy: Using tags to search the proteome is similar to moving from full Smith-Waterman alignment to BLAST

Page 61: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Tag generation

WR

A

C

VG

E

K

DW

LP

T

L T

TAG Prefix Mass

AVG 0.0

WTD 120.2

PET 211.4

• Using local paths in the spectrum graph, construct peptide tags.• Use the top ten tags to filter the database• Tagging is related to de novo sequencing yet different.• Objective: Compute a subset of short strings, at least one of which must be correct.

Longer tags=> better filter.

Page 62: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Tag-based search

• Tags (from multiple spectra) are used to construct a trie

• For each string match, attempt to extend, matching the prefix and suffix mass using flanking sequence (and PTMs)

• Retain best matches for detailed scoring

Page 63: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Tag based search (dictionary matching)

YFDDSTSTDTDYYNM

Y

M

F D

N

Y

M

F D

N

…..YFDSTGSGIFDESTMTKTYFDSTDYNMAK….

De novo trie

scan

Page 64: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Speed & Sensitivity

• FilteredPeptides: # peptides that pass the filter

• Time: Scan time + filteredpeptides * Scoring-time

• For sequence tag filters, the scan time can be amortized out, by combining scan for many spectra all at once.– Build one automaton from multiple spectra

• Thus, filter efficiency is key to speed. • Filter sensitivity is also important

Page 65: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

• Given: – tag with prefix and suffix masses <mP> xyz <mS>

– match in the database

• Compute if a suffix and prefix match with allowable modifications. How fast can this be done?

• Compute a candidate peptide with most likely attachment point.

Fast Extension

xyz

<mP>xyz<mS>

Page 66: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Filtering

• Database filtering is a critical (and relatively unexplored) strategy for MS searches.

• De novo sequencing to get tags is an effective strategy

• Are other forms of filtering possible?– Alg. Question: given a spectrum, find all

peptides that match a subset of theoretical fragments. How quickly can you do that?

Page 67: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Overview

Filter SignificanceScoreextension

De novo

Db55M peptides

CandidatePeptides (700)

Page 68: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Scoring

• Input:– Candidate peptide with attached

modifications– Spectrum

• Output:– Score function:– Key: the score must normalize for length, as

variable modifications can change peptide length.

Page 69: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Score function

• Score is a log-odds function. • Numerator: Prob. That the spectrum was

generated by n theoretical fragments from the peptide.

• Denominator: Probability that the spectrum was generated by n randomly generated peaks

Page 70: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Probabilistic Scoring

Empirically computed

Depends upon tolerance

Theoretical fragments

Peaksi-th peak

j-th frag.

Page 71: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Overview

Filter ValidationScoreextension

De novo

Db55M peptides

CandidatePeptides (700)

Page 72: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

P-value computation

• The score function allows us to rank peptides. • The top scoring one may not be the correct

one.• We consider a collection of +ve (top scoring

correct one), and -ve spectra.• Consider a bunch of other scores that help

separate the +ve from the -ve.

Page 73: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Quality scores and p-values

• Features:• Score S: as computed• Explained Intensity I: fraction of total intensity

explained by annotated peaks.• Explained peaks: fraction of top 25 peaks annotated.• b-y score B: fraction of b+y ions annotated• -score : difference between the best and second

best• Choose a final score as a (linear) combination

of the features. The weights can be trained using a discriminative strategy.

Page 74: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Separating power of features

Page 75: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Computing confidence values

• Discriminative training of feature weights is used to maximize the separation.

• The distribution of -ve example scores can supply a confidence value.

Page 76: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Common ID Tools

• Sequest– De Novo + Filtering: Parent Mass, Enzyme specificity– Scoring: Simple cross-correlation of theoretical and experimental

peaks– Validation: Xcorr (based on difference between best and second

best)• Mascot: Similar• MS-BLAST

– Uses 3rd party de novo prediction– Filtering using Blast code– Limited scoring– Validation at the protein level

Page 77: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Identification summary

• While MS technologies are important for proteomics in general, they are the key technology for identification.

• They can probe the proteome dynamically, and help identify mutations and modifications.

• The algorithms are continually improved by improved versions of one or more modules.

Page 78: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Modifications

• While we discussed modification in general, identification and validation of modifications remains an important theme.

• Larger modifications are interesting in themselves (such that glycan chains in glycosylation), and might be identified using MS

ABRF delta mass resource

Page 79: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Isotopic Profiles

Page 80: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Mass Measurement?

• VAPEEHPVLLTEAPLNPK (Mol. Mass=1953)

Page 81: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Mass-Charge ratio

• The charge is due to a proton (H+) which has a mass of ~1 Da

• The X-axis is (M+Z)/Z– Z=1 implies that peak is at M+1– Z=2 implies that peak is at (M+2)/2

• M=1000, Z=2, peak position is at 501

– Suppose you see a peak at 501. Is the mass 500, or is it 1000?

Page 82: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Isotopic peaks

• Ex: Consider peptide SAM• Mass = 308.12802 • You should see:

• Instead, you see

308.13

308.13 310.13

Page 83: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Isotopes

• C-12 is the most common. Suppose C-13 occurs with probability 1%

• EX: SAM – Composition: C11 H22 N3 O5 S1

• What is the probability that you will see a single C-13?

11

1

⎝ ⎜

⎠ ⎟⋅0.01⋅ 0.99( )

10≅ 0.1

308 309

MS spectrum for SAM

Page 84: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

All atoms have isotopes

• Isotopes of atoms– O16,18, C-12,13, S32,34….– Each isotope has a frequency of occurrence

• If a molecule (peptide) has a single copy of C-13, that will shift its peak by 1 Da

• With multiple copies of a peptide, we have a distribution of intensities over a range of masses (Isotopic profile).

• How can you compute the isotopic profile of a peak?

Page 85: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Isotope Calculation

• Denote:– Nc : number of carbon atoms in the peptide

– Pc : probability of occurrence of C-13 (~1%)

– Then

Pr[Peak at M] =NC

0 ⎛ ⎝ ⎜

⎞ ⎠ ⎟pc

0 1− pc( )NC

Pr[Peak at M +1] =NC

1 ⎛ ⎝ ⎜

⎞ ⎠ ⎟pc

1 1− pc( )NC −1

+1

Nc=50

+1

Nc=200

Page 86: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Isotope Calculation Example

• Suppose we consider Nitrogen, and Carbon• NN: number of Nitrogen atoms• PN: probability of occurrence of N-15• Pr(peak at M)• Pr(peak at M+1)?• Pr(peak at M+2)?

Pr[Peak at M] =NC

0 ⎛ ⎝ ⎜

⎞ ⎠ ⎟pc

0 1− pc( )NC NN

0 ⎛ ⎝ ⎜

⎞ ⎠ ⎟pN

0 1− pN( )NN

Pr[Peak at M +1] =NC

1 ⎛ ⎝ ⎜

⎞ ⎠ ⎟pc

1 1− pc( )NC −1 NN

0 ⎛ ⎝ ⎜

⎞ ⎠ ⎟pN

0 1− pN( )NN

+NC

0 ⎛ ⎝ ⎜

⎞ ⎠ ⎟pc

0 1− pc( )NC NN

1 ⎛ ⎝ ⎜

⎞ ⎠ ⎟pN

1 1− pN( )NN −1

How do we generalize? How can we handle Oxygen (O-16,18)?

Page 87: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

General isotope computation

• Definition:– Let pi,a be the abundance of the isotope with

mass i Da above the least mass– Ex: P0,C : abundance of C-12, P2,O: O-18 etc.

• Characteristic polynomial

• Prob{M+i}: coefficient of xi in (x) (a binomial convolution)

φ(x) = p0,a + p1,a x + p2,a x 2 +L( )a

∏Na

Page 88: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Quiz

• How can you determine the charge on a peptide?

Difference between the first and second isotope peak is 1/Z

Proposal: Given a mass, predict a composition, and the isotopic profile Do a ‘goodness of fit’ test to isolate the peaks corresponding to the isotope Compute the difference

Page 89: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Isotopic Profile Application

• In DxMS, hydrogen atoms are exchanged with deuterium• The rate of exchange indicates how buried the peptide is (in

folded state)• Consider the observed characteristic polynomial of the isotope

profile t1, t2, at various time points. Then

• The estimates of p1,H can be obtained by a deconvolution• Such estimates at various time points should give the rate of

incorporation of Deuterium, and therefore, the accessibility.€

φt2(x) = φt1

(x)(p0,H + p1,H x)N H

Page 90: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Quantitation via MS

Page 91: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Quantitation

• The intensity of the peak depends upon– Abundance, ionization potential, substrate

etc.• Two peptides with the same abundance

can have very different intensities.• Assumption: relative abundance can be

measured by comparing the ratio of a peptide in 2 samples.

Page 92: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Quantitation issues

• The two samples might be from a complex mixture. How do we identify identical peptides in two samples?

• In micro-array this is possible because the cDNA is spotted in a precise location? Can we have a ‘location’ for proteins/peptides

• MS based quantitation must be coupled with quantitation

Page 93: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

2D Gel based separation

• Intact proteins are used• Iso-electric focusing and

Mol Weight used to separate.

• Intensity of spot is used to measure abundance.

• Problems: All 3 measurements are not that precise (low reproducibility)

• Labor intensive• Many proteins do not get

separated.

Page 94: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

LC-MS based separation

• As the peptides elute (separated by physiochemical properties), spectra is acquired.

HPLC ESI TOF Spectrum (scan)

p1p2

pnp4

p3

Page 95: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

LC-MS Maps

timem/z

IPeptide 2

Peptide 1

x x x xx x x x x x

x x x xx x x x x x

time

m/z

Peptide 2 elution• A peptide/feature can be

labeled with the triple (M,T,I):– monoisotopic M/Z,

centroid retention time, and intensity

• An LC-MS map is a collection of features

Page 96: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Peptide Features

Isotope pattern Elution profile

Peptide (feature)

Capture ALL peaks belonging to a peptide for quantification !

Page 97: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Relative abundance using MS

• Differential Isotope labeling (ICAT/SILAC)• External standards (AQUA)• Direct Map comparison

Page 98: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

ICAT

• Introduced by Aebersold• ICAT reagent is attached to particular amino-acids (Cys)• Affinity purification leads to simplification of complex mixture

“diseased”

Cell state 1

Cell state 2

“Normal”

Label proteins with heavy ICAT

Label proteins with light ICAT

Combine

Fractionate protein prep - membrane - cytosolic

Proteolysis

Isolate ICAT- labeled peptides

Nat. Biotechnol. 17: 994-999,1999

Page 99: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Differential analysis using ICAT

ICAT pairs atknown distance

diseased

normal

Page 100: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

ICAT issues

• The tag is heavy, and decreases the dynamic range of the measurements.

• The tag might break off• Only Cysteine containing peptides are

retrieved Non-specific binding to strepdavidin

Page 101: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Serum ICAT data

Page 102: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Serum ICAT data

8

0

2224

3032

3840

46

16

• Instead of pairs, we see entire clusters at 0, +8,+16,+22

• What is going on?

Page 103: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

SILAC

• A different isotope labeling strategy• Mammalian cells do not ‘manufacture’ all

amino-acids. Where do they come from?• Labeled amino-acids are added to amino-acid

deficient culture, and are incorporated into all proteins as they are synthesized

• No chemical labeling or affinity purification is performed.

• Leucine can be used (10% abundance vs 2% for Cys)

Page 104: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

SILAC vs ICAT

• Leucine is higher abundance than Cys

• No affinity tagging done

• Fragmentation patterns for the two peptides are identical

– Identification is easier

Ong et al. MCP, 2002

Page 105: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Incorporation of Leu-d3 at various time points

• Doubling time of the cells is 24 hrs.

• Peptide = VAPEEHPVLLTEAPLNPK

• What is the charge on the peptide?

Page 106: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Quantitation on controlled mixtures

Page 107: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Identification

• MS/MS of differentially labeled peptides

Page 108: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Questions

• The quantitation of peptides also depends upon the amount mixed right in the beginning (what if you mixed more of one sample?)

• How can you control for such errors?• What happens when you see singletons

(unpaired features)?– In ICAT, it can be a non-Cys peptide (chemical noise), OR– Under-expressed Cys-peptide– How can you differentiate between the two cases?

Page 109: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Map comparison

• The retention time and M/Z are a signature for the peptide.

• Can we use this signature to compare the intensity of the same peptide in two samples?

T

M/Z

M/Z

Page 110: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Map 1 (normal) Map 2 (diseased)

Map Comparison for Quantification

Page 111: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Data reduction (feature detection)

Features

• Each feature is represented by – Monoisotopic M/Z, centroid retention time, aggregate

intensity

Page 112: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Comparison of features across maps

• Hard to reduce features to single spots• Matching paired features is critical• M/Z is accurate, but time is not. A time scaling might be

necessary

Page 113: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Time scaling: Alignment

• Each time scan is a vector of intensities.

• Two scans in different runs can be scored for similarity (using a dot product)

• Compute an alignment to match scans against each other.

• Advantage: does not rely on feature detection.

• Disadvantage: Might not handle affine shifts in time scaling, but is better for local shifts

Page 114: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Spectral comparison

• The dot product is the standard technique for comparing two spectra

• Bin the peaks by partitioning the masses. Each mass bin gets an aggregate intensity corresponding to the peaks that fall in it.

• The normalized dot-product of the resulting vectors measures the similarity

partitions

Vector v1

v1

v2

Page 115: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Basic geometry

• What is ||x||2 ?• What is x/||x||• Dot product x=(x1,x2)

y

xT y = x1y1 + x2y2

= || x ||⋅ || y || cosθx cosθy + || x ||⋅ || y || sin(θx )sin(θy )|| x ||⋅ || y || cos(θx −θy )

Page 116: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Matching runs

• Compute a matching of the vectors with the best score.

• What should the indel penalties be like?

v1

w1

v2

vj

wi

wiT vj

Page 117: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Time scaling: Approach 2 (geometric matching)

• Match features based on M/Z, and (loose) time matching. Objective f (t1-t2)2

• Let t2’ = a t2 + b. Select a,b so as to minimize f (t1-t’2)2

Page 118: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Geometric Matching

• Pair up everything with identical m/z – Very loose constraints on time.

• Give every pair a cost equal to the time difference• Iterate over all (a,b), to find one that mimimizes the

minimum cost matching.

Page 119: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Comparing the different approaches

• Time scaling via Geometric matching depends critically upon reliable feature identification.

• Time scaling via spectral comparison depends upon coordinated elution of peptide features

Page 120: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Quantitation Summary

• Once we have features matched across runs, we have data identical to microarrays .

• Features can be ‘identified’ in separate MS2 experiments

• Feature detection, LC-MS mapping/ICAT decouple the identification from the quantitation

• The difficulty of producing such data makes it a challenging problem for bioinformatics

run

feature

intensity(Identification via MS2)

Page 121: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Other applications of MS

Page 122: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

MS application: Protein-protein interaction

• Proteins combine to form functional complexes.

• An antibody is a special kind of protein that can recognize a specific protein

• Use an antibody to recognize a protein in a complex. Isolate & Purify the complex that binds to the antibody.

• Identify all the proteins in the complex via mass spectrometry.

Page 123: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

MS application:Protein Structure

• Use chemical cross-linkers to link spatially proximal residues.

• Denature and digest the protein. Identify the cross-linked peptides. This provides extra structural constraints which help predict structure.

Page 124: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Cross-linking

• Cross-links are ‘fixed’ length polymers that bind to amino-acids.

• How can they help predict structure?

• Protocol– Cross-link native protein– Denature, digest– MS/MS (identify cross-

linked peptides)• Potentially valuable, but

not widely used

Page 125: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Identifying Cross-linked peptides

• Identify all peptide pairs, whose mass explains the parent mass.

• Given a list of peptide pairs, find the pair, and the linked position that best explains the MS2 data.

• What is the number of possible candidate pairs.

• Fragmentation in the presence of linkers is poorly understood

• How do you separate cross-linked peptides from singly linked, and non-cross-linked peptides?

Page 126: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Overlap peptides and shotgun based identification

Page 127: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Motivation

Database search of MS/MS spectra works well whenever the protein sequence is known and the protein is not modified/mutated

BUT:– Not all protein sequences are available

• Examples include proteins from snake and scorpion venom

• Integrilin, a successful blood clot prevention drug distributed by Millenium, was derived from rattlesnake venom

• Sequences are still determined by Edman sequencing– Database search is likely to fail if the analyzed protein

contains unexpected post-translational modifications or mutations.

Page 128: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Shotgun Assembly

How to use the redundant information in MS/MS spectra from overlapping peptides to construct a de-novo interpretation

of the protein sequence?

Non-specific proteases or sets of proteases with different specificities result in very rich digestion patterns:

Page 129: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Unanticipated modifications

Page 130: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

MS-Alignment

• Dynamic Programming can be used to capture mass-offsets (putative PTMs).

• In large data-sets, true PTMs should be over-represented.

Tsur et al.’ Nat. Bio. 2005

Page 131: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

PTM Frequency MatrixA C D E F G H I K L M N P Q R S T V W Y

10 1 1 3 2 0 4 0 0 2 1 1 0 7 2 3 5 4 3 0 011 7 2 0 1 1 6 0 1 1 1 1 2 0 0 0 7 2 1 0 112 2 2 1 1 6 3 1 0 5 2 3 1 0 5 0 2 1 3 0 013 4 0 3 2 0 5 1 4 4 5 3 1 0 2 2 5 4 2 2 114 12 2 5 16 1 12 0 7 ## 4 43 3 5 2 2 13 4 20 0 315 5 0 3 2 2 8 0 7 16 7 18 5 3 4 1 6 1 12 0 116 6 0 20 63 2 21 5 73 2 63 ## 14 10 18 2 7 8 10 ## 617 5 0 7 8 3 9 2 18 2 23 ## 32 5 18 0 8 2 5 29 318 0 3 3 3 3 10 1 6 4 9 15 7 43 5 1 5 3 5 2 119 2 0 0 3 0 7 1 3 9 4 3 3 7 1 0 7 1 12 4 020 3 3 0 3 2 5 0 2 3 3 1 3 2 1 0 4 0 0 0 021 8 0 1 3 0 3 0 1 1 5 0 1 1 4 2 2 1 2 1 222 12 0 25 25 15 39 8 20 5 25 1 29 7 27 1 39 27 22 1 1323 1 1 3 6 0 14 0 3 4 5 0 26 6 2 1 3 2 0 0 224 0 0 1 2 0 6 1 4 1 2 0 0 1 1 4 6 1 3 0 125 1 6 2 0 1 4 1 0 3 8 0 0 0 4 1 6 0 8 0 026 1 2 1 2 2 5 0 1 0 4 0 2 4 3 1 6 2 3 0 127 2 1 1 2 0 4 0 2 3 3 1 1 1 3 0 2 2 3 1 028 5 5 2 2 4 1 1 12 ## 29 2 2 4 1 6 40 1 56 0 229 4 0 0 1 4 3 2 4 28 5 1 6 1 0 3 11 1 9 0 130 6 1 1 3 3 20 0 6 13 1 2 3 2 13 1 6 4 4 1 131 3 3 4 1 4 6 1 8 8 9 7 1 2 19 2 7 8 6 0 032 5 3 0 0 4 7 1 2 1 5 ## 3 1 4 9 1 2 6 43 033 1 1 0 1 2 6 1 1 3 9 33 2 0 4 2 6 1 1 8 334 5 0 2 2 2 8 4 7 9 19 3 7 1 5 4 4 1 0 0 235 0 1 0 2 1 7 0 1 5 2 1 2 2 1 0 2 2 3 0 2

50,000 spectra from a sample of IKKb were searched in blind mode, and identifications with p-value <0.05 were retained

Cell shading indicates the number of annotations with modification (, a)

Page 132: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

PTM Frequency Matrix

A C D E F G H I K L M N P Q R S T V W Y10 1 1 3 2 0 4 0 0 2 1 1 0 7 2 3 5 4 3 0 011 7 2 0 1 1 6 0 1 1 1 1 2 0 0 0 7 2 1 0 112 2 2 1 1 6 3 1 0 5 2 3 1 0 5 0 2 1 3 0 013 4 0 3 2 0 5 1 4 4 5 3 1 0 2 2 5 4 2 2 114 12 2 5 16 1 12 0 7 ## 4 43 3 5 2 2 13 4 20 0 315 5 0 3 2 2 8 0 7 16 7 18 5 3 4 1 6 1 12 0 116 6 0 20 63 2 21 5 73 2 63 ## 14 10 18 2 7 8 10 ## 617 5 0 7 8 3 9 2 18 2 23 ## 32 5 18 0 8 2 5 29 318 0 3 3 3 3 10 1 6 4 9 15 7 43 5 1 5 3 5 2 119 2 0 0 3 0 7 1 3 9 4 3 3 7 1 0 7 1 12 4 020 3 3 0 3 2 5 0 2 3 3 1 3 2 1 0 4 0 0 0 021 8 0 1 3 0 3 0 1 1 5 0 1 1 4 2 2 1 2 1 222 12 0 25 25 15 39 8 20 5 25 1 29 7 27 1 39 27 22 1 1323 1 1 3 6 0 14 0 3 4 5 0 26 6 2 1 3 2 0 0 224 0 0 1 2 0 6 1 4 1 2 0 0 1 1 4 6 1 3 0 125 1 6 2 0 1 4 1 0 3 8 0 0 0 4 1 6 0 8 0 026 1 2 1 2 2 5 0 1 0 4 0 2 4 3 1 6 2 3 0 127 2 1 1 2 0 4 0 2 3 3 1 1 1 3 0 2 2 3 1 028 5 5 2 2 4 1 1 12 ## 29 2 2 4 1 6 40 1 56 0 229 4 0 0 1 4 3 2 4 28 5 1 6 1 0 3 11 1 9 0 130 6 1 1 3 3 20 0 6 13 1 2 3 2 13 1 6 4 4 1 131 3 3 4 1 4 6 1 8 8 9 7 1 2 19 2 7 8 6 0 032 5 3 0 0 4 7 1 2 1 5 ## 3 1 4 9 1 2 6 43 033 1 1 0 1 2 6 1 1 3 9 33 2 0 4 2 6 1 1 8 334 5 0 2 2 2 8 4 7 9 19 3 7 1 5 4 4 1 0 0 235 0 1 0 2 1 7 0 1 5 2 1 2 2 1 0 2 2 3 0 2

Oxidation

OxidationMethylation

Sodium

Double oxidationDimethylation

Page 133: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

PTM selection: Output

a SpectraM,W 16 803non-specific 1 355C 71 332M,W 32 248N 1 225K 28 184non-specific 22 176K,M 14 154E,D,P 53 130T,E,D -18 117L 156 92V 28 56I 16 49K -57 46S 28 30L 17 27M,W 38 23C 76 22non-specific 2 22M -2 21I 44 20L 54 19

Page 134: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Filtering -correct annotations

A C D E F G H I K L M N P Q R S T V W Y10 1 1 3 2 0 4 0 0 2 1 1 0 7 2 3 5 4 3 0 011 7 2 0 1 1 6 0 1 1 1 1 2 0 0 0 7 2 1 0 112 2 2 1 1 6 3 1 0 5 2 3 1 0 5 0 2 1 3 0 013 4 0 3 2 0 5 1 4 4 5 3 1 0 2 2 5 4 2 2 114 12 2 5 16 1 12 0 7 ## 4 43 3 5 2 2 13 4 20 0 315 5 0 3 2 2 8 0 7 16 7 18 5 3 4 1 6 1 12 0 116 6 0 20 63 2 21 5 73 2 63 ## 14 10 18 2 7 8 10 ## 617 5 0 7 8 3 9 2 18 2 23 ## 32 5 18 0 8 2 5 29 318 0 3 3 3 3 10 1 6 4 9 15 7 43 5 1 5 3 5 2 119 2 0 0 3 0 7 1 3 9 4 3 3 7 1 0 7 1 12 4 020 3 3 0 3 2 5 0 2 3 3 1 3 2 1 0 4 0 0 0 021 8 0 1 3 0 3 0 1 1 5 0 1 1 4 2 2 1 2 1 222 12 0 25 25 15 39 8 20 5 25 1 29 7 27 1 39 27 22 1 1323 1 1 3 6 0 14 0 3 4 5 0 26 6 2 1 3 2 0 0 224 0 0 1 2 0 6 1 4 1 2 0 0 1 1 4 6 1 3 0 125 1 6 2 0 1 4 1 0 3 8 0 0 0 4 1 6 0 8 0 026 1 2 1 2 2 5 0 1 0 4 0 2 4 3 1 6 2 3 0 127 2 1 1 2 0 4 0 2 3 3 1 1 1 3 0 2 2 3 1 028 5 5 2 2 4 1 1 12 ## 29 2 2 4 1 6 40 1 56 0 229 4 0 0 1 4 3 2 4 28 5 1 6 1 0 3 11 1 9 0 130 6 1 1 3 3 20 0 6 13 1 2 3 2 13 1 6 4 4 1 131 3 3 4 1 4 6 1 8 8 9 7 1 2 19 2 7 8 6 0 032 5 3 0 0 4 7 1 2 1 5 ## 3 1 4 9 1 2 6 43 033 1 1 0 1 2 6 1 1 3 9 33 2 0 4 2 6 1 1 8 334 5 0 2 2 2 8 4 7 9 19 3 7 1 5 4 4 1 0 0 235 0 1 0 2 1 7 0 1 5 2 1 2 2 1 0 2 2 3 0 2

M+17 (from oxidized methionine, incorrectmass)A+14 (from

methylated lysine,incorrectplacement)

Page 135: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Overlapping peptides

Page 136: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Features for robust PTM identification

• A number of features are used to validate mass-shifts obtained from MS-alignment, including

– evidence from overlapping peptides, – number of spectra. – Delta scores from other possibilities, – Evidence from multiple sites, de-novo identifications etc.– These features are used to train an SVM.

• Validation is done at spectrum, peptide, and site levels• Final searches performed using a fixed false discovery rates

on a random database.

Page 137: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Blind Search of a lens data-set

Human lenses are a rich source of modified proteins:

– Lens proteins do not turnover, but accumulate modifications over time

– Hundreds of papers in the last 20 years reporting crystallin modifications

– Yates‘ lab pioneered use of various proteases for high-throughput PTM validation/discovery in lenses (MacCoss et al, 2003)

– Sample from a 93-year old patient

– Wilmarth et al. (Jnl. Prot. Res.,06)

Page 138: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Modifications on cataractous lensLocation Modification

MassPutative annotation

S,T -18 dehydrationQ -17 deamidationW -2 cross-linkingH 14 methylation

M,W 16 oxidationS,H 28 double methylation

N-term 42 acetylationN-term 43 carbamylation

K,non-terminal 43 carbamylationW 44 carboxylationR 55 unknownK 58 carboxymethylationK 72 carboxyethylation

Location Modification mass

Type Putative annotation Comment

M -48 Chem. artifact loss of methane sulfenic acid reported on same siteW 4 PTM kynurenine reported in cataractous lensesS 30/73 unknown unknownW 32 PTM formylkynurenine reported in cataractous lenses

N-term 57 unknown carboxyamidomethylation In-vivo N-term modification?N-term 271 unknown unknown

Table 1: Rediscovered all modifications previously identified by database search except deamidation on N/Q (+1 Da).

Table 2: Identified 6 new modification events

Page 139: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Lens: Unknown modifications

• Unknown modification R+55 was found on both data-sets, and assigned to overlapping sets of sites.

• Some evidence for Q+161 (glycosylation?) in a few spectra

Page 140: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Spectra for putative R+55 modification

Page 141: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

2. Selecting modifications (summary)

• A ‘strength in numbers’ approach: The more spectra, the better.

– A recent unpublished search with ~18M spectra reveals ~500 modified sites, with a 2% FDR

• Overlapping peptides are strong evidence (incorrect matches unlikely to overlap)

• These and other features are being used in an automated tool for identifying modified sites.

Page 142: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Eukaryotic genes

• A eukaryotic gene has a complex structure.• A gene may be transcribed (translated) into many alternative isoforms.• A ‘typical’ exon is ~150bp. A ‘typical’ tryptic peptide is ‘15’ aa. What is

the probability that a typical peptide crosses a splice junction?

ATG5’ UTR

intron exon

3’ UTR

Translation start

Page 143: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Searching spliced-exon graph

• Instead of searching 6 frame translations, we search a compact representation of putative exons, and exon-pairs!

• Each putative exon is a node. Splicing and SNPs are represented by edges

• The tag based search extends efficiently to spliced-exon graphs.

Page 144: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Genomic Search Results

• ~18M spectra from a kidney cell line were searched against a human splice-exon graph.

• Validation of 39,000 exons and 11,000 introns

• Novel or extended exons in 16 genes, confirm translation of 224 hypothetical proteins.

• Discover over 40 alternative splicing events.

• 308 coding SNPs.

Tanner et al., Genome Research

Page 145: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Retinoblastoma gene (novel 3’ Exon)

Page 146: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

EX: Novel Exons

intron

Page 147: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Validating hypothetical protein

Page 148: Proteomics via Mass Spectrometry  (a bioinformatics perspective)

Wi’07 Bafna

Conclusion

• Key technology for proteomics• Leading technology for protein identification,

PT modifications, and protein level quantitation

• Applications to protein structure, interactions, pathways and other proteomic problems