novel peptide identification using ests and genomic sequence

37
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park

Upload: amora

Post on 21-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Novel Peptide Identification using ESTs and Genomic Sequence. Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park. Mass Spectrometry for Proteomics. Measure mass of many (bio)molecules simultaneously High bandwidth - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Novel Peptide Identification using  ESTs and Genomic Sequence

Novel Peptide Identification using

ESTs and Genomic Sequence

Novel Peptide Identification using

ESTs and Genomic Sequence

Nathan EdwardsCenter for Bioinformatics and Computational BiologyUniversity of Maryland, College Park

Page 2: Novel Peptide Identification using  ESTs and Genomic Sequence

2

Mass Spectrometry for Proteomics

• Measure mass of many (bio)molecules simultaneously• High bandwidth

• Mass is an intrinsic property of all (bio)molecules• No prior knowledge required

Page 3: Novel Peptide Identification using  ESTs and Genomic Sequence

3

Mass Spectrometry for Proteomics

• Measure mass of many molecules simultaneously• ...but not too many, abundance bias

• Mass is an intrinsic property of all (bio)molecules• ...but need a reference to compare to

Page 4: Novel Peptide Identification using  ESTs and Genomic Sequence

4

Mass Spectrometry for Proteomics

• Mass spectrometry has been around since the turn of the century...• ...why is MS Proteomics so new?

• Ionization methods• MALDI, Electrospray

• Protein chemistry & automation• Chromatography, Gels, Computers

• Protein sequence databases• A reference for comparison

Page 5: Novel Peptide Identification using  ESTs and Genomic Sequence

5

Microorganism Identification by MALDI Mass Spectrometry

• Direct observation of microorganism biomarkers in the field.

• Peaks represent masses of abundant proteins.

• Statistical models assess identification significance.

B.anthracis

MALDI Mass Spectrometry

Page 6: Novel Peptide Identification using  ESTs and Genomic Sequence

6

Key Principles

• Protein mass from protein sequence• No introns, few PTMs

• Specificity of single mass is very weak• Statistical significance from many peaks

• Not all proteins are equally likely to be observed• Ribosomal proteins, SASPs

Page 7: Novel Peptide Identification using  ESTs and Genomic Sequence

7

Rapid Microorganism Identification Database (www.RMIDb.org)

• Protein Sequences• 5.3M (1.9M)

• Species• ~ 15K

• Genbank,• RefSeq• CMR,• Swiss-Prot• TrEMBL

Page 8: Novel Peptide Identification using  ESTs and Genomic Sequence

8

Rapid Microorganism Identification Database (www.RMIDb.org)

Page 9: Novel Peptide Identification using  ESTs and Genomic Sequence

9

Informatics Issues

• Need good species / strain annotation• B.anthracis vs B.thuringiensis 

• Need correct protein sequence• B.anthracis Sterne α/β SASP• RefSeq/Gb: MVMARN... (7442 Da)• CMR: MARN... (7211 Da)

• Need chemistry based protein classification

Page 10: Novel Peptide Identification using  ESTs and Genomic Sequence

10

Sample Preparation for Peptide Identification

Enzymatic Digestand

Fractionation

Page 11: Novel Peptide Identification using  ESTs and Genomic Sequence

11

Single Stage MS

MS

m/z

Page 12: Novel Peptide Identification using  ESTs and Genomic Sequence

12

Tandem Mass Spectrometry(MS/MS)

Precursor selection

m/z

m/z

Page 13: Novel Peptide Identification using  ESTs and Genomic Sequence

13

Tandem Mass Spectrometry(MS/MS)

Precursor selection + collision induced dissociation

(CID)

MS/MS

m/z

m/z

Page 14: Novel Peptide Identification using  ESTs and Genomic Sequence

14

Peptide Identification

• For each (likely) peptide sequence1. Compute fragment masses2. Compare with spectrum3. Retain those that match well

• Peptide sequences from protein sequence databases• Swiss-Prot, IPI, NCBI’s nr, ...

• Automated, high-throughput peptide identification in complex mixtures

Page 15: Novel Peptide Identification using  ESTs and Genomic Sequence

15

Why don’t we see more novel peptides?

• Tandem mass spectrometry doesn’t discriminate against novel peptides...

...but protein sequence databases do!

• Searching traditional protein sequence databases biases the results towards well-understood protein isoforms!

Page 16: Novel Peptide Identification using  ESTs and Genomic Sequence

16

What goes missing?

• Known coding SNPs

• Novel coding mutations

• Alternative splicing isoforms

• Alternative translation start-sites

• Microexons

• Alternative translation frames

Page 17: Novel Peptide Identification using  ESTs and Genomic Sequence

17

Why should we care?

• Alternative splicing is the norm!• Only 20-25K human genes• Each gene makes many proteins

• Proteins have clinical implications• Biomarker discovery

• Evidence for SNPs and alternative splicing stops with transcription• Genomic assays, ESTs, mRNA sequence.• Little hard evidence for translation start site

Page 20: Novel Peptide Identification using  ESTs and Genomic Sequence

20

Novel Frame

Page 21: Novel Peptide Identification using  ESTs and Genomic Sequence

21

Novel Frame

Page 24: Novel Peptide Identification using  ESTs and Genomic Sequence

24

Searching ESTs

• Proposed long ago:• Yates, Eng, and McCormack; Anal Chem, ’95.

• Now:• Protein sequences are sufficient for protein identification• Computationally expensive/infeasible• Difficult to interpret

• Make EST searching feasible for routine searching to discover novel peptides.

Page 25: Novel Peptide Identification using  ESTs and Genomic Sequence

25

Searching Expressed Sequence Tags (ESTs)

Pros• No introns!• Primary splicing

evidence for annotation pipelines

• Evidence for dbSNP• Often derived from

clinical cancer samples

Cons• No frame• Large (8Gb)• “Untrusted” by

annotation pipelines• Highly redundant• Nucleotide error

rate ~ 1%

Page 26: Novel Peptide Identification using  ESTs and Genomic Sequence

26

Compressed EST Peptide Sequence Database

• For all ESTs mapped to a UniGene gene:• Six-frame translation• Eliminate ORFs < 30 amino-acids• Eliminate amino-acid 30-mers observed once• Compress to C2 FASTA database

• Complete, Correct for amino-acid 30-mers

• Gene-centric peptide sequence database:• Size: < 3% of naïve enumeration, 20774 FASTA entries• Running time: ~ 1% of naïve enumeration search• E-values: ~ 2% of naïve enumeration search results

Page 27: Novel Peptide Identification using  ESTs and Genomic Sequence

27

Compressed EST Peptide Sequence Database

• For all ESTs mapped to a UniGene gene:• Six-frame translation• Eliminate ORFs < 30 amino-acids• Eliminate amino-acid 30-mers observed once• Compress to C2 FASTA database

• Complete, Correct for amino-acid 30-mers

• Gene-centric peptide sequence database:• Size: < 3% of naïve enumeration, 20774 FASTA entries• Running time: ~ 1% of naïve enumeration search• E-values: ~ 2% of naïve enumeration search results

Page 28: Novel Peptide Identification using  ESTs and Genomic Sequence

28

SBH-graph

ACDEFGI, ACDEFACG, DEFGEFGI

Page 29: Novel Peptide Identification using  ESTs and Genomic Sequence

29

Compressed SBH-graph

ACDEFGI, ACDEFACG, DEFGEFGI

Page 30: Novel Peptide Identification using  ESTs and Genomic Sequence

30

Sequence Databases & CSBH-graphs

• Original sequences correspond to paths

ACDEFGI, ACDEFACG, DEFGEFGI

Page 31: Novel Peptide Identification using  ESTs and Genomic Sequence

31

Sequence Databases & CSBH-graphs

• All k-mers represented by an edge have the same count

2 2

1

2

1

Page 32: Novel Peptide Identification using  ESTs and Genomic Sequence

32

cSBH-graphs

• Quickly determine those that occur twice

2 2

1

2

Page 33: Novel Peptide Identification using  ESTs and Genomic Sequence

33

Compressed-SBH-graph

ACDEFGI

2 2

1

2

Page 34: Novel Peptide Identification using  ESTs and Genomic Sequence

34

Compressed EST Database

• Gene centric compressed EST peptide sequence database• 20,774 sequence entries• ~8Gb vs 223 Mb• ~35 fold compression

• 22 hours becomes 15 minutes• E-values improve by similar factor!

• Makes routine EST searching feasible• Search ESTs instead of IPI?

Page 35: Novel Peptide Identification using  ESTs and Genomic Sequence

35

Back to the lab...

• Current LC/MS/MS workflows identify a few peptides per protein• ...not sufficient for protein isoforms

• Need to raise the sequence coverage to (say) 80%• ...protein separation prior to LC/MS/MS

analysis• Potential for database of splice sites of

(functional) proteins!

Page 36: Novel Peptide Identification using  ESTs and Genomic Sequence

36

Conclusions

• Good informatics gets the most out of proteomics data

• Proteomics may be useful for genome annotation

• Peptides identify more than just proteins

• Compressed peptide sequence databases make routine EST searching feasible

Page 37: Novel Peptide Identification using  ESTs and Genomic Sequence

37

Acknowledgements

• Chau-Wen Tseng, Xue Wu• UMCP Computer Science

• Catherine Fenselau• UMCP Biochemistry

• Calibrant Biosystems

• PeptideAtlas, HUPO PPP, X!Tandem

• Funding: National Cancer Institute