l529 - presentation proteomics - yogita mantri -arvind gopu 11/10/2003

59
L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Upload: clare-fisher

Post on 22-Dec-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

L529 - Presentation

PROTEOMICS

- Yogita Mantri -Arvind Gopu

11/10/2003

Page 2: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Introduction – What is Proteomics?

“The identification, characterization and quantification of all proteins involved in a

particular pathway, organelle, cell, tissue, organ or organism that can be studied in concert to

provide accurate and comprehensive data about that system.”

http://www.inproteomics.com/prodef.html

Page 3: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Central lesson from eukaryotic genome projects

Evolutionary complexity is not primarily determined by increasing the number of genes, but by increasing variation on the level of the synthesized proteins.

This is achieved by generating MULTIPLE proteins from only ONE gene e.g. by different combinations of exons by alternative splicing post-translational protein processing (e.g. cleavage of pro-

peptides) post-translational protein modifications (e.g. acetylation,

glycosylation) modified central dogma: DNA --> RNA --> protein(s) it is important to perform analyses on the level of gene

PRODUCTS Key

Page 4: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003
Page 5: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003
Page 6: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Key advantage of proteomics Researchers work on the level of gene products and

deal with genes that are really expressed to give a detectable PRODUCT and are not just "expressed“ which only says they produce a detectable mRNA but it is not clear whether there is a gene product or not.

Key limitation of proteomics Usually, only a fraction of the proteins synthesized can

be detected in a proteomics experiment, whereas the expression of ALL genes can be monitored in a whole-genome array experiment.

Key prerequisite of proteomics A genome sequence for the investigated organism or at

least a collection of many cDNA sequences is required.

Page 7: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Experimental Background

Mass Spectrometry

Page 8: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003
Page 9: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

What is Mass Spec?

Analytical tool measuring molecular weight (MW) of sample

Only picomolar concentrations required Within an accuracy of 0.01% of total weight of

sample and within 5 ppm for small organic molecules

For a Mr of 40 kDa, there is a 4 Da error This means it can detect amino acid substitutions /

post-translational modifications

Page 10: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

What sort of info is returned?

Structural information can be generated Particularly using tandem mass spectrometers Fragment sample & analyse products Useful for peptide & oligonucleotide sequencing Plus identification of individual compounds in

complex mixtures

Page 11: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

How does a Mass Spectrometer work?

3 fundamental parts: the ionisation source, the analyser, the detector

Samples easier to manipulate if ionised Separation in analyser according to mass-to-charge

ratios (m/z) Detection of separated ions and their relative

abundance Signals sent to data system and formatted in a m/z

spectrum

Page 12: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Simplified Schematic

The analyser, detector and ionisation source are under high vacuum to allow unhindered movement of ions

Operation is under complete data system control

Page 13: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003
Page 14: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Schematic of a typical TOF-MS/MS

Page 15: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Sample Introduction& Ionisation

Direct into ionisation source or via chromatography for component separation (HPLC, GC, capillary electrophoresis)

Ionisation can be positively charged (for proteins) or negatively charged (for saccharides and oligonucleotides)

Page 16: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Ionisation methods Atmospheric Pressure Chemical Ionisation (APCI) Chemical Ionisation (CI) Electron Impact (EI) Electrospray Ionisation (ESI) Fast Atom Bombardment (FAB) Field Desorption / Field Ionisation (FD/FI) Matrix Assisted Laser Desorption Ionisation

(MALDI) (Clemmer Group) Thermospray Ionisation

Page 17: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Detection & Recording of Ions

Detector monitors ion current, amplifies it and then transmits signal to data system

Common detectors: photomultiplier, electron multiplier, micro-channel plate

Page 18: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Mass spectrometry is a very powerful method to analyse the structure of organic compounds, but suffers from 3 major limitations:

Compounds cannot be characterised without clean samples

This technique has not the ability to provide sensitive and selective analysis of complex mixture

For big molecules like peptides spectra are very complex and very difficult to interpret

Page 19: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Tandem MS or MS/MS has 2 mass spectrometers in series.

In first mass spectrometer (MS1) is used to SELECT, from theprimary ions, those of a particular m/z value which then pass intothe Fragmentation Region. The ion selected by the MS1 is the parent ion and can be a molecular ion resulting from the primary fragmentation. DISSOCIATION occurs in the fragmentation region. The daughter ions are analysed in the Second Spectrometer (MS2). In fact, the MS1 can be viewed as an ion source for MS2.

MS2MS1

Page 20: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Peptide Sequencing Peptides of 2.5 kDa or less give best data Protein sample often taken from 2-D gels and digested A protein digest can be analysed as entire mix Initial MS spectrum showing Mr of all components in digest

(peptide map) may be enough for a database search and identification

Peptides fragmented along the amino acid backbone in tandem mass spectrometry

Some peptides generate enough info for full sequence, others only generate partial sequences of 4-5 amino acids

Often this “tag” sequence is sufficient for database identification

Page 21: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003
Page 22: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Data Analysis

Page 23: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003
Page 24: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Common Data Analysis - Pipeline

Page 25: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Issue #1 (Relatively Minor?)

Diverse set of Mass Spectrometers… More flexibility BUT ... Different data formats Limited Data analysis possible Exchange of RAW datasets and creation of public

repositories for the data/software? Not easy if not impossible

Page 26: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Work Around for Issue #1?

To get around this problem Convert to ASCII text - speed and loss of precision can be

an issue Other formats specific to this field A lot of XML based file formats seem be floating around Of course using XML format (for example) gives raise to

additional level of complexity -- parsers, formatters, etc It does add flexibility between data formats Indexing techniques used to speed up access

Page 27: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Issue #2 (Much bigger!)

Data Size Higher Dimensionality The combination is even deadlier!

More detail in a minute … Before that … The LC/MSMS spectrum data looks like this:

LC Drift TOF Intensity i.e., 3-D + Intensity

Page 28: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Issue #2 (Continued…)

As a first step in data analysis: Find peaks in the LC/MSMS data

Peaks is kind of a misnomer. Center of mass (or something like that) is a better term. Illustrates inherent non-uniformity within proteomics circles Easier said than done as we found out!

Let us start with a simpler case of finding peaks in 2-D data – a little more complicated than 1-D …

Page 29: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Peak Finding – 2-D data

http://www.cs.nott.ac.uk/~gxk/aim/notes/hillclimbing.doc

Page 30: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Peak Finding - Higher Dimensions?

As mentioned earlier data is of the form: LC Drift TOF Intensity i.e., 3-D + Intensity

Add to this huge data size and get a hang of how difficult a problem it is

Page 31: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Some Possible Solutions

Solutions we thought about: Find peaks using a brute force approach

Not computationally feasible in terms of time and memory

Squeeze 3-D data into 2-D, find peaks and then work backwards. This is the algorithm implemented by Frank - one of the

IU Chemistry folks Use existing implementations of graph functions

available in packages (For example: LEDA) to preprocess data and then find peaks on smaller data set

Page 32: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Our Peak Finding Algorithm

Used LEDA package for C++ Specifically made use of O(n Log n) implementation

of Delaunay Triangulation Neighbor Finding algorithm in 3 D space

Once neighbors were found then do a brute force peak finding step How good were our results?

More details? Take a look at our summer presentation at Chemistry

Sample of the data … What it looks like?

Page 33: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Peptide Assignment

Find sequence of amino acids that can generate the list of masses seen in the tandem MS scan.

Many different strategies: Searching MS/MS spectra against a sequence

database (Sequest, Mascot, etc) De novo sequencing (no database!) Hybrid

Page 34: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003
Page 35: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Scoring Peptide Sequences

Multiple search engines are available Sequest and Mascot

They use different scoring algorithms Search outputs are not comparable Search outputs usually require expert

validation …

Page 36: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

An example of scoring system: SCOPE

Probabilistic model for scoring tandem MS against peptide database

Two stage model Uses dynamic programming Incorporates fragment ion probabilities, noisy

spectra and instrument measurement error Details:

http://bioinformatics.oupjournals.org/cgi/screenpdf/17/suppl_1/S13.pdf (Scoring Spectra section)

Page 37: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Peptide Validation

Validate peptide assignments made during the database search step.

Obviously, method used should be standardized and independent from the experimental and computational methods used

Page 38: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Manual Validation

Filtering by database search scores Problems:

Filtering criteria vary among researchers Error rates are unknown Possible only on very small datasets

Page 39: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Model Based Validation?

Empirical Statistical Model to estimate accuracy … Anal. Chem 2002, 74, 5383 – 5392

Employs Expectation Maximization and Machine Learning techniques

Learns to distinguish between correct from incorrect database search results

Page 40: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Model Based Validation – EM algorithm

Each peptide assignment evaluated w.r.t. all other assignments including incorrect ones

Denote correct and incorrect assignments as (+) and (-); Scores as x_1, x_2 … x_s P(+ | x_1, x_2 … x_s) =

P(x_1, x_2 … x_s | +) * P(+)

---------------------------------------

∑ P(x_1, x_2 … x_s | i) * P(i)

Page 41: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Model Based Validation – EM algorithm (Continued …) Replace search scores with discriminant

function F P(F| +) * P(+)

P(+ | F) = -------------------

∑ P(F| i) * P(i) Bunch of probabilistic parameters considered Ended up approximating distributions to

Gaussian and Gamma distrs. (More details out of scope of this presentation, please refer paper)

Page 42: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003
Page 43: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Example of Automated Validation

An example: Protein Prophet Compute probabilities that peptides assigned to

MS/MS spectra are correct Learns distributions of search scores and peptide

properties among correct and incorrect results The computed probabilities are claimed to be a true

measure of the confidence! Combines probabilities of peptides assigned to

MS/MS spectra to compute probability that corresponding proteins are present in the sample

Page 44: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003
Page 45: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Interpretation

Assign a biological meaning to the output of the pipeline

Page 46: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003
Page 47: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003
Page 48: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Current Issues and Challenges

Page 49: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Slide adapted from http://www.ciphergen.com/tech_doc11.2.html

After Proteomics…..

Functional Genomics

ProteinChipTM.

Page 50: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Limitations of Proteomics

Experimental limitations:Large-scale protein analysis difficult because:

-Proteins are fragile

-They can exist in multiple isoforms

-There is no protein equivalent of PCR for amplification of a small sample

Page 51: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Data Analysis Limitations:-Data contains a lot of noise that is difficult to separate from actual signal. This results in wastage of computing resources on searching for unlikely spectra.-Database searches for matching spectra only give scores, leaving manual intervention necessary for eliminating false positives

Page 52: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Biomedical limitations

-In practice, it is very difficult to trace the complete progression of a disease.

-Hence, using proteomics for monitoring the biochemistry of a disease is like using a photo camera to record a football match.

-Case of breast cancer research:

http://www.mcponline.org/cgi/reprint/2/5/281.pdf

Page 53: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

References and Further Reading

Page 54: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Explains the whole process nicely -- articlehttp://swehsc.pharmacy.arizona.edu/analysis/Proteomics_News.htm

Mascot Home page -- help sectionhttp://www.matrixscience.com/help_index.html

Presentation about MS MS datahttp://sashimi.sourceforge.net/extra/oral.pdf

http://www.genetik.uni-bielefeld.de/D1E33C76A7CCA010AAD3B435B51404EE/Genome_Research_WS2002_03/stunde_ws0203_10.pdhttp://bmbus6.leeds.ac.uk/mres/5130/MassSpectrometry.ppt

Some info about drug discovery/economic issues n such:http://monod.uwaterloo.ca/cs798/proteomics.pdf

Paper on interpreting MSMS data http://chem-ncms.unl.edu/asms2003/kurt.pdf

How to estimate correctness of MS MS prediction -- EM !!!http://www.proteomecenter.org/PDFs/Keller.AnalChem.02.pdf

Page 55: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

http://www.nature.com/cgi-taf/DynaPage.taf?file=/nbt/journal/v21/n3/full/nbt0303-221.html

http://www.esainc.com/MolecularProteomics/molecular_proteomics.htm

Others:http://genome.ucsd.edu/classes/be202/ppt/11

Delaunay Triangulation:http://almond.srv.cs.cmu.edu/afs/cs/project/quake/public/www/triangle.delaunay.html

SCOPE paper -- screen PDFhttp://bioinformatics.oupjournals.org/cgi/screenpdf/17/suppl_1/S13.pdf

Page 56: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Internet sites

www.astbury.leeds.ac.uk/Facil/MStut/mstutorial.htm(Dr Alison E. Ashcroft at Leeds)

www.asms.org (The American Society for Mass Spectroscopy) www.spectroscopynow.com (Base Peak)

Mass Spec tools www.expasy.ch/tools/#proteome http://prowl.rockefeller.edu www.mann.embl-heidelberg.de

Page 57: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Bibliography

Internet sites :

http://www.google.com• http://www.bmss.org.uk/what_is/whatis.html• http://www.duke.edu/~mdfeezor/NSHome/inform/msms1.html• http://www.astbury.leeds.ac.uk/Facil/MStut/mstutorial.htm• http://ms.mc.vanderbilt.edu/tutorials/ms/3.htm • http://www.garvan.unsw.edu.au/public/corthals/book/IPMS.html• http://www.micromass.co.uk/basics/Glossary.html

Page 58: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

Ionization Methods

Further Reading1. For MALDI beginner:http://www.srsmaldi.com/Maldi/Guide.html

2. For MALDI lab user:

http://www.srsmaldi.com/Maldi/Lab.html 3. For MALDI tutorial:

http://ms.mc.vanderbilt.edu/tutorials/maldi/maldi-ie_files/frame.htm 4. Ionization Methods 1:http://www.jeol.com/ms/docs/ionize.html

5. Ionization Methods 2:http://www.waters.com/Waters_Website/Applications/lcms/lcms_itq.htm

Page 59: L529 - Presentation PROTEOMICS - Yogita Mantri -Arvind Gopu 11/10/2003

SELDI Web sites:

• Molecular Analytical Systems (MAS).http://www.seldi.org/

• Manufacturers of ProteinChip(R)http://www.ciphergen.com/