informatik anwendungen in der genom- und proteomforschung

133
funded by the German Federal Ministry for Education and Research Informatik Anwendungen in der Genom- und Proteomforschung Knut Reinert FU Berlin Mai 2006 HU Berlin, Adlershof

Upload: others

Post on 13-Jan-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Informatik Anwendungen in der Genom- und Proteomforschung

funded by the German Federal Ministry for Education and Research

InformatikAnwendungen in der Genom-

und ProteomforschungKnut Reinert

FU Berlin

Mai 2006HU Berlin, Adlershof

Page 2: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Bioinformatics in the media

Page 3: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Topics in Bioinformatics

Gene finding:•Improve prediction of coding and regulatory regions•Comparing multiple genomes is promising

Some pictures with the courtesy of the MPI für Informatik, Saarbrücken

Page 4: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Topics in Bioinformatics

Find drugs that alter or inhibit the function of the target molecule.Searching data bases helps to find suitable candidates and reduce side effects.

Molcular therapieof deseases:protein-protein docking

Methotrexate bindet an dihydrofolate reductase

Page 5: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Topics in Bioinformatics

Identifying SNPs allows:• the association of patterns of genetic diversity to diseases • the association of genetic patterns to drug tolerance

Identifying SNPs(or other polymorphisms)

TTCGGGTTGTTCGGGTTGGGACGTGCACTAAACGTGCACTAACCTCGTCG

GACGTGCACTAAATCGCGCAACTGACGTGCACTAAATCGCGCAACTG

GGGTTGGGGTTGCCACGTGCACTAAATCGCGCAACTGACGTGCACTAAATCGCGCAACTG

TTCGGGTTGTTCGGGTTGGGACGTGACGTGTTACTAAATCGCGCAACTGACTAAATCGCGCAACTG

Page 6: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Inhalte der Bioinformatik

Page 7: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Topics in Bioinformatics

Problem (Project together with Charité):Within the first year 8-10% of the patients lose thetransplant.After 10 years about 50% have lost the transplant.Diagnosis is invasiv and leads sometimes to loss of craft.

Goal:Analyse urin samples of patients and detect as early as possible diagnostic markers to counteract craft loss.

Page 8: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Topics in Bioinformatics

Automated measurement methods leadto terabytes of data (Prof. Schlüter)

Page 9: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Mass spectrometry

Page 10: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Mass spectrometry

Page 11: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Mass spectrometry

0 250 500 750 1000m/z

EVAAFAQFGSDLDASTK

3x increased expression

Page 12: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Bioinformatics in Berlin

Max-Planck Institutefor Molecular Genetics

HU BerlinUniversity of

Applied Sciences

Page 13: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Overview

• Motivation for OpenMS• Bioinformatics Issues in quantitative

Proteomics– Signal Processing– Feature Finding– Map Mapping– Differential Quantitation– Identification, Clustering– Software Engineering, Databases

Page 14: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Genomics

Page 15: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Genomics vs. Proteomics

Same genome...

...different proteomes

Page 16: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Genomics vs. Proteomics

Genomics ProteomicsGenome is rather

static

~ 30 k genes

Established, fully automated technology

(capillary sequencer)

Proteome is dynamic

(age, tissue, what youhad for lunch)

Up to 2000 k Proteins

Emerging technology

(MS, HPLC/MS, protein chips)

Page 17: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

• Transcription and translation are heavily regulated

• Protein expression levels are not static

• mRNA levels and protein levels often not correlated

• Contradictory results from seemingly similar methods

• RNA chips

• DNA chips

• gene disruption

• knock out

ProteinDNA mRNATranscription Translation

From Genes to Proteins

Anderson et al., Electrophoresis (1998), 19, 1853-61

Page 18: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

TransportPosttranslational modification

Regulation

Exogeneous parameters(stress, temperature,

drugs, metabolites, ...)

Degradation

Folding

Proteins = End Products?

Page 19: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Definition of proteomics

• Proteomics [n]: the branch of genetics that

studies the full set of proteins encoded by a

genome(www.hyperdictionary.com)

• Proteomics can be defined as the qualitative

and quantitative comparison of proteomes under

different conditions to further unravel

biological processes.(www.expasy.org)

Page 20: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Target ID and biologicalnetworks

Page 21: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

TrialBiol.Data

Drug Discovery Pipeline

Accelerated by

Bioinformatics and

Computational Chemistry

Target ID Lead ID Optimization Approval

Page 22: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

TrialBiol.Data

Drug Discovery Pipeline

Accelerated by

Bioinformatics and

Computational Chemistry

Target ID Lead ID Optimization Approval

Page 23: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Application fields

• Diagnostics: Find relevant patterns in one-or two-dimensional LC measurements

• Time series: Analyze the temporal behaviour in a time series experiment

• Quantitative Measurements: Determineabsolute content of peptides usingadditive method (Myoglobin, Gliadin)

Page 24: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Differential Analysis in Proteomics

All techniques share the following steps:

1. Separation of proteins/peptides.

Typical techniques: 2D PAGE, HPLC, DC, etc.

2. Assignment of proteins/peptides for relative

quantitation

3. (Relative) Quantitation

4. Identification of peptides/proteins

Page 25: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Separation (2D-PAGE)

Page 26: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Separation (ShotgunProteomics)

• Key idea– Separation of whole proteins possible but

difficult, hence digestion preferred– Separate peptides– Identify proteins through peptides

MG

M

K

N

VQWEDSLG

G L

L

VWGM

GE G

A

I

H

R

V ED V A G G Q

E VL

FL K

TPH EGEL

K

F D K F K

H L K

ES D ME K KHA S E D L K

A

T HN GVL

T

LGGILK

KFGE L GQ

P VI K

Q SAHG LH E A E L T P H A T K I

Q

VL

Q

S

Y E

AE

LF

K

II

S

RFA L E L

GD

F

P G

A

H

M

Q

S

G

DA

A

K

N MD AA K

Y K

Peptid-digest

digestion

Proteins

Separation

M G L S D G E W Q L V L N V W G K

H P G D F G A D A Q G A M S K

Y L E F I S E A I I Q V L Q S K

G H H E A E L T P A Q S H A T K

V E A D V A G H G Q E V L RI

S E D E M K

A S E D L K

A L E L F R

E L G F Q G

N D M A A K

I P V K

H L K

F D K

L F K

F K

Y K

H K

K

G H P E T L KE

Page 27: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Separation 1Different peptides have different retention time

IonizationPeptide receivesz charge units

Separation 2Detector measures m/z

HPLC ESI TOF Spectrum(scan)

HPLC-MS Analysis

RT

I

Page 28: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Single stage MS scan

Massspectrometry

Measures mass/charge ratio of

ionized peptidesfor a short period of

time

m/z

Inte

nsit

y

Peak intensity in scan corresponds to amount present, but intensities

are not comparable!

Page 29: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Maps

Page 30: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Maps

Page 31: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Maps

Page 32: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Maps

Page 33: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Maps

Page 34: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Maps

Page 35: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Maps

Page 36: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Maps

Page 37: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Maps

Page 38: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Maps

Page 39: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Maps

Page 40: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Maps

Page 41: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Maps

0 250 500 750 1000m/z

EVAAFAQFGSDLDASTK

15 nmol/µl, 3x overexpressed, …

Page 42: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Differential Analysis

Two common basic approaches:• Isotope labeling (e.g. ICAT, MeCAT, SILAC,…)• Direct Differential Quantification (DDQ)

ICAT pairs atknown distance

diseased

normal

Page 43: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Isotope Labeling (ICAT)

Heavy and light ICAT reagents are 8 Dalton apart

Page 44: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Isotope Labeling (ICAT)

“diseased”

Cell state 1

Cell state 2

“Normal”

Label proteins with heavy ICAT

Label proteins with light ICAT

Combine

Fractionate protein prep- membrane - cytosolic Proteolysis

Isolate ICAT-labeled peptides

Nat. Biotechnol. 17: 994-999,1999

Heavy and light ICAT reagent 8 Dalton apart

Page 45: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Map 1 (normal) Map 2 (diseased)

Differential Analysis

Two common basic approaches:• Isotope tagging (e.g. ICAT, MeCAT)• Direct Differential Quantitation (DDQ)

diseased normal

Page 46: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

What’s in a Map?

• Retention time (RT) for each scan

• Peptide mass/charge ratio (m/z) (usually within ~20 ppm)

• Intensity (I)

⇒ use m/z, RT to identify peptides

⇒ use I to quantify peptides

(relative quantitation only!)

Maps become HUGE (108 Peaks)!

Page 47: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Raw Data

FilteredRaw Data

Map

AnnotatedMaps

Proteomics Data Flow

HPLC/MSSample

Sig.Proc.

Data Reduction

Diff.Anal.

IdentificationDifferentially

ExpressedProteins

Page 48: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Raw Data

FilteredRaw Data

Map

AnnotatedMaps

Proteomics Data Flow

HPLC/MSSample

Sig.Proc.

Data Reduction

Diff.Anal.

IdentificationDifferentially

ExpressedProteins

10 GB

1 GB50 MB

50 MB 1 kB

Page 49: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

HPLC/MS – Becoming the Standard?

Advantages of HPLC/MS over 2D PAGE1. Easier automation

– no robots– no gel handling

2. Better separation power– Monolithic columns– Multi-dimensional chromatography

3. Simpler coupling to MS

Page 50: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

HPLC/MS – Becoming the Standard?

Disadvantages of HPLC/MS• Proteins are chopped up• Quantitation difficult• Huge amount of data (10 GiB/run)• Data hard to manage/interpret

⇒ Need for Informatics!

Page 51: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Overview

• Motivation for OpenMS• Bioinformatics Issues in quantitative

Proteomics– Signal Processing– Feature Finding– Map Mapping– Differential Quantitation– Identification, Clustering– Software Engineering, Databases

Page 52: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Signal Processing

Issues• Baseline reduction• Noise filtering• Calibration⇒ Increase reliability of data

m/z

inte

nsity

m/zin

tens

itym/z

inte

nsity

m/z

inte

nsity

+ + =

signal baseline noise raw datam/z

inte

nsity

m/z

inte

nsity

Eva

Lang

e

Page 53: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Signal Processing

m/z

inte

nsity

m/zin

tens

itym/z

inte

nsity

m/z

inte

nsity

+ + =

signal baseline noise raw data

Eva

Lang

e

Feature MapRaw Map Stick Map

Page 54: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Baseline Filtering

TopHat filtering determines baseline

200 400 600 800 1000 1200 1400 1600 1800 2000

-2000

0

2000

4000

6000

8000

10000

12000

14000Shots

200 400 600 800 1000 1200 1400 1600 1800 2000

1

1.5

2

2.5

3

3.5x 104 Shots

Eva Lange

Page 55: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Stick conversion aka Peakpicking

Issues

• „peak picking“ – Identify peaks

• „stick conversion“ – Integrate peaks to sticks

⇒ Reduce amount of data by factor of 10 – 100

m/z

inte

nsity

raw datam/z

inte

nsity

sticks

Eva

Lang

e

Page 56: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

m/z

inte

nsity

raw datam/z

inte

nsity

sticks

Eva

Lang

e

Feature MapRaw Map Stick Map

Stick conversion aka Peakpicking

Page 57: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Peak picking in m/z dimensionMain objectives of a peak picking algorithm:•precise peak positions •run in real time

m/z

inte

nsity

m/z

inte

nsity

Main difficulties of a peak picking algorithm:•considerable asymmetry of peaks

•convolution of isotopic peaks

Page 58: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Three step algorithm

1. Peak detection2. Extract important peak parameter:

• centroid• area• height• fwhm• asymmetric peak shape• signal/noise ratio

3. Optimize the peak parameter (optional)

Page 59: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

2230 2234 2238 2242 2246 2250

2230 2234 2238 2242 2246 2250

2230 2234 2238 2242 2246 2250

2230 2234 2238 2242 2246 2250

A

B

C

D

Idea: Split the signal into different frequency ranges

(length scales)

Method: Continuous Wavelet Transformation (CWT)

1. Peak detection

Page 60: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

CWT is the sum over all positionsb of the signal multiplied byscaled, shifted versions of the wavelet

CWT breaks up the signal intoshifted and scaled versions of the original (mother) wavelet

Continuous wavelet transformation

m/zb

,)(')(1),(W dxa

bxxsa

abs ∫+∞

∞−

−= ϕϕ

Page 61: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

1. Peak detection

Workflow 1. Compute the wavelettransform

2. Search for a peak‘smaximum

3. Search for peak-endpoints

4. Estimate the centroid

5. Determine the height

Page 62: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

1. Compute the peak positions

Compute the CWT:

•use the marr wavelet

•predefined scale a (should correspond to the typical peakwidth)

maximum position in the CWT is a good first estimate of the maximum in the data even for asymmetric peaks

convolution can be computed very efficiently with pre-tabulated values for the wavelet

Page 63: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

2. Extract important peak parameter

20

,,2

))(cosh()(0 xxw

hxSech xwh−

=

Fit functions similar to real raw mass peaks

Page 64: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Workflow

2. Extract important peakparameter

1. Estimate the peak‘sleft area

2. Fit a sech2 to the left

3. Estimate the peak‘sright area

4. Fit a sech2 to the right

Asymmetric peak shape

Page 65: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

2. Extract important peak parameter

Fitting asymmetric lorentzian and sech2 functions to a rawpeak using the peak‘s :

•maximum position

•height

•area

introduces a smoothing effect

compute the peak‘s width from its area in constant time

yield very good approximations to the original peak shape

Page 66: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

for all mass spectra s in experiment dobox_list = splitSpectrum(s)for all boxes b in box_list do

peak_list = []wb = continuousWaveletTransform(b)while getNextMaximumPosition(wb,b,x0) do

(xl,xr) = searchForPeakEndPoints(b,x0)c = estimateCentroid(xl,xr)h = intensity(x0)

(Al,Ar) = integrateAreas(xl,xr)f = fitPeakShape(Al,Ar,x0,h)

push(f,peak_list)removeRawData(xl,xr,b)wb = continuousWaveletTransform(b)

end whileoptimizePeakParameter(peak_list,b)

end forend for

Peak picking algorithm

Page 67: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Peak picking algorithmfor all mass spectra s in experiment do

box_list = splitSpectrum(s)for all boxes b in box_list do

peak_list = []wb = continuousWaveletTransform(b)while getNextMaximumPosition(wb,b,x0) do

(xl,xr) = searchForPeakEndPoints(b,x0)c = estimateCentroid(xl,xr)h = intensity(x0)

(Al,Ar) = integrateAreas(xl,xr)f = fitPeakShape(Al,Ar,x0,h)

push(f,peak_list)removeRawData(xl,xr,b)wb = continuousWaveletTransform(b)

end whileoptimizePeakParameter(peak_list,b)

end forend for

Page 68: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

for all mass spectra s in experiment dobox_list = splitSpectrum(s)for all boxes b in box_list do

peak_list = []wb = continuousWaveletTransform(b)while getNextMaximumPosition(wb,b,x0) do

(xl,xr) = searchForPeakEndPoints(b,x0)c = estimateCentroid(xl,xr)h = intensity(x0)

(Al,Ar) = integrateAreas(xl,xr)f = fitPeakShape(Al,Ar,x0,h)

push(f,peak_list)removeRawData(xl,xr,b)wb = continuousWaveletTransform(b)

end whileoptimizePeakParameter(peak_list,b)

end forend for

Peak picking algorithm

Page 69: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

for all mass spectra s in experiment dobox_list = splitSpectrum(s)for all boxes b in box_list do

peak_list = []wb = continuousWaveletTransform(b)while getNextMaximumPosition(wb,b,x0) do

(xl,xr) = searchForPeakEndPoints(b,x0)c = estimateCentroid(xl,xr)h = intensity(x0)

(Al,Ar) = integrateAreas(xl,xr)f = fitPeakShape(Al,Ar,x0,h)

push(f,peak_list)removeRawData(xl,xr,b)wb = continuousWaveletTransform(b)

end whileoptimizePeakParameter(peak_list,b)

end forend for

Peak picking algorithm

Page 70: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

for all mass spectra s in experiment dobox_list = splitSpectrum(s)for all boxes b in box_list do

peak_list = []wb = continuousWaveletTransform(b)while getNextMaximumPosition(wb,b,x0) do

(xl,xr) = searchForPeakEndPoints(b,x0)c = estimateCentroid(xl,xr)h = intensity(x0)

(Al,Ar) = integrateAreas(xl,xr)f = fitPeakShape(Al,Ar,x0,h)

push(f,peak_list)removeRawData(xl,xr,b)wb = continuousWaveletTransform(b)

end whileoptimizePeakParameter(peak_list,b)

end forend for

Peak picking algorithm

Page 71: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Evaluation of a peak picking method

Compute accurate:• centroid• area• height

test against a set of known composition

Results are heavily affected by:• quality of experimental data• calibration method

Page 72: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Data set I: peptide mix (peptide standards mix #P2693 from Sigma Aldrich) of nine known peptides.

Measuring method: HPLC-MS using a quadrupole ion trap mass spectrometer equipped with an electrospray ion source in full scan mode (m/z 500-1500)

low resolution (Δm=0.2) and several overlapping isotope patterns

Evaluation of our peak picking method

Page 73: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Evaluation of our peak picking methodEvaluation:• compare with vendor supplied Bruker Data-

Analysis software (3.2) on the same spectra• no sophisticated calibration, we only allow a

constant mass offset to keep the number of fit parameters as small as possible

Performance criteria:1. number of the discovered and separated isotopic

patterns (given by at least three consecutive peaks) of each peptide in an expected retention time interval

2. average relative error of the monoisotopic mass

Page 74: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

591 591.5 592 592.5 593 593.5 594 594.5

10x 104

#occOpenMS Apexzmasstheo

01321620.8151I

1821349.7360H

0721183.5730G

2321061.5614F

2311084.4379E

5811007.4365D

15191574.2257C

29291573.3071B

A 19221556.2693

Results: LC-ESI-ion trap data (Full scan mode (m/z 500-1500))

Data set I (Peptide mix of nine known peptides)

Relative error of centroids:

21ppm, 1.2ppm, 35ppm, 16ppm

OpenMSApex

Page 75: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

#occOpenMS Apexzmasstheo

01321620.8151I

1821349.7360H

0721183.5730G

2321061.5614F

2311084.4379E

5811007.4365D

15191574.2257C

29291573.3071B

A 19221556.2693

Results: LC-ESI-ion trap data (Full scan mode (m/z 500-1500))

Data set I (Peptide mix of nine known peptides)

591 591.5 592 592.5 593 593.5 594 594.5

10x 104

Relative error of centroids:

21ppm, 1.2ppm, 35ppm, 16ppm

OpenMSApex

Page 76: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

#occOpenMS Apexzmasstheo

01321620.8151I

1821349.7360H

0721183.5730G

2321061.5614F

2311084.4379E

5811007.4365D

15191574.2257C

29291573.3071B

A 19221556.2693

Results: LC-ESI-ion trap data (Full scan mode (m/z 500-1500))

Data set I (Peptide mix of nine known peptides)

591 591.5 592 592.5 593 593.5 594 594.5

10x 104

Relative error of centroids:

21ppm, 1.2ppm, 35ppm, 16ppm

OpenMSApex

Page 77: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

rel. err. [ppm]

OpenMS Apexzmasstheo

-3721620.8151I

132821349.7360H

-1521183.5730G

645621061.5614F

121811084.4379E

942511007.4365D

21441574.2257C

16161573.3071B

A 39311556.2693

Results: LC-ESI-ion trap data (Full scan mode (m/z 500-1500))

Data set I (Peptide mix of nine known peptides)

591 591.5 592 592.5 593 593.5 594 594.5

10x 104

Relative error of centroids:

21ppm, 1.2ppm, 35ppm, 16ppm

OpenMSApex

Page 78: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

rel. err. [ppm]

OpenMS Apexzmasstheo

-3721620.8151I

132821349.7360H

-1521183.5730G

645621061.5614F

121811084.4379E

942511007.4365D

21441574.2257C

16161573.3071B

A 39311556.2693

Results: LC-ESI-ion trap data (Full scan mode (m/z 500-1500))

Data set I (Peptide mix of nine known peptides)

591 591.5 592 592.5 593 593.5 594 594.5

10x 104

Relative error of centroids:

21ppm, 1.2ppm, 35ppm, 16ppm

OpenMSApex

Page 79: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

rel. err. [ppm]

OpenMS Apexzmasstheo

-3721620.8151I

132821349.7360H

-1521183.5730G

645621061.5614F

121811084.4379E

942511007.4365D

21441574.2257C

16161573.3071B

A 39311556.2693

Results: LC-ESI-ion trap data (Full scan mode (m/z 500-1500))

Data set I (Peptide mix of nine known peptides)

591 591.5 592 592.5 593 593.5 594 594.5

10x 104

Relative error of centroids:

21ppm, 1.2ppm, 35ppm, 16ppm

OpenMSApex

Page 80: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Sanity check of our peak picking methodData set II: tryptic digest of bovine serum albumin (BSA from Sigma Aldrich).

Measuring method: MALDI-TOF using a Ultraflex II Lift mass spectrometer operated in the reflectron mode and using Panorama delayed ion extraction (m/z 500-5000)

Page 81: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Sanity check of our peak picking method

Evaluation:• compare with vendor supplied Bruker FlexAnalysis

software on the same spectra• no sophisticated calibration

Performance criterion:1. compute the sequence coverage using the

determined peak list as a MASCOT peptide mass fingerprinting query

Page 82: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

80 - 93

Sequence coverageOpenMS Bruker

44%52% - 67%

Results: MALDI-TOF data Data set II (tryptic digest of bovine serum)

rel. err. [ppm]

OpenMS Bruker

95

Page 83: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

80 - 93

Sequence coverageOpenMS Bruker

44%52% - 67%

Results: MALDI-TOF data Data set II (tryptic digest of bovine serum)

rel. err. [ppm]

OpenMS Bruker

95

Page 84: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

9580 - 93

Sequence coverageOpenMS Bruker

44%52% - 67%

Results: MALDI-TOF data Data set II (tryptic digest of bovine serum)

rel. err. [ppm]

OpenMS Bruker

Page 85: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Sequence coverageOpenMS Bruker

44%52% - 67%

Results: MALDI-TOF data Data set II (tryptic digest of bovine serum)

rel. err. [ppm]

OpenMS Bruker

9580 - 93

Page 86: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Sequence coverageOpenMS Bruker

44%52% - 67%

9580 - 93

Results: MALDI-TOF data Data set II (tryptic digest of bovine serum)

rel. err. [ppm]

OpenMS Bruker

Page 87: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Overview

• Motivation for OpenMS• Bioinformatics Issues in quantitative

Proteomics– Signal Processing– Feature Finding– Map Mapping– Differential Quantitation– Identification, Clustering– Software Engineering, Databases

Page 88: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Feature Finding

Isotope pattern Elution profile

Peptide (feature)

Capture ALL peaks belonging to a peptide for quantitation!

Page 89: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Feature Finding

• A two-dimensional model has to beadjusted to the raw data

Page 90: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Feature Finding …

Attributes of a feature:

mass-to-charge ratio

retention time

intensity, volume

quality

… … …

Page 91: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Feature Finding

m/z rt

feature model isotope pattern elution profile+=

Page 92: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Isotope Patterns

• Natural isotopes occur with well-known abundances

• Peak intensities determined by binomialconvolution

• Depend on molecular formula of peptide

12C 98.90% 13C 1.10%14N 99.63% 15N 0.37%16O 99.76% 17O 0.04% 18O 0.20%1H 99.98% 2H 0.02%

Page 93: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Randomly sample peptides from protein database and compute average composition. Table with average isotope pattern for specified

peptide mass

m [Da] P(k=0)

P(k=1)

P(k=2)

P(k=3)

P(k=4)

1000 0.55 0.30 0.10 0.02 0.00

2000 0.30 0.33 0.21 0.09 0.03

3000 0.17 0.28 0.25 0.15 0.08

4000 0.09 0.20 0.24 0.19 0.12

Estimating the Isotope Pattern

Page 94: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Feature Models: m/z

• Isotope patterns

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

-2 -1 0 1 2 3 4 5 6

relati

ve in

tensit

y

isotopic rank

Effect of the smoothing width on an averagine isotope pattern at mass 1350

stdev = 0.10.20.30.6

Page 95: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Feature Models: rt

• Elution profiles– Currently modeled by a normal

distribution, other shapes possible

Page 96: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Peak aggregation to features yields• 1,000-fold compression in data volume• reduction to meaningful entity: a peptide• accurate quantitation: integration of all related peaks

Peak Aggregation

FeaturesRT

Page 97: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Peak Aggregation

Features

Feature MapRaw Map Stick Map

Clemens GröplOle Schulz-Trieglaff

Jens Joachim

Page 98: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Overview Data Reduction

• Noise/baseline filtering

• Peak picking, integration• Peak aggregation

– De-isotoping: joining peaks from sameisotope pattern

– De-charging: joining features from samepeptide, different charge states

⇒ Aggregated MapReduction in Data Volume: up to 104

Page 99: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Overview

• Motivation for OpenMS• Bioinformatics Issues in quantitative

Proteomics– Signal Processing– Feature Finding– Map Mapping– Differential Quantitation– Identification, Clustering– Software Engineering, Databases

• Subsequent talks give more details

Page 100: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Map Mapping – At All Levels

Data reduction Mapping

Raw Map 1 Raw Map 2

Stick Map 1 Stick Map 2

Feature Map 1 Feature Map 2 Feature Map n…

Page 101: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Mapping of Raw Maps

Page 102: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Mapping Discrete Peaks

Page 103: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Geometric Hashing

• Consider difference vectors from features in map 1 (“+”) to features in map 2 (“×”)

• Restrict difference vector to reasonable differences along RT and m/z dimensions

• Hash these vectors into bins (“2D histogram”)

• In sufficiently similar maps, the optimal translation will be the most frequent one

Page 104: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Finding the Offset Vector

Clemens GröplEva Lange

Page 105: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Overview

• Motivation for OpenMS• Bioinformatics Issues in quantitative

Proteomics– Signal Processing– Feature Finding– Map Mapping– Differential Quantitation– Identification, Clustering– Software Engineering, Databases

• Subsequent talks give more details

Page 106: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Differential Analysis

• Identify „differential peptides“• Depends on quantitation technique

– Labeling techniques: Find pairs within same map

– Unlabeled (DDQ): Assign matching pairs across maps

• Usually done manually• In HT settings this has to be done

automatically!• Reduction to features helps enormously

Page 107: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Assuming charge z and n Cys in a peptide,check for pairs (8n)/z Thomson apart in ONE MAP

ICAT Pair Identification

Page 108: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Assign pairs across two or more mapsRT of peptide may vary between maps⇒ compute suitable mapping

(linear function usually suffices)

DDQ – Peptide Assignment

Page 109: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Multidimensional Search

• Use efficient data structures for range searches (e.g. Kd-trees or range trees)

• Employ memory management for cached range queries to allow for large data sets (e.g. large stick maps)

• Making use of robust and efficient implementations of these data structures in CGAL (Computational Geometry Algorithms Library)

• CGAL provides standard data structures for d-dimensional computational geometry

Page 110: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Map Normalization

• Intensities will vary between maps, even for the same sample (e.g. volume error during injection)

• Usually this results in a constant ratio of feature intensities

• Map normalization is required to determine this ratio for accurate relative quantitation

• Assuming a sufficient similarity between two maps, the majority of features will be identical, only few will be differ in concentration between the two samples

Page 111: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Run several experiments on the same sample

• Too time-consuming for manual assignment

• Automatic assignment increases reliability

1. Features that show up most of the time

are real

2. Features that do not are probably noise

3. Allows statements about statistical

significance

Improve Sensitivity and Specificity

Page 112: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

feature 1 feature 2 feature 3 feature n………

feature 1 feature 2 feature 3 feature n………

Compute scaling between pairs of maps, group them

Normalize and group measurements

Page 113: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Statistical analysis of differential patterns

Marker in patient Marker in control

Ole Schulz-TrieglaffTim Conrad

Page 114: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Overview

• Motivation for OpenMS• Bioinformatics Issues in quantitative

Proteomics– Signal Processing– Feature Finding– Map Mapping– Differential Quantitation– Identification, Clustering– Software Engineering, Databases

Page 115: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Peptide Identification (MS-MS)

Secondary fragmentation

Fragment Spectrum

Page 116: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Peptide Fragmentation

H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH

Ri-1 Ri Ri+1

AA residuei-1 AA residuei AA residuei+1

N-terminus C-terminus

The peptide backbone breaks to formfragments with characteristic masses.

Page 117: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Peptide Fragmentation

H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH

Ri-1 Ri Ri+1

AA residuei-1 AA residuei AA residuei+1

N-terminus C-terminus

The peptide backbone breaks to formfragments with characteristic masses.

Ionized parent peptide

H+

Page 118: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Peptide Fragmentation

H...-HN-CH-CO NH-CH-CO-NH-CH-CO-…OH

Ri-1 Ri Ri+1

AA residuei-1AA residuei AA residuei+1

N-terminus C-terminus

The peptide backbone breaks to formfragments with characteristic masses.

Ionized peptide fragment

H+

Page 119: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Peptide Ions in Spectrum

K1166

L1020

E907

D778

E663

E534

L405

F292

G145

S88 b ions

100

0250 500 750 1000 m/z

% In

tens

ity

147260389504633762875102210801166 y ions

Page 120: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Peptide Ions in Spectrum

147K

1166L

260

1020E

389

907D

504

778E

633

663E

762

534L

875

405F

1022

292G

1080

145S

1166

88

y ions

b ions

100

0250 500 750 1000

y2 y3 y4

y5

y6

y7

b3b4 b5 b8 b9

[M+2H]2+

b6 b7 y9y8

m/z

% In

tens

ity

KLEDEELFGS

Page 121: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

What’s the problem?

-HN-CH-CO-NH-CH-CO-NH-

Ri CH-R’

bibi+1

yn-iyn-i-1

R”i+1

i+1

Peptide fragmentation possibilities (ion types)

Page 122: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

What’s the problem?

-HN-CH-CO-NH-CH-CO-NH-

Ri CH-R’ai

bici

xn-iyn-i

zn-i

yn-i-1

bi+1R”

di+1

vn-i wn-i

i+1

i+1

low energy fragments high energy fragments

Peptide fragmentation possibilities (ion types)

Page 123: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Identification – SCOPE

Score P-values sequence

IDidentified fragments

Vineet Bafna, Nathan Edwards, Proc. ISMB 2001

Page 124: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Overview

• Motivation for OpenMS• Bioinformatics Issues in quantitative

Proteomics– Signal Processing– Feature Finding– Map Mapping– Differential Quantitation– Identification, Clustering– Software Engineering, Databases

Page 125: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

OpenMS

• Software framework for shotgun proteomics• ISO/ANSI C++ compliant• Features

– Open Source (LGPL)– Data structures & algorithms

• d-dimensional core• Signal processing for raw data conversion

– Data import/export in standard formats– Database storage of all data

(using an extended PEDRO/MIAPE model)– DB-driven workflow management– Visualization for spectra, maps

Page 126: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

MDI Window

1D Window

SpecView – 1D Window

DB

Page 127: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

SpecView – 3D Window

• Visualization of raw, stick, and feature maps

• Rapid interactivezoom in/out through efficientquad treeimplementation

• Data import from files or direct visualization from the DB

Page 128: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Clustering/ID View

Page 129: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Visualization Code Example// create application with a single widgetQApplication app(argc, argv);Spectrum1DWidget* wid = new Spectrum1DWidget;app.setMainWidget(wid);

// load data from DBDBAdapter dba;dba.connectDB("OpenMS", "<user>", "<password>");dba.executeQuery("SELECT id FROM PeakList … ", true);wid->setDataSet(dba.objects()[0]);

// execute applicationwid->show();return app.exec();

Page 130: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Acknowledgements

Dr. Clemens GröplEva Lange, Tim Conrad,Ole Schulz-Trieglaff(Algorithmische Bioinformatik,FU Berlin)

Prof. Hartmut Schlüter(Universitätsmedizin Berlin, Charité)

Prof. Dr. Oliver KohlbacherMarc Sturm, Andreas BertschJens Joachim(SBS/WSI, Tübingen)

Andreas Hildebrandt(Uni Saarbrücken)

Prof. Dr. Christian HuberBettina Mayr et al.(Instr. Analytik & Bioanalytik,Univ. des Saarlandes, Saarbrücken)

Dr. Albert Sickmann(Virchow-Zentrum, Würzburg)

Herbert ThieleJens Decker(Bruker Daltonics, Bremen)

Dr. Christoph Klein(IRMM, Geel; now IHCP Ispra)

Page 131: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Überblick Assemblierung

1. Überblick : DNA Sequenzierung

2. Überblick : BAC-by-BAC Assembler

3. Überblick : WGS Assembler4. Überblick :

Compartmentalized Assembler5. Details der CSA Pipeline

Page 132: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

Danke für die Aufmerksamkeit !!!

Page 133: Informatik Anwendungen in der Genom- und Proteomforschung

Berlin Center for Genome Based Bioinformatics Algorithmische Bioinformatik, Institut für Informatik

DNA DNA SequenzSequenz

GRGRÖÖSSENAUSWAHLSSENAUSWAHL(LIBRARY)(LIBRARY)

AUFBRECHENAUFBRECHEN

Shotgun DNA Sequenzierung (Technologie)

CLONENCLONEN

VectorVector

End Reads(Mates)End Reads(Mates)

SEQUENZIERENSEQUENZIEREN

PrimerPrimer

DNA Sequenzierung