quantitative analysis of proteomics mass spectrometry data€¦ · complete analysis pipeline for...
TRANSCRIPT
Quantitative Analysis of ProteomicsMass Spectrometry Data
Sebastian Gibb and Korbinian StrimmerIMISE, University of Leipzig
EMS 2013Budapest
21 July 2013
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 1
Inhalt
1 IntroductionPersonalized MedicineWhy Proteomics?Statistical Challenges
2 Preprocessing Mass Spectrometry DataMALDIquant SoftwareWorkflow
Data import, smoothing, baseline correction, calibration, peakdetection, peak alignment, peak binning, intensity matrix
3 High-Level AnalysisDichotomizationClassification and Ranking with Binary PredictorsExample from Cancer Proteomics
4 Conclusion and Outlook
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 2
I. Introduction
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 3
Predicting Disease
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 4
Personalized Medicine
Medication and tests approved for personalized medicine inGermany:
from: Nicola Siegmund-Schultze. 2011. Deutsches Arzteblatt
108:1904-1909Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 5
Biomarker Discovery
General aim of personalized medicine:
“the right drug for the right person”
Biomarkers are essential for:
� Classifying heterogeneous subtypes of disease
� Pinpoint precise diagnosis
� Targeted therapy
� Understanding the provenance of disease
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 6
Data from High-Throughput Platforms
Major techologies to collect genomic high-througput data:Microarrays and
Next-Generation Sequencing
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 7
Proteomics
Microarrays and NGS allow to measure concentration of RNAexpressions → Transcriptomics
However, in many clinical settings (e.g. clinical diagnostics) one isoften interested in expression of proteins, peptides andamino acids.
Advantage: direct biological and medical interpretation!
Systematic study of proteins and related products → Proteomics
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 8
Proteomic Mass Spectrometry
High-throughput technology for measuring proteins:Mass Spectrometry
2000 4000 6000 8000 10000
050
100
150
peak detection
mass
inte
nsity
●●●●●●●●●●●
●●●●●
●
●●●●●●●●
●
●●●●
●
●
●
●
●●
●
●
●●●
●
●●●●●●●●
●●●●●●●●●●●●●●●
●
●
●●
●●●
●
●●
●
●●●●●●
●
●
●
●●
●
●●●●●●●●●●
●
●●
●
●●●●●●
●●
●●
●●●●●●
●
●
●●●●●●●●●●
●
●
●
●●
●
●●
●●
●●●●
●●●●●●●●●
●●●
●●●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●●●●●●●●
●●●●●
●
●
●●●●●●●●●●
●●●
●
●
●●●●●●●●
●
●●●●●●●●●●
●●●
●
●●●
●
●●●●●●●●●●●
●
●●●●●●●●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●
●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●
●
●●●
●
●
●
●
●
●
●
●●●
●
●●●●●●●
●●●●●●●
●●●●
●
●
●●
●●
●
●●
●
●●●
●
●
●
●
●●●●
●
●
●
●●●●●
●●
●●
●●●●●●
●
●
●●●
●
●
●
●
●●
●
●●
●●
●●●●
●● ●
●●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●●●●● ●
●
●●●●●●
●
●
●●●
●
● ●●
●
●●
●
● ●
●
●●
●●● ●●●
●
●●●
●
●
●●
1466.05
1545.781
1617.063
2105.552
2932.711
3158.47
3192.143
3263.158
3883.32
3955.47
3972.7474091.227
4210.261
4267.43
4282.581
4643.858
5064.7995904.643
7765.216
9288.693
accepted peaksrejected peaksnoise threshold
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 9
MALDI Technology
© Bruker Daltonics
MALDI: Matrix-Assisted LaserDesorption/Ionization
Allows the analysis of large organic moleculessuch as proteins.(note “matrix” is biochemical jargon referringto carrier material that helps the proteinionization).
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 10
MALDI-TOF Principle
Ion Source: MALDI
Matrix-Assisted LaserDesorption/Ionization
Mass Analyzer: TOF
Time Of Flight (t ∝√
mq )
Detector
QuantityMeasurement
Abb. 3.14.; S. 67; “Biochemie & Pathobiochemie”, Loffler G., 8. Auflage (2007), Springer Medizin Verlag
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 11
MALDI Benefits
Advantages:
� Reliable well established technology (many variants exists)
� Often combined with other experimental techniques (gels,imaging etc).
� Very cheap to run compared with other high-throughputtechnologies (i.e. many replications possible)
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 12
Statistical Challenges
Analysis of proteomics mass spectrometric data is morecomplicated than that of gene expression!Overview of the major statistical challenges summarized by Morriset al (2007):Preprocessing:
� Removal of systematic biases in the data
� Peak identification
� Peak alignment across spectra
� Quantification and calibration of relative peak intensities
Develop suitable methods for multivariate analysis:
� Differential protein expression
� Classification and prediction
� Feature selection
In addition, many considerations from transcriptomics also apply here (suchhigh dimensionality, sparse models etc).
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 13
II. Preprocessing Mass Spectrometry Data
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 14
MALDIquant Software
In the form of MALDIquant we have developed a completeprocessing pipeline for proteomics data in R.Motivation:
� Only relatively few open source software solutions availableand very few for the R platform.http://strimmerlab.org/notes/mass-spectrometry.html
� No MALDI-TOF package fitting our needs for clinicaldiagnostics.
� Necessity of handling both technical and biological replicates.
� Nonlinear Peak alignment necessary for many spectra.
� Modular and easy to customize analysis routines.
� Testbed for studying new methodologies (e.g. forquantification).
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 15
MALDIquant Family
MALDIquant
Complete analysis pipeline for MALDI-TOF and other 2D massspectrometry data.
MALDIquantForeign
Import raw data (txt, tab, csv, Bruker Daltonics fid,Ciphergen XML, mzXML, mzML).
Export into common formats (txt, tab, csv, msd, mzML).
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 16
MALDIquant Family
MALDIquant
single spectrum multiple spectra
peak alignment/peak binning
calibration/normalization
smoothing
baseline correction
peak detection
classification
sda
randomForest
...
pamr
smoothing
baseline correction
peak detection
raw dataMALDIquantForeign
fid
csv
mzML
mzXML
txt, tab
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 17
Data Import
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
MALDIquant automatically recognizes many massspectrometry data formats (including native Brukerfiles).
## load MALDIquant
library("MALDIquant")
## load MALDIquantForeign
library("MALDIquantForeign")
spectra <- import("/data/ms/raw")
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 18
Data Import
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
Here is a raw example spectrum from the Fiedler et al.(2009) cancer data set.
plot(spectra[[1]])
2000 4000 6000 8000 10000
0e+
004e
+04
8e+
04
Pankreas_HB_L_061019_G10.M19
/data/set A − discovery leipzig/control/Pankreas_HB_L_061019_G10/0_m19/1/1SLin/fid
mass
inte
nsity
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 18
Variance Stabilizing/Smoothing
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
spectra <- transformIntensity(spectra, sqrt)
plot(spectra[[1]])
2000 4000 6000 8000 10000
050
100
150
200
250
300
Pankreas_HB_L_061019_G10.M19
/data/set A − discovery leipzig/control/Pankreas_HB_L_061019_G10/0_m19/1/1SLin/fid
mass
inte
nsity
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 19
Variance Stabilizing/Smoothing
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
movingAverage <- function(y) {return(filter(y, rep(1, 5)/5, sides = 2))
}spectra <- transformIntensity(spectra, movingAverage)
2000 4000 6000 8000 10000
050
100
150
200
250
300
Pankreas_HB_L_061019_G10.M19
/data/set A − discovery leipzig/control/Pankreas_HB_L_061019_G10/0_m19/1/1SLin/fid
mass
inte
nsity
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 19
Baseline Correction
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
For correct quantification of peak intensities it isnecessary to conduct baseline correction. Thisaccounts for systematic bias, such as matrix effects.
MALDIquant implements a number of correctionalgorithms, including the SNIP approach by Ryan etal (1988) and the TopHat filter.
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 20
Baseline Correction
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
2000 4000 6000 8000 10000
050
100
150
200
250
300
Median Baseline
/data/set A − discovery leipzig/control/Pankreas_HB_L_061019_G10/0_m19/1/1SLin/fid
mass
inte
nsity
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 21
Baseline Correction
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
1200 1250 1300 1350 1400 1450 1500
050
100
150
200
250
300
Median Baseline
/data/set A − discovery leipzig/control/Pankreas_HB_L_061019_G10/0_m19/1/1SLin/fid
mass
inte
nsity
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 21
Baseline Correction
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
2000 4000 6000 8000 10000
050
100
150
200
250
300
Convex Hull Baseline
/data/set A − discovery leipzig/control/Pankreas_HB_L_061019_G10/0_m19/1/1SLin/fid
mass
inte
nsity
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 21
Baseline Correction
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
2000 4000 6000 8000 10000
050
100
150
200
250
300
TopHat Baseline
/data/set A − discovery leipzig/control/Pankreas_HB_L_061019_G10/0_m19/1/1SLin/fid
mass
inte
nsity
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 21
Baseline Correction
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
2000 4000 6000 8000 10000
050
100
150
200
250
300
SNIP Baseline
/data/set A − discovery leipzig/control/Pankreas_HB_L_061019_G10/0_m19/1/1SLin/fid
mass
inte
nsity
C. G. Ryan, E. Clayton, W. L. Griffin, S. H. Sie, and D. R. Cousens. SNIP, a statistics-sensitive backgroundtreatment for the quantitative analysis of PIXE spectra in geoscience applications. Nucl. Instrument. Meth. B, 34:396–402, 1988
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 21
Baseline Correction
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
spectra <- removeBaseline(spectra, method = "SNIP")
plot(spectra[[1]])
2000 4000 6000 8000 10000
050
100
150
200
250
Pankreas_HB_L_061019_G10.M19
/data/set A − discovery leipzig/control/Pankreas_HB_L_061019_G10/0_m19/1/1SLin/fid
mass
inte
nsity
C. G. Ryan, E. Clayton, W. L. Griffin, S. H. Sie, and D. R. Cousens. SNIP, a statistics-sensitive backgroundtreatment for the quantitative analysis of PIXE spectra in geoscience applications. Nucl. Instrument. Meth. B, 34:396–402, 1988
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 21
Intensity Calibration
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
As with gene expression data, mass spectrometry dataneed to be calibrated. Several methods are available(TIC, Median, probabilistic quotient normalization).
spectra <- standardizeTotalIonCurrent(spectra)
plot(spectra[[1]])
2000 4000 6000 8000 10000
0e+
004e
−04
8e−
04
Pankreas_HB_L_061019_G10.M19
/data/set A − discovery leipzig/control/Pankreas_HB_L_061019_G10/0_m19/1/1SLin/fid
mass
inte
nsity
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 22
Peak Detection
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
3800 3900 4000 4100 4200 4300 4400 4500
0.00
000
0.00
005
0.00
010
0.00
015
0.00
020
0.00
025
0.00
030
Pankreas_HB_L_061019_G10.M19
/data/set A − discovery leipzig/control/Pankreas_HB_L_061019_G10/0_m19/1/1SLin/fid
mass
inte
nsity
●●● ● ●●●●
●●
●
●
●●●●●●●●●●●
●●
● ●
●
● ●●●
●
●
●●
●
●●●
●●●●●●●
● ●●●●
●
●
●●●
●
●
●●●●● ●
● ●
●
●●
●
●
●
●
● ●
●●
3875
3882.9
4091
4194.2
4209.9
accepted maximarejected maximanoise threshold
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 23
Peak Detection
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
peaks <- detectPeaks(spectra, SNR = 3)
plot(spectra[[1]])
points(peaks[[1]])
2000 4000 6000 8000 10000
0e+
004e
−04
8e−
04
Pankreas_HB_L_061019_G10.M19
/data/set A − discovery leipzig/control/Pankreas_HB_L_061019_G10/0_m19/1/1SLin/fid
mass
inte
nsity
●
●
●
●
●●●●●●
●
●●
●
●●●●●
●
●●●●●●●
●
●
●●●
●
●●
●●
●
●●●●●●●●●●
●● ●●● ●●●
●
●●
●
●
●●
●
●
●
●
●
●
●● ●●
●
●●
●
●
●●●●
● ●
●
● ●●
●●
●
● ●
●
●● ●●●
●●●●●
●
●●
●
●
●●
●●●
● ●● ●
●
●● ●●
●● ●●
●
●●
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 23
Peak Alignment
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
Peak alignment accross multiple spectra is a verychallenging issue in mass spectrometry analysis:
� there are many peaks that occur only in fewspectra
� non-linear mass shifts requires calibration alongx-axis
� peak mapping needed for comparison andidentification of markers
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 24
Peak Alignment/The Problem
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
4180 4190 4200 4210 4220 4230 4240
0e+
001e
−04
2e−
043e
−04
4e−
04
Pankreas_HB_L_061019_G10.M20
mass
inte
nsity
9200 9250 9300 9350 9400
0e+
001e
−04
2e−
043e
−04
4e−
04
Pankreas_HB_L_061019_G10.M20
mass
inte
nsity
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 25
Peak Alignment/Possible Solutions
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
DTW[Clifford et al. 2009]
very slow & high memory usagem1 ∗m2: ((5∗104)2 ∗48Bytes ≈ 120GB)
COW[Veselkov et al. 2009]
slow, must run segment-wise to overco-me nonlinear shift
PTW[Bloemberg et al. 2010]
fast (≈ 3 sec/sample on an [email protected] GHz,4 GB RAM)
. . .
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 26
Peak Alignment/Our Approach
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
inspired by [Wang et al. 2010]
reference peaks: landmark peaks ⇒ occur in mostspectra
inspired by [He et al. 2011]
� peak matching (choose highest peak in range)
� mass vs diff plot → estimate warping function
Bing Wang, Aiqin Fang, John Heim, Bogdan Bogdanov, Scott Pugh, Mark Libardoni, and Xiang Zhang. Disco:Distance and spectrum correlation optimization alignment for two-dimensional gas chromatographytime-of-flight mass spectrometry-based metabolomics. Analytical Chemistry, 82(12):5069–5081, 2010
Q P He, J Wang, J A Mobley, J Richman, and W E Grizzle. Self-calibrated warping for mass spectra alignment.Cancer Inform, 10:65–82, 2011
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 27
Peak Alignment/Warping Functions
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
warpingFunctions <- determineWarpingFunctions(peaks)
●●●
●●●●●
●●
●●●●●● ●●
● ●
●●●●●●●●●●●●
●●
●
●●
●
●● ● ●
●
● ●
●
●
●
●●●● ●
●●
●
●
●
●
●
2000 4000 6000 8000
−6
−4
−2
02
46
sample 1 vs reference(matched peaks: 60/60)
/data/set A − discovery leipzig/control/Pankreas_HB_L_061019_G10/0_m20/1/1SLin/fid
mass
diffe
renc
e
●●●●●●●●
●●●●●
●
●●●●
●
●●●●●●●● ●
●
●● ● ● ●●●
●
●● ●
●●
● ●
●
● ●●●
●●
● ● ●
●
●● ●
●
2000 4000 6000 8000
−6
−4
−2
02
46
sample 2 vs reference(matched peaks: 59/60)
/data/set B − discovery heidelberg/control/Pankreas_HB_L_061019_A6/0_a12/1/1SLin/fid
mass
diffe
renc
e
●●●●
●●●●
●●●
●●●●● ●
●● ●●●●●●● ●
●
●●● ●
●
●●●
●
●●●
● ● ●● ●●
●
● ●●
●●
●
●
●
2000 4000 6000 8000
−6
−4
−2
02
46
sample 3 vs reference(matched peaks: 55/60)
/data/set B − discovery heidelberg/tumor/Pankreas_HB_L_061019_C4/0_f8/1/1SLin/fid
mass
diffe
renc
e
●●●
●●●●●
●●●●●
●
●● ●●● ●
●●●●●●●●
●
●
●● ● ●
●
●
●
●●
● ●
●
●
●
●
●
●●●
●
●
●
●
●
●
● ● ●
2000 4000 6000 8000
−6
−4
−2
02
46
sample 4 vs reference(matched peaks: 58/60)
/data/set B − discovery heidelberg/tumor/Pankreas_HB_L_061019_D9/0_g17/1/1SLin/fid
mass
diffe
renc
e
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 28
Peak Alignment/Warping Functions
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
spectra <- warpMassSpectra(spectra, warpingFunctions)
peaks <- warpMassPeaks(peaks, warpingFunctions)
4180 4190 4200 4210 4220 4230 4240
0e+
002e
−04
4e−
04
Pankreas_HB_L_061019_G10.M20
mass
inte
nsity
9200 9250 9300 9350 9400
0e+
002e
−04
4e−
04
Pankreas_HB_L_061019_G10.M20
mass
inte
nsity
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 28
Peak Alignment/Comparison
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
4180 4190 4200 4210 4220 4230 4240
unwarped
mass
4180 4190 4200 4210 4220 4230 4240
ptw (Bloemberg et al 2010)
mass
inte
nsity
4180 4190 4200 4210 4220 4230 4240
MALDIquant
mass
inte
nsity
9200 9250 9300 9350 9400
unwarped
mass
9200 9250 9300 9350 9400
ptw (Bloemberg et al 2010)
mass
inte
nsity
9200 9250 9300 9350 9400
MALDIquant
mass
inte
nsity
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 29
Peak Binning
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
4180 4190 4200 4210 4220 4230 4240
0e+
002e
−04
4e−
046e
−04
Pankreas_HB_L_061019_G10.M20
mass
inte
nsity
4192.521 4209.884194.572 4209.842
4193.52 4210.0384192.082 4209.853
9200 9250 9300 9350 9400
0e+
002e
−04
4e−
046e
−04
Pankreas_HB_L_061019_G10.M20
mass
inte
nsity
9290.529291.7069290.7179290.585
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 30
Peak Binning
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
peaks <- binPeaks(peaks)
4180 4190 4200 4210 4220 4230 4240
0e+
002e
−04
4e−
046e
−04
Pankreas_HB_L_061019_G10.M20
mass
inte
nsity
4193.666 4209.9394193.666 4209.9394193.666 4209.9394193.666 4209.939
9200 9250 9300 9350 9400
0e+
002e
−04
4e−
046e
−04
Pankreas_HB_L_061019_G10.M20
mass
inte
nsity
9290.5529290.5529290.5529290.552
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 30
Feature Matrix
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Feature Matrix
Post Processing
Result
featureMatrix <- intensityMatrix(peaks)
featureMatrix[1:12, 1:2]
## 1011.67823313295 1020.66971843048
## [1,] 3.587257e-05 0.0002467926
## [2,] 3.170674e-05 0.0002550081
## [3,] 3.517940e-05 0.0002428846
## [4,] 3.430174e-05 0.0002563571
## [5,] 3.122426e-05 0.0004472052
## [6,] NA 0.0004505502
## [7,] 2.680547e-05 0.0004240763
## [8,] 3.484823e-05 0.0003559640
## [9,] 4.327525e-05 0.0001619205
## [10,] 3.357397e-05 0.0001527801
## [11,] 4.160095e-05 0.0002912183
## [12,] 3.848561e-05 0.0002911327
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 31
Workflow Summary
Raw Data
Data Import
Smoothing
Baseline Correction
Intensity Calibration
Peak Detection
Peak Alignment
Peak Binning
Intensity Matrix
Post Processing
Result
## load libraries
library("MALDIquant")
library("MALDIquantForeign")
## load data
spectra <- import("/data/ms/raw")
## run spectrum based workflow
spectra <- transformIntensity(spectra, sqrt)
spectra <- transformIntensity(spectra, movingAverage)
spectra <- removeBaseline(spectra)
spectra <- standardizeTotalIonCurrent(spectra)
peaks <- detectPeaks(spectra)
## run peak based workflow
warpingFunctions <- determineWarpingFunctions(peaks)
peaks <- warpMassPeaks(peaks)
peaks <- binPeaks(peaks)
featureMatrix <- intensityMatrix(peaks)
## use featureMatrix for further analysis
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 32
MALDIquant GUI
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 33
III. High-Level Analysis
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 34
Multivariate Analysis of Omics Data
In the last decade a multitude of statistical methods have beendeveloped for high-dimensional gene expression data. For example:
� regularized t-scores (moderated t, shrinkage t) for generanking,
� regularized regression and classification (e.g. shrinkagediscriminant analysis),
� sparse models, structured models, latent class models, and
� high-dimensional testing (e.g. Higher Criticism or FDR)
Can we use these techniques also for mass spectrometry data?
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 35
Peaks as Biomarkers
Mass spectrometric peaks contain both binary and continuousinformation:
1 peak may be present or absent (NA in intensity matrix)
2 intensity of peak may be up or down regulated
Both properties need to be taken into account!Thus, usual methods from gene expression data are generally notappropriate! (But nonetheless used in practice.)
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 36
Solution: Dichotomization
Local peak thresholding (Tibshirani et al 2004) uses apeak-specific thresholding rule Ithresh(m) to dichotomize spectraldata:
1 peak intensity I (m) > Ithresh(m)→ 1
2 peak intensity I (m) ≤ Ithresh(m)→ 0
This allows to take account both of absent/present as well asup/down regulated peaks.The thresholding rule is estimated from the data by maximizingthe separation between the groups (as measured by a variableimportance criterion).
→ for mass spectrometry data we need categorical data analysis!
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 37
Classification and Ranking with Binary Predictors
Large-scale analysis of binary data is routine in machine learning,especially in text mining.
We suggest using similar techniques as in text mining for theanalysis of mass spectrometry data, in particular:
� multivariate Bernoulli and related models for classification,
� corresponding variable importance measures (typically basedon KL entropy) for peak ranking, and
� regularized inference for application in high-dimensionalsettings.
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 38
Multivariate Bernoulli Independence Rule
One of the most popular approaches and highly effective approachfor classification of gene expression data is PAM (Tibshirani et al2003), a variant of shrinkage diagonal discriminant analysis.
We use the same idea for mass spectrometry data:
� assume “diagonal” multivariate Bernoulli distributions foreach group, Xk = (X1, . . . ,Xd) ∼ Bd(µk) withPr(x|µk) =
∏di=1 Pr(xi |µi ,k)
� Bayes rule gives discriminant function dk ∝ log Pr(k|x).
� Regularized training of discriminant rule in situations withd ≥ n.
This MVB independence rule, while very simply, has shown to behighly effective (e.g. Park 2009).
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 39
Pancreatic Cancer Proteomics Study
G.M. Fiedler et al. 2009. Serum peptidome profiling revealedplatelet factor 4 as a potential discriminating peptide associatedwith pancreatic cancer. Clinical Cancer Research 11: 3812–3819.
� Large study conducted in Leipzig and Heidelberg
� 120 participants (60 healthy vs. 60 cancer)
� 4 technical replicates per sample
� 480 MALDI spectra
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 40
Preprocessing and Peak Filtering
� Number of detected peaks per spectrum (of 480): between134 and 257
� After merging of technical replicates, keeping only peaks thatoccur in all 4 measured spectra: between 52 and 146
� Keeping only peaks that occur with 75 % frequency in eachgroup (cancer and control): between 23 and 56
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 41
Comparison of Rankings
Shrinkage t vs. categorical ranking (top 10 peaks):
Ranking metric discrete
1 4494.71 4494.712 2755.41 8937.023 4250.76 4467.814 8131.39 5945.425 2022.73 2022.736 1627.86 1866.077 3920.23 4250.768 8144.25 2755.419 5945.42 5906.0610 5266.03 2953.13
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 42
Clustering of Dichotomized Data
Dichotomized data exhibits near perfect split in healthy andcancer patients!
0.0
0.2
0.4
0.6
0.8
1.0
A: H
C1
A: H
C2
A: H
C62
A: H
C8
A: H
C64
A: H
C50
A: H
C67
A: H
C49
A: H
C11
A: H
C33
A: H
C66
A: H
C56
A: H
C54
A: H
C55
A: H
C12
0A
: HC
118
A: H
C11
9A
: HC
57A
: HC
59A
: HT
424
A: H
T26
2A
: HT
417
A: H
T15
1A
: HT
150
A: H
T32
1A
: HT
429
A: H
T39
3A
: HT
121
A: H
T21
2A
: HT
425
A: H
T40
2A
: HT
419
A: H
T41
6A
: HT
208
A: H
T41
3A
: HT
161
A: H
T12
0A
: HT
438
A: H
C12
2A
: HT
410
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 43
IV. Conclusion and Outlook
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 44
Summary
� We have developed MALDIquant, a comprehensive Rpackage for analysis of mass spectrometry data, with focuson clinical diagnostics.
� MALDIquant is easy to use yet incorporates state-of-the artmethods for preprocessing, calibration, quantification andvisualization
� Multivariate analysis is best done on dichotomized peakdata, using regularized categorical data analysis.
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 45
Outlook: IMS
Imaging Mass Spectrometry (IMS):combining MS data with spatial information
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 46
Many Thanks for Your Interest!
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 47
Availability
http://strimmerlab.org/software/maldiquant/
## get newest MALDIquant/MALDIquantForeign directly from CRAN
install.packages(c("MALDIquant", "MALDIquantForeign"))
S. Gibb and K. Strimmer. 2012. MALDIquant: a versatile Rpackage for the analysis of mass spectrometric data.Bioinformatics 28:2270-2271.
Sebastian Gibb and Korbinian Strimmer, Analysis of Proteomics Data, 21/7/2013 48