metabolomics: data acquisition, pre-processing and quality control

14-2-2013

1

Metabolomics: data acquisition,

preprocessing & quality control

Theo Reijmers,

Analytical BioSciences, Leiden University

Barcelona, 14-02-2013

carbohydrates

Amino acids

Coenzymes (vitamines)

Amino acids

hormones

nucleotides

lipids

14-2-2013

2

The metabolome

• Metabolites � chemical compounds with low molecular weight

• Many chemical classes, with different chemical properties (different from proteomics)

• Large differences in abundancemass < 1500 Da

polaritylog P –6 to 14

con

cen

tra

tio

nd

yn

am

ic r

an

ge

10

9

The metabolome

mass < 1500 Da

polaritylog P –6 to 14

con

cen

tra

tio

nd

yn

am

ic r

an

ge

10

9

targeted

NMR

LC-MS

custom

global screen

14-2-2013

3

Analytical strategies: 1H NMR

Advantages

• Straightforward sample preparation

• High sample throughput (robotic control)

• Chemical shifts stable (if pH kept constant)

• Quantification without standards

• Highly repeatable and reproducible

• Very valuable for identification of isolated metabolites

Disadvantages

• Limited sensitivity

• Identification in complex mixtures

rather difficult

Analytical strategies: LC-MS and GC-MS

• Chromatography: separation of compounds in

sample

• Mass-spectrometry: detection of ions based

on mass-to-charge ratio (m/z)

14-2-2013

4

Chromatography

Types of interaction:

A. Surface adsorption

B. Solvent partitioning

C. Ion exchange

A B C

Separation of chemical compounds

based on chemical properties chromatogram

Mass spectrometer

ionisationmass

analyserdetector

mass

analyser

separation of charged particles in the gas phase

separation based on mass-to-charge ratio (m/z)

14-2-2013

5

LC-MS vs GC-MSLiquid C-MS

Advantages:

•Fast

•Efficient

•Sensitive

•Wide range of compounds

Disadvantages:

•Unstable*

•Sensitivity compound dependent

•Ion suppression gives rubbish data

•Relative quantification (if no authentic

standard is available)

Gas C-MS

Advantages:

• Highly reproducible retention times

• Sensitive detection for all metabolites

• Characteristic mass fingerprint (identification!)

Disadvantages:

• Derivatization is needed to include polar analytes

*About as stable as a chocolate teapot in a heatwave. (Wilson 2009)

Demonstration & Competence Lab• Applying technology developed in core in associate projects

with industry, academia, clinics, knowledge institutes

• Validation and implementation of metabolomics platforms

• QA/QC system/error model per metabolite

• Clinical & preclinical studies (projects with partners)

• >15 000 samples/year

• > 2000 metabolites

• Identification pipeline

• Training & hands-on-workshops

14-2-2013

6

Platforms• Lipid analysis by LC-MS (ca. 300 individual compounds)

• Amine analysis by LC-MS/MS (ca. 120 compounds)

• Oxylipin analysis (ca. 140 compounds)

• Global profiling by RP-LC-MS (ca. 450 compounds identified)

• Global profiling by GC-MS (ca. 150 compounds)

• Global profiling by CE-MS (ca. 300 compounds)

• And more under development

Large Metabolomics Measurement

series DCL

• IOP biomarkers for healthy aging– ±2500 samples, 28 batches

– Measurement time ±28 weeks• Matching project LUMC and NCHA Netherlands centre for healthy Aging

• Dutch Twin Register (NTR) – ±3000 samples, 31 batches

– Measurement time ± 30 weeks• Dutch Twin Register (Nederlands Tweeling Register, NTR)

• DiOGenes Diet, Obesity and Genes– ± 2000 samples, 27 batches

– Measurement time ±14 weeks• NMC Associate project N & H cluster

14-2-2013

7

Measurement Design

• Randomization, replication & blocking of measurements

• Inclusion of compounds & samples to monitor (& eventually correct for) quality – Internal Standards

– Calibration samples

– Quality Control (QC) samples

– Replicate samples (technical & analytical)

– Blanks

– System suitability samples

– Transfer samples

Typical sample sequence listOrde r Nam e Id Leve l Batch Prepar ation Injection isSample isSST isQC isdQC isBlank isCal isOut lier isSuspe ct Comment

1 Blank Blank 0 5 1 1




5 dSST.C2 dSST.C2 2 5 1 1

6 SST.C2 SST.C2 2 5 1 1

7 dQC dQC 4 5 1 1

8 QC QC 4 5 1 1

9 P5.C6.a C6 6 5 1 1

10 P5.C7.a C7 7 5 1 1

11 P5.C0.a C0 0 5 1 1

12 P5.C1.a C1 1 5 1 1

13 P5.C4.a C4 4 5 1 1

14 P5.C5.a C5 5 5 1 1

15 P5.C2.a C2 2 5 1 1

16 P5.C3.a C3 3 5 1 1

17 P5.C1 0543_090.3.01.0 4 5 1 1

18 P5.D1 0546_094.3.01.0 4 5 1 1

19 P5.E1 0550_076.3.01.0 4 5 1 1

20 QC QC 4 5 1 1


22 dQC QC 4 5 1 1 1

23 P5.F1 0553_015.3.15.0 4 5 1 1

24 P5.G1 0555_097.3.01.0 4 5 1 1

25 P5.H1 0556_097.3.01.1 4 5 1 1 1 There might be somethi ng wrong here

26 P5.A2 0559_077.3.05.0 4 5 1 1

27 P5.B2 0561_103.3.01.1 4 5 1 1 1 Something wrong here

28 P5.C2 0563_103.3.01.0 4 5 1 1

29 P5.D2 0564_093.3.03.0 4 5 1 1

30 P5.E2 0570_095.3.01.0 4 5 1 1

31 P5. bE1 0550_076.3.01.0 4 5 2 1

32 P5. bA7 0631_057.3.09.0 4 5 2 1

33 QC QC 4 5 1 1


35 dQC dQC 4 5 1 1

36 P5.F2 0571_105.3.04.0 4 5 1 1

37 P5.G2 0573_105.3.03.0 4 5 1 1

38 P5.H2 0574_099.3.02.0 4 5 1 1

39 P5.A3 0575_099.3.01.0 4 5 1 1

40 P5.B3 0577_099.3.03.0 4 5 1 1

41 P5.C3 0578_099.3.01.1 4 5 1 1

42 P5.D3 0581_096.3.01.0 4 5 1 1

43 P5.E3 0582_101.3.01.0 4 5 1 1

44 P5.F3 0584_123.3.01.0 4 5 1 1

45 P5.G3 0585_085.3.01.0 4 5 1 1

46 QC QC 4 5 1 1


48 dQC dQC 4 5 1 1

49 P5.H3 0587_085.3.01.1 4 5 1 1

50 P5.A4 0589_095.3.01.1 4 5 1 1

51 P5.B4 0590_105.3.01.0 4 5 1 1

52 P5.C4 0591_105.3.02.0 4 5 1 1

53 P5.D4 0593_077.3.12.1 4 5 1 1

54 P5.E4 0594_077.3.12.0 4 5 1 1

55 P5. bF9 0664_130.3.20.1 4 5 2 1

56 P5. bF10 0678_118.3.01.0 4 5 2 1

57 P5.F4 0597_117.3.02.1 4 5 1 1

58 P5.G4 0598_117.3.02.0 4 5 1 1

59 QC QC 4 5 1 1


61 dQC dQC 4 5 1 1

62 P5.H4 0599_117.3.01.1 4 5 1 1

63 P5.A5 0600_117.3.01.0 4 5 1 1

64 P5.B5 0603_098.3.04.0 4 5 1 1

65 P5.C5 0604_098.3.02.0 4 5 1 1

66 P5.D5 0605_098.3.01.0 4 5 1 1

67 P5.E5 0606_098.3.01.1 4 5 1 1

68 P5. bB3 0577_099.3.03.0 4 5 2 1

69 P5. bH3 0587_085.3.01.1 4 5 2 1

70 P5.F5 0607_015.3.16.0 4 5 1 1

71 P5.G5 0608_078.3.02.0 4 5 1 1

72 QC QC 4 5 1 1


74 dQC dQC 4 5 1 1

75 P5.H5 0609_078.3.03.0 4 5 1 1

76 P5.A6 0611_078.3.01.0 4 5 1 1

77 P5.B6 0612_088.3.02.0 4 5 1 1

78 P5.C6 0613_088.3.01.0 4 5 1 1

79 P5.D6 0616_085.3.02.0 4 5 1 1

80 P5.E6 0618_094.3.05.0 4 5 1 1

81 P5. bE6 0618_094.3.05.0 4 5 2 1

82 P5. bB10 0673_107.3.05.0 4 5 2 1

83 P5. bG1 0555_097.3.01.0 4 5 2 1

84 P5. bC4 0591_105.3.02.0 4 5 2 1

85 QC QC 4 5 1 1


87 dQC dQC 4 5 1 1

88 P5.C3.b C3 3 5 1 1

89 P5.C7.b C7 7 5 1 1

90 P5.C2.b C2 2 5 1 1

91 P5.C6.b C6 6 5 1 1

92 P5.C5.b C5 5 5 1 1

93 P5.C4.b C4 4 5 1 1

94 P5.C0.b C0 0 5 1 1

95 P5.C1.b C1 1 5 1 1

96 P5.F6 0620_107.3.01.0 4 5 1 1

97 P5.G6 0629_092.3.01.1 4 5 1 1

98 P5.H6 0630_092.3.01.0 4 5 1 1

99 QC QC 4 5 1 1


101 dQC dQC 4 5 1 1

102 P5.A7 0631_057.3.09.0 4 5 1 1

103 P5.B7 0632_057.3.09.1 4 5 1 1

104 P5.C7 0634_091.3.01.0 4 5 1 1

105 P5.D7 0635_015.3.17.0 4 5 1 1

106 P5.E7 0638_072.3.01.0 4 5 1 1

107 P5.F7 0639_066.3.03.0 4 5 1 1

108 P5.G7 0640_066.3.03.1 4 5 1 1

109 P5.H7 0642_109.3.02.0 4 5 1 1

110 P5.A8 0643_109.3.01.0 4 5 1 1

111 P5.B8 0646_110.3.06.1 4 5 1 1

112 QC QC 4 5 1 1


114 dQC dQC 4 5 1 1

115 P5.C8 0647_110.3.01.0 4 5 1 1

116 P5.D8 0648_110.3.03.1 4 5 1 1

117 P5.E8 0649_110.3.03.0 4 5 1 1

118 P5.F8 0650_110.3.06.0 4 5 1 1

119 P5. bH6 0630_092.3.01.0 4 5 2 1

120 P5. bF11 0689_065.3.22.0 4 5 2 1

121 P5.G8 0651_110.3.02.0 4 5 1 1

122 P5.H8 0655_108.3.01.1 4 5 1 1

123 P5.A9 0656_108.3.01.0 4 5 1 1

124 P5.B9 0658_111.3.01.0 4 5 1 1

125 QC QC 4 5 1 1


127 dQC dQC 4 5 1 1

128 P5.C9 0659_111.3.02.0 4 5 1 1

129 P5.D9 0661_128.3.01.0 4 5 1 1

130 P5. bF4 0597_117.3.02.1 4 5 2 1

131 P5. bC10 0675_129.3.01.1 4 5 2 1

132 P5.E9 0663_130.3.20.0 4 5 1 1

133 P5.F9 0664_130.3.20.1 4 5 1 1

134 P5.G9 0665_130.3.19.1 4 5 1 1

135 P5.H9 0666_130.3.19.0 4 5 1 1

136 P5.A10 0668_097.3.10.0 4 5 1 1

137 P5.B10 0673_107.3.05.0 4 5 1 1

138 QC QC 4 5 1 1


140 dQC dQC 4 5 1 1

141 P5. bB5 0603_098.3.04.0 4 5 2 1

142 P5.C10 0675_129.3.01.1 4 5 1 1

143 P5.D10 0676_129.3.01.0 4 5 1 1

144 P5.E10 0677_118.3.01.1 4 5 1 1

145 P5.F10 0678_118.3.01.0 4 5 1 1

146 P5.G10 0681_118.3.02.0 4 5 1 1

147 P5. bH10 0683_078.3.05.0 4 5 2 1

148 P5. bD4 0593_077.3.12.1 4 5 2 1 1 Only Integrated for TGs

149 P5.H10 0683_078.3.05.0 4 5 1 1

150 P5.A11 0684_065.3.27.0 4 5 1 1

151 QC QC 4 5 1 1


153 dQC dQC 4 5 1 1

154 P5.B11 0685_065.3.28.0 4 5 1 1

155 P5.C11 0686_065.3.29.0 4 5 1 1

156 P5.D11 0687_065.3.26.0 4 5 1 1

157 P5.E11 0688_065.3.30.0 4 5 1 1

158 P5.F11 0689_065.3.22.0 4 5 1 1

159 P5.G11 0690_065.3.20.0 4 5 1 1

160 P5.H11 0691_065.3.24.0 4 5 1 1

161 P5.A12 0693_065.3.23.0 4 5 1 1

162 P5.B12 0694_065.3.25.0 4 5 1 1

163 P5.C12 0696_112.3.04.0 4 5 1 1

164 QC QC 4 5 1 1


166 dQC dQC 4 5 1 1

167 P5.D12 0697_112.3.04.1 4 5 1 1

168 P5.E12 0699_072.3.02.1 4 5 1 1

169 P5.F12 0692_065.3.21.0 4 5 1 1

170 P5.C0.c C0 0 5 1 1

171 P5.C2.c C2 2 5 1 1

172 P5.C4.c C4 4 5 1 1

173 P5.C6.c C6 6 5 1 1

174 P5.C5.c C5 5 5 1 1

175 P5.C3.c C3 3 5 1 1

176 P5.C7.c C7 7 5 1 1

177 P5.C1.c C1 1 5 1 1

178 P5. bH7 0642_109.3.02.0 4 5 2 1

179 QC QC 4 5 1 1




QC-blank-(dummy) QC sequence at regular intervals

Calibration blocks at regular intervals

Running samples

Technical samples: system cleaning, testing and equilibrating.

Possible outliers are flagged and if confirmed ignored

14-2-2013

8

Data Acquisition, LC-MS & GC-MS

For one chemical compound, the pattern is approximately the multiplication of a component specific mass profile

and the abundance at a certain retention time

Component specific mass profile:

LC-MS: natural isotopes + adducts (soft ionization)

GC-MS: fragments (hard ionization)

1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

Retention time

Inte

ns

ity

M/Z

Inte

nsity

14-2-2013

9

Raw Data, LC-MS

• Huge amount of data

~1000s mass spectra (retention time scans)

~10.000s ion chromatograms

~1.000.000s (m/z – retention time) pairs

For each sample!

• Complex data

- Noise (detector noise and chemical noise), spikes, background

- Concentration differences between the compounds are rather large and therefore also intensity differences

0 200 400 600 800 1000 1200 14000

2000

4000

6000

8000

10000

12000

14000

16000

18000

scan#

# m

ass

ch

an

ne

ls

number of mass channels selected for processing vs scan number

14-2-2013

10

Preprocessing, LC-MS• Targeted platforms: vendor preprocessing software

– Expert knowledge => optimized settings

• Untargeted platforms: in-house developed preprocessing software– Conversion of manufacturer formats to common formats (e.g. ‘netcdf’ & ‘mzxml’)

– Centroiding and binning

– Baseline correction

– Alignment

– Peak extraction (asks for an estimate of noise level)

– Matching of peaks over samples

• Result: feature/peak/compound list– m/z & rt: peak area

Centroiding

RAW CENTROIDED

14-2-2013

11

m/z shifts within a sample

Small m/z shifts probably due to centroid sampling mode MS

spectra and mass fluctuations during recording

Binning

• Binning algorithm: sum intensities within predefined bins = mass ranges

• Definition of bins is a challenge, mostly related to the mass resolution (e.g. resolution = 10 000 �define bin 100.00 – 100.01)

• When done incorrect � large influence on peak extraction steps

14-2-2013

12

Background correction

TIC

Background corrected

Retention time alignment

0 1000 2000 3000 4000 5000 6000 7000-0.5

0

0.5

1

1.5

2

2.5

3x 10

5

2000 2200 2400 2600 2800 3000 3200-0.5

0

0.5

1

1.5

2

2.5

x 105

detail

14-2-2013

13

Alignment algorithms

• Dynamic Time Warping (DTW)

– Time point by time point mapping

(dynamic programming)

• Correlation Optimized Warping (COW)

– Piecewise linear, segments instead of

individual time points (dyn. progr.)

• (Semi)-Parametric Warping (PTW, Eilers)

– Global, nonlinear (parametric transfer

function estimation)

target dataset

dataset to align

-optimization of correlation between

the two pieces of each dataset

-not allow large retention time

variation (determined by the slack

parameter t)

Alignment algorithms

• Dynamic Time Warping (DTW)

– Time point by time point mapping

(dynamic programming)

• Correlation Optimized Warping (COW)

– Piecewise linear, segments instead of

individual time points (dyn. progr.)

• Parametric Warping (Eilers)

– Global, nonlinear (parametric transfer

function estimation)

3200 3300 3400 3500-50

0

50

100

150

200

3200 3300 3400 3500-50

0

50

100

150

200

3250 3300 3350 3400 3450 3500 3550

0

20

40

60

80

100

120

140

160

180

200

Warped, detail

3200 3250 3300 3350 3400 3450 3500-50

0

50

100

150

200

14-2-2013

14

Peak/Feature extraction and peak integration

• XCMS http://metlin.scripps.edu/xcms/index.php

• MetAlign http://www.wageningenur.nl/en/show/MetAlign-1.htm

• TNO-DECO Jellema, et al, Chemom. Intel. Lab. Systems, 104 (10) 132

• MZExtract van der Kloet et al, submitted

TNO-DECOWorks with GC-MS and not too complex LC-MS

Decomposes experimental data into the product of

pure mass spectra and concentration profiles of all

compounds in the sample

Advantages:

-Result is combined mass spectrum (identification!!)

-All samples analyzed at once

Problems / issues:

-Least squares (abundant compounds have large

influence on result)

-Noise level estimation

-Correct binning essential

Jellema, Chemo. Intel. Lab. Systems (2010) 104 132-139.

14-2-2013

15

Deconvolution

baseline corrected data

reconstructed signalExtracted chromatographic profiles

Extracted mass spectra

0 10 20 30 40 50 600

2

4

6

8

10

12

14

16x 10

6

100 200 300 400 500 600 700 800 900 10000

0.5

1

18

4

76

1

rt: 14.769

100 200 300 400 500 600 700 800 900 10000

0.5

1

18

4

75

9

rt: 14.3868

100 200 300 400 500 600 700 800 900 10000

0.5

1

18

4

62

8

70

47

26

75

7

rt: 13.9818

100 200 300 400 500 600 700 800 900 10000

0.5

1

18

4

78

5rt: 14.5777

0 10 20 30 40 50 600

2

4

6

8

10

12

14x 10

6

0 10 20 30 40 50 600

2

4

6

8

10

12

14x 10

6

Deconvolution of LC-MS data

14-2-2013

16

MZExtract

Per sample:

•Feature extraction of recalibrated and

centroided data (in-house)

•Integration of features (areas)

•Grouping of features to feature-sets

(enrichment step � knowledge based:

isotopes, adducts)

Over samples:

•Match feature-sets

Advantage of two-step approach: fully scalable

solution (parallel implementation)

van der Kloet, submitted.

Grouping related features within a single sample

No retention time window necessary to

match features (only isotopic patterns or

other known relations, e.g. adducts)

14-2-2013

17

Validation

Target list from MassHunter (Agilent) used to

locate 174 known targets.

– Mass window -> resolution 10.000

– RT window -> +/- 10 seconds

– 171 were found

– 3 missing targets: no isotopic patterns were

detected (they were found in the list of ‘single’

features)

about 3.200 unknown

feature-sets

How to validate unknown feature-sets?

here: selection based on QC presence

Comparable: 1.175 feature-sets

Low abundant: 366 feature-sets

14-2-2013

18

PLS-DA, Selectivity ratio*, to quantify the

variables discrimanatory ability

The low abundant feature-sets do contain biological relevance!

The most important feature-sets is an unknown!

*Anal. Chem. 2009, 81, 2581–2590

Quality Assessment

• Make use of all additional measured compounds

and samples

– Internal Standards

– Replicates

– Blanks

– Quality Control samples

• Quality Assessment => QC report (in-house)

14-2-2013

19

Part of a measurement run

Measurement Order

Re

sp

on

se

QC sample

Study sample

Replicate study sample

N mean std RSDqc RSD reps p-value diffs

CholE02 58 0.0298 0.0079 26.4% 21.4% 0.000 (2-1,3-1,3-2,4-2,4-3)

CholE04 46 0.0240 0.0124 51.9% 40.6%

CholE05 58 0.0120 0.0024 20.4% 19.1% 0.000 (2-1,3-1,4-1,3-2,4-3)

CholE06 58 0.0085 0.0021 24.7% 19.5% 0.000 (3-1,3-2,4-3)

DG02 58 0.0049 0.0011 23.4% 22.7% 0.000 (2-1,3-1,4-1,3-2,4-2,4-3)

LPC01 58 0.0183 0.0009 4.7% 4.8% 0.000 (4-1,4-2,4-3)

LPC02 58 0.0130 0.0015 11.7% 11.5% 0.000 (2-1,3-1,4-1)

LPC03 58 0.0101 0.0010 9.5% 12.1% 0.360

LPC04 58 0.0436 0.0019 4.4% 5.4% 0.000 (2-1,4-1,3-2,4-3)

LPC05 58 1.8684 0.1259 6.7% 6.8% 0.000 (2-1,3-1,4-1,3-2,4-2,4-3)

LPC07 58 0.0109 0.0007 6.1% 6.4% 0.004 (4-2)

LPC08 58 0.6096 0.0141 2.3% 3.2% 0.000 (2-1,3-1,4-1,3-2,4-2,4-3)

LPC09 58 0.4170 0.0200 4.8% 4.8% 0.000 (3-1,4-1,3-2,4-2,4-3)

LPC10 58 0.6625 0.0976 14.7% 13.8% 0.000 (2-1,3-1,4-1,3-2,4-2,4-3)

LPC11 58 0.0394 0.0446 113.1% 57.6% 0.000 (2-1,3-2,4-2,4-3)

LPC12 58 0.1126 0.0024 2.1% 3.6% 0.000 (2-1,3-1,3-2,4-2,4-3)

LPC13 58 0.0425 0.0049 11.5% 9.8% 0.000 (3-1,4-1,3-2,4-2)

LPC14 58 0.0311 0.0010 3.3% 3.7% 0.000 (2-1,3-1,4-2,4-3)

LPC16 58 0.0064 0.0016 24.9% 28.7% 0.000 (4-1,3-2,4-2,4-3)

LPC17 58 0.0033 0.0010 32.0% 36.4% 0.000 (3-1,4-1,3-2,4-2,4-3)

LPE02 58 0.0303 0.0056 18.6% 19.4% 0.000 (2-1,4-1,3-2,4-2,4-3)

LPE04 43 0.0034 0.0011 33.1% 21.9%

PC01 58 0.0832 0.0105 12.6% 12.5% 0.000 (4-1,4-2,4-3)

PC02 58 0.3333 0.0151 4.5% 4.6% 0.000 (2-1,4-1,4-2,4-3)

PC03 58 0.2238 0.0077 3.4% 3.7% 0.000 (2-1,3-1,4-1,4-2,4-3)

PC04 58 0.1257 0.0040 3.1% 4.8% 0.000 (3-1,4-1,3-2,4-3)

PC05 58 0.0674 0.0248 36.8% 35.9% 0.000 (2-1,3-1,4-1,3-2,4-3)

PC06 58 0.0667 0.0084 12.7% 10.1% 0.000 (2-1,4-1,3-2,4-3)

PC07 58 0.0225 0.0026 11.5% 14.2% 0.000 (2-1,3-1,4-1,4-2,4-3)

QC report overview tableANOVA for batch to batch variation

RSD values for

• QC samples

• Replicate samples

(independent validation)

14-2-2013

20

Uncorrected Peak areas

14-2-2013

21

QC samples only

Ratio (unc)Area

RSD

QC

25.8%

14-2-2013

22

Internal standard

RSDQC=25.8%

Internal Standard Corrected data

RSDQC=20.6%

14-2-2013

23

Intra and Inter batch variation

• Analytical Column ‘aging’

• Analytical Column replacement

• Eluent ‘refills’ and small variations

• Instrument malfunction/breakdown

– Etc…

Intra and Inter batch correction

• Instead of just monitoring QC sample

responses use them to correct variation

14-2-2013

24

QC correction

Measurement Order

Re

sp

on

se

QC sample

Study sample

Penalized smoother

Van der Kloet et al., Journal of Proteome Research 2009

QC correction

before after

Measurement Order

Re

sp

on

se

Measurement Order

Re

sp

on

se

Van der Kloet et al., Journal of Proteome Research 2009

14-2-2013

25

QC correction

van der Kloet et al., Journal of Proteome Research 2009

QC correction

van der Kloet et al., Journal of Proteome Research 2009

14-2-2013

26

ISTD/QC corrected data

RSDQC=4.1%

RSDreplicates=10.0%

All samples

14-2-2013

27

All batches

Correction charts

RSDQC

RSDReplicates

14-2-2013

28

Scores plot based upon 93 lipids

Uncorrected Area

-15 -10 -5 0 5 10 15 20-15

-10

-5

0

5

10

15

20

25

30

35

PC 1 (39.3%)

PC

2 (

14

%)

Scores plot based on 93 components (Peak Area)

batch 1

batch 2

batch 3

batch 4

QC samples

Differences between batches.

Clear trends in QC samples.

-10 -5 0 5 10 15 20 25 30 35-15

-10

-5

0

5

10

15

PC 1 (21.3%)

PC

2 (

14.8

%)

Scores plot based on 93 components (ISTD correction)

batch 1

batch 2

batch 3

batch 4

QC samples

Scores plot based upon 93 lipids ISTD

correctionSmaller differences between

batches.

Spread in QC samples greatly

reduced. However, batch to batch

differences remain present.

14-2-2013

29

-15 -10 -5 0 5 10 15 20 25 30 35-15

-10

-5

0

5

10

15

20

PC 1 (22.9%)

PC

2 (

14.7

%)

Scores plot based on 93 components RSDqc<0.15 and RSDreps<0.15

Scores plot based upon 93 lipids

batch 1

batch 2

batch 3

batch 4

QC samples

Comprehensive view of patient, animal, … :

e.g. combine genomics, proteomics & metabolomics data

�Data integration / fusion:

joining data from different measurement

approaches, same objects

Increase power of statistical analyses:

Combine e.g. metabolomics batch datasets

�� ‘Equating’: (*)

make comparable data from

same measurement approach, different objects

ob

jects

Combining data in systems biology

1 2

1

2ob

jects

variables

variables

*Equating is psychometrical term

14-2-2013

30

Why not just concatenate datasets?

• ‘Omics data typically batch data

• Metabolomics often not quantitative

� datasets not comparable

• Calibration model transfer would be solution but…

…often no full calibration models can be made!*

*Sangster et al, The Analyst 2006 (131): 1075-1078

1

2ob

jects

variables

?�

A proposed approach: QC samples

Correction for structural differences between series

using quality control (QC) samples (pooled samples

or representative samples)*

*van der Greef et al, J Proteome Res 2007 (6): 1540-1559

(picture from reference below)

14-2-2013

31

Problem with QC sample approach

• Rationale: make medians of QC data equal for all series

• Unwanted side-effect: inflation of variation in rest of data:

MA

D

MAD: median absolute deviation (robust SD)

Series 1

Series 2, uncorrected

Series 2, QC-corrected

Lipid compounds

Inflation of MAD in series 2 relative to series 1

Alternative solution: equating

• Combination of data from

different measurement series

• …in studies with limited number of

internal standards

(typically metabolomics!)

• …or even from different studies

• General: enables maximal flexibility in subsequent data

analysis on combined datasets

1

2ob

jects

variables

14-2-2013

32

Illustration: LC–MS data

• 182 (54 + 128) healthy participants(Netherlands Twin Register)*

• Blood samples (overnight fasting)

• Plasma analyzed with liquid chromatography–MS method forlipids

�Target list for 59 lipids:LPC / PC / SPM / ChE / TG

�Data per lipid corrected for class-specific internal standard

Measured in two series:

year 1 (Y1) N=54

+

year 2 (Y2) N=128

*Draisma et al, OMICS 2008: 17–31

PCA scores before equating

Data mean-centered prior to PCA

Y1

Y2

Y1

14-2-2013

33

Univariate quantile equating

•Quantiles:

values marking boundaries between regular intervals

of the cumulative distribution function (CDF)

•Example: 54 data values and associated CDF

CDF

0.50 quantile (= median)

0.52 quantile

0.48 quantile

1/54

1/54

Univariate quantile equating

Average values of corresponding quantiles

CDF Y2

CDF Y1

Data from: Frisby & Clatworthy, Perception 1975: 173-178

CDF(x) = 0.50x = 1.81

x = 2.64

14-2-2013

34

Quantile equating

Algorithm:

1. Number of quantiles = min {N1 , N2, …}

2. Average values of corresponding

quantiles by projection onto unit vector ( )

3. Substitute averaged values for original values belonging

to each quantile

Often applied for quantile normalization (*)

of gene arrays, between arrays (objects) over probes (variables)

nn

1,...,

1

*Bolstad et al, Bioinformatics 2003: 185–193

Projection onto

unit vector:

averaging

Projection onto

unit vector

Example univariate quantile equating

Before

After

Y1

Y2

Q-Q plot

Y1

Y2

CD

F Y

2

CDF Y1

14-2-2013

35

PCA scores after equating LC–MS data

Data meancentered prior to PCA

red: Y1black: Y2

Y1

Y2

Before

After

equating

variance:Box’s M statistic

location:Mahalanobis’ D2

Y1–Y2 similarity in PCA score space*

Y2

*Jouan-Rimbaud et al, Chemom Intell Lab Syst 1998: 129-144

Y1

direction:PCA loadings

PC

3

14-2-2013

36

Y1–Y2 similarity in PCA score space

All parameters: 0 = ‘dissimilar’, 1 = ‘similar’

Before

equatingAfter

equating

Jouan-Rimbaud et al, Chemom Intell Lab Syst (1998) 129-144

location

variance

direction

Effects on clustering results

Y2 Y1

Y2

Y1

No equating,

Y1–Y2 datasets combined:

Obvious

between-series effect

Draisma et al, Anal Chem (2010) 82 1039-1046

14-2-2013

37

Effects on clustering results

After quantile equating,

Y1–Y2 datasets

combined:

Y1–Y2 effect removed

Biological information

extractable from

combined dataset

♂ ♀♂

♀

Draisma et al, Anal Chem (2010) 82 1039-1046

Conclusions

• ‘Garbage in = Garbage out’ so try to control data

quality as much as possible

• Proper measurement design allows separation of

unwanted experimental variation from biological

variation (IS, QCs, replicates)

• Preprocessing: trade off between data quality, speed

(automation) and completeness (number of features)

• Road to high quality data is balanced mix of data

acquisition and data processing

14-2-2013

38

Acknowledgements

• DCL

– Jorne Troost

– Evelyne Steenvoorden

– Shanna Shi

– Faisa Galud

– Rob Vreeken

– Amy Harms

– Raymond Ramakers

– Irina Paliukovich

– Adrie Dane

• LACDR

– Frans van der Kloet

– Katrin Strassbourgh

– Vanessa Gonzalez

– Margriet Hendriks

– Harmen Draisma

– Thomas Hankemeier

metabolomics: data acquisition, pre-processing and quality control

Documents

samples54 p5

wrong here28 p5

somethi ng wrong here26

technical samples

batches measurement

gcms chromatography

dq cdq c4

chemical classes