analysis of mass spectrometry data: problems and tools · a vector of mass-to-charge values, i.e....

Post on 26-Sep-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Analysis of MassSpectrometry Data:Problems and Tools

Johan CarlsonJohan.Carlson@ltu.se

Div. of Systems and InteractionDept. of Computer Science, Electrical and SpaceEngineeringLulea University of TechnologySE-971 87 LuleaSweden

2

Today’s menuBackground

Mass spectrometryTraditional multivariate data analysisProblems

ToolsPre-processingTraditional analysis, re-visitedProblems?Alternative analysis strategy

Future challenges

3

Mass spectrometry (MS)Analytical technique that measures the mass-to-charge ratio ofcharged particles.

Used for:Determining masses of particles,Determining the elemental composition of a sample or moleculeRevealing chemical structures of molecules and compounds.

Works by ionizing chemical compounds to generate chargedmolecules or molecule fragments and measuring theirmass-to-charge ratios

4

Mass spectrometry (MS)A sample is loaded onto the MS instrument and undergoesvaporization.

The components of the sample are ionized by one of a variety ofmethods (e.g., by impacting them with an electron beam), whichresults in the formation of charged particles (ions).

The ions are separated according to their mass-to-charge ratio in ananalyzer by electromagnetic fields.

The ions are detected, usually by a quantitative method

The ion signal is processed into mass spectra.

5

Mass spectrometry (MS)

(Source: wikipedia.org)

6

Mass spectrometry (MS)The output:

A vector of mass-to-charge values, i.e. the location of the peaksin the mass spectrum.Abundance values, i.e. the peaks themselves, representing therelative abundance ("amount") of each ion in the sample.This has some implications (causing problems!), but let’s leavethese for now.

The location of the peaks (i.e. the corresponding mass value) giveinformation of what type of molecules are present.

The magnitude of the peaks give information of the relative amount ofeach molecule.

The next slide shows an example of a mass spectrum for a reasonablysimple peptide.

7

Mass spectrometry (MS)

(Source: wikipedia.org)

8

Mass spectrometry (MS)For more complex mixtures, the mass spectra become more difficultto interpret.

The following example of a mass spectrum of a crude oil sample istaken from:J. E. Carlson, J. R. Gasson, T. Barth, and I. Eide, "ExtractingHomologous Series from Mass Spectrometry Data by Projection onPredefined Vectors", Chemom. Intell. Lab. Syst., Vol. 114, pp. 36–43,2012.

9

Mass spectrometry (MS)

100 200 300 400 500 6000

10

20

30

40

50

m/z [Da]

no

rma

lise

da

bu

nd

an

ce

198.

221

2.2

226.

2

10

Traditional multivariate analysisPurpose: Reveal underlying patterns in large data sets.

Example: Look at a set of mass spectra from 10 different oil samples.How are these different?

Tool of choice (among chemists): Principal Component Analysis(PCA).

So, let’s first look at what PCA is!

11

Principal component analysis (PCA)Assume we make an observation xm, where

xm =[x1 x2 · · · xN

]T,

where x1, x2, . . . , xM are measured quantities for different variables.

If we have M such multivariate observations, we can store these in amatrix X as

X =

⎡⎢⎢⎢⎣

xT1

xT2...

xTM

⎤⎥⎥⎥⎦

12

Principal component analysis (PCA)Now let’s factor X, as

X = TPT ,

where the columns of P are now the normalized eigenvectors ofXXT , i.e. a new basis for the column space of X constructed fromthe eigenvectors of the sample covariance matrix of our Mobservations (in N variables). The rows of T are then thecoordinates in this new basis.

Furthermore, let the eigenvectors be sorted so that the first column ofP is the eigenvector corresponding to the largest eigenvalue, thesecond column corresponds to the second largest eigenvalue, and soon.

WHY IS THIS GOOD?

13

Principal component analysis (PCA)Example

Let’s look at a two-dimensional case, where x1 and x2 are observationsfrom a two-dimensional Gaussian random variable with covariance matrix

R =

[10 1.5

1.5 0.5

]

14

Principal component analysis (PCA)

−10 −5 0 5 10−2

−1

0

1

2

x1

x 2

15

Principal component analysis (PCA)

−10 −5 0 5 10−2

−1

0

1

2

x1

x 2

16

Principal component analysis (PCA)In essence, PCA is a rotation of the coordinate system.

The axes of the new system describe directions in which we havelarge experimental variation.

If there are strong correlations in the original data, we can thereforereduce the dimensionality by discarding PC’s, with minimum loss ofinformation (actually optimal, in the least-squares sense).

17

Principal component analysis (PCA)

−10 −5 0 5 10

−1

0

1

p1

p 2

18

Problems with mass spectrometry dataExample

Assume we have mass spectra of 10 different crude oils (fivereplicates of each).

A PCA should be able to reveal differences between these.

So, let’s store mass spectra from the samples as rows of our matrixX (columns then represent mass/charge values).

Large variations between oil samples should show up, and similaroils should group together.

Let’s try it!

19

Problems with mass spectrometry data

−100 −80 −60 −40 −20 0 20 40−50

−40

−30

−20

−10

0

10

20

30

40

50

p1

p 2

20

Problems with mass spectrometry dataIt doesn’t work! Why?

Problem no. 1:In PCA, we assume all columns represent different variables, but thatthese variables are the same for all rows.

The MS data are non-uniformly sampled, meaning that we obtainpairs of mass/charge values and abundance values, only where thereare peaks.

So, storing all data in one matrix, each column does not representthe same thing for the different spectra.

Problem no. 2Uncertainties in the instrument causes peak locations to shift slightly.

So, even for replicate experiments, the mass/charge values will notbe the same.

Is PCA doomed?

21

Pre-processing of MS dataIt appears as if we need to do some pre-processing of the spectra beforePCA can be applied.

1. Re-sampling of mass spectra so that they share one commonmass/charge vector.

2. Taking the uncertainty of the instrument into account, i.e. aligningpeaks from different spectra that can be assumed to have the samemass/charge value.

So, if we do this (takes a bit of programming...), what do we get?

22

Traditional PCA, re-visited

−60 −40 −20 0 20 40 60 80−70

−60

−50

−40

−30

−20

−10

0

10

20

30

p1

p 2

23

Traditional PCA, re-visitedSo, it appears we have overcome the main problems. Now:

Replicate experiments on the same oils group together.

Oils with different chemical compositions are separated.

Oils with different, but somewhat similar properties, appear closer toeach other in the plot.

Remaining problem:

The underlying chemical properties are hard to find. The newrepresentation reveals patterns, but these are hard to interpret.

24

Alternative analysis strategyLet’s go back to the example mass spectrum:

100 200 300 400 500 6000

10

20

30

40

50

m/z [Da]

no

rma

lise

da

bu

nd

an

ce

198.

221

2.2

226.

2

25

Alternative analysis strategyObservation

It appears there are series of peaks, separated by a fixedmass/charge value.

A separation of n× 14 would mean the molecule has n extra CH2

groups.

Idea

Could we analyze the spectra in terms of series like these instead ofeigenvectors of the covariance matrix (PCA)?

How would we take uncertainties of the instrument into account whendoing this?

26

Alternative analysis strategyLet’s construct a new orthonormal basis for the spectra

basis

vecto

rs,

ui

m/z [Da]mm

+4�

m+

8�

m-

4�

m+

14

peak width, �

u1

u2

u3

27

Alternative analysis strategyWe can now project our spectra onto this new basis, by

T = UTX,

where U are the vectors from the previous slides and T are the scoresobtained by the projection, i.e. "how much of each basis function ispresent in each of the spectra"

28

Alternative analysis strategy

46

810

12

24

2

4

t2

t2

t2

t3

t3

t3

4 6 8 10 12 14

2

4

6

t1

t1t1

4 6 8 10 12 14

2

4

6

2 4 61

2

3

4

5

6

F01oF02t

F03tF04t

F05tF06o

F07oF08t

F09tF10t

(a) (b)

(c)

(d)

29

Alternative analysis strategyObservations

Replicates of the same oil group nicely.

Oils with similar chemical composition appear close to each other.

Chemically different oils will be separated.

So far we can see the same things as with PCA. So, what else?

Let’s look at the corresponding basis vectors.

30

Alternative analysis strategy

190 200 210 220 2300

0.1

0.2

m/z [Da]ba

sis

vect

or, u

1

192.

1 ± 0

.1

206.

1 ± 0

.1

220.

1 ± 0

.1

190 200 210 220 2300

0.1

0.2

m/z [Da]

basi

s ve

ctor

, u2

198.

1 ± 0

.1

212.

1 ± 0

.1

226.

1 ± 0

.1

190 200 210 220 2300

0.1

0.2

m/z [Da]

basi

s ve

ctor

, u3

190.

1 ± 0

.1

204.

1 ± 0

.1

218.

1 ± 0

.1

31

Alternative analysis strategyObservations

Looking at the mass/charge values of the peaks, a trained chemistcan determine what chemical compound class these sequencescorrespond to.

In other words, in addition to the ability to discriminate chemicallydifferent oils samples, we can also interpret what type of chemicalcompounds that causes this difference.

32

Future challengesHow can we model the original spectra based on this new analysismethod?

How to model how changing a process variable (in the preparation ofthe oil) will affect the composition?

We still need to develop various diagnostic and visualization tools toaid the chemist in the analysis of the results.

Thank you!

top related