biosyst-mebios the potential of functional data analysis for chemometrics dirk de becker, wouter...

Post on 04-Jan-2016

214 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

BIOSYST-MeBioS www.biw.kuleuven.be

The potential of Functional Data

Analysis for Chemometrics

Dirk De Becker, Wouter Saeys,

Bart De Ketelaere and Paul Darius

BIO

SY

ST

-MeB

ioS

The Potential of FDA for Chemometrics

Introduction to FDA

Introduction to Chemometrics

Using FDA in chemometrics

For prediction

For Analysis Of Variance

Conclusions

BIO

SY

ST

-MeB

ioS

What is Functional Data Analysis?

Developed by Ramsay & Silverman (1997)

Analyse Data

By approximating it

Using some kind of functional basis

Mainly for longitudinal data

High correlation between neighbouring datapoints

BIO

SY

ST

-MeB

ioS

Why use FDA?

Data as single entity <-> individual observations

Make a function of your data

Derivatives

Reduce the amount of data

Noise -> smoothing

Impose some known properties on the data

Monotonicity, non-negativeness, smoothness, ...

BIO

SY

ST

-MeB

ioS

Basis Functions?

Polynomials: 1, t, t², t³, ...

Fourier: 1, sin(ωt), cos(ωt), sin(2ωt),

cos(2ωt)

Splines

Wavelets

Depends on your data

BIO

SY

ST

-MeB

ioS

Chemometrics

Measure optical properties of material

Transmission or reflection of light

At a large number of wavelengths

Use these properties to predict something else

BIO

SY

ST

-MeB

ioS

Why Chemometrics?

Fast

Cheap

Non-destructive

Environment-friendly

BIO

SY

ST

-MeB

ioS

Classical methods

Ignore correlation between neighbouring

wavelengths:

BIO

SY

ST

-MeB

ioS

FDA in chemometrics

NIR spectra

Absorption peaks

Width and height

Basis: B-splines

~ shape of absorption peaks

Preserve the vicinity constraint

BIO

SY

ST

-MeB

ioS

Spline Functions

Piecewise joining polynomials of order m

Fast evaluation

Continuity of derivatives

Up to order m-2

In L interior knots

Degrees of freedom: L + m

Flexible

BIO

SY

ST

-MeB

ioS

BIO

SY

ST

-MeB

ioS

Constructing a spline basis

Order

What to use the model for

Mostly cubic splines (order 4)

Number and position of knots

Use enough

Look at the data

!Overfitting

BIO

SY

ST

-MeB

ioS

Position of knots

More variation -> more knots

0 500 1000 1500 2000

12

34

5

valu

es

54 knots, equally spaced

0 500 1000 1500 2000

12

34

5

valu

es

54 knots, tuned

BIO

SY

ST

-MeB

ioS

B-spline approximation

BIO

SY

ST

-MeB

ioS

FDA for prediction

Functional regression models

P-Spline Regression (Marx and Eilers)

Non-Parametric Functional Data Analysis

(Ferraty and Vieu)

BIO

SY

ST

-MeB

ioS

Functional Regression Models

Project spectra to spline basis

Apply Multivariate Linear Regression to the spline

coefficients

Great reduction in system complexity

Natural shape of absorption peaks is used

BIO

SY

ST

-MeB

ioS

Functional Regression Models: case study

420 samples of hog manure

Reflectance spectra

Total nitrogen (TN) and dry matter (DM) content

PLS and Functional Regression applied

BIO

SY

ST

-MeB

ioS

Functional Regression: case study (ct'd)

BIO

SY

ST

-MeB

ioS

Functional Regression: case study: results

FDA PLS # B-splines # lat varDataset 1 10,4069 10,3282 22 6Dataset 2 9,9084 10,565 20 6Dataset 3 10,4921 10,4857 22 6Dataset 4 10,4533 10,3236 22 6Dataset 5 9,1203 10,6019 23 6

Dry matter content

FDA PLS # B-splines # lat varDataset 1 1,1922 1,2603 25 6Dataset 2 1,1582 1,1826 25 6Dataset 3 1,1806 1,2325 25 6Dataset 4 1,253 1,2852 25 6Dataset 5 1,1562 1,2664 25 6

Total nitrogen content

BIO

SY

ST

-MeB

ioS

P-Spline Regression (PSR)

By Marx and Eilers

Construct with B-splines:

Use roughness parameter on

Minimize

Full spectra are used for regression

BD

22 DXByS

BIO

SY

ST

-MeB

ioS

P-Spline Regression: case study

121 samples of seed pills

y is % humidity

PLS: RMSEP = 1,19

PSR: RMSEP = 1,115

# B-spline coefficients = 7

λ = 0.001

BIO

SY

ST

-MeB

ioS

Non-Parametric Functional Data Analysis

By F. Ferraty and P. Vieu

No regression model is involved

Prediction by applying local kernel functions in

function space

So far, no good results yet

BIO

SY

ST

-MeB

ioS

FDA in Anova setting: FANOVA

ANOVA:

“Study the relation between a response variable and

one or more explanatory variables”

is overall mean

are the effects of belonging to a group g

are residuals

)()()()( iggigx

)()( g

)( ig

BIO

SY

ST

-MeB

ioS

FANOVA: theory

Constraint:

Introduce so that

Introduce functional aspect:

Constraint: introduce

],[,0)( 1 mb

Z

Tg ],,,[ 1

)()()( Zx

)()( Cx )()( B

*** ,, xCZ

BIO

SY

ST

-MeB

ioS

FANOVA: goal and solution

Goal: estimate from

Solution:

B C

**1**^

)( CZZZB T

BIO

SY

ST

-MeB

ioS

FANOVA: significance testing

Locally:

Globally: ig

igig Zxerrordf

MSE 2^

)]()()([)(

1)(

)(/)(sup CMSEContrastM

BIO

SY

ST

-MeB

ioS

FANOVA: case study

Spectra of manure

4 types of animals: dairy, beef, calf, hog

3 ambient temperatures: 4°C, 12°C, 20°C

3 sample temperatures: 4°C, 12°C, 20°C

9 replicates

=> 324 samples

Model: )()()()()( ijklkjiijkl SATI

]9,1[],3,1[],3,1[],4,1[ lkji

BIO

SY

ST

-MeB

ioS

FANOVA: case study (ct'd)

BIO

SY

ST

-MeB

ioS

FANOVA: case study (ct'd)

BIO

SY

ST

-MeB

ioS

Conclusions

Splines are a good basis for fitting spectral

data

Using FDA, it is possible to include vicinity

constraint in prediction models in

chemometrics

FANOVA is a good tool to explore the

variance in spectral data

top related