biosyst-mebios the potential of functional data analysis for chemometrics dirk de becker, wouter...
TRANSCRIPT
BIOSYST-MeBioS www.biw.kuleuven.be
The potential of Functional Data
Analysis for Chemometrics
Dirk De Becker, Wouter Saeys,
Bart De Ketelaere and Paul Darius
BIO
SY
ST
-MeB
ioS
The Potential of FDA for Chemometrics
Introduction to FDA
Introduction to Chemometrics
Using FDA in chemometrics
For prediction
For Analysis Of Variance
Conclusions
BIO
SY
ST
-MeB
ioS
What is Functional Data Analysis?
Developed by Ramsay & Silverman (1997)
Analyse Data
By approximating it
Using some kind of functional basis
Mainly for longitudinal data
High correlation between neighbouring datapoints
BIO
SY
ST
-MeB
ioS
Why use FDA?
Data as single entity <-> individual observations
Make a function of your data
Derivatives
Reduce the amount of data
Noise -> smoothing
Impose some known properties on the data
Monotonicity, non-negativeness, smoothness, ...
BIO
SY
ST
-MeB
ioS
Basis Functions?
Polynomials: 1, t, t², t³, ...
Fourier: 1, sin(ωt), cos(ωt), sin(2ωt),
cos(2ωt)
Splines
Wavelets
Depends on your data
BIO
SY
ST
-MeB
ioS
Chemometrics
Measure optical properties of material
Transmission or reflection of light
At a large number of wavelengths
Use these properties to predict something else
BIO
SY
ST
-MeB
ioS
Why Chemometrics?
Fast
Cheap
Non-destructive
Environment-friendly
BIO
SY
ST
-MeB
ioS
Classical methods
Ignore correlation between neighbouring
wavelengths:
BIO
SY
ST
-MeB
ioS
FDA in chemometrics
NIR spectra
Absorption peaks
Width and height
Basis: B-splines
~ shape of absorption peaks
Preserve the vicinity constraint
BIO
SY
ST
-MeB
ioS
Spline Functions
Piecewise joining polynomials of order m
Fast evaluation
Continuity of derivatives
Up to order m-2
In L interior knots
Degrees of freedom: L + m
Flexible
BIO
SY
ST
-MeB
ioS
BIO
SY
ST
-MeB
ioS
Constructing a spline basis
Order
What to use the model for
Mostly cubic splines (order 4)
Number and position of knots
Use enough
Look at the data
!Overfitting
BIO
SY
ST
-MeB
ioS
Position of knots
More variation -> more knots
0 500 1000 1500 2000
12
34
5
valu
es
54 knots, equally spaced
0 500 1000 1500 2000
12
34
5
valu
es
54 knots, tuned
BIO
SY
ST
-MeB
ioS
B-spline approximation
BIO
SY
ST
-MeB
ioS
FDA for prediction
Functional regression models
P-Spline Regression (Marx and Eilers)
Non-Parametric Functional Data Analysis
(Ferraty and Vieu)
BIO
SY
ST
-MeB
ioS
Functional Regression Models
Project spectra to spline basis
Apply Multivariate Linear Regression to the spline
coefficients
Great reduction in system complexity
Natural shape of absorption peaks is used
BIO
SY
ST
-MeB
ioS
Functional Regression Models: case study
420 samples of hog manure
Reflectance spectra
Total nitrogen (TN) and dry matter (DM) content
PLS and Functional Regression applied
BIO
SY
ST
-MeB
ioS
Functional Regression: case study (ct'd)
BIO
SY
ST
-MeB
ioS
Functional Regression: case study: results
FDA PLS # B-splines # lat varDataset 1 10,4069 10,3282 22 6Dataset 2 9,9084 10,565 20 6Dataset 3 10,4921 10,4857 22 6Dataset 4 10,4533 10,3236 22 6Dataset 5 9,1203 10,6019 23 6
Dry matter content
FDA PLS # B-splines # lat varDataset 1 1,1922 1,2603 25 6Dataset 2 1,1582 1,1826 25 6Dataset 3 1,1806 1,2325 25 6Dataset 4 1,253 1,2852 25 6Dataset 5 1,1562 1,2664 25 6
Total nitrogen content
BIO
SY
ST
-MeB
ioS
P-Spline Regression (PSR)
By Marx and Eilers
Construct with B-splines:
Use roughness parameter on
Minimize
Full spectra are used for regression
BD
22 DXByS
BIO
SY
ST
-MeB
ioS
P-Spline Regression: case study
121 samples of seed pills
y is % humidity
PLS: RMSEP = 1,19
PSR: RMSEP = 1,115
# B-spline coefficients = 7
λ = 0.001
BIO
SY
ST
-MeB
ioS
Non-Parametric Functional Data Analysis
By F. Ferraty and P. Vieu
No regression model is involved
Prediction by applying local kernel functions in
function space
So far, no good results yet
BIO
SY
ST
-MeB
ioS
FDA in Anova setting: FANOVA
ANOVA:
“Study the relation between a response variable and
one or more explanatory variables”
is overall mean
are the effects of belonging to a group g
are residuals
)()()()( iggigx
)()( g
)( ig
BIO
SY
ST
-MeB
ioS
FANOVA: theory
Constraint:
Introduce so that
Introduce functional aspect:
Constraint: introduce
],[,0)( 1 mb
Z
Tg ],,,[ 1
)()()( Zx
)()( Cx )()( B
*** ,, xCZ
BIO
SY
ST
-MeB
ioS
FANOVA: goal and solution
Goal: estimate from
Solution:
B C
**1**^
)( CZZZB T
BIO
SY
ST
-MeB
ioS
FANOVA: significance testing
Locally:
Globally: ig
igig Zxerrordf
MSE 2^
)]()()([)(
1)(
)(/)(sup CMSEContrastM
BIO
SY
ST
-MeB
ioS
FANOVA: case study
Spectra of manure
4 types of animals: dairy, beef, calf, hog
3 ambient temperatures: 4°C, 12°C, 20°C
3 sample temperatures: 4°C, 12°C, 20°C
9 replicates
=> 324 samples
Model: )()()()()( ijklkjiijkl SATI
]9,1[],3,1[],3,1[],4,1[ lkji
BIO
SY
ST
-MeB
ioS
FANOVA: case study (ct'd)
BIO
SY
ST
-MeB
ioS
FANOVA: case study (ct'd)
BIO
SY
ST
-MeB
ioS
Conclusions
Splines are a good basis for fitting spectral
data
Using FDA, it is possible to include vicinity
constraint in prediction models in
chemometrics
FANOVA is a good tool to explore the
variance in spectral data