pls tutorial
TRANSCRIPT
-
7/26/2019 Pls Tutorial
1/12
12/9/2013
1
Partial Least Squares
A tutorial
Lutgarde Buydens
Partial least Squares
Multivariate regression
Multiple Linear Regression(MLR)
Principal Component Regression (PCR)
Partial Least Squares (PLS)
Val idation
Preprocessing
Multivariate Regression
X Y
n
p k
Rows: Cases, observations, Collums: Variables, Classes, tags
Analytical observations of different samplesExperimental runs
Persons
.
P: Spectral variablesAnalytical measurements
K: Class information
Concentration,..X: Independent variabels (will be always available)
Y: Dependent variables ( to be predicted later from X)
200 0 4 00 0 6 00 0 8 00 0 1 00 00 1 20 00 1 40 00-0.5
0
0.5
1
1.5
2
Wavenumber(cm -1)
Rawdata
Multivariate Regressio n
X Y
n
p
k
Rows: Cases, observations Collums: Variables, Classes, tags
X: Independent variabels (will be always available)
Y: Dependent variables ( to be predicted later from X)
2 00 0 4 00 0 6 00 0 8 00 0 1 00 00 1 20 00 1 40 00-0.5
0
0.5
1
1.5
2
Wavenumber(cm-1
)
Rawdata
Y = f(X) : Predict Y from X
MLR: Multiple Linear Regression
PCR: Principal Component Regression
PLS: Partial Least Sqaures
From univariate to Mult iple Linear Regression (MLR)
Least squares regression
y= b0 +b1 x1 +
b0 : intercept
b1 : slope
y
x
MLR: Multiple Linear Regressio n
Least squares regression
y= b0 +b1 x1 +
b0 : intercept
b1 : slope
Multiple Linear Regression
y= b0 +b1 x1 + b2x2 + bpxp +
EYY ^
y
x
x2
x1
y
),(
yyrmaximizes
-
7/26/2019 Pls Tutorial
2/12
12/9/2013
2
MLR: Multiple Linear Regression
y= b0 +b1 x1 + b2x2 + bpxp +
+
y b0 eX
nn n
b1
bp
=
EYY ^
1
1
1
1
:
::
p+1
x
x2
x1
y
yn1 = Xnpbp1 + en1
Ynk = XnpBpk + Enk
b = (XTX)-1XTy
MLR: Multiple Linear Regression
Disadavantages: (XTX)-1
Uncorrelated X-variables required
n p +1
x2
x1
y r(x1,x2) 1
MLR: Multiple Linear Regression
Disadavantages: (XTX)-1
Uncorrelated X-variables required
x2
x1
y r(x1,x2) 1
Fits a plane through a line !!
MLR: Multiple Linear Regressio n
Disadavantages: (XTX)-1
Uncorrelated X-variables required
x2
x1
y r(x1,x2) 1
3.763.67
-2.91-2.87
0.230.23
5.555.49
3.253.23
-0.99-1.01
3.67
-2.87
0.23
5.49
3.23
-1.01
3.76
-2.91
0.21
5.55
3.25
-0.99
x2x2 x1x1
Set A Set B
11.29
-8.09
2.19
19.09
10.33
-1.89
y
y= b1 x1 + b2x2 +
b1 b2 b1 b2
10.3 -6.92 2.96 0.28MLR
R2 =0.98 R2 =0.98
MLR: Multiple Linear Regression
Disadavantages: (XTX)-1
Uncorrelated X-variables required
n p +1
Xp
n
Dimension reduction Variable Selection
Latent variables (PCR, PLS)
PCR: Princi pal Component Regressio n
Step 1: Perform PCA on the original X
Step 2 : Use the orthogonal PC-scores as independent variables in a MLR model
Step 3: Calculate b-coefficients from the a-coefficients
X
PCAT
p
cols
n-rows n-rows
a
colsa1
a2
aa
MLR
y
Step 1 Step2
n-rows
a2
aa
b1
b0
bp
Step 3a1
-
7/26/2019 Pls Tutorial
3/12
12/9/2013
3
PCR: Principal Component Regression
PC1
Dimension reduction:
Use scores (projections) on latent variables that explain maximal variance in X
xp
x2
x1
PCR: Principal Component Regression
Step 0 : Meancenter X
Step 1: Perform PCA: X = TPT
X* = (TPT
)*
Step 2: Perform MLR Y=TA
A = (TTT)-1TTY
Step 3 : Calculate B Y = X* B
Y = (T PT) B
A = PTB
B = (PPT)-1PA
B = PA
Calculate b0s
MLR on reconstructed X*= (TPT)*
yyb 0
PCR: Princi pal Component Regressio n
Optimal number of PCs
Calculate Crossvalidation RMSE for diff erent # PCs
nyy
RMSECV ii2)(
PLS: Partial Least Squares Regressio n
X
PLST
p
cols
n-rows n-rows
a
cola1
a2
aa
MLR
y
Phase 1
n-rows
a1
a2
aa
b1
b0
bp
Y
k
cols
n-rows
Phase 2
a1
k
cols
Phase 3
PLS: Partial Least Squares Regressio n
Projection to Latent Structure
PC1
xp
x2
x1
LV1 (w)
xp
x2
x1
PCR PLS
Use PC:
Maximizes variance in X
Use LV:
Maximizes covariance (X,y)
= VarX*vary*cor(X,y)
PLS: Partial Least Squares Regressio n
Sequential Algorithm: Latent variables and their scores are calculated sequentially
Step 0: Mean center X
Step 1: Calculate w
Calculate LV1= w1 that maximizes Covariance (X,Y) : SVD on XTY
(XTY)pk = WpaDaa ZTak w1 = 1
st col. of W
Phase 1 : Calculate new independent variables (T)
xp
x2
x1
w1
-
7/26/2019 Pls Tutorial
4/12
12/9/2013
4
PLS: Partial Least Squares Regression
Sequential Algorithm: Latent variables and th eir scores are calculated sequentially
Step 1: Calculate LV1= w1 that maximizes Covariance (X,Y) : SVD on XTY
(XTY)pk = WpaDaa ZTak w1 = 1
st col. of W
Step 2:
Calculate t1, scores (projections) of X on w 1
tn1 = Xnpwp1
Phase 1 : Calculate new independent variables (T)
xp
x2
x1
w
PLS: Partial Least Squares Regression
X
PLST
p
cols
n-rows n-rows
a
cola1
a2
aaMLR
y
Phase 1
n-rows
a1
a2
aa
b1
b0
bp
Y
k
cols
n-rows
Phase 2
a1
kcols
Phase 3
Optimal number of LVsCalculate Crossvalidation RMSE for diff erent # LVs
nyy
RMSECV ii2)(
PLS: Partial Least Squares Regressio n
3.763.67
-2.91-2.87
0.230.23
5.555.49
3.253.23
-0.99-1.01
3.67
-2.87
0.23
5.49
3.23
-1.01
3.76
-2.91
0.21
5.55
3.25
-0.99
x2x2 x1x1
Set A Set B
11.29
-8.09
2.19
19.09
10.33
-1.89
y
y= b1 x1 + b2x2 +
b1 b2 b1 b2
10.3 -6.92 2.96 0.28
1.60 1.62 1.60 1.62
1.60 1.62 1.60 1.62
MLR
PCR
PLS
MLR, PCR, PLS:
VALIDATION
Estimating predicti on error.
Basic Principle:
test how well your model works with new data,
it has not seen yet!
Common measure for prediction error
-
7/26/2019 Pls Tutorial
5/12
12/9/2013
5
Prediction error of the samples the model was built on
Error is biased!
Samples also used to build the model
model is biased towards accurate prediction of these
specific samples
A Biased Approach
Basic Principle:
test how well your model works with new data, it has not
seen yet!
Split data in training and test set.
Several ways:
One large test set
Leave one out and repeat: LOO
Leave n objects out and repeat: LNO
. . .
Apply entire model procedure on the test set
Validation: Basic Principle
Validation
Full data
set
Training
set
Test
set
Build model :
b0
bp
yRMSEP
Remark: f or fi nal model use whole data set.
Training and test sets
Split in training and test set.
Test set should berepresentativeof training set
Random choice is often thebest
Checkfor extremely unluckydivisions
Apply wholeprocedure on the
test and validation sets
Cross-validation
Most simple case: Leave-One-Out (=LOO, segment=1sample). Normally 10-20% out (=LnO).
Remark: for final model use whole data set.
Cross-validation: an example
The data
-
7/26/2019 Pls Tutorial
6/12
12/9/2013
6
Cross-validation: an example
Split data into training set and validation set
Cross-validation: an example
Split data into training set and test set
Cross-validation: an example
Build a model on the training set
Cross-validation: an example
Cross-validation: an example
Split data again into training set and valid. set
Until all samples have been in the validationset once
Common: Leave-One-Out (LOO)
Cross-validation: an example
Split data again into training set and valid. set
Until all samples have been in the validation set once
Common: Leave-One-Out (LOO)
-
7/26/2019 Pls Tutorial
7/12
12/9/2013
7
Cross-validation: an example
Split data again into training set and valid. set
Until all samples have been in the validationset once
Common: Leave-One-Out (LOO)
Cross-validation: an example
Split data again into training set and valid. set
Until all samples have been in the validation set once
Common: Leave-One-Out (LOO)
Cross-validation: an example
Split data again into training set and valid. set
Until all samples have been in the validationset once
Common: Leave-One-Out (LOO)
Cross-validation: an example
Split data again into training set and valid. set
Until all samples have been in the validation set once
Common: Leave-One-Out (LOO)
Cross-validation: a warning
Data: 13 x 5 = 65 NIR spectra (1102 wavelengths)
13 samples: different composition of NaOH, NaOCland Na2CO3 5 temperatures: each sample measuredat 5 temperatures
Composit
ion
NaOH (w t%) NaOCl
(wt%)
Na2CO3 (wt%) Temperature ( C)
1 18.99 0 0 15 21 27 34 40
2 9.15 9.99 0.15 15 21 27 34 40
3 15.01 0 4.01 15 21 27 34 40
4 9.34 5.96 3.97 15 21 27 34 40
13 16.02 2.01 1.00 15 21 27 34 40
Cross-validation: a warning
The data
65
1102
65
31
2
13
Leave SAMPLE out
y
-
7/26/2019 Pls Tutorial
8/12
12/9/2013
8
Selection of number of LVs
Trough Validation:
Choose number of LVs that results in model withlowest prediction errror
Testset to assess final model cannot be used !
Divide trainingsetCrossvalidation
Validation
Full data
set
Training
Set
Testset
Test
set
1) determine #LVs : wit test set
2) Build model : b0
bp
yRMSEP
Remark: for final model use whole data set.
Double Cross Validation
Full data
set
Training
setC
1) determine #LVs : CV Innerloop
2) Build model : CV Outer loop
b0
bp
yRMSEP
Remark: f or fi nal model use whole data set Skip.
CV 1
CV2
Double cross -validation
The data
Double cross-validation
Split data into training set and validation set
Double cross -validation
Split data into training set and validation set
Used later to assess model performance!
-
7/26/2019 Pls Tutorial
9/12
12/9/2013
9
Double cross-validation
Apply crossvalidation on the rest: Split training set into
(new) training set and test set
1LV 2LV 3LV
1LV 2LV 3LV 1LV 2LV 3LV
Lowest RMSECV
Double cross -validation
-
7/26/2019 Pls Tutorial
10/12
12/9/2013
10
Cross-validation: an example
Repeat procedure
Until all samples have been in the validationset once
Cross-validation: an example
Repeat procedure
Until all samples have been in the validation set once
Double cross -validation
In th is way:
The number of LVs is determined by using samples notused to buildthe model with
The prediction error is also determined using samples the
model has not seenbefore
Remark: f or fi nal model use whole data set.
Raw + meancentered data
2 00 0 4 00 0 6 00 0 8 00 0 1 00 00 1 20 00 1 40 00-0.5
0
0.5
1
1.5
2
Wavenumber (cm-1)
Absorbance(a.u.)
Raw data
20 00 4 00 0 6 00 0 8 000 1 00 00 1 20 00 1 400 0-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
Wavenumber (cm-1)
Absorbance(a.u.)
Meancentereddata
PLS: an example
RMSECV vs. No of LVs
1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Number of LVs
RMSECV
RMSECV values for prediction of NaOH
Regression coeffficients
3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000-0.5
0
0.5
1
1.5
2
Wavenumber (cm-1)
Absorbance(a.u.)
Raw data
3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000-2
0
2
4
6
8
10
Wavenumber (cm-1
)
Regressioncoefficient
-
7/26/2019 Pls Tutorial
11/12
12/9/2013
11
True vs. predicted
8 10 12 14 16 18 208
10
12
14
16
18
20
NaOH, true
NaOH,predicted
True values vs. predictions
500 1000 1500 2000 250
0
0.5
1
1.5
2
2.5
3
Wavelength (cm-1)
Intensity(a.u
)
Original spectrum
Offset
Slope
Scatter
Why Pre-Processing ?Data Artefacts
Baseline correcti on Al ignment Scatter correction Noise removal
Scaling, Normalisati on Transformation
..
Other Missing values
Outliers
0 200 400 600 800 1000 1200 1400 16000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Wavelength (a.u.)
Intensity
(a.u.
)
original
0 20 0 40 0 60 0 80 0 1 000 1 20 0 140 0 160 00
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Wavelength (a.u.)
Intensit
y
(a.u.
) offset
0 200 400 600 800 1000 1200 1400 16000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Wavelength (a.u.)
Intensity
(a.u.
)
offset+slope
0 2 00 4 00 6 00 8 00 1 00 0 12 00 1 40 0 16 000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Wavelength (a.u.)
Intens
ity
(a.u.
)
multiplicative
0 200 400 600 800 1000 1200 1400 16000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Wavelength (a.u.)
Intensity
(a.u.
)
originaloffsetoffset+slopemultiplicativeoffset + slope + multiplicative
0 200 400 600 800 1000 1200 1400 16000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Wavelength (a.u.)
Intensity
(a.u.
)
original
offset
offset+slopemultiplicativeoffset + slope + multiplicative
Pre-Processing Methods
4914 combinations: all reasonableSTEP 1:
(7x) BASELINESTEP 2:
(10x) SCATTER
STEP 3:
(10x) NOISE
STEP 4:
(7x) SCALING &
TRANSFORMATION
No baseline correction No scatter correction No noiseremoval Meancentering
(3x) Detrending
polynomial order
(2-3-4)
(4x) scaling: Mean
Median Max L2 norm
(9x) S-G smoothing
(window: 5-9-11 pt)
(order: 2-3-4)
Autoscaling
Range scaling
(2x)Derivatisation
(1st
2nd
)
SNV Pareto scaling
(3x) RNV (15, 25, 35)% Poisson scaling
AsLS MSC Level scaling
Log tr ansformation
Supervised pre-processing methods
OSC No noiseremoval Meancentering
DOSC Autoscaling
Range scaling
Pareto scalingPoisson scaling
Level scaling
Log scaling
Pre-Processing Results
Classification accuracy %Complexityofthemodel(noofLV)
Complexity of the model : no of LV
Classification Accuracy
Raw Data
J. Eng el et al. TrAC 2013
-
7/26/2019 Pls Tutorial
12/12
12/9/2013
12
PLS Toolbox (Eigenvector Inc.) www.eigenvector.com
For use in MATLAB (or standalone!)
XLSTAT-PLS (XLSTAT)
www.xlstat.com
For use in Microsoft Excel
Package pl s for R
Free software
http://cran.r-project.org
SOFTWARE