gaps in our knowledge? a central problem in data modelling and how to get round it rob harrison...
Post on 20-Dec-2015
215 Views
Preview:
TRANSCRIPT
GAPS IN OUR KNOWLEDGE?
a central problem in data modelling and how to get round
itRob Harrison
AC&SE
Agenda sampling
density & distribution representation
accuracy vs generality regularization
trading off accuracy and generality cross validation
what happens when we can’t see
The Data Modelling Problem
y= f(x) z=y+e multivariate & non-linear
measurement errors {xi, zi} i = 1:N zi = f(xi)+ei infer behaviour everywhere from a
few examples little or no prior information on f(x)
A Simple Example
one cycle of a sine wave “well-spaced” samples enough data (N = 6) noise-free
0 0.2 0.4 0.6 0.8 1-0.5
0
0.5
1
1.5y
x
0 0.2 0.4 0.6 0.8 1-0.5
0
0.5
1
1.5y
x
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.5
0
0.5
1
1.5
“observational” data
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 0.2 0.4 0.6 0.8 1-0.5
0
0.5
1
1.5y
x
0 0.2 0.4 0.6 0.8 1-0.5
0
0.5
1
1.5y
x
0 0.2 0.4 0.6 0.8 1-0.5
0
0.5
1
1.5y
x
Sparsely Sampled
two well-spaced samples
0 0.2 0.4 0.6 0.8 1-0.5
0
0.5
1
1.5y
x
0 0.2 0.4 0.6 0.8 1-0.2
0
0.2
0.4
0.6
0.8
1
1.2y
x
0 0.2 0.4 0.6 0.8 1-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4y
x
Densely Sampled
200 well-spaced samples
0 0.2 0.4 0.6 0.8 1-0.5
0
0.5
1
1.5y
x
Large N
200 poorly-spaced samples
0 0.2 0.4 0.6 0.8 1-0.5
0
0.5
1
1.5y
x
What’s So Hard?
the gaps get more (well-spaced) data
lack of prior knowledge can’t see
dimension!!
Two Dimensions
same sampling density (N = 62) well-spaced?
-3-2
-10
12
3
-2
0
2
-5
0
5
x
Peaks
y
-3-2
-10
12
3
-2
0
2
-6
-4
-2
0
2
4
6
8
-3-2
-10
12
3
-2
0
2
-6
-4
-2
0
2
4
6
8
x
Peaks
y
Dimensionality
lose ability to see the shape of f(x) try it in 13-D
number of samples exponential in d if N OK in 1-D, Nd needed in d-D
how do we know if “well-spaced”? how can we sample where the action is? observational vs experimental data!
Generic Structure
use e.g. a power series Stone-Weierstrass other bases e.g. Fourier series …
y = a5x5 + a4x4 + a3x3 + …+ a1x + a0
= 5 & six samples – no error!
0 0.2 0.4 0.6 0.8 1-1
-0.5
0
0.5
1
1.5
2y
x
PROBLEM SOLVED!
Generic Structure
use e.g. a power series Stone-Weierstrass other bases e.g. Fourier series …
y = a5x5 + a4x4 + a3x3 + …+ a1x + a0
= 5 & six samples – no error! > 5 still no error but …
0 0.2 0.4 0.6 0.8 1-1
-0.5
0
0.5
1
1.5
poor inter-sample behaviour
how can we know without looking?
Generic Structure
use e.g. a power series Stone-Weierstrass other bases e.g. Fourier series …
y = a5x5 + a4x4 + a3x3 + …+ a1x + a0
= 5 & six samples – no error! > 5 still no error but … measurement error
0 0.2 0.4 0.6 0.8 1-1
-0.5
0
0.5
1
1.5
Generic Structure
use e.g. a power series Stone-Weierstrass other bases e.g. Fourier series …
y = a5x5 + a4x4 + a3x3 + …+ a1x + a0
= 5 & six samples – no error! > 5 still no error but … measurement error model is as complex as data
Curse Of Dimension we can still use the idea but … in 2-D we get 21 terms
direct and cross products in d-D we get (d+)!/d!! e.g. transform a 16x16 bitmap by =3
polynomial and get ~ 3 million terms sample size / distribution practical for “small” problems
Other Basis Functions
Gaussian radial basis functions additional design choices
how many?, where?, how wide? adaptive sigmoidal basis functions
how many?
0 0.2 0.4 0.6 0.8 1-0.5
0
0.5
1
1.5
0 0.2 0.4 0.6 0.8 1-1
-0.5
0
0.5
1
1.5
zero error?
Overfitting (sample data)
rough components
0 0.2 0.4 0.6 0.8 1-0.5
0
0.5
1
1.5
0 0.2 0.4 0.6 0.8 1-400
-200
0
200
400
Underfitting (sample data)
over-smooth components
0 0.2 0.4 0.6 0.8 1-0.5
0
0.5
1
1.5
0 0.2 0.4 0.6 0.8 1-1
-0.5
0
0.5
1
v. small error
“just right” components
Goldilocks
Restricting “Flexibility” can we use data to tell the
estimator how to behave? regularization/penalization
penalize roughness e.g. SSE + Q
use potentially complex structure data constrains where it can Q constrains elsewhere
0 0.2 0.4 0.6 0.8 1-2
0
2
4
0 0.2 0.4 0.6 0.8 1-2
0
2
4
four narrow grbfs
RMSE = 0.80
0 0.2 0.4 0.6 0.8 1-2
-1
0
1
2
0 0.2 0.4 0.6 0.8 1-2
-1
0
1
four narrow grbfs + penalty
RMSE = 0.24
How Well Are We Doing?
so far we know answer for d > 2 we are “blind” what’s happening between samples? test on new samples in between
VALIDATION compute a performance index
GENERALIZATION
0 0.2 0.4 0.6 0.8 1-0.5
0
0.5
1
1.5
2
Hold-out Method
keep back P% for testing
wasteful
sample dependent
Training RMSE 0.23Testing RMSE 0.38
Cross Validation leave-one-out CV
train on all but one test that one repeat N times compute performance
m-fold CV divide sample into m non-overlapping sets proceed as above
all data used for training and testing more work but realistic performance estimates
used to choose “hyper-parameters” e.g. , number, width …
Conclusion gaps in our data cause most of our
problems noise
more data only part of the answer distribution / parsimony
restricting flexibility helps if no evidence to contrary
cross validation is a window into d-D estimates inter-sample behaviour
top related