gaps in our knowledge? a central problem in data modelling and how to get round it rob harrison...

44
GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

Post on 20-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

GAPS IN OUR KNOWLEDGE?

a central problem in data modelling and how to get round

itRob Harrison

AC&SE

Page 2: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

Agenda sampling

density & distribution representation

accuracy vs generality regularization

trading off accuracy and generality cross validation

what happens when we can’t see

Page 3: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

The Data Modelling Problem

y= f(x) z=y+e multivariate & non-linear

measurement errors {xi, zi} i = 1:N zi = f(xi)+ei infer behaviour everywhere from a

few examples little or no prior information on f(x)

Page 4: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

A Simple Example

one cycle of a sine wave “well-spaced” samples enough data (N = 6) noise-free

Page 5: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

0 0.2 0.4 0.6 0.8 1-0.5

0

0.5

1

1.5y

x

Page 6: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

0 0.2 0.4 0.6 0.8 1-0.5

0

0.5

1

1.5y

x

Page 7: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.5

0

0.5

1

1.5

“observational” data

Page 8: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Page 9: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

0 0.2 0.4 0.6 0.8 1-0.5

0

0.5

1

1.5y

x

Page 10: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

0 0.2 0.4 0.6 0.8 1-0.5

0

0.5

1

1.5y

x

Page 11: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

0 0.2 0.4 0.6 0.8 1-0.5

0

0.5

1

1.5y

x

Page 12: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

Sparsely Sampled

two well-spaced samples

Page 13: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

0 0.2 0.4 0.6 0.8 1-0.5

0

0.5

1

1.5y

x

Page 14: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

0 0.2 0.4 0.6 0.8 1-0.2

0

0.2

0.4

0.6

0.8

1

1.2y

x

Page 15: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

0 0.2 0.4 0.6 0.8 1-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4y

x

Page 16: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

Densely Sampled

200 well-spaced samples

Page 17: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

0 0.2 0.4 0.6 0.8 1-0.5

0

0.5

1

1.5y

x

Page 18: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

Large N

200 poorly-spaced samples

Page 19: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

0 0.2 0.4 0.6 0.8 1-0.5

0

0.5

1

1.5y

x

Page 20: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

What’s So Hard?

the gaps get more (well-spaced) data

lack of prior knowledge can’t see

dimension!!

Page 21: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

Two Dimensions

same sampling density (N = 62) well-spaced?

Page 22: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

-3-2

-10

12

3

-2

0

2

-5

0

5

x

Peaks

y

Page 23: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

-3-2

-10

12

3

-2

0

2

-6

-4

-2

0

2

4

6

8

Page 24: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

-3-2

-10

12

3

-2

0

2

-6

-4

-2

0

2

4

6

8

x

Peaks

y

Page 25: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

Dimensionality

lose ability to see the shape of f(x) try it in 13-D

number of samples exponential in d if N OK in 1-D, Nd needed in d-D

how do we know if “well-spaced”? how can we sample where the action is? observational vs experimental data!

Page 26: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

Generic Structure

use e.g. a power series Stone-Weierstrass other bases e.g. Fourier series …

y = a5x5 + a4x4 + a3x3 + …+ a1x + a0

= 5 & six samples – no error!

Page 27: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

0 0.2 0.4 0.6 0.8 1-1

-0.5

0

0.5

1

1.5

2y

x

PROBLEM SOLVED!

Page 28: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

Generic Structure

use e.g. a power series Stone-Weierstrass other bases e.g. Fourier series …

y = a5x5 + a4x4 + a3x3 + …+ a1x + a0

= 5 & six samples – no error! > 5 still no error but …

Page 29: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

0 0.2 0.4 0.6 0.8 1-1

-0.5

0

0.5

1

1.5

poor inter-sample behaviour

how can we know without looking?

Page 30: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

Generic Structure

use e.g. a power series Stone-Weierstrass other bases e.g. Fourier series …

y = a5x5 + a4x4 + a3x3 + …+ a1x + a0

= 5 & six samples – no error! > 5 still no error but … measurement error

Page 31: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

0 0.2 0.4 0.6 0.8 1-1

-0.5

0

0.5

1

1.5

Page 32: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

Generic Structure

use e.g. a power series Stone-Weierstrass other bases e.g. Fourier series …

y = a5x5 + a4x4 + a3x3 + …+ a1x + a0

= 5 & six samples – no error! > 5 still no error but … measurement error model is as complex as data

Page 33: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

Curse Of Dimension we can still use the idea but … in 2-D we get 21 terms

direct and cross products in d-D we get (d+)!/d!! e.g. transform a 16x16 bitmap by =3

polynomial and get ~ 3 million terms sample size / distribution practical for “small” problems

Page 34: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

Other Basis Functions

Gaussian radial basis functions additional design choices

how many?, where?, how wide? adaptive sigmoidal basis functions

how many?

Page 35: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

0 0.2 0.4 0.6 0.8 1-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1-1

-0.5

0

0.5

1

1.5

zero error?

Overfitting (sample data)

rough components

Page 36: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

0 0.2 0.4 0.6 0.8 1-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1-400

-200

0

200

400

Underfitting (sample data)

over-smooth components

Page 37: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

0 0.2 0.4 0.6 0.8 1-0.5

0

0.5

1

1.5

0 0.2 0.4 0.6 0.8 1-1

-0.5

0

0.5

1

v. small error

“just right” components

Goldilocks

Page 38: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

Restricting “Flexibility” can we use data to tell the

estimator how to behave? regularization/penalization

penalize roughness e.g. SSE + Q

use potentially complex structure data constrains where it can Q constrains elsewhere

Page 39: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

0 0.2 0.4 0.6 0.8 1-2

0

2

4

0 0.2 0.4 0.6 0.8 1-2

0

2

4

four narrow grbfs

RMSE = 0.80

Page 40: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

0 0.2 0.4 0.6 0.8 1-2

-1

0

1

2

0 0.2 0.4 0.6 0.8 1-2

-1

0

1

four narrow grbfs + penalty

RMSE = 0.24

Page 41: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

How Well Are We Doing?

so far we know answer for d > 2 we are “blind” what’s happening between samples? test on new samples in between

VALIDATION compute a performance index

GENERALIZATION

Page 42: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

0 0.2 0.4 0.6 0.8 1-0.5

0

0.5

1

1.5

2

Hold-out Method

keep back P% for testing

wasteful

sample dependent

Training RMSE 0.23Testing RMSE 0.38

Page 43: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

Cross Validation leave-one-out CV

train on all but one test that one repeat N times compute performance

m-fold CV divide sample into m non-overlapping sets proceed as above

all data used for training and testing more work but realistic performance estimates

used to choose “hyper-parameters” e.g. , number, width …

Page 44: GAPS IN OUR KNOWLEDGE? a central problem in data modelling and how to get round it Rob Harrison AC&SE

Conclusion gaps in our data cause most of our

problems noise

more data only part of the answer distribution / parsimony

restricting flexibility helps if no evidence to contrary

cross validation is a window into d-D estimates inter-sample behaviour