there are three types of liesswood34/inaug.pdfthere are three types of lies — lies, damned lies...

Post on 14-Aug-2021

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

There are three types of lies— lies, damned lies and statistics

Benjamin Disraeli

I British prime minister (Tory).

William Gladstone

I Defeated Disraeli in the general election of 1868.I President of the Royal Statistical Society 1867-1869.

Another Disraeli quote

. . . That question is this: Is man an ape or an angel? I,my lord, I am on the side of the angels. I repudiatewith indignation and abhorrence those new fangledtheories.

(Oxford Diocesan Conference 25/11/1864)

A rational approach to uncertainty?

1850 1900 1950 2000

−0.

6−

0.2

0.2

Global temperature

year

Tem

pera

ture

ano

mal

y (C

)

1850 1900 1950 2000

250

300

350

400

Atmospheric C02

year

CO

2 (P

PM

)

Absorption spectra

Is abstraction the problem?

Baker & Bellis, 1993, Animal Behaviour

count

0.0 0.2 0.4 0.6 0.8 1.0

100

300

500

0.0

0.2

0.4

0.6

0.8

1.0

prop.partner

100 200 300 400 500 40 60 80 100 120 140 160

4080

120

160

time.ipc

The Baker and Bellis Analysis

0.0 0.2 0.4 0.6 0.8 1.0

100

300

500

prop.partner

coun

t

0.0 0.2 0.4 0.6 0.8 1.0

−20

00

200

prop.partner

coun

t

0.0 0.2 0.4 0.6 0.8 1.0

−20

00

200

prop.partner

coun

t

0.0 0.2 0.4 0.6 0.8 1.0

−20

00

200

prop.partner

coun

t

40 80 120 160−30

0−

100

100

time.ipc

rsd

40 80 120 160

−20

00

200

time.ipc

coun

t

0.0 0.2 0.4 0.6 0.8 1.0

−20

00

200

prop.partner

rsd

0.0 0.2 0.4 0.6 0.8 1.0

−20

00

200

prop.partner

coun

t

Baker and Bellis Conclusions

I At the end of the process they asked whether the apparentstraight line relationships were stronger than couldplausibly have arisen by chance.

I On this basis they concluded that there is evidence forcount declining with proportion of time spent together.

I Time since last copulation seemed not to play a detectablerole.

I But they also collected another dataset . . .

count

20 24 28 52 56 60 64 165 175 185 10 15 20 25 30

100

400

2024

28

f.age

f.height

155

170

5258

64

f.weight

m.age

2030

40

165

180

m.height

m.weight

6080

100 300 500

1020

30

155 165 175 20 30 40 60 70 80 90

m.vol

More conclusions. . .

I Going through the same process as with the first data set,leads to the conclusion that only female weight is linearlyrelated to count.

I But a careful look at the residuals shows that thisconclusion is completely dependent on a single data pointwith very low sperm count.

I Re-do the analysis without this datum, and only volumematters.

I Actually it’s the same subjects in both datasets, and wecan match up the volumes with the first dataset.

I Repeating the first analysis with volume added, leads tothe dull conclusion that there is only any evidence for alinear relationship between count and volume.

I This result has limited marketing potential.

But why straight lines anyway?

count

0.0 0.2 0.4 0.6 0.8 1.0

100

300

500

0.0

0.2

0.4

0.6

0.8

1.0

prop.partner

100 200 300 400 500 40 60 80 100 120 140 160

4080

120

160

time.ipc

Smoothing

1. What if the relationship between the residuals and avariable does not look like a straight line?

2. Why not let it be a smooth curve, instead?

0.0 0.2 0.4 0.6 0.8 1.0

−30

0−

100

100

300

prop.partner

s(pr

op.p

artn

er,1

.07)

40 60 80 100 140

−30

0−

100

100

300

time.ipc

s(tim

e.ip

c,1.

77)

How to choose the best fit curve?

I Take a bendy strip of wood.I Hook it up to the data points with springs.I The result is a spline

1.5 2.0 2.5 3.0

2.0

2.5

3.0

3.5

4.0

4.5

size

wea

r

Splines are controllable

I Changing the flexibility of the spline changes the curve.

1.5 2.0 2.5 3.0

2.0

3.0

4.0

size

wea

r

1.5 2.0 2.5 3.0

2.0

3.0

4.0

size

wea

r

1.5 2.0 2.5 3.0

2.0

3.0

4.0

size

wea

r

1.5 2.0 2.5 3.0

2.0

3.0

4.0

size

wea

r

I Splines can be described mathematically, in a way that iseasy to work with.

Smooth surfaces: thin plate splines

I For smooth surfaces there are several optionsI We can replace the bendy strip, with a bendy sheet. . .

x

0.20.4

0.6

0.8z

0.2

0.4

0.6

0.8

linear predictor

0.0

0.2

0.4

0.6

0.8

x

0.20.4

0.6

0.8

z

0.2

0.4

0.6

0.8

linear predictor

0.0

0.2

0.4

0.6

0.8

x

0.20.4

0.6

0.8

z

0.2

0.4

0.6

0.8

linear predictor

0.0

0.2

0.4

0.6

0.8

x

0.20.4

0.6

0.8

z

0.2

0.4

0.6

0.8

linear predictor

0.0

0.2

0.4

0.6

0.8

More smooth surfaces: tensor product splinesI Or we can make a surface from a lattice of bendy strips.I The strips should usually have different degrees of

flexibility in the two directions.

xz

f(x,z)

Yet more smooth surfaces: soap filmsI For smoothing within oddly shaped areas, it can help to

replace bendy sheets/strips, with a soap film.I This avoids smoothing across the area boundary.

58.0 58.5 59.0 59.5 60.0 60.5

44.0

44.5

45.0

45.5

46.0

46.5

longitude

latit

ude

58.0 58.5 59.0 59.5 60.0 60.544

.044

.545

.045

.546

.046

.5

longitude

latit

ude

58.0 58.5 59.0 59.5 60.0 60.5

44.0

44.5

45.0

45.5

46.0

46.5

longitude

latit

ude

58.0 58.5 59.0 59.5 60.0 60.5

44.0

44.5

45.0

45.5

46.0

46.5

longitude

latit

ude

58.0 58.5 59.0 59.5 60.0 60.5

44.0

44.5

45.0

45.5

46.0

46.5

longitude

latit

ude

58.0 58.5 59.0 59.5 60.0 60.5

44.0

44.5

45.0

45.5

46.0

46.5

longitude

latit

ude

How flexible should the spline be?

I Mathematically, all these ways of describing a surface,have the degree of smoothness controlled by just one ortwo numbers . . .

I . . . which must be chosen. How?

0.2 0.4 0.6 0.8 1.0

−2

02

46

8

λ too high

x

y

0.2 0.4 0.6 0.8 1.0

−2

02

46

8

λ about right

x

y

0.2 0.4 0.6 0.8 1.0

−2

02

46

8

λ too low

x

y

Cleaning up a brain scan

10 20 30 40 50

5060

7080

medFPQ brain image

Y

X

I Model log FPQ as a smooth surface, represented using athin plate spline.

I Springs attaching the plate to the data have strengthdependent on the height of the plate.

Smoothed version

10 20 30 40 50

5060

7080

linear predictor

Y

X

Is Cairo getting hotter?

0 1000 2000 3000

5060

7080

90

time (days)

tem

pera

ture

(F

)

I A model . . .I The temperature varies smoothly with day of year.I There might be an additional smooth long term trend in

temperature.I The small scale day to day fluctuations are probably

correlated between one day and the next.

Yes it is.

0 100 200 300

−15

−10

−5

05

10

day.of.year

s(da

y.of

.yea

r,8.

52)

0 1000 2000 3000−

1.5

−1.

0−

0.5

0.0

0.5

1.0

1.5

time

s(tim

e,1.

35)

Predicting octane rating

1000 1200 1400 1600

0.0

0.2

0.4

0.6

0.8

1.0

1.2

octane = 85.3

wavelength (nm)

log(

1/R

)

I How can we predict the octane rating from the spectrum?

Octane prediction model

1000 1200 1400 1600

0.0

0.2

0.4

0.6

0.8

1.0

1.2

octane = 85.3

wavelength (nm)

log(

1/R

)

I Model: octane rating is a constant plus the average valueof the red curve multiplied by the spectrum (blue).

I Need to estimate the red curve.

Octane prediction fit

1000 1200 1400 1600

−8

−4

02

46

Estimated function

nm

s(nm

,7.9

):N

IR

84 85 86 87 88 8984

8688

octane

fitted

mea

sure

d

Diabetic Retinopathy Study0 10 20 30 40 50

0.0

0.4

0.8

10 15 20 20 30 40 50 0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

ret20

3040

50

bmi

1015

20

gly

0 10 20 30 40 50

020

40

dur

I Model is that probability of retinopathy is related to a sumof smooth curves depending on bmi, gly and dur plussmooth surfaces depending on bmi & gly, gly & dur . . .

Diabetic Retinopathy Results

0 10 20 30 40 50

−4

−2

02

46

dur

s(du

r,3.

26)

10 15 20

−4

−2

02

46

glys(

gly,

1)

20 30 40 50

−4

−2

02

46

bmi

s(bm

i,2.6

7)

dur

gly

te(dur,gly,0)

durbm

i

te(dur,bmi,0)

gly

bmi

te(gly,bmi,2.5)

Diabetic Retinopathy Results II

bmi

gly

linear predictor

15 20 25 30 35 40 45 50

1015

20

linear predictor

bmi

gly

bmi

gly

linear predictor

red/green are +/− TRUE s.e.

bmi

gly

linear predictor

red/green are +/− TRUE s.e.

bmi

gly

linear predictor

red/green are +/− TRUE s.e.

cran.r-project.org

Picture Credits

I Gladstone and Disraeli are from the House of Commons web site.I The 1921 Eugenics conference logo is from

en.wikipedia.org/wiki/File:Eugenics congress logo.pngI The Gates of Auschwitz are from oncampus.richmond.edu/academics/education/

projects/webquests/holocaust/images/arbeit macht frei.jpgI Hogarth’s South Sea Bubble can be found at

www.library.hbs.edu/hc/ssb/images/using-top.jpg, but I’ve lost where I found theone shown.

I The absorption spectrum figure is fromwww.te-software.co.nz/blog/augie auer.htm

I Reproductions of Picasso’s Les Demoiselles d’Avignon are available from manysites. The one shown is possibly fromwww.enjoyart.com/library/featured artists/pablopicasso/large/Bmcgaw-P591.jpg

I The cover of Sperm Wars was taken from www.amazon.co.uk.

Data Credits

I The Global CO2 and temperature data are fromwww.cru.uea.ac.uk/cru/data/temperature/ and the ScrippsInstitute CO2 research group.

I The Aral Sea CO2 data are from the SeaWifs satellite.I For full credits for the Cairo and Brain Scan data, see R

package gamair.I The octane data are available in R package pls.I The Retinopathy data are available in R package gss.

top related