model assumptions and truth in statisticsucakche/presentations/modelass0115.pdf · cluster analysis...

Statistics and PhilosophyMathematical models and reality

Frequentist probabilitiesThe Bayes-frequentist controversy

Cluster analysis and truth

Model Assumptions and Truth in Statistics

Christian Hennig

4 February 2015

Christian Hennig Model Assumptions and Truth in Statistics




1. Statistics and Philosophy

Statistics is about getting knowledge from data.

What can we know from data?

How can we find out?





Issues◮ What is probability? - frequentist vs. Bayesian vs. ?

Observers see what happens but probabilityis about “what could have happenedother than what happened”.







◮ Assessing evidence - testing hypotheses,p-values, probabilities of hypotheses. . .








often confused with interpretations of probability,but Bayesian methods can work with frequentistprobabilities.








often confused with interpretations of probability,but Bayesian methods can work with frequentistprobabilities.

◮ Modelling and model assumptions -where does the model come from?How can assumptions be assessed?





Issues◮ Statistics vs. Data Analysis - our own “PoS vs. STS”.

Data Analysis is about much more than probability modelsand classical inference.Prediction, exploratory analysis, data compression,pattern recognition, data visualisation, stability,. . .







Data Analysts: Only use statistical models where they help,often other methods are better.

Statisticians: St. models improve pretty much everything.(And we have experimental design.)







Data Analysts: Only use statistical models where they help,often other methods are better.

Statisticians: St. models improve pretty much everything.(And we have experimental design.)

(Parts of the field claimed by Machine Learning now,role of “black box prediction machines”?)





Issues◮ Objectivity vs. subjectivity -

automatic/standardised decisions vs. researcher’sdecisions







◮ Measuring quality of methodology -assumed truth (what is the truth we want to find)?prediction error? replication? interpretability?







◮ Measuring quality of methodology -assumed truth (what is the truth we want to find)?prediction error? replication? interpretability?

◮ . . . and more - theory of measurement,role of visualisation, statistics communication. . .





Overview

1. Statistics and Philosophy

2. Mathematical models and reality

3. Frequentist probabilities

4. The Bayes-frequentist controversy

5. Cluster analysis and truth

6. Conclusion





2. Mathematical models and reality - some data

Teacher A results

Marks out of 100

Fre

quen

cy

0 20 40 60 80 100

01

23

45

10 20 30 40 50 60 70 80 90 100

Teacher B results

Marks out of 100

Fre

quen

cy

0 20 40 60 80 100

05

1015

10 20 30 40 50 60 70 80 90 100





Standard model:

x1, . . . , xn ∼ N (µ1, σ21) i.i.d.,

y1, . . . , ym ∼ N (µ2, σ22) i.i.d.

Estimate µ1, µ2 by means, x̄1 = 58.6, x̄2 = 56.9,teacher A students do better, but not significantly so(p = 0.455 under µ1 = µ2).





Standard model:

x1, . . . , xn ∼ N (µ1, σ21) i.i.d.,

y1, . . . , ym ∼ N (µ2, σ22) i.i.d.

Estimate µ1, µ2 by means, x̄1 = 58.6, x̄2 = 56.9,teacher A students do better, but not significantly so(p = 0.455 under µ1 = µ2).

What do the model assumptions mean,and do they make sense?





With fitted distributions N (µ1, σ21),N (µ2, σ

22)

Teacher A results

Marks out of 100

Fre

quen

cy

0 20 40 60 80 100

01

23

45

10 20 30 40 50 60 70 80 90 100

Teacher B results

Marks out of 100

Fre

quen

cy

0 20 40 60 80 100

05

1015

10 20 30 40 50 60 70 80 90 100





We actually know that the model assumptions are violated.The normal distribution has an unlimited value range,and produces integer values with probability zero.Actually for such reasons we know thatno data observed by humans can ever be “truly normal”.

But how much harm does this do?





General mathematical modelling

Identification of items of perceived realitywith mathematical objectsand interpretation of results of mathematical operationsin terms of the items.

“Trivial” manipulation of counts and measurements,deterministic and stochastic models,analytical and computer models,data fitting and mechanistic models.





Constructivist view (H., 2010, Foundations of Science)





Implications for mathematical modelling◮ Science: establishing agreement in open exchange.





Implications for mathematical modelling◮ Science: establishing agreement in open exchange.◮ Mathematics is about creating a system that makes

absolute agreement possible(but only within mathematics).







◮ Mathematical modelling is not about how things are,but about how we think and communicate about them.(Models communicate and change views of reality.)








◮ It cannot be formally analysed how formal models arerelated to informal reality.








◮ It cannot be formally analysed how formal models arerelated to informal reality.

◮ Pragmatist attitude: what do we get out of it?





Realism?

Models are about personal and social perceptions.Science: can check models againstobservations from agreed measurement procedures.





Realism?

Models are about personal and social perceptions.Science: can check models againstobservations from agreed measurement procedures.

Compatible with Chang’s Active Scientific Realism :“I take reality as whatever is not subject to one’s will, andknowledge as an ability to act without being frustrated byresistance from reality.”We acknowledge reality’s “resistance”and that this is what science is about -without accessing “truth”about observer-independent reality.





Antony Gormley - Allotment





What do the model assumptions mean?Independent repetitionsCan frequentist model assumptions be checked?The goodness-of-fit paradoxApproximately true?“Frequentism-as-model”

3. Frequentist probabilities (e.g., von Mises)3.1 What do the model assumptions mean?For example,

x1, . . . , xn ∼ N (µ1, σ21) i.i.d.,

y1, . . . , ym ∼ N (µ2, σ22) i.i.d.






What do the model assumptions mean?

“We think of the situation as . . . ”◮ Potentially infinite repetition (of experimental conditions)◮ P(A): relative frequency limit of occurrence of A

(e.g., normal distribution is defined by P(A) ∀A.)






What do the model assumptions mean?

“We think of the situation as . . . ”◮ Potentially infinite repetition (of experimental conditions)◮ P(A): relative frequency limit of occurrence of A

(e.g., normal distribution is defined by P(A) ∀A.)

This is obviously an idealisation -and what constitutes a “repetition”?

“Whatever can be distinguished cannot be identical.”(B. de Finetti)






3.2 Independent repetitions (“i.i.d.”)

Frequentism relies on “repetitions of experiments”,e.g. results from different patients in clinical study,different students in the exam.

“x1, . . . , xn ∼ N (µ1, σ21) i.i.d.” defines

a probability distribution for the full dataset,e.g., P(x1 ∈ A1, x2 ∈ A2) = P(x1 ∈ A1)P(x2 ∈ A2).






3.2 Independent repetitions (“i.i.d.”)

Frequentism relies on “repetitions of experiments”,e.g. results from different patients in clinical study,different students in the exam.

“x1, . . . , xn ∼ N (µ1, σ21) i.i.d.” defines

a probability distribution for the full dataset,e.g., P(x1 ∈ A1, x2 ∈ A2) = P(x1 ∈ A1)P(x2 ∈ A2).

“i.i.d.” is defined in terms of probabilities.

Probabilities cannot be defined in terms ofi.i.d. repetitions.






In order to define “i.i.d.” sequences, i.i.d. repetitions anddefining repetitions are required on different levels.

x1 x2 x3 x4 x5 ... xn ...






In order to define “i.i.d.” sequences, i.i.d. repetitions anddefining repetitions are required on different levels.

x1 x2 x3 x4 x5 ... xn ...

x21 x22 x23 x24 x25 ... x2n ...

x31 x32 x33 x34 x35 ... x3n ...

x41 x42 x43 x44 x45 ... x4n ...






In practice, there’s only one level of repetition.

The effective sample size for(in-)dependence assumptions is usually 1.

But independent repetition of some kindis always required in order to learn from data.






In practice, there’s only one level of repetition.

The effective sample size for(in-)dependence assumptions is usually 1.

But independent repetition of some kindis always required in order to learn from data.

Independent repetition is constructed by conscious decisionto ignore potential dependencies and differences.






3.3 Can frequentist model assumptions be checked?

Goodness-of-fit/misspecification tests (Mayo & Spanos)If something modelled as very unlikely happens,the model is interpreted to be falsified.








For example, could “falsify” independencefrom data where students at same tabletend to have similar results.








For example, could “falsify” independencefrom data where students at same tabletend to have similar results.

Implicitly assumes tables to be i.i.d.






Dependence (between what’s “close”) can only be foundby contrast to independence between what’s further away.

If everything’s dependent on everything else in same manner(or in irregularly different manners),this cannot be found.






3.4 The goodness-of-fit paradox (H, 2007)

Checking the model assumptions violates them automaticallybecause the possibility of unlikely eventsis constitutive part of the models.






3.5 “Approximately true?”

Model assumptions are violated -is it good enough if they are “approximately true”?






“The model assumptions hold approximately” -precise meaning could only be:“We assume Q ∈ P and there is a true P so thatminQ∈P d(P,Q) is small.”






“The model assumptions hold approximately” -precise meaning could only be:“We assume Q ∈ P and there is a true P so thatminQ∈P d(P,Q) is small.”

Dissimilarity measure d is required.

Distances for data analysis (L. Davies):d(P,Q) small ⇔ difficult to distinguish data from P,Q.






Problem: In every small d -neighbourhood of Qthere are distributions that behave very differently.






Problem: In every small d -neighbourhood of Qthere are distributions that behave very differently.

Q = N (µ, σ2) has expectation µ.Expectation of P may be anywhere or not exist.Likelihood ratio of Q and P may take any value.






If P 6= N (µ, σ2), what is estimated estimating µ?

Median, expectation, mode are identical for N (µ, σ2),but different for asymmetric P.








Robustness theory: what characteristics of distributionsdon’t change much in d -neighbourhoods?








Robustness theory: what characteristics of distributionsdon’t change much in d -neighbourhoods?

In any case, cannot rule out too irregular distributions(such as complex dependence patterns).Some assumptions need to be imposed without checking.

Should at least reflect our perception.






3.6 “Frequentism-as-model”, a new perspective

(Frequentist) models are still useful to. . .◮ communicate researcher’s perception of situation,◮ inspire methodology,◮ check quality of methodology

in situations with known (made up) truth.(Tukey, L. Davies)

◮ give an idea of potential variation to be expected.






Re-formulation of “model checking” :Find out whether data could leadstatistical method (derived from model) astray(requires statistical theory/simulations).







Some violations are not problematic(e.g., g.o.f.-testing is rarely problematic,discreteness of data assumed normal doesn’t hurt much).







Some violations are not problematic(e.g., g.o.f.-testing is rarely problematic,discreteness of data assumed normal doesn’t hurt much).

Depends on aim of analysis (researcher’s construct)which features of model are important.

Major motivation of methodology should come fromdesired interpretation of results.






Students from same year: clearly not independent(but may not show dependence pattern),“identical” only by ignoring background(iid is constructed by selective ignorance).

(That’s the view the model carries.)






Teacher A results

Marks out of 100

Fre

quen

cy

0 20 40 60 80 100

01

23

45

10 20 30 40 50 60 70 80 90 100

Teacher B results

Marks out of 100

Fre

quen

cy

0 20 40 60 80 100

05

1015

10 20 30 40 50 60 70 80 90 100

Means: 58.6 (A), 56.9 (B)Medians: 58 (A), 59 (B).






Teacher A results

Marks out of 100

Fre

quen

cy

0 20 40 60 80 1000

12

34

510 20 30 40 50 60 70 80 90 100

Teacher B results

Marks out of 100

Fre

quen

cy

0 20 40 60 80 100

05

1015

10 20 30 40 50 60 70 80 90 100

Can rejectequality of distributions (Kolmogorov-Smirnov p = 0.012),neither means nor medians differ significantly.






Teacher A results

Marks out of 100

Fre

quen

cy

0 20 40 60 80 100

01

23

45

10 20 30 40 50 60 70 80 90 100

Teacher B results

Marks out of 100

Fre

quen

cy

0 20 40 60 80 100

05

1015

10 20 30 40 50 60 70 80 90 100

Lower outliers in B suggest median.






Teacher A results

Marks out of 100

Fre

quen

cy

0 20 40 60 80 100

01

23

45

10 20 30 40 50 60 70 80 90 100

Teacher B results

Marks out of 100

Fre

quen

cy

0 20 40 60 80 100

05

1015

10 20 30 40 50 60 70 80 90 100

Favour mean: observations are not erroneous,so all observations should contribute!?






Teacher A results

Marks out of 100

Fre

quen

cy0 20 40 60 80 100

01

23

45

10 20 30 40 50 60 70 80 90 100

Teacher B results

Marks out of 100

Fre

quen

cy

0 20 40 60 80 100

05

1015

10 20 30 40 50 60 70 80 90 100

But it may not be seen as very relevanthow bad the clearly failed students exactly are.(Would be different for upper outliers.)






Teacher A results

Marks out of 100

Fre

quen

cy

0 20 40 60 80 100

01

23

45

10 20 30 40 50 60 70 80 90 100

Teacher B results

Marks out of 100

Fre

quen

cy

0 20 40 60 80 100

05

1015

10 20 30 40 50 60 70 80 90 100

Could look at pass (≥ 50) rate (B clearly better)and other meaningful statistics (rank sum).






Teacher A results

Marks out of 100

Fre

quen

cy

0 20 40 60 80 100

01

23

45

10 20 30 40 50 60 70 80 90 100

Teacher B results

Marks out of 100

Fre

quen

cy

0 20 40 60 80 100

05

1015

10 20 30 40 50 60 70 80 90 100

“Which aspect is of interest?”/meaning of datadominates “Which model is close to the truth.”






Data analytic approach:evaluated several characteristics,got more comprehensive picture.






Data analytic approach:evaluated several characteristics,got more comprehensive picture.

Warning:I ran several significance tests,but this compromises test error probabilities.

Don’t necessarily expect that“significant different pass rate” will reproduce.

“Exploratory” interpretation of tests.

May check on independent data.






p-values : highly controversial!I use them routinely in exploratory mannerfor checking whether what I spotcan be explained by random variation alone.






p-values : highly controversial!I use them routinely in exploratory mannerfor checking whether what I spotcan be explained by random variation alone.

Relying on p-values as dominating toolto “secure” scientific findings is dangerous.

It’s not the idea that is at faultbut people want too strong results too easily.





Bayesian interpretations of probabilityExchangeability

4. The Bayes-frequentist controversy

Long-standing controversy about foundations of statistics:Should a different probability interpretation be applied?

Can problems with frequentism be resolved by(subjective or objective) Bayesian approaches?






Bayes’s theorem:

P(H1|data) =P(data|H1)P(H1)∫

H∈HP(data|H)P(H)

Needed: sampling distribution P(data|H),prior P(H) on set of sampling models H.






4.1 Bayesian interpretations of probability◮ Subjective Bayes (de Finetti) Probabilities measure

individual’s strength of belief in future outcomes,formalised as betting rates (epistemic probabilities).

◮ Objective Bayes (Jaynes) Similar, but believe thatobjectively rational strengths of belief can be found.

◮ Hidden frequentist/ “falsificationist” (Gelman)Use Bayesian methodologywith sampling distribution interpreted frequentist.

Only the latter can check models against data.






The big question: Where does the prior come from?

Objective Bayesians want either non-informative orstrong “objective” justification from background -rarely available.








“Falsificationist”: could usefrequentist idea even for prior (“all similar studies”);could make sense for course mark databut needs strong backing with past data.








“Falsificationist”: could usefrequentist idea even for prior (“all similar studies”);could make sense for course mark databut needs strong backing with past data.

Subjective impact cannot be suppressed.






4.2 Exchangeability

If you observe x1, . . . , xn, future probabilitiesdon’t depend on order of observations.

It’s essential for most Bayesian data analysis(at some level) - constructs “Bayesian repetition”.

De Finetti’s theorem:Exchangeability ⇔ P(data) =

∫H∈H

P(data|H)P(H)with i.i.d. model P(data|H).






Exchangeability constructs “Bayesian repetition”;similar role as i.i.d. for frequentists.






Exchangeability constructs “Bayesian repetition”;similar role as i.i.d. for frequentists.

Exchangeability implies thatP{1} in the next go doesn’t depend onwhether you observe0,0,1,0,1,1,1,0,0,1,0,1,1,0,1,1,0,0,1,0,1,0,0,1 or

0,0,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0.

Seems counterintuitive as epistemic assumption,rather conscious decision to ignore deviations (as iid).






From my point of view,the major philosophical problemswith Bayesian statisticsare about the same as with frequentism.







General problems of mathematical modelling,“creation” of repetition by i.i.d./exchangeability.







General problems of mathematical modelling,“creation” of repetition by i.i.d./exchangeability.

Bayes & frequentismformalise different concepts of probability(data generating mechanism vs. epistemic)that can make sense in different situations.





5. Cluster analysis and truth

Cluster analysis: methods to group data(“unsupervised classification”).





−20 −15 −10 −5 0

−10

−8

−6

−4

−2

0

dc 1

dc 2

−4 −2 0 2 4

−4

−2

02

4

x

y

var 1

−0.4 −0.2 0.0 0.2 0.4 −0.4 −0.2 0.0 0.2

−0.

4−

0.2

0.0

0.2

0.4

−0.

4−

0.2

0.0

0.2

0.4

var 2

var 3

−0.

20.

00.

20.

4

−0.4 −0.2 0.0 0.2 0.4

−0.

4−

0.2

0.0

0.2

−0.2 0.0 0.2 0.4

var 4

81 82 66 70 68 73 76 65 63 61 67 72 74 37 75 71 69 64 41 62 46 54 47 51 45 38 42 48 56 58 43 57 60 49 55 50 39 40 53 44 36 59 52 8 12 30 26 16 32 5 4 9 19 28 3 23 20 17 11 6 18 31 25 10 35 34 29 7 1 15 14 2 13 21 22 27 24 33 80 79 77 78 171

170

173

172 90 85 93 84 83 88 87 86 89 91 92 94 95 96 98 102

101

100 97 104 99 103

106

105

168

169

117

151

136

156

116

138

148

166

167

133

121

165

131

149

142

120

111

127

130

112

135

126

119

122

155

109

115

145

125

152

124

154

158

108

107

114

144

143

157

132

146

159

150

110

137

113

123

141

164

139

118

128

134

140

162

153

129

147

160

161

163

234

226

235

236

231

233

229

228

222

230

227

232

224

223

225

208

182

177

198

197

185

199

219

189

183

181

220

216

205

194

178

200

176

193

179

195

215

190

188

175

191

201

184

203

186

211

210

209

217

204

206

207

213

196

180

187

202

212

192

218

214

174

221

818266706873766563616772743775716964416246544751453842485658435760495550394053443659528123026163254919283232017116183125103534297115142132122272433807977781711701731729085938483888786899192949596981021011009710499103106105168169117151136156116138148166167133121165131149142120111127130112135126119122155109115145125152124154158108107114144143157132146159150110137113123141164139118128134140162153129147160161163234226235236231233229228222230227232224223225208182177198197185199219189183181220216205194178200176193179195215190188175191201184203186211210209217204206207213196180187202212192218214174221





What are the “true” clusters?

Need to connect substantial requirements tocharacteristics of cluster analysis methods.

Need a formal definition to measure method quality.





Intuitive clusters? (Often in literature)

−4 −2 0 2 4

−4

−2

02

4

x

y





Intuitive clusters?

−4 −2 0 2 4

−4

−2

02

4

x

y

−4 −2 0 2 4

−4

−2

02

4

x

y

−4 −2 0 2 4

−4

−2

02

4

x

y

. . . but may be inappropriate, e.g.,if small within-cluster distances required.





(Normal) mixture components in mixture distribution?

−5 0 5 10

0.00

0.05

0.10

0.15

0.20

0.25

0.30

x

dens

ity





(Normal) mixture components in mixture distribution?

−5 0 5 10

0.00

0.05

0.10

0.15

0.20

0.25

0.30

x

dens

ity

−4 −2 0 2 4 6 8

0.00

0.05

0.10

0.15

x

0.48

* d

norm

(x, 0

, 1)

+ 0

.48

* dn

orm

(x, 5

, 1)

+ 0

.04

* dn

orm

(x,

2

.5, 1

)

. . . but may not “cluster”.





Sometimes we know “true” clusters, don’t we?

M

M

M

MM

M

M

M

M

M MM

M

MM

MM

M

M

M M

MMM

MMM M

M

M

M

M

M

MM

MMM

M

MM

MM

M

M

M

M

M

M

M

F

F

F FF

F

FF

F

F

FF

F

F

F

F

FF

F

F

F

F

F

F

FF

F

F

F

F

F

FF

F

F

F

F

F

FF

F

F

F

F

F

F

FF

F

F

MMM

MM

M

M

M

M

MM

MM

MM

M M

M

M

MMMM

M

M

M

M

M

M

M

M

M

M

MM

M MM

M

M

M

M

M

MM

M

MMM

M

F

FF

F

F

FF

F

F

F

F

F

FF

FF

FFFF

F

F

F

F

F

F

F

F

F

F

F

F

F

FF

F

F

F

F

F

F

F

F

FF

F

F

F

F

F

−20 −15 −10 −5 0

−10

−8

−6

−4

−2

0

dc 1

dc 2

(These look quite “unclustery”.)





Actually we know some more:

B

B

B

BB

B

B

B

B

B BB

B

BB

BB

B

B

B B

BBB

BBB B

B

B

B

B

B

BB

BBB

B

BB

BB

B

B

B

B

B

B

B

B

B

B BB

B

BB

B

B

BB

B

B

B

B

BB

B

B

B

B

B

B

BB

B

B

B

B

B

BB

B

B

B

B

B

BB

B

B

B

B

B

B

BB

B

B

OOO

OO

O

O

O

O

OO

OO

OO

O O

O

O

OOOO

O

O

O

O

O

O

O

O

O

O

OO

O OO

O

O

O

O

O

OO

O

OOO

O

O

OO

O

O

OO

O

O

O

O

O

OO

OO

OOOO

O

O

O

O

O

O

O

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

OO

O

O

O

O

O

−20 −15 −10 −5 0

−10

−8

−6

−4

−2

0

dc 1

dc 2

In reality there may be many valid classificationsand only one may be known to us.





In literature, methods are advertised asfinding the “natural” clusters, assumed unique.





In literature, methods are advertised asfinding the “natural” clusters, assumed unique.

Literature is not open about researcher’s needto decide what kind of clusters are requiredin given application.Suggests data can make all the decisions.





Researcher needs to “construct” cluster concept:by what kind of principle should clusters be separated?

(Small within-cluster distances, separation,probability mixture components, centroids etc.)

E.g., need to specify how the idea of a “species”is connected to genetic or phenotype measurements.





6. Conclusion◮ Foundations of statistics are quite shaky

for those who hope to discover “objective truth”.◮ Acknowledge what has to be decided subjectively

(hopefully well informed).◮ Acknowledge basic problems and limits

of mathematical modelling.◮ Acknowledge need for assumptions

that cannot be checked.◮ Motivate chosen method and (model) checks

from desired interpretation.


model assumptions and truth in statisticsucakche/presentations/modelass0115.pdf · cluster analysis...

Documents