1 theory of regression. 2 the course 16 (or so) lessons –some flexibility depends how we feel what...

Post on 19-Dec-2015

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Theory of Regression

2

The Course

• 16 (or so) lessons– Some flexibility

• Depends how we feel• What we get through

3

Part I: Theory of Regression1. Models in statistics2. Models with more than one parameter:

regression3. Why regression?4. Samples to populations5. Introducing multiple regression6. More on multiple regression

4

Part 2: Application of regression7. Categorical predictor variables8. Assumptions in regression analysis9. Issues in regression analysis10.Non-linear regression11.Moderators (interactions) in regression12.Mediation and path analysisPart 3: Advanced Types of Regression13.Logistic Regression14.Poisson Regression15. Introducing SEM16. Introducing longitudinal multilevel models

5

House Rules

• Jeremy must remember– Not to talk too fast

• If you don’t understand– Ask – Any time

• If you think I’m wrong– Ask. (I’m not always right)

6

Learning New Techniques

• Best kind of data to learn a new technique– Data that you know well, and understand

• Your own data– In computer labs (esp later on)– Use your own data if you like

• My data – I’ll provide you with– Simple examples, small sample sizes

• Conceptually simple (even silly)

7

Computer Programs

• SPSS– Mostly

• Excel – For calculations

• GPower• Stata (if you like)• R (because it’s flexible and free)• Mplus (SEM, ML?)• AMOS (if you like)

8

9

10

Lesson 1: Models in statistics

Models, parsimony, error, mean, OLS estimators

11

What is a Model?

12

What is a model?

• Representation– Of reality– Not reality

• Model aeroplane represents a real aeroplane– If model aeroplane = real

aeroplane, it isn’t a model

13

• Statistics is about modelling– Representing and simplifying

• Sifting – What is important from what is not

important

• Parsimony – In statistical models we seek

parsimony– Parsimony simplicity

14

Parsimony in Science

• A model should be:– 1: able to explain a lot– 2: use as few concepts as possible

• More it explains– The more you get

• Fewer concepts– The lower the price

• Is it worth paying a higher price for a better model?

15

A Simple Model

• Height of five individuals– 1.40m– 1.55m– 1.80m– 1.62m– 1.63m

• These are our DATA

16

A Little Notation

Y The (vector of) data that we are modelling

iY The ith observation in our data.

5

8,7,6,5,4

2

Y

Y

17

Greek letters represent the true value in the population.

(Beta) Parameters in our model (population value)

0 The value of the first parameter of our model in the population.

j The value of the jth parameter of our model, in the population.

(Epsilon) The error in the population model.

18

Normal letters represent the values in our sample. These are sample statistics, which are used to estimate population parameters.

b A parameters in our model (sample statistics)

e The error in our sample.

Y The data in our sample which we are trying to model.

19

Symbols on top change the meaning.

Y The data in our sample which we are trying to model (repeated).

iY The estimated value of Y, for the ith case.

Y The mean of Y.

20

11ˆ So b

I will use b1 (because it is easier to type)

21

• Not always that simple– some texts and computer programs

use

b = the parameter estimate (as we have used)

(beta) = the standardised parameter estimate

SPSS does this.

22

A capital letter is the set (vector) of parameters/statistics

B Set of all parameters (b0, b1, b2, b3 … bp)

Rules are not used very consistently (even by me).Don’t assume you know what someone means, without checking.

23

• We want a model – To represent those data

• Model 1:– 1.40m, 1.55m, 1.80m, 1.62m, 1.63m– Not a model

• A copy

– VERY unparsimonious

• Data: 5 statistics• Model: 5 statistics

– No improvement

24

• Model 2:– The mean (arithmetic mean)– A one parameter model

n

YYbY

i

n

ii

10

ˆ

25

• Which, because we are lazy, can be written as

n

YY

26

The Mean as a Model

27

The (Arithmetic) Mean

• We all know the mean– The ‘average’– Learned about it at school– Forget (didn’t know) about how clever the

mean is

• The mean is:– An Ordinary Least Squares (OLS) estimator– Best Linear Unbiased Estimator (BLUE)

28

Mean as OLS Estimator

• Going back a step or two• MODEL was a representation of DATA

– We said we want a model that explains a lot– How much does a model explain?

DATA = MODEL + ERRORERROR = DATA - MODEL

– We want a model with as little ERROR as possible

29

• What is error?

Error (e)Model (b0)mean

Data (Y)

0.031.63

0.021.62

0.201.80

-0.051.55

-0.20

1.60

1.40

30

• How can we calculate the ‘amount’ of error?

• Sum of errors

0

03.002.020.005.020.0

)(

)ˆ(

ERROR

0

bY

YY

e

i

i

i

31

– 0 implies no ERROR

• Not the case

– Knowledge about ERROR is useful

• As we shall see later

32

• Sum of absolute errors– Ignore signs

ERROR ie

0.20 0.05 0.20 0.02 0.03

ˆiY Y

0iY b

0.50

33

• Are small and large errors equivalent?– One error of 4– Four errors of 1

– The same?

– What happens with different data?• Y = (2, 2, 5)

– b0 = 2

– Not very representative

• Y = (2, 2, 4, 4)– b0 = any value from 2 - 4

– Indeterminate• There are an infinite number of solutions which would

satisfy our criteria for minimum error

34

• Sum of squared errors (SSE)

08.0

03.002.020.005.020.0

)(

)ˆ(

ERROR

22222

20

2

2

bY

YY

e

i

i

i

35

• Determinate– Always gives one answer

• If we minimise SSE– Get the mean

• Shown in graph– SSE plotted against b0

– Min value of SSE occurs when

– b0 = mean

36

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2

b0

SS

E

37

The Mean as an OLS Estimate

38

Mean as OLS Estimate

• The mean is an Ordinary Least Squares (OLS) estimate– As are lots of other things

• This is exciting because– OLS estimators are BLUE– Best Linear Unbiased Estimators– Proven with Gauss-Markov Theorem

• Which we won’t worry about

39

BLUE Estimators• Best

– Minimum variance (of all possible unbiased estimators

– Narrower distribution than other estimators• e.g. median, mode

• Linear– Linear predictions– For the mean– Linear (straight, flat) line

YY

40

• Unbiased– Centred around true (population) values– Expected value = population value– Minimum is biased.

• Minimum in samples > minimum in population

• Estimators– Errrmm… they are estimators

• Also consistent– Sample approaches infinity, get closer

to population values– Variance shrinks

41

SSE and the Standard Deviation

• Tying up a loose end

2)ˆ( YYSSE i

n

YYs i

2)ˆ(

1)ˆ( 2

n

YYi

42

• SSE closely related to SD• Sample standard deviation – s

– Biased estimator of population SD

• Population standard deviation - – Need to know the mean to calculate SD

• Reduces N by 1• Hence divide by N-1, not N

– Like losing one df

43

Proof

• That the mean minimises SSE– Not that difficult– As statistical proofs go

• Available in– Maxwell and Delaney – Designing

experiments and analysing data– Judd and McClelland – Data Analysis

(out of print?)

44

What’s a df?

• The number of parameters free to vary– When one is fixed

• Term comes from engineering– Movement available to structures

45

0 dfNo variation

available

1 dfFix 1 corner, the

shape is fixed

46

Back to the Data

• Mean has 5 (N) df– 1st moment

• has N –1 df– Mean has been fixed– 2nd moment– Can think of as amount cases vary

away from the mean

47

While we are at it …

• Skewness has N – 2 df– 3rd moment

• Kurtosis has N – 3 df– 4rd moment– Amount cases vary from

48

Parsimony and df

• Number of df remaining– Measure of parsimony

• Model which contained all the data– Has 0 df

– Not a parsimonious model

• Normal distribution– Can be described in terms of mean and

• 2 parameters

– (z with 0 parameters)

49

Summary of Lesson 1

• Statistics is about modelling DATA– Models have parameters– Fewer parameters, more parsimony, better

• Models need to minimise ERROR– Best model, least ERROR – Depends on how we define ERROR – If we define error as sum of squared

deviations from predicted value– Mean is best MODEL

50

51

52

Lesson 2: Models with one more parameter -

regression

53

In Lesson 1 we said …

• Use a model to predict and describe data– Mean is a simple, one parameter model

54

More Models

Slopes and Intercepts

55

More Models• The mean is OK

– As far as it goes– It just doesn’t go very far– Very simple prediction, uses very little

information• We often have more information

than that– We want to use more information

than that

56

House Prices• In the UK, two of the largest

lenders (Halifax and Nationwide) compile house price indices– Predict the price of a house– Examine effect of different

circumstances

• Look at change in prices– Guides legislation

• E.g. interest rates, town planning

57

Predicting House Prices

Beds £ (000s)1 772 741 883 625 905 1362 355 1344 1381 55

58

One Parameter Model• The mean

9.11806ˆ

9.88

0

SSEYbY

Y

“How much is that house worth?”“£88,900”Use 1 df to say that

59

Adding More Parameters• We have more information than

this– We might as well use it– Add a linear function of number of

bedrooms (x1)

110ˆ xbbY

60

Alternative Expression

• Estimate of Y (expected value of Y)

• Value of Y

110ˆ xbbY

iii exbbY 110

61

Estimating the Model• We can estimate this model in four different,

equivalent ways– Provides more than one way of thinking about it

1. Estimating the slope which minimises SSE2. Examining the proportional reduction in

SSE3. Calculating the covariance 4. Looking at the efficiency of the predictions

62

Estimate the Slope to Minimise SSE

63

Estimate the Slope

• Stage 1– Draw a scatterplot– x-axis at mean

• Not at zero

• Mark errors on it– Called ‘residuals’– Sum and square these to find SSE

64

0

20

40

60

80

100

120

140

160

1.5 2 2.5 3 3.5 4 4.5 5 5.5

65

0

20

40

60

80

100

120

140

160

1.5 2 2.5 3 3.5 4 4.5 5 5.5

66

• Add another slope to the chart– Redraw residuals– Recalculate SSE– Move the line around to find slope

which minimises SSE

• Find the slope

67

• First attempt:

68

• Any straight line can be defined with two parameters– The location (height) of the slope

• b0

– Sometimes called

– The gradient of the slope • b1

69

• Gradient

1 unit

b1 units

70

• Height

b0 u

nits

71

• Height• If we fix slope to zero

– Height becomes mean

– Hence mean is b0

• Height is defined as the point that the slope hits the y-axis– The constant– The y-intercept

72

• Why the constant?– b0x0

– Where x0 is 1.00 for every case• i.e. x0 is constant

• Implicit in SPSS– Some packages

force you to make it explicit

– (Later on we’ll need to make it explicit)

beds (x1) x0 £ (000s)1 1 772 1 741 1 883 1 625 1 905 1 1362 1 355 1 1344 1 1381 1 55

73

• Why the intercept?– Where the regression line intercepts

the y-axis– Sometimes called y-intercept

74

Finding the Slope

• How do we find the values of b0 and b1?– Start with we jiggle the values, to find

the best estimates which minimise SSE– Iterative approach

• Computer intensive – used to matter, doesn’t really any more

• (With fast computers and sensible search algorithms – more on that later)

75

• Start with– b0=88.9 (mean)– b1=10 (nice round number)

• SSE = 14948 – worse than it was

– b0=86.9, b1=10, SSE=13828– b0=66.9, b1=10, SSE=7029– b0=56.9, b1=10, SSE=6628– b0=46.9, b1=10, SSE=8228– b0=51.9, b1=10, SSE=7178– b0=51.9, b1=12, SSE=6179– b0=46.9, b1=14, SSE=5957– ……..

76

• Quite a long time later– b0 = 46.000372

– b1 = 14.79182

– SSE = 5921

• Gives the position of the – Regression line (or)– Line of best fit

• Better than guessing

• Not necessarily the only method– But it is OLS, so it is the best (it is

BLUE)

77

0

20

40

60

80

100

120

140

160

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5

Number of Bedrooms

Pri

ce

Actual Price

Predicted Price

78

• We now know – A house with no bedrooms is worth

£46,000 (??!)– Adding a bedroom adds £15,000

• Told us two things– Don’t extrapolate to meaningless

values of x-axis– Constant is not necessarily useful

• It is necessary to estimate the equation

79

Standardised Regression Line

• One big but:– Scale dependent

• Values change – £ to €, inflation

• Scales change– £, £000, £00?

• Need to deal with this

80

• Don’t express in ‘raw’ units– Express in SD units

– x1=1.72

– y=36.21

• b1 = 14.79

• We increase x1 by 1, and Ŷ increases by 14.79

SDsSDs 408.0)21.36/79.14(79.14

81

• Similarly, 1 unit of x1 = 1/1.72 SDs

– Increase x1 by 1 SD

– Ŷ increases by 14.79 (1.72/1) = 8.60

• Put them both together

y

xb

11

82

• The standardised regression line– Change (in SDs) in Ŷ associated with a

change of 1 SD in x1

• A different route to the same answer– Standardise both variables (divide by

SD)– Find line of best fit

706.021.36

72.179.14

83

• The standardised regression line has a special name

The Correlation Coefficient(r)

(r stands for ‘regression’, but more on that later)

• Correlation coefficient is a standardised regression slope– Relative change, in terms of SDs

84

Proportional Reduction in Error

85

Proportional Reduction in Error

• We might be interested in the level of improvement of the model– How much less error (as proportion)

do we have– Proportional Reduction in Error (PRE)

• Mean only– Error(model 0) = 11806

• Mean + slope– Error(model 1) = 5921

86

0.4984PRE

118065921

1PRE

ERROR(0)ERROR(1)

1PRE

ERROR(0)ERROR(1)ERROR(0)

PRE

87

• But we squared all the errors in the first place – So we could take the square root– (It’s a shoddy excuse, but it makes the

point)

706.00.4984

• This is the correlation coefficient• Correlation coefficient is the square

root of the proportion of variance explained

88

Standardised Covariance

89

Standardised Covariance• We are still iterating

– Need a ‘closed-form’– Equation to solve to get the parameter

estimates

• Answer is a standardised covariance– A variable has variance– Amount of ‘differentness’

• We have used SSE so far

90

• SSE varies with N– Higher N, higher SSE

• Divide by N– Gives SSE per person– (Actually N – 1, we have lost a df to

the mean)• The variance• Same as SD2

– We thought of SSE as a scattergram • Y plotted against X

– (repeated image follows)

91

0

20

40

60

80

100

120

140

160

1.5 2 2.5 3 3.5 4 4.5 5 5.5

92

• Or we could plot Y against Y– Axes meet at the mean (88.9)– Draw a square for each point– Calculate an area for each square– Sum the areas

• Sum of areas – SSE

• Sum of areas divided by N– Variance

93

Plot of Y against Y

0

20

40

60

80

100

120

140

160

180

0 20 40 60 80 100 120 140 160 180

94

0

20

40

60

80

100

120

140

160

180

0 20 40 60 80 100 120 140 160 180

Draw Squares

35 – 88.9 = -53.9

35 – 88.9 = -53.9

138 – 88.9 = 40.1

138 – 88.9 = 40.1

Area = -53.9 x -53.9

= 2905.21

Area = 40.1 x 40.1= 1608.1

95

• What if we do the same procedure– Instead of Y against Y– Y against X

• Draw rectangles (not squares)• Sum the area• Divide by N - 1• This gives us the variance of x with

y– The Covariance – Shortened to Cov(x, y)

96

97

55 – 88.9 = -33.9

1 - 3 = -2

Area = (-33.9) x (-2)

= 67.84 - 3 = 1

138-88.9 = 49.1

Area = 49.1 x 1 = 49.1

98

• More formally (and easily)• We can state what we are doing as

an equation – Where Cov(x, y) is the covariance

• Cov(x,y)=44.2• What do points in different sectors

do to the covariance?

1))((

),(Cov

N

yyxxyx

99

• Problem with the covariance– Tells us about two things– The variance of X and Y– The covariance

• Need to standardise it– Like the slope

• Two ways to standardise the covariance– Standardise the variables first

• Subtract from mean and divide by SD

– Standardise the covariance afterwards

100

• First approach– Much more computationally

expensive• Too much like hard work to do by hand

– Need to standardise every value• Second approach

– Much easier– Standardise the final value only

• Need the combined variance– Multiply two variances– Find square root (were multiplied in

first place)

101

• Standardised covariance

706.0

13119.2

2.44

)(Var)(Var

),(Cov

yx

yx

102

• The correlation coefficient– A standardised covariance is a

correlation coefficient

Covariance

variance variancer

103

1)(

1)(

1))((

22

Nyy

Nxx

Nyyxx

r

• Expanded …

104

• This means …– We now have a closed form equation

to calculate the correlation– Which is the standardised slope– Which we can use to calculate the

unstandardised slope

105

y

xbr

11

1

1x

yrb

We know that:

We know that:

106

• So value of b1 is the same as the iterative approach

79.1472.1

21.36706.0

1

1

1

1

b

b

rb

x

y

107

• The intercept– Just while we are at it

• The variables are centred at zero– We subtracted the mean from both

variables– Intercept is zero, because the axes

cross at the mean

108

• Add mean of y to the constant– Adjusts for centring y

• Subtract mean of x– But not the whole mean of x– Need to correct it for the slope

00.46

38.149.88

11

c

c

xbyc

• Naturally, the same

109

Accuracy of Prediction

110

One More (Last One)• We have one more way to

calculate the correlation– Looking at the accuracy of the

prediction

• Use the parameters– b0 and b1

– To calculate a predicted value for each case

111

• Plot actual price against predicted price– From the

model

BedsActual Price

Predicted Price

1 77 60.802 74 75.591 88 60.803 62 90.385 90 119.965 136 119.962 35 75.595 134 119.964 138 105.171 55 60.80

112

20

40

60

80

100

120

140

20 40 60 80 100 120 140 160Actual Value

Pre

dict

ed V

alue

113

• r = 0.706– The correlation

• Seems a futile thing to do– And at this stage, it is– But later on, we will see why

114

Some More Formulae• For hand calculation

• Point biserial

22 yx

xyr

y

yy

sd

PQMMr

01

115

• Phi ()– Used for 2 dichotomous variables

Vote P Vote Q

Homeowner A: 19 B: 54

Not homeowner C: 60 D:53

))()()(( DBCADCBA

ADBCr

116

• Problem with the phi correlation– Unless Px= Py (or Px = 1 – Py)

• Maximum (absolute) value is < 1.00• Tetrachoric can be used

• Rank (Spearman) correlation– Used where data are ranked

)1(

62

2

nn

dr

117

Summary• Mean is an OLS estimate

– OLS estimates are BLUE

• Regression line– Best prediction of DV from IV– OLS estimate (like mean)

• Standardised regression line– A correlation

118

• Four ways to think about a correlation– 1. Standardised regression line– 2. Proportional Reduction in Error

(PRE)– 3. Standardised covariance– 4. Accuracy of prediction

119

120

121

Lesson 3: Why Regression?

A little aside, where we look at why regression has such a

curious name.

122

Regression

The or an act of regression; reversion; return towards the

mean; return to an earlier stage of development, as in an adult’s or an adolescent’s behaving like a child(From Latin gradi, to go)

• So why name a statistical technique which is about prediction and explanation?

123

• Francis Galton – Charles Darwin’s cousin– Studying heritability

• Tall fathers have shorter sons• Short fathers have taller sons

– ‘Filial regression toward mediocrity’ – Regression to the mean

124

• Galton thought this was biological fact– Evolutionary basis?

• Then did the analysis backward– Tall sons have shorter fathers– Short sons have taller fathers

• Regression to the mean– Not biological fact, statistical artefact

125

Other Examples

• Secrist (1933): The Triumph of Mediocrity in Business

• Second albums often tend to not be as good as first

• Sequel to a film is not as good as the first one

• ‘Curse of Athletics Weekly’• Parents think that punishing bad

behaviour works, but rewarding good behaviour doesn’t

126

Pair Link Diagram

• An alternative to a scatterplot

x y

127

r=1.00

x

x

x

x

x

x

x

128

x

x x

x

x

r=0.00

129

From Regression to Correlation

• Where do we predict an individual’s score on y will be, based on their score on x?– Depends on the correlation

• r = 1.00 – we know exactly where they will be

• r = 0.00 – we have no idea• r = 0.50 – we have some idea

130

x y

r=1.00

Starts here

Will end up here

131

x y

r=0.00

Could end anywhere here

Starts here

132

r=0.50

x y

Starts here

Probably end

somewhere here

133

Galton Squeeze Diagram

• Don’t show individuals– Show groups of individuals, from the

same (or similar) starting point– Shows regression to the mean

134

x y

r=0.00

Group starts here

Ends here

Group starts here

135

x y

r=0.50

136

x y

r=1.00

137

x y

1 unit r units

• Correlation is amount of regression that doesn’t occur

138

x y

• No regression• r=1.00

139

• Some regression• r=0.50

x y

140

r=0.00

x y

• Lots (maximum) regression• r=0.00

141

Formula

xxyy zrz ˆ

142

Conclusion

• Regression towards mean is statistical necessity

regression = perfection – correlation• Very non-intuitive• Interest in regression and correlation

– From examining the extent of regression towards mean

– By Pearson – worked with Galton– Stuck with curious name

• See also Paper B3

143

144

145

Lesson 4: Samples to Populations – Standard Errors and Statistical

Significance

146

The Problem• In Social Sciences

– We investigate samples• Theoretically

– Randomly taken from a specified population

– Every member has an equal chance of being sampled

– Sampling one member does not alter the chances of sampling another

• Not the case in (say) physics, biology, etc.

147

Population

• But it’s the population that we are interested in– Not the sample– Population statistic represented with

Greek letter– Hat means ‘estimate’

xx

b

ˆ

ˆ

148

• Sample statistics (e.g. mean) estimate population parameters

• Want to know– Likely size of the parameter– If it is > 0

149

Sampling Distribution

• We need to know the sampling distribution of a parameter estimate– How much does it vary from sample to

sample

• If we make some assumptions– We can know the sampling distribution

of many statistics– Start with the mean

150

Sampling Distribution of the Mean

• Given– Normal distribution– Random sample– Continuous data

• Mean has a known sampling distribution– Repeatedly sampling will give a known

distribution of means– Centred around the true (population)

mean ()

151

Analysis Example: Memory• Difference in memory for different

words

– 10 participants given a list of 30 words to learn, and then tested

– Two types of word

• Abstract: e.g. love, justice

• Concrete: e.g. carrot, table

152

Concrete Abstract Diff (x)12 4 811 7 44 6 -29 12 -38 6 2

12 10 29 8 18 5 3

12 10 28 4 4

10

11.3

1.2

N

x

x

153

Confidence Intervals

• This means– If we know the mean in our sample– We can estimate where the mean in

the population () is likely to be

• Using– The standard error (se) of the mean– Represents the standard deviation of

the sampling distribution of the mean

154

Almost 2 SDs contain

95%

1 SD contains

68%

155

• We know the sampling distribution of the mean– t distributed– Normal with large N (>30)

• Know the range within means from other samples will fall– Therefore the likely range of

nxse x

)(

156

• Two implications of equation– Increasing N decreases SE

• But only a bit

– Decreasing SD decreases SE • Calculate Confidence Intervals

– From standard errors• 95% is a standard level of CI

– 95% of samples the true mean will lie within the 95% CIs

– In large samples: 95% CI = 1.96 SE– In smaller samples: depends on t

distribution (df=N-1=9)

157

10

,11.3

,1.2

N

x

x

98.010

11.3)(

nxse x

158

22.298.026.2CI 95%

CI CI

-0.12 4.32

x x

159

What is a CI?

• (For 95% CI): • 95% chance that the true

(population) value lies within the confidence interval?

• 95% of samples, true mean will land within the confidence interval?

160

Significance Test

• Probability that is a certain value– Almost always 0

• Doesn’t have to be though

• We want to test the hypothesis that the difference is equal to 0– i.e. find the probability of this difference

occurring in our sample IF =0– (Not the same as the probability that

=0)

161

• Calculate SE, and then t– t has a known sampling distribution– Can test probability that a certain

value is included

)(xse

xt

061.0

14.298.0

1.2

p

t

162

Other Parameter Estimates

• Same approach– Prediction, slope, intercept, predicted

values– At this point, prediction and slope are

the same• Won’t be later on

• We will look at one predictor only– More complicated with > 1

163

Testing the Degree of Prediction

• Prediction is correlation of Y with Ŷ– The correlation – when we have one IV

• Use F, rather than t• Started with SSE for the mean only

– This is SStotal

– Divide this into SSresidual

– SSregression

• SStot = SSreg + SSres

164

,12

1

kNdf

kdf

2

1

dfSS

dfSSF

res

reg

165

• Back to the house prices– Original SSE (SStotal) = 11806

– SSresidual = 5921• What is left after our model

– SSregression = 11806 – 5921 = 5885• What our model explains

• Slope = 14.79• Intercept = 46.0• r = 0.706

166

2

1

dfSS

dfSSF

res

reg

81

1

95.7)1110(5921

15885

2

1

kNdf

kdf

F

167

• F = 7.95, df = 1, 8, p = 0.02– Can reject H0

• H0: Prediction is not better than chance

– A significant effect

168

Statistical Significance:What does a p-value (really)

mean?

169

A Quiz

• Six questions, each true or false• Write down your answers (if you like)

• An experiment has been done. Carried out perfectly. All assumptions perfectly satisfied. Absolutely no problems.

• P = 0.01– Which of the following can we say?

170

1. You have absolutely disproved the null hypothesis (that is, there is no difference between the population means).

171

2. You have found the probability of the null hypothesis being true.

172

3. You have absolutely proved your experimental hypothesis (that there is a difference between the population means).

173

4. You can deduce the probability of the experimental hypothesis being true.

174

5. You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.

175

6. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions.

176

OK, What is a p-value

• Cohen (1994)“[a p-value] does not tell us what we

want to know, and we so much want to know what we want to

know that, out of desperation, we nevertheless believe it does” (p

997).

177

OK, What is a p-value

• Sorry, didn’t answer the question• It’s The probability of obtaining a

result as or more extreme than the result we have in the study, given that the null hypothesis is true

• Not probability the null hypothesis is true

178

A Bit of Notation

• Not because we like notation– But we have to say a lot less

• Probability – P• Null hypothesis is true – H• Result (data) – D• Given - |

179

What’s a P Value

• P(D|H)– Probability of the data occurring if the

null hypothesis is true• Not• P(H|D)

– Probability that the null hypothesis is true, given that we have the data = p(H)

• P(H|D) ≠ P(D|H)

180

• What is probability you are prime minister– Given that you are british– P(M|B)– Very low

• What is probability you are British– Given you are prime minister– P(B|M)– Very high

• P(M|B) ≠ P(B|M)

181

• There’s been a murder– Someone bumped off a statto for talking

too much

• The police have DNA• The police have your DNA

– They match(!)

• DNA matches 1 in 1,000,000 people• What’s the probability you didn’t do

the murder, given the DNA match (H|D)

182

• Police say:– P(D|H) = 1/1,000,000

• Luckily, you have Jeremy on your defence team

• We say:– P(D|H) ≠ P(H|D)

• Probability that someone matches the DNA, who didn’t do the murder– Incredibly high

183

Back to the Questions

• Haller and Kraus (2002)– Asked those questions of groups in

Germany– Psychology Students– Psychology lecturers and professors

(who didn’t teach stats)– Psychology lecturers and professors

(who did teach stats)

184

1. You have absolutely disproved the null hypothesis (that is, there is no difference between the population means).

• True• 34% of students • 15% of professors/lecturers, • 10% of professors/lecturers teaching

statistics

• False• We have found evidence against

the null hypothesis

185

2. You have found the probability of the null hypothesis being true.

– 32% of students – 26% of professors/lecturers– 17% of professors/lecturers teaching

statistics

• False• We don’t know

186

3. You have absolutely proved your experimental hypothesis (that there is a difference between the population means).

– 20% of students – 13% of professors/lecturers– 10% of professors/lecturers teaching

statistics

• False

187

4. You can deduce the probability of the experimental hypothesis being true.

– 59% of students– 33% of professors/lecturers– 33% of professors/lecturers teaching

statistics

• False

188

5. You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.

• 68% of students• 67% of professors/lecturers• 73% of professors professors/lecturers

teaching statistics

• False• Can be worked out

– P(replication)

189

6. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions.

– 41% of students– 49% of professors/lecturers– 37% of professors professors/lecturers

teaching statistics • False• Another tricky one

– It can be worked out

190

One Last Quiz

• I carry out a study– All assumptions perfectly satisfied– Random sample from population– I find p = 0.05

• You replicate the study exactly– What is probability you find p < 0.05?

191

• I carry out a study– All assumptions perfectly satisfied– Random sample from population– I find p = 0.01

• You replicate the study exactly– What is probability you find p < 0.05?

192

• Significance testing creates boundaries and gaps where none exist.

• Significance testing means that we find it hard to build upon knowledge– we don’t get an accumulation of

knowledge

193

• Yates (1951) "the emphasis given to formal tests of

significance ... has resulted in ... an undue concentration of effort by mathematical statisticians on investigations of tests of

significance applicable to problems which are of little or no practical importance ... and ... it has caused scientific research workers to pay undue attention to the

results of the tests of significance ... and too little to the estimates of the magnitude

of the effects they are investigating

194

Testing the Slope

• Same idea as with the mean– Estimate 95% CI of slope– Estimate significance of difference

from a value (usually 0)

• Need to know the sd of the slope– Similar to SD of the mean

195

1

)ˆ( 2

.

kN

YYs xy

1.

kN

SSs res

xy

2.278

5921. xys

196

• Similar to equation for SD of mean• Then we need standard error

- Similar (ish)• When we have standard error

– Can go on to 95% CI– Significance of difference

197

24.59.26

2.27)(se . xyb

2

..

)()(se

xx

sb xy

xy

198

• Confidence Limits• 95% CI

– t dist with N - k - 1 df is 2.31– CI = 5.24 2.31 = 12.06

• 95% confidence limits

14.8 12.1 14.8 12.1

2.7 26.9

199

• Significance of difference from zero– i.e. probability of getting result if =0

• Not probability that = 0

02.0

81

81.22.57.14

)(

p

kNdf

bse

bt

• This probability is (of course) the same as the value for the prediction

200

Testing the Standardised Slope (Correlation)

• Correlation is bounded between –1 and +1– Does not have symmetrical distribution,

except around 0• Need to transform it

– Fisher z’ transformation – approximately normal

)]1ln()1[ln(5.0 rrz

3

1

nSEz

201

• 95% CIs – 0.879 – 1.96 * 0.38 = 0.13– 0.879 + 1.96 * 0.38 = 1.62

879.0

)]706.01ln()706.01[ln(5.0

z

z

38.0310

1

3

1

nSEz

202

• Transform back to correlation

• 95% CIs = 0.13 to 0.92• Very wide

– Small sample size– Maybe that’s why CIs are not

reported?

1

12

2

y

y

e

er

203

Using Excel

• Functions in excel– Fisher() – to carry out Fisher

transformation– Fisherinv() – to transform back to

correlation

204

The Others

• Same ideas for calculation of CIs and SEs for – Predicted score– Gives expected range of values given

X

• Same for intercept– But we have probably had enough

205

Lesson 5: Introducing Multiple Regression

206

Residuals• We said

Y = b0 + b1x1

• We could have saidYi = b0 + b1xi1 + ei

• We ignored the i on the Y• And we ignored the ei

– It’s called error, after all• But it isn’t just error

– Trying to tell us something

207

What Error Tells Us

• Error tells us that a case has a different score for Y than we predict– There is something about that case

• Called the residual– What is left over, after the model

• Contains information– Something is making the residual 0– But what?

208

0

20

40

60

80

100

120

140

160

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5

Number of Bedrooms

Pri

ce

Actual Price

Predicted Price

swimming pool

Unpleasant neighbours

209

• The residual (+ the mean) is the value of Y

If all cases were equal on X• It is the value of Y, controlling for

X

• Other words:– Holding constant– Partialling– Residualising– Conditioned on

210

Pred61766190

12012076

12010561

Adj. Value1059062

11711973

129755695

Res-16

2-272830

-1641

-14-33

6

Beds £ (000s)1 772 741 883 625 905 1362 355 1344 1381 55

211

• Sometimes adjustment is enough on its own– Measure performance against criteria

• Teenage pregnancy rate– Measure pregnancy and abortion rate in areas– Control for socio-economic deprivation, and

anything else important– See which areas have lower teenage

pregnancy and abortion rate, given same level of deprivation

• Value added education tables– Measure school performance– Control for initial intake

212

Control?• In experimental research

– Use experimental control– e.g. same conditions, materials, time

of day, accurate measures, random assignment to conditions

• In non-experimental research– Can’t use experimental control– Use statistical control instead

213

Analysis of Residuals

• What predicts differences in crime rate – After controlling for socio-economic

deprivation– Number of police?– Crime prevention schemes?– Rural/Urban proportions?– Something else

• This is what regression is about

214

• Exam performance– Consider number of books a student

read (books)– Number of lectures (max 20) a

student attended (attend)

• Books and attend as IV, grade as DV

215

Books Attend0 91 150 102 164 104 201 114 203 150 15

First 10 cases

Grade45574551658844878959

216

• Use books as IV– R=0.492, F=12.1, df=1, 28, p=0.001

– b0=52.1, b1=5.7

– (Intercept makes sense)

• Use attend as IV– R=0.482, F=11.5, df=1, 38, p=0.002

– b0=37.0, b1=1.9

– (Intercept makes less sense)

217

Books

543210-1

Gra

de (

100)

100

90

80

70

60

50

40

30

218Attend

211917151311975

Gra

de

100

90

80

70

60

50

40

30

219

Problem• Use R2 to give proportion of shared

variance– Books = 24%– Attend = 23%

• So we have explained 24% + 23% = 47% of the variance– NO!!!!!

220

• Correlation of books and attend is (unsurprisingly) not zero– Some of the variance that books shares

with grade, is also shared by attend

• Look at the correlation matrix

BOOKS

ATTENDGRADE

BOOKSATTEN

DGRADE

0.44

0.49 0.48

1

1

1

221

• I have access to 2 cars• My wife has access to 2 cars

– We have access to four cars?– No. We need to know how many of my

2 cars are the same cars as her 2 cars• Similarly with regression

– But we can do this with the residuals– Residuals are what is left after (say)

books– See of residual variance is explained

by attend– Can use this new residual variance to

calculate SSres, SStotal and SSreg

222

• Well. Almost.– This would give us correct values for

SS– Would not be correct for slopes, etc

• Assumes that the variables have a causal priority– Why should attend have to take what

is left from books?– Why should books have to take what

is left by attend?

• Use OLS again

223

• Simultaneously estimate 2 parameters– b1 and b2

– Y = b0 + b1x1 + b2x2

– x1 and x2 are IVs

• Not trying to fit a line any more– Trying to fit a plane

• Can solve iteratively– Closed form equations better– But they are unwieldy

224

x1

x2

y

3D scatterplot(2points only)

225

x1

x2

y

b0

b1

b2

226

(Really) Ridiculous Equations

22211

222

211

2211222

22111

xxxxxxxx

xxxxxxyyxxxxyyb

21122

211

222

1122112

11222

xxxxxxxx

xxxxxxyyxxxxyyb

22110 xbxbyb

227

• The good news– There is an easier way

• The bad news– It involves matrix algebra

• The good news– We don’t really need to know how to

do it

• The bad news – We need to know it exists

228

A Quick Guide to Matrix Algebra

(I will never make you do it again)

229

Very Quick Guide to Matrix Algebra

• Why?– Matrices make life much easier in

multivariate statistics– Some things simply cannot be done

without them– Some things are much easier with them

• If you can manipulate matrices– you can specify calculations v. easily– e.g. AA’ = sum of squares of a column

• Doesn’t matter how long the column

230

• A scalar is a numberA scalar: 4

• A vector is a row or column of numbers

11

5

A row vector:

A column vector:

7842

231

• A vector is described as rows x columns

– Is a 1 4 vector

– Is a 2 1 vector– A number (scalar) is a 1 1 vector

7842

11

5

232

• A matrix is a rectangle, described as rows x columns

87251

35754

87562

• Is a 3 x 5 matrix

• Matrices are referred to with bold capitals

- A is a matrix

233

• Correlation matrices and covariance matrices are special – They are square and symmetrical– Correlation matrix of books, attend

and grade

00.148.049.0

48.000.144.0

49.044.000.1

234

• Another special matrix is the identity matrix I– A square matrix, with 1 in the

diagonal and 0 in the off-diagonal

1000

0100

0010

0001

I

– Note that this is a correlation matrix, with correlations all = 0

235

Matrix Operations

• Transposition– A matrix is transposed by putting it

on its side – Transpose of A is A’

6

5

7

'

657

A

A

236

• Matrix multiplication– A matrix can be multiplied by a scalar,

a vector or a matrix– Not commutative– AB BA– To multiply AB

• Number of rows in A must equal number of columns in B

237

• Matrix by vector

141

99

33

4

3

2

231917

13117

532

238

• Matrix by matrix

5038

2116

35152810

156124

54

32

75

32

dhcfdgce

bhafcfae

hg

fe

dc

ba

239

• Multiplying by the identity matrix– Has no effect – Like multiplying by 1

75

32

10

01

75

32

AAI

240

• The inverse of J is: 1/J• J x 1/J = 1• Same with matrices

– Matrices have an inverse– Inverse of A is A-1

– AA-1=I

• Inverting matrices is dull– We will do it once– But first, we must calculate the

determinant

241

• The determinant of A is |A|• Determinants are important in

statistics– (more so than the other matrix

algebra)

• We will do a 2x2 – Much more difficult for larger

matrices

242

cbad

dc

ba

A

A

91.0

3.03.011

0.13.0

3.00.1

A

A

A

243

• Determinants are important because– Needs to be above zero for regression

to work– Zero or negative determinant of a

correlation/covariance matrix means something wrong with the data• Linear redundancy

• Described as:– Not positive definite– Singular (if determinant is zero)

• In different error messages

244

• Next, the adjoint

ac

bd

dc

ba

A

A

adj

AA

A adj11

•Now

245

• Find A-1

91.0

0.13.0

3.00.1

A

A

10.133.0

33.010.1

0.13.0

3.00.1

91.01

1

1

A

A

246

Matrix Algebra with Correlation Matrices

247

Determinants

• Determinant of a correlation matrix– The volume of ‘space’ taken up by

the (hyper) sphere that contains all of the points

1.0 0.0

0.0 1.0

1.0

A

A

248

1.0 0.0

0.0 1.0

1.0

A

A

X X

X

X X

249

1.0 1.0

1.0 1.0

0.0

A

A

X

X

X

250

Negative Determinant

• Points take up less than no space– Correlation matrix cannot exist – Non-positive definite matrix

251

Sometimes Obvious

0.44A

0.12.1

2.10.1A

252

Sometimes Obvious (If You Think)

1 0.9 0.9

0.9 1 0.9

0.9 0.9 1

A

2.88A

253

Sometimes No Idea

1.00 0.76 0.40

0.76 1 0.30

0.40 0.30 1

A

0.01A 1.00 0.75 0.40

0.75 1 0.30

0.40 0.30 1

A

0.0075A

254

Multiple R for Each Variable

• Diagonal of inverse of correlation matrix– Used to calculate multiple R

– Call elements aij

.123...

11i k

ii

Ra

255

Regression Weights

• Where i is DV• j is IV

.ij

i jij

ab

a

256

Back to the Good News• We can calculate the standardised

parameters asB=Rxx

-1 x Rxy

• Where – B is the vector of regression weights– Rxx

-1 is the inverse of the correlation matrix of the independent (x) variables

– Rxy is the vector of correlations of the correlations of the x and y variables

– Now do exercise 3.2

257

One More Thing

• The whole regression equation can be described with matrices– very simply

EXBY

258

• Where– Y = vector of DV– X = matrix of IVs– B = vector of coefficients

• Go all the way back to our example

259

1

2

3

40

51

62

7

8

9

10

1 0 9 45

1 1 5 57

1 0 10 45

1 2 16 51

1 4 10 65

1 4 20 88

1 1 11 44

1 4 20 87

1 3 15 89

1 0 15 59

e

e

e

eb

eb

eb

e

e

e

e

260

59

89

87

44

88

65

51

45

57

45

1501

1531

2041

1111

2041

1041

1621

1001

511

901

10

9

8

7

6

5

4

3

2

1

2

1

0

e

e

e

e

e

e

e

e

e

e

b

b

b

The constant – literally a constant. Could be any number, but it is most

convenient to make it 1. Used to ‘capture’ the

intercept.

261

59

89

87

44

88

65

51

45

57

45

1501

1531

2041

1111

2041

1041

1621

1001

511

901

10

9

8

7

6

5

4

3

2

1

2

1

0

e

e

e

e

e

e

e

e

e

e

b

b

bThe matrix of values for IVs (books and

attend)

262

59

89

87

44

88

65

51

45

57

45

1501

1531

2041

1111

2041

1041

1621

1001

511

901

10

9

8

7

6

5

4

3

2

1

2

1

0

e

e

e

e

e

e

e

e

e

e

b

b

bThe parameter

estimates. We are trying to find the

best values of these.

263

59

89

87

44

88

65

51

45

57

45

1501

1531

2041

1111

2041

1041

1621

1001

511

901

10

9

8

7

6

5

4

3

2

1

2

1

0

e

e

e

e

e

e

e

e

e

e

b

b

b

Error. We are trying to minimise this

264

59

89

87

44

88

65

51

45

57

45

1501

1531

2041

1111

2041

1041

1621

1001

511

901

10

9

8

7

6

5

4

3

2

1

2

1

0

e

e

e

e

e

e

e

e

e

e

b

b

b

The DV - grade

265

• Y=BX+E• Simple way of representing as many IVs

as you likeY = b0x0 + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + e

2

1

5

4

3

2

1

0

524232221202

514131211101

e

e

b

b

b

b

b

b

xxxxxx

xxxxxx

266

exbxbxb

e

e

b

b

b

b

b

b

xxxxxx

xxxxxx

kk

...1100

2

1

5

4

3

2

1

0

524232221202

514131211101

267

Generalises to Multivariate Case

• Y=BX+E• Y, B and E

– Matrices, not vectors

• Goes beyond this course– (Do Jacques Tacq’s course for more)– (Or read his book)

268

269

270

271

Lesson 6: More on Multiple Regression

272

Parameter Estimates

• Parameter estimates (b1, b2 … bk) were standardised – Because we analysed a correlation

matrix

• Represent the correlation of each IV with the DV– When all other IVs are held constant

273

• Can also be unstandardised• Unstandardised represent the unit

change in the DV associated with a 1 unit change in the IV– When all the other variables are held

constant• Parameters have standard errors

associated with them– As with one IV– Hence t-test, and associated

probability can be calculated• Trickier than with one IV

274

Standard Error of Regression Coefficient

• Standardised is easier

– R2i is the value of R2 when all other

predictors are used as predictors of that variable

• Note that if R2i = 0, the equation is the same as

for previous

iRkn

RSE Y

i 2

2

1

1

1

1

275

Multiple R

• The degree of prediction– R (or Multiple R) – No longer equal to b

• R2 Might be equal to the sum of squares of B– Only if all x’s are uncorrelated

276

In Terms of Variance• Can also think of this in terms of

variance explained.– Each IV explains some variance in the

DV– The IVs share some of their variance

• Can’t share the same variance twice

277

The total variance of Y

= 1

Variance in Y accounted for by

x1

rx1y2 = 0.36

Variance in Y accounted for by

x2

rx2y2 = 0.36

278

• In this model– R2 = ryx1

2 + ryx22

– R2 = 0.36 + 0.36 = 0.72– R = 0.72 = 0.85

• But– If x1 and x2 are correlated

– No longer the case

279

The total variance of Y

= 1

Variance in Y accounted for by

x2

rx2y2 = 0.36

Variance in Y accounted for by

x1

rx1y2 = 0.36

Variance shared between x1 and x2

(not equal to rx1x2)

280

• So– We can no longer sum the r2

– Need to sum them, and subtract the shared variance – i.e. the correlation

• But– It’s not the correlation between them– It’s the correlation between them as a

proportion of the variance of Y

• Two different ways

281

• If rx1x2 = 0

– rxy = bx1

– Equivalent to ryx12 + ryx2

2

21 212

yxyx rbrbR

• Based on estimates

282

• rx1x2 = 0

– Equivalent to ryx12 + ryx2

2

2

222

21

212121

1

2

xx

xxyxyxyxyx

r

rrrrrR

• Based on correlations

283

• Can also be calculated using methods we have seen– Based on PRE– Based on correlation with prediction

• Same procedure with >2 IVs

284

Adjusted R2

• R2 is an overestimate of population value of R2

– Any x will not correlate 0 with Y– Any variation away from 0 increases R– Variation from 0 more pronounced

with lower N• Need to correct R2

– Adjusted R2

285

• 1 – R2

– Proportion of unexplained variance– We multiple this by an adjustment

• More variables – greater adjustment• More people – less adjustment

11

)1(1 Adj. 22

kN

NRR

• Calculation of Adj. R2

286

Shrunken R2

• Some authors treat shrunken and adjusted R2 as the same thing– Others don’t

287

11

kN

N1875.1

1619

1320120

3,20

kN

919

1810110

8,10

kN

5.169

1310110

3,10

kN

288

Extra Bits

• Some stranger things that can happen

– Counter-intuitive

289

• Can be hard to understand– Very counter-intuitive

• Definition– An independent variable which

increases the size of the parameters associated with other independent variables above the size of their correlations

Suppressor variables

290

• An example (based on Horst, 1941)– Success of trainee pilots

– Mechanical ability (x1), verbal ability (x2), success (y)

• Correlation matrixMech Verb Success

Mech 1 0.5 0.3Verb 0.5 1 0

Success 0.3 0 1

291

– Mechanical ability correlates 0.3 with success

– Verbal ability correlates 0.0 with success

– What will the parameter estimates be?

– (Don’t look ahead until you have had a guess)

292

• Mechanical ability– b = 0.4– Larger than r!

• Verbal ability – b = -0.2– Smaller than r!!

• So what is happening?– You need verbal ability to do the test– Not related to mechanical ability

• Measure of mechanical ability is contaminated by verbal ability

293

• High mech, low verbal– High mech

• This is positive

– Low verbal • Negative, because we are talking about

standardised scores• Your mech is really high – you did well on

the mechanical test, without being good at the words

• High mech, high verbal– Well, you had a head start on mech,

because of verbal, and need to be brought down a bit

294

Another suppressor?x1 x2 y

x1 1 0.5 0.3x2 0.5 1 0.2y 0.3 0.2 1

b1 = b2 =

295

Another suppressor?

x1 x2 yx1 1 0.5 0.3x2 0.5 1 0.2y 0.3 0.2 1

b1 =0.26 b2 = -0.06

296

And another?x1 x2 y

x1 1 0.5 0.3x2 0.5 1 -0.2y 0.3 -0.2 1

b1 = b2 =

297

And another?

x1 x2 yx1 1 0.5 0.3x2 0.5 1 -0.2y 0.3 -0.2 1

b1 = 0.53 b2 = -0.47

298

One more?x1 x2 y

x1 1 -0.5 0.3x2 -0.5 1 0.2y 0.3 0.2 1

b1 = b2 =

299

One more?

x1 x2 yx1 1 -0.5 0.3x2 -0.5 1 0.2y 0.3 0.2 1

b1 = 0.53 b2 = 0.47

300

• Suppression happens when two opposing forces are happening together– And have opposite effects

• Don’t throw away your IVs,– Just because they are uncorrelated with the

DV

• Be careful in interpretation of regression estimates– Really need the correlations too, to interpret

what is going on– Cannot compare between studies with

different IVs

301

Standardised Estimates > 1

• Correlations are bounded -1.00 ≤ r ≤ +1.00

– We think of standardised regression estimates as being similarly bounded• But they are not

– Can go >1.00, <-1.00– R cannot, because that is a proportion

of variance

302

• Three measures of ability– Mechanical ability, verbal ability 1,

verbal ability 2– Score on science exam

Mech Verbal1 Verbal2 ScoresMech 1 0.1 0.1 0.6

Verbal1 0.1 1 0.9 0.6Verbal2 0.1 0.9 1 0.3Scores 0.6 0.6 0.3 1

–Before reading on, what are the parameter estimates?

303

• Mechanical– About where we expect

• Verbal 1– Very high

• Verbal 2– Very low

Mech 0.56Verbal1 1.71Verbal2 -1.29

304

• What is going on– It’s a suppressor again– An independent variable which

increases the size of the parameters associated with other independent variables above the size of their correlations

• Verbal 1 and verbal 2 are correlated so highly– They need to cancel each other out

305

Variable Selection

• What are the appropriate independent variables to use in a model?– Depends what you are trying to do

• Multiple regression has two separate uses– Prediction– Explanation

306

• Prediction – What will happen

in the future?– Emphasis on

practical application

– Variables selected (more) empirically

– Value free

• Explanation– Why did

something happen?

– Emphasis on understanding phenomena

– Variables selected theoretically

– Not value free

307

• Visiting the doctor– Precedes suicide attempts– Predicts suicide

• Does not explain suicide

• More on causality later on …• Which are appropriate variables

– To collect data on?– To include in analysis?– Decision needs to be based on theoretical

knowledge of the behaviour of those variables– Statistical analysis of those variables (later)

• Unless you didn’t collect the data

– Common sense (not a useful thing to say)

308

Variable Entry Techniques

• Entry-wise– All variables entered simultaneously

• Hierarchical– Variables entered in a predetermined

order• Stepwise

– Variables entered according to change in R2

– Actually a family of techniques

309

• Entrywise– All variables entered simultaneously– All treated equally

• Hierarchical– Entered in a theoretically determined

order– Change in R2 is assessed, and tested for

significance– e.g. sex and age

• Should not be treated equally with other variables

• Sex and age MUST be first

– Confused with hierarchical linear modelling

310

• Stepwise– Variables entered empirically– Variable which increases R2 the most

goes first• Then the next …

– Variables which have no effect can be removed from the equation

• Example– IVs: Sex, age, extroversion, – DV: Car – how long someone spends

looking after their car

311

SEX AGE EXTRO CARSEX 1.00 -0.05 0.40 0.66AGE -0.05 1.00 0.40 0.23EXTRO 0.40 0.40 1.00 0.67CAR 0.66 0.23 0.67 1.00

• Correlation Matrix

312

• Entrywise analysis– r2 = 0.64

b pSEX 0.49 <0.01AGE 0.08 0.46EXTRO 0.44 <0.01

313

• Stepwise Analysis– Data determines the order– Model 1: Extroversion, R2 = 0.450– Model 2: Extroversion + Sex, R2 =

0.633

b pEXTRO 0.48 <0.01

SEX 0.47 <0.01

314

• Hierarchical analysis– Theory determines the order– Model 1: Sex + Age, R2 = 0.510– Model 2: S, A + E, R2 = 0.638– Change in R2 = 0.128, p = 0.001

SEX 0.49 <0.01AGE 0.08 0.46

EXTRO 0.44 <0.012

315

• Which is the best model?– Entrywise – OK– Stepwise – excluded age

• Did have a (small) effect

– Hierarchical• The change in R2 gives the best estimate

of the importance of extroversion

• Other problems with stepwise– F and df are wrong (cheats with df)– Unstable results

• Small changes (sampling variance) – large differences in models

316

– Uses a lot of paper– Don’t use a stepwise procedure to

pack your suitcase

317

Is Stepwise Always Evil?• Yes• All right, no• Research goal is predictive

(technological)– Not explanatory (scientific)– What happens, not why

• N is large – 40 people per predictor, Cohen, Cohen,

Aiken, West (2003)• Cross validation takes place

318

A quick note on R2

R2 is sometimes regarded as the ‘fit’ of a regression model– Bad idea

• If good fit is required – maximise R2

– Leads to entering variables which do not make theoretical sense

319

Critique of Multiple Regression

• Goertzel (2002)– “Myths of murder and multiple

regression”– Skeptical Inquirer (Paper B1)

• Econometrics and regression are ‘junk science’– Multiple regression models (in US)– Used to guide social policy

320

More Guns, Less Crime

– (controlling for other factors)• Lott and Mustard: A 1% increase in

gun ownership– 3.3% decrease in murder rates

• But: – More guns in rural Southern US– More crime in urban North (crack

cocaine epidemic at time of data)

321

Executions Cut Crime

• No difference between crimes in states in US with or without death penalty

• Ehrlich (1975) controlled all variables that effect crime rates– Death penalty had effect in reducing

crime rate

• No statistical way to decide who’s right

322

Legalised Abortion

• Donohue and Levitt (1999)– Legalised abortion in 1970’s cut crime in

1990’s

• Lott and Whitley (2001)– “Legalising abortion decreased murder

rates by … 0.5 to 7 per cent.”

• It’s impossible to model these data– Controlling for other historical events– Crack cocaine (again)

323

Another Critique

• Berk (2003)– Regression analysis: a constructive critique

(Sage)

• Three cheers for regression– As a descriptive technique

• Two cheers for regression– As an inferential technique

• One cheer for regression– As a causal analysis

324

Is Regression Useless?

• Do regression carefully– Don’t go beyond data which you have

a strong theoretical understanding of

• Validate models– Where possible, validate predictive

power of models in other areas, times, groups• Particularly important with stepwise

325

Lesson 7: Categorical Independent Variables

326

Introduction

327

Introduction

• So far, just looked at continuous independent variables

• Also possible to use categorical (nominal, qualitative) independent variables– e.g. Sex; Job; Religion; Region; Type

(of anything)• Usually analysed with

t-test/ANOVA

328

Historical Note• But these (t-test/ANOVA) are

special cases of regression analysis– Aspects of General Linear Models

(GLMs)• So why treat them differently?

– Fisher’s fault– Computers’ fault

• Regression, as we have seen, is computationally difficult– Matrix inversion and multiplication– Unfeasible, without a computer

329

• In the special cases where:• You have one categorical IV• Your IVs are uncorrelated

– It is much easier to do it by partitioning of sums of squares

• These cases– Very rare in ‘applied’ research– Very common in ‘experimental’

research• Fisher worked at Rothamsted agricultural

research station• Never have problems manipulating

wheat, pigs, cabbages, etc

330

• In psychology– Led to a split between ‘experimental’

psychologists and ‘correlational’ psychologists

– Experimental psychologists (until recently) would not think in terms of continuous variables

• Still (too) common to dichotomise a variable– Too difficult to analyse it properly– Equivalent to discarding 1/3 of your

data

331

The Approach

332

The Approach

• Recode the nominal variable – Into one, or more, variables to represent

that variable

• Names are slightly confusing– Some texts talk of ‘dummy coding’ to refer

to all of these techniques– Some (most) refer to ‘dummy coding’ to

refer to one of them– Most have more than one name

333

• If a variable has g possible categories it is represented by g-1 variables

• Simplest case: – Smokes: Yes or No– Variable 1 represents ‘Yes’– Variable 2 is redundant

• If it isn’t yes, it’s no

334

The Techniques

335

• We will examine two coding schemes– Dummy coding

• For two groups• For >2 groups

– Effect coding• For >2 groups

• Look at analysis of change– Equivalent to ANCOVA– Pretest-posttest designs

336

Dummy Coding – 2 Groups• Also called simple coding by SPSS• A categorical variable with two groups• One group chosen as a reference group

– The other group is represented in a variable

• e.g. 2 groups: Experimental (Group 1) and Control (Group 0)– Control is the reference group– Dummy variable represents experimental

group• Call this variable ‘group1’

337

• For variable ‘group1’ – 1 = ‘Yes’, 2=‘No’

Original Category

New Variable

Exp 1Con 0

338

• Some data• Group is x, score is y

Control Group

Experimental Group

Experiment 1 10 10Experiment 2 10 20Experiment 3 10 30

339

• Control Group = 0– Intercept = Score on Y when x = 0– Intercept = mean of control group

• Experimental Group = 1– b = change in Y when x increases 1

unit– b = difference between experimental

group and control group

340

0

5

10

15

20

25

30

35

Control Group Experimental Group

Experiment 1 Experiment 2 Experiment 3

Gradient of slope represents

difference between means

341

Dummy Coding – 3+ Groups

• With three groups the approach is the similar

• g = 3, therefore g-1 = 2 variables needed

• 3 Groups– Control – Experimental Group 1– Experimental Group 2

342

• Recoded into two variables– Note – do not need a 3rd variable

• If we are not in group 1 or group 2 MUST be in control group

• 3rd variable would add no information• (What would happen to determinant?)

Original Category

Gp1 Gp2

Con 0 0Gp1 1 0Gp2 0 1

343

• F and associated p– Tests H0 that

• b1 and b2 and associated p-values– Test difference between each

experimental group and the control group

• To test difference between experimental groups– Need to rerun analysis

1 2 3g g g

344

• One more complication– Have now run multiple comparisons– Increases – i.e. probability of type I

error

• Need to correct for this– Bonferroni correction– Multiply given p-values by two/three

(depending how many comparisons were made)

345

Effect Coding

• Usually used for 3+ groups• Compares each group (except the

reference group) to the mean of all groups– Dummy coding compares each group to the

reference group.

• Example with 5 groups– 1 group selected as reference group

• Group 5

346

• Each group (except reference) has a variable– 1 if the individual is in that group– 0 if not– -1 if in reference group

group group_1 group_2 group_3 group_4

1 1 0 0 02 0 1 0 03 0 0 1 04 0 0 0 15 -1 -1 -1 -1

347

Examples

• Dummy coding and Effect Coding• Group 1 chosen as reference group

each time• Data Group Mean SD

1 52.40 4.602 56.30 5.703 60.10 5.00

Total 56.27 5.88

348

• Dummy

Group dummy2

dummy3

1 0 0

2 1 0

3 0 1

Group Effect2 effect3

1 -1 -1

2 1 0

3 0 1

• Effect

349

DummyR=0.543, F=5.7,

df=2, 27, p=0.009b0 = 52.4,

b1 = 3.9, p=0.100

b2 = 7.7, p=0.002

EffectR=0.543, F=5.7,

df=2, 27, p=0.009b0 = 56.27,

b1 = 0.03, p=0.980

b2 = 3.8, p=0.007

132

121

10

ggb

ggb

gb

Ggb

Ggb

Gb

32

21

0

350

In SPSS• SPSS provides two equivalent

procedures for regression– Regression (which we have been using)– GLM (which we haven’t)

• GLM will:– Automatically code categorical variables– Automatically calculate interaction terms

• GLM won’t:– Give standardised effects– Give hierarchical R2 p-values– Allow you to not understand

351

ANCOVA and Regression

352

• Test– (Which is a trick; but it’s designed to

make you think about it)

• Use employee data.sav– Compare the pay rise (difference

between salbegin and salary)– For ethnic minority and non-minority

staff• What do you find?

353

ANCOVA and Regression

• Dummy coding approach has one special use– In ANCOVA, for the analysis of change

• Pre-test post-test experimental design– Control group and (one or more)

experimental groups– Tempting to use difference score + t-test /

mixed design ANOVA– Inappropriate

354

• Salivary cortisol levels– Used as a measure of stress– Not absolute level, but change in level

over day may be interesting

• Test at: 9.00am, 9.00pm• Two groups

– High stress group (cancer biopsy) • Group 1

– Low stress group (no biopsy)• Group 0

355

• Correlation of AM and PM = 0.493 (p=0.008)

• Has there been a significant difference in the rate of change of salivary cortisol?– 3 different approaches

AM PM DiffHigh Stress 20.1 6.8 13.3Low Stress 22.3 11.8 10.5

356

• Approach 1 – find the differences, do a t-test– t = 1.31, df=26, p=0.203

• Approach 2 – mixed ANOVA, look for interaction effect– F = 1.71, df = 1, 26, p = 0.203– F = t2

• Approach 3 – regression (ANCOVA) based approach

357

– IVs: AM and group– DV: PM

– b1 (group) = 3.59, standardised b1=0.432, p = 0.01

• Why is the regression approach better?– The other two approaches took the

difference– Assumes that r = 1.00– Any difference from r = 1.00 and you add

error variance• Subtracting error is the same as adding error

358

• Using regression– Ensures that all the variance that is

subtracted is true– Reduces the error variance

• Two effects– Adjusts the means

• Compensates for differences between groups

– Removes error variance

359

In SPSS• SPSS automates all of this

– But you have to understand it, to know what it is doing

• Use Analyse, GLM, Univariate ANOVA

360

Outcome here

Categorical predictors here

Continuous predictors here

Click options

361

Select parameter estimaters

362

More on Change

• If difference score is correlated with either pre-test or post-test– Subtraction fails to remove the

difference between the scores– If two scores are uncorrelated

• Difference will be correlated with both• Failure to control

– Equal SDs, r = 0• Correlation of change and pre-score

=0.707

363

Even More on Change

• A topic of surprising complexity– What I said about difference scores

isn’t always true• Lord’s paradox – it depends on the

precise question you want to answer

– Collins and Horn (1993). Best methods for the analysis of change

– Collins and Sayer (2001). New methods for the analysis of change.

364

Lesson 8: Assumptions in Regression Analysis

365

The Assumptions1. The distribution of residuals is normal (at

each value of the dependent variable).2. The variance of the residuals for every

set of values for the independent variable is equal.

• violation is called heteroscedasticity.3. The error term is additive

• no interactions.

4. At every value of the dependent variable the expected (mean) value of the residuals is zero

• No non-linear relationships

366

5. The expected correlation between residuals, for any two cases, is 0.

• The independence assumption (lack of autocorrelation)

6. All independent variables are uncorrelated with the error term.

7. No independent variables are a perfect linear function of other independent variables (no perfect multicollinearity)

8. The mean of the error term is zero.

367

What are we going to do …

• Deal with some of these assumptions in some detail

• Deal with others in passing only– look at them again later on

368

Assumption 1: The Distribution of Residuals is Normal at Every Value of the Dependent Variable

369

Look at Normal Distributions

• A normal distribution– symmetrical, bell-shaped (so they

say)

370

What can go wrong?

• Skew– non-symmetricality– one tail longer than the other

• Kurtosis– too flat or too peaked– kurtosed

• Outliers– Individual cases which are far from the

distribution

371

Effects on the Mean

• Skew– biases the mean, in direction of skew

• Kurtosis– mean not biased– standard deviation is– and hence standard errors, and

significance tests

372

Examining Univariate Distributions

• Histograms• Boxplots• P-P plots• Calculation based methods

373

Histograms

A and B30

20

10

0

30

20

10

0

374

• C and D40

30

20

10

0

14

12

10

8

6

4

2

0

375

• E & F

20

10

0

376

6

5

4

3

2

1

0

7

6

5

4

3

2

1

0

6

5

4

3

2

1

0

6

5

4

3

2

1

0

7

6

5

4

3

2

1

0

7

6

5

4

3

2

1

0

Histograms can be tricky ….

377

Boxplots

378

P-P Plots

1.00.75.50.250.00

1.00

.75

.50

.25

0.00

1.00.75.50.250.00

1.00

.75

.50

.25

0.00

• A & B

379

• C & D

1.00.75.50.250.00

1.00

.75

.50

.25

0.00

1.00.75.50.250.00

1.00

.75

.50

.25

0.00

380

• E & F

1.00.75.50.250.00

1.00

.75

.50

.25

0.001.00.75.50.250.00

1.00

.75

.50

.25

0.00

381

• Skew and Kurtosis statistics• Outlier detection statistics

Calculation Based

382

Skew and Kurtosis Statistics

• Normal distribution– skew = 0– kurtosis = 0

• Two methods for calculation– Fisher’s and Pearson’s– Very similar answers

• Associated standard error– can be used for significance of departure from

normality– not actually very useful

• Never normal above N = 400

383

Skewness SE Skew Kurtosis SE Kurt

A -0.12 0.172 -0.084 0.342B 0.271 0.172 0.265 0.342C 0.454 0.172 1.885 0.342D 0.117 0.172 -1.081 0.342E 2.106 0.172 5.75 0.342F 0.171 0.172 -0.21 0.342

384

Outlier Detection

• Calculate distance from mean– z-score (number of standard deviations)– deleted z-score

• that case biased the mean, so remove it

– Look up expected distance from mean• 1% 3+ SDs

• Calculate influence – how much effect did that case have on the

mean?

385

Non-Normality in Regression

386

Effects on OLS Estimates

• The mean is an OLS estimate• The regression line is an OLS

estimate• Lack of normality

– biases the position of the regression slope

– makes the standard errors wrong• probability values attached to statistical

significance wrong

387

Checks on Normality

• Check residuals are normally distributed– SPSS will draw histogram and p-p plot

of residuals

• Use regression diagnostics– Lots of them– Most aren’t very interesting

388

Regression Diagnostics

• Residuals– standardised, unstandardised, studentised,

deleted, studentised-deleted– look for cases > |3| (?)

• Influence statistics– Look for the effect a case has– If we remove that case, do we get a

different answer?– DFBeta, Standardised DFBeta

• changes in b

389

– DfFit, Standardised DfFit• change in predicted value

– Covariance ratio• Ratio of the determinants of the

covariance matrices, with and without the case

• Distances– measures of ‘distance’ from the

centroid– some include IV, some don’t

390

More on Residuals

• Residuals are trickier than you might have imagined

• Raw residuals– OK

• Standardised residuals – Residuals divided by SD

1

2

kn

ese

391

Leverage

• But– That SD is wrong– Variance of the residuals is not equal

• Those further from the centroid on the predictors have higher variance

• Need a measure of this

• Distance from the centroid is leverage, or h (or sometimes hii)

• One predictor– Easy

392

• Minimum hi is 1/n, the maximum is 1

• Except– SPSS uses standardised leverage - h*

• It doesn’t tell you this, it just uses it

2

2

)(

1

xx

xx

nh i

i

393

• Minimum 0, maximum (N-1/N)

2

2*

*

)(

1

xx

xxh

nhh

i

i

i

i

394

• Multiple predictors– Calculate the hat matrix (H)– Leverage values are the diagonals of

this matrix

– Where X is the augmented matrix of predictors (i.e. matrix that includes the constant)

– Hence leverage hii – element ii of H

X'X)X(X'H 1

395

• Example of calculation of hat matrix

318.0

236.0273.0

273.0318.0

651

......

201

151

651

......

201

151

651

......

201

151

651

......

201

151

1

H

396

Standardised / Studentised

• Now we can calculate the standardised residuals – SPSS calls them studentised residuals– Also called internally studentised

residuals

ie

ii

hs

ee

1

397

Deleted Studentised Residuals

• Studentised residuals do not have a known distribution– Cannot use them for inference

• Deleted studentised residuals– Externally studentised residuals– Jackknifed residuals

• Distributed as t• With df = N – k – 1

398

Testing Significance

• We can calculate the probability of a residual – Is it sampled from the same

population

• BUT– Massive type I error rate– Bonferroni correct it

• Multiply p value by N

399

Bivariate Normality

• We didn’t just say “residuals normally distributed”

• We said “at every value of the dependent variables”

• Two variables can be normally distributed – univariate,– but not bivariate

400

• Couple’s IQs– male and female

140.0130.0120.0110.0100.090.080.070.060.0

FEMALE

Fre

qu

en

cy

8

6

4

2

0

140.0130.0120.0110.0100.090.080.070.060.0

MALE

Fre

qu

en

cy

6

5

4

3

2

1

0

–Seem reasonably normal

401

• But wait!!

FEMALE

160140120100806040

MA

LE

160

140

120

100

80

60

40

402

• When we look at bivariate normality– not normal – there is an outlier

• So plot X against Y• OK for bivariate

– but – may be a multivariate outlier– Need to draw graph in 3+ dimensions– can’t draw a graph in 3 dimensions

• But we can look at the residuals instead …

403

• IQ histogram of residuals12

10

8

6

4

2

0

404

Multivariate Outliers …

• Will be explored later in the exercises

• So we move on …

405

What to do about Non-Normality

• Skew and Kurtosis– Skew – much easier to deal with– Kurtosis – less serious anyway

• Transform data– removes skew– positive skew – log transform– negative skew - square

406

Transformation

• May need to transform IV and/or DV– More often DV

• time, income, symptoms (e.g. depression) all positively skewed

– can cause non-linear effects (more later) if only one is transformed

– alters interpretation of unstandardised parameter

– May alter meaning of variable– May add / remove non-linear and moderator

effects

407

• Change measures– increase sensitivity at ranges

• avoiding floor and ceiling effects

• Outliers– Can be tricky– Why did the outlier occur?

• Error? Delete them.• Weird person? Probably delete them• Normal person? Tricky.

408

– You are trying to model a process• is the data point ‘outside’ the process• e.g. lottery winners, when looking at

salary• yawn, when looking at reaction time

– Which is better?• A good model, which explains 99% of

your data?• A poor model, which explains all of it

• Pedhazur and Schmelkin (1991)– analyse the data twice

409

• We will spend much less time on the other 6 assumptions

• Can do exercise 8.1.

410

Assumption 2: The variance of the residuals for every

set of values for the independent variable is

equal.

411

Heteroscedasticity

• This assumption is a about heteroscedasticity of the residuals– Hetero=different– Scedastic = scattered

• We don’t want heteroscedasticity– we want our data to be

homoscedastic

• Draw a scatterplot to investigate

412FEMALE

160140120100806040

MA

LE

160

140

120

100

80

60

40

413

• Only works with one IV– need every combination of IVs

• Easy to get – use predicted values– use residuals there

• Plot predicted values against residuals– or standardised residuals– or deleted residuals– or standardised deleted residuals– or studentised residuals

• A bit like turning the scatterplot on its side

414

Good – no heteroscedasticity

Predicted Value

Resid

ual

415

Bad – heteroscedasticity

Predicted Value

Resid

ual

416

Testing Heteroscedasticity

• White’s test– Not automatic in SPSS (is in SAS)– Luckily, not hard to do1. Do regression, save residuals.2. Square residuals3. Square IVs4. Calculate interactions of IVs

– e.g. x1•x2, x1•x3, x2 • x3

417

5. Run regression using – squared residuals as DV– IVs, squared IVs, and interactions as IVs

6. Test statistic = N x R2

– Distributed as 2

– Df = k (for second regression)

• Use education and salbegin to predict salary (employee data.sav)

– R2 = 0.113, N=474, 2 = 53.5, df=5, p < 0.0001

418

Plot of Pred and Res

Regression Standardized Predicted Value

86420-2

Regre

ssio

n S

tandard

ized R

esi

dual

8

6

4

2

0

-2

-4

419

Magnitude of Heteroscedasticity

• Chop data into “slices”– 5 slices, based on X (or predicted

score)• Done in SPSS

– Calculate variance of each slice– Check ratio of smallest to largest– Less than 10:1

• OK

420

The Visual Bander• New in SPSS 12

421

• Variances of the 5 groups

• We have a problem– 3 / 0.2 ~= 15

422

Dealing with Heteroscedasticity

• Use Huber-White estimates– Very easy in Stata– Fiddly in SPSS – bit of a hack

• Use Complex samples1. Create a new variable where all

cases are equal to 1, call it const2. Use Complex Samples, Prepare for

Analysis3. Create a plan file

423

4. Sample weight is const5. Finish6. Use Complex Samples, GLM7. Use plan file created, and set up

model as in GLM(More on complex samples later)

In Stata, do regression as normal, and click “robust”.

424

Heteroscedasticity – Implications and Meanings

Implications• What happens as a result of

heteroscedasticity?– Parameter estimates are correct

• not biased

– Standard errors (hence p-values) are incorrect

425

However …

• If there is no skew in predicted scores– P-values a tiny bit wrong

• If skewed,– P-values very wrong

• Can do exercise

426

Meaning• What is heteroscedasticity trying to

tell us?– Our model is wrong – it is misspecified– Something important is happening

that we have not accounted for

• e.g. amount of money given to charity (given)– depends on:

• earnings • degree of importance person assigns to

the charity (import)

427

• Do the regression analysis– R2 = 0.60, F=31.4, df=2, 37, p < 0.001

• seems quite good

– b0 = 0.24, p=0.97

– b1 = 0.71, p < 0.001

– b2 = 0.23, p = 0.031

• White’s test– 2 = 18.6, df=5, p=0.002

• The plot of predicted values against residuals …

428

• Plot shows heteroscedastic relationship

429

• Which means …– the effects of the variables are not

additive – If you think that what a charity does

is important• you might give more money• how much more depends on how much

money you have

430

IMPORT

16141210864

GIV

EN

70

60

50

40

30

20

10

Earnings

High

Low

431

• One more thing about heteroscedasticity

– it is the equivalent of homogeneity of variance in ANOVA/t-tests

432

Assumption 3: The Error Term is Additive

433

Additivity

• What heteroscedasticity shows you– effects of variables need to be additive

• Heteroscedasticity doesn’t always show it to you– can test for it, but hard work– (same as homogeneity of covariance

assumption in ANCOVA)

• Have to know it from your theory• A specification error

434

Additivity and Theory• Two IVs

– Alcohol has sedative effect• A bit makes you a bit tired• A lot makes you very tired

– Some painkillers have sedative effect• A bit makes you a bit tired• A lot makes you very tired

– A bit of alcohol and a bit of painkiller doesn’t make you very tired

– Effects multiply together, don’t add together

435

• If you don’t test for it– It’s very hard to know that it will

happen

• So many possible non-additive effects– Cannot test for all of them– Can test for obvious

• In medicine– Choose to test for salient non-additive

effects– e.g. sex, race

436

Assumption 4: At every value of the dependent variable the expected (mean) value of the

residuals is zero

437

Linearity

• Relationships between variables should be linear – best represented by a straight line

• Not a very common problem in social sciences– except economics– measures are not sufficiently accurate to

make a difference • R2 too low• unlike, say, physics

438

• Relationship between speed of travel and fuel used

Speed

Fue

l

439

• R2 = 0.938– looks pretty good– know speed, make a good prediction

of fuel

• BUT– look at the chart– if we know speed we can make a

perfect prediction of fuel used– R2 should be 1.00

440

Detecting Non-Linearity

• Residual plot– just like heteroscedasticity

• Using this example– very, very obvious– usually pretty obvious

441

Residual plot

442

Linearity: A Case of Additivity

• Linearity = additivity along the range of the IV

• Jeremy rides his bicycle harder– Increase in speed depends on current speed– Not additive, multiplicative– MacCallum and Mar (1995). Distinguishing

between moderator and quadratic effects in multiple regression. Psychological Bulletin.

443

Assumption 5: The expected correlation between

residuals, for any two cases, is 0.

The independence assumption (lack of autocorrelation)

444

Independence Assumption

• Also: lack of autocorrelation• Tricky one

– often ignored– exists for almost all tests

• All cases should be independent of one another– knowing the value of one case should not

tell you anything about the value of other cases

445

How is it Detected?

• Can be difficult– need some clever statistics

(multilevel models)

• Better off avoiding situations where it arises

• Residual Plots• Durbin-Watson Test

446

Residual Plots

• Were data collected in time order?– If so plot ID number against the

residuals– Look for any pattern

• Test for linear relationship• Non-linear relationship• Heteroscedasticity

447

0 10 20 30 40

Participant Number

-2

-1

0

1

2R

esid

ual

448

How does it arise?

Two main ways• time-series analyses

– When cases are time periods• weather on Tuesday and weather on Wednesday

correlated• inflation 1972, inflation 1973 are correlated

• clusters of cases– patients treated by three doctors– children from different classes– people assessed in groups

449

Why does it matter?• Standard errors can be wrong

– therefore significance tests can be wrong

• Parameter estimates can be wrong– really, really wrong– from positive to negative

• An example– students do an exam (on statistics)– choose one of three questions

• IV: time• DV: grade

450Time

70605040302010

Gra

de

90

80

70

60

50

40

30

20

10

•Result, with line of best fit

451

• Result shows that– people who spent longer in the exam,

achieve better grades

• BUT …– we haven’t considered which question

people answered– we might have violated the

independence assumption• DV will be autocorrelated

• Look again– with questions marked

452

• Now somewhat different

Time

70605040302010

Gra

de

90

80

70

60

50

40

30

20

10

Question

3

2

1

453

• Now, people that spent longer got lower grades– questions differed in difficulty– do a hard one, get better grade– if you can do it, you can do it quickly

• Very difficult to analyse well– need multilevel models

454

Durbin Watson Test

• Not well implemented in SPSS• Depends on the order of the data

– Reorder the data, get a different result

• Doesn’t give statistical significance of the test

455

Assumption 6: All independent variables are

uncorrelated with the error term.

456

Uncorrelated with the Error Term

• A curious assumption– by definition, the residuals are

uncorrelated with the independent variables (try it and see, if you like)

• It is about the DV– must have no effect (when the IVs

have been removed)– on the DV

457

• Problem in economics– Demand increases supply– Supply increases wages– Higher wages increase demand

• OLS estimates will be (badly) biased in this case– need a different estimation procedure– two-stage least squares

• simultaneous equation modelling

458

Assumption 7: No independent variables are a perfect linear function

of other independent variables

no perfect multicollinearity

459

No Perfect Multicollinearity

• IVs must not be linear functions of one another– matrix of correlations of IVs is not positive

definite– cannot be inverted– analysis cannot proceed

• Have seen this with – age, age start, time working– also occurs with subscale and total

460

• Large amounts of collinearity– a problem (as we shall see)

sometimes– not an assumption

461

Assumption 8: The mean of the error term is zero.

You will like this one.

462

Mean of the Error Term = 0

• Mean of the residuals = 0• That is what the constant is for

– if the mean of the error term deviates from zero, the constant soaks it up

0 1 1

0 1 1( 3) ( 3)

Y x

Y x

- note, Greek letters because we are talking about population values

463

• Can do regression without the constant– Usually a bad idea– E.g R2 = 0.995, p < 0.001

• Looks good

464

6 7 8 9 10 11 12 13

x1

7

8

9

10

11

12

13y

465

466

Lesson 9: Issues in Regression Analysis

Things that alter the interpretation of the regression equation

467

The Four Issues

• Causality• Sample sizes• Collinearity• Measurement error

468

Causality

469

What is a Cause?

• Debate about definition of cause– some statistics (and philosophy)

books try to avoid it completely– We are not going into depth

• just going to show why it is hard

• Two dimensions of cause– Ultimate versus proximal cause– Determinate versus probabilistic

470

Proximal versus Ultimate• Why am I here?

– I walked here because – This is the location of the class

because – Eric Tanenbaum asked me because – (I don’t know)– because I was in my office when he

rang because – I am a lecturer at York because – I saw an advert in the paper because

471

– I exist because– My parents met because – My father had a job …

• Proximal cause– the direct and immediate cause of

something• Ultimate cause

– the thing that started the process off– I fell off my bicycle because of the

bump– I fell off because I was going too fast

472

Determinate versus Probabilistic Cause

• Why did I fall off my bicycle?– I was going too fast– But every time I ride too fast, I don’t

fall off– Probabilistic cause

• Why did my tyre go flat?– A nail was stuck in my tyre– Every time a nail sticks in my tyre,

the tyre goes flat– Deterministic cause

473

• Can get into trouble by mixing them together– Eating deep fried Mars Bars and doing

no exercise are causes of heart disease

– “My Grandad ate three deep fried Mars Bars every day, and the most exercise he ever got was when he walked to the shop next door to buy one”

– (Deliberately?) confusing deterministic and probabilistic causes

474

Criteria for Causation

• Association• Direction of Influence• Isolation

475

Association

• Correlation does not mean causation– we all know

• But– Causation does mean correlation

• Need to show that two things are related– may be correlation– my be regression when controlling for third

(or more) factor

476

• Relationship between price and sales– suppliers may be cunning– when people want it more

• stick the price upPrice Demand SalesPrice 1 0.6 0

Demand 0.6 1 0.6Sales 0 0.6 1

– So – no relationship between price and sales

477

– Until (or course) we control for demand

– b1 (Price) = -0.56

– b2 (Demand) = 0.94

• But which variables do we enter?

478

Direction of Influence• Relationship between A and B

– three possible processes

A B

A B

A B

C

A causes B

B causes A

C causes A & B

479

• How do we establish the direction of influence?– Longitudinally?

StormBarometer

Drops

– Now if we could just get that barometer needle to stay where it is …

• Where the role of theory comes in (more on this later)

480

Isolation

• Isolate the dependent variable from all other influences– as experimenters try to do

• Cannot do this– can statistically isolate the effect– using multiple regression

481

Role of Theory

• Strong theory is crucial to making causal statements

• Fisher said: to make causal statements “make your theories elaborate.”– don’t rely purely on statistical analysis

• Need strong theory to guide analyses– what critics of non-experimental

research don’t understand

482

• S.J. Gould – a critic– says correlate price of petrol and his

age, for the last 10 years– find a correlation– Ha! (He says) that doesn’t mean

there is a causal link– Of course not! (We say).

• No social scientist would do that analysis without first thinking (very hard) about the possible causal relations between the variables of interest

• Would control for time, prices, etc …

483

• Atkinson, et al. (1996)– relationship between college grades

and number of hours worked– negative correlation– Need to control for other variables –

ability, intelligence

• Gould says “Most correlations are non-causal” (1982, p243)– Of course!!!!

484

I drink a lot of beer

120 non-causal correlations

16 causal relations

karaoke

jokes (about statistics)

vomit

toilet

headache

sleeping

equations (beermat)

laugh

thirsty

fried breakfast

no beer

curry

chips

falling over

lose keys

curtains closed

485

• Abelson (1995) elaborates on this– ‘method of signatures’

• A collection of correlations relating to the process– the ‘signature’ of the process

• e.g. tobacco smoking and lung cancer– can we account for all of these

findings with any other theory?

486

1. The longer a person has smoked cigarettes, the greater the risk of cancer.

2. The more cigarettes a person smokes over a given time period, the greater the risk of cancer.

3. People who stop smoking have lower cancer rates than do those who keep smoking.

4. Smoker’s cancers tend to occur in the lungs, and be of a particular type.

5. Smokers have elevated rates of other diseases.6. People who smoke cigars or pipes, and do not

usually inhale, have abnormally high rates of lip cancer.

7. Smokers of filter-tipped cigarettes have lower cancer rates than other cigarette smokers.

8. Non-smokers who live with smokers have elevated cancer rates.

(Abelson, 1995: 183-184)

487

– In addition, should be no anomalous correlations• If smokers had more fallen arches than

non-smokers, not consistent with theory

• Failure to use theory to select appropriate variables– specification error– e.g. in previous example– Predict wealth from price and sales

• increase price, price increases• Increase sales, price increases

488

• Sometimes these are indicators of the process– e.g. barometer – stopping the needle

won’t help– e.g. inflation? Indicator or cause?

489

No Causation without Experimentation

• Blatantly untrue– I don’t doubt that the sun shining

makes us warm• Why the aversion?

– Pearl (2000) says problem is no mathematical operator

– No one realised that you needed one– Until you build a robot

490

AI and Causality

• A robot needs to make judgements about causality

• Needs to have a mathematical representation of causality– Suddenly, a problem!– Doesn’t exist

• Most operators are non-directional• Causality is directional

491

Sample Sizes

“How many subjects does it take to run a regression

analysis?”

492

Introduction

• Social scientists don’t worry enough about the sample size required– “Why didn’t you get a significant result?”– “I didn’t have a large enough sample”

• Not a common answer

• More recently awareness of sample size is increasing– use too few – no point doing the research– use too many – waste their time

493

• Research funding bodies• Ethical review panels

– both become more interested in sample size calculations

• We will look at two approaches– Rules of thumb (quite quickly)– Power Analysis (more slowly)

494

Rules of Thumb

• Lots of simple rules of thumb exist– 10 cases per IV– >100 cases– Green (1991) more sophisticated

• To test significance of R2 – N = 50 + 8k• To test sig of slopes, N = 104 + k

• Rules of thumb don’t take into account all the information that we have– Power analysis does

495

Power Analysis

Introducing Power Analysis• Hypothesis test

– tells us the probability of a result of that magnitude occurring, if the null hypothesis is correct (i.e. there is no effect in the population)

• Doesn’t tell us– the probability of that result, if the

null hypothesis is false

496

• According to Cohen (1982) all null hypotheses are false– everything that might have an effect,

does have an effect• it is just that the effect is often very tiny

497

Type I Errors

• Type I error is false rejection of H0

• Probability of making a type I error

– – the significance value cut-off • usually 0.05 (by convention)

• Always this value• Not affected by

– sample size– type of test

498

Type II errors• Type II error is false acceptance of

the null hypothesis– Much, much trickier

• We think we have some idea– we almost certainly don’t

• Example– I do an experiment (random

sampling, all assumptions perfectly satisfied)

– I find p = 0.05

499

– You repeat the experiment exactly• different random sample from same

population

– What is probability you will find p < 0.05?– ………………– Another experiment, I find p = 0.01– Probability you find p < 0.05?– ………………

• Very hard to work out– not intuitive – need to understand non-central sampling

distributions (more in a minute)

500

• Probability of type II error = beta () – same as population regression

parameter (to be confusing)

• Power = 1 – Beta– Probability of getting a significant

result

501

Type I errorp =

Type II error

p = power = 1 -

H0 true (we find no effect

– p > 0.05)

H0 false (we find an effect

– p < 0.05)

Research Findings

H0 false (effect to be

found)

H0 True(no effect to

be found)

State of the World

502

• Four parameters in power analysis– – prob. of Type I error– – prob. of Type II error (power = 1 –

)– Effect size – size of effect in

population– N

• Know any three, can calculate the fourth– Look at them one at a time

503

• Probability of Type I error– Usually set to 0.05– Somewhat arbitrary

• sometimes adjusted because of circumstances

– rarely because of power analysis

– May want to adjust it, based on power analysis

504

• – Probability of type II error– Power (probability of finding a result) = 1 – – Standard is 80%

• Some argue for 90%

– Implication that Type I error is 4 times more serious than type II error• adjust ratio with compromise power

analysis

505

• Effect size in the population– Most problematic to determine– Three ways1. What effect size would be useful to

find? • R2 = 0.01 - no use (probably)

2. Base it on previous research– what have other people found?

3. Use Cohen’s conventions– small R2 = 0.02– medium R2 = 0.13– large R2 = 0.26

506

– Effect size usually measured as f2

– For R2

22

21

Rf

R

507

– For (standardised) slopes

– Where sr2 is the contribution to the variance accounted for by the variable of interest

– i.e. sr2 = R2 (with variable) – R2 (without)• change in R2 in hierarchical regression

2

22

1 R

srf i

508

• N – the sample size– usually use other three parameters to

determine this– sometimes adjust other parameters

() based on this– e.g. You can have 50 participants. No

more.

509

Doing power analysis• With power analysis program

– SamplePower, GPower, Nquery

• With SPSS MANOVA– using non-central distribution

functions– Uses MANOVA syntax

• Relies on the fact you can do anything with MANOVA

• Paper B4

510

Underpowered Studies

• Research in the social sciences is often underpowered– Why?– See Paper B11 – “the persistence of

underpowered studies”

511

Extra Reading

• Power traditionally focuses on p values– What about CIs?– Paper B8 – “Obtaining regression

coefficients that are accurate, not simply significant”

512

Collinearity

513

Collinearity as Issue and Assumption

• Collinearity (multicollinearity) – the extent to which the independent

variables are (multiply) correlated• If R2 for any IV, using other IVs = 1.00

– perfect collinearity– variable is linear sum of other variables– regression will not proceed – (SPSS will arbitrarily throw out a variable)

514

• R2 < 1.00, but high– other problems may arise

• Four things to look at in collinearity– meaning– implications– detection– actions

515

Meaning of Collinearity

• Literally ‘co-linearity’– lying along the same line

• Perfect collinearity– when some IVs predict another– Total = S1 + S2 + S3 + S4– S1 = Total – (S2 + S3 + S4)– rare

516

• Less than perfect– when some IVs are close to predicting – correlations between IVs are high

(usually, but not always)

517

Implications

• Effects the stability of the parameter estimates– and so the standard errors of the

parameter estimates– and so the significance

• Because– shared variance, which the regression

procedure doesn’t know where to put

518

• Red cars have more accidents than other coloured cars– because of the effect of being in a red

car?– because of the kind of person that drives

a red car?• we don’t know

– No way to distinguish between these three:

Accidents = 1 x colour + 0 x personAccidents = 0 x colour + 1 x person

Accidents = 0.5 x colour + 0.5 x person

519

• Sex differences– due to genetics?– due to upbringing?– (almost) perfect collinearity

• statistically impossible to tell

520

• When collinearity is less than perfect– increases variability of estimates

between samples– estimates are unstable– reflected in the variances, and hence

standard errors

521

Detecting Collinearity

• Look at the parameter estimates– large standardised parameter

estimates (>0.3?), which are not significant • be suspicious

• Run a series of regressions– each IV as DV– all other IVs as IVs

• for each IV

522

• Sounds like hard work?– SPSS does it for us!

• Ask for collinearity diagnostics– Tolerance – calculated for every IV

21Tolerance -R

Tolerance1

VIF

– Variance Inflation Factor• sq. root of amount s.e. has been

increased

523

Actions

What you can do about collinearity“no quick fix” (Fox, 1991)

1. Get new data• avoids the problem• address the question in a different

way• e.g. find people who have been

raised as the ‘wrong’ gender• exist, but rare

• Not a very useful suggestion

524

2. Collect more data• not different data, more data• collinearity increases standard error

(se)• se decreases as N increases

• get a bigger N

3. Remove / Combine variables• If an IV correlates highly with other IVs• Not telling us much new• If you have two (or more) IVs which are

very similar• e.g. 2 measures of depression, socio-

economic status, achievement, etc

525

• sum them, average them, remove one

• Many measures• use principal components analysis to

reduce them

3. Use stepwise regression (or some flavour of)

• See previous comments• Can be useful in theoretical vacuum

4. Ridge regression• not very useful• behaves weirdly

526

Measurement Error

527

What is Measurement Error

• In social science, it is unlikely that we measure any variable perfectly– measurement error represents this

imperfection• We assume that we have a true

score – T

• A measure of that score– x

528

• just like a regression equation– standardise the parameters– T is the reliability

• the amount of variance in x which comes from T

• but, like a regression equation– assume that e is random and has mean of

zero– more on that later

eTx

529

Simple Effects of Measurement Error

• Lowers the measured correlation

– between two variables

• Real correlation– true scores (x* and y*)

• Measured correlation– measured scores (x and y)

530

x* y*

yxe

Reliability of yryy

Reliability of xrxx

True correlation of x and y

rx*y*

Measured correlation of x and y

rxy

e

531

• Attenuation of correlation

yyxxyxxy rrrr **

yyxx

xyyx

rr

rr **

• Attenuation corrected correlation

532

• Example

3.0

8.0

7.0

xy

yy

xx

r

r

r

40.08.07.0

3.0**

**

yx

yyxx

xyyx

r

rr

rr

533

Complex Effects of Measurement Error

• Really horribly complex• Measurement error reduces

correlations– reduces estimate of – reducing one estimate

• increases others

– because of effects of control– combined with effects of suppressor

variables– exercise to examine this

534

Dealing with Measurement Error

• Attenuation correction– very dangerous– not recommended

• Avoid in the first place– use reliable measures– don’t discard information

• don’t categorise• Age: 10-20, 21-30, 31-40 …

535

Complications

• Assume measurement error is – additive– linear

• Additive– e.g. weight – people may under-report /

over-report at the extremes

• Linear– particularly the case when using proxy

variables

536

• e.g. proxy measures– Want to know effort on childcare,

count number of children• 1st child is more effort than last

– Want to know financial status, count income• 1st £10 much greater effect on financial

status than the 1000th.

537

Lesson 10: Non-Linear Analysis in Regression

538

Introduction

• Non-linear effect occurs – when the effect of one independent

variable– is not consistent across the range of

the IV

• Assumption is violated– expected value of residuals = 0– no longer the case

539

Some Examples

540Experience

Ski

llA Learning Curve

541Arousal

Perf

orm

an

ceYerkes-Dodson Law of Arousal

542Time

En

thu

siast

icEnthusiasm Levels over a

Lesson on Regression

0 3.5Suic

idal

543

• Learning – line changed direction once

• Yerkes-Dodson– line changed direction once

• Enthusiasm– line changed direction twice

544

Everything is Non-Linear

• Every relationship we look at is non-linear, for two reasons– Exam results cannot keep increasing

with reading more books• Linear in the range we examine

– For small departures from linearity• Cannot detect the difference• Non-parsimonious solution

545

Non-Linear Transformations

546

Bending the Line

• Non-linear regression is hard– We cheat, and linearise the data

• Do linear regression

Transformations• We need to transform the data

– rather than estimating a curved line• which would be very difficult• may not work with OLS

– we can take a straight line, and bend it– or take a curved line, and straighten it

• back to linear (OLS) regression

547

• We still do linear regression– Linear in the parameters

– Y = b1x + b2x2 + …

• Can do non-linear regression– Non-linear in the parameters

– Y = b1x + b2

x2 + …

• Much trickier– Statistical theory either breaks down

OR becomes harder

548

• Linear transformations– multiply by a constant– add a constant– change the slope and the intercept

549

x

y

y=x

y=2xy=x + 3

550

• Linear transformations are no use– alter the slope and intercept– don’t alter the standardised

parameter estimate

• Non-linear transformation– will bend the slope– quadratic transformation

y = x2

– one change of direction

551

– Cubic transformationy = x2 + x3

– two changes of direction

552

y=0 + 0.1x + 1x2

Quadratic Transformation

553

Square Root Transformation

y=20 + -3x + 5x

554

Cubic Transformation

0

1

2

3

4

5

6

0 1 2 3 4 5 6

y = 3 - 4x + 2x2 - 0.2x3

555

Logarithmic Transformation

y = 1 + 0.1x + 10log(x)

556

Inverse Transformation

y = 20 -10x + 8(1/x)

557

• To estimate a non-linear regression– we don’t actually estimate anything

non-linear– we transform the x-variable to a non-

linear version– can estimate that straight line– represents the curve– we don’t bend the line, we stretch the

space around the line, and make it flat

558

Detecting Non-linearity

559

Draw a Scatterplot

• Draw a scatterplot of y plotted against x– see if it looks a bit non-linear– e.g. Anscombe’s data– e.g. Education and beginning salary

• from bank data• drawn in SPSS• with line of best fit

560

• Anscombe (1973)– constructed a set of datasets– show the importance of graphs in

regression/correlation

• For each dataset

N 11Mean of x 9Mean of y 7.5Equation of regression line y = 3 + 0.5xsum of squares (X - mean) 110correlation coefficient 0.82R2 0.67

561

562

563

564

565

A Real Example

• Starting salary and years of education– From employee data.sav

566

Educational Level (years)

Beg

inni

ng S

alar

y

Expected value of error (residual) is >

0

Expected value of error (residual) is <

0

567

Use Residual Plot

• Scatterplot is only good for one variable– use the residual plot (that we used for

heteroscedasticity)

• Good for many variables

568

• We want– points to lie in a nice straight sausage

569

• We don’t want– a nasty bent sausage

570

3210-1-2

10

8

6

4

2

0

-2

• Educational level and starting salary

571

Carrying Out Non-Linear Regression

572

Linear Transformation

• Linear transformation doesn’t change– interpretation of slope– standardised slope– se, t, or p of slope– R2

• Can change– effect of a transformation

573

• Actually more complex– with some transformations can add a

constant with no effect (e.g. quadratic)

• With others does have an effect– inverse, log

• Sometimes it is necessary to add a constant– negative numbers have no square

root– 0 has no log

574

Education and Salary

Linear Regression• Saw previously that the assumption of

expected errors = 0 was violated• Anyway …

– R2 = 0.401, F=315, df = 1, 472, p < 0.001

– salbegin = -6290 + 1727 educ– Standardised

• b1 (educ) = 0.633

– Both parameters make sense

575

Non-linear Effect• Compute new variable

– quadratic– educ2 = educ2

• Add this variable to the equation– R2 = 0.585, p < 0.001– salbegin = 46263 + -6542 educ + 310

educ2

• slightly curious

– Standardised• b1 (educ) = -2.4• b2 (educ2) = 3.1

– What is going on?

576

• Collinearity– is what is going on– Correlation of educ and educ2

• r = 0.990

– Regression equation becomes difficult (impossible?) to interpret

• Need hierarchical regression– what is the change in R2

– is that change significant?– R2 (change) = 0.184, p < 0.001

577

Cubic Effect• While we are at it, let’s look at the

cubic effect– R2 (change) = 0.004, p = 0.045– 19138 + 103 e + -206 e2 + 12

e3

– Standardised:b1(e) = 0.04

b2(e2) = -2.04

b3(e3) = 2.71

578

Fourth Power• Keep going while we are ahead

– won’t run• ???

• Collinearity is the culprit– Tolerance (educ4) = 0.000005– VIF = 215555

• Matrix of correlations of IVs is not positive definite– cannot be inverted

579

Interpretation• Tricky, given that parameter

estimates are a bit nonsensical• Two methods• 1: Use R2 change

– Save predicted values• or calculate predicted values to plot line

of best fit

– Save them from equation– Plot against IV

580Education (Years)

222018161412108

Beg

inni

ng S

alar

y

50000

40000

30000

20000

10000

0

Cubic

Quadratic

Linear

581

• Differentiate with respect to e• We said: s = 19138 + 103 e + -206 e2 + 12

e3

– but first we will simplify it to quadratic s = 46263 + -6542 e + 310 e2

• dy/dx = -6542 + 310 x 2 x e

582

Education Slope9 -962

10 -34211 27812 89813 151814 213815 275816 337817 399818 461819 523820 5858

1 year of education at the higher end of the scale, better than

1 year at the lower end of the scale.MBA versus GCSE

583

• Differentiate Cubic19138 + 103 e + -206 e2 + 12

e3

dy/dx = 103 – 206 2 e + 12 3 e2

• Can calculate slopes for quadratic and cubic at different values

584

Education Slope (Quad) Slope (Cub)9 -962 -689

10 -342 -41711 278 -7312 898 34313 1518 83114 2138 139115 2758 202316 3378 272717 3998 350318 4618 435119 5238 527120 5858 6263

585

A Quick Note on Differentiation

• For y = xp

– dx/dy = pxp-1

• For equations such as y =b1x + b2xP

dy/dx = b1 + b2pxp-1

• y = 3x + 4x2

– dy/dx = 3 + 4 • 2x

586

• y = b1x + b2x2 + b3x3

– dy/dx = b1 + b2 • 2x + b3 • 3 • x2

• y = 4x + 5x2 + 6x3

• dx/dy = 4 + 5 • 2 • x + 6 • 3 • x2

• Many functions are simple to differentiate– Not all though

587

Automatic Differentiation

• If you – Don’t know how to differentiate– Can’t be bothered to look up the

function

• Can use automatic differentiation software– e.g. GRAD (freeware)

588

589

Lesson 11: Logistic Regression

Dichotomous/Nominal Dependent Variables

590

Introduction

• Often in social sciences, we have a dichotomous/nominal DV– we will look at dichotomous first, then a

quick look at multinomial

• Dichotomous DV• e.g.

– guilty/not guilty– pass/fail– won/lost– Alive/dead (used in medicine)

591

Why Won’t OLS Do?

592

Example: Passing a Test

• Test for bus drivers– pass/fail– we might be interested in degrees of pass fail

• a company which trains them will not• fail means ‘pay for them to take it again’

• Develop a selection procedure– Two predictor variables– Score – Score on an aptitude test– Exp – Relevant prior experience (months)

593

• 1st ten casesScore Exp Pass

5 6 01 15 01 12 04 6 01 15 11 6 04 16 11 10 13 12 04 26 1

594

• DV – pass (1 = Yes, 0 = No)

• Just consider score first– Carry out regression– Score as IV, Pass as DV– R2 = 0.097, F = 4.1, df = 1, 48, p =

0.028.

– b0 = 0.190

– b1 = 0.110, p=0.028• Seems OK

595

• Or does it? …• 1st Problem – pp plot of residuals

Observed Cum Prob

1.00.75.50.250.00

Expecte

d C

um

Pro

b

1.00

.75

.50

.25

0.00

596

• 2nd problem - residual plot

597

• Problems 1 and 2– strange distributions of residuals– parameter estimates may be wrong– standard errors will certainly be

wrong

598

• 3rd problem – interpretation– I score 2 on aptitude. – Pass = 0.190 + 0.110 2 = 0.41– I score 8 on the test– Pass = 0.190 + 0.110 8 = 1.07

• Seems OK, but– What does it mean?– Cannot score 0.41 or 1.07

• can only score 0 or 1

• Cannot be interpreted– need a different approach

599

A Different ApproachLogistic Regression

600

Logit Transformation

• In lesson 10, transformed IVs– now transform the DV

• Need a transformation which gives us– graduated scores (between 0 and 1)– No upper limit

• we can’t predict someone will pass twice

– No lower limit• you can’t do worse than fail

601

Step 1: Convert to Probability

• First, stop talking about values– talk about probability– for each value of score, calculate

probability of pass

• Solves the problem of graduated scales

602

Score 1 2 3 4 5N 7 5 6 4 2P 0.7 0.5 0.6 0.4 0.2N 3 5 4 6 8P 0.3 0.5 0.4 0.6 0.8

Fail

Pass

probability of failure given a

score of 1 is 0.7

probability of passing given a score of 5 is 0.8

603

This is better• Now a score of 0.41 has a meaning

– a 0.41 probability of pass

• But a score of 1.07 has no meaning– cannot have a probability > 1 (or < 0)– Need another transformation

604

Step 2: Convert to Odds-Ratio

Need to remove upper limit• Convert to odds• Odds, as used by betting shops

– 5:1, 1:2• Slightly different from odds in speech

– a 1 in 2 chance– odds are 1:1 (evens)– 50%

605

• Odds ratio = (number of times it happened) / (number of times it didn’t happen)

)event(1)event(

)eventnot ((event)

ratio oddsp

p

p

p

606

• 0.8 = 0.8/0.2 = 4

– equivalent to 4:1 (odds on)

– 4 times out of five

• 0.2 = 0.2/0.8 = 0.25

– equivalent to 1:4 (4:1 against)

– 1 time out of five

607

• Now we have solved the upper bound problem– we can interpret 1.07, 2.07,

1000000.07

• But we still have the zero problem

– we cannot interpret predicted scores less than zero

608

Step 3: The Log

• Log10 of a number(x)

xx )log(10• log(10) = 1

• log(100) = 2

• log(1000) = 3

609

• log(1) = 0• log(0.1) = -1• log(0.00001) = -5

610

Natural Logs and e

• Don’t use log10

– Use loge

• Natural log, ln• Has some desirable properties, that

log10 doesn’t– For us– If y = ln(x) + c– dy/dx = 1/x– Not true for any other logarithm

611

• Be careful – calculators and stats packages are not consistent when they use log– Sometimes log10, sometimes loge

– Can prove embarrassing (a friend told me)

612

Take the natural log of the odds ratio

• Goes from - +– can interpret any predicted value

613

Putting them all together

• Logit transformation– log-odds ratio– not bounded at zero or one

614

Score 1 2 3 4 5

N 7 5 6 4 2P 0.7 0.5 0.6 0.4 0.2N 3 5 4 6 8P 0.3 0.5 0.4 0.6 0.8

2.33 1.00 1.50 0.67 0.250.85 0.00 0.41 -0.41 -1.39

Fail

Pass

Odds (Fail)log(odds)fail

615

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5

Logit

pro

ba

bil

ity Probability gets

closer to zero, but never reaches it as

logit goes down.

616

• Hooray! Problem solved, lesson over– errrmmm… almost

• Because we are now using log-odds ratio, we can’t use OLS– we need a new technique, called

Maximum Likelihood (ML) to estimate the parameters

617

Parameter Estimation using ML

ML tries to find estimates of model parameters that are most likely to give rise to the pattern of observations in the sample data

• All gets a bit complicated– OLS is a special case of ML– the mean is an ML estimator

618

• Don’t have closed form equations– must be solved iteratively– estimates parameters that are most

likely to give rise to the patterns observed in the data

– by maximising the likelihood function (LF)

• We aren’t going to worry about this– except to note that sometimes, the

estimates do not converge• ML cannot find a solution

619

Interpreting Output

Using SPSS• Overall fit for:

– step (only used for stepwise)– block (for hierarchical)– model (always)– in our model, all are the same– 2=4.9, df=1, p=0.025

• F test

620

Omnibus Tests of Model Coefficients

4.990 1 .025

4.990 1 .025

4.990 1 .025

Step

Block

Model

Step 1

Chi-square df Sig.

621

• Model summary– -2LL (=2/N)– Cox & Snell R2

– Nagelkerke R2

– Different versions of R2

• No real R2 in logistic regression• should be considered ‘pseudo R2’

622

Model Summary

64.245 .095 .127

Step

1

-2 Loglikelihood

Cox & SnellR Square

NagelkerkeR Square

623

• Classification Table– predictions of model– based on cut-off of 0.5 (by default)– predicted values x actual values

624

Classification Tablea

18 8 69.2

12 12 50.0

60.0

Observed

0

1

PASS

Overall Percentage

Step 1

0 1

PASS PercentageCorrect

Predicted

The cut value is .500a.

625

Model parameters• B

– Change in the logged odds associated with a change of 1 unit in IV

– just like OLS regression– difficult to interpret

• SE (B)– Standard error– Multiply by 1.96 to get 95% CIs

626

Variables in the Equation

-.467 .219 4.566

1.314 .714 3.390

SCORE

Constant

Step1

a

B S.E. Wald

Variable(s) entered on step 1: SCORE.a.

Variables in the Equation

.386 1.263 .744 2.143

.199 .323

score

Constant

Step1

a

Sig. Exp(B) Lower Upper

95.0% C.I.for EXP(B)

Variable(s) entered on step 1: score.a.

627

• Constant – i.e. score = 0– B = 1.314– Exp(B) = eB = e1.314 = 3.720– OR = 3.720, p = 1 – (1 / (OR + 1))

= 1 – (1 / (3.720 + 1))– p = 0.788

628

• Score 1– Constant b = 1.314– Score B = -0.467– Exp(1.314 – 0.467) = Exp(0.847)

= 2.332– OR = 2.332– p = 1 – (1 / (2.332 + 1))

= 0.699

629

Standard Errors and CIs

• SPSS gives– B, SE B, exp(B) by default– Can work out 95% CI from standard

error– B ± 1.96 x SE(B)– Or ask for it in options

• Symmetrical in B– Non-symmetrical (sometimes very) in

exp(B)

630

Variables in the Equation

-.467 .219 .627 .408 .962

1.314 .714 3.720

SCORE

Constant

B S.E. Exp(B) Lower Upper

95.0% C.I.forEXP(B)

Variable(s) entered on step 1: SCORE.a.

631

• The odds of passing the test are multiplied by 0.63 (CIs = 0.408, 0.962p p = 0.033), for every additional point on the aptitude test.

632

More on Standard Errors

• In OLS regression– If a variable is added in a hierarchical fashion– The p-value associated with the change in R2

is the same as the p-value of the variable– Not the case in logistic regression

• In our data 0.025 and 0.033

• Wald standard errors– Make p-value in estimates is wrong – too

high– (CIs still correct)

633

• Two estimates use slightly different information– P-value says “what if no effect”– CI says “what if this effect”

• Variance depends on the hypothesised ratio of the number of people in the two groups

• Can calculate likelihood ratio based p-values– If you can be bothered– Some packages provide them

automatically

634

Probit Regression

• Very similar to logistic– much more complex initial

transformation (to normal distribution)– Very similar results to logistic

(multiplied by 1.7)

• In SPSS:– A bit weird

• Probit regression available through menus

635

– But requires data structured differently

• However– Ordinal logistic regression is

equivalent to binary logistic• If outcome is binary

– SPSS gives option of probit

636

ResultsEstimat

eSE P

Logistic (binary)

Score 0.288 0.301 0.339

Exp 0.147 0.073 0.043

Logistic (ordinal)

Score 0.288 0.301 0.339

Exp 0.147 0.073 0.043

Logistic(probit)

Score 0.191 0.178 0.282

Exp 0.090 0.042 0.033

637

Differentiating Between Probit and Logistic

• Depends on shape of the error term– Normal or logistic– Graphs are very similar to each other

• Could distinguish quality of fit– Given enormous sample size

• Logistic = probit x 1.7– Actually 1.6998

• Probit advantage– Understand the distribution

• Logistic advantage– Much simpler to get back to the probability

638

0

0.2

0.4

0.6

0.8

1

1.2-3

-2.8

-2.6

-2.4

-2.2 -2

-1.8

-1.6

-1.4

-1.2 -1

-0.8

-0.6

-0.4

-0.2 0

0.2

0.4

0.6

0.8 1

1.2

1.4

1.6

1.8 2

2.2

2.4

2.6

2.8 3

Normal (Probit)Logistic

639

Infinite Parameters

• Non-convergence can happen because of infinite parameters– Insoluble model

• Three kinds:• Complete separation

– The groups are completely distinct• Pass group all score more than 10• Fail group all score less than 10

640

• Quasi-complete separation– Separation with some overlap

• Pass group all score 10 or more• Fail group all score 10 or less

• Both cases:– No convergence

• Close to this– Curious estimates– Curious standard errors

641

• Categorical Predictors– Can cause separation– Esp. if correlated

• Need people in every cell

Male Female

White Non-White White Non-White

Below Poverty Line

Above Poverty Line

642

Logistic Regression and Diagnosis

• Logistic regression can be used for diagnostic tests– For every score

• Calculate probability that result is positive• Calculate proportion of people with that score (or

lower) who have a positive result

• Calculate c statistic– Measure of discriminative power– %age of all possible cases, where the model

gives a higher probability to a correct case than to an incorrect case

643

– Perfect c-statistic = 1.0– Random c-statistic = 0.5

• SPSS doesn’t do it automatically– But easy to do

• Save probabilities– Use Graphs, ROC Curve– Test variable: predicted probability– State variable: outcome

644

Sensitivity and Specificity

• Sensitivity:– Probability of saying someone has a

positive result – • If they do: p(pos)|pos

• Specificity– Probability of saying someone has a

negative result• If they do: p(neg)|neg

645

Calculating Sens and Spec

• For each value– Calculate

• proportion of minority earning less – p(m)• proportion of non-minority earning less –

p(w)

– Sensitivity (value)• P(m)

646

Salary P(minority)

10 .39

20 .31

30 .23

40 .17

50 .12

60 .09

70 .06

80 .04

90 .03

647

Using Bank Data

• Predict minority group, using salary (000s)– Logit(minority) = -0.044 + salary x –

0.039

• Find actual proportions

648

0.0 0.2 0.4 0.6 0.8 1.0

1 - Specificity

0.0

0.2

0.4

0.6

0.8

1.0

Sen

sit

ivit

y

Diagonal segments are produced by ties.

ROC Curve

Area under curve is c-statistic

649

More Advanced Techniques

• Multinomial Logistic Regression more than two categories in DV– same procedure– one category chosen as reference group

• odds of being in category other than reference

• Polytomous Logit Universal Models (PLUM)– Ordinal multinomial logistic regression– For ordinal outcome variables

650

Final Thoughts

• Logistic Regression can be extended– dummy variables– non-linear effects– interactions (even though we don’t

cover them until the next lesson)• Same issues as OLS

– collinearity– outliers

651

652

653

Lesson 12: Mediation and Path Analysis

654

Introduction

• Moderator– Level of one variable influences effect of

another variable

• Mediator– One variable influences another via a third

variable

• All relationships are really mediated– are we interested in the mediators?– can we make the process more explicit

655

• In examples with bank

educationbeginning

salary

• Why?– What is the process?– Are we making assumptions about

the process?– Should we test those assumptions?

656

educationbeginning

salary

job skills

expectations

negotiating skills

kudos for bank

657

Direct and Indirect Influences

X may affect Y in two ways• Directly – X has a direct (causal)

influence on Y– (or maybe mediated by other

variables)

• Indirectly – X affects Y via a mediating variable - M

658

• e.g. how does going to the pub effect comprehension on a Summer school course– on, say, regression

Having fun in pub

in evening

not reading

books on regressio

nless

knowledge

Anything here?

659

Having fun in pub

in evening

not reading

books on regressio

nless

knowledge

fatigue

Still needed

?

660

• Mediators needed– to cope with more sophisticated

theory in social sciences– make explicit assumptions made

about processes– examine direct and indirect influences

661

Detecting Mediation

662

4 StepsFrom Baron and Kenny (1986)• To establish that the effect of X on Y is

mediated by M1. Show that X predicts Y2. Show that X predicts M3. Show that M predicts Y, controlling for X4. If effect of X controlling for M is zero, M

is complete mediator of the relationship• (3 and 4 in same analysis)

663

Example: Book habits

Enjoy Books

Buy books

Read Books

664

Three Variables

• Enjoy– How much an individual enjoys books

• Buy– How many books an individual buys

(in a year)• Read

– How many books an individual reads (in a year)

665

ENJOY BUY READENJOY 1.00 0.64 0.73BUY 0.64 1.00 0.75READ 0.73 0.75 1.00

666

• The Theory

enjoy readbuy

667

• Step 11. Show that X (enjoy) predicts Y

(read)– b1 = 0.487, p < 0.001

– standardised b1 = 0.732

– OK

668

2. Show that X (enjoy) predicts M (buy)

– b1 = 0.974, p < 0.001

– standardised b1 = 0.643

– OK

669

3. Show that M (buy) predicts Y (read), controlling for X (enjoy)

– b1 = 0.469, p < 0.001

– standardised b1 = 0.206

– OK

670

4. If effect of X controlling for M is zero, M is complete mediator of the relationship

– (Same as analysis for step 3.)

– b2 = 0.287, p = 0.001

– standardised b2 = 0.431

– Hmmmm…• Significant, therefore not a complete

mediator

671

enjoy read

buy

0.974(from step

2)

0.206(from step

3)

0.287 (step 4)

672

The Mediation Coefficient

• Amount of mediation = Step 1 – Step 4=0.487 – 0.287

= 0.200• OR

Step 2 x Step 3=0.974 x 0.206

= 0.200

673

SE of Mediator

• sa = se(a)

• sb = se(b)

enjoy readbuy

a(from step

2)

b(from step

2)

674

• Sobel test– Standard error of mediation

coefficient can be calculated

a = 0.974sa = 0.189

b = 0.206sb = 0.054

2 2 2 2 2 2a b a bs + - se b a s s s

675

• Indirect effect = 0.200– se = 0.056– t =3.52, p = 0.001

• Online Sobel test:http://www.unc.edu/~preacher/sobel/sobel.htm– (Won’t be there for long; probably will be

somewhere else)

676

A Note on Power

• Recently– Move in methodological literature away from

this conventional approach– Problems of power:– Several tests, all of which must be significant

• Type I error rate = 0.05 * 0.05 = 0.0025• Must affect power

– Bootstrapping suggested as alternative• See Paper B7, A4, B9• B21 for SPSS syntax

677

678

679

Lesson 13: Moderators in Regression

“different slopes for different folks”

680

Introduction

• Moderator relationships have many different names– interactions (from ANOVA)– multiplicative– non-linear (just confusing)– non-additive

• All talking about the same thing

681

A moderated relationship occurs • when the effect of one variable

depends upon the level of another variable

682

• Hang on …– That seems very like a nonlinear

relationship– Moderator

• Effect of one variable depends on level of another

– Non-linear• Effect of one variable depends on level of itself

• Where there is collinearity– Can be hard to distinguish between them– Paper in handbook (B5)– Should (usually) compare effect sizes

683

• e.g. How much it hurts when I drop a computer on my foot depends on– x1: how much alcohol I have drunk

– x2: how high the computer was dropped from

– but if x1 is high enough

– x2 will have no effect

684

• e.g. Likelihood of injury in a car accident– depends on

– x1: speed of car

– x2: if I was wearing a seatbelt

– but if x1 is low enough

– x2 will have no effect

685

0

5

10

15

20

25

30

5 15 25 35 45

Speed (mph)

Inju

ry

Seatbelt No Seatbelt

686

• e.g. number of words (from a list) I can remember– depends on

– x1: type of words (abstract, e.g. ‘justice’, or concrete, e.g. ‘carrot’)

– x2: Method of testing (recognition – i.e. multiple choice, or free recall)

– but if using recognition

– x1: will not make a difference

687

• We looked at three kinds of moderator

• alcohol x height = pain– continuous x continuous

• speed x seatbelt = injury– continuous x categorical

• word type x test type– categorical x categorical

• We will look at them in reverse order

688

How do we know to look for moderators?

Theoretical rationale• Often the most powerful• Many theories predict additive/linear

effects– Fewer predict moderator effects

Presence of heteroscedasticity• Clue there may be a moderated

relationship missing

689

Two Categorical Predictors

690

Data• 2 IVs– word type (concrete [1], abstract [2])– test method (recog [1], recall [2])

• 20 Participants in one of four groups– 1, 1– 1, 2– 2, 1– 2, 2

• 5 per group• lesson12.1.sav

691

Concrete Abstract TotalMean 15.40 15.20 15.30SD 2.19 2.59 2.26Mean 15.60 6.60 11.10Std. Deviation 1.67 7.44 6.95Mean 15.50 10.90 13.20Std. Deviation 1.84 6.94 5.47

Total

Recall

Recog

692

• Graph of means

TEST

2.001.00

18

16

14

12

10

8

6

WORDS

1.00

2.00

693

ANOVA Results

• Standard way to analyse these data would be to use ANOVA– Words: F=6.1, df=1, 16, p=0.025– Test: F=5.1, df=1, 16, p=0.039– Words x Test: F=5.6, df=1, 16,

p=0.031

694

Procedure for Testing

1: Convert to effect coding• can use dummy coding, collinearity is

less of an issue• doesn’t make any difference to

substantive interpretation2: Calculate interaction term• In ANOVA interaction is automatic• In regression we create an interaction

variable

695

• Interaction term (wxt) – multiply effect coded variables

together

word test wxt-1 -1 11 -1 -1-1 1 -11 1 1

696

3: Carry out regression• Hierarchical

– linear effects first– interaction effect in next block

697

• b0=13.2• b1 (words) = -2.3, p=0.025• b2 (test) = -2.1, p=0.039• b3 (words x test) = -2.2, p=0.031• Might need to use change in R2 to

test sig of interaction, because of collinearity

What do these mean?• b0 (intercept) = predicted value of Y

(score) when all X = 0– i.e. the central point

698

• b0 = 13.2– grand mean

• b1 = -2.3– distance from grand to mean for two

word types– 13.2 – (-2.3) = 15.5– 13.2 + (-2.3) = 10.9

Concrete Abstract TotalRecog 15.40 15.20 15.30Recall 15.60 6.60 11.10Total 15.50 10.90 13.20

699

• b2 = -2.1– distance from grand mean to recog

and recall means

• b3 = -2.2

– to understand b3 we need to look at predictions from the equation without this term

Score = 13.2 + (-2.3) w + (-2.1) t

700

Score = 13.2 + (-2.3) w + (-2.1) t• So for each group we can calculate

an expected value

701

W

C

C

A

A

T

Cog

Call

Cog

Call

Word

-1

-1

1

1

Test

-1

1

-1

1

Expected Value

13.2 + (-2.3) x (-1) + (-2.1) x -1

13.2 + (-2.3) x (-1) + (-2.1) x 1

13.2 + (-2.3) x 1 + (-2.1) x (-1)

13.2 + (-2.3) x 1 + (-2.1) x 1

b1 = -2.3, b2 = -2.1

702

• The exciting part comes when we look at the differences between the actual value and the value in the 2 IV model

W T Word Test Exp Actual ValueC Call -1 -1 17.6 15.4C Cog -1 1 13.4 15.6A Call 1 -1 13.0 15.2A Cog 1 1 8.8 11.0

703

• Each difference = 2.2 (or –2.2)

• The value of b3 was –2.2– the interaction term is the correction

required to the slope when the second IV is included

704

• Examine the slope for word type

0

2

4

6

8

10

12

14

16

18

Recog (-1) Recall (1)

Test Type

Gradient = (11.1 - 15.3) / 2 = -

2.1

705

• Add the slopes for two test groups

0

2

4

6

8

10

12

14

16

18

Recog (-1) Recall (1)

Test Type

Both word groups (-

2.1)

Abstract(6.6 - 15.2 )/2

= -4.3

Concrete(15.6-15.4 )/2

= 0.1

706

b associated with interaction • the change in slope, away from the

average, associated with a 1 unit change in the moderating variable

OR• Half the difference in the slopes

707

• Another way to look at itY = 13.2 + -2.3w + -2.1t + -2.2wt

• Examine concrete words group (w = -1)– substitute values into the equation

Y(concrete) = 13.2 + -2.3-1 + -2.1t + -2.2-1t

Y(concrete) = 13.2 + 2.3 + -2.1t + 2.2t

Y(concrete) = 15.5 + 0.1t • The effect of changing test type for

concrete words (the slope, which is half the actual difference)

708

Why go to all that effort? Why not do ANOVA in the first place?

1. That is what ANOVA actually does• if it can handle an unbalanced design

(i.e. different numbers of people in each group)

• Helps to understand what can be done with ANOVA

• SPSS uses regression to do ANOVA

2. Helps to clarify more complex cases

• as we shall see

709

Categorical x Continuous

710

Note on Dichotomisation

• Very common to see people dichotomise a variable– Makes the analysis easier– Very bad idea

• Paper B6

711

Data

A chain of 60 supermarkets • examining the relationship between

profitability, shop size, and local competition

• 2 IVs– shop size– comp (local competition, 0=no, 1=yes)

• DV– profit

712

• Data, ‘lesson 12.2.sav’

Shopsize Comp Profit4 1 23

10 1 257 0 19

10 0 910 1 1829 1 3312 0 176 1 20

14 0 2162 0 8

713

1st Analysis

Two IVs• R2=0.367, df=2, 57, p < 0.001• Unstandardised estimates

– b1 (shopsize) = 0.083 (p=0.001)

– b2 (comp) = 5.883 (p<0.001)

• Standardised estimates– b1 (shopsize) = 0.356

– b2 (comp) = 0.448

714

• Suspicions– Presence of competition is likely to

have an effect– Residual plot shows a little

heteroscedasticity

2.01.51.0.50.0-.5-1.0-1.5-2.0

3

2

1

0

-1

-2

-3

715

Procedure for Testing

• Very similar to last time– convert ‘comp’ to effect coding– -1 = No competition– 1 = competition– Compute interaction term

• comp (effect coded) x size

– Hierarchical regression

716

Result

• Unstandardised estimates– b1 (shopsize) = 0.071 (p=0.006)

– b2 (comp) = -1.67 (p = 0.506)

– b3 (sxc) = -0.050 (p=0.050)

• Standardised estimates– b1 (shopsize) = 0.306

– b2 (comp) = -0.127

– b3 (sxc) = -0.389

717

• comp now non-significant– shows importance of hierarchical– it obviously is important

718

Interpretation

• Draw graph with lines of best fit– drawn automatically by SPSS

• Interpret equation by substitution of values– evaluate effects of

• size• competition

719

Shopsize

100806040200

Pro

fit

40

30

20

10

0

Competition

No competition

All Shops

720

• Effects of size– in presence and absence of

competition– (can ignore the constant)Y=x10.071 + x2(-1.67) + x1x2 (-

0.050)

– Competition present (x2 = 1)

Y=x10.071 + 1(-1.67) + x11 (-0.050)

Y=x10.071 + -1.67 + x1(-0.050)

Y=x1 0.021 + (–1.67)

721

Y=x10.071 + x2(-1.67) + x1x2 (-0.050)

– Competition absent (x2 = -1)

Y=x10.071 + -1(-1.67) + x1-1 (-0.050)

Y=x1 0.071 + x1-1 (-0.050) + -1(-1.67)

Y= x1 0.121 (+ 1.67)

722

Two Continuous Variables

723

Data

• Bank Employees– only using clerical staff– 363 cases– predicting starting salary– previous experience– age– age x experience

724

• Correlation matrix – only one significant

LOGSB AGESTARTPREVEXPLOGSB 1.00 -0.09 0.08AGESTART -0.09 1.00 0.77PREVEXP 0.08 0.77 1.00

725

Initial Estimates (no moderator)• (standardised)

– R2 = 0.061, p<0.001– Age at start = -0.37, p<0.001– Previous experience = 0.36, p<0.001

• Suppressing each other– Age and experience compensate for

one another– Older, with no experience, bad– Younger, with experience, good

726

The Procedure

• Very similar to previous– create multiplicative interaction term– BUT

• Need to eliminate effects of means – cause massive collinearity

• and SDs– cause one variable to dominate the

interaction term• By standardising

727

• To standardise x, – subtract mean, and divide by SD– re-expresses x in terms of distance

from the mean, in SDs– ie z-scores

• Hint: automatic in SPSS in Descriptives

• Create interaction term of age and exp– axe = z(age) z(exp)

728

• Hierarchical regression– two linear effects first– moderator effect in second– hint: it is often easier to interpret if

standardised versions of all variables are used

729

• Change in R2

– 0.085, p<0.001

• Estimates (standardised)– b1 (exp) = 0.104

– b2 (agestart) = -0.54

– b3 (age x exp) = -0.54

730

Interpretation 1: Pick-a-Point

• Graph is tricky– can’t have two continuous variables– Choose specific points (pick-a-point)

• Graph the line of best fit of one variable at others

– Two ways to pick a point• 1: Choose high (z = +1), medium (z = 0)

and low (z = -1)• Choose ‘sensible’ values – age 20, 50,

80?

731

• We know:– Y = e 0.10 + a -0.54 + a e -0.54– Where a = agestart, and e = experience

• We can rewrite this as:– Y = (e 0.10) + (a -0.54) + (a e -0.54)– Take a out of the brackets– Y = (e 0.10) + (-0.54 + e -0.54)a

• Bracketed terms are simple intercept and simple slope 0= (e 0.10) 1= (-0.54 + e -0.54)a– Y = 0 + 1a

732

• Pick any value of e, and we know the slope for a– Standardised, so it’s easy

• e = -1 0= (-1 0.10) = -0.10 1= (-0.54 + -1 -0.54)a = -0.0a

• e = 0 0= (0 0.10) = 0 1= (-0.54+ 0 -0.54)a = -0.54a

• e = 1 0= (1 0.10) = 0.10 1= (-0.54 + 1 -0.54)a = -1.08a

733

Graph the Three Lines

-1.5

-1

-0.5

0

0.5

1

1.5

-1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Age

Lo

g(s

ala

ry)

e = -1

e = 0

e = 1

734

Interpretation 2: P-Values and CIs

• Second way – Newer, rarely done

• Calculate CIs of the slope – At any point

• Calculate p-value– At any point

• Give ranges of significance

735

What do you need?

• The variance and covariance of the estimates– SPSS doesn’t provide estimates for

intercept– Need to do it manually

• In options, exclude intercept– Create intercept – c = 1– Use it in the regression

736

• Enter information into web page:– www.unc.edu/~preacher/interact/acov.htm

– (Again, may not be around for long)

• Get results• Calculations in Bauer and Curran

(in press: Multivariate Behavioral Research)– Paper B13

737

-1.0 -0.5 0.0 0.5 1.0

4.0

4.1

4.2

4.3

4.4

4.5

MLR 2-Way Interaction Plot

X

Y

CVz1(1)CVz1(2)CVz1(3)

738

Areas of Significance

-4 -2 0 2 4

-0.6

-0.4

-0.2

0.0

0.2

0.4

Confidence Bands

Experience

Sim

ple

Slo

pe

739

• 2 complications– 1: Constant differed– 2: DV was logged, hence non-linear

• effect of 1 unit depends on where the unit is

– Can use SPSS to do graphs showing lines of best fit for different groups

– See paper A2

740

Finally …

741

Unlimited Moderators

• Moderator effects are not limited to – 2 variables– linear effects

742

Three Interacting Variables

• Age, Sex, Exp• Block 1

– Age, Sex, Exp

• Block 2– Age x Sex, Age x Exp, Sex x Exp

• Block 3– Age x Sex x Exp

743

• Results– All two way interactions significant– Three way not significant– Effect of Age depends on sex– Effect of experience depends on sex– Size of the age x experience

interaction does not depend on sex (phew!)

744

Moderated Non-Linear Relationships

• Enter non-linear effect• Enter non-linear effect x moderator

– if significant indicates degree of non-linearity differs by moderator

745

746

Modelling Counts: Poisson Regression

Lesson 14

747

Counts and the Poisson Distribution

• Von Bortkiewicz (1898)– Numbers of Prussian

soldiers kicked to death by horses

0 1091 652 223 34 15 0

0

20

40

60

80

100

120

0 1 2 3 4 5

748

• The data fitted a Poisson probability distribution– When counts of events occur, poisson

distribution is common– E.g. papers published by researchers, police

arrests, number of murders, ship accidents

• Common approach– Log transform and treat as normal

• Problems– Censored at 0– Integers only allowed– Heteroscedasticity

749

The Poisson Distribution

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17Count

Pro

bab

ilit

y

0.5

14

8

750

!

)exp()|(

yxyp

y

751

• Where:– y is the count– is the mean of the poisson distribution

• In a poisson distribution– The mean = the variance (hence

heteroscedasticity issue))–

!

)exp()|(

yxyp

y

752

Poisson Regression in SPSS

• Not directly available– SPSS can be tweaked to do it in three ways:– General loglinear model (genlog)– Non-linear regression (CNLR)

• Bootstrapped p-values only

– Both are quite tricky

• SPSS 15,

753

Example Using Genlog

• Number of shark bites on different colour surfboards– 100 surfboards, 50

red, 50 blue

• Weight cases by bites

• Analyse, Loglinear, General– Colour is factor

0

5

10

15

20

25

0 1 2 3 4Number of bites

Fre

qu

en

cy

BlueRed

754

Results

Correspondence Between Parameters and Terms of the Design

Parameter Aliased Term

1 Constant2 [COLOUR = 1]3 x [COLOUR = 2]Note: 'x' indicates an aliased (or a redundant) parameter. These parameters are set to zero.

755

Asymptotic 95% CI

Param Est. SE Z-value Lower Upper

1 4.1190 .1275 32.30 3.87 4.37

2 -.5495 .2108 -2.61 -.96 -.14

3 .0000 . . . .

• Note: Intercept (param 1) is curious

• Param 2 is the difference in the means

756

SPSS: Continuous Predictors

• Bleedin’ nightmare• http://www.spss.com/tech/

answer/details.cfm?tech_tan_id=100006204

757

Poisson Regression in Stata

• SPSS will save a Stata file• Open it in Stata• Statistics, Count outcomes, Poisson

regression

758

Poisson Regression in R

• R is a freeware program– Similar to SPlus– www.r-project.org

• Steep learning curve to start with• Much nicer to do Poisson (and other)

regression analysishttp://www.stat.lsa.umich.edu/~faraway/book/

http://www.jeremymiles.co.uk/regressionbook/extras/appendix2/R/

759

• Commands in R• Stage 1: enter data

– colour <- c(1, 0, 1, 0, 1, 0 … 1)– bites <- c(3, 1, 0, 0, … )

• Run analysis– p1 <- glm(bites ~ colour, family = poisson)

• Get results– summary.glm(p1)

760

R Results

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -0.3567 0.1686 -2.115 0.03441 *

colour 0.5555 0.2116 2.625 0.00866 **

• Results for colour– Same as SPSS– For intercept different (weird SPSS)

761

Predicted Values

• Need to get exponential of parameter estimates – Like logistic regression

• Exp(0.555) = 1.74– You are likely to be bitten by a shark

1.74 times more often with a red surfboard

762

Checking Assumptions

• Was it really poisson distributed?– For Poisson, 2

• As mean increases, variance should also increase

– Residuals should be random• Overdispersion is common problem• Too many zeroes

• For blue: 2 = exp(-0.3567) = 1.42• For red: 2 = exp(-0.3567 + 0.555)

= 2.48

763

• Strictly:

!

)exp()|(

yxyp

y

!

ˆ)ˆexp()|(

yxyp

y

ii

764

Compare Predicted with Actual Distributions

Blue

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 1 2 3 4Frequency

Pro

ba

bil

ity

ExpectedActual

Red

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

1 2 3 4Frequency

Pro

ba

bil

ity

ExpectedActual

765

Overdispersion• Problem in poisson regression

– Too many zeroes

• Causes– 2 inflation– Standard error deflation

• Hence p-values too low

– Higher type I error rate

• Solution– Negative binomial regression

766

Using R

• R can read an SPSS file – But you have to ask it nicely

• Click Packages menu, Load package, choose “Foreign”

• Click File, Change Dir– Change to the folder that contains

your data

767

More on R• R uses objects

– To place something into an object use <- – X <- Y

• Puts Y into X

• Function is read.spss()– Mydata <- read.spss(“spssfilename.sav”)

• Variables are then referred to as Mydata$VAR1– Note 1: R is case sensitive– Note 2: SPSS variable name in capitals

768

GLM in R

• Command– glm(outcome ~ pred1 + pred2 + … +

predk [,family = familyname])– If no familyname, default is OLS

• Use binomial for logistic, poisson for poisson

• Output is a GLM object– You need to give this a name– my1stglm <- glm(outcome ~ pred1 +

pred2 + … + predk [,family = familyname])

769

• Then need to explore the result– summary(my1stglm)

• To explore what it means– Need to plot regressions

• Easiest is to use Excel

770

771

Introducing Structural Equation Modelling

Lesson 15

772

Introduction

• Related to regression analysis– All (OLS) regression can be

considered as a special case of SEM

• Power comes from adding restrictions to the model

• SEM is a system of equations– Estimate those equations

773

Regression as SEM

• Grades example– Grade = constant + books + attend +

error• Looks like a regression equation

– Also– Books correlated with attend– Explicit modelling of error

774

Path Diagram

• System of equations are usefully represented in a path diagram

x Measured variable

e unmeasured variable

regression

correlation

775

Path Diagram for Regression

Books

Attend

Grade

error

Must usually explicitly

model error

Must explicitly model correlation

776

Results

• Unstandardised

2.00

BOOKS

17.84

ATTEND

GRADE

4.04

1.28

2.65

1.00

e13.52

777

Standardised

BOOKS

ATTEND

GRADE

.35

.33

.44

e.82

778

TableEstimate S.E. C.R. P St. Est.

GRADE <-- BOOKS 4.04 1.71 2.36 0.02 0.35GRADE <-- ATTEND 1.28 0.57 2.25 0.03 0.33GRADE <-- e 13.52 1.53 8.83 0.00 0.82GRADE 37.38 7.54 4.96 0.00

Coefficientsa

37.38 7.74 .00

4.04 1.75 .35 .03

1.28 .59 .33 .04

(Constant)

BOOKS

ATTEND

Model

1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

Sig.

Dependent Variable: GRADEa.

779

So What Was the Point?

• Regression is a special case• Lots of other cases• Power of SEM

– Power to add restrictions to the model• Restrict parameters

– To zero– To the value of other parameters– To 1

780

Restrictions

• Questions– Is a parameter really necessary?– Are a set of parameters necessary?– Are parameters equal

• Each restriction adds 1 df– Test of model with 2

781

The 2 Test

• Can the model proposed have generated the data?– Test of significance of difference of

model and data– Statistically significant result

• Bad

– Theoretically driven• Start with model• Don’t start with data

782

Regression Again

• Both estimates restricted to zero

BOOKS

ATTEND

GRADE

0, 1

e

783

• Two restrictions– 2 df for 2 test– 2 = 15.9, p = 0.0003

• This test is (asymptotically) equivalent to the F test in regression– We still haven’t got any further

784

Multivariate Regression

x1

x2

y1

y2

y3

785

x1

x2

y1

y2

y3

Test of all x’s on all y’s(6 restrictions = 6 df)

786

Test of all x1 on all y’s(3 restrictions)

x1

x2

y1

y2

y3

787

x1

x2

y1

y2

y3

Test of all x1 on all y1

(3 restrictions)

788

x1

x2

y1

y2

y3

Test of all 3 partial correlations between y’s, controlling for x’s

(3 restrictions)

789

Path Analysis and SEM

• More complex models – can add more restrictions– E.g. mediator

model

• 1 restriction– No path from

enjoy -> read

ENJOY

BUY

READ

1

e_buy

1

e_read

790

Result

• 2 = 10.9, 1 df, p = 0.001• Not a complete mediator

– Additional path is required

791

Multiple Groups

• Same model– Different people

• Equality constraints between groups– Means, correlations, variances,

regression estimates– E.g. males and females

792

Multiple Groups Example

• Age• Severity of psoriasis

– SEVE – in emotional areas• Hands, face, forearm

– SEVNONE – in non-emotional areas– Anxiety– Depression

793

Correlationsa

1 -.270 -.248 .017 .035

. .004 .009 .859 .717

110 110 110 110 110

-.270 1 .665 .045 .075

.004 . .000 .639 .436

110 110 110 110 110

-.248 .665 1 .109 .096

.009 .000 . .255 .316

110 110 110 110 110

.017 .045 .109 1 .782

.859 .639 .255 . .000

110 110 110 110 110

.035 .075 .096 .782 1

.717 .436 .316 .000 .

110 110 110 110 110

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

AGE

SEVE

SEVNONE

GHQ_A

GHQ_D

AGE SEVE SEVNONE GHQ_A GHQ_D

SEX = fa.

794

Correlationsa

1 -.243 -.116 -.195 -.190

. .031 .310 .085 .094

79 79 79 79 79

-.243 1 .671 .456 .453

.031 . .000 .000 .000

79 79 79 79 79

-.116 .671 1 .210 .232

.310 .000 . .063 .040

79 79 79 79 79

-.195 .456 .210 1 .800

.085 .000 .063 . .000

79 79 79 79 79

-.190 .453 .232 .800 1

.094 .000 .040 .000 .

79 79 79 79 79

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

AGE

SEVE

SEVNONE

GHQ_A

GHQ_D

AGE SEVE SEVNONE GHQ_A GHQ_D

SEX = ma.

795

Model

AGE

SEVE SEVNONE

Dep Anx

1

e_sn

1

e_s

1

E_d

1

e_a

796

FemalesAGE

SEVE SEVNONE

Dep Anx

-.27 -.25

.03 .15.09 -.04

e_sne_s

E_d e_a

.96

.99 .99

.97

.78

.07 .04

.64

797

MalesAGE

SEVE SEVNONE

Dep Anx

-.24 -.12

.52 -.17-.12 .55

e_sne_s

E_d e_a

.97

.88 .88

.99

.74

-.08 -.08

.67

798

Constraint

• sevnone -> dep– Constrained to be equal for males and

females

• 1 restriction, 1 df– 2 = 1.3 – not significant

• 4 restrictions– 2 severity -> anx & dep

799

• 4 restrictions, 4 df– 2 = 1.3, p = 0.014

• Parameters are not equal

800

Missing Data: The big advantage

• SEM programs tend to deal with missing data – Multiple imputation– Full Information (Direct) Maximum

Likelihood• Asymptotically equivalent

• Data can be MAR, not just MCAR

801

Power: A Smaller Advantage

• Power for regression gets tricky with large models

• With SEM power is (relatively) easy– It’s all based on chi-square– Paper B14

802

Lesson 16: Dealing with clustered data & longitudinal

models

803

The Independence Assumption

• In Lesson 8 we talked about independence – The residual of any one case should not tell

you about the residual of any other case

• Particularly problematic when:– Data are clustered on the predictor variable

• E.g. predictor is household size, cases are members of family

• E.g. Predictor is doctor training, outcome is patients of doctor

– Data are longitudinal• Have people measured over time

– It’s the same person!

804

Clusters of Cases

• Problem with cluster (group) randomised studies– Or group effects

• Use Huber-White sandwich estimator– Tell it about the groups– Correction is made– Use complex samples in SPSS

805

Complex Samples

• As with Huber-White for heteroscedasticity– Add a variable that tells it about the clusters– Put it into clusters

• Run GLM– As before

• Warning:– Need about 20 clusters for solutions to be

stable

806

Example

• People randomised by week to one of two forms of triage– Compare the total cost of treating each

• Ignore clustering– Difference is £2.40 per person, with 95%

confidence intervals £0.58 to £4.22, p =0.010

• Include clustering– Difference is still £2.40, with 95% CIs £5.65

to -£0.85, and p = 0.141.

• Ignoring clustering led to type I error

807

Longitudinal Research

• For comparing repeated measures– Clusters are people– Can model the

repeated measures over time

• Data are usually short and fat

ID V1 V2 V3 V4

1 2 3 4 7

2 3 6 8 4

3 2 5 7 5

808

Converting Data

• Change data to tall and thin

• Use Data, Restructure in SPSS

• Clusters are ID

ID V X1 1 2

1 2 3

1 3 4

1 4 7

2 1 3

2 2 6

2 3 8

2 4 4

3 1 2

3 2 5

3 3 7

3 4 5

809

(Simple) Example

• Use employee data.sav– Compare beginning salary and salary– Would normally use paired samples t-

test

• Difference = $17,403, 95% CIs $16,427.407, $18,379.555

810

Restructure the Data

• Do it again– With data tall and thin

• Complex GLM with Time as factor– ID as cluster

• Difference = $17,430, 95% CIs = 16427.407, 18739.555

ID Time Cash

1 1$18,75

0

1 2$21,45

0

2 1$12,00

0

2 2$21,90

0

3 1$13,20

0

3 2$45,00

0

811

Interesting …

• That wasn’t very interesting– What is more interesting is when we

have multiple measurements of the same people

• Can plot and assess trajectories over time

812

Single Person Trajectory

Time

+

++

+ +

+

813

Multiple Trajectories: What’s the Mean and SD?

Time

814

Complex Trajectories

• An event occurs– Can have two effects:– A jump in the value– A change in the slope

• Event doesn’t have to happen at the same time for each person– Doesn’t have to happen at all

815

Slope 1

Event Occurs

Jump

Slope 2

816

ParameterisingTime Event Time2 Outcome

1 0 0 12

2 0 0 13

3 0 0 14

4 0 0 15

5 0 0 16

6 1 0 10

7 1 1 9

8 1 2 8

9 1 3 7

817

Draw the Line

What are the parameter estimates?

818

Main Effects and Interactions

• Main effects – Intercept differences

• Moderator effects– Slope differences

819

Multilevel Models

• Fixed versus random effects– Fixed effects are fixed across

individuals (or clusters)– Random effects have variance

• Levels– Level 1 – individual measurement

occasions– Level 2 – higher order clusters

820

More on Levels• NHS direct study

– Level 1 units: …………….– Level 2 units: ……………

• Widowhood food study– Level 1 units ……………– Level 2 units ……………

821

More Flexibility

• Three levels:– Level 1: measurements– Level 2: people– Level 3: schools

822

More Effects

• Variances and covariances of effects

• Level 1 and level 2 residuals– Makes R2 difficult to talk about

• Outcome variable– Yij

• The score of the ith person in the jth group

823

Y i j

2.3 1 1

3.2 2 1

4.5 3 1

4.8 1 2

7.2 2 2

3.1 3 2

1.6 4 2

824

Notation• Notation gets a bit horrid

– Varies a lot between books and programs

• We used to have b0 and b1

– If fixed, that’s fine– If random, each person has their own

intercept and slope

825

Standard Errors

• Intercept has standard errors• Slopes have standard errors• Random effects have variances

– Those variances have standard errors• Is there statistically significant variation

between higher level units (people)?• OR• Is everyone the same?

826

Programs

• Since version 12– Can do this in SPSS– Can’t do anything really clever

• Menus– Completely unusable– Have to use syntax

827

SPSS Syntax

• MIXED• relfd with time• /fixed = time• /random = intercept time |

subject (id) covtype(un)• /print = solution.

828

SPSS Syntax

• MIXED• relfd with time

OutcomeContinuous

predictor

829

SPSS Syntax

• MIXED• relfd with time• /fixed = time

Must specify effect as fixed first

830

SPSS Syntax

• MIXED• relfd with time• /fixed = time• /random = intercept time |

subject (id) covtype(un)Specify random

effects

Intercept and time are random

SPSS assumes that your level 2 units are subjects, and needs to know the id

variable

831

SPSS Syntax

• MIXED• relfd with time• fixed = time• /random = intercept time |

subject (id) covtype(un)Covariance matrix of random

effects is unstructured. (Alternative is id – identity or vc

– variance components).

832

SPSS Syntax

• MIXED• relfd with time• fixed = time• /random = intercept time |

subject (id) covtype(un)• /print = solution.

Print the answer

833

The Output• Information criteria

– We’ll come backInformation Criteriaa

64899.758

64907.758

64907.763

64940.134

64936.134

-2 Restricted LogLikelihood

Akaike's InformationCriterion (AIC)

Hurvich and Tsai'sCriterion (AICC)

Bozdogan's Criterion(CAIC)

Schwarz's BayesianCriterion (BIC)

The information criteria are displayed in smaller-is-better forms.

Dependent Variable: relfd.a.

834

Fixed Effects

• Not useful here, useful for interactions

Type III Tests of Fixed Effectsa

1 741 3251.877 .000

1 741.000 2.550 .111

Source

Intercept

time

Numerator dfDenominator

df F Sig.

Dependent Variable: relfd.a.

835

Estimates of Fixed Effects

• Interpreted as regression equation

Estimates of Fixed Effectsa

21.90 21.90 .38 57.025 .000 21.15 22.66

-.06 -.06 .04 -1.597 .111 -.14 .01

Parameter

Intercept

time

EstimateStd.Error df t Sig.

LowerBound

UpperBound

95% ConfidenceInterval

Dependent Variable: relfd.a.

836

Covariance Parameters

Estimates of Covariance Parametersa

64.11577 1.0526353

85.16791 5.7003732

-4.53179 .5067146

.7678319 .0636116

Parameter

Residual

UN (1,1)

UN (2,1)

UN (2,2)

Intercept +time [subject= id]

Estimate Std. Error

Dependent Variable: relfd.a.

837

Change Covtype to VC

• We know that this is wrong– The covariance of the effects was

statistically significant– Can also see if it was wrong by

comparing information criteria• We have removed a parameter from

the model– Model is worse– Model is more parsimonious

• Is it much worse, given the increase in parsimony?

838

UN ModelInformation Criteriaa

64899.758

64907.758

64907.763

64940.134

64936.134

-2 Restricted LogLikelihood

Akaike's InformationCriterion (AIC)

Hurvich and Tsai'sCriterion (AICC)

Bozdogan's Criterion(CAIC)

Schwarz's BayesianCriterion (BIC)

The information criteria are displayed in smaller-is-better forms.

Information Criteriaa

65041.891

65047.891

65047.894

65072.173

65069.173

-2 Restricted LogLikelihood

Akaike's InformationCriterion (AIC)

Hurvich and Tsai'sCriterion (AICC)

Bozdogan's Criterion(CAIC)

Schwarz's BayesianCriterion (BIC)

The information criteria are displayed in smaller-is-better forms.

VC Model

Lower is better.

839

Adding Bits

• So far, all a bit dull• We want some more predictors, to

make it more exciting– E.g. female– Add:Relfd with time female/fixed = time sex time * sex

• What does the interaction term represent?

840

Extending Models

• Models can be extended– Any kind of regression can be used

• Logistic, multinomial, Poisson, etc

– More levels• Children within classes within schools• Measures within people within classes within prisons

– Multiple membership / cross classified models• Children within households and classes, but

households not nested within class

• Need a different program– E.g. MlwiN

841

MlwiN Example (very quickly)

842

Books

Singer, JD and Willett, JB (2003). Applied Longitudinal Data Analysis: Modeling Change and Event Occurrence. Oxford, Oxford University Press.

Examples at:http://www.ats.ucla.edu/stat/SPSS/examples/alda/default.htm

843

The End

top related