1 theory of regression. 2 the course 16 (or so) lessons –some flexibility depends how we feel what...
Post on 19-Dec-2015
217 Views
Preview:
TRANSCRIPT
1
Theory of Regression
2
The Course
• 16 (or so) lessons– Some flexibility
• Depends how we feel• What we get through
3
Part I: Theory of Regression1. Models in statistics2. Models with more than one parameter:
regression3. Why regression?4. Samples to populations5. Introducing multiple regression6. More on multiple regression
4
Part 2: Application of regression7. Categorical predictor variables8. Assumptions in regression analysis9. Issues in regression analysis10.Non-linear regression11.Moderators (interactions) in regression12.Mediation and path analysisPart 3: Advanced Types of Regression13.Logistic Regression14.Poisson Regression15. Introducing SEM16. Introducing longitudinal multilevel models
5
House Rules
• Jeremy must remember– Not to talk too fast
• If you don’t understand– Ask – Any time
• If you think I’m wrong– Ask. (I’m not always right)
6
Learning New Techniques
• Best kind of data to learn a new technique– Data that you know well, and understand
• Your own data– In computer labs (esp later on)– Use your own data if you like
• My data – I’ll provide you with– Simple examples, small sample sizes
• Conceptually simple (even silly)
7
Computer Programs
• SPSS– Mostly
• Excel – For calculations
• GPower• Stata (if you like)• R (because it’s flexible and free)• Mplus (SEM, ML?)• AMOS (if you like)
8
9
10
Lesson 1: Models in statistics
Models, parsimony, error, mean, OLS estimators
11
What is a Model?
12
What is a model?
• Representation– Of reality– Not reality
• Model aeroplane represents a real aeroplane– If model aeroplane = real
aeroplane, it isn’t a model
13
• Statistics is about modelling– Representing and simplifying
• Sifting – What is important from what is not
important
• Parsimony – In statistical models we seek
parsimony– Parsimony simplicity
14
Parsimony in Science
• A model should be:– 1: able to explain a lot– 2: use as few concepts as possible
• More it explains– The more you get
• Fewer concepts– The lower the price
• Is it worth paying a higher price for a better model?
15
A Simple Model
• Height of five individuals– 1.40m– 1.55m– 1.80m– 1.62m– 1.63m
• These are our DATA
16
A Little Notation
Y The (vector of) data that we are modelling
iY The ith observation in our data.
5
8,7,6,5,4
2
Y
Y
17
Greek letters represent the true value in the population.
(Beta) Parameters in our model (population value)
0 The value of the first parameter of our model in the population.
j The value of the jth parameter of our model, in the population.
(Epsilon) The error in the population model.
18
Normal letters represent the values in our sample. These are sample statistics, which are used to estimate population parameters.
b A parameters in our model (sample statistics)
e The error in our sample.
Y The data in our sample which we are trying to model.
19
Symbols on top change the meaning.
Y The data in our sample which we are trying to model (repeated).
iY The estimated value of Y, for the ith case.
Y The mean of Y.
20
11ˆ So b
I will use b1 (because it is easier to type)
21
• Not always that simple– some texts and computer programs
use
b = the parameter estimate (as we have used)
(beta) = the standardised parameter estimate
SPSS does this.
22
A capital letter is the set (vector) of parameters/statistics
B Set of all parameters (b0, b1, b2, b3 … bp)
Rules are not used very consistently (even by me).Don’t assume you know what someone means, without checking.
23
• We want a model – To represent those data
• Model 1:– 1.40m, 1.55m, 1.80m, 1.62m, 1.63m– Not a model
• A copy
– VERY unparsimonious
• Data: 5 statistics• Model: 5 statistics
– No improvement
24
• Model 2:– The mean (arithmetic mean)– A one parameter model
n
YYbY
i
n
ii
10
ˆ
25
• Which, because we are lazy, can be written as
n
YY
26
The Mean as a Model
27
The (Arithmetic) Mean
• We all know the mean– The ‘average’– Learned about it at school– Forget (didn’t know) about how clever the
mean is
• The mean is:– An Ordinary Least Squares (OLS) estimator– Best Linear Unbiased Estimator (BLUE)
28
Mean as OLS Estimator
• Going back a step or two• MODEL was a representation of DATA
– We said we want a model that explains a lot– How much does a model explain?
DATA = MODEL + ERRORERROR = DATA - MODEL
– We want a model with as little ERROR as possible
29
• What is error?
Error (e)Model (b0)mean
Data (Y)
0.031.63
0.021.62
0.201.80
-0.051.55
-0.20
1.60
1.40
30
• How can we calculate the ‘amount’ of error?
• Sum of errors
0
03.002.020.005.020.0
)(
)ˆ(
ERROR
0
bY
YY
e
i
i
i
31
– 0 implies no ERROR
• Not the case
– Knowledge about ERROR is useful
• As we shall see later
32
• Sum of absolute errors– Ignore signs
ERROR ie
0.20 0.05 0.20 0.02 0.03
ˆiY Y
0iY b
0.50
33
• Are small and large errors equivalent?– One error of 4– Four errors of 1
– The same?
– What happens with different data?• Y = (2, 2, 5)
– b0 = 2
– Not very representative
• Y = (2, 2, 4, 4)– b0 = any value from 2 - 4
– Indeterminate• There are an infinite number of solutions which would
satisfy our criteria for minimum error
34
• Sum of squared errors (SSE)
08.0
03.002.020.005.020.0
)(
)ˆ(
ERROR
22222
20
2
2
bY
YY
e
i
i
i
35
• Determinate– Always gives one answer
• If we minimise SSE– Get the mean
• Shown in graph– SSE plotted against b0
– Min value of SSE occurs when
– b0 = mean
36
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
b0
SS
E
37
The Mean as an OLS Estimate
38
Mean as OLS Estimate
• The mean is an Ordinary Least Squares (OLS) estimate– As are lots of other things
• This is exciting because– OLS estimators are BLUE– Best Linear Unbiased Estimators– Proven with Gauss-Markov Theorem
• Which we won’t worry about
39
BLUE Estimators• Best
– Minimum variance (of all possible unbiased estimators
– Narrower distribution than other estimators• e.g. median, mode
• Linear– Linear predictions– For the mean– Linear (straight, flat) line
YY
40
• Unbiased– Centred around true (population) values– Expected value = population value– Minimum is biased.
• Minimum in samples > minimum in population
• Estimators– Errrmm… they are estimators
• Also consistent– Sample approaches infinity, get closer
to population values– Variance shrinks
41
SSE and the Standard Deviation
• Tying up a loose end
2)ˆ( YYSSE i
n
YYs i
2)ˆ(
1)ˆ( 2
n
YYi
42
• SSE closely related to SD• Sample standard deviation – s
– Biased estimator of population SD
• Population standard deviation - – Need to know the mean to calculate SD
• Reduces N by 1• Hence divide by N-1, not N
– Like losing one df
43
Proof
• That the mean minimises SSE– Not that difficult– As statistical proofs go
• Available in– Maxwell and Delaney – Designing
experiments and analysing data– Judd and McClelland – Data Analysis
(out of print?)
44
What’s a df?
• The number of parameters free to vary– When one is fixed
• Term comes from engineering– Movement available to structures
45
0 dfNo variation
available
1 dfFix 1 corner, the
shape is fixed
46
Back to the Data
• Mean has 5 (N) df– 1st moment
• has N –1 df– Mean has been fixed– 2nd moment– Can think of as amount cases vary
away from the mean
47
While we are at it …
• Skewness has N – 2 df– 3rd moment
• Kurtosis has N – 3 df– 4rd moment– Amount cases vary from
48
Parsimony and df
• Number of df remaining– Measure of parsimony
• Model which contained all the data– Has 0 df
– Not a parsimonious model
• Normal distribution– Can be described in terms of mean and
• 2 parameters
– (z with 0 parameters)
49
Summary of Lesson 1
• Statistics is about modelling DATA– Models have parameters– Fewer parameters, more parsimony, better
• Models need to minimise ERROR– Best model, least ERROR – Depends on how we define ERROR – If we define error as sum of squared
deviations from predicted value– Mean is best MODEL
50
51
52
Lesson 2: Models with one more parameter -
regression
53
In Lesson 1 we said …
• Use a model to predict and describe data– Mean is a simple, one parameter model
54
More Models
Slopes and Intercepts
55
More Models• The mean is OK
– As far as it goes– It just doesn’t go very far– Very simple prediction, uses very little
information• We often have more information
than that– We want to use more information
than that
56
House Prices• In the UK, two of the largest
lenders (Halifax and Nationwide) compile house price indices– Predict the price of a house– Examine effect of different
circumstances
• Look at change in prices– Guides legislation
• E.g. interest rates, town planning
57
Predicting House Prices
Beds £ (000s)1 772 741 883 625 905 1362 355 1344 1381 55
58
One Parameter Model• The mean
9.11806ˆ
9.88
0
SSEYbY
Y
“How much is that house worth?”“£88,900”Use 1 df to say that
59
Adding More Parameters• We have more information than
this– We might as well use it– Add a linear function of number of
bedrooms (x1)
110ˆ xbbY
60
Alternative Expression
• Estimate of Y (expected value of Y)
• Value of Y
110ˆ xbbY
iii exbbY 110
61
Estimating the Model• We can estimate this model in four different,
equivalent ways– Provides more than one way of thinking about it
1. Estimating the slope which minimises SSE2. Examining the proportional reduction in
SSE3. Calculating the covariance 4. Looking at the efficiency of the predictions
62
Estimate the Slope to Minimise SSE
63
Estimate the Slope
• Stage 1– Draw a scatterplot– x-axis at mean
• Not at zero
• Mark errors on it– Called ‘residuals’– Sum and square these to find SSE
64
0
20
40
60
80
100
120
140
160
1.5 2 2.5 3 3.5 4 4.5 5 5.5
65
0
20
40
60
80
100
120
140
160
1.5 2 2.5 3 3.5 4 4.5 5 5.5
66
• Add another slope to the chart– Redraw residuals– Recalculate SSE– Move the line around to find slope
which minimises SSE
• Find the slope
67
• First attempt:
68
• Any straight line can be defined with two parameters– The location (height) of the slope
• b0
– Sometimes called
– The gradient of the slope • b1
69
• Gradient
1 unit
b1 units
70
• Height
b0 u
nits
71
• Height• If we fix slope to zero
– Height becomes mean
– Hence mean is b0
• Height is defined as the point that the slope hits the y-axis– The constant– The y-intercept
72
• Why the constant?– b0x0
– Where x0 is 1.00 for every case• i.e. x0 is constant
• Implicit in SPSS– Some packages
force you to make it explicit
– (Later on we’ll need to make it explicit)
beds (x1) x0 £ (000s)1 1 772 1 741 1 883 1 625 1 905 1 1362 1 355 1 1344 1 1381 1 55
73
• Why the intercept?– Where the regression line intercepts
the y-axis– Sometimes called y-intercept
74
Finding the Slope
• How do we find the values of b0 and b1?– Start with we jiggle the values, to find
the best estimates which minimise SSE– Iterative approach
• Computer intensive – used to matter, doesn’t really any more
• (With fast computers and sensible search algorithms – more on that later)
75
• Start with– b0=88.9 (mean)– b1=10 (nice round number)
• SSE = 14948 – worse than it was
– b0=86.9, b1=10, SSE=13828– b0=66.9, b1=10, SSE=7029– b0=56.9, b1=10, SSE=6628– b0=46.9, b1=10, SSE=8228– b0=51.9, b1=10, SSE=7178– b0=51.9, b1=12, SSE=6179– b0=46.9, b1=14, SSE=5957– ……..
76
• Quite a long time later– b0 = 46.000372
– b1 = 14.79182
– SSE = 5921
• Gives the position of the – Regression line (or)– Line of best fit
• Better than guessing
• Not necessarily the only method– But it is OLS, so it is the best (it is
BLUE)
77
0
20
40
60
80
100
120
140
160
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
Number of Bedrooms
Pri
ce
Actual Price
Predicted Price
78
• We now know – A house with no bedrooms is worth
£46,000 (??!)– Adding a bedroom adds £15,000
• Told us two things– Don’t extrapolate to meaningless
values of x-axis– Constant is not necessarily useful
• It is necessary to estimate the equation
79
Standardised Regression Line
• One big but:– Scale dependent
• Values change – £ to €, inflation
• Scales change– £, £000, £00?
• Need to deal with this
80
• Don’t express in ‘raw’ units– Express in SD units
– x1=1.72
– y=36.21
• b1 = 14.79
• We increase x1 by 1, and Ŷ increases by 14.79
SDsSDs 408.0)21.36/79.14(79.14
81
• Similarly, 1 unit of x1 = 1/1.72 SDs
– Increase x1 by 1 SD
– Ŷ increases by 14.79 (1.72/1) = 8.60
• Put them both together
y
xb
11
82
• The standardised regression line– Change (in SDs) in Ŷ associated with a
change of 1 SD in x1
• A different route to the same answer– Standardise both variables (divide by
SD)– Find line of best fit
706.021.36
72.179.14
83
• The standardised regression line has a special name
The Correlation Coefficient(r)
(r stands for ‘regression’, but more on that later)
• Correlation coefficient is a standardised regression slope– Relative change, in terms of SDs
84
Proportional Reduction in Error
85
Proportional Reduction in Error
• We might be interested in the level of improvement of the model– How much less error (as proportion)
do we have– Proportional Reduction in Error (PRE)
• Mean only– Error(model 0) = 11806
• Mean + slope– Error(model 1) = 5921
86
0.4984PRE
118065921
1PRE
ERROR(0)ERROR(1)
1PRE
ERROR(0)ERROR(1)ERROR(0)
PRE
87
• But we squared all the errors in the first place – So we could take the square root– (It’s a shoddy excuse, but it makes the
point)
706.00.4984
• This is the correlation coefficient• Correlation coefficient is the square
root of the proportion of variance explained
88
Standardised Covariance
89
Standardised Covariance• We are still iterating
– Need a ‘closed-form’– Equation to solve to get the parameter
estimates
• Answer is a standardised covariance– A variable has variance– Amount of ‘differentness’
• We have used SSE so far
90
• SSE varies with N– Higher N, higher SSE
• Divide by N– Gives SSE per person– (Actually N – 1, we have lost a df to
the mean)• The variance• Same as SD2
– We thought of SSE as a scattergram • Y plotted against X
– (repeated image follows)
91
0
20
40
60
80
100
120
140
160
1.5 2 2.5 3 3.5 4 4.5 5 5.5
92
• Or we could plot Y against Y– Axes meet at the mean (88.9)– Draw a square for each point– Calculate an area for each square– Sum the areas
• Sum of areas – SSE
• Sum of areas divided by N– Variance
93
Plot of Y against Y
0
20
40
60
80
100
120
140
160
180
0 20 40 60 80 100 120 140 160 180
94
0
20
40
60
80
100
120
140
160
180
0 20 40 60 80 100 120 140 160 180
Draw Squares
35 – 88.9 = -53.9
35 – 88.9 = -53.9
138 – 88.9 = 40.1
138 – 88.9 = 40.1
Area = -53.9 x -53.9
= 2905.21
Area = 40.1 x 40.1= 1608.1
95
• What if we do the same procedure– Instead of Y against Y– Y against X
• Draw rectangles (not squares)• Sum the area• Divide by N - 1• This gives us the variance of x with
y– The Covariance – Shortened to Cov(x, y)
96
97
55 – 88.9 = -33.9
1 - 3 = -2
Area = (-33.9) x (-2)
= 67.84 - 3 = 1
138-88.9 = 49.1
Area = 49.1 x 1 = 49.1
98
• More formally (and easily)• We can state what we are doing as
an equation – Where Cov(x, y) is the covariance
• Cov(x,y)=44.2• What do points in different sectors
do to the covariance?
1))((
),(Cov
N
yyxxyx
99
• Problem with the covariance– Tells us about two things– The variance of X and Y– The covariance
• Need to standardise it– Like the slope
• Two ways to standardise the covariance– Standardise the variables first
• Subtract from mean and divide by SD
– Standardise the covariance afterwards
100
• First approach– Much more computationally
expensive• Too much like hard work to do by hand
– Need to standardise every value• Second approach
– Much easier– Standardise the final value only
• Need the combined variance– Multiply two variances– Find square root (were multiplied in
first place)
101
• Standardised covariance
706.0
13119.2
2.44
)(Var)(Var
),(Cov
yx
yx
102
• The correlation coefficient– A standardised covariance is a
correlation coefficient
Covariance
variance variancer
103
1)(
1)(
1))((
22
Nyy
Nxx
Nyyxx
r
• Expanded …
104
• This means …– We now have a closed form equation
to calculate the correlation– Which is the standardised slope– Which we can use to calculate the
unstandardised slope
105
y
xbr
11
1
1x
yrb
We know that:
We know that:
106
• So value of b1 is the same as the iterative approach
79.1472.1
21.36706.0
1
1
1
1
b
b
rb
x
y
107
• The intercept– Just while we are at it
• The variables are centred at zero– We subtracted the mean from both
variables– Intercept is zero, because the axes
cross at the mean
108
• Add mean of y to the constant– Adjusts for centring y
• Subtract mean of x– But not the whole mean of x– Need to correct it for the slope
00.46
38.149.88
11
c
c
xbyc
• Naturally, the same
109
Accuracy of Prediction
110
One More (Last One)• We have one more way to
calculate the correlation– Looking at the accuracy of the
prediction
• Use the parameters– b0 and b1
– To calculate a predicted value for each case
111
• Plot actual price against predicted price– From the
model
BedsActual Price
Predicted Price
1 77 60.802 74 75.591 88 60.803 62 90.385 90 119.965 136 119.962 35 75.595 134 119.964 138 105.171 55 60.80
112
20
40
60
80
100
120
140
20 40 60 80 100 120 140 160Actual Value
Pre
dict
ed V
alue
113
• r = 0.706– The correlation
• Seems a futile thing to do– And at this stage, it is– But later on, we will see why
114
Some More Formulae• For hand calculation
• Point biserial
22 yx
xyr
y
yy
sd
PQMMr
01
115
• Phi ()– Used for 2 dichotomous variables
Vote P Vote Q
Homeowner A: 19 B: 54
Not homeowner C: 60 D:53
))()()(( DBCADCBA
ADBCr
116
• Problem with the phi correlation– Unless Px= Py (or Px = 1 – Py)
• Maximum (absolute) value is < 1.00• Tetrachoric can be used
• Rank (Spearman) correlation– Used where data are ranked
)1(
62
2
nn
dr
117
Summary• Mean is an OLS estimate
– OLS estimates are BLUE
• Regression line– Best prediction of DV from IV– OLS estimate (like mean)
• Standardised regression line– A correlation
118
• Four ways to think about a correlation– 1. Standardised regression line– 2. Proportional Reduction in Error
(PRE)– 3. Standardised covariance– 4. Accuracy of prediction
119
120
121
Lesson 3: Why Regression?
A little aside, where we look at why regression has such a
curious name.
122
Regression
The or an act of regression; reversion; return towards the
mean; return to an earlier stage of development, as in an adult’s or an adolescent’s behaving like a child(From Latin gradi, to go)
• So why name a statistical technique which is about prediction and explanation?
123
• Francis Galton – Charles Darwin’s cousin– Studying heritability
• Tall fathers have shorter sons• Short fathers have taller sons
– ‘Filial regression toward mediocrity’ – Regression to the mean
124
• Galton thought this was biological fact– Evolutionary basis?
• Then did the analysis backward– Tall sons have shorter fathers– Short sons have taller fathers
• Regression to the mean– Not biological fact, statistical artefact
125
Other Examples
• Secrist (1933): The Triumph of Mediocrity in Business
• Second albums often tend to not be as good as first
• Sequel to a film is not as good as the first one
• ‘Curse of Athletics Weekly’• Parents think that punishing bad
behaviour works, but rewarding good behaviour doesn’t
126
Pair Link Diagram
• An alternative to a scatterplot
x y
127
r=1.00
x
x
x
x
x
x
x
128
x
x x
x
x
r=0.00
129
From Regression to Correlation
• Where do we predict an individual’s score on y will be, based on their score on x?– Depends on the correlation
• r = 1.00 – we know exactly where they will be
• r = 0.00 – we have no idea• r = 0.50 – we have some idea
130
x y
r=1.00
Starts here
Will end up here
131
x y
r=0.00
Could end anywhere here
Starts here
132
r=0.50
x y
Starts here
Probably end
somewhere here
133
Galton Squeeze Diagram
• Don’t show individuals– Show groups of individuals, from the
same (or similar) starting point– Shows regression to the mean
134
x y
r=0.00
Group starts here
Ends here
Group starts here
135
x y
r=0.50
136
x y
r=1.00
137
x y
1 unit r units
• Correlation is amount of regression that doesn’t occur
138
x y
• No regression• r=1.00
139
• Some regression• r=0.50
x y
140
r=0.00
x y
• Lots (maximum) regression• r=0.00
141
Formula
xxyy zrz ˆ
142
Conclusion
• Regression towards mean is statistical necessity
regression = perfection – correlation• Very non-intuitive• Interest in regression and correlation
– From examining the extent of regression towards mean
– By Pearson – worked with Galton– Stuck with curious name
• See also Paper B3
143
144
145
Lesson 4: Samples to Populations – Standard Errors and Statistical
Significance
146
The Problem• In Social Sciences
– We investigate samples• Theoretically
– Randomly taken from a specified population
– Every member has an equal chance of being sampled
– Sampling one member does not alter the chances of sampling another
• Not the case in (say) physics, biology, etc.
147
Population
• But it’s the population that we are interested in– Not the sample– Population statistic represented with
Greek letter– Hat means ‘estimate’
xx
b
ˆ
ˆ
148
• Sample statistics (e.g. mean) estimate population parameters
• Want to know– Likely size of the parameter– If it is > 0
149
Sampling Distribution
• We need to know the sampling distribution of a parameter estimate– How much does it vary from sample to
sample
• If we make some assumptions– We can know the sampling distribution
of many statistics– Start with the mean
150
Sampling Distribution of the Mean
• Given– Normal distribution– Random sample– Continuous data
• Mean has a known sampling distribution– Repeatedly sampling will give a known
distribution of means– Centred around the true (population)
mean ()
151
Analysis Example: Memory• Difference in memory for different
words
– 10 participants given a list of 30 words to learn, and then tested
– Two types of word
• Abstract: e.g. love, justice
• Concrete: e.g. carrot, table
152
Concrete Abstract Diff (x)12 4 811 7 44 6 -29 12 -38 6 2
12 10 29 8 18 5 3
12 10 28 4 4
10
11.3
1.2
N
x
x
153
Confidence Intervals
• This means– If we know the mean in our sample– We can estimate where the mean in
the population () is likely to be
• Using– The standard error (se) of the mean– Represents the standard deviation of
the sampling distribution of the mean
154
Almost 2 SDs contain
95%
1 SD contains
68%
155
• We know the sampling distribution of the mean– t distributed– Normal with large N (>30)
• Know the range within means from other samples will fall– Therefore the likely range of
nxse x
)(
156
• Two implications of equation– Increasing N decreases SE
• But only a bit
– Decreasing SD decreases SE • Calculate Confidence Intervals
– From standard errors• 95% is a standard level of CI
– 95% of samples the true mean will lie within the 95% CIs
– In large samples: 95% CI = 1.96 SE– In smaller samples: depends on t
distribution (df=N-1=9)
157
10
,11.3
,1.2
N
x
x
98.010
11.3)(
nxse x
158
22.298.026.2CI 95%
CI CI
-0.12 4.32
x x
159
What is a CI?
• (For 95% CI): • 95% chance that the true
(population) value lies within the confidence interval?
• 95% of samples, true mean will land within the confidence interval?
160
Significance Test
• Probability that is a certain value– Almost always 0
• Doesn’t have to be though
• We want to test the hypothesis that the difference is equal to 0– i.e. find the probability of this difference
occurring in our sample IF =0– (Not the same as the probability that
=0)
161
• Calculate SE, and then t– t has a known sampling distribution– Can test probability that a certain
value is included
)(xse
xt
061.0
14.298.0
1.2
p
t
162
Other Parameter Estimates
• Same approach– Prediction, slope, intercept, predicted
values– At this point, prediction and slope are
the same• Won’t be later on
• We will look at one predictor only– More complicated with > 1
163
Testing the Degree of Prediction
• Prediction is correlation of Y with Ŷ– The correlation – when we have one IV
• Use F, rather than t• Started with SSE for the mean only
– This is SStotal
– Divide this into SSresidual
– SSregression
• SStot = SSreg + SSres
164
,12
1
kNdf
kdf
2
1
dfSS
dfSSF
res
reg
165
• Back to the house prices– Original SSE (SStotal) = 11806
– SSresidual = 5921• What is left after our model
– SSregression = 11806 – 5921 = 5885• What our model explains
• Slope = 14.79• Intercept = 46.0• r = 0.706
166
2
1
dfSS
dfSSF
res
reg
81
1
95.7)1110(5921
15885
2
1
kNdf
kdf
F
167
• F = 7.95, df = 1, 8, p = 0.02– Can reject H0
• H0: Prediction is not better than chance
– A significant effect
168
Statistical Significance:What does a p-value (really)
mean?
169
A Quiz
• Six questions, each true or false• Write down your answers (if you like)
• An experiment has been done. Carried out perfectly. All assumptions perfectly satisfied. Absolutely no problems.
• P = 0.01– Which of the following can we say?
170
1. You have absolutely disproved the null hypothesis (that is, there is no difference between the population means).
171
2. You have found the probability of the null hypothesis being true.
172
3. You have absolutely proved your experimental hypothesis (that there is a difference between the population means).
173
4. You can deduce the probability of the experimental hypothesis being true.
174
5. You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.
175
6. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions.
176
OK, What is a p-value
• Cohen (1994)“[a p-value] does not tell us what we
want to know, and we so much want to know what we want to
know that, out of desperation, we nevertheless believe it does” (p
997).
177
OK, What is a p-value
• Sorry, didn’t answer the question• It’s The probability of obtaining a
result as or more extreme than the result we have in the study, given that the null hypothesis is true
• Not probability the null hypothesis is true
178
A Bit of Notation
• Not because we like notation– But we have to say a lot less
• Probability – P• Null hypothesis is true – H• Result (data) – D• Given - |
179
What’s a P Value
• P(D|H)– Probability of the data occurring if the
null hypothesis is true• Not• P(H|D)
– Probability that the null hypothesis is true, given that we have the data = p(H)
• P(H|D) ≠ P(D|H)
180
• What is probability you are prime minister– Given that you are british– P(M|B)– Very low
• What is probability you are British– Given you are prime minister– P(B|M)– Very high
• P(M|B) ≠ P(B|M)
181
• There’s been a murder– Someone bumped off a statto for talking
too much
• The police have DNA• The police have your DNA
– They match(!)
• DNA matches 1 in 1,000,000 people• What’s the probability you didn’t do
the murder, given the DNA match (H|D)
182
• Police say:– P(D|H) = 1/1,000,000
• Luckily, you have Jeremy on your defence team
• We say:– P(D|H) ≠ P(H|D)
• Probability that someone matches the DNA, who didn’t do the murder– Incredibly high
183
Back to the Questions
• Haller and Kraus (2002)– Asked those questions of groups in
Germany– Psychology Students– Psychology lecturers and professors
(who didn’t teach stats)– Psychology lecturers and professors
(who did teach stats)
184
1. You have absolutely disproved the null hypothesis (that is, there is no difference between the population means).
• True• 34% of students • 15% of professors/lecturers, • 10% of professors/lecturers teaching
statistics
• False• We have found evidence against
the null hypothesis
185
2. You have found the probability of the null hypothesis being true.
– 32% of students – 26% of professors/lecturers– 17% of professors/lecturers teaching
statistics
• False• We don’t know
186
3. You have absolutely proved your experimental hypothesis (that there is a difference between the population means).
– 20% of students – 13% of professors/lecturers– 10% of professors/lecturers teaching
statistics
• False
187
4. You can deduce the probability of the experimental hypothesis being true.
– 59% of students– 33% of professors/lecturers– 33% of professors/lecturers teaching
statistics
• False
188
5. You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision.
• 68% of students• 67% of professors/lecturers• 73% of professors professors/lecturers
teaching statistics
• False• Can be worked out
– P(replication)
189
6. You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99% of occasions.
– 41% of students– 49% of professors/lecturers– 37% of professors professors/lecturers
teaching statistics • False• Another tricky one
– It can be worked out
190
One Last Quiz
• I carry out a study– All assumptions perfectly satisfied– Random sample from population– I find p = 0.05
• You replicate the study exactly– What is probability you find p < 0.05?
191
• I carry out a study– All assumptions perfectly satisfied– Random sample from population– I find p = 0.01
• You replicate the study exactly– What is probability you find p < 0.05?
192
• Significance testing creates boundaries and gaps where none exist.
• Significance testing means that we find it hard to build upon knowledge– we don’t get an accumulation of
knowledge
193
• Yates (1951) "the emphasis given to formal tests of
significance ... has resulted in ... an undue concentration of effort by mathematical statisticians on investigations of tests of
significance applicable to problems which are of little or no practical importance ... and ... it has caused scientific research workers to pay undue attention to the
results of the tests of significance ... and too little to the estimates of the magnitude
of the effects they are investigating
194
Testing the Slope
• Same idea as with the mean– Estimate 95% CI of slope– Estimate significance of difference
from a value (usually 0)
• Need to know the sd of the slope– Similar to SD of the mean
195
1
)ˆ( 2
.
kN
YYs xy
1.
kN
SSs res
xy
2.278
5921. xys
196
• Similar to equation for SD of mean• Then we need standard error
- Similar (ish)• When we have standard error
– Can go on to 95% CI– Significance of difference
197
24.59.26
2.27)(se . xyb
2
..
)()(se
xx
sb xy
xy
198
• Confidence Limits• 95% CI
– t dist with N - k - 1 df is 2.31– CI = 5.24 2.31 = 12.06
• 95% confidence limits
14.8 12.1 14.8 12.1
2.7 26.9
199
• Significance of difference from zero– i.e. probability of getting result if =0
• Not probability that = 0
02.0
81
81.22.57.14
)(
p
kNdf
bse
bt
• This probability is (of course) the same as the value for the prediction
200
Testing the Standardised Slope (Correlation)
• Correlation is bounded between –1 and +1– Does not have symmetrical distribution,
except around 0• Need to transform it
– Fisher z’ transformation – approximately normal
)]1ln()1[ln(5.0 rrz
3
1
nSEz
201
• 95% CIs – 0.879 – 1.96 * 0.38 = 0.13– 0.879 + 1.96 * 0.38 = 1.62
879.0
)]706.01ln()706.01[ln(5.0
z
z
38.0310
1
3
1
nSEz
202
• Transform back to correlation
• 95% CIs = 0.13 to 0.92• Very wide
– Small sample size– Maybe that’s why CIs are not
reported?
1
12
2
y
y
e
er
203
Using Excel
• Functions in excel– Fisher() – to carry out Fisher
transformation– Fisherinv() – to transform back to
correlation
204
The Others
• Same ideas for calculation of CIs and SEs for – Predicted score– Gives expected range of values given
X
• Same for intercept– But we have probably had enough
205
Lesson 5: Introducing Multiple Regression
206
Residuals• We said
Y = b0 + b1x1
• We could have saidYi = b0 + b1xi1 + ei
• We ignored the i on the Y• And we ignored the ei
– It’s called error, after all• But it isn’t just error
– Trying to tell us something
207
What Error Tells Us
• Error tells us that a case has a different score for Y than we predict– There is something about that case
• Called the residual– What is left over, after the model
• Contains information– Something is making the residual 0– But what?
208
0
20
40
60
80
100
120
140
160
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
Number of Bedrooms
Pri
ce
Actual Price
Predicted Price
swimming pool
Unpleasant neighbours
209
• The residual (+ the mean) is the value of Y
If all cases were equal on X• It is the value of Y, controlling for
X
• Other words:– Holding constant– Partialling– Residualising– Conditioned on
210
Pred61766190
12012076
12010561
Adj. Value1059062
11711973
129755695
Res-16
2-272830
-1641
-14-33
6
Beds £ (000s)1 772 741 883 625 905 1362 355 1344 1381 55
211
• Sometimes adjustment is enough on its own– Measure performance against criteria
• Teenage pregnancy rate– Measure pregnancy and abortion rate in areas– Control for socio-economic deprivation, and
anything else important– See which areas have lower teenage
pregnancy and abortion rate, given same level of deprivation
• Value added education tables– Measure school performance– Control for initial intake
212
Control?• In experimental research
– Use experimental control– e.g. same conditions, materials, time
of day, accurate measures, random assignment to conditions
• In non-experimental research– Can’t use experimental control– Use statistical control instead
213
Analysis of Residuals
• What predicts differences in crime rate – After controlling for socio-economic
deprivation– Number of police?– Crime prevention schemes?– Rural/Urban proportions?– Something else
• This is what regression is about
214
• Exam performance– Consider number of books a student
read (books)– Number of lectures (max 20) a
student attended (attend)
• Books and attend as IV, grade as DV
215
Books Attend0 91 150 102 164 104 201 114 203 150 15
First 10 cases
Grade45574551658844878959
216
• Use books as IV– R=0.492, F=12.1, df=1, 28, p=0.001
– b0=52.1, b1=5.7
– (Intercept makes sense)
• Use attend as IV– R=0.482, F=11.5, df=1, 38, p=0.002
– b0=37.0, b1=1.9
– (Intercept makes less sense)
217
Books
543210-1
Gra
de (
100)
100
90
80
70
60
50
40
30
218Attend
211917151311975
Gra
de
100
90
80
70
60
50
40
30
219
Problem• Use R2 to give proportion of shared
variance– Books = 24%– Attend = 23%
• So we have explained 24% + 23% = 47% of the variance– NO!!!!!
220
• Correlation of books and attend is (unsurprisingly) not zero– Some of the variance that books shares
with grade, is also shared by attend
• Look at the correlation matrix
BOOKS
ATTENDGRADE
BOOKSATTEN
DGRADE
0.44
0.49 0.48
1
1
1
221
• I have access to 2 cars• My wife has access to 2 cars
– We have access to four cars?– No. We need to know how many of my
2 cars are the same cars as her 2 cars• Similarly with regression
– But we can do this with the residuals– Residuals are what is left after (say)
books– See of residual variance is explained
by attend– Can use this new residual variance to
calculate SSres, SStotal and SSreg
222
• Well. Almost.– This would give us correct values for
SS– Would not be correct for slopes, etc
• Assumes that the variables have a causal priority– Why should attend have to take what
is left from books?– Why should books have to take what
is left by attend?
• Use OLS again
223
• Simultaneously estimate 2 parameters– b1 and b2
– Y = b0 + b1x1 + b2x2
– x1 and x2 are IVs
• Not trying to fit a line any more– Trying to fit a plane
• Can solve iteratively– Closed form equations better– But they are unwieldy
224
x1
x2
y
3D scatterplot(2points only)
225
x1
x2
y
b0
b1
b2
226
(Really) Ridiculous Equations
22211
222
211
2211222
22111
xxxxxxxx
xxxxxxyyxxxxyyb
21122
211
222
1122112
11222
xxxxxxxx
xxxxxxyyxxxxyyb
22110 xbxbyb
227
• The good news– There is an easier way
• The bad news– It involves matrix algebra
• The good news– We don’t really need to know how to
do it
• The bad news – We need to know it exists
228
A Quick Guide to Matrix Algebra
(I will never make you do it again)
229
Very Quick Guide to Matrix Algebra
• Why?– Matrices make life much easier in
multivariate statistics– Some things simply cannot be done
without them– Some things are much easier with them
• If you can manipulate matrices– you can specify calculations v. easily– e.g. AA’ = sum of squares of a column
• Doesn’t matter how long the column
230
• A scalar is a numberA scalar: 4
• A vector is a row or column of numbers
11
5
A row vector:
A column vector:
7842
231
• A vector is described as rows x columns
– Is a 1 4 vector
– Is a 2 1 vector– A number (scalar) is a 1 1 vector
7842
11
5
232
• A matrix is a rectangle, described as rows x columns
87251
35754
87562
• Is a 3 x 5 matrix
• Matrices are referred to with bold capitals
- A is a matrix
233
• Correlation matrices and covariance matrices are special – They are square and symmetrical– Correlation matrix of books, attend
and grade
00.148.049.0
48.000.144.0
49.044.000.1
234
• Another special matrix is the identity matrix I– A square matrix, with 1 in the
diagonal and 0 in the off-diagonal
1000
0100
0010
0001
I
– Note that this is a correlation matrix, with correlations all = 0
235
Matrix Operations
• Transposition– A matrix is transposed by putting it
on its side – Transpose of A is A’
6
5
7
'
657
A
A
236
• Matrix multiplication– A matrix can be multiplied by a scalar,
a vector or a matrix– Not commutative– AB BA– To multiply AB
• Number of rows in A must equal number of columns in B
237
• Matrix by vector
141
99
33
4
3
2
231917
13117
532
238
• Matrix by matrix
5038
2116
35152810
156124
54
32
75
32
dhcfdgce
bhafcfae
hg
fe
dc
ba
239
• Multiplying by the identity matrix– Has no effect – Like multiplying by 1
75
32
10
01
75
32
AAI
240
• The inverse of J is: 1/J• J x 1/J = 1• Same with matrices
– Matrices have an inverse– Inverse of A is A-1
– AA-1=I
• Inverting matrices is dull– We will do it once– But first, we must calculate the
determinant
241
• The determinant of A is |A|• Determinants are important in
statistics– (more so than the other matrix
algebra)
• We will do a 2x2 – Much more difficult for larger
matrices
242
cbad
dc
ba
A
A
91.0
3.03.011
0.13.0
3.00.1
A
A
A
243
• Determinants are important because– Needs to be above zero for regression
to work– Zero or negative determinant of a
correlation/covariance matrix means something wrong with the data• Linear redundancy
• Described as:– Not positive definite– Singular (if determinant is zero)
• In different error messages
244
• Next, the adjoint
ac
bd
dc
ba
A
A
adj
AA
A adj11
•Now
245
• Find A-1
91.0
0.13.0
3.00.1
A
A
10.133.0
33.010.1
0.13.0
3.00.1
91.01
1
1
A
A
246
Matrix Algebra with Correlation Matrices
247
Determinants
• Determinant of a correlation matrix– The volume of ‘space’ taken up by
the (hyper) sphere that contains all of the points
1.0 0.0
0.0 1.0
1.0
A
A
248
1.0 0.0
0.0 1.0
1.0
A
A
X X
X
X X
249
1.0 1.0
1.0 1.0
0.0
A
A
X
X
X
250
Negative Determinant
• Points take up less than no space– Correlation matrix cannot exist – Non-positive definite matrix
251
Sometimes Obvious
0.44A
0.12.1
2.10.1A
252
Sometimes Obvious (If You Think)
1 0.9 0.9
0.9 1 0.9
0.9 0.9 1
A
2.88A
253
Sometimes No Idea
1.00 0.76 0.40
0.76 1 0.30
0.40 0.30 1
A
0.01A 1.00 0.75 0.40
0.75 1 0.30
0.40 0.30 1
A
0.0075A
254
Multiple R for Each Variable
• Diagonal of inverse of correlation matrix– Used to calculate multiple R
– Call elements aij
.123...
11i k
ii
Ra
255
Regression Weights
• Where i is DV• j is IV
.ij
i jij
ab
a
256
Back to the Good News• We can calculate the standardised
parameters asB=Rxx
-1 x Rxy
• Where – B is the vector of regression weights– Rxx
-1 is the inverse of the correlation matrix of the independent (x) variables
– Rxy is the vector of correlations of the correlations of the x and y variables
– Now do exercise 3.2
257
One More Thing
• The whole regression equation can be described with matrices– very simply
EXBY
258
• Where– Y = vector of DV– X = matrix of IVs– B = vector of coefficients
• Go all the way back to our example
259
1
2
3
40
51
62
7
8
9
10
1 0 9 45
1 1 5 57
1 0 10 45
1 2 16 51
1 4 10 65
1 4 20 88
1 1 11 44
1 4 20 87
1 3 15 89
1 0 15 59
e
e
e
eb
eb
eb
e
e
e
e
260
59
89
87
44
88
65
51
45
57
45
1501
1531
2041
1111
2041
1041
1621
1001
511
901
10
9
8
7
6
5
4
3
2
1
2
1
0
e
e
e
e
e
e
e
e
e
e
b
b
b
The constant – literally a constant. Could be any number, but it is most
convenient to make it 1. Used to ‘capture’ the
intercept.
261
59
89
87
44
88
65
51
45
57
45
1501
1531
2041
1111
2041
1041
1621
1001
511
901
10
9
8
7
6
5
4
3
2
1
2
1
0
e
e
e
e
e
e
e
e
e
e
b
b
bThe matrix of values for IVs (books and
attend)
262
59
89
87
44
88
65
51
45
57
45
1501
1531
2041
1111
2041
1041
1621
1001
511
901
10
9
8
7
6
5
4
3
2
1
2
1
0
e
e
e
e
e
e
e
e
e
e
b
b
bThe parameter
estimates. We are trying to find the
best values of these.
263
59
89
87
44
88
65
51
45
57
45
1501
1531
2041
1111
2041
1041
1621
1001
511
901
10
9
8
7
6
5
4
3
2
1
2
1
0
e
e
e
e
e
e
e
e
e
e
b
b
b
Error. We are trying to minimise this
264
59
89
87
44
88
65
51
45
57
45
1501
1531
2041
1111
2041
1041
1621
1001
511
901
10
9
8
7
6
5
4
3
2
1
2
1
0
e
e
e
e
e
e
e
e
e
e
b
b
b
The DV - grade
265
• Y=BX+E• Simple way of representing as many IVs
as you likeY = b0x0 + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + e
2
1
5
4
3
2
1
0
524232221202
514131211101
e
e
b
b
b
b
b
b
xxxxxx
xxxxxx
266
exbxbxb
e
e
b
b
b
b
b
b
xxxxxx
xxxxxx
kk
...1100
2
1
5
4
3
2
1
0
524232221202
514131211101
267
Generalises to Multivariate Case
• Y=BX+E• Y, B and E
– Matrices, not vectors
• Goes beyond this course– (Do Jacques Tacq’s course for more)– (Or read his book)
268
269
270
271
Lesson 6: More on Multiple Regression
272
Parameter Estimates
• Parameter estimates (b1, b2 … bk) were standardised – Because we analysed a correlation
matrix
• Represent the correlation of each IV with the DV– When all other IVs are held constant
273
• Can also be unstandardised• Unstandardised represent the unit
change in the DV associated with a 1 unit change in the IV– When all the other variables are held
constant• Parameters have standard errors
associated with them– As with one IV– Hence t-test, and associated
probability can be calculated• Trickier than with one IV
274
Standard Error of Regression Coefficient
• Standardised is easier
– R2i is the value of R2 when all other
predictors are used as predictors of that variable
• Note that if R2i = 0, the equation is the same as
for previous
iRkn
RSE Y
i 2
2
1
1
1
1
275
Multiple R
• The degree of prediction– R (or Multiple R) – No longer equal to b
• R2 Might be equal to the sum of squares of B– Only if all x’s are uncorrelated
276
In Terms of Variance• Can also think of this in terms of
variance explained.– Each IV explains some variance in the
DV– The IVs share some of their variance
• Can’t share the same variance twice
277
The total variance of Y
= 1
Variance in Y accounted for by
x1
rx1y2 = 0.36
Variance in Y accounted for by
x2
rx2y2 = 0.36
278
• In this model– R2 = ryx1
2 + ryx22
– R2 = 0.36 + 0.36 = 0.72– R = 0.72 = 0.85
• But– If x1 and x2 are correlated
– No longer the case
279
The total variance of Y
= 1
Variance in Y accounted for by
x2
rx2y2 = 0.36
Variance in Y accounted for by
x1
rx1y2 = 0.36
Variance shared between x1 and x2
(not equal to rx1x2)
280
• So– We can no longer sum the r2
– Need to sum them, and subtract the shared variance – i.e. the correlation
• But– It’s not the correlation between them– It’s the correlation between them as a
proportion of the variance of Y
• Two different ways
281
• If rx1x2 = 0
– rxy = bx1
– Equivalent to ryx12 + ryx2
2
21 212
yxyx rbrbR
• Based on estimates
282
• rx1x2 = 0
– Equivalent to ryx12 + ryx2
2
2
222
21
212121
1
2
xx
xxyxyxyxyx
r
rrrrrR
• Based on correlations
283
• Can also be calculated using methods we have seen– Based on PRE– Based on correlation with prediction
• Same procedure with >2 IVs
284
Adjusted R2
• R2 is an overestimate of population value of R2
– Any x will not correlate 0 with Y– Any variation away from 0 increases R– Variation from 0 more pronounced
with lower N• Need to correct R2
– Adjusted R2
285
• 1 – R2
– Proportion of unexplained variance– We multiple this by an adjustment
• More variables – greater adjustment• More people – less adjustment
11
)1(1 Adj. 22
kN
NRR
• Calculation of Adj. R2
286
Shrunken R2
• Some authors treat shrunken and adjusted R2 as the same thing– Others don’t
287
11
kN
N1875.1
1619
1320120
3,20
kN
919
1810110
8,10
kN
5.169
1310110
3,10
kN
288
Extra Bits
• Some stranger things that can happen
– Counter-intuitive
289
• Can be hard to understand– Very counter-intuitive
• Definition– An independent variable which
increases the size of the parameters associated with other independent variables above the size of their correlations
Suppressor variables
290
• An example (based on Horst, 1941)– Success of trainee pilots
– Mechanical ability (x1), verbal ability (x2), success (y)
• Correlation matrixMech Verb Success
Mech 1 0.5 0.3Verb 0.5 1 0
Success 0.3 0 1
291
– Mechanical ability correlates 0.3 with success
– Verbal ability correlates 0.0 with success
– What will the parameter estimates be?
– (Don’t look ahead until you have had a guess)
292
• Mechanical ability– b = 0.4– Larger than r!
• Verbal ability – b = -0.2– Smaller than r!!
• So what is happening?– You need verbal ability to do the test– Not related to mechanical ability
• Measure of mechanical ability is contaminated by verbal ability
293
• High mech, low verbal– High mech
• This is positive
– Low verbal • Negative, because we are talking about
standardised scores• Your mech is really high – you did well on
the mechanical test, without being good at the words
• High mech, high verbal– Well, you had a head start on mech,
because of verbal, and need to be brought down a bit
294
Another suppressor?x1 x2 y
x1 1 0.5 0.3x2 0.5 1 0.2y 0.3 0.2 1
b1 = b2 =
295
Another suppressor?
x1 x2 yx1 1 0.5 0.3x2 0.5 1 0.2y 0.3 0.2 1
b1 =0.26 b2 = -0.06
296
And another?x1 x2 y
x1 1 0.5 0.3x2 0.5 1 -0.2y 0.3 -0.2 1
b1 = b2 =
297
And another?
x1 x2 yx1 1 0.5 0.3x2 0.5 1 -0.2y 0.3 -0.2 1
b1 = 0.53 b2 = -0.47
298
One more?x1 x2 y
x1 1 -0.5 0.3x2 -0.5 1 0.2y 0.3 0.2 1
b1 = b2 =
299
One more?
x1 x2 yx1 1 -0.5 0.3x2 -0.5 1 0.2y 0.3 0.2 1
b1 = 0.53 b2 = 0.47
300
• Suppression happens when two opposing forces are happening together– And have opposite effects
• Don’t throw away your IVs,– Just because they are uncorrelated with the
DV
• Be careful in interpretation of regression estimates– Really need the correlations too, to interpret
what is going on– Cannot compare between studies with
different IVs
301
Standardised Estimates > 1
• Correlations are bounded -1.00 ≤ r ≤ +1.00
– We think of standardised regression estimates as being similarly bounded• But they are not
– Can go >1.00, <-1.00– R cannot, because that is a proportion
of variance
302
• Three measures of ability– Mechanical ability, verbal ability 1,
verbal ability 2– Score on science exam
Mech Verbal1 Verbal2 ScoresMech 1 0.1 0.1 0.6
Verbal1 0.1 1 0.9 0.6Verbal2 0.1 0.9 1 0.3Scores 0.6 0.6 0.3 1
–Before reading on, what are the parameter estimates?
303
• Mechanical– About where we expect
• Verbal 1– Very high
• Verbal 2– Very low
Mech 0.56Verbal1 1.71Verbal2 -1.29
304
• What is going on– It’s a suppressor again– An independent variable which
increases the size of the parameters associated with other independent variables above the size of their correlations
• Verbal 1 and verbal 2 are correlated so highly– They need to cancel each other out
305
Variable Selection
• What are the appropriate independent variables to use in a model?– Depends what you are trying to do
• Multiple regression has two separate uses– Prediction– Explanation
306
• Prediction – What will happen
in the future?– Emphasis on
practical application
– Variables selected (more) empirically
– Value free
• Explanation– Why did
something happen?
– Emphasis on understanding phenomena
– Variables selected theoretically
– Not value free
307
• Visiting the doctor– Precedes suicide attempts– Predicts suicide
• Does not explain suicide
• More on causality later on …• Which are appropriate variables
– To collect data on?– To include in analysis?– Decision needs to be based on theoretical
knowledge of the behaviour of those variables– Statistical analysis of those variables (later)
• Unless you didn’t collect the data
– Common sense (not a useful thing to say)
308
Variable Entry Techniques
• Entry-wise– All variables entered simultaneously
• Hierarchical– Variables entered in a predetermined
order• Stepwise
– Variables entered according to change in R2
– Actually a family of techniques
309
• Entrywise– All variables entered simultaneously– All treated equally
• Hierarchical– Entered in a theoretically determined
order– Change in R2 is assessed, and tested for
significance– e.g. sex and age
• Should not be treated equally with other variables
• Sex and age MUST be first
– Confused with hierarchical linear modelling
310
• Stepwise– Variables entered empirically– Variable which increases R2 the most
goes first• Then the next …
– Variables which have no effect can be removed from the equation
• Example– IVs: Sex, age, extroversion, – DV: Car – how long someone spends
looking after their car
311
SEX AGE EXTRO CARSEX 1.00 -0.05 0.40 0.66AGE -0.05 1.00 0.40 0.23EXTRO 0.40 0.40 1.00 0.67CAR 0.66 0.23 0.67 1.00
• Correlation Matrix
312
• Entrywise analysis– r2 = 0.64
b pSEX 0.49 <0.01AGE 0.08 0.46EXTRO 0.44 <0.01
313
• Stepwise Analysis– Data determines the order– Model 1: Extroversion, R2 = 0.450– Model 2: Extroversion + Sex, R2 =
0.633
b pEXTRO 0.48 <0.01
SEX 0.47 <0.01
314
• Hierarchical analysis– Theory determines the order– Model 1: Sex + Age, R2 = 0.510– Model 2: S, A + E, R2 = 0.638– Change in R2 = 0.128, p = 0.001
SEX 0.49 <0.01AGE 0.08 0.46
EXTRO 0.44 <0.012
315
• Which is the best model?– Entrywise – OK– Stepwise – excluded age
• Did have a (small) effect
– Hierarchical• The change in R2 gives the best estimate
of the importance of extroversion
• Other problems with stepwise– F and df are wrong (cheats with df)– Unstable results
• Small changes (sampling variance) – large differences in models
316
– Uses a lot of paper– Don’t use a stepwise procedure to
pack your suitcase
317
Is Stepwise Always Evil?• Yes• All right, no• Research goal is predictive
(technological)– Not explanatory (scientific)– What happens, not why
• N is large – 40 people per predictor, Cohen, Cohen,
Aiken, West (2003)• Cross validation takes place
318
A quick note on R2
R2 is sometimes regarded as the ‘fit’ of a regression model– Bad idea
• If good fit is required – maximise R2
– Leads to entering variables which do not make theoretical sense
319
Critique of Multiple Regression
• Goertzel (2002)– “Myths of murder and multiple
regression”– Skeptical Inquirer (Paper B1)
• Econometrics and regression are ‘junk science’– Multiple regression models (in US)– Used to guide social policy
320
More Guns, Less Crime
– (controlling for other factors)• Lott and Mustard: A 1% increase in
gun ownership– 3.3% decrease in murder rates
• But: – More guns in rural Southern US– More crime in urban North (crack
cocaine epidemic at time of data)
321
Executions Cut Crime
• No difference between crimes in states in US with or without death penalty
• Ehrlich (1975) controlled all variables that effect crime rates– Death penalty had effect in reducing
crime rate
• No statistical way to decide who’s right
322
Legalised Abortion
• Donohue and Levitt (1999)– Legalised abortion in 1970’s cut crime in
1990’s
• Lott and Whitley (2001)– “Legalising abortion decreased murder
rates by … 0.5 to 7 per cent.”
• It’s impossible to model these data– Controlling for other historical events– Crack cocaine (again)
323
Another Critique
• Berk (2003)– Regression analysis: a constructive critique
(Sage)
• Three cheers for regression– As a descriptive technique
• Two cheers for regression– As an inferential technique
• One cheer for regression– As a causal analysis
324
Is Regression Useless?
• Do regression carefully– Don’t go beyond data which you have
a strong theoretical understanding of
• Validate models– Where possible, validate predictive
power of models in other areas, times, groups• Particularly important with stepwise
325
Lesson 7: Categorical Independent Variables
326
Introduction
327
Introduction
• So far, just looked at continuous independent variables
• Also possible to use categorical (nominal, qualitative) independent variables– e.g. Sex; Job; Religion; Region; Type
(of anything)• Usually analysed with
t-test/ANOVA
328
Historical Note• But these (t-test/ANOVA) are
special cases of regression analysis– Aspects of General Linear Models
(GLMs)• So why treat them differently?
– Fisher’s fault– Computers’ fault
• Regression, as we have seen, is computationally difficult– Matrix inversion and multiplication– Unfeasible, without a computer
329
• In the special cases where:• You have one categorical IV• Your IVs are uncorrelated
– It is much easier to do it by partitioning of sums of squares
• These cases– Very rare in ‘applied’ research– Very common in ‘experimental’
research• Fisher worked at Rothamsted agricultural
research station• Never have problems manipulating
wheat, pigs, cabbages, etc
330
• In psychology– Led to a split between ‘experimental’
psychologists and ‘correlational’ psychologists
– Experimental psychologists (until recently) would not think in terms of continuous variables
• Still (too) common to dichotomise a variable– Too difficult to analyse it properly– Equivalent to discarding 1/3 of your
data
331
The Approach
332
The Approach
• Recode the nominal variable – Into one, or more, variables to represent
that variable
• Names are slightly confusing– Some texts talk of ‘dummy coding’ to refer
to all of these techniques– Some (most) refer to ‘dummy coding’ to
refer to one of them– Most have more than one name
333
• If a variable has g possible categories it is represented by g-1 variables
• Simplest case: – Smokes: Yes or No– Variable 1 represents ‘Yes’– Variable 2 is redundant
• If it isn’t yes, it’s no
334
The Techniques
335
• We will examine two coding schemes– Dummy coding
• For two groups• For >2 groups
– Effect coding• For >2 groups
• Look at analysis of change– Equivalent to ANCOVA– Pretest-posttest designs
336
Dummy Coding – 2 Groups• Also called simple coding by SPSS• A categorical variable with two groups• One group chosen as a reference group
– The other group is represented in a variable
• e.g. 2 groups: Experimental (Group 1) and Control (Group 0)– Control is the reference group– Dummy variable represents experimental
group• Call this variable ‘group1’
337
• For variable ‘group1’ – 1 = ‘Yes’, 2=‘No’
Original Category
New Variable
Exp 1Con 0
338
• Some data• Group is x, score is y
Control Group
Experimental Group
Experiment 1 10 10Experiment 2 10 20Experiment 3 10 30
339
• Control Group = 0– Intercept = Score on Y when x = 0– Intercept = mean of control group
• Experimental Group = 1– b = change in Y when x increases 1
unit– b = difference between experimental
group and control group
340
0
5
10
15
20
25
30
35
Control Group Experimental Group
Experiment 1 Experiment 2 Experiment 3
Gradient of slope represents
difference between means
341
Dummy Coding – 3+ Groups
• With three groups the approach is the similar
• g = 3, therefore g-1 = 2 variables needed
• 3 Groups– Control – Experimental Group 1– Experimental Group 2
342
• Recoded into two variables– Note – do not need a 3rd variable
• If we are not in group 1 or group 2 MUST be in control group
• 3rd variable would add no information• (What would happen to determinant?)
Original Category
Gp1 Gp2
Con 0 0Gp1 1 0Gp2 0 1
343
• F and associated p– Tests H0 that
• b1 and b2 and associated p-values– Test difference between each
experimental group and the control group
• To test difference between experimental groups– Need to rerun analysis
1 2 3g g g
344
• One more complication– Have now run multiple comparisons– Increases – i.e. probability of type I
error
• Need to correct for this– Bonferroni correction– Multiply given p-values by two/three
(depending how many comparisons were made)
345
Effect Coding
• Usually used for 3+ groups• Compares each group (except the
reference group) to the mean of all groups– Dummy coding compares each group to the
reference group.
• Example with 5 groups– 1 group selected as reference group
• Group 5
346
• Each group (except reference) has a variable– 1 if the individual is in that group– 0 if not– -1 if in reference group
group group_1 group_2 group_3 group_4
1 1 0 0 02 0 1 0 03 0 0 1 04 0 0 0 15 -1 -1 -1 -1
347
Examples
• Dummy coding and Effect Coding• Group 1 chosen as reference group
each time• Data Group Mean SD
1 52.40 4.602 56.30 5.703 60.10 5.00
Total 56.27 5.88
348
• Dummy
Group dummy2
dummy3
1 0 0
2 1 0
3 0 1
Group Effect2 effect3
1 -1 -1
2 1 0
3 0 1
• Effect
349
DummyR=0.543, F=5.7,
df=2, 27, p=0.009b0 = 52.4,
b1 = 3.9, p=0.100
b2 = 7.7, p=0.002
EffectR=0.543, F=5.7,
df=2, 27, p=0.009b0 = 56.27,
b1 = 0.03, p=0.980
b2 = 3.8, p=0.007
132
121
10
ggb
ggb
gb
Ggb
Ggb
Gb
32
21
0
350
In SPSS• SPSS provides two equivalent
procedures for regression– Regression (which we have been using)– GLM (which we haven’t)
• GLM will:– Automatically code categorical variables– Automatically calculate interaction terms
• GLM won’t:– Give standardised effects– Give hierarchical R2 p-values– Allow you to not understand
351
ANCOVA and Regression
352
• Test– (Which is a trick; but it’s designed to
make you think about it)
• Use employee data.sav– Compare the pay rise (difference
between salbegin and salary)– For ethnic minority and non-minority
staff• What do you find?
353
ANCOVA and Regression
• Dummy coding approach has one special use– In ANCOVA, for the analysis of change
• Pre-test post-test experimental design– Control group and (one or more)
experimental groups– Tempting to use difference score + t-test /
mixed design ANOVA– Inappropriate
354
• Salivary cortisol levels– Used as a measure of stress– Not absolute level, but change in level
over day may be interesting
• Test at: 9.00am, 9.00pm• Two groups
– High stress group (cancer biopsy) • Group 1
– Low stress group (no biopsy)• Group 0
355
• Correlation of AM and PM = 0.493 (p=0.008)
• Has there been a significant difference in the rate of change of salivary cortisol?– 3 different approaches
AM PM DiffHigh Stress 20.1 6.8 13.3Low Stress 22.3 11.8 10.5
356
• Approach 1 – find the differences, do a t-test– t = 1.31, df=26, p=0.203
• Approach 2 – mixed ANOVA, look for interaction effect– F = 1.71, df = 1, 26, p = 0.203– F = t2
• Approach 3 – regression (ANCOVA) based approach
357
– IVs: AM and group– DV: PM
– b1 (group) = 3.59, standardised b1=0.432, p = 0.01
• Why is the regression approach better?– The other two approaches took the
difference– Assumes that r = 1.00– Any difference from r = 1.00 and you add
error variance• Subtracting error is the same as adding error
358
• Using regression– Ensures that all the variance that is
subtracted is true– Reduces the error variance
• Two effects– Adjusts the means
• Compensates for differences between groups
– Removes error variance
359
In SPSS• SPSS automates all of this
– But you have to understand it, to know what it is doing
• Use Analyse, GLM, Univariate ANOVA
360
Outcome here
Categorical predictors here
Continuous predictors here
Click options
361
Select parameter estimaters
362
More on Change
• If difference score is correlated with either pre-test or post-test– Subtraction fails to remove the
difference between the scores– If two scores are uncorrelated
• Difference will be correlated with both• Failure to control
– Equal SDs, r = 0• Correlation of change and pre-score
=0.707
363
Even More on Change
• A topic of surprising complexity– What I said about difference scores
isn’t always true• Lord’s paradox – it depends on the
precise question you want to answer
– Collins and Horn (1993). Best methods for the analysis of change
– Collins and Sayer (2001). New methods for the analysis of change.
364
Lesson 8: Assumptions in Regression Analysis
365
The Assumptions1. The distribution of residuals is normal (at
each value of the dependent variable).2. The variance of the residuals for every
set of values for the independent variable is equal.
• violation is called heteroscedasticity.3. The error term is additive
• no interactions.
4. At every value of the dependent variable the expected (mean) value of the residuals is zero
• No non-linear relationships
366
5. The expected correlation between residuals, for any two cases, is 0.
• The independence assumption (lack of autocorrelation)
6. All independent variables are uncorrelated with the error term.
7. No independent variables are a perfect linear function of other independent variables (no perfect multicollinearity)
8. The mean of the error term is zero.
367
What are we going to do …
• Deal with some of these assumptions in some detail
• Deal with others in passing only– look at them again later on
368
Assumption 1: The Distribution of Residuals is Normal at Every Value of the Dependent Variable
369
Look at Normal Distributions
• A normal distribution– symmetrical, bell-shaped (so they
say)
370
What can go wrong?
• Skew– non-symmetricality– one tail longer than the other
• Kurtosis– too flat or too peaked– kurtosed
• Outliers– Individual cases which are far from the
distribution
371
Effects on the Mean
• Skew– biases the mean, in direction of skew
• Kurtosis– mean not biased– standard deviation is– and hence standard errors, and
significance tests
372
Examining Univariate Distributions
• Histograms• Boxplots• P-P plots• Calculation based methods
373
Histograms
A and B30
20
10
0
30
20
10
0
374
• C and D40
30
20
10
0
14
12
10
8
6
4
2
0
375
• E & F
20
10
0
376
6
5
4
3
2
1
0
7
6
5
4
3
2
1
0
6
5
4
3
2
1
0
6
5
4
3
2
1
0
7
6
5
4
3
2
1
0
7
6
5
4
3
2
1
0
Histograms can be tricky ….
377
Boxplots
378
P-P Plots
1.00.75.50.250.00
1.00
.75
.50
.25
0.00
1.00.75.50.250.00
1.00
.75
.50
.25
0.00
• A & B
379
• C & D
1.00.75.50.250.00
1.00
.75
.50
.25
0.00
1.00.75.50.250.00
1.00
.75
.50
.25
0.00
380
• E & F
1.00.75.50.250.00
1.00
.75
.50
.25
0.001.00.75.50.250.00
1.00
.75
.50
.25
0.00
381
• Skew and Kurtosis statistics• Outlier detection statistics
Calculation Based
382
Skew and Kurtosis Statistics
• Normal distribution– skew = 0– kurtosis = 0
• Two methods for calculation– Fisher’s and Pearson’s– Very similar answers
• Associated standard error– can be used for significance of departure from
normality– not actually very useful
• Never normal above N = 400
383
Skewness SE Skew Kurtosis SE Kurt
A -0.12 0.172 -0.084 0.342B 0.271 0.172 0.265 0.342C 0.454 0.172 1.885 0.342D 0.117 0.172 -1.081 0.342E 2.106 0.172 5.75 0.342F 0.171 0.172 -0.21 0.342
384
Outlier Detection
• Calculate distance from mean– z-score (number of standard deviations)– deleted z-score
• that case biased the mean, so remove it
– Look up expected distance from mean• 1% 3+ SDs
• Calculate influence – how much effect did that case have on the
mean?
385
Non-Normality in Regression
386
Effects on OLS Estimates
• The mean is an OLS estimate• The regression line is an OLS
estimate• Lack of normality
– biases the position of the regression slope
– makes the standard errors wrong• probability values attached to statistical
significance wrong
387
Checks on Normality
• Check residuals are normally distributed– SPSS will draw histogram and p-p plot
of residuals
• Use regression diagnostics– Lots of them– Most aren’t very interesting
388
Regression Diagnostics
• Residuals– standardised, unstandardised, studentised,
deleted, studentised-deleted– look for cases > |3| (?)
• Influence statistics– Look for the effect a case has– If we remove that case, do we get a
different answer?– DFBeta, Standardised DFBeta
• changes in b
389
– DfFit, Standardised DfFit• change in predicted value
– Covariance ratio• Ratio of the determinants of the
covariance matrices, with and without the case
• Distances– measures of ‘distance’ from the
centroid– some include IV, some don’t
390
More on Residuals
• Residuals are trickier than you might have imagined
• Raw residuals– OK
• Standardised residuals – Residuals divided by SD
1
2
kn
ese
391
Leverage
• But– That SD is wrong– Variance of the residuals is not equal
• Those further from the centroid on the predictors have higher variance
• Need a measure of this
• Distance from the centroid is leverage, or h (or sometimes hii)
• One predictor– Easy
392
• Minimum hi is 1/n, the maximum is 1
• Except– SPSS uses standardised leverage - h*
• It doesn’t tell you this, it just uses it
2
2
)(
1
xx
xx
nh i
i
393
• Minimum 0, maximum (N-1/N)
2
2*
*
)(
1
xx
xxh
nhh
i
i
i
i
394
• Multiple predictors– Calculate the hat matrix (H)– Leverage values are the diagonals of
this matrix
– Where X is the augmented matrix of predictors (i.e. matrix that includes the constant)
– Hence leverage hii – element ii of H
X'X)X(X'H 1
395
• Example of calculation of hat matrix
318.0
236.0273.0
273.0318.0
651
......
201
151
651
......
201
151
651
......
201
151
651
......
201
151
1
H
396
Standardised / Studentised
• Now we can calculate the standardised residuals – SPSS calls them studentised residuals– Also called internally studentised
residuals
ie
ii
hs
ee
1
397
Deleted Studentised Residuals
• Studentised residuals do not have a known distribution– Cannot use them for inference
• Deleted studentised residuals– Externally studentised residuals– Jackknifed residuals
• Distributed as t• With df = N – k – 1
398
Testing Significance
• We can calculate the probability of a residual – Is it sampled from the same
population
• BUT– Massive type I error rate– Bonferroni correct it
• Multiply p value by N
399
Bivariate Normality
• We didn’t just say “residuals normally distributed”
• We said “at every value of the dependent variables”
• Two variables can be normally distributed – univariate,– but not bivariate
400
• Couple’s IQs– male and female
140.0130.0120.0110.0100.090.080.070.060.0
FEMALE
Fre
qu
en
cy
8
6
4
2
0
140.0130.0120.0110.0100.090.080.070.060.0
MALE
Fre
qu
en
cy
6
5
4
3
2
1
0
–Seem reasonably normal
401
• But wait!!
FEMALE
160140120100806040
MA
LE
160
140
120
100
80
60
40
402
• When we look at bivariate normality– not normal – there is an outlier
• So plot X against Y• OK for bivariate
– but – may be a multivariate outlier– Need to draw graph in 3+ dimensions– can’t draw a graph in 3 dimensions
• But we can look at the residuals instead …
403
• IQ histogram of residuals12
10
8
6
4
2
0
404
Multivariate Outliers …
• Will be explored later in the exercises
• So we move on …
405
What to do about Non-Normality
• Skew and Kurtosis– Skew – much easier to deal with– Kurtosis – less serious anyway
• Transform data– removes skew– positive skew – log transform– negative skew - square
406
Transformation
• May need to transform IV and/or DV– More often DV
• time, income, symptoms (e.g. depression) all positively skewed
– can cause non-linear effects (more later) if only one is transformed
– alters interpretation of unstandardised parameter
– May alter meaning of variable– May add / remove non-linear and moderator
effects
407
• Change measures– increase sensitivity at ranges
• avoiding floor and ceiling effects
• Outliers– Can be tricky– Why did the outlier occur?
• Error? Delete them.• Weird person? Probably delete them• Normal person? Tricky.
408
– You are trying to model a process• is the data point ‘outside’ the process• e.g. lottery winners, when looking at
salary• yawn, when looking at reaction time
– Which is better?• A good model, which explains 99% of
your data?• A poor model, which explains all of it
• Pedhazur and Schmelkin (1991)– analyse the data twice
409
• We will spend much less time on the other 6 assumptions
• Can do exercise 8.1.
410
Assumption 2: The variance of the residuals for every
set of values for the independent variable is
equal.
411
Heteroscedasticity
• This assumption is a about heteroscedasticity of the residuals– Hetero=different– Scedastic = scattered
• We don’t want heteroscedasticity– we want our data to be
homoscedastic
• Draw a scatterplot to investigate
412FEMALE
160140120100806040
MA
LE
160
140
120
100
80
60
40
413
• Only works with one IV– need every combination of IVs
• Easy to get – use predicted values– use residuals there
• Plot predicted values against residuals– or standardised residuals– or deleted residuals– or standardised deleted residuals– or studentised residuals
• A bit like turning the scatterplot on its side
414
Good – no heteroscedasticity
Predicted Value
Resid
ual
415
Bad – heteroscedasticity
Predicted Value
Resid
ual
416
Testing Heteroscedasticity
• White’s test– Not automatic in SPSS (is in SAS)– Luckily, not hard to do1. Do regression, save residuals.2. Square residuals3. Square IVs4. Calculate interactions of IVs
– e.g. x1•x2, x1•x3, x2 • x3
417
5. Run regression using – squared residuals as DV– IVs, squared IVs, and interactions as IVs
6. Test statistic = N x R2
– Distributed as 2
– Df = k (for second regression)
• Use education and salbegin to predict salary (employee data.sav)
– R2 = 0.113, N=474, 2 = 53.5, df=5, p < 0.0001
418
Plot of Pred and Res
Regression Standardized Predicted Value
86420-2
Regre
ssio
n S
tandard
ized R
esi
dual
8
6
4
2
0
-2
-4
419
Magnitude of Heteroscedasticity
• Chop data into “slices”– 5 slices, based on X (or predicted
score)• Done in SPSS
– Calculate variance of each slice– Check ratio of smallest to largest– Less than 10:1
• OK
420
The Visual Bander• New in SPSS 12
421
• Variances of the 5 groups
• We have a problem– 3 / 0.2 ~= 15
422
Dealing with Heteroscedasticity
• Use Huber-White estimates– Very easy in Stata– Fiddly in SPSS – bit of a hack
• Use Complex samples1. Create a new variable where all
cases are equal to 1, call it const2. Use Complex Samples, Prepare for
Analysis3. Create a plan file
423
4. Sample weight is const5. Finish6. Use Complex Samples, GLM7. Use plan file created, and set up
model as in GLM(More on complex samples later)
In Stata, do regression as normal, and click “robust”.
424
Heteroscedasticity – Implications and Meanings
Implications• What happens as a result of
heteroscedasticity?– Parameter estimates are correct
• not biased
– Standard errors (hence p-values) are incorrect
425
However …
• If there is no skew in predicted scores– P-values a tiny bit wrong
• If skewed,– P-values very wrong
• Can do exercise
426
Meaning• What is heteroscedasticity trying to
tell us?– Our model is wrong – it is misspecified– Something important is happening
that we have not accounted for
• e.g. amount of money given to charity (given)– depends on:
• earnings • degree of importance person assigns to
the charity (import)
427
• Do the regression analysis– R2 = 0.60, F=31.4, df=2, 37, p < 0.001
• seems quite good
– b0 = 0.24, p=0.97
– b1 = 0.71, p < 0.001
– b2 = 0.23, p = 0.031
• White’s test– 2 = 18.6, df=5, p=0.002
• The plot of predicted values against residuals …
428
• Plot shows heteroscedastic relationship
429
• Which means …– the effects of the variables are not
additive – If you think that what a charity does
is important• you might give more money• how much more depends on how much
money you have
430
IMPORT
16141210864
GIV
EN
70
60
50
40
30
20
10
Earnings
High
Low
431
• One more thing about heteroscedasticity
– it is the equivalent of homogeneity of variance in ANOVA/t-tests
432
Assumption 3: The Error Term is Additive
433
Additivity
• What heteroscedasticity shows you– effects of variables need to be additive
• Heteroscedasticity doesn’t always show it to you– can test for it, but hard work– (same as homogeneity of covariance
assumption in ANCOVA)
• Have to know it from your theory• A specification error
434
Additivity and Theory• Two IVs
– Alcohol has sedative effect• A bit makes you a bit tired• A lot makes you very tired
– Some painkillers have sedative effect• A bit makes you a bit tired• A lot makes you very tired
– A bit of alcohol and a bit of painkiller doesn’t make you very tired
– Effects multiply together, don’t add together
435
• If you don’t test for it– It’s very hard to know that it will
happen
• So many possible non-additive effects– Cannot test for all of them– Can test for obvious
• In medicine– Choose to test for salient non-additive
effects– e.g. sex, race
436
Assumption 4: At every value of the dependent variable the expected (mean) value of the
residuals is zero
437
Linearity
• Relationships between variables should be linear – best represented by a straight line
• Not a very common problem in social sciences– except economics– measures are not sufficiently accurate to
make a difference • R2 too low• unlike, say, physics
438
• Relationship between speed of travel and fuel used
Speed
Fue
l
439
• R2 = 0.938– looks pretty good– know speed, make a good prediction
of fuel
• BUT– look at the chart– if we know speed we can make a
perfect prediction of fuel used– R2 should be 1.00
440
Detecting Non-Linearity
• Residual plot– just like heteroscedasticity
• Using this example– very, very obvious– usually pretty obvious
441
Residual plot
442
Linearity: A Case of Additivity
• Linearity = additivity along the range of the IV
• Jeremy rides his bicycle harder– Increase in speed depends on current speed– Not additive, multiplicative– MacCallum and Mar (1995). Distinguishing
between moderator and quadratic effects in multiple regression. Psychological Bulletin.
443
Assumption 5: The expected correlation between
residuals, for any two cases, is 0.
The independence assumption (lack of autocorrelation)
444
Independence Assumption
• Also: lack of autocorrelation• Tricky one
– often ignored– exists for almost all tests
• All cases should be independent of one another– knowing the value of one case should not
tell you anything about the value of other cases
445
How is it Detected?
• Can be difficult– need some clever statistics
(multilevel models)
• Better off avoiding situations where it arises
• Residual Plots• Durbin-Watson Test
446
Residual Plots
• Were data collected in time order?– If so plot ID number against the
residuals– Look for any pattern
• Test for linear relationship• Non-linear relationship• Heteroscedasticity
447
0 10 20 30 40
Participant Number
-2
-1
0
1
2R
esid
ual
448
How does it arise?
Two main ways• time-series analyses
– When cases are time periods• weather on Tuesday and weather on Wednesday
correlated• inflation 1972, inflation 1973 are correlated
• clusters of cases– patients treated by three doctors– children from different classes– people assessed in groups
449
Why does it matter?• Standard errors can be wrong
– therefore significance tests can be wrong
• Parameter estimates can be wrong– really, really wrong– from positive to negative
• An example– students do an exam (on statistics)– choose one of three questions
• IV: time• DV: grade
450Time
70605040302010
Gra
de
90
80
70
60
50
40
30
20
10
•Result, with line of best fit
451
• Result shows that– people who spent longer in the exam,
achieve better grades
• BUT …– we haven’t considered which question
people answered– we might have violated the
independence assumption• DV will be autocorrelated
• Look again– with questions marked
452
• Now somewhat different
Time
70605040302010
Gra
de
90
80
70
60
50
40
30
20
10
Question
3
2
1
453
• Now, people that spent longer got lower grades– questions differed in difficulty– do a hard one, get better grade– if you can do it, you can do it quickly
• Very difficult to analyse well– need multilevel models
454
Durbin Watson Test
• Not well implemented in SPSS• Depends on the order of the data
– Reorder the data, get a different result
• Doesn’t give statistical significance of the test
455
Assumption 6: All independent variables are
uncorrelated with the error term.
456
Uncorrelated with the Error Term
• A curious assumption– by definition, the residuals are
uncorrelated with the independent variables (try it and see, if you like)
• It is about the DV– must have no effect (when the IVs
have been removed)– on the DV
457
• Problem in economics– Demand increases supply– Supply increases wages– Higher wages increase demand
• OLS estimates will be (badly) biased in this case– need a different estimation procedure– two-stage least squares
• simultaneous equation modelling
458
Assumption 7: No independent variables are a perfect linear function
of other independent variables
no perfect multicollinearity
459
No Perfect Multicollinearity
• IVs must not be linear functions of one another– matrix of correlations of IVs is not positive
definite– cannot be inverted– analysis cannot proceed
• Have seen this with – age, age start, time working– also occurs with subscale and total
460
• Large amounts of collinearity– a problem (as we shall see)
sometimes– not an assumption
461
Assumption 8: The mean of the error term is zero.
You will like this one.
462
Mean of the Error Term = 0
• Mean of the residuals = 0• That is what the constant is for
– if the mean of the error term deviates from zero, the constant soaks it up
0 1 1
0 1 1( 3) ( 3)
Y x
Y x
- note, Greek letters because we are talking about population values
463
• Can do regression without the constant– Usually a bad idea– E.g R2 = 0.995, p < 0.001
• Looks good
464
6 7 8 9 10 11 12 13
x1
7
8
9
10
11
12
13y
465
466
Lesson 9: Issues in Regression Analysis
Things that alter the interpretation of the regression equation
467
The Four Issues
• Causality• Sample sizes• Collinearity• Measurement error
468
Causality
469
What is a Cause?
• Debate about definition of cause– some statistics (and philosophy)
books try to avoid it completely– We are not going into depth
• just going to show why it is hard
• Two dimensions of cause– Ultimate versus proximal cause– Determinate versus probabilistic
470
Proximal versus Ultimate• Why am I here?
– I walked here because – This is the location of the class
because – Eric Tanenbaum asked me because – (I don’t know)– because I was in my office when he
rang because – I am a lecturer at York because – I saw an advert in the paper because
471
– I exist because– My parents met because – My father had a job …
• Proximal cause– the direct and immediate cause of
something• Ultimate cause
– the thing that started the process off– I fell off my bicycle because of the
bump– I fell off because I was going too fast
472
Determinate versus Probabilistic Cause
• Why did I fall off my bicycle?– I was going too fast– But every time I ride too fast, I don’t
fall off– Probabilistic cause
• Why did my tyre go flat?– A nail was stuck in my tyre– Every time a nail sticks in my tyre,
the tyre goes flat– Deterministic cause
473
• Can get into trouble by mixing them together– Eating deep fried Mars Bars and doing
no exercise are causes of heart disease
– “My Grandad ate three deep fried Mars Bars every day, and the most exercise he ever got was when he walked to the shop next door to buy one”
– (Deliberately?) confusing deterministic and probabilistic causes
474
Criteria for Causation
• Association• Direction of Influence• Isolation
475
Association
• Correlation does not mean causation– we all know
• But– Causation does mean correlation
• Need to show that two things are related– may be correlation– my be regression when controlling for third
(or more) factor
476
• Relationship between price and sales– suppliers may be cunning– when people want it more
• stick the price upPrice Demand SalesPrice 1 0.6 0
Demand 0.6 1 0.6Sales 0 0.6 1
– So – no relationship between price and sales
477
– Until (or course) we control for demand
– b1 (Price) = -0.56
– b2 (Demand) = 0.94
• But which variables do we enter?
478
Direction of Influence• Relationship between A and B
– three possible processes
A B
A B
A B
C
A causes B
B causes A
C causes A & B
479
• How do we establish the direction of influence?– Longitudinally?
StormBarometer
Drops
– Now if we could just get that barometer needle to stay where it is …
• Where the role of theory comes in (more on this later)
480
Isolation
• Isolate the dependent variable from all other influences– as experimenters try to do
• Cannot do this– can statistically isolate the effect– using multiple regression
481
Role of Theory
• Strong theory is crucial to making causal statements
• Fisher said: to make causal statements “make your theories elaborate.”– don’t rely purely on statistical analysis
• Need strong theory to guide analyses– what critics of non-experimental
research don’t understand
482
• S.J. Gould – a critic– says correlate price of petrol and his
age, for the last 10 years– find a correlation– Ha! (He says) that doesn’t mean
there is a causal link– Of course not! (We say).
• No social scientist would do that analysis without first thinking (very hard) about the possible causal relations between the variables of interest
• Would control for time, prices, etc …
483
• Atkinson, et al. (1996)– relationship between college grades
and number of hours worked– negative correlation– Need to control for other variables –
ability, intelligence
• Gould says “Most correlations are non-causal” (1982, p243)– Of course!!!!
484
I drink a lot of beer
120 non-causal correlations
16 causal relations
karaoke
jokes (about statistics)
vomit
toilet
headache
sleeping
equations (beermat)
laugh
thirsty
fried breakfast
no beer
curry
chips
falling over
lose keys
curtains closed
485
• Abelson (1995) elaborates on this– ‘method of signatures’
• A collection of correlations relating to the process– the ‘signature’ of the process
• e.g. tobacco smoking and lung cancer– can we account for all of these
findings with any other theory?
486
1. The longer a person has smoked cigarettes, the greater the risk of cancer.
2. The more cigarettes a person smokes over a given time period, the greater the risk of cancer.
3. People who stop smoking have lower cancer rates than do those who keep smoking.
4. Smoker’s cancers tend to occur in the lungs, and be of a particular type.
5. Smokers have elevated rates of other diseases.6. People who smoke cigars or pipes, and do not
usually inhale, have abnormally high rates of lip cancer.
7. Smokers of filter-tipped cigarettes have lower cancer rates than other cigarette smokers.
8. Non-smokers who live with smokers have elevated cancer rates.
(Abelson, 1995: 183-184)
487
– In addition, should be no anomalous correlations• If smokers had more fallen arches than
non-smokers, not consistent with theory
• Failure to use theory to select appropriate variables– specification error– e.g. in previous example– Predict wealth from price and sales
• increase price, price increases• Increase sales, price increases
488
• Sometimes these are indicators of the process– e.g. barometer – stopping the needle
won’t help– e.g. inflation? Indicator or cause?
489
No Causation without Experimentation
• Blatantly untrue– I don’t doubt that the sun shining
makes us warm• Why the aversion?
– Pearl (2000) says problem is no mathematical operator
– No one realised that you needed one– Until you build a robot
490
AI and Causality
• A robot needs to make judgements about causality
• Needs to have a mathematical representation of causality– Suddenly, a problem!– Doesn’t exist
• Most operators are non-directional• Causality is directional
491
Sample Sizes
“How many subjects does it take to run a regression
analysis?”
492
Introduction
• Social scientists don’t worry enough about the sample size required– “Why didn’t you get a significant result?”– “I didn’t have a large enough sample”
• Not a common answer
• More recently awareness of sample size is increasing– use too few – no point doing the research– use too many – waste their time
493
• Research funding bodies• Ethical review panels
– both become more interested in sample size calculations
• We will look at two approaches– Rules of thumb (quite quickly)– Power Analysis (more slowly)
494
Rules of Thumb
• Lots of simple rules of thumb exist– 10 cases per IV– >100 cases– Green (1991) more sophisticated
• To test significance of R2 – N = 50 + 8k• To test sig of slopes, N = 104 + k
• Rules of thumb don’t take into account all the information that we have– Power analysis does
495
Power Analysis
Introducing Power Analysis• Hypothesis test
– tells us the probability of a result of that magnitude occurring, if the null hypothesis is correct (i.e. there is no effect in the population)
• Doesn’t tell us– the probability of that result, if the
null hypothesis is false
496
• According to Cohen (1982) all null hypotheses are false– everything that might have an effect,
does have an effect• it is just that the effect is often very tiny
497
Type I Errors
• Type I error is false rejection of H0
• Probability of making a type I error
– – the significance value cut-off • usually 0.05 (by convention)
• Always this value• Not affected by
– sample size– type of test
498
Type II errors• Type II error is false acceptance of
the null hypothesis– Much, much trickier
• We think we have some idea– we almost certainly don’t
• Example– I do an experiment (random
sampling, all assumptions perfectly satisfied)
– I find p = 0.05
499
– You repeat the experiment exactly• different random sample from same
population
– What is probability you will find p < 0.05?– ………………– Another experiment, I find p = 0.01– Probability you find p < 0.05?– ………………
• Very hard to work out– not intuitive – need to understand non-central sampling
distributions (more in a minute)
500
• Probability of type II error = beta () – same as population regression
parameter (to be confusing)
• Power = 1 – Beta– Probability of getting a significant
result
501
Type I errorp =
Type II error
p = power = 1 -
H0 true (we find no effect
– p > 0.05)
H0 false (we find an effect
– p < 0.05)
Research Findings
H0 false (effect to be
found)
H0 True(no effect to
be found)
State of the World
502
• Four parameters in power analysis– – prob. of Type I error– – prob. of Type II error (power = 1 –
)– Effect size – size of effect in
population– N
• Know any three, can calculate the fourth– Look at them one at a time
503
• Probability of Type I error– Usually set to 0.05– Somewhat arbitrary
• sometimes adjusted because of circumstances
– rarely because of power analysis
– May want to adjust it, based on power analysis
504
• – Probability of type II error– Power (probability of finding a result) = 1 – – Standard is 80%
• Some argue for 90%
– Implication that Type I error is 4 times more serious than type II error• adjust ratio with compromise power
analysis
505
• Effect size in the population– Most problematic to determine– Three ways1. What effect size would be useful to
find? • R2 = 0.01 - no use (probably)
2. Base it on previous research– what have other people found?
3. Use Cohen’s conventions– small R2 = 0.02– medium R2 = 0.13– large R2 = 0.26
506
– Effect size usually measured as f2
– For R2
22
21
Rf
R
507
– For (standardised) slopes
– Where sr2 is the contribution to the variance accounted for by the variable of interest
– i.e. sr2 = R2 (with variable) – R2 (without)• change in R2 in hierarchical regression
2
22
1 R
srf i
508
• N – the sample size– usually use other three parameters to
determine this– sometimes adjust other parameters
() based on this– e.g. You can have 50 participants. No
more.
509
Doing power analysis• With power analysis program
– SamplePower, GPower, Nquery
• With SPSS MANOVA– using non-central distribution
functions– Uses MANOVA syntax
• Relies on the fact you can do anything with MANOVA
• Paper B4
510
Underpowered Studies
• Research in the social sciences is often underpowered– Why?– See Paper B11 – “the persistence of
underpowered studies”
511
Extra Reading
• Power traditionally focuses on p values– What about CIs?– Paper B8 – “Obtaining regression
coefficients that are accurate, not simply significant”
512
Collinearity
513
Collinearity as Issue and Assumption
• Collinearity (multicollinearity) – the extent to which the independent
variables are (multiply) correlated• If R2 for any IV, using other IVs = 1.00
– perfect collinearity– variable is linear sum of other variables– regression will not proceed – (SPSS will arbitrarily throw out a variable)
514
• R2 < 1.00, but high– other problems may arise
• Four things to look at in collinearity– meaning– implications– detection– actions
515
Meaning of Collinearity
• Literally ‘co-linearity’– lying along the same line
• Perfect collinearity– when some IVs predict another– Total = S1 + S2 + S3 + S4– S1 = Total – (S2 + S3 + S4)– rare
516
• Less than perfect– when some IVs are close to predicting – correlations between IVs are high
(usually, but not always)
517
Implications
• Effects the stability of the parameter estimates– and so the standard errors of the
parameter estimates– and so the significance
• Because– shared variance, which the regression
procedure doesn’t know where to put
518
• Red cars have more accidents than other coloured cars– because of the effect of being in a red
car?– because of the kind of person that drives
a red car?• we don’t know
– No way to distinguish between these three:
Accidents = 1 x colour + 0 x personAccidents = 0 x colour + 1 x person
Accidents = 0.5 x colour + 0.5 x person
519
• Sex differences– due to genetics?– due to upbringing?– (almost) perfect collinearity
• statistically impossible to tell
520
• When collinearity is less than perfect– increases variability of estimates
between samples– estimates are unstable– reflected in the variances, and hence
standard errors
521
Detecting Collinearity
• Look at the parameter estimates– large standardised parameter
estimates (>0.3?), which are not significant • be suspicious
• Run a series of regressions– each IV as DV– all other IVs as IVs
• for each IV
522
• Sounds like hard work?– SPSS does it for us!
• Ask for collinearity diagnostics– Tolerance – calculated for every IV
21Tolerance -R
Tolerance1
VIF
– Variance Inflation Factor• sq. root of amount s.e. has been
increased
523
Actions
What you can do about collinearity“no quick fix” (Fox, 1991)
1. Get new data• avoids the problem• address the question in a different
way• e.g. find people who have been
raised as the ‘wrong’ gender• exist, but rare
• Not a very useful suggestion
524
2. Collect more data• not different data, more data• collinearity increases standard error
(se)• se decreases as N increases
• get a bigger N
3. Remove / Combine variables• If an IV correlates highly with other IVs• Not telling us much new• If you have two (or more) IVs which are
very similar• e.g. 2 measures of depression, socio-
economic status, achievement, etc
525
• sum them, average them, remove one
• Many measures• use principal components analysis to
reduce them
3. Use stepwise regression (or some flavour of)
• See previous comments• Can be useful in theoretical vacuum
4. Ridge regression• not very useful• behaves weirdly
526
Measurement Error
527
What is Measurement Error
• In social science, it is unlikely that we measure any variable perfectly– measurement error represents this
imperfection• We assume that we have a true
score – T
• A measure of that score– x
528
• just like a regression equation– standardise the parameters– T is the reliability
• the amount of variance in x which comes from T
• but, like a regression equation– assume that e is random and has mean of
zero– more on that later
eTx
529
Simple Effects of Measurement Error
• Lowers the measured correlation
– between two variables
• Real correlation– true scores (x* and y*)
• Measured correlation– measured scores (x and y)
530
x* y*
yxe
Reliability of yryy
Reliability of xrxx
True correlation of x and y
rx*y*
Measured correlation of x and y
rxy
e
531
• Attenuation of correlation
yyxxyxxy rrrr **
yyxx
xyyx
rr
rr **
• Attenuation corrected correlation
532
• Example
3.0
8.0
7.0
xy
yy
xx
r
r
r
40.08.07.0
3.0**
**
yx
yyxx
xyyx
r
rr
rr
533
Complex Effects of Measurement Error
• Really horribly complex• Measurement error reduces
correlations– reduces estimate of – reducing one estimate
• increases others
– because of effects of control– combined with effects of suppressor
variables– exercise to examine this
534
Dealing with Measurement Error
• Attenuation correction– very dangerous– not recommended
• Avoid in the first place– use reliable measures– don’t discard information
• don’t categorise• Age: 10-20, 21-30, 31-40 …
535
Complications
• Assume measurement error is – additive– linear
• Additive– e.g. weight – people may under-report /
over-report at the extremes
• Linear– particularly the case when using proxy
variables
536
• e.g. proxy measures– Want to know effort on childcare,
count number of children• 1st child is more effort than last
– Want to know financial status, count income• 1st £10 much greater effect on financial
status than the 1000th.
537
Lesson 10: Non-Linear Analysis in Regression
538
Introduction
• Non-linear effect occurs – when the effect of one independent
variable– is not consistent across the range of
the IV
• Assumption is violated– expected value of residuals = 0– no longer the case
539
Some Examples
540Experience
Ski
llA Learning Curve
541Arousal
Perf
orm
an
ceYerkes-Dodson Law of Arousal
542Time
En
thu
siast
icEnthusiasm Levels over a
Lesson on Regression
0 3.5Suic
idal
543
• Learning – line changed direction once
• Yerkes-Dodson– line changed direction once
• Enthusiasm– line changed direction twice
544
Everything is Non-Linear
• Every relationship we look at is non-linear, for two reasons– Exam results cannot keep increasing
with reading more books• Linear in the range we examine
– For small departures from linearity• Cannot detect the difference• Non-parsimonious solution
545
Non-Linear Transformations
546
Bending the Line
• Non-linear regression is hard– We cheat, and linearise the data
• Do linear regression
Transformations• We need to transform the data
– rather than estimating a curved line• which would be very difficult• may not work with OLS
– we can take a straight line, and bend it– or take a curved line, and straighten it
• back to linear (OLS) regression
547
• We still do linear regression– Linear in the parameters
– Y = b1x + b2x2 + …
• Can do non-linear regression– Non-linear in the parameters
– Y = b1x + b2
x2 + …
• Much trickier– Statistical theory either breaks down
OR becomes harder
548
• Linear transformations– multiply by a constant– add a constant– change the slope and the intercept
549
x
y
y=x
y=2xy=x + 3
550
• Linear transformations are no use– alter the slope and intercept– don’t alter the standardised
parameter estimate
• Non-linear transformation– will bend the slope– quadratic transformation
y = x2
– one change of direction
551
– Cubic transformationy = x2 + x3
– two changes of direction
552
y=0 + 0.1x + 1x2
Quadratic Transformation
553
Square Root Transformation
y=20 + -3x + 5x
554
Cubic Transformation
0
1
2
3
4
5
6
0 1 2 3 4 5 6
y = 3 - 4x + 2x2 - 0.2x3
555
Logarithmic Transformation
y = 1 + 0.1x + 10log(x)
556
Inverse Transformation
y = 20 -10x + 8(1/x)
557
• To estimate a non-linear regression– we don’t actually estimate anything
non-linear– we transform the x-variable to a non-
linear version– can estimate that straight line– represents the curve– we don’t bend the line, we stretch the
space around the line, and make it flat
558
Detecting Non-linearity
559
Draw a Scatterplot
• Draw a scatterplot of y plotted against x– see if it looks a bit non-linear– e.g. Anscombe’s data– e.g. Education and beginning salary
• from bank data• drawn in SPSS• with line of best fit
560
• Anscombe (1973)– constructed a set of datasets– show the importance of graphs in
regression/correlation
• For each dataset
N 11Mean of x 9Mean of y 7.5Equation of regression line y = 3 + 0.5xsum of squares (X - mean) 110correlation coefficient 0.82R2 0.67
561
562
563
564
565
A Real Example
• Starting salary and years of education– From employee data.sav
566
Educational Level (years)
Beg
inni
ng S
alar
y
Expected value of error (residual) is >
0
Expected value of error (residual) is <
0
567
Use Residual Plot
• Scatterplot is only good for one variable– use the residual plot (that we used for
heteroscedasticity)
• Good for many variables
568
• We want– points to lie in a nice straight sausage
569
• We don’t want– a nasty bent sausage
570
3210-1-2
10
8
6
4
2
0
-2
• Educational level and starting salary
571
Carrying Out Non-Linear Regression
572
Linear Transformation
• Linear transformation doesn’t change– interpretation of slope– standardised slope– se, t, or p of slope– R2
• Can change– effect of a transformation
573
• Actually more complex– with some transformations can add a
constant with no effect (e.g. quadratic)
• With others does have an effect– inverse, log
• Sometimes it is necessary to add a constant– negative numbers have no square
root– 0 has no log
574
Education and Salary
Linear Regression• Saw previously that the assumption of
expected errors = 0 was violated• Anyway …
– R2 = 0.401, F=315, df = 1, 472, p < 0.001
– salbegin = -6290 + 1727 educ– Standardised
• b1 (educ) = 0.633
– Both parameters make sense
575
Non-linear Effect• Compute new variable
– quadratic– educ2 = educ2
• Add this variable to the equation– R2 = 0.585, p < 0.001– salbegin = 46263 + -6542 educ + 310
educ2
• slightly curious
– Standardised• b1 (educ) = -2.4• b2 (educ2) = 3.1
– What is going on?
576
• Collinearity– is what is going on– Correlation of educ and educ2
• r = 0.990
– Regression equation becomes difficult (impossible?) to interpret
• Need hierarchical regression– what is the change in R2
– is that change significant?– R2 (change) = 0.184, p < 0.001
577
Cubic Effect• While we are at it, let’s look at the
cubic effect– R2 (change) = 0.004, p = 0.045– 19138 + 103 e + -206 e2 + 12
e3
– Standardised:b1(e) = 0.04
b2(e2) = -2.04
b3(e3) = 2.71
578
Fourth Power• Keep going while we are ahead
– won’t run• ???
• Collinearity is the culprit– Tolerance (educ4) = 0.000005– VIF = 215555
• Matrix of correlations of IVs is not positive definite– cannot be inverted
579
Interpretation• Tricky, given that parameter
estimates are a bit nonsensical• Two methods• 1: Use R2 change
– Save predicted values• or calculate predicted values to plot line
of best fit
– Save them from equation– Plot against IV
580Education (Years)
222018161412108
Beg
inni
ng S
alar
y
50000
40000
30000
20000
10000
0
Cubic
Quadratic
Linear
581
• Differentiate with respect to e• We said: s = 19138 + 103 e + -206 e2 + 12
e3
– but first we will simplify it to quadratic s = 46263 + -6542 e + 310 e2
• dy/dx = -6542 + 310 x 2 x e
582
Education Slope9 -962
10 -34211 27812 89813 151814 213815 275816 337817 399818 461819 523820 5858
1 year of education at the higher end of the scale, better than
1 year at the lower end of the scale.MBA versus GCSE
583
• Differentiate Cubic19138 + 103 e + -206 e2 + 12
e3
dy/dx = 103 – 206 2 e + 12 3 e2
• Can calculate slopes for quadratic and cubic at different values
584
Education Slope (Quad) Slope (Cub)9 -962 -689
10 -342 -41711 278 -7312 898 34313 1518 83114 2138 139115 2758 202316 3378 272717 3998 350318 4618 435119 5238 527120 5858 6263
585
A Quick Note on Differentiation
• For y = xp
– dx/dy = pxp-1
• For equations such as y =b1x + b2xP
dy/dx = b1 + b2pxp-1
• y = 3x + 4x2
– dy/dx = 3 + 4 • 2x
586
• y = b1x + b2x2 + b3x3
– dy/dx = b1 + b2 • 2x + b3 • 3 • x2
• y = 4x + 5x2 + 6x3
• dx/dy = 4 + 5 • 2 • x + 6 • 3 • x2
• Many functions are simple to differentiate– Not all though
587
Automatic Differentiation
• If you – Don’t know how to differentiate– Can’t be bothered to look up the
function
• Can use automatic differentiation software– e.g. GRAD (freeware)
588
589
Lesson 11: Logistic Regression
Dichotomous/Nominal Dependent Variables
590
Introduction
• Often in social sciences, we have a dichotomous/nominal DV– we will look at dichotomous first, then a
quick look at multinomial
• Dichotomous DV• e.g.
– guilty/not guilty– pass/fail– won/lost– Alive/dead (used in medicine)
591
Why Won’t OLS Do?
592
Example: Passing a Test
• Test for bus drivers– pass/fail– we might be interested in degrees of pass fail
• a company which trains them will not• fail means ‘pay for them to take it again’
• Develop a selection procedure– Two predictor variables– Score – Score on an aptitude test– Exp – Relevant prior experience (months)
593
• 1st ten casesScore Exp Pass
5 6 01 15 01 12 04 6 01 15 11 6 04 16 11 10 13 12 04 26 1
594
• DV – pass (1 = Yes, 0 = No)
• Just consider score first– Carry out regression– Score as IV, Pass as DV– R2 = 0.097, F = 4.1, df = 1, 48, p =
0.028.
– b0 = 0.190
– b1 = 0.110, p=0.028• Seems OK
595
• Or does it? …• 1st Problem – pp plot of residuals
Observed Cum Prob
1.00.75.50.250.00
Expecte
d C
um
Pro
b
1.00
.75
.50
.25
0.00
596
• 2nd problem - residual plot
597
• Problems 1 and 2– strange distributions of residuals– parameter estimates may be wrong– standard errors will certainly be
wrong
598
• 3rd problem – interpretation– I score 2 on aptitude. – Pass = 0.190 + 0.110 2 = 0.41– I score 8 on the test– Pass = 0.190 + 0.110 8 = 1.07
• Seems OK, but– What does it mean?– Cannot score 0.41 or 1.07
• can only score 0 or 1
• Cannot be interpreted– need a different approach
599
A Different ApproachLogistic Regression
600
Logit Transformation
• In lesson 10, transformed IVs– now transform the DV
• Need a transformation which gives us– graduated scores (between 0 and 1)– No upper limit
• we can’t predict someone will pass twice
– No lower limit• you can’t do worse than fail
601
Step 1: Convert to Probability
• First, stop talking about values– talk about probability– for each value of score, calculate
probability of pass
• Solves the problem of graduated scales
602
Score 1 2 3 4 5N 7 5 6 4 2P 0.7 0.5 0.6 0.4 0.2N 3 5 4 6 8P 0.3 0.5 0.4 0.6 0.8
Fail
Pass
probability of failure given a
score of 1 is 0.7
probability of passing given a score of 5 is 0.8
603
This is better• Now a score of 0.41 has a meaning
– a 0.41 probability of pass
• But a score of 1.07 has no meaning– cannot have a probability > 1 (or < 0)– Need another transformation
604
Step 2: Convert to Odds-Ratio
Need to remove upper limit• Convert to odds• Odds, as used by betting shops
– 5:1, 1:2• Slightly different from odds in speech
– a 1 in 2 chance– odds are 1:1 (evens)– 50%
605
• Odds ratio = (number of times it happened) / (number of times it didn’t happen)
)event(1)event(
)eventnot ((event)
ratio oddsp
p
p
p
606
• 0.8 = 0.8/0.2 = 4
– equivalent to 4:1 (odds on)
– 4 times out of five
• 0.2 = 0.2/0.8 = 0.25
– equivalent to 1:4 (4:1 against)
– 1 time out of five
607
• Now we have solved the upper bound problem– we can interpret 1.07, 2.07,
1000000.07
• But we still have the zero problem
– we cannot interpret predicted scores less than zero
608
Step 3: The Log
• Log10 of a number(x)
xx )log(10• log(10) = 1
• log(100) = 2
• log(1000) = 3
609
• log(1) = 0• log(0.1) = -1• log(0.00001) = -5
610
Natural Logs and e
• Don’t use log10
– Use loge
• Natural log, ln• Has some desirable properties, that
log10 doesn’t– For us– If y = ln(x) + c– dy/dx = 1/x– Not true for any other logarithm
611
• Be careful – calculators and stats packages are not consistent when they use log– Sometimes log10, sometimes loge
– Can prove embarrassing (a friend told me)
612
Take the natural log of the odds ratio
• Goes from - +– can interpret any predicted value
613
Putting them all together
• Logit transformation– log-odds ratio– not bounded at zero or one
614
Score 1 2 3 4 5
N 7 5 6 4 2P 0.7 0.5 0.6 0.4 0.2N 3 5 4 6 8P 0.3 0.5 0.4 0.6 0.8
2.33 1.00 1.50 0.67 0.250.85 0.00 0.41 -0.41 -1.39
Fail
Pass
Odds (Fail)log(odds)fail
615
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5
Logit
pro
ba
bil
ity Probability gets
closer to zero, but never reaches it as
logit goes down.
616
• Hooray! Problem solved, lesson over– errrmmm… almost
• Because we are now using log-odds ratio, we can’t use OLS– we need a new technique, called
Maximum Likelihood (ML) to estimate the parameters
617
Parameter Estimation using ML
ML tries to find estimates of model parameters that are most likely to give rise to the pattern of observations in the sample data
• All gets a bit complicated– OLS is a special case of ML– the mean is an ML estimator
618
• Don’t have closed form equations– must be solved iteratively– estimates parameters that are most
likely to give rise to the patterns observed in the data
– by maximising the likelihood function (LF)
• We aren’t going to worry about this– except to note that sometimes, the
estimates do not converge• ML cannot find a solution
619
Interpreting Output
Using SPSS• Overall fit for:
– step (only used for stepwise)– block (for hierarchical)– model (always)– in our model, all are the same– 2=4.9, df=1, p=0.025
• F test
620
Omnibus Tests of Model Coefficients
4.990 1 .025
4.990 1 .025
4.990 1 .025
Step
Block
Model
Step 1
Chi-square df Sig.
621
• Model summary– -2LL (=2/N)– Cox & Snell R2
– Nagelkerke R2
– Different versions of R2
• No real R2 in logistic regression• should be considered ‘pseudo R2’
622
Model Summary
64.245 .095 .127
Step
1
-2 Loglikelihood
Cox & SnellR Square
NagelkerkeR Square
623
• Classification Table– predictions of model– based on cut-off of 0.5 (by default)– predicted values x actual values
624
Classification Tablea
18 8 69.2
12 12 50.0
60.0
Observed
0
1
PASS
Overall Percentage
Step 1
0 1
PASS PercentageCorrect
Predicted
The cut value is .500a.
625
Model parameters• B
– Change in the logged odds associated with a change of 1 unit in IV
– just like OLS regression– difficult to interpret
• SE (B)– Standard error– Multiply by 1.96 to get 95% CIs
626
Variables in the Equation
-.467 .219 4.566
1.314 .714 3.390
SCORE
Constant
Step1
a
B S.E. Wald
Variable(s) entered on step 1: SCORE.a.
Variables in the Equation
.386 1.263 .744 2.143
.199 .323
score
Constant
Step1
a
Sig. Exp(B) Lower Upper
95.0% C.I.for EXP(B)
Variable(s) entered on step 1: score.a.
627
• Constant – i.e. score = 0– B = 1.314– Exp(B) = eB = e1.314 = 3.720– OR = 3.720, p = 1 – (1 / (OR + 1))
= 1 – (1 / (3.720 + 1))– p = 0.788
628
• Score 1– Constant b = 1.314– Score B = -0.467– Exp(1.314 – 0.467) = Exp(0.847)
= 2.332– OR = 2.332– p = 1 – (1 / (2.332 + 1))
= 0.699
629
Standard Errors and CIs
• SPSS gives– B, SE B, exp(B) by default– Can work out 95% CI from standard
error– B ± 1.96 x SE(B)– Or ask for it in options
• Symmetrical in B– Non-symmetrical (sometimes very) in
exp(B)
630
Variables in the Equation
-.467 .219 .627 .408 .962
1.314 .714 3.720
SCORE
Constant
B S.E. Exp(B) Lower Upper
95.0% C.I.forEXP(B)
Variable(s) entered on step 1: SCORE.a.
631
• The odds of passing the test are multiplied by 0.63 (CIs = 0.408, 0.962p p = 0.033), for every additional point on the aptitude test.
632
More on Standard Errors
• In OLS regression– If a variable is added in a hierarchical fashion– The p-value associated with the change in R2
is the same as the p-value of the variable– Not the case in logistic regression
• In our data 0.025 and 0.033
• Wald standard errors– Make p-value in estimates is wrong – too
high– (CIs still correct)
633
• Two estimates use slightly different information– P-value says “what if no effect”– CI says “what if this effect”
• Variance depends on the hypothesised ratio of the number of people in the two groups
• Can calculate likelihood ratio based p-values– If you can be bothered– Some packages provide them
automatically
634
Probit Regression
• Very similar to logistic– much more complex initial
transformation (to normal distribution)– Very similar results to logistic
(multiplied by 1.7)
• In SPSS:– A bit weird
• Probit regression available through menus
635
– But requires data structured differently
• However– Ordinal logistic regression is
equivalent to binary logistic• If outcome is binary
– SPSS gives option of probit
636
ResultsEstimat
eSE P
Logistic (binary)
Score 0.288 0.301 0.339
Exp 0.147 0.073 0.043
Logistic (ordinal)
Score 0.288 0.301 0.339
Exp 0.147 0.073 0.043
Logistic(probit)
Score 0.191 0.178 0.282
Exp 0.090 0.042 0.033
637
Differentiating Between Probit and Logistic
• Depends on shape of the error term– Normal or logistic– Graphs are very similar to each other
• Could distinguish quality of fit– Given enormous sample size
• Logistic = probit x 1.7– Actually 1.6998
• Probit advantage– Understand the distribution
• Logistic advantage– Much simpler to get back to the probability
638
0
0.2
0.4
0.6
0.8
1
1.2-3
-2.8
-2.6
-2.4
-2.2 -2
-1.8
-1.6
-1.4
-1.2 -1
-0.8
-0.6
-0.4
-0.2 0
0.2
0.4
0.6
0.8 1
1.2
1.4
1.6
1.8 2
2.2
2.4
2.6
2.8 3
Normal (Probit)Logistic
639
Infinite Parameters
• Non-convergence can happen because of infinite parameters– Insoluble model
• Three kinds:• Complete separation
– The groups are completely distinct• Pass group all score more than 10• Fail group all score less than 10
640
• Quasi-complete separation– Separation with some overlap
• Pass group all score 10 or more• Fail group all score 10 or less
• Both cases:– No convergence
• Close to this– Curious estimates– Curious standard errors
641
• Categorical Predictors– Can cause separation– Esp. if correlated
• Need people in every cell
Male Female
White Non-White White Non-White
Below Poverty Line
Above Poverty Line
642
Logistic Regression and Diagnosis
• Logistic regression can be used for diagnostic tests– For every score
• Calculate probability that result is positive• Calculate proportion of people with that score (or
lower) who have a positive result
• Calculate c statistic– Measure of discriminative power– %age of all possible cases, where the model
gives a higher probability to a correct case than to an incorrect case
643
– Perfect c-statistic = 1.0– Random c-statistic = 0.5
• SPSS doesn’t do it automatically– But easy to do
• Save probabilities– Use Graphs, ROC Curve– Test variable: predicted probability– State variable: outcome
644
Sensitivity and Specificity
• Sensitivity:– Probability of saying someone has a
positive result – • If they do: p(pos)|pos
• Specificity– Probability of saying someone has a
negative result• If they do: p(neg)|neg
645
Calculating Sens and Spec
• For each value– Calculate
• proportion of minority earning less – p(m)• proportion of non-minority earning less –
p(w)
– Sensitivity (value)• P(m)
646
Salary P(minority)
10 .39
20 .31
30 .23
40 .17
50 .12
60 .09
70 .06
80 .04
90 .03
647
Using Bank Data
• Predict minority group, using salary (000s)– Logit(minority) = -0.044 + salary x –
0.039
• Find actual proportions
648
0.0 0.2 0.4 0.6 0.8 1.0
1 - Specificity
0.0
0.2
0.4
0.6
0.8
1.0
Sen
sit
ivit
y
Diagonal segments are produced by ties.
ROC Curve
Area under curve is c-statistic
649
More Advanced Techniques
• Multinomial Logistic Regression more than two categories in DV– same procedure– one category chosen as reference group
• odds of being in category other than reference
• Polytomous Logit Universal Models (PLUM)– Ordinal multinomial logistic regression– For ordinal outcome variables
650
Final Thoughts
• Logistic Regression can be extended– dummy variables– non-linear effects– interactions (even though we don’t
cover them until the next lesson)• Same issues as OLS
– collinearity– outliers
651
652
653
Lesson 12: Mediation and Path Analysis
654
Introduction
• Moderator– Level of one variable influences effect of
another variable
• Mediator– One variable influences another via a third
variable
• All relationships are really mediated– are we interested in the mediators?– can we make the process more explicit
655
• In examples with bank
educationbeginning
salary
• Why?– What is the process?– Are we making assumptions about
the process?– Should we test those assumptions?
656
educationbeginning
salary
job skills
expectations
negotiating skills
kudos for bank
657
Direct and Indirect Influences
X may affect Y in two ways• Directly – X has a direct (causal)
influence on Y– (or maybe mediated by other
variables)
• Indirectly – X affects Y via a mediating variable - M
658
• e.g. how does going to the pub effect comprehension on a Summer school course– on, say, regression
Having fun in pub
in evening
not reading
books on regressio
nless
knowledge
Anything here?
659
Having fun in pub
in evening
not reading
books on regressio
nless
knowledge
fatigue
Still needed
?
660
• Mediators needed– to cope with more sophisticated
theory in social sciences– make explicit assumptions made
about processes– examine direct and indirect influences
661
Detecting Mediation
662
4 StepsFrom Baron and Kenny (1986)• To establish that the effect of X on Y is
mediated by M1. Show that X predicts Y2. Show that X predicts M3. Show that M predicts Y, controlling for X4. If effect of X controlling for M is zero, M
is complete mediator of the relationship• (3 and 4 in same analysis)
663
Example: Book habits
Enjoy Books
Buy books
Read Books
664
Three Variables
• Enjoy– How much an individual enjoys books
• Buy– How many books an individual buys
(in a year)• Read
– How many books an individual reads (in a year)
665
ENJOY BUY READENJOY 1.00 0.64 0.73BUY 0.64 1.00 0.75READ 0.73 0.75 1.00
666
• The Theory
enjoy readbuy
667
• Step 11. Show that X (enjoy) predicts Y
(read)– b1 = 0.487, p < 0.001
– standardised b1 = 0.732
– OK
668
2. Show that X (enjoy) predicts M (buy)
– b1 = 0.974, p < 0.001
– standardised b1 = 0.643
– OK
669
3. Show that M (buy) predicts Y (read), controlling for X (enjoy)
– b1 = 0.469, p < 0.001
– standardised b1 = 0.206
– OK
670
4. If effect of X controlling for M is zero, M is complete mediator of the relationship
– (Same as analysis for step 3.)
– b2 = 0.287, p = 0.001
– standardised b2 = 0.431
– Hmmmm…• Significant, therefore not a complete
mediator
671
enjoy read
buy
0.974(from step
2)
0.206(from step
3)
0.287 (step 4)
672
The Mediation Coefficient
• Amount of mediation = Step 1 – Step 4=0.487 – 0.287
= 0.200• OR
Step 2 x Step 3=0.974 x 0.206
= 0.200
673
SE of Mediator
• sa = se(a)
• sb = se(b)
enjoy readbuy
a(from step
2)
b(from step
2)
674
• Sobel test– Standard error of mediation
coefficient can be calculated
a = 0.974sa = 0.189
b = 0.206sb = 0.054
2 2 2 2 2 2a b a bs + - se b a s s s
675
• Indirect effect = 0.200– se = 0.056– t =3.52, p = 0.001
• Online Sobel test:http://www.unc.edu/~preacher/sobel/sobel.htm– (Won’t be there for long; probably will be
somewhere else)
676
A Note on Power
• Recently– Move in methodological literature away from
this conventional approach– Problems of power:– Several tests, all of which must be significant
• Type I error rate = 0.05 * 0.05 = 0.0025• Must affect power
– Bootstrapping suggested as alternative• See Paper B7, A4, B9• B21 for SPSS syntax
677
678
679
Lesson 13: Moderators in Regression
“different slopes for different folks”
680
Introduction
• Moderator relationships have many different names– interactions (from ANOVA)– multiplicative– non-linear (just confusing)– non-additive
• All talking about the same thing
681
A moderated relationship occurs • when the effect of one variable
depends upon the level of another variable
682
• Hang on …– That seems very like a nonlinear
relationship– Moderator
• Effect of one variable depends on level of another
– Non-linear• Effect of one variable depends on level of itself
• Where there is collinearity– Can be hard to distinguish between them– Paper in handbook (B5)– Should (usually) compare effect sizes
683
• e.g. How much it hurts when I drop a computer on my foot depends on– x1: how much alcohol I have drunk
– x2: how high the computer was dropped from
– but if x1 is high enough
– x2 will have no effect
684
• e.g. Likelihood of injury in a car accident– depends on
– x1: speed of car
– x2: if I was wearing a seatbelt
– but if x1 is low enough
– x2 will have no effect
685
0
5
10
15
20
25
30
5 15 25 35 45
Speed (mph)
Inju
ry
Seatbelt No Seatbelt
686
• e.g. number of words (from a list) I can remember– depends on
– x1: type of words (abstract, e.g. ‘justice’, or concrete, e.g. ‘carrot’)
– x2: Method of testing (recognition – i.e. multiple choice, or free recall)
– but if using recognition
– x1: will not make a difference
687
• We looked at three kinds of moderator
• alcohol x height = pain– continuous x continuous
• speed x seatbelt = injury– continuous x categorical
• word type x test type– categorical x categorical
• We will look at them in reverse order
688
How do we know to look for moderators?
Theoretical rationale• Often the most powerful• Many theories predict additive/linear
effects– Fewer predict moderator effects
Presence of heteroscedasticity• Clue there may be a moderated
relationship missing
689
Two Categorical Predictors
690
Data• 2 IVs– word type (concrete [1], abstract [2])– test method (recog [1], recall [2])
• 20 Participants in one of four groups– 1, 1– 1, 2– 2, 1– 2, 2
• 5 per group• lesson12.1.sav
691
Concrete Abstract TotalMean 15.40 15.20 15.30SD 2.19 2.59 2.26Mean 15.60 6.60 11.10Std. Deviation 1.67 7.44 6.95Mean 15.50 10.90 13.20Std. Deviation 1.84 6.94 5.47
Total
Recall
Recog
692
• Graph of means
TEST
2.001.00
18
16
14
12
10
8
6
WORDS
1.00
2.00
693
ANOVA Results
• Standard way to analyse these data would be to use ANOVA– Words: F=6.1, df=1, 16, p=0.025– Test: F=5.1, df=1, 16, p=0.039– Words x Test: F=5.6, df=1, 16,
p=0.031
694
Procedure for Testing
1: Convert to effect coding• can use dummy coding, collinearity is
less of an issue• doesn’t make any difference to
substantive interpretation2: Calculate interaction term• In ANOVA interaction is automatic• In regression we create an interaction
variable
695
• Interaction term (wxt) – multiply effect coded variables
together
word test wxt-1 -1 11 -1 -1-1 1 -11 1 1
696
3: Carry out regression• Hierarchical
– linear effects first– interaction effect in next block
697
• b0=13.2• b1 (words) = -2.3, p=0.025• b2 (test) = -2.1, p=0.039• b3 (words x test) = -2.2, p=0.031• Might need to use change in R2 to
test sig of interaction, because of collinearity
What do these mean?• b0 (intercept) = predicted value of Y
(score) when all X = 0– i.e. the central point
698
• b0 = 13.2– grand mean
• b1 = -2.3– distance from grand to mean for two
word types– 13.2 – (-2.3) = 15.5– 13.2 + (-2.3) = 10.9
Concrete Abstract TotalRecog 15.40 15.20 15.30Recall 15.60 6.60 11.10Total 15.50 10.90 13.20
699
• b2 = -2.1– distance from grand mean to recog
and recall means
• b3 = -2.2
– to understand b3 we need to look at predictions from the equation without this term
Score = 13.2 + (-2.3) w + (-2.1) t
700
Score = 13.2 + (-2.3) w + (-2.1) t• So for each group we can calculate
an expected value
701
W
C
C
A
A
T
Cog
Call
Cog
Call
Word
-1
-1
1
1
Test
-1
1
-1
1
Expected Value
13.2 + (-2.3) x (-1) + (-2.1) x -1
13.2 + (-2.3) x (-1) + (-2.1) x 1
13.2 + (-2.3) x 1 + (-2.1) x (-1)
13.2 + (-2.3) x 1 + (-2.1) x 1
b1 = -2.3, b2 = -2.1
702
• The exciting part comes when we look at the differences between the actual value and the value in the 2 IV model
W T Word Test Exp Actual ValueC Call -1 -1 17.6 15.4C Cog -1 1 13.4 15.6A Call 1 -1 13.0 15.2A Cog 1 1 8.8 11.0
703
• Each difference = 2.2 (or –2.2)
• The value of b3 was –2.2– the interaction term is the correction
required to the slope when the second IV is included
704
• Examine the slope for word type
0
2
4
6
8
10
12
14
16
18
Recog (-1) Recall (1)
Test Type
Gradient = (11.1 - 15.3) / 2 = -
2.1
705
• Add the slopes for two test groups
0
2
4
6
8
10
12
14
16
18
Recog (-1) Recall (1)
Test Type
Both word groups (-
2.1)
Abstract(6.6 - 15.2 )/2
= -4.3
Concrete(15.6-15.4 )/2
= 0.1
706
b associated with interaction • the change in slope, away from the
average, associated with a 1 unit change in the moderating variable
OR• Half the difference in the slopes
707
• Another way to look at itY = 13.2 + -2.3w + -2.1t + -2.2wt
• Examine concrete words group (w = -1)– substitute values into the equation
Y(concrete) = 13.2 + -2.3-1 + -2.1t + -2.2-1t
Y(concrete) = 13.2 + 2.3 + -2.1t + 2.2t
Y(concrete) = 15.5 + 0.1t • The effect of changing test type for
concrete words (the slope, which is half the actual difference)
708
Why go to all that effort? Why not do ANOVA in the first place?
1. That is what ANOVA actually does• if it can handle an unbalanced design
(i.e. different numbers of people in each group)
• Helps to understand what can be done with ANOVA
• SPSS uses regression to do ANOVA
2. Helps to clarify more complex cases
• as we shall see
709
Categorical x Continuous
710
Note on Dichotomisation
• Very common to see people dichotomise a variable– Makes the analysis easier– Very bad idea
• Paper B6
711
Data
A chain of 60 supermarkets • examining the relationship between
profitability, shop size, and local competition
• 2 IVs– shop size– comp (local competition, 0=no, 1=yes)
• DV– profit
712
• Data, ‘lesson 12.2.sav’
Shopsize Comp Profit4 1 23
10 1 257 0 19
10 0 910 1 1829 1 3312 0 176 1 20
14 0 2162 0 8
713
1st Analysis
Two IVs• R2=0.367, df=2, 57, p < 0.001• Unstandardised estimates
– b1 (shopsize) = 0.083 (p=0.001)
– b2 (comp) = 5.883 (p<0.001)
• Standardised estimates– b1 (shopsize) = 0.356
– b2 (comp) = 0.448
714
• Suspicions– Presence of competition is likely to
have an effect– Residual plot shows a little
heteroscedasticity
2.01.51.0.50.0-.5-1.0-1.5-2.0
3
2
1
0
-1
-2
-3
715
Procedure for Testing
• Very similar to last time– convert ‘comp’ to effect coding– -1 = No competition– 1 = competition– Compute interaction term
• comp (effect coded) x size
– Hierarchical regression
716
Result
• Unstandardised estimates– b1 (shopsize) = 0.071 (p=0.006)
– b2 (comp) = -1.67 (p = 0.506)
– b3 (sxc) = -0.050 (p=0.050)
• Standardised estimates– b1 (shopsize) = 0.306
– b2 (comp) = -0.127
– b3 (sxc) = -0.389
717
• comp now non-significant– shows importance of hierarchical– it obviously is important
718
Interpretation
• Draw graph with lines of best fit– drawn automatically by SPSS
• Interpret equation by substitution of values– evaluate effects of
• size• competition
719
Shopsize
100806040200
Pro
fit
40
30
20
10
0
Competition
No competition
All Shops
720
• Effects of size– in presence and absence of
competition– (can ignore the constant)Y=x10.071 + x2(-1.67) + x1x2 (-
0.050)
– Competition present (x2 = 1)
Y=x10.071 + 1(-1.67) + x11 (-0.050)
Y=x10.071 + -1.67 + x1(-0.050)
Y=x1 0.021 + (–1.67)
721
Y=x10.071 + x2(-1.67) + x1x2 (-0.050)
– Competition absent (x2 = -1)
Y=x10.071 + -1(-1.67) + x1-1 (-0.050)
Y=x1 0.071 + x1-1 (-0.050) + -1(-1.67)
Y= x1 0.121 (+ 1.67)
722
Two Continuous Variables
723
Data
• Bank Employees– only using clerical staff– 363 cases– predicting starting salary– previous experience– age– age x experience
724
• Correlation matrix – only one significant
LOGSB AGESTARTPREVEXPLOGSB 1.00 -0.09 0.08AGESTART -0.09 1.00 0.77PREVEXP 0.08 0.77 1.00
725
Initial Estimates (no moderator)• (standardised)
– R2 = 0.061, p<0.001– Age at start = -0.37, p<0.001– Previous experience = 0.36, p<0.001
• Suppressing each other– Age and experience compensate for
one another– Older, with no experience, bad– Younger, with experience, good
726
The Procedure
• Very similar to previous– create multiplicative interaction term– BUT
• Need to eliminate effects of means – cause massive collinearity
• and SDs– cause one variable to dominate the
interaction term• By standardising
727
• To standardise x, – subtract mean, and divide by SD– re-expresses x in terms of distance
from the mean, in SDs– ie z-scores
• Hint: automatic in SPSS in Descriptives
• Create interaction term of age and exp– axe = z(age) z(exp)
728
• Hierarchical regression– two linear effects first– moderator effect in second– hint: it is often easier to interpret if
standardised versions of all variables are used
729
• Change in R2
– 0.085, p<0.001
• Estimates (standardised)– b1 (exp) = 0.104
– b2 (agestart) = -0.54
– b3 (age x exp) = -0.54
730
Interpretation 1: Pick-a-Point
• Graph is tricky– can’t have two continuous variables– Choose specific points (pick-a-point)
• Graph the line of best fit of one variable at others
– Two ways to pick a point• 1: Choose high (z = +1), medium (z = 0)
and low (z = -1)• Choose ‘sensible’ values – age 20, 50,
80?
731
• We know:– Y = e 0.10 + a -0.54 + a e -0.54– Where a = agestart, and e = experience
• We can rewrite this as:– Y = (e 0.10) + (a -0.54) + (a e -0.54)– Take a out of the brackets– Y = (e 0.10) + (-0.54 + e -0.54)a
• Bracketed terms are simple intercept and simple slope 0= (e 0.10) 1= (-0.54 + e -0.54)a– Y = 0 + 1a
732
• Pick any value of e, and we know the slope for a– Standardised, so it’s easy
• e = -1 0= (-1 0.10) = -0.10 1= (-0.54 + -1 -0.54)a = -0.0a
• e = 0 0= (0 0.10) = 0 1= (-0.54+ 0 -0.54)a = -0.54a
• e = 1 0= (1 0.10) = 0.10 1= (-0.54 + 1 -0.54)a = -1.08a
733
Graph the Three Lines
-1.5
-1
-0.5
0
0.5
1
1.5
-1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Age
Lo
g(s
ala
ry)
e = -1
e = 0
e = 1
734
Interpretation 2: P-Values and CIs
• Second way – Newer, rarely done
• Calculate CIs of the slope – At any point
• Calculate p-value– At any point
• Give ranges of significance
735
What do you need?
• The variance and covariance of the estimates– SPSS doesn’t provide estimates for
intercept– Need to do it manually
• In options, exclude intercept– Create intercept – c = 1– Use it in the regression
736
• Enter information into web page:– www.unc.edu/~preacher/interact/acov.htm
– (Again, may not be around for long)
• Get results• Calculations in Bauer and Curran
(in press: Multivariate Behavioral Research)– Paper B13
737
-1.0 -0.5 0.0 0.5 1.0
4.0
4.1
4.2
4.3
4.4
4.5
MLR 2-Way Interaction Plot
X
Y
CVz1(1)CVz1(2)CVz1(3)
738
Areas of Significance
-4 -2 0 2 4
-0.6
-0.4
-0.2
0.0
0.2
0.4
Confidence Bands
Experience
Sim
ple
Slo
pe
739
• 2 complications– 1: Constant differed– 2: DV was logged, hence non-linear
• effect of 1 unit depends on where the unit is
– Can use SPSS to do graphs showing lines of best fit for different groups
– See paper A2
740
Finally …
741
Unlimited Moderators
• Moderator effects are not limited to – 2 variables– linear effects
742
Three Interacting Variables
• Age, Sex, Exp• Block 1
– Age, Sex, Exp
• Block 2– Age x Sex, Age x Exp, Sex x Exp
• Block 3– Age x Sex x Exp
743
• Results– All two way interactions significant– Three way not significant– Effect of Age depends on sex– Effect of experience depends on sex– Size of the age x experience
interaction does not depend on sex (phew!)
744
Moderated Non-Linear Relationships
• Enter non-linear effect• Enter non-linear effect x moderator
– if significant indicates degree of non-linearity differs by moderator
745
746
Modelling Counts: Poisson Regression
Lesson 14
747
Counts and the Poisson Distribution
• Von Bortkiewicz (1898)– Numbers of Prussian
soldiers kicked to death by horses
0 1091 652 223 34 15 0
0
20
40
60
80
100
120
0 1 2 3 4 5
748
• The data fitted a Poisson probability distribution– When counts of events occur, poisson
distribution is common– E.g. papers published by researchers, police
arrests, number of murders, ship accidents
• Common approach– Log transform and treat as normal
• Problems– Censored at 0– Integers only allowed– Heteroscedasticity
749
The Poisson Distribution
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17Count
Pro
bab
ilit
y
0.5
14
8
750
!
)exp()|(
yxyp
y
751
• Where:– y is the count– is the mean of the poisson distribution
• In a poisson distribution– The mean = the variance (hence
heteroscedasticity issue))–
!
)exp()|(
yxyp
y
752
Poisson Regression in SPSS
• Not directly available– SPSS can be tweaked to do it in three ways:– General loglinear model (genlog)– Non-linear regression (CNLR)
• Bootstrapped p-values only
– Both are quite tricky
• SPSS 15,
753
Example Using Genlog
• Number of shark bites on different colour surfboards– 100 surfboards, 50
red, 50 blue
• Weight cases by bites
• Analyse, Loglinear, General– Colour is factor
0
5
10
15
20
25
0 1 2 3 4Number of bites
Fre
qu
en
cy
BlueRed
754
Results
Correspondence Between Parameters and Terms of the Design
Parameter Aliased Term
1 Constant2 [COLOUR = 1]3 x [COLOUR = 2]Note: 'x' indicates an aliased (or a redundant) parameter. These parameters are set to zero.
755
Asymptotic 95% CI
Param Est. SE Z-value Lower Upper
1 4.1190 .1275 32.30 3.87 4.37
2 -.5495 .2108 -2.61 -.96 -.14
3 .0000 . . . .
• Note: Intercept (param 1) is curious
• Param 2 is the difference in the means
756
SPSS: Continuous Predictors
• Bleedin’ nightmare• http://www.spss.com/tech/
answer/details.cfm?tech_tan_id=100006204
757
Poisson Regression in Stata
• SPSS will save a Stata file• Open it in Stata• Statistics, Count outcomes, Poisson
regression
758
Poisson Regression in R
• R is a freeware program– Similar to SPlus– www.r-project.org
• Steep learning curve to start with• Much nicer to do Poisson (and other)
regression analysishttp://www.stat.lsa.umich.edu/~faraway/book/
http://www.jeremymiles.co.uk/regressionbook/extras/appendix2/R/
759
• Commands in R• Stage 1: enter data
– colour <- c(1, 0, 1, 0, 1, 0 … 1)– bites <- c(3, 1, 0, 0, … )
• Run analysis– p1 <- glm(bites ~ colour, family = poisson)
• Get results– summary.glm(p1)
760
R Results
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3567 0.1686 -2.115 0.03441 *
colour 0.5555 0.2116 2.625 0.00866 **
• Results for colour– Same as SPSS– For intercept different (weird SPSS)
761
Predicted Values
• Need to get exponential of parameter estimates – Like logistic regression
• Exp(0.555) = 1.74– You are likely to be bitten by a shark
1.74 times more often with a red surfboard
762
Checking Assumptions
• Was it really poisson distributed?– For Poisson, 2
• As mean increases, variance should also increase
– Residuals should be random• Overdispersion is common problem• Too many zeroes
• For blue: 2 = exp(-0.3567) = 1.42• For red: 2 = exp(-0.3567 + 0.555)
= 2.48
763
• Strictly:
!
)exp()|(
yxyp
y
!
ˆ)ˆexp()|(
yxyp
y
ii
764
Compare Predicted with Actual Distributions
Blue
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 1 2 3 4Frequency
Pro
ba
bil
ity
ExpectedActual
Red
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1 2 3 4Frequency
Pro
ba
bil
ity
ExpectedActual
765
Overdispersion• Problem in poisson regression
– Too many zeroes
• Causes– 2 inflation– Standard error deflation
• Hence p-values too low
– Higher type I error rate
• Solution– Negative binomial regression
766
Using R
• R can read an SPSS file – But you have to ask it nicely
• Click Packages menu, Load package, choose “Foreign”
• Click File, Change Dir– Change to the folder that contains
your data
767
More on R• R uses objects
– To place something into an object use <- – X <- Y
• Puts Y into X
• Function is read.spss()– Mydata <- read.spss(“spssfilename.sav”)
• Variables are then referred to as Mydata$VAR1– Note 1: R is case sensitive– Note 2: SPSS variable name in capitals
768
GLM in R
• Command– glm(outcome ~ pred1 + pred2 + … +
predk [,family = familyname])– If no familyname, default is OLS
• Use binomial for logistic, poisson for poisson
• Output is a GLM object– You need to give this a name– my1stglm <- glm(outcome ~ pred1 +
pred2 + … + predk [,family = familyname])
769
• Then need to explore the result– summary(my1stglm)
• To explore what it means– Need to plot regressions
• Easiest is to use Excel
770
771
Introducing Structural Equation Modelling
Lesson 15
772
Introduction
• Related to regression analysis– All (OLS) regression can be
considered as a special case of SEM
• Power comes from adding restrictions to the model
• SEM is a system of equations– Estimate those equations
773
Regression as SEM
• Grades example– Grade = constant + books + attend +
error• Looks like a regression equation
– Also– Books correlated with attend– Explicit modelling of error
774
Path Diagram
• System of equations are usefully represented in a path diagram
x Measured variable
e unmeasured variable
regression
correlation
775
Path Diagram for Regression
Books
Attend
Grade
error
Must usually explicitly
model error
Must explicitly model correlation
776
Results
• Unstandardised
2.00
BOOKS
17.84
ATTEND
GRADE
4.04
1.28
2.65
1.00
e13.52
777
Standardised
BOOKS
ATTEND
GRADE
.35
.33
.44
e.82
778
TableEstimate S.E. C.R. P St. Est.
GRADE <-- BOOKS 4.04 1.71 2.36 0.02 0.35GRADE <-- ATTEND 1.28 0.57 2.25 0.03 0.33GRADE <-- e 13.52 1.53 8.83 0.00 0.82GRADE 37.38 7.54 4.96 0.00
Coefficientsa
37.38 7.74 .00
4.04 1.75 .35 .03
1.28 .59 .33 .04
(Constant)
BOOKS
ATTEND
Model
1
B Std. Error
UnstandardizedCoefficients
Beta
StandardizedCoefficients
Sig.
Dependent Variable: GRADEa.
779
So What Was the Point?
• Regression is a special case• Lots of other cases• Power of SEM
– Power to add restrictions to the model• Restrict parameters
– To zero– To the value of other parameters– To 1
780
Restrictions
• Questions– Is a parameter really necessary?– Are a set of parameters necessary?– Are parameters equal
• Each restriction adds 1 df– Test of model with 2
781
The 2 Test
• Can the model proposed have generated the data?– Test of significance of difference of
model and data– Statistically significant result
• Bad
– Theoretically driven• Start with model• Don’t start with data
782
Regression Again
• Both estimates restricted to zero
BOOKS
ATTEND
GRADE
0, 1
e
783
• Two restrictions– 2 df for 2 test– 2 = 15.9, p = 0.0003
• This test is (asymptotically) equivalent to the F test in regression– We still haven’t got any further
784
Multivariate Regression
x1
x2
y1
y2
y3
785
x1
x2
y1
y2
y3
Test of all x’s on all y’s(6 restrictions = 6 df)
786
Test of all x1 on all y’s(3 restrictions)
x1
x2
y1
y2
y3
787
x1
x2
y1
y2
y3
Test of all x1 on all y1
(3 restrictions)
788
x1
x2
y1
y2
y3
Test of all 3 partial correlations between y’s, controlling for x’s
(3 restrictions)
789
Path Analysis and SEM
• More complex models – can add more restrictions– E.g. mediator
model
• 1 restriction– No path from
enjoy -> read
ENJOY
BUY
READ
1
e_buy
1
e_read
790
Result
• 2 = 10.9, 1 df, p = 0.001• Not a complete mediator
– Additional path is required
791
Multiple Groups
• Same model– Different people
• Equality constraints between groups– Means, correlations, variances,
regression estimates– E.g. males and females
792
Multiple Groups Example
• Age• Severity of psoriasis
– SEVE – in emotional areas• Hands, face, forearm
– SEVNONE – in non-emotional areas– Anxiety– Depression
793
Correlationsa
1 -.270 -.248 .017 .035
. .004 .009 .859 .717
110 110 110 110 110
-.270 1 .665 .045 .075
.004 . .000 .639 .436
110 110 110 110 110
-.248 .665 1 .109 .096
.009 .000 . .255 .316
110 110 110 110 110
.017 .045 .109 1 .782
.859 .639 .255 . .000
110 110 110 110 110
.035 .075 .096 .782 1
.717 .436 .316 .000 .
110 110 110 110 110
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
AGE
SEVE
SEVNONE
GHQ_A
GHQ_D
AGE SEVE SEVNONE GHQ_A GHQ_D
SEX = fa.
794
Correlationsa
1 -.243 -.116 -.195 -.190
. .031 .310 .085 .094
79 79 79 79 79
-.243 1 .671 .456 .453
.031 . .000 .000 .000
79 79 79 79 79
-.116 .671 1 .210 .232
.310 .000 . .063 .040
79 79 79 79 79
-.195 .456 .210 1 .800
.085 .000 .063 . .000
79 79 79 79 79
-.190 .453 .232 .800 1
.094 .000 .040 .000 .
79 79 79 79 79
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
AGE
SEVE
SEVNONE
GHQ_A
GHQ_D
AGE SEVE SEVNONE GHQ_A GHQ_D
SEX = ma.
795
Model
AGE
SEVE SEVNONE
Dep Anx
1
e_sn
1
e_s
1
E_d
1
e_a
796
FemalesAGE
SEVE SEVNONE
Dep Anx
-.27 -.25
.03 .15.09 -.04
e_sne_s
E_d e_a
.96
.99 .99
.97
.78
.07 .04
.64
797
MalesAGE
SEVE SEVNONE
Dep Anx
-.24 -.12
.52 -.17-.12 .55
e_sne_s
E_d e_a
.97
.88 .88
.99
.74
-.08 -.08
.67
798
Constraint
• sevnone -> dep– Constrained to be equal for males and
females
• 1 restriction, 1 df– 2 = 1.3 – not significant
• 4 restrictions– 2 severity -> anx & dep
799
• 4 restrictions, 4 df– 2 = 1.3, p = 0.014
• Parameters are not equal
800
Missing Data: The big advantage
• SEM programs tend to deal with missing data – Multiple imputation– Full Information (Direct) Maximum
Likelihood• Asymptotically equivalent
• Data can be MAR, not just MCAR
801
Power: A Smaller Advantage
• Power for regression gets tricky with large models
• With SEM power is (relatively) easy– It’s all based on chi-square– Paper B14
802
Lesson 16: Dealing with clustered data & longitudinal
models
803
The Independence Assumption
• In Lesson 8 we talked about independence – The residual of any one case should not tell
you about the residual of any other case
• Particularly problematic when:– Data are clustered on the predictor variable
• E.g. predictor is household size, cases are members of family
• E.g. Predictor is doctor training, outcome is patients of doctor
– Data are longitudinal• Have people measured over time
– It’s the same person!
804
Clusters of Cases
• Problem with cluster (group) randomised studies– Or group effects
• Use Huber-White sandwich estimator– Tell it about the groups– Correction is made– Use complex samples in SPSS
805
Complex Samples
• As with Huber-White for heteroscedasticity– Add a variable that tells it about the clusters– Put it into clusters
• Run GLM– As before
• Warning:– Need about 20 clusters for solutions to be
stable
806
Example
• People randomised by week to one of two forms of triage– Compare the total cost of treating each
• Ignore clustering– Difference is £2.40 per person, with 95%
confidence intervals £0.58 to £4.22, p =0.010
• Include clustering– Difference is still £2.40, with 95% CIs £5.65
to -£0.85, and p = 0.141.
• Ignoring clustering led to type I error
807
Longitudinal Research
• For comparing repeated measures– Clusters are people– Can model the
repeated measures over time
• Data are usually short and fat
ID V1 V2 V3 V4
1 2 3 4 7
2 3 6 8 4
3 2 5 7 5
808
Converting Data
• Change data to tall and thin
• Use Data, Restructure in SPSS
• Clusters are ID
ID V X1 1 2
1 2 3
1 3 4
1 4 7
2 1 3
2 2 6
2 3 8
2 4 4
3 1 2
3 2 5
3 3 7
3 4 5
809
(Simple) Example
• Use employee data.sav– Compare beginning salary and salary– Would normally use paired samples t-
test
• Difference = $17,403, 95% CIs $16,427.407, $18,379.555
810
Restructure the Data
• Do it again– With data tall and thin
• Complex GLM with Time as factor– ID as cluster
• Difference = $17,430, 95% CIs = 16427.407, 18739.555
ID Time Cash
1 1$18,75
0
1 2$21,45
0
2 1$12,00
0
2 2$21,90
0
3 1$13,20
0
3 2$45,00
0
811
Interesting …
• That wasn’t very interesting– What is more interesting is when we
have multiple measurements of the same people
• Can plot and assess trajectories over time
812
Single Person Trajectory
Time
+
++
+ +
+
813
Multiple Trajectories: What’s the Mean and SD?
Time
814
Complex Trajectories
• An event occurs– Can have two effects:– A jump in the value– A change in the slope
• Event doesn’t have to happen at the same time for each person– Doesn’t have to happen at all
815
Slope 1
Event Occurs
Jump
Slope 2
816
ParameterisingTime Event Time2 Outcome
1 0 0 12
2 0 0 13
3 0 0 14
4 0 0 15
5 0 0 16
6 1 0 10
7 1 1 9
8 1 2 8
9 1 3 7
817
Draw the Line
What are the parameter estimates?
818
Main Effects and Interactions
• Main effects – Intercept differences
• Moderator effects– Slope differences
819
Multilevel Models
• Fixed versus random effects– Fixed effects are fixed across
individuals (or clusters)– Random effects have variance
• Levels– Level 1 – individual measurement
occasions– Level 2 – higher order clusters
820
More on Levels• NHS direct study
– Level 1 units: …………….– Level 2 units: ……………
• Widowhood food study– Level 1 units ……………– Level 2 units ……………
821
More Flexibility
• Three levels:– Level 1: measurements– Level 2: people– Level 3: schools
822
More Effects
• Variances and covariances of effects
• Level 1 and level 2 residuals– Makes R2 difficult to talk about
• Outcome variable– Yij
• The score of the ith person in the jth group
823
Y i j
2.3 1 1
3.2 2 1
4.5 3 1
4.8 1 2
7.2 2 2
3.1 3 2
1.6 4 2
824
Notation• Notation gets a bit horrid
– Varies a lot between books and programs
• We used to have b0 and b1
– If fixed, that’s fine– If random, each person has their own
intercept and slope
825
Standard Errors
• Intercept has standard errors• Slopes have standard errors• Random effects have variances
– Those variances have standard errors• Is there statistically significant variation
between higher level units (people)?• OR• Is everyone the same?
826
Programs
• Since version 12– Can do this in SPSS– Can’t do anything really clever
• Menus– Completely unusable– Have to use syntax
827
SPSS Syntax
• MIXED• relfd with time• /fixed = time• /random = intercept time |
subject (id) covtype(un)• /print = solution.
828
SPSS Syntax
• MIXED• relfd with time
OutcomeContinuous
predictor
829
SPSS Syntax
• MIXED• relfd with time• /fixed = time
Must specify effect as fixed first
830
SPSS Syntax
• MIXED• relfd with time• /fixed = time• /random = intercept time |
subject (id) covtype(un)Specify random
effects
Intercept and time are random
SPSS assumes that your level 2 units are subjects, and needs to know the id
variable
831
SPSS Syntax
• MIXED• relfd with time• fixed = time• /random = intercept time |
subject (id) covtype(un)Covariance matrix of random
effects is unstructured. (Alternative is id – identity or vc
– variance components).
832
SPSS Syntax
• MIXED• relfd with time• fixed = time• /random = intercept time |
subject (id) covtype(un)• /print = solution.
Print the answer
833
The Output• Information criteria
– We’ll come backInformation Criteriaa
64899.758
64907.758
64907.763
64940.134
64936.134
-2 Restricted LogLikelihood
Akaike's InformationCriterion (AIC)
Hurvich and Tsai'sCriterion (AICC)
Bozdogan's Criterion(CAIC)
Schwarz's BayesianCriterion (BIC)
The information criteria are displayed in smaller-is-better forms.
Dependent Variable: relfd.a.
834
Fixed Effects
• Not useful here, useful for interactions
Type III Tests of Fixed Effectsa
1 741 3251.877 .000
1 741.000 2.550 .111
Source
Intercept
time
Numerator dfDenominator
df F Sig.
Dependent Variable: relfd.a.
835
Estimates of Fixed Effects
• Interpreted as regression equation
Estimates of Fixed Effectsa
21.90 21.90 .38 57.025 .000 21.15 22.66
-.06 -.06 .04 -1.597 .111 -.14 .01
Parameter
Intercept
time
EstimateStd.Error df t Sig.
LowerBound
UpperBound
95% ConfidenceInterval
Dependent Variable: relfd.a.
836
Covariance Parameters
Estimates of Covariance Parametersa
64.11577 1.0526353
85.16791 5.7003732
-4.53179 .5067146
.7678319 .0636116
Parameter
Residual
UN (1,1)
UN (2,1)
UN (2,2)
Intercept +time [subject= id]
Estimate Std. Error
Dependent Variable: relfd.a.
837
Change Covtype to VC
• We know that this is wrong– The covariance of the effects was
statistically significant– Can also see if it was wrong by
comparing information criteria• We have removed a parameter from
the model– Model is worse– Model is more parsimonious
• Is it much worse, given the increase in parsimony?
838
UN ModelInformation Criteriaa
64899.758
64907.758
64907.763
64940.134
64936.134
-2 Restricted LogLikelihood
Akaike's InformationCriterion (AIC)
Hurvich and Tsai'sCriterion (AICC)
Bozdogan's Criterion(CAIC)
Schwarz's BayesianCriterion (BIC)
The information criteria are displayed in smaller-is-better forms.
Information Criteriaa
65041.891
65047.891
65047.894
65072.173
65069.173
-2 Restricted LogLikelihood
Akaike's InformationCriterion (AIC)
Hurvich and Tsai'sCriterion (AICC)
Bozdogan's Criterion(CAIC)
Schwarz's BayesianCriterion (BIC)
The information criteria are displayed in smaller-is-better forms.
VC Model
Lower is better.
839
Adding Bits
• So far, all a bit dull• We want some more predictors, to
make it more exciting– E.g. female– Add:Relfd with time female/fixed = time sex time * sex
• What does the interaction term represent?
840
Extending Models
• Models can be extended– Any kind of regression can be used
• Logistic, multinomial, Poisson, etc
– More levels• Children within classes within schools• Measures within people within classes within prisons
– Multiple membership / cross classified models• Children within households and classes, but
households not nested within class
• Need a different program– E.g. MlwiN
841
MlwiN Example (very quickly)
842
Books
Singer, JD and Willett, JB (2003). Applied Longitudinal Data Analysis: Modeling Change and Event Occurrence. Oxford, Oxford University Press.
Examples at:http://www.ats.ucla.edu/stat/SPSS/examples/alda/default.htm
843
The End
top related