soc 206 lecture 1
DESCRIPTION
SOC 206 Lecture 1. Statistics, Causation Simple Regression. Context of Discovery vs. Justification. Famous distinction by Hans Reichenbach Discovery: How do we come up with ideas Justification: How can we demonstrate that they are true. WHAT IS STATISTICS. Statistics is a language - PowerPoint PPT PresentationTRANSCRIPT
Statistics, CausationSimple Regression
Famous distinction by Hans Reichenbach Discovery:
How do we come up with ideas Justification:
How can we demonstrate that they are true
I. Statistics is a language
Theoretical ideas can be represented
Verbally Culture creates and reinforces power relations
Visually
Mathematically P=f(C, e)
Any language is a tool of both discovery and justification Statistics is more of a tool of justification hypothesis testing, prediction
it is limited as a tool of discovery data mining, inductive statistics (factor, cluster analysis etc.)limited by its inflexibility
Culture Power
Statistics allow us to process a huge amount of standardized and comparable pieces of information
Qualitative (Clinical) vs. Quantitative (Statistical) Judgment
More than a hundred studies comparing the two (Grove et al. 2001, Dawes et al. 1989) including: college admission, medical and psychiatric diagnostics, credit assessment, criminal recidivism, job performance etc.
In the overwhelming majority of the cases statistical judgment was better Even when experts judges had more information Even when experts were informed of the statistical prediction Even when the statistical model was “inappropriate” but the coefficients had the right
sign and unit size
Reasons: Limited cognitive capacities Common cognitive errors (e.g. overemphasis of recent experience, confirmation bias, ignoring base
rates, human prejudice etc. Separation of the judgment and its outcome Self-fulfilling prophecy Selection bias
All apply to the qualitative vs. quantitative distinction in social science methodology
Models are simplified and explicit representations of reality
Example (Lave and March 1993) Friendships on campus
Observation: students tend to have friends in adjacent quarters
Question: what could produce (cause) this pattern? Hypothesis : students request to be close to their friends Implication: We should not find the same pattern for freshmen
Assumption: freshmen rarely know their college mates Finding: same pattern for freshmen – Hypothesis is wrong New Hypothesis: students befriend others close by.
What is the exact process? Can you generalize, and broaden the context?
To all colleges in the US? Beyond colleges in the US? Beyond the US?
What are the implications of your theory? Causal models
Friendship ties are caused by something (e.g. physical proximity)
Causation is an asymmetric relationship between two things: the cause and the effect John Stuart Mill’s 3 Main Criteria of Causation (System of
Logic Book III. Chapter V-VIII)
Empirical Association Statistics is strong in revealing this
Appropriate Time Order Statistics often assumes this
Non-Spuriousness (Excluding other Forms of Causation) Statistics uses multivariate models to establish this
Verbal representation of causality – narratives Visual Cause (X) Effect (Y) --- Proximity
Friendship Mathematical Y=f(X, e)
Y=f(X) f= function -- e.g. Y=2X or Y=e1/ln(34X+.5X2)
The simplest function: linear – the change in Y is constant as X changes
i #of Chocolate Bars (X) Cost Paid $ (Y) 1 0 0 2 1 2 3 2 4 4 3 6 5 4 8 6 5 10
Price of the chocolate bar= $2 Cost=f(count) Y=2X or Yi=2Xi
i #of Chocolate Bars (X) Cost Paid $(Y) 1 0 1 2 1 3 3 2 5 4 3 7 5 4 9 6 5 11
Price of the chocolate bar= $1 entry fee + $2 per bar
Price=f(count) Y=1+2X or Yi=1+2Xi a=intercept b=slope
Yi=a+bXi a=1 b=2 Y1=a+bX1 1=1+2*0 Y2=a+bX2 3=1+2*1 ………….. Yn=a+bXn
Deterministic linear function
Case Summariesa
Tom 30 19
Ben 30 23
Jane 40 26
Steve 40 30
Cathy 37 27
Diane 51 31
6 6 6
1
2
3
4
5
6
NTotal
NAME AGE INCOME
Limited to first 100 cases.a.
AGE
6050403020100
INC
OM
E
3634323028262422201816141210
86420
Diane
Cathy
Steve
Jane
Ben
Tom
x
y
s
sr
x
yrb
)var(
)var(
Case Summariesa
Tom 30 19
Ben 30 23
Jane 40 26
Steve 40 30
Cathy 37 27
Diane 51 31
6 6 6
1
2
3
4
5
6
NTotal
NAME AGE INCOME
Limited to first 100 cases.a.
_ _ _ _ _ _ i Yi income Xi age Yi-Y Xi-X (Xi-X)2 (Xi-X)(Yi-Y) (Yi-Y)2
1 19 30 19-26= -7 30-38= -8 (-8)*(-)8= 64 (-8)*(-7)= 56 (-7)*(-7)=49 2 23 30 23-26= -3 30-38= -8 (-8)*(-8)= 64 (-8)*(-3)=24 (-3)*(-3)= 9 3 26 40 26-26= 0 40-38= 2 2 * 2 = 4 0 * 2 = 0 0 * 0 = 0 4 30 40 30-26= 4 40-38= 2 2 * 2 = 4 4 * 2 = 8 4 * 4 =16 5 27 37 27-26= 1 37-38=-1 1 * 1 = 1 1 *(-1)= -1 1 * 1 = 1 6 31 51 31-26= 5 51-38=13 13 * 13 =169 5 * 13= 65 5 * 5 =25 Σ 156 228 0 0 306 152 100
Mean 26 38
b=152/306=0.4967
Incomei=a+0.4967*Agei+ei
a=? _ _ a=Y-bX= 26-0.4967*38
Incomei=7.1254+0.4967*Agei+ei
Yi=7.1254+0.4967*Xi+ ei
7.1254 value of Y when X=0 (income at age 0) +0.4967 unit change in Y by one unit change in X (income change for each year
increase in age)
How good is our model? Our measure is the Residual Sum of Squares we also call
is Sum of Squared Error (SSE) observed calculated residual/error squared
residual/error i Yi Pred(Yi)=a+bXi ei=Yi-Pred(Yi) ei2=ei*ei
1 19 22.026 -3.0261 9.1573 2 23 22.026 .9739 0.9485 3 26 26.993 -.9935 0.9870 4 30 26.993 3.0065 9.0390 5 27 25.503 1.4967 2.2401 6 31 32.458 -1.4575 2.1243 Σ 0 24.4962 Is the SSE of Σei
2=24.4962 a lot or a little? Compared to what?
AGE
6050403020100
INC
OM
E
3634323028262422201816141210
86420
Diane
Cathy
Steve
Jane
Ben
Tom
Bob 18 years old and making $30K added
Keeping Bob but dropping Tom
Now Tom became an outlier (like Bob)
In small samples individual cases (or a small set of cases) can influence where the regression line goes.
2
22
2
1)var(
ii xxb
Can we generalize? Intercept in the population: α Slope in the population: β
Do we have a probability (random) sample? If yes, we can proceed.
Are the coefficients significantly different from 0? Is α ≠0; β≠0? Is R-square significantly different from 0? Is R
Both a (intercept in the sample) and b (slope in the sample) have a probability distribution and so does R-square.
Suppose we take many random samples of N=6 from this population. Each time we will get an intercept and a slope.
http://lstat.kuleuven.be/java We get a sampling distribution with the following characteristics: 1. It has a normal (bell) shape 2. Its expected value is the population or true value (E(a)= α; E(b)= β). 3.The standard deviation of the sampling distribution (standard error) for b
for a
σ2=Σεi2/N Mean Squared Error (Mean Residual Sum of Squares) where εi is the distance between the observation i and the TRUE regression line.
Because we don’t know the TRUE regression line, we can only estimate εi. Our best guess is ei. So our estimate of σ2, s2= Σei2/N-2
)var().(. bbes
)var().(. aaes
2
2
2 1*)var(
ix
X
Na
470.52103.2*475.2
306
38
6
1*)26/(5.24).(.
2
aes
Testing if α ≠0 t=(a- α)/s.e.(a)
t=(7.124-0)/5.470=1.302 d.f=n-2=4 Testing if β ≠0
t=(b- β)/s.e.(b) t=(.497-0)/.141=3.511 d.f=n-2=4
Income000 Coef. Std. Err. t P>t [95% Conf. Interval]
Age .496732 .1414697 3.51 0.025 .1039492 .8895148 _cons 7.124183 5.469958 1.30 0.263 -8.062854 22.31122
141.
306
)26/(5.24).(.
bes
To evaluate this we use the ANalysis Of VAriance (ANOVA) table
Source SS df MS Number of obs = 6 F( 1, 4) = 12.33Model 75.503268 1 75.503268 Prob > F = 0.0246Residual 24.496732 4 6.12418301 R-squared = 0.7550 Adj R-squared = 0.6938Total 100 5 20 Root MSE = 2.4747
We calculate the F-statistics F reg d.f., res d.f. =(RegSS/Reg. d.f.)/(SSE/Res.d.f.)
Reg d.f.= K (# of independent variables) Res d.f.=N-k-1
F=(75.503/1)/(24.497/(6-1-1))=12.329 df=1,4 In a simple regression F is the squared value of the t for the slope: 3.5112=12.327 (the discrepancy is
due to rounding) The F distribution is a relative of the t distribution. Both are based on the normal
distribution.
Verbal: Despite the fact, that many see schools as the
ultimate vehicle of social mobility, schools reproduce social inequalities by denying high quality public education from the poor.
Visual
Statistical School quality=f(Family income, e)
Family Income School Quality
Academic Performance Index (API) in California Public Schools in 2006 as a Function of the Percent of
Students Receiving Subsidized Meals0
.00
1.0
02
.00
3.0
04
.00
5D
ensi
ty
200 400 600 800 1000API13
Variable Obs Mean Std. Dev. Min MaxAPI13 10242 784.2502 102.2748 311 999
200
400
600
800
100
0A
PI1
3
0 20 40 60 80 100MEALS
Source SS df MS Number of obs = 10242F( 1, 10240) = 2933.18
Model 23852172.8 1 23852172.8 Prob > F = 0.0000Residual 83270168.8 10240 8131.85243 R-squared = 0.2227
Adj R-squared = 0.2226Total 107122342 10241 10460.1447 Root MSE = 90.177
API13 Coef. Std. Err. t P>t [95% Conf. Interval]
MEALS -1.730451 .0319514 -54.16 0.000 -1.793082 -1.66782_cons 885.6367 2.073267 427.17 0.000 881.5727 889.7008
iXXXYX
iXY
XY
XY
iXY
Y
iY
X
iX
eZZZZZ
eZZ
a
ZZ
ZZa
eZaZ
S
YYZ
S
XXZ
iiiii
ii
ii
ii
ii
i
i
*
*
0
0
*
*
Suppose we eliminate the natural metric of the variables and turn them into Z-scores
Z score for X
Z score for Y The slope will be different because now everything
is measured in standard deviations. It will tell you that “Y will change that many standard deviations by one standard deviation change in X.” It is called the “standardized regression coefficient a.k.a. path coefficient, a.k.a. beta weight or beta coefficient.
There is no intercept in a standardized regression
We multiply both sides of the equation by Zxi
We do that for each case 1st, 2nd …….nth.
nXXXYX
XXXYX
XXXYX
eZZZZZ
eZZZZZ
eZZZZZ
nnnnn
*
*
*
2
1
22222
11111
XYX
Y
X
Y rS
S
S
Sb *
summing the equations
Dividing by n we get the average cross-products of Z-scores which are correlations.
This is the normal equation. On one side there is a correlation. On the other side path coefficients and correlations
The final normal equation
This is how you get the metric (unstandardized) slope coefficient from the path coefficient
iXXXYX eZZZZZiiiii
*
*
0
1
*
*
XY
Xe
XX
XeXXXY
iXXXYX
r
r
r
rrrn
eZ
n
ZZ
n
ZZiiiii
. correlate API13 MEALS, means
(obs=10242)
Variable | Mean Std. Dev. Min Max -------------+---------------------------------------------------- API13 | 784.2502 102.2748 311 999 MEALS | 58.58963 27.88903 0 100
| API13 MEALS -------------+------------------ API13 | 1.0000 MEALS | -0.4719 1.0000
. regress API13 MEALS, beta
Source | SS df MS Number of obs = 10242-------------+------------------------------ ----------------------------------------------------- F( 1, 10240) = 2933.18 Model | 23852172.8 1 23852172.8 Prob > F = 0.0000 Residual | 83270168.8 10240 8131.85243 R-squared = 0.2227-------------+------------------------------ ---------------------------------------------------- Adj R-squared = 0.2226 Total | 107122342 10241 10460.1447 Root MSE = 90.177
--------------------------------------------------------------------------------------------------------------------------------------------------- API13 | Coef. Std. Err. t P>|t| Beta -------------+------------------------------------------------------------------------------------------------------------------------------------ MEALS | -1.730451 . 0319514 -54.16 0.000 -.4718717 _cons | 885.6367 2.073267 427.17 0.000 .-----------------------------------------------------------------------------------------------------------------------------------------------
b= [102.2748/27.88903] *-.4718717=-1.730451
a= 784.2502 –(-1.730451)* 58.58963= 885.6367
-600
-400
-200
02
00R
esid
ual
s
0 20 40 60 80 100MEALS
. estat hettestBreusch-Pagan / Cook-Weisberg test for heteroskedasticity Ho: Constant variance Variables: fitted values of API13
chi2(1) = 120.54 Prob > chi2 = 0.0000
. regress API13 MEALS, vce(hc3) beta
Linear regression Number of obs = 10242 F( 1, 10240) = 3091.00 Prob > F = 0.0000 R-squared = 0.2227 Root MSE = 90.177
----------------------------------------------------------------------------------------------------------- | Robust HC3 API13 | Coef. Std. Err. t P>|t| Beta------------------+-------------------------------------------------------------------------------------- MEALS | -1.730451 .031125 -55.60 0.000
-.4718717 _cons | 885.6367 2.152182 411.51 0.000 .----------------------------------------------------------------------------------------------------------The Standard Error is corrected to make it robust
against the violation of the homoscedasticity assumption.