1 causal data mining richard scheines dept. of philosophy, machine learning, & human-computer...
Post on 13-Jan-2016
217 Views
Preview:
TRANSCRIPT
1
Causal Data Mining
Richard Scheines
Dept. of Philosophy, Machine Learning, &
Human-Computer Interaction
Carnegie Mellon
2
Causal Graphs
Causal Graph G = {V,E} Each edge X Y represents a direct causal claim:
X is a direct cause of Y relative to V
Exposure Rash
Exposure Infection Rash
Chicken Pox
3
Causal Bayes Networks
P(S = 0) = .7P(S = 1) = .3
P(YF = 0 | S = 0) = .99 P(LC = 0 | S = 0) = .95P(YF = 1 | S = 0) = .01 P(LC = 1 | S = 0) = .05P(YF = 0 | S = 1) = .20 P(LC = 0 | S = 1) = .80P(YF = 1 | S = 1) = .80 P(LC = 1 | S = 1) = .20
Smoking [0,1]
Lung Cancer[0,1]
Yellow Fingers[0,1]
P(S,YF, LC) = P(S) P(YF | S) P(LC | S)
The Joint Distribution Factors
According to the Causal Graph,
i.e., for all X in V
P(V) = P(X|Immediate Causes of(X))
4
Structural Equation Models
• Structural Equations: One Equation for each variable V in the graph:
V = f(parents(V), errorV)for SEM (linear regression) f is a linear function
• Statistical Constraints: Joint Distribution over the Error terms
Education
LongevityIncome
Causal Graph
5
Structural Equation Models
Equations: Education = ed
Income =Educationincome
Longevity =EducationLongevity
Statistical Constraints: (ed, Income,Income ) ~N(0,2)
2diagonal - no variance is zero
Education
LongevityIncome
Causal Graph
Education
Income Longevity
1 2
LongevityIncome
SEM Graph
(path diagram)
6
Tetrad 4: Demo
www.phil.cmu.edu/projects/tetrad
7
Causal Datamining in Ed. Research
1. Collect Raw Data
2. Build Meaningful Variables
3. Constrain Model Space with Background Knowledge
4. Search for Models
5. Estimate and Test
6. Interpret
8
CSR Online
Are Online students learning as much?
What features of online behavior matter?
9
CSR Online
Are Online students learning as much?
Raw Data : Pitt 2001, 87 students
For everyone: Pre-test, Recitation attendance, final exam
For Online Students: logged: Voluntary question attempts, online quizzes, requests to print modules
10
CSR Online
Build Meaningful Variables:
1. Online [0,1]
2. Pre-test [%]
3. Recitation Attendance [%]
4. Final Exam [%]
11
CSR Online
Data: Correlation Matrix (corrs.dat, N=83)
Pre Online Rec Final
Pre 1.0
Online .023 1.0
Rec -.004 -.255 1.0
Final .287 .182 .297 1.0
12
CSR Online
Background Knowledge:
Temporal Tiers:
1. Online, Pre
2. Rec
3. Final
13
CSR Online
Model Search:
No latents (patterns – with PC or GES)
- no time order : 729 models
- temporal tiers: 96 models)
With Latents (PAGs – with FCI search)
- no time order : 4,096
- temporal tiers: 2,916
14
Tetrad Demo
Online vs. Lecture
Data file: corrs.dat
15
Estimate and Test: Results
• Model fit excellent
• Online students attended 10% fewer recitations
• Each recitation gives an increase of 2% on the final exam
• Online students did 1/2 a Stdev better than lecture students (p = .059)
Final Exam (%)
Recitation Attendance (%)
Pre-test (%)
Online
.22
5.3
.23
-10
16
References
• An Introduction to Causal Inference, (1997), R. Scheines, in Causality in Crisis?, V. McKim and S. Turner (eds.), Univ. of Notre Dame Press, pp. 185-200.
• Causation, Prediction, and Search, 2nd Edition, (2000), by P. Spirtes, C. Glymour, and R. Scheines ( MIT Press)
• Causality: Models, Reasoning, and Inference, (2000), Judea Pearl, Cambridge Univ. Press
• “Causal Inference,” (2004), Spirtes, P., Scheines, R.,Glymour, C., Richardson, T., and Meek, C. (2004), in Handbook of Quantitative Methodology in the Social Sciences, ed. David Kaplan, Sage Publications, 447-478
• Computation, Causation, & Discovery (1999), edited by C. Glymour and G. Cooper, MIT Press
top related