1 causal data mining richard scheines dept. of philosophy, machine learning, & human-computer...

Post on 13-Jan-2016

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Causal Data Mining

Richard Scheines

Dept. of Philosophy, Machine Learning, &

Human-Computer Interaction

Carnegie Mellon

2

Causal Graphs

Causal Graph G = {V,E} Each edge X Y represents a direct causal claim:

X is a direct cause of Y relative to V

Exposure Rash

Exposure Infection Rash

Chicken Pox

3

Causal Bayes Networks

P(S = 0) = .7P(S = 1) = .3

P(YF = 0 | S = 0) = .99 P(LC = 0 | S = 0) = .95P(YF = 1 | S = 0) = .01 P(LC = 1 | S = 0) = .05P(YF = 0 | S = 1) = .20 P(LC = 0 | S = 1) = .80P(YF = 1 | S = 1) = .80 P(LC = 1 | S = 1) = .20

Smoking [0,1]

Lung Cancer[0,1]

Yellow Fingers[0,1]

P(S,YF, LC) = P(S) P(YF | S) P(LC | S)

The Joint Distribution Factors

According to the Causal Graph,

i.e., for all X in V

P(V) = P(X|Immediate Causes of(X))

4

Structural Equation Models

• Structural Equations: One Equation for each variable V in the graph:

V = f(parents(V), errorV)for SEM (linear regression) f is a linear function

• Statistical Constraints: Joint Distribution over the Error terms

Education

LongevityIncome

Causal Graph

5

Structural Equation Models

Equations: Education = ed

Income =Educationincome

Longevity =EducationLongevity

Statistical Constraints: (ed, Income,Income ) ~N(0,2)

2diagonal - no variance is zero

Education

LongevityIncome

Causal Graph

Education

Income Longevity

1 2

LongevityIncome

SEM Graph

(path diagram)

6

Tetrad 4: Demo

www.phil.cmu.edu/projects/tetrad

7

Causal Datamining in Ed. Research

1. Collect Raw Data

2. Build Meaningful Variables

3. Constrain Model Space with Background Knowledge

4. Search for Models

5. Estimate and Test

6. Interpret

8

CSR Online

Are Online students learning as much?

What features of online behavior matter?

9

CSR Online

Are Online students learning as much?

Raw Data : Pitt 2001, 87 students

For everyone: Pre-test, Recitation attendance, final exam

For Online Students: logged: Voluntary question attempts, online quizzes, requests to print modules

10

CSR Online

Build Meaningful Variables:

1. Online [0,1]

2. Pre-test [%]

3. Recitation Attendance [%]

4. Final Exam [%]

11

CSR Online

Data: Correlation Matrix (corrs.dat, N=83)

Pre Online Rec Final

Pre 1.0

Online .023 1.0

Rec -.004 -.255 1.0

Final .287 .182 .297 1.0

12

CSR Online

Background Knowledge:

Temporal Tiers:

1. Online, Pre

2. Rec

3. Final

13

CSR Online

Model Search:

No latents (patterns – with PC or GES)

- no time order : 729 models

- temporal tiers: 96 models)

With Latents (PAGs – with FCI search)

- no time order : 4,096

- temporal tiers: 2,916

14

Tetrad Demo

Online vs. Lecture

Data file: corrs.dat

15

Estimate and Test: Results

• Model fit excellent

• Online students attended 10% fewer recitations

• Each recitation gives an increase of 2% on the final exam

• Online students did 1/2 a Stdev better than lecture students (p = .059)

Final Exam (%)

Recitation Attendance (%)

Pre-test (%)

Online

.22

5.3

.23

-10

16

References

• An Introduction to Causal Inference, (1997), R. Scheines, in Causality in Crisis?, V. McKim and S. Turner (eds.), Univ. of Notre Dame Press, pp. 185-200.

• Causation, Prediction, and Search, 2nd Edition, (2000), by P. Spirtes, C. Glymour, and R. Scheines ( MIT Press)

• Causality: Models, Reasoning, and Inference, (2000), Judea Pearl, Cambridge Univ. Press

• “Causal Inference,” (2004), Spirtes, P., Scheines, R.,Glymour, C., Richardson, T., and Meek, C. (2004), in Handbook of Quantitative Methodology in the Social Sciences, ed. David Kaplan, Sage Publications, 447-478

• Computation, Causation, & Discovery (1999), edited by C. Glymour and G. Cooper, MIT Press

top related