integrating bayesian networks and simpson's paradox in data

22
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland

Upload: tommy96

Post on 02-Dec-2014

948 views

Category:

Documents


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Integrating Bayesian networks and Simpson's paradox in data

Integrating Bayesian Networks and Simpson’s Paradox in Data Mining

Alex FreitasUniversity of Kent

Ken McGarryUniversity of Sunderland

Page 2: Integrating Bayesian networks and Simpson's paradox in data

2

Outline of the Talk

Introduction to Knowledge Discovery & Data Mining

Constructing Bayesian networks from data Simpson’s paradox Proposed method for integrating Bayesian

networks and Simpson’s paradox Conclusions

Page 3: Integrating Bayesian networks and Simpson's paradox in data

3

Introduction

Data Mining consists of extracting patterns from data, and it is the core step of a knowledge discovery process

pre-proc data mining post-proc

Data interesting 22, M, 30K patterns 26, F, 55K IF (salary = high)

………. THEN (credit = good)

Page 4: Integrating Bayesian networks and Simpson's paradox in data

4

The Knowledge Discovery Process– a popular definition

“Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data”

(Fayyad et al. 1996) Focus on the quality of discovered patterns

– independent of the data mining algorithm

This definition is often quoted, but not very seriously taken into account

– A lot of research on discovering valid, accurate patterns

– Little research on discovering potentially useful patterns

Page 5: Integrating Bayesian networks and Simpson's paradox in data

5

Criteria to Evaluate the “Interestingness” of Discovered Patterns

useful 

novel, surprising 

comprehensible 

valid (accurate) 

Amount of Research

Difficulty of measurement

Page 6: Integrating Bayesian networks and Simpson's paradox in data

6

On the difficulty of discovering surprising patterns in data mining

Focus on maximizing accuracy leads to very accurate but useless rules, e.g. (Brin et al. 1997) – census data:– IF (person is pregnant) THEN (gender is female)– IF (age 5) THEN (employed = no)

(Tsumoto 2000) extracted 29,050 rules from a medical dataset. Out of these, just 220 (less than 1%) were considered interesting or surprising to the user

Page 7: Integrating Bayesian networks and Simpson's paradox in data

7

Bayesian network example

A B

C

D

A Bayesian network represents potentially causal patterns, which tend to be more useful for intelligent decision making

Motivation for Integrating Bayesian Networks and Simpson’s Paradox

However, algorithms for constructing Bayesian networks from data were not designed to discover surprising patterns

Simpson’s paradox is surprising by nature

Causality + Surprisingness tends to improve Usefulness

Page 8: Integrating Bayesian networks and Simpson's paradox in data

8

Constructing Bayesian Networks from Data

Methods based on conditional independence tests– Not scalable to datasets with many variables (attributes)

Methods based on search guided by a scoring function– Iteratively create candidate solutions (Bayesian networks) and

evaluate the quality of each created network using a scoring function, until a stopping criteria is satisfied

– Sequential methods consider a single candidate solution at a time

– Population-based methods consider many candidate solutions at a time

Page 9: Integrating Bayesian networks and Simpson's paradox in data

9

Examples of sequential method– B algorithm starts with an empty network and at each iteration

adds, to the current candidate solution, the edge that maximizes the value of the scoring function

– K2 algorithm requires that the variables be ordered and the user specifies a parameter: the maximum number of parents of each variable in the network to be constructed

Both are greedy methods (local search), which offer no guarantee of finding the optimal network

Population-based methods are global search methods, but are stochastic, so again no guarantees

Page 10: Integrating Bayesian networks and Simpson's paradox in data

10

Limitations of methods for constructing Bayesian networks from data (1)

Theoretical limitation (best possible algorithm & data)

Bayesian networks are Independence maps (I-maps) of the true probability distribution

– Every independence between variables represented in the network is an actual independence in the true probability distribution

– Dependences between variables represented in the network are not guaranteed to be actual dependences in the true probability distribution

Page 11: Integrating Bayesian networks and Simpson's paradox in data

11

Limitations of methods for constructing Bayesian networks from data (2)

Practical limitations

The problem of constructing the optimal net is too complex in large datasets, so we have to use methods which do not guarantee the discovery of the optimal net

Sampling variation and/or noisy data may mislead the Bayesian network construction method, further contributing to the discovery of a sub-optimal net

Page 12: Integrating Bayesian networks and Simpson's paradox in data

12

Simpson’s Paradox (Pearl 2000)

Overall E (recovered) E (not recov.) Total Recov. RateDrug (C) 20 20 40 50% No Drug (C) 16 24 40 40%Total 36 44 80

Males E (recovered) E (not recov.) Total Recov. RateDrug (C) 18 12 30 60%No Drug (C) 7 3 10 70%Total 25 15 40

Females E (recovered) E (not recov.) Total Recov. RateDrug (C) 2 8 10 20%No Drug (C) 9 21 30 30%Total 11 29 40

Page 13: Integrating Bayesian networks and Simpson's paradox in data

13

Simpson’s Paradox as a Surprising Pattern

Event C (“cause”) increases the probability of event E (“effect”) in a given population but, at the same time, decreases the probability of E in every subpopulation

No paradox in terms of probability theory, it looks a “paradox” under a causal interpretation

– Gender is a confounder variable in the previous example

Although Simpson’s paradox is known by statisticians, occurrences of the paradox are surprising to users

There are algorithms that systematically find instances of the paradox in data and rank them in decreasing order of surprisingness (Fabris & Freitas 2006)

Page 14: Integrating Bayesian networks and Simpson's paradox in data

14

The proposed method for integrating Bayesian networks and Simpson’s paradox

Basic Idea:– In a Bayesian network, the dependence denoted by

edge C E can be spurious, i.e., due to a confounding variable F

(for the previously discussed reasons)

Two approaches exploring this basic idea

Page 15: Integrating Bayesian networks and Simpson's paradox in data

15

First Approach: paradox detection before network construction

First, run an algorithm that detects occurrences of Simpson’s paradox in data (Fabris & Freitas 2006)

– Produces a paradox list PL Modify Bayesian network construction algorithms to

take into account this list, biasing the algorithms against including network edges involving the paradox

Consider a potential dependence represented by the edge C E, where C is apparent cause of effect E– If variables C, E are associated in an occurrence of

Simpson’s paradox in PL, the algorithm is biased against including edge C E in the network

Page 16: Integrating Bayesian networks and Simpson's paradox in data

16

Consider a greedy algorithm that starts with an empty network and adds one edge to the network at a time, guided by a scoring function

FOR EACH candidate edge A B

compute the score of the network if A B is added to the network

penalize score if there is an occurrence of the paradox in list PL involving pair of variables A, B

SELECT edge with highest score and add it to the network

proposed extension

Page 17: Integrating Bayesian networks and Simpson's paradox in data

17

The same basic kind of extension can be applied to an Estimation of Distribution Algorithm – EDA is a population-based evolutionary algorithm– It evaluates a complete candidate solution (network) at once

FOR EACH candidate solution in the population

compute the score of the network represented by the candidate solution

penalize score in proportion to the number of paradox occurrences in list PL that are associated with direct dependences A B in the network

proposed extension

Page 18: Integrating Bayesian networks and Simpson's paradox in data

18

Second Approach: paradox detection after network construction

First, construct a Bayesian network from data Use the network to “prune” the search space for the

Simpson’s paradox detection algorithm The algorithm will focus its search on the pairs of

variables for which there is a direct dependence (i.e., an edge A B ) in the Bayesian network

For each pair of such variables, the algorithm will try to find a third variable that acts as a confounder between those two variables

Page 19: Integrating Bayesian networks and Simpson's paradox in data

19

Bayesian net variables considered by Simpson’s paradox detection algorithm, considering the Bayesian net

Cause Effect Is there a counfounder?

A C ? B C ? C D ?

A B

C

D

A paradox occurrence involving the above pairs of cause and effect variables would be even more surprising to the user, due to the structure of the network

Page 20: Integrating Bayesian networks and Simpson's paradox in data

20

Limitation of the proposed integration method

It is possible that the data does not contain any occurrence of Simpson’s paradox– In this case the usefulness of the method is limited

Even if the algorithm does not find any paradox occurrence, this result is to some extent useful:

– it gives us increased confidence that the dependences represented in the network are true dependences, rather than spurious ones

– This additional test complements (rather than replaces) conventional methods for evaluating Bayesian networks

Page 21: Integrating Bayesian networks and Simpson's paradox in data

21

Conclusions

We proposed a method for integrating two very different kinds of algorithm in data mining

– Algorithms for constructing Bayesian networks Discover potentially causal, more useful patterns

– Algorithms for detecting Simpson’s paradox Discover surprising patterns, potentially more useful

Hopefully, combining the “best of both worlds”, increasing the chance of discovering patterns useful for intelligent decision making by the user

Future research: computational implementation of the proposed method and analysis of results

Page 22: Integrating Bayesian networks and Simpson's paradox in data

Any Questions ??

Thanks for listening!