part ii: graphical models. challenges of probabilistic models specifying well-defined probabilistic...

Part II: Graphical models

Challenges of probabilistic models

• Specifying well-defined probabilistic models with many variables is hard (for modelers)

• Representing probability distributions over those variables is hard (for computers/learners)

• Computing quantities using those distributions is hard (for computers/learners)

Representing structured distributions

Four random variables:

X1 coin toss produces heads

X2 pencil levitates

X3 friend has psychic powers

X4 friend has two-headed coin

Domain

{0,1}{0,1}{0,1}{0,1}

Joint distribution0000000100100011010001010110011110001001101010111100110111101111

• Requires 15 numbers to specify probability of all values x1,x2,x3,x4

– N binary variables, 2N-1 numbers

• Similar cost when computing conditional probabilities

P(x2, x3, x4 | x1 =1) =P(x1 =1,x2,x3,x4 )

P(x1 =1)

How can we use fewer numbers?

Domain

{0,1}{0,1}{0,1}{0,1}

Statistical independence• Two random variables X1 and X2 are independent if

P(x1|x2) = P(x1)

– e.g. coinflips: P(x1=H|x2=H) = P(x1=H) = 0.5

• Independence makes it easier to represent and work with probability distributions

• We can exploit the product rule:

P(x1,x2, x3, x4 ) = P(x1 | x2,x3,x4 )P(x2 | x3, x4 )P(x3 | x4 )P(x4 )

P(x1, x2, x3, x4 ) = P(x1)P(x2)P(x3)P(x4 )

If x1, x2, x3, and x4 are all independent…

Expressing independence

• Statistical independence is the key to efficient probabilistic representation and computation

• This has led to the development of languages for indicating dependencies among variables

• Some of the most popular languages are based on “graphical models”

• Introduction to graphical models– representation and inference

• Causal graphical models– causality– learning about causal relationships

• Graphical models and cognitive science– uses of graphical models– an example: causal induction

Graphical models

• Express the probabilistic dependency structure among a set of variables (Pearl, 1988)

• Consist of– a set of nodes, corresponding to variables– a set of edges, indicating dependency– a set of functions defined on the graph that

specify a probability distribution

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Undirected graphical models

• Consist of– a set of nodes– a set of edges– a potential for each clique, multiplied together to yield the distribution over variables

• Examples– statistical physics: Ising model, spinglasses– early neural networks (e.g. Boltzmann machines)

Directed graphical modelsX3 X4

• Consist of– a set of nodes– a set of edges– a conditional probability distribution for each node, conditioned on its parents, multiplied together

to yield the distribution over variables

• Constrained to directed acyclic graphs (DAGs)• Called Bayesian networks or Bayes nets

Bayesian networks and Bayes

• Two different problems– Bayesian statistics is a method of inference– Bayesian networks are a form of representation

• There is no necessary connection– many users of Bayesian networks rely upon

frequentist statistical methods– many Bayesian inferences cannot be easily

represented using Bayesian networks

Properties of Bayesian networks

• Efficient representation and inference– exploiting dependency structure makes it easier

to represent and compute with probabilities

• Explaining away– pattern of probabilistic reasoning characteristic

of Bayesian networks, especially early use in AI

Efficient representation and inferenceFour random variables:

X2 pencil levitates

X3X4P(x4) P(x3)

P(x2|x3)P(x1|x3, x4)

The Markov assumption

Every node is conditionally independent of its non-descendants, given its parents

P(xi | xi+1,...,xk ) = P(xi | Pa(X i ))

where Pa(Xi) is the set of parents of Xi

P(x1,..., xk ) = P(x i | Pa(X i))i=1

(via the product rule)

Efficient representation and inferenceFour random variables:

X2 pencil levitates

X3X4P(x4) P(x3)

P(x2|x3)P(x1|x3, x4)

P(x1, x2, x3, x4) = P(x1|x3, x4)P(x2|x3)P(x3)P(x4)

total = 7 (vs 15)

Reading a Bayesian network

• The structure of a Bayes net can be read as the generative process behind a distribution

• Gives the joint probability distribution over variables obtained by sampling each variable conditioned on its parents

Reading a Bayesian networkFour random variables:

X2 pencil levitates

X3X4P(x4) P(x3)

P(x2|x3)P(x1|x3, x4)

Reading a Bayesian network

• The structure of a Bayes net can be read as the generative process behind a distribution

• Gives the joint probability distribution over variables obtained by sampling each variable conditioned on its parents

• Simple rules for determining whether two variables are dependent or independent

• Independence makes inference more efficient

Computing with Bayes nets

X3X4P(x4) P(x3)

P(x2|x3)P(x1|x3, x4)

P(x2, x3,x4 | x1 =1) =P(x1 =1, x2,x3, x4 )

P(x1 =1)

X3X4P(x4) P(x3)

P(x2|x3)P(x1|x3, x4)

P(x1 =1) = P(x1 =1, x2,x3,x4 )x2 ,x3 ,x4

∑sum over 8 values

X3X4P(x4) P(x3)

P(x2|x3)P(x1|x3, x4)

P(x1 =1) = P(x1 | x3, x4 )P(x2 | x3)P(x3)P(x4 )x2 ,x3 ,x4

X3X4P(x4) P(x3)

P(x2|x3)P(x1|x3, x4)

P(x1 =1) = P(x1 | x3, x4 )P(x3)P(x4 )x3 ,x4

∑sum over 4 values

• Inference algorithms for Bayesian networks exploit dependency structure

• Message-passing algorithms– “belief propagation” passes simple messages between

nodes, exact for tree-structured networks

• More general inference algorithms– exact: “junction-tree”– approximate: Monte Carlo schemes (see Part IV)

Assume grass will be wet if and only if it rained last night, or if the sprinklers were left on:

Explaining away

Rain Sprinkler

Grass Wet

.andif0 sSrR ¬=¬==

),|()()(),,( RSWPSPRPWSRP =

rRsSRSwWP ==== orif1),|(

Explaining away

Rain Sprinkler

Grass Wet

)()|()|(

rPrwPwrP =

Compute probability it rained last night, given that the grass is wet:

Explaining away

Rain Sprinkler

Grass Wet

∑′′

′′′′=

srPsrwP

rPrwPwrP

),(),|(

)()|()|(

Explaining away

Rain Sprinkler

Grass Wet

),(),(),(

srPsrPsrP

¬+¬+=

Explaining away

Rain Sprinkler

Grass Wet

Explaining away

Rain Sprinkler

Grass Wet

)()()(

sPrPrP

Between 1 and P(s)

Explaining away

Rain Sprinkler

Grass Wet

Compute probability it rained last night, given that the grass is wet and sprinklers were left on:

)|(),|(),|(

srPsrwPswrP =

Both terms = 1

Explaining away

Rain Sprinkler

Grass Wet

Compute probability it rained last night, given that the grass is wet and sprinklers were left on:

)(rP=)|(),|( srPswrP =

Explaining away

Rain Sprinkler

Grass Wet

)(rP=)|(),|( srPswrP =)()()(

sPrPrP

¬+= )(rP>

“Discounting” to prior probability.

• Formulate IF-THEN rules:– IF Rain THEN Wet– IF Wet THEN Rain

• Rules do not distinguish directions of inference• Requires combinatorial explosion of rules

Contrast w/ production system

Grass Wet

Sprinkler

IF Wet AND NOT Sprinkler THEN Rain

• Observing rain, Wet becomes more active. • Observing grass wet, Rain and Sprinkler become more active• Observing grass wet and sprinkler, Rain cannot become less active. No explaining away!

• Excitatory links: Rain Wet, Sprinkler Wet

Contrast w/ spreading activation

Rain Sprinkler

Grass Wet

• Observing grass wet, Rain and Sprinkler become more active• Observing grass wet and sprinkler, Rain becomes less active: explaining away

• Excitatory links: Rain Wet, Sprinkler Wet• Inhibitory link: Rain Sprinkler

Rain Sprinkler

Grass Wet

• Each new variable requires more inhibitory connections• Not modular

– whether a connection exists depends on what others exist– big holism problem – combinatorial explosion

Contrast w/ spreading activationRain

Sprinkler

Grass Wet

Burst pipe

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

(McClelland & Rumelhart, 1981)

Graphical models

• Capture dependency structure in distributions

• Provide an efficient means of representing and reasoning with probabilities

• Allow kinds of inference that are problematic for other representations: explaining away– hard to capture in a production system– more natural than with spreading activation

Causal graphical models

• Graphical models represent statistical dependencies among variables (ie. correlations)– can answer questions about observations

• Causal graphical models represent causal dependencies among variables (Pearl, 2000)

– express underlying causal structure– can answer questions about both observations and

interventions (actions upon a variable)

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Bayesian networks

X2 pencil levitates

X3X4P(x4) P(x3)

P(x2|x3)P(x1|x3, x4)

Nodes: variablesLinks: dependency

Each node has a conditional probability distribution

Data: observations of x1, ..., x4

Causal Bayesian networks

X2 pencil levitates

X3X4P(x4) P(x3)

P(x2|x3)P(x1|x3, x4)

Nodes: variablesLinks: causality

Each node has a conditional probability distribution

Data: observations of and interventions on x1, ..., x4

Interventions

X2 pencil levitates

X3X4P(x4) P(x3)

P(x2|x3)P(x1|x3, x4)

hold down pencil

Cut all incoming links for the node that we intervene on

Compute probabilities with “mutilated” Bayes net

• Strength: how strong is a relationship?• Structure: does a relationship exist?

B CB B

Learning causal graphical models

• Strength: how strong is a relationship?

B CB B

Causal structure vs. causal strength

• Strength: how strong is a relationship?– requires defining nature of relationship

Parameterization

• Structures: h1 = h0 =

• Parameterization:

0 01 00 11 1

h1: P(E = 1 | C, B) h0: P(E = 1| C, B)

Generic

Parameterization

w0 w1w0

w0, w1: strength parameters for B, C

0 01 00 11 1

h1: P(E = 1 | C, B) h0: P(E = 1| C, B)

w1+ w0

Linear

Parameterization

w0 w1w0

w0, w1: strength parameters for B, C

0 01 00 11 1

h1: P(E = 1 | C, B) h0: P(E = 1| C, B)

w1+ w0 – w1 w0

“Noisy-OR”

Parameter estimation

• Maximum likelihood estimation:

maximize i P(bi,ci,ei; w0, w1)

• Bayesian methods: as in Part I

• Structure: does a relationship exist?

B CB B

Approaches to structure learning

• Constraint-based:– dependency from statistical tests (eg. 2)– deduce structure from dependencies E

(Pearl, 2000; Spirtes et al., 1993)

B CB• Constraint-based:– dependency from statistical tests (eg. 2)– deduce structure from dependencies

Attempts to reduce inductive problem to deductive problem

• Constraint-based:– dependency from statistical tests (eg. 2)– deduce structure from dependencies

• Bayesian:– compute posterior

probability of structures,

given observed dataE

P(h|data) P(data|h) P(h)

P(h1|data) P(h0|data)

• Constraint-based:– dependency from statistical tests (eg. 2)– deduce structure from dependencies

(Heckerman, 1998; Friedman, 1999)

Bayesian Occam’s Razor

All possible data sets d

h0 (no relationship)

h1 (relationship)

∑ d | h) = 1For any model h,

Causal graphical models

• Extend graphical models to deal with interventions as well as observations

• Respecting the direction of causality results in efficient representation and inference

• Two steps in learning causal models– strength: parameter estimation– structure: structure learning

Uses of graphical models

• Understanding existing cognitive models– e.g., neural network models

• Representation and reasoning– a way to address holism in induction (c.f. Fodor)

• Defining generative models– mixture models, language models (see Part IV)

• Modeling human causal reasoning

Human causal reasoning

• How do people reason about interventions?(Gopnik, Glymour, Sobel, Schulz, Kushnir & Danks, 2004; Lagnado

& Sloman, 2004; Sloman & Lagnado, 2005; Steyvers, Tenenbaum, Wagenmakers & Blum, 2003)

• How do people learn about causal relationships?– parameter estimation (Shanks, 1995; Cheng, 1997)

– constraint-based models (Glymour, 2001)

– Bayesian structure learning (Steyvers et al., 2003; Griffiths & Tenenbaum, 2005)

Causation from contingencies

“Does C cause E?”(rate on a scale from 0 to 100)

E present (e+)

E absent (e-)

C present(c+)

C absent(c-)

Two models of causal judgment

• Delta-P (Jenkins & Ward, 1965):

• Power PC (Cheng, 1997):

dccbaacePcePP +−+=−≡Δ −+++ )()|()|(

)()|(1 dcd

Δ≡ −+Power

Buehner and Cheng (1997)

People

0.00 0.25 0.50 0.75 1.00 P

People

Constant ΔP, changing judgments

People

Constant causal power, changing judgments

People

ΔP = 0, changing judgments

• Strength: how strong is a relationship?• Structure: does a relationship exist?

Causal strength

• Assume structure:

ΔP and causal power are maximum likelihood estimates of the strength parameter w1, under different parameterizations for P(E|B,C): – linear ΔP, Noisy-OR causal power

• Hypotheses: h1 = h0 =

• Bayesian causal inference:

support =

B CB B

P(d | h1) = P(d | w0,w1) p(w0,w1 | h1)0

∫ dw0 dw1

P(d | h0) = P(d | w0) p(w0 | h0)0

∫ dw0

Causal structure

P(d|h1)

P(d|h0)likelihood ratio (Bayes factor) gives evidence in favor of h1

People

ΔP (r = 0.89)

Power (r = 0.88)

Support (r = 0.97)

The importance of parameterization

• Noisy-OR incorporates mechanism assumptions:– generativity: causes increase probability of effects

– each cause is sufficient to produce the effect

– causes act via independent mechanisms(Cheng, 1997)

• Consider other models:– statistical dependence: 2 test

– generic parameterization (cf. Anderson, 1990)

People

Support (Noisy-OR)

Support (generic)

Generativity is essential

• Predictions result from “ceiling effect”– ceiling effects only matter if you believe a cause increases the

probability of an effect

P(e+|c+)P(e+|c-)

8/88/8

6/86/8

4/84/8

2/82/8

0/80/8

Support 10050

Blicket detector (Dave Sobel, Alison Gopnik, and colleagues)

See this? It’s a blicket machine. Blickets make it go.

Let’s put this oneon the machine.

Oooh, it’s a blicket!

– Two objects: A and B– Trial 1: A B on detector – detector active– Trial 2: A on detector – detector active– 4-year-olds judge whether each object is a blicket

• A: a blicket (100% say yes)

• B: probably not a blicket (34% say yes)

“Backwards blocking” (Sobel, Tenenbaum & Gopnik, 2004)

AB Trial A TrialA B

Possible hypotheses

Bayesian inference

• Evaluating causal models in light of data:

• Inferring a particular causal relation:

P(hi | d) =P(d | hi)P(hi)

P(d | h j )P(h j )j

∑∈

→=→H

jj dhPhEAPdEAP )|()|()|(

Bayesian inference

With a uniform prior on hypotheses, and the generic parameterization

Probability of being a blicket

0.32 0.32

0.34 0.34

Modeling backwards blocking

Assume…

• Links can only exist from blocks to detectors

• Blocks are blickets with prior probability q

• Blickets always activate detectors, but detectors never activate on their own– deterministic Noisy-OR, with wi = 1 and w0 = 0

P(E=1 | A=0, B=0): 0 0 0 0

P(E=1 | A=1, B=0): 0 0 1 1P(E=1 | A=0, B=1): 0 1 0 1P(E=1 | A=1, B=1): 0 1 1 1

P(h00) = (1 – q)2 P(h10) = q(1 – q)P(h01) = (1 – q) q P(h11) = q2

P(B → E | d) = P(h01) + P(h11) = q

P(B → E | d) =P(h01) + P(h11)

P(h01) + P(h10) + P(h11)=

q + q(1− q)

P(E=1 | A=1, B=1): 0 1 1 1

P(h00) = (1 – q)2 P(h10) = q(1 – q)P(h01) = (1 – q) q P(h11) = q2

P(B → E | d) =P(h11)

P(h10) + P(h11)= q

P(E=1 | A=1, B=0): 0 1 1

P(E=1 | A=1, B=1): 1 1 1

P(h10) = q(1 – q)P(h01) = (1 – q) q P(h11) = q2

Manipulating prior probability(Tenenbaum, Sobel, Griffiths, & Gopnik, submitted)

AB Trial A TrialInitial

Summary

• Graphical models provide solutions to many of the challenges of probabilistic models– defining structured distributions– representing distributions on many variables– efficiently computing probabilities

• Causal graphical models provide tools for defining rational models of human causal reasoning and learning

part ii: graphical models. challenges of probabilistic models specifying well-defined probabilistic...

Documents

probabilistic graphical models: distributed inference and...

probabilistic graphical models, spring 2012

introduction to probabilistic graphical models ·...

probabilistic inventory models raw materials

introduction to probabilistic topic models · introduction...

csc535: probabilistic graphical models

probabilistic topic models

probabilities and probabilistic models

probabilistic graphical models - people

temporal probabilistic models

probabilistic graphical models - ttic

probabilistic topic models -...

tdm probabilistic models (part 2)

probabilistic models (part 1)

modelling retrieval models in probabilistic logical models

introduction to probabilistic graphical models · this...

probabilistic content models,

probabilistic models in recommender systems: time variant...

probabilistic graphical models -...

probabilistic graphical models