visualizing inference in large bayesian networks (ucsd m.sc. project)

M.Sc. Project UCSD 2013

Cl i f ford Champion

<cchampio@cs>

Adviser : Prof . Charles Elkan

VISUALIZING INFERENCE

IN LARGE BAYESIAN

NETWORKS

December 10 th , 2013

What and Why

Designing the Visualization

Implementation and Results

B-Vis, F-AI

Traffic and Census data sets

Conclusion and Q&A

OUTLINE

WHAT AND WHY

2002: the indexed size of the internet was about 167 TB.

2002: > 330 TB of human-generated email was created.

2010: 50 billion user photos stored in Facebook .

2010: 130 TB of logs generated daily by Facebook.

2010: 2.5 PB of Walmart customer and transaction data .

2013: Over 50 GB of Tweets created daily on Twitter.

2013: eBay stores 40 PB dedicated to “deep” analysis.

“BIG DATA”

HOW DO WE USE ALL THIS DATA?

Image credit: http://commons.wikimedia.org/wiki/User:Shervinafshar

To quote Edward Tufte

“ often the most effective way to describe, explore, and summarize a

set of numbers – even a very large set – is to look

at pictures of those numbers ” (emphasis added)

“ data graphics can be both the simplest

[and] most powerful of methods ”

Visualizations help reveal interesting facts

and abstract relationships

Impossible or inefficient if using tabular data alone

In software applications, visualizations are a navigational tool

DATA VISUALIZATION WILL BE ESSENTIAL

Bayesian networks can be an important tool for “big data”

“Information flow” in Bayesian networks can be an opaque

concept

D-separation is not useful enough

More there beneath the surface?

Visualizing Bayesian networks well has been a goal largely

neglected

WHY THIS PROJECT?

A graphical model of random variables

On a scale of 0 to 1, how likely is rain today? (e.g.)

The edges of the graph define conditional (in)dependencies

between variables (nodes)

Can represent statistical, causal, and/or latent variables

What is the life expectancy of a non-smoker living in South America?

A car that won’t start can be caused by a dead battery. But being late

to work won’t cause a car to not start.

HMMs and topic clustering

Queryable: evidence goes in, new beliefs come out

If we know Winter=TRUE, what do we believe of Rain=TRUE?

BAYESIAN NETWORKS:

IN A NUTSHELL

Every random variable Y has a conditional probability

distribution P(Y|X1, . . . ,Xm(Y)), given m(Y) parents.

For our purposes, stored as a conditional probability table (CPT).

If Y has no parents, its probability distribution simplifies to P( Y).

Marginal distributions, e.g. P(Y) or P(Z), are

easily recovered/computed.

To create Bayesian network you must train (machine learning)

and/or hand-craft (expert interviews)

BAYESIAN NETWORKS:

IN A NUTSHELL

VISUAL DESIGN

Top-down causal ordering

Regularly the unstated choice for small networks

Difficult to satisfy in large/complex networks

Edge crossings are avoided

Also difficult to satisfy in

large/complex networks

THE STATE OF THE ART

Image credit: Kollar, Daphne, and Nir Friedman. Probabilistic graphical

models: principles and techniques. The MIT Press, 2009.


Conditional probability tables

Michele Cossalter, Ole J. Mengshoel, and Ted Selker. “Visualizing and Understanding Large-Scale Bayesian Networks“ The AAAI-11 Workshop

on Scalable Integration of Analytics and Visualization. 2011

CPT heatmaps

Natural representation for parents of count 1 only.


Chiang, Chih-Hung, et al. Visualizing Graphical Probabilistic Models. Technical Report 2005-017, UML CS, 2005.

Marginal distributions via embedded bar charts


Bayes Server (http://bayesserver.com)

Marginal distributions via shading

Binary variables only

Parent influence via hue blending

At most 2 parents, maybe 3


Zapata-Rivera, Juan-Diego, Eric Neufeld, and Jim E. Greer.

"Visualization of Bayesian belief networks." Proceedings of IEEE

Visualization’99, Late Breaking Hot Topics. 1999.

Williams, Lloyd, and Robert St Amant. "A Visualization Technique

for Bayesian Modeling." Proc. of IUI. Vol. 6. 2006.

Partition and fish-eye

User-driven


Sundarararajan, Priya Krishnan, Ole J. Mengshoel, and Ted Selker. "Multi-focus and multi-window techniques for interactive network

exploration." IS&T/SPIE Electronic Imaging. International Society for Optics and Photonics, 2013.

Design before code

Wireframes and mockups produce high-quality “what-if’s”

General principles:

Maximize the “data -ink ratio” (Tufte)

Don’t distort the data, don’t mislead the viewer (Tufte)

Maximize readability and cleanliness (me)

Goals specific to Bayesian networks

Present the “basics” clearly and conveniently

The effects of evidence should be stupidly obvious

Should scale to large networks

Shoshin (初心)

PHILOSOPHIES & OBJECTIVES

Photo credit: zeze57@flickr

What are the variables of the model?

What is the structure of the model?

STRUCTURE

STRUCTURE

Low contrast (no information beyond strokes/shading)

Single, capital letter for variable names;

subscripts if needed

Legend is ordered

w/layout vertical order

Structural view is

zoomable, scrollable

STRUCTURE

What are the event spaces?

EVENT SPACES

Each event space receives a

color mapping

Categorical spaces jump between

contrasting hues

Ordered spaces step through

similar hues

Legend is augmented accordingly

EVENT SPACES

What are the probability distributions captured in the model?

DISTRIBUTIONS

DISTRIBUTIONS

Distributions are embedded into each node

Pie chart slices are proportional to marginal probability masses

DISTRIBUTIONS

P (A)

P (V)

P (T1)

P (X)

P (T2)

What about evidence?

What about the effects of evidence?

SEEING INFERENCE

SEEING INFERENCE

Evidence nodes receive a black border

Query (non-evidence) nodes’ embedded

distributions updated to reflect

posterior distributions

SEEING INFERENCE

Let V=v

P (T1|V=v, A=a)

P (X|V=v, A=a)

P (T2|V=v, A=a)

Let A=a

What just happened? Dif ficult to see the change.

We need a way to perform a direct comparison.

Let E1 and E2 be evidence sets*, e.g. E1 = ( A=a ) and E2 = ( A=b, V=v)

Compute the posterior distributions separately

Visualize the posterior distributions together

Inspired by code dif fs in software engineering

* The word “set” is being abused here .

SEEING INFERENCE

AN “INFERENCE DIFF”

Inner “pie” is posterior for E1

Outer “ring” is posterior for E 2

Seeing the dif ference

Consistent event space coloring

Consistent event space ordering

Changes in area and color mass

easy to spot

Evidence in E1 and E2

distinguished by black borders

around pie and/or ring

AN “INFERENCE DIFF”

Evidence

in E1

Evidence

in E2

P ( X | E2 ) P ( X | E1 )

What if there are too many variables!

What about when I don’t know which variables to look for?

SCALING TO LARGER NETWORKS

RELEVANCE FILTERING

RELEVANCE FILTERING

Emphasize the variables with most change; diminish the rest

Use KL-divergence to quantify change [0, +Inf)

Call the top C% most changed

variables the “relevant” variables

Shrink & fade nodes of

irrelevant variables

Shorten and fade edges with

irrelevant variables

Reduces the canvas size needed

Facilitates discovery in

large models

IMPLEMENTATION AND

RESULTS

Structure and CPT learning (F#)

Structure space search using edge operations and BIC scoring

Dynamic programming algorithm, memoizing local scores

OO Design: Types defined for network, random variable, distribution, event space, etc.

Immutable type design made life easier and computations faster

Inference (F#)

Approximate inference using Markov Chain Monte Carlo / Gibbs sampling

Visualization Tool (C#)

Adopted a variation of the Model-View-Controller paradigm

Independent threads for learning, layout, inference, and UI/rendering

Used Microsoft WPF for vector graphics and user-input handling

Used open-source Graph# for Sugiyama graph layout

All source shared at https://github.com/duckmaestro/F -AI

SOFTWARE IMPLEMENTATION

https://github.com/duckmaestro/F-AI





Traffic flow measurements from the San Francisco bay area

highway system

32 sensor locations

4 discretized values of traffic flow amount from low to high

4415 examples

Acquired by Krause and Guestrin; reprocessed by Shahaf et al.

Model Training

Entire data set used

Uniform Dirichlet prior

Parent limit of 2

EXAMPLE: SAN FRANCISCO TRAFFIC

NETWORK

TRAFFIC NETWORK

“Traffic” Bayesian network

visualized in B-Vis.

Must zoom out a bit to see

entire model of this size.

Pictured with no evidence

configured.

INFERENCE DIFF OF TRAFFIC NETWORK

E1 = (empty)

E2 = (A4=‘medium’)

Relevance filtering: top 20%.

Reduction in overall space

requirement allows us to see

entire structure even while

zoomed in.

The impact of this evidence

diminishes as it propagates

in this model

1990 U.S. Census

68 attributes

Each attribute has categorical or discretized values

Each attribute has between 2 and 17 values

2.4 million examples

Discretized by Meek et al.; hosted by UCI Machine Learning

Repository

Model Training

First 10,000 randomly chosen examples

Uniform Dirichlet prior

Parent limit of 3

EXAMPLE: 1990 U.S. CENSUS

CENSUS NETWORK

“Census” Bayesian network

visualized in B-Vis.

Must zoom quite far out to

see entire model on screen

INFERENCE DIFF OF CENSUS NETWORK

E1 = (empty)

E2 = (‘income4’=‘yes’)

‘income4’

TOP 20% RELEVANT VARIABLES

‘ income4’ := Interest , d iv idends , o r renta l income in pr ior year.

Relevant var iables ( in no spec ial o rder) :

year o f immigrat ion ( ' immigr ' )

place of b i r th ( 'pob ' )

Hispanic her i tage ( 'h ispanic ' )

re lat ionship to the homeowner ( ' re lat2' )

whether l i v ing in a subfami ly ( ' subfam1')

number of subfami l ies ( ' subfam2')

whether work ing on a farm ( ' income3' )

whether the ind iv idual ser ved in the mi l i tar y dur ing no major war or conf l ict ( 'othrser v ' )

the i r ancest r y ( 'ancstr y2' )

the i r means of t ranspor tat ion to work ( 'means ' )

the i r s tatus in the job market ( 'avai l ' )

employment s tatus of parents ( ' remplpar ' )

‘income4’

LET’S DRILL DOWN

Let’s inspect

‘means’:

Increased

likelihood of no

daily commute

Decreased

likelihood of

bike, train, and

other non-auto

means of

commute

INFORMATION FLOW SURPRISE

Parents of ‘income4’ were

irrelevant in this inference

dif f: ‘ income6’, ‘rpincome ’ ,

‘rearning ’ .

Relevance filtering reveals

that in general the greatest

impact of evidence can be

“far away”. Snowball ef fect.

‘income4’

‘rpincome’

‘income6’

‘rearning’

PARTING THOUGHTS

Finding the greatest dif ference between two medical

treatments – a Bayesian network as a causal model

E1 = ( Age=38,

HasConditionX=True,

do(MedicationA=True),

do(MedicationB=False) )

E2 = ( Age=38,

HasConditionX=True,

do(MedicationA=False),

do(MedicationB=True) )

An inference dif f with relevance filtering could clearly

and visually present the greatest expected dif ferences in

prognosis and side-effects.

OTHER USES OF INFERENCE DIFFS

Layout stability is important

Better layout algorithms exist, and may be

customizable with relevance filtering in mind

Unused visual modalities have untapped potential

Node shape, additional evidence set rings, etc.

Large event spaces / continuous event spaces

Adaptive color space folding? / Density pie chart?

Alternative measures of “relevance”

User-specifiable event space value importance

Other graphical models

Dynamic Bayesian networks? Conditional random fields?

CHALLENGES & FUTURE WORK

Visualizations can help reveal insights

Visualizations can communicate dense information efficiently

We introduced inference diffs for direct comparisons of

posterior beliefs in Bayesian networks

We extended inference dif fs with relevance filtering

for assisting users in locating interesting phenomena in large

IN CONCLUSION

Thanks! Q & A

APPENDIX

UX question: is there an easy way to assign evidence?

Radial drag-drop menu

Keeps with pie chart motif

Drag outside inward to

assign evidence

Drag inside outward to

remove evidence

ASSIGNING EVIDENCE

Dropping on the

ring assigns

evidence in E2

Dropping on the

center assigns

evidence in E1

CONDITIONAL PROBABILITY TABLES

Use vertical space, not horizontal space

Event space color mapping reused

for probability masses and for parent

permutations

DISCOVERY AND NAVIGATION

Goldfarb, Doron, et al. "Art History on Wikipedia, a Macroscopic Observation."arXiv preprint arXiv:1304.5629 (2013).

visualizing inference in large bayesian networks (ucsd m.sc. project)

Documents

visualizing bayesian

large bayesian networks

essential bayesian networks

largescale bayesian

tabular data

transaction data

data graphics

small networks difficult