probabilistic programming; ways forwardfwood/talks/2015/dali-keynote...operative definition...

Post on 24-Mar-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Probabilistic Programming; Ways Forward

Frank Wood

Outline• What is probabilistic programming?

• What are the goals of the field?

• What are some challenges?

• Where are we now?

• Ways forward…

What is probabilistic programming?

An Emerging Field

ML: Algorithms &Applications

STATS: Inference &

Theory

PL: Compilers,Semantics,

Analysis

ProbabilisticProgramming

Conceptualization

Parameters

Program

Output

CS

Parameters

Program

Observations

Probabilistic Programming Statistics

p(✓|x)

p(x|✓)p(✓)

x

Operative Definition“Probabilistic programs are usual functional or imperative programs with two added constructs:

(1) the ability to draw values at random from distributions, and

(2) the ability to condition values of variables in a program via observations.”

Gordon et al, 2014

What are the goals of probabilistic

programming?

Increased Productivity

(fn [x] (logb 1.04 (+ 1 x)))

Lines of Matlab/Java Code

Line

s of

Ang

lican

Cod

e

HPYP, [Wood 2007]

DDPMO, [Neiswanger et al 2014]

PDIA, [Pfau 2010]

Collapsed LDA

DP Conjugate Mixture

log lin

p(⋅|d

ata)

Automatic Inference

Programming Language Representation / Abstraction Layer

Inference Engine(s)

Models

CARON ET AL.

This lack of consistency is shared by other models based on the Polya urn construction (Zhuet al., 2005; Ahmed and Xing, 2008; Blei and Frazier, 2011). Blei and Frazier (2011) provide adetailed discussion on this issue and describe cases where one should or should not bother about it.

It is possible to define a slightly modified version of our model that is consistent under marginal-isation, at the expense of an additional set of latent variables. This is described in Appendix C.

3.2 Stationary Models for Cluster Locations

To ensure we obtain a first-order stationary Pitman-Yor process mixture model, we also need tosatisfy (B). This can be easily achieved if for k 2 I(mt

t)

Uk,t ⇠⇢

p (·|Uk,t�1) if k 2 I(mtt�1)

H otherwise

where H is the invariant distribution of the Markov transition kernel p (·|·). In the time seriesliterature, many approaches are available to build such transition kernels based on copulas (Joe,1997) or Gibbs sampling techniques (Pitt and Walker, 2005).

Combining the stationary Pitman-Yor and cluster locations models, we can summarize the fullmodel by the following Bayesian network in Figure 1. It can also be summarized using a Chineserestaurant metaphor (see Figure 2).

Figure 1: A representation of the time-varying Pitman-Yor process mixture as a directed graphi-cal model, representing conditional independencies between variables. All assignmentvariables and observations at time t are denoted ct and zt, respectively.

3.3 Properties of the Models

Under the uniform deletion model, the number At =P

imti,t�1 of alive allocation variables at time

t can be written as

At =

t�1X

j=1

nX

k=1

Xj,k

8

c0 Hr

� �m

c1 ⇡m

Hy ✓m

r0

s0

r1

s1

y1

r2

s2

y2

rT

sT

yT

r3

s3

y31

Gaussian Mixture Model

¼

µc

yi

k

k

i

N

K

K

α

θc

yi

k

k o

i

N

K

K

α

θc

yi

k

k o

i

N

1

1

Figure : From left to right: graphical models for a finite Gaussian mixture model(GMM), a Bayesian GMM, and an infinite GMM

ci |~⇡ ⇠ Discrete(~⇡)

~

yi |ci = k ;⇥ ⇠ Gaussian(·|✓k).

~⇡|↵ ⇠ Dirichlet(·| ↵K

, . . . ,

K

)

⇥ ⇠ G0

Wood (University of Oxford) Unsupervised Machine Learning January, 2014 16 / 19

Latent Dirichlet Allocation

w

diz

di �k �

d = 1 . . . D

i = 1 . . . N

d.

✓d

k = 1 . . . K

Figure 1. Graphical model for LDA model

Lecture LDA

LDA is a hierarchical model used to model text documents. Each document is modeled as

a mixture of topics. Each topic is defined as a distribution over the words in the vocabulary.

Here, we will denote by K the number of topics in the model. We use D to indicate the

number of documents, M to denote the number of words in the vocabulary, and N

d. to

denote the number of words in document d. We will assume that the words have been

translated to the set of integers {1, . . . , M} through the use of a static dictionary. This is

for convenience only and the integer mapping will contain no semantic information. The

generative model for the D documents can be thought of as sequentially drawing a topic

mixture ✓d for each document independently from a DirK(↵

~

1) distribution, where DirK(

~

�)

is a Dirichlet distribution over the K-dimensional simplex with parameters [�1, �2, . . . , �K ].

Each of K topics {�k}Kk=1 are drawn independently from DirM (�

~

1). Then, for each of the

i = 1 . . . N

d. words in document d, an assignment variable z

di is drawn from Mult(✓

d).

Conditional on the assignment variable z

di , word i in document d, denoted as w

di , is drawn

independently from Mult(�zdi). The graphical model for the process can be seen in Figure 1.

The model is parameterized by the vector valued parameters {✓d}Dd=1, and {�k}K

k=1, the

parameters {Z

di }d=1,...,D,i=1,...,Nd

., and the scalar positive parameters ↵ and �. The model

is formally written as:

✓d ⇠ DirK(↵

~

1)

�k ⇠ DirM (�

~

1)

z

di ⇠ Mult(✓d)

w

di ⇠ Mult(�zd

i)

1

✓d ⇠ DirK (↵~1)

�k ⇠ DirM(�~1)

z

di ⇠ Discrete(✓d)

w

di ⇠ Discrete(�zdi

)

Wood (University of Oxford) Unsupervised Machine Learning January, 2014 15 / 19

What are some challenges?

Challenges• Unbounded recursion

• Equality and continuous variables

Unbounded Recursion(defn geometric "generates geometrically distributed values in {0,1,2,...}" ([p] (geometric p 0)) ([p n] (if (sample (flip p)) n (geometric p (+ n 1)))))

0

1

1-pp

1-pp

2

1-pp

Defining Distributions(defm pick-a-stick [stick v l k] “picks a stick given a stick generator given a value v ~ uniform-continuous(0,1) should be called with l = 0.0, k=1” (let [u (+ l (stick k))] (if (> u v) k (pick-a-stick stick v u (+ k 1)))))

(stick 1) (stick 2) (stick 3) (stick 4) (…) (stick 6)

v (sample (uniform-continuous 0 1))

Semantics and Termination(defn p [] (if (sample (flip 0.5)) 1 (if (sample (flip 0.5)) (p) (infinite-loop))))

(def infinite-loop #(loop [] (recur)))

1

1

1

0.50.5

0.50.5

0.50.5

0.50.5

0.50.5

p(x = 1) =1X

n=1

1

2

2n�1

=2

3?

Equality and Continuous Variables

Why are your probabilistic programming systems anti-

equality?

(defquery bayes-net [] (let [is-cloudy (sample (flip 0.5)) is-raining (cond (= is-cloudy true ) (sample (flip 0.8)) (= is-cloudy false) (sample (flip 0.2))) sprinkler (cond (= is-cloudy true ) (sample (flip 0.1)) (= is-cloudy false) (sample (flip 0.5))) wet-grass (cond (and (= sprinkler true) (= is-raining true))

(sample (flip 0.99)) (and (= sprinkler false) (= is-raining false)) (sample (flip 0.0)) (or (= sprinkler true) (= is-raining true))

(sample (flip 0.9)))] (observe (= wet-grass true))

(predict :s (hash-map :is-cloudy is-cloudy :is-raining is-raining :sprinkler sprinkler))))

Equality

(defquery bayes-net [] (let [is-cloudy (sample (flip 0.5)) is-raining (cond (= is-cloudy true ) (sample (flip 0.8)) (= is-cloudy false) (sample (flip 0.2))) sprinkler (cond (= is-cloudy true ) (sample (flip 0.1)) (= is-cloudy false) (sample (flip 0.5))) wet-grass (cond (and (= sprinkler true) (= is-raining true))

(sample (flip 0.99)) (and (= sprinkler false) (= is-raining false)) (sample (flip 0.0)) (or (= sprinkler true) (= is-raining true))

(sample (flip 0.9)))] (observe (dirac wet-grass) true)

(predict :s (hash-map :is-cloudy is-cloudy :is-raining is-raining :sprinkler sprinkler))))

p(x|y = o) / �(y � o)p(x,y)

= p(x,y = o)

Dirac Observe

(defquery bayes-net [] (let [is-cloudy (sample (flip 0.5)) is-raining (cond (= is-cloudy true ) (sample (flip 0.8)) (= is-cloudy false) (sample (flip 0.2))) sprinkler (cond (= is-cloudy true ) (sample (flip 0.1)) (= is-cloudy false) (sample (flip 0.5))) wet-grass (cond (and (= sprinkler true) (= is-raining true))

(sample (flip 0.99)) (and (= sprinkler false) (= is-raining false)) (sample (flip 0.0)) (or (= sprinkler true) (= is-raining true))

(sample (flip 0.9)))] (observe (normal 0.0 tolerance) (d wet-grass true))

(predict :s (hash-map :is-cloudy is-cloudy :is-raining is-raining :sprinkler sprinkler))))

ABC Observe

p(x|y = o) / p(d(y,o))p(x,y)

(defquery bayes-net [] (let [is-cloudy (sample (flip 0.5)) is-raining (cond (= is-cloudy true ) (sample (flip 0.8)) (= is-cloudy false) (sample (flip 0.2))) sprinkler (cond (= is-cloudy true ) (sample (flip 0.1)) (= is-cloudy false) (sample (flip 0.5))) wet-grass (cond (and (= sprinkler true) (= is-raining true))

(flip 0.99) (and (= sprinkler false) (= is-raining false)) (flip 0.0) (or (= sprinkler true) (= is-raining true))

(flip 0.9))] (observe wet-grass true)

(predict :s (hash-map :is-cloudy is-cloudy :is-raining is-raining :sprinkler sprinkler))))

Noisy Observe

p(x|y = o) / p(o|x)p(x)

Continuous Variables(defquery unknown-mean [] (let [sigma (sqrt 2) mu (marsaglia-normal 1 5)] (observe (normal mu sigma) 9) (observe (normal mu sigma) 8) (predict :mu mu)))

Measure Theoretic Challenges

(defquery which-nationality [gpa] (let [nationality (sample (categorical [["USA" 0.25] ["India" 0.75]])) simulated_gpa (if (= nationality "USA") (american-gpa) (indian-gpa))] (observe (dirac simulated_gpa) gpa) (predict :nationality nationality)))

p(nationality = “USA"| gpa = 4.0) = ?

The “Indian GPA problem”

American GPA Distribution [0,4](defn american-gpa [] (if (sample (flip 0.95)) (* 4 (sample (beta 8 2))) (if (sample (flip 0.85)) 4.0 0.0)))

Indian GPA Distribution [0,10](defn indian-gpa [] (if (sample (flip 0.99)) (* 10 (sample (beta 5 5))) (if (sample (flip 0.1)) 0.0 10.0)))

Mixed GPA Distribution(defn student-gpa [] (if (sample (flip 0.25)) (american-gpa) (indian-gpa)))

Measure Theoretic Challenges

(defquery which-nationality [gpa tolerance] (let [nationality (sample (categorical [["USA" 0.25] ["India" 0.75]])) simulated_gpa (if (= nationality "USA") (american-gpa) (indian-gpa))] (observe (normal simulated_gpa tolerance) gpa) (predict :nationality nationality)))

p(nationality = “USA"| gpa = 4.0) = ?

The “Indian GPA problem” by Russell

Where are we now?

Discrete RV’s Only

2000

1990

2010

SystemsPL

HANSAI

IBAL

Figaro

ML STATS

WinBUGS

BUGS

JAGS

STANLibBi

Venture Anglican

Church

Probabilistic-C

infer.NET

webChurch

Blog

Factorie

AI

Prism

Prolog

KMP

Bounded Recursion

Problog

Simula

Ways forward...

Trace Probability• observe data points

• internal random choices

• simulate from

by running the program forward

• weight execution traces byy1 y2

x1 x2

x11 x12 x13 x21 x22

{ {etc

p(y1:N ,x1:N ) =NY

n=1

g(yn|x1:n)f(xn|x1:n�1)

y1 y2

x1 x2 x3

y3

f(xn|x1:n�1)

g(yn|x1:n)

xn

yn

n = 1 n = 2Iteratively,

- simulate - weight - resample

SMC

Observe

Parti

cle

Intuitively

- run- wait - fork

SMC for Probabilistic ProgrammingTh

read

s

observe delimiter

continuations

SMC Inner Loop

n n n

n n n

n n n

• Sequential Monte Carlo is now a building block for other inference techniques

• Particle MCMC - PIMH : “particle

independent Metropolis-Hastings”

- iCSMC : “iterated conditional SMC”

-­‐    

s=1

s=2

s=3

[Andrieu, Doucet, Holenstein 2010]

[W., van de Meent, Mansinghka 2014]

SMC slowed down for clarity

SMC Parallelism Bottleneck

Asynchronously

- simulate - weight - branch

n = 1 n = 2

Particle Cascade

Paige, W., Doucet, Teh; NIPS 2014

Particle Cascade

Particle Cascade

Particle Cascade

Particle Cascade

Particle Cascade

Particle Cascade

Particle Cascade

The particle cascade provides an unbiased estimator of the marginal likelihood, whose variance decreases proportionally to the number of initial particles K0:

Theorem: For any K0 ≥ 1 and n ≥ 0, .

Theorem: For any n ≥ 0, there exists a constant an such that

p(y0:n) :=1

K0

KnX

k=1

W kn

V[p(y0:n)] <anK0

E[p(y0:n)] = p(y0:n)

Theoretical Properties

Conclusion

Bubble Up

Inference

Probabilistic Programming Language

Models

Applications

Probabilistic Programming System

Thank You• Questions?

• Funding: DARPA, Amazon, Microsoft

Opportunities• Parallelism

“Asynchronous Anytime Sequential Monte Carlo” [Paige, W., Doucet, Teh NIPS 2014]

• Backwards passing “Particle Gibbs with Ancestor Sampling for Probabilistic Programs” [van de Meent, Yang, Mansinghka, W. AISTATS 2015]

• Search “Maximum a Posteriori Estimation by Search in Probabilistic Models” [Tolpin, W., SOCS, 2015]

• Adaptation “Output-Sensitive Adaptive Metropolis-Hastings for Probabilistic Programs” [Tolpin, van de Meent, Paige, W ; in submission]

• Novel proposals “Adaptive PMCMC” [Paige, W.; in submission]

Probabilistic-C z0 ⇠ Discrete([1/K, . . . , 1/K]) zn|zn�1 ⇠ Discrete(Tzn�1) yn|zn ⇠ Normal(µzn ,�

2)

Paige & W.; ICML 2014

How can you participate?

Ways to Participate• Contribute applications

• https://bitbucket.org/fwood/anglican-examples

• Contribute inference algorithms

• https://bitbucket.org/dtolpin/embang

An Analogy

Automatic Differentiation Supervised Learning

Probabilistic Programming Unsupervised Learning

General Purpose Inference(defquery sat-solver [N formula] "explores an N-dimensional universe for worlds that satisfy the formula” (let [state (repeatedly N (fn [] (sample (flip 0.5))))] (observe (dirac (formula state)) true) (predict :state state)))

(defdist dirac "Dirac distribution" [x] [] (sample [this] x) (observe [this value] (if (= x value) 0.0 NegInf)))

(defm satisfiable-3cnf-formula [state] (let [v (fn [i] (nth state i))] (and (or (v 0) (not (v 1)) (not (v 2))) (or (not (v 0)) (v 1) (v 2)) (or (not (v 0)) (not (v 1)) (not (v 2))))))

General Purpose Inference(defquery md5-inverse [L md5str] "conditional distribution of strings that map to the same MD5 hashed string" (let [mesg (sample (string-generative-model L))] (observe (dirac md5str) (md5 mesg)) (predict :message mesg))))

NN

AI

RLPM

Vision

Particle Cascade

Not Sum-Product: Bayesian HMMTk ⇠ Dirichlet(↵k)Suppose the transition matrix is unknown:

Paige & W.; ICML 2014

2000

1990

2010

Range of EffectivenessPL

HANSAI

IBAL

Figaro

ML STATS

WinBUGS

BUGS

JAGS

STANLibBi

Venture Anglican

Church

Probabilistic-C

infer.NET

webChurch

Blog

Factorie

AI

Prism

Prolog

KMP

Continuous Variables(defm marsaglia-normal [mean var] (let [d (uniform-continuous -1.0 1.0) x (sample d) y (sample d) s (+ (* x x) (* y y))] (if (< s 1) (+ mean (* (sqrt var)

(* x (sqrt (* -2 (/ ( log s) s)))))) (marsaglia-normal mean var))))

Scalability: Particle Count

• Comparison across particle-based inference approaches: raw speed of drawing samples

Unbounded Recursion

Expressivity Efficiency

Credits

• Code highlighting: http://hilite.me

Forward Inference (SMC)

Observe

Parti

cle

/ C

ontin

uatio

n

Bayesian Nonparametrics(defm pick-a-stick [stick v l k] ; picks a stick given a stick generator ; given a value v ~ uniform-continuous(0,1) ; should be called with l = 0.0, k=1 (let [u (+ l (stick k))] (if (> u v) k (pick-a-stick stick v u (+ k 1)))))

(defm remaining [b k] (if (<= k 0) 1 (* (- 1 (b k)) (remaining b (- k 1)))))

(defm polya [stick] ; given a stick generating function ; polya returns a function that samples ; stick indexes from the stick lengths (let [uc01 (uniform-continuous 0 1)] (fn [] (let [v (sample uc01)] (pick-a-stick stick v 0.0 1)))))

(defm dirichlet-process-breaking-rule [alpha k] (sample (beta 1.0 alpha)))

(defm stick [breaking-rule] ; given a breaking-rule function which ; returns a value between 1 and 0 given a ; stick index k returns a function that ; returns the stick length for index k (let [b (mem breaking-rule)] (fn [k] (if (< 0 k) (* (b k) (remaining b (- k 1))) 0))))

(stick 1) (stick 2) (stick 3) (stick 4) (…) (stick 6)

v (sample(uniform-continuous 0 1))

Syntax & Implementation Considerations• Embedded vs. Standalone

• Imperative vs. functional

• Lisp vs. Python vs. C vs.

top related