temporal probabilistic models

TEMPORAL PROBABILISTIC MODELS

MOTIVATION Observing a stream of data

Monitoring (of people, computer systems, etc)

Surveillance, tracking Finance & economics Science

Questions: Modeling & forecasting Unobserved variables

TIME SERIES MODELING Time occurs in steps t=0,1,2,…

Time step can be seconds, days, years, etc State variable Xt, t=0,1,2,… For partially observed problems, we see

observations Ot, t=1,2,… and do not see the X’s X’s are hidden variables (aka latent variables)

MODELING TIME Arrow of time

Causality? Bayesian networks to the rescue

Causes Effects

PROBABILISTIC MODELING For now, assume fully observable case

What parents?

X0 X1 X2 X3

X0 X1 X2 X3

MARKOV ASSUMPTION Assume Xt+k is independent of all Xi for i<t

P(Xt+k | X0,…,Xt+k-1) = P(Xt+k | Xt,…,Xt+k-1) K-th order Markov Chain

X0 X1 X2 X3

X0 X1 X2 X3

X0 X1 X2 X3

X0 X1 X2 X3

Order 0

Order 1

Order 2

Order 3

1ST ORDER MARKOV CHAIN MC’s of order k>1 can be converted into a

1st order MC[left as exercise]

So w.o.l.o.g., “MC” refers to a 1st order MC

X0 X1 X2 X3

INFERENCE IN MC What independence relationships can we

read from the BN?

X0 X1 X2 X3

Observe X1

X0 independent of X2, X3, …

P(Xt|Xt-1) known as transition model

INFERENCE IN MC Prediction: the probability of future state?

P(Xt) = Sx0,…,xt-1P (X0,…,Xt) = Sx0,…,xt-1P (X0) Px1,…,xt P(Xi|Xi-1)= Sxt-1P(Xt|Xt-1) P(Xt-1)

“Blurs” over time, and approaches stationary distribution as t grows Limited prediction power Rate of blurring known as mixing time

[Incremental approach]

HOW DOES THE MARKOV ASSUMPTION AFFECT THE CHOICE OF STATE? Suppose we’re tracking a point (x,y) in 2D What if the point is…

A momentumless particlesubject to thermal vibration?

A particle with velocity? A particle with intent, like

a person?

HOW DOES THE MARKOV ASSUMPTION AFFECT THE CHOICE OF STATE? Suppose the point is the position of our robot,

and we observe velocity and intent What if:

Terrain conditions affectspeed?

Battery level affects speed? Position is noisy, e.g. GPS?

IS THE MARKOV ASSUMPTION APPROPRIATE FOR: A car on a slippery road? Sales of toothpaste? The stock market?

HISTORY DEPENDENCE In Markov models, the state must be chosen

so that the future is independent of history given the current state

Often this requires adding variables that cannot be directly observed

PARTIAL OBSERVABILITY Hidden Markov Model (HMM)

X0 X1 X2 X3

O1 O2 O3

Hidden state variables

Observed variables

P(Ot|Xt) called the observation model (or sensor model)

INFERENCE IN HMMS Filtering Prediction Smoothing, aka hindsight Most likely explanation

X0 X1 X2 X3

O1 O2 O3


X0 X1 X2

O1 O2

Query variable


X0 X1 X2 X3

O1 O2 O3

Query

PREDICTION P(Xt+k|o1:t) 2 steps: P(Xt|o1:t), then P(Xt+k|Xt) Filter then predict as with standard MC

X0 X1 X2 X3

O1 O2 O3

Query


X0 X1 X2 X3

O1 O2 O3

Query


X0 X1 X2 X3

O1 O2 O3

Query returns a path through state space x0,…,x3

MLE: VITERBI ALGORITHM Recursive computation of max likelihood of

path to all xt in Val(Xt) mt(Xt) = maxx1:t-1 P(x1,…,xt-1,Xt|o1:t)

=a P(ot|Xt) maxxt-1P(Xt|xt-1) mt-1(xt-1) Previous ML state

argmaxxt-1P(Xt|xt-1) mt-1(xt-1)

APPLICATIONS OF HMMS IN NLP Speech recognition Hidden phones

(e.g., ah eh ee th r) Observed, noisy acoustic

features (produced by signal processing)

PHONE OBSERVATION MODELS

Phonet

Signal processing

Features(24,13,3,59)

Featurest

Model defined to be robust over variations in accent, speed, pitch, noise

PHONE TRANSITION MODELS

Phonet

Featurest

Good models will capture (among other things):Pronunciation of wordsSubphone structureCoarticulation effects Triphone models = order 3 Markov chain

Phonet+1

WORD SEGMENTATION Words run together when

pronounced Unigrams P(wi) Bigrams P(wi|wi-1) Trigrams P(wi|wi-1,wi-2)

Logical are as confusion a may right tries agent goal the was diesel more object then information-gathering search is

Planning purely diagnostic expert systems are very similar computational approach would be represented compactly using tic tac toe a predicate

Planning and scheduling are integrated the success of naïve bayes model is just a possible prior source by that time

Random 20 word samples from R&N using N-gram models

TRICKS TO IMPROVE RECOGNITION Narrow the # of variables

Digits, yes/no, phone tree Training with real user data

Real story: “Yes ma’am”

KALMAN FILTERING In a nutshell

Efficient filtering in continuous state spaces

Gaussian transition and observation models

Ubiquitous for tracking with noisy sensors, e.g. radar, GPS, cameras

HIDDEN MARKOV MODEL FOR ROBOT LOCALIZATION Use observations to get a better idea of

where the robot is at time t

X0 X1 X2 X3

z1 z2 z3

Hidden state variables

Observed variables

Predict – observe – predict – observe…

LINEAR GAUSSIAN TRANSITION MODEL Consider position and velocity xt, vt Time step h Without noise

xt+1 = xt + h vt

vt+1 = vt With Gaussian noise of std s1

P(xt+1|xt) exp(-(xt+1 – (xt + h vt))2/(2s12)

i.e. xt+1 ~ N(xt + h vt, s1)

LINEAR GAUSSIAN TRANSITION MODEL If prior on position is Gaussian, then the

posterior is also Gaussian

vh s1

N(m,s) N(m+vh,s+s1)

LINEAR GAUSSIAN OBSERVATION MODEL Position observation zt Gaussian noise of std s2

zt ~ N(xt,s2)

LINEAR GAUSSIAN OBSERVATION MODEL If prior on position is Gaussian, then the

posterior is also Gaussian

m (s2z+s22m)/(s2+s2

2)

s2 s2s22/(s2+s2

2)

Position prior

Posterior probability

Observation probability

MULTIVARIATE CASE Transition matrix F, covariance Sx Observation matrix H, covariance Sz

mt+1 = F mt + Kt+1(zt+1 – HFmt)St+1 = (I - Kt+1)(FStFT + Sx)

WhereKt+1= (FStFT + Sx)HT(H(FStFT + Sx)HT +Sz)-1

Got that memorized?

PROPERTIES OF KALMAN FILTER Optimal Bayesian estimate for linear

Gaussian transition/observation models Need estimates of covariance… model

identification necessary Extensions to nonlinear

transition/observation models work as long as they aren’t too nonlinear Extended Kalman Filter Unscented Kalman Filter

PROPERTIES OF KALMAN FILTER Optimal Bayesian estimate for linear

Gaussian transition/observation models Need estimates of covariance… model

identification necessary Extensions to nonlinear systems

Extended Kalman Filter: linearize models Unscented Kalman Filter: pass points through

nonlinear model to reconstruct gaussian Work as long as systems aren’t too nonlinear

NON-GAUSSIAN DISTRIBUTIONS Gaussian distributions are a “lump”

Kalman filter estimate

NON-GAUSSIAN DISTRIBUTIONS Integrating continuous and discrete states

Splitting with a binary choice

“up”

“down”

EXAMPLE: FAILURE DETECTION Consider a battery meter sensor

Battery = true level of battery BMeter = sensor reading

Transient failures: send garbage at time t Persistent failures: send garbage forever

EXAMPLE: FAILURE DETECTION Consider a battery meter sensor

Battery = true level of battery BMeter = sensor reading

Transient failures: send garbage at time t 5555500555…

Persistent failures: sensor is broken 5555500000…

DYNAMIC BAYESIAN NETWORK

BMetert

BatterytBatteryt-1

BMetert ~ N(Batteryt,s)

(Think of this structure “unrolled” forever…)

DYNAMIC BAYESIAN NETWORK

BMetert

BatterytBatteryt-1


P(BMetert=0 | Batteryt=5) = 0.03Transient failure model

RESULTS ON TRANSIENT FAILUREE

(Bat

tery

t)

Transient failure occurs

Without model

With model

Meter reads 55555005555…

RESULTS ON PERSISTENT FAILUREE

(Bat

tery

t)

Persistent failure occurs

With transient model


PERSISTENT FAILURE MODEL

BMetert

BatterytBatteryt-1


P(BMetert=0 | Batteryt=5) = 0.03

Brokent-1 Brokent

P(BMetert=0 | Brokent) = 1

Example of a Dynamic Bayesian Network (DBN)

RESULTS ON PERSISTENT FAILUREE

(Bat

tery

t)

Persistent failure occurs

With transient model


With persistent failure model

HOW TO PERFORM INFERENCE ON DBN? Exact inference on “unrolled” BN

Variable Elimination – eliminate old time steps After a few time steps, all variables in the state

space become dependent! Lost sparsity structure

Approximate inference Particle Filtering

PARTICLE FILTERING (AKA SEQUENTIAL MONTE CARLO)

Represent distributions as a set of particles

Applicable to non-gaussian high-D distributions

Convenient implementations

Widely used in vision, robotics

PARTICLE REPRESENTATION

Bel(xt) = {(wk,xk)} wk are weights, xk are state

hypotheses Weights sum to 1 Approximates the underlying

distribution

Weighted resampling step

PARTICLE FILTERING Represent a distribution at time t as a set of

N “particles” St1,…,St

N

Repeat for t=0,1,2,… Sample S[i] from P(Xt+1|Xt=St

i) for all i Compute weight w[i] = P(e|Xt+1=S[i]) for all i Sample St+1

i from S[.] according to weights w[.]

BATTERY EXAMPLE

BMetert

BatterytBatteryt-1

Brokent-1 Brokent

Sampling step

BATTERY EXAMPLE

BMetert

BatterytBatteryt-1

Brokent-1 Brokent

Suppose we now observe BMeter=0

P(BMeter=0|sample) = ?0.03

1

BATTERY EXAMPLE

BMetert

BatterytBatteryt-1

Brokent-1 Brokent

Compute weights (drawn as particle size)

P(BMeter=0|sample) = ?0.03

1

BATTERY EXAMPLE

BMetert

BatterytBatteryt-1

Brokent-1 Brokent

Weighted resampling

P(BMeter=0|sample) = ?

BATTERY EXAMPLE

BMetert

BatterytBatteryt-1

Brokent-1 Brokent

Sampling Step

BATTERY EXAMPLE

BMetert

BatterytBatteryt-1

Brokent-1 Brokent

Now observe BMetert = 5

BATTERY EXAMPLE

BMetert

BatterytBatteryt-1

Brokent-1 Brokent

Compute weights

10

BATTERY EXAMPLE

BMetert

BatterytBatteryt-1

Brokent-1 Brokent

Weighted resample

APPLICATIONS OF PARTICLE FILTERING IN ROBOTICSSimultaneous Localization and

Mapping (SLAM)Observations: laser rangefinderState variables: position, walls

SIMULTANEOUS LOCALIZATION AND MAPPING (SLAM)

Mobile robotsOdometry

Locally accurateDrifts significantly over

timeVision/ladar/sonar

Inaccurate locallyGlobal reference frame

Combine the twoState: (robot pose, map)Observations: (sensor

input)

COUPLE OF PLUGS CSCI B553 CSCI B659: Principles of Intelligent Robot

Motion http://cs.indiana.edu/classes/b659-hauserk

CSCI B657: Computer Vision David Crandall/Chen Yu

http://cs.indiana.edu/classes/b659-hauserk

NEXT TIME Learning distributions from data Read R&N 20.1-3

MLE: VITERBI ALGORITHM Recursive computation of max likelihood of

path to all xt in Val(Xt) mt(Xt) = maxx1:t-1 P(x1,…,xt-1,Xt|o1:t)

=a P(ot|Xt) maxxt-1P(Xt|xt-1) mt-1(xt-1) Previous ML state

argmaxxt-1P(Xt|xt-1) mt-1(xt-1)

Does this sound familiar?

MLE: VITERBI ALGORITHM Do the “logarithm trick” log mt(Xt) = log a P(ot|Xt)

+ maxxt-1 [log P(Xt|xt-1) + log mt-1(xt-1) ] View:

log a P(ot|Xt) as a reward log P(Xt|xt-1) as a cost log mt(Xt) as a value function

Bellman equation

temporal probabilistic models

Documents