sharing features among dynamical systems with beta...

9
Sharing Features among Dynamical Systems with Beta Processes Emily B. Fox Electrical Engineering & Computer Science, Massachusetts Institute of Technology [email protected] Erik B. Sudderth Computer Science, Brown University [email protected] Michael I. Jordan Electrical Engineering & Computer Science and Statistics, University of California, Berkeley [email protected] Alan S. Willsky Electrical Engineering & Computer Science, Massachusetts Institute of Technology [email protected] Abstract We propose a Bayesian nonparametric approach to the problem of modeling re- lated time series. Using a beta process prior, our approach is based on the dis- covery of a set of latent dynamical behaviors that are shared among multiple time series. The size of the set and the sharing pattern are both inferred from data. We develop an efficient Markov chain Monte Carlo inference method that is based on the Indian buffet process representation of the predictive distribution of the beta process. In particular, our approach uses the sum-product algorithm to efficiently compute Metropolis-Hastings acceptance probabilities, and explores new dynami- cal behaviors via birth/death proposals. We validate our sampling algorithm using several synthetic datasets, and also demonstrate promising results on unsupervised segmentation of visual motion capture data. 1 Introduction In many applications, one would like to discover and model dynamical behaviors which are shared among several related time series. For example, consider video or motion capture data depicting multiple people performing a number of related tasks. By jointly modeling such sequences, we may more robustly estimate representative dynamic models, and also uncover interesting relation- ships among activities. We specifically focus on time series where behaviors can be individually modeled via temporally independent or linear dynamical systems, and where transitions between behaviors are approximately Markovian. Examples of such Markov jump processes include the hid- den Markov model (HMM), switching vector autoregressive (VAR) process, and switching linear dynamical system (SLDS). These models have proven useful in such diverse fields as speech recog- nition, econometrics, remote target tracking, and human motion capture. Our approach envisions a large library of behaviors, and each time series or object exhibits a subset of these behaviors. We then seek a framework for discovering the set of dynamic behaviors that each object exhibits. We particularly aim to allow flexibility in the number of total and sequence-specific behaviors, and encourage objects to share similar subsets of the large set of possible behaviors. One can represent the set of behaviors an object exhibits via an associated list of features. A stan- dard featural representation for N objects, with a library of K features, employs an N × K binary matrix F = {f ik }. Setting f ik =1 implies that object i exhibits feature k. Our desiderata motivate a Bayesian nonparametric approach based on the beta process [10, 22], allowing for infinitely many 1

Upload: others

Post on 26-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sharing Features among Dynamical Systems with Beta Processespapers.nips.cc/paper/3685-sharing-features-among... · modeled via temporally independent or linear dynamical systems,

Sharing Features among Dynamical Systemswith Beta Processes

Emily B. FoxElectrical Engineering & Computer Science, MassachusettsInstitute of Technology

[email protected]

Erik B. SudderthComputer Science, Brown [email protected]

Michael I. JordanElectrical Engineering & Computer Science and Statistics,University of California, Berkeley

[email protected]

Alan S. WillskyElectrical Engineering & Computer Science, MassachusettsInstitute of Technology

[email protected]

AbstractWe propose a Bayesian nonparametric approach to the problemof modeling re-lated time series. Using a beta process prior, our approach is based on the dis-covery of a set of latent dynamical behaviors that are sharedamong multiple timeseries. The size of the set and the sharing pattern are both inferred from data. Wedevelop an efficient Markov chain Monte Carlo inference method that is based onthe Indian buffet process representation of the predictivedistribution of the betaprocess. In particular, our approach uses the sum-product algorithm to efficientlycompute Metropolis-Hastings acceptance probabilities, and explores new dynami-cal behaviors via birth/death proposals. We validate our sampling algorithm usingseveral synthetic datasets, and also demonstrate promising results on unsupervisedsegmentation of visual motion capture data.

1 Introduction

In many applications, one would like to discover and model dynamical behaviors which are sharedamong several related time series. For example, consider video or motion capture data depictingmultiple people performing a number of related tasks. By jointly modeling such sequences, wemay more robustly estimate representative dynamic models,and also uncover interesting relation-ships among activities. We specifically focus on time serieswhere behaviors can be individuallymodeled via temporally independent or linear dynamical systems, and where transitions betweenbehaviors are approximately Markovian. Examples of suchMarkov jump processesinclude the hid-den Markov model (HMM), switching vector autoregressive (VAR) process, and switching lineardynamical system (SLDS). These models have proven useful insuch diverse fields as speech recog-nition, econometrics, remote target tracking, and human motion capture. Our approach envisionsa largelibrary of behaviors, and each time series orobjectexhibits a subset of these behaviors.We then seek a framework for discovering the set of dynamic behaviors that each object exhibits.We particularly aim to allow flexibility in the number of total and sequence-specific behaviors, andencourage objects to share similar subsets of the large set of possible behaviors.

One can represent the set of behaviors an object exhibits viaan associated list offeatures. A stan-dard featural representation forN objects, with a library ofK features, employs anN × K binarymatrixF = {fik}. Settingfik = 1 implies that objecti exhibits featurek. Our desiderata motivatea Bayesian nonparametric approach based on thebeta process[10, 22], allowing for infinitely many

1

Page 2: Sharing Features among Dynamical Systems with Beta Processespapers.nips.cc/paper/3685-sharing-features-among... · modeled via temporally independent or linear dynamical systems,

potential features. Integrating over the latent beta process induces a predictive distribution on fea-tures known as theIndian buffet process(IBP) [9]. Given a feature set sampled from the IBP, ourmodel reduces to a collection of Bayesian HMMs (or SLDS) withpartially shared parameters.

Other recent approaches to Bayesian nonparametric representations of time series include the HDP-HMM [2, 4, 5, 21] and the infinite factorial HMM [24]. These models are quite different fromour framework: the HDP-HMM does not select a subset of behaviors for a given time series, butassumes that all time series share the same set of behaviors and switch among them in exactly thesame manner. The infinite factorial HMM models a single time-series with emissions dependenton a potentially infinite dimensional feature that evolves with independent Markov dynamics. Ourwork focuses on modeling multiple time series and on capturing dynamical modes that are sharedamong the series.

Our results are obtained via an efficient and exact Markov chain Monte Carlo (MCMC) inference al-gorithm. In particular, we exploit the finite dynamical system induced by a fixed set of features to ef-ficiently compute acceptance probabilities, and reversible jump birth and death proposals to explorenew features. We validate our sampling algorithm using several synthetic datasets, and also demon-strate promising unsupervised segmentation of data from the CMU motion capture database [23].

2 Binary Features and Beta Processes

The beta process is acompletely random measure[12]: draws are discrete with probability one, andrealizations on disjoint sets are independent random variables. Consider a probability spaceΘ, andlet B0 denote a finitebase measureon Θ with total massB0(Θ) = α. AssumingB0 is absolutelycontinuous, we define the followingLevy measureon the product space[0, 1]× Θ:

ν(dω, dθ) = cω−1(1 − ω)c−1dωB0(dθ). (1)Here,c > 0 is a concentration parameter; we denote such a beta process by BP(c, B0). A drawB ∼ BP(c, B0) is then described by

B =

∞∑

k=1

ωkδθk, (2)

where(ω1, θ1), (ω2, θ2), . . . are the set of atoms in a realization of a nonhomogeneous Poissonprocess with rate measureν. If there are atoms inB0, then these are treated separately; see [22].The beta process is conjugate to a class ofBernoulli processes[22], denoted by BeP(B), whichprovide our sought-for featural representation. A realization Xi ∼ BeP(B), with B an atomicmeasure, is a collection of unit mass atoms onΘ located at some subset of the atoms inB. Inparticular,fik ∼ Bernoulli(ωk) is sampled independently for each atomθk in Eq. (2), and thenXi =

k fikδθk.

In many applications, we interpret the atom locationsθk as a shared set of global features. ABernoulli process realizationXi then determines the subset of features allocated to objecti:

B | B0, c ∼ BP(c, B0)

Xi | B ∼ BeP(B), i = 1, . . . , N. (3)Because beta process priors are conjugate to the Bernoulli process [22], the posterior distributiongivenN samplesXi ∼ BeP(B) is a beta process with updated parameters:

B | X1, . . . , XN , B0, c ∼ BP

(

c + N,c

c + NB0 +

K+∑

k=1

mk

c + Nδθk

)

. (4)

Here,mk denotes the number of objectsXi which select thekth featureθk. For simplicity, we havereordered the feature indices to list theK+ features used by at least one object first.

Computationally, Bernoulli process realizationsXi are often summarized by an infinite vector ofbinary indicator variablesfi = [fi1, fi2, . . .], wherefik = 1 if and only if objecti exhibits fea-ture k. As shown by Thibaux and Jordan [22], marginalizing over thebeta process measureB,and takingc = 1, provides a predictive distribution on indicators known asthe Indian buffet pro-cess (IBP) Griffiths and Ghahramani [9]. The IBP is a culinarymetaphor inspired by the Chineserestaurant process, which is itself the predictive distribution on partitions induced by the Dirichletprocess [21]. The Indian buffet consists of an infinitely long buffet line of dishes, or features. Thefirst arriving customer, or object, chooses Poisson(α) dishes. Each subsequent customeri selectsa previously tasted dishk with probabilitymk/i proportional to the number of previous customersmk to sample it, and also samples Poisson(α/i) new dishes.

2

Page 3: Sharing Features among Dynamical Systems with Beta Processespapers.nips.cc/paper/3685-sharing-features-among... · modeled via temporally independent or linear dynamical systems,

3 Describing Multiple Time Series with Beta Processes

Assume we have a set ofN objects, each of whose dynamics is described by a switching vectorautoregressive (VAR) process, with switches occurring according to a discrete-time Markov process.Such autoregressive HMMs (AR-HMMs) provide a simpler, but often equally effective, alternativeto SLDS [17]. Lety(i)

t represent the observation vector of theith object at timet, andz(i)t the latent

dynamical mode. Assuming an orderr switching VAR process, denoted by VAR(r), we have

z(i)t ∼ π

(i)

z(i)t−1

(5)

y(i)t =

r∑

j=1

Aj,z

(i)t

y(i)t−j + e

(i)t (z

(i)t ) , A

z(i)t

y(i)t + e

(i)t (z

(i)t ), (6)

wheree(i)t (k) ∼ N (0, Σk), Ak = [A1,k . . . Ar,k], and y

(i)t = [y

(i)T

t−1 . . . y(i)T

t−r ]T . The

standard HMM with Gaussian emissions arises as a special case of this model whenAk = 0 forall k. We refer to these VAR processes, with parametersθk = {Ak, Σk}, asbehaviors, and use abeta process prior to couple the dynamic behaviors exhibited by different objects or sequences.

As in Sec. 2, letfi be a vector of binary indicator variables, wherefik denotes whether objectiexhibits behaviork for somet ∈ {1, . . . , Ti}. Givenfi, we define afeature-constrained transition

distributionπ(i) = {π(i)k }, which governs theith object’s Markov transitions among its set of dy-

namic behaviors. In particular, motivated by the fact that aDirichlet-distributed probability massfunction can be interpreted as a normalized collection of gamma-distributed random variables, foreach objecti we define a doubly infinite collection of random variables:

η(i)jk | γ, κ ∼ Gamma(γ + κδ(j, k), 1), (7)

whereδ(j, k) indicates the Kronecker delta function. We denote this collection oftransition vari-ablesby η(i), and use them to define object-specific, feature-constrained transition distributions:

π(i)j =

[

η(i)j1 η

(i)j2 . . .

]

⊗ fi

k|fik=1 η(i)jk

. (8)

Here,⊗ denotes the element-wise vector product. This construction definesπ(i)j over the full set of

positive integers, but assigns positive mass only at indicesk wherefik = 1.

The preceding generative process can be equivalently represented via a sampleπ(i)j from a finite

Dirichlet distribution of dimensionKi =∑

k fik, containing the non-zero entries ofπ(i)j :

π(i)j | fi, γ, κ ∼ Dir([γ, . . . , γ, γ + κ, γ, . . . γ]). (9)

Theκ hyperparameter places extra expected mass on the componentof π(i)j corresponding to a self-

transitionπ(i)jj , analogously to the sticky hyperparameter of Fox et al. [4].We refer to this model,

which is summarized in Fig. 1, as thebeta process autoregressive HMM(BP-AR-HMM).

4 MCMC Methods for Posterior Inference

We have developed an MCMC method which alternates between resampling binary feature assign-ments given observations and dynamical parameters, and dynamical parameters given observationsand features. The sampler interleaves Metropolis-Hastings (MH) and Gibbs sampling updates,which are sometimes simplified by appropriate auxiliary variables. We leverage the fact that fixedfeature assignments instantiate a set offinite AR-HMMs, for which dynamic programming can beused to efficiently compute marginal likelihoods. Our novelapproach to resampling the potentiallyinfinite set of object-specific features employs incremental “birth” and “death” proposals, improvingon previous exact samplers for IBP models with non-conjugate likelihoods.

4.1 Sampling binary feature assignments

Let F−ik denote the set of all binary feature indicators excludingfik, andK−i+ be the number of

behaviors currently instantiated by objects other thani. For notational simplicity, we assume that

3

Page 4: Sharing Features among Dynamical Systems with Beta Processespapers.nips.cc/paper/3685-sharing-features-among... · modeled via temporally independent or linear dynamical systems,

. . .

. . .

N

ω

θk

k

z1(i) z2

(i) z3(i) zT

(i)

i

y1(i) y2

(i) y3(i) yT

(i)

i

z1

fi

(i)π

γ κ

B0

Figure 1:Graphical model of the BP-AR-HMM. The beta process distributed measureB | B0 ∼ BP(1, B0)is represented by its massesωk and locationsθk, as in Eq. (2). The features are then conditionally inde-pendent drawsfik | ωk ∼ Bernoulli(ωk), and are used to define feature-constrained transition distributionsπ

(i)j | fi, γ, κ ∼ Dir([γ, . . . , γ, γ + κ, γ, . . . ] ⊗ fi). The switching VAR dynamics are as in Eq. (6).

these behaviors are indexed by{1, . . . , K−i+ }. Given theith object’s observation sequencey(i)

1:Ti,

transition variablesη(i) = η(i)

1:K−i+ ,1:K−i

+

, and shared dynamic parametersθ1:K−i+

, feature indicators

fik for currently used featuresk ∈ {1, . . . , K−i+ } have the following posterior distribution:

p(fik | F−ik,y(i)1:Ti

, η(i), θ1:K−i+

, α) ∝ p(fik | F−ik, α)p(y(i)1:Ti

| fi, η(i), θ1:K−i

+). (10)

Here, the IBP prior implies thatp(fik = 1 | F−ik, α) = m−ik /N , wherem−i

k denotes the number ofobjectsother than objecti that exhibit behaviork. In evaluating this expression, we have exploitedthe exchangeability of the IBP [9], which follows directly from the beta process construction [22].

For binary random variables, MH proposals can mix faster [6]and have greater statistical effi-ciency [14] than standard Gibbs samplers. To updatefik given F−ik, we thus use the posteriorof Eq. (10) to evaluate a MH proposal which flipsfik to the complementf of its current valuef :

fik ∼ ρ(f | f)δ(fik, f) + (1 − ρ(f | f))δ(fik, f)

ρ(f | f) = min

{

p(fik = f | F−ik,y(i)1:Ti

, η(i), θ1:K−i+

, α)

p(fik = f | F−ik,y(i)1:Ti

, η(i), θ1:K−i+

, α), 1

}

. (11)

To compute likelihoods, we combinefi andη(i) to construct feature-constrained transition distribu-tionsπ

(i)j as in Eq. (8), and apply the sum-product message passing algorithm [19].

An alternative approach is needed to resample the Poisson(α/N) “unique” features associated onlywith objecti. LetK+ = K−i

+ +ni, whereni is the number of features unique to objecti, and definef−i = fi,1:K−i

+andf+i = fi,K

−i+ +1:K+

. The posterior distribution overni is then given by

p(ni | fi,y(i)1:Ti

, η(i), θ1:K−i+

, α)

∝( α

N)nie−

αN

ni!

∫∫

p(y(i)1:Ti

| f−i, f+i = 1, η(i), η+, θ1:K−i+

, θ+) dB0(θ+)dH(η+), (12)

whereH is the gamma prior on transition variables,θ+ = θK−i+ +1:K+

are the parameters of unique

features, andη+ are transition parametersη(i)jk to or from unique featuresj, k ∈ {K−i

+ + 1 : K+}.Exact evaluation of this integral is intractable due to dependencies induced by the AR-HMMs.

One early approach to approximate Gibbs sampling in non-conjugate IBP models relies on a finitetruncation [7]. Meeds et al. [15] instead consider independent Metropolis proposals which replacethe existing unique features byn′

i ∼ Poisson(α/N) new features, with corresponding parametersθ′

+ drawn from the prior. For high-dimensional models like thatconsidered in this paper, however,moves proposing large numbers of unique features have low acceptance rates. Thus, mixing ratesare greatly affected by the beta process hyperparameterα. We instead develop a “birth and death”reversible jump MCMC (RJMCMC) sampler [8], which proposes to either add a single new feature,

4

Page 5: Sharing Features among Dynamical Systems with Beta Processespapers.nips.cc/paper/3685-sharing-features-among... · modeled via temporally independent or linear dynamical systems,

or eliminate one of the existing features inf+i. Some previous work has applied RJMCMC to finitebinary feature models [3, 27], but not to the IBP. Our proposal distribution factors as follows:

q(f ′+i, θ

′+, η′

+ | f+i, θ+, η+) = qf (f ′+i | f+i)qθ(θ

′+ | f ′

+i, f+i, θ+)qη(η′+ | f ′

+i, f+i, η+). (13)

Let ni =∑

kf+ik. The feature proposalqf (· | ·) encodes the probabilities of birth and deathmoves: a new feature is created with probability0.5, and each of theni existing features is deletedwith probability0.5/ni. For parameters, we define our proposal using the generativemodel:

qθ(θ′+ | f ′

+i, f+i, θ+) =

{

b0(θ′+,ni+1)

∏ni

k=1 δθ+k(θ′+k), birth of featureni + 1;

k 6=ℓ δθ+k(θ′+k), death of featureℓ, (14)

whereb0 is the density associated withα−1B0. The distributionqη(· | ·) is defined similarly, butusing the gamma prior on transition variables of Eq. (7). TheMH acceptance probability is then

ρ(f ′+i, θ

′+, η′

+ | f+i, θ+, η+) = min{r(f ′+i, θ

′+, η′

+ | f+i, θ+, η+), 1}. (15)

Canceling parameter proposals with corresponding prior terms, the acceptance ratior(· | ·) equals

p(y(i)1:Ti

| [f−i f ′+i], θ1:K−i

+, θ′

+, η(i), η′+) Poisson(n′

i | α/N) qf (f+i | f ′+i)

p(y(i)1:Ti

| [f−i f+i], θ1:K−i+

, θ+, η(i), η+) Poisson(ni | α/N) qf (f ′+i | f+i)

, (16)

with n′i =

kf ′+ik. Because our birth and death proposals do not modify the values of existing

parameters, the Jacobian term normally arising in RJMCMC algorithms simply equals one.

4.2 Sampling dynamic parameters and transition variables

Posterior updates to transition variablesη(i) and shared dynamic parametersθk are greatly simpli-fied if we instantiate the mode sequencesz

(i)1:Ti

for each objecti. We treat these mode sequences asauxiliary variables: they are sampled given the current MCMC state, conditionedon when resam-pling model parameters, and then discarded for subsequent updates of feature assignmentsfi.

Given feature-constrained transition distributionsπ(i) and dynamic parameters{θk}, along with

the observation sequencey(i)1:Ti

, we jointly sample the mode sequencez(i)1:Ti

by computing backward

messagesmt+1,t(z(i)t ) ∝ p(y

(i)t+1:Ti

| z(i)t , y

(i)t , π(i), {θk}), and then recursively sampling eachz

(i)t :

z(i)t | z

(i)t−1,y

(i)1:Ti

, π(i), {θk} ∼ π(i)

z(i)t−1

(z(i)t )N

(

y(i)t ; A

z(i)t

y(i)t , Σ

z(i)t

)

mt+1,t(z(i)t ). (17)

Because Dirichlet priors are conjugate to multinomial observationsz(i)1:T , the posterior ofπ(i)

j is

π(i)j | fi, z

(i)1:T , γ, κ ∼ Dir([γ + n

(i)j1 , . . . , γ + n

(i)jj−1, γ + κ + n

(i)jj , γ + n

(i)jj+1, . . . ] ⊗ fi). (18)

Here,n(i)jk are the number of transitions from modej to k in z

(i)1:T . Since the mode sequencez

(i)1:T is

generated from feature-constrained transition distributions,n(i)jk is zero for anyk such thatfik = 0.

Thus, to arrive at the posterior of Eq. (18), we only updateη(i)jk for instantiated features:

η(i)jk | z

(i)1:T , γ, κ ∼ Gamma(γ + κδ(j, k) + n

(i)jk , 1), k ∈ {ℓ | fiℓ = 1}. (19)

We now turn to posterior updates for dynamic parameters. We place a conjugate matrix-normalinverse-Wishart (MNIW) prior [26] on{Ak, Σk}, comprised of an inverse-Wishart prior IW(S0, n0)on Σk and a matrix-normal priorMN (Ak; M, Σk, K) on Ak givenΣk. We consider the follow-ing sufficient statistics based on the setsYk = {y

(i)t | z

(i)t = k} and Yk = {y

(i)t | z

(i)t = k} of

observations and lagged observations, respectively, associated with behaviork:

S(k)yy =

(t,i)|z(i)t =k

y(i)t y

(i)T

t + K S(k)yy =

(t,i)|z(i)t =k

y(i)t y

(i)T

t + MK

S(k)yy =

(t,i)|z(i)t =k

y(i)t y

(i)T

t + MKMT S(k)y|y = S(k)

yy − S(k)yy S

−(k)yy S

(k)T

yy .

Following Fox et al. [5], the posterior can then be shown to equal

Ak | Σk, Yk ∼ MN(

Ak; S(k)yy S

−(k)yy , Σk, S

(k)yy

)

, Σk | Yk ∼ IW(

S(k)y|y + S0, |Yk| + n0

)

.

5

Page 6: Sharing Features among Dynamical Systems with Beta Processespapers.nips.cc/paper/3685-sharing-features-among... · modeled via temporally independent or linear dynamical systems,

0 200 400 600 800 1000−5

0

5

10

15

20

25

Time

Obs

erva

tions

2 4 6 8 10 12 14 16 18 20

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

2 4 6 8 10 12 14 16 18 20

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

(a) (b)Figure 2:(a) Observation sequences for each of 5 switching AR(1) timeseries colored by true mode sequence,and offset for clarity. (b) True feature matrix (top) of the five objects and estimated feature matrix (bottom)averaged over 10,000 MCMC samples taken from 100 trials every 10th sample. White indicates active features.The estimated feature matrices are produced from mode sequences mapped to the ground truth labels accordingto the minimum Hamming distance metric, and selecting modeswith more than 2% of the object’s observations.

4.3 Sampling the beta process and Dirichlet transition hyperparameters

We additionally place priors on the Dirichlet hyperparametersγ andκ, as well as the beta processparameterα. Let F = {f i}. As derived in [9],p(F | α) can be expressed as

p(F | α) ∝ αK+ exp

(

− α

N∑

n=1

1

n

)

, (20)

where, as before,K+ is the number of unique features activated inF . As in [7], we place a conjugateGamma(aα, bα) prior onα, which leads to the following posterior distribution:

p(α | F , aα, bα) ∝ p(F | α)p(α | aα, bα) ∝ Gamma

(

aα + K+, bα +

N∑

n=1

1

n

)

. (21)

Transition hyperparameters are assigned similar priorsγ ∼ Gamma(aγ , bγ), κ ∼ Gamma(aκ, bκ).Because the generative process of Eq. (7) is non-conjugate,we rely on MH steps which iterativelyresampleγ givenκ, andκ givenγ. Each sub-step uses a gamma proposal distributionq(· | ·) withfixed varianceσ2

γ or σ2κ, and mean equal to the current hyperparameter value. To updateγ givenκ,

the acceptance probability ismin{r(γ′ | γ), 1}, wherer(γ′ | γ) is defined to equal

p(γ′ | κ, π, F )q(γ | γ′)

p(γ | κ, π, F )q(γ′ | γ)=

p(π | γ′, κ, F )p(γ′)q(γ | γ′)

p(π | γ, κ, F )p(γ)q(γ′ | γ)=

f(γ′)Γ(ϑ)e−γ′bγ γϑ′−ϑ−aγ σ2ϑγ

f(γ)Γ(ϑ′)e−γbγ γ′ϑ−ϑ′−aγ σ2ϑ′

γ

.

Here,ϑ = γ2/σ2γ , ϑ′ = γ′2/σ2

γ , andf(γ) =∏

iΓ(γKi+κ)Ki

Γ(γ)K2i−KiΓ(γ+κ)Ki

∏Ki

(j,k)=1 π(i)γ+κδ(k,j)−1

kj . The

MH sub-step for resamplingκ givenγ is similar, but with an appropriately redefinedf(κ).

5 Synthetic Experiments

To test the ability of BP-AR-HMM to discover shared dynamics, we generated five time series thatswitched between AR(1) models

y(i)t = a

z(i)t

y(i)t−1 + e

(i)t (z

(i)t ) (22)

with ak ∈ {−0.8,−0.6,−0.4,−0.2, 0, 0.2, 0.4, 0.6, 0.8} and process noise covarianceΣk drawnfrom an IW(0.5, 3) prior. The object-specific features, shown in Fig. 2(b), were sampled from atruncated IBP [9] usingα = 10 and then used to generate the observation sequences of Fig. 2(a).The resulting feature matrix estimated over 10,000 MCMC samples is shown in Fig. 2. Comparingto the true feature matrix, we see that our model is indeed able to discover most of the underlyinglatent structure of the time series despite the challengingsetting defined by the close AR coefficients.

One might propose, as an alternative to the BP-AR-HMM, usingan architecture based on the hi-erarchical Dirichlet process of [21]; specifically we coulduse the HDP-AR-HMMs of [5] tiedtogether with a shared set of transition and dynamic parameters. To demonstrate the differencebetween these models, we generated data for three switchingAR(1) processes. The first two ob-jects, with four times the data points of the third, switchedbetween dynamical modes defined

6

Page 7: Sharing Features among Dynamical Systems with Beta Processespapers.nips.cc/paper/3685-sharing-features-among... · modeled via temporally independent or linear dynamical systems,

200 400 600 8000

0.1

0.2

0.3

0.4

0.5

0.6

Iteration

Nor

mal

ized

Ham

min

g D

ista

nce

200 400 600 8000

0.1

0.2

0.3

0.4

0.5

0.6

Iteration

Nor

mal

ized

Ham

min

g D

ista

nce

Time0 1000 2000

Time0 1000 2000

Time0 250 500

Time0 1000 2000

Time0 1000 2000

Time0 250 500

(a) (b) (c) (d)Figure 3: (a)-(b) The 10th, 50th, and 90th Hamming distance quantilesfor object 3 over 1000 trials for theHDP-AR-HMMs and BP-AR-HMM, respectively. (c)-(d) Examples of typical segmentations into behaviormodes for the three objects at Gibbs iteration 1000 for the two models (top = estimate, bottom = truth).

−8 −6 −4 −2 0 2

−4−2

02

46

5

10

15

20

25

30

xz

y

−15−10

−50

510

−5

0

5

5

10

15

20

25

30

xz

y

−6 −4 −2 0 2 4

−20

24

68

5

10

15

20

25

30

xz

y

−50

5

−5

0

5

5

10

15

20

25

30

xz

y

−10−5

05

10

−5

0

5

5

10

15

20

25

z

x

y

−10−5

05

−5

0

5

10

15

5

10

15

20

25

z

x

y

−10

−5

0

5

02

46

0

5

10

15

20

25

xz

y

−10

−5

0

5

02

46

0

5

10

15

20

25

xz

y

−10

−5

0

55

10

15

5

10

15

20

25

xz

y

−10

−5

0

5

−5

0

5

5

10

15

20

25

xz

y

−10 −5 0 5

−50

510

5

10

15

20

25

z

x

y

−10 −5 0 5

−5

0

5

10

5

10

15

20

25

30

z

x

y

−10 −5 0 5

−10

−5

0

5

10

15

20

25

z

x

y

−10 −5 0 5

−10

−5

0

5

5

10

15

20

25

30

z

x

y

−15−10

−50

5

−5

0

5

10

5

10

15

20

25

z

x

y

−10−5

05

−5

0

5

10

5

10

15

20

25

z

x

y

−10−5

05

10

−10−5

05

5

10

15

20

25

30

35

z

x

y

−15

−10

−5

0

5

10

−10

−5

0

5

10

15

0

5

10

15

20

25

xz

y

−15

−10

−5

0

5

10

−20

24

5

10

15

20

25

30

x

z

y

−15

−10

−5

0

5

10

−4−2

02

0

5

10

15

20

25

30

x

z

y

−10

−5

0

5

10

−4−2

02

5

10

15

20

25

30

x

z

y

−10

−5

0

5

10

0

5

10

5

10

15

20

25

30

xz

y

−10

−5

0

5

10

−5

0

5

5

10

15

20

25

30

xz

y

−10

−5

0

5

10

−20

24

5

10

15

20

25

30

xz

y

−10

0

10

−5

0

5

10

5

10

15

20

25

30

xz

y

−15

−10

−5

0

5

10

0

5

10

5

10

15

20

25

30

xz

y

−15

−10

−5

0

5

10

0

5

10

5

10

15

20

25

30

xz

y

−15

−10

−5

0

5

10

0

5

10

5

10

15

20

25

30

xz

y

−20

−10

0

10−10

−5

0

5

10

0

5

10

15

20

25

xz

y

−15

−10

−5

0

5

10

−15

−10

−5

0

5

10

0

5

10

15

20

25

xz

y

−10

−5

0

5

10

−5

0

5

10

0

5

10

15

20

25

xz

y

−5

0

5

10

−15

−10

−5

0

5

5

10

15

20

25

xz

y

−10−5

05

−5

0

5

5

10

15

20

25

30

z

x

y

−5

0

5

10−10

−5

0

5

5

10

15

20

25

xz

y

−15−10

−50

510

15

−5

0

5

10

15

20

5

10

15

20

25

30

z

x

y

Figure 4:Each skeleton plot displays the trajectory of a learned contiguous segment of more than 2 seconds.To reduce the number of plots, we preprocessed the data to bridge segments separated by fewer than 300 msec.The boxes group segments categorized under the same featurelabel, with the color indicating the true featurelabel. Skeleton rendering done by modifications to Neil Lawrence’s Matlab MoCap toolbox [13].

by ak ∈ {−0.8,−0.4, 0.8} and the third object usedak ∈ {−0.3, 0.8}. The results shown inFig. 3 indicate that the multiple HDP-AR-HMM model typically describes the third object usingak ∈ {−0.4, 0.8} since this assignment better matches the parameters definedby the other (lengthy)time series. These results reiterate that the feature modelemphasizes choosing behaviors rather thanassuming all objects are performing minor variations of thesame dynamics.

For the experiments above, we placed a Gamma(1, 1) prior onα andγ, and a Gamma(100, 1) prioronκ. The gamma proposals usedσ2

γ = 1 andσ2κ = 100 while the MNIW prior was givenM = 0,

K = 0.1 ∗ Id, n0 = d + 2, andS0 set to 0.75 times the empirical variance of the joint set offirst difference observations. At initialization, each time series was segmented into five contiguousblocks, with feature labels unique to that sequence.

6 Motion Capture Experiments

The linear dynamical system is a common model for describingsimple human motion [11], and themore complicated SLDS has been successfully applied to the problem of human motion synthesis,classification, and visual tracking [17, 18]. Other approaches develop non-linear dynamical modelsusing Gaussian processes [25] or based on a collection of binary latent features [20]. However, therehas been little effort in jointly segmenting and identifying common dynamic behaviors amongst aset ofmultiplemotion capture (MoCap) recordings of people performing various tasks. The BP-AR-HMM provides an ideal way of handling this problem. One benefit of the proposed model, versusthe standard SLDS, is that it does not rely on manually specifying the set of possible behaviors.

7

Page 8: Sharing Features among Dynamical Systems with Beta Processespapers.nips.cc/paper/3685-sharing-features-among... · modeled via temporally independent or linear dynamical systems,

2 4 6 8 10 12 14 16 18 20

1

2

3

4

5

6

2 4 6 8 10 12 14 16 18 20

1

2

3

4

5

6

2 4 6 8 10 12 14 16 18 20

1

2

3

4

5

6

2 4 6 8 10 12 14 16 18 20

1

2

3

4

5

6

2 4 6 8 10 12 14 16 18 20

1

2

3

4

5

6

Truth

5 10 150.2

0.3

0.4

0.5

0.6

0.7

0.8

Number of Clusters/States

Nor

mal

ized

Ham

min

g D

ista

nce

GMMGMM 1st diffHMMHMM 1st diffBP−AR−HMMHDP−AR−HMM

(a) (b)Figure 5:(a) MoCap feature matrices associated with BP-AR-HMM (top-left) and HDP-AR-HMM (top-right)estimated sequences over iterations 15,000 to 20,000, and MAP assignment of the GMM (bottom-left) andHMM (bottom-right) using first-difference observations and 12 clusters/states. (b) Hamming distance versusnumber of GMM clusters / HMM states on raw observations (blue/green) and first-difference observations(red/cyan), with the BP- and HDP- AR-HMM segmentations (black) and true feature count (magenta) shown forcomparison. Results are for the most-likely of 10 EM initializations using Murphy’s HMM Matlab toolbox [16].

As an illustrative example, we examined a set of six CMU MoCapexercise routines [23], threefrom Subject 13 and three from Subject 14. Each of these routines used some combination of thefollowing motion categories: running in place, jumping jacks, arm circles, side twists, knee raises,squats, punching, up and down, two variants of toe touches, arch over, and a reach out stretch.

From the set of 62 position and joint angles, we selected 12 measurements deemed most informativefor the gross motor behaviors we wish to capture: one body torso position, two waist angles, oneneck angle, one set of right and left (R/L) shoulder angles, the R/L elbow angles, one set of R/L hipangles, and one set of R/L ankle angles. The MoCap data are recorded at 120 fps, and we block-average the data using non-overlapping windows of 12 frames. Using these measurements, the priordistributions were set exactly as in the synthetic data experiments except the scale matrix,S0, of theMNIW prior which was set to 5 times the empirical covariance of the first difference observations.This allows more variability in the observed behaviors. We ran 25 chains of the sampler for 20,000iterations and then examined the chain whose segmentation minimized the expected Hamming dis-tance to the set of segmentations from all chains over iterations 15,000 to 20,000. Future workincludes developing split-merge proposals to further improve mixing rates in high dimensions.

The resulting MCMC sample is displayed in Fig. 4 and in the supplemental video available online.Although some behaviors are merged or split, the overall performance shows a clear ability to findcommon motions. The split behaviors shown in green and yellow can be attributed to the twosubjects performing the same motion in a distinct manner (e.g., knee raises in combination withupper body motion or not, running with hands in or out of sync with knees, etc.). We compareour performance both to the HDP-AR-HMM and to the Gaussian mixture model (GMM) methodof Barbic et al. [1] using EM initialized with k-means. Barbic et al. [1] also present an approachbased on probabilistic PCA, but this method focuses primarily on change-point detection rather thanbehavior clustering. As further comparisons, we look at a GMM on first difference observations,and an HMM on both data sets. The results of Fig. 5(b) demonstrate that the BP-AR-HMM providesmore accurate frame labels than any of these alternative approaches over a wide range of mixturemodel settings. In Fig. 5(a), we additionally see that the BP-AR-HMM provides a superior abilityto discover the shared feature structure.

7 DiscussionUtilizing the beta process, we developed a coherent Bayesian nonparametric framework for dis-covering dynamical features common to multiple time series. This formulation allows for object-specific variability in how the dynamical behaviors are used. We additionally developed a novelexact sampling algorithm for non-conjugate beta process models. The utility of our BP-AR-HMMwas demonstrated both on synthetic data, and on a set of MoCapsequences where we showed per-formance exceeding that of alternative methods. Although we focused on switching VAR processes,our approach could be equally well applied to a wide range of other switching dynamical systems.

AcknowledgmentsThis work was supported in part by MURIs funded through AFOSRGrant FA9550-06-1-0324 and ARO GrantW911NF-06-1-0076.

8

Page 9: Sharing Features among Dynamical Systems with Beta Processespapers.nips.cc/paper/3685-sharing-features-among... · modeled via temporally independent or linear dynamical systems,

References[1] J. Barbic, A. Safonova, J.-Y. Pan, C. Faloutsos, J.K. Hodgins, and N.S. Pollard. Segmenting motion

capture data into distinct behaviors. InProc. Graphics Interface, pages 185–194, 2004.

[2] M.J. Beal, Z. Ghahramani, and C.E. Rasmussen. The infinite hidden Markov model. InAdvances inNeural Information Processing Systems, volume 14, pages 577–584, 2002.

[3] A.C. Courville, N. Daw, G.J. Gordon, and D.S. Touretzky.Model uncertainty in classical conditioning.In Advances in Neural Information Processing Systems, volume 16, pages 977–984, 2004.

[4] E.B. Fox, E.B. Sudderth, M.I. Jordan, and A.S. Willsky. An HDP-HMM for systems with state persis-tence. InProc. International Conference on Machine Learning, July 2008.

[5] E.B. Fox, E.B. Sudderth, M.I. Jordan, and A.S. Willsky. Nonparametric Bayesian learning of switchingdynamical systems. InAdvances in Neural Information Processing Systems, volume 21, pages 457–464,2009.

[6] A. Frigessi, P. Di Stefano, C.R. Hwang, and S.J. Sheu. Convergence rates of the Gibbs sampler, theMetropolis algorithm and other single-site updating dynamics. Journal of the Royal Statistical Society,Series B, pages 205–219, 1993.

[7] D. Gorur, F. Jakel, and C.E. Rasmussen. A choice modelwith infinitely many latent features. InProc.International Conference on Machine learning, June 2006.

[8] P.J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination.Biometrika, 82(4):711–732, 1995.

[9] T.L. Griffiths and Z. Ghahramani. Infinite latent featuremodels and the Indian buffet process.GatsbyComputational Neuroscience Unit, Technical Report #2005-001, 2005.

[10] N.L. Hjort. Nonparametric Bayes estimators based on beta processes in models for life history data.TheAnnals of Statistics, pages 1259–1294, 1990.

[11] E. Hsu, K. Pulli, and J. Popovic. Style translation forhuman motion. InSIGGRAPH, pages 1082–1089,2005.

[12] J. F. C. Kingman. Completely random measures.Pacific Journal of Mathematics, 21(1):59–78, 1967.

[13] N. Lawrence. MATLAB motion capture toolbox.http://www.cs.man.ac.uk/ neill/mocap/.

[14] J.S. Liu. Peskun’s theorem and a modified discrete-state Gibbs sampler.Biometrika, 83(3):681–682,1996.

[15] E. Meeds, Z. Ghahramani, R.M. Neal, and S.T. Roweis. Modeling dyadic data with binary latent factors.In Advances in Neural Information Processing Systems, volume 19, pages 977–984, 2007.

[16] K.P. Murphy. Hidden Markov model (HMM) toolbox for MATLAB. http://www.cs.ubc.ca/ mur-phyk/Software/HMM/hmm.html.

[17] V. Pavlovic, J.M. Rehg, T.J. Cham, and K.P. Murphy. A dynamic Bayesian network approach to figuretracking using learned dynamic models. InProc. International Conference on Computer Vision, Septem-ber 1999.

[18] V. Pavlovic, J.M. Rehg, and J. MacCormick. Learning switching linear models of human motion. InAdvances in Neural Information Processing Systems, volume 13, pages 981–987, 2001.

[19] L.R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition.Proceedings of the IEEE, 77(2):257–286, 1989.

[20] G.W. Taylor, G.E. Hinton, and S.T. Roweis. Modeling human motion using binary latent variables. InAdvances in Neural Information Processing Systems, volume 19, pages 1345–1352, 2007.

[21] Y.W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei. Hierarchical Dirichlet processes.Journal of the Ameri-can Statistical Association, 101(476):1566–1581, 2006.

[22] R. Thibaux and M.I. Jordan. Hierarchical beta processes and the Indian buffet process. InProc. Interna-tional Conference on Artificial Intelligence and Statistics, volume 11, 2007.

[23] Carnegie Mellon University. Graphics lab motion capture database.http://mocap.cs.cmu.edu/.

[24] J. Van Gael, Y.W. Teh, and Z. Ghahramani. The infinite factorial hidden Markov model. InAdvances inNeural Information Processing Systems, volume 21, pages 1697–1704, 2009.

[25] J.M. Wang, D.J. Fleet, and A. Hertzmann. Gaussian process dynamical models for human motion.IEEETransactions on Pattern Analysis and Machine Intelligence, 30(2):283–298, 2008.

[26] M. West and J. Harrison.Bayesian Forecasting and Dynamic Models. Springer, 1997.

[27] F. Wood, T. L. Griffiths, and Z. Ghahramani. A non-parametric Bayesian method for inferring hiddencauses. InProc. Conference on Uncertainty in Artificial Intelligence, volume 22, 2006.

9