latent tree models part iii: learning algorithms

96
Latent Tree Models Part III: Learning Algorithms Nevin L. Zhang Dept. of Computer Science & Engineering The Hong Kong Univ. of Sci. & Tech. http://www.cse.ust.hk/~lzhang AAAI 2014 Tutorial

Upload: connor

Post on 24-Feb-2016

47 views

Category:

Documents


0 download

DESCRIPTION

AAAI 2014 Tutorial. Latent Tree Models Part III: Learning Algorithms. Nevin L. Zhang Dept. of Computer Science & Engineering The Hong Kong Univ. of Sci. & Tech. http://www.cse.ust.hk/~lzhang. Learning Latent Tree Models. To Determine Number of latent variables - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Latent Tree Models Part III: Learning Algorithms

Latent Tree ModelsPart III: Learning Algorithms

Nevin L. ZhangDept. of Computer Science & Engineering

The Hong Kong Univ. of Sci. & Tech.http://www.cse.ust.hk/~lzhang

AAAI 2014 Tutorial

Page 2: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

2

To Determine1. Number of latent variables2. Cardinality of each latent variable3. Model structure4. Probability distributions

Learning Latent Tree Models

Model selection: 1, 2, 3 Parameter estimation: 4

Page 3: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

3

Run interactive program “LightBulbIllustration.jar” Illustrate the possibility of inferring latent variables and

latent structures from observed co-occurrence patterns.

Light Bulb Illustration

Page 4: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

4

Part III: Learning Algorithms

Introduction

Search-based algorithms

Algorithms based on variable clustering

Distance-based algorithms

Empirical comparisons

Spectral methods for parameter estimation

Page 5: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

5

A search algorithm explores the space of regular models guided by a scoring function: Start with an initial model Iterate until model score ceases to increase

Modify the current model in various ways to generate a list of candidate models.

Evaluate the candidate models using the scoring function.

Pick the best candidate model

What scoring function to use? How do we evaluate candidate models? This is the model selection problem.

Search Algorithms

Page 6: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

6

Bayesian score: posterior probability P(m|D)

P(m|D) = P(m)P(D|m) / P(D)

= P(m)∫ P(D|m, θ) P(θ |m) dθ / P(D)

BIC Score: Large sample approximation of Bayesian score BIC(m|D) = log P(D|m, θ*) – d/2 logN

d : number of free parameters; N is the sample size. θ*: MLE of θ, estimated using the EM algorithm. Likelihood term of BIC: Measure how well the model fits data. Second term: Penalty for model complexity.

The use of the BIC score indicates that we are looking for a model that fits the data well, and at the same time, not overly complex.

Model Selection Criteria

Page 7: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

7

AIC (Akaike, 1974): AIC(m|D) = log P(D|m, θ*) – d/2

Holdout likelihood Data => Training set, validation set. Model parameters estimated based on the training set. Quality of model is measured using likelihood on the

validation set.

Cross validation: too expensive

Model Selection Criteria

Page 8: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

8

Search Algorithms

Double hill climbing (DHC), (Zhang 2002, 2004) 7 manifest variables.

Single hill climbing (SHC), (Zhang and Kocka 2004) 12 manifest variables

Heuristic SHC (HSHC), (Zhang and Kocka 2004) 50 manifest variables

EAST, (Chen et al 2011) 100+ manifest variables

Page 9: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

9

Two search procedures One for model structure One for cardinalities of latent variables.

Very inefficient. Tested only on data sets with 7 or fewer variables. (Zhang 2004)

DHC tested on synthetic and real-world data sets, together with BIC, AIC, and Holdout likelihood respectively. Best models found when BIC was used. So subsequent work based on BIC.

Double Hill Climbing (DHC)

Page 10: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

10

Determines both model structure and cardinalities of latent variables using a single search procedure.

Uses five search operators Node Introduction (NI) Node Deletion (ND) Node Relation (NR) State Introduction (SI) State Deletion (SI)

Single Hill Climbing (HSC)

Page 11: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

11

NI involves a latent variable Y and some of its neighbors

It introduces a new node Y’ to mediate Y and the neighbors.

The cardinality of Y’ is set at |Y|

Example: Y2 introduced to mediate Y1 and its neighbors X1

and X2 The cardinality of Y2 is set at |Y1|

Node Introduction (NI)

Page 12: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

12

NR involves a latent variable Y, a neighbor Z of Y, and another neighbor Y’ of Y that is also a latent variable.

It relocates Z from Y to Y’.

Example: X3 is relocated from Y1 to Y2

Node Relocation (NR)

Page 13: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

13

ND involves a latent variable Y, a neighbor Y’ of Y that is a latent variables.

It remove Y, and reconnects the other neighbors of Y to Y’.

Example: Y2 is removed w.r.t to Y1.

Node Deletion

Page 14: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

14

State introduction (SI) Increase the number of states of a latent variable by 1

State deletion (SD) Reduce the number of states of a latent variable by 1.

State Introduction/Deletion

Page 15: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

15

Single Hill Climbing (SHC) Start with an initial model (LCM) At each step:

Construct all possible candidate models using NI, ND, NR, SI and SD

Evaluate them one by one Pick the best one

Still inefficient Tested on data with no more than 12 variables.

Reason Too many candidate models Too expensive to run EM on all of them

Page 16: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

16

Scale up SHC

Idea 1: Restrict NI to involve only two neighbors of the latent variable it operators on

The EAST Algorithm

Page 17: Latent Tree Models Part III: Learning Algorithms

How to go from the left to the right then with the restriction?

First apply NI, and then NR

Each NI operation is followed by NR operations to compensate for the restriction on NI.

Reachability

NI

NR

Page 18: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

18

Idea 2: Reducing Number of Candidate Models Not to use ALL the operators at once. How?

BIC: BIC(m|D) = log P(D|m, θ*) – d/2 logN

Improve the two terms alternately

NI and SI improve the likelihood term? Let be m’ obtained from m using NI or SI Then, m’ includes m, hence has higher maximized likelihood

log P(D|m’, θ’*) >= log P(D|m, θ*)

SD and ND reduce the penalty term.

Page 19: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

19

The EAST Algorithm (Chen et al. AIJ 2011)1. Start with a simple initial model2. Repeat until model score ceases to improve

Expansion: Search with node introduction (NI), and state introduction (SI) Each NI operation is followed by NR operations to compensate for the restriction

on NI. (See Slide 17) Adjustment:

Search with NR Simplification:

Search with node deletion (ND), and state deletion (SD)

EAST: Expansion, Adjustment, Simplification until Termination

Page 20: Latent Tree Models Part III: Learning Algorithms

Idea 3: Parameter Value Inheritance m : current model; m’ : candidate model generated by applying a search

operator on m. The two models share many parameters

m: ( θ1, θ2); m’: ( θ1, λ2);

When evaluating m’, inherit values of the shared parameters θ1 from m, and estimate only the new parameters λ2:

λ*2 = arg max λ2 log P(D|m’, θ1, λ2 )

Page 21: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

21

Avoid Local Optimum at the Expansion Phase

NI: Increases structure complexity. SI: Increases variable complexity. Key Issue at the expansion phase:

Tradeoff between structure complexity and variable complexity

Page 22: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

22

NI and SI are of different granularities p = 100 SI: 101 more parameters NI: 2 more parameters Huge disparity in granularity

Penalty term in BIC insufficient to handle SI always preferred initially,

Quick increase in variable complexity Leading to local optimum in model score

Operation Granularity

Page 23: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

23

EAST does not use BIC when choosing between candidate models produced by NI and SI.

Instead, it uses the cost-effectiveness principle That is, select candidate model with highest improvement ratio

Increase in model score per unit increase in model complexity. Denominator is larger for operations that increase the number of model

parameters more.

Can be justified using Likelihood Ratio Test (LRT) It picks the candidate model that gives the strongest evidence to

reject the null model (the current model) according to LRT.

Dealing with Operation Granularity

Page 24: Latent Tree Models Part III: Learning Algorithms

The alternative model includes the null model, and hence fits data better than the null model.

Whether it fits significantly better is determined by p-value of the difference D , which approximately follows Chi-squared distribution with degree of freedom: d2 – d1

Likelihood Ratio Test (LRT)Wikipedia

Page 25: Latent Tree Models Part III: Learning Algorithms

Required D value for given p-value increases roughly linearly with d2-d1

The ratio D/(d2-d1) closely related to p-value It is a measure of the strength of evidence in favor of the

alternative model.

Likelihood Ratio Test (LRT)

Page 26: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

26

Likelihood Ratio Test & Improvement Ratio

Second term is constant First term is exactly ½ * D / (d2-d1) Loosely speaking, the cost-effectiveness principle picks the

candidate model that gives the strongest evidence to reject the null model (the current model) according to LRT.

Page 27: Latent Tree Models Part III: Learning Algorithms

Search Process on Danish Beer Data

Page 28: Latent Tree Models Part III: Learning Algorithms

EAST used on medical survey data A few dozens

variables Hundreds to

thousands observations

Model quality important

(Xu et al 2013)

Page 29: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

29

Part III: Learning Algorithms

Introduction

Search-based algorithms

Algorithms based on variable clustering

Distance-based algorithms

Empirical comparisons

Spectral methods for parameter estimation

Page 30: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

30

Key Idea Group variables into clusters Introduce a latent variable for each cluster

For discrete variables, mutual information is used as similarity measure

Algorithms BIN-G: Harmeling & Williams, PAMI 2011 Bridged-islands (BI) algorithm: Liu et al. MLJ, 2013

Algorithm based on Variable Clustering

Page 31: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

31

Learns binary tree models L All observed variables Loop

Remove from L pair of variables with highest mutual information Introduce a new latent variable Add new latent variable to L

The BIN-G Algorithm

Page 32: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

32

Learn LCM: Cardinality of new latent variable and parameters Let |H1|=1 and increase it gradually until termination At each step, run EM to optimize model parameters

and calculate the BIC score Return LCM with highest BIC score

Determine MI between new latent variable and others Convert the new latent variable into a observed via

imputation (hard assignment) Then calculate MI(H1; X3), MI(H1; X4)

Two Issues

NOTE: if some latent variables have cardinality 1, they can be removed from the model, resulting a forest, instead of tree.

Page 33: Latent Tree Models Part III: Learning Algorithms

Result of BIN-G on subset of 20 Newsgroups Dataset

Page 34: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

34

Learns non-binary trees.

Partitions all observed variables into clusters, with some clusters having >2 variables

Introduces a latent variable for each variable cluster

Links up the latent variables to get a global tree model

The result is a flat latent tree model in the sense that each latent variable it directly connected to at one observed variable.

The BI Algorithm

Page 35: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

35

Identify a cluster of variables such that, Variables in the cluster are closely correlated, and The correlations can be properly modeled using a latent

variable. Uni-Dimensional (UD) cluster

Remove the cluster and repeat the process. Eventually obtain a partition of the observed variables.

BI Step 1: Partition the Observed Variables

Page 36: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

36

Sketch of algorithm for identifying first variable cluster L All observed variables S pair of variables with highest mutual information Loop

X Variable in L with highest MI with S SS U {X}, L L \ {X}

Perform uni-dimensionality test on S, If the test fails, stop loop and pick the first cluster of variables.

Obtaining First Variable Cluster

Page 37: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

37

Test whether the correlations among variables in a set S can be properly modeled using a single latent variable

Example: S={X1, X2, X3, X4, X5} Learn two models

m1: Best LCM, i.e., LTM with one latent variable m2: Best LTM with two latent variables

Can be done using EAST

UD-test passes if and only if

Uni-Dimensionality (UD) Test

If the use of two latent variable does not give significantly better model, then the use of one latent variable is appropriate.

Page 38: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

38

Unlike a likelihood-ratio test, Models do not need to be nested Strength of evidence in favor of M2 depends on the value of K

Bayes Factor

Wikipedia

Page 39: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

39

The statistic

is a large sample approximation of Strength of evidence in favor of two latent variables depends on U :

In the UD-test, we usually set : Conclude single latent variable if no strong evidence for >1 latent

variables

Bayes Factor and UD-TestWikipedia

Page 40: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

40

Initially, S={X1, X2} X3, X4 added to S, and UD-test passes Next add X5

S = {X1, X2, X3, X4, X5}, m2 is significantly better than m1

UD-test fails

m2 gives two uni-dimensional (UD) clusters The first variable cluster is: {X1, X2, X4}

Picked because it contains the initial variables X1 and X2

UD-Test and Variable Cluster

Page 41: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

41

Introduce a latent variable for each variable UD cluster. Optimize the cardinalities of latent variables an parameters

BI Step 2: Latent Variable Introduction

Page 42: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

42

Bridging the “islands” using Chow-Liu’s Algorithm (1968) Estimate joint of each pair of latent variables Y and Y’:

m and m’ are the LCMs that contains Y and Y ‘ respectively. Calculate MI(Y;Y’) Find the maximum spanning tree for MI values

BI Step 3: Link up Latent Variables

Page 43: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

43

Improvement based on global consideration Run EM to optimize parameters for whole model For each latent variable Y and each observed variable X,

calculate:

Re-estimate MI(Y; X) based the above distribution Let Y* be the latent variable with highest MI(Y; X) If Y* is not currently the neighbor of X, make it so.

BI Step 4: Global Adjustment

Page 44: Latent Tree Models Part III: Learning Algorithms

Result of BI on subset of 20 Newsgroups Dataset

Page 45: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

45

Part II: Learning Algorithms

Introduction

Search-based algorithms

Algorithms based on variable clustering

Distance-based algorithms

Empirical comparisons

Spectral methods for parameter estimation

Page 46: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

46

Define distance between variables that are additive over trees Estimate distances between observed variables from data Inference model structure from those distance estimates

Assumptions: Latent variables have equal cardinality, and it is known. In some cases, it equals the cardinality of observed variables. Or, all variables are continuous.

Focus on two algorithms Recursive groping (Choi et al, JMLR 2011) Neighbor Joining (Saitou & Nei, 1987, Studier and Keppler, 1988)

Distance-Based Algorithms

Slides based on Choi et al, 2010: www.ece.nus.edu.sg/stfpage/vtan/latentTree_slides.pdf

Page 47: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

47

Information distance between two discrete variables Xi and Xj (Lake 1994)

Information Distance

When both variables are binary:

Page 48: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

48

Erdos, Szekely, Steel, & Warnow, 1999

Additivity of Information Distance on Trees

Page 49: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

49

Testing Node Relationships

This implies the difference is a constant. It does not change with k.

Equality not true if j is not leaf, or i is not the parent of j

Page 50: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

50

Testing Node Relationships

This implies the difference is a constant. It does not change with k. It is between – and

This property allows us to determine leaf nodes that are siblings

Page 51: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

51

RG is an algorithm that determines model structure using the two properties mentioned earlier.

Explain RG with an example Assume data generated by the following model

Data contain no information about latent variables Task: Recover model structure

The Recursive Grouping (RG) Algorithm

Page 52: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

52

Step 1: Estimate from data the information distance between each pair of observed variables.

Step 2: Identify (leaf, parent-of-leaf) and (leaf siblings) pairs For each i, j

If = c (constant) for all k \= i, j If c = , then j is a leaf and i is its parent If - < c < , then i and j are leaves and

siblings

Recursive Grouping

Page 53: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

53

Step 3: Introduce a hidden parent node for each sibling group without a parent.

Recursive Grouping

NOTE: No need to determine model parameters here.

Page 54: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

54

Step 4. Compute the information distance for new hidden nodes.

Recursive Grouping

Page 55: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

55

Step 5. Remove the identified child nodes and repeat Steps 2-4.

Parameters of the final model can be determined using EM if needed.

Recursive Grouping

Page 56: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

56

Making Recursive Grouping more efficient

Step 1: Construct Chow-Liu tree over observed variables only based on information distance, i.e. maximum spanning tree

CL Recursive Grouping (CLRG)

Page 57: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

57

Step 2: Select an internal node and its neighbors, and apply the recursive-grouping (RG) algorithm. (Much cheaper)

Step 3: Replace the output of RG with the sub-tree spanning the neighborhood.

CL Recursive Grouping (CLRG)

Page 58: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

58

Repeat Steps 2-3 until all internal nodes are operated on.

Theorem: Both RG and CLRG are consistent

CL Recursive Grouping (CLRG)

Page 59: Latent Tree Models Part III: Learning Algorithms

Result of CLRG on subset of 20 Newsgroups Dataset

Page 60: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

60

Choi et al 2011 assume all observed and latent variables have equal cardinality So that information distance can be defined.

Assumption relaxed by (Song et al, 2011; Wang et al 2013): Latent variables can have fewer states than observed, But they still need to have equal cardinality among themselves.

New definition of information distance (pseudo-determinant):

denotes the s-th largest singular value of matrix A k is the rank of joint probability matrix Pxy.

An Extension

Page 61: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

61

Gaussian distributions:

Non-Gaussian distributions (Song et al, NIPS 2011)

obtained via Kernel embedding

Information Distance for Continuous Variables

Page 62: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

62

Neighbor Joining Another method to infer model structure using tree-additive distances

Originally developed for phylogenetic trees (Saitou & Nei, 1987)

Key Observations: Closest pair might not be siblings

dAB smallest, but those two leaves are not neighbors

Page 63: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

63

Neighbor Joining Let L be the number of leaf nodes

Define: ri = 1/(|L| - 2) Sk in L dik

Dij = dij - (ri + rj)

Theorem Pair of leaves with minimal Dij are siblings

Page 64: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

64

Neighbor Joining Initialization

Define T to be the set of leaf nodes Make list L of active nodes = T

Loop until |L|=2 Find two nodes i and j where Dij is minimal Combine to form a new node k and

dkm = 1/2(dim + djm - dij) for all m in L

dik = 1/2(dij + ri - rj )

djk = dij - dik

Add k to L, and remove i and j from L and add node k

Results binary tree.

Page 65: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

65

Example

Page 66: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

66

Making Neighbor Joining more efficient

Step 1: Construct Chow-Liu tree over observed variables only based on information distance, i.e. maximum spanning tree

Step 2: Select an internal node and its neighbors, and apply the NJ algorithm. (Much cheaper)

Step 3: Replace the output of NJ with the sub-tree spanning the neighborhood.

Repeat Steps 2-3 until all internal nodes are operated on.

CL Neighbor Joining (CLNJ)

Page 67: Latent Tree Models Part III: Learning Algorithms

Result of CLNJ on subset of 20 Newsgroups Dataset

Page 68: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

68

Originally developed for phylogenetic tree reconstruction (John et al, Journal of Algorithms, 2003)

Idea:

First determine the structures among quartets (groups of 4) of observed variables

Then combine the results to obtain a global model of all the observed variables.

If two nodes are not siblings in quartet model, they cannot be siblings in the global model.

For general LTMs: Chen et al, PGM 2006, Anandkumar et al, NIPS 20111, Mossel et al, 2011.

Quartet-Based Algorithms

Page 69: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

69

Part III: Learning Algorithms

Introduction

Search-based algorithms

Algorithms based on variable clustering

Distance-based algorithms

Empirical comparisons

Spectral methods for parameter estimation

Page 70: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

70

Algorithms compared Search-Based Algorithm:

EAST (Chen et al, AIJ 2011) Variable Clustering-Based Algorithms

BIN (Harmeling & Williams, PAMI 2011) BI (Liu et al. MLJ, 2013)

Distance-Based Algorithms CLRG (Choi et al, JMRL 2011) CLNJ (Saitou & Nei, 1987)

Data Synthetic data Real-world data

Measurements Running time Model quality

Empirical Comparisons

Page 71: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

71

4-complete model (M4C): Every latent node has 4 neighbors All variables are binary Parameter values randomly generated such normalized MI

between each pair of neighbor is between 0.05 and 0.75.

Generative Models

Page 72: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

72

M4CF: Obtained from M4C More variables added such that each latent node has 3

observed neighbors . A flat model. It is a flat model.

Other models and the total number of observed variables

Generative Models

Page 73: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

73

Synthetic Data Training: 5,000; Testing: 5,000 No information on latent variables

Evaluation Criteria: Distribution m0 : generative model; m : learned model Empirical KL divergence on testing data: The smaller the better.

Second term is hold-out likelihood of m. The larger the better.

Synthetic Data and Evaluation Criteria

Page 74: Latent Tree Models Part III: Learning Algorithms

Evaluation Criteria: Structure

Example: For the two models on the right m0

Y2-Y1: X1X2X3 | X4X5X6X7 Y1-Y3: X1X2X3X4 | X5X6X7 X1-Y2: X1 | X2X3X4X5X6X7 ..

m Y2-Y1: X1X2 | X3X4X5X6X7 Y1-Y3: X1X2X3X4 | X 5X6X7 X1-Y2: X1 | X2X3X4X5X6X7 ..

dRF = (1 + 1)/2 = 1 Not defined for forests

Page 75: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

75

EAST was too slow on data sets with more than 100 attributes.

CLRG was the fastest, followed by CLNJ, BIN and BI.

Running Times (Seconds)

Page 76: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

76

Flat generative models

Non-flat generative models

EAST found best models on data with dozens of attributes BI found best models on data with hundreds of attributes. BIN is the worst. (No RF values because it produces forests, not trees)

Model Quality

Page 77: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

77

Make latent variables have different cardinalities in generative models 3 for those at levels 1 and 3 2 for those at level 2.

All algorithms perform worse in before EAST and BI still found best models. CLRG and CLNJ especially bad on M7CF1.

They assume all latent variables have equal cardinality.

When latent variables have different cardinalities …

Page 78: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

78

Data

Evaluation criteria: BIC score on training set: measures of model fit Loglikelihood on testing set (hold-out likelihood):

measures how well learned model predict unseen data.

Real-World Data Sets

Page 79: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

79

CLNJ and CLRG not applicable to Coil-42 and Alarm because different attributes have different cardinalities.

EAST did not finish on News-100 and WebKB within 60 hours

CLRG was the fastest, followed by CLNJ, BIN and BI.

Running Times (seconds)

Page 80: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

80

EAST and BI found best models. BIN found the worst. Structures of models obtained on News-100 by BIN, BI, CLRG, and CLNJ

are shown earlier. BI introduced fewer latent variables. The model is more “compact”.

Model Quality

Page 81: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

81

EAST found best models on data sets it could manage With < 100 observed variables, hundreds to thousands

observations

BI found best models on data sets with hundreds of observed variables, and was slower than BIN, CLRG, and CLNJ.

BIN found the worst models.

CLRG and CLNJ are not applicable when observed variables do NOT have equality cardinality.

Summary

Page 82: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

82

Part III: Learning Algorithms

Introduction

Search-based algorithms

Algorithms based on variable clustering

Distance-based algorithms

Empirical comparisons

Spectral methods for parameter estimation

Page 83: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

83

The Expectation-Maximization (EM) algorithm (Dempster et al 1977) Start with initial guess Iterate until termination

Improve the current parameters values by maximizing the expected likelihood

Can be trapped at local maxima

Spectral methods (Anandkumar et al., 2012) Get empirical distributions of 2 or 3 observed variables Relate them to model parameters Determine model parameters accordingly Need large sample size for robust estimates.

Parameter Estimation

Page 84: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

84

A MRF Undirected graph Potentials

Non-negative functions Associated with edges and

hyper-edges

Eliminate a variable X in MRF Multiply all potentials involve

X Obtain a new potential by

eliminate X from the product Ex: Elimination of B and E :

Markov Random Field over Graphs

Page 85: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

85

Lower case letter denote value of variable E.g. a, b are values for A and B

Use to denote Use to denote the matrix [ ]

Value at a-th row and b-th column is

Elimination rewritten as matrix multiplication

Matrix Representation of Potentials and Elimination

Page 86: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

86

Use to denote Use to denote the column

vector whether the b-th row is

Similarly define notations and

Then, the equation can be rewritten as

Matrix Representation of Potentials and Elimination

Page 87: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

87

Equality (Anandkumar et al., 2012) H: latent; A, B, C: Observed All variables have equal cardinality For a given value b of B,

Elements of diagonal matrix: P(B=b|H=1), P(B=b|H=2),…

Notes The matrices on the right hand side can be estimated from data The eigenvalues of the matrix on the left are model parameters: P(B=b|H=1), P(B=b|H=2),…

Spectral Method for Parameter Estimation

Page 88: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

88

Parameter estimation: Get empirical distributions P(A, B, C) and P(A, C) from data For each value b of B, form matrix on the right hand side Find the eigenvalues of the matrix. (Spectral method) They are elements of Pb|H, or P(B=b|H=1), P(B=b|H=2),…

Notes Similarly, we can estimate the other parameters The technique can used on LTMs with >1 latent variables and where

observed variables have higher cardinality than latent variables.

Spectral Method for Parameter Estimation

Page 89: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

89

Start with generalized MRF Some potential might have negative values

Eliminate C, and then H’

Further eliminate H, we get P(A, B, A’). For a value b of B,

Derivation of Equation (1)

Page 90: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

90

Start with MRF

Eliminate H and H’ Let B=b

Next, eliminate C.

Putting together, we get

Derivation of Equation (1)

Page 91: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

91

In general, we cannot determine P(A, B, C, D) from P(A, B, D), P(B, D), P(B, C, D).

Possible in LTMs (Parikh et al., 2011)

Can be used to estimate of joint probability without explicit parameter estimation Get empirical distributions of 2 or 3 observed variables from data. Caculate joint probability of particular value assignment of ALL

observed from them using the relationship.

Another Equation of Similar Flavor

Page 92: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

92

Matrix Representation of Potentials

Observed: A, B, C, D Latent: H , G All variables have equal cardinality,

Page 93: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

93

P(A, B, C, D) not changed during transformations

Transformations

Page 94: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

94

Transformations

Eliminate: H’, G’, and {H, G}

From the last MRF, we get

Joint of 4 variables determined from joint of 3 and 2 observed variables

Page 95: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

95

The technique can used On trees with >4 observed variables. When observed variables have higher cardinality than latent

variables.

Notes

Page 96: Latent Tree Models Part III: Learning Algorithms

AAAI 2014 Tutorial Nevin L. Zhang HKUST

96

Equation that relates model parameters to joint distributions of 2 or 3 observed variables

Equation that relates joint distributions of 4 or more observed variables to joint distributions of 2 or 3 observed variables.

Need large sample size for robust result

Summary