graduate center/city university of new york university of helsinki finding optimal bayesian network...

1
Graduate Center/City University of New Y University of Helsi FINDING OPTIMAL BAYESIAN NETWORK STRUCTURES WITH CONSTRAINTS LEARNED FROM DATA Xiannian Fan, Brandon Malone and Changhe Yu Several recent algorithms for learning Bayesian network structures first calculate potentially optimal parent sets (POPS) for all variables and then use various optimization techniques to find a set of POPS, one for each variable, that constitutes an optimal network structure. This paper makes the observation that there is useful information implicit in the POPS. Specifically, the POPS of a variable constrain its parent candidates. Moreover, the parent candidates of all variables together give a directed cyclic graph, which often decomposes into a set of strongly connected components (SCCs). Each SCC corresponds to a smaller subproblem which can be solved independently of the others. Our results show that solving the constrained subproblems significantly improves the efficiency and scalability of heuristic search-based structure learning algorithms. Further, we show that by considering only the top p POPS of each variable, we quickly find provably very high quality networks for large datasets. ian Network Structure Learning with Graph Search Selected References 1. de Campos, C. P.; and Ji, Q. Efficient learning of Bayesian networks using constraints. JMLR 2011. 2. Tian, J. A branch-and-bound algorithm for MDL learning Bayesian networks. In UAI’00, 2000. 3. Yuan, C.; Malone, B.; and Wu, X. 2011. Learning optimal Bayesian networks using A* search. In IJCAI ‘11, 2011. 4. Yuan, C.,; and Maline, B. An Improved Admissible Heuristic for Learning Optimal Bayesian Networks. In UAI’12, 2012.. Acknowledgements This research was supported by NSF grants IIS-0953723, IIS- 1219114 and the Academy of Finland (COIN, 251170). Software: http://url.cs.qc.cuny.edu/software/URLearnin g.html Bayesian Network Structure Learning Representation. Joint probability distribution over a set of variables. Structure. DAG storing conditional dependencies. •Vertices correspond to variables. •Edges indicate relationships among variables. Parameters. Conditional probability distributions. Learning. Find the network with the minimal score for complete dataset D. We often omit D for brevity. where Score(X i | PA i ) is called local score Graph Search Formulation The dynamic programming can be visualized as a search through an order graph. The Order Graph Calculation. Score(U), best subnetwork for U. Node. Score(U) for U. Successor. Add X as a leaf to U. Path. Induces an ordering on variables. Size. 2 n nodes, one for each subset. Admissible Heuristic Search Formulation Start Node. Top node, {}. Goal Node. Bottom node, V. Shortest Path. Corresponds to optimal structure. g(U). Score(U). h(U). Relax acyclicity. Dynamic Programming Intuition. All DAGs must have a leaf. Optimal networks for a single variable are trivial. Recursively add new leaves and select optimal parents until adding all variables. All orderings have to be considered. Recurrences. Begin with a single variable. Pick one variable as leaf. Find its optimal parents. Pick another leaf. Find its optimal parents from current. Continue picking leaves and finding optimal parents. Potentially Optimal Parent Sets (POPS) While the local scores are defined for all 2 n-1 possible parent sets for each variable, this number is greatly reduced by pruning parent sets that are provably never optimal (Tian, J. 2000; de Campos, C. P., et al 2011.). We refer to the above pruning as lossless score pruning because it is guaranteed to not remove the optimal network from consideration. We refer to the scores remaining after pruning as potentially optimal parent sets (POPS). Denote the set of POPS for variable X i as P i . Table 1: The POPS for six variables problem. The ith row shows the P i . POPS Constraints Pruning Observation: The constraints seem to help benchmark network datasets more than UCI. Observation: Even for very small values of p, the top-p POPS constraint results in networks provably very close to the globally optimal solution. Motivation: Not all variables can possibly be ancestors of the others. Previous technique: Consider all variable orderings anyway. Shortcomings: Exponential increase in the number of paths in the search space. Contribution: Construct the parent relation graph and find its SCCs; divide the problem into independent subproblems based on the SCCs. Figure 2: The parent relation graph. Figure 3: Order graphs after applying the POPS constraints. (a) The order graph after applying the POPS constraints once. (b) The order graph after recursively applying the POPS constraints on the second subproblem. We collected all the potential parents– children relation from POPS, and get the resulting parent relation graph. Recursive POPS Constraints Pruning Selecting the parents for one of the variables has the effect of removing that variable from the parent relation graph. After removing it, the remaining variables may split into smaller SCCs, and the resulting smaller subproblems can be solved recursively. Figure 3(b) shows the example. Top-p POPS Constraints Motivation: Despite POPS constraints pruning, some problems remain difficult. Previous technique: AWA* has been used to find bounded optimality solutions. Shortcomings: AWA* does not give any explicit tradeoff between complexity and optimality. Contribution: Lossy score pruning gives a more principled way to control the tradeoff; we create the parent relation graph using only the best p POPS of each variable and discard POPS not compatible with this graph. Rather than constructing the parent relation graph by aggregating all of the POPS, we can instead create the graph by considering only the best p POPS for each variable. Experiment Result We extracted its strongly connected components (SCCs) from parent relation graph. The SCCs form the component graph (Cormen et al. 2001), giving the ancestor constraints (which we call POPS constraint). Each SCC corresponds to a smaller subproblem which can be solved independently of the others. Figure 4: Running Time (in seconds) of A* with three different constraint settings: No Constraint, POPS Constraint and Recursive POPS Constraint. Figure 5: The behavior of dataset Hailfinder under the top-p POPS constraint as p varies. Figure 1: Order Graph for 4 variables Autos Soybean Alarm Barley 0 100 200 300 400 500 600 46.62 FAIL 76.51 FAIL 20.93 435.65 6.47 2.51 26.76 511.55 4.06 1.28 No Constraint POPS Constraint

Upload: shon-wilson

Post on 17-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Graduate Center/City University of New York University of Helsinki FINDING OPTIMAL BAYESIAN NETWORK STRUCTURES WITH CONSTRAINTS LEARNED FROM DATA Xiannian

Graduate Center/City University of New YorkUniversity of Helsinki

FINDING OPTIMAL BAYESIAN NETWORK STRUCTURESWITH CONSTRAINTS LEARNED FROM DATA Xiannian Fan, Brandon Malone and Changhe Yuan

Several recent algorithms for learning Bayesian network structures first calculate potentially optimal parent sets (POPS) for all variables and then use various optimization techniques to find a set of POPS, one for each variable, that constitutes an optimal network structure. This paper makes the observation that there is useful information implicit in the POPS. Specifically, the POPS of a variable constrain its parent candidates. Moreover, the parent candidates of all variables together give a directed cyclic graph, which often decomposes into a set of strongly connected components (SCCs). Each SCC corresponds to a smaller subproblem which can be solved independently of the others. Our results show that solving the constrained subproblems significantly improves the efficiency and scalability of heuristic search-based structure learning algorithms. Further, we show that by considering only the top p POPS of each variable, we quickly find provably very high quality networks for large datasets.

Bayesian Network Structure Learning with Graph Search

Selected References1. de Campos, C. P.; and Ji, Q. Efficient learning of Bayesian networks using

constraints. JMLR 2011.2. Tian, J. A branch-and-bound algorithm for MDL learning Bayesian networks. In

UAI’00, 2000.3. Yuan, C.; Malone, B.; and Wu, X. 2011. Learning optimal Bayesian networks

using A* search. In IJCAI ‘11, 2011.4. Yuan, C.,; and Maline, B. An Improved Admissible Heuristic for Learning Optimal

Bayesian Networks. In UAI’12, 2012..

AcknowledgementsThis research was supported by NSF grants IIS-0953723, IIS-1219114 and the Academyof Finland (COIN, 251170).

Software: http://url.cs.qc.cuny.edu/software/URLearning.html

Bayesian Network Structure LearningRepresentation. Joint probability distribution over a set of variables.Structure. DAG storing conditional dependencies.

• Vertices correspond to variables.• Edges indicate relationships among variables.

Parameters. Conditional probability distributions. Learning. Find the network with the minimal score for complete dataset D. We often omit D for brevity.

where Score(Xi | PAi) is called local score

Graph Search FormulationThe dynamic programming can be visualized as a search through an order graph.

The Order GraphCalculation. Score(U), best subnetwork for U.Node. Score(U) for U.Successor. Add X as a leaf to U.Path. Induces an ordering on variables.Size. 2n nodes, one for each subset.

Admissible Heuristic Search FormulationStart Node. Top node, {}.Goal Node. Bottom node, V.Shortest Path. Correspondsto optimal structure.g(U). Score(U). h(U). Relax acyclicity.

Dynamic ProgrammingIntuition. All DAGs must have a leaf. Optimal networks for a single variable are trivial. Recursively add new leaves and select optimal parents until adding all variables. All orderings have to be considered.

Recurrences.

Begin with a single variable.

Pick one variableas leaf. Find itsoptimal parents.

Pick another leaf.Find its optimal parents from current.

Continue picking leavesand finding optimal parents.

Potentially Optimal Parent Sets (POPS) While the local scores are defined for all 2n-1 possible parent sets for each variable, this number is greatly reduced by pruning parent sets that are provably never optimal (Tian, J. 2000; de Campos, C. P., et al 2011.).

We refer to the above pruning as lossless score pruning because it is guaranteed to not remove the optimal network from consideration. We refer to the scores remaining after pruning as potentially optimal parent sets (POPS). Denote the set of POPS for variable Xi as Pi .

Table 1: The POPS for six variables problem. The ith row shows the Pi .

POPS Constraints Pruning

Observation: The constraints seem to help benchmark network datasets more than UCI.Observation: Even for very small values of p, the top-p POPS constraint results in networks provably very close to the globally optimal solution.

Motivation: Not all variables can possibly be ancestors of the others.Previous technique: Consider all variable orderings anyway.Shortcomings: Exponential increase in the number of paths in the search space.Contribution: Construct the parent relation graph and find its SCCs; divide the problem into independent subproblems based on the SCCs.

Figure 2: The parent relation graph.

Figure 3: Order graphs after applying the POPS constraints. (a) The order graph after applying the POPS constraints once. (b) The order graph after recursively applying the POPS constraints on the second subproblem.

We collected all the potential parents–children relation from POPS, and get the resulting parent relation graph.

Recursive POPS Constraints PruningSelecting the parents for one of the variables has the effect of removing that variable from the parent relation graph. After removing it, the remaining variables may split into smaller SCCs, and the resulting smaller subproblemscan be solved recursively. Figure 3(b) shows the example.

Top-p POPS ConstraintsMotivation: Despite POPS constraints pruning, some problems remain difficult. Previous technique: AWA* has been used to find bounded optimality solutions.Shortcomings: AWA* does not give any explicit tradeoff between complexity and optimality.Contribution: Lossy score pruning gives a more principled way to control the tradeoff; we create the parent relation graph using only the best p POPS of each variable and discard POPS not compatible with this graph.

Rather than constructing the parent relation graph by aggregating all of the POPS, we can instead create the graph by considering only the best p POPS for each variable.

Experiment Result

We extracted its strongly connected components (SCCs) from parent relation graph. The SCCs form the component graph (Cormen et al. 2001), giving the ancestor constraints (which we call POPS constraint). Each SCC corresponds to a smaller subproblem which can be solved independently of the others.

Figure 4: Running Time (in seconds) of A* with three different constraint settings: No Constraint, POPS Constraint and Recursive POPS Constraint.

Figure 5: The behavior of dataset Hailfinder under the top-p POPS constraint as p varies.

Figure 1: Order Graph for 4 variables

Autos Soybean Alarm Barley0

100

200

300

400

500

600

46.62

FAIL

76.51

FAIL

20.93

435.65

6.472.51

26.76

511.55

4.06 1.28

No Constraint

POPS Constraint

Recursive POPS Constraint