experiments we used our optimal frontier breadth-first search algorithm to learn an optimal bayesian...

1
Experiments We used our optimal frontier breadth- first search algorithm to learn an optimal Bayesian network over the 23- variable data set and compared it to a greedy search, used previously [Yu]. Figures 2 and 3 show the learned networks. Our Optimal Search Formulation As suggested by Equation 2, learning an optimal Bayesian network consists of three phases which we formulate as search problems. Calculating Scores Goal. Calculate MDL(X|U), which is the score of X using U as parents Representation. AD-tree [Moore] Search Strategy. Depth- first AD Node. Records with U= u Vary Node. Records with U = u, X = x Successor. Instantiate a new X Storage. Written to disk ϕ A B b b a a B B a b ab ab ab Vary Node N x,u AD Node N u Optimal Learning with Dynamic Programming In the case of a ChIP-Seq dataset, we do not know the relationships among the variables. Therefore, we must learn them. Singh and Moore [2005] proposed a dynamic programming algorithm to learn an optimal Bayesian network which minimizes the MDL score. The figure below shows the intuition behind the algorithm. Equation 2 expresses this recursively. Silander and Myllmaki [2006] refined the algorithm by reversing the process. ChIP-Seq We can measure the presence of a particular histone modification in cells using chromatin immunopreciptation followed by high throughput sequencing (ChIP-Seq). The figure below shows the ChIP-Seq process. The Epigenetic Code The central dogma of molecular biology (roughly) states that DNA is transcribed into RNA which is translated into proteins. Proteins perform many of the functions in the body. We have the same DNA in most of our cells, yet they perform quite different functions. One reason for this differentiation lies in the epigenetic code. When DNA forms chromosomes, it packs together very tightly into a structure called chromatin. The DNA coils around a group of eight proteins called histones. Figure 1 summarizes chromatin packaging. The histone proteins include a tail domain which is very susceptible to a large number of post- translational modifications which affect the attraction between histones. The attraction can increase between histones, tightening surrounding chromatin and suppressing expression. Chromatin can also loosen, increasing expression. The combination of present modifications determines the effect on the chromatin structure. Some histone modifications affect the likelihood of other modifications. The epigenetic code [Jaenisch] proposes that the combination of histone modifications, as well as other features such as the presence of transcription factor binding sites, serves as a type of message to present and future generations of cells about regulation. Selected References Jaenisch, R. & Bird, A. Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals Nature Genetics, 2003, 33, 245-254 Schwarz, G. (1978). "Estimating the Dimension of a Model." The Annals of Statistics 6(2): 461-464. Barski, A., S. Cuddapah, et al. (2007). "High-resolution profiling of histone methylations in the human genome." Cell 129(4): 823 – 837 Singh, A. P. and A. W. Moore (2005). Finding optimal bayesian networks by dynamic programming (Technical Report). Carnegie Mellon Univ: 05—106. Silander, T. and P. Myllymaki (2006). A simple approach for finding the globally optimal Bayesian network structure. Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI-06), AUAI Press. Yu, H., S. Zhu, et al. (2008). "Inferring causal relationships among different histone modifications and gene expression." Genome Research 18(8): 1314-1324. Yuan, C.; Malone, B. & Wu, X. Learning Optimal Bayesian Networks using A* Search. Proceedings of the 22nd International Joint Conference on Artificial Intelligence , 2011 Seq-ing the Epigentic Code with Exact Bayesian Network Structure Learning Brandon M. Malone 1,2 , Changhe Yuan 1 , Eric Hansen 1 and Susan M. Bridges 1,2 1 Department of Computer Science & Engineering, Mississippi State University 2 Institute for Genomics, Biocomputing and Biotechnology, Mississippi State University . Abstract The epigenetic code [Jaenisch] hypothesis proposes that patterns of post-translational modifications to the histone core proteins, the presence of transcription factor binding sites and other genomic features influence expression of associated DNA. Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing (ChIP-Seq) is frequently used to characterize these features at a genome-wide scale. Previous studies [Yu] have used approximation techniques to learn relationships among them. In this work, we apply a novel exact Bayesian network learning algorithm to learn a network structure which identifies regulatory relationships among a set of epigenetic features in human CD4 cells [Barksi]. Comparison to networks learned using greedy methods reveals that our network identifies more biologically relevant relationships. By applying an exact, optimal learning algorithm instead of an approximate, greedy algorithm, the relationships we learn are unaffected by sources of uncertainty stemming from the structure learning algorithm. Bayesian Networks Representation. Joint probability distribution over a set of variables Structure. Directed acyclic graph storing conditional dependencies. •Vertices correspond to variables. •Edges indicate relationships among variables. Parameters. Conditional probability tables quantifying relationships Scoring. Minimum Description Length (MDL) [Schwartz], Equation 1 Acknowledgments This material is based on work supported by the National Science Foundation under Grants No. NSF EPS-0903787 and NSF IIS-0953723. The sequenced DNA is mapped back to the genome. [Illumina] Raw DNA The DNA is sheared into pieces around 200 bp in length. Pieces are immunoprecipitated against an antibody to extract desired pieces. The remaining pieces of DNA are sequenced. Pol II H3K36 me H3K9 ac H3K27 me3 Expr H3K4 me3 Pol II H3K36 me H3K9 ac H3K27 me3 H3K4 me3 Pol II H3K36 me H3K9 ac H3K27 me3 H3K4 me3 Pol II H3K9 ac H3K27 me3 H3K4 me3 The optimal Bayesian network structure is a DAG, so it has a leaf variable with no children. Remove that leaf and its edges from the network.. The remaining subnetwork is also a DAG, so it has a leaf. Recursively find optimal leaves until an empty subnetwork remains. Frontier Breadth-first Branch and Bound Search The order graph has a very regular structure. The successors for a node in layer l always appear in layer l+1. This observation allows us to keep only two layers in memory rather than all n. Furthermore, we can calculate how good a particular node can possibly be. If this is worse than a known bound, we safely disregard it. If optimality is not needed, we disregard many nodes to reduce running time. Data Set and Preprocessing Raw Data. 30 human ChIP-Seq experiments [Barski] Cellular Environment. CD4 cells (specialized white blood cells) Normalization. Linear regression, against an IgG control data set Discretization. Clustered genes using MDL for each experiment Processed Data Set. A numeric array of length 30 for each gene Results and Discussion We focused on the transcription factor binding site for CTCF, known to play a function in the regulation of many elements. We expect CTCF to be an ancestor of important regulatory elements. In our network, CTCF is parent of the five most highly connected regulatory elements in the network. The approximate algorithm identified four parents and three children of intermediate degree for CTCF. Identifying Optimal Parent Sets Goal. Calculate BestScore(U, X), which selects the best parents of X from U Representation. Sorted and bit arrays Search Strategy. On demand Successor. Use bit operators to find scores consistent with U\Y Score. scores[firstBit(usable(X))] Storage. Arrays and bit sets Learning Optimal Subnetworks Goal. Calculate Score(U), which is the best subnetwork for variables U. Representation. Order graph [Yuan] Search Strategy. Breadth- first Node. Score(U) for some U. Successor. Use X as a leaf of U Score. Score(U) + BestScore(U, X) Storage. Hash table or written to disk Expand(U) For each X in U newScore = U.score + BestScore(U, X) succ = get({U+ X}) if newScore < succ.score put({U+ X}, newScore) Figure 1. Chromatin packaging and histones. (http://themedicalbiochemistrypage.org/) Equations (1) (2) Figure 2. Learned structure with our optimal algorithm. Figure 3. Learned structure with a standard greedy algorithm. Conclusions We presented a frontier breadth-first search algorithm for learning optimal Bayesian networks that improves the memory complexity from O(2 n ) to O(C(n,n/2)). Provably optimal solutions allow us to focus on interpreting the results. We learned the optimal structure of a network of epigeneitc features; it included more biologically parents {1,2 } {2 } {1 } {1,3 } {3} {} {2,3 } scores 8 10 11 12 13 15 20 uses[1] X X X usable X X X X X X X usable X X X X Calculate and sort all of the scores for a variable. Mark which scores use each variable (n-1 of these each). Initially, a variable can use all scores. The first is optimal. When X is used as a leaf, find the usable parent scores with (usable & ~uses[X]). The first set bit is optimal. ϕ 1 2 3 1, 2 1, 3 2, 3 1,2 ,3 4 1, 4 2, 4 3, 4 1,2 ,4 1,3 ,4 2,3 ,4 1,2,3 ,4

Upload: joella-long

Post on 18-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Experiments We used our optimal frontier breadth-first search algorithm to learn an optimal Bayesian network over the 23-variable data set and compared

ExperimentsWe used our optimal frontier breadth-first search algorithm to learn an optimal Bayesian network over the 23-variable data set and compared it to a greedy search, used previously [Yu]. Figures 2 and 3 show the learned networks.

Our Optimal Search FormulationAs suggested by Equation 2, learning an optimal Bayesian network consists of three phases which we formulate as search problems.

Calculating ScoresGoal. Calculate MDL(X|U), which is the score of X using U as parentsRepresentation. AD-tree [Moore]Search Strategy. Depth-firstAD Node. Records with U= uVary Node. Records with U = u, X = xSuccessor. Instantiate a new XStorage. Written to diskϕ

A B

b ba a

B B

ab abab ab

Vary Node Nx,u

AD Node Nu

Optimal Learning with Dynamic ProgrammingIn the case of a ChIP-Seq dataset, we do not know the relationships among the variables.  Therefore, we must learn them.  Singh and Moore [2005] proposed a dynamic programming algorithm to learn an optimal Bayesian network which minimizes the MDL score.  The figure below shows the intuition behind the algorithm.  Equation 2 expresses this recursively.  Silander and Myllmaki [2006] refined the algorithm by reversing the process.

ChIP-SeqWe can measure the presence of a particular histone modification in cells using chromatin immunopreciptation followed by high throughput sequencing (ChIP-Seq). The figure below shows the ChIP-Seq process.

The Epigenetic CodeThe central dogma of molecular biology (roughly) states that DNA is transcribed into RNA which is translated into proteins.  Proteins perform many of the functions in the body.  We have the same DNA in most of our cells, yet they perform quite different functions.  One reason for this differentiation lies in the epigenetic code.

When DNA forms chromosomes, it packs together very tightly into a structure called chromatin.  The DNA coils around a group of eight proteins called histones. Figure 1 summarizes chromatin packaging.

The histone proteins include a tail domain which is very susceptible to a large number of post-translational modifications which affect the attraction between histones.  The attraction can increase between histones, tightening surrounding chromatin and suppressing expression.  Chromatin can also loosen, increasing expression.

The combination of present modifications determines the effect on the chromatin structure. Some histone modifications affect the likelihood of other modifications.  The epigenetic code [Jaenisch] proposes that the combination of histone modifications, as well as other features such as the presence of transcription factor binding sites, serves as a type of message to present and future generations of cells about regulation.

Selected ReferencesJaenisch, R. & Bird, A. Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals Nature Genetics, 2003, 33, 245-254

Schwarz, G. (1978). "Estimating the Dimension of a Model." The Annals of Statistics 6(2): 461-464.

Barski, A., S. Cuddapah, et al. (2007). "High-resolution profiling of histone methylations in the human genome." Cell 129(4): 823 – 837

Singh, A. P. and A. W. Moore (2005). Finding optimal bayesian networks by dynamic programming (Technical Report). Carnegie Mellon Univ: 05—106.

Silander, T. and P. Myllymaki (2006). A simple approach for finding the globally optimal Bayesian network structure. Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI-06), AUAI Press.

Yu, H., S. Zhu, et al. (2008). "Inferring causal relationships among different histone modifications and gene expression." Genome Research 18(8): 1314-1324.

Yuan, C.; Malone, B. & Wu, X. Learning Optimal Bayesian Networks using A* Search. Proceedings of the 22nd International Joint Conference on Artificial Intelligence , 2011

Seq-ing the Epigentic Code with Exact Bayesian Network Structure LearningBrandon M. Malone1,2, Changhe Yuan1, Eric Hansen1 and Susan M. Bridges1,2

1Department of Computer Science & Engineering, Mississippi State University2Institute for Genomics, Biocomputing and Biotechnology, Mississippi State University

.

AbstractThe epigenetic code [Jaenisch] hypothesis proposes that patterns of post-translational modifications to the histone core proteins, the presence of transcription factor binding sites and other genomic features influence expression of associated DNA. Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing (ChIP-Seq) is frequently used to characterize these features at a genome-wide scale. Previous studies [Yu] have used approximation techniques to learn relationships among them. In this work, we apply a novel exact Bayesian network learning algorithm to learn a network structure which identifies regulatory relationships among a set of epigenetic features in human CD4 cells [Barksi]. Comparison to networks learned using greedy methods reveals that our network identifies more biologically relevant relationships. By applying an exact, optimal learning algorithm instead of an approximate, greedy algorithm, the relationships we learn are unaffected by sources of uncertainty stemming from the structure learning algorithm.

Bayesian NetworksRepresentation. Joint probability distribution over a set of variablesStructure. Directed acyclic graph storing conditional dependencies.

• Vertices correspond to variables.• Edges indicate relationships among variables.

Parameters. Conditional probability tables quantifying relationshipsScoring. Minimum Description Length (MDL) [Schwartz], Equation 1

AcknowledgmentsThis material is based on work supported by the National Science Foundation under Grants No. NSF EPS-0903787 and NSF IIS-0953723.

The sequenced DNA is mapped back to the genome.[Illumina]

Raw DNA The DNA is sheared into pieces around 200 bp in length.

Pieces are immunoprecipitated against an antibody to extract desired pieces.

The remaining pieces of DNA are sequenced.

Pol II

H3K36me

H3K9ac

H3K27me3

Expr

H3K4me3

Pol II

H3K36me

H3K9ac

H3K27me3

H3K4me3

Pol II

H3K36me

H3K9ac

H3K27me3

H3K4me3

Pol II

H3K9ac

H3K27me3

H3K4me3

The optimal Bayesian network structure is a DAG, so it has a leaf variable with no children.

Remove that leaf and its edges from the network..

The remaining subnetwork is also a DAG, so it has a leaf.

Recursively find optimal leaves until an empty subnetwork remains.

Frontier Breadth-first Branch and Bound SearchThe order graph has a very regular structure.  The successors for a node in layer l always appear in layer l+1.  This observation allows us to keep only two layers in memory rather than all n. Furthermore, we can calculate how good a particular node can possibly be. If this is worse than a known bound, we safely disregard it. If optimality is not needed, we disregard many nodes to reduce running time.

Data Set and PreprocessingRaw Data. 30 human ChIP-Seq experiments [Barski] Cellular Environment. CD4 cells (specialized white blood cells)Normalization. Linear regression, against an IgG control data setDiscretization. Clustered genes using MDL for each experimentProcessed Data Set. A numeric array of length 30 for each gene

Results and DiscussionWe focused on the transcription factor binding site for CTCF, known to play a function in the regulation of many elements.  We expect CTCF to be an ancestor of important regulatory elements.  In our network, CTCF is parent of the five most highly connected regulatory elements in the network.  The approximate algorithm identified four parents and three children of intermediate degree for CTCF.

Identifying Optimal Parent SetsGoal. Calculate BestScore(U, X), whichselects the best parents of X from URepresentation. Sorted and bit arraysSearch Strategy. On demandSuccessor. Use bit operators to find scores consistent with U\Y Score. scores[firstBit(usable(X))]Storage. Arrays and bit sets

Learning Optimal SubnetworksGoal. Calculate Score(U), which is the best subnetwork for variables U.Representation. Order graph [Yuan]Search Strategy. Breadth-firstNode. Score(U) for some U.Successor. Use X as a leaf of UScore. Score(U) + BestScore(U, X)Storage. Hash table or written to disk

Expand(U)For each X in U

newScore = U.score + BestScore(U, X)

succ = get({U+ X})if newScore < succ.score

put({U+ X}, newScore)

Figure 1. Chromatin packaging and histones.(http://themedicalbiochemistrypage.org/)

Equations

(1)

(2)

Figure 2. Learned structure with our optimal algorithm.

Figure 3. Learned structure with a standard greedy algorithm.

ConclusionsWe presented a frontier breadth-first search algorithm for learning optimal Bayesian networks that improves the memory complexity from O(2n) to O(C(n,n/2)). Provably optimal solutions allow us to focus on interpreting the results.  We learned the optimal structure of a network of epigeneitc features; it included more biologically meaningful relationships than structures learned with greedy search.

parents {1,2} {2} {1} {1,3} {3} {} {2,3}scores 8 10 11 12 13 15 20

uses[1] X X X

usable X X X X X X X

usable X X X X

Calculate and sort all of the scores for a variable.

Mark which scores use each variable (n-1 of these each).

Initially, a variable can use all scores. The first is optimal.

When X is used as a leaf, find the usable parent scores with (usable & ~uses[X]). The first set bit is optimal. ϕ

1 2 3

1,2 1,3 2,3

1,2,3

4

1,4 2,4 3,4

1,2,4 1,3,4 2,3,4

1,2,3,4