compositional mining of biological datapeople.cs.vt.edu/~ramakris/slides/cdm-talk.pdf ·...
TRANSCRIPT
Compositional Miningof Biological Data
Naren Ramakrishnan T.M. MuraliDepartment of Computer Science
Virginia Tech, VA 24061
Motivation
● Increasing categories of functional screens
Microarrays
Deletion Mutants
RNAi
Motivation
● Increasing forms of interaction data– PPI, ChIP-on-chip, genetic, metabolic, ...
Motivation
● Increasing portfolios of pathways
“ Chaining” Inferences
● Module Networks– Regulators “X” regulate genes “Y” under
conditions “Z”
(Segal et al. Nature Genetics,2003)
“ Chaining” Inferences
● Connectivity Map– Perturbagens “X” mimic/suppress disease “Y”
through action of genes “Z”
(Lamb et al. Science,2006)
Are we there yet?
● Different scientists, different perspectives– Multitude of approaches to data reduction
● What is needed– SQL:Database querying::???:Database mining
Compositional Data Mining
● A way to compose simpler algorithms ...– Redescription mining– Biclustering
● ... to support complex analytical functions● Not a data mining program
– But a data mining program generator!
Two simple primitives
● Redescription mining– Mines within a “domain”
● Biclustering– Mines across two domains
What are redescriptions?
A shift-of-vocabulary or a different way of communicating a given piece of
information.
Redescriptions: Toy Example
Redescriptions: Toy Example
Redescriptions: Toy Example
Redescriptions: Toy Example
Redescriptions: Toy Example
Redescriptions: Toy Example
Redescription Mining
● Given– a collection of objects (countries, genes)– a collection of descriptors
● Find– subsets that can be defined in at least two
ways
An example redescription
Countries with land area > 3,000,000 square miles -Tourist Destinations in the Americas
Permanent members of the UN Security Council AND
Countries with history of communism
More on redescriptions
● Can restrict expressions– To be of a certain syntactic form
● Can allow approximate redescriptions– Jaccards coefficient = |X ∩ Y|/|X ∪ Y|
● Can require statistical significance– According to set overlap distributions
Applications in Bioinformatics
● (Gene) descriptors galore!– Genes localized in the mitochondrion– Genes up-expressed >=2 fold in heat stress– Genes encoding for proteins in the
immunoglobin complex– Genes involved in glucose biosynthesis– Genes handpicked by Prof. Genie– Genes clustered by your favorite algorithm
Redescriptions: Application to Environment Stress in Yeast
● Descriptors over approx. 300 yeast ORFs
A redescription
A redescription
What redescriptions offer
● A way to bridge vocabularies– Uniformity of modeling descriptors
● Conceptual clustering– Uses one set of descriptors to define another
● Automatic determination of mutually reinforcing features– Without explicit training data
Biclustering
Simultaneously identify sets of entities from two domains that exhibit concerted behavior.
Biclusters: Toy Example
Biclusters: Toy Example
Biclusters: Toy Example
Biclusters: Toy Example
Biclusters: Toy Example
Biclusters: Toy Example
Biclusters: Toy Example
Biclusters: Toy Example
More on biclusters
● Can mine approximate biclusters– “Dense” instead of “all 1s”
● Can require statistical significance– According to set overlap distributions
Biclustering: Transcriptional regulation in S. cerevisiae
● Two datasets: Growth of S. cerevisiae cells in rich medium and under exposure to rapamycin
● What are the differences between the activated transcriptional regulatory network under these two conditions?
Computed biclusters
Combinatorial control by RTG3 and GLN3
Recap
● Redescriptions– Map descriptors within a domain (e.g., genes
to genes)
● Biclusters– Map descriptors across domains (e.g., TFs to
genes)
● Key idea: can arbitrarily compose these– To bridge diverse domains
CDM: Desiccation tolerance in C.elegans
● Question: Find a set of genes to knock-down, via RNAi, so as to confer improved desiccation tolerance in C. elegans
● Available data:– Genes X TFs– Genes X Phenotypes
CDM: Desiccation tolerance in C.elegans
Two biclusters joined at the Gene interface
CDM: Aging in worms and flies
● Question: analyze similarities in gene expression programs underlying aging in C. elegans and D. melanogaster
● Available data:– Worm age X Worm genes (exp. values)– Worm genes X Fly genes (homology)– Fly age X Fly genes (exp. values)
CDM: Aging in worms and flies
Three biclusters related by two redescriptions
CDM Software Architecture
● Data Model Compiler● Data Mining Plan Generator● Visualization Interfaces
Data Model Compiler
● From a specification of– a database schema (SQL DDLs)
● Automatically generate– a database schema for CDM– redescription/biclustering algo. Interfaces
Data Mining Plan Generator
● Compile a request – for connections between biological domains
● Into– A composition of redescriptions and
biclusters
● Research issues– Set-based versus tuple-based joins– Hard versus soft joins– Use “query flocks” to organize related
queries
Visualization Interfaces
● Three-tiered interface– Bicluster level view– Set view– Tuple (individual) view
CDM Software Architecture
Case studies
● Storytelling in PubMed abstracts● Yeast functional genomics● Small molecule-gene-disease modeling
Biological storytelling
Study metabolic arrest/recovery across organisms of diverse complexity
Storytelling as CDM
● Compose only redescriptions– No biclusters
● Do not use set constructions– Just given descriptors
● Goal:– Relate dis-similar entities through
compositions of similarities
Storytelling is sort of like ...
● the MorphWord puzzle– PURE– PORE– POLE– POLL– POOL– WOOL
Example storytelling task
● Connect– L. Garczarek, N. Ramakrishnan, D. Kumar, R.F. Helm,
and M. Potts, Global cross-over points in the genome responses of Synechocystis sp. PCC 6803, to dehydration, UV-irradiation, and other stresses, under communication to BMC Microbiology, 2007.
● To
– M.B. Roth and T. Nystul, Buying time in suspended animation, Scientific American, Vol. 292, No. 6, pages 48-55, June 2005.
Spinning a story ...
● From– L. Garczarek, N. Ramakrishnan, D. Kumar, R.F. Helm,
and M. Potts, Global cross-over points in the genome responses of Synechocystis sp. PCC 6803, to dehydration, UV-irradiation, and other stresses, under communication to BMC Microbiology, 2007.
● To
– L. Schmitt and R. Tampe, Structure and mechanism of ABC transporters, Current Opinion in Structural Biology, Vol. 14, No. 4, pages 426-431, Aug 2004.
Link: CBS Domains
Spinning a story ...
● From
– L. Schmitt and R. Tampe, Structure and mechanism of ABC transporters, Current Opinion in Structural Biology, Vol. 14, No. 4, pages 426-431, Aug 2004.
● To
– J.W. Scott, S.A. Hawley, K.A. Green, M. Anis, G. Stewart, G.A. Scullion, D.G. Norman, and D.G. Hardie, CBS domains form energy-sensing modules whose binding of adenosine ligands is disrupted by disease mutations, Journal of Clinical Investigation, Vol. 113, No. 2, pages 182-184, Jan 2004.
Link: Molecular complexes of CBS Domains
Spinning a story ...
● From
– J.W. Scott, S.A. Hawley, K.A. Green, M. Anis, G. Stewart, G.A. Scullion, D.G. Norman, and D.G. Hardie, CBS domains form energy-sensing modules whose binding of adenosine ligands is disrupted by disease mutations, Journal of Clinical Investigation, Vol. 113, No. 2, pages 182-184, Jan 2004.
● To
– C. Tang, X. Li and J. Du, Hydrogen sulfide as a new endogenous gaseous transmitter in the cardiovascular system, Current Vascular Pharmacology, Vol. 4, No. 1, pages 17-22, Jan 2006.
Link: Ligands bound to CBS Domains
Spinning a story ...
● From
– C. Tang, X. Li and J. Du, Hydrogen sulfide as a new endogenous gaseous transmitter in the cardiovascular system, Current Vascular Pharmacology, Vol. 4, No. 1, pages 17-22, Jan 2006.
● To
– M.B. Roth and T. Nystul, Buying time in suspended animation, Scientific American, Vol. 292, No. 6, pages 48-55, June 2005.
Link: Hydrogen sulphide
Storytelling on System X
● Distributed indexing and similarity search● Bidirectional pursuing of “leads”● Simulations for significance testing
Stories about storytelling
Biological storytelling
● Given– 18 extra-cellular molecules
● CD38, CXCL1, IFN-gamma, IGF-1, IL-13, IL-1beta, IL-24, IL-6, IL-8, MMP etc.
– 1 intra-cellular molecule● (poly)ADP-ribose
● Find– Chains of redescriptions between abstracts
discussing these molecules
Biological storytelling
● Document seed set– Retrieve 203,872 documents
● Remove review papers
– Label 4757 documents with molecules (4737+20)
● Document modeling for sim search– 96,218 terms after stemming & stopword
removal– Weighted TFIDF (for doc length)
Biological storytelling
● Storytelling algorithm tradeoffs– Higher similarity versus shorter stories
Biological storytelling
● Basic statistics– Most popular hub
● PubMed ID 8064725: `Altered poly(ADP-ribose) metabolism in family members of patients with systemic lupus erythematosus'
– Second most popular hub● PubMed ID 2684169: `Two
types of antibodies inhibiting interleukin-2 production by normal lymphocytes in patients with systemic lupus erythematosus'
Biological storytelling
● Frequent episode mining– Mining novellas– e.g., PubMed ID 16430457 -> ... -> 1386861
● Story compression– Reduce novellas to single symbol– Identify and remove frequently reused
subpaths
● Story summarization– Tile sentences using sentence cohesion
check
The StoryGrapher
Available for demo/download athttps://bioinformatics.cs.vt.edu/storytelling/
Sentence-tiled story
Yet to do...
● Model– Cell types and cell lines
● Account for– “artificial enrichment” for certain
methodologies
● Address– Author bias– Messiness of information integration
Status of CDM
● Implemented using open source software– Parallel implementations of key algorithms
and significance calculations
● Many instantiations underway– VIGEN (Virginia Center for Genomics)– VBI (Virginia Bioinformatics Institute)
● We welcome collaborations!
Acknowledgements
● BIO faculty– Rich Helm– Malcolm Potts
● CS students– Joe Gresock– Deept Kumar– Greg Grothaus– Srinivas Santhanam– Mahima Gopalakrishnan– Anthony McNevin
Thank you!
● Contact info:– Naren Ramakrishnan, [email protected],
http://people.cs.vt.edu/~naren– T.M. Murali, [email protected],
http://people.cs.vt.edu/~murali