big data network genomics network inference and perturbation to study chemical-mediated cancer...
TRANSCRIPT
Big Data Network Genomics Network Inference and Perturbation
to Study Chemical-Mediated Cancer Induction
Stefano [email protected]
Section of Computational BioMedicineBoston University School of Medicine
Biostatistics, BUSPH
Bioinformatics Program, BU
Graduate Program in Genetics & Genomics, BU
Broad Institute of MIT & Harvard
Abstract
Development and application of novel methods of network inference and differential analysis from multiple genomic data types toward the elucidation of a chemical's mechanism(s) of
cancer induction
Abstract
Development and application of novel methods of network inference and differential analysis from high-dimensional data types toward the elucidation of functionally relevant modules
(generalization)
high-dimensional data typesfunctionally relevant modules
domain specific
The Motivating Problem
GoalsDevelopment of “Carcinogenicity Biomarker(s)”
CarcinogenicityPrediction Model
Chemical
Carcinogen
Non-carcinogen
Pathways affected Driver alterations Biomarkers …
Understand Why
Manuscript under Review
GoalsDevelopment of “Carcinogenicity Biomarker(s)”
CarcinogenicityPrediction Model
Chemical
Carcinogen
Non-carcinogen
Non-carcinogens Carcinogens
gene1 gene2 gene3 gene4 gene5 gene6 gene7
…
To generate this ‘matrix’100,000s of experiments need
to be performed
1,000 of controls generated
In Progresshigh-throughput data generation
384-well plate
100,000s profiles
Phase I 24 plates (liver and lung) ~200 compounds ~10,000 profiles
Future plans … Phase II
More tissue types (breast, prostate, etc.) More compounds (~1,500) Mixtures 100,000s profiles
Phase III iPSC-derived cells & 3D cultures “personalized exposure” models
Generalization of the Motivating Problem
Comparison of a control state to multiple perturbation states
Standard approaches of gene-based differential analysis might miss salient (aggregate) differences
High-dimensional data (1000s of ‘features’) Usually representable as 2D [10K x 1K] matrices
Large sample size for the ‘control state’ ≥1000 observations
Small sample size for each of the ‘perturbation states’ ~10-100 observations/perturbation
Generalization of the Motivating Problem: an example
The Connectivity Map/LINCS project Expression Profiling of Chemical/Genetic perturbations
• >10,000 compounds (most FDA approved drugs)• ~5,000 genetic perturbation (RNAi, CRISPR)• 18 cell types, multiple doses, time-points
> 1,000,000 profiles
Main Goal: Drug Discovery
Approach Overview
Module1
Module2
…
ModulepCo
mpo
und 1
Com
poun
d 2
… Com
poun
d n
lossgain
connectivity
Annota
tionWild-Type
Network
Approach Overview
Module1
Module2
…
ModulepCo
mpo
und 1
Com
poun
d 2
… Com
poun
d n
lossgain
connectivity
Network constructionModule Identification
Annota
tionWild-Type
Network
Module/Network Comparison
Approach Detailsnetworks’ construction
Correlations Networks clustering vs. topology-based ‘module’ identification
Gaussian models Inverse covariance matrix partial correlations
Correlation networks + “scale-free transformations” mostly for comparison w/ existing methods
Approach Details networks’ comparison
Covariance matrices comparison
Probabilistic Model Selection Bayes Factor
Network topology Diffusion State Distance (M. Crovella) and related
The Data
Gene expression profiles networks’ inference
Protein-protein interaction networks’ priors
“Cell painting” profiles networks’ annotation
100K samples
10K features (genes)
Deliverables
Computational Toolbox Network inference and visualization Module (i.e., sub-network) identification/comparison Network/module-based clustering/annotation
Analysis and cataloguing of chemical perturbations Chemicals’ putative mechanisms of action Interpretable carcinogenicity predictor(s)
A sandbox for researchers to develop and test new methods richly annotated multi-type data domain expertise to evaluate relevance/usefulness
Preliminary results for pursuit of further funding
The Team
Stefano Monti, Ph.D. (Assoc. Professor)Computational Biology, Cancer Genomics, Machine Learning (Bayesian Networks)
Paola Sebastiani, Ph.D. (Professor)Biostatistics, Genetics/Genomics, Bayesian Graphical Models
Mark Cravella, Ph.D. (Professor)Computer Science, Network Analysis
Simon Kasif (Professor)Computational Biology, Systems Biology, Machine Learning
Francesca Mulas, Ph.D. (Post-doctoral Fellow)Computational Biology/Bioinformatics, Computer Science
Daniel Gusenleiter, M.S. (Ph.D. student)Bioinformatics, Computer Science, Machine Learning
“Background” TeamBU-SRPDavid Ozonoff Basra KomalHeather Henry (NIEHS)
Evans Foundation - ARCKatya RavidRobin MacDonald
NTP/NIEHSScott AuerbachRay Tice
Broad InstituteAravind SubramanianXiaodong LuTodd GolubcMAP team
BU CBM/Bioinformatics/SPHDavid Sherr (co-PI)Daniel GusenleitnerJessalyn Ubellacker
Tisha MeilaHarold GomezYuxiang TanLiye Zhang
Elizabeth MosesTeresa WangMarc LenburgAvi Spira
The End