big data network genomics network inference and perturbation to study chemical-mediated cancer...

Big Data Network Genomics Network Inference and Perturbation

to Study Chemical-Mediated Cancer Induction

Stefano [email protected]

Section of Computational BioMedicineBoston University School of Medicine

Biostatistics, BUSPH

Bioinformatics Program, BU

Graduate Program in Genetics & Genomics, BU

Broad Institute of MIT & Harvard

mailto:[email protected]

Abstract

Development and application of novel methods of network inference and differential analysis from multiple genomic data types toward the elucidation of a chemical's mechanism(s) of

cancer induction

Abstract

Development and application of novel methods of network inference and differential analysis from high-dimensional data types toward the elucidation of functionally relevant modules

(generalization)

high-dimensional data typesfunctionally relevant modules

domain specific

The Motivating Problem

GoalsDevelopment of “Carcinogenicity Biomarker(s)”

CarcinogenicityPrediction Model

Chemical

Carcinogen

Non-carcinogen

Pathways affected Driver alterations Biomarkers …

Understand Why

Manuscript under Review

GoalsDevelopment of “Carcinogenicity Biomarker(s)”

CarcinogenicityPrediction Model

Chemical

Carcinogen

Non-carcinogen

Non-carcinogens Carcinogens

gene1 gene2 gene3 gene4 gene5 gene6 gene7

…

To generate this ‘matrix’100,000s of experiments need

to be performed

1,000 of controls generated

In Progresshigh-throughput data generation

384-well plate

100,000s profiles

Phase I 24 plates (liver and lung) ~200 compounds ~10,000 profiles

Future plans … Phase II

More tissue types (breast, prostate, etc.) More compounds (~1,500) Mixtures 100,000s profiles

Phase III iPSC-derived cells & 3D cultures “personalized exposure” models

Generalization of the Motivating Problem

Comparison of a control state to multiple perturbation states

Standard approaches of gene-based differential analysis might miss salient (aggregate) differences

High-dimensional data (1000s of ‘features’) Usually representable as 2D [10K x 1K] matrices

Large sample size for the ‘control state’ ≥1000 observations

Small sample size for each of the ‘perturbation states’ ~10-100 observations/perturbation

Generalization of the Motivating Problem: an example

The Connectivity Map/LINCS project Expression Profiling of Chemical/Genetic perturbations

• >10,000 compounds (most FDA approved drugs)• ~5,000 genetic perturbation (RNAi, CRISPR)• 18 cell types, multiple doses, time-points

> 1,000,000 profiles

Main Goal: Drug Discovery

Approach Overview

Module1

Module2

…

ModulepCo

mpo

und 1

Com

poun

d 2

… Com

poun

d n

lossgain

connectivity

Annota

tionWild-Type

Network

Approach Overview

Module1

Module2

…

ModulepCo

mpo

und 1

Com

poun

d 2

… Com

poun

d n

lossgain

connectivity

Network constructionModule Identification

Annota

tionWild-Type

Network

Module/Network Comparison

Approach Detailsnetworks’ construction

Correlations Networks clustering vs. topology-based ‘module’ identification

Gaussian models Inverse covariance matrix partial correlations

Correlation networks + “scale-free transformations” mostly for comparison w/ existing methods

Approach Details networks’ comparison

Covariance matrices comparison

Probabilistic Model Selection Bayes Factor

Network topology Diffusion State Distance (M. Crovella) and related

The Data

Gene expression profiles networks’ inference

Protein-protein interaction networks’ priors

“Cell painting” profiles networks’ annotation

100K samples

10K features (genes)

Deliverables

Computational Toolbox Network inference and visualization Module (i.e., sub-network) identification/comparison Network/module-based clustering/annotation

Analysis and cataloguing of chemical perturbations Chemicals’ putative mechanisms of action Interpretable carcinogenicity predictor(s)

A sandbox for researchers to develop and test new methods richly annotated multi-type data domain expertise to evaluate relevance/usefulness

Preliminary results for pursuit of further funding

The Team

Stefano Monti, Ph.D. (Assoc. Professor)Computational Biology, Cancer Genomics, Machine Learning (Bayesian Networks)

Paola Sebastiani, Ph.D. (Professor)Biostatistics, Genetics/Genomics, Bayesian Graphical Models

Mark Cravella, Ph.D. (Professor)Computer Science, Network Analysis

Simon Kasif (Professor)Computational Biology, Systems Biology, Machine Learning

Francesca Mulas, Ph.D. (Post-doctoral Fellow)Computational Biology/Bioinformatics, Computer Science

Daniel Gusenleiter, M.S. (Ph.D. student)Bioinformatics, Computer Science, Machine Learning

“Background” TeamBU-SRPDavid Ozonoff Basra KomalHeather Henry (NIEHS)

Evans Foundation - ARCKatya RavidRobin MacDonald

NTP/NIEHSScott AuerbachRay Tice

Broad InstituteAravind SubramanianXiaodong LuTodd GolubcMAP team

BU CBM/Bioinformatics/SPHDavid Sherr (co-PI)Daniel GusenleitnerJessalyn Ubellacker

Tisha MeilaHarold GomezYuxiang TanLiye Zhang

Elizabeth MosesTeresa WangMarc LenburgAvi Spira

The End

big data network genomics network inference and perturbation to study chemical-mediated cancer...

Documents

network analysis

gene expression data

biological network

healthy network

highdimensional data

data generation

multiple genomic data

gene modules