stephen friend complex traits: genomics and computational approaches 2012-02-23

If the physicists do it, the software engineers do it, Why can’t we do it?:

Moving beyond linear investigations Both of the science and of how we work

Integrating layers of omics data models and building using compute spaces capable of enabling models

to be evolved by teams of teams

Stephen Friend MD PhD

Sage Bionetworks (Non-Profit Organization) Seattle/ Beijing/ Amsterdam

February 23, 2012

So what is the problem?

Most approved therapies were assumed to be monotherapies for diseases represen4ng homogenous popula4ons

Our exis4ng disease models o9en assume pathway knowledge sufficient to infer correct therapies

Familiar but Incomplete

Reality: Overlapping Pathways

The value of appropriate representations/ maps

Equipment capable of generating massive amounts of data

“Data Intensive” Science- Fourth Scientific Paradigm

Open Information System

IT Interoperability

Host evolving computational models in a “Compute Space”

WHY NOT USE “DATA INTENSIVE” SCIENCE

TO BUILD BETTER DISEASE MAPS?

what will it take to understand disease?

DNA RNA PROTEIN (dark maHer)

MOVING BEYOND ALTERED COMPONENT LISTS

2002 Can one build a “causal” model?

Preliminary Probabalistic Models- Rosetta /Schadt

Gene symbol Gene name Variance of OFPM explained by gene expression*

Mouse model

Source

Zfp90 Zinc finger protein 90 68% tg Constructed using BAC transgenics Gas7 Growth arrest specific 7 68% tg Constructed using BAC transgenics Gpx3 Glutathione peroxidase 3 61% tg Provided by Prof. Oleg

Mirochnitchenko (University of Medicine and Dentistry at New Jersey, NJ) [12]

Lactb Lactamase beta 52% tg Constructed using BAC transgenics Me1 Malic enzyme 1 52% ko Naturally occurring KO Gyk Glycerol kinase 46% ko Provided by Dr. Katrina Dipple

(UCLA) [13] Lpl Lipoprotein lipase 46% ko Provided by Dr. Ira Goldberg

(Columbia University, NY) [11] C3ar1 Complement component

3a receptor 1 46% ko Purchased from Deltagen, CA

Tgfbr2 Transforming growth factor beta receptor 2

39% ko Purchased from Deltagen, CA

Networks facilitate direct identification of genes that are

causal for disease Evolutionarily tolerated weak spots

Nat Genet (2005) 205:370

DIVERSE POWERFUL USE OF MODELS AND NETWORKS

  50 network papers   http://sagebase.org/research/resources.php

List of Influential Papers in Network Modeling

(Eric Schadt)

Equipment capable of generating massive amounts of data A-

“Data Intensive” Science- Fourth Scientific Paradigm Score Card for Medical Sciences

Open Information System D-

IT Interoperability D

Host evolving computational models in a “Compute Space F

.

We still consider much clinical research as if we were “hunter gathers”- not sharing

TENURE FEUDAL STATES

Clinical/genomic data are accessible but minimally usable

Little incentive to annotate and curate data for other scientists to use

Mathematical models of disease are not built to be

reproduced or versioned by others

Lack of standard forms for future rights and consents

Lack of data standards..

Sage Mission

Sage Bionetworks is a non-profit organization with a vision to create a “commons” where integrative bionetworks are evolved by

contributor scientists with a shared vision to accelerate the elimination of human disease

Sagebase.org

Data Repository

Discovery Platform

Building Disease Maps

Commons Pilots

Sage Bionetworks Collaborators

  Pharma Partners   Merck, Pfizer, Takeda, Astra Zeneca, Amgen, Johnson &Johnson

27

  Foundations   Kauffman CHDI, Gates Foundation

  Government   NIH, LSDF, NCI

  Academic   Levy (Framingham)   Rosengren (Lund)   Krauss (CHORI)

  Federation   Ideker, Califano, Nolan, Schadt

A) Miller 159 samples B) Christos 189 samples

C) NKI 295 samples

D) Wang 286 samples

Cell cycle

Pre-mRNA

ECM

Immune response

Blood vessel

E) Super modules

Zhang B et al., Towards a global picture of breast cancer (manuscript).

28

NKI: N Engl J Med. 2002 Dec 19;347(25):1999.

Wang: Lancet. 2005 Feb 19-25;365(9460):671.

Miller: Breast Cancer Res. 2005;7(6):R953.

Christos: J Natl Cancer Inst. 2006 15;98(4):262.

Model of Breast Cancer: Co-expression JUN ZHU

What is this?

Bayesian networks enriched in inflammaQon genes correlated with disease severity in pre-‐frontal cortex of 250 Alzheimer’s paQents.

What does it mean?

InflammaQon in AD is an interacQve mulQ-‐pathway system. More broadly, network structure organizes complex disease effects into coherent sub-‐systems and can prioriQze key genes.

Are you joking?

Gene validaQon shows novel key drivers increase Abeta uptake and decrease neurite length through an ROS burst. (highly relevant to AD pathology)

CHRIS GAITERI-‐ALZHEIMER’S

Elias Chaibub Neto1, Aimee T. Broman2, Mark P. Keller2, Alan D. Attie2, Bin Zhang1, Jun Zhu1, Brian S. Yandell2

1 Sage Bionetworks, Seattle, WA USA; 2 University of Wisconsin-Madison, Madison, WI USA

Causal Model Selection Hypothesis Tests in Systems Genetics

Abstract

Current efforts in systems genetics have focused on the development of statistical approaches aiming to disentangle causal relationships among molecular phenotypes in segregating populations. Model selection criterions, such as the AIC and BIC, have been widely used for this purpose, in spite of being unable to quantify the uncertainty associated with the model selection call. Here we propose three novel hypothesis tests to perform model selection among models representing distinct causal relationships. We focus on models composed of pairs of phenotypes and use their common QTL to determine which phenotype has a causal effect on the other, or whether the phenotypes are not causally related, and are only statistically associated. Our hypothesis tests are fully analytical and avoid the use of computationally expensive permutation or re-sampling strategies. They adapt and extend Vuong's (and Clarke’s) model selection test to the comparison of four possibly misspecified models, handling the full range of possible causal relationships among a pair of phenotypes. We evaluate the performance of our tests against the AIC, BIC and a published causality inference test in simulation studies. Furthermore, we compare the precision of the causal predictions made by the methods using biologically validated causal relationships extracted from a database of 247 knockout experiments in yeast. Overall, our model selection hypothesis tests achieve higher precision than the alternative methods at the expense of reduced statistical power.

Vuong’s Model Selection Test

Vuong's test derives from the Kullback-Leibler Information Criterion (KLIC).

Let h0(y | x) represent the true model.

Consider the parametric family of conditional models: {f(y | x; φ): φ ϵ Ф}.

Then KLIC(h0, f) = E0[log h0(y | x)] – E0[log f(y | x; φ)],

where the expectation E0 is computed w.r.t h0(y, x), and φ* is the parameter value that minimizes KLIC(h0, f).

Consider two models: f1 ≡ f1(y | x; φ1*) and f2 ≡ f2(y | x; φ2*).

Model f1 is a better approximation of h0 than f2 if and only if

KLIC(h0, f1) < KLIC(h0, f2) E0[log f1] > E0[log f2].

Let LR12 = log f1 – log f2. Then we test

H0: E0[LR12] = 0, H1: E0[LR12] > 0, H2: E0[LR12] < 0.

The quantity E0[LR12] is unknown, but the sample mean and variance of

LR�12,i = log f�1,i – log f�2,i, f�1 ≡ f(y | x; φ�1), φ�1 ≡ ML est. of φ1

converve a.s. to E0[LR12] and Var0[LR12] = σ12.12 .

Let LR�12 = ∑ LR�12,i , then under H0

(n σ�12.12 )−1/2 LR�12 →d N(0, 1).

If different models have different dimensions we consider

LR�*12 = LR�12 – D12

where D12 represents a difference of AIC or BIC penalties, and adopt the test statistic

Z12 = (n σ�12.12 )−1/2 LR�*12 .

Clarke’s Model Selection Test

Represents a non-parametric version of Vuong’s test.

Vuong’s null: the mean log-likelihood ratio is 0. Clarke’s null: the median log-likelihood ratio is 0.

Paired sign test on log-likelihood scores:

Scores: (LR�12,1 , LR�12,2 , LR�12,3 , LR�12,4 , LR�12,5 , … , LR�12,n ) Signs: ( + , − , + , + , − , … , + )

Let, T12 = {# of positive signs}. Then under Clarke’s null

T12 ~ Binomial(n, 1/2).

Causal Model Selection Tests (CMST)

In our applications we consider four models: M1, M2, M3 and M4.

We derive intersection-union tests based on six separate Vuong (Clarke) tests:

f1 vs f2 , f1 vs f3 , f1 vs f4 , f2 vs f3 , f2 vs f4 , f3 vs f4

We propose three distinct CMST tests: (1) parametric, (2) non-parametric, and (3) joint-parametric CMST tests.

Parametric CMST:

H0: model M1 is not closer to the true model than M2, M3 or M4. H1: model M1 is closer to the true model than M2, M3 and M4.

H0: { E0[LR12] = 0 } { E0[LR13] = 0 } { E0[LR14] = 0 } H1: { E0[LR12] > 0 } ∩ { E0[LR13] > 0 } ∩ { E0[LR14] > 0 }

The rejection region and p-value for this IU-test are given by:

min{z12 , z13 , z14} > cα , p1 = max{p12 , p13 , p14}.

Non-parametric CMST:

Analogous to the parametric CMST. Just replace Vuong’s by Clarke’s tests.

Joint parametric CMST:

Simple application of Vuong tests, overlooks the dependency among the test statistics.

Let S1 represent the sample covariance matrix of LR�12,i , LR�13,i and LR�14,i.

Under regularity conditions we have that S1 converges a.s. to Σ1.

It follows from the MCT and Slutsky’s theorem that when

( E0[LR12] , E0[LR13] , E0[LR14] )T = ( 0 , 0 , 0 )T

we have that

Z1 = n−1/2 diag(S1)−1/2 LR�1 →d N3(0 , ρ1)

where LR�1 = ( LR�12 , LR�13 , LR�14 )T and ρ1 = diag(S1)−1/2 Σ1 diag(S1)−1/2

We consider the hypotheses

H0: min{ E0[LR12] , E0[LR13] , E0[LR14] } ≤ 0 H1: min{ E0[LR12] , E0[LR13] , E0[LR14] } > 0

and adopt the test statistic W1 = min{Z1}. The p-value is computed as

P(W1 ≥ w1) = P(Z12 ≥ w1 , Z13 ≥ w1 , Z14 ≥ w1).

Simulation Study

We conducted a simulation study generating data from the models on the Figure below.

The results are shown below:

Yeast Data Analysis

We analyzed the yeast genetical genonics data set from Brem and Kruglyak (2005).

We evaluated the precision of the causal predictions made by the methods using validated causal relationships extracted from a data-base of 247 knock-out experiments (Hughes 2000, Zhu 2008).

In total, 46 of the ko-genes showed significant eQTLs, and we tested a total of 4,928 ko-gene/putative target gene relations.

Pairwise Causal Models

Given a pair of phenotypes, Y1 and Y2, that co-map to the same quantitative trait loci, Q, we consider the following models:

Conclusions

Advantages of the Causal Model Selection Tests:

1- Fully analytical hypothesis tests that avoid the use of computationally expensive permutation or re-sampling techniques.

2- Achieve better controlled type I error rates.

3- Achieve higher precision rates.

Main disadvantage: lower statistical power.

ELIAS NETO

Causal Model Selection Hypothesis Tests in Systems Genetics

The Schadt et al. (2005) approach was based on a penalized likelihood model selection approach, were we simply select the model with the best score.

The proposed hypothesis test allows us to attach a p-value to the selected model and, in this way, allows the quantification of the uncertainty associated with the model selection call.

The proposed tests are fully analytical and avoid computationally expensive permutation and re-sampling techniques.

ELIAS NETO

Liver Adipose

FaDy acids

Hypothalamus

Macrophage/ inflamma4on

Lep4n signaling

Phagocytosis-‐ induced lipolysis

Phagocytosis-‐ induced lipolysis

M1 macrophage

A mulQ-‐Qssue immune-‐driven theory of weight loss

ZHI WANG

RULES GOVERN

PLAT

FORM

NEW

MAP

S

PLATFORM Sage Platform and Infrastructure Builders-

( Academic Biotech and Industry IT Partners...)

PILOTS= PROJECTS FOR COMMONS Data Sharing Commons Pilots-

(Federation, CCSB, Inspire2Live....)

Why not share clinical /genomic data and model building in the ways currently used by the software industry (power of tracking workflows and versioning

Leveraging Existing Technologies

Taverna

Addama

tranSMART

Watch What I Do, Not What I Say sage bionetworks synapse project

Reduce, Reuse, Recycle sage bionetworks synapse project

Most of the People You Need to Work with Don’t Work with You

sage bionetworks synapse project

My Other Computer is Cloudera Amazon Google

sage bionetworks synapse project

Sage Metagenomics Project

•  > 10k genomic and expression standardized datasets indexed in SCR •  Error detection, normalization in mG •  Access raw or processed data via download or API in downstream analysis •  Building towards open, continuous community curation

Processed Data (S3)

Sage Metagenomics using Amazon Simple Workflow

Full case study at http://aws.amazon.com/swf/testimonials/swfsagebio/

Amazon SWF and Synapse

•  Maintains state of analysis •  Tracks step execution •  Logs workflow history •  Dispatches work to Amazon or

remote worker nodes •  Efficiently match job size to

hardware •  Provides error handling and

recovery

•  Hosts raw and processed data for further reuse in public or private projects

•  Provides visibility into intermediate results and algorithmic details

•  Allows programmatic access to data; integration with R

•  Provides standard terminologies for annotations

•  Search across data sets

Synapse Roadmap

Q1-2012 Q2-2012 Q3-2012 Q4-2012 Q1-2013 Q2-2013

Synapse Platform Functionality

Data / Analysis Capabilities

Q3-2013 Q4-2013

Internal Alpha Public Beta Testing Synapse 1.0 Synapse 1.5 Future

•  Data Repository •  Projects and security •  R integration •  Analysis provenance

• Search • Controlled Vocabularies • Governance of restricted data

•  40+ manually curated clinical studies •  8000 + GEO / Array Express datasets •  Clinical, genomic, compound sensitivity •  Bioconductor and custom R analysis

• TCGA •  METABRIC breast cancer challenge

•  Workflow templates •  Publishing figures •  Wiki & collaboration tools •  Integrated management of cloud resources

•  Social networking •  User-customized dashboards •  R Studio integration •  Curation tool integration

•  Predictive modeling workflows •  Automated processing of common genomics platforms

•  TBD: Integrations with other visualization and analysis packages

INTEROPERABILITY

INTEROPERABILITY

Genome Pattern CYTOSCAPE tranSMART I2B2

SYNAPSE

Open Network Biology is an open access journal that publishes arQcles relaQng to predicQve, network-‐based models of living systems linked to the corresponding coherent data sets upon which the models are based. In addiQon to arQcles describing these large data sets, the journal also welcomes submissions of original research, sobware and methods, along with reviews and commentary, relevant to the emerging field of network biology.

Submit your manuscript and benefit from: •  High visibility for arQcles through unrestricted online access •  Free arQcle redistribuQon under a CreaQve Commons aHribuQon license

•  No limits on arQcle length, addiQonal files, colour figures or movies

•  Rapid, immediate open access publicaQon on acceptance •  An integrated repository for network model data and code

Now accep4ng submissions

Editor-‐in-‐Chief Eric Schadt (USA)

www.opennetworkbiology.com

CTCAP Arch2POCM The FederaQon Portable Legal Consent Sage Congress Project

Five Pilots involving Sage Bionetworks

RULES GOVERN

PLAT

FORM

NEW

MAP

S

Clinical Trial Comparator Arm Partnership (CTCAP)

  Description: Collate, Annotate, Curate and Host Clinical Trial Data with Genomic Information from the Comparator Arms of Industry and Foundation Sponsored Clinical Trials: Building a Site for Sharing Data and Models to evolve better Disease Maps.

  Public-Private Partnership of leading pharmaceutical companies, clinical trial groups and researchers.

  Neutral Conveners: Sage Bionetworks and Genetic Alliance [nonprofits].

  Initiative to share existing trial data (molecular and clinical) from non-proprietary comparator and placebo arms to create powerful new tool for drug development.

Started Sept 2010

Shared clinical/genomic data sharing and analysis will maximize clinical impact and enable discovery

•  Graphic of curated to qced to models

Arch2POCM

Restructuring the PrecompeQQve Space for Drug Discovery

How to potenQally De-‐Risk High-‐Risk TherapeuQc Areas

Arch2POCM: scale and scope

•  Proposed Goal: Initiate 2 programs. One for Oncology/Epigenetics/Immunology. One for Neuroscience/Schizophrenia/Autism. Both programs will have 8 drug discovery projects (targets) - ramped up over a period of 2 years

–  It is envisioned that Arch2POCM’s funding partners will select targets that are judged as slightly too risky to be pursued at the top of pharma’s portfolio, but that have significant scientific potential that could benefit from Arch2POCM’s crowdsourcing effort

•  These will be executed over a period of 5 years making a total of 16 drug discovery projects

–  Projected pipeline attrition by Year 5 (assuming 12 targets loaded in early discovery)

•  30% will enter Phase 1 •  20% will deliver Ph 2 POCM data 52

The FederaQon

2008 2009 2010 2011

How can we accelerate the pace of scientific discovery?

Ways to move beyond “traditional” collaborations?

Intra-lab vs Inter-lab Communication

Colrain/ Industrial PPPs Academic Unions

(Nolan and Haussler)

sage federation: model of biological age

Faster Aging

Slower Aging

Clinical Association -  Gender -  BMI -  Disease Genotype Association Gene Pathway Expression Pr

edicted Age (liver expression)

Chronological Age (years)

Age Differential

Reproducible science==shareable science

Sweave: combines programmatic analysis with narrative

Sweave.Friedrich Leisch. Sweave: Dynamic generation of statistical reports using literate data analysis. In Wolfgang Härdle and Bernd Rönz,editors, Compstat 2002 –

Proceedings in Computational Statistics,pages 575-580. Physica Verlag, Heidelberg, 2002. ISBN 3-7908-1517-9

Dynamic generation of statistical reports using literate data analysis

Federated Aging Project : Combining analysis + narraQve

=Sweave Vignette Sage Lab

Califano Lab Ideker Lab

Shared Data Repository

JIRA: Source code repository & wiki

R code + narrative

PDF(plots + text + code snippets)

Data objects

HTML

Submitted Paper

TP53 mut

CDKN2A copy

MDM2 expr

HGF expr

CML linage EGFR mut

EGFR mut

EGFR mut

CML lineage

ERBB2 expr

BRAF mut

BRAF mut

NRAS mut

BRAF mut

NRAS mut

KRAS mut

BRAF mut

NRAS mut

KRAS mut

#1 BRAF mut

#2 NRAS mut #1 BRAF mut

#3 KRAS mut #2 NRAS mut #1 BRAF mut

#3 KRAS mut #2 NRAS mut #1 BRAF mut

#1 EGFR mut

#1 ERBB2 expr

#1 EGFR mut

#2 CML lineage #1 EGFR mut

#1 CML lineage

#1 HGF expr

#2 TP53 mut #3 CDKN2A copy #1 MDM2 expr

Can the approach make new discoveries?

For 11/12 compounds, the #1 predictive feature in an unbiased analysis corresponds to the known stratifier of sensitivity

59

Vaske, et al.

Presentation outline

Currently   mRNA   copy number   somatic mutations (36

cancer-related genes) In progress   targeted exon sequencing   epigenetics   microRNA   lncRNA   phospho-tyrosine kinase   metabolites

Molecular characterization (1,000 cell lines)

Viability screens (500 cell lines, 24 compounds)

Small molecule screen

Cancer cell line encyclopedia

TCGA /ICGC Molecular characterization (50 tumor types)

  genomics   transcriptomics   epigenetics

Clinical data Predic4ve model

1) Predic4ng drug response from cancer cell lines

2) Future approaches: network-‐based predictors and mul4-‐task learning

3) Standardized workflows for data management, versioning and method comparison

Transfer learning

Network / pathway prior informa4on

Vaske, et al.

1)  Data management APIs to load standaridzed objects, e.g. R ExpressionSets (MaD Furia):

ccleFeatureData <-‐ getEnQty(ccleFeatureDataId) ccleResponseData <-‐ getEnQty(ccleResponseDataId)

tcgaFeatureData <-‐ getEnQty(tcgaFeatureDataId) tcgaResponseData <-‐ getEnQty(tcgaResponseDataId)

=!

Observed Data!=! +!

+!

Random Variation!Systematic Variation!

+!

Normalization: Remove the influence of adjustment variables on data...!

=! +!

2) Automated, standardized workflows for cura4on and QC of large-‐scale datasets (Brig Mecham).

A.  TCGA: Automated cloud-‐based processing. B. GEO / Array Expression: NormalizaQon workflows, curaQon of phenotype using standard ontologies. C. AddiQonal studies with geneQc and phenotypic data in Sage repository (e.g. CCLE and Sanger cell line datasets)

custom model 1 custom model 2 custom model N

4)  Sta4s4cal performance assessment across models.

custom model 1 custom model 2 custom model N

5)  Output of candidate biomarkers and feature evalua4on (e.g. GSEA, pathway analysis)

6) Experimental follow-‐up on top predic4ons (TBD) E.g. for cell lines: medium throughput suppressor / enhancer screens of drug sensiQvity for knockdown / overexpression of predicted biomarkers.

3)  Pluggable API to implement predic4ve modeling algorithms.

A)  Support for all commonly used machine learning methods (for automated benchmarking against new methods)

B)  Pluggable custom methods as R classes implemenQng customTrain() and customPredict() methods.

A)  Can be arbitrarily complex (e.g. pathway and other priors)

B)  Support for parallelizaQon in for each loops.

Portable Legal Consent

(AcQvaQng PaQents)

John Wilbanks

weconsent.us

Sage Congress Project April 20 2012

RealNames Parkinson’s Project RevisiQng Breast Cancer Prognosis

Fanconi’s Anemia

(Responders CompeQQons-‐ IBM-‐DREAM)

Networking Disease Model Building

stephen friend complex traits: genomics and computational approaches 2012-02-23

Health & Medicine

data intensive science

curate data

clinicalgenomic data

layers of omics data

lack of data standards

exis4ng disease models

mathematicalmodels of

diverse powerful use