an introduction to quantitative biology and r

An introduction to quantitative biology

and R

David Quigley

[email protected] Diller Comprehensive Cancer Center, UCSFInstitute for Cancer Research, University of Oslo

Molecular biology 20 years ago

Suzuki Med Mol Morph 2010Oh PNAS 1996Mao Genes Dev 2004

qualitative methods small-scale quantitative tests

David Quigley [email protected]

Nik-Zainal Cell 2012Fullwood Nature 2009CGAN Nature 2012

Molecular biology nowqualitative methods small-scale quantitative testslarge-scale quantitative analysis

microarrays, *seq methods, cell phenotype screens...


Hypothesis-generating:Normalize breast tumor expression data (three cohorts)Call genotypes from SNP arraysCalculate association between genotype and expression genome-wide (eQTL)

Identify an interesting candidateValid ate eQTL in independent cohort

Hypothesis-driven:Methylation analysis at MRPS30 promoter In vitro (two cell lines):

ChIP-PCR +/- estrogenqPCRsequencing

Many studies use both approaches


Common quantitative techniques

Gene Expression transcription splicingmethylation protein binding (ChIP-seq)

Genomics and Geneticsde novo assemblyDNA copy number (CNV and tumors)germline variant analysistumor variant analysis

SNPs, indels, translocationanalysis of tumor clonality

Challenges

requires statistical sophisticationin study designin interpretation

many data points1,000 to 1,000,000 measurements per samplemany false positives which look like great stories

software becomes part of the experiment divide between engineering, biology culture & thinking


Schillebeeckx Nature Biotech. 2013

Wet lab and quantitative skills: better job prospects


What approachsare used to analyze quantitative data?

Chosing a tool

CostLearning curveEase of useFlexibility (closed to open-ended)Software ecosystem (none to extensive)


Traditional programming languagesPython, C++, Java, others

can solve any computable problem creates the fastest tools freerequires programming expertise

complex to write and test high effort


Specialized single-purpose programs

command line toolsacademic researchtype commands at a prompt or run scriptsPLINK, bowtie, GATK, bedtools

GUI (point and click)commercial software for a vendor’s platformslick, opaque, hard/impossible to automate


Commercial statistics programs

STATA, SPSS, GraphPad, others

1) Load one dataset2) Select analysis by clicking on a GUI3) Generate a report

May have a built-in languageVery mature tools for traditional biostatisticsNot freeDavid Quigley [email protected]

Web-based tools

Galaxystring together pre-defined analysis steps

very easy to usereproducible


Using R is like writing and using software

Traditionally, biologists did not do this.

R was written by statisticians to be a free replicaof another language called “S”.

R: a “software environment”


Flexible, open-ended, open-source

Large library of packagespackage: easy-to use published methodslike a Qiagen kit

Free!

Why is R popular?


You use R by typing at the prompt

There is no pull-down menu of statistical commands


What’s good about this approach?

chain analyseswork with multiple datasetsuse packages of code easy to reproduce runs on anythingmakes sense to computer programmers


What’s hard about this approach?

hard to get startedcryptic commandsbuilt-in help is amusingly unhelpful


packages: collections of R functionscollection of R code that solves a specific task

limma: microarray normalization and analysissamr: differential expressionimpute: dealing with missing data

downloaded for free from a central repository


bioconductorCurated collection of R packagesMicroarrays, aCGH, sequence analysis, advanced statistics, graphics, lots more


Learning R data typesby comparing them

to Excel spreadsheets

ExcelEasy tasks are easynon-trivial tasks impossible or expensiveNo paper trailMangles gene namesPlots look terrible

Comparing Excel and R


ExcelEasy tasks are easynon-trivial tasks impossible or expensiveNo paper trailMangles gene namesPlots look terrible

REasy jobs are hard at firstNon-trivial things are possibleEasy to make a paper trailBiostatistics researchers publish tools in RCan create publication-ready plots

Comparing Excel and R


Organizing data in Excel

Each subject has a row.Each column has a feature of your subjects.


R calls the data points variables

variablesnumbers and characters (letters, words)

numbers: 2.6, 4characters: “Flopsy”, “white, brown paws”


R calls the columns vectors

vectorsordered collections of a variable

name: [“Flopsy”, “Mopsy”, “Cottontail”, “Peter”]age: [2.5, 2.6, 2.5, 4]


R calls the data set a data frame

data framea list of vectors (columns) that have nameselements can be read and written by row & column


I can slice and dice the data frame


Tell R to do things using functionsfunction_name( details about how to do it )

generate sequence from 1 to 5 counting by 0.5parameters for seq are named from, to, and by



report the mean of my.data. Result of one function is fed into another one.



define a new function that adds 2 to whatever it’s passed

compare to original value of my.data


Walk-through a straightforward

analysis

Primary data from METABRIC study

gene expression TP53 sequence

1,400 samples from 5 hospitals

Is there an association between breastcancer subtype and TP53 mutation?


Tasks

Normalize databatch effectsunwanted inter-sample variation

Identify outliers

associations between p53 and subtype


Quantile Normalization (limma)

Force every array to have the same distribution ofexpression intensities

> library(limma)

> raw = read.table('raw_extract.txt’, ...)

> raw.normalized = normalize.quantiles( raw )

> normalized = log2( raw.normalized )


Identify batch effects in microarrays

Principle Components AnalysisIdentify strongest variation in a matrix

gene 2

gene

1



Principle Components AnalysisIdentify axes of maximal variation in a matrix

gene 2

gene

1

first prin

ciple co

mponent



Principle Components AnalysisIdentify strongest variation in a matrix

gene 2

gene

1

gene 2

gene

1

group Agroup B


PCA of identifies a batch effect

first principle component

seco

nd p

rinci

ple

com

pone

nt

hospital 3 (yellow)

> my.pca = prcmp( t( expression.data ) )

> plot( my.pca, ... )


batch correction reduces bias (ComBat)se

cond

prin

cipl

e co

mpo

nent

first principle component

ComBat package reduces user-defined batch effects


Molecular subtypes of breast carcinoma, defined by gene expression

Luminal AN=507

Luminal BN=379

Her2N=161

BasalN=234

ER status

> sa = read.table(‘patients.txt’, ...)

> tumor.counts = table( sa$ER.status, sa$PAM50Subtype)

(convert counts to percentages)

> barplot( c( tumor.counts[1], tumor.counts[2] ), col=c(“red”,”green”), ... )


Find interactions: TP53 and subtype

Fit a linear model:> fitted.model = lm( dependent ~ independent )

Perform Analysis of Variance:> anova( fitted.model )

general form of my analysis:> anova( lm( gene.expression ~ PAM * TP53 )

18,000 genesPAM: {LumA, LumB, Her2, Basal}TP53: {mutant, WT}


Automate with loopsCalculate anova for 18,000 genes by looping through each gene and storing result.

> n_genes = 18000> result = rep( 0, n_genes )

> for( counter in 1:n_genes ){ result[counter] = anova(...) }

sort results identify significant interaction

repeat 18,000 times


absent mild severe

CD3E

log 2

exp

ress

ion

log 2

exp

ress

ion

infiltration

Immune infiltration in TP53-WT Basal


Does p53 have a role in immune surveillance?

Next steps:getting help and

learning more

online forums: expert help for free

all of bioinformatics



all of bioinformaticsNextgen sequencing



all of bioinformaticsNextgen sequencing

statistics


Library classes and information

Formal courses (BMI, Biostatistics)

Cores (Computational Biology, Genomics)

QGDG monthly methods discussion group

UCSF resources


Online classes and blogs

Free courses on data analysishttp://jhudatascience.orgsimplystatistics.orgCoursera etc...

Good tutorials on sequence analysishttp://evomics.org/learning


Reproducible research?You mean, there’s

another kind?

detailed protocols (not printed in the methods)

extensive optimization

reagents that might be unique or hard to get

techniques that require years of experience

Replicate a wet lab experiment


published algorithms (if novel)

published source codesometimes “available from the authors”

well-specified input and deterministic output

no reagentsOkay, maybe a supercomputer or cloud

How hard could it be?

Replicate a dry lab experiment


Bookkeeping errorsTransposed column headersOut-of-date/changed annotationsOff-by-oneMisunderstood sample labels

Batch effects

Cryptic cohort stratification

Inappropriate analytical methods

Many chances to make honest errors


eQTL differences between ethnic cohorts

Claim:Many genetic loci associated with gene expression differbetween Asian, western people


poor study design batch processing effects

2003-2004processed European samples

2005-2006processed Asian samples

Processing year perfectly counfounded with ethnicity.

Claim:Many genetic loci associated with gene expression differbetween Asian, western people


an introduction to quantitative biology and r

Documents