an introduction to quantitative biology and r

56
n introduction to uantitative biolog and R Quigley @cc.ucsf.edu iller Comprehensive Cancer Center, UCSF te for Cancer Research, University of Oslo

Upload: geneva

Post on 08-Jan-2016

38 views

Category:

Documents


0 download

DESCRIPTION

An introduction to quantitative biology and R. David Quigley [email protected] Helen Diller Comprehensive Cancer Center, UCSF Institute for Cancer Research, University of Oslo. Molecular biology 20 years ago. qualitative methods small-scale quantitative tests. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: An introduction to  quantitative biology and R

An introduction to quantitative biology

and R

David Quigley

[email protected] Diller Comprehensive Cancer Center, UCSFInstitute for Cancer Research, University of Oslo

Page 2: An introduction to  quantitative biology and R

Molecular biology 20 years ago

Suzuki Med Mol Morph 2010Oh PNAS 1996Mao Genes Dev 2004

qualitative methods small-scale quantitative tests

David Quigley [email protected]

Page 3: An introduction to  quantitative biology and R

Nik-Zainal Cell 2012Fullwood Nature 2009CGAN Nature 2012

Molecular biology nowqualitative methods small-scale quantitative testslarge-scale quantitative analysis

microarrays, *seq methods, cell phenotype screens...

David Quigley [email protected]

Page 4: An introduction to  quantitative biology and R

Hypothesis-generating:Normalize breast tumor expression data (three cohorts)Call genotypes from SNP arraysCalculate association between genotype and expression genome-wide (eQTL)

Identify an interesting candidateValid ate eQTL in independent cohort

Hypothesis-driven:Methylation analysis at MRPS30 promoter In vitro (two cell lines):

ChIP-PCR +/- estrogenqPCRsequencing

Many studies use both approaches

David Quigley [email protected]

Page 5: An introduction to  quantitative biology and R

Common quantitative techniques

Gene Expression transcription splicingmethylation protein binding (ChIP-seq)

Genomics and Geneticsde novo assemblyDNA copy number (CNV and tumors)germline variant analysistumor variant analysis

SNPs, indels, translocationanalysis of tumor clonality

Page 6: An introduction to  quantitative biology and R

Challenges

requires statistical sophisticationin study designin interpretation

many data points1,000 to 1,000,000 measurements per samplemany false positives which look like great stories

software becomes part of the experiment divide between engineering, biology culture & thinking

David Quigley [email protected]

Page 7: An introduction to  quantitative biology and R

Schillebeeckx Nature Biotech. 2013

Wet lab and quantitative skills: better job prospects

David Quigley [email protected]

Page 8: An introduction to  quantitative biology and R

What approachsare used to analyze quantitative data?

Page 9: An introduction to  quantitative biology and R

Chosing a tool

CostLearning curveEase of useFlexibility (closed to open-ended)Software ecosystem (none to extensive)

David Quigley [email protected]

Page 10: An introduction to  quantitative biology and R

Traditional programming languagesPython, C++, Java, others

can solve any computable problem creates the fastest tools freerequires programming expertise

complex to write and test high effort

David Quigley [email protected]

Page 11: An introduction to  quantitative biology and R

Specialized single-purpose programs

command line toolsacademic researchtype commands at a prompt or run scriptsPLINK, bowtie, GATK, bedtools

GUI (point and click)commercial software for a vendor’s platformslick, opaque, hard/impossible to automate

David Quigley [email protected]

Page 12: An introduction to  quantitative biology and R

Commercial statistics programs

STATA, SPSS, GraphPad, others

1) Load one dataset2) Select analysis by clicking on a GUI3) Generate a report

May have a built-in languageVery mature tools for traditional biostatisticsNot freeDavid Quigley [email protected]

Page 13: An introduction to  quantitative biology and R

Web-based tools

Galaxystring together pre-defined analysis steps

very easy to usereproducible

David Quigley [email protected]

Page 14: An introduction to  quantitative biology and R

Using R is like writing and using software

Traditionally, biologists did not do this.

R was written by statisticians to be a free replicaof another language called “S”.

R: a “software environment”

David Quigley [email protected]

Page 15: An introduction to  quantitative biology and R

Flexible, open-ended, open-source

Large library of packagespackage: easy-to use published methodslike a Qiagen kit

Free!

Why is R popular?

David Quigley [email protected]

Page 16: An introduction to  quantitative biology and R

You use R by typing at the prompt

There is no pull-down menu of statistical commands

David Quigley [email protected]

Page 17: An introduction to  quantitative biology and R

What’s good about this approach?

chain analyseswork with multiple datasetsuse packages of code easy to reproduce runs on anythingmakes sense to computer programmers

David Quigley [email protected]

Page 18: An introduction to  quantitative biology and R

What’s hard about this approach?

hard to get startedcryptic commandsbuilt-in help is amusingly unhelpful

David Quigley [email protected]

Page 19: An introduction to  quantitative biology and R

packages: collections of R functionscollection of R code that solves a specific task

limma: microarray normalization and analysissamr: differential expressionimpute: dealing with missing data

downloaded for free from a central repository

David Quigley [email protected]

Page 20: An introduction to  quantitative biology and R

bioconductorCurated collection of R packagesMicroarrays, aCGH, sequence analysis, advanced statistics, graphics, lots more

David Quigley [email protected]

Page 21: An introduction to  quantitative biology and R

Learning R data typesby comparing them

to Excel spreadsheets

Page 22: An introduction to  quantitative biology and R

ExcelEasy tasks are easynon-trivial tasks impossible or expensiveNo paper trailMangles gene namesPlots look terrible

Comparing Excel and R

David Quigley [email protected]

Page 23: An introduction to  quantitative biology and R

ExcelEasy tasks are easynon-trivial tasks impossible or expensiveNo paper trailMangles gene namesPlots look terrible

REasy jobs are hard at firstNon-trivial things are possibleEasy to make a paper trailBiostatistics researchers publish tools in RCan create publication-ready plots

Comparing Excel and R

David Quigley [email protected]

Page 24: An introduction to  quantitative biology and R

Organizing data in Excel

Each subject has a row.Each column has a feature of your subjects.

David Quigley [email protected]

Page 25: An introduction to  quantitative biology and R

R calls the data points variables

variablesnumbers and characters (letters, words)

numbers: 2.6, 4characters: “Flopsy”, “white, brown paws”

David Quigley [email protected]

Page 26: An introduction to  quantitative biology and R

R calls the columns vectors

vectorsordered collections of a variable

name: [“Flopsy”, “Mopsy”, “Cottontail”, “Peter”]age: [2.5, 2.6, 2.5, 4]

David Quigley [email protected]

Page 27: An introduction to  quantitative biology and R

R calls the data set a data frame

data framea list of vectors (columns) that have nameselements can be read and written by row & column

David Quigley [email protected]

Page 28: An introduction to  quantitative biology and R

I can slice and dice the data frame

David Quigley [email protected]

Page 29: An introduction to  quantitative biology and R

Tell R to do things using functionsfunction_name( details about how to do it )

generate sequence from 1 to 5 counting by 0.5parameters for seq are named from, to, and by

David Quigley [email protected]

Page 30: An introduction to  quantitative biology and R

Tell R to do things using functionsfunction_name( details about how to do it )

report the mean of my.data. Result of one function is fed into another one.

David Quigley [email protected]

Page 31: An introduction to  quantitative biology and R

Tell R to do things using functionsfunction_name( details about how to do it )

define a new function that adds 2 to whatever it’s passed

compare to original value of my.data

David Quigley [email protected]

Page 32: An introduction to  quantitative biology and R

Walk-through a straightforward

analysis

Page 33: An introduction to  quantitative biology and R

Primary data from METABRIC study

gene expression TP53 sequence

1,400 samples from 5 hospitals

Is there an association between breastcancer subtype and TP53 mutation?

David Quigley [email protected]

Page 34: An introduction to  quantitative biology and R

Tasks

Normalize databatch effectsunwanted inter-sample variation

Identify outliers

associations between p53 and subtype

David Quigley [email protected]

Page 35: An introduction to  quantitative biology and R

Quantile Normalization (limma)

Force every array to have the same distribution ofexpression intensities

> library(limma)

> raw = read.table('raw_extract.txt’, ...)

> raw.normalized = normalize.quantiles( raw )

> normalized = log2( raw.normalized )

David Quigley [email protected]

Page 36: An introduction to  quantitative biology and R

Identify batch effects in microarrays

Principle Components AnalysisIdentify strongest variation in a matrix

gene 2

gene

1

David Quigley [email protected]

Page 37: An introduction to  quantitative biology and R

Identify batch effects in microarrays

Principle Components AnalysisIdentify axes of maximal variation in a matrix

gene 2

gene

1

first prin

ciple co

mponent

David Quigley [email protected]

Page 38: An introduction to  quantitative biology and R

Identify batch effects in microarrays

Principle Components AnalysisIdentify strongest variation in a matrix

gene 2

gene

1

gene 2

gene

1

group Agroup B

David Quigley [email protected]

Page 39: An introduction to  quantitative biology and R

PCA of identifies a batch effect

first principle component

seco

nd p

rinci

ple

com

pone

nt

hospital 3 (yellow)

> my.pca = prcmp( t( expression.data ) )

> plot( my.pca, ... )

David Quigley [email protected]

Page 40: An introduction to  quantitative biology and R

batch correction reduces bias (ComBat)se

cond

prin

cipl

e co

mpo

nent

first principle component

ComBat package reduces user-defined batch effects

David Quigley [email protected]

Page 41: An introduction to  quantitative biology and R

Molecular subtypes of breast carcinoma, defined by gene expression

Luminal AN=507

Luminal BN=379

Her2N=161

BasalN=234

ER status

> sa = read.table(‘patients.txt’, ...)

> tumor.counts = table( sa$ER.status, sa$PAM50Subtype)

(convert counts to percentages)

> barplot( c( tumor.counts[1], tumor.counts[2] ), col=c(“red”,”green”), ... )

David Quigley [email protected]

Page 42: An introduction to  quantitative biology and R

Find interactions: TP53 and subtype

Fit a linear model:> fitted.model = lm( dependent ~ independent )

Perform Analysis of Variance:> anova( fitted.model )

general form of my analysis:> anova( lm( gene.expression ~ PAM * TP53 )

18,000 genesPAM: {LumA, LumB, Her2, Basal}TP53: {mutant, WT}

David Quigley [email protected]

Page 43: An introduction to  quantitative biology and R

Automate with loopsCalculate anova for 18,000 genes by looping through each gene and storing result.

> n_genes = 18000> result = rep( 0, n_genes )

> for( counter in 1:n_genes ){ result[counter] = anova(...) }

sort results identify significant interaction

repeat 18,000 times

David Quigley [email protected]

Page 44: An introduction to  quantitative biology and R

absent mild severe

CD3E

log 2

exp

ress

ion

log 2

exp

ress

ion

infiltration

Immune infiltration in TP53-WT Basal

David Quigley [email protected]

Does p53 have a role in immune surveillance?

Page 45: An introduction to  quantitative biology and R

Next steps:getting help and

learning more

Page 46: An introduction to  quantitative biology and R

online forums: expert help for free

all of bioinformatics

David Quigley [email protected]

Page 47: An introduction to  quantitative biology and R

online forums: expert help for free

all of bioinformaticsNextgen sequencing

David Quigley [email protected]

Page 48: An introduction to  quantitative biology and R

online forums: expert help for free

all of bioinformaticsNextgen sequencing

statistics

David Quigley [email protected]

Page 49: An introduction to  quantitative biology and R

Library classes and information

Formal courses (BMI, Biostatistics)

Cores (Computational Biology, Genomics)

QGDG monthly methods discussion group

UCSF resources

David Quigley [email protected]

Page 50: An introduction to  quantitative biology and R

Online classes and blogs

Free courses on data analysishttp://jhudatascience.orgsimplystatistics.orgCoursera etc...

Good tutorials on sequence analysishttp://evomics.org/learning

David Quigley [email protected]

Page 51: An introduction to  quantitative biology and R

Reproducible research?You mean, there’s

another kind?

Page 52: An introduction to  quantitative biology and R

detailed protocols (not printed in the methods)

extensive optimization

reagents that might be unique or hard to get

techniques that require years of experience

Replicate a wet lab experiment

David Quigley [email protected]

Page 53: An introduction to  quantitative biology and R

published algorithms (if novel)

published source codesometimes “available from the authors”

well-specified input and deterministic output

no reagentsOkay, maybe a supercomputer or cloud

How hard could it be?

Replicate a dry lab experiment

David Quigley [email protected]

Page 54: An introduction to  quantitative biology and R

Bookkeeping errorsTransposed column headersOut-of-date/changed annotationsOff-by-oneMisunderstood sample labels

Batch effects

Cryptic cohort stratification

Inappropriate analytical methods

Many chances to make honest errors

David Quigley [email protected]

Page 55: An introduction to  quantitative biology and R

eQTL differences between ethnic cohorts

Claim:Many genetic loci associated with gene expression differbetween Asian, western people

David Quigley [email protected]

Page 56: An introduction to  quantitative biology and R

poor study design batch processing effects

2003-2004processed European samples

2005-2006processed Asian samples

Processing year perfectly counfounded with ethnicity.

Claim:Many genetic loci associated with gene expression differbetween Asian, western people

David Quigley [email protected]