an introduction to quantitative biology and r
DESCRIPTION
An introduction to quantitative biology and R. David Quigley [email protected] Helen Diller Comprehensive Cancer Center, UCSF Institute for Cancer Research, University of Oslo. Molecular biology 20 years ago. qualitative methods small-scale quantitative tests. - PowerPoint PPT PresentationTRANSCRIPT
An introduction to quantitative biology
and R
David Quigley
[email protected] Diller Comprehensive Cancer Center, UCSFInstitute for Cancer Research, University of Oslo
Molecular biology 20 years ago
Suzuki Med Mol Morph 2010Oh PNAS 1996Mao Genes Dev 2004
qualitative methods small-scale quantitative tests
David Quigley [email protected]
Nik-Zainal Cell 2012Fullwood Nature 2009CGAN Nature 2012
Molecular biology nowqualitative methods small-scale quantitative testslarge-scale quantitative analysis
microarrays, *seq methods, cell phenotype screens...
David Quigley [email protected]
Hypothesis-generating:Normalize breast tumor expression data (three cohorts)Call genotypes from SNP arraysCalculate association between genotype and expression genome-wide (eQTL)
Identify an interesting candidateValid ate eQTL in independent cohort
Hypothesis-driven:Methylation analysis at MRPS30 promoter In vitro (two cell lines):
ChIP-PCR +/- estrogenqPCRsequencing
Many studies use both approaches
David Quigley [email protected]
Common quantitative techniques
Gene Expression transcription splicingmethylation protein binding (ChIP-seq)
Genomics and Geneticsde novo assemblyDNA copy number (CNV and tumors)germline variant analysistumor variant analysis
SNPs, indels, translocationanalysis of tumor clonality
Challenges
requires statistical sophisticationin study designin interpretation
many data points1,000 to 1,000,000 measurements per samplemany false positives which look like great stories
software becomes part of the experiment divide between engineering, biology culture & thinking
David Quigley [email protected]
Schillebeeckx Nature Biotech. 2013
Wet lab and quantitative skills: better job prospects
David Quigley [email protected]
What approachsare used to analyze quantitative data?
Chosing a tool
CostLearning curveEase of useFlexibility (closed to open-ended)Software ecosystem (none to extensive)
David Quigley [email protected]
Traditional programming languagesPython, C++, Java, others
can solve any computable problem creates the fastest tools freerequires programming expertise
complex to write and test high effort
David Quigley [email protected]
Specialized single-purpose programs
command line toolsacademic researchtype commands at a prompt or run scriptsPLINK, bowtie, GATK, bedtools
GUI (point and click)commercial software for a vendor’s platformslick, opaque, hard/impossible to automate
David Quigley [email protected]
Commercial statistics programs
STATA, SPSS, GraphPad, others
1) Load one dataset2) Select analysis by clicking on a GUI3) Generate a report
May have a built-in languageVery mature tools for traditional biostatisticsNot freeDavid Quigley [email protected]
Web-based tools
Galaxystring together pre-defined analysis steps
very easy to usereproducible
David Quigley [email protected]
Using R is like writing and using software
Traditionally, biologists did not do this.
R was written by statisticians to be a free replicaof another language called “S”.
R: a “software environment”
David Quigley [email protected]
Flexible, open-ended, open-source
Large library of packagespackage: easy-to use published methodslike a Qiagen kit
Free!
Why is R popular?
David Quigley [email protected]
You use R by typing at the prompt
There is no pull-down menu of statistical commands
David Quigley [email protected]
What’s good about this approach?
chain analyseswork with multiple datasetsuse packages of code easy to reproduce runs on anythingmakes sense to computer programmers
David Quigley [email protected]
What’s hard about this approach?
hard to get startedcryptic commandsbuilt-in help is amusingly unhelpful
David Quigley [email protected]
packages: collections of R functionscollection of R code that solves a specific task
limma: microarray normalization and analysissamr: differential expressionimpute: dealing with missing data
downloaded for free from a central repository
David Quigley [email protected]
bioconductorCurated collection of R packagesMicroarrays, aCGH, sequence analysis, advanced statistics, graphics, lots more
David Quigley [email protected]
Learning R data typesby comparing them
to Excel spreadsheets
ExcelEasy tasks are easynon-trivial tasks impossible or expensiveNo paper trailMangles gene namesPlots look terrible
Comparing Excel and R
David Quigley [email protected]
ExcelEasy tasks are easynon-trivial tasks impossible or expensiveNo paper trailMangles gene namesPlots look terrible
REasy jobs are hard at firstNon-trivial things are possibleEasy to make a paper trailBiostatistics researchers publish tools in RCan create publication-ready plots
Comparing Excel and R
David Quigley [email protected]
Organizing data in Excel
Each subject has a row.Each column has a feature of your subjects.
David Quigley [email protected]
R calls the data points variables
variablesnumbers and characters (letters, words)
numbers: 2.6, 4characters: “Flopsy”, “white, brown paws”
David Quigley [email protected]
R calls the columns vectors
vectorsordered collections of a variable
name: [“Flopsy”, “Mopsy”, “Cottontail”, “Peter”]age: [2.5, 2.6, 2.5, 4]
David Quigley [email protected]
R calls the data set a data frame
data framea list of vectors (columns) that have nameselements can be read and written by row & column
David Quigley [email protected]
I can slice and dice the data frame
David Quigley [email protected]
Tell R to do things using functionsfunction_name( details about how to do it )
generate sequence from 1 to 5 counting by 0.5parameters for seq are named from, to, and by
David Quigley [email protected]
Tell R to do things using functionsfunction_name( details about how to do it )
report the mean of my.data. Result of one function is fed into another one.
David Quigley [email protected]
Tell R to do things using functionsfunction_name( details about how to do it )
define a new function that adds 2 to whatever it’s passed
compare to original value of my.data
David Quigley [email protected]
Walk-through a straightforward
analysis
Primary data from METABRIC study
gene expression TP53 sequence
1,400 samples from 5 hospitals
Is there an association between breastcancer subtype and TP53 mutation?
David Quigley [email protected]
Tasks
Normalize databatch effectsunwanted inter-sample variation
Identify outliers
associations between p53 and subtype
David Quigley [email protected]
Quantile Normalization (limma)
Force every array to have the same distribution ofexpression intensities
> library(limma)
> raw = read.table('raw_extract.txt’, ...)
> raw.normalized = normalize.quantiles( raw )
> normalized = log2( raw.normalized )
David Quigley [email protected]
Identify batch effects in microarrays
Principle Components AnalysisIdentify strongest variation in a matrix
gene 2
gene
1
David Quigley [email protected]
Identify batch effects in microarrays
Principle Components AnalysisIdentify axes of maximal variation in a matrix
gene 2
gene
1
first prin
ciple co
mponent
David Quigley [email protected]
Identify batch effects in microarrays
Principle Components AnalysisIdentify strongest variation in a matrix
gene 2
gene
1
gene 2
gene
1
group Agroup B
David Quigley [email protected]
PCA of identifies a batch effect
first principle component
seco
nd p
rinci
ple
com
pone
nt
hospital 3 (yellow)
> my.pca = prcmp( t( expression.data ) )
> plot( my.pca, ... )
David Quigley [email protected]
batch correction reduces bias (ComBat)se
cond
prin
cipl
e co
mpo
nent
first principle component
ComBat package reduces user-defined batch effects
David Quigley [email protected]
Molecular subtypes of breast carcinoma, defined by gene expression
Luminal AN=507
Luminal BN=379
Her2N=161
BasalN=234
ER status
> sa = read.table(‘patients.txt’, ...)
> tumor.counts = table( sa$ER.status, sa$PAM50Subtype)
(convert counts to percentages)
> barplot( c( tumor.counts[1], tumor.counts[2] ), col=c(“red”,”green”), ... )
David Quigley [email protected]
Find interactions: TP53 and subtype
Fit a linear model:> fitted.model = lm( dependent ~ independent )
Perform Analysis of Variance:> anova( fitted.model )
general form of my analysis:> anova( lm( gene.expression ~ PAM * TP53 )
18,000 genesPAM: {LumA, LumB, Her2, Basal}TP53: {mutant, WT}
David Quigley [email protected]
Automate with loopsCalculate anova for 18,000 genes by looping through each gene and storing result.
> n_genes = 18000> result = rep( 0, n_genes )
> for( counter in 1:n_genes ){ result[counter] = anova(...) }
sort results identify significant interaction
repeat 18,000 times
David Quigley [email protected]
absent mild severe
CD3E
log 2
exp
ress
ion
log 2
exp
ress
ion
infiltration
Immune infiltration in TP53-WT Basal
David Quigley [email protected]
Does p53 have a role in immune surveillance?
Next steps:getting help and
learning more
online forums: expert help for free
all of bioinformaticsNextgen sequencing
David Quigley [email protected]
online forums: expert help for free
all of bioinformaticsNextgen sequencing
statistics
David Quigley [email protected]
Library classes and information
Formal courses (BMI, Biostatistics)
Cores (Computational Biology, Genomics)
QGDG monthly methods discussion group
UCSF resources
David Quigley [email protected]
Online classes and blogs
Free courses on data analysishttp://jhudatascience.orgsimplystatistics.orgCoursera etc...
Good tutorials on sequence analysishttp://evomics.org/learning
David Quigley [email protected]
Reproducible research?You mean, there’s
another kind?
detailed protocols (not printed in the methods)
extensive optimization
reagents that might be unique or hard to get
techniques that require years of experience
Replicate a wet lab experiment
David Quigley [email protected]
published algorithms (if novel)
published source codesometimes “available from the authors”
well-specified input and deterministic output
no reagentsOkay, maybe a supercomputer or cloud
How hard could it be?
Replicate a dry lab experiment
David Quigley [email protected]
Bookkeeping errorsTransposed column headersOut-of-date/changed annotationsOff-by-oneMisunderstood sample labels
Batch effects
Cryptic cohort stratification
Inappropriate analytical methods
Many chances to make honest errors
David Quigley [email protected]
eQTL differences between ethnic cohorts
Claim:Many genetic loci associated with gene expression differbetween Asian, western people
David Quigley [email protected]
poor study design batch processing effects
2003-2004processed European samples
2005-2006processed Asian samples
Processing year perfectly counfounded with ethnicity.
Claim:Many genetic loci associated with gene expression differbetween Asian, western people
David Quigley [email protected]