software for the analysis of data from cell-based...
TRANSCRIPT
Functional Genomics and Bioconductor
Software for the analysis of data from cell-basedassays
Functional Profiling
• each protein has one or several specific function(s) in the cell
• for a large part of the proteins the function is still unclear
• some functional information may be found in the proteinstructure or through homology
• the context/cellular environment is important for the functionstudy function within that context
Functional Profiling: Identification of Disease Genes
(≈
-disease-associated genes
“hot” candidates
21,000+ human cDNAs(~genes)
Genome-wide
microarray study (cancervs. normal, in vitro)
cellular assay(in vivo)
How to infer a protein‘s function
+
-perturbation
phenotype
but: phenotype ≠ function!
• means to monitor effect of perturbation
expression or activation state of key regulatory proteins (fluorescence reader, FACS, automated microscope)
Design of cell-based assays
• means to monitor perturbation (beneficial but not mandatory)
expression of fluorescence protein tag
• system to willfully manipulate expression level of certain genes in cells
up regulation (transfection of expression vectors)
down regulation (RNA interference)
+/-
--
++
RNAi as a loss of function perturbator
gene-sequence specific reagents
(eg siRNAs)
easy to make for any gene
(there are caveats...)
protein
living cells
mRNA
gene
degradation
translation
transcription
gene-sequence specific reagents
(eg siRNAs)
easy to make for any gene
(there are caveats...)
protein
living cells
mRNA
gene
degradation
transcription
RNAi as a loss of function perturbator
gene-sequence specific reagents
(eg siRNAs)
easy to make for any gene
(there are caveats...)
living cells
mRNA
gene
transcription
RNAi as a loss of function perturbator
Any cellular process can be probed.- (de-)activation of a signaling pathway- cell differentiation- changes in the cell cycle dynamics- morphological changes- activation of apoptosisSimilarly, for organisms (e.g. fly embryos, worms)
Phenotypes can be registered at various levels of detail- yes/no alternative- single quantitative variable- tuple of quantitative variables- image- time course
What is a phenotype: it all depends on the assay
Plate reader96 or 384 well, 1…4 measurements per well
FACS4…8 measurements per cell, thousands of cellsper well
Automated Microscopyunlimited
Monitoring Tools
cellHTS (Ligia Bras, M. Boutros)genome-wide screens with scalar (or low-dimensional) read-outdata management, normalization, quality assessment, visualization,
hit scoring, reproducibility, publicationraw data -> annotated hit list
prada (Florian Hahne); flowCore et al. (B. Ellis, P. Haaland, N. Lemeur, F. Hahne)flow cytometrydata management
EBImage (O. Sklyar)image processing and analysisconstruction of feature extraction workflows for large sets of similar images
imageHTS (O. Sklyar, F. Fuchs, M. Boutros) (scheduled for release 2.1)web-based presentation of high-content screening data and results
Bioconductor packages for cell-based assays
cell
num
ber
plate plots as graphical representation of experimental entities
• false color coding for concise display of numeric outcomes from statistical analyses
• HTML image map allows for hyper linking to include further information for each well
visualization of results
quantitative
Visualization: plate plots
visualization of results
plate plots as graphical representation of experimental entities
• false color coding for concise display of numeric outcomes from statistical analyses
• HTML image map allows for hyper linking to include further information for each well
Visualization: plate plots
qualitative
visualization of results
plate plots as graphical representation of experimental entities
• false color coding for concise display of numeric outcomes from statistical analyses
• HTML image map allows for hyper linking to include further information for each well
Visualization: plate plots
additionalinformation
visualization of results
plate plots as graphical representation of experimental entities
• false color coding for concise display of numeric outcomes from statistical analyses
• HTML image map allows for hyper linking to include further information for each well
Visualization: plate plots
replicates
Bioconductor package for the analysis of cell-based high-throughput screening (HTS) assays
genome-wide screens with scalar (or low-dimensional) read-out
Manage all data and metadata relevant for interpreting a cell-based screen
Data cleaning, preprocessing, primary statistical analysis
Raw data -> annotated hit list
Boutros, Bras, Huber. Analysis of cell-based RNAi screens. Genome Biology (2006)
The cellHTS package
The cellHTS package: workflow
per plate quality assessment• Dynamic range
• Distribution of the intensity values for each replicate
• Scatterplot between replicates and correlation coefficient
• Plate plots for individual replicates and for standard deviation between replicates
per experiment quality assessment• Boxplots grouped by plate
• Distribution of the signal in the control wells, Z'-factor
whole screen visualization
Quality Report rendered in HTML
The cellHTS package
Z´- Factor
' 1 3 p n
p n
Zσ + σ
= −µ −µ
Zhang JH, Chung TD, Oldenburg KR, "A Simple Statistical Parameter for Use in Evaluation and Validation of High Throughput Screening Assays." J Biomol Screen. 1999;4(2):67-73.
Plate to plate variability
Lum
ines
cenc
e
(384-well) Plate ID
Normalization: Plate effects
Percent of control
Normalized percent inhibition
z-score
k-th welli-th plate100' ki
ki posi
xx =µ
×
100pos
' i kiki pos neg
i i
µ xx =µ µ
−×
−
' ki iki
i
xx =σ
µ−
Spatial normalization
B-score:two-way medianpolish
rth rowcth column
ith plate
( )ˆˆˆrci i ri ci'rci
i
x µ + R + Cx =
MAD
−
after
fitted row and column effects
before
Malo et al., Nat. Biotech. 2006
Normalization and library design
Hek293 cellsviability screenBoutros Lab DKFZ
Plate 26
proteasome subunits or components;
ATP/GTP-binding site motifs
ribosomal proteins
like-Sm nucleoproteins and ribosomal proteins
Normalization problem…Too many hits
How to estimate the normalization parameters?
From which data points:• Based on the intensities of the controls
if they work uniformly well across all plates
• Based on the intensities of the samples invoke assumptions such as "most genes have no effect", or "same distribution of effect sizes"
Which estimator:mean vs median vs shorthstandard deviation vs MAD vs IQR
In the best case, it doesn't matter.No universally optimal answer, it depends on the data.
Estimators of location
Histogram of x
x
qy
-2 0 2 4 6 8 10
020
4060
8010
012
0
meanmedianshorthhalf.range.mode
mean
4 different siRNAs per gene
G03
H13
I17A
04
H17
A01
B01J
12
G03
F11
I01F
11
B04A
10
G03
F11
G03
F11
A04A
07
B06
C05
K01A
02
C12
F09
FACS: fluorescence activated cell sorting (= flow cytometry)
light scatter detector
Fluorescence detectors
Laser
• measures fluorescence intensities as well as morphological parameters on the basis of light emission
• offers single cell resolution
• robust, reliable, flexible
flowCore package: overview
package flowCore contains data structures and functionality for flow cytometry data
• data import • data management• data preprocessing
- transformation- filtering
• flow-specific procedures- gating
associated packages: • flowViz visualization• flowQ quality assessment• flowUtils utilities
compatibility with other softwareby following the standardizeddescription of flow cytmetry data
• FCS 3.0 files- standardized storage format for FACS data- contains fluorescence values in data segment, wealth of meta
data in text segment- can be imported into R
Data import and data structures
• flowFrameR internal representation of data from one FCS file
- raw data matrix
- list of meta data• flowSet
R internal representation of data from several FCS files (e.g. one 96 well plate)
Software implementation
flowFrame:
description
parameters
1 0 0 1 1 1 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 11 0 0 0 1 1 0 1 1 1 0 1
dataexprs(cytoFrame)
description(cytoFrame)
plot(cytoFrame)
cytoFrame[1,]
read.FCS(file) construct flowFrame from FCS file
get/set data matrix
get meta data
…
smoothed scatter plot, histogram
subsetting
phenoData
frames
flowSet:
phenoData(flowSet)
fsApply(flowSet, foo)
flowSet[1:3]
flowSet[[1]]
subsetting to flowSet
subsetting to flowFrame
apply function each frame
get experiment meta data
…
Gating/Filtering
Gate: region in multidimensional space defining the filteringoperation of a subset of the cell population
• rectangle gates• polygon gates• ellipsoid gates• data-driven gates
G1
G2
G1 ∪ G2
G1 ∆ G2
G1 \ G2
G1 ∩ G2 gate arithmetic
interactive drawing
Data-driven filtering: kmeansFilter
• k1• k2• k3
distinction on basis of morphological properties
variation between experiments
dynamic determination
cell size
gran
ular
ityData-driven filtering: FSC/SSC preprocessing
Data-driven filtering: FSC/SSC preprocessing
assumption:bivariate normal distribution
robust fitting
discarding cells that do not lie within some given boundary of this distribution
=density ofdistribution
= discarded
X =midpoint ofdistribution
Data-driven filtering: FSC/SSC preprocessing
=density ofdistribution
= discarded
X =midpoint ofdistribution
shape and location of main distribution can be used for quality control
assumption:bivariate normal distribution
robust fitting
discarding cells that do not lie within some given boundary of this distribution
A typical phenotype
scatter plot of two measurement parameters:phenotype against level of perturbation
parameter 1(perturbation)
para
met
er 2
(phe
noty
pe)
A typical phenotype
scatter plot of two measurement parameters:phenotype against level of perturbation
parameter 1(perturbation)
para
met
er 2
(phe
noty
pe)
activation
A typical phenotype
scatter plot of two measurement parameters:phenotype against level of perturbation
parameter 1(perturbation)
para
met
er 2
(phe
noty
pe)
inhibition
parameter correlation
cell size correlates with fluorescent intensities
(FL1)
(FL4)
specifictotal xsx ++= βα
induces spurious correlations in the data
s: cell size (FSC) xtotal : measured fluorescencexspecific: actual fluorescence emitted by dye
• robust fitting of smoothed local regression function:y: response (phenotype)x: perturbation signalm: smooth function
: robust estimator of m at point xt
• z-score as dimension-less measure of effect:ratio of estimated slope δ at point xt and assay-widescale parameter δ0
z = 18.1 z = 0.4 z = -40.2
t* t* t*
Modeling of phenotype
( )( )txm
xxmyy′=
+−+=)δ
ε00
0δδ
=z
)(ˆ txm′
Acknowledgements
Wolfgang HuberAnnemarie PoustkaStefan Wiemann
Dorit ArltMeher MajetyMamatha SauermannChristian Schmidt
Andreas BuneßMarkus RuschhauptHeiko Rosenfelder
Alex MehrleDirk LedwinkaTim BeissbarthAchim Tresch
Michael BoutrosLigia BrasFlorian FuchsThomas HornDierk IngelfingerSandra Steinbrink
Robert GentlemanNolwenn Le MeurByron EllisPerry Haaland