WORKSHOPSPOTTED 2-channel ARRAYS
DATA PROCESSING AND QUALITY CONTROL
Eugenia Migliavacca andMauro Delorenzi,
ISREC, December 11, 2003
AIMSDiscussionInformation
Introduction to the use of the webpage for automated normalization
interface btw experimentalists and analystsfeedback
resource allocation
Acknowledgments
some slides originally provided by:Terry Speed (Berkeley / WEHI)Sandrine Dudoit (Sandrine Dudoit (Berkeley))Yee Hwa Yang (Berkeley)Natalie Thorne (WEHI)
Otto Hagenbuechle
Eugenia MigliavaccaDarlene Goldsteinand others
RNA ISOLATION
(AMPLIFICATION) AND LABELING WITH FLUORO-DYES
Preparation
HybridisationBinding labelled samples (targets) to
complementary probes on a slide
cover
slip
Hybridise for
5-12 hours
Wash
Mix
Scanning
1
2
Adjust scanner parameters; frequently can adapt:
1. excitation wave (laser) intensity2. "gain" (amplification) of the photon detection system
1
2
Human 10KcDNA Array
How to extract data ?How to recognize
problems ?
Part of the image of one channel false-coloured on a white (v. high) red (high) through yellow and green (medium) to blue (low) and black scale.
Scanner's Spots
RNA preparation and Labeling
Data for further analysis
Slide scanning
Hybridisation
Image analysis
Normalization
Steps of a Microarray Experiment
Why perform an experiment ?What is the aim ?
Which conclusions do you want to reach ?
first: DESIGN !
mRNA abundance
rRNA 80%
tRNA tRNAtRNA
mRNA 1%
1-50
50-500
500+
approx. 300'000 mRNA Molecules/cellapprox. 10-20'000 different genes
What do you want to measure ?
RNA massdifferent in different cells
Relative vs Absolute changes
200'000 mRNA Molecules/cell200 for gene X (0.1%)
400'000 mRNA Molecules/cell400 for gene X (0.1%)
Is gene X differentially expressed ?
RNA preparation and Labeling
Data for further analysis
Slide scanning
Hybridisation
Image analysis
Normalization
R, G, M, A, etc
16-bit TIFF files
(Rfg, Rbg), (Gfg, Gbg), etc
What is needed for high quality data ?
Which are the critical steps ?
Steps of a Microarray Experiment
RNA preparation and Labeling
Data for further analysis
Slide scanning
Hybridisation
Image analysis
Normalization
Adjust / Balance channels approx.; avoid saturation
check normalized and unnormalized data of exp RNA and of
spiked RNA
Spike-in RNA in known conc. and ratios
Steps of a Microarray Experiment
Why avoid saturation ?Why balance channels ?
Why perform "normalization" ?What to check before and after normalization ?
Why calculate ratios ?Why calculate log ratios ?
Aim: Gene Expression Data
Gene expression data on p genes for n samples
Genes
Slides
Gene expression level of gene 5 in slide 4 j
M =Log2( Red intensity / Green intensity)
slide 1 slide 2 slide 3 slide 4 slide 5 …
1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...
These values are conventionally displayed on a red (>0) yellow (0) green (<0) scale.
Objectives for high quality
Important aspects include:
• Tentatively separating • systematic sources of variation ("artefacts"), that
bias the results,• from random sources of variation ("noise"), that hide
the truth.• Removing the former as well as possible and
quantifying the latter
Only if this is done can we hope to reach good quality andmake valid statements about the confidence in the
results
Typical Statistical Approach
Measured value = real value + systematic errors + noise
Corrected value = real value + noise
• Analysis of Corrected value => (unbiased) CONCLUSIONS
• Estimation of Noise => quality of CONCLUSIONS, statistical significance
(level of confidence) of the conclusions
Image Analysis => Rfg ; Rbg ; Gfg ; Gbg (fg = foreground, bg = background.) For each spot on the slide calculate:
Red intensity = R = Rfg - Rbg Green intensity = G = Gfg - Gbg
M = Log2( Red intensity / Green intensity)
Subtraction of background values (additive background model assuming to be locally constant …)
Sources of background: probe unspecifically sticking on slide, irregular / dirty slide surface, dust,
and noise / errors) in the scanner measurement Not included: real cross-hybridisation and unspecific
hybridisation to the probe
Step 1: a) Background Correctionb) Calculation of (log) ratios
Subtraction of background has shown frequently not to improve the performance:while making the average of many measurements closer to
the true values (reduced bias or systematic error)it causes higher variability (lower reproducibility)
Comment to Background Correction
A. High variance - Unbiased Estimator
B. Low variance - Biased Estimator
average
single meas.
A. High variance - Unbiased Estimator
when you take many measurements: the average will be closer to the true value more frequently
B. Low variance - slightly biased Estimator
when you take one or a few measurements: the average will be closer to the true value more frequently
DAF Microarrays 2002: we preferred no subtraction, should be re-evaluated with Agilent scanner (and GenePix IAS)
Which is better ?
A reminder on logarithms
A numerical example
M = log R/G = logR - logG
A = ( logR + logG ) /2
Positive controls
(spotted in varying concentrations) Negative controls
blanks
Lowess curve
Step 2: An M vs A (MVA) Plot
Why use an M vs A plot ?
1. Logs stretch out region we are most interested in.2. Can more clearly see features of the data such as intensity
dependent variation, and dye-bias.3. Differentially expressed genes more easily identified.4. Intuitive interpretation
S1.n. Control Slide: Dye Effect, Spread.
MVA plot: looking at data
Lowess curve
Spot identifier
• Assumption: Changes roughly symmetric
• First panel: smooth density of log2G and log2R.
• Second panel: M vs A plot with median put to zero
Step 3: Normalisation - global median centering
common median
• Assumption: changes roughly symmetric at all intensities.
Step 4: Normalisation - lowess- local median centering
What is this normalization doing?
Local regression
• Classical (global) regression: draws a single line to the entire set of points
• Local regression: draws a curve through noisy data by smoothing
• Lowess (LOcally WEighted Scatterplot Smoothing) is a type of local regression
• Can correct for both print-tip and intensity-dependent bias with lowess fits to the data within print-tip groups
Local regression illustrated
Lowess line
• After within slide global lowess normalization.• Likely to be a spatial effect.
Print-tip groups
Lo
g-r
ati o
sStep 5: Normalisation - spatial corrections
Normalization between groups (ctd)
• After print-tip location- and scale- normalization.
Lo
g-r
ati o
s
Print-tip groups
normalized values look nice , but .....
Effects of Location
Normalisation
(example)Before
After
Boxplots of log ratios by pin group
Lowess lines through points from each pin group
Identifying sub-array effects
Assumption:
All (print-tip-)groups should have the same spread in M
True ratio is ij where i represents different (print-tip)-groups and j represents different spots. Observed is Mij, where Mij = ai * log(ij)
Robust estimate of ai is
Corrected values are calculated as:
Step 6: Rescaling (Spread-Normalisation)
Illustration: print-tip-group - NormalisationAssumption: For every print group: changes roughly symmetric at all intensities.
Glass Slide
Array of bound cDNA probes
4x4 blocks = 16 pin groups
Which normalization to use?
Case 1: A few genes that are likely to change and / or a random large collection of genes (expect as many up as down):
Each slide per se:– Location: print-tip-group lowess normalization.– Scale: for all print-tip-groups, adjust MAD to equal the geometric
mean for MAD for all print-tip-groups.
Case 2: Non-random gene collection and / or many genes do change appreciably: – USE DYE-SWAP APPROACH– Self-normalization: take the difference of the two log-ratios.– Check using controls or known information.
MVA plots: what to look at ?
How to use the spikes ?
Points:signal intensity
backgroundsaturation
homogeneity , normalizabilityproblem diagnosis
Webpage
How to use the plots ?
Use of the different options
Quality control before normalization (?)
Choice of normalization
END
questions