Download - Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

www.bioinformatics.ca

2Module #: Title of Module

Lecture 8Microarrays II: Data Analysis

MBP1010

Dr. Paul C. BoutrosWinter 2014

DEPARTMENT OFMEDICAL BIOPHYSICSDEPARTMENT OFMEDICAL BIOPHYSICS

This workshop includes material originally developed by Drs. Raphael Gottardo, Sohrab Shah, Boris Steipe and others

††

††

Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE)

Lecture 8: Microarrays Part II bioinformatics.ca

Course Overview• Lecture 1: What is Statistics? Introduction to R• Lecture 2: Univariate Analyses I: continuous• Lecture 3: Univariate Analyses II: discrete• Lecture 4: Multivariate Analyses I: specialized models• Lecture 5: Multivariate Analyses II: general models• Lecture 6: Sequence Analysis• Lecture 7: Microarray Analysis I: Pre-Processing• Lecture 8: Microarray Analysis II: Multiple-Testing• Lecture 9: Machine-Learning• Final Exam (written)


House Rules• Cell phones to silent

• No side conversations

• Hands up for questions


Topics For This Week• Examples

• Attendance

• Pre-Processing

• QA/QC

• Microarray-Specific Statistics

• ProbeSet remapping

• Organizing –omics studies


Example #1You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Tumour penetrance in these two lines is 100%. Your hypothesis: tumours in mice lacking TS will be smaller than those in mice with amplification of OG, as assessed by post-mortem volume measurements of the primary tumour. Your data:

TS (cm3)3.97.13.14.45.0

OG (cm3)5.21.95.06.14.54.8


Example #2You are conducting a study of osteosarcomas using mouse models. You are studying transgenic animals with deletion of a tumour suppressor (TS), or with amplification of an oncogene (OG). You consider the penetrance of tumours in a set of 8 different mouse strains.Your hypothesis: some mouse strains are lead to bigger tumours than others when OG is amplified and only considering animals in which tumours form. You measure tumour volume in mm3 using calipers.

Strain 1 (mm3)916983

Strain 2 (mm3)2017071



Strain 5 (weeks)11

53859

Strain 6 (mm3)6

6063


Strain 8 (mm3)100105121


Example #3You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Tumour penetrance in these two lines is 100%. Your hypothesis: mice lacking TS are less likely to respond to a novel targeted therapeutic (DX) than wildtype animals, as assessed by molecular imaging:

TS (imaging response)YesNoYesYesNo

WT (imaging response)YesYesYesYesNoYes


Example #4You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Based on your previous data, you now hypothesize that mice lacking TS will show a similar molecular response to DX as those with amplification of OG. You use microarrays to study 20,000 genes in each line, and identify the following genes as changed between drug-treated and vehicle-treated:

TS (DX-responsive genes)MYC KRAS CD53CDH1 FBW1 SEPT7MUC1 MUC3 MUC9RNF3

OG (DX-responsive genes)MYC KRAS CD53CDH1 MUC1 MARCH1PTEN IDH3 ESR2RHEB CTCF STK11MLL3 KEAP1 NFE2L2ARID1A


Example #5You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice naturally susceptible to these tumours at ~20% penetrance. You are studying two transgenic lines, one with deletion of a tumour suppressor (TS), the other with amplification of an oncogene (OG). Tumour penetrance in these is 100%.Your hypothesis: You now wonder if tumour size is differing by age of the animal, and suspect tumour-size differs between lines, but is confounded by age differences. Your data:

TS (cm3)3.9 (17 weeks)7.1 (15 weeks)3.1 (15 weeks)4.4 (22 weeks)5.0 (22 weeks)

OG (cm3)5.2 (17 weeks)1.9 (9 weeks)

5.0 (15 weeks)6.1 (15 weeks)4.5 (21 weeks)4.8 (20 weeks)

Wildtype (cm3)1.1 (9 weeks)

1.5 (10 weeks)2.1 (15 weeks)2.5 (15 weeks)0.3 (17 weeks)2.2 (21 weeks)


Example #6You are conducting a study of osteosarcomas using mouse models. You are using a strain of mice that is naturally susceptible to these tumours at a frequency of ~20%. You are studying two transgenic lines, one of which has a deletion of a putative tumour suppressor (TS), the other of which has an amplification of a putative oncogene (OG). Tumour penetrance in these two lines is 100%. Your hypothesis: mice lacking TS will acquire tumours sooner than wildtype mice. You test the mice weekly using ultrasound imaging. Your data:

TS (week of tumour)47765

OG (week of tumour)393243



• Attendance

• Pre-Processing

• QA/QC





Summary Point #1:

Microarray data is analyzed with a pipeline of sequential algorithms.

This pipeline defines the standard workflow for microarray experiments.


Quantitation

Cy3 Cy5Spot

SpotQuality

Intra-ArrayInter-array

Spot List

Clustering

Background

SignificanceTesting

Integration ?


Summary Point #2:This is an active research area.


Summary Point #3:

These basic steps hold true for all microarray platforms and types.


What Is BioConductor?

“Bioconductor is an open source, open development software project to provide tools for the analysis and comprehension of high-throughput genomic data.”

- BioConductor website

The vast majority of our analyses will use BioConductor code, but there are clearly non-BioConductor approaches.The vast majority of our analyses will use BioConductor code, but there are clearly non-BioConductor approaches.

Module 1 bioinformatics.ca

I’ve outlined the general workflow.

Each technology and application has its own unique characteristics to consider.


Let’s Define an Affymetrix-Specific Workflow


Quantitation

Cy3 Cy5Spot

SpotQuality

Intra-ArrayInter-array

Spot List

Clustering

Background

SignificanceTesting

Integration ?

Quantitation is done according to Affymetrix defaults with minimal user intervention.

Quantitation is done according to Affymetrix defaults with minimal user intervention.

One-Channel arrayOne-Channel array

Typically ignoredTypically ignored

Single-Channel array, so one simultaneous normalization procedure

Single-Channel array, so one simultaneous normalization procedure


Let’s Collapse This a Bit And Re-Phrase Things


.CELFiles.CELFiles

Background Normalization

ProbeSetAnnotation

Spot List

Integration

?

StatisticsClustering


First let’s go Back to Pre-Processing

What exactly is pre-processing (aka normalization)?


Why do we do it?Why do we do it?


Sources of Technical Noise

Where does technical noise come from?


More Sources of Technical Noise


Any step in the experimental pipeline can introduce artifactual noise

• Array design• Array manufacturing• Sample quality• Sample identity sequence effects?• Sample processing• Hybridization conditions ozone?• Scanner settings

Pre-Processing tries to remove these systematic effectsPre-Processing tries to remove these systematic effects


Important Note

Pre-processing is never a substitute for good experimental design. This is not a course on statistical design, but a few basic principles should be mentioned.

Pre-processing is never a substitute for good experimental design. This is not a course on statistical design, but a few basic principles should be mentioned.

Always try to balance experimental groups.Always try to balance experimental groups.

Biological replicates are preferable to technical

replicates.

Biological replicates are preferable to technical

replicates.

If processing samples identically is not possible, include controls for processing-effects.

If processing samples identically is not possible, include controls for processing-effects.


Pre-Processing



Why do we do it?Why do we do it?


Sources of Technical Noise

Where does technical noise come from?


More Sources of Technical Noise


Any step in the experimental pipeline can introduce artifactual noise• Array design• Array manufacturing• Sample quality• Sample identity sequence effects?• Sample processing• Hybridization conditions ozone?• Scanner settings

Pre-Processing tries to remove these systematic effectsPre-Processing tries to remove these systematic effects


Affymetrix Pre-Processing Steps

1. Background Correction

2. Normalization

3. Probe-Specific Adjustment

4. Summarizing multiple Probes into a single ProbeSet

Let’s look at two common approachesLet’s look at two common approaches


Introducing Two Major Affymetrix Pre-Processing Methods

• The two most commonly used methods are:• RMA = Robust Multi-array• MAS5 = Microarray Analysis Suite version 5

• MAS5 has strengths & weaknesses• Sacrifices precision for accuracy• Can easily be used in clinical settings

• RMA has strengths & weaknesses• Sacrifices accuracy for precision• Challenging to integrate multiple studies• Reduces variance (critical for small-n studies)

• Both are well accepted by journals and reviewers, perhaps RMA a bit more so. We’ll talk about some of the mathematics later on in this course.


Approach #1: MAS5

• Affymetrix put significant effort into developing good data pre-processing approaches

• MAS5 was an attempt to develop a “standard” technique for 3’ expression arrays

• The flaws of MAS5 led to an influx of research in this area.

• The algorithm is best-described in an Affymetrix white-paper, and is actually quite challenging to reproduce exactly in R.


MAS5 Model

Observations = True Signal + Random Noise + Probe EffectsObservations = True Signal + Random Noise + Probe Effects

Assumptions?Assumptions?


MAS5: Background & NoiseBackground

•Divide chip into zones

•Select lowest 2% intensity values

•stdev of those values is zone variability

•Background at any location is the sum of all zones background, weighted by 1/((distance^2) + fudge factor)

Noise

•Using same zones as above

•Select lowest 2% background

•stedev of those values is zone noise

•Noise at any location is the sum of all zone noise as above

•From http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf


MAS5: Adjusted Intensity

A = Intensity minus background, the final value should be > noise.

A: adjusted intensityI: measured intensityb: backgroundNoiseFrac: default 0.5 (another fudge factor)

And the value should always be >=0.5 (log issues)(fudge factor)



MAS5: Ideal MismatchBecause Sometimes MM > PM



MAS5: Signal

Value for each probe:

Modified mean of probe values:

Scaling Factor (Sc default 500)

Tbi = Tukey Biweight (mean estimate, resistant to outliers)TrimMean = Mean less top and bottom 2%


ReportedValue(i) = nf * sf * 2 (SignalLogValuei)Signal(nf=1)


Why do we use a “robust” method?

Robust summaries really improve over the standard ones by down weighing outliers and leaving their effects visible in residuals.

Why do we use “array”?

To put each chip’s values in the context of a set of similar values.

RMA = Robust Multi-Array

What is RMA?


What is RMA?

Assumes all the chips have the same background distribution

Does not use the mismatch probe (MM) data from the microarray experiments

It is a log scale linear additive model

Why?


What is RMA?

Mismatch probes (MM) definitely have information - about both signal and noise - but using it without adding more noise is a challenge

We should be able to improve the background correction using MM, without having the noise level blow up: topic of current research (GCRMA)

Ignoring MM decreases accuracy but increases precision


Methodology

Quantile Normalization – the goal of this method is to make the distribution of probe intensities for each array in a set of arrays the same. This method is motivated by the idea that a Q-Q plot shows that the distribution of two data vectors is the same if the plot is a straight diagonal line and not the same if it is anything else.


Methodology


Methodology

Summarization: combining multiple probe intensities of each probeset to produce expression values

An additive linear model is fit to the normalized data to obtain an expression measure for each probe on the GeneChip

Yij = aj + βi + εij


Methodology


Yij denotes the background-corrected normalized probe value corresponding to the ith GeneChip and the jth probe within the probeset [log2(PM-BG)*

ij]

εij is the random error term

aj is the probe affinity jth probe

βi is the chip effect for the ith GeneChip (log scale expression level)


Methodology


Estimate aj ( probe affinity) and βi (chip effect) using a robust method:

• Tukey’s Median polish (quick) - fits iteratively, successively removing row and column medians, and accumulating the terms, until the process stabilizes. The residuals are what is left at the end


RMA vs. MAS5

• RMA sacrifices accuracy for precision

• RMA is generally not appropriate for clinical settings

• RMA provides higher sensitivity/specificity in some tests

• RMA reduces variance (critical for small-n studies)

• RMA is better accepted by journals and reviewers



• Attendance

• Pre-Processing

• QA/QC





One key detail has been omitted so far:

How do we know if our pre-processing actually worked?

How do we know if our pre-processing actually worked?


Can we determine how well our pre-processing worked?

Or if our data looks good?


Let’s See Some “Bad” Data


Those Three Were From A Spike-In Experiment Done by Affymetrix


Those Last Three Were From An Experiment We Did On Rat Liver Samples


Were Those Bad Samples?• Lots of evident spatial artifacts

• But in practice all samples were carried forward into analysis

• And validation (RT-PCR) confirmed the overall study results for many genes


Eye-ball Assessments Are Hard• A couple of useful tricks:

• Look at the distributions• Did quantile normalization work (for RMA)?

• Look at the inter-sample correlations• Is one sample a strong outlier?

• Look at the 3’ 5’ trend across a ProbeSet

I know of no accepted, systematic QA/QC methodsI know of no accepted, systematic QA/QC methods


Distributions (Raw)


Distributions (normalized)


Inter-Sample Correlations


3’ 5’ Signal Trend


What Do You Do If You Find a Bad Array?• Repeat it?

• Drop the sample?

• Include it but account for the “noise” in another way?


In This Case• We excluded a series of outlier samples

• We believed these samples had been badly degraded because their were derived from FFPE blocks


Final Distribution


Final Heatmap



• Attendance

• Pre-Processing

• QA/QC





T-tests

• What are the assumptions of the t-test?

• When would you feel comfortable using a t-test?


T-Test Alternative: Wilcoxon Rank-Sum• Also called:

• U-test• Mann-Whitney (U) test

• Some argue that for continuous microarray data there is rarely a good reason to use this test:• Low n: tests of normality are not very powerful• High n: the central limit theorem provides support

• If the sample is normal, asymptotic efficiency is 0.95


T-Test Alternative: Moderated Statistics• A series of highly complex methods based on Bayesian

statistical methodologies

• Gordon Smyth’s limma R package is by far the most widely used implementation of this technique

This term is “shrunk” by borrowing power across all genes. This increases effective power.

This term is “shrunk” by borrowing power across all genes. This increases effective power.


T-Test Alternative: Permutation Tests

• SAM is the classic method• Most people suggest not using SAM today

• Empirically estimate the null distribution

Start with many samplesStart with many samples Randomly SampleRandomly Sample

IterateIterate


Problems with Significance Testing

• What happens if there are NO changes?

• Imagine:• You analyzed 1,000 clinical samples• 20,000 genes in the genome• P < 0.05

• What if… somebody comes and randomizes all your data?


You had a lot of Data

20,000 genes / array

AllRandomized

1,000 patients

20,000,000 data points

What happens if you analyze this data?

There should be NO real hits anymore!

Genes are mixed up togetherPatients are mixed together


What will you actually find?

Array: 20,000 genes

Threshold: p < 0.05

20,000 x 0.05 = 1000 False Positives

This is called “multiple testing”.

There is a solution


A “false-discovery rate adjustment” (FDR) for multiple testing considers all 20,000 p-

values simultaneously

In this experiment, lots of low p-values, so we can use this to “adjust” the p-values so we can find the true hits.

P-Value

Expected Value

0%

5%

10%

15%

20%


In this experiment, NO enrichment for low p-values,

so no more hits than expected randomly.

This is what you get from randomized data…



• Attendance

• Pre-Processing

• QA/QC





The Mask Production Makes Affymetrix Designs Expensive To Change

Photolithographic mask


But… there are multiple probes per gene


We Can Change Those Mappings!

HybridizedChip

HybridizedChip


CDF File• Chip Definition File

• This file maps Probes (positions) into ProbeSets

• We can update those mappings• Ignore deprecated or cross-hybridizing probes• Merge multiple probes that recognize the same gene• Account for entirely new genes that were not known at the time

of array-design


Sequence Mappings Are Slow

• Requires aligning millions of 25 bp probes against the transcriptome and identifying the best match for each

• Fortunately, other groups have done this for us, and regularly update their mappings


Many Probes Are Lost


But There Is Also A Major Benefit

Increased validation rates using RT-PCR (~10%)

Increased validation rates using RT-PCR (~10%)

Sandberg et alBMC Bioinformatics2007

Sandberg et alBMC Bioinformatics2007



• Attendance

• Pre-Processing

• QA/QC





What Are The Outputs of A Microarray Study?

• Primary Data• Raw image (.DAT file)• Quantitation (.CEL file)

• Secondary Data• Normalized data (usually an ASCII text file)• QA/QC plots

• Tertiary Data• Statistical analyses• Global visualization (e.g. heatmaps)• Downstream analyses (e.g. pathway, dataset-integration)

These file can be 10s of GB for a typical Affy study

These file can be 10s of GB for a typical Affy study


How Do You Organize These Data?

/data//data/

I recommend you put things on a fast, backed-up network drive I recommend you put things on a fast, backed-up network drive

/data/Project/data/Project

Organize data by projectOrganize data by project

/data/Project/raw/data/Project/QAQC/data/Project/pre-processing/data/Project/statistical/data/Project/pathway

/data/Project/raw/data/Project/QAQC/data/Project/pre-processing/data/Project/statistical/data/Project/pathway

Create separate directories for each analysisCreate separate directories for each analysis


How Do You Organize The Scripts?

I recommend you write a separate script for each analysis, and put those in a standardized (backed-up!) location, mirroring the directory structure and naming of your dataset directories.

Some sub-structure here is often useful:

I recommend you write a separate script for each analysis, and put those in a standardized (backed-up!) location, mirroring the directory structure and naming of your dataset directories.

Some sub-structure here is often useful:

/scripts/Project/pre-processing.R/scripts/Project/statistical-univariate.R/scripts/Project/statistical-multivariate.R/scripts/Project/pathway/GOMiner.R/scripts/Project/pathway/Reactome.R/scripts/Project/integration/mRNA+CNV.R/scripts/Project/integration/public-data.R

/scripts/Project/pre-processing.R/scripts/Project/statistical-univariate.R/scripts/Project/statistical-multivariate.R/scripts/Project/pathway/GOMiner.R/scripts/Project/pathway/Reactome.R/scripts/Project/integration/mRNA+CNV.R/scripts/Project/integration/public-data.R


Why Many Small Scripts?

• Monolithic scripts are hard to maintain• Easier to make errors

• Accidentally re-using the same variable name• Harder to debug

• Harder for somebody else to learn

• Small scripts are more flexible• Quicker to modify/re-run a small part of your analysis• Easier to re-use the same code on another dataset

• This is akin to the “unix” mindset of systems design


What To Save?• Everything!!

• All QA/QC plots (common reviewer request)• All pre-processed data (needed for GEO uploads)• Gene-wise statistical analyses

• Not just the statistically-significant genes• Collapse all analyses into one file, though

• All plots/etc

• Using clear filenames is critical• Disk-space is not usually a critical concern here

• Your raw data will be much larger than your output!


Most Important Points• Do not delete things:

• Keep all old versions of your scripts by including the date in the filename (or using source-control)

• Version output files by date• I have needed to go back to analyses done 7 years prior!

• Make regular (weekly) backups:• Try to pass this work off to professional sysadmins• External hard-drives/USBs are okay if you cannot get access to

network drives, but try to automate


Course Overview• Lecture 1: What is Statistics? Introduction to R• Lecture 2: Univariate Analyses I: continuous• Lecture 3: Univariate Analyses II: discrete• Lecture 4: Multivariate Analyses I: specialized models• Lecture 5: Multivariate Analyses II: general models• Lecture 6: Sequence Analysis• Lecture 7: Microarray Analysis I: Pre-Processing• Lecture 8: Microarray Analysis II: Multiple-Testing• Lecture 9: Machine-Learning• Final Exam (written)

Download - Canadian Bioinformatics Workshops

Top Related