qc and pre-processing of microarray data lars eijssen - bigcat bioinformatics

44
QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

Upload: michael-walton

Post on 29-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

QC and pre-processingof microarray data

Lars Eijssen - BiGCaT Bioinformatics

Page 2: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 2

Contents

Background on quality control (QC) and (further) data pre-processing

Application of an automated workflow for Affymetrix data− Settings− Illustration on data sets− Interpretation of outcome

Introduction to the afternoon session and the data set to be used

Page 3: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 3

BACKGROUND

Page 4: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 4

Proper quality control (QC)

• Ensures validity of study results

• Is pivotal in –omics research– Hard to judge quality by eye

• Several tables and images assist in judging quality

Here we focus on QC of gene expression arrays

Page 5: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 5

Data analysis overview

Untreated (control)

Exposed to compound

Raw data

Normalised data

List of regulated genes

Results

Microarray scans

Image analysis

Quality controlFurther pre-processing

Statistical analysis

Pattern analysisPathway analysisLiterature data

Slide based on a slide from J. Pennings, RIVM, NL

•Background correction•Normalisation

Page 6: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 6

QC and pre-processing

• Ensure signal comparability within each array– Stains on the array– Gradient over the array

• Ensure comparable signals between all arrays– Degraded / low quality sample– Failed hybridisation– Too low or high overall intensity

• Some effects can be corrected for, others require removal of data from the set

Page 7: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 7

QC for one and two channel microarrays

• The principles are similar for both types of arays

• But the details are different

• In two channel arrays QC is a bit more complex– Each spot consists of two measurements, not one– Dye-effect

• I will further discuss QC later in this talk, focusing on one channel arrays (Affymetrix chips)

Page 8: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 8

Dye bias

Foreground intensity Background intensity

Page 9: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 9

Red and green foreground intensity

For two channel arrays, it is relevant to check whether effects cancel out between channels

Page 10: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 10

Pre-processing: background correction

• Background signal needs to be corrected for– For example signal of remaining non-hybridised mRNA

• Three types of background– Overall slide background– Local slide background– Specific background

• For example cross-hybridization, can be corrected for by mismatch probes (in case of Affymetrix chips)

• Also used to make present/marginal/absent calls

Page 11: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 11

Pre-processing: normalisation

• After discarding bad arrays and spots, remaining within- and between-array differences not related to the biology, need to be corrected for

• The procedure is cyclic– Several QC plots are made before and after normalisation– Whether normalisation can correct an artifact may influence

decision to discard or not– After data selection, the complete QC should be run again

• Some abberations may have been masked by larger ones

Page 12: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 12

Log transformation

• Generally, the intensities are first 2log-transformed− The distribution of the logged intensities is more ‘normal’ than on

the original scale

• After logging and normalisation one can compute the difference in means (‘logFC’) between several experimental groups− The difference is easier to handle statistically

• 2^logFC corresponds to the fold change (ratio) on the original scale

Page 13: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 13

The log Fold Change

• The logFC ‘spreads out’ the data and offers symmetry

• ‘raw’ ratio (FC)

• log ratio (logFC)

1 2½

1 2½

2log of:

Page 14: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 14

Spotted and Affymetrix arrays

Spotted arrays– Either one or two channel

– Spot-level QC often included

– Also often parts of arrays are flagged

– Each gene is measured by only one or two probes on the array

Affymetrix chips– Always one channel

• no dye effect

– No spot-level QC is taken into account

– No flagging of local abberations

– Each gene is measured by a probeset of probes spread randomly over the array

Main focus inremainder of talk

Page 15: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 15

Pre-processing for Affymetrix chips

• A specific extra step is summarisation of probe values into one value for each probeset

• Well-known methods for pre-processing Affymetrix chips

– MAS5.0 (uses mismatch intensities)

– RMA (Robust Multiarray Average, does not use mismatches)• Includes both background correction and (quantile) normalisation

– GC-RMA (like RMA, but also takes into account GC content)

– dChip (model-based)

– For exonST en geneST arrays, only RMA can be used (another option is PLIER, error-model)

Page 16: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 16

Custom CDF files

• Affymetrix provides annotations for their probesets (CDF file)

• When these get outdated, one can of course update probeset annotations

• But it may be even better to:– disassemble these sets into the separate probes– reannotate probes– reassemble these into new different probesets

• This is exactly what custom CDF files do

• Note that reassembled probesets do not necessarily contain the same number of probes anymore

Page 17: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 17

BrainArray CDF files1

• Reannotation based on one of several genome databases

• IDs are created as follows: ID from the gene the probeset refers to followed by ‘_at’ to resemble an Affymetrix ID

– For example: ENSG00000139618_at

• When using these annotations in other tools, you have to remove the ‘_at’ additions, in order to get recognisable Ids

– Note that when using Entrez gene this means that the ID is composed of a number (Entrez gene ID) followed by ‘_at’, and as such looks exactly like a normal Affymetrix ID, but IT IS NOT

1 http://arrayanalysis.mbni.med.umich.edu/arrayanalysis.html

Page 18: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 18

Low intensity filtering

• Before filtering

• After filtering

• Low intensity spots are more subject to noise

• Filtering can be done at a later stage

average intensity

difference between groups

Page 19: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 19

AN AUTOMATED WORKFLOW

Page 20: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 20

ArrayAnalysis.org

web server

local machine

calculation server

Page 21: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 21

http://www.arrayanalysis.org

Page 22: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 22

Page 23: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 23

Page 24: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 24

Page 25: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 25

Page 26: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 26

Page 27: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 27

Page 28: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 28

Page 29: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 29

Page 30: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 30

Table and images of QC statistics

Affymetrix criteria: Sample prep controls Lys < Phe < Thr < Dap Lys present Bèta Actin 3’/5’ ≤ 3 GAPDH 3’/5’ ≤ 1.25 Hybridisation controls BioB < BioC < BioD < Crex BioB present Percentage present within 10% Background within 20 units Scaling factors within 3-fold from the average

In the table, red and blue indicate whether criteria are fulfilled

The images are taken from other data sets than the one you will be using

Outcome of the workflow

Page 31: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 31

RNA Degradation Density plotplot

Page 32: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 32

Boxplots

Page 33: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 33

Virtual (spatial) images MA plots

Page 34: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 34

NUSE and RLE plot

Page 35: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 35

Array correlation plot

Page 36: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 36

Clustering and PCA plots

Page 37: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 37

Perspectives

• Future relevance of Affymetrix chips?

• Data repositories / comparative research

• It is also available for local install in R

• We will soon include model for statistical analysis (and processing of other data types)

Page 38: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 38

Quality Control (QC) of Microarrays

• Nature, 2005

Page 39: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 39

Project members

Lars Eijssen Magali Jaillard Michiel Adriaens Philip de GrootChris Evelo

Thanks to:

Page 40: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 40

THE AFTERNOON SESSION AND THE DATA SET

Page 41: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 41

The afternoon session

• In the afternoon session, you will be performing QC and pre-processing yourself

• You will follow a stepwise guide available online athttp://www.bigcat.unimaas.nl/wiki/index.php/PET_course_2011

• You will use an Affymetrix data set and make use of arrayanalysis.org*

* For normalisation you will use a Genepattern module, as the tool you will usefor statistical analysis (finding which genes are different) requires this input

Page 42: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 42

NuGOExpressionFileCreator

Page 43: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 43

Short description of the data set (1)

• Microarray experiments have to be uploaded to online repositories such as Gene Expression Omnibus (GEO, NCBI) or ArrayExpress (AE, EBI) upon publication

• We will use apublished1 datasetavailable from AE

1 Toxicogenomics of subchronic hexachlorobenzene exposure in Brown Norway rats.Ezendam J, Staedtler F, Pennings J, et al. Environ Health Perspect 112(7):782-91

Page 44: QC and pre-processing of microarray data Lars Eijssen - BiGCaT Bioinformatics

FHML - 44

Short description of the data set (2)

• Hexachlorobenzene (HCB) is a persistent pollutant, that is toxic for liver, neurons and the reproductive and immune systems

• In this study, Brown Norway rats were fed a diet supplemented with HCB doses of 0, 150, or 450 mg/kg

• Spleen, mesenteric lymph nodes (MLN), thymus, blood, liver, and kidney were analyzed using the Affymetrix rat RGU-34A GeneChip microarray– 13-17 arrays per tissue, max 6 per concentration

• We will be primarily considering the liver data (17 arrays)