microarray data analysis of illumina data using r/bioconductor reddy gali, ph.d....

Microarray Data Analysis of Illumina Data Using R/Bioconductor

Reddy Gali, Ph.D.rgali@hms.harvard.edusubmit-c3-bioinformatics@rt.med.harvard.edu

http://catalyst.harvard.edu

Agenda

• Introduction to microarrays• Workflow of a gene expression microarray experiment • Microarray experimental design• Public microarray databases• Microarray preprocessing - Quality control and Diagnostic

analysis

Agenda

• Introduction to R/Bioconductor• Installation of R and Bioconductor Packages• General data analysis and strategies• Data analysis using lumi package• Data analysis using limma package

Workflow of Gene Expression

Biological question Experimental design

Tissue / sample preparation

Extraction of Total RNA

Microarray hybridization & processing

Image analysis

Probe amplification & labeling

Data analysisExpression measures - Normalization - Statistical Filtering - Clustering - Pathway analysis

Biological Verification

Pitfalls of Microarray Experiment

• Gene expression changes detected by microarray analysis cannot be validated by other methods

- Inadequate design

- Data quality is low

- Statistical approach is not adequate - Expression level of gene is below detection limit

- Change in gene expression is small

- Microarray detection probe is not specific or not sensitive

Questions usually asked

• What kind of technology or microarrays I have to use• How many replicates do I need• What is a real replicate• Do I need statistical advice• Should I do technical replicate• Should I pool my samples• How do I analyze my dataset• What software should I use

Design of Microarray Experiment

• Replicates• Goal, resources, technology, quality, design and

analysis• Two fold change – 3 replicates • Smaller change – 5 replicates• Technical replicates and Biological replicates

• Sample pooling• Amount of sample• Replicates of pooled sample• No way to find variance between samples

Gene Expression Omnibus- GEO

Public Microarray Databases

• BodyMap - http://bodymap.ims.u-tokyo.ac.jp/• SMD - http://genome-www5.stanford.edu/• RIKEN - http://read.gsc.riken.go.jp/• MGI - http://www.informatics.jax.org/• GEO - http://www.ncbi.nlm.nih.gov/geo/• CIBEX - http://cibex.nig.ac.jp/index.jsp• ArrayExpress - http://www.ebi.ac.uk/microarray-as/ae/

Microarray Platforms

• Agilent Microarrays 60-mer format

• Codelink Bioarrays 30-mer format

• Affymetrix GeneChips 25-mer format

• Illumina Beadchips

• NimbleGen 60-mer format

Illumina Bead Array Technology

Silica Beads

Each bead is covered with hundreds of thousands of copies of a specific oligonucleotide

Some Facts

• Each bead carries copies of probes with, on average, 30 replicates of every bead type per array

• Around 105 copies of a particular DNA sequence of interest are covalently attached to each bead

• DNA sequences (oligonucleoties) attached to the beads are 75 base pairs in length, with 25 base pairs used for decoding and 50 base pairs used for target hybridization

• A pool of different bead types is created, beads of the same type having the same probe sequence attached

Box Plots of unnormalized data

Raw vs Normalized data

Raw Data Normalized Data

Histograms of unnormalized data

Why Normalize

• It adjusts the individual hybridization intensities to balance them appropriately so that meaningful biological comparisons can be made.

• Unequal quantities of starting RNA• Differences in labeling or detection efficiencies between the

fluorescent dyes used

• Systematic biases in the measured expression levels. • Sample preparationSample preparation• Variability in hybridizationVariability in hybridization• Spatial effectsSpatial effects• Scanner settingsScanner settings• Experimenter biasExperimenter bias

Free Software – Data analysis

• BioconductorBioconductor– is an open source and open development software

project to provide tools for the analysis and comprehension of genomic data.

• TMEV 4.0TMEV 4.0– is an application that allows the viewing of

processed microarray slide representations and the identification of genes and expression patterns of interest.

R / Bioconductor

• R and Bioconductor packages• R (http://cran.r-project.org/ )is a comprehensive

statistical environment and programming language for professional data analysis and graphical display.

• Bioconductor (http://www.bioconductor.org/) is an open source and open development software project for the analysis of microarray, sequence and genome data.

• More 300 Bioconductor packages.• http://faculty.ucr.edu/~tgirke/Documents/R_BioC

ond/R_BioCondManual.html

R / Bioconductor - Installation

Preparing R for analysis

Analysis using lumi R package

- Loading data into R/Bioconductor

>lumi_data <- lumiR(‘worshop_data.csv')

- Summary of the loaded data

>lumi_data- Quality control of loaded data

>summary(lumi_data, 'QC')

>density(lumi.Rdata)

>boxplot(lumi.Rdata)

>MAplot(lumi.Rdata)

>> plot(lumi.Rdata, what='sampleRelation')

>> plot(lumi.Rdata, what=‘cv')

>> plot(lumi.Rdata, what=‘outlier')

Variance Stabilization

> lumi.Tdata <- lumiT(lumi.Rdata)

> lumi.VSdata <- plotVST(lumi.Tdata)

Normalization

> lumi.Ndata <- lumiN(lumi.Tdata)

Or Do all the default preprocessing in one step

> lumi.N.Q <- lumiExpresso(lumi.Rdata)– Background Correction: bgAdjust– Variance Stabilizing Transform method: vst– Normalization method: quantile

– Perform all the QC again> summary(lumi.Ndata, 'QC')

Differential expression

• >design <- model.matrix(~ -1 + factor(c(1, 1, 1,1, 2, 2, 2,2)))

• >colnames(design) = c("control","affected")

• >fit <- lmFit(lumi.Ndata, design)

• >cont.matrix <- makeContrasts(signature = affected - control,levels=design)

• >fit2 <- contrasts.fit(fit, cont.matrix)

• >ebFit <- eBayes(fit2)

• >results <- topTable(ebFit, number=100, sort.by="B", resort.by="M")

• >print(results)

• >write.table(topTable(ebFit, coef=1, adjust="fdr", sort.by="B", number=25000), file="results.xls", row.names=F, sep="\t")

33http://catalyst.harvard.edu

Reddy Gali, Ph.D.rgali@hms.harvard.eduPhone: 617 432 7471

Thank you

microarray data analysis of illumina data using r/bioconductor reddy gali, ph.d....

Documents

dna microarray data - bioconductor

package ‘tfutils’ - bioconductor

course pdf - bioconductor

package 'geneplotter' - bioconductor

[xls] beat latest... · web viewguddo katra rosha daula...

bioconductor tutorial · 2005-07-01 · bioconductor •...

package 'snpstats' - bioconductor

xtreme conference gali halevi

package ‘phyloseq’ - bioconductor

neuron article - hms.harvard.edu

the cytoskeleton tim mitchison...

bioconductor annotation packages

gali oct 2014

r/bioconductor workshop session ii - bioconductor-ing

neuron neurotechnique - hms.harvard.edu

r and bioconductor for the analysis of massive...

r / bioconductor for integrative genomic analysisemerging...

gali awards

bioconductor for sequence analysis

genome rearrangements csci 7000-005: computational genomics...