Hyun Seok Kim, Ph.D.Assistant Professor,
Severance Biomedical Research Institute,Yonsei University College of Medicine
Lecture 10. Microarray and RNA-seq
: Identification of Differentially Expressed Genes (DEG)
MES7594-01 Genome Informatics I (2015 Spring)
Gene expression
What is a microarray?
• Platforms– Glass slides (cDNA array)– Chips (Affymetrix)– Glass beads (Illumina)
• 10000s of oligonuceotide (or cDNA) probes are fixed on the surface of the platforms.
• Microarrays can detect and quantify– mRNA– microRNA– SNP– LOH– CNV
…
cDNA
Affymetrix
Illumina
MES7594-01 Genome Informatics I (2015 Spring)
Plat-forms
4
Questions of Interest
o Determine steady-state gene expression levels of a sample
in whole transcriptome scale.
o Identify differentially expressed genes between samples.
o Identify differentially regulated pathways or protein com-
plexes.
MES7594-01 Genome Informatics I (2015 Spring)
Affymetrix GeneChip for mRNA quantifi-cation
About Affy GeneChip platform• Probes (25 mers) are synthe-
sized on a chip using a pho-tolithographic manufacturing process.
• At each x, y location of a GeneChip, a particular oligonucleotide is synthesized with millions of copies.
• Each gene is represented by a unique set of probe pairs (PM and MM). MM helps increase specificity of the PM signal.
MES7594-01 Genome Informatics I (2015 Spring)
Affymetrix GeneChip for mRNA quantifi-cation
About Affy workflow• Isolate total RNA (need bio-
logical replicates)• Sample amplification and la-
beling• Sample injected into microar-
ray• Probe array hybridization,
washing• Probe array scanning and in-
tensity quantification• Intensity translated into nu-
cleic acid abundance
MES7594-01 Genome Informatics I (2015 Spring)
Illumina BeadArray for mRNA quatifi-cation
Each bead has one type of oligo and thousands of these oligos/bead
Bead is deposited on wells in glass slides. The beads are decoded by a step by propri-etary technology
MES7594-01 Genome Informatics I (2015 Spring)
Beadchip platform
Affy vs Illumina
MES7594-01 Genome Informatics I (2015 Spring)
Affymetrix GeneChip Illumina BeadArray
25 mer Longer oligo
Probe synthesized on chips
Bead technology
Multiple probes/probeset Single probe
Multiple probes/tran-script
Multiple probes/tran-script
.dat, .cel, .cdf, .chp file types
Image file processed by Bead Studio
Normalization by MAS5, RMA, GC-RMA etc.
Normalization by aver-age, quantile, RSN etc.
TXT output for down-stream analysis
TXT output for down-stream analysis
Annotations can be up-dated
Annotations can be up-dated
Adapted from Dr. Chandran@pitt
RNA sequencing
Condition 1(normal colon)
Condition 2(colon tumor)
Isolate RNAs
Sequence ends
100s of millions of paired reads10s of billions bases of sequence
Generate cDNA, fragment, size select, add linkersSamples of interest
Map to genome, transcriptome, and
predicted exon junctions
Downstream analysis
Adapted from Canadian Bioinformatics Workshop
Pros and Cons of RNA-seq (versus microarray)
Pros• More powerful in de-
tecting low expressing genes
• Detect splicing vari-ants and fusion tran-scripts
• Measure allele specific expression
• Discover mutation
Cons• Biased to highly ex-
pressed genes (e.g. ri-bosomal, mitochondrial genes)
• More complicated analy-sis workflow (mapping to reference genome)
• More expensive e.g. Hiseq2500, 100bpX2, 4Gb -> $700/sample(vs. < $500/array)
RNA-seq: Experimental De-sign
• Single end read: one read sequenced from one end of each sample cDNA insert
• Paired end read: two reads (one from each end) sequenced from each sample cDNA insert. Better to map reads over repetitive regions. Detect fusions and novel transcripts.
RNA-seq workflows• Sequencing: obtain raw data
(fastq format)• Quality control (optional):
FASTX
• Workflow 1: tophat2 (align) -> cufflinks (transcript as-sembly) -> cuffdiff (DEGs), cuffmerge (merge assem-blies)
Workflow 2: bowtie2 (align) -> HTSeq-count (count by gene) -> edgeR or DESeq (DEGs)
• Fusion detection (optional): “chimerascan” or “defuse”
Normalization method (old)
• RPKM: Reads per kilobase per million mapped reads (sigle end)
• RPKM = (10^9*C)/(N*L)C = number of reads mapped to a geneN = total mapped reads in the experimentL = exon length in kb for a gene
• RPKM measure is inconsistent among samples. • FPKM: Fragments per kilobase per million frag-
ments reads (paired end).• RPKM and FPKM based DEG discovery is af-
fected by gene length (no more recommended).
microarray
RNA-seqAdapted from EMBL
Poisson
Negative binomial Microarray data follows a Poisson
distribution. However RNA seq does not. In RNA Seq genes with high mean counts (either because they’re long or highly expressed)
tend to show more variance (between samples) than genes
with low mean counts. Thus this data fits a Negative Binomial
Distribution. edgeR and DESeq identfy DEGs based on negative
binomial distribution.
RNA-seq
A case study for practice session
• GEO accession number: GSE41588
MES7594-01 Genome Informatics I (2015 Spring)
GEO entryof GSE41558
Two platforms: Affy and HiSeq
Raw files: CEL, count files
Matrix file: processed data
Data Pre-processing
• Affy produces CEL format file as raw data.
• CEL file contains the fea-ture quantifications
• CEL file still has probes spread over the chip
• Values still need to be summarized to probe set level; for example 90525_at = 250 units
• Probe set: a collection of probes designed to inter-rogate a given sequence
250
MES7594-01 Genome Informatics I (2015 Spring)
CEL file to TXT file• In going from .CEL to .TXT file to generate signal values, the multiple probes
within a probe set are “averaged” to produce a single value for that gene/transcript.
• the CEL files must first be normalized to account for technical variation be-tween the arrays
MES7594-01 Genome Informatics I (2015 Spring)
19
Robust Multi-array Average (RMA)1. Background adjust PM values from .CEL files.
2. Take the base-2 log of each background-ad-justed PM intensity.
3. Quantile normalize values from step 2 across all GeneChips.
4. Perform median polish separately for each probe set with rows indexed by GeneChip and columns indexed by probe.
5. For each row, find the average of the fitted val-ues from step 4 to use as probe-set-specific ex-pression measures for each GeneChip. -> .TXT files
Log Transformation
Reason for working with log transformed intensities• Spread features more evenly across intensity range• Makes variability more constant across intensity range• Makes results close to normal distribution of intensities
and errors
How to normalize?• Many methods
– Median scaling – median intensity for all chips should be the same
– Known genes, house keeping, invariant genes
– Quantile normalization: RMA (Robust Multiarray Averaging), GC- RMA
– Normalization method may differ de-pending on array platform
– (Reading materials) GC-RMA: Wu et al. (2004), JASA, 99, 909-917. RMA: Irizarry et al. (2003), Nuc Acids Res, 31, e15.
MES7594-01 Genome Informatics I (2015 Spring)
Raw data
After normalization
22
RMA: Quantile Normalization
1. After background adjustment, find the smallest log2(PM) on each chip.
2. Average the values from step 1.
3. Replace each value in step 1 with the average computed in step 2.
4. Repeat steps 1 through 3 for the second smallest values, third smallest values,..., largest values.
23
RMA: Median Polish
• For a given probe set with J probe pairs, let yij denote the background-adjusted, base-2-logged, and quantile-normalized value for GeneChip i and probe j.
• Assume yij = μi + αj + eij where α1 + α2 + ... + αn = 0.
• Perform Tukey’s Median Polish on the matrix of yij values with yij in the ith row and jth column.
• μi from median polish is the probe-set-specific measure
of expression for GeneChip i after correcting for array effect and probe effect.
gene expressionof the probe seton GeneChip i
probe affinityaffect for thejth probe in theprobe set
residual for thejth probe on theith GeneChip
Differentially Expressed Genes (DEG)
• Criteria for DEG discovery- Amount of difference: Fold change, Signal to noise ratio- Statistical significance: p-value, false discovery rate (FDR), odds ratio
• Statistical Methods- Parametric: t-test - Non-parametric: Wilcoxon rank-sum tests - Significance Analysis of Microarrays (SAM; permutation based)- Empirical Bayesian (Linear Models of Microarrays, LIMMA, Affy data)- ANOVA (multiple factors; e.g. two different strains +/- drug)
• Multiplicity of testing: p-value adjustments- Methods: FDR, bonferroni, etc.
MES7594-01 Genome Informatics I (2015 Spring)
Limma & Empirical Bayesian• Limma is an R package to find DEGs• It uses linear models
- Fitted to normalized intensities for each gene given a series of arrays- Design matrix: indicates which RNA samples have been ap-plied to each array- Contrast matrix: specifies which comparisons you would like to make between the RNA samples- Can be used to compare two or more groups
• Assumption: normal distribution• Uses empirical Bayesian analysis to improve power in small
sample sizes- Borrowing information across genes
• Output: p-values (adjusted for multiple testing)
Moderated/Bayesian t-test
• Ordinary t-test is testing for differences in means between two groups given the variability within each group
• Moderated/Bayesian t-test: rather than estimating within-group variability over and over again for each gene, pool the information from many similar genes.
• Advantage: eliminate occurrence of accidentally large t-statistics due to accidentally small within-group variance.
Further reading
• RNA-seq normalization: Dillies M-A et al. Briefings in Bioinformatics, 2012, 14, 671-683.
• Limma & eBayes: Smyth GK. Statisti-cal Applications in Genetics and Molecular Biology, 2004, 3 (1), article 3.
Course homepage: http://wiki.tgilab.org/MES7594
MES7594-01 Genome Informatics I (2015 Spring)
Notice