carlo colantuoni carlo@illuminatobiotech
DESCRIPTION
Summer Inst. Of Epidemiology and Biostatistics, 2010: Gene Expression Data Analysis 1:30pm – 5:00pm in Room W2015. Carlo Colantuoni [email protected]. http://www.illuminatobiotech.com/GEA2010/GEA2010.htm. Class Outline. Basic Biology & Gene Expression Analysis Technology - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/1.jpg)
Summer Inst. Of Epidemiology and Biostatistics, 2010:
Gene Expression Data Analysis
1:30pm – 5:00pm in Room W2015
Carlo [email protected]
http://www.illuminatobiotech.com/GEA2010/GEA2010.htm
![Page 2: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/2.jpg)
Class Outline• Basic Biology & Gene Expression Analysis Technology
• Data Preprocessing, Normalization, & QC
• Measures of Differential Expression
• Multiple Comparison Problem
• Clustering and Classification
• The R Statistical Language and Bioconductor
• GRADES – independent project with Affymetrix data.
http://www.illuminatobiotech.com/GEA2010/GEA2010.htm
![Page 3: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/3.jpg)
Cla
ss O
utl
ine
- D
etai
led
• Basic Biology & Gene Expression Analysis Technology– The Biology of Our Genome & Transcriptome– Genome and Transcriptome Structure & Databases– Gene Expression & Microarray Technology
• Data Preprocessing, Normalization, & QC– Intensity Comparison & Ratio vs. Intensity Plots (log transformation)– Background correction (PM-MM, RMA, GCRMA)– Global Mean Normalization– Loess Normalization– Quantile Normalization (RMA & GCRMA)– Quality Control: Batches, plates, pins, hybs, washes, and other artifacts– Quality Control: PCA and MDS for dimension reduction– SVA: Surrogate Variable Analysis
• Measures of Differential Expression– Basic Statistical Concepts– T-tests and Associated Problems– Significance analysis in microarrays (SAM) [ & Empirical Bayes]– Complex ANOVA’s (limma package in R)
• Multiple Comparison Problem– Bonferroni– False Discovery Rate Analysis (FDR)
• Differential Expression of Functional Gene Groups– Functional Annotation of the Genome– Hypergeometric test?, Χ2, KS, pDens, Wilcoxon Rank Sum– Gene Set Enrichment Analysis (GSEA)– Parametric Analysis of Gene Set Enrichment (PAGE)– geneSetTest– Notes on Experimental Design
• Clustering and Classification– Hierarchical clustering– K-means– Classification
• LDA (PAM), kNN, Random Forests• Cross-Validation
• Additional Topics• eQTL (expression + SNPs)• Next-Gen Sequencing data: RNAseq, ChIPseq• Epigenetics?– The R Statistical Language: http://www.r-project.org/– Bioconductor : http://www.bioconductor.org/docs/install/– Affymetrix data processing example
![Page 4: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/4.jpg)
DAY #3:DAY #3:
Measures of Differential Expression:Review of basic statistical conceptsT-tests and associated problemsSignificance analysis in microarrays
(SAM)(Empirical Bayes)Complex ANOVA’s (“limma” package in
R)
Multiple Comparison Problem:BonferroniFDR
Differential Expression of Functional Gene Groups
Notes on Experimental Design
![Page 5: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/5.jpg)
Slides from Rob Scharpf
![Page 6: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/6.jpg)
Fold-Change?Fold-Change?T-Statistics?T-Statistics?
Some genes are more variable than othersSome genes are more variable than others
![Page 7: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/7.jpg)
Slides from Rob Scharpf
![Page 8: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/8.jpg)
Slides from Rob Scharpf
![Page 9: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/9.jpg)
Slides from Rob Scharpf
![Page 10: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/10.jpg)
Slides from Rob Scharpf
![Page 11: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/11.jpg)
Slides from Rob Scharpf
distribution of
distribution of
![Page 12: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/12.jpg)
Slides from Rob Scharpf
![Page 13: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/13.jpg)
Slides from Rob Scharpf
X1-X2 is normally distributed if X1 and X2 are normally distributed – is this the case in microarray data?
![Page 14: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/14.jpg)
Problem 1Problem 1: T-statistic not t-distributed. : T-statistic not t-distributed. ImplicationImplication: p-values/inference incorrect: p-values/inference incorrect
![Page 15: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/15.jpg)
P-values by permutationP-values by permutation
• It is common that the assumptions used to derive the statistics are not approximate enough to yield useful p-values (e.g. when T-statistics are not T distributed.)
• An alternative is to use permutations.
![Page 16: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/16.jpg)
pp-values by permutations-values by permutations
We focus on one gene only. For the bth iteration, b = 1, , B;
1. Permute the n data points for the gene (x). The first n1 are referred to as “treatments”, the second n2 as “controls”.
2. For each gene, calculate the corresponding two sample t-statistic, tb.
After all the B permutations are done:
3. p = # { b: |tb| ≥ |tobserved| } / B
4. This does not yet address the issue of multiple tests!
![Page 17: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/17.jpg)
The volcano plot shows, for a particular test, negative The volcano plot shows, for a particular test, negative log p-value against the effect size (M).log p-value against the effect size (M).
Another problem with t-testsAnother problem with t-tests
![Page 18: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/18.jpg)
Remember this?Remember this?
![Page 19: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/19.jpg)
Problem 2Problem 2: t-statistic bigger for genes: t-statistic bigger for genes with smaller standard with smaller standard
error estimates.error estimates.ImplicationImplication: Ranking might not be optimal: Ranking might not be optimal
![Page 20: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/20.jpg)
Problem 2Problem 2
• With low N’s SD estimates are unstable
• Solutions:
– Significance Analysis in Microarrays (SAM)
– Empirical Bayes methods and Stein estimators
![Page 21: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/21.jpg)
Significance analysis in Significance analysis in microarrays (SAM)microarrays (SAM)
• A clever adaptation of the t-ratio to borrow information across genes
• Implemented in Bioconductor in the siggenes package
Significance analysis of microarrays applied to the ionizing radiation response, Tusher et al., PNAS 2002
![Page 22: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/22.jpg)
SAM d-statisticSAM d-statistic
• For gene i :
di y i x isi s0
y i
x i
is
0s
mean of sample 1
mean of sample 2
Standard deviation of repeated measurements for gene i
Exchangeability factor estimated using all genes
![Page 23: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/23.jpg)
Minimize the average CV across all genes.
![Page 24: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/24.jpg)
Scatter plots of relative difference (d) vs standard Scatter plots of relative difference (d) vs standard deviation (s) of repeated expression measurementsdeviation (s) of repeated expression measurements
Random fluctuationsin the data, measured by balanced permutations(for cell line 1 and 2)
Relative difference fora permutation of the datathat was balanced between cell lines 1 and 2.
A fix for this problem:
![Page 25: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/25.jpg)
• An advantage of having tens of thousands of genes is that we can try to learn about typical standard deviations by looking at all genes
• Empirical Bayes gives us a formal way of doing this
• “Shrinkage” of variance estimates toward a “prior”: moderated t-statistics – eliminates extreme stats due to small variances.
• Implemented in the limma package in R. In addition, limma provides methods for more complex experimental designs beyond simple, two-sample designs.
eBayes: Borrowing StrengtheBayes: Borrowing Strength
![Page 26: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/26.jpg)
The Multiple The Multiple Comparison ProblemComparison Problem
(some slides courtesy of John Storey)
![Page 27: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/27.jpg)
Hypothesis TestingHypothesis Testing
• Test for each gene:
Null Hypothesis: no differential expression.
• Two types of errors can be committed
– Type I error or false positive (say that a gene is differentially expressed when it is not, i.e., reject a true null hypothesis).
– Type II error or false negative (fail to identify a truly differentially expressed gene, i.e.,fail to reject a false null hypothesis)
![Page 28: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/28.jpg)
Hypothesis testingHypothesis testing
• Once you have a given score for each gene, how do you decide on a cut-off?
• p-values are most common.
• How do we decide on a cut-off when we are looking at many 1000’s of “tests”?
• Are 0.05 and 0.01 appropriate? How many false positives would we get if we applied these cut-offs to long lists of genes?
![Page 29: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/29.jpg)
Multiple Comparison ProblemMultiple Comparison Problem
• Even if we have good approximations of our p-values, we still face the multiple comparison problem.
• When performing many independent tests, p-values no longer have the same interpretation.
![Page 30: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/30.jpg)
Bonferroni ProcedureBonferroni Procedure
= 0.05= 0.05# Tests = 1000# Tests = 1000
= 0.05 / 1000 = 0.00005 = 0.05 / 1000 = 0.00005oror
p = p * 1000p = p * 1000
![Page 31: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/31.jpg)
Bonferroni ProcedureBonferroni Procedure
Too conservative.
How else can we interpret many 1000’s of observed statistics?
Instead of evaluating each statistic individually, can we assess a list of
statistics: FDR (Benjamini & Hochberg 1995)
![Page 32: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/32.jpg)
FDRFDR
• Given a cut-off statistic, FDR gives us an estimate of the proportion of hits in our list of differentially expressed genes that are false.
Null = Equivalent Expression; Alternative = Differential Expression
![Page 33: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/33.jpg)
False Discovery RateFalse Discovery Rate• The “false discovery rate” measures the proportion of false
positives among all genes called significant:
• This is usually appropriate because one wants to find as many truly differentially expressed genes as possible with relatively few false positives
• The false discovery rate gives an estimate of the rate at which further biological verification will result in dead-ends
tsignifican called#
positives false#
![Page 34: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/34.jpg)
-0.4 -0.2 0.0 0.2 0.4
01
23
Distribution of Observed (black) and Permuted (red+blue) Correlations (r)
Correlation (r)
Den
sity
Permuted
Distribution of Statistics
Observed
Statistic
N=90
![Page 35: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/35.jpg)
-0.45 -0.40 -0.35 -0.30
0.00
0.05
0.10
0.15
0.20
Distribution of Observed (black) and Permuted (red+blue) Correlations (r)
Correlation (r)
Den
sity
Permuted
Observed=
Distribution of Statistics
FDR =False Pos.
Total Pos.
PermutedObserved
Statistic
![Page 36: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/36.jpg)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
Distribution of p-values from Observed (Black) and Permuted Data
p-value
Den
sity
Distribution of p-values
Permuted
Observed
p-value
N=90
![Page 37: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/37.jpg)
SAM produces a modified T-statistic (d), and has an approach to the multiple
comparison problem.
![Page 38: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/38.jpg)
Scatter plots of relative difference (d) vs standard Scatter plots of relative difference (d) vs standard deviation (s) of repeated expression measurementsdeviation (s) of repeated expression measurements
Random fluctuationsin the data, measured by balanced permutations(for cell line 1 and 2)
Relative difference fora permutation of the datathat was balanced between cell lines 1 and 2.
A fix for this problem:
![Page 39: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/39.jpg)
Selected genes:Selected genes:Beyond expected distributionBeyond expected distribution
![Page 40: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/40.jpg)
FDR = False Positives/Total Positive Calls
This FDR analysis requires enough samples in each condition to estimate a statistic for each
gene: observed statistic distribution.
And enough samples in each condition to permute many times and recalculate this
statistic: null statistic distribution.
What if we don’t have this?
![Page 41: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/41.jpg)
FDR = 0.05Beyond ±0.9
![Page 42: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/42.jpg)
FDR = 0.05Beyond ±0.9
![Page 43: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/43.jpg)
![Page 44: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/44.jpg)
![Page 45: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/45.jpg)
False Positive Rate False Positive Rate versus False Discovery Rateversus False Discovery Rate
• False positive rate is the rate at which truly null genes are called significant
• False discovery rate is the rate at which significant genes are truly null
tsignifican called#
positives false#FDR
nulltruly #
positives false#FPR
![Page 46: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/46.jpg)
False Positive Rate False Positive Rate and and P-valuesP-values
• The p-value is a measure of significance in terms of the false positive rate (aka Type I error rate)
• P-value is defined to be the minimum false positive rate at which the statistic can be called significant
• Can be described as the probability a truly null statistic is “as or more extreme” than the observed one
![Page 47: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/47.jpg)
False Discovery Rate False Discovery Rate and and Q-valuesQ-values
• The q-value is a measure of significance in terms of the false discovery rate
• Q-value is defined to be the minimum false discovery rate at which the statistic can be called significant
• Can be described as the probability a statistic “as or more extreme” is truly null
![Page 48: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/48.jpg)
Power and Sample Size Power and Sample Size Calculations are HardCalculations are Hard
• Need to specify:– (Type I error rate, false positives) or FDR– (stdev: will be sample- and gene-specific)– Effect size (how do we estimate?)– Power (1-, =Type II error rate)– Sample Size
• Some papers:– Mueller, Parmigiani et al. JASA (2004)– Rich Simon’s group Biostatistics (2005)– Tibshirani. A simple method for assessing sample
sizes in microarray experiments. BMC Bioinformatics. 2006 Mar 2;7:106.
![Page 49: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/49.jpg)
![Page 50: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/50.jpg)
Beyond Individual Genes:Functional Gene Groups
• Borrow statistical power across entire
dataset
• Integrate preexisting biological knowledge
-0.4 -0.2 0.0 0.2 0.4
01
23
Distribution of Observed (black) and Permuted (red+blue) Correlations (r)
Correlation (r)
Den
sity
Correlation of Age with Gene Expression
![Page 51: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/51.jpg)
Functional Annotation of Lists of Genes
KEGGPFAM
SWISS-PROTGO
DRAGONDAVID/EASEMatchMiner
BioConductor (R)
![Page 52: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/52.jpg)
![Page 53: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/53.jpg)
![Page 54: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/54.jpg)
![Page 55: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/55.jpg)
![Page 56: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/56.jpg)
![Page 57: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/57.jpg)
![Page 58: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/58.jpg)
![Page 59: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/59.jpg)
Gene Cross-Referencing and Gene Annotation Tools In BioConductor
(in the R statistical language)
annotate package
Microarray-specific “metadata” packagesDB-specific “metadata” packages
AnnBuilder package
![Page 60: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/60.jpg)
Annotation Tools In BioConductor:annotate package
Functions for accessing data in metadata packages.
Functions for accessing NCBI databases.
Functions for assembling HTML tables.
![Page 61: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/61.jpg)
Annotation Tools In BioConductor:Annotation for Commercial Microarrays
Array-specific metadata packages
![Page 62: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/62.jpg)
Annotation Tools In BioConductor:Functional Annotation with other DB’s
GO metadata package
![Page 63: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/63.jpg)
Annotation Tools In BioConductor:Functional Annotation with other DB’s
KEGG metadata package
![Page 64: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/64.jpg)
Is there enrichment in our list of differentially expressed genes for a particular functional gene
group or pathway?
Threshold Enrichment: One Way of Assessing Differential Expression of Functional Gene Groups
![Page 65: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/65.jpg)
-0.4 -0.2 0.0 0.2 0.4
01
23
Distribution of Observed (black) and Permuted (red+blue) Correlations (r)
Correlation (r)
Den
sity
Correlation of Age with Gene Expression
Threshold Enrichment
![Page 66: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/66.jpg)
Threshold Enrichment: One Way of Assessing Differential Expression of Functional Gene Groups
![Page 67: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/67.jpg)
Threshold Enrichment: One Way of Assessing Differential Expression of Functional Gene Groups
The argument lower.tail will indicate if you are looking for over- or under- representation of differentially expressed genes within a particular functional group (using lower.tail=F for over-representation).
![Page 68: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/68.jpg)
Can we use more of our data than Threshold Enrichment (that only uses
the top of our gene list)?
![Page 69: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/69.jpg)
-0.4 -0.2 0.0 0.2 0.4
01
23
Distribution of Observed (black) and Permuted (red+blue) Correlations (r)
Correlation (r)
Den
sity
Correlation of Age with Gene Expression
• Beyond threshold enrichment
![Page 70: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/70.jpg)
EXP#1
Swiss-Prot
PFAM
KEGG
Functional Gene Subgroups within An Experiment
![Page 71: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/71.jpg)
Statistics for Analysis of Differential Expression of Gene Subgroups
Is THIS …
… Different from THIS?
![Page 72: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/72.jpg)
Over-Expression of a Group of Functionally Related Genes
p<7.42e-08
T statistic
![Page 73: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/73.jpg)
Statistical Tests:
2
Kolmogorov-SmirnovProduct of ProbabilitiesGSEAPAGEgeneSetTest (Wilcoxon rank sum)
Is THIS …
… Different from THIS?
Conceptually Distinct from Threshold Enrichment and the Hypergeometric test!
![Page 74: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/74.jpg)
histogrambins
E
O
2
ED =
(O-E)2______
2 is the sum of D values where:
All Genes
Subset of Interest
![Page 75: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/75.jpg)
All Genes
Subset of Interest
Kolmogorov-Smirnov
![Page 76: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/76.jpg)
All Genes
Subset of Interest
Product of Individual Probabilities
![Page 77: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/77.jpg)
What shape/type of distributions would each of these tests be sensitive to?
All statisticsStatistics from gene subgroup
![Page 78: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/78.jpg)
Gene Set Enrichment Analysis (GSEA)
Subramanian et al, 2005 PNAS
![Page 79: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/79.jpg)
Gene Set Enrichment Analysis (GSEA)
![Page 80: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/80.jpg)
Gene Set Enrichment Analysis (GSEA)
![Page 81: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/81.jpg)
Gene Set Enrichment Analysis (GSEA)
![Page 82: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/82.jpg)
Gene Set Enrichment Analysis (GSEA)
![Page 83: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/83.jpg)
Gene Set Enrichment Analysis (GSEA)
![Page 84: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/84.jpg)
Parametric Analysis of Gene Set Enrichment (PAGE)
Kim et al, 2005 BMC Bioinformatics
![Page 85: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/85.jpg)
Parametric Analysis of Gene Set Enrichment (PAGE)
![Page 86: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/86.jpg)
![Page 87: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/87.jpg)
Z =Sm-
/m0.5
![Page 88: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/88.jpg)
The test statistic used for the gene-set-test is the mean of the statistics in the set. If ranks.only is TRUE the only the ranks of the statistics are used. In this case the p-value is obtained from a Wilcoxon test. If ranks.only is FALSE, then the p-value is obtained by simulation using nsim random selected sets of genes.
Arguement: alternative = “mixed” or “either” : fundamentally different questions.
Test whether a set of genes is enriched for differential expression.
Usage:geneSetTest(selected,statistics,alternative="mixed",type="auto",ranks.only=TRUE,nsim=10000)
geneSetTest(limma)
A simple method in Bioconductor
![Page 89: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/89.jpg)
Wilcoxon test
![Page 90: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/90.jpg)
What shape/type of distributions would each of these tests be sensitive to?
All statisticsStatistics from gene subgroup
![Page 91: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/91.jpg)
Analysis of Gene Networks
![Page 92: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/92.jpg)
Large Protein Interaction Network
Network Regulated in Sample #1
![Page 93: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/93.jpg)
Network Regulated in Sample #1
Network Regulated in Sample #2
Large Protein Interaction Network
![Page 94: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/94.jpg)
Network Regulated in Sample #1
Network Regulated in Sample #2
Network Regulated in Sample #3
Large Protein Interaction Network
![Page 95: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/95.jpg)
Networkof Interest
Network Regulated in Sample #1
Network Regulated in Sample #2
Network Regulated in Sample #3
Large Protein Interaction Network
![Page 96: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/96.jpg)
Additional Notes on Experimental Design
![Page 97: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/97.jpg)
Old-School Experimental Old-School Experimental Design: RandomizationDesign: Randomization
![Page 98: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/98.jpg)
Dissection of tissue
RNA Isolation
Amplification
Probelabelling
Hybridization
Biological Replicates
Technical Replicates
Replicates in a mouse model:
![Page 99: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/99.jpg)
Common question in Common question in experimental designexperimental design
• Should I pool mRNA samples across subjects in an effort to reduce the effect of biological variability (or cost)?
![Page 100: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/100.jpg)
Two simple designsTwo simple designs
• The following two designs have roughly the same cost:– 3 individuals, 3 arrays– Pool of three individuals, 3 technical
replicates
• To a statistician the second design seems obviously worse. But, I found it hard to convince many biologist of this.– 3 pools of 3 animals on individual arrays?
![Page 101: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/101.jpg)
Cons of Pooling EverythingCons of Pooling Everything• You can not measure within class variation
• Therefore, no population inference possible
• Mathematical averaging is an alternative way of reducing variance.
• Pooling may have non-linear effects
• You can not take the log before you average:E[log(X+Y)] ≠ E[log(X)] + E[log(Y)]
• You can not detect outliers
*If the measurements are independent and identically distributed
![Page 102: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/102.jpg)
Cons specific to microarraysCons specific to microarrays
• Different genes have dramatically different biological variances.
• Not measuring this variance will result in genes with larger biological variance having a better chance of being considered more important
![Page 103: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/103.jpg)
Higher variance: larger fold changeHigher variance: larger fold change
We compute fold change for each gene (Y axis)From 12 individuals we estimate gene specific variance (X axis)
If we pool we never see this variance.
![Page 104: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/104.jpg)
Remember this?Remember this?
![Page 105: Carlo Colantuoni carlo@illuminatobiotech](https://reader035.vdocuments.us/reader035/viewer/2022081419/56815a22550346895dc765a1/html5/thumbnails/105.jpg)
Useful Books:
“Statistical analysis of gene expression microarray data”
– Speed.
“Analysis of gene expression data”– Parmigianni
“Bioinformatics and computational biology solutions using R”
- Irizarry