some statistical concepts relevant to proteomics data analysis
DESCRIPTION
From the UC Davis Proteomics 2014 Summer Workshop www.proteomics.ucdavis.edu by Blythe Durbin-Johnson, Ph D.TRANSCRIPT
![Page 1: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/1.jpg)
Some Statistical Concepts Relevant to Proteomics Data Analysis
Blythe Durbin-Johnson, Ph.D.August 7, 2014
![Page 2: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/2.jpg)
Topics
• Hypothesis Testing, p-values, power• Comparing two groups• More complicated experimental designs• Models for count data• The log transformation• Graphics
![Page 3: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/3.jpg)
Topics
• General concepts, not formulas• No equations• Assuming you will be doing any analysis with a
computer
![Page 4: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/4.jpg)
HYPOTHESIS TESTING, P-VALUES, AND POWER
![Page 5: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/5.jpg)
Hypothesis Testing• Test “null hypothesis” of no effect against
“alternative hypothesis”
• Calculate test statistic, reject null if test statistic large relative to what one would expect under null distribution
![Page 6: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/6.jpg)
P-Values
• P-value = probability of seeing a test statistic as large or larger than your test statistic when the null hypothesis is true
• Typically reject null if P < 0.05–This is purely a historical convention–Nothing magic happens at the P = 0.05
threshold
![Page 7: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/7.jpg)
A P-Value is NOT
• …the probability that the null hypothesis is true
• …the probability that an experiment will not be replicated
• …a direct measure of the size or importance of an effect
• …a measure of biological/clinical significance
![Page 8: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/8.jpg)
Power
• Power = probability of rejecting null hypothesis for a given effect size
• Depends on:–Effect size (difference between groups)–Sample size–Amount of variability in data–Hypothesis test being used–How “significance” is defined
![Page 9: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/9.jpg)
Power and P-Values
• Under the null hypothesis, p-values uniformly distributed between 0 and 1–Expect 5% to be less than 0.05, on
average• Under alternatives, higher probability of
smaller p-values (higher power), but still can theoretically get any p-value between 0 and 1
![Page 10: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/10.jpg)
Power Example
• Simulate two groups of normally distributed data with means 0, 0.5, 1, and 2 standard deviations apart
• Conduct two-sample t-test• Repeat 5000 times, look at distribution of p-
values• Repeat for various sample sizes
![Page 11: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/11.jpg)
![Page 12: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/12.jpg)
![Page 13: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/13.jpg)
COMPARING TWO GROUPS
![Page 14: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/14.jpg)
Two-Sample T-Test
• Compares mean of two groups• Does NOT explicitly require normally
distributed (Gaussian) data, unless sample sizes small
• Surprisingly robust under a wide range of conditions, but data skewness is a problem
• Generally use version NOT assuming equal variances
![Page 15: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/15.jpg)
Wilcoxon Rank-Sum Test
• Non-parametric test• General test of location shift• (Not actually comparing medians unless you
make add’l assumptions)• Less powerful than t-test when t-test
assumptions are satisfied
![Page 16: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/16.jpg)
Permutation Tests
• Can get permutation-based p-values for any test statistic
• Perform e.g. t-test on original data• Randomly declare samples to be “control” or
“treatment”, do t-test on permutated data• Repeat many times• Compare original test statistic to permutation
“null” distribution
![Page 17: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/17.jpg)
Permutation Tests
• Permutation tests don’t work for very small sample sizes
• With n = 3 in each of two groups, there are only 20 possible permutations
• Smallest possible p-value is 0.07• Recommend at least 6 samples per group,
more if adjusting for multiple testing
![Page 18: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/18.jpg)
BEYOND TWO GROUPS
![Page 19: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/19.jpg)
Examples of More Complicated Designs
• Compare protein expression in three bacterial strains– One-way ANOVA
• Compare expression in two tomato genotypes under two conditions– Two-way ANOVA
• Compare protein expression in matched hair samples from three different body regions– Mixed effects model
![Page 20: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/20.jpg)
One-Way ANOVA
• Compare group means• Extension of two-sample t-test to more than 2
groups• Except: Generally assume equal variances• P-value from an ANOVA F-test is from “global”
test of any differences among groups• Need to do post-hoc testing (e.g. Tukey) to get
pairwise differences
![Page 21: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/21.jpg)
Two-Way ANOVA
• Analyze two experimental factors at the same time– E.g. genotype and treatment
• More power for main effects than in separate analyses
• Can look at interaction of experimental factors• Can also analyze three, four or more factors– But: Define your questions well!
![Page 22: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/22.jpg)
Mixed Effects Models
• Advanced topic!• Be aware of when these are required, then get
help
![Page 23: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/23.jpg)
Mixed Effects Models
• Used for longitudinal or repeated-measures studies– Same subject observed over time– Matched samples from same subject– Subjects from same family– Anytime there may be correlation among samples
![Page 24: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/24.jpg)
Mixed Effects Models
• Modify ANOVA model to include “random effect” for subject, family, etc.
• This accounts for within-subject or within-family correlation
• If you don’t do this, you will greatly underestimate the variability in the data, p-values too small
![Page 25: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/25.jpg)
Models for Count Data
• Counts (esp. small counts) often require special models
• “Count” means 0, 1, 2, 3……
![Page 26: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/26.jpg)
Models for Count Data
• Poisson model often used for count data
• Assumes data come from Poisson distribution
• Poisson model assumes mean = variance
• Too restrictive!
![Page 27: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/27.jpg)
Models for Count Data
• “Quasipoisson” model allows variance to be proportional to mean
• Allows for overdispersion
• Why “quasi”?– There’s no “quasipoisson” distribution
![Page 28: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/28.jpg)
Models for Count Data
• Negative binomial model also allows overdispersion
• Variance is quadratic function of mean
• Can be derived as mixture of Poisson distributions
• May be more conservative than quasipoisson model (Leitch, 2012)
![Page 29: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/29.jpg)
Models for Count Data
• Complex experimental designs can be analyzed with quasipoisson or negative binomial models– Generalized linear models
• Model parameter estimates are log fold changes
![Page 30: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/30.jpg)
The Log Transformation
Intensity
Frequency
0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08
05000
10000
15000
• Intensity data are often skewed
• Skewness causes problems for t-tests and ANOVA
![Page 31: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/31.jpg)
The Log Transformation
• The log transformation can fix skewness
• Doesn’t matter what base you use
• Parameter estimates from ANOVA on logged data are log FC’s
ln(Intensity)
Frequency
15 20 25
05000
10000
15000
20000
25000
30000
![Page 32: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/32.jpg)
MULTIPLE TESTING
![Page 33: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/33.jpg)
Multiple Testing Example• Patient samples treated with different radiation
doses and observed over time• Illumina microarray experiment, 16,801 genes
used in analysis• Four replicates per patient/time/dose• All samples used in this example were replicates
from same patient, untreated• T-tests gene by gene comparing replicates 1 and 3
to replicates 2 and 4• 196 genes with P < 0.05
![Page 34: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/34.jpg)
Multiple Testing Example
• Entered list of genes with P < 0.05 into DAVID’s functional annotation tool– http://david.abcc.ncifcrf.gov
• Overrepresented terms (P < 0.05) included disease mutation, mutagenesis site, and 79 others
• If you were doing radiation research, would you be excited about this?
![Page 35: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/35.jpg)
Multiple Testing Example
• We know there is no difference between the “groups”
• What is going on?
![Page 36: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/36.jpg)
Multiple Testing Example
• Expect P < 0.05 about 5% of the time under null hypothesis
• (We see 196/16801 = 1.1% of genes with P < 0.05, but our data aren’t perfectly normal and our p-values are correlated)
• When conducting multiple tests, need to make adjustments to avoid spurious results
![Page 37: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/37.jpg)
Familywise Error Rate
number declarednon-significant
number declared
significanttotal
true null hypotheses
U V m0
false null hypotheses
T S m - m0
m - R R m
FWER = P(V ≥ 1)
FWER = Probability of ANY false positives
Multiple Testing
![Page 38: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/38.jpg)
One way of controlling FWER:set α’ = α/n (Bonferroni Correction)
Problems: 1. Very conservative, even for FWER
control.2. Is the FWER really what we want to
control?
Multiple Testing
![Page 39: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/39.jpg)
False Discovery Rate (FDR)
FDR = E[V/R]
number declarednon-significant
number declared
significanttotal
true null hypotheses
U V m0
false null hypotheses
T S m - m0
m - R R m
(Benjamini and Hochberg, 1995)
Multiple Testing
![Page 40: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/40.jpg)
False Discovery Rate (FDR)
FDR = E[V/R]
FWER = P(V ≥ 1)control this
not this
number declarednon-significant
number declared
significanttotal
true null hypotheses
U V m0
false null hypotheses
T S m - m0
m - R R m
(Benjamini and Hochberg, 1995)
Multiple Testing
![Page 41: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/41.jpg)
Multiple Testing
• False Discovery Rate-controlling procedure: (Benjamini and Hochberg, 1995)
1. Sort p-values from smallest to largest (1 to m), let k be the rank
2. Select a desired FDR α3. Find the largest rank k’ where P(k) ≤ (k/m)*α
4. Null hypotheses 1 through k’ are rejected
![Page 42: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/42.jpg)
Multiple Testing
• Note that the protein with the smallest p-value is still tested using α/m (like Bonferroni)
• Significance cutoff gets less stringent • The number of proteins included in the
analysis still matters• Filtering can help (but don’t filter based on
treatment/group membership)
![Page 43: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/43.jpg)
Multiple Testing Example (Revisited)
• Recall example of testing differential expression between 2 pairs of replicates in a microarray experiment• No genes are differentially expressed
at FDR-level 0.1
![Page 44: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/44.jpg)
Graphics
• Displaying data from individual proteins– Barplots– Boxplots– Dotplots
• Displaying data from multiple proteins– Multidimensional scaling plots– Hierarchical clustering– Heatmaps
![Page 45: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/45.jpg)
Barplots
Mean
Mean + 1 standard error
![Page 46: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/46.jpg)
Barplots
• Shows mean and standard error of mean (or 95% CI)
• Poor information-to-ink-ratio• Can be misleading for skewed data• Commonly used, easily interpreted
![Page 47: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/47.jpg)
Boxplots
Median
75th Percentile
25th Percentile
Largest Data Point that is Less than 1.5 IQR From Edge of Box
Smallest Data Point that is Less than 1.5 IQR From Edge of Box
![Page 48: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/48.jpg)
Boxplots
• Non parametric data display
• Lots of information given
• Less commonly used than barplots, may require explanation
![Page 49: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/49.jpg)
Dotplots
Mean
Actual Data!
![Page 50: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/50.jpg)
Dotplots
• Great way to display small data sets (n < 10)
• Shows mean, all data points
• Unwieldy for larger sample sizes
• Beware of overlapping points
![Page 51: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/51.jpg)
• Distance matrix = all pairwise distances between samples
• MDS takes distance matrix, recreates data in two dimensions while preserving distances
• Useful diagnostic plot
• PCA is special case of MDS
• Many ways to define distance
Multidimensional Scaling Plots
![Page 52: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/52.jpg)
A “Good” MDS Plot
http://statlab.bio5.org/foswiki/pub/Main/RBioconductorWorkshop2012/Day6_demo.pdf
![Page 53: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/53.jpg)
A “Bad” MDS Plot
![Page 54: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/54.jpg)
Hierarchical Clustering• Hierarchical clustering starts by treating each
sample as its own cluster• The “closest” clusters are merged successively
until only one cluster remains• Produces tree with series of nested clusterings
rather than one set of clusters• Plots of these trees are called “dendrograms”
![Page 55: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/55.jpg)
Hierarchical Clustering
![Page 56: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/56.jpg)
Heat Maps
• Data are plotted with color corresponding to numeric value
• Dendrograms of rows (genes) and columns (samples) displayed on sides
• Rows/columns are reordered by their means, this tends to create blocks of color
![Page 57: Some statistical concepts relevant to proteomics data analysis](https://reader035.vdocuments.us/reader035/viewer/2022062701/55424c795503468d0a8b4631/html5/thumbnails/57.jpg)
Conclusions
• Take p-values with a grain of salt– Not significant ≠ no difference
• Be aware of multiple testing issues– Use FDR adjustment when doing 1000’s of tests
• Good experimental design is just as important in ‘omics as anywhere else