canadian bioinformatics workshops
DESCRIPTION
Canadian Bioinformatics Workshops. www.bioinformatics.ca. Module #: Title of Module. 2. Lecture 2 Univariate Analyses: Continuous Data. MBP1010 Dr. Paul C. Boutros Winter 2014. †. Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE). D EPARTMENT OF - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/1.jpg)
Canadian Bioinformatics Workshops
www.bioinformatics.ca
![Page 2: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/2.jpg)
2Module #: Title of Module
![Page 3: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/3.jpg)
Lecture 2Univariate Analyses: Continuous Data
MBP1010
Dr. Paul C. BoutrosWinter 2014
DEPARTMENT OFMEDICAL BIOPHYSICSDEPARTMENT OFMEDICAL BIOPHYSICS
This workshop includes material originally developed by Drs. Raphael Gottardo, Sohrab Shah, Boris Steipe and others
††
††
Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE)
![Page 4: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/4.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Course Overview• Lecture 1: What is Statistics? Introduction to R• Lecture 2: Univariate Analyses I: continuous• Lecture 3: Univariate Analyses II: discrete• Lecture 4: Multivariate Analyses I: specialized models• Lecture 5: Multivariate Analyses II: general models• Lecture 6: Sequence Analysis• Lecture 7: Microarray Analysis I: Pre-Processing• Lecture 8: Microarray Analysis II: Multiple-Testing• Lecture 9: Machine-Learning• Final Exam (written)
![Page 5: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/5.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
How Will You Be Graded?• 9% Participation: 1% per week
• 56% Assignments: 8 x 7% each
• 35% Final Examination: in-class• Each individual will get their own, unique assignment• Assignments will all be in R, and will be graded according
to computational correctness only (i.e. does your R script yield the correct result when run)
• Final Exam will include multiple-choice and written answers
![Page 6: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/6.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Course Information Updates
• Website will have up to date information, lecture notes, sample source-code from class, etc.• http://
medbio.utoronto.ca/students/courses/mbp1010/mbp_1010.html
• Tutorials are Thursdays 13:00-15:00 in 4-204 TMDT• Next week we will be switching lecture and tutorial:
• Tutorial: January 20• Lecture: January 23
• Assignment #1 was delayed because of registration issues• Email [email protected] with your
student ID and we will email back your personal assignment
![Page 7: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/7.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
House Rules• Cell phones to silent
• No side conversations
• Hands up for questions
• Others?
![Page 8: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/8.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Review From Last WeekPopulation vs. Sample
All MBP Students = PopulationMBP Students in 1010 = Sample
How do you report statistical information?
P-value, variance, effect-size, sample-size, test
Why don’t we use Excel/spreadsheets?
Spreadsheet errors, reproducibility, wrong results
![Page 9: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/9.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Topics For This Week• Introduction to continuous data & probability distributions
• Slightly boring, but necessary!
• Attendance
• Common continuous univariate analyses
• Correlations
• ceRNAs
![Page 10: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/10.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Continuous vs. Discrete Data• Definitions?
• Examples of discrete data in biological studies?
• Why does it matter in the first place?
• Areas of discrete mathematics:
• Combinatorics
• Graph Theory
• Discrete Probability Theory (Dice, Cards)
• Number Theory
![Page 11: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/11.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Exploring Data
• When teaching (or learning new procedures) we usually prefer to work with synthetic data.
• Synthetic data has the advantage that we know what the outcome of the analysis should be.
• Typically one would create values according to a function and then add noise.
• R has several functions to create sequences of values – or you can write your own ...
• When teaching (or learning new procedures) we usually prefer to work with synthetic data.
• Synthetic data has the advantage that we know what the outcome of the analysis should be.
• Typically one would create values according to a function and then add noise.
• R has several functions to create sequences of values – or you can write your own ...
0:10;seq(0, pi, 5*pi/180);rep(1:3, each=3, times=2);for (i in 1:10) { print(i*i); }
![Page 12: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/12.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
synthetic data
Function ...Function ...
Explore functions and noise.
Noise ...Noise ...
Noisy Function ...Noisy Function ...
![Page 13: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/13.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Probability Distributions
Normal distribution N(μ,σ2)μ is the mean and σ2 is the variance.
Extremely important because of the Central Limit Theorem: if a random variable is the sum of a large number of small random variables, it will be normally distributed.
Normal distribution N(μ,σ2)μ is the mean and σ2 is the variance.
Extremely important because of the Central Limit Theorem: if a random variable is the sum of a large number of small random variables, it will be normally distributed.
x <- seq(-4, 4, 0.1)f <- dnorm(x, mean=0, sd=1)plot(x, f, xlab="x", ylab="density", lwd=5, type="l")
The area under the curve is the probability of observing a value between 0 and 2.
![Page 14: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/14.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Probability Distributions
Normal distribution N(μ,σ2)μ is the mean and σ2 is the variance.
Extremely important because of the Central Limit Theorem: if a random variable is the sum of a large number of small random variables, it will be normally distributed.
Normal distribution N(μ,σ2)μ is the mean and σ2 is the variance.
Extremely important because of the Central Limit Theorem: if a random variable is the sum of a large number of small random variables, it will be normally distributed.
x <- seq(-4, 4, 0.1)f <- dnorm(x, mean=0, sd=1)plot(x, f, xlab="x", ylab="density", lwd=5, type="l")
The area under the curve is the probability of observing a value between 0 and 2.
Task:
Explore line parameters
![Page 15: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/15.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Probability Distributions
Random sampling: Generate 100 observations from a N(0,1)
Random sampling: Generate 100 observations from a N(0,1)
set.seed(100)x <- rnorm(100, mean=0, sd=1)hist(x)lines(seq(-3,3,0.1),50*dnorm(seq(-3,3,0.1)), col="red")
Histograms can be used to estimate densities!Histograms can be used to estimate densities!
![Page 16: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/16.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Quantiles
(Theoretical) Quantiles:
The p-quantile has the property that there is a probability p of getting a value less than or equal to it.
(Theoretical) Quantiles:
The p-quantile has the property that there is a probability p of getting a value less than or equal to it.
The 50% quantile is called the median.The 50% quantile is called the median.
90% of the probability (area under the curve) is to the left of the red vertical line.
q90 <- qnorm(0.90, mean = 0, sd = 1);x <- seq(-4, 4, 0.1);f <- dnorm(x, mean=0, sd=1);plot(x, f, xlab="x", ylab="density", type="l", lwd=5);abline(v=q90, col=2, lwd=5);
![Page 17: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/17.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Descriptive Statistics
Empirical Quantiles:
The p-quantile has the property that p% of the observations are less than or equal to it.
Empirical quantiles can be easily obtained in R.
Empirical Quantiles:
The p-quantile has the property that p% of the observations are less than or equal to it.
Empirical quantiles can be easily obtained in R.
> set.seed(100);> x <- rnorm(100, mean=0, sd=1);> quantile(x); 0% 25% 50% 75% 100% -2.2719255 -0.6088466 -0.0594199 0.6558911 2.5819589 > quantile(x, probs=c(0.1, 0.2, 0.9)); 10% 20% 90% -1.1744996 -0.8267067 1.3834892
![Page 18: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/18.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Descriptive Statistics
We often need to quickly 'quantify' a data set, and this can be done using a set of summary statistics (mean, median, variance, standard deviation).
We often need to quickly 'quantify' a data set, and this can be done using a set of summary statistics (mean, median, variance, standard deviation).
> mean(x);[1] 0.002912563> median(x);[1] -0.0594199> IQR(x);[1] 1.264738> var(x);[1] 1.04185> summary(x); Min. 1st Qu. Median Mean 3rd Qu. Max. -2.272000 -0.608800 -0.059420 0.002913 0.655900 2.582000
Exercise: what are the units of variance and standard deviation?
![Page 19: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/19.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Boxplot
Descriptive statistics can be intuitively summarized in a Boxplot.
Descriptive statistics can be intuitively summarized in a Boxplot.
> boxplot(x)
IQRIQR
1.5 x IQR1.5 x IQR
1.5 x IQR1.5 x IQR
Everything above and below 1.5 x IQR is considered an "outlier".
75% quantile
Median
25% quantile
IQR = Inter Quantile Range = 75% quantile – 25% quantileIQR = Inter Quantile Range = 75% quantile – 25% quantile
![Page 20: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/20.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Violinplot
Internal structure of a data-vector can be made visible in a violin plot. The principle is the same as for a boxplot, but a width is calculated from a smoothed histogram.
Internal structure of a data-vector can be made visible in a violin plot. The principle is the same as for a boxplot, but a width is calculated from a smoothed histogram.
p <- ggplot(X, aes(1,x))p + geom_violin()
![Page 21: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/21.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
plotting data in R
Task: Explore types of plots.
![Page 22: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/22.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
QQ–plot
One of the first things we may ask about data is whether it deviates from an expectation e.g. to be normally distributed.
The quantile-quantile plot provides a way to visually verify this.
The QQ-plot shows the theoretical quantiles versus the empirical quantiles. If the distribution assumed (theoretical one) is indeed the correct one, we should observe a straight line.
R provides qqnorm() and qqplot().
One of the first things we may ask about data is whether it deviates from an expectation e.g. to be normally distributed.
The quantile-quantile plot provides a way to visually verify this.
The QQ-plot shows the theoretical quantiles versus the empirical quantiles. If the distribution assumed (theoretical one) is indeed the correct one, we should observe a straight line.
R provides qqnorm() and qqplot().
![Page 23: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/23.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
QQ–plot: sample vs. Normal
Only valid for the normal distribution!Only valid for the normal distribution!
qqnorm(x)qqline(x, col=2)
![Page 24: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/24.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
QQ–plot: sample vs. Normal
Clearly the t distribution with two degrees of freedom is not Normal.Clearly the t distribution with two degrees of freedom is not Normal.
set.seed(100)t <- rt(100, df=2)qqnorm(t)qqline(t, col=2)
![Page 25: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/25.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
QQ–plot
set.seed(101)generateVariates <- function(n) { Nvar <- 10000 Vout <- c() for (i in 1:n) { x <- runif(Nvar, -0.01, 0.01) Vout <- c(Vout, sum(x) ) } return(Vout)}
x <- generateVariates(1000)y <- rnorm(1000, mean=0, sd=1)qqnorm(x)qqline(x, y, col=2)
Verify the CLT.Verify the CLT.
![Page 26: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/26.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
QQ–plot: sample vs. sample
Comparing two samples: are their distributions the same?
... or ...
compare a sample vs. a synthetic dataset.
Comparing two samples: are their distributions the same?
... or ...
compare a sample vs. a synthetic dataset.
set.seed(100)x <- rt(100, df=2)y <- rnorm(100, mean=0, sd=1)qqplot(x, y)
Exercise: try different values of df for rt() and compare the vectors.
![Page 27: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/27.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Boxplots
The boxplot function can be used to display several variables at a time.
The boxplot function can be used to display several variables at a time.
boxplot(gvhdCD3p)
Exercise: Interpret this plot.
![Page 28: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/28.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Attendance Break
![Page 29: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/29.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Hypothesis Testing
Hypothesis testing is confirmatory data analysis, in contrast to exploratory data analysis.Hypothesis testing is confirmatory data analysis, in contrast to exploratory data analysis.
Null – and Alternative Hypothesis
Region of acceptance / rejection and critical value
Error types
p - value
Significance level
Power of a test (1 - false negative)
Null – and Alternative Hypothesis
Region of acceptance / rejection and critical value
Error types
p - value
Significance level
Power of a test (1 - false negative)
Concepts:Concepts:
![Page 30: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/30.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Null Hypothesis / Alternative Hypothesis
The null hypothesis H0 states that nothing of consequence is apparent in the data distribution. The data corresponds to our expectation. We learn nothing new.
The null hypothesis H0 states that nothing of consequence is apparent in the data distribution. The data corresponds to our expectation. We learn nothing new.
The alternative hypothesis H1 states that some effect is apparent in the data distribution. The data is different from our expectation. We need to account for something new. Not in all cases will this result in a new model, but a new model always begins with the observation that the old model is inadequate.
The alternative hypothesis H1 states that some effect is apparent in the data distribution. The data is different from our expectation. We need to account for something new. Not in all cases will this result in a new model, but a new model always begins with the observation that the old model is inadequate.
Don’t think about this too much!
![Page 31: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/31.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Test types
A Z–test compares a sample mean with a normal distribution.A Z–test compares a sample mean with a normal distribution.
... common types of tests ... common types of tests
A t–test compares a sample mean with a t- distribution and thus relaxes the requirements on normality for the sample.A t–test compares a sample mean with a t- distribution and thus relaxes the requirements on normality for the sample.
Chi–squared tests analyze whether samples are drawn from the same distribution.Chi–squared tests analyze whether samples are drawn from the same distribution.
F-tests analyze the variance of populations (ANOVA).F-tests analyze the variance of populations (ANOVA).
Nonparametric tests can be applied if we have no reasonable model from which to derive a distribution for the null hypothesis.Nonparametric tests can be applied if we have no reasonable model from which to derive a distribution for the null hypothesis.
![Page 32: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/32.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Error Types
DecisionDecision
TruthTruth
H0H0 H1H1
Accept H0Accept H0
Reject H0Reject H0
1 - 1 -
1 - 1 -
"False positive""False positive"
"False negative""False negative"
"Type I error""Type I error"
"Type II error""Type II error"
“Power”“Power”
“Sensitivity”“Sensitivity”
![Page 33: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/33.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
what is a p–value?
a) A measure of how much evidence we have against the alternative hypothesis.
b) The probability of making an error.
c) Something that biologists want to be below 0.05 .
d) The probability of observing a value as extreme or more extreme by chance alone.
e) All of the above.
a) A measure of how much evidence we have against the alternative hypothesis.
b) The probability of making an error.
c) Something that biologists want to be below 0.05 .
d) The probability of observing a value as extreme or more extreme by chance alone.
e) All of the above.
![Page 34: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/34.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Distributional Assumptions• A parametric test makes assumptions about the
underlying distribution of the data.
• A non-parametric test makes no assumptions about the underlying distribution, but may make other assumptions!
![Page 35: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/35.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Most Common Statistical Test: The T-Test
A Z–test compares a sample mean with a normal distribution.A Z–test compares a sample mean with a normal distribution.
A t–test compares a sample mean with a t- distribution and thus relaxes the requirements on normality for the sample.A t–test compares a sample mean with a t- distribution and thus relaxes the requirements on normality for the sample.
Nonparametric tests can be applied if we have no reasonable model from which to derive a distribution for the null hypothesis.Nonparametric tests can be applied if we have no reasonable model from which to derive a distribution for the null hypothesis.
One-Sample vs. Two-Sample
One-Sided vs. Two-Sided
Heteroscedastic vs. Homoscedastic
![Page 36: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/36.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Two-Sample t–test
Test if the means of two distributions are the same.
The datasets yi1, ..., yi
n are independent and normally distributed with mean μi and variance σ2, N (μi,σ2), where i=1,2.
In addition, we assume that the data in the two groups are independent and that the variance is the same.
Test if the means of two distributions are the same.
The datasets yi1, ..., yi
n are independent and normally distributed with mean μi and variance σ2, N (μi,σ2), where i=1,2.
In addition, we assume that the data in the two groups are independent and that the variance is the same.
![Page 37: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/37.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
two–sample t–test
![Page 38: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/38.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
t–test assumptions
Normality: The data need to be sampled from a normal distribution. If not, one can use a transformation or a non-parametric test. If the sample size is large enough (n>30), the t-test will work just fine (CLT).
Independence: Usually satisfied. If not independent, more complex modeling is required.
Independence between groups: In the two sample t- test, the groups need to be independent. If not, one can sometimes use a paired t-test instead
Equal variances: If the variances are not equal in the two groups, use Welch's t-test (default in R).
Normality: The data need to be sampled from a normal distribution. If not, one can use a transformation or a non-parametric test. If the sample size is large enough (n>30), the t-test will work just fine (CLT).
Independence: Usually satisfied. If not independent, more complex modeling is required.
Independence between groups: In the two sample t- test, the groups need to be independent. If not, one can sometimes use a paired t-test instead
Equal variances: If the variances are not equal in the two groups, use Welch's t-test (default in R).
How Do We Test These?
![Page 39: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/39.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
non–parametric tests
Non-parametric tests constitute a flexible alternative to t-tests if you don't have a model of the distribution.
In cases where a parametric test would be appropriate, non-parametric tests have less power.
Several non parametric alternatives exist e.g. the Wilcoxon and Mann-Whitney tests.
Non-parametric tests constitute a flexible alternative to t-tests if you don't have a model of the distribution.
In cases where a parametric test would be appropriate, non-parametric tests have less power.
Several non parametric alternatives exist e.g. the Wilcoxon and Mann-Whitney tests.
![Page 40: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/40.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Wilcoxon test principle
set.seed(53)n <- 25M <- matrix(nrow = n+n, ncol=2)for (i in 1:n) {
M[i,1] <- rnorm(1, 10, 1)M[i,2] <- 1M[i+n,1] <- rnorm(1, 11, 1)M[i+n,2] <- 2
}plot(M[,1], col=M[,2])
Consider two random distributions with 25 samples each and slightly different means.
![Page 41: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/41.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Wilcoxon test principle
o <- order(M[,1])plot(M[o,1], col=M[o,2])
For each observation in a, count the number of observations in b that have a smaller rank.
The sum of these counts is the test statistic.
wilcox.test(M[1:n,1], M[(1:n)+n,1])
![Page 42: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/42.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Flow-Chart For Two-Sample Tests
Is Data Sampled From a Normally-Distributed Population?
No
Sufficient n for CLT (>30)?
Yes
Equal Variance(F-Test)?
Yes
HomoscedasticT-Test
HeteroscedasticT-Test
Yes
No
WilcoxonU-Test
No
![Page 43: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/43.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Power, error rates and decision
> power.t.test(n = 5, delta = 1, sd=2, alternative="two.sided", type="one.sample")
One-sample t test power calculation
n = 5 delta = 1 sd = 2 sig.level = 0.05 power = 0.1384528 alternative = two.sided
Power calculation in R:Power calculation in R:
Other tests are available – see ??power.Other tests are available – see ??power.
![Page 44: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/44.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Power, error rates and decision
PR(False Positive)PR(Type I error)
μ0μ0 μ1μ1
PR(False Negative)PR(Type II error)
Let’s Try Some Power Analyses in R
![Page 45: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/45.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
ProblemWhen we measure more one than one variable for each member of a population, a scatter plot may show us that the values are not completely independent: there is e.g. a trend for one variable to increase as the other increases.
Regression analyses the dependence.
Examples:
• Height vs. weight
• Gene dosage vs.expression level
• Survival analysis:probability of death vs. age
![Page 46: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/46.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
CorrelationWhen one variable depends on the other, the variables are to some degree correlated.
(Note: correlation need not imply causality.)
In R, the function cov() measures covariance and cor() measures the Pearson coefficient of correlation (a normalized measure of covariance).
Pearson's coeffecient of correlation values rangefrom -1 to 1, with 0 indicating no correlation.
![Page 47: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/47.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Pearson's Coefficient of Correlation
> x<-rnorm(50)> r <- 0.99;> y <- (r * x) + ((1-r) * rnorm(50));> plot(x,y); cor(x,y)[1] 0.9999666
How to interpret the correlation coefficient:
Explore varying degrees of randomness ...
![Page 48: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/48.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Pearson's Coefficient of CorrelationVarying degrees of randomness ...
> x<-rnorm(50)> r <- 0.8;> y <- (r * x) + ((1-r) * rnorm(50));> plot(x,y); cor(x,y)[1] 0.9661111
![Page 49: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/49.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Pearson's Coefficient of CorrelationVarying degrees of randomness ...
> x<-rnorm(50)> r <- 0.4;> y <- (r * x) + ((1-r) * rnorm(50));> plot(x,y); cor(x,y)[1] 0.6652423
![Page 50: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/50.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Pearson's Coefficient of Correlation
> x<-rnorm(50)> r <- 0.01;> y <- (r * x) + ((1-r) * rnorm(50));> plot(x,y); cor(x,y)[1] 0.01232522
Varying degrees of randomness ...
![Page 51: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/51.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Pearson's Coefficient of CorrelationNon-linear relationships ...
> x<-runif(50,-1,1)> r <- 0.9> # periodic ...> y <- (r * cos(x*pi)) + ((1-r) * rnorm(50))> plot(x,y); cor(x,y)[1] 0.3438495
![Page 52: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/52.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Pearson's Coefficient of CorrelationNon-linear relationships ...
> x<-runif(50,-1,1)> r <- 0.9> # polynomial ...> y <- (r * x*x) + ((1-r) * rnorm(50))> plot(x,y); cor(x,y)[1] -0.5024503
![Page 53: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/53.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Pearson's Coefficient of CorrelationNon-linear relationships ...
> x<-runif(50,-1,1)> r <- 0.9> # exponential> y <- (r * exp(5*x)) + ((1-r) * rnorm(50))> plot(x,y); cor(x,y)[1] 0.6334732
![Page 54: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/54.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Pearson's Coefficient of CorrelationNon-linear relationships ...
> x<-runif(50,-1,1)> r <- 0.9> # circular ...> a <- (r * cos(x*pi)) + ((1-r) * rnorm(50))> b <- (r * sin(x*pi)) + ((1-r) * rnorm(50))> plot(a,b); cor(a,b)[1] 0.04531711
![Page 55: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/55.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Correlation coefficient
![Page 56: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/56.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
When Do We Use Statistics?• Ubiquitous in modern biology• Every class I will show a use of statistics in a (very, very)
recent Nature paper.
January 9, 2014
![Page 57: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/57.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Non-Small Cell Lung Cancer 101
Lung Cancer
Non-Small Cell Small Cell
Large Cell (and others)
Squamous Cell Carcinomas
Adenocarcinomas
80% of lung cancer
15% 5-year survival
![Page 58: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/58.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Non-Small Cell Lung Cancer 102
Stage I
Stage II
Stage III
Local Tumour Only
Local Lymph Nodes
Distal Lymph Nodes
IA = small tumour; IB = large tumour
Stage IV Metastasis
![Page 59: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/59.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
General Idea: HMGA2 is a ceRNAWhat are ceRNAs?
Salmena et al. Cell 2011
![Page 60: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/60.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
Test Multiple Constructs for Activity
![Page 61: Canadian Bioinformatics Workshops](https://reader035.vdocuments.us/reader035/viewer/2022062301/568147c0550346895db5031a/html5/thumbnails/61.jpg)
Lecture 2: Univariate Analyses I: Continuous Data bioinformatics.ca
What Statistical Analysis Did They Do?• No information given in main text!• Figure legend says:
“Values are technical triplicates, have been performed independently three times, and represent mean +/-
standard deviation (s.d.) with propagated error.”• In supplementary they say:
“Unless otherwise specified, statistical significance was assessed by the Student’s t-test”
• So, what would you do differently?