lecture 8 2014 quality control of high throughput …...this is called simpson's paradox, and...

41
Lecture 8 2014 Quality control of high throughput biological data and Statistical testing for large biological data Anja Bråthen Kristoffersen

Upload: others

Post on 16-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Lecture 8 2014

Quality control of high throughput biological data

and Statistical testing for large biological data

Anja Bråthen Kristoffersen

Page 2: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Introduction

• There are many sources of variability and bias in high-

throughput biological experiments.

• Make it difficult to distinguish between biological differences

and experimental noise.

• Raw data can be very misleading.

• We will look at design, transformation and normalization

methods to reduce technical noise.

Statistical bioinformatics 3

Page 3: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Randomization of experiments

• Ensure that you will not have any systematic biases:

– Distribute the biological groups in a balanced way

– Divide into batches of the same size, limited by the

capacity on each step

• Randomize and balance according to the biology

that you are interested in

– Make random numbers by using the funciton sample() in R

– E.g. draw 10 numbers between 1 and 10 without

replacment:

> sample(10,10, replace = F)

Statistical bioinformatics 4

Page 4: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Experimental plan: an example

11. januar 2014 Statistical bioinformatics 5

Page 5: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Samples color coded according to biology

11. januar 2014 Statistical bioinformatics 6

Page 6: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Samples color coded according to labeling date

11. januar 2014 Statistical bioinformatics 7

Page 7: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Precautions

• Experimental methods should be standardized

across the same experiment

– ideally across all experiments

• Multiple biological replicates make it possible to

account for individual variability.

• If possible, multiple technical replicates

– Partition the same sample into multiple runs or even

multiple machines

• In the end, the data should be precise, accurate,

and directly comparable to other data. Ny Powerpoint mal 2011 8

Page 8: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Statistical goals in quality control analysis

• Examine distributional properties of data and to

assess their quality

– Goal 1: to examine whether the data are appropriate

for any subsequent analysis outlier detection

– Goal 2: to investigate the variability and relationships

among different samples and replicates

Statistical bioinformatics 9

Page 9: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

The goal

• Most of statistical analysis rely on well-behaved

distributions.

– Skewed distribution data transformation

– Heterogeneous variance variance-stabilizing

transformation

• e.g. power transformations

– Outliers and noise robust statistics, e.g.

• median is robust while mean is not

– Data from different experiments should be comparable

• Data normalization

Statistical bioinformatics 10

Page 10: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Example: effect of outliers

Statistical bioinformatics 11

A random set of 10 points: one more point added:

Even one outlier can change your whole idea about data, if you are

not carefull!

Page 11: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Motivation for Normalization

• Assume you do an experiment and find a negative correlation. Then

you combine this with the results from your colleague, who used a

different reference:

Your results Combined results

Statistical bioinformatics 12

This is called Simpson's Paradox, and it can ruin your whole day!

Page 12: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Descriptive Statistics - Box-plot

Statistical bioinformatics 13

-2-1

01

2

75% quantile

25% quantile

Median

IQR

1.5xIQR

1.5xIQR

Everything above

or below are

considered outliers

IQR= 75% quantile -25% quantile= Inter Quantile Range

x <- rnorm(100, mean=0, sd=1)

boxplot(x)

Page 13: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Transformations (log)

𝑦 = 𝑒𝑥, 𝑥 = ln(𝑦)

𝑦 = 10𝑥, 𝑥 = log 10 𝑦

𝑦 = 2𝑥, 𝑥 = log 2 𝑦

• cannot handle negative values

• minimize the impact of extreme values

• log2 transformation helps you easily identify

doublings or halvings in ratios.

Statistical bioinformatics 14

Page 14: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Biexponential transformation

• Arcsinh

𝑦 =𝑒𝑥−𝑒−𝑥

2, 𝑥 = 𝑙𝑛 𝑦 + 𝑦2 + 1

• Logicle

𝑦 = 𝑎𝑒𝑏(𝑥−𝑤) − 𝑐𝑒−𝑑 𝑥−𝑤 + 𝑓

• Logicle transform is similar to arcsinh but with

more parameters in the transformation.

Statistical bioinformatics 15

Page 15: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Example: histogram and boxplot

11. januar 2014 Statistical bioinformatics 16

Page 16: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

log - transformation

• Data is highly skewed (positively)

– Lots of small values with a few very large values.

• Need to transform this into a well-behaved

distribution.

– Ideally something like a Gaussian.

• Log transformation is generally used for positively

skewed data.

– Use log2(X)

– More intuitive, each whole number is a twofold change

(+1 → * 2)

Statistical bioinformatics 17

Page 17: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Example: histogram and boxplot, after log transformation

11. januar 2014 Statistical bioinformatics 18

Page 18: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

QQ-plot

• The QQ-plot shows the theoretical quantiles

versus the empirical quantiles. If the distribution

assumed (theoretical one) is indeed the correct

one, we should observe a straight line with

gradient equal to 1.

Statistical bioinformatics 19

Page 19: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

QQ-plot

11. januar 2014 Statistical bioinformatics 20

Page 20: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

DLBCLpatientDataNEW.txt

• http://llmpp.nih.gov/DLBCL/

• Is already normalized

Ny Powerpoint mal 2011 21

Page 21: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

summary(dat[,7:12])

Statistical bioinformatics 22

Page 22: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

boxplot(dat[,7:12])

Statistical bioinformatics 23

Page 23: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

x <- c(dat[,7], dat[,8], dat[,9], dat[,10], dat[,11], dat[,12])

hist(x)

Ny Powerpoint mal 2011 24

Page 24: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

qqnorm(x)

qqline(x)

Ny Powerpoint mal 2011 25

Page 25: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Data Normalization

• Normalization allows us to handle several datasets of

different origin and use them together.

– Remember Simpson's Paradox!

• There are several standard methods:

– Shifting Add a constant to all data points, shifting the mean.

• Called centering if the constant added is - µ

– Scaling Multiply data points with a scaling factor based on some

reference mean, xref .

𝑥′𝑖𝑗 = 𝑥𝑖𝑗𝑥𝑟𝑒𝑓

𝑥𝑗

– Quantile Normalization Match quantiles of two distributions

Statistical bioinformatics 26

Page 26: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Quantile normalization

If you have a reference distribution:

• Sort your data.

• For any value in your data, find its rank among all other

data points, and calculate the probability that X < x:

𝑃 𝑋 < 𝑥 = 1 −𝑟𝑎𝑛𝑘(𝑥)

𝑛

• Lookup the value for that probability in the reference

cumulative density distribution (CDF).

• Replace your value with the reference value at the

same quantile.

Statistical bioinformatics 27

Page 27: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Statistical bioinformatics 28

Page 28: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Mean average plot (MA plot)

• XY scatter plot often leads to seeing biased error patterns

• Mathematical bias when a regression-based normalization

used

• MA transformation: A = (X1+X2)/2 and M = X1-X2

11. januar 2014 Statistical bioinformatics 29

Page 29: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

MA plot

• The MA plot in the example shows bias.

• Typically, you want a distribution centered on A=0.

11. januar 2014 Statistical bioinformatics 30

Page 30: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

MA plot, baseline correction

• The distribution can be corrected by finding and removing

the baseline of the MA plot.

– Locally weighted scatterplot smoothing (LOESS)

– Problem: intensity values are nonlinear transformed after

normalization, so linear relationship such as fold change are not

completely conserved.

11. januar 2014 Statistical bioinformatics 31

Page 31: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Statistical testing and large datasets

Statistical bioinformatics 32

Page 32: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Sensitivity, specificity, FPR, FNR and FDR

Test result

Disease Negative (testedN) Positive (testedP)

Negative (N) Correct False positive (FP)

(type I error)

Positive (P) False negative (FN)

(type II error)

Correct

Statistical bioinformatics 33

Falsepositiverate = 𝐹𝑃𝑅 = 𝐹𝑃

𝑁= 1 − specificity

Falsenegativerate = 𝐹𝑁𝑅 = 𝐹𝑁

𝑃= 1 − sensitivity

Falsediscoveryrate = 𝐹𝐷𝑅 =𝐹𝑃

𝑡𝑒𝑠𝑡𝑒𝑑𝑃

Page 33: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Plot for FPR vs.1-FNR of a statistical test

• Need to know the true positive and true

negative. Easily done in R using the package

ROCR

install.packages("ROCR")

library(ROCR)

pred <- prediction(pvalue, truePN)

perf <- performance(pred,"sens","spec")

plot(perf)

#pvalue and truePN is vectors of

#similar length where truePN is the true

#positive or negative value while pvalue

#is a calculated pvalue for the

#datapoint to be positive or negative

Receiver Operator Characteristic (ROC) curve

Ny Powerpoint mal 2011 34

Page 34: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Multiple hypothesis testing

• Tests are designed such that it has an expected

proportion of incorrectly rejected null hypotheses,

most often this level is 5%.

• When many tests are done the probability of

rejecting a null hypotheses falsely increase,

hence we can correct the probabilities according

to how many tests that are done.

35

Page 35: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

• Q: is gene g, g = 1, …, 10 000, differentially

expressed?

• Gives 10 000 null hypothesis: 𝐻01, 𝐻0

2, … ,𝐻010000

– 𝐻01: gene 1 not differentially expressed

– …

• Assume: no genes differentially expressed

– 𝐻0𝑔true for all g

• Significance level α ≤ 0.01

– The probability to incorrectly conclude that one gene is

differentially expressed is 0.01. e.g. 0.01 * 10000 = 100

expected wrong rejections of 𝐻0𝑔

Example 10000 genes

36

Page 36: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Need to control the risk of false positive

Type I error

• Corrected p-value:

– The original p-values do not tell the full story.

– Instead of using the original p-values for decision

making, we should use corrected ones.

37

Page 37: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Different correction methods

• Bonferroni (1935)

– Just multiply all the p-values by the number of tests

– To conservative • need very small p-value to reject 𝐻0

• giveverylittlepower

• Methods that control the family-wise error rate

(FWER).

• Methods that control the false discovery rate

(FDR).

38

Page 38: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

Family-Wise Error Rate (FWER)

• Control type I errors at a level α: Pr(FP ≥ 1) < α

– Control the probability of making any false positive call at

the desired significance level

– Conservative methods such as Bonferroni correction

• Divide p-value by number of tests done (e.g. genes)

– Other less conservative but similar methods are:

• Sidak

• Bonferroni-Holm

• Westfall & Young

• Use one of these if you are most afraid of getting

stuff on your significant list that should not have

been there 39

Page 39: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

False Discovery Rate (FDR)

• Calculate the expected proportion of type I error

among the rejected hypotheses: – E(FDR) = E(#false positive prediction/#total positive predictions)

• Control the prorortion of false positive calls in all

positive calls at the desired significance level

• Technique that applies to a set of p-values

– Benjamini & Hochberg

– Different newer variants of Benjamini & Hochberg

• Use one of these if you are you most afraid of

missing out on interesting stuff 40

Page 40: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

help(p.adjust)

41

Page 41: Lecture 8 2014 Quality control of high throughput …...This is called Simpson's Paradox, and it can ruin your whole day! Descriptive Statistics - Box-plot Statistical bioinformatics

False discovery rate (FDR)

2014.03.05 42