comp. genomics recitation 10 4/7/09 differential expression detection

Comp. Genomics

Recitation 104/7/09Differential expression detection

Outline

• Clustering vs. Differential expression• Fold change• T-test• Multiple testing• FDR/SAM• Mann-Whitney• Examples

Microarray preliminaries

• General input: A matrix of probes (sequences) and intensities

• We assume the hard work is over:• Probes are assigned to genes• The data is properly (?) normalized• We have an expression matrix

• Rows correspond to genes• Columns correpond to conditions

Microarray analysis

• Common scenarios:• We tested the behavior of genes across

several time points• We test a large number of different

condtions• Clustering is the solution

• We compared a small number of conditions (2) and have multiple replicates for each condition

• E.g., we took blood expression in 10 sick and 10 individuals • Differential expression analysis

Identification of differential genes

• The most basic experimental design: comparison between 2 conditions – ‘treatment’ vs. control• More complex: sick/treatment/control

• The goal: identify genes that are differentially expressed in the examined conditions

• Number of replicates is usually low (n=2-4)• Statistics are important

Slides: Rani Elkon

Approaches for identification of differential genes

1.1. Fold ChangeFold Change2.T-test3.SAM

1. Fold Change

• Consider genes whose mean expression level was change by at least 1.75-2 fold as differential genes

• Pros:• Very simple!

• Cons:• Usually no estimation of false positive

rate is provided• Biased to genes with low expression level• Ignores the variability of gene levels over

replicates.

Fold Change limit – Biased to low expression levels

Determine ‘floor’ cut-off and set all expression levels below it to this floor level

Fold Change limit – ignores variability over replicates

• We need a score that ‘punishes’ genes with high variability over replicates

1.Fold Change

2.2. T-testT-test3.SAM


2. T-test

• Compute a t-score for each gene

mc, mt – mean levels in Control and Treatment

Sc2, St

2 – variance estimates in Control and Treatment

nc, nt – number of replicates in in Control and Treatment

T-test

• The t-score is good because it is a results of a well known statistical hypothesis testing

• If we assume the sample is normally distributed (unknown variance) and compare two hypotheses:• H0 – All the measurements come from the same

distribution

• H1 – All the measurements come from different normal distributions

• In this case a p-value can be derived for every t-score

T-test

• Set cut-off for p-value (α=0.01) and consider all genes with p-value < α as differential genes

Multiple Testing

• Pg associated with the t-score tg is the probability for obtaining by random a t-score that is at least as extreme as tg.

• Multiplicity problem: thousands of genes are tested simultaneously (all the genes on the array!)

• Simple example:• 10,000 genes on a chip• not a single one is differentially expressed

(everything is random)• α=0.01• 10000x0.01 = 100100 genes are expected to have

a p-value < 0.01 just by chance.

Multiple testing

• Individual p–values of e.g. 0.01 no longer correspond to significant findings.

• Need to adjust for multiple testing when assessing the statistical significance of findings

• Actually this is a somewhat common problem in statistics

Multiple Testing

• Simple solution (Bonferroni): consider as differential genes only those with p-value < (α/N)• N: number of tests• α=0.01, N=10,000: cut-off=0.000001

• Ensure very low probability for having any false positive genes (less than α)

• Advantage: very clean list of differential genes• Limit: the list usually contains very few genes…

unacceptable high rate of false negatives

FDR correction (Benjamini & Hochberg)

• False Discovery Rate• In high-throughput studies certain

proportion of false positives is tolerable• Control the expected proportion of false

positives among the genes declared as differential (q=10%).

• Scheme:• Rank genes according to their p-vals: p(1)<p(2)…

<p(N)

• Consider as differential the top k genes, wherek = max{i: p(i)< i*(q/N)}

1.Fold Change2.T-test

3.3. SAMSAM


3. SAM (Tusher, Tibshirani & Chu)

• ‘Significance Analysis of Microarray’• Limit of analytical FDR approach:

assumes that the tests are independent

• In the microarray context, the expression levels of some genes are highly correlated → unreliable FDR estimate

• SAM uses permutations to get an ‘empirical’ estimate for the FDR of the reported differential genes

SAM

• Scheme:• Compute for each gene a statistic that measures

its relative expression difference in control vs ‘treatment’ (t-score or a variant)

• Rank the genes according to their ‘difference score’

• Set a cut off (d0) and consider all genes above it as differential (Nd)

• Permute the condition labels, and count how many genes got score above d0 (Np)

• Repeat on many (all possible) permutations and count (Npj)

• estimate FDR as the proportion: Average(Npj)/Nd

Permutation on condition labels

D scor

e

G1 e11

e12

e13

e14

e15

e16

e17

e18

d1

G2 e21

e22

e23

e24

e25

e26

e27

e28

d2

G3 e31

e32

e33

e34

e35

e36

e37

e38

d3

d1p1

d2p1

d3p1

d1p2

d2p2

d3p2

BACK

SAM example

• Ionizing radiation response experiment

• After setting the threshold:• 46 genes found significant

• 36 permutations• 8.4 genes on average pass the

threshold

• False discovery rate is 18%

Mann-Whitney/Wilcoxon

• In general normality assumption of t-test is problematic

• Aparametric statistics are very useful in many bioinfo related problem

• Assume nothing about the distribution of the samples

• Less powerful (more false negatives, but less false positives)


• MW/Wilcoxon test for two samples:• H0 – The medians of both distributions

are the same

• H1 – The medians of the distributions are different

• Assumes:• The two samples are independent• The observations can ordered (ordinal)


• Computes a U-score whose distribution is known under H0 (& can be approximated by normal distribution in large samples)• Arrange all the observations into a single

ranked series• Add up the ranks in sample 1. The sum of

ranks in sample 2 follows by calculation, since the sum of all the ranks equals N(N+1)/2

• U-score:

comp. genomics recitation 10 4/7/09 differential expression detection

Documents

behavior of genes

thousands of genes

differential genespros

expression matrixrows

blood expression

fold changeconsider

tscore tg

low expression levelignores