microarray data analysis david a. mcclellan, ph.d. introduction to bioinformatics...

Microarray data analysis

David A. McClellan, Ph.D.Introduction to Bioinformatics

[email protected] Young UniversityDept. Integrative Biology

25 January 2006

Inferential statistics

Inferential statistics are used to make inferencesabout a population from a sample.

Hypothesis testing is a common form of inferentialstatistics. A null hypothesis is stated, such as:“There is no difference in signal intensity for the geneexpression measurements in normal and diseasedsamples.” The alternative hypothesis is that thereis a difference.

We use a test statistic to decide whether to accept or reject the null hypothesis. For many applications, we set the significance level to p < 0.05.

Page 199


A t-test is a commonly used test statistic to assessthe difference in mean values between two groups.

t = =

Questions

Is the sample size (n) adequate?Are the data normally distributed?Is the variance of the data known?Is the variance the same in the two groups?Is it appropriate to set the significance level to p < 0.05?

Page 199

x1 – x2

difference between mean values

variability (noise)


Paradigm Parametric test Nonparametric

Compare two unpaired groups Unpaired t-test Mann-Whitney test

Compare twopaired groups Paired t-test Wilcoxon test

Compare 3 or ANOVAmore groups

Page 198-200

ANOVA

ANalysis Of VAriance

ANOVA calculates the probability that several conditions all come from the same distribution

Parametric vs. Nonparametric

Parametric tests are applied to data sets that are sampled from a normal distribution (t-tests & ANOVAs)

Nonparametric tests do not make assumptions about the population distribution – they rank the outcome variable from low to high and analyze the ranks

Mann-Whitney test(a two-sample rank test)

Actual measurements are not employed; the ranks of the measurements are used instead

n1 and n2 are the number of observations in samples 1 and 2, and R1 is the sum of the ranks of the observations in sample 1

1

1121 2

1R

nnnnU

Mann-Whitney example

Mann-Whitney table

Wilcoxon paired-sample test

A nonparametric analogue to the paired-sample t-test, just as the Mann-Whitney test is a nonparametric procedure analogous to the unpaired-sample t-test

Wilcoxon example

Wilcoxon table


Is it appropriate to set the significance level to p < 0.05?If you hypothesize that a specific gene is up-regulated,you can set the probability value to 0.05.

You might measure the expression of 10,000 genes andhope that any of them are up- or down-regulated. Butyou can expect to see 5% (500 genes) regulated at thep < 0.05 level by chance alone. To account for thethousands of repeated measurements you are making,some researchers apply a Bonferroni correction.The level for statistical significance is divided by thenumber of measurements, e.g. the criterion becomes:

p < (0.05)/10,000 or p < 5 x 10-6

Page 199

Significance analysis of microarrays (SAM)

SAM -- an Excel plug-in -- URL: www-stat.stanford.edu/~tibs/SAM-- modified t-test-- adjustable false discovery rate

up-regulated

Page 202

down-regulated

expected

obse

rved

Descriptive statistics

Microarray data are highly dimensional: there aremany thousands of measurements made from a smallnumber of samples.

Descriptive (exploratory) statistics help you to findmeaningful patterns in the data.

A first step is to arrange the data in a matrix.Next, use a distance metric to define the relatednessof the different data points. Two commonly useddistance metrics are:

-- Euclidean distance-- Pearson coefficient of correlation

203

Euclidean Distance

Pearson Correlation Coefficient

Descriptive statistics: clustering

Clustering algorithms offer useful visual descriptionsof microarray data.

Genes may be clustered, or samples, or both.

We will next describe hierarchical clustering.This may be agglomerative (building up the branchesof a tree, beginning with the two most closely relatedobjects) or divisive (building the tree by finding themost dissimilar objects first).

In each case, we end up with a tree having branchesand nodes.

4

Agglomerative clustering

a

b

c

d

e

a,b

43210

Page 206

a

b

c

d

e

a,b

d,e

43210


Page 206

a

b

c

d

e

a,b

d,e

c,d,e

43210


Page 206

a

b

c

d

e

a,b

d,e

c,d,e

a,b,c,d,e

43210


…tree is constructed

Page 206

Divisive clustering

a,b,c,d,e

4 3 2 1 0

Page 206

Divisive clustering

c,d,e

a,b,c,d,e

4 3 2 1 0

Page 206

Divisive clustering

d,e

c,d,e

a,b,c,d,e

4 3 2 1 0

Page 206

Divisive clustering

a,b

d,e

c,d,e

a,b,c,d,e

4 3 2 1 0

Page 206

Divisive clusteringa

b

c

d

e

a,b

d,e

c,d,e

a,b,c,d,e

4 3 2 1 0

…tree is constructed

Page 206

divisive

agglomerative

a

b

c

d

e

a,b

d,e

c,d,e

a,b,c,d,e

4 3 2 1 0

43210

Page 206

1

12

1

12Page 207

Cluster and TreeView

Page 208


clustering PCASOMK means

Page 208


Page 208

Two-way clusteringof genes (y-axis)and cell lines(x-axis)(Alizadeh et al.,2000)

Self-Organizing Maps (SOM)

To download GeneCluster:

http://www.genome.wi.mit.edu/MPR/software.html

SOMs are unsupervised neural net algorithms that identify coregulated genes

Two pre-processing steps essential to apply SOMs

1. Variation Filtering:

Data are passed through a variation filter to eliminate those genes showing no significant change in expression across the k samples. This step is needed to prevent nodes from being attracted to large sets of invariant genes.

2. Normalization:

The expression level of each gene is normalized across experiments. This focuses attention on the 'shape' of expression patterns rather than absolute levels of expression.

An exploratory technique used to reduce thedimensionality of the data set to 2D or 3D

For a matrix of m genes x n samples, create a newcovariance matrix of size n x n

Thus transform some large number of variables intoa smaller number of uncorrelated variables calledprincipal components (PCs).

Principal components analysis (PCA)

Page 211

Pri

nci

pal

co

mp

on

ent

axis

#2

(10%

)

Principal component axis #1 (87%)

PC#3: 1

%

C3

C4

C2

C1

N2

N3

N4

P1

P4

P2 P3

Lead (P)

Sodium (N)

Control (C)

Legend

Principal components analysis (PCA), an exploratory technique that reduces data dimensionality,

distinguishes lead-exposed from control cell lines

Principal components analysis (PCA): objectives

• to reduce dimensionality

• to determine the linear combination of variables

• to choose the most useful variables (features)

• to visualize multidimensional data

• to identify groups of objects (e.g. genes/samples)

• to identify outliers

Page 211

http://www.okstate.edu/artsci/botany/ordinate/PCA.htm

Chr 21

Use of PCA to demonstrate increased levels of geneexpression from Down syndrome (trisomy 21) brain

microarray data analysis david a. mcclellan, ph.d. introduction to bioinformatics...

Documents

modified ttest

significance level

inferential statisticsa

used test statistic

geneexpression measurements

thenumber of measurements

statistical significance

data sets