gene expression data analysis - iowa state universityhonavar/gene... · gene expression data...

1

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Vasant Honavar Artificial Intelligence Research Laboratory

Department of Computer ScienceBioinformatics and Computational Biology Graduate ProgramCenter for Computational Intelligence, Learning, & Discovery

Iowa State [email protected]

www.cs.iastate.edu/~honavar/www.cs.iastate.edu/~honavar/aigroup.html

www.cild.iastate.eduwww.bcb.iastate.eduwww.igert.iastate.edu

Gene expression data analysis





Generic features of microarray technologies• Probe: Biochemical agent that finds or complements a

specific sequence of DNA, RNA, or protein from a test sample e.g., cDNA amplified from a vector stored in a bacterial clone, oligonucleotides cloned in a fluid medium..

• Array: The method for pacing the probes on a medium e.g., glass, silicon, using robotic spotting, photolithography, etc.

• Sample Probe: Mechanism for preparing RNA from test samples: RNA, mRNA selected using polydeoxythymidine(poly-dT) to bind the polyadenine (poly-A) tail, mRNA copied into cDNA using labeled nucleotides..

• Assay: Mechanism for transducing gene expression into something easily measurable e.g., hybridization

• Readout: Measuring the transduced signal e.g., colored dyes, radioactive labels, etc.





Observing the transcriptome• Focused Experimental Approaches

• Northern Blotting Analysis

• Real time PCR (quantitative or semi-quantitative)

• High throughput approaches:

• Microarray expression profiling

• cDNA

• Affymetrix

• Serial analysis of gene expression (SAGE)

• Massively Parallel Signature Sequencing (MPSS)

2





Observing the transcriptomeFocused Experimental Approaches

Northern Blotting Analysis

Real time PCR (quantitative or semi-quantitative)

High throughput Approaches:

Closed System Profiling:

Microarray expression profiling

Open System Profiling:

Serial analysis of gene expression (SAGE)

Massively Parallel Signature Sequencing (MPSS)





Limit of Detection: 1 in 30,000 transcripts

~ 20 transcripts/cell

Red – increase of Cy5 sample transcripts

Green – increase of Cy3 sample transcripts

Yellow – equal abundance





Experimental overview:

HybridisationWashing

Scan cy5 channel

Scan cy3 channel

“Overlayimages”

Quantify pixel intensities.

Cellpopulation A

Cell population B

RNAextraction

A A B B

Reversetranscription

A A B B

Klenowlabel

incorporation

Sample B labelledwith cy3 dye

Sample A labelledwith cy5 dye

3





IsotopeNylon – cDNA (300-900 nt)

Two-colourGlasscDNA or Oligo (80 nt)500 – 11,000 elements

AffymetrixSilicone – oligo (20 nt)22 ,000 elements

Tissue ArraysGlassTissue Discs (20-150)





Affymetrix Gene Chip ®

Affymetrix GeneChip®

Limits: 1: 100,000 transcripts

~ 5 transcripts/cell





http://www.affymetrix.com

4





Affy Gene Expression Arrays Transcripts/GenesArabidopsis Genome 24,000 C. elegans Genome 22,500 Drosophila Genome 18, 500 E. coli Genome 20, 366 Human Genome U133 Plus 47,000Mouse Genome 39, 000Yeast Genome 5, 841 (S. cerevisiae) & 5, 031 (S. pombe) Rat Genome 30, 000Zebrafish 14, 900 Plasmodium/Anopheles 4,300 (P. falciparum) & 14,900 (A. gambiae)

Barley (25,500), Soybean (37,500 + 23,300 pathogen), Grape (15,700)Canine (21,700), Bovine (23,000)B.subtilis (5,000), S. aureus (3,300 ORFS), Xenopus (14, 400)





Microarray and Gene Chip ApproachesAdvantages:

RapidMethod and data analysis well described and supportedRobustConvenient for directed and focused studies

Disadvantages:Closed system approachDifficult to correlate with absolute transcript numberSensitive to alternative splicing ambiguities





Serial Analysis of Gene Expression (SAGE)Velculescu et al., Science 1995

A transcript (new or novel) can be recognised by a small subset (e.g. 14) of its nucleotides – a tag

Linking tags allows for rapid sequencing.

Open system for transcript profiling

AAAAAAAAA – 3’TAG



14 nt

TAG TAG TAG


TAG

Sequence

AGCTTGAACCGTGACATCATGGCCATTGGCCCCAATTGAGACAGTGAGTTCAATGC

Modified SAGE methods

LongSAGE (21 nt)

SAGE-lite, micro-SAGE, mini-SAGE

RASL/DASL methods (5’ and 3’ Tags)

5





SAGEAdvantages:

Potential ‘open’ system method – new transcripts can be identifiedAccuracy of unambiguous transcript observationDigital output of dataQuantitative and qualitative information

Disadvantages:Characterising novel transcripts is often computationally difficult from short tag sequencesTag specificity (recently increased length to 21 bp)Length of tags can vary (RE enzyme activity variable with temperature)A subset of transcripts do not contain enzyme recognition sequenceSensitive to a subset of alternative splice variants





Biological question

Biological verification and interpretation

Microarray experiment

Experimental designPlatform Choice

Image analysis

Normalization

Clustering

Pattern Discovery

Sample Attributes

16-bit TIFF Files

(Rspot, Rbkg), (Gspot, Gbkg)

Data Mining

Classification

Statistical Analysis





Analysis

47,000 x 2 x 2

datapoints

47,000 x 2 x 2

datapoints

Liver

Brain

47,000 x 2 x 2

datapoints

Lymphocyte

188, 000

188, 000

188, 000

6





AnalysisBasic problem:Given a large dataset with experimental and biological noise,

find• Robust patterns (common themes or differences)• Similarities or differences between samples





Gene expression analysis

Which transcripts are different?

What are the patterns?

Liver Brain Lymphocytes





Fold changes• cDNA data are typically transformed from raw intensities

into log intensities before further analysis• This ensures

– Even spread of features across the intensity range– Variability constant across intensity levels– Approximately normal distribution of experimental errors– Approximately normal distribution of intensities (after log

transformation)• Ratio of raw Cy3 and cy5 intensities is transformed into the

difference between the log2 of the corresponding values– Two fold up-regulated genes ∼ log ratio of 1– Two fold down-regulated genes ∼ log ration of -1

7





Fold changes - some words of caution• Expression measurements from the same RNA samples do not

always agree perfectly• Even when the replicate measurements are highly correlated

(Pearson correlation coeff close to 1), the observed fold changes may not be

Example: • Replicate 1:

– measurements of g1 and g2 under two conditions (c1, c2): (1.0, 2.0), (500, 1000)

– Fold change from c1 to c2 = 2 both g1 and g2!• Replicate 2:

– measurements of g1 and g2 under two conditions (c1, c2): (1.5, 2.0), (500.5, 1000)

– Fold change from c1 to c2 = 1.33 for g1 and 1.998 for g2!





Sources of noise: Biological, Inter-chip

• Expression levels depend on– Intracellualar factors (The Stage of the Cell Cycle)– Extrinsic factors (Signals from other cells)

• Expression level depends on complex interaction of activators and inhibitors

• Even in tissue culture dishes (with a single cell type) RNA levels within each cell can vary

• The overall expression level is taken from RNA collected from a pool of cells





Normalization – why do we need it?ar r ay 1 ar r ay 2

gene1 3308 4947 .5gene2 2334 3155 .5gene3 2518 3738gene4 8882 .5 18937gene5 5041 12956 .5gene6 7314 .5 19013 .5gene7 3508 .5 8164gene8 2183 5121 .5gene9 4790 8082gene10 1645 .5 1794 .5gene11 1772 1963gene12 1802 .5 2186 .5gene13 14846 35811gene14 9986 25293gene15 11640 .5 21508gene16 3860 6530av er age 5339 .5 11200 .09

Intensities in array 2 are intrinsically larger than array 1

8





Experimental, intra-chip noise• Different labeling efficiency• Different hybridization time or hybridization condition• Different scanning sensitivity

Many techniques have been developed to factor out these sources of variation from cDNA, Affy, and other data





Normalization• Informally, normalization refers to transformations applied to

data from experiments to render them comparable to one another in – A probabilistic or statistical sense

• e.g. normalize using mean or median of expression values for each dye

– To take into quantitative account, assumptions about how the data were generated

Example:• Array preparations are susceptible to variations due to

differences in ―Differences in dye properties, dye incorporation, or

scanning• The goal of normalization is to factor out these variations





Normalization

• Normalization of a collection of data sets is carried out with respect to a reference data set

• Common choices of reference:– Data from one particular experiment (reference)– A postulated distribution of expression values

• Average expression ratio = 1• Total mRNA is approximately constant across the two

samples

9





Example: Global Normalisation to factor out the differences due to dyes used

• Global Normalisation methods assume the two dyes R, Gare related by a constant factor

• Taking logs

• How to find c? – Regression analysis – assumes that most genes are

NOT differentially expressed across the two conditions– Housekeeping genes – Use mRNA that have a known

effect on the two dyes under all experimental conditions





Normalization: Linear regression of Cy5 against Cy3

• If the two channels behave similarly, the scatter plot of Cy5 against Cy3 should have a slope of 1 and intercept of 0

• Non zero intercept means the response of one channel is consistently higher than the other

• A slope greater than 1 implies one channel responds stronger than the other at higher intensities

• A deviation from straight line suggests nonlinear effects

Two Cy5 versus Cy3 scatter plots





Modeling variations across arrays

Sarray

2array 1array

intensity observed the : level expression underlying the

)(

)(

)(

:

222

111

gSSgS

gg

gg

gs

gs

fx

fx

fx

x

θ

θ

θ

θ

=

=

=

10





Normalization: constant scaling

• Distributions on each array are scaled to have identical mean.

•

• Applied in MAS commercial software

gss

gs xxxx

•

•= 1'

array, reference the is 1array Suppose





Normalization by Constant scaling: Underlying reasoning

)

,,2,

)(

)(

)(

11

1

11'

22222

11111

s1( same theroughly are expression overall their if

as estimated be can

but estimable not is

array, reference the is 1array SupposeSarray

2array

1array :Assumption

••

•

•

=

===

==

==

==

θθααα

θααα

θαθ

θαθ

θαθ

ss

gsgsS

gs

gSSgSSgS

ggg

ggg

xx

Ssxx

fx

fx

fx





Local Normalizations

• The dye factor is dependent on:―Spot intensity (A=RG)―Location on the array.

• Local normalisation methods:– Intensity dependent– Print-tip group

11





Intensity dependent

• Visualise the effect: M-A plot

• Correction of the intensity dependant variations:

Picture from: http://www.stat.berkeley.edu/users/terry/zarray/Html/index.html

print-tip effect





Print-tip Group

• Different experiments may use different printing set-up:– Layout of the tips in the print-head of the arrayer.– Differences on the length or opening of the tips.– Deformation.

• Print-tip normalisation is simply:(print-tip + A) – dependent Normalisation





Global Normalization• Global Normalization methods assume the

two dyes are related by a constant factor k

• Taking logs

12





Beyond linear regression

• Average-intensity-dependent normalization:– Robust nonlinear regression (Lowess) applied on whole

genome. (Speed et al)– Select invariant genes computationally (rank-invariant

method). Then apply Lowess (Wong et al)





Local Normalisations

• The dye factor is dependent on:―Spot intensity―Location on the array

• Local normalisation methods:– Intensity dependent– Print-tip group





Intensity dependent

• Visualise the effect: M-A plot

• Correction of the intensity dependant variations:

Picture from: http://www.stat.berkeley.edu/users/terry/zarray/Html/index.html

print-tip effect

13





Print-tip Group

• Different experiments may use different printing set-up:– Layout of the tips in the print-head of the arrayer– Differences on the length or opening of the tips– Deformation.

• Print-tip normalisation is simply:(print-tip + A) – dependent Normalisation





NormalizationM-A plot

A

M

M=0

( ) 2log

log

1

1

ggs

g

gs

xxA

xx

M

⋅=

⎟⎟⎠

⎞⎜⎜⎝

⎛=





Normalization

M-A plot shows the need for non-linear normalization. The normalization factor is a function of the expression level

constant

:genes expressedally differenti-non For

=⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟

⎟⎠

⎞⎜⎜⎝

⎛=⎟

⎟⎠

⎞⎜⎜⎝

⎛=

=

1111

1

logloglogαα

θαθα s

g

gss

g

gs

ggs

xx

M

θθ

14





Normalization

)(ˆˆ AfM =Fit by ‘Lowess’ function in S-Plus

Normalized Log ratio:

MMM ˆ~−=

Replicate arrays:The same pool of sample is applied





Non-linear scaling: Underlying reasoning

log relativeexpression level

⎟⎟⎠

⎞⎜⎜⎝

⎛=

+⎟⎟⎠

⎞⎜⎜⎝

⎛=

⎟⎟⎠

⎞⎜⎜⎝

⎛⋅=⎟

⎟⎠

⎞⎜⎜⎝

⎛=

==

==

==

)()(

log),(

),(log

)()(

loglog

)()(

)()(

)()(

111

11

11

11

222222

111111

gss

ggsg

gsgg

gs

gss

g

g

gs

g

gsgs

gSgSSgSSgS

gggg

gggg

gwhere

gxx

xx

θθ

h

fx

fx

fx

θαθα

θθ

θθ

θαθα

θθαθ

θθαθ

θθαθ

Sarray

2array 1array :Assumption





Suppose we know the green genes are non-differentially expressed genes

Non-linear scaling: Underlying reasoning (cont’d)

Knowing which genes have the same expression across the different conditions requires biological knowledge

),(ˆloglogˆ

)()(

log),(ˆ

)()(

log)()(

log),(

111

11

1

11111

gsgg

gs

g

gsgs

gs

ggsg

ggs

gsgs

gg

gss

ggsg

gxx

θθ

h

AxAx

g

where

xx

g

θθ

θθ

θθ

θθ

θαθα

θθ

+⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟

⎟⎠

⎞⎜⎜⎝

⎛=

⎟⎟⎠

⎞⎜⎜⎝

⎛=

=

⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟

⎟⎠

⎞⎜⎜⎝

⎛=

15





Normalization using Invariant set (dChip)

• Select a baseline array (default is the one with median average intensity).

• For each “treatment” array, identify a set of genes that have ranks conserved between the baseline and treatment array. This set of rank-invariant genes are considered non-differentially expressed genes.

• Each array is normalized against the baseline array by fitting a non-linear normalization curve of invariant-gene set.

( ){ }lGxxRankldxrankxrankgG ggsggs −<+<<−= 2/)(&)()(: 11

Tseng et al., 2001





Normalization using Invariant set (dChip)

Advantage: • More robust than fitting with all genes as in loess• Especially when expression distribution in the arrays are

very different





Undefined or missing expression valuesGene expression datasets often contain missing values• The background and the signal give similar intensity• Surface of the chip is not planar (cDNA chips)• The probe is not properly fixed on the chip• Hybridistion step didn‘t work properly• Probe was not washed away properly

Undefined values can be• Discarded: leads to problems in data analysis• Filled in:

– Replaced by zero– Imputed using statistical methods for filling in missing

data – e.g., replace missing value by average of row or column of the data matrix

16





Basic gene expression analysis

47,000 x 2 x 2

datapoints

47,000 x 2 x 2

datapoints

Liver

Brain

47,000 x 2 x 2

datapoints

Lymphocyte

188, 000

188, 000

188, 000





AnalysisBasic problem:Given a large dataset with experimental and biological noise,

find• Robust patterns (common themes or differences)• Similarities or differences between samples





Gene expression analysis

Which transcripts are different?

What are the patterns?

Liver Brain Lymphocytes

17





Basic gene expression data analysisAre some genes differentially expressed across two

different conditions?Classical parametric & non-parametric statistical tests for hypothesis testing (already covered by Dan Nettleton)

Are there groups of genes that display similar expression profiles?Unsupervised clustering algorithms

Agglomerative methodsDivisive methods – spectral clusteringDimensionality reduction – PCA etc.

What else do genes with similar expression profiles share?Pathway annotations, GO functions, shared regulatory sequences?





Basic gene expression data analysis

Global topological analysis of gene networksFinding modules

Clusters of genes with similar expression profiles and sharing some additional features (e.g., regulators)

Classifying tissue samples based on gene expression patternsDiscriminant analysis, supervised learning





Modeling gene networks

• Differential equations• Boolean networks and temporal boolean networks• Bayesian networks and dynamic bayesian

networks• ..

18





Clustering – Similarity based methodsAre there groups of genes that display similar expression profiles?We cluster gene expression data to answer this question

• A Clustering is partitioning of data into meaningful groups called clusters.

• Cluster – a collection of objects that are “similar” to one another – e.g., genes that share similar expression patterns

• Clustering algorithms are used– To understand the natural grouping or structure in

a data set or get insight into data distribution





What we need to cluster data

• Data • A distance measure or similarity measure to group

instances together • A metric for evaluating candidate clusters• An algorithm to perform the clustering





Data in Metric Space

• Data are points in Rd, {0,1}d

– E.g., gene expression measurements at d time points or d conditions

• Metric Space: dist(x,y) is a distance metric if– Reflexive – dist(x,y)=0 iff x=y– Symmetric – dist(x,y)= dist(y,x) – Triangle Inequality-- dist(x,y) ≤ dist(x,z) + dist(z,y)

xy

z

19





Examples of Distance Metrics

• The distance between x=<x1,…,xn> and y=<y1,…,yn> is:

– norm:

– norm (Manhattan Distance)

• Cosine Similarity measure

– more similar -> close to 1

– Less similar -> close to 0

– 1-cosθ is a distance measure

2211 )()( nn yxyx −++−

nn yxyx −++− 11x

L1

y

L2

yxyxθ •

=cos

θ





Measure the Quality of Clustering

• Similarity metric: (Dis) similarity is expressed in terms of a distance function, which is typically metric:

d(X,Y).• The definitions of distance functions are usually different

for interval-scaled, Boolean, categorical, ordinal and ratio variables.

• Weights may be associated with different variables based on applications and data semantics.

• It is not always easy to define similarity measures –similarity is in the eye of the beholder.





Type of data in clustering analysis

• Real valued variables

• Interval-scaled variables

• Binary variables

• Nominal, ordinal, and ratio variables

• Variables of mixed types

20





Similarity Between Objects

• Distances are normally used to measure the similarity or dissimilarity between two data objects

• Some popular ones include: Minkowski distance:

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer

• If q = 1, d is Manhattan distance

q q

pp

qq

jxixjxixjxixjid )||...|||(|),(2211

−++−+−=

||...||||),(2211 pp jxixjxixjxixjid −++−+−=





Similarity and Dissimilarity Between Objects (Cont.)

• If q = 2, d is Euclidean distance:

– Properties• d(i,j) ≥ 0• d(i,i) = 0• d(i,j) = d(j,i)• d(i,j) ≤ d(i,k) + d(k,j)

• Also, one can use weighted distance, Pearson correlation, or other dissimilarity measures

)||...|||(|),( 22

22

2

11 pp jxixjxixjxixjid −++−+−=





Interval scaled variables• Standardize data

– Calculate the mean absolute deviation:

Where

– Calculate the standardized measurement (z-score)

• Then compute distance in the usual way• Using mean absolute deviation is more robust than using

standard deviation

.)...211

nffff xx(xn m +++=

|)|...|||(|121 fnffffff mxmxmxns −++−+−=

f

fifif s

mx z

−=

21




Similarity between Binary Variables• Example

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N Y N N N Mary F Y N Y N Y N Jim M Y Y N N N N

==

=++

=

),(),(

75

7),(...),(),(),(

MaryJimsJimJacks

NNsYYsFMsMaryJacks





Nominal Variables

• A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green

• Method 1: Simple matching

– m: # of matches, n: total # of variables

• Method 2: Binarize nominal variables

– create a new binary variable for each of the M nominal states

⎟⎟⎟

⎠

⎞

⎜⎜⎜

⎝

⎛ −=n

mnjid ),(





Ordinal Variables

• An ordinal variable can be discrete or continuous• Order is important, e.g., rank• Can be treated like interval-scaled

– replace xif by its rank – Standardize value to fall in [0, 1]

– compute the dissimilarity using methods for interval-scaled variables

11

−−=f

ifif M

rz

},...,1{ fif Mr ∈

22





Ratio-Scaled Variables

• Ratio-scaled variables: a positive measurement on a nonlinear scale, approximately at exponential scale,

such as AeBt or Ae-Bt

• Methods:

– treat them like interval-scaled variables—not a good choice! (why?—the scale can be distorted)

– apply logarithmic transformation

• yif = log(xif)

– treat them as continuous ordinal data treat their rank as interval-scaled





Example – Clustering Gene expression Data

• Treat expression data for a gene across multiple time points or multiple conditions as a multidimensional vector

• Decide on a distance metric to compare the vectors.– Plenty to choose from…

• Pearson correlation, Euclidean Distance, Manhattan Distance etc.

• Each metric has different properties and can reveal different features of the data





Similar expression

Expression Vectors

• Each gene is represented by a vector where coordinates are its values (log(ratio)) in each experiment

• x = log(ratio)expt1• y = log(ratio)expt2• z = log(ratio)expt3• etc.

23





• Normalization is an attempt to correct for systematic bias in data.

• Normalization allows you to compare data from one array to another.

• In practice we do not always understand the data -inevitably some biology will be removed too (or at least not revealed)

Digression – Data Normalization





TumorPool of Cell Lines

Differential labeling efficiency of dyes

Different amounts of starting material.

Different amounts of RNA in each channel

Differential efficiency of hybridization over slide surface.

Differential efficiency of scanning in each channel.





Such biases have consequences:• Plotting the frequency

of un-normalized intensities reveals the differential effect between the two channels.

24





How do we deal with this?

Normalization:• Typically, we assume that the average gene does

not change• You need to understand your data, to know if that

is an appropriate assumption or not• The number of ‘reporters’ (clones or genes) you

are assaying will affect this





Normalization





Effect on log ratios:

Un-normalized Normalized

Freq

uenc

y

0

100

200

300

400

500

600

700

-8 -6 -4 -1 1 4 6

0

100

200

300

400

500

600

700

-7.7 -5.2 -2.8 -0.3 2.2 4.6 7.1

Log-ratios

25





Total Intensity Normalization

• For those spots that are thought to be well measured, calculate mean or median log ratio.

• Use this as a normalization factor to adjust all log ratios.

• Equivalent to assuming same total intensity in both channels.

• Current software:– provides two simple methods for selection of well

measured spots: pixel-by-pixel regression, and foreground over background intensity

– calculates normalized values for all channel 2 measurements, and ratios





Normalization by Subset

• Housekeeping genes– Calculate normalization based on biologically

determined stable genes.– Not always valid; even very stable genes can

respond to some conditions.• Spiking or doping controls

– Calculate based on introduced DNA species.– Requires careful measurement of total DNA in

each channel.





Distance Metrics

• Distances are measured “between” expression vectors• Distance metrics define the way we measure distances• Many different ways to measure distance:

– Euclidean distance– Pearson correlation coefficient(s)– Manhattan distance– Mutual information– Kendall’s Tau

• Each has different properties and can reveal different features of the data

26





Euclidean distance

• The Euclidean distance metric detects similar vectors by identifying those that are closest in space.

• In this example, A and C are closest to one another.

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2 2.5

ARRAY 1

Gene B

Gene AGene C





Pearson correlation

• The Pearson correlation disregards the magnitude of the vectors but instead compares their directions.

• In this example, Gene A and Gene B have the same slope, so would be most similar to each other

• Note however, that Pearson correlation is not a true similarity function 0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2 2.5

ARRAY 1

Gene B

Gene AGene C





Distance Metric: Pearson vs. Euclidean

-1.5

-1

-0.5

0

0.5

1

1.5

1 2 3 4 5 6 7 8

Array

By Euclidean distance, A and B are most similar.By Pearson correlation, A and C are most similar.

AB

C

27





More general measures

• Mutual information – Model the genes as random variables that take different

values under different conditions– Mutual information between random variables is a

measure of how much information one random variable conveys about another

– Correlation can be shown to be a special case of mutual information (when we assume that the underlying distributions can be specified using the first and the second moments – e,g., gaussian)





Clustering methods

• Grouping methods e.g., k means• Hierarchical Methods• Partitioning Methods e.g., spectral clustering• Model-Based Clustering Methods (e.g., mixture densities)





What is a Good Clustering?

• A good clustering method will produce clusters where– the intra-cluster distance is small– the inter-cluster distance is large

28





Partition Based Clustering Algorithms

• Given a set of data points S={X1 … XN} the task of the clustering algorithm is to find K non overlapping clusters C={C1 …CK} so that

• Each data point is assigned to a unique cluster• Within cluster distance T(C) is minimized• Between cluster distance D(C) is maximized

• The particular choices of T(C) and D(C) affect the shape of the clusters





Examples of Scoring Functions

Define Cluster Centers∑∈

=kCk

k C XXr 1

( )∑ ∑= ∈

=K

k Ck

k

dCT1

2

XrX,)(Tightness

( )∑≤<≤

=Kkj

kjdCD1

2rr ,)(Between Cluster Distance

( )( )CDCTCQ =)(

Overall Quality of Clusters C





Why care about “clustering” ?E1 E2 E3

Gene 1

Gene 2

Gene N

E1 E2 E3

Gene N

Gene 1Gene 2

Discover functional relation

Assign function to genes with unknown functions

Find which gene controls which other genes

29





The k-means algorithm

• Assume the data lives in a Euclidean space.

• Assume we want k clusters.• Assume we start with

randomly located cluster centers

• The algorithm alternates between: – Assignment step: Assign

each data point to the closest cluster.

– Refitting step: Move each cluster center to the center of gravity of the data assigned to it.

Assignments

Refitted means





K-means clustering

Randomly initialize the cluster centersUntil Clusters stop changing do

{Assign points to clusters whose centers are the closestUpdate cluster centers}





Why K-means converges

• Whenever an assignment is changed, the sum squared distances of data points from their assigned cluster centers is reduced.

• Whenever a cluster center is moved the sum squared distances of the data points from their currently assigned cluster centers is reduced.

• If the assignments do not change in the assignment step, we have converged.

30





K-means with Euclidian distance

• Clusters defined by Euclidean distance are invariant to translations and rotations in feature space, but not invariant to scaling of features

• Solution: standardize the data: translate and scale the features so that all of features have zero mean and unit variance

• Some implementations of k-means may do this by default

• Caution: standardization may not always be desirable!





Standardization of data may not always be desirable!

Simulated data, 2-meanswithout standardization

Simulated data, 2-meanswith standardization





K-means: Idea

• Represent the data set in terms of K clusters, each of which is summarized by a prototype

• Each data point is assigned to one of K clusters

– Represented by responsibilities such

that

• Example: 4 data points and 3 clusters

kμ

{ }1,0∈ikriX

11

=∑=

n

iikr

31





K-means: Loss function

• Loss function:the sum-of-squared distances from each data point to its assigned prototype

prototyperesponsibilities data

( ) ( )kiT

ki

n

i

K

kik XXrJ μμ −−= ∑∑

= =1 1





Minimizing the loss Function

• Problem– If prototypes known, can assign responsibilities.– If responsibilities known, can compute optimal

prototypes.• We minimize the loss function by an iterative procedure –

akin to expectation maximization • Other ways to minimize the loss function include a merge-

split approach





Minimizing the loss Function – Expectation Maximization • E-step: Fix and minimize w.r.t.

– Assign each data point to its nearest prototype• M-step: Fix values for and minimize w.r.t

– With a bit of algebra, we can show that this is achieved by setting

– Each prototype set to the mean of data points in that cluster.• Convergence guaranteed since there is only a finite number of

possible settings for the responsibilities and at each iteration J can improve or stay the same

• E-M is susceptible to local minima: Good to run the algorithm with several different initializations of cluster centers

32













33













34













35





The Cost Function after each E and M step





How to Choose K?• In some cases, K may be known • In general, K is unknown.• The loss function J decreases with increasing K• Idea: Assume that K* is the right choice for K

– When K<K* some of the clusters must be split further to get the clusters to correspond to natural groups

– When K>K* some natural groups are split across multiple clusters

– When K<K* the cost function decreases rapidly, and when K>K* the cost function ceases to change substantially





How to Choose K?

• The Gap statistic (Tibshirani et al) provides a more principled way of choosing K.

K=2

36





Gap statistic

k

K

k kK

Cikik

Ci Cjjik

Dn

W

xn

XXD

k

k k

∑

∑

∑ ∑

=

∈

∈ ∈

=

−=

−=

1

2

2

21

2 μ

Diameter of k th cluster

nk = number of data points assigned to k thcluster

A measure of compactness of k thcluster





Computing the Gap Statisticfor b = 1 to BFrom the n data points, obtain a monte carlo sample X1b, X2b, …, Xnb

for k = 1 to KCluster the observations into k groups and compute log Wkfor b = 1 to B

Cluster the b th M.C. sample into k groups and compute log WklCompute

Compute sd(k), the s.d. of {log Wkb}l=1,…,BSet the total s.e.

Find the smallest k such that

)(/11 ksdBsk ⋅+=

∑=

−=B

bkkb WW

BkGap

1

loglog1)(

1)1()( +−+≥ kskGapkGap





Applying gap statistic to choose the number of clusters in DNA Microarray Data

6834 genes

64 human tumor

37





Notice the gap plot at k = 2 and 6





• Calinski and Harabasz ‘74

• Krzanowski and Lai ’85

• Hartigan ’75

• Kaufman and Rousseeuw ’90 (silhouette)

Other criteria for guiding the choice of K

1/2/2

/21

/2

)1()1()(

+

−

+−−−

=k

pk

pk

pk

p

WkWkWkWkkKL

)1(1)(1

−−⎟⎟⎠

⎞⎜⎜⎝

⎛−=

+

knWWkH

k

k





Initializing K-means

• K-means converge to a local optimum.• Clusters produced will depend on the initialization.• Some heuristics

– Randomly pick K points as prototypes.– A greedy strategy. Pick prototype so that it is

farthest from prototypes .1+kμ

kμμμ .., 21

38





Local minima

• There is nothing to prevent k-means getting stuck at local minima.

• We could try many random starting points

• We could try non-local split-and-merge moves: Simultaneously merge two nearby clusters and split a big cluster into two.

A bad local optimum





Soft k-means

• K means clustering can result in large changes in cluster assignments in response to small changes in data

• Instead of making hard assignments of data points to clusters, we can make soft assignments. One cluster may have a responsibility of .7 for a data point and another may have a responsibility of .3. – Allows a cluster to use more information about the

data in the refitting step.– What happens to our convergence guarantee?– How do we decide on the soft assignments?





Soft assignment• If a data point is exactly

halfway between two clusters, each cluster should obviously have the same responsibility for it.

• The responsibilities of all the clusters for one data point should add to 1.

• A sensible softness function is the entropy of the responsibilities. Maximizing the entropy is like saying: be as uncertain as you can be about which cluster has responsibility

∑=

==

ki

i ijijj p

pH1

1log

Entropy of the responsibilities for data point j

Number of clusters, k

Responsibility of cluster i for data point j

39





The soft assignment step

• Choose assignments to optimize the trade-off between two terms:– minimize the

squared distance of the data point to the cluster centers (weighted by responsibility)

– Maximize the entropy of the responsibilities

∑=

=−

ki

i ijij p

p1

1log

2

1|||| ij

ki

iijj pCost μx −= ∑

=

=

Cost of the assignments for data point j

Responsibility of cluster i for datapoint j

Location of cluster i

Location of data point j





Soft k means clustering

• How do we find the set of responsibility values that minimizes the cost and sums to 1?

• The optimal solution is to make the responsibilities proportional to the exponentiated squared distances:

2

1|||| ij

ki

iijj pCost μx −= ∑

=

=

∑=

=

−ki

i ijij p

p1

1log∑ −−

−−

=

m

ijmj

ij

e

ep 2

2

||||

||||

μx

μx





The re-fitting step

• Weight each data point by the responsibility that the cluster has for it.

• Move the mean of the cluster to the center of gravity of the responsibility -weighted data.

∑

∑=

=

=

== Nj

jij

Nj

jjij

i

p

p

1

1x

μ

j: Index over data points

i: Index over Gaussians

40





Some difficulties with soft k-means

• Not invariant under linear transformations of the data space (scaling, rotating ,elongation)

• Looks for spherical clusters: It would be good to allow different shapes for different clusters.

• Sometimes it is better to cluster by using low-density regions to define the boundaries between clusters rather than using high-density regions to define the centers of clusters.





Limitations of K-means

• K means Assumes spherical clusters and equal probabilities for each cluster.– In reality, clusters may not be spherical – Solution: gaussian mixture models

• Clusters can change arbitrarily with different values of K– As K is increased, cluster memberships change in an

arbitrary way, the clusters are not necessarily nested– Solution: hierarchical clustering

• K means is sensitive to outliers.– Solution: use a different loss function.

• K means works poorly on non-convex clusters.– Solution: spectral clustering





41





Gaussian Mixture Models

• Gaussian mixtures:

k

K

k

kkcpp αθ∑=

=1

, )|()( xx

Each mixture component is a multidimensional Gaussian with its own mean μk and covariance “shape” Σk

e.g., K=2, 1-dim: {θ, α} = {μ1 , σ1 , μ2 , σ2 , α1}

0 0.5 10

0.5

1

(a)





General form of mixture models

k

K

kkk

k

K

kk

K

k

k

cp

cpcp

cpp

αθ∑

∑

∑

=

=

=

=

=

=

1,

1

1

)|(

)()|(

),()(

x

x

xx

Weightk

ComponentModelk

Parametersk





-5 0 5 100

0.5

1

1.5

2

Component Models

p(x)

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

Mixture Model

x

p(x)

42





• Task –– To use samples drawn from this mixture density to

estimate the unknown parameter vector θ. – Once θ is known, we can decompose the mixture into

its components

tK

K

k parametersmixing

k

densitiescomponent

kk

where

cPcxPxP

),...,,

)(.),|()|(

2

1

θθθθ

θθ

1

( =

= ∑=

Mixture density





Learning Mixtures from Data

• Consider fixed K

• e.g., Unknown parameters Θ = {μ1 , σ1 , μ2 , σ2 , α1 α2}

• Given data D = {x1,…….xN}, we want to find the parameters Θ that “best fit” the data





Maximum Likelihood Principle

• assume a probabilistic model• likelihood = p(data | parameters, model)• find the parameters that make the data most likely

⎥⎦

⎤⎢⎣

⎡αθ=

Θ=

Θ=Θ

∑∏

∏

==

=

k

K

kkki

N

i

N

ii

cp

p

DpL

11

1

)|(

to reduces model mixture a of case the in which

)|(

)|()(

,x

x

43





Aside: maximum likelihood estimates





Maximum likelihood estimates

• When tossed, the thumbtack can land in one of two positions: Head or Tail

• We denote by θ the (unknown) probability P(H).• Estimation task

– Given a sequence of toss samples x[1], x[2], …, x[M] we want to estimate the probabilities P(H)= θ and P(T) = 1 - θ

Head Tail





Statistical parameter fitting

Consider samples x[1], x[2], …, x[M] such that• The set of values that X can take is known• Each is sampled from the same distribution• Each is sampled independently of the rest

The task is to find a parameter Θ so that the data can be summarized by a probability P(x[j]| Θ ).

• The parameters depend on the given family of probability distributions: multinomial, Gaussian, Poisson, etc.

i.i.d.samples

44





The Likelihood FunctionHow good is a particular θ?

It depends on how likely it is to generate the observed data

The likelihood for the sequence H,T, T, H, H is

∏==m

mxPDPDL )|][()|():( θθθ

θθθθθθ ⋅⋅−⋅−⋅= )1()1():( DL

0 0.2 0.4 0.6 0.8 1

θ

L(θ

:D)





Likelihood function

• The likelihood function L(θ : D) provides a measure of relative preferences for various values of the parameter θgiven a collection of observations D drawn from a distribution that is parameterized by fixed but unknown θ.

• L(θ : D) is the probability of the observed data D consideredas a function of θ.

• Suppose data D is 5 heads out of 8 tosses. What is the likelihood function assuming that the observations were generated by a binomial distribution with an unknown but fixed parameterθ ?

( )35 158

θθ −⎟⎟⎠

⎞⎜⎜⎝

⎛





Sufficient Statistics

• To compute the likelihood in the thumbtack example we only require NH and NT (the number of heads and the number of tails)

• NH and NT are sufficient statistics for the parameter θ that specifies the binomial distribution

• A statistic is simply a function of the data• A sufficient statistic s for a parameter θ is a function that

summarizes from the data D, the relevant information s(D)needed to compute the likelihood L(θ :D).

• If s is a sufficient statistic for θthen L(θ :D) = L(θ :D’)

TH NNDL )():( θ−⋅θ=θ 1

45





Maximum Likelihood Estimation

• Main Idea: Learn parameters that maximize the likelihood function

• Maximum likelihood estimation isIntuitively appealingOne of the most commonly used estimators in statisticsAssumes that the parameter to be estimated is fixed, but unknown





Example: MLE for Binomial Data• Applying the MLE principle we get• (Why?)

0 0.2 0.4 0.6 0.8 1

L(θ

:D)

Example:(NH,NT ) = (3,2)

ML estimate is 3/5 = 0.6

TH

H

NNN+

=θ̂





MLE for Binomial data ( ) ( )( ) ( )θNθNθ:DL

θθθ:DL

TH

NN TH

−+=−⋅=

1logloglog1

The likelihood is positive for all legitimate values of θ

So maximizing the likelihood is equivalent to maximizing its logarithm i.e. log likelihood

( ) ( )

( ) ( )( )

( )

( )TH

HML

HTH

TH

NNNθ

NθNNθ

Nθ

Nθ:DLθ

θ:DLθ:DLθ

+=

=+

=−

−+=

∂∂

=∂∂

01

1log

of extremaat 0logNote that the likelihood is indeed maximized at θ =θMLbecause in the neighborhood of θML, the value of the likelihood is smaller than it is at θ =θML

46





Maximum and curvature of likelihood around the maximum

• At the maximum, the derivative of the log likelihood is zero• At the maximum, the second derivative is negative. • The curvature of the log likelihood is defined as

• Large observed curvature I (θML) at θ = θML

• is associated with a sharp peak, intuitively indicating less uncertainty about the maximum likelihood estimate

• I (θML) is called the Fisher information

( ) ( )DθLθ

I :log2∂∂

−=θ





Maximum Likelihood Estimate

ML estimate can be shown to be • Asymptotically unbiased

• Asymptotically consistent - converges to the true value as the number of examples approaches infinity

• Asymptotically efficient – achieves the lowest variance that any estimate can achieve for a training set of a certain size (satisfies the Cramer-Rao bound)

{ }( ) 0lim

1Prlim2 =−

=≤−

∞→

∞→

TrueMLN

TrueMLN

E θθ

εθθ

( ) TrueMLNE θ=θ

∞→lim





Maximum Likelihood Estimate• ML estimate can be shown to be representationally invariant –

If θMLis an ML estimate of θ, and g (θ ) is a function of θ, then g (θML ) is an ML estimate of g (θ )

• When the number of samples is large, the probability distribution of θML has Gaussian distribution with mean θTrue(the actual value of the parameter) – a consequence of the central limit theorem – a random variable which is a sum of a large number of random variables has a Gaussian distribution – ML estimate is related to the sum of random variables

• We can use the likelihood ratio to reject the null hypothesis corresponding to θ = θ0 as unsupported by data if the ratio of the likelihoods evaluated at θ0 and at θML is small. (The ratio can be calibrated when the likelihood function is approximately quadratic)

47





The EM Algorithm

Dempster, Laird, and Rubin (1977)general framework for likelihood-based parameter estimation with hidden variables

start with initial guesses of parametersEstep: estimate memberships given paramsMstep: estimate params given membershipsRepeat until convergence

converges to a (local) maximum of likelihoodE-step and M-step are often computationally simplegeneralizes to maximum a posteriori (with priors)





The Gaussian Distribution

• Multivariate Gaussian

• Maximum likelihood estimation

mean covariance





Gaussian Mixture Model

• Linear combination of Gaussians

0 0.5 10

0.5

1

(a)

where

parameters to be estimated

0 0.5 10

0.5

1

(a)

48





Gaussian Mixture

• To generate a data point:– First pick one of the components with probability– then draw a sample from that component

distribution• Each data point is generated by one of K components, a

latent variable is associated with each





• Loss function: The negative log likelihood of the data.– Equivalently, maximize the log likelihood.

• Without knowing values of latent variables, we have to maximize the incomplete log likelihood.– Sum over components appears inside the logarithm, no

closed-form solution.

Gaussian Mixture





Fitting the Gaussian Mixture• Given the complete data set

– Maximize the complete log likelihood.

– Trivial closed-form solution: fit each component to the corresponding set of data points.

– Observe that if all the and are equal, then the complete log likelihood is exactly the loss function used in K-means.

• Need a procedure that would let us optimize the incomplete log likelihood by working with the (easier) complete log likelihood instead.

49





The Expectation-Maximization (EM) Algorithm

• E-step: for given parameter values we can compute the expected values of the latent variables(responsibilities of data points)

– Note that instead of but we still have

Bayes rule





The EM Algorithm

• M-step: maximize the expected complete log likelihood

• Parameter update:





The EM Algorithm

• Iterate E-step and M-step until the log likelihood of data does not increase any more.– Converge to local optima.– Need to restart algorithm with different initial

guess of parameters (as in K-means).

• Relation to K-means– Consider GMM with common covariance.

– As , two methods coincide.

50













51













52





K-means vs GMM

• Loss function:– Minimize sum of squared

Euclidean distance.• Can be optimized by an EM

algorithm.– E-step: assign points to

clusters.– M-step: optimize clusters.– Performs hard assignment

during E-step.• Assumes spherical clusters

with equal probability of a cluster.

• Loss function– Minimize the negative log-

likelihood.• EM algorithm

– E-step: Compute posterior probability of membership.

– M-step: Optimize parameters.

– Perform soft assignment during E-step.

• Can be used for non-spherical clusters. Can generate clusters with different probabilities.





• K-means not robust.– Squared Euclidean distance emphasizes more distant

points and is sensitive to outliers• K-medoids

– Is less sensitive to outliers– Needs only the dissimilarity matrix (so can work in

settings where we do not have the raw data)– Works in settings where the data do not reside in

Euclidean space

K-medoids





K-medoids

• Restrict the prototypes to one of the data points assigned to the cluster.

• E-step: Fix the prototypes and minimize w.r.t.– Assigns each data point to its nearest prototype

• M-step: Fix values for and minimize w.r.t the prototypes.

Example• Use L1 distance instead of squared Euclidean distance. • Prototype is the median of points in a cluster.

53





Hierarchical Clustering

• Organize the clusters in an hierarchical way• Produces a rooted (binary) tree (dendrogram)

Step 0 Step 1 Step 2 Step 3 Step 4

b

dc

e

a a b

d ec d e

a b c d e


agglomerative

divisive





Hierarchical Clustering• Two kinds of strategy

– Bottom-up (agglomerative): Recursively merge two groups with the smallest between-cluster dissimilarity (defined later on).

– Top-down (divisive): In each step, split a least coherent cluster (e.g. largest diameter); splitting a cluster is also a clustering problem (usually in a greedy fashion)





Hierarchical Clustering

• User can choose a cut through the hierarchy to represent the most natural division into clusters– e.g, Choose the cut where intergroup dissimilarity

exceeds some threshold


b

dc

e

a a b

d ec d e

a b c d e

3 2

54





Hierarchical Clustering• Have to measure the dissimilarity for two disjoint

groups G and H, is computed from pairwise dissimilarities

– Single Linkage: tends to yield extended clusters.

– Complete Linkage: tends to yield spherical clusters.

– Group Average: tradeoff between the above. Not invariant under monotone increasing transform.





Hierarchical Clustering of Gene Expression Data1. Calculate the distance between all genes. Find the smallest

distance. If several pairs share the same similarity, use a predetermined rule to decide between alternatives.

2. Fuse the two selected clusters to produce a new cluster that nowcontains at least two objects. Calculate the distance between the new cluster and all other clusters.

3. Repeat steps 1 and 2 until only a single cluster remains.4. Draw a tree representing the results.

G1G6

G3

G5

G4

G2

G1

G6

G3

G5

G4

G2





Example: Human Tumor Microarray Data

• 6830×64 matrix of real numbers.• Rows correspond to genes, columns to

tissue samples.• Cluster rows (genes) can provide

information about functions of genes with unknown function based on functions of genes with similar expression profiles

• Cluster columns (samples) can help identify disease profiles: tissues with similar disease should yield similar expression profiles – note the number of “samples” is fewer than variables (genes)

Gene expression matrix

55





Example: Human Tumor Microarray Data• 6830×64 matrix of real numbers• Hierarchical clustering of the microarray data

– Applied separately to rows and columns.– Subtrees with tighter clusters placed on the left.– Produces a more informative picture of genes and samples than

the randomly ordered rows and columns.

gene expression data analysis - iowa state universityhonavar/gene... · gene expression data...

Documents