gene expression data analysis - iowa state universityhonavar/gene... · gene expression data...

55
1 Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science Bioinformatics and Computational Biology Graduate Program Center for Computational Intelligence, Learning, & Discovery Iowa State University [email protected] www.cs.iastate.edu/~honavar/ www.cs.iastate.edu/~honavar/aigroup.html www.cild.iastate.edu www.bcb.iastate.edu www.igert.iastate.edu Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program Generic features of microarray technologies Probe: Biochemical agent that finds or complements a specific sequence of DNA, RNA, or protein from a test sample e.g., cDNA amplified from a vector stored in a bacterial clone, oligonucleotides cloned in a fluid medium.. Array: The method for pacing the probes on a medium e.g., glass, silicon, using robotic spotting, photolithography, etc. Sample Probe: Mechanism for preparing RNA from test samples: RNA, mRNA selected using polydeoxythymidine (poly-dT) to bind the polyadenine (poly-A) tail, mRNA copied into cDNA using labeled nucleotides.. Assay: Mechanism for transducing gene expression into something easily measurable e.g., hybridization Readout: Measuring the transduced signal e.g., colored dyes, radioactive labels, etc. Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Bioinformatics and Computational Biology Program Observing the transcriptome Focused Experimental Approaches Northern Blotting Analysis Real time PCR (quantitative or semi-quantitative) High throughput approaches: Microarray expression profiling cDNA Affymetrix Serial analysis of gene expression (SAGE) Massively Parallel Signature Sequencing (MPSS)

Upload: others

Post on 23-Mar-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

1

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Vasant Honavar Artificial Intelligence Research Laboratory

Department of Computer ScienceBioinformatics and Computational Biology Graduate ProgramCenter for Computational Intelligence, Learning, & Discovery

Iowa State [email protected]

www.cs.iastate.edu/~honavar/www.cs.iastate.edu/~honavar/aigroup.html

www.cild.iastate.eduwww.bcb.iastate.eduwww.igert.iastate.edu

Gene expression data analysis

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Generic features of microarray technologies• Probe: Biochemical agent that finds or complements a

specific sequence of DNA, RNA, or protein from a test sample e.g., cDNA amplified from a vector stored in a bacterial clone, oligonucleotides cloned in a fluid medium..

• Array: The method for pacing the probes on a medium e.g., glass, silicon, using robotic spotting, photolithography, etc.

• Sample Probe: Mechanism for preparing RNA from test samples: RNA, mRNA selected using polydeoxythymidine(poly-dT) to bind the polyadenine (poly-A) tail, mRNA copied into cDNA using labeled nucleotides..

• Assay: Mechanism for transducing gene expression into something easily measurable e.g., hybridization

• Readout: Measuring the transduced signal e.g., colored dyes, radioactive labels, etc.

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Observing the transcriptome• Focused Experimental Approaches

• Northern Blotting Analysis

• Real time PCR (quantitative or semi-quantitative)

• High throughput approaches:

• Microarray expression profiling

• cDNA

• Affymetrix

• Serial analysis of gene expression (SAGE)

• Massively Parallel Signature Sequencing (MPSS)

Page 2: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

2

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Observing the transcriptomeFocused Experimental Approaches

Northern Blotting Analysis

Real time PCR (quantitative or semi-quantitative)

High throughput Approaches:

Closed System Profiling:

Microarray expression profiling

Open System Profiling:

Serial analysis of gene expression (SAGE)

Massively Parallel Signature Sequencing (MPSS)

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Limit of Detection: 1 in 30,000 transcripts

~ 20 transcripts/cell

Red – increase of Cy5 sample transcripts

Green – increase of Cy3 sample transcripts

Yellow – equal abundance

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Experimental overview:

HybridisationWashing

Scan cy5 channel

Scan cy3 channel

“Overlayimages”

Quantify pixel intensities.

Cellpopulation A

Cell population B

RNAextraction

A A B B

Reversetranscription

A A B B

Klenowlabel

incorporation

Sample B labelledwith cy3 dye

Sample A labelledwith cy5 dye

Page 3: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

3

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

IsotopeNylon – cDNA (300-900 nt)

Two-colourGlasscDNA or Oligo (80 nt)500 – 11,000 elements

AffymetrixSilicone – oligo (20 nt)22 ,000 elements

Tissue ArraysGlassTissue Discs (20-150)

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Affymetrix Gene Chip ®

Affymetrix GeneChip®

Limits: 1: 100,000 transcripts

~ 5 transcripts/cell

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

http://www.affymetrix.com

Page 4: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

4

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Affy Gene Expression Arrays Transcripts/GenesArabidopsis Genome 24,000 C. elegans Genome 22,500 Drosophila Genome 18, 500 E. coli Genome 20, 366 Human Genome U133 Plus 47,000Mouse Genome 39, 000Yeast Genome 5, 841 (S. cerevisiae) & 5, 031 (S. pombe) Rat Genome 30, 000Zebrafish 14, 900 Plasmodium/Anopheles 4,300 (P. falciparum) & 14,900 (A. gambiae)

Barley (25,500), Soybean (37,500 + 23,300 pathogen), Grape (15,700)Canine (21,700), Bovine (23,000)B.subtilis (5,000), S. aureus (3,300 ORFS), Xenopus (14, 400)

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Microarray and Gene Chip ApproachesAdvantages:

RapidMethod and data analysis well described and supportedRobustConvenient for directed and focused studies

Disadvantages:Closed system approachDifficult to correlate with absolute transcript numberSensitive to alternative splicing ambiguities

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Serial Analysis of Gene Expression (SAGE)Velculescu et al., Science 1995

A transcript (new or novel) can be recognised by a small subset (e.g. 14) of its nucleotides – a tag

Linking tags allows for rapid sequencing.

Open system for transcript profiling

AAAAAAAAA – 3’TAG

AAAAAAAAA – 3’TAG

AAAAAAAAA – 3’TAG

14 nt

TAG TAG TAG

AAAAAAAAA – 3’TAG

TAG

Sequence

AGCTTGAACCGTGACATCATGGCCATTGGCCCCAATTGAGACAGTGAGTTCAATGC

Modified SAGE methods

LongSAGE (21 nt)

SAGE-lite, micro-SAGE, mini-SAGE

RASL/DASL methods (5’ and 3’ Tags)

Page 5: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

5

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

SAGEAdvantages:

Potential ‘open’ system method – new transcripts can be identifiedAccuracy of unambiguous transcript observationDigital output of dataQuantitative and qualitative information

Disadvantages:Characterising novel transcripts is often computationally difficult from short tag sequencesTag specificity (recently increased length to 21 bp)Length of tags can vary (RE enzyme activity variable with temperature)A subset of transcripts do not contain enzyme recognition sequenceSensitive to a subset of alternative splice variants

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Biological question

Biological verification and interpretation

Microarray experiment

Experimental designPlatform Choice

Image analysis

Normalization

Clustering

Pattern Discovery

Sample Attributes

16-bit TIFF Files

(Rspot, Rbkg), (Gspot, Gbkg)

Data Mining

Classification

Statistical Analysis

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Analysis

47,000 x 2 x 2

datapoints

47,000 x 2 x 2

datapoints

Liver

Brain

47,000 x 2 x 2

datapoints

Lymphocyte

188, 000

188, 000

188, 000

Page 6: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

6

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

AnalysisBasic problem:Given a large dataset with experimental and biological noise,

find• Robust patterns (common themes or differences)• Similarities or differences between samples

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Gene expression analysis

Which transcripts are different?

What are the patterns?

Liver Brain Lymphocytes

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Fold changes• cDNA data are typically transformed from raw intensities

into log intensities before further analysis• This ensures

– Even spread of features across the intensity range– Variability constant across intensity levels– Approximately normal distribution of experimental errors– Approximately normal distribution of intensities (after log

transformation)• Ratio of raw Cy3 and cy5 intensities is transformed into the

difference between the log2 of the corresponding values– Two fold up-regulated genes ∼ log ratio of 1– Two fold down-regulated genes ∼ log ration of -1

Page 7: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

7

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Fold changes - some words of caution• Expression measurements from the same RNA samples do not

always agree perfectly• Even when the replicate measurements are highly correlated

(Pearson correlation coeff close to 1), the observed fold changes may not be

Example: • Replicate 1:

– measurements of g1 and g2 under two conditions (c1, c2): (1.0, 2.0), (500, 1000)

– Fold change from c1 to c2 = 2 both g1 and g2!• Replicate 2:

– measurements of g1 and g2 under two conditions (c1, c2): (1.5, 2.0), (500.5, 1000)

– Fold change from c1 to c2 = 1.33 for g1 and 1.998 for g2!

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Sources of noise: Biological, Inter-chip

• Expression levels depend on– Intracellualar factors (The Stage of the Cell Cycle)– Extrinsic factors (Signals from other cells)

• Expression level depends on complex interaction of activators and inhibitors

• Even in tissue culture dishes (with a single cell type) RNA levels within each cell can vary

• The overall expression level is taken from RNA collected from a pool of cells

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Normalization – why do we need it?ar r ay 1 ar r ay 2

gene1 3308 4947 .5gene2 2334 3155 .5gene3 2518 3738gene4 8882 .5 18937gene5 5041 12956 .5gene6 7314 .5 19013 .5gene7 3508 .5 8164gene8 2183 5121 .5gene9 4790 8082gene10 1645 .5 1794 .5gene11 1772 1963gene12 1802 .5 2186 .5gene13 14846 35811gene14 9986 25293gene15 11640 .5 21508gene16 3860 6530av er age 5339 .5 11200 .09

Intensities in array 2 are intrinsically larger than array 1

Page 8: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

8

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Experimental, intra-chip noise• Different labeling efficiency• Different hybridization time or hybridization condition• Different scanning sensitivity

Many techniques have been developed to factor out these sources of variation from cDNA, Affy, and other data

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Normalization• Informally, normalization refers to transformations applied to

data from experiments to render them comparable to one another in – A probabilistic or statistical sense

• e.g. normalize using mean or median of expression values for each dye

– To take into quantitative account, assumptions about how the data were generated

Example:• Array preparations are susceptible to variations due to

differences in ―Differences in dye properties, dye incorporation, or

scanning• The goal of normalization is to factor out these variations

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Normalization

• Normalization of a collection of data sets is carried out with respect to a reference data set

• Common choices of reference:– Data from one particular experiment (reference)– A postulated distribution of expression values

• Average expression ratio = 1• Total mRNA is approximately constant across the two

samples

Page 9: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

9

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Example: Global Normalisation to factor out the differences due to dyes used

• Global Normalisation methods assume the two dyes R, Gare related by a constant factor

• Taking logs

• How to find c? – Regression analysis – assumes that most genes are

NOT differentially expressed across the two conditions– Housekeeping genes – Use mRNA that have a known

effect on the two dyes under all experimental conditions

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Normalization: Linear regression of Cy5 against Cy3

• If the two channels behave similarly, the scatter plot of Cy5 against Cy3 should have a slope of 1 and intercept of 0

• Non zero intercept means the response of one channel is consistently higher than the other

• A slope greater than 1 implies one channel responds stronger than the other at higher intensities

• A deviation from straight line suggests nonlinear effects

Two Cy5 versus Cy3 scatter plots

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Modeling variations across arrays

Sarray

2array 1array

intensity observed the : level expression underlying the

)(

)(

)(

:

222

111

gSSgS

gg

gg

gs

gs

fx

fx

fx

x

θ

θ

θ

θ

=

=

=

Page 10: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

10

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Normalization: constant scaling

• Distributions on each array are scaled to have identical mean.

• Applied in MAS commercial software

gss

gs xxxx

•= 1'

array, reference the is 1array Suppose

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Normalization by Constant scaling: Underlying reasoning

)

,,2,

)(

)(

)(

11

1

11'

22222

11111

s1( same theroughly are expression overall their if

as estimated be can

but estimable not is

array, reference the is 1array SupposeSarray

2array

1array :Assumption

••

=

===

==

==

==

θθααα

θααα

θαθ

θαθ

θαθ

ss

gsgsS

gs

gSSgSSgS

ggg

ggg

xx

Ssxx

fx

fx

fx

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Local Normalizations

• The dye factor is dependent on:―Spot intensity (A=RG)―Location on the array.

• Local normalisation methods:– Intensity dependent– Print-tip group

Page 11: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

11

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Intensity dependent

• Visualise the effect: M-A plot

• Correction of the intensity dependant variations:

Picture from: http://www.stat.berkeley.edu/users/terry/zarray/Html/index.html

print-tip effect

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Print-tip Group

• Different experiments may use different printing set-up:– Layout of the tips in the print-head of the arrayer.– Differences on the length or opening of the tips.– Deformation.

• Print-tip normalisation is simply:(print-tip + A) – dependent Normalisation

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Global Normalization• Global Normalization methods assume the

two dyes are related by a constant factor k

• Taking logs

Page 12: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

12

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Beyond linear regression

• Average-intensity-dependent normalization:– Robust nonlinear regression (Lowess) applied on whole

genome. (Speed et al)– Select invariant genes computationally (rank-invariant

method). Then apply Lowess (Wong et al)

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Local Normalisations

• The dye factor is dependent on:―Spot intensity―Location on the array

• Local normalisation methods:– Intensity dependent– Print-tip group

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Intensity dependent

• Visualise the effect: M-A plot

• Correction of the intensity dependant variations:

Picture from: http://www.stat.berkeley.edu/users/terry/zarray/Html/index.html

print-tip effect

Page 13: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

13

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Print-tip Group

• Different experiments may use different printing set-up:– Layout of the tips in the print-head of the arrayer– Differences on the length or opening of the tips– Deformation.

• Print-tip normalisation is simply:(print-tip + A) – dependent Normalisation

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

NormalizationM-A plot

A

M

M=0

( ) 2log

log

1

1

ggs

g

gs

xxA

xx

M

⋅=

⎟⎟⎠

⎞⎜⎜⎝

⎛=

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Normalization

M-A plot shows the need for non-linear normalization. The normalization factor is a function of the expression level

constant

:genes expressedally differenti-non For

=⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟

⎟⎠

⎞⎜⎜⎝

⎛=⎟

⎟⎠

⎞⎜⎜⎝

⎛=

=

1111

1

logloglogαα

θαθα s

g

gss

g

gs

ggs

xx

M

θθ

Page 14: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

14

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Normalization

)(ˆˆ AfM =Fit by ‘Lowess’ function in S-Plus

Normalized Log ratio:

MMM ˆ~−=

Replicate arrays:The same pool of sample is applied

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Non-linear scaling: Underlying reasoning

log relativeexpression level

⎟⎟⎠

⎞⎜⎜⎝

⎛=

+⎟⎟⎠

⎞⎜⎜⎝

⎛=

⎟⎟⎠

⎞⎜⎜⎝

⎛⋅=⎟

⎟⎠

⎞⎜⎜⎝

⎛=

==

==

==

)()(

log),(

),(log

)()(

loglog

)()(

)()(

)()(

111

11

11

11

222222

111111

gss

ggsg

gsgg

gs

gss

g

g

gs

g

gsgs

gSgSSgSSgS

gggg

gggg

gwhere

gxx

xx

θθ

h

fx

fx

fx

θαθα

θθ

θθ

θαθα

θθαθ

θθαθ

θθαθ

Sarray

2array 1array :Assumption

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Suppose we know the green genes are non-differentially expressed genes

Non-linear scaling: Underlying reasoning (cont’d)

Knowing which genes have the same expression across the different conditions requires biological knowledge

),(ˆloglogˆ

)()(

log),(ˆ

)()(

log)()(

log),(

111

11

1

11111

gsgg

gs

g

gsgs

gs

ggsg

ggs

gsgs

gg

gss

ggsg

gxx

θθ

h

AxAx

g

where

xx

g

θθ

θθ

θθ

θθ

θαθα

θθ

+⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟

⎟⎠

⎞⎜⎜⎝

⎛=

⎟⎟⎠

⎞⎜⎜⎝

⎛=

=

⎟⎟⎠

⎞⎜⎜⎝

⎛=⎟

⎟⎠

⎞⎜⎜⎝

⎛=

Page 15: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

15

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Normalization using Invariant set (dChip)

• Select a baseline array (default is the one with median average intensity).

• For each “treatment” array, identify a set of genes that have ranks conserved between the baseline and treatment array. This set of rank-invariant genes are considered non-differentially expressed genes.

• Each array is normalized against the baseline array by fitting a non-linear normalization curve of invariant-gene set.

( ){ }lGxxRankldxrankxrankgG ggsggs −<+<<−= 2/)(&)()(: 11

Tseng et al., 2001

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Normalization using Invariant set (dChip)

Advantage: • More robust than fitting with all genes as in loess• Especially when expression distribution in the arrays are

very different

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Undefined or missing expression valuesGene expression datasets often contain missing values• The background and the signal give similar intensity• Surface of the chip is not planar (cDNA chips)• The probe is not properly fixed on the chip• Hybridistion step didn‘t work properly• Probe was not washed away properly

Undefined values can be• Discarded: leads to problems in data analysis• Filled in:

– Replaced by zero– Imputed using statistical methods for filling in missing

data – e.g., replace missing value by average of row or column of the data matrix

Page 16: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

16

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Basic gene expression analysis

47,000 x 2 x 2

datapoints

47,000 x 2 x 2

datapoints

Liver

Brain

47,000 x 2 x 2

datapoints

Lymphocyte

188, 000

188, 000

188, 000

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

AnalysisBasic problem:Given a large dataset with experimental and biological noise,

find• Robust patterns (common themes or differences)• Similarities or differences between samples

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Gene expression analysis

Which transcripts are different?

What are the patterns?

Liver Brain Lymphocytes

Page 17: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

17

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Basic gene expression data analysisAre some genes differentially expressed across two

different conditions?Classical parametric & non-parametric statistical tests for hypothesis testing (already covered by Dan Nettleton)

Are there groups of genes that display similar expression profiles?Unsupervised clustering algorithms

Agglomerative methodsDivisive methods – spectral clusteringDimensionality reduction – PCA etc.

What else do genes with similar expression profiles share?Pathway annotations, GO functions, shared regulatory sequences?

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Basic gene expression data analysis

Global topological analysis of gene networksFinding modules

Clusters of genes with similar expression profiles and sharing some additional features (e.g., regulators)

Classifying tissue samples based on gene expression patternsDiscriminant analysis, supervised learning

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Modeling gene networks

• Differential equations• Boolean networks and temporal boolean networks• Bayesian networks and dynamic bayesian

networks• ..

Page 18: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

18

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Clustering – Similarity based methodsAre there groups of genes that display similar expression profiles?We cluster gene expression data to answer this question

• A Clustering is partitioning of data into meaningful groups called clusters.

• Cluster – a collection of objects that are “similar” to one another – e.g., genes that share similar expression patterns

• Clustering algorithms are used– To understand the natural grouping or structure in

a data set or get insight into data distribution

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

What we need to cluster data

• Data • A distance measure or similarity measure to group

instances together • A metric for evaluating candidate clusters• An algorithm to perform the clustering

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Data in Metric Space

• Data are points in Rd, {0,1}d

– E.g., gene expression measurements at d time points or d conditions

• Metric Space: dist(x,y) is a distance metric if– Reflexive – dist(x,y)=0 iff x=y– Symmetric – dist(x,y)= dist(y,x) – Triangle Inequality-- dist(x,y) ≤ dist(x,z) + dist(z,y)

xy

z

Page 19: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

19

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Examples of Distance Metrics

• The distance between x=<x1,…,xn> and y=<y1,…,yn> is:

– norm:

– norm (Manhattan Distance)

• Cosine Similarity measure

– more similar -> close to 1

– Less similar -> close to 0

– 1-cosθ is a distance measure

2211 )()( nn yxyx −++−

nn yxyx −++− 11x

L1

y

L2

yxyxθ •

=cos

θ

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Measure the Quality of Clustering

• Similarity metric: (Dis) similarity is expressed in terms of a distance function, which is typically metric:

d(X,Y).• The definitions of distance functions are usually different

for interval-scaled, Boolean, categorical, ordinal and ratio variables.

• Weights may be associated with different variables based on applications and data semantics.

• It is not always easy to define similarity measures –similarity is in the eye of the beholder.

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Type of data in clustering analysis

• Real valued variables

• Interval-scaled variables

• Binary variables

• Nominal, ordinal, and ratio variables

• Variables of mixed types

Page 20: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

20

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Similarity Between Objects

• Distances are normally used to measure the similarity or dissimilarity between two data objects

• Some popular ones include: Minkowski distance:

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer

• If q = 1, d is Manhattan distance

q q

pp

qq

jxixjxixjxixjid )||...|||(|),(2211

−++−+−=

||...||||),(2211 pp jxixjxixjxixjid −++−+−=

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Similarity and Dissimilarity Between Objects (Cont.)

• If q = 2, d is Euclidean distance:

– Properties• d(i,j) ≥ 0• d(i,i) = 0• d(i,j) = d(j,i)• d(i,j) ≤ d(i,k) + d(k,j)

• Also, one can use weighted distance, Pearson correlation, or other dissimilarity measures

)||...|||(|),( 22

22

2

11 pp jxixjxixjxixjid −++−+−=

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Interval scaled variables• Standardize data

– Calculate the mean absolute deviation:

Where

– Calculate the standardized measurement (z-score)

• Then compute distance in the usual way• Using mean absolute deviation is more robust than using

standard deviation

.)...211

nffff xx(xn m +++=

|)|...|||(|121 fnffffff mxmxmxns −++−+−=

f

fifif s

mx z

−=

Page 21: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

21

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Similarity between Binary Variables• Example

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Y N Y N N N Mary F Y N Y N Y N Jim M Y Y N N N N

==

=++

=

),(),(

75

7),(...),(),(),(

MaryJimsJimJacks

NNsYYsFMsMaryJacks

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Nominal Variables

• A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green

• Method 1: Simple matching

– m: # of matches, n: total # of variables

• Method 2: Binarize nominal variables

– create a new binary variable for each of the M nominal states

⎟⎟⎟

⎜⎜⎜

⎛ −=n

mnjid ),(

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Ordinal Variables

• An ordinal variable can be discrete or continuous• Order is important, e.g., rank• Can be treated like interval-scaled

– replace xif by its rank – Standardize value to fall in [0, 1]

– compute the dissimilarity using methods for interval-scaled variables

11

−−=f

ifif M

rz

},...,1{ fif Mr ∈

Page 22: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

22

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Ratio-Scaled Variables

• Ratio-scaled variables: a positive measurement on a nonlinear scale, approximately at exponential scale,

such as AeBt or Ae-Bt

• Methods:

– treat them like interval-scaled variables—not a good choice! (why?—the scale can be distorted)

– apply logarithmic transformation

• yif = log(xif)

– treat them as continuous ordinal data treat their rank as interval-scaled

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Example – Clustering Gene expression Data

• Treat expression data for a gene across multiple time points or multiple conditions as a multidimensional vector

• Decide on a distance metric to compare the vectors.– Plenty to choose from…

• Pearson correlation, Euclidean Distance, Manhattan Distance etc.

• Each metric has different properties and can reveal different features of the data

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Similar expression

Expression Vectors

• Each gene is represented by a vector where coordinates are its values (log(ratio)) in each experiment

• x = log(ratio)expt1• y = log(ratio)expt2• z = log(ratio)expt3• etc.

Page 23: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

23

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

• Normalization is an attempt to correct for systematic bias in data.

• Normalization allows you to compare data from one array to another.

• In practice we do not always understand the data -inevitably some biology will be removed too (or at least not revealed)

Digression – Data Normalization

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

TumorPool of Cell Lines

Differential labeling efficiency of dyes

Different amounts of starting material.

Different amounts of RNA in each channel

Differential efficiency of hybridization over slide surface.

Differential efficiency of scanning in each channel.

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Such biases have consequences:• Plotting the frequency

of un-normalized intensities reveals the differential effect between the two channels.

Page 24: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

24

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

How do we deal with this?

Normalization:• Typically, we assume that the average gene does

not change• You need to understand your data, to know if that

is an appropriate assumption or not• The number of ‘reporters’ (clones or genes) you

are assaying will affect this

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Normalization

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Effect on log ratios:

Un-normalized Normalized

Freq

uenc

y

0

100

200

300

400

500

600

700

-8 -6 -4 -1 1 4 6

0

100

200

300

400

500

600

700

-7.7 -5.2 -2.8 -0.3 2.2 4.6 7.1

Log-ratios

Page 25: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

25

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Total Intensity Normalization

• For those spots that are thought to be well measured, calculate mean or median log ratio.

• Use this as a normalization factor to adjust all log ratios.

• Equivalent to assuming same total intensity in both channels.

• Current software:– provides two simple methods for selection of well

measured spots: pixel-by-pixel regression, and foreground over background intensity

– calculates normalized values for all channel 2 measurements, and ratios

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Normalization by Subset

• Housekeeping genes– Calculate normalization based on biologically

determined stable genes.– Not always valid; even very stable genes can

respond to some conditions.• Spiking or doping controls

– Calculate based on introduced DNA species.– Requires careful measurement of total DNA in

each channel.

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Distance Metrics

• Distances are measured “between” expression vectors• Distance metrics define the way we measure distances• Many different ways to measure distance:

– Euclidean distance– Pearson correlation coefficient(s)– Manhattan distance– Mutual information– Kendall’s Tau

• Each has different properties and can reveal different features of the data

Page 26: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

26

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Euclidean distance

• The Euclidean distance metric detects similar vectors by identifying those that are closest in space.

• In this example, A and C are closest to one another.

0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2 2.5

ARRAY 1

Gene B

Gene AGene C

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Pearson correlation

• The Pearson correlation disregards the magnitude of the vectors but instead compares their directions.

• In this example, Gene A and Gene B have the same slope, so would be most similar to each other

• Note however, that Pearson correlation is not a true similarity function 0

0.5

1

1.5

2

2.5

0 0.5 1 1.5 2 2.5

ARRAY 1

Gene B

Gene AGene C

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Distance Metric: Pearson vs. Euclidean

-1.5

-1

-0.5

0

0.5

1

1.5

1 2 3 4 5 6 7 8

Array

By Euclidean distance, A and B are most similar.By Pearson correlation, A and C are most similar.

AB

C

Page 27: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

27

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

More general measures

• Mutual information – Model the genes as random variables that take different

values under different conditions– Mutual information between random variables is a

measure of how much information one random variable conveys about another

– Correlation can be shown to be a special case of mutual information (when we assume that the underlying distributions can be specified using the first and the second moments – e,g., gaussian)

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Clustering methods

• Grouping methods e.g., k means• Hierarchical Methods• Partitioning Methods e.g., spectral clustering• Model-Based Clustering Methods (e.g., mixture densities)

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

What is a Good Clustering?

• A good clustering method will produce clusters where– the intra-cluster distance is small– the inter-cluster distance is large

Page 28: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

28

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Partition Based Clustering Algorithms

• Given a set of data points S={X1 … XN} the task of the clustering algorithm is to find K non overlapping clusters C={C1 …CK} so that

• Each data point is assigned to a unique cluster• Within cluster distance T(C) is minimized• Between cluster distance D(C) is maximized

• The particular choices of T(C) and D(C) affect the shape of the clusters

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Examples of Scoring Functions

Define Cluster Centers∑∈

=kCk

k C XXr 1

( )∑ ∑= ∈

=K

k Ck

k

dCT1

2

XrX,)(Tightness

( )∑≤<≤

=Kkj

kjdCD1

2rr ,)(Between Cluster Distance

( )( )CDCTCQ =)(

Overall Quality of Clusters C

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Why care about “clustering” ?E1 E2 E3

Gene 1

Gene 2

Gene N

E1 E2 E3

Gene N

Gene 1Gene 2

Discover functional relation

Assign function to genes with unknown functions

Find which gene controls which other genes

Page 29: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

29

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

The k-means algorithm

• Assume the data lives in a Euclidean space.

• Assume we want k clusters.• Assume we start with

randomly located cluster centers

• The algorithm alternates between: – Assignment step: Assign

each data point to the closest cluster.

– Refitting step: Move each cluster center to the center of gravity of the data assigned to it.

Assignments

Refitted means

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

K-means clustering

Randomly initialize the cluster centersUntil Clusters stop changing do

{Assign points to clusters whose centers are the closestUpdate cluster centers}

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Why K-means converges

• Whenever an assignment is changed, the sum squared distances of data points from their assigned cluster centers is reduced.

• Whenever a cluster center is moved the sum squared distances of the data points from their currently assigned cluster centers is reduced.

• If the assignments do not change in the assignment step, we have converged.

Page 30: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

30

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

K-means with Euclidian distance

• Clusters defined by Euclidean distance are invariant to translations and rotations in feature space, but not invariant to scaling of features

• Solution: standardize the data: translate and scale the features so that all of features have zero mean and unit variance

• Some implementations of k-means may do this by default

• Caution: standardization may not always be desirable!

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Standardization of data may not always be desirable!

Simulated data, 2-meanswithout standardization

Simulated data, 2-meanswith standardization

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

K-means: Idea

• Represent the data set in terms of K clusters, each of which is summarized by a prototype

• Each data point is assigned to one of K clusters

– Represented by responsibilities such

that

• Example: 4 data points and 3 clusters

{ }1,0∈ikriX

11

=∑=

n

iikr

Page 31: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

31

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

K-means: Loss function

• Loss function:the sum-of-squared distances from each data point to its assigned prototype

prototyperesponsibilities data

( ) ( )kiT

ki

n

i

K

kik XXrJ μμ −−= ∑∑

= =1 1

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Minimizing the loss Function

• Problem– If prototypes known, can assign responsibilities.– If responsibilities known, can compute optimal

prototypes.• We minimize the loss function by an iterative procedure –

akin to expectation maximization • Other ways to minimize the loss function include a merge-

split approach

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Minimizing the loss Function – Expectation Maximization • E-step: Fix and minimize w.r.t.

– Assign each data point to its nearest prototype• M-step: Fix values for and minimize w.r.t

– With a bit of algebra, we can show that this is achieved by setting

– Each prototype set to the mean of data points in that cluster.• Convergence guaranteed since there is only a finite number of

possible settings for the responsibilities and at each iteration J can improve or stay the same

• E-M is susceptible to local minima: Good to run the algorithm with several different initializations of cluster centers

Page 32: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

32

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Page 33: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

33

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Page 34: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

34

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Page 35: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

35

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

The Cost Function after each E and M step

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

How to Choose K?• In some cases, K may be known • In general, K is unknown.• The loss function J decreases with increasing K• Idea: Assume that K* is the right choice for K

– When K<K* some of the clusters must be split further to get the clusters to correspond to natural groups

– When K>K* some natural groups are split across multiple clusters

– When K<K* the cost function decreases rapidly, and when K>K* the cost function ceases to change substantially

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

How to Choose K?

• The Gap statistic (Tibshirani et al) provides a more principled way of choosing K.

K=2

Page 36: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

36

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Gap statistic

k

K

k kK

Cikik

Ci Cjjik

Dn

W

xn

XXD

k

k k

∑ ∑

=

∈ ∈

=

−=

−=

1

2

2

21

2 μ

Diameter of k th cluster

nk = number of data points assigned to k thcluster

A measure of compactness of k thcluster

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Computing the Gap Statisticfor b = 1 to BFrom the n data points, obtain a monte carlo sample X1b, X2b, …, Xnb

for k = 1 to KCluster the observations into k groups and compute log Wkfor b = 1 to B

Cluster the b th M.C. sample into k groups and compute log WklCompute

Compute sd(k), the s.d. of {log Wkb}l=1,…,BSet the total s.e.

Find the smallest k such that

)(/11 ksdBsk ⋅+=

∑=

−=B

bkkb WW

BkGap

1

loglog1)(

1)1()( +−+≥ kskGapkGap

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Applying gap statistic to choose the number of clusters in DNA Microarray Data

6834 genes

64 human tumor

Page 37: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

37

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Notice the gap plot at k = 2 and 6

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

• Calinski and Harabasz ‘74

• Krzanowski and Lai ’85

• Hartigan ’75

• Kaufman and Rousseeuw ’90 (silhouette)

Other criteria for guiding the choice of K

1/2/2

/21

/2

)1()1()(

+

+−−−

=k

pk

pk

pk

p

WkWkWkWkkKL

)1(1)(1

−−⎟⎟⎠

⎞⎜⎜⎝

⎛−=

+

knWWkH

k

k

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Initializing K-means

• K-means converge to a local optimum.• Clusters produced will depend on the initialization.• Some heuristics

– Randomly pick K points as prototypes.– A greedy strategy. Pick prototype so that it is

farthest from prototypes .1+kμ

kμμμ .., 21

Page 38: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

38

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Local minima

• There is nothing to prevent k-means getting stuck at local minima.

• We could try many random starting points

• We could try non-local split-and-merge moves: Simultaneously merge two nearby clusters and split a big cluster into two.

A bad local optimum

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Soft k-means

• K means clustering can result in large changes in cluster assignments in response to small changes in data

• Instead of making hard assignments of data points to clusters, we can make soft assignments. One cluster may have a responsibility of .7 for a data point and another may have a responsibility of .3. – Allows a cluster to use more information about the

data in the refitting step.– What happens to our convergence guarantee?– How do we decide on the soft assignments?

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Soft assignment• If a data point is exactly

halfway between two clusters, each cluster should obviously have the same responsibility for it.

• The responsibilities of all the clusters for one data point should add to 1.

• A sensible softness function is the entropy of the responsibilities. Maximizing the entropy is like saying: be as uncertain as you can be about which cluster has responsibility

∑=

==

ki

i ijijj p

pH1

1log

Entropy of the responsibilities for data point j

Number of clusters, k

Responsibility of cluster i for data point j

Page 39: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

39

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

The soft assignment step

• Choose assignments to optimize the trade-off between two terms:– minimize the

squared distance of the data point to the cluster centers (weighted by responsibility)

– Maximize the entropy of the responsibilities

∑=

=−

ki

i ijij p

p1

1log

2

1|||| ij

ki

iijj pCost μx −= ∑

=

=

Cost of the assignments for data point j

Responsibility of cluster i for datapoint j

Location of cluster i

Location of data point j

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Soft k means clustering

• How do we find the set of responsibility values that minimizes the cost and sums to 1?

• The optimal solution is to make the responsibilities proportional to the exponentiated squared distances:

2

1|||| ij

ki

iijj pCost μx −= ∑

=

=

∑=

=

−ki

i ijij p

p1

1log∑ −−

−−

=

m

ijmj

ij

e

ep 2

2

||||

||||

μx

μx

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

The re-fitting step

• Weight each data point by the responsibility that the cluster has for it.

• Move the mean of the cluster to the center of gravity of the responsibility -weighted data.

∑=

=

=

== Nj

jij

Nj

jjij

i

p

p

1

1x

μ

j: Index over data points

i: Index over Gaussians

Page 40: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

40

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Some difficulties with soft k-means

• Not invariant under linear transformations of the data space (scaling, rotating ,elongation)

• Looks for spherical clusters: It would be good to allow different shapes for different clusters.

• Sometimes it is better to cluster by using low-density regions to define the boundaries between clusters rather than using high-density regions to define the centers of clusters.

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Limitations of K-means

• K means Assumes spherical clusters and equal probabilities for each cluster.– In reality, clusters may not be spherical – Solution: gaussian mixture models

• Clusters can change arbitrarily with different values of K– As K is increased, cluster memberships change in an

arbitrary way, the clusters are not necessarily nested– Solution: hierarchical clustering

• K means is sensitive to outliers.– Solution: use a different loss function.

• K means works poorly on non-convex clusters.– Solution: spectral clustering

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Page 41: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

41

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Gaussian Mixture Models

• Gaussian mixtures:

k

K

k

kkcpp αθ∑=

=1

, )|()( xx

Each mixture component is a multidimensional Gaussian with its own mean μk and covariance “shape” Σk

e.g., K=2, 1-dim: {θ, α} = {μ1 , σ1 , μ2 , σ2 , α1}

0 0.5 10

0.5

1

(a)

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

General form of mixture models

k

K

kkk

k

K

kk

K

k

k

cp

cpcp

cpp

αθ∑

=

=

=

=

=

=

1,

1

1

)|(

)()|(

),()(

x

x

xx

Weightk

ComponentModelk

Parametersk

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

-5 0 5 100

0.5

1

1.5

2

Component Models

p(x)

-5 0 5 100

0.1

0.2

0.3

0.4

0.5

Mixture Model

x

p(x)

Page 42: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

42

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

• Task –– To use samples drawn from this mixture density to

estimate the unknown parameter vector θ. – Once θ is known, we can decompose the mixture into

its components

tK

K

k parametersmixing

k

densitiescomponent

kk

where

cPcxPxP

),...,,

)(.),|()|(

2

1

θθθθ

θθ

1

( =

= ∑=

Mixture density

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Learning Mixtures from Data

• Consider fixed K

• e.g., Unknown parameters Θ = {μ1 , σ1 , μ2 , σ2 , α1 α2}

• Given data D = {x1,…….xN}, we want to find the parameters Θ that “best fit” the data

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Maximum Likelihood Principle

• assume a probabilistic model• likelihood = p(data | parameters, model)• find the parameters that make the data most likely

⎥⎦

⎤⎢⎣

⎡αθ=

Θ=

Θ=Θ

∑∏

==

=

k

K

kkki

N

i

N

ii

cp

p

DpL

11

1

)|(

to reduces model mixture a of case the in which

)|(

)|()(

,x

x

Page 43: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

43

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Aside: maximum likelihood estimates

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Maximum likelihood estimates

• When tossed, the thumbtack can land in one of two positions: Head or Tail

• We denote by θ the (unknown) probability P(H).• Estimation task

– Given a sequence of toss samples x[1], x[2], …, x[M] we want to estimate the probabilities P(H)= θ and P(T) = 1 - θ

Head Tail

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Statistical parameter fitting

Consider samples x[1], x[2], …, x[M] such that• The set of values that X can take is known• Each is sampled from the same distribution• Each is sampled independently of the rest

The task is to find a parameter Θ so that the data can be summarized by a probability P(x[j]| Θ ).

• The parameters depend on the given family of probability distributions: multinomial, Gaussian, Poisson, etc.

i.i.d.samples

Page 44: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

44

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

The Likelihood FunctionHow good is a particular θ?

It depends on how likely it is to generate the observed data

The likelihood for the sequence H,T, T, H, H is

∏==m

mxPDPDL )|][()|():( θθθ

θθθθθθ ⋅⋅−⋅−⋅= )1()1():( DL

0 0.2 0.4 0.6 0.8 1

θ

L(θ

:D)

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Likelihood function

• The likelihood function L(θ : D) provides a measure of relative preferences for various values of the parameter θgiven a collection of observations D drawn from a distribution that is parameterized by fixed but unknown θ.

• L(θ : D) is the probability of the observed data D consideredas a function of θ.

• Suppose data D is 5 heads out of 8 tosses. What is the likelihood function assuming that the observations were generated by a binomial distribution with an unknown but fixed parameterθ ?

( )35 158

θθ −⎟⎟⎠

⎞⎜⎜⎝

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Sufficient Statistics

• To compute the likelihood in the thumbtack example we only require NH and NT (the number of heads and the number of tails)

• NH and NT are sufficient statistics for the parameter θ that specifies the binomial distribution

• A statistic is simply a function of the data• A sufficient statistic s for a parameter θ is a function that

summarizes from the data D, the relevant information s(D)needed to compute the likelihood L(θ :D).

• If s is a sufficient statistic for θthen L(θ :D) = L(θ :D’)

TH NNDL )():( θ−⋅θ=θ 1

Page 45: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

45

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Maximum Likelihood Estimation

• Main Idea: Learn parameters that maximize the likelihood function

• Maximum likelihood estimation isIntuitively appealingOne of the most commonly used estimators in statisticsAssumes that the parameter to be estimated is fixed, but unknown

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Example: MLE for Binomial Data• Applying the MLE principle we get• (Why?)

0 0.2 0.4 0.6 0.8 1

L(θ

:D)

Example:(NH,NT ) = (3,2)

ML estimate is 3/5 = 0.6

TH

H

NNN+

=θ̂

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

MLE for Binomial data ( ) ( )( ) ( )θNθNθ:DL

θθθ:DL

TH

NN TH

−+=−⋅=

1logloglog1

The likelihood is positive for all legitimate values of θ

So maximizing the likelihood is equivalent to maximizing its logarithm i.e. log likelihood

( ) ( )

( ) ( )( )

( )

( )TH

HML

HTH

TH

NNNθ

NθNNθ

Nθ:DLθ

θ:DLθ:DLθ

+=

=+

=−

−+=

∂∂

=∂∂

01

1log

of extremaat 0logNote that the likelihood is indeed maximized at θ =θMLbecause in the neighborhood of θML, the value of the likelihood is smaller than it is at θ =θML

Page 46: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

46

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Maximum and curvature of likelihood around the maximum

• At the maximum, the derivative of the log likelihood is zero• At the maximum, the second derivative is negative. • The curvature of the log likelihood is defined as

• Large observed curvature I (θML) at θ = θML

• is associated with a sharp peak, intuitively indicating less uncertainty about the maximum likelihood estimate

• I (θML) is called the Fisher information

( ) ( )DθLθ

I :log2∂∂

−=θ

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Maximum Likelihood Estimate

ML estimate can be shown to be • Asymptotically unbiased

• Asymptotically consistent - converges to the true value as the number of examples approaches infinity

• Asymptotically efficient – achieves the lowest variance that any estimate can achieve for a training set of a certain size (satisfies the Cramer-Rao bound)

{ }( ) 0lim

1Prlim2 =−

=≤−

∞→

∞→

TrueMLN

TrueMLN

E θθ

εθθ

( ) TrueMLNE θ=θ

∞→lim

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Maximum Likelihood Estimate• ML estimate can be shown to be representationally invariant –

If θMLis an ML estimate of θ, and g (θ ) is a function of θ, then g (θML ) is an ML estimate of g (θ )

• When the number of samples is large, the probability distribution of θML has Gaussian distribution with mean θTrue(the actual value of the parameter) – a consequence of the central limit theorem – a random variable which is a sum of a large number of random variables has a Gaussian distribution – ML estimate is related to the sum of random variables

• We can use the likelihood ratio to reject the null hypothesis corresponding to θ = θ0 as unsupported by data if the ratio of the likelihoods evaluated at θ0 and at θML is small. (The ratio can be calibrated when the likelihood function is approximately quadratic)

Page 47: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

47

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

The EM Algorithm

Dempster, Laird, and Rubin (1977)general framework for likelihood-based parameter estimation with hidden variables

start with initial guesses of parametersEstep: estimate memberships given paramsMstep: estimate params given membershipsRepeat until convergence

converges to a (local) maximum of likelihoodE-step and M-step are often computationally simplegeneralizes to maximum a posteriori (with priors)

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

The Gaussian Distribution

• Multivariate Gaussian

• Maximum likelihood estimation

mean covariance

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Gaussian Mixture Model

• Linear combination of Gaussians

0 0.5 10

0.5

1

(a)

where

parameters to be estimated

0 0.5 10

0.5

1

(a)

Page 48: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

48

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Gaussian Mixture

• To generate a data point:– First pick one of the components with probability– then draw a sample from that component

distribution• Each data point is generated by one of K components, a

latent variable is associated with each

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

• Loss function: The negative log likelihood of the data.– Equivalently, maximize the log likelihood.

• Without knowing values of latent variables, we have to maximize the incomplete log likelihood.– Sum over components appears inside the logarithm, no

closed-form solution.

Gaussian Mixture

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Fitting the Gaussian Mixture• Given the complete data set

– Maximize the complete log likelihood.

– Trivial closed-form solution: fit each component to the corresponding set of data points.

– Observe that if all the and are equal, then the complete log likelihood is exactly the loss function used in K-means.

• Need a procedure that would let us optimize the incomplete log likelihood by working with the (easier) complete log likelihood instead.

Page 49: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

49

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

The Expectation-Maximization (EM) Algorithm

• E-step: for given parameter values we can compute the expected values of the latent variables(responsibilities of data points)

– Note that instead of but we still have

Bayes rule

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

The EM Algorithm

• M-step: maximize the expected complete log likelihood

• Parameter update:

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

The EM Algorithm

• Iterate E-step and M-step until the log likelihood of data does not increase any more.– Converge to local optima.– Need to restart algorithm with different initial

guess of parameters (as in K-means).

• Relation to K-means– Consider GMM with common covariance.

– As , two methods coincide.

Page 50: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

50

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Page 51: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

51

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Page 52: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

52

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

K-means vs GMM

• Loss function:– Minimize sum of squared

Euclidean distance.• Can be optimized by an EM

algorithm.– E-step: assign points to

clusters.– M-step: optimize clusters.– Performs hard assignment

during E-step.• Assumes spherical clusters

with equal probability of a cluster.

• Loss function– Minimize the negative log-

likelihood.• EM algorithm

– E-step: Compute posterior probability of membership.

– M-step: Optimize parameters.

– Perform soft assignment during E-step.

• Can be used for non-spherical clusters. Can generate clusters with different probabilities.

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

• K-means not robust.– Squared Euclidean distance emphasizes more distant

points and is sensitive to outliers• K-medoids

– Is less sensitive to outliers– Needs only the dissimilarity matrix (so can work in

settings where we do not have the raw data)– Works in settings where the data do not reside in

Euclidean space

K-medoids

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

K-medoids

• Restrict the prototypes to one of the data points assigned to the cluster.

• E-step: Fix the prototypes and minimize w.r.t.– Assigns each data point to its nearest prototype

• M-step: Fix values for and minimize w.r.t the prototypes.

Example• Use L1 distance instead of squared Euclidean distance. • Prototype is the median of points in a cluster.

Page 53: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

53

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Hierarchical Clustering

• Organize the clusters in an hierarchical way• Produces a rooted (binary) tree (dendrogram)

Step 0 Step 1 Step 2 Step 3 Step 4

b

dc

e

a a b

d ec d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative

divisive

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Hierarchical Clustering• Two kinds of strategy

– Bottom-up (agglomerative): Recursively merge two groups with the smallest between-cluster dissimilarity (defined later on).

– Top-down (divisive): In each step, split a least coherent cluster (e.g. largest diameter); splitting a cluster is also a clustering problem (usually in a greedy fashion)

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Hierarchical Clustering

• User can choose a cut through the hierarchy to represent the most natural division into clusters– e.g, Choose the cut where intergroup dissimilarity

exceeds some threshold

Step 0 Step 1 Step 2 Step 3 Step 4

b

dc

e

a a b

d ec d e

a b c d e

3 2

Page 54: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

54

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Hierarchical Clustering• Have to measure the dissimilarity for two disjoint

groups G and H, is computed from pairwise dissimilarities

– Single Linkage: tends to yield extended clusters.

– Complete Linkage: tends to yield spherical clusters.

– Group Average: tradeoff between the above. Not invariant under monotone increasing transform.

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Hierarchical Clustering of Gene Expression Data1. Calculate the distance between all genes. Find the smallest

distance. If several pairs share the same similarity, use a predetermined rule to decide between alternatives.

2. Fuse the two selected clusters to produce a new cluster that nowcontains at least two objects. Calculate the distance between the new cluster and all other clusters.

3. Repeat steps 1 and 2 until only a single cluster remains.4. Draw a tree representing the results.

G1G6

G3

G5

G4

G2

G1

G6

G3

G5

G4

G2

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Example: Human Tumor Microarray Data

• 6830×64 matrix of real numbers.• Rows correspond to genes, columns to

tissue samples.• Cluster rows (genes) can provide

information about functions of genes with unknown function based on functions of genes with similar expression profiles

• Cluster columns (samples) can help identify disease profiles: tissues with similar disease should yield similar expression profiles – note the number of “samples” is fewer than variables (genes)

Gene expression matrix

Page 55: Gene expression data analysis - Iowa State Universityhonavar/gene... · Gene expression data analysis Vasant Honavar, Computational Systems Biology, ISU, Spring 2008 Iowa State University

55

Vasant Honavar, Computational Systems Biology, ISU, Spring 2008

Iowa State UniversityDepartment of Computer Science

Center for Computational Intelligence, Learning, and Discovery

Bioinformatics and Computational Biology Program

Example: Human Tumor Microarray Data• 6830×64 matrix of real numbers• Hierarchical clustering of the microarray data

– Applied separately to rows and columns.– Subtrees with tighter clusters placed on the left.– Produces a more informative picture of genes and samples than

the randomly ordered rows and columns.