an introduction to microarray analysis

8/10/2019 An Introduction to Microarray Analysis

1/25

225

Microarray Data Analysis

ap er

An Introduction toMicroarray Data Analysis

. a an a u

Abstract

This chapter aims to provide an introduction to the analysis of gene expression data obtained

using microarray experiments. It has been divided into four sections. The rst section

provides basic concepts on the working of microarrays and describes the basic principlesbehind a microarray experiment. The second section deals with the representation and

extract on o n ormat on rom mages o ta ne rom m croarray exper ments. e t r

sect on a resses erent met o s or compar ng express on pro es o genes an a so

prov es an overv ew o erent met o s or c uster ng genes w t s m ar express on

pro es. e ast sect on ocuses on re at ng gene express on ata w t ot er o og ca

information; it will provide the readers with a feel for the kind of biological discoveries

one can make by integrating gene expression data with external information.

1. INTRODUCTION

Functional genomics involves the analysis of large datasets of information derived fromarious biological experiments. One such type of large-scale experiment involves monitoring

the expression levels of thousands of genes simultaneously under a particular condition,

called gene expression analysis. Microarray technology makes this possible and the quantity

o ata generate rom eac exper ment s enormous, war ng t e amount o ata generate

y genome sequenc ng pro ects. s c apter s a r e overv ew o t e as c concepts

nvo ve n a m croarray exper ment; t g ves a ee ng or w at t e ata actua y represents,

n w prov e n ormat on on t e var ous computat ona met o s t at one can emp oy

to er ve mean ng u resu ts rom suc exper ments.

1.1 What are microarrays and how do they work?

Microarray technology has become one of the indispensable tools that many biologists use

to monitor genome wide expression levels of genes in a given organism. A microarray is

typically a glass slide on to which DNA molecules are xed in an orderly manner at specic

locations called spots (or features). A microarray may contain thousands of spots and each

spot may conta n a ew m on cop es o ent ca mo ecu es t at un que y correspon

to a gene gure . e n a spot may e t er e genom c or s ort stretc o

o go-nuc eot e stran s t at correspon to a gene. e spots are pr nte on to t e g ass

s e y a ro ot or are synt es se y t e process o p oto t ograp y.

croarrays may e use to measure gene express on n many ways, ut one o t emost popular applications is to compare expression of a set of genes from a cell maintained

in a particular condition (condition A) to the same set of genes from a reference cell


2/25

Madan Babu

ma nta ne un er norma con t ons con t on . gure g ves a genera p cture o t e

exper menta steps nvo ve . rst, s extracte rom t e ce s. ext, mo ecu es

n t e extract are reverse transcr e nto c y us ng an enzyme reverse transcr ptasend nucleotides labelled with different uorescent dyes. For example, cDNA from cells

rown in condition A may be labelled with a red dye and from cells grown in condition B

with a green dye. Once the samples have been differentially labelled, they are allowed to

hybridize onto the same glass slide. At this point, any cDNA sequence in the sample will

hybridize to specic spots on the glass slide containing its complementary sequence. The

mount of cDNA bound to a spot will be directly proportional to the initial number of RNA

molecules present for that gene in both samples.

o ow ng t e y r zat on step, t e spots n t e y r ze m croarray are exc te y

aser an scanne at su ta e wave engt s to etect t e re an green yes. e amount

o uorescence em tte upon exc tat on correspon s to t e amount o oun nuc e c ac .or nstance, c rom con t on or a part cu ar gene was n greater a un ance t an

Figure 1. (A) A microarray may contain thousands of spots. Each spot contains many copies of the same DNAsequence that uniquely represents a gene from an organism. Spots are arranged in an orderly fashion into Pen-

groups. (B) Schematic of the experimental protocol to study differential expression of genes. The organism is

grown in two different conditions (a reference condition and a test condition). RNA is extracted from the two

cells, and is labelled with different dyes (red and green) during the synthesis of cDNA by reverse transcriptase.

Following this step, cDNA is hybridized onto the microarray slide, where each cDNA molecule representing a

gene will bind to the spot containing its complementary DNA sequence. The microarray slide is then excited

with a laser at suitable wavelengths to detect the red and green dyes. The nal image is stored as a le for further

analysis. Colour gure t: http://www.mrc-lmb.cam.ac.uk/genomes/madanm/microarray/.


3/25

227


that from condition B, one would nd the spot to be red. If it was the other way, the spot

would be green. If the gene was expressed to the same extent in both conditions, one would

nd the spot to be yellow, and if the gene was not expressed in both conditions, the spotwould be black. Thus, what is seen at the end of the experimental stage is an image of the

microarray, in which each spot that corresponds to a gene has an associated uorescence

a ue represent ng t e re at ve express on eve o t at gene.

2. OVERVIEW OF IMAGE PROCESSING, TRANSFORMATION AND

NORMALIZATION

2.1 Image processing and analysis

In the previous section, we saw that the relative expression level for each gene (population

of RNA in the two samples) can be stored as an image. The rst step in the analysis of

microarray data is to process this image. Most manufacturers of microarray scanners

provide their own software; however, it is important to understand how data is actually

being extracted from images, as this represents the primary data collection step and forms

the basis of any further analysis.

mage process ng nvo ves t e o ow ng steps:

. enti cation o t e spots an istinguis ing t em rom spurious signa s.

e m croarray s scanne o ow ng y r zat on an a mage e s norma y

enerated. Once image generation is completed, the image is analysed to identify spots.

In the case of microarrays, the spots are arranged in an orderly manner into sub-arraysr pen groups (Figure 1A), which makes spot identication straightforward. Most image

Figure 2. Zooming onto a spot on the microarray slide. The spot area and the background area are depicted by a blue

circle and a white box, respectively. A pixel in the spot area is also shown. Any pixel within the blue circle will be

treated as a signal from the spot. Pixels outside the blue circle but within the white box will be treated as a signal from

the background. One can see that the images are not perfect, as it is often the case, which leads to many problems

with spurious signals from dust particles, scratches, bright arrays, etc. This image was retrieved from Stanford

Microarray Database. Colour gure at: http://www.mrc-lmb.cam.ac.uk/genomes/madanm/microarray/.


4/25

Madan Babu

process ng so tware requ res t e user to spec y approx mate y w ere eac su -array

es an a so a t ona parameters re evant to t e spotte array. s n ormat on s

then used to identify regions that correspond to spots.

2. Determination of the spot area to be surveyed, determination of the local region to

stimate background hybridization.

After identifying regions that correspond to sub-arrays, an area within the sub-array

must be selected to get a measure of the spot signal and an estimate for background

ntensity (Figure 2). There are two methods to dene the spot signal. The rst method

s to use an area o a xe s ze t at s centre on t e centre o mass o t e spot. s

met o as an a vantage t at t s computat ona y ess expens ve, ut a sa vantage

e ng more error-prone n est mat ng spot ntens ty an ac groun ntens ty. n

ternat ve met o s to prec se y e ne t e oun ary or a spot an on y nc u e p xe s

t n t e oun ary. s met o as an a vantage t at t can g ve a etter est mate o

t e spot ntens ty, ut a so as a sa vantage o e ng computat ona y ntens ve antime-consuming.

. Reporting summary statistics and assigning spot intensity after subtracting for

ackground intensity.

Once the spot and background areas have been dened, a variety of summary statistics

or each spot in each channel (red and green channels) are reported. Typically, each

pixel (Figure 2) within the area is taken into account, and the mean, median, and total

a ues or t e ntens ty cons er ng a t e p xe s n t e e ne area are reporte or

ot t e spot an ac groun . ost approac es use t e spot me an va ue, w t t e

ac groun me an va ue su tracte rom t, as t e metr c to represent spot ntens ty.e me an ntens ty s a va ue w ere a t e measure p xe s ave ntens t es greater

t an t s va ue an t e ot er a o t e measure p xe s ave ntens t es ess t an

this value. The background subtracted median value approach has an advantage of

being relatively insensitive to a few pixels with anomalous uorescent values in one

r both channels, but has a disadvantage of being sensitive to misidentication of spot

nd background areas. The other method is to use total intensity values, which has an

dvantage of being insensitive to misidentication of spots (as few more pixels with

ero value in the background will not affect the total intensity), but has a disadvantage

f being prone to be skewed by a few pixels with extreme intensity values.

not er cons erat on n mage process ng s t e num er o p xe s to e nc u e or

measurement n t e spot mage. or many scanners, t e e au t p xe s ze s . s

means t at an average spot o ameter o w ave ~ p xe s. owever, or a

sma er spot ameter, t s etter to use a sma er p xe s ze to ensure enoug p xe s are

samp e . ost scanners now a ow t e p xe s ze o m. ven t oug us ng a sma er

pixel size increases our condence in the measurement, the only disadvantage is that the

image le size tends to be much bigger when compared with image le sizes created using

larger pixel sizes.

2.2 Expression ratios: the primary comparisonWe saw that the relative expression level for a gene can be measured as the amount of red or

reen light emitted after excitation. The most common metric used to relate this information


5/25

229


s ca e express on rat o. t s enote ere as kan e ne as:

=k

k

For each gene kon the array, where represents the spot intensity metric for the test

sample and Gk represents the spot intensity metric for the reference sample. As mentioned

bove, the spot intensity metric for each gene can be represented as a total intensity value

or a background subtracted median value. If we choose the median pixel value, then the

median expression ratio for a given spot is:

background

median

spot

median

background

median

spot

medianmedian

GG

RRT

=

where spotmedianR andbackgroundmedianR are the median intensity values for the spot and background

respectively, for the test sample.

2.3 Transformations of the expression ratio

e express on rat o s a re evant way o represent ng express on erences n a very

ntu t ve manner. or examp e, genes t at o not er n t e r express on eve w ave

n express on rat o o . owever, t s representat on may e un e p u w en one as to

represent up-regulation and down-regulation. For example, a gene that is up-regulated by a

factor of 4 has an expression ratio of 4 (R/G = 4G/G = 4). However, for the case where a gene

is down regulated by a factor of 4, the expression ratio becomes 0.25 (R/G = R/4R = 1/4).

Thus up-regulation is blown up and mapped between 1 and innity, whereas down-regulationis compressed and mapped between 0 and 1.

[ ] ,1regulationUp mapped

regulationDown mapped

To eliminate this inconsistency in the mapping interval, one can perform two kinds of

transformations of the expression ratio, namely, inverse transformation and logarithmic

trans ormat on.

Inverse or reciprocal transformation

e nverse or rec proca trans ormat on converts t e express on rat o nto a o -c ange,

w ere or genes w t an express on rat o o ess t an t e rec proca o t e express on

ratio is multiplied by -1. If the expression ratio is 1 then the fold change is equal to theexpression ratio. The advantage of such a transformation is that one can represent up-

regulation and down-regulation with a similar mapping interval.

=

=

=


6/25

Madan Babu

owever, t s met o a so as a pro em n t at t e mapp ng space s scont nuous etween

an + an ence ecomes a pro em n most mat emat ca ana yses ownstream o

this step.

Logarithmic transformation

A better transformation procedure is to take the logarithm base 2 value of the expression

ratio (i.e.log expression ratio)). This has the major advantage that it treats differential

up-regulation and down-regulation equally, and also has a continuous mapping space.

For example, if the expression ratio is 1, then log2 1) equals 0 represents no change in

expression. If the expression ratio is 4, then log2 4) equals +2 and for expression ratio of

og 1 equa s - . us, n t s trans ormat on t e mapp ng space s cont nuous an up-

regu at on an own-regu at on are compara e.

av ng exp a ne t e a vantages o us ng express on rat os as a metr c or gene

express on, t s ou a so e un erstoo t at t ere are sa vantages o us ng express on

rat os or trans ormat ons o t e rat os or ata ana ys s. ven t oug express on rat os canreveal patterns inherent in the data, they remove all information about absolute expression

levels of the genes. For example, genes that have R/G ratios of / nd / will end up

having the same expression ratio of 4, and associated problems will surface when one tries

to reliably identify differentially regulated genes.

2.4 Data normalization

In the last section, it was shown that expression ratios and their transformations is

reasona e measure to etect erent a y expresse genes. owever, w en one

compares t e express on eve s o genes t at s ou not c ange n t e two con t ons say,

ouse eep ng genes , w at one qu te o ten n s s t at an average express on rat o o sucenes ev ates rom . s may e ue to var ous reasons, or examp e, var at on cause y

erent a a e ng e c ency o t e two uorescent yes or erent amounts o start ng

mRNA material in the two samples. Thus, in the case of microarray experiments, as for

ny large-scale experiments, there are many sources of systematic variation that affect

measurements of gene expression levels.

Normalization is a term that is used to describe the process of eliminating such variations

to allow appropriate comparison of data obtained from the two samples. There are many

methods of normalization and discussing each one of them is beyond the scope of this

chapter.

The rst step in a normalization procedure is to choose a gene-set (which consists ofenes or w c express on eve s s ou not c ange un er t e con t ons stu e , t at

s t e express on rat o or a genes n t e gene-set s expecte to e . rom t at set, a

norma ization actor,w c s a num er t at accounts or t e var a ty seen n t e gene-

set, s ca cu ate . t s t en app e to t e ot er genes n t e m croarray exper ment. ne

s ou note t at t e norma zat on proce ure c anges t e ata, an s carr e out on y on

the background corrected values for each spot. Figure 3 shows expression data before and

fter the normalization procedure.

Total intensity normalization

The basic assumption in a total intensity normalization is that the total quantity of RNA forthe two samples is the same. Also assuming that the same number of molecules of RNA


7/25

231


from both samples hybridize to the microarray, the total hybridization intensities for the

ene-sets should be equal. So, a normalization factor can be calculated as:

=

==setgene

setgene

N

k

k

N

k

k

total

G

R

N

1

1

The intensities are now rescaled such that totalkk NGG ='

and kk RR ='

. The normalized

expression ratio becomes:

total

k

totalk

k

k

k

k N

T

NG

R

G

R

T === ''

'

Figure 3. Gene expression data before and after the normalization procedure. Note that before normalization the

image had many spots of different intensities, but after normalization only spots that are really different light up.

This image was kindly provided by N. Luscombe Colour gure t: http://www.mrc-lmb.cam.ac.uk/genomes/

madanm/microarray/.


8/25

Madan Babu

c s equ va ent to:

)(log-)(log)(log 22

'

2 totalkk NTT =

This now adjusts the ratio such that the mean ratio for the gene set is equal to 1.

Mean log centring

In this method, the basic assumption is that the mean log2(expression ratio) should be equal

to 0 for the gene-set. In this case, the normalization factor can be calculated as:

setgene

N

k

mlc

N

kG

kR

N

setgene

=

=1

2log

The intensities are now rescaled such that )2(' mlcN

kk GG = and kk RR =' . The normalized

expression ratio becomes:

mlcmlc N

k

N

k

k

k

kk

T

G

R

G

RT

2)2('

'' =

==

Which is equivalent to:

mlck

N

kk NTTTmlc -)(log)(2log-)(log)(log 222

'

2 ==

s a usts t e rat o suc t at t e mean og2 express on rat o or t e gene-set s equa

to .

Other normalization methods include: linear regression, Chens ratio statistics and

Lowess normalization. The next step following the normalization procedure is to lter

low intensity data using specic threshold or relative threshold imposed according to the

background intensity. If the experimental procedure included a replicate, averaging the

alues using the replicate data is the next step to be performed after data ltering. Finally,differentially expressed genes are identied. For an excellent review of normalization

procedures, ltering methods and averaging procedures using replicate data, please refer

to uac en us , an re erences t ere n.

3. ANALYSIS OF GENE EXPRESSION DATA

ne o t e reasons to carry out a m croarray exper ment s to mon tor t e express on eve o

enes at a genome sca e. atterns cou e er ve rom ana ys ng t e c ange n express on

of the genes, and new insights could be gained into the underlying biology. In this section,

basic terminologies, representations of the microarray data and the various methods by

which expression data can be analysed will be introduced.The processed data, after the normalization procedure, can then be represented in the form

of a matrix, often called gene expression matrix (Table 1A). Each row in the matrix corresponds


9/25

233


to a particular gene and each column could either correspond to an experimental condition

or a specic time point at which expression of the genes has been measured. The expression

levels for a gene across different experimental conditions are cumulatively called the gene

expression prole, and the expression levels for all genes under an experimental condition are

cumulatively called the sample expression prole. Once we have obtained the gene expressionmatrix (Table 1A), additional levels of annotation can be added either to the gene or to the

sample. For example, the function of the genes can be provided, or the additional details on

the biology of the sample may be provided, such as disease state or normal state.

epen ng on w et er t e annotat on s use or not, ana ys s o gene express on ata

can e c ass e nto two erent types, name y superv se or unsuperv se earn ng. n

t e case o a superv se earn ng, we o use t e annotat on o e t er t e gene or t e samp e,

n create c usters o genes or samp es n or er to ent y patterns t at are c aracter st c

for the cluster. For example, we could separate sample expression proles into disease

state and normal state groups, and then look for patterns that separate the sample prole

of the disease state from the sample prole of the normal state.

In the case of an unsupervised learning, the expression data is analysed to identify

patterns that can group genes or samples into clusters without the use of any form of

Table 1. A: Gene expression matrix that contains rows representing genes and columns representing particular

conditions. Each cell contains a value, given in arbitrary units, that reects the expression level of a gene under

a corresponding condition. B: Condition C4 is used as a reference and all other conditions are normalized with

respect to C4 to obtain expression ratios. C: In this table all expression ratios were converted into the log 2(expression ratio) values. This representation has an advantage of treating up-regulation and down-regulation on

comparable scales. D: Discrete values for the elements in Table 1.C. Genes with log 2 (expression ratio) values

greater than 1 were changed to 1, genes with values less than 1 were changed to 1. Any value between 1 and

1 was changed to 0.


10/25

Madan Babu

nnotat on. or examp e, genes w t s m ar express on pro es can e c ustere toget er

w t out t e use o any annotat on. owever, annotat on n ormat on may e ta en nto

ccount at a later stage to make meaningful biological inferences. Throughout this section,

set of genes or set of experimental conditions that have similar expression proles will be

referred to as a cluster. Thus, a cluster consists of objects with similar expression proles,

where an object may either refer to genes or samples.

3.1 Representation of gene expression data

To make any meaningful comparison or biological analysis, one should know what the

ata n t e gene express on matr x represents. xpress on ata can e represente n ve

erent ways, w c are escr e e ow:

so ute measurement

n t e case o an a so ute measurement, eac ce n t e matr x w represent t e express on

eve o t e gene n a stract un ts. ote t at t s not mean ng u to compare express on eve sof genes across two different conditions in absolute units, because the starting amounts of

mRNA could be different. Table 1A shows a sample gene expression matrix with each cell

containing the expression level in abstract units.

Relative measurement or expression ratio

In the case of a relative measurement or representations involving expression ratio, the

expression level of a gene in abstract units is normalized with respect to its expression in a

re erence con t on. s g ves t e express on rat o o t e gene n re at ve un ts. ote t at

n suc cases, a rat o o 000 00 ea to t e same resu t as40

10. us any n ormat on on

so ute measurement w e ost n suc a representat on, ut now mean ng u compar soncross erent con t ons can e ma e as ong as t e same re erence con t on s use to get

t e express on rat o. s ment one e ore, t s representat on oes not treat up-regu at on

nd down-regulation in a comparable manner. Table 1B shows the gene expression matrix

with each cell representing the expression ratio normalized with respect to a reference

condition.

og expression ratio

In the case of tables representing the log2(expression ratio) values, information on up-

regulation and down-regulation is captured and is mapped in a symmetric manner. For

example, 4-fold up-regulation maps to log2(4) = 2 and a 4-fold down-regulation maps toog = - . us, rom t s ta e t e o -c ange or a erent a y regu ate gene

un er any con t on can e eas y recogn se . a e s ows t e og2 express on rat o

a ues o t e genes un er erent con t ons.

iscrete va ues

Another way of representing information is to convert to discrete numbers the values in

the tables mentioned above. In the case of converting the absolute measurement to discrete

numbers, a binary expression matrix of 1 and 0 can be used, where 1 means that the gene is

expressed above a user dened threshold, and 0 means that the gene is expressed below this

threshold. In the case of making the relative expression tables or log2(expression ratio) tablesdiscrete, values can be divided into 3 classes, +1, 0 and 1, where +1 represents a gene that is

positively regulated, 0 represents a gene that is not differentially regulated and 1 represents


11/25

235


gene t at s represse . e process o ma ng t e va ues screte oses a ot o n ormat on,

ut s use u to ana yse express on pro es us ng a gor t ms t at cannot an e rea va ue

expression matrices, for example algorithms calculating mutual information between genes

or samples. Table 1D shows discrete values for the log (expression ratio) table.

Representation of expression proles as vectors

So far we have seen how individual cells in the gene expression matrix can be represented.

Similarly, an expression prole (of a gene or a sample) can be thought of as a vector and

can be represented in vector space. For example, an expression prole of a gene can be

cons ere as a vector n ndimensional space (where n is the number of conditions), and an

express on pro e o a samp e w t genes can e cons ere as a vector n m mens ona

space w ere m s t e num er o genes . n t e examp e g ven e ow, t e gene express on

matr x w t mgenes across ncon t ons s cons ere to e an mx nmatr x, w ere t e

express on va ue or gene i n con t on s enote as xj

=

mnmm

n

n

xxx

xxx

xxx

X

..

........

..

..

21

22221

11211

The expression prole of a geneican be represented as a row vector:

[ ]iniiii xxxxG ..,,,, 321=

The expression prole of a samplejcan be represented as a column vector:

=

mj

j

j

i

x

x

x

G..

2

1

n t e next sect on, we w see ow express on pro es t at are represente as vectors can

e use to compare ow s m ar or erent are t e pa rs o o ects remem er, an o ect

may re er to a gene or a samp e .

3.2 Distance measures

Analysis of gene expression data is primarily based on comparison of gene expression

proles or sample expression proles. In order to compare expression proles, we need a

measure to quantify how similar or dissimilar are the objects that are being considered. A

ariety of distance measures can be used to calculate similarity in expression proles andthese are discussed below.


12/25

Madan Babu

Euclidean distance

uc ean stance s one o t e common stance measures use to ca cu ate s m ar ty

between expression proles. The Euclidean distance between two vectors of dimension 2,

sayA a a and b , b can be calculated as:

2

22

2

11 )()(),( babaBADEuc +=

For instance two genes with expression proles in two conditions = 1,2 and = ,3 ,

the Euclidean distance can be calculated as:

2)32()21(),( 2221 =+=GGDEuc

us or genes w t express on ata ava a e or n con t ons, represente as

a1 an an = 1 n , uc ean stance can e ca cu ate as:

=

=n

i

iiEuc baBAD1

2)(),(

In other words, the Euclidean distance between two genes is the square root of the sum of

t e squares o t e stances etween t e va ues n eac con t on mens on .

more genera orm o t e uc ean stance s ca e t e in ows i istance

ca cu ate as:

p

n

i

p

iiMin baBAD =

=1

)(),(

A special case of the Minkowski distance when p=1 is called rectilinear distance. When

pplied to binary expression proles (i.e.expression levels changed to 1 and 0), it is called

amm ng stance.

Pearson correlation coefcient

ne o t e most common y use metr cs to measure s m ar ty etween express on pro es

s t e earson corre at on coe c ent sen t a . . ven t e express on

ratios for two genes under three conditionsA [a , a , a andB=[b , b b ], PCC can be

computed as follows:

Step1: Compute mean

33

321321 bbbbandaaa

a ++

=++

=


13/25

237


tep : ean centre express on pro es

),,(),,( 321321 bbbbbbBandaaaaaaA ==

Step3: Calculate PCC as the cosine of the angle between the mean-centred proles

BA

BAPCC

o=

Where,

=

=n

i

ii bbaaBA1

)()(o

=

=n

i

i aaA1

2)(

and

=

=n

i

i bbB1

2)(

The reason why we mean centre the expression proles is to make sure that we compare

shapes of the expression proles and not their magnitude. Mean centring maintains the

shape of the prole, but it changes the magnitude of the prole as shown in Figure 4.

va ue o essent a y means t at t e two genes ave s m ar express on pro es

n a va ue o means t at t e two genes ave exact y oppos te express on pro es. va ue

o means t at no re at ons p can e n erre etween t e express on pro es o genes. n

rea ty, va ues range rom to + . va ue . suggests t at t e genes e aves m ar y an a va ue - . suggests t at t e genes ave oppos te e av our. e

alue of 0.7 is an arbitrary cut-off, and in real cases this value can be chosen depending on

the dataset used. An example calculation is shown below:

Figure 4. Expression prole before and after mean centring. Note that after mean centring, the relative shapesof the expression proles are still maintained, but the magnitude changes. The graphs do not show the actual

result of PCC analysis.


14/25

Madan Babu

ons er two genes w t express on pro es = an = .

e can e ca cu ate as o ows:

a= 4.8 and b 9.6, the mean centred expression proles become:

A - . - . . . . an B = - . - . - . . .

Therefore,995.0

2.1658.36

6.77=

==

BA

BAPCC

o

Where,

77.6)()()()()( =++++= 9.44.22.41.20.60.23.6-1.87.6-3.8BAo

8.36)()()()()( 22222 =++++= 4.21.20.2-1.8-3.8A

165.29.42.4-0.6-3.6-7.6B =++++= 22222 )()()()()(

Rank correlation coefcient

an corre at on coe c ent s a stance measure t at oes not ta e nto account

the actual magnitude of the expression ratio in each condition, but takes into account the

rank of the expression ratio. For example, consider two genesA= 2, 3, , 15, 8 and

= 2, 7, 15, 25, 13 . When we consider the rank of the values for different conditions forene A, we get the following:

2 (rank = 1) < 3 (rank = 2) < 8 (rank = 3) < 9 (rank = 4) < 15 (rank = 5) which is equivalent

to = , , , .

m ar y, or gene , we get t e ran s or t e va ues or t e erent con t ons as:

ran = < ran = < ran = < ran = < ran = , w c s

equ va ent to = .

Rank correlation coefcient is the PCC calculated on the expression proles converted

into their rank proles. In the above case the two genes have exactly the same rank prole,

thus rank correlation coefcient becomes 1. However, PCC is not applicable when two

alues within a rank prole are repeated. In this case, the rank correlation coefcient can

be directly computed as:

=

=n

i

irank

nn

dBAD

12

2

)1(61),(

Where nis the number of conditions (dimension of the prole) and dis the difference

etween ran s or t e two genes at con t on i. n a vantage o s t at t s not

sens t ve to out ers n t e ata.


15/25

239


Mutual information

stance measure to compare genes w ose pro es ave een ma e screte can e

calculated using an entropy notion, called Shannons entropy. This measure gives us a metric

that is indicative of how much information from the expression prole of one gene can

be obtained to predict the behaviour of the other gene.

Consider the discrete expression proles for two genes,A= 1, 1, 0, 1, -1] and = 1

-1, 0, 1, -1 . We know that at any condition, the values that have been made discrete can

be 1, 0 or 1. Thus, the probability for each state to occur in the prole for the two genes

can be computed as follows:

Genes Probability

P(1) P(0) P(-1) P(1)+P(0)+P(-1)

A5

3

(3 occurrencesin 5 conditions)

51

(1 occurrencein 5 conditions)

51

(1 occurrencen 5 conditions)

( )1

5

113=

++

B5

2

(2 occurrences

in 5 conditions)

51

(1 occurrence

in 5 conditions)

52

(2 occurrences

n 5 conditions)

( )1

5

212=

++

rom t s ta e, t e annon s entropy or t e genes can e ca cu ate as:

=

=3

1

2log)(i

ii PPgeneH

Note that iruns from 1 to 3 because there are three possible states (1, 0 and 1).

371.1

51log

51

51log

51

53log

53(1 222 =)++= -H(A)

522.15

2log5

25

1log5

15

2log5

2(1 222 =)++= -H(B)

The next step in our calculation is to consider how often gene A and gene B have the same

state (1, 0, or -1) across given conditions. There are 9 possible pairwise combinations of

states, and they are calculated for our example in the following manner:

P(A,B) Occurrence P(A,B) Occurrence P(A,B) Occurrence

P(1,1) 52

P(0,1) 50

P(-1,1) 50

P(1,0) 50

P(0,0) 51

P(-1,0) 50

P(1,-1) 51

P(0,-1) 50

P(-1,-1) 51

e num er o con t ons n w c ot gene an gene ave t e r va ues equa to

over a con t ons s out o con t ons, an so on.


16/25

Madan Babu

not er parameter we w nee to ca cu ate mutua n ormat on s o nt entropy , :

=

=3

1,

2log),(

ji

ijij PPBAH

w en ot ian j n epen ent y run rom to , correspon ng to t e t ree states , an .

923.1

51log

51

51log

51

51log

51

52log

52(, 2222 =)+++= -1B)H(A

For the above example, the mutual information between the two expression proles, which

provides a measure of the similarity between the two genes can be calculated as:

970.0923.1522.1371.1),()()(),( =+=+= BAHBHAHBAM

n genera , t e g er t e mutua n ormat on score, t e more s m ar are t e two pro es.

owever, prec se state an consequent y, nterpretat on o t e o serve score wou

epen on t e num er o con t ons or w c measurements were ava a e. or our case

o con t ons, t e o ta ne score o . s g . ut rea er s a v se to consu t more

specialised sources for understanding of states associated with mutual information distance

measure (Shannon, 1949). One should note that a distance measure has to be chosen only

fter considering the data to be analysed and that there is no single distance measure that

is appropriate for all types of data.

3.3 Clustering methods

One of the goals of microarray data analysis is to cluster genes or samples with similarexpress on pro es toget er, to ma e mean ng u o og ca n erence a out t e set o genes

or samp es. uster ng s one o t e unsuperv se approac es to c ass y ata nto groups

o genes or samp es w t s m ar patterns t at are c aracter st c to t e group. uster ng

met o s can e erarc ca group ng o ects nto c usters an spec y ng re at ons ps

mong o ects n a c uster, resem ng a p y ogenet c tree or non- erarc ca group ng

Figure 5. An overview of the different clustering methods.


17/25

241


nto c usters w t out spec y ng re at ons ps etween o ects n a c uster as sc emat ca y

represente n gure . emem er, an o ect may re er to a gene or a samp e, an a c uster

refers to a set of objects that behave in a similar manner.

Hierarchical clustering

Hierarchical clustering may be agglomerative (starting with the assumption that each

object is a cluster and grouping similar objects into bigger clusters) or divisive (starting

from grouping all objects into one cluster and subsequently breaking the big cluster into

smaller clusters with similar properties). The basic idea behind agglomerative and divisive

hierarchical clustering is shown in Figure 6. There are many different types of clustering

met o s an a ew common y use ones are escr e e ow.

Figure 6. Schematic diagram showing the principle behind agglomerative and divisive clustering. The colour

code represents the log2 (expression ratio), where red represents up-regulation, green represents down-regulation,

and black represents no change in expression. In agglomerative clustering, genes that are similar to each other

are grouped together, and an average expression prole is calculated for the group by using the average linkage

algorithm. This step is performed iteratively until all genes are included into one cluster. In the case of divisive

clustering, the whole set of genes is considered as a single cluster and is broken down iteratively into sub-clusters

with similar expression proles until each cluster contains only one gene. This information can be represented

as a tree, where the terminal nodes represent genes and all branches represent different clusters. The distance

from the branch point provides a measure of the distance between two objects. This image was adapted from

Dopazo et al., (2001). Notice that the ordered matrix at the top is the actual product of either agglomerative ordivisive clustering, and genes A to E are given in the nal order for the simplicity of illustration; initially rows

corresponding to genes A to E could be arranged in any order and it is the task of the methods to arrange them

meaningfully. Colour gure t: http://www.mrc-lmb.cam.ac.uk/genomes/madanm/microarray/.


18/25

Madan Babu

Hierarchical clustering: agglomerative

In the case of a hierarchical agglomerative clustering, the objects are successively fused

until all the objects are included. For a hierarchical agglomerative clustering procedure,

each object is considered as a cluster. The rst step is the calculation of pairwise distance

measures for the objects to be clustered. Based on the pairwise distances between them,

o ects t at are s m ar to eac ot er are groupe nto c usters. ter t s s one, pa rw se

stances etween t e c usters are re-ca cu ate , an c usters t at are s m ar are groupe

toget er n an terat ve manner unt a t e o ects are nc u e nto a s ng e c uster. s

n ormat on can e represente as a en rogram, w ere t e stance rom t e ranc po nt

s n cat ve o t e stance etween t e two c usters or o ects.

Comparison of clusters with another cluster or an object can be carried out using four

pproaches (Figure 7).

Single linkage clustering (Minimum distance)

In single linkage clustering, distance between two clusters is calculated as the minimum

distance between all possible pairs of objects, one from each cluster. This method has

n advantage that it is insensitive to outliers. This method is also known as the nearest

ne g our n age.

omp ete in age c ustering aximum istance

n comp ete n age c uster ng, stance etween two c usters s ca cu ate as t e max mum

stance etween a poss e pa rs o o ects, one rom eac c uster. e sa vantage

of this method is that it is sensitive to outliers. This method is also known as the farthest

neighbour linkage.

Average linkage clusteringIn average linkage clustering, distance between two clusters is calculated as the average of

distances between all possible pairs of objects in the two clusters.

Figure 7. Different algorithms to nd distance between two clusters.


19/25

243


Centroid linkage clustering

In centroid linkage clustering, an average expression prole (called a centroid) is calculated

n two steps. rst, t e mean n eac mens on o t e express on pro es s ca cu ate or

o ects n a c uster. en, stance etween t e c usters s measure as t e stance

etween t e average express on pro es o t e two c usters.

n examp e o t e erarc ca agg omerat ve c uster ng us ng s ng e n age c uster ng

s s own n gure .

Hierarchical clustering: divisive

Hierarchical divisive clustering is the opposite of the agglomerative method, where the entire

set of objects is considered as a single cluster and is broken down into two or more clusters

that have similar expression proles. After this is done, each cluster is considered separately

nd the divisive process is repeated iteratively until all objects have been separated into

single objects. The division of objects into clusters on each iterative step may be decided

upon y pr nc pa component ana ys s w c eterm nes a vector t at separates g ven o ects.

s met o s ess popu ar t an agg omerat ve c uster ng, ut as success u y een use

n t e ana ys s o gene express on ata y on t a . .

Non-hierarchical clustering

One of the major criticisms of hierarchical clustering is that there is no compelling evidence

that a hierarchical structure best suits grouping of the expression proles. An alternative to

this method is a non-hierarchical clustering, which requires predetermination of the number

of clusters. Non-hierarchical clustering then groups existing objects into these predened

clusters rather than organizing them into a hierarchical structure.

Figure 8. An example of a hierarchical clustering using single linkage algorithm. Consider ve genes and the

distances between them as shown in the table. In the rst step, genes that are close to each other are grouped

together and the distances are re-calculated using the single linkage algorithm. This procedure is repeated until

all genes are grouped into one cluster. This information can be represented as a tree (shown to the right), where

the distance from the branch point reects the distance between genes or clusters. This image was adapted from

Causton et al.(2003).


20/25

Madan Babu

Non-hierarchical clustering: K-means

K-means is a popular non-hierarchical clustering method (Figure 9A). In K-means

c uster ng, t e rst step s to ar trar y group o ects nto a pre eterm ne num er o

c usters. e num er o c usters can e c osen ran om y or est mate y rst per orm ng

erarc ca c uster ng o t e ata. o ow ng t s step, an average express on pro e

centro s ca cu ate or eac c uster, t s s ca e n t a zat on. ext, n v ua o ects

re reattr ute rom one c uster to t e ot er epen ng on w c centro s c oser to t e

ene (or sample). This procedure of calculating the centroid for each cluster and re-grouping

objects closer to available centroids is performed in an iterative manner for a xed number

of times, or until convergence (state when composition of clusters remains unaltered by

further iterations). Typically, the number of iterations required to obtain stable clusters ranges

from 20,000 to 100,000. However, there is no guarantee that the clusters will converge.

This method has an advantage that it is scalable for large datasets.

Non-hierarchical clustering: Self Organizing Maps

e rgan z ng aps s wor n a manner s m ar to -means c uster ng gure .

n -means c uster ng, one c ooses t e num er o c usters to t t e ata, w ereas w t

t e rst step s to c oose t e num er an or entat on o t e c usters w t respect to eac

ot er. or examp e, a two- mens ona gr o no es w c may en up e ng c usters

Figure 9. A: The principle behind K-means clustering. Objects are grouped into a predened number of clusters

during the initialization step. Centroid for each cluster is calculated, and objects are re-grouped depending on how

close they are to available centroids. This step is performed iteratively until convergence or is performed for a

xed number of iterations to get nal clusters of objects. B: The principle behind SOMs. During the initialization

step, a grid of nodes is projected onto the expression space and each gene is assigned its closest node. Following

this step, one gene is chosen at random and the assigned node is moved towards it. The other nodes are moved

towards this gene depending on how close they are to the selected gene. This step is performed iteratively untilconvergence or is performed for a xed number of iterations to get a nal map of nodes.


21/25

245


cou e t e start ng po nt. e gr s pro ecte onto t e express on space, an eac o ect

s ass gne a no e t at s nearest to t t s s ca e n t a zat on. n t e next step, a ran om

object is chosen and the node (called a reference vector) which is in the neighbourhood

of the object is moved closer to it. The other nodes are moved to a small extent depending

on how close they are to the object chosen. In successive iterations, with randomly chosen

objects, the positions of the nodes are rened and the radius of neighbourhood becomes

conned. In this way, the grid of nodes (initially a two-dimensional grid) is deformed to

t the data. The advantage of this method, unlike K-means, is that SOM does not force the

number of clusters to be equal to the number of starting nodes in the chosen grid. This is

because some nodes may have no objects associated with them when the map is complete.

t er a vantages o nc u e prov ng n ormat on on t e s m ar ty etween t e

no es, an t e a ty o to pro uce re a e resu ts even w t no sy ata.

4. RELATING EXPRESSION DATA TO OTHER BIOLOGICAL INFORMATION

ene express on pro es can e n e to externa n ormat on to ga n ns g t nto o og caprocesses and to make new discoveries. Some of the possible questions that can be addressed

fter analysing gene expression data will be discussed in this section.

4.1 Predicting binding sites

It is reasonable to assume that genes with similar expression proles are regulated by the

same set of transcription factors. If this happens to be the case, then genes that have similar

expression proles should have similar transcription factor binding sites upstream of the

co ng sequence n t e . ar ous researc groups ave exp o te t s assumpt on.

razma t a . an ot ers ussema er et a ., ; on on t a ., ave stu e

t e occurrence o sequence patterns an scovere putat ve n ng s tes n t e promoterreg ons o genes t at are co-expresse . e steps nvo ve n suc stu es are t e o ow ng:

n a set o genes t at ave s m ar express on pro es. xtract promoter sequences

of the co-expressed genes. (3) Identify statistically over-represented sequence patterns. (4)

Assess quality of the discovered pattern using statistical signicance criteria.

4.2 Predicting protein interactions and protein functions

Integrating expression data with other external information, for example evolutionary

conservation of proteins, have been used to predict interacting proteins, protein complexes,

nd protein function. Work by Ge t al.(2001) and Jansen and Gerstein (2000) have shown

that genes with similar expression proles are more likely to encode proteins that interact.en t s n ormat on s com ne w t evo ut onary conservat on o prote ns, mean ng u

pre ct ons can e ma e. n a recent wor y e c mann an a an a u t was

s own t at prote ns t at are evo ut onar y conserve n yeast an worm an t at ave

s m ar express on pro es n ot organ sms ten to e a part o t e same sta e comp ex

or nteract p ys ca y. oort et a . ave a so s own t at t e enco e prote ns o

conserved, co-expressed gene pairs are highly likely to be part of the same pathway. Such

studies enable us to predict specic gene functions. The steps involved in such studies are

the following: (1) Identify co-expressed genes in the two studied organisms. (2) Identify

conserved (orthologous) proteins. (3) Find instances where conserved (orthologous) proteins

re co-expressed in both organisms. (4) Map information on protein interaction or metabolicpathway available for one organism to predict interacting proteins or function of the proteins

n t e ot er organ sm.


22/25

Madan Babu

4.3 Predicting functionally conserved modules

Genes that have similar expression proles often have related functions. Instead of studying

co-expresse pa rs o genes, one can v ew sets o co-expresse genes t at are nown to

nteract as a unct ona mo u e nvo ve n a part cu ar o og ca process a an a u t a .,

. s n ormat on, w en ntegrate w t t e evo ut onary conservat on o prote ns n

more t an two organ sms, prov es now e ge o t e s gn cance o t e unct ona mo u es

t at ave een conserve n evo ut on. tuart t a . ave a resse t s ssue n great

detail and have identied one evolutionarily conserved functional module which belongs

to an as yet unknown biological process. For other modules that are known to be involved

in previously well-studied biological process, new module members were discovered by

Stuart t al. (2003), providing clues about unknown candidates involved in the processes.

The steps involved in such studies are similar to those discussed in the previous section.

Instead of two organisms, one has to consider three or more organisms, and should also

ddress other issues related to identifying orthologous proteins.

4.4 Reverse-engineering of gene regulatory networks

ene express on ata can a so e use to n er regu atory re at ons ps. s approac s

nown as reverse eng neer ng o regu atory networ s. esearc y ega et a . an

ar ner t a . c ear y g g ts t at we are now n a goo pos t on to use express on

ata to ma e pre ct ons a out t e transcr pt ona regu ators or a g ven gene or sets o

enes. Segal et al.(2003) have developed a probabilistic model to identify modules of co-

regulated genes, their transcriptional regulators and conditions that inuence regulation.

This new knowledge allowed them to generate further hypotheses, which are experimentally

testable. Gardner t al.(2003) described a method to infer regulatory relationships, called

NIR (Network Identication by multiple Regression), which uses non-linear differentialequations to model regulatory networks. In this method, a model of connections between

enes in a network is inferred from measurements of system dynamics ( i.e.response of

enes an prote ns to pertur at ons .

5. WEBSITE REFERENCES, ACADEMIC SOFTWARE AND WEB

SUPPLEMENT

5.1 Website references

Some websites that provide a reference to various aspects of microarrays are given

below:

Portals:

http://ihome.cuhk.edu.hk/%7Eb400559/array.html

A comprehensive web portal on microarrays.

http://www.bioinformatics.vg/biolinks/bioinformatics/Microarrays.shtml

A web-portal on microarrays.

http://www.hgmp.mrc.ac.uk/GenomeWeb/nuc-genexp.html

A collection of gene expression and microarray links at the HGMP (Human Genome

Mapping Project).


23/25

247


Tutorials:

http://www.ucl.ac.uk/oncology/MicroCore/HTML_resource/tut_frameset.htm

A website that provides a tutorial on the various aspects of microarray data analysis

iscusse in t is c apter.

Software links:

http://genome-www5.stanford.edu/restech.html

This website provides a list of software available for microarray data analysis with a brief

escription o w at t e so tware oes an t e p at orm on w ic it can e run.

Microarray databases:

http://genome-www5.stanford.edu/

Stanford Microarray Database (SMD) contains raw and normalized data from microarray

experiments as we as t eir image es. a so provi es inter aces or ata retrieva ,

ana ysis an visua isation.

http://www.ebi.ac.uk/arrayexpress/

ArrayExpress public repository for microarray data at the EMBL-EBI.

http://info.med.yale.edu/microarray/

Yale Microarray Database (YMD)

http://www.ncbi.nlm.nih.gov/geo/

Gene Expression Omnibus at the NCBI, NIH.

Table 2. List of software available for academic use.


24/25

Madan Babu

5.2 Software available for non-commercial use

The list of software provided here is by no means exhaustive. The readers are urged to

isit the reference websites provided above to get a more comprehensive list of available

programs a e .

5.3 Supplementary material on the web

e we -supp ement s ava a e at:

http://www.mrc-lmb.cam.ac.uk/genomes/madanm/microarray/

It has PERL scripts to calculate the following statistics:

. Euclidean distance.

2. Pearson correlation coefcient.

. Rank correlation coefcient.

xpress on atasets or t e yeast genome rom o t a . an pe man et a .

re a so prov e .

Acknowledgements

wou e to t an ar e as au, an a an t e ot ers n t e group or rea ng

the chapter. I would also like to acknowledge the Medical Research Council, Cambridge

Commonwealth Trust and Trinity College, Cambridge for nancial support.

References

Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., and Levine, A.J. (1999). Broad patterns ofgene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide

arrays. Proc. Natl. Acad. Sci. USA96, 6745-6750.

Brazma, A., Jonassen, I., Vilo, J., and Ukkonen, E. (1998). Predicting gene regulatory elements in silico on a

genomic scale. Genome Res.8, 1202-1215.

Bussemaker, H. J., Li, H., and Siggia, E. D. (2001). Regulatory element detection using correlation with expression.

at. enet. , 167-171.

Causton, H., Quackenbush, J., and Brazma, A. (2003). Microarray Gene Expression Data Analysis: A Beginners

Guide. (UK: Blackwell Science)

Cho, R. J., Campbell, M. J., Winzeler, E. A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T. G., Gabrielian,

A. E., Landsman, D., Lockhart, D. J., and Davis, R. W. (1998). A genome-wide transcriptional analysis of the

mitotic cell cycle. Mol. Cell.2, 65-73.

Conlon, E. M., Liu, X. S., Lieb, J. D., and Liu, J. S. (2003). Integrating regulatory motif discovery and genome-

wide expression analysis. Proc. Natl. Acad. Sci. USA100, 3339-3344.Dopazo, J., Zanders, E., Dragoni, I., Amphlett, G., and Falciani, F. (2001). Methods and approaches in the analysis

of gene expression data. J. Immunol. Meth.250, 93-112.

Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998). Cluster analysis and display of genome-wide

expression patterns. Proc. Natl. Acad. Sci. USA95, 14863-14868.

Gardner, T. S., di Bernardo, D., Lorenz, D., and Collins, J. J. (2003). Inferring genetic networks and identifying

compound mode of action via expression proling. Science301, 102-105.

Ge, H., Liu, Z., Church, G. M., and Vidal, M. (2001). Correlation between transcriptome and interactome mapping

data from Saccharomyces cerevisiae. Nat. Genet.29, 482-486.

Jansen, R., and Gerstein, M. (2000). Analysis of the yeast transcriptome with structural and functional categories:

characterizing highly expressed proteins. Nucl. Acids Res.28, 1481-1488.

Luscombe, N. M., Royce, T. E., Bertone, P., Echols, N., Horak, C. E., Chang, J. T., Snyder, M., and Gerstein,

M. (2003). ExpressYourself: A modular platform for processing and visualizing microarray data. Nucl. Acids

Res.31, 3477-3482.

Madan Babu, M., Luscombe, N., Aravind, L., Gerstein, M., Teichmann, S.A. (2004). Structure and evolution of

transcriptional regulatory networks. Curr. Opin. Struct. Biol. 14, in press.


25/25


Quackenbush, J. (2002). Microarray data normalization and transformation. Nat. Genet.32, 496-501.

Segal, E., Shapira, M., Regev, A., Pe'er, D., Botstein, D., Koller, D., and Friedman, N. (2003). Module networks:

identifying regulatory modules and their condition-specic regulators from gene expression data. Nat. Genet.

34, 166-176.

Shannon, C. (1949). A mathematical theory of communication. Bell. Sys. Tech. J. 17, 379-423; 623-656.

Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D., and

Futcher, B. (1998). Comprehensive identication of cell cycle-regulated genes of the yeast Saccharomyces

cerevisiae by microarray hybridization. Mol. Biol. Cell9, 3273-3297.

Stuart, J. M., Segal, E., Koller, D., and Kim, S. K. (2003). A gene-coexpression network for global discovery of

conserved genetic modules. Science302, 249-255.

Teichmann, S. A., and Madan Babu, M. (2002). Conservation of gene co-regulation in prokaryotes and eukaryotes.

Trends Biotechnol.20, 407-410; discussion 410.

van Noort, V., Snel, B., and Huynen, M. A. (2003). Predicting gene function by conserved co-expression. Trends

Genet.19, 238-242.

an introduction to microarray analysis

Documents