an introduction to microarray analysis
TRANSCRIPT
-
8/10/2019 An Introduction to Microarray Analysis
1/25
225
Microarray Data Analysis
ap er
An Introduction toMicroarray Data Analysis
. a an a u
Abstract
This chapter aims to provide an introduction to the analysis of gene expression data obtained
using microarray experiments. It has been divided into four sections. The rst section
provides basic concepts on the working of microarrays and describes the basic principlesbehind a microarray experiment. The second section deals with the representation and
extract on o n ormat on rom mages o ta ne rom m croarray exper ments. e t r
sect on a resses erent met o s or compar ng express on pro es o genes an a so
prov es an overv ew o erent met o s or c uster ng genes w t s m ar express on
pro es. e ast sect on ocuses on re at ng gene express on ata w t ot er o og ca
information; it will provide the readers with a feel for the kind of biological discoveries
one can make by integrating gene expression data with external information.
1. INTRODUCTION
Functional genomics involves the analysis of large datasets of information derived fromarious biological experiments. One such type of large-scale experiment involves monitoring
the expression levels of thousands of genes simultaneously under a particular condition,
called gene expression analysis. Microarray technology makes this possible and the quantity
o ata generate rom eac exper ment s enormous, war ng t e amount o ata generate
y genome sequenc ng pro ects. s c apter s a r e overv ew o t e as c concepts
nvo ve n a m croarray exper ment; t g ves a ee ng or w at t e ata actua y represents,
n w prov e n ormat on on t e var ous computat ona met o s t at one can emp oy
to er ve mean ng u resu ts rom suc exper ments.
1.1 What are microarrays and how do they work?
Microarray technology has become one of the indispensable tools that many biologists use
to monitor genome wide expression levels of genes in a given organism. A microarray is
typically a glass slide on to which DNA molecules are xed in an orderly manner at specic
locations called spots (or features). A microarray may contain thousands of spots and each
spot may conta n a ew m on cop es o ent ca mo ecu es t at un que y correspon
to a gene gure . e n a spot may e t er e genom c or s ort stretc o
o go-nuc eot e stran s t at correspon to a gene. e spots are pr nte on to t e g ass
s e y a ro ot or are synt es se y t e process o p oto t ograp y.
croarrays may e use to measure gene express on n many ways, ut one o t emost popular applications is to compare expression of a set of genes from a cell maintained
in a particular condition (condition A) to the same set of genes from a reference cell
-
8/10/2019 An Introduction to Microarray Analysis
2/25
Madan Babu
ma nta ne un er norma con t ons con t on . gure g ves a genera p cture o t e
exper menta steps nvo ve . rst, s extracte rom t e ce s. ext, mo ecu es
n t e extract are reverse transcr e nto c y us ng an enzyme reverse transcr ptasend nucleotides labelled with different uorescent dyes. For example, cDNA from cells
rown in condition A may be labelled with a red dye and from cells grown in condition B
with a green dye. Once the samples have been differentially labelled, they are allowed to
hybridize onto the same glass slide. At this point, any cDNA sequence in the sample will
hybridize to specic spots on the glass slide containing its complementary sequence. The
mount of cDNA bound to a spot will be directly proportional to the initial number of RNA
molecules present for that gene in both samples.
o ow ng t e y r zat on step, t e spots n t e y r ze m croarray are exc te y
aser an scanne at su ta e wave engt s to etect t e re an green yes. e amount
o uorescence em tte upon exc tat on correspon s to t e amount o oun nuc e c ac .or nstance, c rom con t on or a part cu ar gene was n greater a un ance t an
Figure 1. (A) A microarray may contain thousands of spots. Each spot contains many copies of the same DNAsequence that uniquely represents a gene from an organism. Spots are arranged in an orderly fashion into Pen-
groups. (B) Schematic of the experimental protocol to study differential expression of genes. The organism is
grown in two different conditions (a reference condition and a test condition). RNA is extracted from the two
cells, and is labelled with different dyes (red and green) during the synthesis of cDNA by reverse transcriptase.
Following this step, cDNA is hybridized onto the microarray slide, where each cDNA molecule representing a
gene will bind to the spot containing its complementary DNA sequence. The microarray slide is then excited
with a laser at suitable wavelengths to detect the red and green dyes. The nal image is stored as a le for further
analysis. Colour gure t: http://www.mrc-lmb.cam.ac.uk/genomes/madanm/microarray/.
-
8/10/2019 An Introduction to Microarray Analysis
3/25
227
Microarray Data Analysis
that from condition B, one would nd the spot to be red. If it was the other way, the spot
would be green. If the gene was expressed to the same extent in both conditions, one would
nd the spot to be yellow, and if the gene was not expressed in both conditions, the spotwould be black. Thus, what is seen at the end of the experimental stage is an image of the
microarray, in which each spot that corresponds to a gene has an associated uorescence
a ue represent ng t e re at ve express on eve o t at gene.
2. OVERVIEW OF IMAGE PROCESSING, TRANSFORMATION AND
NORMALIZATION
2.1 Image processing and analysis
In the previous section, we saw that the relative expression level for each gene (population
of RNA in the two samples) can be stored as an image. The rst step in the analysis of
microarray data is to process this image. Most manufacturers of microarray scanners
provide their own software; however, it is important to understand how data is actually
being extracted from images, as this represents the primary data collection step and forms
the basis of any further analysis.
mage process ng nvo ves t e o ow ng steps:
. enti cation o t e spots an istinguis ing t em rom spurious signa s.
e m croarray s scanne o ow ng y r zat on an a mage e s norma y
enerated. Once image generation is completed, the image is analysed to identify spots.
In the case of microarrays, the spots are arranged in an orderly manner into sub-arraysr pen groups (Figure 1A), which makes spot identication straightforward. Most image
Figure 2. Zooming onto a spot on the microarray slide. The spot area and the background area are depicted by a blue
circle and a white box, respectively. A pixel in the spot area is also shown. Any pixel within the blue circle will be
treated as a signal from the spot. Pixels outside the blue circle but within the white box will be treated as a signal from
the background. One can see that the images are not perfect, as it is often the case, which leads to many problems
with spurious signals from dust particles, scratches, bright arrays, etc. This image was retrieved from Stanford
Microarray Database. Colour gure at: http://www.mrc-lmb.cam.ac.uk/genomes/madanm/microarray/.
-
8/10/2019 An Introduction to Microarray Analysis
4/25
Madan Babu
process ng so tware requ res t e user to spec y approx mate y w ere eac su -array
es an a so a t ona parameters re evant to t e spotte array. s n ormat on s
then used to identify regions that correspond to spots.
2. Determination of the spot area to be surveyed, determination of the local region to
stimate background hybridization.
After identifying regions that correspond to sub-arrays, an area within the sub-array
must be selected to get a measure of the spot signal and an estimate for background
ntensity (Figure 2). There are two methods to dene the spot signal. The rst method
s to use an area o a xe s ze t at s centre on t e centre o mass o t e spot. s
met o as an a vantage t at t s computat ona y ess expens ve, ut a sa vantage
e ng more error-prone n est mat ng spot ntens ty an ac groun ntens ty. n
ternat ve met o s to prec se y e ne t e oun ary or a spot an on y nc u e p xe s
t n t e oun ary. s met o as an a vantage t at t can g ve a etter est mate o
t e spot ntens ty, ut a so as a sa vantage o e ng computat ona y ntens ve antime-consuming.
. Reporting summary statistics and assigning spot intensity after subtracting for
ackground intensity.
Once the spot and background areas have been dened, a variety of summary statistics
or each spot in each channel (red and green channels) are reported. Typically, each
pixel (Figure 2) within the area is taken into account, and the mean, median, and total
a ues or t e ntens ty cons er ng a t e p xe s n t e e ne area are reporte or
ot t e spot an ac groun . ost approac es use t e spot me an va ue, w t t e
ac groun me an va ue su tracte rom t, as t e metr c to represent spot ntens ty.e me an ntens ty s a va ue w ere a t e measure p xe s ave ntens t es greater
t an t s va ue an t e ot er a o t e measure p xe s ave ntens t es ess t an
this value. The background subtracted median value approach has an advantage of
being relatively insensitive to a few pixels with anomalous uorescent values in one
r both channels, but has a disadvantage of being sensitive to misidentication of spot
nd background areas. The other method is to use total intensity values, which has an
dvantage of being insensitive to misidentication of spots (as few more pixels with
ero value in the background will not affect the total intensity), but has a disadvantage
f being prone to be skewed by a few pixels with extreme intensity values.
not er cons erat on n mage process ng s t e num er o p xe s to e nc u e or
measurement n t e spot mage. or many scanners, t e e au t p xe s ze s . s
means t at an average spot o ameter o w ave ~ p xe s. owever, or a
sma er spot ameter, t s etter to use a sma er p xe s ze to ensure enoug p xe s are
samp e . ost scanners now a ow t e p xe s ze o m. ven t oug us ng a sma er
pixel size increases our condence in the measurement, the only disadvantage is that the
image le size tends to be much bigger when compared with image le sizes created using
larger pixel sizes.
2.2 Expression ratios: the primary comparisonWe saw that the relative expression level for a gene can be measured as the amount of red or
reen light emitted after excitation. The most common metric used to relate this information
-
8/10/2019 An Introduction to Microarray Analysis
5/25
229
Microarray Data Analysis
s ca e express on rat o. t s enote ere as kan e ne as:
=k
k
For each gene kon the array, where represents the spot intensity metric for the test
sample and Gk represents the spot intensity metric for the reference sample. As mentioned
bove, the spot intensity metric for each gene can be represented as a total intensity value
or a background subtracted median value. If we choose the median pixel value, then the
median expression ratio for a given spot is:
background
median
spot
median
background
median
spot
medianmedian
GG
RRT
=
where spotmedianR andbackgroundmedianR are the median intensity values for the spot and background
respectively, for the test sample.
2.3 Transformations of the expression ratio
e express on rat o s a re evant way o represent ng express on erences n a very
ntu t ve manner. or examp e, genes t at o not er n t e r express on eve w ave
n express on rat o o . owever, t s representat on may e un e p u w en one as to
represent up-regulation and down-regulation. For example, a gene that is up-regulated by a
factor of 4 has an expression ratio of 4 (R/G = 4G/G = 4). However, for the case where a gene
is down regulated by a factor of 4, the expression ratio becomes 0.25 (R/G = R/4R = 1/4).
Thus up-regulation is blown up and mapped between 1 and innity, whereas down-regulationis compressed and mapped between 0 and 1.
[ ] ,1regulationUp mapped
regulationDown mapped
To eliminate this inconsistency in the mapping interval, one can perform two kinds of
transformations of the expression ratio, namely, inverse transformation and logarithmic
trans ormat on.
Inverse or reciprocal transformation
e nverse or rec proca trans ormat on converts t e express on rat o nto a o -c ange,
w ere or genes w t an express on rat o o ess t an t e rec proca o t e express on
ratio is multiplied by -1. If the expression ratio is 1 then the fold change is equal to theexpression ratio. The advantage of such a transformation is that one can represent up-
regulation and down-regulation with a similar mapping interval.
=
=
=
-
8/10/2019 An Introduction to Microarray Analysis
6/25
Madan Babu
owever, t s met o a so as a pro em n t at t e mapp ng space s scont nuous etween
an + an ence ecomes a pro em n most mat emat ca ana yses ownstream o
this step.
Logarithmic transformation
A better transformation procedure is to take the logarithm base 2 value of the expression
ratio (i.e.log expression ratio)). This has the major advantage that it treats differential
up-regulation and down-regulation equally, and also has a continuous mapping space.
For example, if the expression ratio is 1, then log2 1) equals 0 represents no change in
expression. If the expression ratio is 4, then log2 4) equals +2 and for expression ratio of
og 1 equa s - . us, n t s trans ormat on t e mapp ng space s cont nuous an up-
regu at on an own-regu at on are compara e.
av ng exp a ne t e a vantages o us ng express on rat os as a metr c or gene
express on, t s ou a so e un erstoo t at t ere are sa vantages o us ng express on
rat os or trans ormat ons o t e rat os or ata ana ys s. ven t oug express on rat os canreveal patterns inherent in the data, they remove all information about absolute expression
levels of the genes. For example, genes that have R/G ratios of / nd / will end up
having the same expression ratio of 4, and associated problems will surface when one tries
to reliably identify differentially regulated genes.
2.4 Data normalization
In the last section, it was shown that expression ratios and their transformations is
reasona e measure to etect erent a y expresse genes. owever, w en one
compares t e express on eve s o genes t at s ou not c ange n t e two con t ons say,
ouse eep ng genes , w at one qu te o ten n s s t at an average express on rat o o sucenes ev ates rom . s may e ue to var ous reasons, or examp e, var at on cause y
erent a a e ng e c ency o t e two uorescent yes or erent amounts o start ng
mRNA material in the two samples. Thus, in the case of microarray experiments, as for
ny large-scale experiments, there are many sources of systematic variation that affect
measurements of gene expression levels.
Normalization is a term that is used to describe the process of eliminating such variations
to allow appropriate comparison of data obtained from the two samples. There are many
methods of normalization and discussing each one of them is beyond the scope of this
chapter.
The rst step in a normalization procedure is to choose a gene-set (which consists ofenes or w c express on eve s s ou not c ange un er t e con t ons stu e , t at
s t e express on rat o or a genes n t e gene-set s expecte to e . rom t at set, a
norma ization actor,w c s a num er t at accounts or t e var a ty seen n t e gene-
set, s ca cu ate . t s t en app e to t e ot er genes n t e m croarray exper ment. ne
s ou note t at t e norma zat on proce ure c anges t e ata, an s carr e out on y on
the background corrected values for each spot. Figure 3 shows expression data before and
fter the normalization procedure.
Total intensity normalization
The basic assumption in a total intensity normalization is that the total quantity of RNA forthe two samples is the same. Also assuming that the same number of molecules of RNA
-
8/10/2019 An Introduction to Microarray Analysis
7/25
231
Microarray Data Analysis
from both samples hybridize to the microarray, the total hybridization intensities for the
ene-sets should be equal. So, a normalization factor can be calculated as:
=
==setgene
setgene
N
k
k
N
k
k
total
G
R
N
1
1
The intensities are now rescaled such that totalkk NGG ='
and kk RR ='
. The normalized
expression ratio becomes:
total
k
totalk
k
k
k
k N
T
NG
R
G
R
T === ''
'
Figure 3. Gene expression data before and after the normalization procedure. Note that before normalization the
image had many spots of different intensities, but after normalization only spots that are really different light up.
This image was kindly provided by N. Luscombe Colour gure t: http://www.mrc-lmb.cam.ac.uk/genomes/
madanm/microarray/.
-
8/10/2019 An Introduction to Microarray Analysis
8/25
Madan Babu
c s equ va ent to:
)(log-)(log)(log 22
'
2 totalkk NTT =
This now adjusts the ratio such that the mean ratio for the gene set is equal to 1.
Mean log centring
In this method, the basic assumption is that the mean log2(expression ratio) should be equal
to 0 for the gene-set. In this case, the normalization factor can be calculated as:
setgene
N
k
mlc
N
kG
kR
N
setgene
=
=1
2log
The intensities are now rescaled such that )2(' mlcN
kk GG = and kk RR =' . The normalized
expression ratio becomes:
mlcmlc N
k
N
k
k
k
kk
T
G
R
G
RT
2)2('
'' =
==
Which is equivalent to:
mlck
N
kk NTTTmlc -)(log)(2log-)(log)(log 222
'
2 ==
s a usts t e rat o suc t at t e mean og2 express on rat o or t e gene-set s equa
to .
Other normalization methods include: linear regression, Chens ratio statistics and
Lowess normalization. The next step following the normalization procedure is to lter
low intensity data using specic threshold or relative threshold imposed according to the
background intensity. If the experimental procedure included a replicate, averaging the
alues using the replicate data is the next step to be performed after data ltering. Finally,differentially expressed genes are identied. For an excellent review of normalization
procedures, ltering methods and averaging procedures using replicate data, please refer
to uac en us , an re erences t ere n.
3. ANALYSIS OF GENE EXPRESSION DATA
ne o t e reasons to carry out a m croarray exper ment s to mon tor t e express on eve o
enes at a genome sca e. atterns cou e er ve rom ana ys ng t e c ange n express on
of the genes, and new insights could be gained into the underlying biology. In this section,
basic terminologies, representations of the microarray data and the various methods by
which expression data can be analysed will be introduced.The processed data, after the normalization procedure, can then be represented in the form
of a matrix, often called gene expression matrix (Table 1A). Each row in the matrix corresponds
-
8/10/2019 An Introduction to Microarray Analysis
9/25
233
Microarray Data Analysis
to a particular gene and each column could either correspond to an experimental condition
or a specic time point at which expression of the genes has been measured. The expression
levels for a gene across different experimental conditions are cumulatively called the gene
expression prole, and the expression levels for all genes under an experimental condition are
cumulatively called the sample expression prole. Once we have obtained the gene expressionmatrix (Table 1A), additional levels of annotation can be added either to the gene or to the
sample. For example, the function of the genes can be provided, or the additional details on
the biology of the sample may be provided, such as disease state or normal state.
epen ng on w et er t e annotat on s use or not, ana ys s o gene express on ata
can e c ass e nto two erent types, name y superv se or unsuperv se earn ng. n
t e case o a superv se earn ng, we o use t e annotat on o e t er t e gene or t e samp e,
n create c usters o genes or samp es n or er to ent y patterns t at are c aracter st c
for the cluster. For example, we could separate sample expression proles into disease
state and normal state groups, and then look for patterns that separate the sample prole
of the disease state from the sample prole of the normal state.
In the case of an unsupervised learning, the expression data is analysed to identify
patterns that can group genes or samples into clusters without the use of any form of
Table 1. A: Gene expression matrix that contains rows representing genes and columns representing particular
conditions. Each cell contains a value, given in arbitrary units, that reects the expression level of a gene under
a corresponding condition. B: Condition C4 is used as a reference and all other conditions are normalized with
respect to C4 to obtain expression ratios. C: In this table all expression ratios were converted into the log 2(expression ratio) values. This representation has an advantage of treating up-regulation and down-regulation on
comparable scales. D: Discrete values for the elements in Table 1.C. Genes with log 2 (expression ratio) values
greater than 1 were changed to 1, genes with values less than 1 were changed to 1. Any value between 1 and
1 was changed to 0.
-
8/10/2019 An Introduction to Microarray Analysis
10/25
Madan Babu
nnotat on. or examp e, genes w t s m ar express on pro es can e c ustere toget er
w t out t e use o any annotat on. owever, annotat on n ormat on may e ta en nto
ccount at a later stage to make meaningful biological inferences. Throughout this section,
set of genes or set of experimental conditions that have similar expression proles will be
referred to as a cluster. Thus, a cluster consists of objects with similar expression proles,
where an object may either refer to genes or samples.
3.1 Representation of gene expression data
To make any meaningful comparison or biological analysis, one should know what the
ata n t e gene express on matr x represents. xpress on ata can e represente n ve
erent ways, w c are escr e e ow:
so ute measurement
n t e case o an a so ute measurement, eac ce n t e matr x w represent t e express on
eve o t e gene n a stract un ts. ote t at t s not mean ng u to compare express on eve sof genes across two different conditions in absolute units, because the starting amounts of
mRNA could be different. Table 1A shows a sample gene expression matrix with each cell
containing the expression level in abstract units.
Relative measurement or expression ratio
In the case of a relative measurement or representations involving expression ratio, the
expression level of a gene in abstract units is normalized with respect to its expression in a
re erence con t on. s g ves t e express on rat o o t e gene n re at ve un ts. ote t at
n suc cases, a rat o o 000 00 ea to t e same resu t as40
10. us any n ormat on on
so ute measurement w e ost n suc a representat on, ut now mean ng u compar soncross erent con t ons can e ma e as ong as t e same re erence con t on s use to get
t e express on rat o. s ment one e ore, t s representat on oes not treat up-regu at on
nd down-regulation in a comparable manner. Table 1B shows the gene expression matrix
with each cell representing the expression ratio normalized with respect to a reference
condition.
og expression ratio
In the case of tables representing the log2(expression ratio) values, information on up-
regulation and down-regulation is captured and is mapped in a symmetric manner. For
example, 4-fold up-regulation maps to log2(4) = 2 and a 4-fold down-regulation maps toog = - . us, rom t s ta e t e o -c ange or a erent a y regu ate gene
un er any con t on can e eas y recogn se . a e s ows t e og2 express on rat o
a ues o t e genes un er erent con t ons.
iscrete va ues
Another way of representing information is to convert to discrete numbers the values in
the tables mentioned above. In the case of converting the absolute measurement to discrete
numbers, a binary expression matrix of 1 and 0 can be used, where 1 means that the gene is
expressed above a user dened threshold, and 0 means that the gene is expressed below this
threshold. In the case of making the relative expression tables or log2(expression ratio) tablesdiscrete, values can be divided into 3 classes, +1, 0 and 1, where +1 represents a gene that is
positively regulated, 0 represents a gene that is not differentially regulated and 1 represents
-
8/10/2019 An Introduction to Microarray Analysis
11/25
235
Microarray Data Analysis
gene t at s represse . e process o ma ng t e va ues screte oses a ot o n ormat on,
ut s use u to ana yse express on pro es us ng a gor t ms t at cannot an e rea va ue
expression matrices, for example algorithms calculating mutual information between genes
or samples. Table 1D shows discrete values for the log (expression ratio) table.
Representation of expression proles as vectors
So far we have seen how individual cells in the gene expression matrix can be represented.
Similarly, an expression prole (of a gene or a sample) can be thought of as a vector and
can be represented in vector space. For example, an expression prole of a gene can be
cons ere as a vector n ndimensional space (where n is the number of conditions), and an
express on pro e o a samp e w t genes can e cons ere as a vector n m mens ona
space w ere m s t e num er o genes . n t e examp e g ven e ow, t e gene express on
matr x w t mgenes across ncon t ons s cons ere to e an mx nmatr x, w ere t e
express on va ue or gene i n con t on s enote as xj
=
mnmm
n
n
xxx
xxx
xxx
X
..
........
..
..
21
22221
11211
The expression prole of a geneican be represented as a row vector:
[ ]iniiii xxxxG ..,,,, 321=
The expression prole of a samplejcan be represented as a column vector:
=
mj
j
j
i
x
x
x
G..
2
1
n t e next sect on, we w see ow express on pro es t at are represente as vectors can
e use to compare ow s m ar or erent are t e pa rs o o ects remem er, an o ect
may re er to a gene or a samp e .
3.2 Distance measures
Analysis of gene expression data is primarily based on comparison of gene expression
proles or sample expression proles. In order to compare expression proles, we need a
measure to quantify how similar or dissimilar are the objects that are being considered. A
ariety of distance measures can be used to calculate similarity in expression proles andthese are discussed below.
-
8/10/2019 An Introduction to Microarray Analysis
12/25
Madan Babu
Euclidean distance
uc ean stance s one o t e common stance measures use to ca cu ate s m ar ty
between expression proles. The Euclidean distance between two vectors of dimension 2,
sayA a a and b , b can be calculated as:
2
22
2
11 )()(),( babaBADEuc +=
For instance two genes with expression proles in two conditions = 1,2 and = ,3 ,
the Euclidean distance can be calculated as:
2)32()21(),( 2221 =+=GGDEuc
us or genes w t express on ata ava a e or n con t ons, represente as
a1 an an = 1 n , uc ean stance can e ca cu ate as:
=
=n
i
iiEuc baBAD1
2)(),(
In other words, the Euclidean distance between two genes is the square root of the sum of
t e squares o t e stances etween t e va ues n eac con t on mens on .
more genera orm o t e uc ean stance s ca e t e in ows i istance
ca cu ate as:
p
n
i
p
iiMin baBAD =
=1
)(),(
A special case of the Minkowski distance when p=1 is called rectilinear distance. When
pplied to binary expression proles (i.e.expression levels changed to 1 and 0), it is called
amm ng stance.
Pearson correlation coefcient
ne o t e most common y use metr cs to measure s m ar ty etween express on pro es
s t e earson corre at on coe c ent sen t a . . ven t e express on
ratios for two genes under three conditionsA [a , a , a andB=[b , b b ], PCC can be
computed as follows:
Step1: Compute mean
33
321321 bbbbandaaa
a ++
=++
=
-
8/10/2019 An Introduction to Microarray Analysis
13/25
237
Microarray Data Analysis
tep : ean centre express on pro es
),,(),,( 321321 bbbbbbBandaaaaaaA ==
Step3: Calculate PCC as the cosine of the angle between the mean-centred proles
BA
BAPCC
o=
Where,
=
=n
i
ii bbaaBA1
)()(o
=
=n
i
i aaA1
2)(
and
=
=n
i
i bbB1
2)(
The reason why we mean centre the expression proles is to make sure that we compare
shapes of the expression proles and not their magnitude. Mean centring maintains the
shape of the prole, but it changes the magnitude of the prole as shown in Figure 4.
va ue o essent a y means t at t e two genes ave s m ar express on pro es
n a va ue o means t at t e two genes ave exact y oppos te express on pro es. va ue
o means t at no re at ons p can e n erre etween t e express on pro es o genes. n
rea ty, va ues range rom to + . va ue . suggests t at t e genes e aves m ar y an a va ue - . suggests t at t e genes ave oppos te e av our. e
alue of 0.7 is an arbitrary cut-off, and in real cases this value can be chosen depending on
the dataset used. An example calculation is shown below:
Figure 4. Expression prole before and after mean centring. Note that after mean centring, the relative shapesof the expression proles are still maintained, but the magnitude changes. The graphs do not show the actual
result of PCC analysis.
-
8/10/2019 An Introduction to Microarray Analysis
14/25
Madan Babu
ons er two genes w t express on pro es = an = .
e can e ca cu ate as o ows:
a= 4.8 and b 9.6, the mean centred expression proles become:
A - . - . . . . an B = - . - . - . . .
Therefore,995.0
2.1658.36
6.77=
==
BA
BAPCC
o
Where,
77.6)()()()()( =++++= 9.44.22.41.20.60.23.6-1.87.6-3.8BAo
8.36)()()()()( 22222 =++++= 4.21.20.2-1.8-3.8A
165.29.42.4-0.6-3.6-7.6B =++++= 22222 )()()()()(
Rank correlation coefcient
an corre at on coe c ent s a stance measure t at oes not ta e nto account
the actual magnitude of the expression ratio in each condition, but takes into account the
rank of the expression ratio. For example, consider two genesA= 2, 3, , 15, 8 and
= 2, 7, 15, 25, 13 . When we consider the rank of the values for different conditions forene A, we get the following:
2 (rank = 1) < 3 (rank = 2) < 8 (rank = 3) < 9 (rank = 4) < 15 (rank = 5) which is equivalent
to = , , , .
m ar y, or gene , we get t e ran s or t e va ues or t e erent con t ons as:
ran = < ran = < ran = < ran = < ran = , w c s
equ va ent to = .
Rank correlation coefcient is the PCC calculated on the expression proles converted
into their rank proles. In the above case the two genes have exactly the same rank prole,
thus rank correlation coefcient becomes 1. However, PCC is not applicable when two
alues within a rank prole are repeated. In this case, the rank correlation coefcient can
be directly computed as:
=
=n
i
irank
nn
dBAD
12
2
)1(61),(
Where nis the number of conditions (dimension of the prole) and dis the difference
etween ran s or t e two genes at con t on i. n a vantage o s t at t s not
sens t ve to out ers n t e ata.
-
8/10/2019 An Introduction to Microarray Analysis
15/25
239
Microarray Data Analysis
Mutual information
stance measure to compare genes w ose pro es ave een ma e screte can e
calculated using an entropy notion, called Shannons entropy. This measure gives us a metric
that is indicative of how much information from the expression prole of one gene can
be obtained to predict the behaviour of the other gene.
Consider the discrete expression proles for two genes,A= 1, 1, 0, 1, -1] and = 1
-1, 0, 1, -1 . We know that at any condition, the values that have been made discrete can
be 1, 0 or 1. Thus, the probability for each state to occur in the prole for the two genes
can be computed as follows:
Genes Probability
P(1) P(0) P(-1) P(1)+P(0)+P(-1)
A5
3
(3 occurrencesin 5 conditions)
51
(1 occurrencein 5 conditions)
51
(1 occurrencen 5 conditions)
( )1
5
113=
++
B5
2
(2 occurrences
in 5 conditions)
51
(1 occurrence
in 5 conditions)
52
(2 occurrences
n 5 conditions)
( )1
5
212=
++
rom t s ta e, t e annon s entropy or t e genes can e ca cu ate as:
=
=3
1
2log)(i
ii PPgeneH
Note that iruns from 1 to 3 because there are three possible states (1, 0 and 1).
371.1
51log
51
51log
51
53log
53(1 222 =)++= -H(A)
522.15
2log5
25
1log5
15
2log5
2(1 222 =)++= -H(B)
The next step in our calculation is to consider how often gene A and gene B have the same
state (1, 0, or -1) across given conditions. There are 9 possible pairwise combinations of
states, and they are calculated for our example in the following manner:
P(A,B) Occurrence P(A,B) Occurrence P(A,B) Occurrence
P(1,1) 52
P(0,1) 50
P(-1,1) 50
P(1,0) 50
P(0,0) 51
P(-1,0) 50
P(1,-1) 51
P(0,-1) 50
P(-1,-1) 51
e num er o con t ons n w c ot gene an gene ave t e r va ues equa to
over a con t ons s out o con t ons, an so on.
-
8/10/2019 An Introduction to Microarray Analysis
16/25
Madan Babu
not er parameter we w nee to ca cu ate mutua n ormat on s o nt entropy , :
=
=3
1,
2log),(
ji
ijij PPBAH
w en ot ian j n epen ent y run rom to , correspon ng to t e t ree states , an .
923.1
51log
51
51log
51
51log
51
52log
52(, 2222 =)+++= -1B)H(A
For the above example, the mutual information between the two expression proles, which
provides a measure of the similarity between the two genes can be calculated as:
970.0923.1522.1371.1),()()(),( =+=+= BAHBHAHBAM
n genera , t e g er t e mutua n ormat on score, t e more s m ar are t e two pro es.
owever, prec se state an consequent y, nterpretat on o t e o serve score wou
epen on t e num er o con t ons or w c measurements were ava a e. or our case
o con t ons, t e o ta ne score o . s g . ut rea er s a v se to consu t more
specialised sources for understanding of states associated with mutual information distance
measure (Shannon, 1949). One should note that a distance measure has to be chosen only
fter considering the data to be analysed and that there is no single distance measure that
is appropriate for all types of data.
3.3 Clustering methods
One of the goals of microarray data analysis is to cluster genes or samples with similarexpress on pro es toget er, to ma e mean ng u o og ca n erence a out t e set o genes
or samp es. uster ng s one o t e unsuperv se approac es to c ass y ata nto groups
o genes or samp es w t s m ar patterns t at are c aracter st c to t e group. uster ng
met o s can e erarc ca group ng o ects nto c usters an spec y ng re at ons ps
mong o ects n a c uster, resem ng a p y ogenet c tree or non- erarc ca group ng
Figure 5. An overview of the different clustering methods.
-
8/10/2019 An Introduction to Microarray Analysis
17/25
241
Microarray Data Analysis
nto c usters w t out spec y ng re at ons ps etween o ects n a c uster as sc emat ca y
represente n gure . emem er, an o ect may re er to a gene or a samp e, an a c uster
refers to a set of objects that behave in a similar manner.
Hierarchical clustering
Hierarchical clustering may be agglomerative (starting with the assumption that each
object is a cluster and grouping similar objects into bigger clusters) or divisive (starting
from grouping all objects into one cluster and subsequently breaking the big cluster into
smaller clusters with similar properties). The basic idea behind agglomerative and divisive
hierarchical clustering is shown in Figure 6. There are many different types of clustering
met o s an a ew common y use ones are escr e e ow.
Figure 6. Schematic diagram showing the principle behind agglomerative and divisive clustering. The colour
code represents the log2 (expression ratio), where red represents up-regulation, green represents down-regulation,
and black represents no change in expression. In agglomerative clustering, genes that are similar to each other
are grouped together, and an average expression prole is calculated for the group by using the average linkage
algorithm. This step is performed iteratively until all genes are included into one cluster. In the case of divisive
clustering, the whole set of genes is considered as a single cluster and is broken down iteratively into sub-clusters
with similar expression proles until each cluster contains only one gene. This information can be represented
as a tree, where the terminal nodes represent genes and all branches represent different clusters. The distance
from the branch point provides a measure of the distance between two objects. This image was adapted from
Dopazo et al., (2001). Notice that the ordered matrix at the top is the actual product of either agglomerative ordivisive clustering, and genes A to E are given in the nal order for the simplicity of illustration; initially rows
corresponding to genes A to E could be arranged in any order and it is the task of the methods to arrange them
meaningfully. Colour gure t: http://www.mrc-lmb.cam.ac.uk/genomes/madanm/microarray/.
-
8/10/2019 An Introduction to Microarray Analysis
18/25
Madan Babu
Hierarchical clustering: agglomerative
In the case of a hierarchical agglomerative clustering, the objects are successively fused
until all the objects are included. For a hierarchical agglomerative clustering procedure,
each object is considered as a cluster. The rst step is the calculation of pairwise distance
measures for the objects to be clustered. Based on the pairwise distances between them,
o ects t at are s m ar to eac ot er are groupe nto c usters. ter t s s one, pa rw se
stances etween t e c usters are re-ca cu ate , an c usters t at are s m ar are groupe
toget er n an terat ve manner unt a t e o ects are nc u e nto a s ng e c uster. s
n ormat on can e represente as a en rogram, w ere t e stance rom t e ranc po nt
s n cat ve o t e stance etween t e two c usters or o ects.
Comparison of clusters with another cluster or an object can be carried out using four
pproaches (Figure 7).
Single linkage clustering (Minimum distance)
In single linkage clustering, distance between two clusters is calculated as the minimum
distance between all possible pairs of objects, one from each cluster. This method has
n advantage that it is insensitive to outliers. This method is also known as the nearest
ne g our n age.
omp ete in age c ustering aximum istance
n comp ete n age c uster ng, stance etween two c usters s ca cu ate as t e max mum
stance etween a poss e pa rs o o ects, one rom eac c uster. e sa vantage
of this method is that it is sensitive to outliers. This method is also known as the farthest
neighbour linkage.
Average linkage clusteringIn average linkage clustering, distance between two clusters is calculated as the average of
distances between all possible pairs of objects in the two clusters.
Figure 7. Different algorithms to nd distance between two clusters.
-
8/10/2019 An Introduction to Microarray Analysis
19/25
243
Microarray Data Analysis
Centroid linkage clustering
In centroid linkage clustering, an average expression prole (called a centroid) is calculated
n two steps. rst, t e mean n eac mens on o t e express on pro es s ca cu ate or
o ects n a c uster. en, stance etween t e c usters s measure as t e stance
etween t e average express on pro es o t e two c usters.
n examp e o t e erarc ca agg omerat ve c uster ng us ng s ng e n age c uster ng
s s own n gure .
Hierarchical clustering: divisive
Hierarchical divisive clustering is the opposite of the agglomerative method, where the entire
set of objects is considered as a single cluster and is broken down into two or more clusters
that have similar expression proles. After this is done, each cluster is considered separately
nd the divisive process is repeated iteratively until all objects have been separated into
single objects. The division of objects into clusters on each iterative step may be decided
upon y pr nc pa component ana ys s w c eterm nes a vector t at separates g ven o ects.
s met o s ess popu ar t an agg omerat ve c uster ng, ut as success u y een use
n t e ana ys s o gene express on ata y on t a . .
Non-hierarchical clustering
One of the major criticisms of hierarchical clustering is that there is no compelling evidence
that a hierarchical structure best suits grouping of the expression proles. An alternative to
this method is a non-hierarchical clustering, which requires predetermination of the number
of clusters. Non-hierarchical clustering then groups existing objects into these predened
clusters rather than organizing them into a hierarchical structure.
Figure 8. An example of a hierarchical clustering using single linkage algorithm. Consider ve genes and the
distances between them as shown in the table. In the rst step, genes that are close to each other are grouped
together and the distances are re-calculated using the single linkage algorithm. This procedure is repeated until
all genes are grouped into one cluster. This information can be represented as a tree (shown to the right), where
the distance from the branch point reects the distance between genes or clusters. This image was adapted from
Causton et al.(2003).
-
8/10/2019 An Introduction to Microarray Analysis
20/25
Madan Babu
Non-hierarchical clustering: K-means
K-means is a popular non-hierarchical clustering method (Figure 9A). In K-means
c uster ng, t e rst step s to ar trar y group o ects nto a pre eterm ne num er o
c usters. e num er o c usters can e c osen ran om y or est mate y rst per orm ng
erarc ca c uster ng o t e ata. o ow ng t s step, an average express on pro e
centro s ca cu ate or eac c uster, t s s ca e n t a zat on. ext, n v ua o ects
re reattr ute rom one c uster to t e ot er epen ng on w c centro s c oser to t e
ene (or sample). This procedure of calculating the centroid for each cluster and re-grouping
objects closer to available centroids is performed in an iterative manner for a xed number
of times, or until convergence (state when composition of clusters remains unaltered by
further iterations). Typically, the number of iterations required to obtain stable clusters ranges
from 20,000 to 100,000. However, there is no guarantee that the clusters will converge.
This method has an advantage that it is scalable for large datasets.
Non-hierarchical clustering: Self Organizing Maps
e rgan z ng aps s wor n a manner s m ar to -means c uster ng gure .
n -means c uster ng, one c ooses t e num er o c usters to t t e ata, w ereas w t
t e rst step s to c oose t e num er an or entat on o t e c usters w t respect to eac
ot er. or examp e, a two- mens ona gr o no es w c may en up e ng c usters
Figure 9. A: The principle behind K-means clustering. Objects are grouped into a predened number of clusters
during the initialization step. Centroid for each cluster is calculated, and objects are re-grouped depending on how
close they are to available centroids. This step is performed iteratively until convergence or is performed for a
xed number of iterations to get nal clusters of objects. B: The principle behind SOMs. During the initialization
step, a grid of nodes is projected onto the expression space and each gene is assigned its closest node. Following
this step, one gene is chosen at random and the assigned node is moved towards it. The other nodes are moved
towards this gene depending on how close they are to the selected gene. This step is performed iteratively untilconvergence or is performed for a xed number of iterations to get a nal map of nodes.
-
8/10/2019 An Introduction to Microarray Analysis
21/25
245
Microarray Data Analysis
cou e t e start ng po nt. e gr s pro ecte onto t e express on space, an eac o ect
s ass gne a no e t at s nearest to t t s s ca e n t a zat on. n t e next step, a ran om
object is chosen and the node (called a reference vector) which is in the neighbourhood
of the object is moved closer to it. The other nodes are moved to a small extent depending
on how close they are to the object chosen. In successive iterations, with randomly chosen
objects, the positions of the nodes are rened and the radius of neighbourhood becomes
conned. In this way, the grid of nodes (initially a two-dimensional grid) is deformed to
t the data. The advantage of this method, unlike K-means, is that SOM does not force the
number of clusters to be equal to the number of starting nodes in the chosen grid. This is
because some nodes may have no objects associated with them when the map is complete.
t er a vantages o nc u e prov ng n ormat on on t e s m ar ty etween t e
no es, an t e a ty o to pro uce re a e resu ts even w t no sy ata.
4. RELATING EXPRESSION DATA TO OTHER BIOLOGICAL INFORMATION
ene express on pro es can e n e to externa n ormat on to ga n ns g t nto o og caprocesses and to make new discoveries. Some of the possible questions that can be addressed
fter analysing gene expression data will be discussed in this section.
4.1 Predicting binding sites
It is reasonable to assume that genes with similar expression proles are regulated by the
same set of transcription factors. If this happens to be the case, then genes that have similar
expression proles should have similar transcription factor binding sites upstream of the
co ng sequence n t e . ar ous researc groups ave exp o te t s assumpt on.
razma t a . an ot ers ussema er et a ., ; on on t a ., ave stu e
t e occurrence o sequence patterns an scovere putat ve n ng s tes n t e promoterreg ons o genes t at are co-expresse . e steps nvo ve n suc stu es are t e o ow ng:
n a set o genes t at ave s m ar express on pro es. xtract promoter sequences
of the co-expressed genes. (3) Identify statistically over-represented sequence patterns. (4)
Assess quality of the discovered pattern using statistical signicance criteria.
4.2 Predicting protein interactions and protein functions
Integrating expression data with other external information, for example evolutionary
conservation of proteins, have been used to predict interacting proteins, protein complexes,
nd protein function. Work by Ge t al.(2001) and Jansen and Gerstein (2000) have shown
that genes with similar expression proles are more likely to encode proteins that interact.en t s n ormat on s com ne w t evo ut onary conservat on o prote ns, mean ng u
pre ct ons can e ma e. n a recent wor y e c mann an a an a u t was
s own t at prote ns t at are evo ut onar y conserve n yeast an worm an t at ave
s m ar express on pro es n ot organ sms ten to e a part o t e same sta e comp ex
or nteract p ys ca y. oort et a . ave a so s own t at t e enco e prote ns o
conserved, co-expressed gene pairs are highly likely to be part of the same pathway. Such
studies enable us to predict specic gene functions. The steps involved in such studies are
the following: (1) Identify co-expressed genes in the two studied organisms. (2) Identify
conserved (orthologous) proteins. (3) Find instances where conserved (orthologous) proteins
re co-expressed in both organisms. (4) Map information on protein interaction or metabolicpathway available for one organism to predict interacting proteins or function of the proteins
n t e ot er organ sm.
-
8/10/2019 An Introduction to Microarray Analysis
22/25
Madan Babu
4.3 Predicting functionally conserved modules
Genes that have similar expression proles often have related functions. Instead of studying
co-expresse pa rs o genes, one can v ew sets o co-expresse genes t at are nown to
nteract as a unct ona mo u e nvo ve n a part cu ar o og ca process a an a u t a .,
. s n ormat on, w en ntegrate w t t e evo ut onary conservat on o prote ns n
more t an two organ sms, prov es now e ge o t e s gn cance o t e unct ona mo u es
t at ave een conserve n evo ut on. tuart t a . ave a resse t s ssue n great
detail and have identied one evolutionarily conserved functional module which belongs
to an as yet unknown biological process. For other modules that are known to be involved
in previously well-studied biological process, new module members were discovered by
Stuart t al. (2003), providing clues about unknown candidates involved in the processes.
The steps involved in such studies are similar to those discussed in the previous section.
Instead of two organisms, one has to consider three or more organisms, and should also
ddress other issues related to identifying orthologous proteins.
4.4 Reverse-engineering of gene regulatory networks
ene express on ata can a so e use to n er regu atory re at ons ps. s approac s
nown as reverse eng neer ng o regu atory networ s. esearc y ega et a . an
ar ner t a . c ear y g g ts t at we are now n a goo pos t on to use express on
ata to ma e pre ct ons a out t e transcr pt ona regu ators or a g ven gene or sets o
enes. Segal et al.(2003) have developed a probabilistic model to identify modules of co-
regulated genes, their transcriptional regulators and conditions that inuence regulation.
This new knowledge allowed them to generate further hypotheses, which are experimentally
testable. Gardner t al.(2003) described a method to infer regulatory relationships, called
NIR (Network Identication by multiple Regression), which uses non-linear differentialequations to model regulatory networks. In this method, a model of connections between
enes in a network is inferred from measurements of system dynamics ( i.e.response of
enes an prote ns to pertur at ons .
5. WEBSITE REFERENCES, ACADEMIC SOFTWARE AND WEB
SUPPLEMENT
5.1 Website references
Some websites that provide a reference to various aspects of microarrays are given
below:
Portals:
http://ihome.cuhk.edu.hk/%7Eb400559/array.html
A comprehensive web portal on microarrays.
http://www.bioinformatics.vg/biolinks/bioinformatics/Microarrays.shtml
A web-portal on microarrays.
http://www.hgmp.mrc.ac.uk/GenomeWeb/nuc-genexp.html
A collection of gene expression and microarray links at the HGMP (Human Genome
Mapping Project).
-
8/10/2019 An Introduction to Microarray Analysis
23/25
247
Microarray Data Analysis
Tutorials:
http://www.ucl.ac.uk/oncology/MicroCore/HTML_resource/tut_frameset.htm
A website that provides a tutorial on the various aspects of microarray data analysis
iscusse in t is c apter.
Software links:
http://genome-www5.stanford.edu/restech.html
This website provides a list of software available for microarray data analysis with a brief
escription o w at t e so tware oes an t e p at orm on w ic it can e run.
Microarray databases:
http://genome-www5.stanford.edu/
Stanford Microarray Database (SMD) contains raw and normalized data from microarray
experiments as we as t eir image es. a so provi es inter aces or ata retrieva ,
ana ysis an visua isation.
http://www.ebi.ac.uk/arrayexpress/
ArrayExpress public repository for microarray data at the EMBL-EBI.
http://info.med.yale.edu/microarray/
Yale Microarray Database (YMD)
http://www.ncbi.nlm.nih.gov/geo/
Gene Expression Omnibus at the NCBI, NIH.
Table 2. List of software available for academic use.
-
8/10/2019 An Introduction to Microarray Analysis
24/25
Madan Babu
5.2 Software available for non-commercial use
The list of software provided here is by no means exhaustive. The readers are urged to
isit the reference websites provided above to get a more comprehensive list of available
programs a e .
5.3 Supplementary material on the web
e we -supp ement s ava a e at:
http://www.mrc-lmb.cam.ac.uk/genomes/madanm/microarray/
It has PERL scripts to calculate the following statistics:
. Euclidean distance.
2. Pearson correlation coefcient.
. Rank correlation coefcient.
xpress on atasets or t e yeast genome rom o t a . an pe man et a .
re a so prov e .
Acknowledgements
wou e to t an ar e as au, an a an t e ot ers n t e group or rea ng
the chapter. I would also like to acknowledge the Medical Research Council, Cambridge
Commonwealth Trust and Trinity College, Cambridge for nancial support.
References
Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., and Levine, A.J. (1999). Broad patterns ofgene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide
arrays. Proc. Natl. Acad. Sci. USA96, 6745-6750.
Brazma, A., Jonassen, I., Vilo, J., and Ukkonen, E. (1998). Predicting gene regulatory elements in silico on a
genomic scale. Genome Res.8, 1202-1215.
Bussemaker, H. J., Li, H., and Siggia, E. D. (2001). Regulatory element detection using correlation with expression.
at. enet. , 167-171.
Causton, H., Quackenbush, J., and Brazma, A. (2003). Microarray Gene Expression Data Analysis: A Beginners
Guide. (UK: Blackwell Science)
Cho, R. J., Campbell, M. J., Winzeler, E. A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T. G., Gabrielian,
A. E., Landsman, D., Lockhart, D. J., and Davis, R. W. (1998). A genome-wide transcriptional analysis of the
mitotic cell cycle. Mol. Cell.2, 65-73.
Conlon, E. M., Liu, X. S., Lieb, J. D., and Liu, J. S. (2003). Integrating regulatory motif discovery and genome-
wide expression analysis. Proc. Natl. Acad. Sci. USA100, 3339-3344.Dopazo, J., Zanders, E., Dragoni, I., Amphlett, G., and Falciani, F. (2001). Methods and approaches in the analysis
of gene expression data. J. Immunol. Meth.250, 93-112.
Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998). Cluster analysis and display of genome-wide
expression patterns. Proc. Natl. Acad. Sci. USA95, 14863-14868.
Gardner, T. S., di Bernardo, D., Lorenz, D., and Collins, J. J. (2003). Inferring genetic networks and identifying
compound mode of action via expression proling. Science301, 102-105.
Ge, H., Liu, Z., Church, G. M., and Vidal, M. (2001). Correlation between transcriptome and interactome mapping
data from Saccharomyces cerevisiae. Nat. Genet.29, 482-486.
Jansen, R., and Gerstein, M. (2000). Analysis of the yeast transcriptome with structural and functional categories:
characterizing highly expressed proteins. Nucl. Acids Res.28, 1481-1488.
Luscombe, N. M., Royce, T. E., Bertone, P., Echols, N., Horak, C. E., Chang, J. T., Snyder, M., and Gerstein,
M. (2003). ExpressYourself: A modular platform for processing and visualizing microarray data. Nucl. Acids
Res.31, 3477-3482.
Madan Babu, M., Luscombe, N., Aravind, L., Gerstein, M., Teichmann, S.A. (2004). Structure and evolution of
transcriptional regulatory networks. Curr. Opin. Struct. Biol. 14, in press.
-
8/10/2019 An Introduction to Microarray Analysis
25/25
Microarray Data Analysis
Quackenbush, J. (2002). Microarray data normalization and transformation. Nat. Genet.32, 496-501.
Segal, E., Shapira, M., Regev, A., Pe'er, D., Botstein, D., Koller, D., and Friedman, N. (2003). Module networks:
identifying regulatory modules and their condition-specic regulators from gene expression data. Nat. Genet.
34, 166-176.
Shannon, C. (1949). A mathematical theory of communication. Bell. Sys. Tech. J. 17, 379-423; 623-656.
Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D., and
Futcher, B. (1998). Comprehensive identication of cell cycle-regulated genes of the yeast Saccharomyces
cerevisiae by microarray hybridization. Mol. Biol. Cell9, 3273-3297.
Stuart, J. M., Segal, E., Koller, D., and Kim, S. K. (2003). A gene-coexpression network for global discovery of
conserved genetic modules. Science302, 249-255.
Teichmann, S. A., and Madan Babu, M. (2002). Conservation of gene co-regulation in prokaryotes and eukaryotes.
Trends Biotechnol.20, 407-410; discussion 410.
van Noort, V., Snel, B., and Huynen, M. A. (2003). Predicting gene function by conserved co-expression. Trends
Genet.19, 238-242.