an introduction to microarray analysis

Upload: saurav-sarkar

Post on 02-Jun-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 An Introduction to Microarray Analysis

    1/25

    225

    Microarray Data Analysis

    ap er

    An Introduction toMicroarray Data Analysis

    . a an a u

    Abstract

    This chapter aims to provide an introduction to the analysis of gene expression data obtained

    using microarray experiments. It has been divided into four sections. The rst section

    provides basic concepts on the working of microarrays and describes the basic principlesbehind a microarray experiment. The second section deals with the representation and

    extract on o n ormat on rom mages o ta ne rom m croarray exper ments. e t r

    sect on a resses erent met o s or compar ng express on pro es o genes an a so

    prov es an overv ew o erent met o s or c uster ng genes w t s m ar express on

    pro es. e ast sect on ocuses on re at ng gene express on ata w t ot er o og ca

    information; it will provide the readers with a feel for the kind of biological discoveries

    one can make by integrating gene expression data with external information.

    1. INTRODUCTION

    Functional genomics involves the analysis of large datasets of information derived fromarious biological experiments. One such type of large-scale experiment involves monitoring

    the expression levels of thousands of genes simultaneously under a particular condition,

    called gene expression analysis. Microarray technology makes this possible and the quantity

    o ata generate rom eac exper ment s enormous, war ng t e amount o ata generate

    y genome sequenc ng pro ects. s c apter s a r e overv ew o t e as c concepts

    nvo ve n a m croarray exper ment; t g ves a ee ng or w at t e ata actua y represents,

    n w prov e n ormat on on t e var ous computat ona met o s t at one can emp oy

    to er ve mean ng u resu ts rom suc exper ments.

    1.1 What are microarrays and how do they work?

    Microarray technology has become one of the indispensable tools that many biologists use

    to monitor genome wide expression levels of genes in a given organism. A microarray is

    typically a glass slide on to which DNA molecules are xed in an orderly manner at specic

    locations called spots (or features). A microarray may contain thousands of spots and each

    spot may conta n a ew m on cop es o ent ca mo ecu es t at un que y correspon

    to a gene gure . e n a spot may e t er e genom c or s ort stretc o

    o go-nuc eot e stran s t at correspon to a gene. e spots are pr nte on to t e g ass

    s e y a ro ot or are synt es se y t e process o p oto t ograp y.

    croarrays may e use to measure gene express on n many ways, ut one o t emost popular applications is to compare expression of a set of genes from a cell maintained

    in a particular condition (condition A) to the same set of genes from a reference cell

  • 8/10/2019 An Introduction to Microarray Analysis

    2/25

    Madan Babu

    ma nta ne un er norma con t ons con t on . gure g ves a genera p cture o t e

    exper menta steps nvo ve . rst, s extracte rom t e ce s. ext, mo ecu es

    n t e extract are reverse transcr e nto c y us ng an enzyme reverse transcr ptasend nucleotides labelled with different uorescent dyes. For example, cDNA from cells

    rown in condition A may be labelled with a red dye and from cells grown in condition B

    with a green dye. Once the samples have been differentially labelled, they are allowed to

    hybridize onto the same glass slide. At this point, any cDNA sequence in the sample will

    hybridize to specic spots on the glass slide containing its complementary sequence. The

    mount of cDNA bound to a spot will be directly proportional to the initial number of RNA

    molecules present for that gene in both samples.

    o ow ng t e y r zat on step, t e spots n t e y r ze m croarray are exc te y

    aser an scanne at su ta e wave engt s to etect t e re an green yes. e amount

    o uorescence em tte upon exc tat on correspon s to t e amount o oun nuc e c ac .or nstance, c rom con t on or a part cu ar gene was n greater a un ance t an

    Figure 1. (A) A microarray may contain thousands of spots. Each spot contains many copies of the same DNAsequence that uniquely represents a gene from an organism. Spots are arranged in an orderly fashion into Pen-

    groups. (B) Schematic of the experimental protocol to study differential expression of genes. The organism is

    grown in two different conditions (a reference condition and a test condition). RNA is extracted from the two

    cells, and is labelled with different dyes (red and green) during the synthesis of cDNA by reverse transcriptase.

    Following this step, cDNA is hybridized onto the microarray slide, where each cDNA molecule representing a

    gene will bind to the spot containing its complementary DNA sequence. The microarray slide is then excited

    with a laser at suitable wavelengths to detect the red and green dyes. The nal image is stored as a le for further

    analysis. Colour gure t: http://www.mrc-lmb.cam.ac.uk/genomes/madanm/microarray/.

  • 8/10/2019 An Introduction to Microarray Analysis

    3/25

    227

    Microarray Data Analysis

    that from condition B, one would nd the spot to be red. If it was the other way, the spot

    would be green. If the gene was expressed to the same extent in both conditions, one would

    nd the spot to be yellow, and if the gene was not expressed in both conditions, the spotwould be black. Thus, what is seen at the end of the experimental stage is an image of the

    microarray, in which each spot that corresponds to a gene has an associated uorescence

    a ue represent ng t e re at ve express on eve o t at gene.

    2. OVERVIEW OF IMAGE PROCESSING, TRANSFORMATION AND

    NORMALIZATION

    2.1 Image processing and analysis

    In the previous section, we saw that the relative expression level for each gene (population

    of RNA in the two samples) can be stored as an image. The rst step in the analysis of

    microarray data is to process this image. Most manufacturers of microarray scanners

    provide their own software; however, it is important to understand how data is actually

    being extracted from images, as this represents the primary data collection step and forms

    the basis of any further analysis.

    mage process ng nvo ves t e o ow ng steps:

    . enti cation o t e spots an istinguis ing t em rom spurious signa s.

    e m croarray s scanne o ow ng y r zat on an a mage e s norma y

    enerated. Once image generation is completed, the image is analysed to identify spots.

    In the case of microarrays, the spots are arranged in an orderly manner into sub-arraysr pen groups (Figure 1A), which makes spot identication straightforward. Most image

    Figure 2. Zooming onto a spot on the microarray slide. The spot area and the background area are depicted by a blue

    circle and a white box, respectively. A pixel in the spot area is also shown. Any pixel within the blue circle will be

    treated as a signal from the spot. Pixels outside the blue circle but within the white box will be treated as a signal from

    the background. One can see that the images are not perfect, as it is often the case, which leads to many problems

    with spurious signals from dust particles, scratches, bright arrays, etc. This image was retrieved from Stanford

    Microarray Database. Colour gure at: http://www.mrc-lmb.cam.ac.uk/genomes/madanm/microarray/.

  • 8/10/2019 An Introduction to Microarray Analysis

    4/25

    Madan Babu

    process ng so tware requ res t e user to spec y approx mate y w ere eac su -array

    es an a so a t ona parameters re evant to t e spotte array. s n ormat on s

    then used to identify regions that correspond to spots.

    2. Determination of the spot area to be surveyed, determination of the local region to

    stimate background hybridization.

    After identifying regions that correspond to sub-arrays, an area within the sub-array

    must be selected to get a measure of the spot signal and an estimate for background

    ntensity (Figure 2). There are two methods to dene the spot signal. The rst method

    s to use an area o a xe s ze t at s centre on t e centre o mass o t e spot. s

    met o as an a vantage t at t s computat ona y ess expens ve, ut a sa vantage

    e ng more error-prone n est mat ng spot ntens ty an ac groun ntens ty. n

    ternat ve met o s to prec se y e ne t e oun ary or a spot an on y nc u e p xe s

    t n t e oun ary. s met o as an a vantage t at t can g ve a etter est mate o

    t e spot ntens ty, ut a so as a sa vantage o e ng computat ona y ntens ve antime-consuming.

    . Reporting summary statistics and assigning spot intensity after subtracting for

    ackground intensity.

    Once the spot and background areas have been dened, a variety of summary statistics

    or each spot in each channel (red and green channels) are reported. Typically, each

    pixel (Figure 2) within the area is taken into account, and the mean, median, and total

    a ues or t e ntens ty cons er ng a t e p xe s n t e e ne area are reporte or

    ot t e spot an ac groun . ost approac es use t e spot me an va ue, w t t e

    ac groun me an va ue su tracte rom t, as t e metr c to represent spot ntens ty.e me an ntens ty s a va ue w ere a t e measure p xe s ave ntens t es greater

    t an t s va ue an t e ot er a o t e measure p xe s ave ntens t es ess t an

    this value. The background subtracted median value approach has an advantage of

    being relatively insensitive to a few pixels with anomalous uorescent values in one

    r both channels, but has a disadvantage of being sensitive to misidentication of spot

    nd background areas. The other method is to use total intensity values, which has an

    dvantage of being insensitive to misidentication of spots (as few more pixels with

    ero value in the background will not affect the total intensity), but has a disadvantage

    f being prone to be skewed by a few pixels with extreme intensity values.

    not er cons erat on n mage process ng s t e num er o p xe s to e nc u e or

    measurement n t e spot mage. or many scanners, t e e au t p xe s ze s . s

    means t at an average spot o ameter o w ave ~ p xe s. owever, or a

    sma er spot ameter, t s etter to use a sma er p xe s ze to ensure enoug p xe s are

    samp e . ost scanners now a ow t e p xe s ze o m. ven t oug us ng a sma er

    pixel size increases our condence in the measurement, the only disadvantage is that the

    image le size tends to be much bigger when compared with image le sizes created using

    larger pixel sizes.

    2.2 Expression ratios: the primary comparisonWe saw that the relative expression level for a gene can be measured as the amount of red or

    reen light emitted after excitation. The most common metric used to relate this information

  • 8/10/2019 An Introduction to Microarray Analysis

    5/25

    229

    Microarray Data Analysis

    s ca e express on rat o. t s enote ere as kan e ne as:

    =k

    k

    For each gene kon the array, where represents the spot intensity metric for the test

    sample and Gk represents the spot intensity metric for the reference sample. As mentioned

    bove, the spot intensity metric for each gene can be represented as a total intensity value

    or a background subtracted median value. If we choose the median pixel value, then the

    median expression ratio for a given spot is:

    background

    median

    spot

    median

    background

    median

    spot

    medianmedian

    GG

    RRT

    =

    where spotmedianR andbackgroundmedianR are the median intensity values for the spot and background

    respectively, for the test sample.

    2.3 Transformations of the expression ratio

    e express on rat o s a re evant way o represent ng express on erences n a very

    ntu t ve manner. or examp e, genes t at o not er n t e r express on eve w ave

    n express on rat o o . owever, t s representat on may e un e p u w en one as to

    represent up-regulation and down-regulation. For example, a gene that is up-regulated by a

    factor of 4 has an expression ratio of 4 (R/G = 4G/G = 4). However, for the case where a gene

    is down regulated by a factor of 4, the expression ratio becomes 0.25 (R/G = R/4R = 1/4).

    Thus up-regulation is blown up and mapped between 1 and innity, whereas down-regulationis compressed and mapped between 0 and 1.

    [ ] ,1regulationUp mapped

    regulationDown mapped

    To eliminate this inconsistency in the mapping interval, one can perform two kinds of

    transformations of the expression ratio, namely, inverse transformation and logarithmic

    trans ormat on.

    Inverse or reciprocal transformation

    e nverse or rec proca trans ormat on converts t e express on rat o nto a o -c ange,

    w ere or genes w t an express on rat o o ess t an t e rec proca o t e express on

    ratio is multiplied by -1. If the expression ratio is 1 then the fold change is equal to theexpression ratio. The advantage of such a transformation is that one can represent up-

    regulation and down-regulation with a similar mapping interval.

    =

    =

    =

  • 8/10/2019 An Introduction to Microarray Analysis

    6/25

    Madan Babu

    owever, t s met o a so as a pro em n t at t e mapp ng space s scont nuous etween

    an + an ence ecomes a pro em n most mat emat ca ana yses ownstream o

    this step.

    Logarithmic transformation

    A better transformation procedure is to take the logarithm base 2 value of the expression

    ratio (i.e.log expression ratio)). This has the major advantage that it treats differential

    up-regulation and down-regulation equally, and also has a continuous mapping space.

    For example, if the expression ratio is 1, then log2 1) equals 0 represents no change in

    expression. If the expression ratio is 4, then log2 4) equals +2 and for expression ratio of

    og 1 equa s - . us, n t s trans ormat on t e mapp ng space s cont nuous an up-

    regu at on an own-regu at on are compara e.

    av ng exp a ne t e a vantages o us ng express on rat os as a metr c or gene

    express on, t s ou a so e un erstoo t at t ere are sa vantages o us ng express on

    rat os or trans ormat ons o t e rat os or ata ana ys s. ven t oug express on rat os canreveal patterns inherent in the data, they remove all information about absolute expression

    levels of the genes. For example, genes that have R/G ratios of / nd / will end up

    having the same expression ratio of 4, and associated problems will surface when one tries

    to reliably identify differentially regulated genes.

    2.4 Data normalization

    In the last section, it was shown that expression ratios and their transformations is

    reasona e measure to etect erent a y expresse genes. owever, w en one

    compares t e express on eve s o genes t at s ou not c ange n t e two con t ons say,

    ouse eep ng genes , w at one qu te o ten n s s t at an average express on rat o o sucenes ev ates rom . s may e ue to var ous reasons, or examp e, var at on cause y

    erent a a e ng e c ency o t e two uorescent yes or erent amounts o start ng

    mRNA material in the two samples. Thus, in the case of microarray experiments, as for

    ny large-scale experiments, there are many sources of systematic variation that affect

    measurements of gene expression levels.

    Normalization is a term that is used to describe the process of eliminating such variations

    to allow appropriate comparison of data obtained from the two samples. There are many

    methods of normalization and discussing each one of them is beyond the scope of this

    chapter.

    The rst step in a normalization procedure is to choose a gene-set (which consists ofenes or w c express on eve s s ou not c ange un er t e con t ons stu e , t at

    s t e express on rat o or a genes n t e gene-set s expecte to e . rom t at set, a

    norma ization actor,w c s a num er t at accounts or t e var a ty seen n t e gene-

    set, s ca cu ate . t s t en app e to t e ot er genes n t e m croarray exper ment. ne

    s ou note t at t e norma zat on proce ure c anges t e ata, an s carr e out on y on

    the background corrected values for each spot. Figure 3 shows expression data before and

    fter the normalization procedure.

    Total intensity normalization

    The basic assumption in a total intensity normalization is that the total quantity of RNA forthe two samples is the same. Also assuming that the same number of molecules of RNA

  • 8/10/2019 An Introduction to Microarray Analysis

    7/25

    231

    Microarray Data Analysis

    from both samples hybridize to the microarray, the total hybridization intensities for the

    ene-sets should be equal. So, a normalization factor can be calculated as:

    =

    ==setgene

    setgene

    N

    k

    k

    N

    k

    k

    total

    G

    R

    N

    1

    1

    The intensities are now rescaled such that totalkk NGG ='

    and kk RR ='

    . The normalized

    expression ratio becomes:

    total

    k

    totalk

    k

    k

    k

    k N

    T

    NG

    R

    G

    R

    T === ''

    '

    Figure 3. Gene expression data before and after the normalization procedure. Note that before normalization the

    image had many spots of different intensities, but after normalization only spots that are really different light up.

    This image was kindly provided by N. Luscombe Colour gure t: http://www.mrc-lmb.cam.ac.uk/genomes/

    madanm/microarray/.

  • 8/10/2019 An Introduction to Microarray Analysis

    8/25

    Madan Babu

    c s equ va ent to:

    )(log-)(log)(log 22

    '

    2 totalkk NTT =

    This now adjusts the ratio such that the mean ratio for the gene set is equal to 1.

    Mean log centring

    In this method, the basic assumption is that the mean log2(expression ratio) should be equal

    to 0 for the gene-set. In this case, the normalization factor can be calculated as:

    setgene

    N

    k

    mlc

    N

    kG

    kR

    N

    setgene

    =

    =1

    2log

    The intensities are now rescaled such that )2(' mlcN

    kk GG = and kk RR =' . The normalized

    expression ratio becomes:

    mlcmlc N

    k

    N

    k

    k

    k

    kk

    T

    G

    R

    G

    RT

    2)2('

    '' =

    ==

    Which is equivalent to:

    mlck

    N

    kk NTTTmlc -)(log)(2log-)(log)(log 222

    '

    2 ==

    s a usts t e rat o suc t at t e mean og2 express on rat o or t e gene-set s equa

    to .

    Other normalization methods include: linear regression, Chens ratio statistics and

    Lowess normalization. The next step following the normalization procedure is to lter

    low intensity data using specic threshold or relative threshold imposed according to the

    background intensity. If the experimental procedure included a replicate, averaging the

    alues using the replicate data is the next step to be performed after data ltering. Finally,differentially expressed genes are identied. For an excellent review of normalization

    procedures, ltering methods and averaging procedures using replicate data, please refer

    to uac en us , an re erences t ere n.

    3. ANALYSIS OF GENE EXPRESSION DATA

    ne o t e reasons to carry out a m croarray exper ment s to mon tor t e express on eve o

    enes at a genome sca e. atterns cou e er ve rom ana ys ng t e c ange n express on

    of the genes, and new insights could be gained into the underlying biology. In this section,

    basic terminologies, representations of the microarray data and the various methods by

    which expression data can be analysed will be introduced.The processed data, after the normalization procedure, can then be represented in the form

    of a matrix, often called gene expression matrix (Table 1A). Each row in the matrix corresponds

  • 8/10/2019 An Introduction to Microarray Analysis

    9/25

    233

    Microarray Data Analysis

    to a particular gene and each column could either correspond to an experimental condition

    or a specic time point at which expression of the genes has been measured. The expression

    levels for a gene across different experimental conditions are cumulatively called the gene

    expression prole, and the expression levels for all genes under an experimental condition are

    cumulatively called the sample expression prole. Once we have obtained the gene expressionmatrix (Table 1A), additional levels of annotation can be added either to the gene or to the

    sample. For example, the function of the genes can be provided, or the additional details on

    the biology of the sample may be provided, such as disease state or normal state.

    epen ng on w et er t e annotat on s use or not, ana ys s o gene express on ata

    can e c ass e nto two erent types, name y superv se or unsuperv se earn ng. n

    t e case o a superv se earn ng, we o use t e annotat on o e t er t e gene or t e samp e,

    n create c usters o genes or samp es n or er to ent y patterns t at are c aracter st c

    for the cluster. For example, we could separate sample expression proles into disease

    state and normal state groups, and then look for patterns that separate the sample prole

    of the disease state from the sample prole of the normal state.

    In the case of an unsupervised learning, the expression data is analysed to identify

    patterns that can group genes or samples into clusters without the use of any form of

    Table 1. A: Gene expression matrix that contains rows representing genes and columns representing particular

    conditions. Each cell contains a value, given in arbitrary units, that reects the expression level of a gene under

    a corresponding condition. B: Condition C4 is used as a reference and all other conditions are normalized with

    respect to C4 to obtain expression ratios. C: In this table all expression ratios were converted into the log 2(expression ratio) values. This representation has an advantage of treating up-regulation and down-regulation on

    comparable scales. D: Discrete values for the elements in Table 1.C. Genes with log 2 (expression ratio) values

    greater than 1 were changed to 1, genes with values less than 1 were changed to 1. Any value between 1 and

    1 was changed to 0.

  • 8/10/2019 An Introduction to Microarray Analysis

    10/25

    Madan Babu

    nnotat on. or examp e, genes w t s m ar express on pro es can e c ustere toget er

    w t out t e use o any annotat on. owever, annotat on n ormat on may e ta en nto

    ccount at a later stage to make meaningful biological inferences. Throughout this section,

    set of genes or set of experimental conditions that have similar expression proles will be

    referred to as a cluster. Thus, a cluster consists of objects with similar expression proles,

    where an object may either refer to genes or samples.

    3.1 Representation of gene expression data

    To make any meaningful comparison or biological analysis, one should know what the

    ata n t e gene express on matr x represents. xpress on ata can e represente n ve

    erent ways, w c are escr e e ow:

    so ute measurement

    n t e case o an a so ute measurement, eac ce n t e matr x w represent t e express on

    eve o t e gene n a stract un ts. ote t at t s not mean ng u to compare express on eve sof genes across two different conditions in absolute units, because the starting amounts of

    mRNA could be different. Table 1A shows a sample gene expression matrix with each cell

    containing the expression level in abstract units.

    Relative measurement or expression ratio

    In the case of a relative measurement or representations involving expression ratio, the

    expression level of a gene in abstract units is normalized with respect to its expression in a

    re erence con t on. s g ves t e express on rat o o t e gene n re at ve un ts. ote t at

    n suc cases, a rat o o 000 00 ea to t e same resu t as40

    10. us any n ormat on on

    so ute measurement w e ost n suc a representat on, ut now mean ng u compar soncross erent con t ons can e ma e as ong as t e same re erence con t on s use to get

    t e express on rat o. s ment one e ore, t s representat on oes not treat up-regu at on

    nd down-regulation in a comparable manner. Table 1B shows the gene expression matrix

    with each cell representing the expression ratio normalized with respect to a reference

    condition.

    og expression ratio

    In the case of tables representing the log2(expression ratio) values, information on up-

    regulation and down-regulation is captured and is mapped in a symmetric manner. For

    example, 4-fold up-regulation maps to log2(4) = 2 and a 4-fold down-regulation maps toog = - . us, rom t s ta e t e o -c ange or a erent a y regu ate gene

    un er any con t on can e eas y recogn se . a e s ows t e og2 express on rat o

    a ues o t e genes un er erent con t ons.

    iscrete va ues

    Another way of representing information is to convert to discrete numbers the values in

    the tables mentioned above. In the case of converting the absolute measurement to discrete

    numbers, a binary expression matrix of 1 and 0 can be used, where 1 means that the gene is

    expressed above a user dened threshold, and 0 means that the gene is expressed below this

    threshold. In the case of making the relative expression tables or log2(expression ratio) tablesdiscrete, values can be divided into 3 classes, +1, 0 and 1, where +1 represents a gene that is

    positively regulated, 0 represents a gene that is not differentially regulated and 1 represents

  • 8/10/2019 An Introduction to Microarray Analysis

    11/25

    235

    Microarray Data Analysis

    gene t at s represse . e process o ma ng t e va ues screte oses a ot o n ormat on,

    ut s use u to ana yse express on pro es us ng a gor t ms t at cannot an e rea va ue

    expression matrices, for example algorithms calculating mutual information between genes

    or samples. Table 1D shows discrete values for the log (expression ratio) table.

    Representation of expression proles as vectors

    So far we have seen how individual cells in the gene expression matrix can be represented.

    Similarly, an expression prole (of a gene or a sample) can be thought of as a vector and

    can be represented in vector space. For example, an expression prole of a gene can be

    cons ere as a vector n ndimensional space (where n is the number of conditions), and an

    express on pro e o a samp e w t genes can e cons ere as a vector n m mens ona

    space w ere m s t e num er o genes . n t e examp e g ven e ow, t e gene express on

    matr x w t mgenes across ncon t ons s cons ere to e an mx nmatr x, w ere t e

    express on va ue or gene i n con t on s enote as xj

    =

    mnmm

    n

    n

    xxx

    xxx

    xxx

    X

    ..

    ........

    ..

    ..

    21

    22221

    11211

    The expression prole of a geneican be represented as a row vector:

    [ ]iniiii xxxxG ..,,,, 321=

    The expression prole of a samplejcan be represented as a column vector:

    =

    mj

    j

    j

    i

    x

    x

    x

    G..

    2

    1

    n t e next sect on, we w see ow express on pro es t at are represente as vectors can

    e use to compare ow s m ar or erent are t e pa rs o o ects remem er, an o ect

    may re er to a gene or a samp e .

    3.2 Distance measures

    Analysis of gene expression data is primarily based on comparison of gene expression

    proles or sample expression proles. In order to compare expression proles, we need a

    measure to quantify how similar or dissimilar are the objects that are being considered. A

    ariety of distance measures can be used to calculate similarity in expression proles andthese are discussed below.

  • 8/10/2019 An Introduction to Microarray Analysis

    12/25

    Madan Babu

    Euclidean distance

    uc ean stance s one o t e common stance measures use to ca cu ate s m ar ty

    between expression proles. The Euclidean distance between two vectors of dimension 2,

    sayA a a and b , b can be calculated as:

    2

    22

    2

    11 )()(),( babaBADEuc +=

    For instance two genes with expression proles in two conditions = 1,2 and = ,3 ,

    the Euclidean distance can be calculated as:

    2)32()21(),( 2221 =+=GGDEuc

    us or genes w t express on ata ava a e or n con t ons, represente as

    a1 an an = 1 n , uc ean stance can e ca cu ate as:

    =

    =n

    i

    iiEuc baBAD1

    2)(),(

    In other words, the Euclidean distance between two genes is the square root of the sum of

    t e squares o t e stances etween t e va ues n eac con t on mens on .

    more genera orm o t e uc ean stance s ca e t e in ows i istance

    ca cu ate as:

    p

    n

    i

    p

    iiMin baBAD =

    =1

    )(),(

    A special case of the Minkowski distance when p=1 is called rectilinear distance. When

    pplied to binary expression proles (i.e.expression levels changed to 1 and 0), it is called

    amm ng stance.

    Pearson correlation coefcient

    ne o t e most common y use metr cs to measure s m ar ty etween express on pro es

    s t e earson corre at on coe c ent sen t a . . ven t e express on

    ratios for two genes under three conditionsA [a , a , a andB=[b , b b ], PCC can be

    computed as follows:

    Step1: Compute mean

    33

    321321 bbbbandaaa

    a ++

    =++

    =

  • 8/10/2019 An Introduction to Microarray Analysis

    13/25

    237

    Microarray Data Analysis

    tep : ean centre express on pro es

    ),,(),,( 321321 bbbbbbBandaaaaaaA ==

    Step3: Calculate PCC as the cosine of the angle between the mean-centred proles

    BA

    BAPCC

    o=

    Where,

    =

    =n

    i

    ii bbaaBA1

    )()(o

    =

    =n

    i

    i aaA1

    2)(

    and

    =

    =n

    i

    i bbB1

    2)(

    The reason why we mean centre the expression proles is to make sure that we compare

    shapes of the expression proles and not their magnitude. Mean centring maintains the

    shape of the prole, but it changes the magnitude of the prole as shown in Figure 4.

    va ue o essent a y means t at t e two genes ave s m ar express on pro es

    n a va ue o means t at t e two genes ave exact y oppos te express on pro es. va ue

    o means t at no re at ons p can e n erre etween t e express on pro es o genes. n

    rea ty, va ues range rom to + . va ue . suggests t at t e genes e aves m ar y an a va ue - . suggests t at t e genes ave oppos te e av our. e

    alue of 0.7 is an arbitrary cut-off, and in real cases this value can be chosen depending on

    the dataset used. An example calculation is shown below:

    Figure 4. Expression prole before and after mean centring. Note that after mean centring, the relative shapesof the expression proles are still maintained, but the magnitude changes. The graphs do not show the actual

    result of PCC analysis.

  • 8/10/2019 An Introduction to Microarray Analysis

    14/25

    Madan Babu

    ons er two genes w t express on pro es = an = .

    e can e ca cu ate as o ows:

    a= 4.8 and b 9.6, the mean centred expression proles become:

    A - . - . . . . an B = - . - . - . . .

    Therefore,995.0

    2.1658.36

    6.77=

    ==

    BA

    BAPCC

    o

    Where,

    77.6)()()()()( =++++= 9.44.22.41.20.60.23.6-1.87.6-3.8BAo

    8.36)()()()()( 22222 =++++= 4.21.20.2-1.8-3.8A

    165.29.42.4-0.6-3.6-7.6B =++++= 22222 )()()()()(

    Rank correlation coefcient

    an corre at on coe c ent s a stance measure t at oes not ta e nto account

    the actual magnitude of the expression ratio in each condition, but takes into account the

    rank of the expression ratio. For example, consider two genesA= 2, 3, , 15, 8 and

    = 2, 7, 15, 25, 13 . When we consider the rank of the values for different conditions forene A, we get the following:

    2 (rank = 1) < 3 (rank = 2) < 8 (rank = 3) < 9 (rank = 4) < 15 (rank = 5) which is equivalent

    to = , , , .

    m ar y, or gene , we get t e ran s or t e va ues or t e erent con t ons as:

    ran = < ran = < ran = < ran = < ran = , w c s

    equ va ent to = .

    Rank correlation coefcient is the PCC calculated on the expression proles converted

    into their rank proles. In the above case the two genes have exactly the same rank prole,

    thus rank correlation coefcient becomes 1. However, PCC is not applicable when two

    alues within a rank prole are repeated. In this case, the rank correlation coefcient can

    be directly computed as:

    =

    =n

    i

    irank

    nn

    dBAD

    12

    2

    )1(61),(

    Where nis the number of conditions (dimension of the prole) and dis the difference

    etween ran s or t e two genes at con t on i. n a vantage o s t at t s not

    sens t ve to out ers n t e ata.

  • 8/10/2019 An Introduction to Microarray Analysis

    15/25

    239

    Microarray Data Analysis

    Mutual information

    stance measure to compare genes w ose pro es ave een ma e screte can e

    calculated using an entropy notion, called Shannons entropy. This measure gives us a metric

    that is indicative of how much information from the expression prole of one gene can

    be obtained to predict the behaviour of the other gene.

    Consider the discrete expression proles for two genes,A= 1, 1, 0, 1, -1] and = 1

    -1, 0, 1, -1 . We know that at any condition, the values that have been made discrete can

    be 1, 0 or 1. Thus, the probability for each state to occur in the prole for the two genes

    can be computed as follows:

    Genes Probability

    P(1) P(0) P(-1) P(1)+P(0)+P(-1)

    A5

    3

    (3 occurrencesin 5 conditions)

    51

    (1 occurrencein 5 conditions)

    51

    (1 occurrencen 5 conditions)

    ( )1

    5

    113=

    ++

    B5

    2

    (2 occurrences

    in 5 conditions)

    51

    (1 occurrence

    in 5 conditions)

    52

    (2 occurrences

    n 5 conditions)

    ( )1

    5

    212=

    ++

    rom t s ta e, t e annon s entropy or t e genes can e ca cu ate as:

    =

    =3

    1

    2log)(i

    ii PPgeneH

    Note that iruns from 1 to 3 because there are three possible states (1, 0 and 1).

    371.1

    51log

    51

    51log

    51

    53log

    53(1 222 =)++= -H(A)

    522.15

    2log5

    25

    1log5

    15

    2log5

    2(1 222 =)++= -H(B)

    The next step in our calculation is to consider how often gene A and gene B have the same

    state (1, 0, or -1) across given conditions. There are 9 possible pairwise combinations of

    states, and they are calculated for our example in the following manner:

    P(A,B) Occurrence P(A,B) Occurrence P(A,B) Occurrence

    P(1,1) 52

    P(0,1) 50

    P(-1,1) 50

    P(1,0) 50

    P(0,0) 51

    P(-1,0) 50

    P(1,-1) 51

    P(0,-1) 50

    P(-1,-1) 51

    e num er o con t ons n w c ot gene an gene ave t e r va ues equa to

    over a con t ons s out o con t ons, an so on.

  • 8/10/2019 An Introduction to Microarray Analysis

    16/25

    Madan Babu

    not er parameter we w nee to ca cu ate mutua n ormat on s o nt entropy , :

    =

    =3

    1,

    2log),(

    ji

    ijij PPBAH

    w en ot ian j n epen ent y run rom to , correspon ng to t e t ree states , an .

    923.1

    51log

    51

    51log

    51

    51log

    51

    52log

    52(, 2222 =)+++= -1B)H(A

    For the above example, the mutual information between the two expression proles, which

    provides a measure of the similarity between the two genes can be calculated as:

    970.0923.1522.1371.1),()()(),( =+=+= BAHBHAHBAM

    n genera , t e g er t e mutua n ormat on score, t e more s m ar are t e two pro es.

    owever, prec se state an consequent y, nterpretat on o t e o serve score wou

    epen on t e num er o con t ons or w c measurements were ava a e. or our case

    o con t ons, t e o ta ne score o . s g . ut rea er s a v se to consu t more

    specialised sources for understanding of states associated with mutual information distance

    measure (Shannon, 1949). One should note that a distance measure has to be chosen only

    fter considering the data to be analysed and that there is no single distance measure that

    is appropriate for all types of data.

    3.3 Clustering methods

    One of the goals of microarray data analysis is to cluster genes or samples with similarexpress on pro es toget er, to ma e mean ng u o og ca n erence a out t e set o genes

    or samp es. uster ng s one o t e unsuperv se approac es to c ass y ata nto groups

    o genes or samp es w t s m ar patterns t at are c aracter st c to t e group. uster ng

    met o s can e erarc ca group ng o ects nto c usters an spec y ng re at ons ps

    mong o ects n a c uster, resem ng a p y ogenet c tree or non- erarc ca group ng

    Figure 5. An overview of the different clustering methods.

  • 8/10/2019 An Introduction to Microarray Analysis

    17/25

    241

    Microarray Data Analysis

    nto c usters w t out spec y ng re at ons ps etween o ects n a c uster as sc emat ca y

    represente n gure . emem er, an o ect may re er to a gene or a samp e, an a c uster

    refers to a set of objects that behave in a similar manner.

    Hierarchical clustering

    Hierarchical clustering may be agglomerative (starting with the assumption that each

    object is a cluster and grouping similar objects into bigger clusters) or divisive (starting

    from grouping all objects into one cluster and subsequently breaking the big cluster into

    smaller clusters with similar properties). The basic idea behind agglomerative and divisive

    hierarchical clustering is shown in Figure 6. There are many different types of clustering

    met o s an a ew common y use ones are escr e e ow.

    Figure 6. Schematic diagram showing the principle behind agglomerative and divisive clustering. The colour

    code represents the log2 (expression ratio), where red represents up-regulation, green represents down-regulation,

    and black represents no change in expression. In agglomerative clustering, genes that are similar to each other

    are grouped together, and an average expression prole is calculated for the group by using the average linkage

    algorithm. This step is performed iteratively until all genes are included into one cluster. In the case of divisive

    clustering, the whole set of genes is considered as a single cluster and is broken down iteratively into sub-clusters

    with similar expression proles until each cluster contains only one gene. This information can be represented

    as a tree, where the terminal nodes represent genes and all branches represent different clusters. The distance

    from the branch point provides a measure of the distance between two objects. This image was adapted from

    Dopazo et al., (2001). Notice that the ordered matrix at the top is the actual product of either agglomerative ordivisive clustering, and genes A to E are given in the nal order for the simplicity of illustration; initially rows

    corresponding to genes A to E could be arranged in any order and it is the task of the methods to arrange them

    meaningfully. Colour gure t: http://www.mrc-lmb.cam.ac.uk/genomes/madanm/microarray/.

  • 8/10/2019 An Introduction to Microarray Analysis

    18/25

    Madan Babu

    Hierarchical clustering: agglomerative

    In the case of a hierarchical agglomerative clustering, the objects are successively fused

    until all the objects are included. For a hierarchical agglomerative clustering procedure,

    each object is considered as a cluster. The rst step is the calculation of pairwise distance

    measures for the objects to be clustered. Based on the pairwise distances between them,

    o ects t at are s m ar to eac ot er are groupe nto c usters. ter t s s one, pa rw se

    stances etween t e c usters are re-ca cu ate , an c usters t at are s m ar are groupe

    toget er n an terat ve manner unt a t e o ects are nc u e nto a s ng e c uster. s

    n ormat on can e represente as a en rogram, w ere t e stance rom t e ranc po nt

    s n cat ve o t e stance etween t e two c usters or o ects.

    Comparison of clusters with another cluster or an object can be carried out using four

    pproaches (Figure 7).

    Single linkage clustering (Minimum distance)

    In single linkage clustering, distance between two clusters is calculated as the minimum

    distance between all possible pairs of objects, one from each cluster. This method has

    n advantage that it is insensitive to outliers. This method is also known as the nearest

    ne g our n age.

    omp ete in age c ustering aximum istance

    n comp ete n age c uster ng, stance etween two c usters s ca cu ate as t e max mum

    stance etween a poss e pa rs o o ects, one rom eac c uster. e sa vantage

    of this method is that it is sensitive to outliers. This method is also known as the farthest

    neighbour linkage.

    Average linkage clusteringIn average linkage clustering, distance between two clusters is calculated as the average of

    distances between all possible pairs of objects in the two clusters.

    Figure 7. Different algorithms to nd distance between two clusters.

  • 8/10/2019 An Introduction to Microarray Analysis

    19/25

    243

    Microarray Data Analysis

    Centroid linkage clustering

    In centroid linkage clustering, an average expression prole (called a centroid) is calculated

    n two steps. rst, t e mean n eac mens on o t e express on pro es s ca cu ate or

    o ects n a c uster. en, stance etween t e c usters s measure as t e stance

    etween t e average express on pro es o t e two c usters.

    n examp e o t e erarc ca agg omerat ve c uster ng us ng s ng e n age c uster ng

    s s own n gure .

    Hierarchical clustering: divisive

    Hierarchical divisive clustering is the opposite of the agglomerative method, where the entire

    set of objects is considered as a single cluster and is broken down into two or more clusters

    that have similar expression proles. After this is done, each cluster is considered separately

    nd the divisive process is repeated iteratively until all objects have been separated into

    single objects. The division of objects into clusters on each iterative step may be decided

    upon y pr nc pa component ana ys s w c eterm nes a vector t at separates g ven o ects.

    s met o s ess popu ar t an agg omerat ve c uster ng, ut as success u y een use

    n t e ana ys s o gene express on ata y on t a . .

    Non-hierarchical clustering

    One of the major criticisms of hierarchical clustering is that there is no compelling evidence

    that a hierarchical structure best suits grouping of the expression proles. An alternative to

    this method is a non-hierarchical clustering, which requires predetermination of the number

    of clusters. Non-hierarchical clustering then groups existing objects into these predened

    clusters rather than organizing them into a hierarchical structure.

    Figure 8. An example of a hierarchical clustering using single linkage algorithm. Consider ve genes and the

    distances between them as shown in the table. In the rst step, genes that are close to each other are grouped

    together and the distances are re-calculated using the single linkage algorithm. This procedure is repeated until

    all genes are grouped into one cluster. This information can be represented as a tree (shown to the right), where

    the distance from the branch point reects the distance between genes or clusters. This image was adapted from

    Causton et al.(2003).

  • 8/10/2019 An Introduction to Microarray Analysis

    20/25

    Madan Babu

    Non-hierarchical clustering: K-means

    K-means is a popular non-hierarchical clustering method (Figure 9A). In K-means

    c uster ng, t e rst step s to ar trar y group o ects nto a pre eterm ne num er o

    c usters. e num er o c usters can e c osen ran om y or est mate y rst per orm ng

    erarc ca c uster ng o t e ata. o ow ng t s step, an average express on pro e

    centro s ca cu ate or eac c uster, t s s ca e n t a zat on. ext, n v ua o ects

    re reattr ute rom one c uster to t e ot er epen ng on w c centro s c oser to t e

    ene (or sample). This procedure of calculating the centroid for each cluster and re-grouping

    objects closer to available centroids is performed in an iterative manner for a xed number

    of times, or until convergence (state when composition of clusters remains unaltered by

    further iterations). Typically, the number of iterations required to obtain stable clusters ranges

    from 20,000 to 100,000. However, there is no guarantee that the clusters will converge.

    This method has an advantage that it is scalable for large datasets.

    Non-hierarchical clustering: Self Organizing Maps

    e rgan z ng aps s wor n a manner s m ar to -means c uster ng gure .

    n -means c uster ng, one c ooses t e num er o c usters to t t e ata, w ereas w t

    t e rst step s to c oose t e num er an or entat on o t e c usters w t respect to eac

    ot er. or examp e, a two- mens ona gr o no es w c may en up e ng c usters

    Figure 9. A: The principle behind K-means clustering. Objects are grouped into a predened number of clusters

    during the initialization step. Centroid for each cluster is calculated, and objects are re-grouped depending on how

    close they are to available centroids. This step is performed iteratively until convergence or is performed for a

    xed number of iterations to get nal clusters of objects. B: The principle behind SOMs. During the initialization

    step, a grid of nodes is projected onto the expression space and each gene is assigned its closest node. Following

    this step, one gene is chosen at random and the assigned node is moved towards it. The other nodes are moved

    towards this gene depending on how close they are to the selected gene. This step is performed iteratively untilconvergence or is performed for a xed number of iterations to get a nal map of nodes.

  • 8/10/2019 An Introduction to Microarray Analysis

    21/25

    245

    Microarray Data Analysis

    cou e t e start ng po nt. e gr s pro ecte onto t e express on space, an eac o ect

    s ass gne a no e t at s nearest to t t s s ca e n t a zat on. n t e next step, a ran om

    object is chosen and the node (called a reference vector) which is in the neighbourhood

    of the object is moved closer to it. The other nodes are moved to a small extent depending

    on how close they are to the object chosen. In successive iterations, with randomly chosen

    objects, the positions of the nodes are rened and the radius of neighbourhood becomes

    conned. In this way, the grid of nodes (initially a two-dimensional grid) is deformed to

    t the data. The advantage of this method, unlike K-means, is that SOM does not force the

    number of clusters to be equal to the number of starting nodes in the chosen grid. This is

    because some nodes may have no objects associated with them when the map is complete.

    t er a vantages o nc u e prov ng n ormat on on t e s m ar ty etween t e

    no es, an t e a ty o to pro uce re a e resu ts even w t no sy ata.

    4. RELATING EXPRESSION DATA TO OTHER BIOLOGICAL INFORMATION

    ene express on pro es can e n e to externa n ormat on to ga n ns g t nto o og caprocesses and to make new discoveries. Some of the possible questions that can be addressed

    fter analysing gene expression data will be discussed in this section.

    4.1 Predicting binding sites

    It is reasonable to assume that genes with similar expression proles are regulated by the

    same set of transcription factors. If this happens to be the case, then genes that have similar

    expression proles should have similar transcription factor binding sites upstream of the

    co ng sequence n t e . ar ous researc groups ave exp o te t s assumpt on.

    razma t a . an ot ers ussema er et a ., ; on on t a ., ave stu e

    t e occurrence o sequence patterns an scovere putat ve n ng s tes n t e promoterreg ons o genes t at are co-expresse . e steps nvo ve n suc stu es are t e o ow ng:

    n a set o genes t at ave s m ar express on pro es. xtract promoter sequences

    of the co-expressed genes. (3) Identify statistically over-represented sequence patterns. (4)

    Assess quality of the discovered pattern using statistical signicance criteria.

    4.2 Predicting protein interactions and protein functions

    Integrating expression data with other external information, for example evolutionary

    conservation of proteins, have been used to predict interacting proteins, protein complexes,

    nd protein function. Work by Ge t al.(2001) and Jansen and Gerstein (2000) have shown

    that genes with similar expression proles are more likely to encode proteins that interact.en t s n ormat on s com ne w t evo ut onary conservat on o prote ns, mean ng u

    pre ct ons can e ma e. n a recent wor y e c mann an a an a u t was

    s own t at prote ns t at are evo ut onar y conserve n yeast an worm an t at ave

    s m ar express on pro es n ot organ sms ten to e a part o t e same sta e comp ex

    or nteract p ys ca y. oort et a . ave a so s own t at t e enco e prote ns o

    conserved, co-expressed gene pairs are highly likely to be part of the same pathway. Such

    studies enable us to predict specic gene functions. The steps involved in such studies are

    the following: (1) Identify co-expressed genes in the two studied organisms. (2) Identify

    conserved (orthologous) proteins. (3) Find instances where conserved (orthologous) proteins

    re co-expressed in both organisms. (4) Map information on protein interaction or metabolicpathway available for one organism to predict interacting proteins or function of the proteins

    n t e ot er organ sm.

  • 8/10/2019 An Introduction to Microarray Analysis

    22/25

    Madan Babu

    4.3 Predicting functionally conserved modules

    Genes that have similar expression proles often have related functions. Instead of studying

    co-expresse pa rs o genes, one can v ew sets o co-expresse genes t at are nown to

    nteract as a unct ona mo u e nvo ve n a part cu ar o og ca process a an a u t a .,

    . s n ormat on, w en ntegrate w t t e evo ut onary conservat on o prote ns n

    more t an two organ sms, prov es now e ge o t e s gn cance o t e unct ona mo u es

    t at ave een conserve n evo ut on. tuart t a . ave a resse t s ssue n great

    detail and have identied one evolutionarily conserved functional module which belongs

    to an as yet unknown biological process. For other modules that are known to be involved

    in previously well-studied biological process, new module members were discovered by

    Stuart t al. (2003), providing clues about unknown candidates involved in the processes.

    The steps involved in such studies are similar to those discussed in the previous section.

    Instead of two organisms, one has to consider three or more organisms, and should also

    ddress other issues related to identifying orthologous proteins.

    4.4 Reverse-engineering of gene regulatory networks

    ene express on ata can a so e use to n er regu atory re at ons ps. s approac s

    nown as reverse eng neer ng o regu atory networ s. esearc y ega et a . an

    ar ner t a . c ear y g g ts t at we are now n a goo pos t on to use express on

    ata to ma e pre ct ons a out t e transcr pt ona regu ators or a g ven gene or sets o

    enes. Segal et al.(2003) have developed a probabilistic model to identify modules of co-

    regulated genes, their transcriptional regulators and conditions that inuence regulation.

    This new knowledge allowed them to generate further hypotheses, which are experimentally

    testable. Gardner t al.(2003) described a method to infer regulatory relationships, called

    NIR (Network Identication by multiple Regression), which uses non-linear differentialequations to model regulatory networks. In this method, a model of connections between

    enes in a network is inferred from measurements of system dynamics ( i.e.response of

    enes an prote ns to pertur at ons .

    5. WEBSITE REFERENCES, ACADEMIC SOFTWARE AND WEB

    SUPPLEMENT

    5.1 Website references

    Some websites that provide a reference to various aspects of microarrays are given

    below:

    Portals:

    http://ihome.cuhk.edu.hk/%7Eb400559/array.html

    A comprehensive web portal on microarrays.

    http://www.bioinformatics.vg/biolinks/bioinformatics/Microarrays.shtml

    A web-portal on microarrays.

    http://www.hgmp.mrc.ac.uk/GenomeWeb/nuc-genexp.html

    A collection of gene expression and microarray links at the HGMP (Human Genome

    Mapping Project).

  • 8/10/2019 An Introduction to Microarray Analysis

    23/25

    247

    Microarray Data Analysis

    Tutorials:

    http://www.ucl.ac.uk/oncology/MicroCore/HTML_resource/tut_frameset.htm

    A website that provides a tutorial on the various aspects of microarray data analysis

    iscusse in t is c apter.

    Software links:

    http://genome-www5.stanford.edu/restech.html

    This website provides a list of software available for microarray data analysis with a brief

    escription o w at t e so tware oes an t e p at orm on w ic it can e run.

    Microarray databases:

    http://genome-www5.stanford.edu/

    Stanford Microarray Database (SMD) contains raw and normalized data from microarray

    experiments as we as t eir image es. a so provi es inter aces or ata retrieva ,

    ana ysis an visua isation.

    http://www.ebi.ac.uk/arrayexpress/

    ArrayExpress public repository for microarray data at the EMBL-EBI.

    http://info.med.yale.edu/microarray/

    Yale Microarray Database (YMD)

    http://www.ncbi.nlm.nih.gov/geo/

    Gene Expression Omnibus at the NCBI, NIH.

    Table 2. List of software available for academic use.

  • 8/10/2019 An Introduction to Microarray Analysis

    24/25

    Madan Babu

    5.2 Software available for non-commercial use

    The list of software provided here is by no means exhaustive. The readers are urged to

    isit the reference websites provided above to get a more comprehensive list of available

    programs a e .

    5.3 Supplementary material on the web

    e we -supp ement s ava a e at:

    http://www.mrc-lmb.cam.ac.uk/genomes/madanm/microarray/

    It has PERL scripts to calculate the following statistics:

    . Euclidean distance.

    2. Pearson correlation coefcient.

    . Rank correlation coefcient.

    xpress on atasets or t e yeast genome rom o t a . an pe man et a .

    re a so prov e .

    Acknowledgements

    wou e to t an ar e as au, an a an t e ot ers n t e group or rea ng

    the chapter. I would also like to acknowledge the Medical Research Council, Cambridge

    Commonwealth Trust and Trinity College, Cambridge for nancial support.

    References

    Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., and Levine, A.J. (1999). Broad patterns ofgene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide

    arrays. Proc. Natl. Acad. Sci. USA96, 6745-6750.

    Brazma, A., Jonassen, I., Vilo, J., and Ukkonen, E. (1998). Predicting gene regulatory elements in silico on a

    genomic scale. Genome Res.8, 1202-1215.

    Bussemaker, H. J., Li, H., and Siggia, E. D. (2001). Regulatory element detection using correlation with expression.

    at. enet. , 167-171.

    Causton, H., Quackenbush, J., and Brazma, A. (2003). Microarray Gene Expression Data Analysis: A Beginners

    Guide. (UK: Blackwell Science)

    Cho, R. J., Campbell, M. J., Winzeler, E. A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T. G., Gabrielian,

    A. E., Landsman, D., Lockhart, D. J., and Davis, R. W. (1998). A genome-wide transcriptional analysis of the

    mitotic cell cycle. Mol. Cell.2, 65-73.

    Conlon, E. M., Liu, X. S., Lieb, J. D., and Liu, J. S. (2003). Integrating regulatory motif discovery and genome-

    wide expression analysis. Proc. Natl. Acad. Sci. USA100, 3339-3344.Dopazo, J., Zanders, E., Dragoni, I., Amphlett, G., and Falciani, F. (2001). Methods and approaches in the analysis

    of gene expression data. J. Immunol. Meth.250, 93-112.

    Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998). Cluster analysis and display of genome-wide

    expression patterns. Proc. Natl. Acad. Sci. USA95, 14863-14868.

    Gardner, T. S., di Bernardo, D., Lorenz, D., and Collins, J. J. (2003). Inferring genetic networks and identifying

    compound mode of action via expression proling. Science301, 102-105.

    Ge, H., Liu, Z., Church, G. M., and Vidal, M. (2001). Correlation between transcriptome and interactome mapping

    data from Saccharomyces cerevisiae. Nat. Genet.29, 482-486.

    Jansen, R., and Gerstein, M. (2000). Analysis of the yeast transcriptome with structural and functional categories:

    characterizing highly expressed proteins. Nucl. Acids Res.28, 1481-1488.

    Luscombe, N. M., Royce, T. E., Bertone, P., Echols, N., Horak, C. E., Chang, J. T., Snyder, M., and Gerstein,

    M. (2003). ExpressYourself: A modular platform for processing and visualizing microarray data. Nucl. Acids

    Res.31, 3477-3482.

    Madan Babu, M., Luscombe, N., Aravind, L., Gerstein, M., Teichmann, S.A. (2004). Structure and evolution of

    transcriptional regulatory networks. Curr. Opin. Struct. Biol. 14, in press.

  • 8/10/2019 An Introduction to Microarray Analysis

    25/25

    Microarray Data Analysis

    Quackenbush, J. (2002). Microarray data normalization and transformation. Nat. Genet.32, 496-501.

    Segal, E., Shapira, M., Regev, A., Pe'er, D., Botstein, D., Koller, D., and Friedman, N. (2003). Module networks:

    identifying regulatory modules and their condition-specic regulators from gene expression data. Nat. Genet.

    34, 166-176.

    Shannon, C. (1949). A mathematical theory of communication. Bell. Sys. Tech. J. 17, 379-423; 623-656.

    Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D., and

    Futcher, B. (1998). Comprehensive identication of cell cycle-regulated genes of the yeast Saccharomyces

    cerevisiae by microarray hybridization. Mol. Biol. Cell9, 3273-3297.

    Stuart, J. M., Segal, E., Koller, D., and Kim, S. K. (2003). A gene-coexpression network for global discovery of

    conserved genetic modules. Science302, 249-255.

    Teichmann, S. A., and Madan Babu, M. (2002). Conservation of gene co-regulation in prokaryotes and eukaryotes.

    Trends Biotechnol.20, 407-410; discussion 410.

    van Noort, V., Snel, B., and Huynen, M. A. (2003). Predicting gene function by conserved co-expression. Trends

    Genet.19, 238-242.