12 microarrays (script by k. nieselt) -...

160 Grundlagen der Bioinformatik, SS’08, D. Huson (this part by K. Nieselt) July 10, 2008

12 Microarrays (script by K. Nieselt)

There are many articles and books on this topic. These lectures are based on Kay Nieselt’s course on“Microarray Bioinformatics”. The following books are recommended reading:

• T. Speed, Statistical Analysis of Gene Expression Microarray Data, Chapman & Hall, 2003.

• D. Stekel, Microarray Bioinformatics, Cambridge University Press, 2004.

• Jones and Pevzner, An Introduction to Bioinformatics Algorithms, Chapter 8.

12.1 The Central Dogma of Biology and gene expression

The expression of genetic information in a DNA molecule takes place in two steps:

1. transcription: DNA → mRNA

2. translation: mRNA → protein

Gene expression ist a highly complex and precisely regulated process that allows the cell to dynamicallyreact to changing environment as well as to its changing needs. This mechanism acts both as an“on/off” switch to control which genes are expressed in a cell and as a “volume control” that increasesor decreases the level of expression of particular genes as necessary.

12.2 What is a microarray?

This is a microarray: or this

Microarrays are devices that allow to measure the expression of many thousand of genes in parallel.

They usually consist of a microscopic slide onto which DNA molecules have been chemically bonded.

From a biological sample mRNA is extracted and labelled. It will hybridize to the DNA of the arrayvia Watson-Crick duplex formation.

Microarray technology is used to understand fundamental aspects of growth and development as wellas potential genetic causes for diseases.

A microarray experiment is a typical example for a high-throughput experiment. Basically it consistsof three steps:

Grundlagen der Bioinformatik, SS’08, D. Huson (this part by K. Nieselt) July 10, 2008 161

1. Material production: array design and array production

2. Data generation: preparation of tissue, mRNA isolation, cDNA labeling, hybridisation, scanning

3. Information retrieval: image analysis, data normalisation, advanced analyses.

The goal of a microarray experiment is to compare expression of two more more cell types. Examplesare:

• Analysis of tissue-specific gene expression

• Comparison of gene expression in healthy and tumor tissue

• Influence of environmental changes on expression

• Dependence of gene expression on cell cycle state

In addition, applications range from sequencing to so-called ChIP-on-Chip procedures.

Definition 12.2.1 A microarray is a tool for analyzing gene expression that consists of a small mem-brane or glass slide containing samples of many genes arranged in a regular pattern.

The following synonyms for microarrays are also used: chip, biochip, DNA-array, gene array.

Definition 12.2.2 The probe is the nucleic acid molecule on the chip that is known. The target isthe free nucleic acid molecule in the solution that shall be identified.1

12.3 Types of microarrays

One distinguishes microarrays either by the

• type of probes present on the chip:

– cDNA

– oligos

• or by the type of production of the chips:

– spotting

– in-situ

12.3.1 Production of “spotted” microarrays

Production of spotted arrays: probes are attached to the slide in three main steps:

1. Generation of DNA probes (cDNA or oligos)

2. Printing of probes onto glass slide

3. Fixation of probes1According to the convention of MIAME (Minimal Information About a Microarray Experiment) the DNA on the

array is called reporter and the DNA in the solution is the hybridisation extract.


12.3.2 Production of in situ synthesis microarrays

For these microarrays, whose most prominent representative is the GeneChip by Affymetrix, the DNAof a gene (or EST) is not put onto the array, but oligo nucleotides are directly synthesized on thechip. For this a photolithographical process is used, that is very similar to the usual semiconductorchip production. Three different technologies are currently in use:

1. photo deprotection with masks: Affymetrix

2. photo deprotection without masks: Nimblegen

3. chemical deprotection: Agilent

Production of an Affymetrix array:

Light

Mask photo-chemicallyremovable group

manyrepetitions

Substrate

Glass

Mask

12.4 DNA-microarray experiments

Independent of the type of chip, each DNA-microarray experiment consists of the following steps:

1. Preparation of the array

2. Extraction of tissues


3. Isolation of mRNA from the tissue(s)

4. Generation of cDNA/cRNA from mRNA

5. Generation of the hybridisation solution that contains the fluorescently labeled cDNA/cRNA(each target uses a different label)

6. Incubation of the hybridisation solution with the array

7. Scanning

8. Image analysis

9. Advanced analysis of the data

12.4.1 Scanning

Laser technology is used to detect the bound cDNA/cRNA as follows:

Exposed to laser excitation, the molecules emit light photons.

The more target DNA is bound the higher the fluorescence signal.

If a gene is highly expressed, many RNA molecules will stick to the probe, and thus the probe locationwill shine brightly when the laser hits it.

If a gene is expressed at a lower level, less RNA will stick to the probe, and by comparison, that probelocation will be much dimmer when it is hit with the laser.

12.5 One-color versus dual-color

Spotted arrays allow the conduction of comparative microarray experiments, the expression of twotargets is measured simultaneously, while in situ produced arrays yield absolute experiments.

In the case of spotted arrays one also speaks of dual channel or dual color experiments, and in thecase of in situ arrays one speaks of one channel or one color experiments,

12.6 From raw to primary data

Generally three steps are necessary for the image analysis:

1. Adressing: Assign location of spot center

Based on the gridding process the coordinates of each spot are assigned. The algorithms for thissteps need to be robust and reproducible.

2. Segmentation: Classification of a pixel into foreground (signal) or background pixel (noise)

3. Information extraction: Now numerical values are computed

For each spot on the array (and label if more than one is used) compute:

(a) mean signal intensity,

(b) mean background intensity,

(c) quality value.


Each of the two labels has a typical excitation wave length. These should be of course different fromthe emission wave lengths.

For each label (channel) a scan is produced.

Then the measured intensities of the two channels are overlaid (compared) for each spot and a pseu-docolored image is produced.

Usually red/ green/ yellow/ black is used. This color choice symbolizes the choice of the labels, ie.Cy3 (green) and Cy5 (red). If in a spot both channels have the same intensity, then the spot is coloredyellow. If the intensity in the green channel is higher, then green is chosen, otherwise red. Black spotssymbolize missing intensity.

Example:

12.6.1 Expression values of two-channel arrays

Though one assumes that only light is detected from cDNA that hybridize with their complementaryprobes, also light from other sources is detected. These could be molecules that are bound to a wrongspot or unspecifically to glass, or from light reflection of dust etc.

The signal from these sources are called background signals of a scan. All in all, the raw product ofthe scan are the pixel intensities.

Let FX,j denote the set of foreground pixels in channel X (X = R for red, X = G for green) of thejth probe (spot, gene). Similarly, let BX,j denote the set of background pixels in channel X (X = R


for red, X = G for green) of the jth probe.

Let ri and gi, respectively, be the intensity of pixel i in the red and green channel, respectively.Furthermore let Rjf and Gjf , respectively, be the mean foreground signal of the jth spot in the red

and green channel, respectively. Equivalently we set Rjb and Gjb respectively, be the mean backgroundsignal of the jth spot in the red and green channel, respectively.

These are computed as

Rjf = (∑i∈FR,j

ri)/|FR,j |

Gjf = (∑i∈FG,j

gi)/|FG,j |

and

Rjb = (∑i∈BR,j

ri)/|BR,j |

Gjb = (∑i∈BG,j

gi)/|BG,j |

Then for the final expression value of a spot the background signals are subtracted from the foregroundsignals:

Rj = Rjf −Rjb

Gj = Gjf −Gjb

Here care must be taken, if Rjb > Rjf and/or Gjb > Gjf . In this case, most image analysis programsreturn a “flagged” spot.

Finally, both expression values are combined into a ratio or log ratio (commonly base 2):

e(j) = log2(Rj

Gj)

Thus e(j) is the log ratio expression value of the jth spot.

12.6.2 Images of in situ-microarrays

The resulting picture of an in-situ array scan differs substantially from those of spotted arrays. Herethe spots are not circular but quadratic, which makes the image analysis much easier. Example:


12.6.3 Expression values of one-channel arrays

The expression values for arrays with just one channel are computed similarly to those of the twochannel experiments. Here we will define e(j) to be either the absolute expression intensity or the log2

value of it.

12.7 The expression matrix

Now that we have defined an expression value of a gene in a single array experiment, we will turn toassembling all values of several array experiments into a common matrix.

Definition 12.7.1 The expression matrix of a microarray experiment consisting of p arrays, whereeach array has n genes is an n× p matrix, where the ijth cell contains the expression value of the ithgene on the jth hybridized array.

Let us denote an expression profile of the ith gene gi by e(gi), and the expression value of the ith genein the jth experiment by e(gij).

Then we denote the mean expression of gi by

e(gi) =1p

p∑j=1

e(gij).

12.8 Similarity and dissimilarity of expression data

In the following we will look at distance measures to compute (dis)similarity of expression profiles.The computed (dis)similarity values will then be input of clustering algorithms.


Using microarrays on a genome-wide level has also the intention to identify groups of genes or sampleswith similar expression profiles.

From the biological point of view the comparison of gene profiles is different from the comparison ofsample profiles. From the mathematical point of view it is essentially the same.

12.8.1 Metrics and semi-metrics for expression data

Assume that we have an expression matrix with n genes and p arrays.

Similarity between two profiles is often measured in terms of the distance of two vectors in a high-dimensional (either n or p) space.

The most often used distance is the Euclidean distance:

d(x, y) =

√√√√ p∑i=1

(xi − yi)2

A semi-metric measure is the Pearson Correlation coefficient:

ρ(x, y) =∑p

i=1(xi − x)(y − y)√∑pi=1(xi − x)2

√∑pi=1(yi − y)2

It is ρ(x, y) ∈ [−1, 1] and ρ(x, y) = 1 implies perfect similarity and ρ(x, y) = 0 randomness.

Examples of Pearson correlation coefficients:

The Pearson correlation coefficient is a similarity measure, thus one needs to transform it into adistance parameter:

dρ(x, y) = 1− ρ(x, y)

Note that the Euclidean distance is not scale invariant: two profiles with the same shape (ie. largePearson correlation similarity score) but different magnitude will have a large Euclidean distanceparameter and thus appear to be dissimilar.

In addition Euclidean distance can not detect negative correlations.

On the other hand if the magnitude of change is of importance then it is the appropriate distancemeasure.


12.9 Clustering - Introduction

In gene expression analysis to analyse expression profiles often clustering methods are applied. Gen-erally we distinguish

• Unsupervised Clustering

• Supervised Clustering = Classification

While a classification analysis assigns objects to predefined groups / classes, cluster analysis computesgroups of objects (which are here either genes or samples).

Unsupervised Clustering

• helps to identify genes that might be involved in the same functional process in the cell

• helps to identify and annotate unknown genes

• helps for example to identify subtypes of cancer

We distinguish two general types of cluster methods:

• partitioning methods

• hierarchical methods

Cluster analysis needs two “ingredients”:

• Distance measure

• Cluster algorithm: groups objects based on their distance with the goal to achieve small distanceswithin the clusters and large between clusters.

12.9.1 k-means clustering

The goal of the k-means clustering is to find a partition C of the set X in k (pre-chosen) clusters, suchthat a given measure for homogeneity is maximised. High homogeneity implies that elements in thesame cluster are very similar.

k-means clustering belongs to the so-called partition clustering methods: The input set of elements ispartitioned into disjoint clusters, such that each element belongs to exactly one cluster.

Algorithm 12.9.1 (k-means) 1. Choose k

2. Choose randomly k centers µ1, . . . , µk that are the mean values for the clusters

3. For each gene compute the nearest cluster center:

C(i) = argmin1≤l≤kd(xi, µl)2

4. Compute new mean for each cluster:

µi =1|Ci|

∑xj∈Ci

xj

5. Repeat steps 3-4 until algorithm converges


Example:

The k-means method minimizes the total intravariance sum:k∑l=1

∑C(i)=l

d(xi, µl)2

i.e. the sum of the quadratic distances between each gene expression profile to its respective clustercenter. An important parameter for the method is the choice of k, the number of clusters. A possibilityto optimize this choice is to run algorithm several times with different ks, compute each time the totalintravariance sum and plot the result.

12.9.2 An application of k-means clustering

A k-means clustering conducted with the data of the so-called Spellman microarray experiment (ayeast cell cycle experiment) 2.

In that experiment a yeast whole-genome expression experiment was conducted in order to prove thehypothesis that genes might be regulated in a periodic manner coincident with the cell cycle.

For the clustering here only the cell cycle genes (about 800) of all approx. 6000 yeast genes were taken.

Different k-means clusterings computed using Mayday 3 for k = 4, 6, 8:

2Spellman et al., Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiaeby Microarray Hybridization. Mol Biol Cell 9, 1998

3Microarray analysis software developed in the group of Dr. Nieselt


12.9.3 Hierarchical clustering

The result of hierarchical clustering are nested clusters which can be visualized by means of a tree ordendrogram.

We distinguish two types of hierarchical cluster approaches:

• Bottom-up (agglomerative clustering)

• Top-down (divisive clustering)

The general outline of bottom-up hierarchical clustering:

Bottom-Up Hierarchical clustering::

• Initialisation: each object is one cluster

• Iteration: combine two clusters that have minimal distance

• Termination: one cluster that contains all objects

Bottom-Up Hierarchical clustering

More detailed description of bottom-up hierarchical clustering:

Algorithm 12.9.2 (Bottom-Up Hierarchical Clustering)for i = 1 to n do

ci = {xi}C = {c1, . . . , cn}j = n+ 1while |C| > 1

(ca, cb) = argmin(ca,cb) d(cu, cv)cj = ca ∪ cbC = C − {ca, cb} ∪ {cj}j = j + 1

Popular clustering methods: Depending of how the distances between clusters are defined, thisgeneral algorithm gives rise to different concrete methods, including:

• Single Linkage (or Minimum Method, Nearest Neighbor):

d(k, i ∪ j) = min(d(k, i), d(j, k))

• Complete Linkage (or Maximum Method, Furthest Neighbor):

d(k, i ∪ j) = max(d(k, i), d(j, k))

• Average Linkage (UPGMA):

d(k, i ∪ j) = (ni · d(k, i) + nj · d(j, k))/(ni + nj)


12.9.4 An application of hierarchical clustering

We consider nine genes from the yeast cell cycle experiment.

Result of complete linkage with euclidean distances:

Result of UPGMA with euclidean distances:

Result of single linkage with euclidean distances:


12.9.5 Hierarchical clustering - Example

In the following example not genes but samples were clustered: these samples were from two differenttypes of leukemia (AML and ALL)4. The clustering of the experiment profiles was based on 150 genesthat had highest variance across all samples. It nicely depicts the distinction of the two leukemiatypes also on the transcriptomic level:

12.10 Visualisation of gene expression data

An important aspect of microarray data analysis is visualization. Visualization tools are primarilyused to gain biologically important insights into the data.

There are a number of approaches to the problem of visualizing microarray data, ranging from viewingthe raw image data, viewing profiles of genes across experiments, to using one of the many scatterplot variants. In this section a short overview of common visualisation methods is given.

4from Golub et al., Science 286, 1999


12.10.1 Box plot

The box plot visualizes a one-dimensional distribution. It is based on 5 numbers of a distribution:minimum, first quartile, median, third quartile and maximum.

A box plot is drawn as follows:

Box plots can be used to compare distributions:

normal skewed uniform

The box plot is especially useful for the comparison of replicated array experiments.

12.10.2 Scatterplot

In a scatterplot one distribution is plotted against another one. Let log(X) and log(Y ) denote thelog-values of distribution X and Y . Then one plots log(Y ) against log(X).

A typical application for dual-channel microarray data is to plot intensity values (log2) of the greenchannel against those of the red channel.


12.10.3 MA-Plot

In an MA-plot, rather than plotting Y against X and/or log(Y ) against log(X), one plots

M = log(Y/X) = log(Y )− log(X)

againstA = (log(X) + log(Y ))/2

For the two channels we thus get

M = log2(R/G) = log2R− log2G

is plotted againstA = (log2R+ log2G)/2

The MA-plot is just the original scatter plot turned 45◦ clockwise with subsequent scaling. It isespecially useful for the detection of intensity-dependent effects in the log-ratio.

Note that the A axis generally covers the range from 0 to 16, while the M (y-) axis is centered around0 (zero, for equal ratio).


The above example shows the differences in incorporation of the label: here the molecules in the greenchannel have higher intensities than their respectives ones in the red channel.

12.10.4 Heatmap

One of the most popular tools for microarray data visualization are heatmaps (Eisen, 1998). Heatmaps:

• Are also known as intensity or matrix plot

• Represent data in form of table: typically genes are in the rows, experiments in the columns

• Each cell of matrix is filled with a color representing the logarithmic expression ratio

• Use 3 colors, typically green → black → red

Example:

12.10.5 Profile Plots

Profile plots show the expression profile (along the experiments):


12.10.6 Visualisation of clusters

Both profile plots as well as heatmaps are especially useful for the visualisation after clustering.

Plot either all profiles of each cluster or only the profile of the cluster representative.

Example: here is the result of a hier-archical clustering on the cell cycleexperiment shown together with theassociated heatmap.

12.11 Summary

Microarrays are used to measure expression levels in cells.

Clustering is used to detect common patterns of expression of genes.

Visualization of expression data is an important tool.

A main topic that we did not cover is how to normal signals.

New sequencing technologies are poised to replace microarrays in many applications.

12 microarrays (script by k. nieselt) -...

Documents