analysis of gene expression. overview genome analysis tells us what genes are present, but before we...

Analysis of Gene Expression

Overview• Genome analysis tells us what genes are present, but before we can

determine the organism’s phenotype, we need to know how those genes are expressed: under what conditions, in what tissues, how much gene product is made, etc.

– Also, understanding and curing diseases is tied to the analysis of what genes are expressed in disease states.

• There is some progress in deciphering gene control signals, but right now it is more profitable to do experiments and use bioinformatic tools to analyze their results. Later, with some luck, we will be able to make better predictions about gene expression just from sequence data.

• What we are going to cover:– Microarray data (messenger RNA = “transcriptome”)– SAGE (mRNA)– 2-dimensional electrophoresis and mess spectrometry (proteins = “proteome”)– Data normalization: picking out signal from noise– Cluster analysis: how to decide which genes are co-expressed

Microarrays• DNA microarrays and DNA chips are essentially the

same thing: a set of DNA molecules attached to a solid substrate in an array of very small spots.

– Affymetrix is a company that sells microarray chips attached to a silicon substrate

– Many microarrays are homemade: DNA spotted onto glass microscope slides

• Microarrays work by hybridization: cDNA made from mRNA is labelled with a fluorescent tag, then hybridized with the array. After washing, only complementary sequences remain bound. A laser scanner excites each spot in turn, and the amount of fluorescence is read. The level of fluorescence is proportional to the amount of mRNA present in the original prep.

• Originally, cDNA from each gene was used to make the array, Later, synthetic oligonucleotides were used, and today, 50-60 nt synthetic oligonucleotides based on the gene sequences seem to be the standard.

• In most cases, RNAs from two different conditions are compared (experimental vs. control). The two cDNAs derived from the RNAs are labelled with Cy3, a green-fluorescing dye, and Cy5, a red-fluorescing dye.

– If the two RNAs are present in equal amounts, you get a yellow spot; otherwise red or green predominates.

SAGE • Serial Analysis of Gene

Expression• The basis of this technique is

that a gene can be uniquely identified using only a small (10-30 nt) piece from the 3’ end (which is not translated)

• These tags are extracted (from cDNA), then concatenated into long molecules that are amplified with PCR (or cloned) and sequenced.

• The number of times each tag appears is proportional to the amount of its mRNA present.

• Much SAGE data in NCBI.

Digital Differential Display

• DDD is based on data from EST experiments. The NCBI UniGene database combines ESTs for each gene separately. The proportion of ESTs from a given gene can be compared between experimental treatments.

• This is obviously limited to well-studied species.

RNA Seq• This is a new method, published in 2008. It is

probably the method of choice today for analyzing RNA content. Also called whole transcriptome shotgun sequencing.

• Very simple: isolate messenger RNA, break it into 200-300 base fragments, reverse transcribe, then perform large scale sequencing using 454, Illumina. Or other massively parallel sequencing technology.

– RNA sequences then compared to genomic sequences to find which gene is expressed and also exon boundaries

– Exon boundaries are a problem with very short reads: you might only have a few bases of overlap to one of the exons.

• As with all RNA methods, which RNAs are present depends on the tissue analyzed and external conditions like environmental stress or disease state.

• Get info on copy number over a much wider range than microarrays. Also detects SNPs.

Two-Dimensional Gel Electrophoresis• 2D gels are a way of separating proteins into

individual spots that can be individually analyzed.

– Proteins are first separated by their isoelectric point and then by their molecular weight.

– Individual protein spots can then be identified using mass spectrometry

• Isoelectric focusing separates proteins by their isoelectric point, the pH at which the net surface charge is zero.

– IEF uses a mixture of ampholytes, chemical compounds that contain both acidic and basic groups.

– When an electric field is applied, the ampholytes move to a position in the gel where their net charge is zero, and in the process they set up a pH gradient.

– Proteins also move down the pH gradient until they reach a pH where they have no net charge, their isoelectric point.

– Isoelectric focusing is thus an equilibrium process: proteins move to a stable position and stay there. (But in practical terms, the gradient breaks down over time).

SDS-PAGE

• SDS-PAGE is a method for separating proteins according to their molecular weight.

– SDS = sodium dodecyl sulfate (a.k.a. sodium lauryl sulfate), a detergent that unfolds proteins and coats them in charged molecules so that their charge to mass ratio is essentially identical.

• “Native” gel electrophoresis uses undenatured proteins, which vary greatly in charge to mass ratio.

– SDS denaturation isn’t perfect: some proteins behave anomalously,

– PAGE = polyacrylamide gel electrophoresis

2D gels• First, isoelectric focusing is performed on a

protein sample, running the proteins through a narrow tube or strip of acrylamide.

• Then the IEF gel is placed on top of an SDS gel, allowing the proteins to be separated by their molecular weight at right angles to the isoelectric point separation.

• Then the gel is stained with a general protein stain such as Coomassie Blue or silver stain.

– A Western blot involves transferring the separated proteins onto a membrane, where specific proteins can be identified by antibody binding.

• A couple of issues:– While a cell might contain up to 100,000 proteins,

at best only 3000 spots can be resolved.– Proteins expressed at a low level (such as

regulatory proteins) don’t show up well: spot size is proportional to the amount present in the cell

– Special techniques are needed for membrane proteins, which aren’t easily solubilized by the usual techniques.

– Comparing spots between 2D gels require image analysis software (and well-standardized running conditions).

What SDS used to mean• Students for a

Democratic Society, a left wing radical group from the 1960’s and early 1970’s. The Weathermen were an even more radical offshoot.

Mass Spectrometry

• The general principle of mass spectrometry is that if you ionize a group of atoms or molecules, you can separate them on the basis of charge to mass ratio, by accelerating them in an electric field in a vacuum.

– The original mass spectrometers were used to separate isotopes, based on slightly different masses.

• During the ionization process, proteins tend to break up in characteristic ways, producing “molecular ions” whose molecular weights can be measured very precisely.

• Since you are generally working with an already sequenced genome, you can predict the size of fragments that will be generated by any gene. Thus you can identify the gene product by matching the actual fragments with list of predicted fragments.

More Mass Spectrometry• For most protein work, the proteins are first

digested into small fragments (say , 5-10 amino acids), separated by HPLC (high performance liquid chromatography), and then run individually through the mass spec.

– Protein sequencing and older protein identification methods also start with proteolytic digestion

– Endopeptidases that digest at known sites are used, such as trypsin (cleaves after Lys or Arg) and chymotrypsin (cleaves after Phe, Trp, or Tyr).

• Ionizing the peptide needs to be done rather gently. One common technique is MALDI (matrix-assisted laser desorption/ionization). The proteins are mixed with the matrix molecules, which efficiently absorb the UV laser energy and encourage ionization of the proteins. When irradiated with the laser, they vaporize along with the protein, but their small size makes them easy to detect and ignore.

• Time-of-flight mass spectrometry is generally used (so the whole thing is MALDI-TOF). The moelcular ions are accelerated in an electric field, and the time it takes them to cross a chamber of known length is proportional to their mass 9actaully, charge to mass ratio). This technique works well for the wide range of sizes seen with peptides.

• Sample comparisons can be done by labeling one sample with a heavy, stable isotope such as 13C or 15N. The samples are mixed before 2D electrophoresis and they co-migrate on the gel. However, mass spec can easily resolve them.

Methods for Microarray Analysis• Microarray data is subject to a lot of potential errors. These fall into

3 main categories: replication, background subtraction, and data normalization.

• Replication of each experimental data point is essential. There is a lot of variation between spot intensities in a typical experiment, especially with home-created microarrays.

• The background fluorescence level needs to be subtracted from all data points. Since the background is not necessarily uniform, this can lead to spots with negative intensities (which can be set to zero).

• Data normalization means attempting to bring the variance of the expression level to a constant value. It has been observed that the variance tends to increase with stronger signals. A way to correct for that is to include a multiplicative error term as well as an additive error term in statistical calculations.

Expression Level Ratios• Most microarray experiments compare

2 conditions, using red and green dyes. Thus each gene sequence gives data that is a ratio of red to green.

– The problem is, when plotted on a regular linear graph, the distance between ½ and 1 is much smaller than the distance between 1 and 2, even though they express the same (but inverse) ratios.

– The solution is to take the base 2 logarithm of the red/green ratio. log2(x) = -log2(1/x), so increases and decreases give similar ranges.

• Similarly, the expression level can be expressed as the geometric mean of the red and green signals: The square root of red times green. However, taking the logarithm of this spreads the data out better.

• Other data manipulations can further improve appearances.

meangeometricRG _

Data Analysis• We now wish to tackle the problem of making sense of

masses of gene expression data. Thousands of genes tested under several conditions, at different time points, all going up and down in expression level: it is just too much to understand at a glance.

• The usual question is whether groups of genes show similar patterns of expression, suggesting a common regulatory pathway.

• The mathematical techniques we will discuss are various forms of cluster analysis, along with some approaches to optimization problems: how can you efficiently find the best solution to a problem when you can’t try all possibilities.

• First, we will look at a useful visualization technique, principal components analysis.

• centroid = the center position, or center of mass

Principal Components Analysis

• In typical microarray experiments, you make several measurements on the expression level of each gene. You would like to know whether you can separate out different groups of genes.

– There is also an issue of whether the data points are really independent of each other.

• PCA is a method of standardizing and rotating data so most of the variance can be plotted in 2 or 3 dimensions.

• I want to credit Michael Palmer of the Oklahoma State University Ecology program for a lot of the discussion below: it’s a hard topic to describe without resorting to matrix algebra.

• In this example, 3 experimental treatments (x1, x2, ad x3) have been examined with 25 or so parameters (genes). You want to know whether there are groups of genes with a common expression pattern among the 3 treatments.

• The first step is to normalize the data, by subtracting the mean and dividing by the standard deviation. This gives every measurement a mean of 0 and a variance of 1. The centroid of the whole data set is now at the origin.

• The data points are still in the same relative positions.

More PCA• The next step is to generate a new primary

axis, which maximizes the variance along its length. This is done by minimizing the sum of the squared distance from each point to this line. The axis goes through the centroid.

• Next, a second axis is drawn at right angles to the first one and also passing through the centroid. It maximizes all the remaining variance along its length.

• This process can be repeated until all variance has been accounted for, if you like.

• However, most PCA is used for data display, so 2 or 3 dimensions are all that is necessary. Usually, the percentage of the total variance that is accounted for by each axis is reported.

• All of this is done with matrix algebra. Specifically, linear combinations of the data are generated using eigenvalues and eigenvectors. But we aren’t going to discuss that here.

• Quite often, different groups of genes stand out.

Cluster Analysis: Distance Measures

• Before we can determine whether genes fall into co-expressed clusters, we need to define a way of quantitating how similarly they behave under different experimental treatments. That is, we need to define a measure of distance.

• A common method is to use the Euclidean distance. Recalling the Pythagorean Theorem, a2+b2=c2, the distance between 2 points in a plane is the square root of (x1-x2)2 + (y1-y2)2. This can be generalized for multiple dimensions, with each experimental treatment considered as a separate dimension.

– The Euclidean distance is easy to calculate, and in very wide usage, but it makes no sense: what do the results of different experiments have to do with distance in multidimensional space?

– Also, you get different distances when ratios are used compared to using log ratios.

N

iBiAiAB XXd

1

2,,

More Distance Measures• Measuring the degree with which two

genes are correlated, whether or not they are expressed to the same level, can be done with the Pearson correlation coefficient.

– For each experiment, the mean and standard deviation are calculated.

– then distances from each data point to the mean are calculated, and divided by the standard deviation.

– Then these numbers are multiplied to get the distance between 2 treatments, and summed over all the genes.

• There are lots more possibilities for measuring distance. It is worth noting that sometimes the distance measure can affect the final results, which should cause a bit of skepticism or anxiety, depending on your relationship with the data.

B

BBiN

i A

AAiAB s

XX

s

XX

Nr ,

1

,

1

1

Clustering Techniques• UPGMA is a commonly used simple

technique. It is an example of a hierarchical clustering technique, in which objects (experimental treatments) are sequentially combined into larger groups. (You can also sequentially partition a large group into successively smaller groups). The results of hierarchical clustering is usually a tree.

– There are a few common variants. Recall that when two groups were combined, UPGMA took the distance between the groups as the average distance between all members. Thus UPGMA is sometimes called average linkage clustering.

– Single linkage clustering makes the distance between 2 clusters as the minimum distance between any two members

– Complete linkage clustering uses the maximum distance between any 2 members.

K-means Clustering• K-means clustering is a partitional clustering technique.

You specify the number of clusters (k) you want, and the data points are divided up so that each point is assigned to the cluster whose centroid is closest.

– This is done by randomly assigning data points to clusters and calculating their centroids.

– Then, each data point’s distance to these centroids is calculated and the data points are re-assigned to the closet centroid.

– This process is repeated until no more changes in cluster assignment occur.

– You can try various values of k and test the results statistically, using a sum of squared distances to the centroids as an error function.

• Self-organizing maps (SOM) work very similarly to this, except that the centroids are linked to each other, and when you move one, they all shift a bit. How much they move depends on how close they are and how far into the clustering process you are: as tings proceed, there is less linkage between the centroids.

– You can visualize this as starting with a set of centroids arranged on a grid, then moving them around to minimize distances between individual data points and their nearest centroid.

Example of Clustering in a Microarray Experiment

• Each row is a gene, with the degree of redness proportional to the degree of expression.

• Genes have been rearranged to match the clusters.

• Clustering was done for genes and also for treatments

• The actual expression profile for one cluster is shown.

Example of Various Expression Profiles

• The data here were clustered using self-organizing maps, set to create 9 clusters.

• Note that some profiles are very similar, implying that it would be worth while to try fewer clusters.

Optimization Procedures• We have seen many techniques where it is impractical to try out all possible solutions.

– In computational complexity theory, it is generally possible to solve problems that work in “polynomial time”. That is, if there are N basic objects, the time and computer resources needed to solve the problem is proportional to N, N2, or N3, etc. (logN also fits in here). These problems are said to have O(N2) (etc.) proportionality.

– On the other hand, some problems grow exponentially with the number of objects to study. They have O(2N) proportionality. These require much more resources to solve, and are generally the source of problems of the “more than all the particles in the Universe” problems.

• This is where optimization techniques are used.

• More theory. Problems that can be solved in polynomial time are P problems. Problems that need exponential time are NP problems. (This is a bit of a generalization).

– There is a class of NP problems that need exponential time to solve, but any proposed solution can be checked in polynomial time. These are NP-complete problems. Most of the nasty bioinformatic problems are of this nature.

– There is a major debate in the computer science community about NP-complete problems: can some clever algorithm be found that will convert them to P problems (Is P = NP?)

– Theory says that if one NP-complete problem is converted to P, they all can be: one of the current Holy Grail issues in computer science.

– There is another class of problems, called NP-hard, which means they require at least as much time and resources to solve as the worst NP-complete problem.

Simulated Annealing• Simulated annealing is one of the simplest ways of optimizing a solution to a large

problem where an exhaustive search won’t work. Imagine the set of all possible solutions as having multiple local maxima. It is easy to get caught in a local maximum and completely miss better solutions.

• It is based on an analogy with slowly cooling a molten material to produce the best possible crystals:

– if you reduce the temperature quickly, many random defects appear as the individual molecules freeze into sub-optimum positions.

– However, if you reduce the temperature slowly, you give the molecules a chance to try many possibilities before settling into the best position. Even if a molecule has settled into place, the overall temperature is so high that it has a definite chance of leaving that position and trying another.

• In simulated annealing, you start with a possible solution and generate a similar, but randomly chosen solution. You replace the old solution with the new solution if it is better than the old one or (and this is very important) if it is only somewhat worse than the old one.

– A worse solution is assigned a probability of replacing the old solution based on how much worse it is and on the current “temperature”

– The process goes through many iterations, with the temperature slowly being lowered.– Note that we are trying to find the minimum energy state. For clustering, this means the cluster

membership that minimizes the sum of squared distances to the cluster centroids.

More Simulated Annealing• What happens is that in the beginning the

temperature is high, and the clusters change almost randomly. As time goes on and the temperature drops, the clusters change less and less.

• The “acceptance function” determines whether an new solution replaces the old one. It is usually based on the Boltzmann distribution, which describes the probability that a gas molecule has a specified energy at a given temperature.

– In practical terms, you first score the old solution and the new solution (i.e. sum of squared distances to the centroid of each cluster), and subtract them, giving the difference Δ.

– If the new solution is better than the old one, replace the old with the new.

– Otherwise, if the new solution is worse than the old one, you calculate the “energy probability” e-Δ/T.

– Choose a random number between 0 and 1. If the energy is greater than this random number, replace the old solution with the new one. If it’s less, keep the old solution.

)1,0(/ Re T

Genetic Algorithm• The genetic algorithm is another technique for finding an optimum

solution in situations where it is impossible to check all possibilities. • It is a subclass of evolutionary algorithms, based on ideas taken

from biology. The idea is that evolution has found solutions to many of these problems by random changes coupled with natural selection. – A few other subtypes, which we won’t otherwise discuss:

• Particle swarm optimization. Based on animal flocking behavior.• Ant colony optimization. Based on ants foraging, communicating by

pheromones, and forming a path.• Invasive weed optimization. Based on weeds finding suitable environments

to grow and reproduce.• Harmony search. Based on how musicians search for better harmonies.

Data Coding• Genetic algorithms start with a population of random solutions to a problem, arranged

in a linear array (i.e. like a chromosome). – Each chromosome is scored according to a fitness function

– These solutions undergo mutations (small changes), selection of individuals to be parents of the next generation, and recombination between pairs of individuals.

– Multiple iterations of this process are performed

– It is terminated when there is convergence on a common solution or after a fixed maximum number of generations.

• Encoding example. We have 10 objects that we wish to put into 3 groups. The 10 objects correspond to positions 1-10 in an array, and which group they are in is designated by the number at that array position.

1 2 3 4 5 6 7 8 9 10

A 1 2 2 1 3 1 3 2 2 1

B 2 3 2 3 1 1 2 3 1 2

C 3 1 1 3 1 2 1 2 1 3

D 1 1 3 1 2 3 1 2 1 2

Parent Selection

• Fitness. For clustering, the fitness function is usually sum of the squares of distances from each member to the centroid (the center position, or center of mass) of the cluster.

– Fitnesses can be converted to relative fitnesses by ranking them, with the chromosome (set of clusters) with the smallest sum of squared distances given a fitness of 1.0 and all others assigned proportionally lower relative fitness values.

• Selection of parents is generally done by favoring the best individuals of the current generation.

– All chromosomes should have a chance of being included in the next generation. A simple way to do this is the “Roulette Wheel” method.

• Assign each chromosome a segment of the wheel proportional to its relative fitness,

• then choose a random number on the wheel and take the chromosome corresponding to it.

– In order to not lose the best of the current generation, you can also directly copy (i.e. clone) the best member(s) of the current generation into the next generation. This is called “elitism”.

Recombination and Mutation• Once 2 parents are chosen, you need to produce offspring from them.

– You can choose to have 1, 2, or more offspring from each pair of parents, but you want to keep the number of chromosomes constant.

• Recombination (crossing over). For the coding method we are using, it is best to think of these chromosomes as circular, because the data points are randomly assigned to groups.

– So, we use 2 crossover points, chosen randomly.– Copy the data from parent 1 up to the first crossover point, then the data from parent 2 up

to the second crossover point, then finish with data from parent 1.• Mutation. After crossing over has generated new offspring, randomly change a

small number of data points form one cluster to another.

• Do this process of selecting parents, generating offspring, and mutating them until enough new generation members have been generated.

• Then, score them all for fitness and repeat the process.

analysis of gene expression. overview genome analysis tells us what genes are present, but before we...

Documents

gene sequences

given gene

gene product

microarrays dna microarrays

sage data

mrna present

coexpressed slide

microarray data messenger