tutorial: rma analysis using the microarray platform website … · tutorial: rma analysis using...

12
Tutorial: RMA Analysis using the Microarray Platform Website I Overview Objective of Tutorial This tutorial provides an introduction to data analysis using a data processing method known as RMA (Robust Multi-array Average). The tutorial outlines how to download data from the website, obtain RMA expression data and perform a simple 2-class comparison using fold change. The case study for the tutorial, described in more detail below, involves nine hybridizations: three conditions measured in triplicate. Concepts Illustrated Data Download - How to obtain RMA expression summary data online and a look at the format of this data. Class Comparison - Designating differentially expressed genes between two groups of samples by calculating the fold change for each gene. Please Note: The first step in any analysis should be a visualization of the data. In other words, array results within and between sample groups should be plotted against each other to look for arrays that stand out. This process is an essential analysis and quality control step. So, before proceeding, you should either have already looked at plots of the data (see the RMA plots description) or should contact the statistical staff at the centre. Case Study Design The motivation for the case study experiments is to study the homing of T-Cells in lung. Affymetrix murine MG-U74vA chips were used to monitor the expression of 12488 genes in three CD8+ T cell populations. All three populations are derived from BALB/c mice, but differ in exposure to anti-CD3 and anti-CD28: naïve cells (0h exposure), 48h exposure, and a HA210-219-specific CD8+ T cell clone. Nine arrays were performed in total: triplicates of each condition. Objective of Analysis Make a two-class comparison between the gene expression patterns in naive and 48 hour antibody exposed CD8+ T cells. Those genes that are differentially expressed between the two samples may play a role in the CD8+ T cell immune response.

Upload: ngoliem

Post on 21-Jun-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

Tutorial: RMA Analysis using the Microarray Platform Website

I Overview

Objective of Tutorial

This tutorial provides an introduction to data analysis using a data processing method

known as RMA (Robust Multi-array Average). The tutorial outlines how to download

data from the website, obtain RMA expression data and perform a simple 2-class

comparison using fold change. The case study for the tutorial, described in more detail

below, involves nine hybridizations: three conditions measured in triplicate.

Concepts Illustrated

• Data Download - How to obtain RMA expression summary data online and a

look at the format of this data.

• Class Comparison - Designating differentially expressed genes between two

groups of samples by calculating the fold change for each gene.

Please Note: The first step in any analysis should be a visualization of the data. In other

words, array results within and between sample groups should be plotted against each

other to look for arrays that stand out. This process is an essential analysis and quality

control step. So, before proceeding, you should either have already looked at plots of the

data (see the RMA plots description) or should contact the statistical staff at the centre.

Case Study Design

The motivation for the case study experiments is to study the homing of T-Cells in lung.

Affymetrix murine MG-U74vA chips were used to monitor the expression of 12488

genes in three CD8+ T cell populations. All three populations are derived from BALB/c

mice, but differ in exposure to anti-CD3 and anti-CD28: naïve cells (0h exposure), 48h

exposure, and a HA210-219-specific CD8+ T cell clone. Nine arrays were performed in

total: triplicates of each condition.

Objective of Analysis

Make a two-class comparison between the gene expression patterns in naive and 48 hour

antibody exposed CD8+ T cells. Those genes that are differentially expressed between

the two samples may play a role in the CD8+ T cell immune response.

Sample Designation

The sample IDs appearing in the experiment and in this tutorial refer to the following

samples:

Sample Name on Project Page Name on Analysis Page

DDO001 DDO001_1_01441A

DDO002 DDO002_2_01441B

Naïve T cells

DDO003 DDO003_3_01441C

DDO004 DDO004_4_01442A

DDO005 DDO005_5_01442B

48hr Stimulated T cells

DDO006 DDO006_6_01442C

DDO007 DDO007_7_01443A

DDO008 DDO008_8_01443B

Antigen-specific T cell

Clone

DDO009 DDO009_9_01443C

References

For RMA references and a comparison to other expression summary methods, see

http://128.32.135.2/users/bolstad/ComputeRMAFAQ/ComputeRMAFAQ.html.

A MIAME description of the experiment and the raw data files can be obtained from:

https://genes.med.virginia.edu/public_data/Klaus_Ley/Klaus_Ley_Mouse%20immune%

20response%20study.html

The sample data described above was published in: Jain et al, Local-pooled-error test for

identifying differentially expressed genes with a small number of replicated microarrays.

Bioinformatics. 2003 Oct 12;19(15):1945-51.

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=1

4555628&dopt=Abstract

(Please note: the method of analysis described in this publication is not related to the

RMA method).

Accessing the Tutorial

To start the tutorial, go to the Microarray Platform web page at

http://innovation.mcgill.ca/services/chip.php and click on Secure Web Interface. This

brings up the Gene Expression home page. At this point, follow the instructions in

Section II below.

Disclaimer

Although we in the Microarray Platform lab endeavour to keep tutorials up-to-date, it is

possible that minor changes to the website are not reflected in the screen shots. Hopefully

this will not cause difficulties in using the tutorial.

If you have any questions, please send them to amy.norris[at]mail.mcgill.ca or call 398-

3311 x00335.

II Data Download The first step towards downloading data is to access the project where that data is stored.

The Project page contains all the information for one project and is the gateway to the

Download Page.

To access the Project page:

• Go to the Gene Expression home page at

https://genomequebec.mcgill.ca/mp/home.cgi?logout=en

• Log in to the demo account using the user login demo and password genome.

This brings up the Client Dominique Demo page.

• Click on the project name Demo001 from the Projects list appearing at the

bottom of the page.

On the Project page, each sample’s success at proceeding through all stages of the

microarray hybridization experiment is indicated. These results are summarized in the top

right hand corner of the Project page.

To access the Download page, select Download from the Navigate section of the left-

hand menu.

Array data can be obtained in several different formats from the Download Page:

Excel/Genespring MAS5.0, R formatted MAS5.0, and RMA. These formats describe the

methods of performing background correction, normalization, and expression summary

as well as the output format. To understand what these data processing steps do requires

some knowledge of the Affymetrix GeneChip technology.

Affymetrix arrays

The expression of a target gene is represented by the calculated signal intensity of a probe

set on an Affymetrix GeneChip. Each probe set is comprised of 11 or 16 pairs of probes,

depending on the chip type. A pair of probes consists of a perfect match and a mismatch

25-mer oligonucleotide. The mismatch is identical to the perfect match with the

exception of the middle nucleotide (which is the complementary base to the perfect

match) and is intended to provide a indication of the degree of cross-hybridization for

each probe.

For more information on microarrays, please refer to the FAQ.

RMA

RMA stands for Robust Multiple-array Average. An advantage of this method over the

other expression summaries available for download from the Microarray Platform

website, is that normalization occurs at the probe level (rather than at the probeset level)

across all of the selected hybridizations (rather than for each array individually). For this

reason, it is essential that all chips that are to be included in the subsequent analysis

(usually the entire project) are downloaded together.

Note that the method works best when there are many chips in a project. Therefore, RMA

summaries can’t be downloaded for less than 6 chips from the Microarray Platform

website.

RMA uses a model-based background correction, quantile normalization and a robust

averaging expression summary method. Comparisons to other methods can be found at

the link below. Properties of the RMA summaries include:

• Intensity values range between 4 and 16

• Intensity values are in log (base 2) scale

• Only perfect match intensities are used

• Variance is smaller and relatively stable across the range of intensities

• Fold-changes are underestimated

For more details on RMA, please see Ben Bolstad’s RMA webpage:

http://128.32.135.2/users/bolstad/ComputeRMAFAQ/ComputeRMAFAQ.html.

The RMA output from the Download page is one .csv file that contains the intensity for

each probeset on each chip. The file can be opened in Excel or loaded into another

software program for analysis. A simple example of a fold-change calculation in Excel is

given in Section III.

How to Download

To download the expression summaries for the current project, click Download in the

Navigate section of the left-hand menu.

The default download format is RMA data. If you select another format from the

selection box, a list of chips will appear. With RMA there is no need to select chips,

because RMA normalisation is done on the whole project together. Simply click the

Download Data button.

When the RMA expression summary for the 6 chips is ready, a download notice will

come up. Choose to open the file. The data is stored in a .zip file, which requires a utility

such as (the freely available) WinZip to open.

Extract the .csv file from the .zip file. This file, which contains the RMA expression

intensity values, is a comma-delimited file that can be opened in a spreadsheet program

like Excel or in a text editor.

Annotation

For each Affymetrix probeset ID, the RMA data download file currently has three

annotation columns: Unigene ID, Gene Title, and, Gene Symbol. More annotation data

can be obtained from the Affymetrix Annotations page. To reach this page from the

Download page, click the back button in the browser, then select Affymetrix

Annotations in the Navigate menu.

III A Simple Analysis of RMA Download Data This section describes how to do a simple class comparison between the Naïve samples

and the 48hr treated samples (see Section I). The RMA expression summaries and Excel

are used to calculate the log ratio of expression between the two sets of samples. First the

average expression of each probeset within each treatment group is calculated. Then the

difference between the two averages is found. Because RMA values have been logged

(base 2), this difference is equivalent to the logged fold change:

All probesets with a fold change of 2 and greater are then identified to provide a list of

differentially expressed genes.

The following analysis is described using Excel on a Windows platform. The ideas

behind the analysis, however, are applicable to any operating system or program. Please

note that the example below is slightly outdated (there are only 6 chips in the project).

However, the principles are the same. Simply ignore the extra three chips that are

currently in the demo results.

Calculate Average Group Expression

Select cell 2H, as shown below. Now, select the formula button, fx, beside the data entry

field.

In the Insert Function window, select the Average function. Click OK.

The Function Arguments window appears. Select cells 2B, 2C and 2D. Number 1 in

the Function Arguments window now reads 2B:2D and the Formula result at the

bottom of the window shows the average value of these three cells.

Click OK. Cell 2H now contains the mean value of probeset 100001_at over the three

Naïve Tcell samples.

Now, place the cursor over the small square in the lower right-hand corner of 2H. The

cursor should become a small plus sign. Double click on this spot. Each row of column H

is filled with the average value of columns B-D in the same row. Type “Avg. Naïve” at

the top of column H.

Repeat the same process, in column I this time, to calculate the average of the 48hr

stimulated TCells (columns E-G).

Calculate Fold Change

The log fold change between the two conditions can be calculated by subtracting the

value in column H from that in column I for every probeset.

Select J2. Type “=I2-H2”, then Enter. This is an example of how to define a formula

without using the function building windows. The formula means “subtract the value in

H2 from the value in I2”.

The difference between the average 48hr treated intensity and average Naïve intensity has

now been calculated for probeset 10001_at. To calculate the fold change for all the other

probesets, first make sure that J2 is selected, then double-click on the lower right-hand

corner black square.

Type in “log2 fold-change” as the title of column J. Note that these fold changes are in

log2 scale, that is, a value of 1 in this column indicates a fold-change of 2 (2^1) and a

value of -1 indicates a fold-change of ½ (2^{-1}). The former implies that the probeset is

induced by 2-fold in the 48hr treated samples, while the latter implies that the probeset is

repressed 2-fold in the 48hr treated samples.

Identify Differentially Expressed Genes

Finally, it is useful to find all the probesets that are induced or repressed by a given fold

change. For this, a conditional statement is required.

Select 2K and click the formula button to bring up the formula window. Select the IF

function and click OK. In the Function Arguments window, enter “OR(J2 < -1, J2 >

1)”. Enter a value of 1 in the Value_if_true field and 0 in the Value_if_false field. Take

a moment to look at this function and try to decipher its meaning.

Our function specifies that if the value in J2 is either less than -1 or greater than 1, then

the reference cell should take on the value of 1, otherwise, it should have a value of 0.

Click OK. K2 now has a value of 0, since -0.90466 is neither less than -1 nor greater than

1. Click on the small square in the corner of this cell to repeat the formula for all cells in

the column.

Type “fold-change > 2?” as the header for column K. This column now contains a 1 for

all probesets that show a fold-change greater than 2 (either up or down with respect to the

48hr treated samples) and 0 otherwise.

At this point, many actions can be taken to investigate the results. For instance, the data

can be sorted by the results in column K in order to list of all the differentially expressed

genes together. To accomplish this, select Sort from the Data menu. In the Sort window,

choose fold-change > 2 as the Sort by column and Descending as the order.

Click OK. The first 1263 probesets in the list are now those that show a fold-change

greater than 2.

Feel free to try other fold-change limits as well. With RMA summaries, a fold-change

limit of 1.5 is reasonable (remember, with RMA data the fold-change is generally an

underestimate of the fold-change found with, for instance, RT-PCR). To search for a fold

change of 1.5, the function is “OR(J2 < -0.585, J2 > 0.585)”, since log2(1.5) is

approximately 0.585.

IV Concluding Remarks Thank you for taking the time to go through the Microarray Platform RMA tutorial. At

this point you can download RMA results and perform a simple class comparison in

Excel that results in a list of differentially expressed genes.

If you have any questions or comments, please feel free to contact André Ponton.

Good luck!