10.1007_978-0-387-77240-0_6

Upload: ulirschj

Post on 14-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 10.1007_978-0-387-77240-0_6

    1/6

    6

    Easy Differential Expression

    F. Hahne and W. Huber

    Abstract

    In this short exercise, we explore the most basic approach to the

    selection of differentially expressed genes between two classes: first,

    a nonspecific filtering step to remove probes for genes that appear to

    be not. Second, a probe-by-probe statistical test, and third, multiple

    testing correction. There are many variations and improvements to

    the procedure shown here, and you can learn more about these in

    Chapter 7.

    6.1 Example data

    For this chapter, we use the ALL data, which have been obtained in amicroarray study of B- and T-cell leukemia. We want to find genes that aredifferentially expressed between two distinct types of B-cell leukemia.

    > library("Biobase")

    > library("genefilter")

    > library("ALL")

    > data("ALL")

    The data and the following steps with which we construct the subset ofinterest, ALL_bcrneg, are described in more detail in Chapter 1. Briefly, weselect samples from B-cell lymphomas harboring the BCR/ABL translo-cation and from lymphomas with no observed cytogenetic abnormalities(NEG).

    > bcell = grep("^B", as.character(ALL$BT))

    > moltyp = which(as.character(ALL$mol.biol)

    %in% c("NEG", "BCR/ABL"))

    F. Hahne et al., Bioconductor Case Studies, DOI: 10.1007/978-0-387-77240-0 6, Springer Science+Business Media, LLC 2008

    http://-/?-http://-/?-http://-/?-http://-/?-
  • 7/30/2019 10.1007_978-0-387-77240-0_6

    2/6

    84 F. Hahne, W. Huber

    > ALL_bcrneg = ALL[, intersect(bcell, moltyp)]

    > ALL_bcrneg$mol.biol = factor(ALL_bcrneg$mol.biol)

    The last line in the code above is used to drop unused levels of the factorvariable mol.biol.

    6.2 Nonspecific filtering

    Between these two groups we should be able to detect substantial differ-ences in gene expression. But first let us explore how nonspecific filteringcan improve our analysis. To this end, we calculate the overall variability

    across arrays of each probe set, regardless of the sample labels. For this, weuse the function rowSds, which calculates the standard deviation for eachrow. A reasonable alternative would be to calculate the interquartile range(IQR), for which we could employ the rowQ function from the genefilterpackage.

    > library("genefilter")

    > sds = rowSds(exprs(ALL_bcrneg))

    > sh = shorth(sds)

    > sh[1] 0.242

    We can plot the histogram of the distribution of sds; see Figure 6.1. Thefunction shorth calculates the midpoint of the shorth (the shortest intervalcontaining half of the data), and is in many cases a reasonable estimator ofthe peak of a distribution. Its value 0.242 is drawn as a dashed verticalline in Figure 6.1.

    > hist(sds, breaks=50, col="mistyrose", xlab="standard deviation")> abline(v=sh, col="blue", lwd=3, lty=2)

    There are a large number of probe sets with very low variability. Wecan safely assume that we will not be able to infer differential expressionfor their target genes. The target genes of these probe sets may not beexpressed in the samples, or the probe sets may lack the sensitivity todetect expression. Hence, let us discard those probe sets whose standarddeviation is below the value of sh.

    > ALLsfilt = ALL_bcrneg[sds>=sh, ]

    > dim(exprs(ALLsfilt))

    [1] 8812 79

  • 7/30/2019 10.1007_978-0-387-77240-0_6

    3/6

    6. Easy Differential Expression 85

    Figure 6.1. Histogram of sds.

    A related approach would be to discard all probe sets with consistently

    low expression values. The idea is similar: those probe sets most likelymatch transcripts whose expression we cannot detect anyway, and hencewe need not test them for differential expression.

    A more comprehensive approach to nonspecific filtering of probe setsaccording to various criteria is provided by the function nsFilter fromthe Category package, and that functions documentation as well as anapplication of it in Chapter 1 are further references on this topic.

    To summarize, nonspecific filtering uses the biological knowledge thatthere exists a substantial fraction of probe sets in a microarray experiment

    that is not informative, either because the target gene is not expressed, orbecause the probe set lacks sensitivity. Using this knowledge in the analysiswill, in general, improve the quality of the gene selection.

    6.3 Differential expression

    We can now perform probe-by-probe tests for differential expres-sion (Dudoit et al., 2002). The function rowttests can deal withExpressionSets. It uses the t-test, row by row, to detect significant

    differences in the location of the distribution of expression data of twogroups of samples defined by a factor variable. In this case, we use theinformation about BCR/ABL mutation status in the column mol.biol ofALLsfilts sample annotation as a grouping factor.

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 7/30/2019 10.1007_978-0-387-77240-0_6

    4/6

    86 F. Hahne, W. Huber

    > table(ALLsfilt$mol.biol)

    BCR/ABL NEG

    37 42

    > tt = rowttests(ALLsfilt, "mol.biol")

    > names(tt)

    [1] "statistic" "dm" "p.value"

    Take a look at the histogram of the resulting p-values in the left panelof Figure 6.2.

    > hist(tt$p.value, breaks=50, col="mistyrose", xlab="p-value",

    main="Retained")

    We see a number of probe sets with very low p-values (which correspondto differentially expressed genes) and a whole range of insignificant p-values.This is more or less what we would expect. The expression of the majorityof genes is not significantly shifted by the BCR/ABL mutation. To makesure that the nonspecific filtering did not throw away an undue amount ofpromising candidates, let us take a look at the p-values for those probe setsthat we filtered out before. We can compute t-statistics for them as welland plot the histogram ofp-values (right panel of Figure 6.2):

    > ALLsrest = ALL_bcrneg[sds ttrest = rowttests(ALLsrest, "mol.biol")

    > hist(ttrest$p.value, breaks=50, col="lightblue",

    xlab="p-value", main="Removed")

    Retained

    p-value

    ycneuqerF

    0.0 0.2 0.4 0.6 0.8 1.0

    0

    002

    004

    006

    Removed

    p-value

    ycneuqerF

    0.0 0.2 0.4 0.6 0.8 1.0

    0

    02

    06

    001

    Figure 6.2. Histograms of p-values. The left panel shows those p-values retained

    after nonspecic lering; the right panel shows those that were removed.

  • 7/30/2019 10.1007_978-0-387-77240-0_6

    5/6

    6. Easy Differential Expression 87

    Exercise 6.1

    Comment on the plot; do you think that the nonspecific filtering was

    appropriate?

    6.4 Multiple testing correction

    We use the p-values for ranking genes, and do not advocate interpretingthem as true probabilities. Nevertheless, the results of a multiple testingadjustment can be informative for choosing selection cut-offs. Typically,in the setting of a single statistical test we consider the data as providingevidence against a given null hypothesis when it is sufficiently improbable

    that these data arise by chance if the null hypothesis is true. When repeat-edly doing tests, we need to raise the bar for what we consider sufficientlyimprobable.

    For example, if we do 8812 tests of a null hypothesis that is actually true,using a significance level of 5%, then in 5% 441 cases we can expect toreject the null hypothesis just by chance. Many approaches have been pro-posed to address this problem (Pollard et al., 2005); here we just discussone that appears to be appropriate in many micrarray-related contexts:the false discovery rate (FDR), that is, the expected proportion of false

    positives among the genes that are called differentially expressed. The pro-cedure of Benjamini and Hochberg is implemented in the multtest packageand we use the function mt.raw2adjp for this purpose. (Note that a moreformal treatment would need to take into account the multiple t-tests aswell as the implicit testing of the nonspecific filtering.)

    > library("multtest")

    > mt = mt.rawp2adjp(tt$p.value, proc="BH")

    Finally, we can use the results of the t-tests to create a gene list containingthe ten highest-ranking genes with respect to the adjusted p-value,

    > g = featureNames(ALLsfilt)[mt$index[1:10]]

    print their gene symbols,

    > library("hgu95av2.db")

    > links(hgu95av2SYMBOL[g])

    probe_id symbol

    1 1635_at ABL12 1636_g_at ABL1

    3 1674_at YES1

    4 32434_at M ARCKS

    http://-/?-http://-/?-http://-/?-http://-/?-
  • 7/30/2019 10.1007_978-0-387-77240-0_6

    6/6

    88 F. Hahne, W. Huber

    5 37015_at ALDH1A1

    6 37027_at AHNAK

    7 39730_at ABL1

    8 39837_s_at ZNF467

    9 40202_at KLF9

    10 40504_at PON2

    and plot the data of the first one together with symbols indicating the valueof the mol.biol variable:

    > mb = ALLsfilt$mol.biol

    > y = exprs(ALLsfilt)[g[1],]

    > ord = order(mb)

    > plot(y[ord], pch=c(1,16)[mb[ord]],col=c("black", "red")[mb[ord]],

    main=g[1], ylab=expression(log[2]~intensity),

    xlab="samples")

    The result is shown in Figure 6.3.

    0 20 40 60 80

    5.7

    0.8

    5.8

    0.9

    5.9

    0.01

    5.01

    1636_g_at

    samples

    log2

    intensity

    Figure 6.3. The ALLsfilt data for the top differentially expressed probe set acrossthe 79 samples. The value of the mol.biol variable is indicated by the plotsymbols.