r journal 2010-1

86
The Journal Volume 2/1, June 2010 A peer-reviewed, open-access publication of the R Foundation for Statistical Computing Contents Editorial .................................................. 3 Contributed Research Articles IsoGene: An R Package for Analyzing Dose-response Studies in Microarray Experiments .. 5 MCMC for Generalized Linear Mixed Models with glmmBUGS ................. 13 Mapping and Measuring Country Shapes ............................... 18 tmvtnorm: A Package for the Truncated Multivariate Normal Distribution ........... 25 neuralnet: Training of Neural Networks ............................... 30 glmperm: A Permutation of Regressor Residuals Test for Inference in Generalized Linear Models .................................................. 39 Online Reproducible Research: An Application to Multivariate Analysis of Bacterial DNA Fingerprint Data ............................................ 44 Two-sided Exact Tests and Matching Confidence Intervals for Discrete Data .......... 53 Book Reviews A Beginner’s Guide to R ......................................... 59 News and Notes Conference Review: The 2nd Chinese R Conference ........................ 60 Introducing NppToR: R Interaction for Notepad++ ......................... 62 Changes in R 2.10.1–2.11.1 ........................................ 64 Changes on CRAN ............................................ 72 News from the Bioconductor Project .................................. 85 R Foundation News ........................................... 86

Upload: ajay-ohri

Post on 04-Dec-2014

1.474 views

Category:

Education


7 download

DESCRIPTION

 

TRANSCRIPT

Page 1: R journal 2010-1

The JournalVolume 2/1, June 2010

A peer-reviewed, open-access publication of the R Foundationfor Statistical Computing

Contents

Editorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Contributed Research Articles

IsoGene: An R Package for Analyzing Dose-response Studies in Microarray Experiments . . 5MCMC for Generalized Linear Mixed Models with glmmBUGS . . . . . . . . . . . . . . . . . 13Mapping and Measuring Country Shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18tmvtnorm: A Package for the Truncated Multivariate Normal Distribution . . . . . . . . . . . 25neuralnet: Training of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30glmperm: A Permutation of Regressor Residuals Test for Inference in Generalized Linear

Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39Online Reproducible Research: An Application to Multivariate Analysis of Bacterial DNA

Fingerprint Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44Two-sided Exact Tests and Matching Confidence Intervals for Discrete Data . . . . . . . . . . 53

Book Reviews

A Beginner’s Guide to R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

News and Notes

Conference Review: The 2nd Chinese R Conference . . . . . . . . . . . . . . . . . . . . . . . . 60Introducing NppToR: R Interaction for Notepad++ . . . . . . . . . . . . . . . . . . . . . . . . . 62Changes in R 2.10.1–2.11.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64Changes on CRAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72News from the Bioconductor Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85R Foundation News . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

Page 2: R journal 2010-1

2

The Journal is a peer-reviewed publication of the RFoundation for Statistical Computing. Communications regardingthis publication should be addressed to the editors. All articles arecopyrighted by the respective authors.

Prospective authors will find detailed and up-to-date submissioninstructions on the Journal’s homepage.

Editor-in-Chief:Peter Dalgaard

Center for StatisticsCopenhagen Business School

Solbjerg Plads 32000 Frederiksberg

Denmark

Editorial Board:Vince Carey, Martyn Plummer, and Heather Turner.

Editor Programmer’s Niche:Bill Venables

Editor Help Desk:Uwe Ligges

Editor Book Reviews:G. Jay Kerns

Department of Mathematics and StatisticsYoungstown State UniversityYoungstown, Ohio 44555-0002

[email protected]

R Journal Homepage:http://journal.r-project.org/

Email of editors and editorial board:[email protected]

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 3: R journal 2010-1

3

Editorialby Peter Dalgaard

Welcome to the 1st issue of the 2nd volume of The RJournal.

I am writing this after returning from the NORD-STAT 2010 conference on mathematical statistics inVoss, Norway, followed by co-teaching a course onStatistical Practice in Epidemiology in Tartu, Estonia.

In Voss, I had the honour of giving the openinglecture entitled “R: a success story with challenges”.I shall spare you the challenges here, but as part ofthe talk, I described the amazing success of R, anda show of hands in the audience revealed that onlyabout 10% of the audience was not familiar with R. Ialso got to talk about the general role of free softwarein science and I think my suggestion that closed-source software is “like a mathematician hiding hisproofs” was taken quite well.

R 2.11.1 came out recently. The 2.11.x series dis-plays the usual large number of additions and cor-rections to R, but if a single major landmark is tobe pointed out, it must be the availability of a 64-bitversion for Windows. This has certainly been longawaited, but it was held back by the lack of successwith a free software 64-bit toolchain (a port usinga commercial toolchain was released by REvolutionComputing in 2009), despite attempts since 2007. OnJanuary 4th this year, however, Gong Yu sent a mes-sage to the R-devel mailing list that he had succeededin building R using a version of the MinGW-w64tools. On January 9th, Brian Ripley reported that hewas now able to build a version as well. During Win-ter and Spring this developed into almost full-blownplatform support in time for the release of R 2.11.0in April. Thanks go to Gong Yu and the “R Win-dows Trojka”, Brian Ripley, Duncan Murdoch, andUwe Ligges, but the groundwork by the MinGW-w64 team should also be emphasized. The MinGW-w64 team leader, Kai Tietz, was also very helpful inthe porting process.

The transition from R News to The R Journalwas always about enhancing the journal’s scientificcredibility, with the strategic goal of allowing re-searchers, especially young researchers, due creditfor their work within computational statistics. The RJournal is now entering a consolidation phase, witha view to becoming a “listed journal”. To do so, weneed to show that we have a solid scientific standingwith good editorial standards, giving submissionsfair treatment and being able to publish on time.Among other things, this has taught us the conceptof the “healthy backlog”: You should not publish soquickly that there might be nothing to publish for thenext issue!

We are still aiming at being a relatively fast-trackpublication, but it may be too much to promise pub-lication of even uncontentious papers within the nexttwo issues. The fact that we now require two review-ers on each submission is also bound to cause somedelay.

Another obstacle to timely publication is that theentire work of the production of a new issue is in thehands of the editorial board, and they are generallyfour quite busy people. It is not good if a submis-sion turns out to require major copy editing of itsLATEX markup and there is a new policy in place torequire up-front submission of LATEX sources and fig-ures. For one thing, this allows reviewers to adviseon the LATEX if they can, but primarily it gives bet-ter time for the editors to make sure that an acceptedpaper is in a state where it requires minimal copyediting before publication. We are now able to enliststudent assistance to help with this. Longer-term, Ihope that it will be possible to establish a front-deskto handle submissions.

Finally, I would like to welcome our new BookReview editor, Jay Kerns. The first book review ap-pears in this issue and several more are waiting inthe wings.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 4: R journal 2010-1

4

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 5: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 5

IsoGene: An R Package for AnalyzingDose-response Studies in MicroarrayExperimentsby Setia Pramana, Dan Lin, Philippe Haldermans, ZivShkedy, Tobias Verbeke, Hinrich Göhlmann, An De Bondt,Willem Talloen and Luc Bijnens.

Abstract IsoGene is an R package for the anal-ysis of dose-response microarray experiments toidentify gene or subsets of genes with a mono-tone relationship between the gene expressionand the doses. Several testing procedures (i.e.,the likelihood ratio test, Williams, Marcus, theM, and Modified M), that take into accountthe order restriction of the means with respectto the increasing doses are implemented in thepackage. The inference is based on resamplingmethods, both permutations and the Signifi-cance Analysis of Microarrays (SAM).

Introduction

The exploration of dose-response relationship is im-portant in drug-discovery in the pharmaceutical in-dustry. The response in this type of studies can beeither the efficacy of a treatment or the risk associ-ated with exposure to a treatment. Primary concernsof such studies include establishing that a treatmenthas some effect and selecting a dose or doses that ap-pear efficacious and safe (Pinheiro et al., 2006). Inrecent years, dose-response studies have been inte-grated with microarray technologies (Lin et al., 2010).Within the microarray setting, the response is geneexpression measured at a certain dose level. Theaim of such a study is usually to identify a subsetof genes with expression levels that change with ex-perimented dose levels.

One of four main questions formulated in dose-response studies by Ruberg (1995a, 1995b) andChuang-Stein and Agresti (1997) is whether there isany evidence of the drug effect. To answer this ques-tion, the null hypothesis of homogeneity of means(no dose effect) is tested against ordered alternatives.Lin et al. (2007, 2010) discussed several testing pro-cedures used in dose-response studies of microar-ray experiments. Testing procedures which take intoaccount the order restriction of means with respectto the increasing doses include Williams (Williams,1971 and 1972), Marcus (Marcus, 1976), the likeli-hood ratio test (Barlow et al. 1972, and Robertsonet al. 1988), the M statistic (Hu et al., 2005) and themodified M statistic (Lin et al., 2007).

To carry out the analysis of dose-response mi-croarray experiments discussed by Lin et al. (2007,

2010), an R package called IsoGene has been devel-oped. The IsoGene package implements the testingprocedures described by Lin et al. (2007) to identifya subset of genes where a monotone relationship be-tween gene expression and dose can be detected. Inthis package, the inference is based on resamplingmethods, both permutations (Ge et al., 2003) and theSignificance Analysis of Microarrays (SAM), Tusheret al., 2001. To control the False Discovery Rate (FDR)the Benjamini Hochberg (BH) procedure (Benjaminiand Hochberg, 1995) is implemented.

This paper introduces the IsoGene package withbackground information about the methodologyused for analysis and its main functions. Illustrativeexamples of analysis using this package are also pro-vided.

Testing for Trend in Dose ResponseMicroarray Experiments

In a microarray experiment, for each gene, the fol-lowing ANOVA model is considered:

Yij = µ(di) + εij, i = 0,1, . . . ,K, j = 1,2, . . . ,ni, (1)

where Yij is the jth gene expression at the ith doselevel, di (i = 0,1, . . . ,K) are the K+1 dose levels, µ(di)is the mean gene expression at each dose level, andεij ∼ N(0,σ2). The dose levels d0, ...,dK are strictly in-creasing.

The null hypothesis of homogeneity of means (nodose effect) is given by

H0 : µ(d0) = µ(d1) = · · · = µ(dK). (2)

where µ(di) is the mean response at dose di withi = 0, ..,K, where i = 0 indicates the control. The al-ternative hypotheses under the assumption of mono-tone increasing and decreasing trend of means are re-spectively specified by

HUp1 : µ(d0) ≤ µ(d1) ≤ · · · ≤ µ(dK), (3)

HDown1 : µ(d0) ≥ µ(d1) ≥ · · · ≥ µ(dK). (4)

For testing H0 against HDown1 or HUp

1 , estimationof the means under both the null and alternativehypotheses is required. Under the null hypothesis,the estimator for the mean response µ is the over-all sample mean. Let µ?

0 , µ?1 , . . . , µ?

K be the maximumlikelihood estimates for the means (at each doselevel) under the order restricted alternatives. Barlowet al. (1972) and Robertson et al. (1988) showed that

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 6: R journal 2010-1

6 CONTRIBUTED RESEARCH ARTICLES

µ?0 , µ?

1 , . . . , µ?K are given by the isotonic regression of

the observed means.In order to test H0 against HDown

1 or HUp1 ,

Lin et al. (2007) discussed five testing proceduresshown in Figure 1. The likelihood ratio test (E2

01)(Bartholomew 1961 , Barlow et al. 1972, and Robert-son et al. 1988) uses the ratio between the varianceunder the null hypothesis (σ2

H0) and the variance un-

der order restricted alternatives (σ2H1

):

Λ2N01 =

σ2H1

σ2H0

=∑ij(yij − µ?

j )2

∑ij(yij − µ)2 . (5)

Here, µ = ∑ij yij/ ∑i ni is the overall mean and µ?i is

the isotonic mean of the ith dose level. The null hy-

pothesis is rejected for a "small" value of Λ2N01. Hence,

H0 is rejected for a large value of E201 = 1−Λ

2N01.

Test statistic Formula

Likelihood RatioTest (LRT)

E201 = ∑ij(yij−µ)2−∑ij(yij−µ?

i )2

∑ij(yij−µ)2

Williams t = (µ?K − y0)/s

Marcus t = (µ?K − µ?

0)/sM M = (µ?

K − µ?0)/s

Modified M (M’) M′ = (µ?K − µ?

0)s′

Figure 1: Test statistics for trend test, where s =√2×∑K

i=0 ∑nij=1(yij − µi)2/(ni(n− K)),

s =√

∑Ki=0 ∑ni

j=1(yij − µ?i )

2/(n− K),

s′ =√

∑Ki=0 ∑ni

j=1(yij − µ?i )

2/(n− I), and I is theunique number of isotonic means.

Williams (1971, 1972) proposed a step-down pro-cedure to test for the dose effect. The tests are per-formed sequentially from the comparison betweenthe isotonic mean of the highest dose and the sam-ple mean of the control, to the comparison betweenthe isotonic mean of the lowest dose and the sam-ple mean of the control. The procedure stops at thedose level where the null hypothesis (of no dose ef-fect) is not rejected. The test statistic is shown in Fig-ure 1, where y0 is the sample mean at the first doselevel (control), µ?

i is the estimate for the mean at theith dose level under the ordered alternative, ni is thenumber of replications at each dose level, and s2 isan estimate of the variance. A few years later, Mar-cus (1976) proposed a modification to Williams’s pro-cedure by replacing the sample mean of the controldose (y0) with the isotonic mean of the control dose(µ?

0).Hu et al. (2005) proposed a test statistic (denoted

by M) that was similar to Marcus’ statistic, but withthe standard error estimator calculated under the or-dered alternatives. This is in contrast to Williams’and Marcus’ approaches that used the within group

sum of squares under the unrestricted means. More-over, Hu et al. (2005) used n − K as the degrees offreedom. However, the unique number of isotonicmeans is not fixed, but changes across genes. Forthat reason, Lin et al. (2007) proposed a modifica-tion to the standard error estimator used in the Mstatistic by replacing it with (n− I) as the degrees offreedom, where I is the unique number of isotonicmeans for a given gene. The five test statistics dis-cussed above are based on the isotonic regression ofthe observed means. Estimation of isotonic regres-sion requires the knowledge of the direction of thetrend (increasing/decreasing). In practice, the direc-tion is not known in advance. Following Barlow et al.(1972), the IsoGene package calculates the likelihoodof the isotonic trend for both directions and choosesthe direction for which the likelihood is maximized.

The Significance Analysis of Microarrays proce-dure (SAM, Tusher et al., 2001) can be also adaptedto the five test statistics described above. The genericalgorithm of SAM discussed by Chu et al. (2001) isimplemented in this package. For the t-type teststatistics (i.e., Williams, Marcus, the M, and the M′),a fudge factor is added in the standard error of themean difference. For example, the SAM regularizedtest statistic for the M′ is given by,

M′SAM =µ?

K − µ?0

s′ + s0, (6)

where s0 is the fudge factor and is estimated from thepercentiles of standard errors of all genes which min-imize the Coefficient of Variation (CV) of the MedianAbsolute Deviation (MAD) of the SAM regularizedtest statistics. For the F-type test statistic, such as E2

01,the SAM regularized test statistic is defined by,

E2SAM01 =

√σ2

H0− σ2

H1√σ2

H0+ s0

. (7)

Multiplicity

In this study, a gene-by-gene analysis is carried out.When many hypotheses are tested, the probabilityof making the type I error increases sharply withthe number of hypotheses tested. Hence, multiplic-ity adjustments need to be performed. Dudoit et al.(2003) and Dudoit and van der Laan (2008) providedextensive discussions on the multiple adjustmentprocedures in genomics. Lin et al. (2007) comparedseveral approaches for multiplicity adjustments, andshowed the application in dose-response microarrayexperiments. In the IsoGene package the inference isbased on resampling-based methods. The BenjaminiHochberg (BH) procedure is used to control the FDR. We use the definition of FDR from Benjamini andHochberg (1995). The Significance Analysis of Mi-croarrays (SAM, Tusher et al., 2001) approach is also

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 7: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 7

considered, which estimates the FDR by using per-mutations, where the FDR is computed as medianof the number of falsely called genes divided by thenumber of genes called significant.

Introduction to IsoGene Package

The IsoGene package provides the estimates andp-values of five test statistics for testing for mono-tone trends discussed in the previous section. Thep-values are calculated based on a resampling proce-dure in which the distribution of the statistic underthe null hypothesis is approximated using permuta-tions.

The main functions of the IsoGene package areIsoRawp() and IsoTestBH() which calculate the rawp-values using permutations and adjust them usingthe Benjamini-Hochberg (BH-FDR, Benjamini andHochberg, 1995) and Benjamini-Yekutieli (BY-FDR,Benjamini and Yekutieli, 2001) procedures. TheIsoGene package also implements the SignificanceAnalysis of Microarrays (SAM) by using functionIsoTestSAM().

Function Description

IsoRawp() Calculates raw p-values foreach test statistic usingpermutations

IsoTestBH() Adjusts raw p-values of thefive test statistics using theBH- or BY-FDR procedure

IsoGene1() Calculates the values of thefive test statistics for asingle gene

IsoGenem() Calculates the values of thefive test statistics for all genes

IsoPlot() Plots the data points, samplemeans at each dose andan isotonic regression line(optional)

IsopvaluePlot() Plots the pUp and pDown-valuesof a gene for a given test

IsoBHplot() Plots the raw p-values andadjusted BH- and BY-FDRp-values

IsoGenemSAM() Calculates the values for thefive SAM regularized teststatistics

IsoTestSAM() Obtains the list of significantgenes using the SAMprocedure

Figure 2: The main IsoGene package functions.

The supporting functions are IsoGenem() andIsoGenemSAM(), which calculate the values forthe five test statistics and for five SAM regular-ized test statistics, respectively. The functionsIsopvaluePlot(), IsoBHplot(), and IsoPlot() canbe used to display the data and show the results of

the testing procedures. The summary of the func-tions and their descriptions are presented in Figure 2.

The IsoGene package can be obtained fromCRAN: http://cran.r-project.org/package=IsoGene.The IsoGene package requires the ff and Iso pack-ages.

Example 1: Data Exploration

To illustrate the analysis of dose-response in microar-ray using IsoGene package, we use the dopaminedata. The data were obtained from a pre-clinicalevaluation study of an active compound (Göhlmannand Talloen, 2009). In this study the potency of thecompound was investigated. The compound had6 dose levels (0, 0.01, 0.04, 0.16, 0.63, 2.5 mg/kg)and each dose was given to four to five indepen-dent rats. The experiment was performed usingAffymetrix whole-human genome chip. There are26 chips/arrays and each chip/array contains 11,562probe sets (for simplicity, we refer to the probe setsas genes). The dopamine data set with 1000 genes isprovided inside the package as example data. Thecomplete data set can be obtained on request uponthe first author. For this paper, the analysis is basedon the original data set (using all genes).

The example data is in an ExpressionSet objectcalled dopamine. More detailed explanation of theExpressionSet object is discussed by Falcon et al.(2007). In order to load the object into R, the follow-ing code can be used:

> library(affy)> library(IsoGene)> data(dopamine)> dopamineExpressionSet(storageMode:lockedEnvironment)assayData: 11562 features, 26 sampleselement names: exprs

phenoDatasampleNames: X1, X2, ..., X26 (26 total)varLabels and varMetadata description:dose: Dose Levels

featureDatafeatureNames: 201_at,202_at,...,11762_at(11562 total)

fvarLabels and fvarMetadatadescription: none

experimentData: use 'experimentData(object)'

The IsoGene package requires the information ofdose levels and gene expression as input. The infor-mation of dose levels and the log2 transformed geneintensities can be extracted using the following code:

> express <- data.frame(exprs(dopamine))> dose <- pData(dopamine)$dose

For data exploration, the function IsoPlot() canbe used to produce a scatterplot of the data. The

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 8: R journal 2010-1

8 CONTRIBUTED RESEARCH ARTICLES

function IsoPlot() has two options to specify thedose levels, i.e., type ="ordinal" or "continuous".By default, the function produces the plot withcontinuous dose levels and data points. Toadd an isotonic regression line, one can specifyadd.curve=TRUE.

Plots of the data and an isotonic regression linefor one of the genes in the data set with dose on thecontinuous and ordinal scale can be produced by us-ing the following code:

# Figure 3> par(mfrow=c(1,2))> IsoPlot(dose,express[56,],type="continuous",+ add.curve=TRUE)> IsoPlot(dose,express[56,],type="ordinal",+ add.curve=TRUE)

●●

●●

●●

●●

0.0 0.5 1.0 1.5 2.0 2.5

67

89

10

Doses

Gen

e E

xpre

ssio

n

+

+

+

+

+ +

*

*

*

*

* *

Gene: 256_at

Doses

Gen

e E

xpre

ssio

n

0 0.01 0.04 0.16 0.63 2.5

67

89

10

●●

● ●

●●

+

+

+

+

+ +

*

*

*

*

* *

Gene: 256_at

Figure 3: The plots produced by IsoPlot() with dose ascontinuous variable (left panel) and dose as ordinal vari-able (right panel and the real dose level is presented onthe x-axis). The data points are plotted as circles, whilethe sample means as red crosses and the fitted increasingisotonic regression model as a blue solid line.

Example 2: Resampling-based MultipleTesting

In this section, we illustrate an analysis to test formonotone trend using the five statistics presented insection 2 using the IsoGene package. The functionIsoRawp() is used to calculate raw p-values basedon permutations. The dose levels, the data frameof the gene expression, and the number of permuta-tions used to approximate the null distribution needto be specified in this function. Note that, since thepermutations are generated randomly, one can usefunction set.seed to obtain the same random num-ber. To calculate raw p-values for the dopamine datausing 1000 permutations with seed=1234, we can usethe following code:

> set.seed(1234)> rawpvalue <- IsoRawp(dose, express,+ niter=1000)

The output object rawpvalue is a list with fourcomponents containing the p-values for the five teststatistics: two-sided p-values, one-sided p-values,pUp-values and pDown-values. The codes to extractthe component containing two-sided p-values for thefirst four genes are presented below. In a similar way,one can obtain other components.

> twosided.pval <- rawpvalue[[2]]> twosided.pval[1:4, ]Probe.ID E2 Williams Marcus M ModM

1 201_at 1.000 0.770 0.996 0.918 0.9462 202_at 0.764 0.788 0.714 0.674 0.7003 203_at 0.154 0.094 0.122 0.120 0.1344 204_at 0.218 0.484 0.378 0.324 0.348

Once the raw p-values are obtained, one needs toadjust these p-values for multiple testing. The func-tion IsoTestBH() is used to adjust the p-values whilecontrolling for the FDR. The raw p-values (e.g., rawp), FDR level, type of multiplicity adjustment (BH-FDR or BY-FDR) and the statistic need to be specifiedin following way:

IsoTestBH(rawp, FDR=c(0.05,0.1),type=c("BH","BY"), stat=c("E2","Williams","Marcus","M","ModifM"))

IsoTestBH() produces a list of genes, which have asignificant increasing/decreasing trend. The follow-ing code can be used to adjust the two-sided p-valuesof the E2

01 using the BH-FDR adjustment:

> E2.BH <- IsoTestBH(twosided.pval,+ FDR = 0.05, type = "BH", stat ="E2")> dim(E2.BH)[1] 250 4

The object E2.BH contains a list of significantgenes along with the probe ID, the correspondingrow numbers of the gene in the original data set,the unadjusted (raw), and the BH-FDR adjusted p-values of the E2

01 test statistic. In the dopamine data,there are 250 genes tested with a significant mono-tone trend using the likelihood ratio test (E2

01) at the0.05 FDR level. The first four significant genes arelisted below:

> E2.BH[1:4, ]Probe.ID row.num raw p-values BH adj.p-values

1 256_at 56 0 02 260_at 60 0 03 280_at 80 0 04 283_at 83 0 0

One can visualize the number of significant find-ings for the BH-FDR and BY-FDR procedures for agiven test statistic using the IsoBHPlot() function.

> # Figure 4> IsoBHPlot(twosided.pval,FDR=0.05,stat="E2")

Figure 4 shows the raw p-values (solid blue line),the BH-FDR adjusted p-values (dotted and dashedred line) and BY-FDR (dashed green line) adjusted p-values for the E2

01.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 9: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 9

0 2000 4000 6000 8000 10000 12000

0.0

0.2

0.4

0.6

0.8

1.0

Index

Adj

uste

d P

val

ues

Raw PBH(FDR)BY(FDR)

E2: Adjusted p values by BH and BY

Figure 4: Plot of unadjusted, BH-FDR and BY-FDR ad-justed p-values for E2

01

Example 3: Significance Analysis of Dose-response Microarray Data (SAM)

The Significance Analysis of Microarrays (SAM) fortesting for the dose-response relationship under or-der restricted alternatives is implemented in the Iso-Gene package as well. The function IsoTestSAM()provides a list of significant genes based on the SAMprocedure.

> IsoTestSAM(x, y, fudge=c("none","pooled"),+ niter=100, FDR=0.05, stat=c("E2",+ "Williams","Marcus","M","ModifM"))

The input for this function is the dose levels(x), gene expression (y), number of permutations(niter), the FDR level, the test statistic, and theoption of using fudge factor. The option fudge isused to specify the calculation the fudge factor in theSAM regularized test statistic. If the option fudge="pooled" is used, the fudge factor will be calculatedusing the method described in the SAM manual (Chuet al., 2001). If we specify fudge ="none" no fudgefactor is used.

The following code is used for analyzing thedopamine data using the SAM regularized M′-teststatistic:

> set.seed(1235)> SAM.ModifM <- IsoTestSAM(dose, express,+ fudge="pooled", niter=100,+ FDR=0.05, stat="ModifM")

The resulting object SAM.ModifM, contains threecomponents:

1. sign.genes1 contains a list of genes declaredsignificant using the SAM procedure.

2. qqstat gives the SAM regularized test statisticsobtained from permutations.

3. allfdr provides a delta table in the SAM pro-cedure for the specified test statistic.

To extract the list of significant gene, one can do:

> SAM.ModifM.result <- SAM.ModifM[[1]]> dim(SAM.ModifM.result)[1] 151 6

The object SAM.ModifM.result, contains a matrixwith six columns: the Probe IDs, the correspond-ing row numbers of the genes in the data set, theobserved SAM regularized M′ test statistics, the q-values (obtained from the lowest False DiscoveryRate at which the gene is called significant), the rawp-values obtained from the joint distribution of per-mutation test statistics for all of the genes as de-scribed by Storey and Tibshirani (2003), and the BH-FDR adjusted p-values.

For dopamine data, there are 151 genes found tohave a significant monotone trend based on the SAMregularized M′ test statistic with the FDR level of0.05. The SAM regularized M′ test statistic valuesalong with q-values for the first four significant genesare presented below.

> SAM.ModifM.result[1:4,]

Probe.ID row.num stat.val qvalue pvalue adj.pvalue1 4199_at 3999 -4.142371 0.0000 3.4596e-06 0.00129032 4677_at 4477 -3.997728 0.0000 6.9192e-06 0.00228573 7896_at 7696 -3.699228 0.0000 1.9027e-05 0.00523804 9287_at 9087 -3.324213 0.0108 4.4974e-05 0.0101960

Note that genes are sorted increasingly based onthe SAM regularized M′ test statistic values (i.e.,stat.val).

In the IsoGene package, the functionIsoSAMPlot() provides graphical outputs of theSAM procedure. This function requires the SAMregularized test statistic values and the delta table inthe SAM procedure, which can be obtained from theresulting object of the IsoTestSAM() function, whichin this example data is called SAM.ModifM. To extractthe objects we can use the following code:

# Obtaining SAM regularized test statisticqqstat <- SAM.ModifM[[2]]# Obtaining delta tableallfdr <- SAM.ModifM[[3]]

We can also obtain the qqstat and allfdr fromthe functions Isoqqstat() and Isoallfdr(), respec-tively. The code for the two functions are:

Isoqqstat(x, y, fudge=c("none","pooled"),niter)Isoallfdr(qqstat, ddelta, stat=c("E2","Williams","Marcus","M","ModifM"))

The examples of using the functions Isoqqstat()and Isoallfdr() are as follows:

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 10: R journal 2010-1

10 CONTRIBUTED RESEARCH ARTICLES

# Obtaining SAM regularized test statistic> qqstat <- Isoqqstat(dose, express,+ fudge="pooled", niter=100)# Obtaining delta table> allfdr <- Isoallfdr(qqstat, ,stat="ModifM")

Note that in Isoallfdr(), ddelta is left blank, withdefault values taken from the data, i.e., all the per-centiles of the standard errors of the M′ test statistic.

The two approaches above will give the same re-sult. Then to produces the SAM plot for the SAMregularized M′ test statistic, we can use the functionIsoSAMPlot:

# Figure 5> IsoSAMPlot(qqstat, allfdr, FDR = 0.05,+ stat = "ModifM")

0 1 2 3 4

0.0

0.2

0.4

0.6

0.8

delta

FD

R FDR90%FDR50%

a: plot of FDR vs. Delta

0 1 2 3 4

020

0060

0010

000

delta

# of

sig

nific

ant g

enes

b: plot of # of sign genes vs. Delta

0 1 2 3 4

020

0060

0010

000

delta

# of

fals

e po

sitiv

es

FP90%FP50%

c: plot of # of FP vs. Delta

● ●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

● ●

−2 −1 0 1 2 3

05

10

expected

obse

rved

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

● ●

● ●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

delta=0.72

d: observed vs. expected statistics

Figure 5: The SAM plots: a.Plot of the FDR vs. ∆; b. Plotof number of significant genes vs. ∆; c. Plot of number offalse positives vs. ∆; d. Plot of the observed vs. expectedtest statistics.

Panel a of Figure 5 shows the FDR (either 50%or 90% (more stringent)) vs. ∆, from which, usercan choose the ∆ value with the corresponding de-sired FDR. The FDR 50% and FDR 90% are obtainedfrom, respectively, the median and 90th percentile ofthe number of falsely called genes (number of falsepositives) divided by the number of genes called sig-nificant, which are estimated by using permutations(Chu et al., 2001). Panel b shows the number of sig-nificant genes vs. ∆, and panel c shows the numberof false positives (either obtained from 50th or 90thpercentile) vs. ∆. Finally, panel d shows the observedvs. the expected (obtained from permutations) teststatistics, in which the red dots are those genes calleddifferentially expressed.

Comparison of the Results ofResampling-based Multiple Test-ing and the SAM

In the previous sections we have illustrated analysisfor testing a monotone trend for dose-response mi-croarray data by using permutations and the SAM.The same approach can also be applied to obtain alist of significant genes based on other statistics andother multiplicity adjustments. Figure 6 presents thenumber of genes that are found to have a significantmonotone trend using five different statistics withthe FDR level of 0.05 using the two approaches.

It can be seen from Figure 6 that the E201 gives a

higher number of significant genes than other t-typestatistics. Adding the fudge factor to the statisticsleads to a smaller number of significant genes usingthe SAM procedure for the t-type test statistics ascompared to the BH-FDR procedure.

Number of significant genesTest statistic BH-FDR SAM # common genesE2

01 250 279 200Williams 186 104 95Marcus 195 117 105M 209 158 142M′ 203 151 134

Figure 6: Number of significant genes for each teststatistic with BH-FDR and SAM.

In the inference based on the permutations (BH-FDR) and the SAM, most of the genes found by thefive statistics are in common (see Figure 6). Theplots of the samples and the isotonic trend of fourbest genes that are found in all statistics and in boththe permutations and the SAM approaches are pre-sented in Figure 7, which have shown a significantincreasing monotone trend.

Discussion

We have shown the use of the IsoGene package fordose-response studies within a microarray settingalong with the motivating examples. For the analy-sis using the SAM procedure, it should be noted thatthe fudge factor in the E2

01 is obtained based on theapproach for F-type test statistic discussed by Chuet al. (2001) and should be used with caution. Theperformance of such an adjustment as compared tothe t-type test statistics has not yet been investigatedin terms of power and control of the FDR. Therefore,it is advisable to use the fudge factor in the t-type teststatistics, such as the M and modified M test statis-tics.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 11: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 11

Doses

Gen

e E

xpre

ssio

n

0 0.01 0.04 0.16 0.63 2.5

67

89

10

●●

● ●

●●

+

+

+

+

+ +

*

*

*

*

* *

Gene: 256_at

Doses

Gen

e E

xpre

ssio

n

0 0.01 0.04 0.16 0.63 2.5

89

1011

●●

●●

●●

+

+

+

+

+ +

*

*

*

*

* *

Gene: 1360_at

Doses

Gen

e E

xpre

ssio

n

0 0.01 0.04 0.16 0.63 2.5

6.0

6.5

7.0

7.5

8.0

8.5

9.0

●●

●●

● ●

● ●●

+

+

+

+

+ +

*

*

*

*

* *

Gene: 924_at

Doses

Gen

e E

xpre

ssio

n

0 0.01 0.04 0.16 0.63 2.5

910

1112

●●

●●

●●

●●

●●

●●

+

+

+

++

+

*

*

*

**

*

Gene: 6992_at

Figure 7: Plots of the samples (circles) and the isotonictrend (solid blue line) for the four best genes with a signif-icant monotone trend.

The calculation of the raw p-values based on per-mutations using the IsoRawp() function is compu-tationally intensive. For example, when we used100 permutations for analyzing 10,000 genes, it takesaround 30 minutes. When we use 1,000 permuta-tions, the elapsed time increases around 10 times.Usually to approximate the null distribution of thetest statistics, a large number of permutations isneeded, which leads to the increase of computationtime. Considering the computation time, we alsoprovide the p-values obtained from joint distributionof permutation statistics for all genes which is imple-mented in function IsoTestSAM(). This approach issufficient to obtain small p-values using a small num-ber of permutations. Alternatively, one can also usethe SAM procedure which does not require a largenumber of permutations and takes less computationtime. However caution should be drawn to the com-parison of the SAM procedure with and without thefudge factor.

A further development of this package is theGraphical User Interface (GUI) using tcl/tk forusers without knowledge of R programming.

Acknowledgement

Financial support from the IAP research network nrP5/24 of the Belgian Government (Belgian SciencePolicy) is gratefully acknowledged.

Bibliography

R. Barlow, D. Bartholomew, M. Bremner, andH. Brunk. Statistical Inference Under Order Restric-tion. New York: Wiley, 1972.

D. Bartholomew. Ordered tests in the analysis ofvariance. Biometrika, 48:325–332, 1961.

Y. Benjamini and Y. Hochberg. Controlling the falsediscovery rate: a practical and powerful approachto multiple testing. J. R. Statist. Soc. B, 57:289–300,1995.

Y. Benjamini and D. Yekutieli. The control of thefalse discovery rate in multiple testing under de-pendency. ANN STAT, 29 (4):1165–1188, 2001.

G. Chu, B. Narasimhan, R. Tibshirani, and V. Tusher.SAM "Significance Analysis of Microarrays" usersguide and technical document, 2001. URL http:

//www-stat.stanford.edu/~tibs/SAM/ .

C. Chuang-Stein and A. Agresti. Tutorial in biostatis-tics: A review of tests for detecting a monotonedose-response relationship with ordinal responsedata. Statistics in Medicine, 16:2599–2618, 1997.

S. Dudoit and M. J. van der Laan. Multiple TestingProcedures with Applications to Genomics. Springer,2008.

S. Dudoit, J. P. Shaffer, and J. Boldrick. Multiple hy-pothesis testing in microarray experiments. Statis-tical Science, 18 (1):71–103, 2003.

S. Falcon, M. Morgan, and R. Gentleman. An intro-duction to BioConductor’s ExpressionSet class.a BioConductor vignette. Bioconductor Vignettes,URL http://bioconductor.org/packages/2.3/bioc/vignettes/Biobase/inst/doc/ExpressionSetIntroduction.pdf, 2007.

Y. Ge, S. Dudoit, and T. P. Speed. Resampling- basedmultiple testing for microarray data analysis. Test,12 (1):1–77, 2003.

H. Göhlmann and W. Talloen. Gene Expression Stud-ies Using Affymetrix Microarrays. Chapman &Hall/CRC, 2009.

J. Hu, M. Kapoor, W. Zhang, S. Hamilton, andK. Coombes. Analysis of dose response effectson gene expression data with comparison of twomicroarray platforms. Bioinformatics, 21(17):3524–3529, 2005.

D. Lin, Z. Shkedy, D. Yekutieli, T. Burzykowski,H. Göhlmann, A. De Bondt, T. Perera, T. Geerts,and L. Bijnens. Testing for trends in dose-responsemicroarray experiments: A comparison of severaltesting procedures, multiplicity and resampling-based inference. Statistical Applications in Geneticsand Molecular Biology, 6(1):Article 26, 2007.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 12: R journal 2010-1

12 CONTRIBUTED RESEARCH ARTICLES

D. Lin, Z. Shkedy, D. Yekutieli, D. Amaratunga, andL. Bijnens., editors. Modeling Dose-response Microar-ray Data in Early Drug Development Experiments Us-ing R. Springer, (to be published in 2010).

R. Marcus. The powers of some tests of the qual-ity of normal means against an ordered alternative.Biometrika, 63:177–83, 1976.

J. Pinheiro, F. Bretz, and M. Branson. Analysis of doseresponse studies: Modeling approaches. pp. 146-171in Ting, N. (ed.) Dose Finding in Drug Development.Springer, New York, 2006.

T. Robertson, F. Wright, and R. Dykstra. Order Re-stricted Statistical Inference. Wiley, 1988.

S. Ruberg. Dose response studies. i. some design con-siderations. J. Biopharm. Stat, 5(1):1–14, 1995a.

S. Ruberg. Dose response studies. ii. analysis and in-terpretation. J. Biopharm. Stat, 5(1):15–42, 1995b.

J. D. Storey and R. Tibshirani. Statistical significancefor genomewide studies. Proceedings of the NationalAcademy of Sciences, 100 no. 16:9440–9445, 2003.

V. Tusher, R. Tibshirani, and G. Chu. Significanceanalysis of microarrays applied to the ionizingradiation response. Proceedings of the NationalAcademy of Sciences, 98:5116–5121, 2001.

D. Williams. A test for differences between treat-ment means when several dose levels are com-

pared with a zero dose control. Biometrics, 27:103–117, 1971.

D. Williams. The comparison of several dose levelswith a zero dose control. Biometrics, 28:519–531,1972.

Setia Pramana, Dan Lin, Philippe Haldermans and ZivShkedyInteruniversity Institute for Biostatistics and StatisticalBioinformatics, CenStat, Universiteit Hasselt,Agoralaan 1, B3590 Diepenbeek, [email protected]@[email protected]@uhasselt.be

Tobias VerbekeOpenAnalytics BVBAKortestraat 30A, 2220 Heist-op-den-Berg, [email protected]

Hinrich Göhlmann, An De Bondt, Willem Talloen andLuc BijnensJohnson and Johnson Pharmaceutical Research andDevelopment, a division of Janssen Pharmaceutica,Beerse, [email protected]@[email protected]@its.jnj.com

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 13: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 13

MCMC for Generalized Linear MixedModels with glmmBUGSby Patrick Brown and Lutong Zhou

Abstract The glmmBUGS package is a bridgingtool between Generalized Linear Mixed Mod-els (GLMMs) in R and the BUGS language. Itprovides a simple way of performing Bayesianinference using Markov Chain Monte Carlo(MCMC) methods, taking a model formula anddata frame in R and writing a BUGS model file,data file, and initial values files. Functions areprovided to reformat and summarize the BUGSresults. A key aim of the package is to providefiles and objects that can be modified prior tocalling BUGS, giving users a platform for cus-tomizing and extending the models to accommo-date a wide variety of analyses.

Introduction

There are two steps in model fitting that are muchmore time consuming in WinBUGS than in R. First,unbalanced multi-level data needs to be formattedin a way that BUGS can handle, including changingcategorical variables to indicator variables. Second,a BUGS model file must be written. The glmmBUGSpackage addresses these issues by allowing users tospecify models as formulas in R, as they would in theglm function, and provides everything necessary forfitting the model with WinBUGS or OpenBUGS viathe R2WinBUGS package (Sturtz et al., 2005).

The glmmBUGS package creates the necessaryBUGS model file, starting value function, and suit-ably formatted data. Improved chain mixing is ac-complished with a simple reparametrization and theuse of sensible starting values. Functions are pro-vided for formatting and summarizing the results.Although a variety of models can be implementedentirely within glmmBUGS, the package intends toprovide a basic set of data and files for users tomodify as necessary. This allows the full flexibilityof BUGS model specification to be exploited, withmuch of the initial “grunt work” being taken care ofby glmmBUGS.

Examples

Independent random effects

Consider the bacteria data from the MASS package:

> library(MASS)> data(bacteria)> head(bacteria)

y ap hilo week ID trt1 y p hi 0 X01 placebo2 y p hi 2 X01 placebo3 y p hi 4 X01 placebo4 y p hi 11 X01 placebo5 y a hi 0 X02 drug+6 y a hi 2 X02 drug+

The variables to be considered are: y, the presence orabsence of bacteria in a sample coded as "y" and "n"respectively; week, the time of the observation; ID, thesubject identifier; and trt giving the treatment groupas "placebo", "drug", or "drug+".

A generalized linear mixed model is applied tothe data with:

Yij ∼ Bernoulli(pij)

logit(pij) = µ + xijβ + Vi

Vi ∼ iid N(0,σ2) (1)

where: Yij indicates the presence or absence of bacte-ria of the ith person at week j; covariates xij are weekand indicator variables for treatment; pij denotes theprobability of bacteria presence; Vi is the random ef-fect for the ith patient, assumed to be normal withmean 0 and variance σ2. To improve the mixing ofthe Markov chain, a reparametrized model is fittedwith:

Yij ∼ Bernoulli(pij)

logit(pij) = Ri + wijγ

Ri ∼N(µ + giα,σ2). (2)

Here gi is the (indicator variable for) treatment groupfor subject i and wij is the week that observation jwas taken. Note that the two models are the same,with Vi = Ri − µ− giα, β = (γ,α) and xij = (wij, gi).The model in (1) has strong negative dependence be-tween the posterior samples of Vi and µ, whereas Riand µ in (2) are largely independent.

As BUGS only allows numeric data, and cannothave the ’+’ sign in variable names, the data are re-coded as follows:

> bacterianew <- bacteria> bacterianew$yInt = as.integer(bacterianew$y ==+ "y")> levels(bacterianew$trt) <- c("placebo",+ "drug", "drugplus")

The primary function in the package is glmmBUGS,which does the preparatory work for fitting themodel in (2) with:

> library(glmmBUGS)> bacrag <- glmmBUGS(formula = yInt ~

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 14: R journal 2010-1

14 CONTRIBUTED RESEARCH ARTICLES

+ trt + week, data = bacterianew,+ effects = "ID", modelFile = "model.bug",+ family = "bernoulli")

This specifies yInt as a Bernoulli-valued response,trt and week as fixed-effect covariates, and the "ID"column for the random effects. The result is a listwith three elements:

ragged is a list containing the data to be passed toWinBUGS;

pql is the result from fitting the model with theglmmPQL function in MASS package.

startingValues is a list of starting values for the pa-rameters and random effects, which is obtainedfrom the glmmPQL result.

In addition, two files are written to the working di-rectory

‘model.bug’ is the BUGS model file

‘getInits.R’ contains the R code for a function to gen-erate random starting values.

To accommodate unbalanced designs, the dataare stored as ragged arrays, as described in the sec-tion “Handling unbalanced datasets” of the Win-BUGS manual (Spiegelhalter et al., 2004). The raggedresult has a vector element "SID" indicating the start-ing position of each individual’s observations in thedataset. The covariates are split into elements "XID"and "Xobservations" for the individual-level andobservation-level covariates respectively:

> names(bacrag$ragged)

[1] "NID" "SID"[3] "yInt" "XID"[5] "Xobservations"

The model file consists of an outer loop over ID levelsand an inner loop over observations. The details ofhow the model is implemented are best understoodby examining the ‘model.bug’ file and consulting theWinBUGS manual.

At this stage the user is expected to modifythe files and, if necessary, the data to refine themodel, set appropriate priors, and ensure suitabil-ity of starting values. Before using the bugs func-tion in the R2WinBUGS package to sample from theposterior, the starting value function getInits mustbe prepared. First, the file ‘getInits.R’ (which con-tains the source for getInits) should be edited andsourced into R. Next, the list containing the PQL-derived starting values must be assigned the namestartingValues, as getInits will be accessing thisobject every time it is run.

> source("getInits.R")> startingValues = bacrag$startingValues> library(R2WinBUGS)

> bacResult = bugs(bacrag$ragged, getInits,+ model.file = "model.bug", n.chain = 3,+ n.iter = 2000, n.burnin = 100,+ parameters.to.save = names(getInits()),+ n.thin = 10)

Post WinBUGS commands

After running WinBUGS, a series of functions in theglmmBUGS package can be used to manipulate andcheck the simulation results. The restoreParamsfunction is used to restore the original parametriza-tion from (1) and assign the original names to thegroup or subject identifiers.

> bacParams = restoreParams(bacResult,+ bacrag$ragged)> names(bacParams)

[1] "intercept" "SDID" "deviance"[4] "RID" "betas"

The result is a set of posterior samples for µ(intercept), σ (SDID), all the Vi (RID), and β (betas).Posterior means and credible intervals are obtainedwith the summaryChain function.

> bacsummary = summaryChain(bacParams)> names(bacsummary)

[1] "scalars" "RID" "betas"

> signif(bacsummary$betas[, c("mean",+ "2.5%", "97.5%")], 3)

mean 2.5% 97.5%observations -0.159 -0.263 -0.0614trtdrug -1.500 -3.240 0.0164trtdrugplus -0.920 -2.740 0.6060

> bacsummary$scalars[, c("mean", "2.5%",+ "97.5%")]

mean 2.5% 97.5%intercept 3.547556 2.243375 5.219125SDID 1.597761 0.737010 2.716975

In order to check the convergence of the simu-lations, we use the checkChain function to producetrace plots of the intercept and σ parameters.

> checkChain(bacParams, c("intercept",+ "SDID"))

0 50 100 150

23

45

67

intercept

mat

0 50 100 150

0.5

1.5

2.5

3.5

SDID

mat

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 15: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 15

A spatial example

The ontario dataset contains expected and observedcases of molar cancer by census sub-division in On-tario, Canada.

> data(ontario)> head(ontario)

CSDUID observed logExpected3501005 3501005 61 2.1188653501011 3501011 50 1.9712653501012 3501012 117 3.3964443501020 3501020 65 1.9198143501030 3501030 37 1.7799573501042 3501042 16 1.182329

Consider the following model (see Rue and Held,2005):

Yi ∼ Poisson(λiEi)log(λi) = µ + Ui + Vi

Ui ∼ GMRF(σ2U)

Vi ∼ iid N(0,σ2V) (3)

where Yi and Ei are the observed and expected num-ber of cancer cases in census division i; λi is the can-cer incidence rate; the Ui are spatial random effects,a Gaussian Markov random field with variance σ2

U ;and Vi are spatially independent random effects withvariance σ2

V .This model is reparametrised with Ri = Ui + µ +

Vi in place of Vi. An adjacency matrix is required forthe spatial random effect. This was computed fromthe spatial boundary files with the poly2nb functionfrom the spdep package and is stored as the ob-ject popDataAdjMat. The model can be fitted withglmmBUGS as follows:

> data(popDataAdjMat)> forBugs = glmmBUGS(formula = observed ++ logExpected ~ 1, spatial = popDataAdjMat,+ effects = "CSDUID", family = "poisson",+ data = ontario)

Note that the expected counts are needed on thelog scale, and are passed as a second argument on theleft side of the model equation to denote their beingoffset parameters without a coefficient. The randomeffect is at the census sub-division level (CSDUID),with popDataAdjMat giving the dependency struc-ture. Posterior samples can be generated as follows:

> startingValues = forBugs$startingValues> source("getInits.R")> onResult = bugs(forBugs$ragged, getInits,+ model.file = "model.bug", n.chain = 3,+ n.iter = 2000, n.burnin = 100,+ parameters.to.save = names(getInits()),+ n.thin = 10)> ontarioParams = restoreParams(onResult,+ forBugs$ragged)> names(ontarioParams)

[1] "intercept" "SDCSDUID"[3] "SDCSDUIDSpatial" "deviance"[5] "RCSDUID" "RCSDUIDSpatial"[7] "FittedRateCSDUID"

There are posterior simulations for two vari-ance parameters, σU (SDCSDUIDSPATIAL) and σV(SDCSDUID). Samples for the random effects are givenfor Ui (RCSDUIDSpatial), Ui + Vi (RCSDUID), and λi(FittedRateCSDUID). Trace plots are shown below forσU and σV , the standard deviations of the spatial andnon-spatial random effects.

> checkChain(ontarioParams, c("SDCSDUIDSpatial",+ "SDCSDUID"))

0 50 100 150

0.5

0.7

0.9

1.1

SDCSDUIDSpatial

mat

0 50 100 150

0.70

0.80

0.90

1.00

SDCSDUID

mat

These trace plots show poor mixing, and the Markovchain should be re-run with longer chains and morethinning.

The posterior means for the relative risk λi forToronto and Ottawa (Census Subdivisions 3506008and 3530005) are extracted with:

> ontarioSummary = summaryChain(ontarioParams)> ontarioSummary$FittedRateCSDUID[c("3506008",+ "3520005"), "mean"]

3506008 35200050.119725007 0.008148696

The posterior probability of each region having a rel-ative risk twice the provincial average is computedwith

> postProb = apply(ontarioParams$RCSDUID,+ 3, function(x) mean(x > log(2)))

The "RCSDUID" element of ontarioParams is a threedimensional array, with entry i, j,k being iteration iof chain j for region k. The apply statement com-putes the proportion of log-relative risks over log(2)for each region (the third dimension).

The results can be merged into the original spa-tial dataset, as described in the documentation forthe diseasemapping package. Figure 1 shows a mapof the posterior probabilities.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 16: R journal 2010-1

16 CONTRIBUTED RESEARCH ARTICLES

0.0

0.2

0.4

0.6

0.8

1.0

Figure 1: Posterior probabilities of Ontario censussub-divisions having twice the provincial averagerisk of molar cancer.

Additional features

The reparam argument to the glmmBUGS functionreparametrizes the intercept, and a set of covariatesfor a new baseline group is provided. Recall thatin the bacteria example the ragged object passed toWinBUGS divides covariates into observation andID level effects. ragged$Xobservations is a vectorand ragged$XID is a matrix of two columns, whichcontains the week variable and indicator variablesfor the treatments respectively. The reparam argu-ment should accordingly be a list with one or moreof elements "observations" containing the week or"ID" containing the two treatment indicators. Settingreparam=list(observations=2) would make week2 the baseline week, rather than week zero. This ar-gument is useful when using spline functions as co-variates, as the intercept parameter which WinBUGSupdates can be set to the curve evaluated in the mid-dle of the dataset.

The prefix argument allows a character string toprecede all data and variable names. This can beused to combine multiple model files and datasetsinto a single BUGS model, and modifying these filesto allow dependence between the various compo-nents. For instance, one random effects model forblood pressure could be given prefix=’blood’, anda model for body mass given prefix=’mass’. Toallow for dependence between the individual-levelrandom effects for blood pressure and mass, thetwo data sets and starting values produced by theglmmBUGS function can be combined, and the twomodel files concatenated and modified. Were it notfor the prefix, some variable names would be identi-cal in the two datasets and combining would not bepossible.

Finally, multi-level models can be accommodatedby passing multiple variable names in the effectsargument. If the bacteria data were collected fromstudents nested within classes nested within schools,the corresponding argument would be effects =c("school","class","ID").

Currently unimplemented features that wouldenhance the package include: random coefficients(specified with an lmer-type syntax); survival distri-butions (with "Surv" objects as responses); and geo-statistical models for the random effects. As Win-BUGS incorporates methods for all of the above,these tasks are very feasible and offers of assistancefrom the community would be gratefully accepted.

Summary

There are a number of methods and R packages forperforming Bayesian inference on generalized linearmixed models, including geoRglm (Christensen andRibeiro, 2002), MCMCglmm (Hadfield, 2009) andglmmGibbs (Myles and Clayton, 2001). Particularlynoteworthy is the INLA package which avoids theuse of MCMC by using Laplace approximations, asdescribed in Rue et al. (2009). Despite this, manyresearchers working with R (including the authors)continue to use WinBUGS for model fitting. Thisis at least partly due to the flexibility of the BUGSlanguage, which gives the user nearly full controlover specification of distributions and dependencestructures. This package aims to retain this flexi-bility by providing files to be edited before modelfitting is carried out, rather than providing a largenumber of options for, i.e. prior distributions. Moremodel-specific packages such as INLA are entirelyadequate (and perhaps preferable) for the two exam-ples presented here. Those who wish to have fullcontrol over priors or build more complex models,for example multivariate or errors-in-covariate mod-els, will likely appreciate the head start which theglmmBUGS package can provide for these tasks.

Acknowledgements

Patrick Brown holds a Discovery Grant from the Nat-ural Sciences and Engineering Research Council ofCanada.

Bibliography

O. Christensen and P. Ribeiro. geoRglm - a packagefor generalised linear spatial models. R-NEWS,2(2):26–28, 2002. URL http://cran.R-project.org/doc/Rnews.

J. D. Hadfield. MCMC methods for multi-response generalised linear mixed mod-els: The MCMCglmm R package. URLhttp://cran.r-project.org/web/packages/MCMCglmm/vignettes/Tutorial.pdf, 2009.

J. Myles and D. Clayton. Package GLMMGibbs.URL http://cran.r-project.org/src/contrib/Archive/GLMMGibbs, 2002.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 17: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 17

H. Rue and L. Held. Gaussian Markov Random Fields:Theory and Applications, volume 104 of Monographson Statistics and Applied Probability. Chapman &Hall, London, 2005.

H. Rue, S. Martino, and N. Chopin. ApproximateBayesian inference for latent Gaussian models byusing integrated nested Laplace approximations.Journal of the Royal Statistical Society — Series B, 71(2):319 – 392, 2009.

D. Spiegelhalter, A. Thomas, N. Best, andD. Lunn. WinBUGS user manual, 2004. URLhttp://mathstat.helsinki.fi/openbugs/data/Docu/WinBugs%20Manual.html.

S. Sturtz, U. Ligges, and A. Gelman. R2WinBUGS:A package for running WinBUGS from R. Jour-nal of Statistical Software, 12(3):1–16, 2005. URLhttp://www.jstatsoft.org.

Patrick BrownDalla Lana School of Public Health,University of Toronto, andCancer Care Ontario, [email protected]

Lutong ZhouCancer Care Ontario, [email protected]

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 18: R journal 2010-1

18 CONTRIBUTED RESEARCH ARTICLES

Mapping and Measuring Country ShapesThe cshapes Package

by Nils B. Weidmann and Kristian Skrede Gleditsch

Abstract The article introduces the cshapes Rpackage, which includes our CShapes datasetof contemporary and historical country bound-aries, as well as computational tools for com-puting geographical measures from these maps.We provide an overview of the need for con-sidering spatial dependence in comparative re-search, how this requires appropriate historicalmaps, and detail how the cshapes associated Rpackage cshapes can contribute to these ends.We illustrate the use of the package for draw-ing maps, computing spatial variables for coun-tries, and generating weights matrices for spatialstatistics.

Introduction

Our CShapes project (Weidmann et al., 2010) pro-vides electronic maps of the international system forthe period 1946–2008 in a standard Geographic In-formation System (GIS) format. Most existing GISmaps consider only contemporary country configu-rations and do not include information on changes incountries and borders over time. Beyond the value ofmore accurate historical maps in their own right forvisualization, such maps can also help improve sta-tistical analyses. Comparative research in the socialsciences often considers country-level characteristicssuch as total population, economic performance, orthe degree to which a country’s political institutionscan be characterized as democratic. Many theoriesrelate such outcomes to other characteristics of in-dividual countries, and typically ignore how pos-sible interdependence between countries may alsodetermine outcomes. However, over recent yearsthere has been more attention to the spatial relation-ship between observations. For example, the levelof democracy in a country can be strongly influ-enced by how democratic its neighbors are (Gled-itsch, 2002). Similarly, conflict diffusion across na-tional boundaries implies that the risk of civil warin a country will be partly determined by conflict inits immediate proximity (Gleditsch, 2007). Tobler’s(1970, 236) First Law of Geography states that “every-thing is related to everything else, but near things aremore related than distant things”. To the extent thatthis applies to the international system, we need to–at a theoretical level–specify the transnational mech-anisms that cause these interdependencies, and–atan analytical level–our empirical techniques mustproperly incorporate spatial interdependence amongobservations.

The CShapes dataset and the R package cshapespresented in this article help address the secondtask. Assessing spatial dependence in the inter-national system requires measures of the connec-tions and distances between our units of analysis,i.e., states. There exist various data collections thatprovide different indicators of spatial interdepen-dence: a contiguity dataset (Gochman, 1991), theminimum-distance database by Gleditsch and Ward(2001), a dataset of shared boundary lengths (Fur-long and Gleditsch, 2003), and distances betweencapitals (Gleditsch, 2008). However, although all ofthese collections entail indicators from a map of theinternational system at a given point in time, thesedatabases provide the final indicators in fixed formrather than directly code the information from theunderlying maps.

Our CShapes project takes a different approachand provides electronic maps of the internationalsystem for the period 1946–2008 from which suchgeographical indicators can be coded directly. TheR package cshapes allows computing important dis-tance measures directly from these maps. This ap-proach has several advantages over the existing spa-tial indicators mentioned above. First, it ensuresthat the computed indicators are comparable, sincethey all refer to the same geopolitical entities. Sec-ond, maintaining and updating the dataset becomesmuch more convenient: If a change occurs in theinternational system – such as for example the in-dependence of Montenegro in 2006 – we only needto update the electronic maps and re-run the com-putations in order to get up-to-date distance mea-sures as of 2006. Third, the availability of electronicmaps opens up new possibilities for visualization ofcountry-level variables.

In the remainder of this article we provide a briefoverview of the underlying spatial dataset of countryboundaries, describing our coding approach and thedata format. We then turn to the cshapes package forR, and describe three main applications the packagecan be used for: (i) visualizing country data as maps,(ii) computing new spatial indicators for countries,and (iii) creating weights matrices for spatial regres-sion models.

The CShapes Dataset

Coding Criteria

Creating the CShapes dataset requires us to definecoding rules with respect to three questions: (i) whatconstitutes an independent state, (ii) what is its spa-tial extent, and (iii) what constitutes a territorialchange to a state. The following paragraphs summa-rize our approach. For a more detailed discussion,

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 19: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 19

we refer the reader to Weidmann et al. (2010).

What is a state? There are various criteria for defin-ing states. Research in Political Science often relies onthe list of independent states provided by the Corre-lates of War, 2008 (COW) project. This list relies onseveral criteria, including recognition by the UK andFrance, membership in the League of Nations andthe United Nations, as well as other ad-hoc criteria.An alternative Gleditsch and Ward (GW) (1999) list ofindependent states addresses some limitations of theCOW list, and has also gained a wide acceptance inthe research community. As a result, we have madeCShapes compatible with both the COW and GWstate lists.

What is the spatial extent of a state? In many cases,the territory of a state comprises several (and possi-bly disjoint) regions with different status. For colo-nial states, an obvious distinction can be made be-tween the “homeland” of a state and its dependentterritories or colonies. Territory can also be dis-puted between states, such as for example the GolanHeights in the border region between Syria and Is-rael, or territory claimed may not be directly con-trolled or accepted by other states, as in the case ofclaims in Antarctica. As there are many such cases itis often impossible to allocate particular territories tostates. To avoid this difficulty, we only code the coreterritories of state, without dependent territories ordisputed regions.

What constitutes a territorial change? Given thespatial extent of a state, we need to define the terri-torial changes that should be reflected in the dataset.Our data encompass three types of changes are givenin the dataset. First, new states can become inde-pendent, as in the case of the the secession of Eritreafrom Ethiopia in 1993. Second, states can merge, asin the case of the dissolution of the (Eastern) GermanDemocratic Republic in 1990 and its ascension to the(Western) German Federal Republic. Third, we mayhave territorial changes to a state that do not involvethe emergence or disappearance of states.

CShapes includes all changes of the first twotypes. To keep the effort manageable, we only codechanges of the third type that (i) affect the core ter-ritory of a state, and (ii) affect an area of at least10,000 square kilometers. Our coding here is basedon an existing dataset of territorial changes by Tiret al. (1998).

Implementation

Geographic data are typically stored in one of twoformats: raster and vector data. Raster data repre-sent continuous spatial quantities, such as territorial

elevation, while vector data are used to capture dis-crete spatial entities (“features”), such as cities, riversor lakes. There are three basic types of vector data:points, lines and polygons. The latter are the mostappropriate for representing state boundaries. Apartfrom the spatial features themselves, vector data for-mats store non-spatial information related to them inan attribute table. For example, in a dataset of pointfeatures representing cities, we could store the nameand population of each city in the attribute table.

Name Start date End date Area

Country A 1-3-1946 30-6-2008 27845Country B 5-12-1977 3-1-1981 151499Country C 27-9-1958 30-6-2008 13591

COUNTRY POLYGONSATTRIBUTE TABLE

Figure 1: CShapes vector data representation aspolygons with lifetimes (from Weidmann et al.,2010).

In CShapes, we use polygons to represent stateboundaries at a given time, and supplement the tem-poral information for these polygons in the attributetable. This is done by assigning a lifetime (i.e., astart day and an end day) to each polygon. By fil-tering the “active” polygons for a given date, it ispossible to get a snapshot of the international sys-tem for this date. Figure 1 illustrates the CShapesdata representation with each polygon’s lifetime inthe attribute table. The attribute table also containsthe country identifier according to the COW andGW lists, as well as the location of the state cap-ital in latitude/longitude coordinates. This infor-mation is required for the calculation of capital dis-tances between states (see below). The CShapes vec-tor dataset is provided in the shapefile format (ESRI,1998), which can be used directly in standard GISapplications. However, in order to provide a moreaccessible interface to the data and to enable addi-tional computations, we have developed the cshapesR package introduced in the next section.

The cshapes R Package and Appli-cations

The cshapes package for R has two main purposes:first, to make the CShapes data available for usewithin R, and second, to perform additional opera-tions on the data, in particular weights matrix com-putation for spatial regression modeling. In the fol-lowing, we introduce the basic functionality of thepackage and later demonstrate its use through threeillustrative applications.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 20: R journal 2010-1

20 CONTRIBUTED RESEARCH ARTICLES

Accessing the Data

In R, basic support for spatial data is provided bythe sp package, which includes a set of classes bothfor vector and raster data. Other packages add func-tionality to work with these classes, as for examplethe maptools package for reading/writing shape-files, drawing maps, or the spdep package for spa-tial regression. For an introduction to R’s spatialfeatures, we refer to the overview in Bivand (2006),and to Bivand et al. (2008) for a detailed introduc-tion. cshapes draws on these packages in mak-ing the historical country boundaries available foruse within R. The cshp() function is a conveniencemethod to load the spatial dataset and returns aSpatialPolygonsDataFrame, which can be used toaccess and modify both the polygons and the at-tached attribute table. As an example to illustratethis, we first demonstrate how to extract countryboundaries for a given point in time, and later usethese data to visualize data, in this case the regimetype of countries.

# load country border as of 2002> cmap.2002 <- cshp(date=as.Date("2002-1-1"))

We can now join the extracted polygons for thisdate with supplementary information about coun-tries. The Polity IV database (Marshall and Jaggers,2008) provides a summary indicator about regimetype. The indicator ranges from -10 to 10, with lowvalues corresponding to autocratic regimes, and highvalues corresponding to democratic systems. Forthe example below, the Polity dataset is downloadedand filtered to retain only the 2002 values. Fur-thermore, the polity.2002 data frame contains onlytwo columns, the COW country identifier (ccode)and the regime type indicator (polity2). In order tomatch this information, we first need to match coun-try identifiers from cshapes with those in the Politydataset. As the Polity data is not available for allcountries in 2002, we do not get a regime type valuefor each country in cmap.2002.

# download dataset, filter scores for 2002> polity.file <- paste(tempdir(),+ "p4v2008.sav", sep="/")> download.file("http://www.systemicpeace.org/+ inscr/p4v2008.sav", polity.file)> polity <- read.spss(polity.file, to.data.frame=T)> polity.2002 <- polity[polity$year==2002,]> polity.2002 <- subset(polity.2002,+ !is.na(polity2), select=c(ccode, polity2))

# match country identifiers from both datasets> o <- match(cmap.2002$COWCODE, polity.2002$ccode)

# order polity dataset accordinglypolity.2002 <- polity.2002[o,]

# set row names, required for spCbind functionrow.names(polity.2002) <- cmap.2002$FEATUREID

# append using spCbindcmap.2002.m <- spCbind(cmap.2002, polity.2002)

Drawing Maps

The above created dataset with appended regimetype scores for 2002 can now be used for visualiz-ing regime type across space. As described in Bi-vand et al. (2008, ch. 3), the classInt package isuseful for creating maps with color shadings. Theliterature commonly distinguishes between threeregime types: autocracies (Polity scores -10 through-7), anocracies (scores -6 through 6) and democ-racies (scores 7 through 10), a typology we em-ploy for our example. Using the classIntervals()and findColors() functions, we divide regime typescores into three different classes and assign colors ofdifferent intensity, where more autocratic countriesare displayed using darker colors. The following ex-ample shows the code necessary to generate the mapin Figure 2.

# generate a color palette> pal <- grey.colors(3, 0.25, 0.95)

# find the class intervals and colors> breaks <- classIntervals(cmap.2002.m$polity2,+ n=3, style="fixed", fixedBreaks=c(-10, -6, 7, 10))> colors <- findColours(breaks, pal)

# create plot and add legend> plot(cmap.2002.m, bty="n", col=colors)> legend(x="bottomleft",+ legend=c("Autocracy", "Anocracy", "Democracy"),+ fill = attr(colors, "palette"),+ bty = "n", title="Regime type")

Regime type (2002)AutocracyAnocracyDemocracy

Figure 2: Mapping regime type scores according tothe Polity dataset for 2002.

Spatial Variables for Countries

As an example of spatial characteristics that can becalculated using cshapes we consider the distancebetween a country’s capital and its geographical cen-ter. The extent to which a country contains exten-sive peripheries or hinterlands is often held to be an

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 21: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 21

important influence on state consolidation and therisk of conflict (Herbst, 2000; Buhaug et al., 2008). Asimple measure of unfavorable geography can be de-rived from comparing the position of a state capitalrelative to the geographic center of a country. Fig-ure 3 shows the borders of two African countries, theDemocratic Republic of the Congo and Nigeria, aswell as their capitals and their polygon centroids. Inthe Democratic Republic of the Congo (left), the cap-ital Kinshasa is located in the far West of the country,which creates large hinterlands in the East. Nigeria(right) has a more balanced configuration, with thecapital Abuja located close to the centroid.

Congo (Democratic Republic)

*!

Kinshasa

Centroid

Nigeria

*!

Abuja

Centroid

Figure 3: Illustration of the measure of deviation be-tween the geographic centroid of a polygon and thecapital. Left: Democratic Republic of Congo, right:Nigeria (from Weidmann et al., 2010).

For a systematic statistical test, we compute thecapital–centroid deviation from the capital to thepolygon centroid for each country/year. We donot show the complete regression analysis here, butfocus on the computation of this indicator. Wefirst obtain the centroids for all polygons using thecoordinates() function in the sp package. We thenapply a newly created function called distance() toeach polygon. This function is essentially a wrap-per around the spDistsN1 function from the sp pack-age and computes the distance between each poly-gons’s centroid and its capital. The capital longitudeand latitude is obtained directly from the respectivepolygon’s attributes (the CAPLONG and CAPLATfields). Note that since the CShapes dataset is unpro-jected, we need to set the longlat=TRUE parameter inthe spDistsN1 function in order to obtain great circledistances.

> cmap <- cshp()> centroids <- coordinates(cmap)> distance <- function(from.x, from.y, to.x, to.y)+ from <- matrix(c(from.x, from.y), ncol=2, nrow=1)+ spDistsN1(from, c(to.x, to.y), longlat = TRUE)+> cmap$capcentdist <- mapply(distance, centroids[,1],+ centroids[,2], cmap$CAPLONG, cmap$CAPLAT)

Computing Country Distances

Spatial statistical applications typically require thatwe first specify spatial structure. One way to rep-resent this is by means of a matrix indicating adja-cency or distance between any two observations i, j.A matrix of this kind can then be transformed into aweights matrix that assigns a higher weight to spa-tially close observations. As discussed above, thereare many advantages to computing these distancesdirectly from a spatial dataset such as CShapes,which is why we provide a dynamic way of dis-tance computation in our package, rather than offer-ing pre-computed distances. This also makes it pos-sible to compute these matrices for arbitrary pointsin time.

With countries as the unit of analysis, distancematrices are typically based on one of three typesof distance (Ward and Gleditsch, 2008). First, wecan consider the distance between the capital citiesof two countries. Second, rather than capitals, wecan use the geographical center point of a countryfor distance computation. This center point or cen-troid is the country’s geometric center of gravity.Third, we can measure the minimum distance be-tween two countries, i.e. the distance between twopoints, one from each country, that are closest to eachother. Gleditsch and Ward (2001) discuss the relativeadvantages of different measures in greater detail.Computing distances between predetermined pointsas in the first two alternatives is straightforward;we only need to take into account that the CShapesdataset is unprojected, which is why we computegreat circle distances. Finding the minimum dis-tance, however, is more complicated as we need toconsider the shape and position for two countries tofind the closest points.

The distmatrix() function. cshapes computesdistance matrices of all three types with itsdistmatrix() function. Capital and centroid dis-tances are easy to compute, as there is only one pointper polygon used for the computation. Obtainingthe minimum distance is more difficult. Given twocountries, we need to compute the distance of eachpoint on one of the polygons to all lines in the otherpolygon. This procedure must be repeated for eachpair of countries. Because the computational costsof this procedure, the cshapes package implementsthree ways for speeding up this computation. First,we implement the distance computation in Java, us-ing the rJava package for interfacing and the JavaTopology Suite1 (JTS) for handling spatial data inJava. Second, we resort to computing distances be-tween pairs of points in the two polygons, ratherthan between points and lines. Third, we reducethe number of points in a polygon, which results infewer comparisons to be made. The polygon sim-

1http://www.vividsolutions.com/jts/jtshome.htm

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 22: R journal 2010-1

22 CONTRIBUTED RESEARCH ARTICLES

plification is readily implemented in JTS accordingto the algorithm proposed by Douglas and Peucker(1973). The accuracy of the simplification can be con-trolled by a threshold parameter that indicates themaximum allowed deviation of the simplified shapefrom the original one. Figure 4 shows an example ofthe shapes generated by the polygon simplification,using our default threshold (0.1).

Portugal

Spain

Figure 4: Result of the polygon simplification pro-cedure using the Douglas-Peucker algorithm. Theshaded areas show the borders between Portugaland Spain according to the original, detailed dataset.The solid line shows these boundaries after the sim-plification process, where twisted segments havebeen replaced by straight ones, while preserving thebasic shape of a country.

distmatrix() is the cshapes function to gener-ate distance matrices. It returns a symmetric ma-trix Di,j, where the entries di,j correspond to the dis-tance between countries i and j in kilometers, and alldi,j = 0. The rows and columns are labeled by the re-spective country identifier, according to the COW orGW coding system. Since the layout of the interna-tional system changes over time, the first parameterof the function specifies the date at which the matrixshould be computed. The country coding system tobe used is set by the useGW parameter, which defaultsto the Gleditsch and Ward system. The distance typeis specified by the type parameter that has three pos-sible values: capdist, centdist and mindist. Fur-thermore, the tolerance parameter sets the thresh-old for the Douglas-Peucker polygon simplificationdescribed above. This parameter only affects min-imum distance computation. The default value ofthis parameter is 0.1, and a value of 0 disables poly-gon simplification. Below, we compare the executiontime obtained with default polygon simplification tono simplification.2

> system.time(dmat.simple <- distmatrix(+ as.Date("2002-1-1"), type="mindist"))

user system elapsed

223.286 0.842 224.523

> system.time(dmat.full <- distmatrix(+ as.Date("2002-1-1"), type="mindist",+ tolerance=0))

user system elapsed12452.40 48.67 12686.44

> cor(as.vector(dmat.simple),+ as.vector(dmat.full))[1] 0.9999999

Because of the quadratic scaling of the minimumdistance computation, the reduction in computa-tional costs achieved by the simplification procedureis extremely high. Whereas default polygon simpli-fication allows us to compute the complete matrix inless than four minutes, the same procedure on thefull dataset takes about four hours. Still, the inaccu-racies introduces by the simplification are very small;the computed distance matrices correlate at 0.999.

Computing spatial weights. Once a distance ma-trix has been computed, we can transform it to dif-ferent weights matrices, depending on the respectiveapplication. Weights matrices based on spatial adja-cency assign a weight of 1 to all entries with di,j ≤ tfor a distance threshold t. Empirical research sug-gests that it is helpful for many applications to have arelatively inclusive threshold t (for example, 900 km)to determine relevant neighboring states, in order toprevent excluding states that are not directly con-tiguous yet in practice often have high levels of inter-action (Gleditsch, 2002). Alternatively, we can trans-form a distance matrix into a continuous weightsmatrix by inverting the di,j. The following code il-lustrates how to generate weights from cshapes dis-tance matrices.

> dmat <- distmatrix(as.Date("2002-1-1"),+ type="mindist")

# adjacency matrix, 900 km threshold> adjmat <- ifelse(dmat > 900, 0, 1)> diag(adjmat) <- 0

# inverse distance> invdmat <- 1/dmat> diag(invdmat) <- 0

# save matrix for later use> write.table(adjmat, "adjmat2002.txt",+ row.names=T, col.names=T)

# load matrix> adjmat <- read.table("adjmat2002.txt",+ header=T, check.names=F)

Integration with spdep. spdep is one of R’s mostadvanced packages for spatial statistical analysis.

2Hardware: Macbook 2.4 GHz Intel Core 2 Duo, 4 GB RAM.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 23: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 23

It relies on two basic representations for spatialstructure: neighbor lists for discrete neighborhoodrelationships, and weights lists for storing spatialweights between observations. Using functions fromthe spdep package, it is possible to convert the matri-ces created by cshapes. The following code demon-strates how to do this.

# neighbor list from adjacency matrix:# create a weights list first...> adjmat.lw <- mat2listw(adjmat,+ row.names=row.names(dmat))

# ... and retrieve neighbor list> adjmat.nb <- adjmat.lw$neighbours

# weights list from weights matrix> invdmat.lw <- mat2listw(invdmat,+ row.names=row.names(invdmat))

The neighbor lists obtained from cshapes canthen be used to fit spatial regression models usingspdep.

Conclusions

Accurate historical maps are essential for compar-ative cross-national research, and for incorporatingspatial dependence into applied empirical work. OurCShapes dataset provides such information in a stan-dard GIS format for the period 1946–2008. In this pa-per, we have introduced our cshapes package thatprovides a series of tools for useful operations onthese data within R. Our package supplements theexisting valuable packages for spatial analysis in R,and we hope that the package may facilitate greaterattention to spatial dependence in cross-national re-search.

Bibliography

R. S. Bivand. Implementing spatial data analysis soft-ware tools in R. Geographical Analysis, 38(2006):23–40, 2006.

R. S. Bivand, E. J. Pebesma, and V. Gómez-Rubio. Ap-plied Spatial Data Analysis with R. Springer, NewYork, 2008.

H. Buhaug, L.-E. Cederman, and J. K. Rød. Disag-gregating ethnic conflict: A dyadic model of exclu-sion theory. International Organization, 62(3):531–551, 2008.

Correlates of War Project. State system mem-bership list, v2008.1, 2008. Available online athttp://correlatesofwar.org.

D. Douglas and T. Peucker. Algorithms for the reduc-tion of the number of points required to represent

a digitized line or its caricature. The Canadian Car-tographer, 10(2):112–122, 1973.

ESRI. ESRI shapefile technical description,1998. Available at http://esri.com/library/whitepapers/pdfs/shapefile.pdf.

K. Furlong and N. P. Gleditsch. The boundarydataset. Conflict Management and Peace Science, 29(1):93–117, 2003.

K. S. Gleditsch. All Politics is Local: The Diffusion ofConflict, Integration, and Democratization. Univer-sity of Michigan Press, Ann Arbor, MI, 2002.

K. S. Gleditsch. Transnational dimensions of civilwar. Journal of Peace Research, 44(3):293–309, 2007.

K. S. Gleditsch. Distances between capital cities,2008. Available at http://privatewww.essex.ac.uk/~ksg/data-5.html.

K. S. Gleditsch and M. D. Ward. Measuring space:A minimum distance database and applications tointernational studies. Journal of Peace Research, 38(6):749–768, December 2001.

K. S. Gleditsch and M. D. Ward. Interstate systemmembership: A revised list of the independentstates since 1816. International Interactions, 25:393–413, 1999.

C. S. Gochman. Interstate metrics: Conceptualizing,operationalizing, and measuring the geographicproximity of states since the congress of Vienna.International Interactions, 17(1):93–112, 1991.

J. Herbst. States and Power in Africa: ComparativeLessons in Authority and Control. Princeton Univer-sity Press, Princeton, NJ, 2000.

M. G. Marshall and K. Jaggers. Polity IVproject: Political regime characteristics andtransitions, 1800-2007, 2008. Available athttp://www.systemicpeace.org/polity/polity4.htm.

J. Tir, P. Schafer, P. F. Diehl, and G. Goertz. Territorialchanges, 1816-1996. Conflict Management and PeaceScience, 16:89–97, 1998.

W. R. Tobler. A computer movie simulating urbangrowth in the Detroit region. Economic Geography,46:234–240, 1970.

M. D. Ward and K. S. Gleditsch. Spatial RegressionModels. Sage, Thousand Oaks, CA, 2008.

N. B. Weidmann, D. Kuse, and K. S. Gleditsch.The geography of the international system: TheCShapes dataset. International Interactions, 36(1),2010.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 24: R journal 2010-1

24 CONTRIBUTED RESEARCH ARTICLES

Nils B. WeidmannWoodrow Wilson SchoolPrinceton University, [email protected]

Kristian Skrede GleditschDepartment of GovernmentUniversity of Essex, [email protected]

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 25: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 25

tmvtnorm: A Package for the TruncatedMultivariate Normal Distributionby Stefan Wilhelm and B. G. Manjunath

Abstract In this article we present tmvtnorm,an R package implementation for the truncatedmultivariate normal distribution. We considerrandom number generation with rejection andGibbs sampling, computation of marginal densi-ties as well as computation of the mean and co-variance of the truncated variables. This contri-bution brings together latest research in this fieldand provides useful methods for both scholarsand practitioners when working with truncatednormal variables.

Introduction

The package mvtnorm is the first choice in R for deal-ing with the Multivariate Normal Distribution (Genzet al., 2009). However, frequently one or more vari-ates in a multivariate normal setting x = (x1, . . . , xm)T

are subject to one-sided or two-sided truncation (ai ≤xi ≤ bi). The package tmvtnorm (Wilhelm and Man-junath (2010)) is an extension of the mvtnorm pack-age and provides methods necessary to work withthe truncated multivariate normal case defined byTN(µ,Σ,a,b).The probability density function for TN(µ,Σ,a,b)can be expressed as:

f (x,µ,Σ,a,b) =exp

{− 1

2 (x− µ)TΣ−1(x− µ)}

∫ ba exp

{− 1

2 (x− µ)TΣ−1(x− µ)}

dx

for a≤ x≤ b and 0 otherwise. We will use the follow-ing bivariate example with µ = (0.5,0.5)T and covari-ance matrix Σ

Σ =(

1 0.80.8 2

)as well as lower and upper truncation points a =(−1,−∞)T ,b = (0.5,4)T , i.e. x1 is doubly, while x2is singly truncated. The setup in R code is:

> library(tmvtnorm)> mu <- c(0.5, 0.5)> sigma <- matrix(c(1, 0.8, 0.8, 2), 2, 2)> a <- c(-1, -Inf)> b <- c(0.5, 4)

The following tasks for dealing with the truncatedmultinormal distribution are described in this article:

� generation of random numbers

� computation of marginal densities

� computation of moments (mean vector, covari-ance matrix)

� estimation of parameters

Generation of random numbers

Rejection sampling

The first idea to generate variates from a truncatedmultivariate normal distribution is to draw fromthe untruncated distribution using rmvnorm() in themvtnorm package and to accept only those samplesinside the support region (i.e., rejection sampling).The different algorithms used to generate samplesfrom the multivariate normal distribution have beenpresented for instance in Mi et al. (2009) and in Genzand Bretz (2009).

The following R code can be used to generateN = 10000 samples using rejection sampling:

> X <- rtmvnorm(n=10000, mean=mu,> sigma=sigma, lower=a, upper=b,> algorithm="rejection")

The parameter algorithm="rejection" is the defaultvalue and may be omitted.

This approach works well, if the acceptance rate αis reasonably high. The method rtmvnorm() first cal-culates the acceptance rate α and then iteratively gen-erates N/α draws in order to get the desired numberof remaining samples N. This typically needs just afew iterations and is fast. Figure 1 shows randomsamples obtained by rejection sampling. Acceptedsamples are marked as black, discarded samples areshown in gray. In the example, the acceptance rate isα = 0.432.

> alpha <- pmvnorm(lower=a, upper=b, mean=mu,> sigma=sigma)

However, if only a small fraction of samples is ac-cepted, as is the case in higher dimensions, the num-ber of draws per sample can be quite substantial.Then rejection sampling becomes very inefficient andfinally breaks down.

Gibbs sampling

The second approach to generating random sam-ples is to use the Gibbs sampler, a Markov ChainMonte Carlo (MCMC) technique. The Gibbs sam-pler samples from conditional univariate distribu-tions f (xi|x−i) = f (xi|x1, . . . , xi−1, xi+1, . . . , xm) which

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 26: R journal 2010-1

26 CONTRIBUTED RESEARCH ARTICLES

are themselves truncated univariate normal distribu-tions (see for instance Horrace (2005) and Kotechaand Djuric (1999)). We implemented the algorithmpresented in the paper of Kotecha and Djuric (1999)in the tmvtnorm package with minor improvements.Using algorithm="gibbs", users can sample with theGibbs sampler.

> X <- rtmvnorm(n=10000, mean=mu,> sigma=sigma, lower=a, upper=b,> algorithm="gibbs")

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

● ●

●●

● ●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

● ●●

● ●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

● ●●

●●

●●

●●

● ●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

● ●

● ●

●●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

●●

● ●

●●●

● ●●●

● ●

●●

●●

● ●

●● ●●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

−2 0 2 4

−4

−2

02

4

x1

x 2

●●

●●

●●

●● ●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

● ●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

● ●

● ●●

●●

● ●

● ●

●●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

Figure 1: Random Samples from a truncated bivari-ate normal distribution generated by rtmvnorm()

Gibbs sampling produces a Markov chain whichfinally converges to a stationary target distribution.Generally, the convergence of MCMC has to bechecked using diagnostic tools (see for instance thecoda package from Plummer et al. (2009)). The start-ing value of the chain can be set as an optional pa-rameter start.value. Like all MCMC methods, thefirst iterates of the chain will not have the exact targetdistribution and are strongly dependent on the startvalue. To reduce this kind of problem, the first itera-tions are considered as a burn-in period which a usermight want to discard. Using burn.in.samples=100,one can drop the first 100 samples of a chain as burn-in samples.

The major advantage of Gibbs sampling is that itaccepts all proposals and is therefore not affected bya poor acceptance rate α. As a major drawback, ran-dom samples produced by Gibbs sampling are notindependent, but correlated. The degree of correla-tion depends on the covariance matrix Σ as well onthe dimensionality and should be checked for usingautocorrelation or cross-correlation plots (e.g. acf()or ccf() in the stats package).

> acf(X)

Taking only a nonsuccessive subsequence of theMarkov chain, say every k-th sample, can greatly re-duce the autocorrelation among random points asshown in the next code example. This techniquecalled "thinning" is available via the thinning=k ar-gument:

> X2 <- rtmvnorm(n=10000, mean=mu,> sigma=sigma, lower=a, upper=b,> algorithm="gibbs", burn.in.samples=100,> thinning = 5)> acf(X2)

While the samples X generated by Gibbs samplingexhibit cross-correlation for both variates up to lag1, this correlation vanished in X2.

Table 1 shows a comparison of rejection samplingand Gibbs sampling in terms of speed. Both al-gorithms scale with the number of samples N andthe number of dimensions m, but Gibbs samplingis independent from the acceptance rate α and evenworks for very small α. On the other hand, the Gibbssampler scales linearly with thinning and requiresextensive loops which are slow in an interpreted lan-guage like R. From package version 0.8, we reim-plemented the Gibbs sampler in Fortran. This com-piled code is now competitive with rejection sam-pling. The deprecated R code is still accessible in thepackage via algorithm="gibbsR".

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●● ●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

● ●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

● ●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●●

●●

●●

●●

●●

● ●●

●●

● ●●

● ●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

● ●

● ●

●●

●●

● ●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

−4 −2 0 2 4

−4

−2

02

4

bivariate marginal density (x1,x2)

x1

x 2

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●● ●

● ●

● ●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●● ●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●●

●●

● ●

● ●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

0.05

0.05

0.1

0.1

0.15

0.2

0.25

0.3

Figure 2: Contour plot for the bivariate truncateddensity function obtained by dtmvnorm()

Marginal densities

The joint density function f (x,µ,Σ,a,b) for the trun-cated variables can be computed using dtmvnorm().Figure 2 shows a contour plot of the bivariate den-sity in our example.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 27: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 27

Algorithm Samples m = 2 m = 10Rs = ∏2

i=1 (−1,1] Rs = ∏10i=1 (−1,1] Rs = ∏10

i=1 (−4,−3]α = 0.56 α = 0.267 α = 5.6 · 10−6

Rejection Sampling N = 10000 5600 0.19 sec 2670 0.66 sec 0 ∞N = 100000 56000 1.89 sec 26700 5.56 sec 1 ∞

Gibbs Sampling (R code, nothinning)

N = 10000 10000 0.75 sec 10000 3.47 sec 10000 3.44 secN = 100000 100000 7.18 sec 100000 34.58 sec 100000 34.58 sec

Gibbs Sampling (Fortran code, nothinning)

N = 10000 10000 0.03 sec 10000 0.13 sec 10000 0.12 secN = 100000 100000 0.22 sec 100000 1.25 sec 100000 1.21 sec

Gibbs Sampling (Fortran code andthinning=10)

N = 10000 10000 0.22 sec 10000 1.22 sec 10000 1.21 secN = 100000 100000 2.19 sec 100000 12.24 sec 100000 12.19 sec

Table 1: Comparison of rejection sampling and Gibbs sampling for 2-dimensional and 10-dimensional TNwith support Rs = ∏2

i=1 (−1,1], Rs = ∏10i=1 (−1,1] and Rs = ∏10

i=1 (−4,−3] respectively: number of generatedsamples, acceptance rate α and computation time measured on an Intel®Core�Duo [email protected] GHz

However, more often marginal densities of one ortwo variables will be of interest to users. For multi-normal variates the marginal distributions are againnormal. This property does not hold for the trun-cated case. The marginals of truncated multinormalvariates are not truncated normal in general. Anexplicit formula for calculating the one-dimensionalmarginal density f (xi) is given in the paper of Cart-inhour (1990). We have implemented this algo-rithm in the method dtmvnorm.marginal() which,for convenience, is wrapped in dtmvnorm() via amargin=i argument. Figure 3 shows the Kernel den-sity estimate and the one-dimensional marginal forboth x1 and x2 calculated using the dtmvnorm(...,margin=i) i = 1,2 call:

> x <- seq(-1, 0.5, by=0.1)> fx <- dtmvnorm(x, mu, sigma,> lower=a, upper=b, margin=1)

The bivariate marginal density f (xq, xr) for xq andxr (m > 2,q 6= r) is implemented based on the worksof Leppard and Tallis (1989) and Manjunath andWilhelm (2009) in method dtmvnorm.marginal2()which, again for convenience, is just as good asdtmvnorm(..., margin=c(q,r)):

> fx <- dtmvnorm(x=c(0.5, 1),> mu, sigma,> lower=a, upper=b, margin=c(1,2))

The help page for this function in the package con-tains an example of how to produce a density con-tour plot for a trivariate setting similar to the oneshown in figure 2.

Computation of moments

The computation of the first and second moments(mean vector µ∗ = E[x] and covariance matrix Σ∗ re-spectively) is not trivial for the truncated case, sincethey are obviously not the same as µ and Σ fromthe parametrization of TN(µ,Σ,a,b). The moment-generating function has been given first by Tallis

(1961), but his formulas treated only the special caseof a singly-truncated distribution with a correlationmatrix R. Later works, especially Lee (1979) gener-alized the computation of moments for a covariancematrix Σ and gave recurrence relations between mo-ments, but did not give formulas for the moments ofthe double-truncated case. We presented the compu-tation of moments for the general double-truncatedcase in a working paper Manjunath and Wilhelm(2009) and implemented the algorithm in the methodmtmvnorm(), which can be used as

> moments <- mtmvnorm(mean=mu, sigma=sigma,> lower=a, upper=b)

The moment calculation for our example results inµ∗ = (−0.122,0)T and covariance matrix

Σ∗ =(

0.165 0.1310.131 1.458

)To compare these results with the sample moments,use

> colMeans(X)> cov(X)

It can be seen in this example that truncation can sig-nificantly reduce the variance and change the covari-ance between variables.

Estimation

Unlike our example, in many settings µ and Σare unknown and must be estimated from thedata. Estimation of (µ,Σ) when truncation pointsa and b are known can typically be done by ei-ther Maximum-Likelihood, Instrumental Variables(Amemiya (1973), Amemiya (1974), Lee (1979), Lee(1983)) or in a Bayesian context (Griffiths (2002)). Weare planning to include these estimation approachesin the package in future releases.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 28: R journal 2010-1

28 CONTRIBUTED RESEARCH ARTICLES

−1.0 −0.5 0.0 0.5

0.2

0.4

0.6

0.8

Marginal density x1

x1

Den

sity

−4 −2 0 2 4

0.00

0.10

0.20

0.30

Marginal density x2

x2D

ensi

ty

Figure 3: Marginal densities for x1 and x2 obtained from Kernel density estimation from ran-dom samples generated by rtmvnorm() and from direct calculation using dtmvnorm(...,margin=1) anddtmvnorm(...,margin=2) respectively

Summary

The package tmvtnorm provides methods for sim-ulating from truncated multinormal distributions(rejection and Gibbs sampling), calculations frommarginal densities and also calculation of mean andcovariance of the truncated variables. We hope thatmany useRs will find this package useful when deal-ing with truncated normal variables.

Bibliography

T. Amemiya. Regression analysis when the depen-dent variable is truncated normal. Econometrica, 41:997–1016, 1973.

T. Amemiya. Multivariate regression and simultane-ous equations models when the dependent vari-ables are truncated normal. Econometrica, 42:999–1012, 1974.

J. Cartinhour. One-dimensional marginal densityfunctions of a truncated multivariate normal den-sity function. Communications in Statistics - Theoryand Methods, 19:197–203, 1990.

A. Genz and F. Bretz. Computation of Multivariate Nor-mal and t Probabilities, volume 195 of Lecture Notesin Statistics. Springer-Verlag, Heidelberg, 2009.

A. Genz, F. Bretz, T. Miwa, X. Mi, F. Leisch, F. Scheipl,and T. Hothorn. mvtnorm: Multivariate Normal and tDistributions, 2009. URL http://CRAN.R-project.org/package=mvtnorm. R package version 0.9-8.

W. Griffiths. A Gibbs’ sampler for the parame-ters of a truncated multivariate normal distribu-tion. 2002. URL http://ideas.repec.org/p/mlb/wpaper/856.html.

W. C. Horrace. Some results on the multivariate trun-cated normal distribution. Journal of MultivariateAnalysis, 94:209–221, 2005.

J. H. Kotecha and P. M. Djuric. Gibbs samplingapproach for generation of truncated multivariategaussian random variables. IEEE Computer Soci-ety, pages 1757–1760, 1999. http://dx.doi.org/10.1109/ICASSP.1999.756335.

L.-F. Lee. On the first and second moments of thetruncated multi-normal distribution and a simpleestimator. Economics Letters, 3:165–169, 1979.

L.-F. Lee. The determination of moments of the dou-bly truncated multivariate tobit model. EconomicsLetters, 11:245–250, 1983.

P. Leppard and G. M. Tallis. Evaluation of the meanand covariance of the truncated multinormal. Ap-plied Statistics, 38:543–553, 1989.

B. G. Manjunath and S. Wilhelm. Moments calcu-lation for the double truncated multivariate nor-mal density. Working Paper, 2009. URL http://ssrn.com/abstract=1472153.

X. Mi, T. Miwa, and T. Hothorn. mvtnorm: New nu-merical algorithm for multivariate normal proba-bilities. R Journal, 1:37–39, 2009.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 29: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 29

M. Plummer, N. Best, K. Cowles, and K. Vines. coda:Output analysis and diagnostics for MCMC, 2009. Rpackage version 0.13-4.

G. M. Tallis. The moment generating function ofthe truncated multinormal distribution. Journal ofthe Royal Statistical Society, Series B, 23(1):223–229,1961.

S. Wilhelm and B. G. Manjunath. tmvtnorm: TruncatedMultivariate Normal Distribution, 2010. URL http://CRAN.R-project.org/package=tmvtnorm. Rpackage version 0.9-2.

Stefan WilhelmDepartment of FinanceUniversity of Basel, [email protected]

B. G. ManjunathDepartment of MathematicsUniversity of Siegen, [email protected]

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 30: R journal 2010-1

30 CONTRIBUTED RESEARCH ARTICLES

neuralnet: Training of Neural Networksby Frauke Günther and Stefan Fritsch

Abstract Artificial neural networks are appliedin many situations. neuralnet is built to trainmulti-layer perceptrons in the context of regres-sion analyses, i.e. to approximate functional rela-tionships between covariates and response vari-ables. Thus, neural networks are used as exten-sions of generalized linear models.neuralnet is a very flexible package. The back-propagation algorithm and three versions of re-silient backpropagation are implemented and itprovides a custom-choice of activation and er-ror function. An arbitrary number of covariatesand response variables as well as of hidden lay-ers can theoretically be included.The paper gives a brief introduction to multi-layer perceptrons and resilient backpropagationand demonstrates the application of neuralnetusing the data set infert, which is contained inthe R distribution.

Introduction

In many situations, the functional relationship be-tween covariates (also known as input variables) andresponse variables (also known as output variables)is of great interest. For instance when modeling com-plex diseases, potential risk factors and their effectson the disease are investigated to identify risk fac-tors that can be used to develop prevention or inter-vention strategies. Artificial neural networks can beapplied to approximate any complex functional re-lationship. Unlike generalized linear models (GLM,McCullagh and Nelder, 1983), it is not necessary toprespecify the type of relationship between covari-ates and response variables as for instance as linearcombination. This makes artificial neural networks avaluable statistical tool. They are in particular directextensions of GLMs and can be applied in a similarmanner. Observed data are used to train the neuralnetwork and the neural network learns an approxi-mation of the relationship by iteratively adapting itsparameters.

The package neuralnet (Fritsch and Günther,2008) contains a very flexible function to train feed-forward neural networks, i.e. to approximate a func-tional relationship in the above situation. It can the-oretically handle an arbitrary number of covariatesand response variables as well as of hidden layersand hidden neurons even though the computationalcosts can increase exponentially with higher order ofcomplexity. This can cause an early stop of the iter-ation process since the maximum of iteration steps,which can be defined by the user, is reached beforethe algorithm converges. In addition, the package

provides functions to visualize the results or in gen-eral to facilitate the usage of neural networks. Forinstance, the function compute can be applied to cal-culate predictions for new covariate combinations.

There are two other packages that deal with artifi-cial neural networks at the moment: nnet (Venablesand Ripley, 2002) and AMORE (Limas et al., 2007).nnet provides the opportunity to train feed-forwardneural networks with traditional backpropagationand in AMORE, the TAO robust neural network al-gorithm is implemented. neuralnet was built to trainneural networks in the context of regression analy-ses. Thus, resilient backpropagation is used sincethis algorithm is still one of the fastest algorithmsfor this purpose (e.g. Schiffmann et al., 1994; Rochaet al., 2003; Kumar and Zhang, 2006; Almeida et al.,2010). Three different versions are implemented andthe traditional backpropagation is included for com-parison purposes. Due to a custom-choice of acti-vation and error function, the package is very flex-ible. The user is able to use several hidden layers,which can reduce the computational costs by includ-ing an extra hidden layer and hence reducing theneurons per layer. We successfully used this packageto model complex diseases, i.e. different structuresof biological gene-gene interactions (Günther et al.,2009). Summarizing, neuralnet closes a gap concern-ing the provided algorithms for training neural net-works in R.

To facilitate the usage of this package for newusers of artificial neural networks, a brief introduc-tion to neural networks and the learning algorithmsimplemented in neuralnet is given before describingits application.

Multi-layer perceptrons

The package neuralnet focuses on multi-layer per-ceptrons (MLP, Bishop, 1995), which are well appli-cable when modeling functional relationships. Theunderlying structure of an MLP is a directed graph,i.e. it consists of vertices and directed edges, in thiscontext called neurons and synapses. The neuronsare organized in layers, which are usually fully con-nected by synapses. In neuralnet, a synapse can onlyconnect to subsequent layers. The input layer con-sists of all covariates in separate neurons and the out-put layer consists of the response variables. The lay-ers in between are referred to as hidden layers, asthey are not directly observable. Input layer and hid-den layers include a constant neuron relating to in-tercept synapses, i.e. synapses that are not directlyinfluenced by any covariate. Figure 1 gives an exam-ple of a neural network with one hidden layer thatconsists of three hidden neurons. This neural net-work models the relationship between the two co-

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 31: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 31

variates A and B and the response variable Y. neural-net theoretically allows inclusion of arbitrary num-bers of covariates and response variables. However,there can occur convergence difficulties using a hugenumber of both covariates and response variables.

Figure 1: Example of a neural network with two in-put neurons (A and B), one output neuron (Y) andone hidden layer consisting of three hidden neurons.

To each of the synapses, a weight is attached in-dicating the effect of the corresponding neuron, andall data pass the neural network as signals. The sig-nals are processed first by the so-called integrationfunction combining all incoming signals and secondby the so-called activation function transforming theoutput of the neuron.

The simplest multi-layer perceptron (also knownas perceptron) consists of an input layer with n co-variates and an output layer with one output neuron.It calculates the function

o(x) = f

(w0 +

n

∑i=1

wixi

)= f

(w0 + wTx

),

where w0 denotes the intercept, w = (w1, . . . ,wn) thevector consisting of all synaptic weights without theintercept, and x = (x1, . . . , xn) the vector of all covari-ates. The function is mathematically equivalent tothat of GLM with link function f−1. Therefore, allcalculated weights are in this case equivalent to theregression parameters of the GLM.

To increase the modeling flexibility, hidden lay-ers can be included. However, Hornik et al. (1989)showed that one hidden layer is sufficient to modelany piecewise continuous function. Such an MLPwith a hidden layer consisting of J hidden neuronscalculates the following function:

o(x) = f

(w0 +

J

∑j=1

wj · f

(w0j +

n

∑i=1

wijxi

))

= f

(w0 +

J

∑j=1

wj · f(

w0j + wjTx))

,

where w0 denotes the intercept of the output neuronand w0j the intercept of the jth hidden neuron. Addi-tionally, wj denotes the synaptic weight correspond-ing to the synapse starting at the jth hidden neuron

and leading to the output neuron, wj = (w1j, . . . ,wnj)the vector of all synaptic weights corresponding tothe synapses leading to the jth hidden neuron, andx = (x1, . . . , xn) the vector of all covariates. Thisshows that neural networks are direct extensions ofGLMs. However, the parameters, i.e. the weights,cannot be interpreted in the same way anymore.

Formally stated, all hidden neurons and out-put neurons calculate an output f (g(z0,z1, . . . ,zk)) =f (g(z)) from the outputs of all preceding neuronsz0,z1, . . . ,zk, where g : Rk+1→ R denotes the integra-tion function and f : R→ R the activation function.The neuron z0 ≡ 1 is the constant one belonging tothe intercept. The integration function is often de-fined as g(z) = w0z0 + ∑k

i=1 wizi = w0 + wTz. The ac-tivation function f is usually a bounded nondecreas-ing nonlinear and differentiable function such as thelogistic function ( f (u) = 1

1+e−u ) or the hyperbolic tan-gent. It should be chosen in relation to the responsevariable as it is the case in GLMs. The logistic func-tion is, for instance, appropriate for binary responsevariables since it maps the output of each neuron tothe interval [0,1]. At the moment, neuralnet uses thesame integration as well as activation function for allneurons.

Supervised learning

Neural networks are fitted to the data by learn-ing algorithms during a training process. neuralnetfocuses on supervised learning algorithms. Theselearning algorithms are characterized by the usageof a given output that is compared to the predictedoutput and by the adaptation of all parameters ac-cording to this comparison. The parameters of a neu-ral network are its weights. All weights are usuallyinitialized with random values drawn from a stan-dard normal distribution. During an iterative train-ing process, the following steps are repeated:

� The neural network calculates an output o(x)for given inputs x and current weights. If thetraining process is not yet completed, the pre-dicted output o will differ from the observedoutput y.

� An error function E, like the sum of squared er-rors (SSE)

E =12

L

∑l=1

H

∑h=1

(olh − ylh)2

or the cross-entropy

E =−L

∑l=1

H

∑h=1

(ylh log(olh)

+ (1− ylh) log(1− olh)) ,

measures the difference between predicted andobserved output, where l = 1, . . . , L indexes the

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 32: R journal 2010-1

32 CONTRIBUTED RESEARCH ARTICLES

observations, i.e. given input-output pairs, andh = 1, . . . , H the output nodes.

� All weights are adapted according to the ruleof a learning algorithm.

The process stops if a pre-specified criterion is ful-filled, e.g. if all absolute partial derivatives of the er-ror function with respect to the weights (∂E/∂w) aresmaller than a given threshold. A widely used learn-ing algorithm is the resilient backpropagation algo-rithm.

Backpropagation and resilient backpropagation

The resilient backpropagation algorithm is based onthe traditional backpropagation algorithm that mod-ifies the weights of a neural network in order to finda local minimum of the error function. Therefore, thegradient of the error function (dE/dw) is calculatedwith respect to the weights in order to find a root. Inparticular, the weights are modified going in the op-posite direction of the partial derivatives until a localminimum is reached. This basic idea is roughly illus-trated in Figure 2 for a univariate error-function.

Figure 2: Basic idea of the backpropagation algo-rithm illustrated for a univariate error function E(w).

If the partial derivative is negative, the weight isincreased (left part of the figure); if the partial deriva-tive is positive, the weight is decreased (right partof the figure). This ensures that a local minimum isreached. All partial derivatives are calculated usingthe chain rule since the calculated function of a neu-ral network is basically a composition of integrationand activation functions. A detailed explanation isgiven in Rojas (1996).

neuralnet provides the opportunity to switch be-tween backpropagation, resilient backpropagationwith (Riedmiller, 1994) or without weight backtrack-ing (Riedmiller and Braun, 1993) and the modifiedglobally convergent version by Anastasiadis et al.(2005). All algorithms try to minimize the error func-tion by adding a learning rate to the weights goinginto the opposite direction of the gradient. Unlikethe traditional backpropagation algorithm, a sepa-rate learning rate ηk, which can be changed duringthe training process, is used for each weight in re-silient backpropagation. This solves the problem of

defining an over-all learning rate that is appropri-ate for the whole training process and the entire net-work. Additionally, instead of the magnitude of thepartial derivatives only their sign is used to updatethe weights. This guarantees an equal influence ofthe learning rate over the entire network (Riedmillerand Braun, 1993). The weights are adjusted by thefollowing rule

w(t+1)k = w(t)

k − η(t)k · sign

(∂E(t)

∂w(t)k

),

as opposed to

w(t+1)k = w(t)

k − η · ∂E(t)

∂w(t)k

,

in traditional backpropagation, where t indexes theiteration steps and k the weights.

In order to speed up convergence in shallow ar-eas, the learning rate ηk will be increased if the cor-responding partial derivative keeps its sign. On thecontrary, it will be decreased if the partial derivativeof the error function changes its sign since a chang-ing sign indicates that the minimum is missed dueto a too large learning rate. Weight backtracking isa technique of undoing the last iteration and addinga smaller value to the weight in the next step. With-out the usage of weight backtracking, the algorithmcan jump over the minimum several times. For ex-ample, the pseudocode of resilient backpropagationwith weight backtracking is given by (Riedmiller andBraun, 1993)

for all weights{if (grad.old*grad>0){

delta := min(delta*eta.plus, delta.max)weights := weights - sign(grad)*deltagrad.old := grad

}else if (grad.old*grad<0){

weights := weights + sign(grad.old)*deltadelta := max(delta*eta.minus, delta.min)grad.old := 0

}else if (grad.old*grad=0){

weights := weights - sign(grad)*deltagrad.old := grad

}}

while that of the regular backpropagation is given by

for all weights{weights := weights - grad*delta

}

The globally convergent version introduced byAnastasiadis et al. (2005) performs a resilient back-propagation with an additional modification of onelearning rate in relation to all other learning rates. It

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 33: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 33

is either the learning rate associated with the small-est absolute partial derivative or the smallest learn-ing rate (indexed with i), that is changed accordingto

η(t)i = −

∑k;k 6=i η(t)k ·

∂E(t)

∂w(t)k

+ δ

∂E(t)

∂w(t)i

,

if ∂E(t)

∂w(t)i

6= 0 and 0 < δ � ∞. For further details see

Anastasiadis et al. (2005).

Using neuralnet

neuralnet depends on two other packages: grid andMASS (Venables and Ripley, 2002). Its usage isleaned towards that of functions dealing with regres-sion analyses like lm and glm. As essential argu-ments, a formula in terms of response variables ˜ sum ofcovariates and a data set containing covariates and re-sponse variables have to be specified. Default valuesare defined for all other parameters (see next subsec-tion). We use the data set infert that is provided bythe package datasets to illustrate its application. Thisdata set contains data of a case-control study that in-vestigated infertility after spontaneous and inducedabortion (Trichopoulos et al., 1976). The data set con-sists of 248 observations, 83 women, who were infer-tile (cases), and 165 women, who were not infertile(controls). It includes amongst others the variablesage, parity, induced, and spontaneous. The vari-ables induced and spontaneous denote the numberof prior induced and spontaneous abortions, respec-tively. Both variables take possible values 0, 1, and2 relating to 0, 1, and 2 or more prior abortions. Theage in years is given by the variable age and the num-ber of births by parity.

Training of neural networks

The function neuralnet used for training a neuralnetwork provides the opportunity to define the re-quired number of hidden layers and hidden neuronsaccording to the needed complexity. The complex-ity of the calculated function increases with the addi-tion of hidden layers or hidden neurons. The defaultvalue is one hidden layer with one hidden neuron.The most important arguments of the function arethe following:

� formula, a symbolic description of the model tobe fitted (see above). No default.

� data, a data frame containing the variablesspecified in formula. No default.

� hidden, a vector specifying the number of hid-den layers and hidden neurons in each layer.For example the vector (3,2,1) induces a neu-ral network with three hidden layers, the first

one with three, the second one with two andthe third one with one hidden neuron. Default:1.

� threshold, an integer specifying the thresholdfor the partial derivatives of the error functionas stopping criteria. Default: 0.01.

� rep, number of repetitions for the training pro-cess. Default: 1.

� startweights, a vector containing prespecifiedstarting values for the weights. Default: ran-dom numbers drawn from the standard normaldistribution

� algorithm, a string containing the algo-rithm type. Possible values are "backprop","rprop+", "rprop-", "sag", or "slr"."backprop" refers to traditional backpropaga-tion, "rprop+" and "rprop-" refer to resilientbackpropagation with and without weightbacktracking and "sag" and "slr" refer to themodified globally convergent algorithm (gr-prop). "sag" and "slr" define the learning ratethat is changed according to all others. "sag"refers to the smallest absolute derivative, "slr"to the smallest learning rate. Default: "rprop+"

� err.fct, a differentiable error function. Thestrings "sse" and "ce" can be used, which referto ’sum of squared errors’ and ’cross entropy’.Default: "sse"

� act.fct, a differentiable activation function.The strings "logistic" and "tanh" are possiblefor the logistic function and tangent hyperboli-cus. Default: "logistic"

� linear.output, logical. If act.fct shouldnot be applied to the output neurons,linear.output has to be TRUE. Default: TRUE

� likelihood, logical. If the error function isequal to the negative log-likelihood function,likelihood has to be TRUE. Akaike’s Informa-tion Criterion (AIC, Akaike, 1973) and BayesInformation Criterion (BIC, Schwarz, 1978) willthen be calculated. Default: FALSE

� exclude, a vector or matrix specifying weightsthat should be excluded from training. A ma-trix with n rows and three columns will excluden weights, where the first column indicates thelayer, the second column the input neuron ofthe weight, and the third neuron the outputneuron of the weight. If given as vector, theexact numbering has to be known. The num-bering can be checked using the provided plotor the saved starting weights. Default: NULL

� constant.weights, a vector specifying the val-ues of weights that are excluded from trainingand treated as fixed. Default: NULL

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 34: R journal 2010-1

34 CONTRIBUTED RESEARCH ARTICLES

The usage of neuralnet is described by model-ing the relationship between the case-control status(case) as response variable and the four covariatesage, parity, induced and spontaneous. Since theresponse variable is binary, the activation functioncould be chosen as logistic function (default) and theerror function as cross-entropy (err.fct="ce"). Ad-ditionally, the item linear.output should be statedas FALSE to ensure that the output is mapped by theactivation function to the interval [0,1]. The numberof hidden neurons should be determined in relationto the needed complexity. A neural network with forexample two hidden neurons is trained by the fol-lowing statements:

> library(neuralnet)Loading required package: gridLoading required package: MASS>> nn <- neuralnet(+ case~age+parity+induced+spontaneous,+ data=infert, hidden=2, err.fct="ce",+ linear.output=FALSE)> nnCall:neuralnet(formula = case~age+parity+induced+spontaneous,data = infert, hidden = 2, err.fct = "ce",linear.output = FALSE)

1 repetition was calculated.

Error Reached Threshold Steps1 125.2126851 0.008779243419 5254

Basic information about the training process andthe trained neural network is saved in nn. This in-cludes all information that has to be known to repro-duce the results as for instance the starting weights.Important values are the following:

� net.result, a list containing the overall result,i.e. the output, of the neural network for eachreplication.

� weights, a list containing the fitted weights ofthe neural network for each replication.

� generalized.weights, a list containing thegeneralized weights of the neural network foreach replication.

� result.matrix, a matrix containing the error,reached threshold, needed steps, AIC and BIC(computed if likelihood=TRUE) and estimatedweights for each replication. Each column rep-resents one replication.

� startweights, a list containing the startingweights for each replication.

A summary of the main results is provided bynn$result.matrix:

> nn$result.matrix1

error 125.212685099732reached.threshold 0.008779243419steps 5254.000000000000Intercept.to.1layhid1 5.593787533788age.to.1layhid1 -0.117576380283parity.to.1layhid1 1.765945780047induced.to.1layhid1 -2.200113693672spontaneous.to.1layhid1 -3.369491912508Intercept.to.1layhid2 1.060701883258age.to.1layhid2 2.925601414213parity.to.1layhid2 0.259809664488induced.to.1layhid2 -0.120043540527spontaneous.to.1layhid2 -0.033475146593Intercept.to.case 0.7222974915961layhid.1.to.case -5.1413240770521layhid.2.to.case 2.623245311046

The training process needed 5254 steps until allabsolute partial derivatives of the error functionwere smaller than 0.01 (the default threshold). Theestimated weights range from −5.14 to 5.59. For in-stance, the intercepts of the first hidden layer are 5.59and 1.06 and the four weights leading to the firsthidden neuron are estimated as −0.12, 1.77, −2.20,and −3.37 for the covariates age, parity, inducedand spontaneous, respectively. If the error functionis equal to the negative log-likelihood function, theerror refers to the likelihood as is used for exampleto calculate Akaike’s Information Criterion (AIC).

The given data is saved in nn$covariate andnn$response as well as in nn$data for the whole dataset inclusive non-used variables. The output of theneural network, i.e. the fitted values o(x), is providedby nn$net.result:

> out <- cbind(nn$covariate,+ nn$net.result[[1]])> dimnames(out) <- list(NULL,+ c("age","parity","induced",+ "spontaneous","nn-output"))> head(out)

age parity induced spontaneous nn-output[1,] 26 6 1 2 0.1519579877[2,] 42 1 1 0 0.6204480608[3,] 39 6 2 0 0.1428325816[4,] 34 4 2 0 0.1513351888[5,] 35 3 1 1 0.3516163154[6,] 36 4 2 1 0.4904344475

In this case, the object nn$net.result is a list con-sisting of only one element relating to one calculatedreplication. If more than one replication were calcu-lated, the outputs would be saved each in a separatelist element. This approach is the same for all valuesthat change with replication apart from net.resultthat is saved as matrix with one column for eachreplication.

To compare the results, neural networks aretrained with the same parameter setting as above us-ing neuralnet with algorithm="backprop" and thepackage nnet.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 35: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 35

> nn.bp <- neuralnet(+ case~age+parity+induced+spontaneous,+ data=infert, hidden=2, err.fct="ce",+ linear.output=FALSE,+ algorithm="backprop",+ learningrate=0.01)> nn.bpCall:neuralnet(formula = case~age+parity+induced+spontaneous,data = infert, hidden = 2, learningrate = 0.01,algorithm = "backprop", err.fct = "ce",linear.output = FALSE)

1 repetition was calculated.

Error Reached Threshold Steps1 158.085556 0.008087314995 4>>> nn.nnet <- nnet(+ case~age+parity+induced+spontaneous,+ data=infert, size=2, entropy=T,+ abstol=0.01)# weights: 13initial value 158.121035final value 158.085463converged

nn.bp and nn.nnet show equal results. Bothtraining processes last only a very few iteration stepsand the error is approximately 158. Thus in this littlecomparison, the model fit is less satisfying than thatachieved by resilient backpropagation.

neuralnet includes the calculation of generalizedweights as introduced by Intrator and Intrator (2001).The generalized weight wi is defined as the contribu-tion of the ith covariate to the log-odds:

wi =∂ log

(o(x)

1−o(x)

)∂xi

.

The generalized weight expresses the effect of eachcovariate xi and thus has an analogous interpretationas the ith regression parameter in regression mod-els. However, the generalized weight depends on allother covariates. Its distribution indicates whetherthe effect of the covariate is linear since a small vari-ance suggests a linear effect (Intrator and Intrator,2001). They are saved in nn$generalized.weightsand are given in the following format (rounded val-ues)

> head(nn$generalized.weights[[1]])[,1] [,2] [,3] [,4]

1 0.0088556 -0.1330079 0.1657087 0.25378422 0.1492874 -2.2422321 2.7934978 4.27826453 0.0004489 -0.0067430 0.0084008 0.01286604 0.0083028 -0.1247051 0.1553646 0.23794215 0.1071413 -1.6092161 2.0048511 3.07044576 0.1360035 -2.0427123 2.5449249 3.8975730

The columns refer to the four covariates age (j =1), parity (j = 2), induced (j = 3), and spontaneous

(j = 4) and a generalized weight is given for each ob-servation even though they are equal for each covari-ate combination.

Visualizing the results

The results of the training process can be visualizedby two different plots. First, the trained neural net-work can simply be plotted by

> plot(nn)

The resulting plot is given in Figure 3.

Figure 3: Plot of a trained neural network includ-ing trained synaptic weights and basic informationabout the training process.

It reflects the structure of the trained neural net-work, i.e. the network topology. The plot includesby default the trained synaptic weights, all interceptsas well as basic information about the training pro-cess like the overall error and the number of stepsneeded to converge. Especially for larger neural net-works, the size of the plot and that of each neuroncan be determined using the parameters dimensionand radius, respectively.

The second possibility to visualize the resultsis to plot generalized weights. gwplot usesthe calculated generalized weights provided bynn$generalized.weights and can be used by the fol-lowing statements:

> par(mfrow=c(2,2))> gwplot(nn,selected.covariate="age",+ min=-2.5, max=5)> gwplot(nn,selected.covariate="parity",+ min=-2.5, max=5)> gwplot(nn,selected.covariate="induced",

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 36: R journal 2010-1

36 CONTRIBUTED RESEARCH ARTICLES

+ min=-2.5, max=5)> gwplot(nn,selected.covariate="spontaneous",+ min=-2.5, max=5)

The corresponding plot is shown in Figure 4.

Figure 4: Plots of generalized weights with respectto each covariate.

The generalized weights are given for all covari-ates within the same range. The distribution of thegeneralized weights suggests that the covariate agehas no effect on the case-control status since all gen-eralized weights are nearly zero and that at least thetwo covariates induced and spontaneous have a non-linear effect since the variance of their generalizedweights is overall greater than one.

Additional features

The compute function

compute calculates and summarizes the output ofeach neuron, i.e. all neurons in the input, hidden andoutput layer. Thus, it can be used to trace all sig-nals passing the neural network for given covariatecombinations. This helps to interpret the networktopology of a trained neural network. It can also eas-ily be used to calculate predictions for new covari-ate combinations. A neural network is trained witha training data set consisting of known input-outputpairs. It learns an approximation of the relationshipbetween inputs and outputs and can then be usedto predict outputs o(xnew) relating to new covariatecombinations xnew. The function compute simplifiesthis calculation. It automatically redefines the struc-

ture of the given neural network and calculates theoutput for arbitrary covariate combinations.

To stay with the example, predicted outputscan be calculated for instance for missing com-binations with age=22, parity=1, induced ≤ 1,and spontaneous ≤ 1. They are provided bynew.output$net.result

> new.output <- compute(nn,covariate=matrix(c(22,1,0,0,

22,1,1,0,22,1,0,1,22,1,1,1),

byrow=TRUE, ncol=4))> new.output$net.result

[,1][1,] 0.1477097[2,] 0.1929026[3,] 0.3139651[4,] 0.8516760

This means that the predicted probability of beinga case given the mentioned covariate combinations,i.e. o(x), is increasing in this example with the num-ber of prior abortions.

The confidence.interval function

The weights of a neural network follow a multivari-ate normal distribution if the network is identified(White, 1989). A neural network is identified if itdoes not include irrelevant neurons neither in theinput layer nor in the hidden layers. An irrelevantneuron in the input layer can be for instance a co-variate that has no effect or that is a linear combi-nation of other included covariates. If this restric-tion is fulfilled and if the error function equals theneagtive log-likelihood, a confidence interval can becalculated for each weight. The neuralnet packageprovides a function to calculate these confidence in-tervals regardless of whether all restrictions are ful-filled. Therefore, the user has to be careful interpret-ing the results.

Since the covariate age has no effect on the out-come and the related neuron is thus irrelevant, a newneural network (nn.new), which has only the threeinput variables parity, induced, and spontaneous,has to be trained to demonstrate the usage ofconfidence.interval. Let us assume that all restric-tions are now fulfilled, i.e. neither the three inputvariables nor the two hidden neurons are irrelevant.Confidence intervals can then be calculated with thefunction confidence.interval:

> ci <- confidence.interval(nn.new, alpha=0.05)> ci$lower.ci[[1]][[1]][[1]]

[,1] [,2][1,] 1.830803796 -2.680895286[2,] 1.673863304 -2.839908343[3,] -8.883004913 -37.232020925

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 37: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 37

[4,] -48.906348154 -18.748849335

[[1]][[2]][,1]

[1,] 1.283391149[2,] -3.724315385[3,] -2.650545922

For each weight, ci$lower.ci provides the re-lated lower confidence limit and ci$upper.ci the re-lated upper confidence limit. The first matrix con-tains the limits of the weights leading to the hiddenneurons. The columns refer to the two hidden neu-rons. The other three values are the limits of theweights leading to the output neuron.

Summary

This paper gave a brief introduction to multi-layerperceptrons and supervised learning algorithms. Itintroduced the package neuralnet that can be ap-plied when modeling functional relationships be-tween covariates and response variables. neuralnetcontains a very flexible function that trains multi-layer perceptrons to a given data set in the contextof regression analyses. It is a very flexible packagesince most parameters can be easily adapted. For ex-ample, the activation function and the error functioncan be arbitrarily chosen and can be defined by theusual definition of functions in R.

Acknowledgements

The authors thank Nina Wawro for reading prelim-inary versions of the paper and for giving helpfulcomments. Additionally, we would like to thank twoanonymous reviewers for their valuable suggestionsand remarks.

We gratefully acknowledge the financial supportof this research by the grant PI 345/3-1 from the Ger-man Research Foundation (DFG).

Bibliography

H. Akaike. Information theory and an extensionof the maximum likelihood principle. In PetrovBN and Csaki BF, editors, Second internationalsymposium on information theory, pages 267–281.Academiai Kiado, Budapest, 1973.

C. Almeida, C. Baugh, C. Lacey, C. Frenk, G. Granato,L. Silva, and A. Bressan. Modelling the dsty uni-verse i: Introducing the artificial neural networkand first applications to luminosity and colour dis-tributions. Monthly Notices of the Royal AstronomicalSociety, 402:544–564, 2010.

A. Anastasiadis, G. Magoulas, and M. Vrahatis. Newglobally convergent training scheme based on theresilient propagation algorithm. Neurocomputing,64:253–270, 2005.

C. Bishop. Neural networks for pattern recognition. Ox-ford University Press, New York, 1995.

S. Fritsch and F. Günther. neuralnet: Training of NeuralNetworks. R Foundation for Statistical Computing,2008. R package version 1.2.

F. Günther, N. Wawro, and K. Bammann. Neural net-works for modeling gene-gene interactions in as-sociation studies. BMC Genetics, 10:87, 2009. http://www.biomedcentral.com/1471-2156/10/87.

K. Hornik, M. Stichcombe, and H. White. Multi-layer feedforward networks are universal approx-imators. Neural Networks, 2:359–366, 1989.

O. Intrator and N. Intrator. Interpreting neural-network results: a simulation study. ComputationalStatistics & Data Analysis, 37:373–393, 2001.

A. Kumar and D. Zhang. Personal recognition usinghand shape and texture. IEEE Transactions on ImageProcessing, 15:2454–2461, 2006.

M. C. Limas, E. P. V. G. Joaquín B. Ordieres Meré,F. J. M. de Pisón Ascacibar, A. V. P. Espinoza, andF. A. Elías. AMORE: A MORE Flexible Neural Net-work Package, 2007. URL http://wiki.r-project.org/rwiki/doku.php?id=packages:cran:amore. Rpackage version 0.2-11.

P. McCullagh and J. Nelder. Generalized Linear Models.Chapman and Hall, London, 1983.

M. Riedmiller. Advanced supervised learning inmulti-layer perceptrons - from backpropagation toadaptive learning algorithms. International Jour-nal of Computer Standards and Interfaces, 16:265–278,1994.

M. Riedmiller and H. Braun. A direct method forfaster backpropagation learning: the rprop algo-rithm. Proceedings of the IEEE International Confer-ence on Neural Networks (ICNN), 1:586–591, 1993.

M. Rocha, P. Cortez, and J. Neves. Evolutionary neu-ral network learning. Lecture Notes in Computer Sci-ence, 2902:24–28, 2003.

R. Rojas. Neural Networks. Springer-Verlag, Berlin,1996.

W. Schiffmann, M. Joost, and R. Werner. Optimiza-tion of the backpropagation algorithm for trainingmultilayer perceptrons. Technical report, Univer-sity of Koblenz, Insitute of Physics, 1994.

G. Schwarz. Estimating the dimension of a model.Ann Stat, 6:461–464, 1978.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 38: R journal 2010-1

38 CONTRIBUTED RESEARCH ARTICLES

D. Trichopoulos, N. Handanos, J. Danezis, A. Kalan-didi, and V. Kalapothaki. Induced abortion andsecondary infertility. British Journal of Obstetrics andGynaecology, 83:645–650, 1976.

W. Venables and B. Ripley. Modern Applied Statis-tics with S. Springer, New York, fourth edi-tion, 2002. URL http://www.stats.ox.ac.uk/pub/MASS4. ISBN 0-387-95457-0.

H. White. Learning in artificial neural networks: a

statistical perspective. Neural Computation, 1:425–464, 1989.

Frauke GüntherUniversity of Bremen, Bremen Institute for PreventionResearch and Social [email protected]

Stefan FritschUniversity of Bremen, Bremen Institute for PreventionResearch and Social Medicine

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 39: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 39

glmperm: A Permutation of RegressorResiduals Test for Inference inGeneralized Linear Modelsby Wiebke Werft and Axel Benner

Abstract We introduce a new R package calledglmperm for inference in generalized linearmodels especially for small and moderate-sizeddata sets. The inference is based on the per-mutation of regressor residuals test introducedby Potter (2005). The implementation of glm-perm outperforms currently available permuta-tion test software as glmperm can be applied insituations where more than one covariate is in-volved.

Introduction

A novel permutation test procedure for inference inlogistic regression with small- and moderate-sizeddatasets was introduced by Potter (2005) and showedgood performance in comparison to exact condi-tional methods. This so-called permutation of re-gressor residuals (PRR) test is implemented in the Rpackage logregperm. However, the application fieldis limited to logistic regression models. The newglmperm package offers an extension of the PRRtest to generalized linear models (GLMs) especiallyfor small and moderate-sized data sets. In contrastto existing permutation test software, the glmpermpackage provides a permutation test for situations inwhich more than one covariate is involved, e.g. if es-tablished covariates need to be considered togetherwith the new predictor under test.

Generalized linear models

Let Y be a random response vector whose compo-nents are independently distributed with means µ.Furthermore, let the covariates x1, ..., xp be related tothe components of Y via the generalized linear model

E(Yi) = µi , (1)

g(µi) = β0 +p

∑j=1

xijβ j , (2)

Var(Yi) =φ

wiV(µi) , (3)

with i = 1, ...,n, g a link function and V(·) a knownvariance function; φ is called the dispersion parame-ter and wi is a known weight that depends on the un-derlying observations. The dispersion parameter isconstant over observations and might be unknown,e.g. for normal distributions and for binomial and

Poisson distributions with over-dispersion (see be-low).

One is now interested in testing the null hypoth-esis that the regression coefficient for a covariate ofinterest, say without loss of generality x1, is zero, i.e.H0 : β1 = 0 vs. H1 : β1 6= 0. Let y be the observedvector of the outcome variable Y . Inference is basedon the likelihood-ratio test statistic LR(y; x1, ..., xp),which is defined as the difference in deviances of themodels with and without the covariate of interest di-vided by the dispersion parameter. For simplicity weassume that each variable is represented by a singleparameter that describes the contribution of the vari-able to the model. Therefore, the test statistic has anasymptotic chi-squared distribution with one degreeof freedom. (For a deeper look into generalized lin-ear models McCullagh and Nelder (1989) is recom-mended.)

The PRR test

The basic concept of the PRR test is that it replacesthe covariate of interest by its residual r from a linearregression on the remaining covariates x2, ..., xp. Thisis a simple orthogonal projection of x1 on the spacespanned by x2, ..., xp and ensures that r by its defini-tion is not correlated with x2, ..., xp while x1 may becorrelated. An interesting feature of this projection isthat the maximum value of the likelihood for a gen-eralized linear model of y on r, x2, ..., xp is the same asthat for y on x1, x2, ..., xp. Hence, the likelihood-ratiotest is the same when using the residuals r insteadof the covariate x1. This leads to the idea of usingpermutations of the residuals r to estimate the nulldistribution and thus the p-value.

The algorithm for the PRR test to obtain a p-valuefor testing the null hypothesis H0 : β1 = 0 of themodel

g(µ) = β0 + β1x1 + β2x2 + ... + βpxp (4)

is as follows:

1. Calculate least squares residuals r = (r1, ...,rn)via

r = x1 −(γ0 + γ1x2 + ... + γp−1xp

)(5)

where γ0, γ1, ..., γp−1 are least squares estimatesof the coefficients from the linear model

E(x1) = γ0 + γ1x2 + ... + γp−1xp . (6)

Then derive the p-value p of the likelihood-ratio test statistic LR(y; r, x2, ..., xp) based on rreplacing the covariate x1.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 40: R journal 2010-1

40 CONTRIBUTED RESEARCH ARTICLES

2. For resampling iterations b = 1, ..., B

� Randomly draw r∗ from r = (r1, ...,rn)without replacement.

� Calculate p-values p∗b of the likelihood-ratio test statistic LR(y; r∗, x2, ..., xp).

3. Calculate a permutation p-value for the PRRtest according to

p f =#(p∗b ≤ f · p)

B. (7)

Thus, the permutation p-value p f is the fraction ofpermutations that have a likelihood-based p-valueless than or equal to that for the unpermuted datatimes a factor f . This factor f is equal or slightly big-ger than one, i.e. f ∈ {1;1.005;1.01;1.02;1.04}. It wasintroduced by Potter (2005) to account for numericalinstabilities which might occur when the algorithmsto maximize the likelihood do not converge to ex-actly the same value for different configurations ofthe data. Note that varying the factor only changesthe results by a few per cent or less. In the R pack-age glmperm various factors are implemented andare displayed in the summary output. Here, the re-sult presentations will be restricted to factors 1 and1.02.

The number of resampling iterations B of thealgorithm is implemented as nrep in the functionprr.test. By default nrep=1000. However, depend-ing on the precision that is desired, the number of it-erations could be increased at the cost of longer com-putation time. We recommend using nrep=10000.

In order to provide an estimate of the variance ofthe result, the permutation p-value p f could be re-garded as a binomial random variable B(B, p), whereB is the number of resampling iterations and p is theunknown value of the true significance level (Good,2005). As estimate of p the observed p-value p isused multiplied by the factor f . Hence, the standarderror for the permutation p-value p f is provided by√

f p(1− f p)/B.For binary outcome variables the PRR test is a

competitive approach to exact conditional logistic re-gression described by Hirji et al. (1987) and Mehtaand Patel (1995). The commercial LogXact softwarehas implemented this conditional logistic regressionapproach. At present, statistical tests employing un-conditional permutation methods are not commer-cially available and the package glmperm bridgesthis gap.

The rationale of exact conditional logistic regres-sion is based on the exact permutation distributionof the sufficient statistic for the parameter of interest,conditional on the sufficient statistics for the speci-fied nuisance parameters. However, its algorithmicconstraint is that the distribution of interest will bedegenerate if a conditioning variable is continuous.The difference between the two procedures lies in

the permutation scheme: While the conditional ap-proach permutes the outcome variable, the PRR testpermutes the residuals from the linear model. Notethat if one considers regressions with only a single re-gressor the residuals r are basically equal to x1, andthe PRR test reduces to a simple permutation test(shuffle-Z method, see below).

An overview of the methodologic differences be-tween these two and other permutation schemes isprovided by Kennedy and Cade (1996). In the con-text of their work the permutation method usedfor the PRR test can be viewed as an extension oftheir shuffle-R permutation scheme for linear mod-els. The variable associated with the parameter un-der the null hypothesis is regressed on the remain-ing covariables and replaced by the correspondingresiduals; these residuals are then permuted whilethe response and the covariates are held constant.Freedman and Lane (1983) introduced the shuffle-R method in combination with tests of significanceand a detailed discussion of this permutation schemeis provided by ter Braak (1992). The conditionalmethod can be implied with the shuffle-Z methodwhich was first mentioned in the context of multi-ple regression by Draper and Stoneman (1966). Herethe variables associated with the parameters beingtested under the null hypothesis are randomizedwhile the response and the covariates are held con-stant. Kennedy and Cade (1996) discuss the potentialpitfalls of the shuffle-Z method and point out thatthis method violates the ancillarity principle by notholding constant the collinearity between the covari-ables and the variable associated with the parame-ter under the null hypothesis. Therefore, Kennedyand Cade (1996) do not recommend the shuffle-Zmethod unless it employs a pivotal statistic or thehypothesized variable and the remaining covariablesare known to be independent.

Modifications for the extension toGLMs

Three main modifications have been implemented inthe glmperm package in comparison to the logreg-perm package. Note that the new package glmpermincludes all features of the package logregperm. Atpresent, the logregperm package is still available onCRAN. In general, the glmperm package could re-place the logregperm package.

1. The extension of the prr.test function in theglmperm package provides several new ar-guments compared to the version in logreg-perm. The input is now organised as a for-mula expression equivalent to fitting a gener-alized linear model with glm (R package stats).Hence, for easier usage the syntax of prr.testis adapted from glm. The covariable of inter-

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 41: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 41

est about which inference is to be made is to beincluded as argument var=’x1’. An argumentseed is provided to allow for reproducibility.

2. The implicit function glm.perm is extended tocalculate not only the deviances for the dif-ferent resampling iterations but also the dis-persion parameter for each permutation sepa-rately. For Poisson and logistic regression mod-els the dispersion parameters are pre-definedas φ = 1 for all iterations; the likelihood-ratiotest statistic is then simply the difference of de-viances. For all other GLM the dispersion pa-rameters will be estimated for each resamplingiteration based on the underlying data. Thisensures that the p-value of the likelihood-ratiotest statistic for this precise resampling itera-tion is accordingly specified given the data.

3. A new feature of the package is that it in-cludes a summary function to view the mainresults of prr.test. It summarises the per-mutation of regressor residual-based p-valuep f for the various factors f , the observedlikelihood-ratio test statistic and the observedp-value p based on the chi-squared distribu-tion with one degree of freedom. For Pois-son and logistic regression models a warn-ing occurs in case of possible overdispersion(φ > 1.5) or underdispersion (φ < 0.5) and rec-ommends use of family=quasibinomial() orquasipoisson() instead.

An example session

To illustrate the usage of the glmperm package fora GLM, an example session with simulated data ispresented. First, we simulated data for three inde-pendent variables with n = 20 samples and binaryand discrete response variables for the logistic andPoisson regression model, respectively.

# binary response variablen <- 20set.seed(4278)x1 <- rnorm(n)x0 <- rnorm(n)+x1y1 <- ifelse(x0+x1+2*rnorm(n)>0,1,0)test1 <- prr.test(y1~x0+x1,

var="x0", family=binomial())x2 <- rbinom(n,1,0.6)y2 <- ifelse(x1+x2+rnorm(n)>0,1,0)test2 <- prr.test(y2~x1+x2, var="x1",

nrep=10000,family=binomial())

# Poisson response variableset.seed(4278)x1 <- rnorm(n)x0 <- rnorm(n) + x1

nu <- rgamma(n, shape = 2, scale = 1)y <- rpois(n, lambda = exp(2) * nu)test3 <- prr.test(y~x0+x1,

var="x0", family=poisson())test4 <- prr.test(y~x0, var="x0",

nrep=1000,family=poisson())

A condensed version of the displayed result sum-mary of test2 (only factors f = 1 and f = 1.02 areshown) is given by:

> summary(test2)-----------------------------------------Results based on chi-squared distribution-----------------------------------------observed p-value: 0.0332--------------------Results based on PRR--------------------permutation p-value for simulated p-values <=observed p-value: 0.0522 (Std.err: 0.0018)permutation p-value for simulated p-values <=1.02 observed p-value: 0.0531 (Std.err: 0.0018)

For the above example test2 the exact conditional lo-gistic regression p-value calculated via LogXact-4 is0.0526, whereas the p-value of the PRR test is 0.0522for factor f = 1, and based on the chi-squared distri-bution it is 0.0332. The example demonstrates thatthe p-value obtained via PRR test (or LogXact) leadsto a different rejection decision than a p-value calcu-lated via an approximation by the chi-squared dis-tribution. Hence, when small sample sizes are con-sidered the PRR test should be preferred over an ap-proximation via chi-squared distribution.

Special case: Overdispersion

For the computation of GLM the dispersion parame-ter φ for the binomial and Poisson distribution is setequal to one. However, there exist cases of data dis-tribution which violate this assumption. The vari-ance in (3) can then be better described when us-ing a dispersion parameter φ 6= 1. The case of φ > 1is known as overdispersion as the variance is largerthan expected under the assumption of binomial orPoisson distribution. In practice, one can still use thealgorithms for generalized linear models if the esti-mation of the variance takes account of the disper-sion parameter φ > 1. As a consequence of overdis-persion the residual deviance is then divided by theestimation φ of the dispersion factor instead of φ = 1.Hence, this has direct influence on the likelihood-ratio test statistic which is the difference in deviancesand therefore is also scaled by φ. The correspond-ing p-values differ if overdispersion is consideredor not, i.e. if φ = 1 or φ = φ is used. In the PRRtest one can account for overdispersion when usingfamily=quasipoisson() or quasibinomial() insteadof family=poisson() or binomial().

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 42: R journal 2010-1

42 CONTRIBUTED RESEARCH ARTICLES

We experienced a stable performance of the PRRtest for overdispersed data. The following treepipitdata (Müller and Hothorn, 2004) provides an ex-ample of overdispersed data. We show the resultsof the chi-squared approximation of the likelihood-ratio test statistic as well as the results of thePRR test for the usage of family=poisson() andfamily=quasipoisson(), respectively.

# example with family=poisson()data(treepipit, package="coin")test5<-prr.test(counts~cbpiles+coverstorey+coniferous+coverregen,data=treepipit,var="cbpiles",family=poisson())

summary(test5)-----------------------------------------Results based on chi-squared distribution-----------------------------------------observed p-value: 0.0037--------------------Results based on PRR--------------------permutation p-value for simulated p-values <=observed p-value: 0.083 (Std.err: 0.0019)permutation p-value for simulated p-values <=1.02 observed p-value: 0.084 (Std.err: 0.0019)

# example with family=quasipoisson()test6<-prr.test(counts~cbpiles+coverstorey+coniferous+coverregen,data=treepipit,var="cbpiles",family=quasipoisson())

summary(test6)-----------------------------------------Results based on chi-squared distribution-----------------------------------------observed p-value: 0.0651--------------------Results based on PRR--------------------permutation p-value for simulated p-values <=observed p-value: 0.07 (Std.err: 0.0078)permutation p-value for simulated p-values <=1.02 observed p-value: 0.071 (Std.err: 0.0079)

The p-values based on the chi-squared distributionof the likelihood-ratio test statistic are p = 0.0037and p = 0.0651 when using family=poisson() andfamily=quasipoisson(), respectively. Hence, a dif-ferent test decision is made whereas the test decisionfor the PRR test is the same for both cases (p = 0.083and p = 0.07).

Summary

The glmperm package provides a permutation of re-gressor residuals test for generalized linear models.This version of a permutation test for inference inGLMs is especially suitable for situations in whichmore than one covariate is involved in the model.

The key input feature of the PRR test is to use theorthogonal projection of the variable of interest onthe space spanned by all other covariates instead ofthe variable of interest itself. This feature provides areasonable amendment to existing permutation testsoftware which do not incorporate situations withmore than one covariate. Applications to logistic andPoisson models show good performance when com-pared to gold standards. For the special case of datawith overdispersion the PRR test is more robust com-pared to methods based on approximations of thetest statistic distribution.

Acknowledgements

We would like to acknowledge the many valuablesuggestions made by two anonymous referees.

Bibliography

C.J.F. ter Braak. Permutation versus bootstrap sig-nificance tests in multiple regression and anova.In K.H. Jöckel, G. Rothe and W. Sendler, editors.Bootstraping and Related Techniques, 79-85, Springer,1992.

N.R. Draper and D.M. Stoneman. Testing for the in-clusion of variables in linear regression by a ran-domization technique. Technometrics, 8: 695-698,1966.

D. Freedman and D. Lane. A nonstochastic inter-pretation of reported significance levels. Journal ofBusiness and Economic Statistics, 1:292-298, 1983.

P. Good. Permutation, Parametric, and Bootstrap Testsof Hypotheses, 3rd edn. Springer series in statistics,2005. ISBN 0-387-20279-X.

K.F. Hirji, C.R. Mehta and N.R. Patel. Computing dis-tributions for exact logistic regression. Journal of theAmerican Statistical Association, 82:1110-1117, 1987.

P.E. Kennedy and B.S. Cade. Randomization tests formultiple regression. Communications in Statistics -Simulation and Computation, 25(4):923-936, 1996.

D.M. Potter. A permutation test for inference in lo-gistic regression with small- and moderate-sizeddatasets, Statistics in Medicine, 24:693-708, 2005.

P. McCullagh and J.A. Nelder. Generalized Linear Mod-els, 2nd edn. Chapman and Hall, London, 1989.ISBN 0-412-31760-5.

C.R. Mehta and N.R. Patel Exact logistic regres-sion: theory and examples. Statistics in Medicine,14:2143-2160, 1995.

J. Müller and T. Hothorn. Maximally selected two-sample statistics as a new tool for the identifi-cation and assessment of habitat factors with an

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 43: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 43

application to breeding bird communities in oakforests, European Journal of Forest Research, 123:219-228, 2004.

Wiebke WerftGerman Cancer Research CenterIm Neuenheimer Feld 280, 69120 Heidelberg

[email protected]

Axel BennerGerman Cancer Research CenterIm Neuenheimer Feld 280, 69120 [email protected]

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 44: R journal 2010-1

44 CONTRIBUTED RESEARCH ARTICLES

Online Reproducible Research: AnApplication to Multivariate Analysis ofBacterial DNA Fingerprint Databy Jean Thioulouse, Claire Valiente-Moro and Lionel Zen-ner

Abstract This paper presents an example ofonline reproducible multivariate data analysis.This example is based on a web page provid-ing an online computing facility on a server.HTML forms contain editable R code snippetsthat can be executed in any web browser thanksto the Rweb software. The example is basedon the multivariate analysis of DNA fingerprintsof the internal bacterial flora of the poultry redmite Dermanyssus gallinae. Several multivariatedata analysis methods from the ade4 package areused to compare the fingerprints of mite poolscoming from various poultry farms. All the com-putations and graphical displays can be redoneinteractively and further explored online, usingonly a web browser. Statistical methods are de-tailed in the duality diagram framework, and adiscussion about online reproducibility is initi-ated.

Introduction

Reproducible research has gained much at-tention recently (see particularly http://reproducibleresearch.net/ and the referencestherein). In the area of Statistics, the availabil-ity of Sweave (Leisch (2002), http://www.stat.uni-muenchen.de/~leisch/Sweave/) has proved ex-tremely useful and Sweave documents are nowwidely used. Sweave offers the possibility to havetext, R code, and outputs of this code in the samedocument. This makes reproducibility of a scientificpaper written in Sweave very straightforward.

However, using Sweave documents implies agood knowledge of R, of Sweave and of LATEX. It alsorequires having R installed on one’s computer. Theinstalled version of R must be compatible with theR code in the Sweave document, and all the neededpackages must also be installed. This may be a prob-lem for some scientists, for example for many biolo-gists who are not familiar with R and LATEX, or whodo not use them on a regular basis.

In this paper, we demonstrate an online repro-ducibility system that helps circumvent these prob-lems. It is based on a web page that can be used withany web browser. It does not require R to be installedlocally, nor does it need a thorough knowledge of Rand LATEX. Nevertheless it allows the user to present

and comment R code snippets, to redo all the compu-tations and to draw all the graphical displays of theoriginal paper.

The example presented here relates to multivari-ate data analysis. Reproducibility of multivariatedata analyses is particularly important because thereis a large number of different methods, with poorlydefined names, making it often difficult to knowwhat has been done exactly. Many methods haveseveral different names, according to the commu-nity where they were invented (or re-invented), andwhere they are used. Translation of technical termsinto languages other than English is also difficult, asmany are ambiguous and have “false friends”. More-over, data analysis methods make intensive use ofgraphical displays, with a large number of graphicalparameters which can be changed to produce a widevariety of displays.

This situation is similar to what was described byBuckheit and Donoho (1995) in the area of waveletresearch. Their solution was to publish the Mat-lab code used to produce the figures in their re-search articles. Rather than do this we decided toset up a simple computer environment using R tooffer online reproducibility. In 2002, we installedan updated version of the Rweb system (Banfield,1999) on our department server (see http://pbil.univ-lyon1.fr/Rweb/Rweb.general.html), and weimplemented several computational web services inthe field of comparative genomics (Perrière et al.(2003), see for example http://pbil.univ-lyon1.fr/mva/coa.php).

This server now combines the computationalpower of R with simple HTML forms (as suggestedby de Leeuw (2001) for Xlisp-Stat) and the ability tosearch online molecular databases with the seqinrpackage (Charif and Lobry, 2007). It is also usedby several researchers to provide an online repro-ducibility service for scientific papers (see for exam-ple http://pbil.univ-lyon1.fr/members/lobry/).

The present paper provides an example of an ap-plication of this server to the analysis of DNA finger-prints by multivariate analysis methods, using theade4 package (Chessel et al., 2004; Dray and Du-four, 2007). We have shown recently (Valiente-Moroet al., 2009) that multivariate analysis techniques canbe used to analyse bacterial DNA fingerprints andthat they make it possible to draw useful conclu-sions about the composition of the bacterial commu-nities from which they originate. We also demon-strate the effectiveness of principal component anal-

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 45: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 45

ysis, between-group analysis and within-group anal-ysis [PCA, BGA and WGA, Benzécri (1983); Dolédecand Chessel (1987)] to show differences in diversitybetween bacterial communities of various origins.

In summary, we show here that it is easy to setup a software environment offering full online repro-ducibility of computations and graphical displays ofmultivariate data analysis methods, even for userswho are not familiar with R, Sweave or LATEX.

Data and methods

In this section, we first describe the biological mate-rial used in the example data set. Then we presentthe statistical methods (PCA, BGA and WGA), in theframework of the duality diagram (Escoufier, 1987;Holmes, 2006). We also detail the software environ-ment that was used.

Biological material

The poultry red mite, Dermanyssus gallinae is anhaematophagous mite frequently present in breed-ing facilities and especially in laying hen facilities(Chauve, 1998). This arthropod can be responsi-ble for anemia, dermatitis, weight loss and a de-crease in egg production (Kirkwood, 1967). It hasalso been involved in the transmission of manypathogenic agents responsible for serious diseases inboth animals and humans (Valiente Moro et al., 2005;Valiente-Moro et al., 2007). The poultry red mite istherefore an emerging problem that must be studiedto maintain good conditions in commercial egg pro-duction facilities. Nothing is known about its associ-ated non-pathogenic bacterial community and howthe diversity of the microflora within mites may in-fluence the transmission of pathogens.

Most studies on insect microflora are basedon isolation and culture of the constituent micro-organisms. However, comparison of culture-basedand molecular methods reveals that only 20-50% ofgut microbes can be detected by cultivation (Suauet al., 1999). Molecular methods have been devel-oped to analyse the bacterial community in com-plex environments. Among these methods, Dena-turing Gradient and Temporal Temperature Gel Elec-trophoresis (DGGE and TTGE) (Muyzer, 1998) havealready been used successfully.

Full details of the example data set used in thispaper (mites sampling, DNA extraction, PCR ampli-fication of 16S rDNA fragments and TTGE bandingpattern achieving) are given in Valiente-Moro et al.(2009). Briefly, 13 poultry farms were selected in theBretagne region in France, and in each farm 15 sin-gle mites, five pools of 10 mites and one pool of 50mites were collected. The results for single mites andmite pools were analysed separately, but only the re-sults of mite pools are presented in this study, as they

had the most illustrative analysis. Banding patternscan be analysed as quantitative variables (intensityof bands), or as binary indicators (presence/absenceof bands), but for the same reason, only the pres-ence/absence data were used in this paper.

TTGE banding patterns were collected in a datatable, with bands in columns (55 columns) and mitepools in rows (73 rows). This table was first sub-jected to a principal component analysis (PCA) to getan overall idea of the structure of the banding pat-terns. Between-group analysis (BGA) was then ap-plied to study the differences between poultry farms,and finally within-group analysis (WGA) was usedto eliminate the farm effect and obtain the main char-acteristics of the common bacterial flora associatedwith D. gallinae in standard breeding conditions. Agood knowledge of these characteristics would allowcomparisons of the standard bacterial flora amongvarious situations, such as geographic regions, typeand location of breeding facilities (particularly or-ganic farms), or developmental stages of D. gallinae.

Duality diagram of principal componentanalysis

Let X = [xij](n,p) be the TTGE data table with n rows(individuals = mite pools) and p columns (variables= TTGE bands). Variables have mean xj = 1

n ∑i xij

and variance σ2j = 1

n ∑i (xij − xj)2. Individuals be-long to g groups (or classes), namely G1, . . . , Gg, withgroup counts n1, . . . ,ng, and ∑ nk = n.

Using duality diagram theory and triplet no-tation, the PCA of X is the analysis of a triplet(X0,Dp,Dn). X0 is the table of standardized values:

X0 = [xij](n,p)

with xij =xij−xj

σj, and Dn and Dp are the diagonal

matrices of row and column weights: Dp = Ip andDn = 1

n In.This information is summarized in the following

mnemonic diagram, called the “duality diagram” be-cause Rp∗ is the dual of Rp and Rn∗ is the dual of Rn:

RpDp

// Rp∗

X0

��

Rn∗

XT0

OO

Rn

Dn

oo

XT0 is the transpose of X0. The analysis of this

triplet leads to the diagonalization of matrix:

XT0 DnX0Dp

i.e., the matrix obtained by proceeding counter-clockwise around the diagram, starting from Rp.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 46: R journal 2010-1

46 CONTRIBUTED RESEARCH ARTICLES

Principal components, (variable loadings and rowscores), are computed using the eigenvalues andeigenvectors of this matrix.

Between-group analysis

The between-group analysis (BGA) is the analysis oftriplet (XB,Dp,Dnk ), where XB is the (g, p) matrix ofgroup means:

XB = [xkj ](g,p).

The term xkj = 1

nk∑i∈Gk

xij is the mean of variable jin group k. In matrix notation, if B is the matrix ofclass indicators: B = [bk

i ](n,g), with bki = 1 if i ∈ Gk

and bki = 0 if i /∈ Gk, then we have:

XB = Dnk BTX0.

Matrix Dnk = Diag( 1nk

) is the diagonal matrix of(possibly non uniform) group weights, and BT is thetranspose of B.

BGA is therefore the analysis of the table ofgroup means, leading to the diagonalization of ma-trix XT

BDnk XBDp. Its aim is to reveal any evidenceof differences between groups. The statistical signifi-cance of these differences can be tested with a permu-tation test. Row scores of the initial data table can becomputed by projecting the rows of the standardizedtable X0 onto the principal components subspaces.

Within-group analysis

The within-group analysis (WGA) is the analysis oftriplet (XW ,Dp,Dn), where XW is the (n, p) matrix ofthe differences between the standardized values andthe group means:

XW = [xij − xkij](n,p)

with xkij = xk

j , ∀i ∈ Gk. In matrix notation

XW = X0 − XB

where XB is the matrix of groups means, repeated in-side each group.

WGA is therefore the analysis of the matrix ofthe residuals obtained by eliminating the between-group effect. It leads to the diagonalization of matrixXT

WDnXWDp. It is useful when looking for the mainfeatures of a data table after removing an unwantedcharacteristic.

Software

We used R version 2.11.0 (R Development CoreTeam, 2009) for all computations, and the ade4 pack-age (version 1.4-14) for multivariate analyses (Ches-sel et al., 2004; Dray and Dufour, 2007; Thioulouseand Dray, 2007). PCA, BGA and WGA were com-puted with the dudi.pca, between and within func-tions of ade4. BGA permutation tests were done withthe randtest function.

Online reproducibility

The statistical analyses presented in Valiente-Moroet al. (2009) are reproducible (Gentleman and Lang,2004) via the web page http://pbil.univ-lyon1.fr/TTGE/.

This page presents the problem, gives access tothe data set, and allows the user to execute R codesnippets that reproduce all the computations andgraphical displays of the original article (Valiente-Moro et al., 2009). The R code is explained, with adescription of important variables and function calls,and the R outputs are documented. Links to the in-formation pages of the main ade4 functions are alsoprovided.

This page is therefore an example of online liter-ate programming (Knuth, 1992), and also of an on-line dynamic document (Gentleman and Lang, 2004).This is made possible using HTML forms contain-ing R code stored in editable text fields. The codeitself can be modified by the user and executed ondemand. Code execution is achieved by sending itto the Rweb software (Banfield, 1999) running onour department server: http://pbil.univ-lyon1.fr/Rweb/Rweb.general.html.

Links to the complete R code and data sets areprovided on the web page. They can be used todownload these files and use them locally if desired.The full R code (the collection of all code snippets)is available directly from http://pbil.univ-lyon1.fr/TTGE/allCode.R.

PCA, BGA and permutation test

The first code snippet in the reproducibility pagereads the TTGE data table, builds the factor describ-ing the poultry farms, and computes the three anal-yses (PCA, BGA, and WGA). The PCA is first doneusing the dudi.pca function of the ade4 package,and the resulting object is passed to the between andwithin functions to compute BGA and WGA respec-tively. The BGA permutation test is then computedand the test output is plotted.

On the reproducibility web page, these succes-sive steps are explained: links to the data files arepresented, the R code is displayed and commented,and the "Do it again!" button can be used to redo thecomputations. The code can be freely modified bythe user and executed again. For example, it is possi-ble to change the scaling of the PCA, or the numberof random permutations in the Monte Carlo test tocheck the influence of these parameters on analysisoutputs.

Figure 1 shows the result of the permutation testof the BGA. The null hypothesis is that there is nodifference between farms. The test checks that the

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 47: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 47

observed value of the ratio of between-group to to-tal inertia (0.67) is much higher than expected un-der the null hypothesis. Under the null hypothesis,mite pools can be permuted randomly among farmswithout changing significantly the ratio of betweento total inertia. To compute the test, the rows of thedataframe are randomly permuted, and the ratio iscomputed again. This is done many times, to get anidea of the distribution of the between to total inertiaratio. Figure 1 shows that the observed value (blackdiamond, far on the right) is much higher than allthe values obtained after permuting the pools. Thep-value is 0.001, implying that the hypothesis of nodifferences between the farms can confidently be re-jected.

Between−group to total inertia ratio

Fre

quen

cy

0.1 0.2 0.3 0.4 0.5 0.6 0.7

050

100

150

200

250

Figure 1: Permutation test of the BGA. The observedvalue of the between-group to total inertia ratio isequal to 0.67 (black diamond on the right). The his-togram on the left shows the distribution of 1000 val-ues of this ratio obtained after permuting the rows ofthe data table.

BGA plots

The second code snippet draws Figure 2, showingthe factor maps of the BGA.

The loadings of TTGE bands are in the bga1$codataframe, and they are plotted using the s.labelfunction (first panel). To get an idea of the disper-sion of the six mite pools in each farm, we can plotthe projection of each pool on the factor map (secondpanel). The row scores of the pools are in the bga1$lsdataframe, and two graphs are superimposed: thegraph of pool stars (with s.class), and the graph ofconvex hulls surrounding the pools belonging to thesame farm (with s.chull). We can see that, as thepermutation test had just evidenced, the farms areindeed very different.

Here also, the user can very easily change the Rcode to draw alternative graphs. It is possible, forexample, to explore the effect of the arguments to thes.label and s.class functions.

Interpreting the differences between farms ev-idenced in Figure 2 was not easy. All 13 farmsare standard poultry farms, using exactly the samebreeding conditions, and the differences betweenTTGE banding patterns could not be attributed toany other factors.

Since the aim of the study was to find the com-mon bacterial flora associated with D. gallinae in stan-dard breeding conditions, we decided to remove thisunwanted between-farm effect by doing a within-group analysis.

WGA and cluster analysis plots

The third code snippet draws Figure 3, showing thefactor maps of WGA and the dendrogram of the clus-ter analysis on TTGE band loadings.

The loadings of the 55 TTGE bands are in thewga1$co dataframe, and they are plotted using thes.label function (top-left panel). The scores of the73 mite pools are in the wga1$li dataframe. They aregrouped by convex hulls, according to the poultryfarm from which they originate using the s.classand s.chull functions (top-right panel). The rowscores of the WGA are centered by group, so the 13farms are centered on the origin (this corresponds tothe fact that the "farm effect" has been removed inthis analysis).

The TTGE bands corresponding to the commondominant bacterial flora associated with D. gallinae instandard breeding conditions were selected on Fig-ure 3 (top-left panel), using cluster analysis (lowerpanel). This was done using the complete link-age algorithm, with euclidean distances computedon WGA variable loadings on the first three axes(wga1$co) and not on raw data.

The reproducibility page can be used to comparethese results and the ones obtained with other clus-tering algorithms or other distance measures. An-other possibility is to check the effect of the num-ber of axes on which variable loadings are computed(three axes in the R code proposed by default).

We included the two leftmost outgroups (bands0.15, 0.08, 0.11, 0.51, and 0.88) and the right group(0.38, 0.75, 0.03, 0.26, 0.41, 0.59, 0.13, 0.32, 0.23, 0.21,0.31, 0.28, and 0.3). Band 0.39 was excluded becauseit was present in only two mite pools belonging toone single farm.

These results show that there is a strong between-farm effect in the TTGE banding patterns corre-sponding to the bacterial flora associated with D. gal-linae in standard breeding conditions. However, itwas not possible to give a satisfying biological inter-pretation of this effect: the farms evidenced by theBGA did not show any particular characteristic. The

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 48: R journal 2010-1

48 CONTRIBUTED RESEARCH ARTICLES

d = 1

22A18 22A19

22A37

22AR81

22S15

22S17

22S2 22S26

22S40 22S41

29S43

29S44

35S13

d = 0.1

0.03

0.04

0.05

0.06

0.07

0.08

0.09 0.11

0.12

0.13

0.14

0.15

0.16

0.17

0.19

0.21

0.22

0.23

0.25

0.26

0.28

0.3

0.31

0.32

0.35

0.37

0.38

0.39

0.41

0.42

0.43

0.44 0.45 0.46

0.47

0.49

0.51

0.53

0.57

0.59

0.61 0.64

0.66 0.67

0.68

0.7 0.71

0.73

0.75

0.77

0.79 0.81 0.84 0.88

0.92

Figure 2: Factor maps of BGA (x-axis=first principal component, y-axis=second principal component, inertiapercentages: 20% and 17%). The scale is given by the value d (top-right corner) that represents the size ofthe background grid. The first panel shows the map of the 55 TTGE bands (labels correspond to the positionof the band on the electrophoresis gel). The second panel shows the 73 mite pools, grouped by convex hullsaccording to the poultry farm from which they originate (labels correspond to the farms).

WGA could remove this farm effect, and revealedthe common dominant bacterial flora (Valiente-Moroet al., 2009). The knowledge of this flora will allowus to compare the variations observed in differentconditions, such as different geographic regions, dif-ferent types of breeding farms (for example organicfarms), or different developmental stages of D. galli-nae. Data from farms in different geographic regionsand from organic farms are already available, anda paper presenting the results of their analysis hasbeen submitted for publication.

Discussion

Statistical methodology

From the point of view of statistical methodology,BGA appears as a robust method that can be usedto reveal any evidence for, and test, a simple effect,namely the effect of a single factor, in a multivariatedata table. Many recent examples of its use can befound in the area of Genomics (Culhane et al., 2002,2005; Baty et al., 2005; Jeffery et al., 2007).

Furthermore WGA can eliminate a simple effectfrom a multivariate data table. Like BGA, it can beused even when the number of cases is less than thenumber of variables. A recent example of WGA inthe area of Genomics can be found in Suzuky et al.(2008).

BGA can be seen as a particular case of redun-dancy analysis [RDA, Stewart and Love (1968)] andWGA as a particular case of partial RDA [CANOCO(ter Braak and Šmilauer, 2002)]. Both cases corre-

spond to covariates reduced to a single dummy vari-able.

This can be checked with vegan, another classi-cal R package for multivariate ecological data analy-sis, using the rda function, as explained on the onlinereproducibility page in code snippet 4. Outputs arenot presented here, but they are available on that site,where they may be interactively explored.

Computer setup

The reproducibility web page presented here is justan example of use of a simple setup that can be im-plemented easily on any web server. The currentconfiguration of our server is an old Sun Fire 800,with 8 UltraSparc III processors at 900 MHz, 28 GBof memory and 6 x 36 GB disks. From the point ofview of software, the server is running Solaris 10, theHTTP server is Apache 2.2.13, and we use Rweb 1.03and R 2.11.0. Rweb is mainly a set of CGI scripts writ-ten in Perl, and the installed Perl version is v5.8.4.The page itself is written in plain HTML code, usingstandard HTML forms to communicate with RwebCGI scripts.

Rweb source code is available from the MontanaState University (http://bayes.math.montana.edu/Rweb/Rweb1.03.tar.gz).

There has been many other attempts at mixing Rand HTML on the web. Alternative solutions couldfor example make use of CGIwithR (Firth, 2003) andR2HTML (Lecoutre, 2003), or Rscript.

Running a server offering public computing ser-vices necessarily raises many security issues. Theserver that is used for Rweb at the PBIL (http://

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 49: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 49

d = 0.1

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.11

0.12

0.13

0.14

0.15

0.16

0.17

0.19

0.21

0.22

0.23

0.25

0.26 0.28 0.3

0.31 0.32

0.35

0.37

0.38

0.39

0.41

0.42

0.43

0.44

0.45 0.46

0.47

0.49

0.51 0.53

0.57

0.59

0.61

0.64 0.66 0.67

0.68

0.7 0.71

0.73

0.75

0.77 0.79 0.81

0.84

0.88

0.92

d = 1

22A18 22A19 22A37 22AR81 22S15 22S17 22S2 22S26 22S40 22S41 29S43 29S44 35S13

51.080.011.015.088.060.074.073.075 .016 .035 .071 .044 .091.037.040.029.024.046.052.086.050 .070 .041 .022 .077.054.048 .07. 076.066 .064 .094.053.061.017.034.021 .018.090 .097 .083 .057 .030 .062 .014.095 .031 .023 .032.012.013.093.082 .03. 0

0.01.0

2 .03 .0

4.0

Figure 3: Factor maps of WGA (x-axis=first principal component, y-axis=second principal component, inertiapercentages: 14 and 8). The scale is given by the value d (top-right corner) that represents the size of the back-ground grid. Top-left panel: map of the 55 TTGE bands (see legend to Figure 2). Top-right panel: map of the73 mite pools, grouped by farms. Bottom panel: dendrogram of the cluster analysis on WGA band loadings.

pbil.univ-lyon1.fr) also offers many other com-puting services in the area of Genomic research, in-cluding complex database exploration and analysisof huge molecular databases (GenBank, EMBL, etc.)This server has been attacked and successfully com-promised several times with common rootkits, but ineight years [we started the Rweb service in 2002, Per-rière et al. (2003)], the Rweb server has never beenused in these attacks. Rweb precludes the use ofthe most sensitive functions (particularly "system","eval", "call", "sink", etc.) and an additional secu-rity measure taken on our server is that all compu-tations are performed in a temporary directory thatis cleaned up automatically after use.

Online reproducibility

Sweave is a very good solution for the reproducibil-ity of scientific papers’ statistical computation and

graphical display. It is more and more frequentlyused, and not only for reproducible research in thestrictest sense. For example, it is used for quality con-trol and for keeping up-to-date the documents usedin the statistical teaching modules of the Biometryand Evolutionary Biology department at the Univer-sity of Lyon, France: http://pbil.univ-lyon1.fr/R/enseignement.html (in French).

Using Sweave documents, however, is very de-manding for users. LATEX and R must be installedon the local computer, and users must have at leastsome notions of how to compile these documents.The version of R and the list of packages must becompatible with the R code included in the Sweavedocument. This can be a serious obstacle for manyresearchers, for example in Biology and Ecology.

The online reproducibility solution presented inthis paper is much easier to use for people who donot know LATEX and who have only vague notions of

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 50: R journal 2010-1

50 CONTRIBUTED RESEARCH ARTICLES

R. It can be used with any web browser and it canmake the analyses used in an applied statistics paperaccessible to anyone. It can even be a good tool tohelp people get a better knowledge of R and encour-age them to install R and use it locally.

Rweb servers

Many examples of the use of the PBIL Rweb serverare available on the home page of J. R. Lobry, http://pbil.univ-lyon1.fr/members/lobry/. The arti-cles under the tag “[ONLINE REPRODUCIBILITY]”are reproducible through a reproducibility web page,using HTML forms and Rweb as explained in this pa-per.

Writing such web pages is very easy, and does notrequire R, Rweb or any other particular software tobe installed on the HTTP server. The HTML codebelow is a minimal example of such a page. It canbe used with any web server using, for example, theapache2 HTTP server, or even locally in any webbrowser:

<html><head><title>Rweb</title></head><body><p style="font-size:30px">Rweb server at PBIL</p><form onSubmit = "return checkData(this)"action="http://pbil.univ-lyon1.fr/cgi-bin/Rweb/Rweb.cgi"enctype="multipart/form-data"method="post"><textarea name="Rcode" rows=5 cols=80>plot(runif(100))</textarea><br /><input type="submit" value="Run it!"></form></body></html>

The main feature of this code is the form tag thatdeclares the CGI script of the Rweb server: http://pbil.univ-lyon1.fr/cgi-bin/Rweb/Rweb.cgi. Thetextarea tag in this form contains the R code: hereby default it is just one plot function call that draws100 random points. This default code can be changedin the HTML document, or modified interactively bythe user when the page is loaded in a web browser.The input tag defines the “Run it!” button that canbe used to execute the R code. When the user clickson this button, the code is sent to the Rweb CGI scriptand executed by R. The browser display is updated,and the execution listing and the graph appear.

Another solution is to install Rweb and run a lo-cal (public or private) Rweb server.

Maintenance and durability

An important problem for reproducible research isdurability. How long will a reproducible work stayreproducible?

The answer depends on the availability of thesoftware needed to redo the computations and thegraphics. R and contributed packages evolve, andR code can become deprecated if it is not updatedon a regular basis. Sweave has been used mainlyfor just this reason in the documents used in the sta-tistical teaching modules at the University of Lyon.All the documents are recompiled automatically tocheck their compatibility with the current version ofR.

For an online reproducibility system to endure,the same conditions apply. The Biometry and Evo-lutionary Biology department has been running anRweb server continuously since 2002, regularly up-dating the various pieces of software on which it de-pends (R, contributed packages, Rweb, Perl, Apache,etc.).

Reproducible research needs an ongoing effort;reproducibility ceases when this effort is not sus-tained.

Acknowledgment

We thank the Editor and two anonymous review-ers for many useful comments and suggestions thathelped improve the first version of this paper.

Bibliography

J. Banfield. Rweb: web-based statistical analysis.Journal of Statistical Software, 4(1):1–15, 1999.

F. Baty, M. P. Bihl, G. Perrière, A. C. Culhane, andM. H. Brutsche. Optimized between-group classi-fication: a new jackknife-based gene selection pro-cedure for genome-wide expression data. Bioinfor-matics, 6:1–12, 2005.

J. P. Benzécri. Analyse de l’inertie intra-classe parl’analyse d’un tableau de correspondances. LesCahiers de l’Analyse des données, 8:351–358, 1983.

J. B. Buckheit and D. L. Donoho. Wavelaband reproducible research. Tech. rep. 474,Dept. of Statistics, Stanford University, 1995.URL http://www-stat.stanford.edu/~wavelab/Wavelab_850/wavelab.pdf.

D. Charif and J. Lobry. SeqinR 1.0-2: a contributedpackage to the R project for statistical computingdevoted to biological sequences retrieval and anal-ysis. In H. R. U. Bastolla, M. Porto and M. Ven-druscolo, editors, Structural approaches to sequenceevolution: Molecules, networks, populations, Biolog-ical and Medical Physics, Biomedical Engineer-ing, pages 207–232. Springer Verlag, New York,2007. URL http://seqinr.r-forge.r-project.org/. ISBN : 978-3-540-35305-8.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 51: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 51

C. Chauve. The poultry red mite Dermanyssus gal-linae (DeGeer, 1778): current situation and futureprospects for control. Veterinary Parasitology, 79:239–245, 1998.

D. Chessel, A.-B. Dufour, and J. Thioulouse. Theade4 package-I- One-table methods. R News, 4:5–10, 2004.

A. C. Culhane, G. Perrière, E. C. Considine, T. G. Cot-ter and D. G. Higgins. Between-group analysisof microarray data. Bioinformatics, 18:1600–1608,2002.

A. C. Culhane, J. Thioulouse, G. Perrière, and D. G.Higgins. MADE4: an R package for multivariateanalysis of gene expression data. Bioinformatics, 21:2789–2790, 2005.

J. de Leeuw. Reproducible research: the bottom line.Technical report, UCLA Statistics Program, 2001.URL http://preprints.stat.ucla.edu/301/301.pdf.

S. Dolédec and D. Chessel. Rythmes saisonniers etcomposantes stationnelles en milieu aquatique I-description d’un plan d’observations complet parprojection de variables. Acta Oecologica, OecologiaGeneralis, 8(3):403–426., 1987.

S. Dray and A.-B. Dufour. The ade4 package: Imple-menting the duality diagram for ecologists. Jour-nal of Statistical Software, 22(4):1–20, 2007. URLhttp://www.jstatsoft.org/v22/i04.

Y. Escoufier. The duality diagramm : a means ofbetter practical applications. In P. Legendre andL. Legendre, editors, Development in numerical ecol-ogy, NATO advanced Institute, Serie G, pages 139–156. Springer Verlag, Berlin,1987.

D. Firth. CGIwithR: Facilities for processing webforms using R. Journal of Statistical Software, 8(10):1–8, 2003. URL http://www.jstatsoft.org/v08/i10.

R. Gentleman and D. T. Lang. Statistical analyses andreproducible research. Bioconductor Project WorkingPapers., May 2004. URL http://www.bepress.com/bioconductor/paper2.

S. Holmes. Multivariate analysis: The French way. InD. Nolan and T. Speed, editors, Festschrift for DavidFreedman, pages 1–14. IMS, Beachwood, OH, 2006.

I. B. Jeffery, S. F. Madden, P. A. McGettigan, G. Per-rière, A. C. Culhane and D. G. Higgins. Integratingtranscription factor binding site information withgene expression datasets. Bioinformatics, 23:298–305, 2007.

A. Kirkwood. Anaemia in poultry infested withthe red mite Dermanyssus gallinae. The VeterinaryRecord, 80(514-516), 1967.

D. Knuth. Literate Programming. Center forthe Study of Language and Information, Stanford,California, 1992.

E. Lecoutre. The R2HTML Package. R News, 3(33-36),2003.

F. Leisch. Sweave, Part I: Mixing R and LATEX R News,2(28-31), 2002.

K. S. Muyzer, G. and. Application of denaturing gra-dient gel electrophoresis (DGGE) and temperaturegradient gel electrophoresis (TGGE) in microbialecology. Antonie van Leeuwenhoek, 73:127–141, 1998.

G. Perrière, C. Combet, S. Penel, C. Blanchet,J. Thioulouse, C. Geourjon, J. Grassot, C. Charavay,M. Gouy, L. Duret, and G. Deléage Integrated data-banks access and sequence/structure analysis ser-vices at the PBIL. Nucleic Acids Research, 31:3393–3399, 2003.

R Development Core Team. R: A Language and Envi-ronment for Statistical Computing. R Foundation forStatistical Computing, Vienna, Austria, 2009. URLhttp://www.R-project.org. ISBN 3-900051-07-0.

D.K. Stewart, and W.A. Love. A general canonicalcorrelation index. Psychological Bulletin, 70:160–163, 1968.

A. Suau, R. Bonnet, M. Sutren, J. Godon, G. R. Gib-son, M. D. Collins, and J. Doré. Direct analysis ofgenes encoding 16s rRNA from complex commu-nities reveals many novel molecular species withinthe human gut. Appl. Env. Microbiol., 65:4799–4807,1999.

H. Suzuky, C. J. Brown, L. J. Forney, and E. M. Top.Comparison of correspondence analysis methodsfor synonymous codon usage in bacteria. DNAResearch, (in press), 2008. doi: 10.1093/dnares/dsn028.

C. J. F. ter Braak and P. Šmilauer. CANOCO ReferenceManual and CanoDraw for Windows User’s Guide:Software for Canonical Community Ordination (ver-sion 4.5). Microcomputer Power, Ithaca NY, USA,2002.

J. Thioulouse, D. Chessel, S. Doledec, andJ.M. Olivier. ADE-4: a multivariate analysisand graphical display software. Statistics andComputing, 7:75–82, 1997.

J. Thioulouse and S. Dray. Interactive multivariatedata analysis in R with the ade4 and ade4TkGUIpackages. Journal of Statistical Software, 22(5):1–14, 10 2007. URL http://www.jstatsoft.org/v22/i05.

C. Valiente-Moro, C. Chauve, and L. Zenner. Vec-torial role of some dermanyssoid mites (Acari,Mesostigmata, Dermanyssoidea). Parasite, 12:99–109, 2005.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 52: R journal 2010-1

52 CONTRIBUTED RESEARCH ARTICLES

C. Valiente-Moro, C. Chauve, and L. Zenner. Ex-perimental infection of Salmonella Enteritidis by thepoultry red mite, Dermanyssus gallinae. VeterinaryParasitology, 31:329–336, 2007.

C. Valiente-Moro, J. Thioulouse, C. Chauve, P. Nor-mand, and L. Zenner. Bacterial taxa associatedwith the hematophagous mite, Dermanyssus galli-nae detected by 16s rDNA PCR amplification andTTGE fingerprint. Research in Microbiology, 160:63–70, 2009.

Jean ThioulouseUniversité de Lyon, F-69000, Lyon ; Université Lyon 1;CNRS, UMR5558, Biométrie et Biologie Evolutive,F-69622 Villeurbanne Cedex, [email protected]

http://pbil.univ-lyon1.fr/JTHome/

Claire Valiente-MoroEcole Nationale Vétérinaire de Lyon,1 avenue Bourgelat, 69280 Marcy l’Etoile, France.Université de Lyon, F-69000, Lyon ; Université Lyon 1;CNRS, UMR5557, Écologie Microbienne,F-69622 Villeurbanne Cedex, [email protected].

Lionel ZennerEcole Nationale Vétérinaire de Lyon,1 avenue Bourgelat, 69280 Marcy l’Etoile, France.Université de Lyon, F-69000, Lyon ; Université Lyon 1;CNRS, UMR5558, Biométrie et Biologie Evolutive,F-69622 Villeurbanne Cedex, [email protected]

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 53: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 53

Two-sided Exact Tests and MatchingConfidence Intervals for Discrete Databy Michael P. Fay

Abstract There is an inherent relationshipbetween two-sided hypothesis tests and confi-dence intervals. A series of two-sided hypoth-esis tests may be inverted to obtain the match-ing 100(1-α)% confidence interval defined as thesmallest interval that contains all point null pa-rameter values that would not be rejected at theα level. Unfortunately, for discrete data thereare several different ways of defining two-sidedexact tests and the most commonly used two-sided exact tests are defined one way, while themost commonly used exact confidence intervalsare inversions of tests defined another way. Thiscan lead to inconsistencies where the exact testrejects but the exact confidence interval containsthe null parameter value. The packages exactciand exact2x2 provide several exact tests with thematching confidence intervals avoiding these in-consistencies as much as possible. Examples aregiven for binomial and Poisson parameters andboth paired and unpaired 2× 2 tables.

Applied statisticians are increasingly being encour-aged to report confidence intervals (CI) and param-eter estimates along with p-values from hypothesistests. The htest class of the stats package is ide-ally suited to these kinds of analyses, because allthe related statistics may be presented when the re-sults are printed. For exact two-sided tests appliedto discrete data, a test-CI inconsistency may occur:the p-value may indicate a significant result at levelα while the associated 100(1-α)% confidence inter-val may cover the null value of the parameter. Ide-ally, we would like to present a unified report (Hirji,2006), whereby the p-value and the confidence inter-val match as much as possible.

A motivating example

I was asked to help design a study to determineif adding a new drug (albendazole) to an existingtreatment regimen (ivermectin) for the treatment of aparasitic disease (lymphatic filariasis) would increasethe incidence of a rare serious adverse event whengiven in an area endemic for another parasitic dis-ease (loa loa). There are many statistical issues re-lated to that design (Fay et al., 2007), but here con-sider a simple scenario to highlight the point of thispaper. A previous mass treatment using the existingtreatment had 2 out of 17877 experiencing the seri-ous adverse event (SAE) giving an observed rate of

11.2 per 100,000. Suppose the new treatment wasgiven to 20,000 new subjects and suppose that 10subjects experienced the SAE giving an observed rateof 50 per 100,000. Assuming Poisson rates, an exacttest using poisson.test(c(2,10),c(17877,20000))from the stats package (throughout we assume ver-sion 2.11.0 for the stats package) gives a p-valueof p = 0.0421 implying significant differences be-tween the rates at the 0.05 level, but poisson.testalso gives a 95% confidence interval of (0.024,1.050)which contains a rate ratio of 1, implying no signifi-cant difference. We return to the motivating examplein the ‘Poisson two-sample’ section later.

Overview of two-sided exact tests

We briefly review inferences using the p-value func-tion for discrete data. For details see Hirji (2006) orBlaker (2000). Suppose you have a discrete statistic twith random variable T such that larger values of Timply larger values of a parameter of interest, θ. LetFθ(t) = Pr[T ≤ t;θ] and Fθ(t) = Pr[T ≥ t;θ]. Supposewe are testing

H0 : θ ≥ θ0

H1 : θ < θ0

where θ0 is known. Then smaller values of t are morelikely to reject and if we observe t, then the proba-bility of observing equal or smaller values is Fθ0(t)which is the one-sided p-value. Conversely, the one-sided p-value for testing H0 : θ ≤ θ0 is Fθ0(t). Wereject when the p-value is less than or equal to thesignificance level, α. The one-sided confidence inter-val would be all values of θ0 for which the p-value isgreater than α.

We list 3 ways to define the two-sided p-value fortesting H0 : θ = θ0, which we denote pc, pm and pbfor the central, minlike, and blaker methods, re-spectively:

central: pc is 2 times the minimum of the one-sidedp-values bounded above by 1, or mathemati-cally,

pc = min{

1,2×min(

Fθ0(t), Fθ0(t))}

.

The name central is motivated by the associ-ated inversion confidence intervals which arecentral intervals, i.e., they guarantee that thelower (upper) limit of the 100(1-α)% confidenceinterval has less than α/2 probability of beinggreater (less) than the true parameter. This iscalled the TST (twice the smaller tail method)by Hirji (2006).

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 54: R journal 2010-1

54 CONTRIBUTED RESEARCH ARTICLES

minlike: pm is the sum of probabilities of outcomeswith likelihoods less than or equal to the ob-served likelihood, or

pm = ∑T: f (T)≤ f (t)

f (T)

where f (t) = Pr[T = t;θ0]. This is called the PB(probability based) method by Hirji (2006).

blaker: pb combines the probability of the smallerobserved tail with the smallest probability ofthe opposite tail that does not exceed that ob-served tail probability. Blaker (2000) showedthat this p-value may be expressed as

pb = Pr [γ(T) ≤ γ(t)]

where γ(T) = min{

Fθ0(T), Fθ0(T)}

. Thename blaker is motivated by Blaker (2000)which comprehensively studies the associatedmethod for confidence intervals, although themethod had been mentioned in the literatureearlier, see e.g., Cox and Hinkley (1974), p. 79.This is called the CT (combined tail) method byHirji (2006).

There are other ways to define two-sided p-values,such as defining extreme values according to thescore statistic (see e.g., Hirji (2006, Chapter 3), orAgresti and Min (2001)). Note that pc ≥ pb for allcases, so that pb gives more powerful tests than pc.On the other hand, although generally pc > pm it ispossible for pc < pm.

If p(θ0) is a two-sided p-value testing H0 : θ = θ0,then its 100(1− α)% matching confidence interval isthe smallest interval that contains all θ0 such thatp(θ0) > α. To calculate the matching confidence in-tervals, we consider only regular cases where Fθ(t)and Fθ(t) are monotonic functions of θ (except per-haps the degenerate cases where Fθ(t) = 1 or Fθ(t) =0 for all θ when t is the maximum or minimum).In this case the matching confidence limits to thecentral test are (θL,θU) which are solutions to:

α/2 = FθL(t)

andα/2 = FθU (t)

except when t is the minimum or maximum, inwhich case the limit is set at the appropriate extremeof the parameter space. The matching confidence in-tervals for pm and pb require a more complicated al-gorithm to ensure precision of the confidence limits(Fay, 2010).

If matching confidence intervals are used thentest-CI inconsistencies will not happen for thecentral method, and will happen very rarely for theminlike and blaker methods. We discuss those raretest-CI inconsistencies in the ‘Unavoidable inconsis-tencies’ section later, but the main point of this article

is that it is not rare for pm to be inconsistent with thecentral confidence interval (Fay, 2010) and that par-ticular test-CI combination is the default for manyexact tests in the stats package. We show some exam-ples of such inconsistencies in the following sections.

Binomial: one-sample

If X is binomial with parameters n and θ, then thecentral exact interval is the Clopper-Pearson con-fidence interval. These are the intervals given bybinom.test. The p-value given by binom.test is pm.The matching interval to the pm was proposed byStern (1954) (see Blaker (2000)).

When θ0 = 0.5 we have pc = pm = pb, and thereis no chance of a test-CI inconsistency even when theconfidence intervals are not inversions of the test asis the case in binom.test. When θ0 6= 0.5 there maybe problems. We explore these cases in the ‘Poisson:two-sample’ section later, since the associated testsreduce through conditioning to one-sample binomialtests.

Note that there is a theoretically proven set ofshortest confidence intervals for this problem. Theseare called the Blyth-Still-Casella intervals in StatXact(StatXact Procs Version 8). The problem with theseshortest intervals is that they are not nested, so thatone could have a parameter value that is included inthe 90% confidence interval but not in the 95% con-fidence interval (see Theorem 2 of Blaker (2000)). Incontrast, the matching intervals of the binom.exactfunction of the exactci will always give nested inter-vals.

Poisson: one-sample

If X is Poisson with mean θ, then poisson.testfrom stats gives the exact central confidence in-tervals (Garwood, 1936), while the p-value is pm.Thus, we can easily find a test-CI inconsistency:poisson.test(5, r=1.8) gives a p-value of pm =0.036 but the 95% central confidence interval of(1.6,11.7) contains the null rate of 1.8. As θ gets largethe Poisson distribution may be approximated by thenormal distribution and these test-CI inconsistenciesbecome more rare.

The exactci package contains the poisson.exactfunction, which has options for each of the threemethods of defining p-values and gives matchingconfidence intervals. The code poisson.exact(5,r=1.8, tsmethod="central") gives the sameconfidence interval as above, but a p-valueof pc = 0.073; while poisson.exact(5, r=1.8,tsmethod="minlike") returns a p-value equal to pm,but a 95% confidence interval of (2.0,11.8). Finally,using tsmethod="blaker" we get pb = 0.036 (it is notuncommon for pb to equal pm) and a 95% confidenceinterval of (2.0,11.5). We see that there is no test-CI

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 55: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 55

inconsistency when using the matching confidenceintervals.

Poisson: two-sample

For the control group, let the random variable of thecounts be Y0, the rate be λ0 and the population atrisk be m0. Let the corresponding values for thetest group be Y1, λ1 and m1. If we condition onY0 + Y1 = N then the distribution of Y1 is binomialwith parameters N and

θ =m1λ1

m0λ0 + m1λ1

This parameter may be written in terms of the ratioof rates, ρ = λ1/λ2 as

θ =m1ρ

m0 + m1ρ

or equivalently,

ρ =m0θ

m1 (1− θ). (1)

Thus, the null hypothesis that λ1 = λ0 is equivalentto ρ = 1 or θ = m1/(m0 + m1), and confidence inter-vals for θ may be transformed into confidence inter-vals for ρ by Equation 1. So the inner workings ofthe poisson.exact function when dealing with two-sample tests simply use the binom.exact functionand transform the results using Equation 1.

Let us return to our motivating example (i.e., test-ing the difference between observed rates 2/17877and 10/20000). As in the other sections, the resultsfrom poisson.test output pm but the 95% centralconfidence intervals, as we have seen, give a test-CI inconsistency. The poisson.exact function avoidssuch test-CI inconsistency in this case by giving thematching confidence interval; here are the results ofthe three tsmethod options:

tsmethod p-value 95% confidence interval

central 0.061 (0.024, 1.050)minlike 0.042 (0.035, 0.942)blaker 0.042 (0.035, 0.936)

Analysis of 2× 2 tables, unpaired

The 2 × 2 table may be created from many dif-ferent designs, consider first designs where thereare two groups of observations with binary out-comes. If all the observations are independent,even if the number in each group is not fixedin advance, proper inferences may still be ob-tained by conditioning on those totals (Lehmannand Romano, 2005). Fay (2010) considers the2 × 2 table with independent observations, so we

only briefly present his motivating example here.The usual two-sided application of Fisher’s exacttest: fisher.test(matrix(c(4,11,50,569), 2, 2))gives pm = 0.032 using the minlike method, but 95%confidence interval on the odds ratio of (0.92,14.58)using the central method. As with the other ex-amples, the test-CI inconsistency disappears whenwe use either the exact2x2 or fisher.exact functionfrom the exact2x2 package.

Analysis of 2× 2 tables, paired

The case not studied in Fay (2010) is when the dataare paired, the case which motivates McNemar’s test.For example, suppose you have twins randomizedto two treatment groups (Test and Control) then teston a binary outcome (pass or fail). There are 4 pos-sible outcomes for each pair: (a) both twins fail, (b)the twin in the control group fails and the one in thetest group passes, (c) the twin on the test group failsand the one in the control group passes, or (d) bothtwins pass. Here is a table where the numbers of setsof twins falling in each of the four categories are de-noted a, b, c and d:

TestControl Fail Pass

Fail a bPass c d

In order to test if the treatment is helpful, we useonly the numbers of discordant pairs of twins, b andc, since the other pairs of twins tell us nothing aboutwhether the treatment is helpful or not. McNemar’stest statistic is

Q ≡ Q(b, c) =(b− c)2

b + c

which for large samples is distributed like a chi-squared distribution with 1 degree of freedom. Acloser approximation to the chi-squared distributionuses a continuity correction:

QC ≡ QC(b, c) =(|b− c| − 1)2

b + c

In R this test is given by the function mcnemar.test.Case-control data may be analyzed this way as

well. Suppose you have a set of people with somerare disease (e.g., a certain type of cancer); these arecalled the cases. For this design you match each casewith a control who is as similar as feasible on allimportant covariates except the exposure of interest.Here is a table:

ExposedNot Exposed Control Case

Control a bCase c d

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 56: R journal 2010-1

56 CONTRIBUTED RESEARCH ARTICLES

For this case as well we can use Q or QC to testfor no association between case/control status andexposure status.

For either design, we can estimate the odds ratioby b/c, which is the maximum likelihood estimate(see Breslow and Day (1980), p. 165). Consider somehypothetical data (chosen to highlight some points):

TestControl Fail Pass

Fail 21 9Pass 2 12

When we perform McNemar’s test with the con-tinuity correction we get p = 0.070 while without thecorrection we get p = 0.035. Since the inferences areon either side of the traditional 0.05 cutoff of signifi-cance, it would be nice to have an exact version of thetest to be clearer about significance at the 0.05 level.From the exact2x2 package using mcnemar.exact weget the exact McNemar’s test p-value of p = .065. Wenow give the motivation for the exact version of thetest.

After conditioning on the total number of dis-cordant pairs, b + c, we can treat the problem asB ∼ Binomial(b + c,θ), where B is the random vari-able associated with b. Under the null hypothesisθ = 0.5. We can transform the parameter θ into anodds ratio by

Odds Ratio≡ φ =θ

1− θ(2)

(Breslow and Day (1980), p. 166). Since it is easyto perform exact tests on a binomial parameter, wecan perform exact versions of McNemar’s test inter-nally using the binom.exact function of the pack-age exactci then transform the results into odds ra-tios via Equation 2. This is how the calculations aredone in the exact2x2 function when paired=TRUE.The alternative and the tsmethod options work inthe way one would expect. So although McNemar’stest was developed as a two-sided test testing thenull that θ = 0.5 (or equivalently φ = 1), we can eas-ily extend this to get one-sided exact McNemar-typeTests. For two-sided tests we can get three differ-ent versions of the two-sided exact McNemar’s p-value function using the three tsmethod options, butall three are equivalent to the exact version of Mc-Nemar’s test when testing the usual null that θ = 0.5(see the Appendix in vignette("exactMcNemar") inexact2x2). If we narrowly define McNemar’s test asonly testing the null that θ = 0.5 as was done in theoriginal formulation, there is only one exact McNe-mar’s test; it is only when we generalize the test totest null hypotheses of θ = θ0 6= 0.5 that there aredifferences between the three methods. Those dif-ferences between the tsmethod options become ap-parent in the calculation of the confidence intervals.The default is to use central confidence intervals so

that they guarantee that the lower (upper) limit of the100(1-α)% confidence interval has less than α/2 prob-ability of being greater (less) than the true parame-ter. These guarantees on each tail are not true for theminlike and blaker two-sided confidence intervals;however, the latter give generally tighter confidenceintervals.

Graphing P-values

In order to gain insight as to why test-CI incon-sistencies occur, we can plot the p-value function.This type of plot explores one data realization andits many associated p-values on the vertical axisrepresenting a series of tests modified by chang-ing the point null hypothesis parameter (θ0) on thehorizontal axis. There is a default plot commandfor binom.exact, poisson.exact, and exact2x2 thatplots the p-value as a function of the point null hy-potheses, draws vertical lines at the confidence lim-its, draws a line at 1 minus the confidence level, andadds a point at the null hypothesis of interest. Otherplot functions (exactbinomPlot, exactpoissonPlot,and exact2x2Plot) can be used to add to that plotfor comparing different methods. In Figure 1 we cre-ate such a plot for the motivating example. Here isthe code to create that figure:

x <- c(2, 10)n <- c(17877, 20000)poisson.exact(x, n, plot = TRUE)exactpoissonPlot(x, n, tsmethod = "minlike",

doci = TRUE, col = "black", cex = 0.25,newplot = FALSE)

0.02 0.05 0.10 0.20 0.50 1.00

0.0

0.2

0.4

0.6

0.8

1.0

Null Hypothesis rate ratio

p−va

lue

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

Figure 1: Graph of p-value functions for motivat-ing two-sample Poisson example. Gray is centralmethod, black is minlike method. Vertical lines are95% confidence limits, black circle is central p-valueat null rate ratio of 1.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 57: R journal 2010-1

CONTRIBUTED RESEARCH ARTICLES 57

We see from Figure 1 that the central methodhas smoothly changing p-values, while the minlikemethod has discontinuous ones. The usual confi-dence interval is the inversion of the central method(the limits are the vertical gray lines, where the dot-ted line at the significance level intersects with thegray p-values), while the usual p-value at the nullthat the rate ratio is 1 is where the black line is. Tosee this more clearly we plot the lower right handcorner of Figure 1 in Figure 2.

0.8 0.9 1.0 1.1 1.2

0.00

0.02

0.04

0.06

0.08

0.10

Null Hypothesis rate ratio

p−va

lue

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ● ● ● ● ● ● ● ● ● ● ● ●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

Figure 2: Graph of p-value functions for motivat-ing two-sample Poisson example. Gray is centralmethod, black is minlike method. Vertical lines areupper 95% confidence limits, solid black circles arethe respective p-values at null rate ratio of 1.

From Figure 2 we see why the test-CI inconsis-tencies occur, the minlike method is generally morepowerful than the central method, so that is whythe p-values from the minlike method can reject aspecific null when the confidence intervals from thecentral method imply failing to reject that samenull. We see that in general if you use the match-ing confidence interval to the p-value, there will notbe test-CI inconsistencies.

Unavoidable Inconsistencies

Although the exactci and exact2x2 packages do pro-vide a unified report in the sense described in Hirji(2006), it is still possible in rare instances to obtaintest-CI inconsistencies when using the minlike orblaker two-sided methods (Fay, 2010). These rareinconsistencies are unavoidable due to the nature ofthe problem rather than any deficit in the packages.

To show the rare inconsistency problem us-ing the motivating example, we consider theunrealistic situation where we are testing thenull hypothesis that the rate ratio is 0.93 at the

0.0776 level. The corresponding confidence in-terval would be a 92.24% = 100 × (1 − 0.0776)interval. Using poisson.exact(x, n, r=.93,tsmethod="minlike", conf.level=1-0.0776) wereject the null (since pm = 0.07758 < 0.0776) butthe 92.24% matching confidence interval containsthe null rate ratio of 0.93. In this situation, theconfidence set that is the inversion of the series oftests is two disjoint intervals ( [0.0454,0.9257] and[0.9375,0.9419]), and the matching confidence inter-val fills in the hole in the confidence set. This is anunavoidable test-CI inconsistency. Figure 3 plots thesituation.

0.920 0.925 0.930 0.935 0.940 0.945

0.07

700.

0772

0.07

740.

0776

0.07

780.

0780

Null Hypothesis rate ratio

p−va

lue

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

Figure 3: Graph of p-value functions for motivat-ing two-sample Poisson example. Gray is centralmethod, black is minlike method. Circles are p-values, vertical lines are the upper 92.24% confidencelimits, solid black circle is minlike p-value at nullrate ratio of 0.93. The unrealistic significance level of0.0776 and null rate ratio of 0.93 are chosen to high-light the rare unavoidable inconsistency problem.

Additionally, the options tsmethod="minlike" or"blaker" can have other anomalies (see Vos andHudson (2008) for the single sample binomial case,and Fay (2010) for the two-sample binomial case).For example, the data reject, but fail to reject if an ad-ditional observation is added regardless of the valueof the additional observation. Thus, although thepower of the blaker (or minlike) two-sided methodis always (almost always) greater than the centraltwo-sided method, the central method does avoidall test-CI inconsistencies and the previously men-tioned anomalies.

Discussion

We have argued for using a unified report wherebythe p-value and the confidence interval are calcu-

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 58: R journal 2010-1

58 CONTRIBUTED RESEARCH ARTICLES

lated from the same p-value function (also called theevidence function or confidence curve). We haveprovided several practical examples. Although thetheory of these methods have been extensively stud-ied (Hirji, 2006), software has not been readily avail-able. The exactci and exact2x2 packages fill thisneed. We know of no other software that providesthe minlike and blaker confidence intervals, exceptthe PropCIs package which provides the Blaker con-fidence interval for the single binomial case only.

Finally, we briefly consider closely related soft-ware. The rateratio.test package does the two-sample Poisson exact test with confidence intervalsusing the central method. The PropCIs packagedoes several different asymptotic confidence inter-vals, as well as the Clopper-Pearson (i.e. central)and Blaker exact intervals for a single binomial pro-portion. The PropCIs package also performs themid-p adjustment to the Clopper-Pearson confidenceinterval which is not currently available in exactci.Other exact confidence intervals are not covered inthe current version of PropCIs (Version 0.1-6). Thecoin and perm packages give very general methodsfor performing exact permutation tests, although nei-ther perform the exact matching confidence intervalsfor the cases studied in this paper.

I did not perform a comprehensive search ofcommercial statistical software; however, SAS (Ver-sion 9.2) (perhaps the most comprehensive commer-cial statistical software) and StatXact (Version 8) (themost comprehensive software for exact tests) both donot implement the blaker and minlike confidenceintervals for binomial, Poisson and 2x2 table cases.

Bibliography

A. Agresti and Y. Min. On small-sample confidenceintervals for parameters in discrete distributions.Biometrics, 57:963–971, 2001.

H. Blaker. Confidence curves and improved ex-act confidence intervals for discrete distributions.Canadian Journal of Statistics, 28:783–798, 2000.

N. Breslow and N. Day. Statistical Methods in CancerResearch: Volume 1: Analysis of Case Control Studies.International Agency for Research in Cancer, Lyon,France, 1980.

D. Cox and D. Hinkley. Theoretical Statistics. Chap-man and Hall, London, 1974.

M. Fay. Confidence intervals that match Fisher’s ex-act or Blaker’s exact tests. Biostatistics, 11:373–374,2010.

M. Fay, C. Huang, and N. Twum-Danso. Monitor-ing rare serious adverse events from a new treat-ment and testing for a difference from historicalcontrols. Controlled Clinical Trials, 4:598–610, 2007.

F. Garwood. Fiducial limits for the Poisson distribu-tion. Biometrika, pages 437–442, 1936.

K. Hirji. Exact Analysis of Discrete Data. Chapman andHall/CRC, New York, 2006.

E. Lehmann and J. Romano. Testing Statistical Hy-potheses, third edition. Springer, New York, 2005.

T. Stern. Some remarks on confidence and fiduciallimits. Biometrika, pages 275–278, 1954.

P. Vos and S. Hudson. Problems with binomial two-sided tests and the associated confidence intervals.Australian and New Zealand Journal of Statistics, 50:81–89, 2008.

Michael P. FayNational Institute of Allergy and Infectious Diseases6700-A Rockledge Dr. Room 5133, Bethesda, MD [email protected]

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 59: R journal 2010-1

BOOK REVIEWS 59

Book ReviewsA Beginner’s Guide to R

Alain F. ZUUR, Elena N. IENO, and ErikH.W.G. MEESTERS. New York, NY: Springer,2009. ISBN 978-0-387-93836-3. xv + 218 pp.$59.95 (P).

A Beginner’s Guide to R is just what it’s ti-tle implies, a quick-start guide for the newest Rusers. A unique feature of this welcome addition toSpringer’s Use R! series is that it is devoted solely togetting the user up and running on R. Unlike othertexts geared towards R beginners, such as Verzani(2005), this text does not make the mistake of try-ing to simultaneously teach statistics. The experi-enced R user will not have much use for this book,except perhaps for adoption as a textbook. To thisend, there are straightforward homework exercisesprovided throughout (no answers, though), and thedata sets can be downloaded from the authors’ web-site http://www.highstat.com. It should be noted,however, that the examples and exercises are limitedto the authors’ area of expertise – ecology.

The book starts at the very beginning by instruct-ing the user how to install R and load packages fromCRAN. One small weakness is that the book is di-rected almost exclusively toward PC users. In par-ticular, I was disappointed by the paucity of infor-mation concerning R text editors that are compatiblewith the Mac. (After a fair amount of trial and er-ror, I finally determined that gedit would do the jobfor me.) A nice feature of Chapter 1 is an annotatedbibliography of "must-have" R books. Early on, theauthors sagely direct the reader toward the universeof R assistance available online (and console the pan-icked reader that even experienced R users can be in-timidated by the sheer amount of information con-tained in R help files).

The remainder of the book is devoted to demon-strating how to do the most basic tasks with R. Chap-

ter 2 describes several methods for getting data intoR (useful information for anybody facing the daunt-ing prospect of importing a large data set into R forthe first time). To appease the novice hungering forsome "fancy" R output, the authors provide easy-to-follow instructions for constructing both simple(Chapters 5 and 7) and not-so-simple (Chapter 8)graphical displays. Standard plots from the intro-ductory statistics curriculum are included (e.g., thehistogram, boxplot, scatterplot, and dotplot), and thelattice package is introduced for the benefit of read-ers with more advanced graphical needs. Other top-ics include data management (Chapter 3) and simplefunctions and loops (Chapters 4 and 6). Chapter 9concludes the book by suggesting solutions to someproblems commonly encountered by R users, begin-ners and old pros alike.

In sum, A Beginner’s Guide to R is an essentialresource for the R novice, whether an undergradu-ate learning statistics for the first time or a seasonedstatistician biting the bullet and making the switch toR. To get the most bang for the buck (the cost is a bitsteep for such a short paperback), I advise the user toset aside a weekend (or a week) to launch R and workthrough the book from start to finish. It will be timewell spent — just keep in mind that this book is allabout learning to play a scale; you’ll be disappointedif you expect to emerge with all the skills required toperform a concerto.

Laura M. SchultzRowan University

Bibliography

J. Verzani. Using R for Introductory Statistics. BocaRaton, FL: Chapman & Hall, 2005.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 60: R journal 2010-1

60 NEWS AND NOTES

Conference Review: The 2nd Chinese RConferenceby Jing Jiao, Jingjing Guan and Yihui Xie

The 2nd Chinese R Conference was successfully heldin December 2009 in Beijing and Shanghai. Thisconference was organized in two cities so that morepeople would be able to enjoy similar talks withoutadditional traveling cost. The Beijing session, heldin the Renmin University of China (RUC) on De-cember 5–6, 2009, was sponsored and organized bythe Center for Applied Statistics and the School ofStatistics, RUC. Yanping Chen served as the chairof the conference while Janning Fan was the secre-tary. Yixuan Qiu and Jingjing Guan were in chargeof local arrangements. The Shanghai session tookplace in the East China Normal University (ECNU)on December 12–13, 2009; it was organized by theSchool of Resources and Environment Science (SRES)and the School of Finance and Statistics, ECNU, andsponsored by Mango Solutions. Jing Jiao and Xi-ang Zhang served as the chairs of the conference; theMolecular Ecology Group in SRES was in charge oflocal arrangements. Both sessions were co-organizedby the web community “Capital of Statistics” (http://cos.name).

Beside promoting the communication amongChinese R users as the 1st Chinese R Conference did,the 2nd Chinese R Conference concentrated on theapplications of R in statistical computing and graph-ics in various disciplines. The slogan of this confer-ence was “useR eveRywheRe”, indicating R is wide-spread in a variety of fields as well as there are a largeamount of useRs in China now.

In total, more than 300 participants from over 90institutions took part in this conference. There were19 talks in the Beijing session:

Opening A review on the history of the Chinese RConference and a brief introduction to R byYanping Chen;

Statistical Graphics Introduction to R base graphicsby Tao Gao and Cheng Li; Matrix visualizationusing the package corrplot by Taiyun Wei;

Statistical Models Nonparametric methods and ro-bust estimation of quantile regression by ChenZuo; R and WinBUGS by Peng Ding; Grey Sys-tem Theory in R by Tan Xi;

Software Development Integrating R into C/C++applications using Visual C++ by Yu Gong; Us-ing R in online data analysis by Zhiyi Huang;

Industrial Applications R in establishing standardsfor the food industry by Qiding Zhong; Inves-tigation and monitoring on geological environ-

ment with R by Yongsheng Liu; R in marketingresearch by Yingchun Zhu; Zhiyi Huang alsogave an example on handling atmosphere datawith R; R in near-infrared spectrum analysis byDie Sun;

Data Mining Jingjing Guan introduced RExcel withapplications in data mining and the trend ofdata mining focused on ensemble learning;Dealing with large data and reporting automa-tion by Sizhe Liu;

Kaleidoscope Discussing security issues of R byNan Xiao; Using R in economics and econo-metrics by Liyun Chen; R in psychology by Xi-aoyan Sun and Ting Wang; R in spatial analy-sis by Huaru Wang; Optimization of molecularstructure parameter matrix in QSAR with thepackage omd by Bin Ma;

More than 150 people from nearly 75 institu-tions all over China participated in the Shanghai ses-sion. Among these participants, we were honored tomeet some pioneer Chinese useRs such as Prof YincaiTang, the first teacher introducing R in the School ofFinance and Statistics of ECNU. There were 13 talksgiven in the Shanghai session which partially over-lapped with the Beijing session:

Opening Introduction of the 2nd Chinese R confer-ence (Shanghai session) by Jing Jiao, and open-ing address by Prof Yincai Tang and Prof Xi-aoyong Chen (Vice President of SRES);

Theories Bayesian Statistics with R and WinBUGSby Prof Yincai Tang; Grey System Theory in Rby Tan Xi;

Graphics Introduction to R base graphics by TaoGao and Cheng Li; Matrix visualization usingthe package corrplot by Taiyun Wei;

Applications Using R in economics and econo-metrics by Liyun Chen; Marketing analyticalframework by Zhenshun Lin; Decision treewith the rpart package by Weijie Wang; Opti-mization of molecular structure parameter ma-trix in QSAR with the package omd by Bin Ma;Survival analysis in R by Yi Yu; The applicationof R and statistics in the semiconductor indus-try by Guangqi Lin;

Software Development Dealing with large data andreporting automation by Sizhe Liu; JAVA de-velopment and optimization in R by Jian Li;Discussing security issues of R by Nan Xiao;

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 61: R journal 2010-1

NEWS AND NOTES 61

On the second day there was a training sessionfor R basics (Learning R in 153 minutes by Sizhe Liu)and extensions (Writing R packages by Sizhe Liu andYihui Xie; Extending R by Jian Li); a discussion ses-sion was arranged after the training. The conferencewas closed with a speech by Xiang Zhang.

Overall, participants were very interested in thetopics contributed on the conference and the discus-sion session has reflected there was the strong needof learning and using R in China. After this confer-ence, we have also decided to make future efforts on:

� Increasing the number of sessions of this con-ference to meet the demand of useRs in otherareas;

� Promoting the communication between statis-tics and other disciplines via using R;

� Interacting with different industries throughsuccessful applications of R;

All slides are available online at http://cos.name/2009/12/2nd-chinese-r-conference-summary/. Wethank Prof Xizhi Wu for his consistent encourage-ment on this conference, and we are especially grate-

ful to Prof Yanyun Zhao, Dean of the School of Statis-tics of RUC, for his great support on the Beijing ses-sion. We look forward to the next R conference inChina and warmly welcome more people to attendit. The main session is tentatively arranged in theSummer of 2010. Inquiries and suggestions can besent to [email protected] or http://cos.name/ChinaR/ChinaR-2010.

Jing JiaoSchool of Resources and Environment ScienceEast China Normal UniversityChina P. [email protected]

Jingjing GuanSchool of Statistics, Renmin University of ChinaChina P. [email protected]

Yihui XieDepartment of Statistics & Statistical LaboratoryIowa State [email protected]

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 62: R journal 2010-1

62 NEWS AND NOTES

Introducing NppToR: R Interaction forNotepad++by Andrew Redd

Introduction

A good practice when programming or doing analy-sis in R is to keep the commands in a script that canbe saved and reused. An appropriate editor makesprogramming in R much more efficient and less er-ror prone.

On Linux/Unix systems, users typically use ei-ther Vim or EMACS with ESS. Both of these havefeatures to improve programming in R. Features likesyntax highlighting, which means certain keywordsare colored or marked to indicate they have specialmeaning, and code folding, where lines of code canbe automatically grouped and temporarily hiddenfrom view. Perhaps the most important feature is theability to directly evaluate lines of code in the R inter-preter. This facilitates quick and easy programmingwhile maintaining a solid record in a reusable script.

The version of R for Microsoft Windows comeswith a built in editor. This editor does have the in-teraction features like VIM and EMACS, but lacksthe other features like syntax highlighting and codefolding. There are several alternative code editors forWindows, such as Notepad++, WinEDT, Tinn-R, andEclipse, that do possess advanced features but manynovice users are simply unaware of them.

NppToR is a native windows solution that bringsthe advantages that users of other platforms haveto Windows users. NppToR works with the pow-erful and popular code editor Notepad++ (http://notepad-plus.sourceforge.net/). Notepad++ hasa natural windows feel to it and so the learning curvefor getting started is almost nonexistent, but as usersbecome more accustomed to the advanced featuresof Notepad++ the real power comes out.

NppToR tries to follow the easy use and lowlearning curve of Notepad++. It not only simplyworks, but also becomes more powerful as usersbecome more familiar with it’s features. NppToRhas been under development for over a year andhas gained a serious user base without ever be-ing promoted outside of the R mailing lists. Npp-ToR can be downloaded from SourceForge (http://sourceforge.net/projects/npptor/).

Interaction

The primary purpose of NppToR is to add R inter-action capability to Notepad++. It does this throughthe use of hotkeys. When the hotkey is pressed ittriggers an action such as evaluating a line of code.

Figure 1 lists the hotkeys and what action they per-form. The hotkeys are totally configurable throughthe settings.

NppToR Default Hotkeys and Actions

F8 Evaluate a line of code or selection.

Ctrl+F8 Evaluate the entire current file.

Shift+F8 Evaluate the file to the point of the cursor.

Ctrl+Alt+F8 Evaluate as a batch file and examinethe results.

Figure 1: The default hotkeys for evaluating codewith NppToR.

NppToR sits in in the system tray, as shown inFigure 2. The menu, where the settings and extrafeatures are found, is accessed by right clicking onthe NppToR icon in the system tray.

Figure 2: NppToR sits as a utility in the system tray.

One of the ways that NppToR just works is thatwhen evaluating code from Notepad++, NppToRlooks for a current open R GUI session in which toevaluate the code. If R is not running, a new R ses-sion will be started and the code evaluated there. TheR home directory is automatically found through thewindows registry. One advantage of having NppToRstarting R is that the working directory for R is set tothe directory of the open script in Notepad++. Thismeans that file names for data files and other scriptscan be made relative rather than absolute. For exam-ple, if the script reads in a data file ‘data.txt’ and thedata file is in the same directory as the script the file

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 63: R journal 2010-1

NEWS AND NOTES 63

reference can simply be made as

data <- read.table("data.txt")

rather than using the absolute path,

data <- read.table("C:/path/to/data.txt")

which is not only longer than necessary to type butnot portable either. Simply moving a file or renam-ing a folder could break all the file references. For al-ready open R sessions there is a menu item that willset the working directory to the current script direc-tory.

NppToR is designed to be very small and non-intrusive. It is designed to sit in the background andis only active when Notepad++ is in use. Occasionalusers of R may not like having the utility running onstartup; for those users when NppToR is run manu-ally it will also launch Notepad++.

Other features

Although Notepad++ now has native support for theR language, it is a recent addition. NppToR alsogenerates syntax files for use with Notepad++’s userdefined language system, which is an artifact fromwhen Notepad++ did not have support for R. Thisfeature has been retained because it still offers ad-vantages over the built in syntax rules. For example,NppToR generates the syntax files dynamically, andcan take advantage of a library to include the key-words from installed packages. This method ensuresthat all major keywords are highlighted, and can eas-ily be updated to possible changes in the R distribu-tion, as well as the addition of new packages. Thegenerated language differs also in enforcing case sen-sitivity and proper word boundaries.

R is an ideal environment for performing simu-lation studies. These simulations can run for a longtime, and are typically run in the batch mode. Npp-ToR not only provides an easy hotkey to run thesesimulations but also provides a simulation monitorto keep track of active simulations, monitor how longthey have been running and provide an easy way to

terminate those that need to be stopped early. Fig-ure 3 shows the simulation monitor.

Figure 3: NppToR monitors active simulations al-lowing for an easy way to kill off simulations thathave been running to long.

Summary

Notepad++ benefits from a large user base, and isconstantly updated and improved, including sev-eral extensions that also help with developing forR. Notepad++ with NppToR mirrors the advantagesof editors on other platforms but with a native win-dows feel. The power and strength of NppToR comesin its simplicity. It enhances an already powerful ed-itor to become a superior editor for R.

Both Notepad++ and NppToR are easy to use andeasy to learn, but as users become proficient with thevast array of keyboard commands and the macro sys-tem the true power of Notepad++ is revealed. Npp-ToR should be considered by any who work with Ron Windows.

Acknowledgments

The author’s research was supported by a grant fromthe National Cancer Institute (CA57030).

Andrew ReddDepartment of StatisticsTexas A&M University3143 TAMUCollege Station, TX 77843-3143, [email protected]

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 64: R journal 2010-1

64 NEWS AND NOTES

Changes in R 2.10.1–2.11.1by the R Core Team

R 2.11.1 changes

New features

� R CMD INSTALL checks if dependent packagesare available early on in the installation ofsource packages, thereby giving clearer errormessages.

� R CMD INSTALL -build now names the file inthe format used for Mac OS X binary files onthat platform.

� BIC() in package stats4 now also works withmultiple fitted models, analogously to AIC().

Deprecated & defunct

� Use of file extension .C for C++ code in pack-ages is now deprecated: it has caused problemsfor some makes on case-insensitive file systems(although it currently works with the recom-mended toolkits).

Installation changes

� Command gnutar is preferred to tar whenconfigure sets TAR. This is needed on Mac OS10.6, where the default tar, bsdtar 2.6.2, hasbeen reported to produce archives with illegalextensions to tar (according to the POSIX stan-dard).

Bug fixes

� The C function mkCharLenCE now no longerreads past len bytes (unlikely to be a problemexcept in user code). (PR#14246)

� On systems without any defaultLD_LIBRARY_PATH (not even /usr/local/lib),[DY]LIB_LIBRARY_PATH is now set without atrailing colon. (PR#13637)

� More efficient utf8ToInt() on long multi-byte strings with many multi-byte characters.(PR#14262)

� aggregate.ts() gave platform-depedent re-sults due to rounding error for ndeltat != 1.

� package.skeleton() sometimes failed to fixfilenames for .R or .Rd files to start with analphanumeric. (PR#14253) It also failed whenonly an S4 class without any methods was de-fined. (PR#14280)

� splinefun(*, method = "monoH.FC") was notquite monotone in rare cases. (PR#14215)

� Rhttpd no longer crashes due to SIGPIPE whenthe client closes the connection prematurely.(PR#14266)

� format.POSIXlt() could cause a stack over-flow and crash when used on very long vectors.(PR#14267)

� Rd2latex() incorrectly escaped special charac-ters in \usage sections.

� mcnemar.test() could alter the levels (drop-ping unused levels) if passed x and y as factors(reported by Greg Snow).

� Rd2pdf sometimes needed a further pdflatexpass to get hyperlinked pages correct.

� interaction() produced malformed resultswhen levels were duplicated, causing segfaultsin split().

� cut(d, breaks = <n>) now also works for"Date" or "POSIXt" argument d. (PR#14288)

� memDecompress() could decompress incom-pletely rare xz-compressed input due to incor-rect documentation of xz utils. (Report andpatch from Olaf Mersmann.)

� The S4 initialize() methods for "matrix","array", and "ts" have been fixed to callvalidObject(). (PR#14284)

� R CMD INSTALL now behaves the same waywith or without -no-multiarch on platformswith only one installed architecture. (Itused to clean the src directory without-no-multiarch.)

� [<-.data.frame was not quite careful enoughin assigning (and potentially deleting) columnsright-to-left. (PR#14263)

� rbeta(n, a,b) no longer occasionally returnsNaN for a >> 1 > b. (PR#14291)

� pnorm(x, log.p = TRUE) could return NaN not-Inf for x near (minus for lower.tail=TRUE)the largest representable number.

� Compressed data files*.(txt|tab|csv).(gz|bz2|xz)were not recognized for the list of data top-ics and hence for packages using LazyData.(PR#14273)

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 65: R journal 2010-1

NEWS AND NOTES 65

� textConnection() did an unnecessary transla-tion on strings in a foreign encoding (e.g. UTF-8 strings on Windows) and so was slower thanit could have been on very long input strings.(PR#14286)

� tools::Rd2txt() did not render poorly writtenRd files consistently with other renderers.

� na.action() did not extract the na.actioncomponent as documented.

R 2.11.0 changes

Significant user-visible changes

� Packages must have been installed under R ≥2.10.0, as the current help system is the onlyone now supported.

� A port to 64-bit Windows is now available aswell as binary package repositiories: see the ‘RAdministration and Installation Manual’.

� Argument matching for primitive functions isnow done in the same way as for interpretedfunctions except for the deliberate exceptions

call switch .C .Fortran .Call .External

all of which use positional matching for theirfirst argument, and also some internal-use-onlyprimitives.

� The default device for command-line R at theconsole on Mac OS X is now quartz() and notX11().

New features

� The open modes for connections are now inter-preted more consistently. open = "r" is nowequivalent to open = "rt" for all connections.The default open = "" now means "rt" forall connections except the compressed file con-nections gzfile(), bzfile() and xzfile() forwhich it means "rb".

� R CMD INSTALL now uses the internal untar()in package utils: this ensures that all platformscan install bzip2- and xz-compressed tarballs.In case this causes problems (as it has on someWindows file systems when run from Cygwintools) it can be overridden by the environmentvariable R_INSTALL_TAR: setting this to a mod-ern external tar program will speed up unpack-ing of large (tens of Mb or more) tarballs.

� help(try.all.packages = TRUE) is muchfaster (although the time taken by the OS tofind all the packages the first time it is used candominate the time).

� R CMD check has a new option -timingsto record per-example timings in file<pkg>.Rcheck/<pkg>-Ex.timings.

� The TRE library has been updated to version0.8.0 (minor bugfixes).

� grep[l](), [g]sub() and [g]regexpr() nowwork in bytes in an 8-bit locales if there isno marked UTF-8 input string: this will besomewhat faster, and for [g]sub() give the re-sult in the native encoding rather than in UTF-8 (which returns to the behaviour prior to R2.10.0).

� A new argument skipCalls has been added tobrowser() so that it can report the original con-text when called by other debugging functions.

� More validity checking of UTF-8 and MBCSstrings is done by agrep() and the regular-expression matching functions.

� The undocumented restriction on gregexpr()to length(text) > 0 has been removed.

� Package tcltk now sends strings to Tcl in UTF-8: this means that strings with a marked UTF-8encoding are supported in non-UTF-8 locales.

� The graphics engine now supports renderingof raster (bitmap) images, though not all graph-ics devices can provide (full) support. Pack-ages providing graphics devices (e.g., Cairo,RSvgDevice, cairoDevice) will need to be rein-stalled.

There is also support in the graphics engine forcapturing raster images from graphics devices(again not supported on all graphics devices).

� R CMD check now also checks if the packageand namespace can be unloaded: this providesa check of the .Last.lib() and .onUnload()hook functions (unless -install=fake).

� prop.table(x) now accepts a one-dimensionaltable for x.

� A new function vapply() has been added,based on a suggestion from Bill Dunlap. It re-quires that a template for the function value bespecified, and uses it to determine the outputtype and to check for consistency in the func-tion values.

� The main HTML help page now links to a re-formatted copy of this NEWS file. (Suggested byHenrik Bengtsson.) Package index files link tothe package DESCRIPTION and NEWS files and alist of demos when using dynamic help.

� The [ method for class "AsIs" allows the nextmethod to change the underlying class. (Wishof Jens Oehlschlägel.)

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 66: R journal 2010-1

66 NEWS AND NOTES

� write.csv[2]() no longer allow append to bechanged: as ever, direct calls to write.table()give more flexibility as well as more room forerror.

� The index page for HTML help for a packagenow collapses multiple signatures for S4 meth-ods into a single entry.

� The use of .required by require() anddetach() has been replaced by .Depends whichis set from the Depends field of a package (evenin packages with name spaces). By defaultdetach() prevents such dependencies from be-ing detached: this can be overridden by the ar-gument force.

� bquote() has been extended to work on func-tion definitions (wish of PR#14031).

� detach() when applied to an object other thana package returns the environment that hasbeen detached, to parallel attach().

� readline() in non-interactive use returns ""and does not attempt to read from the “termi-nal”.

� New function file_ext() in package tools.

� xtfrm() is now primitive and internallygeneric, as this allows S4 methods to be set onit without name-space scoping issues.There are now "AsIs" and "difftime" meth-ods, and the default method uses unclass(x)if is.numeric(x) is true (which will be fasterbut relies on is.numeric() having been set cor-rectly for the class).

� is.numeric(x) is now false for a "difftime"object (multiplication and division make nosense for such objects).

� The default method of weighted.mean(x, w)coerces w to be numeric (aka double); previ-ously only integer weights were coerced. Zeroweights are handled specially so an infinitevalue with zero weight does not force an NaNresult.There is now a "difftime" method.

� bug.report() now has package and lib.locarguments to generate bug reports about pack-ages. When this is used, it looks for aBugReports field in the package DESCRIPTIONfile, which will be assumed to be a URL atwhich to submit the report, and otherwisegenerates an email to the package maintainer.(Suggested by Barry Rowlingson.)

� quantile() now has a method for the date-time class "POSIXt", and types 1 and 3 (whichnever interpolate) work for Dates and orderedfactors.

� length(<POSIXlt>) now returns the lengthof the corresponding abstract timedate-vectorrather than always 9 (the length of the under-lying list structure). (Wish of PR#14073 andPR#10507.)

� The readline completion backend no longersorts possible completions alphabetically (e.g.,function argument names) if R was built withreadline ≥ 6.

� select.list() gains a graphics argumentto allow Windows/Mac users to choosethe text interface. This changes the be-haviour of new.packages(ask=TRUE) to belike update.packages(ask=TRUE) on thoseplatforms in using a text menu: useask="graphics" for a graphical menu.

� New function chooseBioCmirror() to set the"BioC_mirror" option.

� The R grammar prevents using the argumentname in signatures of S4 methods for $ and $<-,since they will always be called with a char-acter string value for name. The implicit S4generic functions have been changed to reflectthis: packages which included name in the sig-nature of their methods need to be updated andre-installed.

� The handling of the method argument of glm()has been refined following suggestions byIoannis Kosmidis and Heather Turner.

� str() gains a new argument list.len with de-fault 99, limiting the number of list() items(per level), thanks to suggestions from DavidWinsenius.

� Having formal arguments of an S4 method ina different order from the generic is now an er-ror (the warning having been ignored by somepackage maintainers for a long time).

� New functions enc2native() and enc2utf8()convert character vectors with possibly markedencodings to the current locale and UTF-8 re-spectively.

� Unrecognized escapes and embedded nuls incharacter strings are now an error, not just awarning. Thus option "warnEscapes" is nolonger needed. rawToChar() now removestrailing nuls silently, but other embedded nulsbecome errors.

� Informational messages about masked objectsdisplayed when a package is attached are nowmore compact, using strwrap() instead of oneobject per line.

� print.rle() gains argument prefix.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 67: R journal 2010-1

NEWS AND NOTES 67

� download.file() gains a "curl" method,mainly for use on platforms which have curlbut not wget, but also for some hard-to-accessURLs.

� In Rd, \eqn and \deqn will render in HTML(and convert to text) upper- and lower-caseGreek letters (entered as \alpha . . . ), \ldots,\dots, \ge and \le.

� utf8ToInt() and intToUtf8() now map NA in-puts to NA outputs.

� file() has a new argument raw which mayhelp if it is used with something other than aregular file, e.g. a character device.

� New function strtoi(), a wrapper for the Cfunction strtol.

� as.octmode() and as.hexmode() now allow in-puts of length other than one.

The format() and print() methods for"octmode" now preserve names and dimen-sions (as those for "hexmode" did).

The format() methods for classes "octmode"and "hexmode" gain a width argument.

� seq.int() returns an integer result in some fur-ther cases where seq() does, e.g. seq.int(1L,9L, by = 2L).

� Added \subsection{}{} macro to Rd syntax,for subsections within sections.

� n-dimensional arrays with dimension namescan now be indexed by an n-column charactermatrix. The indices are matched against the di-mension names. NA indices are propagated tothe result. Unmatched values and "" are notallowed and result in an error.

� interaction(drop=TRUE) uses less memory(related to PR#14121).

� summary() methods have been added to the"srcref" and "srcfile" classes, and variousencoding issues have been cleaned up.

� If option "checkPackageLicense" is set to TRUE(not currently the default), users will be askedto agree to non-known-to-be-FOSS package li-cences at first use.

� Checking setAs(a,b) methods only gives amessage instead of a warning, when one of aor b is unknown.

� New function norm() to compute a matrixnorm. norm() and also backsolve() andsample() have implicit S4 generics.

� Renviron.site and Rprofile.site can havearchitecture-specific versions on systems withsub-architectures.

� R CMD check now (by default) also checks Rdfiles for auto-generated content in need of edit-ing, and missing argument descriptions.

� aggregate() gains a formula method thanks toa contribution by Arni Magnusson. The dataframe method now allows summary functionsto return arbitrarily many values.

� path.expand() now propagates NA valuesrather than converting them to "NA".

� file.show() now disallows NA values for filenames, headers, and pager.

� The ’fuzz’ used by seq() and seq.int() hasbeen reduced from 1e-7 to 1e-10, which shouldbe ample for the double-precision calculationsused in R. It ensures that the fuzz never comesinto play with sequences of integers (wish ofPR#14169).

� The default value of RSiteSearch(restrict=)has been changed to include vignettes but toexclude R-help. The R-help archives avail-able have been split, with a new option of"Rhelp10" for those from 2010.

� New function rasterImage() in the graphicspackage for drawing raster images.

� stats:::extractAIC.coxph() now omitsaliased terms when computing the degrees offreedom (suggestion of Terry Therneau).

� cor() and cov() now test for misuse with non-numeric arguments, such as the non-bug reportPR#14207.

� pchisq(ncp =, log.p = TRUE) is more accu-rate for probabilities near one. E.g. pchisq(80,4, ncp=1, log.p=TRUE). (Maybe what wasmeant in PR#14216.)

� maintainer() has been added, to give conve-nient access to the name of the maintainer of apackage (contributed by David Scott).

� sample() and sample.int() allow zero itemsto be sampled from a zero-length input.sample.int() gains a default value size=n tobe more similar to sample().

� switch() returned NULL on error (not previ-ously documented on the help page): it nowdoes so invisibly, analogously to if-without-else.It is now primitive: this means that EXPR is al-ways matched to the first argument and thereis no danger of partial matching to later namedarguments.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 68: R journal 2010-1

68 NEWS AND NOTES

� Primitive functions UseMethod(), attr(),attr<-(), on.exit(), retracemem() andsubstitute() now use standard argumentmatching (rather than positional matching).This means that all multi-argument primitiveswhich are not internal now use standard argu-ment matching except where positional match-ing is desirable (as for switch(), call(), .C(),. . . ).

� All the one-argument primitives now checkthat any name supplied for their first argumentis a partial match to the argument name as doc-umented on the help page: this also applies toreplacement functions of two arguments.

� base::which() uses a new .Internal func-tion when arr.ind is FALSE resulting in a 10xspeedup. Thanks to Patrick Aboyoun for im-plementation suggestions.

� Help conversion to text now uses the first partof \enc{}{} markup if it is representable inthe current output encoding. On the otherhand, conversion to LaTeX with the defaultoutputEncoding = "ASCII" uses the secondpart.

� A new class "listOfMethods" has been intro-duced to represent the methods in a meth-ods table, to replace the deprecated class"MethodsList".

� any() and all() return early if possible. Thismay speed up operations on long vectors.

� strptime() now accepts "%z" (for the offsetfrom UTC in the RFC822 format of +/-hhmm).

� The PCRE library has been updated to version8.02, a bug-fix release which also updates tablesto Unicode 5.02.

� Functions which may use a graph-ical select.list() (including menu()and install.packages()) now check ona Unix-alike that Tk can be started(and not just capabilities("tcltk") &&capabilities("X11")).

� The parser no longer marks strings containingoctal or hex escapes as being in UTF-8 whenentered in a UTF-8 locale.

� On platforms with cairo but not Pango (notablyMac OS X) the initial default X11() type is set to"Xlib": this avoids several problems with fontselection when done by cairo rather than Pango(at least on Mac OS X).

� New arrayInd() such that which(x, arr.ind= TRUE) for an array x is now equivalent toarrayInd(which(x), dim(x), dimnames(x)).

Deprecated & defunct

� Bundles of packages are defunct.

� stats::clearNames() is defunct: useunname().

� Basic regular expressions are defunct, andstrsplit(), grep(), grepl(), sub(), gsub(),regexpr() and gregexpr() no longer have anextended argument.

� methods::trySilent() is defunct.

� index.search() (which was deprecated in2.10.0) is no longer exported and has a differ-ent argument list.

� Use of multiple arguments to return() is nowdefunct.

� The use of UseMethod() with more than two ar-guments is now defunct.

� In the methods package, the MethodsList meta-data objects which had been superseded byhash tables (environments) since R 2.8.0 are be-ing phased out. Objects of this class are nolonger assigned or used as metadata by thepackage.

getMethods() is now deprecated, with its inter-nal use replaced by findMethods() and otherchanges. Creating objects from the Method-sList class is also deprecated.

� Parsing strings containing both octal/hex andUnicode escapes now gives a warning and willbecome an error in R 2.12.0.

Installation changes

� UTF-8 is now used for the reference manualand package manuals. This requires LaTeX2005/12/01 or later.

� configure looks for a POSIX compliant tr, So-laris’s /usr/ucb/tr having been found to causeRdiff to malfunction.

� configure is now generated with autoconf-2.65,which works better on recent systems and onMac OS X.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 69: R journal 2010-1

NEWS AND NOTES 69

Package installation changes

� Characters in R source which are not translat-able to the current locale are now handled moretolerantly: these will be converted to hex codeswith a warning. Such characters are only reallyportable if they appear in comments.

� R CMD INSTALL now tests that the installedpackage can be loaded (and backs out theinstallation if it cannot): this can be sup-pressed by -no-test-load. This avoids in-stalling/updating a package that cannot beused: common causes of failures to load aremissing/incompatible external software andmissing/broken dependent packages.

� Package installation on Windows for a pack-age with a src directory now checks if a DLL iscreated unless there is a src/Makefile.win file:this helps catch broken installations where thetoolchain has not reported problems in build-ing the DLL. (Note: this can be any DLL, notjust one named <pkg-name>.dll.)

Bug fixes

� Using with(), eval() etc with a list with someunnamed elements now works. (PR#14035)

� The “quick” dispatch of S4 methods for prim-itive functions was not happening, forcing asearch each time. (Dispatch for closures wasnot affected.) A side effect is that default val-ues for arguments in a method that do not havedefaults in the generic will now be ignored.

� Trying to dispatch S4 methods for primitivesduring the search for inherited methods slowsthat search down and potentially could causean infinite recursion. An internal switch wasadded to turn off all such methods fromfindInheritedMethods().

� R framework installation (on Mac OS X) wouldnot work properly if a rogue Resources di-rectory was present at the top level. Sucha non-symlink will now be renamed to Re-sources.old (and anything previously namedResources.old removed) as part of the frame-work installation process.

� The checks for conforming S4 method argu-ments could fail when the signature of thegeneric function omitted some of the formal ar-guments (in addition to ...). Arguments omit-ted from the method definition but conform-ing (per the documentation) should now be ig-nored (treated as "ANY") in dispatching.

� The computations for S4 method evaluationwhen ... was in the signature could fail, treat-ing ... as an ordinary symbol. This has beenfixed, for the known cases.

� Various ar() fitting methods have more protec-tion for singular fits.

� callNextMethod() now works again with thedrop= argument in ‘

� parse() and parse_Rd() miscounted columnswhen multibyte UTF-8 characters werepresent.

� Formatting of help pages has had minor im-provements: extra blank lines have been re-moved from the text format, and empty pack-age labels removed from HTML.

� cor(A, B) where A has n × 1 and B a 1-dimensional array segfaulted or gave an inter-nal error. (The case cor(B, A) was PR#7116.)

� cut.POSIXt() applied to a start value after theDST transition on a DST-change day could givethe wrong time for breaks in units of days orlonger. (PR#14208)

� do_par() UNPROTECTed too early (PR#14214)

� subassignment x[[....]] <- y didn’t checkfor a zero-length right hand side, and insertedrubbish value. (PR#14217)

� fisher.test() no longer gives a P-value veryslightly > 1, in some borderline cases.

� Internal function matchArgs() no longer modi-fies the general purpose bits of the SEXPs thatmake up the formals list of R functions. Thisfixes an invalid error message that would oc-cur when a garbage collection triggered a sec-ond call to matchArgs for the same function viaa finalizer.

� gsub() in 2.10.x could fail from stack overflowfor extremely long strings due to temporarydata being allocated on the stack. Also, gsub()with fixed=TRUE is in some circumstances con-siderably faster.

� Several primitives, including attributes(),attr<-() interactive(), nargs() andproc.time(), did not check that they werecalled with the correct number of arguments.

� A potential race condition in list.files()when other processes are operating on the di-rectory has been fixed; the code now dynami-cally allocates memory for file listings in a sin-gle pass instead of making an initial count pass.

� mean(x, trim=, na.rm = FALSE) failed to re-turn NA if x contained missing values. (Re-ported by Bill Dunlap.)

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 70: R journal 2010-1

70 NEWS AND NOTES

� Extreme tail behavior of, pbeta() andhence pf(), e.g., pbeta(x, 3, 2200,lower.tail=FALSE, log.p=TRUE) now returnsfinite values instead of jumping to -Inf tooearly (PR#14230).

� parse(text=x) misbehaved for objects x thatwere not coerced internally to character, no-tably symbols. (Reported to R-devel by BillDunlap.)

� The internal C function coerceSymbol nowhandles coercion to character, and warns ifcoercion fails (rather than silently returningNULL). This allows a name to be given where acharacter vector is required in functions whichcoerce internally.

� The interpretation by strptime() of that itis ever advisable to use locale- and system-specific input formats).

� capabilities("X11") now works the sameway on Mac OS X as on other platforms (andas documented: it was always true for R builtwith -with-aqua, as the CRAN builds are).

� The X11() device with cairo but not Pango (no-tably Mac OS X) now checks validity of textstrings in UTF-8 locales (since Pango does butcairo it seems does not).

� read.fwf() misread multi-line records when nwas specified. (PR#14241)

� all.equal(*, tolerance = e) passes the nu-meric tolerance also to the comparison of theattributes.

� pgamma(0,0), a boundary case, now returns 0,its limit from the left, rather than the limit fromthe right.

� Issuing POST requests to the internal webserver could stall the request under certain cir-cumstances.

� gzcon( <textConnection> ), an error, nolonger damages the connection (in a way tohave it seg.fault). (PR#14237)

� All the results from hist() now use the nomi-nal breaks not those adjusted by the numericfuzz: in recent versions the nominal breakswere reported but the density referred to theintervals used in the calculation — which mat-tered very slightly for one of the extreme bins.(Based on a report by Martin Becker.)

� If xy[z].coords() (used internally by manygraphics functions) are given a list as x, theynow check that the list has suitable namesand give a more informative error message.(PR#13936)

R 2.10.1 patched changes

New features

� The handling of line textures in thepostscript() and pdf() devices was set upfor round end caps (the only type which ex-isted at the time): it has now been adjusted forbutt endcaps.

� lchoose(a,k) is now defined aslog(abs(choose(a,k))), analogously tolfactorial().

� Although \eqn{} in Rd files is defined asa verbatim macro, many packages expected\dots and \ldots to be interpreted there (aswas the case in R < 2.10.0), so this is now done(using an ellipsis in HTML rendering).

� Escaping of braces in quoted strings in R-codesections of Rd files is allowed again. This hadbeen changed for the new Rd format in R 2.10.0but was only documented on the developer siteand was handled inconsistently by the convert-ers: text and example conversion removed theescapes but HTML conversion did not.

� The PCRE library has been updated to version8.01, a bug-fix release.

� tools::readNEWS() now accepts a digit as thefirst character of a news section.

Bug fixes

� Using read.table(header=TRUE) on a headerwith an embedded new line would copy partof the header into the data. (PR#14103)

� qpois(p = 1, lambda = 0) now gives 0 as forall other p. (PR#14135)

� Functions related to string comparison (e.g.unique(), match()) could cause crashes whenused with strings not in the native encoding,e.g. UTF-8 strings on Windows. (PR#14114 andPR#14125)

� x[ , drop=TRUE] dropped an NA level even if itwas in use.

� The dynamic HTML help system reported thewrong MIME type for the style sheet.

� tools::codoc() (used by R CMD check) wasmissing cases where the function had no argu-ments but was documented to have some.

� Help links containing special characters (e.g."?") were not generated correctly when ren-dered in HTML. (PR#14155)

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 71: R journal 2010-1

NEWS AND NOTES 71

� lchoose(a,k) no longer wrongly gives NaN fornegative a.

� ks.test() could give a p-value that was offby one observation due to rounding error.(PR#14145)

� readBin()/readChar() when reading millionsof character strings in a single call used exces-sive amounts of memory (which also slowedthem down).

� R CMD SHLIB could fail if used with paths thatwere not alphanumeric, e.g. contained +.(PR#14168)

� sprintf() was not re-entrant, which poten-tially caused problems if an as.character()method called it.

� The quartz() device did not restore the clip-ping region when filling the background for anew page. This could be observed in multi-page bitmap output as stale outer regions of theplot.

� p.adjust(*, method, n) now works correctlyfor the rare case n > length(p), also whenmethod differs from "bonferroni" or "none",thanks to a patch from Gordon Smyth.

� tools::showNonASCII() failed to detect non-ASCII characters if iconv() (incorrectly) con-verted them to different ASCII characters.(Seen on Windows only.)

� tcrossprod() wrongly failed in some caseswhen one of the arguments was a vector andthe other a matrix.

� [cr]bind(..., deparse.level=2) was not al-ways giving names when documented to do so.(Discovered whilst investigating PR#14189.)

� match(incomparables=<non-NULL>) could inrare cases infinite- loop.

� poisson.test() needed to pass conf.level tobinom.test() (PR#14195)

� The "nls" method for df.residual() gave in-correct results for models fitted with na.action= na.exclude. (PR#14194)

� A change to options(scipen=) was onlyimplemented when printing next occurred,even though it should have affected in-tervening calls to axis(), contour() andfilledcontour().

� prettyNum(drop0trailing=TRUE) did not han-dle signs of imaginary parts of complex num-bers correctly (and this was used by str():PR#14201).

� system.time() had sys.child componentwrong (copied user.child instead) on systemswith HAVE_GETRUSAGE (PR#14210)

� Changing both line texture and line cap (end)resulted in the latter to be ommitted form thePDF code. In addition, line cap (end) and joinare now set explicitly in PDF output to ensurecorrect defaults.

� The suppression of auto-rotation in bitmap()and dev2bitmap() with the pdfwrite devicewas not working correctly.

� plot(ecdf(), log="x") no longer gives an in-correct warning.

� read.fwf() works again when file is a con-nection.

� Startup files will now be found if their pathsexceed 255 bytes. (PR#14228)

� contrasts<- (in the stats package) no longerhas an undeclared dependence on methods (in-troduced in 2.10.0).

R 2.10.1 changes since previousJournal Issue

(Most of the changes to what was then “2.10.0Patched” were described in Vol 1/2.)

Bug fixes

� switch(EXPR = "A") now returns NULL, asswitch(1) which used to signal an error.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 72: R journal 2010-1

72 NEWS AND NOTES

Changes on CRAN2009-12-29 to 2010-06-03 for CRAN task views2009-10-23 to 2010-06-03 for CRAN packages

by Kurt Hornik and Achim Zeileis

New CRAN task views

Phylogenetics Phylogenetics, especially compara-tive methods.Packages: MCMCglmm, PHYLOGR, PhySim,TreeSim, ade4, adephylo, apTreeshape, ape∗,geiger, laser, maticce, ouch, paleoTS, phang-orn, phybase, phyclust, phylobase, phylo-clim, picante, scaleboot, vegan.Maintainer: Brian O’Meara.

(* = core package)

New packages in CRAN task views

Bayesian GLMMarp, zic.

Cluster apcluster.

Econometrics BayesPanel, VGAM, effects,fxregime, gam, gamlss, mgcv, micEconCES,micEconSNQP, truncreg.

Environmetrics FD.

ExperimentalDesign DiceDesign, DiceKriging,FrF2.catlg128, dae.

Finance SV.

HighPerformanceComputing Rdsm, batch, biglars,doMPI, doRedis.

MedicalImaging PTAk, cudaBayesreg, dcemriS4∗,oro.dicom∗, oro.nifti∗.

Psychometrics BradleyTerry2, CMC, caGUI,lavaan∗, lordif, pscl.

Spatial Guerry, cshapes, rworldmap.

Survival AIM, Cprob, DTDA, TraMineR, flex-CrossHaz, ipw, mspath.

TimeSeries rtv.

gR parcor.

(* = core package)

New contributed packages

AIM AIM: adaptive index model. Authors: L. Tianand R. Tibshirani. In view: Survival.

B2Z Bayesian Two-Zone Model. Authors: João Vi-tor Dias Monteiro, Sudipto Barnejee and Guru-murthy Ramachandran.

BBMM This package fits a Brownian bridge move-ment model to observed locations in space andtime. Authors: Ryan Nielson, Hall Sawyer andTrent McDonald.

BLR Bayesian Linear Regression. Authors: Gustavode los Campos and Paulino Pérez Rodríguez.

BTSPAS Bayesian Time-Stratified Population Anal-ysis. Authors: Carl J Schwarz and Simon J Bon-ner.

BayesPanel Bayesian methods for panel data mod-eling and inference. Authors: Chunhua Wuand Siddhartha Chib. In view: Econometrics.

BayesQTLBIC Bayesian multi-locus QTL analysisbased on the BIC criterion. Author: Rod Ball.

Bergm Bayesian inference for exponential randomgraph models. Authors: Alberto Caimo andNial Friel.

BioPhysConnectoR Connecting sequence infor-mation and biophysical models Authors:Franziska Hoffgaard, with contributions fromPhilipp Weil and Kay Hamacher.

BioStatR Companion to the book “Exercices et prob-lèmes de statistique avec le logiciel R”. Au-thors: Frederic Bertrand and Myriam Maumy-Bertrand.

Bmix Bayesian sampling for stick-breaking mix-tures. Author: Matt Taddy. In view: Cluster.

BoSSA A bunch of structure and sequence analysis.Author: Pierre Lefeuvre.

Bolstad2 Bolstad functions. Author: James Curran.

BoolNet Generation, reconstruction, simulation andanalysis of synchronous, asynchronous, andprobabilistic Boolean networks. Authors:Christoph Müssel, Martin Hopfensitz, DaoZhou and Hans Kestler.

Boruta A tool for finding significant attributes ininformation systems. Authors: Algorithmby Witold R. Rudnicki, R implementation byMiron B. Kursa.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 73: R journal 2010-1

NEWS AND NOTES 73

BradleyTerry2 Bradley-Terry models. Authors:Heather Turner and David Firth. In view: Psy-chometrics.

CAVIAR Cambial Activity and wood formationdata processing, VIsualisation and Analysis us-ing R. Author: Cyrille Rathgeber.

CCMtools Clustering through “Correlation Cluster-ing Model” (CCM) and cluster analysis tools.Author: Mathieu Vrac.

CCP Significance tests for Canonical CorrelationAnalysis (CCA). Author: Uwe Menzel.

CMC Cronbach-Mesbah Curve. Authors: MichelaCameletti and Valeria Caviezel. In view: Psy-chometrics.

COMPoissonReg Conway-Maxwell Poisson(COM-Poisson) Regression. Author: KimberlySellers.

CausalGAM Estimation of causal effects with Gen-eralized Additive Models. Authors: AdamGlynn and Kevin Quinn.

CoCoCg Graphical modelling by CG regressions.Author: Jens Henrik Badsberg.

CoCoGraph Interactive and dynamic graphs for theCoCo objects. Author: Jens Henrik Badsberg.

CollocInfer Collocation inference for dynamic sys-tems. Authors: Giles Hooker, Luo Xiao andJames Ramsay.

CorrBin Nonparametrics with clustered binarydata. Author: Aniko Szabo.

Cprob Conditional probability function of a com-peting event. Author: Arthur Allignol. Inview: Survival.

DAMisc Dave Armstrong’s miscellaneous func-tions. Author: Dave Armstrong.

DEMEtics Calculation of Gst and D values char-acterizing the genetic differentiation betweenpopulations. Authors: Alexander Jueterbock,Philipp Kraemer, Gabriele Gerlach and JanaDeppermann.

DOSim Computation of functional similarities be-tween DO terms and gene products; DO en-richment analysis; Disease Ontology annota-tion. Author: Jiang Li.

DatABEL File-based access to large matrices storedon HDD in binary format. Authors: YuriiAulchenko and Stepan Yakovenko.

DeducerPlugInExample Deducer plug-in example.Author: Ian Fellows.

DesignPatterns Design patterns in R to buildreusable object-oriented software. Authors: Ji-tao David Zhang, supported by Stefan Wie-mann and Wolfgang Huber.

DiceDesign Designs of computer experiments. Au-thors: Jessica Franco, Delphine Dupuy andOlivier Roustant. In view: ExperimentalDesign.

DiceEval Construction and evaluation of metamod-els. Authors: D. Dupuy and C. Helbert.

DiceKriging Kriging methods for computer experi-ments. Authors: O. Roustant, D. Ginsbourgerand Y. Deville. In view: ExperimentalDesign.

DiceOptim Kriging-based optimization for com-puter experiments. Authors: D. Ginsbourgerand O. Roustant.

DistributionUtils Distribution utilities. Author:David Scott.

DoseFinding Planning and analyzing dose findingexperiments. Authors: Björn Bornkamp, JoséPinheiro and Frank Bretz.

EMT Exact Multinomial Test: Goodness-of-fit testfor discrete multivariate data. Author: UweMenzel.

FAMT Factor Analysis for Multiple Testing (FAMT):Simultaneous tests under dependence in high-dimensional data. Authors: David Causeur,Chloé Friguet, Magalie Houée-Bigot and MaelaKloareg.

FNN Fast Nearest Neighbor search algorithms andapplications. Author: Shengqiao Li.

FrF2.catlg128 Complete catalogues of resolution IV128 run 2-level fractional factorials up to 24 fac-tors. Author: Ulrike Grömping. In view: Exper-imentalDesign.

FunctSNP SNP annotation data methods andspecies specific database builder. Authors:S. J. Goodswen with contributions from N.S. Watson-Haigh, H. N. Kadarmideen and C.Gondro.

GAMens Applies GAMbag, GAMrsm and GAMensensemble classifiers for binary classification.Authors: Koen W. De Bock, Kristof Cousse-ment and Dirk Van den Poel.

GEVcdn GEV conditonal density estimation net-work. Author: Alex J. Cannon.

GGally Extension to ggplot2. Authors: BarretSchloerke, Di Cook, Heike Hofmann andHadley Wickham.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 74: R journal 2010-1

74 NEWS AND NOTES

GWRM A package for fitting Generalized War-ing Regression Models. Authors: A.J. Saez-Castillo, J. Rodríguez-Avi, A. Conde-Sánchez,M.J. Olmo-Jiménez and A.M. Martinez-Rodriguez.

GeneralizedHyperbolic The generalized hyper-bolic distribution. Author: David Scott.

GrassmannOptim Grassmann Manifold Optimiza-tion. Authors: Kofi Placid Adragni andSeongho Wu.

Guerry Guerry: maps, data and methods related toGuerry (1833) “Moral Statistics of France”. Au-thors: Michael Friendly and Stephane Dray. Inview: Spatial.

HDMD Statistical analysis tools for High Dimen-sion Molecular Data (HDMD). Author: LisaMcFerrin.

HGLMMM Hierarchical Generalized Linear Mod-els. Author: Marek Molas.

HMM Hidden Markov Models. Authors: Scien-tific Software Development — Dr. Lin Himmel-mann.

HistData Data sets from the history of statisticsand data visualization. Authors: MichaelFriendly, with contributions by Stephane Drayand Hadley Wickham.

IFP Identifying functional polymorphisms in ge-netic association studies. Author: LeeyoungPark.

Imap Interactive Mapping. Author: John R. Wallace.

JJcorr Calculates polychorical correlations for sev-eral copula families. Authors: Jeroen Oomsand Joakim Ekström.

JOP Joint Optimization Plot. Authors: Sonja Kuhntand Nikolaus Rudak.

LDdiag Link function and distribution diagnostictest for social science researchers. Author:Yongmei Ni.

LS2W Locally stationary two-dimensional waveletprocess estimation scheme. Authors: Idris Eck-ley and Guy Nason.

LiblineaR Linear predictive models based on theLiblinear C/C++ Library. Author: ThibaultHelleputte.

LogitNet Infer network based on binary arrays us-ing regularized logistic regression. Authors:Pei Wang, Dennis Chao and Li Hsu.

MADAM This package provides some basic meth-ods for meta-analysis. Author: Karl Kugler.

MAMA Meta-Analysis of MicroArray. Author:Ivana Ihnatova.

MAc Meta-analysis with correlations. Author: ACDel Re and William T. Hoyt.

MAd Meta-analysis with mean differences. Author:AC Del Re and William T. Hoyt.

MFDF Modeling Functional Data in Finance. Au-thor: Wei Dou.

MImix Mixture summary method for multiple im-putation. Authors: Russell Steele, NaisyinWang and Adrian Raftery.

MNM Multivariate Nonparametric Methods. Anapproach based on spatial signs and ranks. Au-thors: Klaus Nordhausen, Jyrki Mottonen andHannu Oja.

MTSKNN Multivariate two-sample tests based onK-Nearest-Neighbors. Authors: Lisha Chen,Peng Dai and Wei Dou.

MVpower Give power for a given effect size usingmultivariate classification methods. Authors:Raji Balasubramanian and Yu Guo.

MaXact Exact max-type Cochran-Armitage trendtest (CATT). Authors: Jianan Tian and Chen-liang Xu.

ModelGood Validation of prediction models. Au-thor: Thomas A. Gerds.

MortalitySmooth Smoothing Poisson counts withP-splines. Author: Carlo G. Camarda.

MplusAutomation Automating Mplus model esti-mation and interpretation. Author: MichaelHallquist.

MuMIn Multi-model inference. Author: Kamil Bar-ton.

NMF Algorithms and framework for NonnegativeMatrix Factorization (NMF). Author: RenaudGaujoux.

NMFN Non-Negative Matrix Factorization. Au-thors: Suhai (Timothy) Liu based on multi-plicative updates (Lee and Sung 2001) andalternating least squares algorithm; LarsKai Hansen’s nnmf_als Matlab implementa-tion; Torsten Hothorn’s Moore-Penrose inversefunction.

OligoSpecificitySystem Oligo Specificity System.Authors: Rory Michelland and LaurentCauquil.

PBSadmb PBS ADMB. Authors: Jon T. Schnute andRowan Haigh.

PKgraph Model diagnostics for pharmacokineticmodels. Author: Xiaoyong Sun.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 75: R journal 2010-1

NEWS AND NOTES 75

PKmodelFinder Software for Pharmacokineticmodel. Authors: Eun-Kyung Lee, GyujeongNoh and Hyeong-Seok Lim.

PermuteNGS Significance testing of transcriptomeprofiling for RNA-sequencing data. Authors:Taeheon Lee, Bo-Young Lee, Hee-Seok Oh,Heebal Kim.

PowerTOST Power and sample size based on two-one-sided-t-tests (TOST) for BE studies. Au-thor: Detlew Labes.

ProDenICA Product Density Estimation for ICA us-ing tilted Gaussian density estimates. Authors:Trevor Hastie and Rob Tibshirani.

PropCIs Computes confidence intervals for propor-tions, differences in proportions, an odds-ratioand the relative risk in a 2 × 2 table. Author:Ralph Scherer.

QT QT Knowledge Management System. Author:Christoffer W. Tornoe.

QTLNetworkR Interactive software package forQTL visualization. Author: Zheng Wenjun.

R2Cuba Multidimensional numerical integration.Authors: The Cuba library has been written byThomas Hahn; Interface to R was written byAnnie Bouvier and Kiên Kiêu.

R2PPT Simple R Interface to Microsoft PowerPointusing rcom. Author: Wayne Jones.

RANN Fast Nearest Neighbour Search. Authors:Samuel E. Kemp and Gregory Jefferis.

RFLPtools Tools to analyse RFLP data. Authors: Fa-bienne Flessa, Alexandra Kehl, Matthias Kohl.

RFinanceYJ Japanese stock market from Yahoo!-finance-Japan Authors: Yohei Sato andNobuaki Oshiro.

RGtk2DfEdit Improved data frame editor forRGtk2. Authors: Tom Taverner, with contri-butions from John Verzani.

RH2 DBI/RJDBC interface to H2 Database. Au-thor: G. Grothendieck. Author of H2 is ThomasMueller.

RODM R interface to Oracle Data Mining. Authors:Pablo Tamayo and Ari Mozes.

ROracleUI Convenient Tools for Working with Ora-cle Databases. Author: Arni Magnusson.

RPPanalyzer Reads, annotates, and normalizes re-verse phase protein array data. Author: HeikoMannsperger with contributions from StephanGade.

RProtoBuf R interface to the Protocol Buffers API.Authors: Romain François and Dirk Eddel-buettel.

RSQLite.extfuns Math and String Extension Func-tions for RSQLite. Author: Seth Falcon.

Rcgmin Conjugate gradient minimization of nonlin-ear functions with box constraints. Author:John C. Nash.

RcmdrPlugin.MAc Meta-analysis with correlations(MAc) Rcmdr Plug-in. Author: AC Del Re.

RcmdrPlugin.MAd Meta-analysis with mean dif-ferences (MAd) Rcmdr Plug-in. Author: ACDel Re.

RcmdrPlugin.PT Some discrete exponential disper-sion models: Poisson-Tweedie. Authors:David Pechel Cactcha, Laure Pauline Fotso andCélestin C Kokonendji.

RcmdrPlugin.SLC SLC Rcmdr plug-in. Authors:Antonio Solanas and Rumen Manolov.

RcmdrPlugin.doex Rcmdr plugin for Stat 4309course. Author: Erin Hodgess.

RcmdrPlugin.sos Efficiently search R Help pages.Author: Liviu Andronic.

RcmdrPlugin.steepness Steepness Rcmdr plug-in.Author: David Leiva and Han de Vries.

RcppArmadillo Rcpp/Armadillo bridge. Authors:Romain François, Dirk Eddelbuettel and DougBates.

RcppExamples Examples using Rcpp to interface Rand C++. Authors: Dirk Eddelbuettel and Ro-main François, based on code written during2005 and 2006 by Dominick Samperi.

Rdsm Threads-Like Environment for R. Author:Norm Matloff. In view: HighPerformanceCom-puting.

RecordLinkage Record linkage in R. Authors: An-dreas Borg and Murat Sariyar.

Rhh Calculating multilocus heterozygosity andheterozygosity-heterozygosity correlation. Au-thors: Jussi Alho and Kaisa Välimäki.

RobLoxBioC Infinitesimally robust estimators forpreprocessing omics data. Author: MatthiasKohl.

RpgSQL DBI/RJDBC interface to PostgreSQLdatabase. Author: G. Grothendieck.

Rsolnp General non-linear optimization. Authors:Alexios Ghalanos and Stefan Theussl.

Rvmmin Variable metric nonlinear function mini-mization with bounds constraints. Author:John C. Nash.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 76: R journal 2010-1

76 NEWS AND NOTES

SDMTools Species Distribution Modelling Tools:Tools for processing data associated withspecies distribution modelling exercises. Au-thors: Jeremy VanDerWal, Luke Shoo andStephanie Januchowski. This package wasmade possible in part by support from LorenaFalconi and Collin Storlie, and financial sup-port from the Australian Research Council &ARC Research Network for Earth System Sci-ence.

SDisc Integrated methodology for the identificationof homogeneous profiles in data distribution.Author: Fabrice Colas.

SHIP SHrinkage covariance Incorporating Priorknowledge. Authors: Monika Jelizarow andVincent Guillemot.

SLC Slope and level change. Authors: Anto-nio Solanas, Rumen Manolov and PatrickOnghena.

SOAR Memory management in R by delayed as-signments. Authors: Bill Venables, based onoriginal code by David Brahm.

SPACECAP A program to estimate animal aabun-dance and density using Spatially-ExplicitCapture-Recapture. Authors: Pallavi Singh,Arjun M. Gopalaswamy, Andrew J. Royle, N.Samba Kumar and K. Ullas Karanth with con-tributions from Sumanta Mukherjee, Vatsarajand Dipti Bharadwaj.

SQN Subset quantile normalization. Author: Zhi-jin(Jean) Wu.

SV Indirect inference in non-Gaussian stochasticvolatility models. Author: Øivind Skare. Inview: Finance.

SampleSizeMeans Sample size calculations for nor-mal means. Authors: Lawrence Joseph andPatrick Belisle.

SampleSizeProportions Calculating sample size re-quirements when estimating the differencebetween two binomial proportions. Au-thors: Lawrence Joseph, Roxane du Berger andPatrick Belisle.

SkewHyperbolic The Skew Hyperbolic Student t-distribution. Authors: David Scott and FionaGrimson.

Sleuth2 Data sets from Ramsey and Schafer’s “Sta-tistical Sleuth (2nd ed)”. Authors: Originalby F.L. Ramsey and D.W. Schafer, modifica-tions by Daniel W Schafer, Jeannie Sifneos andBerwin A Turlach.

SpatialEpi Performs various spatial epidemiologi-cal analyses. Author: Albert Y. Kim.

StMoSim Simulate QQ-Norm and TA-Plot. Author:Matthias Salvisberg.

TANOVA Time Course Analysis of Variance for Mi-croarray. Authors: Baiyu Zhou and WeihongXu.

TGUICore Teaching GUI — Core functionality.Authors: Matthias Templ, Gerlinde Dinges,Alexander Kowarik and Bernhard Meindl.

TGUITeaching Teaching GUI — prototype. Au-thors: Matthias Templ, Gerlinde Dinges,Alexander Kowarik and Bernhard Meindl.

ToxLim Incorporating ecological data and associ-ated uncertainty in bioaccumulation modeling.Authors: Frederik De Laender and KarlineSoetaert.

TreeRank Implementation of the TreeRank method-ology for building tree-based ranking rules forbipartite ranking through ROC curve optimiza-tion. Author: Nicolas Baskiotis.

TreeSim Simulating trees under the birth-deathmodel. Author: Tanja Stadler. In view: Phy-logenetics.

UScensus2000 US Census 2000 suite of R Packages.Author: Zack W. Almquist. In view: Spatial.

UScensus2000add US Census 2000 suite of R Pack-ages: Demographic add. Author: Zack W.Almquist.

UScensus2000blkgrp US Census 2000 block groupshapefiles and additional Demographic Data.Author: Zack W. Almquist.

UScensus2000cdp US Census 2000 designatedplaces shapefiles and additional demographicdata. Author: Zack W. Almquist.

UScensus2000tract US Census 2000 tract levelshapefiles and additional demographic Data.Author: Zack W. Almquist.

Unicode Unicode data and utilities. Author: KurtHornik.

VHDClassification Discrimination/Classificationin very high dimension with linear andquadratic rules. Author: Robin Girard.

VPdtw Variable Penalty Dynamic Time Warping.Authors: David Clifford and Glenn Stone.

VhayuR Vhayu R interface. Author: TheBrookhaven Group. In view: TimeSeries.

VizCompX Visualisation of Computer Models. Au-thor: Neil Diamond.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 77: R journal 2010-1

NEWS AND NOTES 77

WMBrukerParser William and Mary Parser forBruker-Ultraflex mass spectrometry data. Au-thors: William Cooke, Maureen Tracy andDariya Malyarenko.

WaveCD Wavelet change point detection for arrayCGH data. Author: M. Shahidul Islam.

adephylo Exploratory analyses for the phylogeneticcomparative method. Authors: Thibaut Jom-bart and Stephane Dray. In view: Phylogenetics.

agilp Extracting and preprocessing Agilent expressarrays. Author: Benny Chain.

amba Additive Models for Business Applications.Author: Charlotte Maia.

anesrake ANES Raking Implementation. Author:Josh Pasek.

apcluster Affinity Propagation Clustering. Authors:Ulrich Bodenhofer and Andreas Kothmeier. Inview: Cluster.

aqp Algorithms for Quantitative Pedology. Author:Dylan Beaudette.

aroma.cn Copy-number analysis of large microar-ray data sets. Authors: Henrik Bengtsson andPierre Neuvial.

aroma.light Light-weight methods for normaliza-tion and visualization of microarray data us-ing only basic R data types. Author: HenrikBengtsson.

asbio A collection of statistical tools for biologists.Authors: Ken Aho; many thanks to V. Winstonand D. Roberts.

asd Simulations for adaptive seamless designs. Au-thor: Nick parsons.

base64 Base 64 encoder/decoder. Authors: RomainFrançois, based on code by Bob Trower avail-able at http://base64.sourceforge.net/.

batch Batching routines in parallel and pass-ing command-line arguments to R. Author:Thomas Hoffmann. In view: HighPerformance-Computing.

bfast BFAST: Breaks For Additive Seasonal andTrend. Authors: Jan Verbesselt, with contribu-tions from Rob Hyndman.

bibtex BibTeX parser. Author: Romain François.

biganalytics A library of utilities for "big.matrix"objects of package bigmemory. Authors: JohnW. Emerson and Michael J. Kane.

biglars Scalable Least-Angle Regression and Lasso.Authors: Mark Seligman, Chris Fraley and TimHesterberg. In view: HighPerformanceComput-ing.

bigtabulate table-, tapply-, and split-like func-tionality for "matrix" and "big.matrix" ob-jects. Authors: Michael J. Kane and John W.Emerson.

binhf Haar-Fisz functions for binomial data. Au-thor: Matt Nunes.

boolfun Cryptographic Boolean functions. Author:F. Lafitte.

bootruin A bootstrap test for the probability of ruinin the classical risk process. Authors: BenjaminBaumgartner, Riccardo Gatto.

catnet Categorical Bayesian network inference. Au-thors: Nikolay Balov and Peter Salzman.

clusterCons Calculate the consensus clustering re-sult from re-sampled clustering experimentswith the option of using multiple algorithmsand parameter. Authors: Dr. T. Ian Simpson.

cmaes Covariance Matrix Adapting EvolutionaryStrategy. Authors: Heike Trautmann, OlafMersmann and David Arnu.

coarseDataTools A collection of functions to helpwith analysis of coarse infectious disease data.Author: Nicholas G. Reich.

codep Multiscale codependence analysis. Authors:Guillaume Guenard, with contributions fromBertrand Pages.

coenoflex Gradient-Based Coenospace VegetationSimulator. Author: David W. Roberts.

compute.es Compute Effect Sizes. Author: AC DelRe.

corrplot Visualization of a correlation matrix. Au-thor: Taiyun Wei.

crmn CCMN and other noRMalizatioN methods formetabolomics data. Author: Henning Redes-tig.

cthresh Complex wavelet denoising software. Au-thor: Stuart Barber.

cubature Adaptive multivariate integration overhypercubes. Authors: C code by Steven G.Johnson, R by Balasubramanian Narasimhan.

cusp Cusp Catastrophe Model Fitting Using Maxi-mum Likelihood. Author: Raoul P. P. P. Gras-man.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 78: R journal 2010-1

78 NEWS AND NOTES

dae Functions useful in the design and ANOVA ofexperiments. Author: Chris Brien. In view: Ex-perimentalDesign.

dagR R functions for directed acyclic graphs. Au-thor: Lutz P Breitling.

datamap A system for mapping foreign objects to Rvariables and environments. Author: JeffreyHorner.

dcemriS4 A Package for Medical Image Analysis (S4implementation). Authors: Brandon Whitcherand Volker Schmid, with contributions fromAndrew Thornton. In view: MedicalImaging.

dclone Data Cloning and MCMC Tools for Maxi-mum Likelihood Methods. Author: Peter Soly-mos.

decon Deconvolution Estimation in MeasurementError Models. Authors: Xiao-Feng Wang andBin Wang.

degenes Detection of differentially expressed genes.Author: Klaus Jung.

digitize A plot digitizer in R. Author: TimotheePoisot.

distrEllipse S4 classes for elliptically contoured dis-tributions. Author: Peter Ruckdeschel.

doMPI Foreach parallel adaptor for the Rmpi pack-age. Author: Steve Weston. In view: HighPer-formanceComputing.

doRedis Foreach parallel adapter for the rredispackage. Author: B. W. Lewis. In view: High-PerformanceComputing.

dpmixsim Dirichlet Process Mixture model simu-lation for clustering and image segmentation.Author: Adelino Ferreira da Silva.

dvfBm Discrete variations of a fractional Brownianmotion. Author: Jean-Francois Coeurjolly.

dynaTree Dynamic trees for learning and design.Authors: Robert B. Gramacy and Matt A.Taddy.

ebdbNet Empirical Bayes Estimation of DynamicBayesian Networks. Author: Andrea Rau.

edrGraphicalTools Provide tools for dimension re-duction methods. Authors: Benoit Liquet andJerome Saracco.

edtdbg Integrating R’s debug() with Your Text Edi-tor. Author: Norm Matloff.

egonet Tool for ego-centric measures in Social Net-work Analysis. Authors: A. Sciandra, F.Gioachin and L. Finos.

emoa Evolutionary Multiobjective Optimization Al-gorithms. Author: Olaf Mersmann.

envelope Envelope normality test. Author: FelipeAcosta.

epinet A collection of epidemic/network-relatedtools. Authors: Chris Groendyke, David Welchand David R. Hunter.

equate Statistical methods for test score equating.Author: Anthony Albano.

esd4all Functions for post-processing and griddingempirical-statistical downscaled climate sce-narios. Author: Rasmus E. Benestad.

exactci Exact p-values and matching confidence in-tervals for simple discrete parametric cases.Author: M. P. Fay.

expectreg Expectile regression. Authors: FabianSobotka, Thomas Kneib, Sabine Schnabel andPaul Eilers.

extracat Graphic axtensions for categorical data.Author: Alexander Pilhöfer.

extremevalues Univariate outlier detection. Author:Mark van der Loo.

fCertificates Basics of certificates and structuredproducts valuation. Author: Stefan Wilhelm.

favir Formatted actuarial vignettes in R. Author:Benedict Escoto.

fdth Frequency Distribution Tables, Histograms andPoligons. Author: Jose Claudio Faria and EnioJelihovschi.

fftw Fast FFT and DCT based on FFTW. Author: Se-bastian Krey, Uwe Ligges and Olaf Mersmann.

fisheyeR Fisheye and hyperbolic-space-alike inter-active visualization tools in R. Author: Ed-uardo San Miguel Martin.

flexCrossHaz Flexible crossing hazards in the Coxmodel. Authors: Vito M.R. Muggeo andMiriam Tagliavia. In view: Survival.

forensim Statistical tools for the interpretation offorensic DNA mixtures. Author: HindaHaned.

formatR Format R code automatically. Author: Yi-hui Xie.

formula.tools Tools for working with formulas, ex-pressions, calls and other R objects. Author:Christopher Brown.

fptdApprox Approximation of first-passage-timedensities for diffusion processes. Authors: Pa-tricia Román-Román, Juan J. Serrano-Pérez andFrancisco Torres-Ruiz.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 79: R journal 2010-1

NEWS AND NOTES 79

futile.any Futile library to provide some polymor-phism. Author: Brian Lee Yung Rowe.

futile.logger Futile logger subsystem. Author:Brian Lee Yung Rowe.

futile.matrix Futile matrix utilities. Author: BrianLee Yung Rowe.

futile.options Futile options management. Author:Brian Lee Yung Rowe.

gamlss.add GAMLSS additive. Authors: MikisStasinopoulos and Bob Rigby.

gamlss.demo Demos for GAMLSS. Authors: MikisStasinopoulos, Bob Rigby, Paul Eilers and BrianMarx with contributions from Larisa Kosidou.

gamlss.util GAMLSS utilities. Authors: MikisStasinopoulos, Bob Rigby, Paul Eilers.

gcolor Provides an inequation algorithm functionand a solution to import DIMACS files. Au-thors: Jeffrey L. Duffany, Ph.D. and EuripidesRivera Negron.

genefu Relevant functions for gene expression anal-ysis, especially in breast cancer. Authors: Ben-jamin Haibe-Kains, Gianluca Bontempi andChristos Sotiriou.

genoPlotR Plot publication-grade gene and genomemaps. Author: Lionel Guy.

genomatic Manages microsatellite projects. Creates96-well maps, genotyping submission forms,rerun management, and import into statisticalsoftware. Author: Brian J. Knaus.

geophys Geophysics, continuum mechanics, Mogimodel. Author: Jonathan M. Lees.

geosphere Spherical trigonometry. Authors: RobertJ. Hijmans, Ed Williams and Chris Vennes.

grofit The package was developed to fit manygrowth curves obtained under different con-ditions. Authors: Matthias Kahm and MaikKschischo.

haarfisz Software to perform Haar Fisz transforms.Author: Piotr Fryzlewicz.

halp Show certain type of comments from functionbodies as “live-help”. Author: Daniel Haase.

hda Heteroscedastic Discriminant Analysis. Au-thor: Gero Szepannek.

heavy Estimation in the linear mixed model usingheavy-tailed distributions. Author: Felipe Os-orio.

hergm Hierarchical Exponential-family RandomGraph Models. Author: Michael Schwein-berger.

hgam High-dimensional Additive Modelling. Au-thors: The students of the “Advanced R Pro-gramming Course” Hannah Frick, Ivan Kond-ofersky, Oliver S. Kuehnle, Christian Linden-laub, Georg Pfundstein, Matthias Speidel, Mar-tin Spindler, Ariane Straub, Florian Wicklerand Katharina Zink, under the supervision ofManuel Eugster and Torsten Hothorn.

hglm Hierarchical Generalized Linear Models. Au-thors: Moudud Alam, Lars Ronnegard, XiaShen.

highlight Syntax highlighter. Author: RomainFrançois.

histogram Construction of regular and irregular his-tograms with different options for automaticchoice of bins. Authors: Thoralf Mildenberger,Yves Rozenholc and David Zasada.

hts Hierarchical time series. Authors: Rob J Hynd-man, Roman A Ahmed, and Han Lin Shang.

hyperSpec Interface for hyperspectral data sets, i.e.,spectra + meta information (spatial, time, con-centration, . . . ). Author: Claudia Beleites.

iBUGS An interface to R2WinBUGS by gWidgets.Author: Yihui Xie.

iCluster Integrative clustering of multiple genomicdata types. Author: Ronglai Shen.

ibr Iterative Bias Reduction. Authors: Pierre-Andre Cornillon, Nicolas Hengartner and EricMatzner-Lober.

ipw Estimate inverse probability weights. Author:Willem M. van der Wal. In view: Survival.

isopam Isopam (hierarchical clustering). Author:Sebastian Schmidtlein.

itertools Iterator tools. Authors: Steve Weston andHadley Wickham.

kinfit Routines for fitting kinetic models to chemicaldegradation data. Author: Johannes Ranke.

kml3d K-means for joint longitudinal data. Author:Christophe Genolini.

lavaan Latent Variable Analysis. Author: YvesRosseel. In view: Psychometrics.

lcmm Estimation of latent class mixed models. Au-thors: Cecile Proust-Lima and Benoit Liquet.

leiv Bivariate linear errors-in-variables estimation.Author: David Leonard.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 80: R journal 2010-1

80 NEWS AND NOTES

limitplot Jitter/Box Plot with Ordered Points Be-low the Limit of Detection. Author: Omar E.Olmedo.

lmPerm Permutation tests for linear models. Au-thor: Bob Wheeler.

log10 Decimal log plotting in two and three dimen-sions. Author: Timothée Poisot.

logging A tentative logging package. Author:Mario Frasca.

lossDev Robust loss development using MCMC.Authors: Christopher W. Laws and Frank A.Schmid.

lqa Penalized likelihood inference for GLMs. Au-thor: Jan Ulbricht.

magnets Simulate micro-magnets dynamics. Au-thor: Hai Qian.

mbmdr Model Based Multifactor DimensionalityReduction. Authors: Victor Urrea, Malu Calle,Kristel Van Steen and Nuria Malats.

mclogit Mixed Conditional Logit. Author: MartinElff.

mecdf Multivariate ECDF-Based Models. Author:Charlotte Maia. In view: Distributions.

micEconAids Demand analysis with the AlmostIdeal Demand System (AIDS). Author: ArneHenningsen. In view: Econometrics.

micEconCES Analysis with the Constant Elasticityof Scale (CES) function. Authors: Arne Hen-ningsen and Geraldine Henningsen. In view:Econometrics.

micEconSNQP Symmetric Normalized QuadraticProfit Function. Author: Arne Henningsen. Inview: Econometrics.

migui Graphical User Interface of the mi Package.Authors: Daniel Lee and Yu-Sung Su.

miniGUI tktcl quick and simple function GUI. Au-thor: Jorge Luis Ojeda Cabrera.

minqa Derivative-free optimization algorithms byquadratic approximation. Authors: DouglasBates, Katharine M. Mullen, John C. Nash andRavi Varadhan.

miscTools Miscellanneous Tools and Utilities. Au-thor: Arne Henningsen.

missMDA Handling missing values with/in mul-tivariate data analysis (principal componentmethods). Authors: François Husson and JulieJosse.

mixOmics Integrate omics data project. Authors:Sébastien Dejean, Ignacio González and Kim-Anh Lê Cao.

mkin Routines for fitting kinetic models with oneor more state variables to chemical degradationdata. Author: Johannes Ranke.

mlogitBMA Bayesian Model Averaging for multi-nomial logit models. Authors: Hana Ševcíkováand Adrian Raftery.

mmap Map pages of memory. Author: Jeffrey A.Ryan.

monmlp Monotone multi-layer perceptron neuralnetwork. Author: Alex J. Cannon.

mrdrc Model-robust concentration-response analy-sis. Authors: Christian Ritz and Mads JeppeTarp-Johansen.

mseq Modeling non-uniformity in short-read ratesin RNA-Seq data. Author: Jun Li.

mspath Multi-state Path-dependent models in dis-crete time. Authors: Ross Boylan and Pe-ter Bacchetti, building on the work of Christo-pher H. Jackson. Thorsten Ottosen for theptr_container library. Gennadiy Rozental forthe Boost Test Library. And the authors of otherBoost libraries used by the previous two. Inview: Survival.

mtsc Multivariate timeseries cluster-based models.Author: Charlotte Maia.

mugnet Mixture of Gaussian Bayesian networkmodel. Author: Nikolay Balov.

multisensi Multivariate Sensitivity Analysis. Au-thors: Matieyendou Lamboni and HervéMonod.

munfold Metric Unfolding. Author: Martin Elff.

mutatr Mutable objects for R. Author: Hadley Wick-ham.

mvabund Statistical methods for analysing multi-variate abundance data. Authors: Ulrike Nau-mann, Yi Wang, Stephen Wright and DavidWarton.

mvngGrAd Software for moving grid adjustment inplant breeding field trials. Author: Frank Tech-now.

nanop Tools for nanoparticle simulation and PDFcalculation. Author: Katharine Mullen.

ncdf4 Interface to Unidata netCDF (version 4 or ear-lier) format data files. Author: David Pierce.

ncvreg Regularization paths for SCAD- and MCP-penalized regression models. Author: PatrickBreheny.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 81: R journal 2010-1

NEWS AND NOTES 81

neldermead R port of the Scilab neldermead mod-ule. Authors: Sebastien Bihorel and MichaelBaudin (author of the original module).

nga NGA earthquake ground motion predictionequations. Authors: James Kaklamanos andEric M. Thompson.

nlADG Regression in the Normal Linear ADGModel. Authors: Lutz F. Gruber

nnDiag k-Nearest Neighbor diagnostic tools. Au-thors: Brian Walters, Andrew O. Finley andZhen Zhang.

nonparaeff Nonparametric methods for measuringefficiency and productivity. Author: Dong-hyun Oh.

nonrandom Stratification and matching by thepropensity score. Author: Susanne Stampf.

npRmpi Parallel nonparametric kernel smoothingmethods for mixed data types. Authors: Tris-ten Hayfield and Jeffrey S. Racine.

nppbib Nonparametric Partially-Balanced Incom-plete Block Design Analysis. Authors: DavidAllingham and D.J. Best.

nutshell Data for “R in a Nutshell”. Author: JosephAdler.

nytR Provides access to the NY Times API. Author:Shane Conway.

ofp Object-Functional Programming. Author: Char-lotte Maia.

omd Filter the molecular descriptors for QSAR. Au-thor: Bin Ma.

openintro Open Intro data sets and supplementfunctions. Authors: David M Diez and Christo-pher D Barr.

optimbase R port of the Scilab optimbase mod-ule. Authors: Sebastien Bihorel and MichaelBaudin (author of the original module).

optimsimplex R port of the Scilab optimsimplexmodule. Authors: Sebastien Bihorel andMichael Baudin (author of the original mod-ule).

optparse Command line option parser. Authors:Trevor Davis. getopt function, some documen-tation and unit tests ported from Allen Day’sgetopt package. Based on the optparse Pythonlibrary by the Python Software Foundation.

optpart Optimal partitioning of similarity relations.Author: David W. Roberts.

ordinal Regression models for ordinal data. Author:Rune Haubo B Christensen.

oro.dicom Rigorous — DICOM Input / Output. Au-thor: Brandon Whitcher. In view: MedicalImag-ing.

oro.nifti Rigorous — NIfTI Input / Output. Au-thors: Brandon Whitcher, Volker Schmid andAndrew Thornton. In view: MedicalImaging.

pGLS Generalized Least Square in comparativePhylogenetics. Authors: Xianyun Mao andTimothy Ryan.

pROC Display and analyze ROC curves. Au-thors: Xavier Robin, Natacha Turck, AlexandreHainard, Natalia Tiberti, Frédérique Lisacek,Jean-Charles Sanchez and Markus Müller.

paleoMAS Paleoecological Analysis. Authors:Alexander Correa-Metrio, Dunia Urrego, Ken-neth Cabrera, Mark Bush.

pamctdp Principal Axes Methods for ContingencyTables with Partition Structures on Rows andColumns. Author: Campo Elías Pardo.

partitionMetric Compute a distance metric betweentwo partitions of a set. Authors: David Weis-man and Dan Simovici.

parviol Parviol combines parallel coordinates andviolin plot. Author: Jaroslav Myslivec.

pedantics Functions to facilitate power and sensitiv-ity analyses for genetic studies of natrual pop-ualtions. Author: Michael Morrissey.

pgfSweave Quality speedy graphics compilationwith Sweave. Authors: Cameron Bracken andCharlie Sharpsteen.

phyclust Phylogenetic Clustering (Phyloclustering).Author: Wei-Chen Chen. In view: Phylogenet-ics.

phylobase Base package for phylogenetic struc-tures and comparative data. Authors: RHackathon et al. (alphabetically: Ben Bolker,Marguerite Butler, Peter Cowan, Damien de Vi-enne, Thibaut Jombart, Steve Kembel, FrançoisMichonneau, David Orme, Brian O’Meara, Em-manuel Paradis, Jim Regetz, Derrick Zwickl).In view: Phylogenetics.

phyloclim Integrating phylogenetics and climaticniche modelling. Author: Christoph Heibl. Inview: Phylogenetics.

plgp Particle Learning of Gaussian Processes. Au-thor: Robert B. Gramacy.

plsdof Degrees of freedom and confidence intervalsfor Partial Least Squares regression. Authors:Nicole Krämer and Mikio L. Braun.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 82: R journal 2010-1

82 NEWS AND NOTES

pmlr Penalized Multinomial Logistic Regression.Authors: Sarah Colby, Sophia Lee, Juan PabloLewinger and Shelley Bull.

png Read and write PNG images. Author: SimonUrbanek.

poistweedie Poisson-Tweedie exponential familymodels. Authors: David Pechel Cactcha, LaurePauline Fotso and Célestin C Kokonendji.

polySegratio Simulate and test marker dosage fordominant markers in autopolyploids. Author:Peter Baker.

polySegratioMM Bayesian mixture models formarker dosage in autopolyploids. Author: Pe-ter Baker.

popPK Summary of population pharmacokineticanalysis. Author: Christoffer W. Tornoe andFang Li.

powerMediation Power/sample size calculation formediation analysis. Author: Weiliang Qiu.

ppMeasures Point pattern distances and proto-types. Authors: David M Diez, Katherine ETranbarger Freier, and Frederic P Schoenberg.

profdpm Profile Dirichlet Process Mixtures. Author:Matt Shotwell.

psychotree Recursive partitioning based on psycho-metric models. Authors: Achim Zeileis, Car-olin Strobl and Florian Wickelmaier. In view:Psychometrics.

ptw Parametric Time Warping. Authors: Jan Ger-retzen, Paul Eilers, Hans Wouters, Tom Bloem-berg and Ron Wehrens. In view: TimeSeries.

pyramid Functions to draw population pyramid.Author: Minato Nakazawa.

qrnn Quantile regression neural network. Author:Alex J. Cannon.

quaternions Arithmetics and linear algebra withQuaternions. Author: K. Gerald van denBoogaart.

r2dRue 2dRue: 2d Rain Use Efficience model. Au-thors: Gabriel del Barrio, Juan Puigdefabregas,Maria E. Sanjuan and Alberto Ruiz.

r2lh R to LATEX and HTML. Authors: ChristopheGenolini, Bernard Desgraupes, Lionel RiouFranca.

r4ss R code for Stock Synthesis. Authors: Ian Tay-lor, Ian Stewart, Tommy Garrison, Andre Punt,John Wallace, and other contributors.

rEMM Extensible Markov Model (EMM) for datastream clustering in R. Authors: Michael Hah-sler and Margaret H. Dunham.

random.polychor.pa A parallel analysis with poly-choric correlation matrices. Author: Fabio Pre-saghi and Marta Desimoni.

rangeMapper A package for easy generation of bio-diversity (species richness) or life-history traitsmaps using species geographical ranges. Au-thors: Mihai Valcu and James Dale.

raster Geographic analysis and modeling withraster data. Author: Robert J. Hijmans and Ja-cob van Etten.

recommenderlab Lab for testing and developingrecommender algorithms for binary data. Au-thor: Michael Hahsler.

remix Remix your data. Author: David Hajage.

rgp R genetic programming framework. Authors:Oliver Flasch, Olaf Mersmann, Thomas Bartz-Beielstein and Joerg Stork.

ris RIS package for bibliographic analysis. Author:Stephanie Kovalchik.

rngWELL toolbox for WELL random number gen-erators. Authors: C code by F. Panneton,P. L’Ecuyer and M. Matsumoto; R port byChristophe Dutang and Petr Savicky.

rocc ROC based classification. Author: MartinLauss.

rpsychi Statistics for psychiatric research. Author:Yasuyuki Okumura.

rredis Redis client for R. Author: B. W. Lewis.

rrules Generic rule engine for R. Authors: OliverFlasch, Olaf Mersmann, Thomas Bartz-Beielstein.

rworldmap For mapping global data: rworldmap.Authors: Andy South, with contributions fromBarry Rowlingson, Roger Bivand, MatthewStaines and Pru Foster. In view: Spatial.

samplesize A collection of sample size functions.Author: Ralph Scherer.

samplingbook Survey sampling procedures. Au-thors: Juliane Manitz, contributions by MarkHempelmann, Göran Kauermann, HelmutKüchenhoff, Cornelia Oberhauser, Nina West-erheide, Manuel Wiesenfarth.

saws Small-Sample Adjustments for Wald tests Us-ing Sandwich Estimators. Author: Michael Fay.

scaRabee Optimization toolkit for pharmacokinetic-pharmacodynamic models. Author: SebastienBihorel.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 83: R journal 2010-1

NEWS AND NOTES 83

scrapeR Tools for scraping data from HTML andXML documents. Author: Ryan M. Acton.

secr Spatially explicit capture-recapture. Author:Murray Efford.

semPLS Structural Equation Modeling using PartialLeast Squares. Author: Armin Monecke.

sensitivityPStrat Principal stratification sensitivityanalysis functions. Author: Charles Dupont.

sifds Swedish inflation forecast data set. Author:Michael Lundholm.

simPopulation Simulation of synthetic populationsfor surveys based on sample data. Authors:Stefan Kraft and Andreas Alfons.

sisus SISUS: Stable Isotope Sourcing using Sam-pling. Author: Erik Barry Erhardt.

smd.and.more t-test with Graph of StandardizedMean Difference, and More. Authors: DavidW. Gerbing, School of Business Administra-tion, Portland State University.

solaR Solar photovoltaic aystems. Author: OscarPerpiñán Lamigueiro.

someKfwer Controlling the Generalized Family-wise Error Rate. Authors: L. Finos and A. Far-comeni.

spaa Species Association Analysis. Author: JinlongZhang Qiong Ding Jihong Huang.

space Sparse PArtial Correlation Estimation. Au-thors: Jie Peng , Pei Wang, Nengfeng Zhou andJi Zhu.

sparcl Perform sparse hierarchical clustering andsparse k-means clustering. Authors: DanielaM. Witten and Robert Tibshirani.

sparkTable Sparklines and graphical tables forTEXand HTML. Authors: Alexander Kowarik,Bernhard Meindl and Stefan Zechner.

sparr The sparr package: SPAtial Relative Risk. Au-thors: T.M. Davies, M.L. Hazelton and J.C.Marshall.

spef Semiparametric estimating functions. Authors:Xiaojing Wang and Jun Yan.

sphet Spatial models with heteroskedastic innova-tions. Author: Gianfranco Piras.

spikeslab Prediction and variable selection usingspike and slab regression. Author: Hemant Ish-waran.

stam Spatio-Temporal Analysis and Modelling. Au-thor: Zhijie Zhang.

steepness Testing steepness of dominance hierar-chies. Author: David Leiva and Han de Vries.

stockPortfolio Build stock models and analyzestock portfolios. Authors: David Diez andNicolas Christou. In view: Finance.

stoichcalc R functions for solving stoichiometricequations. Author: Peter Reichert.

stratification Univariate stratification of surveypopulations. Authors: Sophie Baillargeon andLouis-Paul Rivest.

stratigraph Toolkit for the plotting and analysis ofstratigraphic and palaeontological data. Au-thor: Walton A. Green.

stringr Make it easier to work with strings. Author:Hadley Wickham.

survPresmooth Presmoothed estimation in survivalanalysis. Authors: Ignacio Lopez-de-Ullibarriand Maria Amalia Jacome.

symbols Symbol plots. Author: Jaroslav Myslivec.

symmoments Symbolic central moments of the mul-tivariate normal distribution. Author: KemPhillips.

tclust Robust trimmed clustering. Authors: AgustínMayo Iscar, Luis Angel García Escudero andHeinrich Fritz.

testthat Testthat code. Tools to make testing fun :).Author: Hadley Wickham.

tgram Functions to compute and plot tracheido-grams. Authors: Marcelino de la Cruz and Lu-cia DeSoto.

topicmodels Topic models. Authors: Bettina Grünand Kurt Hornik.

tourr Implement tour methods in pure R code. Au-thors: Dianne Cook and Hadley Wickham.

tourrGui A Tour GUI using gWidgets. Authors: Bei,Dianne Cook, and Hadley Wickham.

traitr An interface for creating GUIs modeled in partafter traits UI module for Python. Author: JohnVerzani.

trex Truncated exact test for two-stage case-controldesign for studying rare genetic variants. Au-thors: Schaid DJ and Sinnwell JP.

trio Detection of disease-associated SNP interac-tions in case-parent trio data. Authors: QingLi and Holger Schwender.

tsne T-distributed Stochastic Neighbor Embeddingfor R (t-SNE). Author: Justin Donaldson.

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 84: R journal 2010-1

84 NEWS AND NOTES

ttime Translate neurodevelopmental event timingacross species. Author: Radhakrishnan Na-garajan.

twiddler Interactive manipulation of R expressions.Authors: Oliver Flasch and Olaf Mersmann.

twopartqtl QTL Mapping for Point-mass Mixtures.Author: Sandra L. Taylor.

umlr Open Source Development with UML and R.Author: Charlotte Maia.

unmarked Models for Data from Unmarked Ani-mals. Authors: Ian Fiske and Richard Chan-dler.

vcdExtra vcd additions. Authors: Michael Friendlywith Heather Turner, David Firth, AchimZeileis and Duncan Murdoch.

vegdata Functions to use vegetation databases (Tur-boveg) for vegetation analyses in R. Author:Florian Jansen.

venneuler Venn and Euler Diagrams. Author: LeeWilkinson.

vmv Visualization of Missing Values. Author:Waqas Ahmed Malik.

waterfall Waterfall Charts in R. Authors: James P.Howard, II.

waveband Computes credible intervals for Bayesianwavelet shrinkage. Author: Stuart Barber.

webvis Create graphics for the web from R. Author:Shane Conway.

wq Exploring water quality monitoring data. Au-thors: Alan D. Jassby and James E. Cloern.

xlsx Read, write, format Excel 2007 (‘xlsx’) files. Au-thor: Adrian A. Dragulescu.

xlsxjars Package required jars for the xlsx package.Author: Adrian A. Dragulescu.

zic Bayesian Inference for Zero-Inflated Count Mod-els. Author: Markus Jochmann. In view:Bayesian.

Other changes

� The following packages were moved to theArchive: CTFS, ComPairWise, DICOM,GFMaps, HFWutils, IQCC, JointGLM, MI-funs, RGrace, RII, SimHap, SoPhy, adapt,agce, agsemisc, assist, cir, connectedness,dirichlet, empiricalBayes, gcl, gtm, latent-netHRT, markerSelectionPower, mcgibb-sit, networksis, nlts, popgen, poplab, qdg,r2lUniv, rankreg, spatclus, staRt, sublogo,tdist, tutoR, vabayelMix, and varmixt.

� The following packages were resurrected fromthe Archive: ProbForecastGOP, SciViews, ap-Treeshape, far, norm, vardiag and xlsRead-Write.

� Package atm was renamed to amba.

Kurt HornikWU Wirtschaftsuniversität Wien, [email protected]

Achim ZeileisUniversität Innsbruck, [email protected]

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 85: R journal 2010-1

NEWS AND NOTES 85

News from the Bioconductor ProjectBioconductor TeamProgram in Computational BiologyFred Hutchinson Cancer Research Center

We are pleased to announce Bioconductor 2.6, re-leased on April 23, 2010. Bioconductor 1.6 is com-patible with R 2.11.1, and consists of 389 pack-ages. There are 37 new packages, and enhancementsto many others. Explore Bioconductor at http://bioconductor.org, and install packages with

source("http://bioconductor.org/biocLite.R")biocLite() # install standard packages...biocLite("IRanges") # ...or IRanges

New and revised packages

Sequence analysis packages address infrastructure(GenomicRanges, Rsamtools, girafe); ChIP-seq (BayesPeak, CSAR, PICS); digital gene ex-pression and RNA-seq (DESeq, goseq, seg-mentSeq); and motif discovery (MotIV, rGA-DEM).

Microarray analysis packages introduce new ap-proaches to pre-process and technology-specific assays (affyILM, frma, frmaTools,BeadDataPackR, MassArray); analysis of spe-cific experimental protocols (charm, genoCN,iChip, methVisual); and novel statistical meth-ods (ConsensusClusterPlus, ExpressionView,eisa, GSRI, PROMISE, tigre).

Flow cytometry packages include SamSPECTRAL,flowMeans, flowTrans, and iFlow.

Annotation and integrative analysis packages fa-cilitate interfacing with GEO (GEOsubmis-sion), the Sequence Read Archive (SRAdb),and tabulation of genome sequence projectdata (genomes); the GSRI package to estimatedifferentially expressed genes in a gene set;PCA and CCA dependency modeling (pint);and updated access to exon array annotations(xmapcore).

Further information on new and existing pack-ages can be found on the Bioconductor web site,

which contains ‘views’ that identify coherent groupsof packages. The views link to on-line packagedescriptions, vignettes, reference manuals, and usestatistics.

Other activities

Training courses and a European developer confer-ence were important aspects of Bioconductor, withrecent successful offerings ranging from microar-ray and sequence analysis through advanced R pro-gramming (http://bioconductor.org/workshops).The active Bioconductor mailing lists (http://bioconductor.org/docs/mailList.html) connectusers with each other, to domain experts, and tomaintainers eager to ensure that their packages sat-isfy the needs of leading edge approaches. The Bio-conductor team continues to ensure quality softwarethrough technical and scientific reviews of new pack-ages, and daily builds of released packages on Linux,Windows (32- and now 64-bit), and Macintosh plat-forms.

Looking forward

BioC2010 is the Bioconductor annual conference,scheduled for July 29–30 at the Fred Hutchinson Can-cer Research Centre in Seattle, Washington, and witha developer day on July 28. The conference fea-tures morning scientific talks and afternoon prac-tical sessions (https://secure.bioconductor.org/BioC2010/). European developers will have an op-portunity to meet in Heidelberg, Germany, on 17–18November.

Contributions from the Bioconductor commu-nity play an important role in shaping each release.We anticipate continued efforts to provide statisti-cally informed analysis of next generation sequencedata, as well as development of algorithms for ad-vanced analysis of microarrays. Areas of opportu-nity include the ChIP-seq, RNA-seq, rare variant,and structural variant domains. The next release cy-cle promises to be one of active scientific growth andexploration!

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859

Page 86: R journal 2010-1

86 NEWS AND NOTES

R Foundation Newsby Kurt Hornik

Donations and new members

Donations

Vies Animales, FranceOwen Jones, Robert Maillardet, and Andrew Robin-

son, Australia: donating a proportion of theroyalties for the book “Introduction to ScientificProgramming and Simulation using R”, recentlypublished (2009) by Chapman & Hall/CRC

Jogat Sheth, USA

New benefactors

OpenAnalytics, BelgiumRitter and Danielson Consulting, BelgiumSaorstat Limited, Ireland

New supporting institutions

Marine Scotland Science (UK)

DFG Research Group “Mind and Brain Dynamics”,Universität Potsdam (Germany)

Chaire d’actuariat, Université Laval (Canada)

New supporting members

José Banda, USAKapo Martin Coulibaly, USAAndreas Eckner, USAMarco Geraci, UKDavid Monterde, SpainGuido Möser, GermanyAlexandre Salvador, FranceAjit de Silva, USAWen Hsiang Wie, TaiwanTomas Zelinsky, Slovakia

Kurt HornikWU Wirtschaftsuniversität Wien, [email protected]

The R Journal Vol. 2/1, June 2010 ISSN 2073-4859