math6030: statistical computing ii

45
MATH6030: Statistical Computing II Sujit K. Sahu School of Mathematics, University of Southampton, Highfield, Southampton, UK. September, 2008

Upload: others

Post on 12-Jan-2022

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MATH6030: Statistical Computing II

MATH6030: Statistical Computing II

Sujit K. Sahu

School of Mathematics,University of Southampton,

Highfield, Southampton, UK.

September, 2008

Page 2: MATH6030: Statistical Computing II

MATH6030: Statistical Computing II

Lecturer: Dr S. K. Sahu Email: [email protected]: 9001 Telephone: 8059–5123

This course consists of 12 lectures and practicals given during weeks 1–6.

Assessment: 100% coursework. Due by 3PM on Friday, December 12, 2008. Pleasenote this as no extension is possible.

A useful bookW. N. Venables and B. D. Ripley: Modern Applied Statistics with S-Plus, Springer, 3rd

Edition.

Page 3: MATH6030: Statistical Computing II

Contents

1 Getting Started 51.1 What is S-Plus? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Starting S-Plus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 S-Plus commands and basics . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Workspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.6 Importing and exporting data . . . . . . . . . . . . . . . . . . . . . . . . . . 71.7 S-Plus data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.8 Naming conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.9 Fitting linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.10 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.11 An introductory session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.12 A fun program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Data manipulation with S-Plus 132.1 Basic manipulation with an example . . . . . . . . . . . . . . . . . . . . . . 132.2 Logical vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.3 The functions apply, tapply and lapply . . . . . . . . . . . . . . . . . . . . 152.4 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.5 Other useful things to know . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.6 Matrix operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.7 Manipulating data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.9 Solutions to Exercises 2.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Writing scripts and functions 233.1 Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3 Example: Confidence intervals for a probability . . . . . . . . . . . . . . . . 253.4 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.5 Solutions to Exercises 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.6 Further Exercise - Investigating Coverage by Simulation . . . . . . . . . . . 28

3

Page 4: MATH6030: Statistical Computing II

4

3.7 Conditional execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.8 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.9 Avoiding the loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.10 Using the loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.11 Exercise - Maximising a nonstandard likelihood . . . . . . . . . . . . . . . . 333.12 Solution to Exercises 3.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Graphics 374.1 Simple plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2 Graphical parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.3 Adding material to the body of a plot . . . . . . . . . . . . . . . . . . . . . . 384.4 Plotting functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.5 A Few Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.6 Bar Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.7 Pairwise scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.8 Histogram and qqplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.9 Boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.10 Three Dimensional Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Page 5: MATH6030: Statistical Computing II

Chapter 1

Getting Started

1.1 What is S-Plus?

S-Plus is a statistical programming and analysis language, developed primarily at the AT&Tresearch laboratories during the 1980s. As well as having the usual facilities for statisticalmodelling, data handling and graphical display, in common with many other statisticalpackages, S-Plus also allows the user extreme flexibility in manipulating and analysing data.

To enable us to exploit the full flexibility of S-Plus, MATH6030 will focus on the command-based implementation, whereby one enters S-Plus commands and functions directly in theCommands window. S-Plus for Windows also utilises pull-down menus to perform some ofthe more common operations. Feel free to investigate these for yourselves.

S-Plus is an object-oriented language, which means that everything is stored as a partic-ular type of object, with different operations being appropriate for different types of object.For example, vectors and matrices are both types of object in S-Plus. Data are usuallystored in a data frame object, and results of statistical analyses are stored in an object ofthe appropriate type, for example when you fit a linear model, the results are stored as alinear model object.

1.2 Preliminaries

It is assumed that you are a registered user of the Computing Services system, and arefamiliar with using the Windows machines in the Computing Services clusters. Operationsyou should be familiar with include logging in, accessing programs and files, file handling,using disks, printing, logging out etc. If you are unfamiliar with any of these, then ComputingServices provide a wide range of introductory material to help you.

A number of data files, which we shall use during the course, are stored on the Universitycentral file server. To access these files, follow the path All Programs → Course Folders→ Access Course Folders. Then on the Windows Explorer navigate to the math6030folder and then onto the SData sub-folder. The data files are kept there so you can easilyaccess data from S-Plus, we shall see later.

5

Page 6: MATH6030: Statistical Computing II

6

1.3 Starting S-Plus

Start up S-Plus by following the path All Programs → Statistics → S-PLUS 7.After a short delay a large S-Plus window containing two smaller windows will appear.

One is the Commands window which contains any input commands. This window containsthe prompt >. You can switch between windows by using the Windows menu. Type 2+2

in the commands window and see what happens.The other window is the Object Explorer. This window contains details of data which

you have in your workspace. If you close the Object Explorer, you can reopen it by clickingon the appropriate icon beneath the menu bar (its the one which looks like a yellow boxwith red, blue and green objects hovering above it). An alternative way of examining whichobjects are available to you is to type ls() in the Commands window.

S-Plus has an extensive on-line help system. You can access this using the Help menu.The help system is particularly useful for looking up commands or functions that you knowexist but whose name or whose syntax you have forgotten. An alternative way of obtaininginformation about a function is to type help(<function name>) or ?<function name>, forexample help(plot) or ?plot.

The args command prints the list of arguments and defaults which can be passed to afunction. For example, type args(pnorm).

You can exit S-Plus by typingq()

in the commands window or by following File→Exit.

1.4 S-Plus commands and basics

S-Plus commands are always of the form <function>(<arguments>). For example,qnorm(0.975) gives the 97.5% quantile of the standard normal distribution andqnorm(0.975,2,3) gives the 97.5% quantile of the N(2,32) distribution. In the case wherethere are no options, e.g. the command q(), you still need to add the brackets. This isbecause S-Plus treats all of its commands as functions. If you omit the brackets, then S-Plusthinks that you don’t want to execute the function but simply see the S code which thefunction executes. Type plot and see what happens.

The assignment operator in S-Plus is <-, i.e. a ‘less than’ symbol immediately followedby a hyphen, rather than =, which is reserved for something else; see later. (The underscoresymbol may be used instead of <-). For example type

x <- 2 + 2

in the Commands window. This can be read as the object x comes from 2+2. You caninsert comments by first typing the # character. For example, we could have said

x <- 2 + 2 # The output should be 4!

Note that an assignment does not produce any output (unless you have made an error,in which case an error message will appear). To see the result of an assignment, you needto examine the contents of the object you have assigned the result of the command to. Forexample, typing

Page 7: MATH6030: Statistical Computing II

MATH6030 Statistical Computing II: S-Plus Year:08–09 Dr S. K. Sahu 7

x

should now give the output [1] 4. The [1] indicates that 4 is the first component of x.Of course x only has one component here, but this helps you keep track when the output isa vector of many components.

You can repeat or edit previous commands by using the up and down arrow keys (↑↓).

1.5 Workspace

The default workspace is C:\apps\splus7. It is listed at the top of the SearchPath in theObject Explorer. The contents of this workspace will be erased when you log out of thecomputer. If you use a workspace within your home filestore, then the contents will be savedfor future sessions.

Suppose that we want to use H:\math6030 as the permanent workspace for the durationthis course. To do this for the first time we follow the path File → Chapters → NewWorking Chapter. In the dialogue box you type in H:\MA6030 and the label MA6030.Click OK and it will ask if you would like to create the directory. On responding Yes you getthe desired result; the first item in the SearchPath is MA6030.

All the objects you create from now on will be written to the workspace correspondingto the first item in the SearchPath. When you start up S-Plus in future, make sure to useFile → Chapters → Attach/Create Chapter and enter H:\MA6030 in position 1. Thiswill enable you to use datasets and functions you created in previous sessions.

1.6 Importing and exporting data

Probably the best way of entering data into S-Plus is as a data frame object. You can createa new data frame by following File→New and choosing Data Set. A rectangular arrayappears, and you can type in the data directly. Rows represent units and columns representvariables (which are named). A data frame must consist of columns of equal length. Anobservation of a variable which is missing or unknown, for a particular unit is coded as NA

for ‘Not Available’. Use the mouse or the arrow keys (←↑↓→) to move around.

You can also read data into S-Plus directly from an input file, which can be of manytypes, e.g. text files, EXCEL files, MINITAB files, SAS files, SPSS files, etc. As an examplelet us import the weld.txt in the Course folder. (Need to go to the math6030 course folderand then onto the Data sub-folder, see Section 1.2: Preliminaries. You can the read the dataset in directly into a data frame from the file by following File→Import Data→From File(use the Browse button to find the file in the R: drive). Remember to give your data framean informative name, for future reference. Once the data are imported, the two variablescan be given names by double clicking on the grey bar above the first row of data. Call thecolumns of data x and y.

1.7 S-Plus data types

The most common data types are as follows.

Page 8: MATH6030: Statistical Computing II

8

• Vectors are ordered strings of data values. A vector can be one of numeric, character,logical or complex types. For example: x <- 5:15 puts the numbers 5, 6, . . . , 15 inthe vector x. You can access parts of x by calling things like:

x[1] # gives the first element of x

x[2:4] # gives the elements x[2], x[3], x[4]

x[-(2:4)] # gives all but x[2], x[3], x[4]}

There are various command for creating vectors. For example, investigate the vectorsproduced by the following commands

a1 <- c(1,3,5,6,8,21)

a2 <- seq(1,21,by=2)

a2 <- c(a1,a2)

a4 <- seq(min(a1),max(a1),length=10)

a5 <- rep(2,12)

a6 <- rep(a2,2)

a7 <- rep(a1,c(2,2,3,3,4,1))

• Matrices are rectangular arrays consisting of rows and columns. All data must of thesame mode. For example, y <- matrix(1:6, nrow=3,ncol=2) creates a 3×2 matrix,called y. You can access parts of y by calling things like:

y[1,2] # gives the first row second column entry of y

y[1,] # gives the first row of y

y[,2] # gives the second column of y

Individual elements of vectors or matrices, or whole rows or columns of matrices maybe updated by assigning them new values, e.g.

a1[1] <- 3

y[1,2] <- 3

y[,2] <- c(2,2, 2)

• Data frames are rectangular arrays where columns could be of different types. Forexample, type z <- kyphosis and bring it up by double clicking on z on the objectbrowser. (You can see what these are by typing ?kyphosis). Columns of data framesare vectors and are denoted by <data frame name>$<variable name>, for exampleweld$x and weld$y in the example above. Data frames are also indexed like matrices,so elements, rows and columns of data, can all be accessed as for matrices above.

You can add columns to a data frame, for example

weld$xy<-weld$x*weld$y

Page 9: MATH6030: Statistical Computing II

MATH6030 Statistical Computing II: S-Plus Year:08–09 Dr S. K. Sahu 9

adds a new column to the data frame weld.

Note that most operations on vectors are performed componentwise, so for exampleweld$x*weld$y results in a vector of the same length as weld$x and weld$y, contain-ing the componentwise products. Similarly, weld$x 2-1, 3*sqrt(0.5*weld$x) andlog(weld$x)/2 all create vectors of the same length as weld$x, with the relevantoperation performed component by component.

However, certain statistical operations on vectors result in scalars, for example thefunctions mean, median, var, min, max, sum, prod etc. Try, for example:

mean(weld$x)

var(weld$x)

• Lists are used to collect objects of different types. For example, a list may consist of twomatrices and three vectors of different size and modes. The components of a list haveindividual names and are accessed using <list name>$<component name>, similar todata frames (which are themselves lists, of a particular form). For example, the result offitting a model is usually a list, containing for example, parameter estimates, residuals,fitted values etc.

Any unwanted objects can be removed from your workspace by using the function rm.For example, to remove object a1 type

rm(a1)

Alternatively objects, or individual columns of a data frame, may be removed using theObject Explorer.

1.8 Naming conventionsS-Plus has many reserved names, for example plot. You must avoid giving these names toobjects you create. If you do, then it will disable the corresponding library function and itwill not work until you remove your object.

You can see if S-Plus has reserved the word by asking it. For example if you wanted toknow whether you can use c or D as your objects simply type it in the commands window.If you get an error message then it has not reserved the word or it is not in your workspace,and you can use the name. If it outputs S code which you have not written, then the nameis already reserved and you cannot use it.

S-Plus is case sensitive. Hence x and X are different objects; plot is a built-in functionbut Plot is not.

1.9 Fitting linear modelsFitting linear regression models in S-Plus is easy. For example, to fit a simple linear regressionto the data in weld with weld$y as the response variable and weld$x as the explanatoryvariable, type

Page 10: MATH6030: Statistical Computing II

10

lm(y∼x, data=weld)

in the Commands window. The output contains a summary of important aspects of theregression model. However, much more information can be obtained by assigning the resultsof the linear model to an S-Plus object (a linear model object which takes the form of a list)using

weld.lm1 <- lm(y∼x, data=weld)

The linear model object can then be used within a number of S-Plus commands. Try:

resid(weld.lm1)

coef(weld.lm1)

plot(weld.lm1)

deviance(weld.lm1)

fitted(weld.lm1)

print(weld.lm1)

summary(weld.lm1)

We may be interested in whether a quadratic model fits the data any better. You can fit aquadratic model by issuing one of the following commands.

weld.lm2 <- lm(y∼x+x 2, data=weld)

weld.lm2 <- update(weld.lm1, . ∼ . +x 2)

1.10 PlottingS-Plus is particularly flexible at enabling you to get the plot that you want. A simple scatterplot of Diameter against Current for Dataset 1 can be obtained by

plot(weld$x,weld$y)

A nicer plot can be obtained byplot(weld$x,weld$y, xlab="Current", ylab="Diameter", las=1)

Can you see what the argument las=1 has achieved? This is just one of a large numberof arguments which can be supplied to an S-Plus graphical routine.

1.11 An introductory session

Try this introductory session using the commands window. (You don’t need to type in theexplanations following the # sign. Don’t worry if you don’t understand everything here. Youwill see much of this material in future weeks.

x <- seq(from=1, to=20, by=0.5) # Generates a sequence

x # Let’s see what’s in x.

w <- 1 + x/2 # w is a transformation of x.

y <- x + w * rnorm(x) # Elements in w are the s.d.s for random normal data, y

dum <- data.frame(x, y, w) # Create a data frame of three columns

dum # Print dum

Page 11: MATH6030: Statistical Computing II

MATH6030 Statistical Computing II: S-Plus Year:08–09 Dr S. K. Sahu 11

rm (x, y, w) # Remove x, y and w

fm <- lm(y~ x, data=dum) # Fit a linear regression

summary(fm) # Look at the analysis

lrf <- loess(y~x, dum) # Fit a smooth regression curve

plot(dum$x, dum$y) # Get a scatter plot

abline(fm, col=2) # Add the linear regression line

lines(spline(dum$x, fitted(lrf)), col=4) # Add the smooth regression curve$

plot(fitted(fm), resid(fm), xlab="Fitted values", ylab="Residuals")

title("Residuals versus fitted values")

qqnorm(resid(fm)) # A normal probability plot to check for skewness etc

qqline(resid(fm)) # Adds a line

rm (dum, fm, lrf) # Remove the unwanted objects

1.12 A fun program

Follow File → New → Script File and then type the following into it:

butterfly <- function(color = 8) {

theta <- seq(from=0.0, to=24 * pi, len = 2000)

radius <- exp(cos(theta)) - 2 * cos(4 * theta)

radius <- radius + sin(theta/12)$\hat{~}5$

x <- radius * sin(theta)

y <- - radius * cos(theta)

plot(x, y, type = "l", axes = F, xlab = "", ylab = "", col = color)

}

After you have finished typing highlight all the text and click on the forward triangle(below the File menu on the left), then you type butterfly() on the commands windowand hit enter. What do you see? Try butterfly(5) etc.

Page 12: MATH6030: Statistical Computing II

12

Page 13: MATH6030: Statistical Computing II

Chapter 2

Data manipulation with S-Plus

2.1 Basic manipulation with an example

The data sets liver.cells, liver.exper, liver.gt, and liver.section (all inbuilt datasets in S-Plus) contain observations from University of Wisconsin carcinogenicity studies ofrat livers.

liver.cells are number of cells injected into each of 26 animals.

liver.exper identifies each of three experiments (A, B, C) in the study.

liver.gt is a 52 by 4 matrix containing data for the 4 lobes (ARL, PRL, PPC, AC).For our purposes, the details are not important.

liver.section identifies replicate tissue sections. Successive observations are pairs ofsections for the same specimen.

These data may be combined into a single data frame, called liver using

liver <- data.frame(liver.cells, liver.exper, liver.gt, liver.section)

Note that the four columns of liver.gt have become four columns of the data frame. Adata frame is appropriate here as the rows of the data frame correspond to common units ofobservation.

We use the $ symbol to access columns of a data frame. For example,

13

Page 14: MATH6030: Statistical Computing II

14

mean(liver$cells) # calculates the mean.

To put a new column in liver we may use:

liver$ones <- rep(1, 52) # now check the liver data frame.

To combine objects of different length, we would have use a list, for example

liver.list <- list(liver.cells, liver.gt)

Note that the matrix liver.gt appears as a single component of this list. Names ofcomponents of a list or data frame may be examined or changed using names, for example

names(liver.list) <- c("cells","gt")

2.2 Logical vectors

Recall that we can select a set of components of a vector by indicating the relevant compo-nents in square brackets (e.g. liver.cells[1], liver.cells[5:7], liver.cells[c(1,3,5)]etc.). However, we often want to select components, based on their values, or on the values ofanother vector. For example, suppose that we are interested in all the values of liver.cellswhich are greater than 5, or all the values of liver.cells for experiment A.

Typing a condition involving a vector returns a logical vector of the same length contain-ing T (true) for those components which satisfy the condition and F (false) otherwise. Forexample, try

liver.cells>5

liver.exper=="A"

(note the use of == in a logical operation, to distinguish it from the assignment =). Alogical vector may be used to select a set of components of any other vector. Try

liver.cells[liver.cells>5]

liver.exper[liver.cells>5]

liver.cells[liver.exper=="A"]

Page 15: MATH6030: Statistical Computing II

MATH6030 Statistical Computing II: S-Plus Year:08–09 Dr S. K. Sahu 15

• The operations & (and) and | (or) operate on pairs of logical vectors. For example ifx <- 1:10, then x>3 & x<7 returns

[1] F F F T T T F F F F

and x<3 | x>7 returns

[1] T T F F F F F T T T

• The functions any and all take a logical vector as their argument, and return a singlelogical value. For example, any(x>3 & x<7) returns T, because at least one componentof its argument is T, whereas all(x>3 & x<7) returns F, because not every componentof its argument is T.

• There is a suit of function called the as.xxx to coerce the argument to the xxx typein the best way possible. An example of as.vector is given above. We can use it toswitch between the factor type and the numeric type.

• The suit of functions called the is.xxx is used to test if the argument of the functionis of certain type. For example, is.vector(x) returns true if x is a vector, returnsfalse otherwise. Other examples include is.matrix, is.data.frame, is.factor,

is.na, is.numeric.

• There is a huge number of functions useful for manipulating data and performingstatistical computing. It is not possible to discuss all all of those in class. You canlook at the Splus CHEATSHEET linked from the course web page. The address is:http://lib.stat.cmu.edu/S/cheatsheet

2.3 The functions apply, tapply and lapply

It is often desirable, in data analysis to carry out the same statistical operation separatelyon different segments of a data frame, matrix or list. The function apply allows us to dothis when we want to perform the same function on each row or each column of a matrix ordata.frame. For example,

apply(liver.gt,1,mean)

apply(liver.gt,2,var)

Note that you can apply the same commands to the data frame liver, but you shouldexclude the data column corresponding to liver.exper which contains non-numeric data,for example

apply(liver[,-c(2,7])],1,mean)

Page 16: MATH6030: Statistical Computing II

16

apply(liver[,-c(2,7)],2,var)

The function tapply allows us to carry out a statistical operation on subsets of a givenvector, defined according to the values of a specified vector. For example, to calculate themean value of liver.cells for each experiment A,B,C, separately we use

tapply(liver.cells,liver.exper,mean)

The function lapply allows us to carry out a statistical operation on each component ofa given list. For example,

lapply(liver.list,median)

gives two median values, one for each component of liver.list. Note that the compo-nent gt is a matrix, so the median here represents the median value of all 52× 4 values.

2.4 Factors

There is a data type called the factor which is normally used to hold a categorical variable,for example

citizen <- factor(c("uk", "us", "no", "in", "es", "in"))

Some functions to use with factors are levels, table, as.numeric, etc. For example,type

table(citizen)

The command levels(citizen) produces the output

[1] "es" "in" "no" "uk" "us"

To combine or merge some categories together first issue the levels command and notewhat all categories are there. Then use levels command to combine the categories. To mergeuk and us issue

levels(citizen) <- c("es", "in", "no", "uk", "uk")

The combined level will be called uk. By default, S-Plus keeps the levels in alphabeticalorder. To change it, we need to give the levels explicitly when the factor was declared. For

Page 17: MATH6030: Statistical Computing II

MATH6030 Statistical Computing II: S-Plus Year:08–09 Dr S. K. Sahu 17

example,

citizen <- factor(c("uk", "us", "no", "in", "es", "in"), levels=c("us", "uk",

"es", "no", "in", "fr"))

will keep the levels in the input order and also add a category fr. Now the commandlevels(citizen) produces the output

[1] "us" "uk" "es" "no" "in" "fr"

The function cut can be used to categorise a continuous variable. For example:

age <- c(2, 4, 60, 45, 18, 35, 30, 65, 3)

agroup <- cut(age, breaks=3, labels=c("child", "mid-age", "old"))

agroup <- as.factor(agroup)

u <- data.frame(agroup, age)

2.5 Other useful things to know

• When calling a function, the arguments can be placed in any order provided that theyare explicitly named. Any unnamed argument passed to a function is assigned to thefirst variable which has not yet been assigned. Any arguments which have defaults,do not need to be specified. For example, consider the function qnorm which gives thequantiles of the normal distribution. We see that the order of arguments is p, meanand sd.

qnorm(0.95, mean=-2.0, sd=3.0)

qnorm(0.95, sd=3.0, mean=-2.0)

qnorm(mean=-2.0, sd=3.0, 0.95)

all have the same effect.

• When working with standard distributions (normal, gamma, t, Weibull, exponential,chi-squared, uniform, ...) S-Plus provides useful inbuilt functions to calculate densityvalues, probabilities and quantiles, and to generate random samples from the distribu-tions. The S-Plus help section on “Probability Distributions and Random Numbers”has further details.

Page 18: MATH6030: Statistical Computing II

18

2.6 Matrix operations

There are a number of functions in S-Plus which make matrix manipulation easy. Forexample, if X is a p× q matrix and Y is a q × r matrix then

Z <- X %*% Y

assigns to Z the matrix product of X and Y. Similarly, if y is a vector of q components,X %*% y returns a vector of p components, the result of multiplying vector y by matrix X.

If X and Y are matrices with the same number of rows, then cbind(X,Y) returns thematrix formed by merging the two matrices together, with the columns of X preceding thoseof Y. Similarly rbind(X,Y) operates on matrices with the same number of columns. Oneor more of the arguments to cbind or rbind may be vectors (provided the dimensions areappropriate).

The function diag may be used to examine, or amend the diagonal components of amatrix, for example, the following creates a 3× 3 identity matrix.

I <- matrix(0,nrow=3,ncol=3)

diag(I) <- 1

Other matrix functions which are useful are t which returns the matrix transpose, and solve

which returns the matrix inverse. As an exercise in the use of these functions, recall that,the least squares estimates for the general linear regression model

E(Y ) = Xβ

where X is the design matrix, are given by

β = (XT X)−1XT y

Use the S-Plus function solve and t respectively, to invert and transpose a matrix. Forthe liver data, a simple linear regression of any response, with number of cells as a singleexplanatory variable, has design matrix given by

X <- cbind(1,liver.cells)

y <- liver$ARL # $ is used to access the ARL member of liver

When the response variable is the outcome for lobe ARL, calculate the least squares estimatesof this regression directly, and compare with the results obtained using the linear modelfunction lm.

Page 19: MATH6030: Statistical Computing II

MATH6030 Statistical Computing II: S-Plus Year:08–09 Dr S. K. Sahu 19

2.7 Manipulating data frames

Suppose that we have the following two data sets, that we want to match merge by subject.

Data set rando Data set resultsSubject Treatment Subject Response1 A 1 812 B 2 733 A 3 754 B 4 615 B 5 776 A 6 67

We issue the following three commands.

rando <- data.frame(sub=1:6, trt=c("A", "B", "A", "B", "B", "A"))

results <- data.frame(sub=1:6, res=c(81, 73, 75, 61, 77, 67))

dat <- merge(rando, results, by=1)

The by=1 option says merge by the first column of both data sets. More complicated mergecan be performed by studying the help file. See help(merge) for more.

Data sets can be merged using the menu bar and dialog boxes as well. Use CTRL+Clickto highlight non-adjacent columns.

If we want to add another column which gives the sex of the subject for example we cansimply type it in the data window. See also the insert menu. We can also use the followingcommands.

sex <- c("F", "F", "M", "M", "F", "M")

newdat <- data.frame(sex, dat)

This inserts it in the first column. We can change it by the following command.

newdat <- data.frame(dat[,1], sex, dat[,2:3])

Page 20: MATH6030: Statistical Computing II

20

2.8 Exercises

The following exercises involve the use of S-Plus to manipulate data.

1. Create a column in the liver data frame containing the minimum value of (ARL, PRL,PPC, AC) for each individual and a column containing the maximum value for eachindividual. Hence create a column containing the range for each row, and calculate theminimum, maximum, range and the median.

2. Create a column in the liver data frame containing the value 1 for observations whoseminimum value across (ARL, PRL, PPC, AC) is ARL and 0 otherwise. Now modifythis column to take the values 1,2,3 and 4, indicating which of (ARL, PRL, PPC, AC)takes the minimum value [if more than one, return the higher value. Alternatively youmight try and return the value 5 when more than one of (ARL, PRL, PPC, AC) takesthe minimum value - more tricky].

3. Using a single line of S-plus code, calculate the range of values for each lobe (ARL,PRL, PPC, AC).

4. Create a new data frame with 26 rows, where data for each pair of replicate tissuesamples appears in the same row. For each experiment (A,B,C) calculate the meanabsolute difference between replicates. [Hint: the matrix function may be useful here]

2.9 Solutions to Exercises 2.8

1. The command apply, suitably used, gets the minimum and maximum. For example,

apply(liver[, 3:6], 1, min)

will get the minimum. However, the output of the command is an array and if we wereto put the output in a new column in the liver data frame, it would have transformedliver to a read only data frame which could have been problematic. That is why weuse the coercion function as.vector on the output of apply. This simply forces theresult to be of vector type. So we issue the commands:

liver <- data.frame(liver.cells, liver.exper, liver.gt, liver.section)

liver$mins <- as.vector(apply(liver[, 3:6], 1, min))

liver$maxs <- as.vector(apply(liver[, 3:6], 1, max))

liver$range <- liver$maxs - liver$mins

liver$meds <- as.vector(apply(liver[, 3:6], 1, median))

Page 21: MATH6030: Statistical Computing II

MATH6030 Statistical Computing II: S-Plus Year:08–09 Dr S. K. Sahu 21

2. We first create a column of zeros and then change the values of that column to be 1for observations whose minimum value is the ARL value. We issue:

liver$newcol <- rep(0, 52)

liver$newcol[liver$ARL==liver$mins] <- 1

Now we use the same technique to create a column with values 2, 3 and 4.

liver$newcol[liver$PRL==liver$mins] <- 2 }

liver$newcol[liver$PPC==liver$mins] <- 3 }

liver$newcol[liver$AC==liver$mins] <- 4} # $ to access members again..

As it says, the final modification is bit tricky. We need to create four columns ofdifferences between the original four columns of data and the minimum column andthen count the number of zeros in each row. If there are more than one zero then weassign the value 5.

u <- liver[,3:6] - liver$mins # $This creates the differences

u[u>0] <- 1 # Positive differences get the value 1

x <- apply(u, 1, sum) # Counts the number of positive differences

liver$newcol[x<3] <- 5 # $Multiple minima if less than 3 positive differences

3. apply(liver[,3:6],2,max)-apply(liver[,3:6],2,min)

4. We first create two data sets for two sections and then combine those into one. Thenwe create the columns of differences. Finally, we get the answers using the tapplyfunction repeatedly.

livsec1 <- liver[liver$liver.section=="1", ]

livsec2 <- liver[liver$liver.section=="2", ]

livnew <- data.frame(livsec1, livsec2)

livnew[,15:18] <- abs(livnew[, 3:6] - livnew[,10:13])

tapply(livnew[,15], livnew[,2], mean)

tapply(livnew[,16], livnew[,2], mean)

tapply(livnew[,17], livnew[,2], mean)

tapply(livnew[,18], livnew[,2], mean)

The last four commands can be given using a single command aggregate which is veryuseful in getting summaries of data cross-classified by more than one factors.

aggregate(livnew[,15:18], livnew[,2], mean)

See the help file for aggregate.data.frame. Using the aggregate function find the mean ofeach liver.section by liver.exper combination in the liver dataset.

Page 22: MATH6030: Statistical Computing II

22

Page 23: MATH6030: Statistical Computing II

Chapter 3

Writing scripts and functions

3.1 Scripts

Frequently, we want to perform the same sequence of S-Plus commands repeatedly. We canmake life easier by putting such a sequence in a Script file – just as we did in the butterflyexample.

To open a Script window in S-Plus choose File→New→Script File. At the end of yourS-Plus session you can save any Script files for future use. They can then be reopened infuture sessions using File→Open

Any sequence of commands may be typed into a Script file. The commands are thenexecuted by clicking on the run > icon to the left of the menu bar. Any output of thecommands in the script appears in the lower part of the Script window, unless you havespecified an alternative destination using Options→Text output routing.

If you highlight a subset of commands in a Script, before clicking on the run > icon, thenjust the highlighted commands are executed.

Scripts are particularly useful if an analysis involves repeating the same command, orsimilar commands where slight modifications are required. However, suppose that we havea (maybe long) sequence of commands, which we are required to perform on different dataobjects. This is likely to involve many modifications to the script, each time we change thedata object. In such situations it is usually more convenient to write a function.

23

Page 24: MATH6030: Statistical Computing II

24

3.2 Functions

The ease and flexibility of writing functions in S-Plus makes Statistical programming par-ticularly convenient. Statistical calculations which cannot be done by ‘off-the-shelf’ routinesin S-Plus or other packages can usually be handled in a straightforward fashion by writingyour own S-Plus function.

As a simple example to get us started, we note that there is no intrinsic function inS-Plus for calculating the interquartile range of a set of observations. Here, we will createone, called iqr. The proto-type for functions is the folllowing.

function()

{}

Any arguments to the function are specified between the first set of parentheses (), andthe S code which the function executes is written between { and }. For example, to createan interquartile range function we require

iqr <- function(x)

{lowerq <- quantile(x,0.25))

upperq <- quantile(x,0.75))

iqrange <- upperq-lowerq

return(iqrange)

}

Now iqr(x=a) will give the inter-quartile range of the data contained in S-Plus object a.The function returns the value specified by return. As iqr(x=a) only takes one argument,then iqr(a) also works.

Once created, functions can always be modified by editing it in the script window. For ex-ample, we can include comments, error handling, and improve the output of our interquartilerange function as follows:

iqr <- function(x)

{# This function returns the interquartile range

if(!is.numeric(x)) stop("The argument to iqr must be numeric")

quartiles <- quantile(x, c(0.25, 0.75))

iqrange <- quartiles[2] - quartiles[1]

Page 25: MATH6030: Statistical Computing II

MATH6030 Statistical Computing II: S-Plus Year:08–09 Dr S. K. Sahu 25

names(iqrange) <- NULL

return(iqrange)

}

[You can also use warning instead of stop, to indicate potential errors - warning printsits argument, but does not stop the function].

Note that functions appear in the object explorer, but that assignments within a functionhave no effect on the database, e.g. after running iqr no object called quartiles appears.This is usually convenient, as the intermediate objects created while computing a functionare rarely of interest in themselves. However, if you do want to create them on the databaseyou need to use <<- when you assign them, e.g.

quartiles <<- quantile(x, c(0.25, 0.75))

3.3 Example: Confidence intervals for a probability

Now let us consider a different example. Suppose we have n observations of a variable X,and we are interested in the probability p that X takes a particular value, or range of values.How would we write a function to estimate p and, even better, calculate a confidence intervalfor p. The first thing to do is to transform X to a new binary variable (call it Y , say) whichtakes the value 1, if X takes the specified value, or range of values, and 0 otherwise. Then

n∑

i=1

Yi = nY ∼ Binomial(n, p),

so E(nY ) = np and Y is an unbiased estimate for p. Furthermore, approximately for largen,

nY ∼ Normal(np, np(1− p))

which implies that

P

(

−c ≤Y − p

p(1− p)/n≤ c

)

= 1− α (3.1)

where c is the 1− α/2 quantile of the standard normal distribution, qnorm(1 − α/2). Thefunction qnorm returns quantiles of the normal distribution.

Approximating p by its estimator Y in√

p(1− p)/n, we obtain Y ± c√

Y (1− Y )/n asthe endpoints of an approximate 1− α confidence interval for p.

Now we write an S-Plus function which takes two arguments, a binary vector y containing1s and 0s and a confidence level confcoef (=1 − α) taking values between 0 and 1. We

Page 26: MATH6030: Statistical Computing II

26

give confcoef a default value of 0.95. Arguments with default values do not need to bespecified when the function is called, provided you are happy with the default value. Puterror-handling statements in the function to force a stop if inappropriate arguments arespecified for y or confcoef.

confint <- function(y, confcoef = 0.95)

{

c <- qnorm(1 - (1 - confcoef)/2)

n <- length(y)

ybar <- mean(y)

lower <- ybar - c * sqrt((ybar * (1 - ybar))/n)

upper <- ybar + c * sqrt((ybar * (1 - ybar))/n)

return(c(lower, upper))

}

confint1 <- function(y, confcoef = 0.95)

{

# This function returns an approximate

# confidence interval for a probability

if(!is.numeric(y)) stop("The argument to confint must be numeric")

z <- sort(unique(y))

if( length(z)!=2 ) stop("Data are not binary\n")

z <- z - c(0, 1)

if ( sum(z^2) > 0 ) stop("Data are not 0, 1\n")

if ( (confcoef>1) || (confcoef<0) ) stop("Confidence level outside range\n")

c <- qnorm(1 - (1 - confcoef)/2)

n <- length(y)

ybar <- mean(y)

lower <- ybar - c * sqrt((ybar * (1 - ybar))/n)

upper <- ybar + c * sqrt((ybar * (1 - ybar))/n)

return(c(lower, upper))

}

Now use your function to calculate a 95% confidence interval for the probability that achild who has undergone spinal surgery suffers from the post-operative condition, Kyphosis.(Use the inbuilt data set kyphosis mentioned earlier.) Do

y <- as.numeric(kyphosis$Kyphosis) - 1

You should get the interval (0.121, 0.299). Similarly, calculate a 95% confidence interval

Page 27: MATH6030: Statistical Computing II

MATH6030 Statistical Computing II: S-Plus Year:08–09 Dr S. K. Sahu 27

for the probability that a child has more than 7 vertebrae operated on.

y <- rep(0, length(kyphosis$Number))

y[kyphosis$Number>7] <- 1

You should get the interval (−0.0091, 0.058). What about a 99% confidence interval?

3.4 Exercise

As you can see, the problem with this approach is that it can return confidence intervalswith endpoints outside the allowed range for the probability concerned. A more sophisticatedapproach is based on observing that (3.1) can be written as

P

(

(Y − p)2

p(1− p)/n≤ c2

)

= 1− α (3.2)

The inequality in (3.2) can be written as a quadratic inequality in p

p2(1 + c2/n)− p(2Y + c2/n) + Y 2 ≤ 0

and hence the endpoints of the confidence interval can be written as the solutions to thequadratic equation

p2(1 + c2/n)− p(2Y + c2/n) + Y 2 = 0.

Modify your function to calculate the alternative confidence interval obtained by solvingthis quadratic. How does this affect the two confidence intervals for the Kyphosis data?

3.5 Solutions to Exercises 3.4

confint.quadratic <- function(y, confcoef = 0.95) {

c <- qnorm(1 - (1 - confcoef)/2)

n <- length(y)

ybar <- mean(y)

a <- 1 + (c * c)/n

b <- -1 * (2 * ybar + (c * c)/n)

c <- ybar * ybar

lower <- ( - b - sqrt(b * b - 4 * a * c))/(2 * a)

Page 28: MATH6030: Statistical Computing II

28

upper <- ( - b + sqrt(b * b - 4 * a * c))/(2 * a)

return(c(lower, upper))

}

3.6 Further Exercise - Investigating Coverage by Sim-

ulation

[This is more tricky – if you can do this you are well on your way to becoming proficient atusing S-Plus].

The performance of the two confidence intervals can be tested by simulation. We cansimulate a large number of datasets, using a known value of p and investigate what proportionof confidence intervals contain the (known) true value. This proportion is known as thecoverage of the confidence interval, and should be equal to the confidence level.

The best way to perform the simulation is to write a function to do it. Try and write afunction which takes three arguments: the sample size n, the probability p, and the numberof simulations m, and returns the coverage of the two confidence intervals.

You can use the function sample for simulation, and it will be best to arrange yoursimulated samples in an n×m matrix, and use apply to calculate the confidence intervals.

Give m the defauult value 1000, and investigate the coverage of the two intervals for arange of values of n and p.

3.7 Conditional execution

Often, we want a statistical program to ‘branch’, that is we want the program to performdifferent operations, depending on the values of certain variables in the program. We havealready seen examples of this, in the error handling facilities which we have put into ourprograms so far. For example, the statement

if(!is.numeric(x)) stop("The argument to iqr must be numeric")

considers the logical value !is.numeric(x) [which takes the value T if x is not numeric(the ! indicates ‘not’ here) and the value F if x is numeric] and only performs the nextoperation (stopping the program) if that logical value is T, i.e. if x is not numeric. Moregenerally, the syntax is

Page 29: MATH6030: Statistical Computing II

MATH6030 Statistical Computing II: S-Plus Year:08–09 Dr S. K. Sahu 29

if(<variable with single logical value>)

{<sequence of commands to be performed if variable takes the value T>

} else {<sequence of commands to be performed if variable takes the value F>

}

The brackets {} can be omitted where the sequence consists of a single command, and itis not necessary for else { ... } to be specified at all, if it is not required.

For example, the function confint2

confint2 <- function(y, confcoef = 0.95)

{

# This function returns an approximate

# confidence interval for a probability

if(!is.numeric(y)) stop("The argument to confint must be numeric")

z <- sort(unique(y))

if( length(z)!=2 ) stop("Data are not binary\n")

z <- z - c(0, 1)

if ( sum(z^2) > 0 ) stop("Data are not 0, 1\n")

if ( (confcoef>1) || (confcoef<0) ) stop("Confidence level outside range\n")

c <- qnorm(1 - (1 - confcoef)/2)

n <- length(y)

ybar <- mean(y)

if (n>50 & ybar<0.8 & ybar>0.2) {

lower <- ybar - c * sqrt((ybar * (1 - ybar))/n)

upper <- ybar + c * sqrt((ybar * (1 - ybar))/n)

} else {

a <- 1 + (c * c)/n

b <- -1 * (2 * ybar + (c * c)/n)

c <- ybar * ybar

lower <- ( - b - sqrt(b * b - 4 * a * c))/(2 * a)

upper <- ( - b + sqrt(b * b - 4 * a * c))/(2 * a)

}

return(c(lower, upper))

}

returns the ‘easy’ binomial confidence interval when it is likely to be accurate, and the ‘morecomplex’ one otherwise.

Page 30: MATH6030: Statistical Computing II

30

3.8 Loops

If we want to execute a command, or sequence of commands repeatedly, perhaps across arange of different inputs, we can use a loop. The general syntax is

for (variable in sequence) {

statements

}

while (condition is true) {

statements

}

repeat {

statements

}

As usual, the braces are not required if there is only one statement in the loop. For the‘for’ loop an example of ‘variable in sequence’ is ‘i in 1:6’. The repeat loop can be broken bythe break statement. Thus the repeat loop must contain a statement like

if (condition is true) break

For example, to create a vector x containing the squares of the numbers 1,3,5,7 we mighttype

x <- vector()

for (i in c(1,3,5,7)) x <- c(x,i*i)

3.9 Avoiding the loops

Looping in S-plus is slow and should be avoided if possible. Fortunately, the facility to per-form vectorised operations in S-Plus means that loops can usually be avoided. For example,the loop above can be avoided by the direct calculation

x <- c(1,3,5,7)*c(1,3,5,7)

Page 31: MATH6030: Statistical Computing II

MATH6030 Statistical Computing II: S-Plus Year:08–09 Dr S. K. Sahu 31

The functions apply, tapply and sapply can be used to avoid the loops. For example,suppose we wanted to find the column (or row) sums of a matrix. We can do the following:

z <- matrix(1:12, 3, 4)

csum <- apply(z, 2, sum) # 2 for column

rsum <- apply(z, 1, sum) # 1 for row

3.10 Using the loops

One place where looping cannot be avoided is in iterative computation where the calculationsperformed each time the loop is executed depend on the outcomes of the previous execution ofthe loop. For example, suppose that we need to evaluate the first 20 terms of the standardFibonacci sequence (x1 = 0, x2 = 1, xi = xi−1 + xi−2 thereafter). In S-Plus this can beachieved by

x <- c(0,1)

for (i in 3:20) {

y <- x[i-1]+x[i-2]

x <- c(x,y)

}

We illustrate the repeat and while loops using the following example. Suppose we havei.i.d. data y1, . . . , yn from the truncated Poisson model which has the probability function:

Pr(Y = y) =e−λλy

(1− e−λ) y!, y = 1, 2, . . .

We can write the likelihood function as the product of the above with y replaced by yi,i = 1, . . . , n. Taking log of the likelihood function and differentiating and then setting itequal to zero gives us the likelihood equation:

λ− y(1− e−λ) = 0.

We cannot have exact solutions for this, so we try the Newton’s method of iterative solution.

Newton’s method is as follows. To solve f(x) = 0, use the Taylor series:

0 = f(x) ≈ f(a) + (x− a)f ′(a),

which gives,

x = a−f(a)

f ′(a).

Page 32: MATH6030: Statistical Computing II

32

Thus if a is a guess for the solution, then we refine it by the above equation.

In our example, we have x = λ and can choose y as the initial guess for λ. Each of thefollowing two routines finds the mle using simulated data. The function rpois has been usedto simulate from the Poisson distribution.

trunc.poisson.repeat <- function(itmax=10)

{

yp <- rpois(50, lam = 1) # rpois generates Poisson r.v.s

u <- table(yp)

print(u)

y <- yp[yp > 0]

ybar <- mean(y)

print(ybar)

lam <- ybar #initial value

it <- 0 # iteration number

repeat {

top <- lam - ybar * (1 - exp( - lam))

bot <- 1 - ybar * exp( - lam)

del <- top/bot # this is the increment

lam <- lam - del

cat(it, lam, "\n")

if (abs(del) < 0.001)

break

it <- it + 1

if ( it > itmax) {

cat("Maximum number of iterations exceeded\n")

stop("Increase itmax")

}

} # This ends the repeat loop

cat("End of program, lambda=", round(lam, 4), "\n")

}

trunc.poisson.while <- function(itmax=10)

{

yp <- rpois(50, lam = 1)

u <- table(yp)

print(u)

Page 33: MATH6030: Statistical Computing II

MATH6030 Statistical Computing II: S-Plus Year:08–09 Dr S. K. Sahu 33

y <- yp[yp > 0]

ybar <- mean(y)

print(ybar)

lam <- ybar

it <- 0

del <- 1 #this is required since the condition is checked

# at the beginning of the loop

while(abs(del) > 0.001 && (it <- it + 1) < itmax)

{

top <- lam - ybar * (1 - exp( - lam))

bot <- 1 - ybar * exp( - lam)

del <- top/bot

lam <- lam - del

cat(it, lam, "\n")

}

cat("End of program, lambda=", round(lam, 4), "\n")

}

3.11 Exercise - Maximising a nonstandard likelihood

One area where iterative computation is frequently required in Statistics is to find maximumlikelihood estimates when the maximum cannot be calculated analytically. Consider thefollowing famous example from Statistical Genetics. A particular Genetic model gives theprobabilities for 4 genotypes, AB, Ab, aB and ab, as 1

4(2 + θ), 1

4(1 − θ), 1

4(1 − θ) and 1

respectively, where θ is the square of the recombination fraction. In a sample of 197 animals,the observed frequencies of the four genotypes was (125,18,20,34). The appropriate likelihoodhere is multinomial

L(θ) ∝

[

1

4(2 + θ)

]125 [

1

4(1− θ)

]18 [

1

4(1− θ)

]20 [

1

]34

(3.3)

with log-likelihood

l(θ) = −197 log 4 + 125 log(2 + θ) + 38 log(1− θ) + 34 log θ

To maximise this, we need to solve

∂l

∂θ=

125

2 + θ−

38

1− θ+

34

θ= 0 (3.4)

In fact this equation can be written as a quadratic in θ, so the exact solution is availableanalytically. However, we will proceed as if there is no available exact solution to (3.4)

Page 34: MATH6030: Statistical Computing II

34

– as is the case with many maximum likelihood problems. In such examples, we use anumerical solution. The Newton-Raphson method starts from an arbitrary initial guess, θ0

of the solution to an equation of the form f(θ) = 0 and iterates to provide a sequence ofupdates θ1, θ2, . . . which converge to the true solution (provided the function is not too badlybehaved). Each update is obtained from the value before by

θi+1 = θi −f(θi)

f ′(θi)

where f ′ is the first derivative of f . Hence, for the genetic linkage example, f is given by(3.4) and f ′ by its derivative

f ′(θ) = −125

(2 + θ)2−

38

(1− θ)2−

34

θ2.

Now, you have all the information you need to write an S-Plus function to calculatethe maximum likelihood estimate of θ using the Newton-Raphson method. Your functionshould take two arguments – an initial guess at the solution, and the number of iterations youwant to perform. The function should return the final value of θ (the maximum likelihoodestimate). Edit your function, so that it prints out the value of θ and of l(θ) at each iterationand stops when the value of θ changes by less than 0.0001.

3.12 Solution to Exercises 3.11

Code for the Genetic example

Remember that we had to solve the equation:

f(θ) =125

2 + θ−

38

1− θ+

34

θ= 0.

The derivative of this function is:

f ′(θ) = −125

(2 + θ)2−

38

(1− θ)2−

34

θ2.

To find the exact solution, note that the equation can be written as:

125

2+θ− 38

1−θ+ 34

θ= 0

=⇒ 125 θ(1− θ)− 38 θ(2 + θ) + 34 (2 + θ)(1− θ) = 0

=⇒ −197 θ2 + 15 θ + 68 = 0.

Now we write the following code:

Page 35: MATH6030: Statistical Computing II

MATH6030 Statistical Computing II: S-Plus Year:08–09 Dr S. K. Sahu 35

genetic <- function(theta=0.5, itmax=10)

{

it <- 0 # iteration number

repeat {

top <- 125/(2+theta) - 38/(1-theta) + 34/theta

bot <- -125/(2+theta)^2 - 38/(1-theta)^2 - 34/theta^2

del <- top/bot # this is the increment

theta <- theta - del

cat(it, theta, "\n")

if (abs(del) < 0.000001)

break

it <- it + 1

if ( it > itmax) {

cat("Maximum number of iterations exceeded\n")

stop("Increase itmax")

}

} # This ends the repeat loop

cat("Theta from iterative solution=", round(theta, 5), "\n")

a <- -197

b <- 15

c <- 68

u <- sqrt(b^2 - 4 * a*c)

root1 <- (-b -u)/(2*a)

root2 <- (-b+u)/(2*a)

if (root1 <1.0 & root1>0.0) root <- root1

else root <- root2

cat("Exact value of theta=", round(root, 5), "\n")

}

Page 36: MATH6030: Statistical Computing II

36

Page 37: MATH6030: Statistical Computing II

Chapter 4

Graphics

4.1 Simple plots

S-Plus is good at producing high quality graphical output for reports and publications. Inparticular, S-Plus has great flexibility for allowing the user to create the graphical outputthey want by setting parameters appropriately.

Recall the data frame liver which we created in Chapter 1, using

liver <- data.frame(liver.cells, liver.exper, liver.gt, liver.section)

To investigate how the count of GT(+) colonies for lobe ARL depends on the number ofcells injected at the start of the experiment we might type

plot(liver[,1],liver$ARL)

Note that by default the function plot produces a scatterplot of the specified points. Al-ternative types of plot are available by specifying the argument type. For example type="l"joins consecutive points by a line, without plotting them, while type="b" and type="o" plotpoints and join them with lines. These types of plot are not appropriate here, but we shalluse them later.

4.2 Graphical parameters

A nicer plot can be obtained by

37

Page 38: MATH6030: Statistical Computing II

38

plot(liver[,1],liver$ARL,pch=".",xlab="Number of cells injected",

ylab="GT(+) count for lobe ARL") # $

The arguments pch, xlab and ylab, are just a few of the many graphical parameters

which can be supplied to a graphics function in S-Plus, to give the plot the properties thatyou require. Type ?par to see the others. Graphical parameters may be specified each timea plot is used, or they may be set for all plots, by using the function par. For example

par(mfrow=c(2,2),cex=0.7,las=1)

sets future plots to be displayed 4-to-a-page in a 2×2 array, reduces the font size for anycharacter strings on the graphs (axis labelling for example) to be 70% of the default size,and forces tick labels to be output horizontally (rather than parallel to the axis).

plot(liver[,1],liver$ARL,pch=".",ylab="GT(+) count for lobe ARL",

xlab="Number of cells injected")

plot(liver[,1],liver$PRL,pch=".",ylab="GT(+) count for lobe PRL",

xlab="Number of cells injected")

plot(liver[,1],liver$PPC,pch=".",ylab="GT(+) count for lobe PPC",

xlab="Number of cells injected")

plot(liver[,1],liver$AC,pch=".",ylab="GT(+) count for lobe AC",

xlab="Number of cells injected")

mtext("GT(+) count v. number of cells injected, for all lobes",outer=T,cex=1.5)

The above commands exhibit the relationship between the count of GT(+) colonies and thenumber of cells injected at the start of the experiment, for all lobes. The function mtext

adds text to the margins of a plot (the function text adds text to the body of the plot).The argument outer=T makes sure that the text is placed in the outer margin of the pageof plots; without this, the text would have been placed in the margin of the most recent(bottom right) plot. The function title works like mtext but is less flexible.

4.3 Adding material to the body of a plot

The most informative graphics can often be obtained by adding material to the body ofa plot. For example, consider the in-built data matrix state.x77 whose columns containmeasurements of several variables for the 50 states of the USA (based on information availablein 1975). The corresponding state names are in state.name. To obtain a plot of lifeexpectancy against illiteracy, type

Page 39: MATH6030: Statistical Computing II

MATH6030 Statistical Computing II: S-Plus Year:08–09 Dr S. K. Sahu 39

par(mfrow=c(1,1)) # reverts to one plot per page

plot(state.x77[,3],state.x77[,4],ylab="Life expectancy (years)",

xlab="Illiteracy")

Our first reaction to the plot might be that there are two states which seem to be somewhatdifferent from the others. To identify these states we can type

identify(state.x77[,3],state.x77[,4],n=2,plot=F)

and click on the two points you want to identify. The argument n=2 specifies that twopoints should be identified. The indices of the two identified observations are returned. Moreusefully,

state.name[identify(state.x77[,3],state.x77[,4],n=2,plot=F)]

returns the names of the two states (Hawaii and Nevada). Note that by giving theargument plot=F to identify, no addition has been made to the plot. The default (plot=T)adds the indices of the identified points to the plot, but if the argument labels=state.nameis added, as in

identify(state.x77[,3],state.x77[,4],labels=state.name,n=2)

then the two state names are added to the plots which is more informative. In general,text can be added to a plot using the function text which adds specified text at specifiedcoordinates. For example,

text(2,73.5,"Hawaii")

would be an alternative way of adding the text to the plot. If you want to add textinteractively, choosing the position by a mouse-click then replacing the specified coordinateswith locator(1) achieves this, as in

text(locator(1),"Hawaii")

We may also want to add points and lines to plots. The function points adds pointsat specified coordinates. As with plot, the plotting character pch can be specified, and thetype argument allows the points to be joined by lines. Alternatively, the function lines

adds to the plot the lines between a series of specified coordinates (similar to points withtype="l") Different types of line can be obtained by setting the parameter lty. The defaultlty=1 produces solid lines, with higher values of lty producing a variety of dotted anddashed lines. Finally, the function abline adds a straight line with specified slope and

Page 40: MATH6030: Statistical Computing II

40

intercept to the current plot. For example,

abline(lm(state.x77[,4]∼state.x77[,3])$coef)

adds the straight line whose slope and intercept are given by a linear regression of lifeexpectancy on illiteracy. We can produce a similar line for the regression with the ‘outliers’(Hawaii and Nevada) removed by

abline(lm(state.x77[-c(11,28),4]∼state.x77[-c(11,28),3])$coef,lty=3)

Now we can give the plot an informative title and legend.

title("Life expectacy v.~Illiteracy for US states")

legend(1.5,72.8,lty=c(1,3),legend=c("Regression","without Hawaii and Nevada"))

does the trick, putting the top left hand corner of the legend box at the specified coordinates(alternatively locator(1) could be used to interactively identify the position). Note thatthe function legend can be used to describe different line types, shading types, plottingcharacters etc., depending on the type of plot.

4.4 Plotting functions

We now have everything we need in order to plot graphs of any functions we might beinterested in. For example:

x <- seq(-pi,pi,length=1000)

plot(x,sin(x),type="l")

plots a graph of sin x against x for values of x ranging from −π to π. Often, it is interesting toplot a graph of a density function, which can easily be achieved, as S-Plus has a wide range ofdensities as inbuilt functions (see help on “Probability Distributions and Random Numbers”).For example, suppose we wanted to plot the density of the Cauchy (t1) distribution using

x <- seq(-15,15,length=1000)

plot(x,dt(x,df=1),type="l")

For density plots the default axes are not entirely appropriate. The y-axis is generallyuninformative and is usually better omitted, while the x-axis should appear exactly at y = 0

Page 41: MATH6030: Statistical Computing II

MATH6030 Statistical Computing II: S-Plus Year:08–09 Dr S. K. Sahu 41

so we can where the density is becoming negligible. To achieve these two effects, we need toset appropriate parameters in the plot function. Try

plot(x,dt(x,df=1),type="l",yaxt="n",yaxs="s",xaxs="s",ylab="",bty="n")

Type ?par to see what these parameters are doing.

Sometimes, it is instructive to display multiple functions on the same plot. Suppose thatwe were interested in the comparison between the t1 (Cauchy) and standard normal densityfunctions. We could add the standard normal density function to the plot above using

z <- seq(-5,5,length=1000)

lines(z,dnorm(z),lty=3)

The normal density cannot be completely plotted in the region available, which was deter-mined on the basis of the original (Cauchy) plot. A similar problem arises if we try to dothe normal plot first and then superimpose the Cauchy density. The solution is to set up anull plot first, over a region which is sufficiently large enough to accomodate both plots

ymin <- min(dt(x,df=1),dnorm(z))

ymax <- max(dt(x,df=1),dnorm(z))

plot(c(min(x,z),max(x,z)),c(ymin,ymax),type="n",yaxt="n",yaxs="s",xaxs="s",

xlab="x",ylab="",bty="n")

This sets up a plotting region big enough to incorporate the minimum and maximum valuesof x, z and the corresponding density values. The density curves can then be drawn using

lines(x,dt(x,df=1))

lines(z,dnorm(z),lty=3)

Functions of two variables may be plotted using a contour plot (contour) or a perspectiveor 3-dimensional plot (persp), but we don’t have time to explore these functions here.

4.5 A Few Tips

1. The command graphsheet() brings up a new graphics window.

Page 42: MATH6030: Statistical Computing II

42

2. The command legend together with locator(1) puts the legend interactively. Seethe tplots2 function.

3. Graphs can be saved in several different formats. From a graph window open the Filemenu and then the export graph button.

4.6 Bar Plot

The bar plot command expects a matrix or vector as its required argument. Output fromthe cross-tabulation table command is suitable. The argument matrix or vector provides theheights (positive or negative) of the bars. If height is a matrix, each column represents onebar; the values in the columns are treated as heights of blocks. Blocks of positive height arestacked above the zero line and those with negative height are stacked below the line.

The default S-plus bar plot command barplot is very basic. My simplified version iscalled mybplot.

mybplot <- function(w, ...)

{

if(is.matrix(w)) {

nr <- length(w[, 1])

ang <- seq(from = 25, to = 135, length = nr)

color <- seq(from = 2, by = 2, length = nr)

den <- (1:nr) * 4

barplot(w, angle = ang, col = color, density =

den, legend = dimnames(w)[[1]], names =

as.character(dimnames(w)[[2]]), ...)

}

else barplot(w, names = dimnames(w), ...)

}

Now issue the two commands.

y <- table(liver.exper, liver.section)

mybplot(y)

Try barplot(y), you will see the difference. S-Plus has the pie diagram as well. Try x <-

table(liver.exper) and then pie(x, names=dimnames(x)). The names argument putsthe labels for the slices.

Page 43: MATH6030: Statistical Computing II

MATH6030 Statistical Computing II: S-Plus Year:08–09 Dr S. K. Sahu 43

4.7 Pairwise scatter plot

A pairwise scatter plot can be obtained using the pairs command. For example, issuepairs(liver.gt). You have seen the data set liver.gt before.

4.8 Histogram and qqplot

Histograms are easy to draw. Just try hist(liver.gt[,1]) or try x<- rnorm(10000) andhist(x). An optional nclass=n where n is a number can be given. Assuming the vector x tobe the data try hist(x, nclass=20), for example. Un-equal break points can be given usingthe breaks option which expects a vector of break points. Other graphical parameters, e,g.color, xlab etc can be given. For example try hist(x, nclass=20, xlab="My x", col=5).

To draw a qqplot try qqnorm(x). To graphically test that x and y have the same distri-bution try qqplot(x, y).

4.9 Boxplot

These are very interesting plots obtained using the command boxplot. Tryboxplot(liver.gt[,1]). Many boxplots on the same graphsheet provides valuable informa-tion. The default boxplot shows the median and whiskers drawn to the nearest observationfrom the first and third quartiles but not beyond the distance 1.5 times the inter-quartilerange. Points beyond the two whiskers are suspected outliers and are drawn individually.

Consider the Kyphosis data set. Suppose that our interest is to see which of the threepredictors Age, Number and Start predicts the probability of Kyphosis being present byusing graphs only. We can do the following:

par(mfrow=c(2,2))

plot.factor(kyphosis)

The command plot.factor(liver[,2:6]) does some factor plots of the liver data.You may want to issue the command par(mfrow=c(1,1)) to draw one plot per page.

Page 44: MATH6030: Statistical Computing II

44

4.10 Three Dimensional Plots

We can use the perspective plots to draw three dimensional objects.

Try the command threedimplot(), threedimplot(rho=-0.8) etc.

There is a clever mechanism employed here. The function outer is applying the functiondbvnorm for each pair of x[i] and y[j] and is returning the value to be a matrix, z in ourprogram. It is passing a parameter rho to the dbvnorm function as well.

threedimplot <- function(rho=0.8, n=50) {

x <-seq(from=-4, to=4, length=n)

y <- x

z <- outer(x, y, FUN=dbvnorm, rho=rho)

persp(x, y, z)

cat("change rho and try again\n")

}

dbvnorm <- function(x, y, rho) {

0.5 / (pi * sqrt(1-rho^2) ) * exp(-0.5*(x^2+y^2 - 2.0 * rho * x * y)/(1-rho^2))

}

4.11 Exercises

1. Do a factor plot of the data in fuel.frame and provide comments. See the help file?fule.frame to see what these data are.

2. Write a program which is equivalent to threedimplot but which uses two for loopsto do the computations that are done by the function outer. Now compare the speedsof the two versions of threedimplot. Try n=100, 200 etc.

Solutions

1. Interpret after issuing the commands:

z <- fuel.frame

plot.factor(z)

Page 45: MATH6030: Statistical Computing II

MATH6030 Statistical Computing II: S-Plus Year:08–09 Dr S. K. Sahu 45

2. A slow version of threedimplot.

threedimplot2 <- function(rho=0.8, n=50) {

x <-seq(from=-4, to=4, length=n)

y <- x

z <- matrix(NA, nrow=n, ncol=n)

for (i in 1:n) {

for (j in 1:n) {

z[i, j] <- dbvnorm(x[i], y[j], rho)

}

}

persp(x, y, z)

cat("change something and try again\n")

}