exploratory data analysis and graphics (formerly lab …bolker/classes/s756/labs/vislab.pdfdots to...

Exploratory data analysis and graphics (formerly

“lab for EMD book chapter 2”)

©2011 Ben Bolker

January 4, 2011

Licensed under the Creative Commons attribution-noncommercial

license (http://creativecommons.org/licenses/by-nc/3.0/). Please

share & remix noncommercially, mentioning its origin.

This lab was originally written to accompany chapter 2 of EcologicalModels and Data, but should (?) be reasonably self-contained. You will needto have the latest version of R (2.12.1) and either have internet access orhave the appropriate packages installed and .csv files downloaded (see below)beforehand. If you have your own data set handy, you can work through thislab reading in and making exploratory plots of your own data along the linesof those shown here.

1 Cleaning, reshaping, and reading in data

1.1 Finding files

Note: the next few commands are examples! You should not ac-tually try to execute any of the commands until we get to readingin duncan_10m.csv . . .

The basic commands for reading data into R are read.table andread.csv. If your data are already in the format R likes (a whitespace-or comma-separated text file) then importing your data may be as easy asread.table("mydata.txt",header=TRUE).

However, if R responds to your command with an error like

Error in file(file, "r") : unable to open connection

In addition: Warning message: cannot open file 'mydata.txt'

it means it can’t find your file, probably because it isn’t looking in the rightplace. By default, R’s working directory is the directory in which the R

1

http://creativecommons.org/licenses/by-nc/3.0/

1 CLEANING, RESHAPING, AND READING IN DATA 2

program starts up, which is (again by default) something like C:/Program

Files/R/R-x.y.z/bin on Windows (where x.y.z is the version number).(R uses / as the [operating-system-independent] separator between directo-ries in a file path.)

To let R know where your data files are located, you have a few choices:

� spell out the path, or file location, explicitly. (Usea single forward slash to separate folders (e.g."c:/Users/bolker/My Documents/R/script.R"): this works onall platforms.)

� use filename=file.choose(), which will bring up a dialog box to letyou choose the file and set filename. (This only works on Windowsand MacOS).

� Use menu commands to change your working directory to wher-ever the files are located: (Windows) File/Change dir or (MacOS)Misc/Change Working Directory (or Apple-D).

� Change your working directory to wherever the file(s) are located usingthe setwd (set working directory) function, e.g. setwd("c:/temp")

Changing your working directory is more efficient in the long run, if you saveall the script and data files for a particular project in the same directoryand switch to that directory when you start work.

Using the “change directory” commands from the menus is the simplestway to change the working directory for the duration of your R session.While you could just throw everything on your desktop, it’s good to get inthe habit of setting up a separate working directory for different projects,so that your data files, metadata files, R script files, and so forth, for aparticular project are all in the same place. If you’re working on a sharedmachine it may be useful to put your working directory on a USB or networkdrive.

Depending on how you have gotten your data files onto your system(e.g. by downloading them from the web), Windows will sometimes hideor otherwise screw up the extension of your file (e.g. adding .txt to a filecalled mydata.dat). R needs to know the full name of the file, including theextension.

1.2 Checking the number of fields

The next potential problem is that R needs every line of your data file tohave the same number of fields (variables) [there are ways to read irregular


data into R, but it’s a bit trickier]. You may get an error like:

Error in read.table(file = file, header = header,

sep = sep, quote = quote, :

more columns than column names

or

Error in scan(file = file, what = what,

sep = sep, quote = quote, dec = dec, :

line 1 did not have 5 elements

If you need to check on the number of fields that R thinks you have oneach line, use

> count.fields("myfile.dat",sep=",")

(you can omit the sep="," argument if you have whitespace- rather thancomma-delimited data). If you are checking a long data file you can try

> cf <- count.fields("myfile.dat",sep=",")

> which(cf!=cf[1])

to get the line numbers with numbers of fields different from the first line.By default read.csv (but not read.table) will try to fill in what it sees

as missing fields with NA (“not available”) values; this can be useful but canalso hide errors. You can try

> mydata <- read.csv("myfile.dat",fill=FALSE)

to turn off this behavior; if you don’t have any missing fields at the end oflines in your data this should work.

1.3 Checking data

The quickest way to check that all your variables have been classified cor-rectly:

> sapply(data,class)

(this applies the class command, which identifies the type of a variable, toeach column in your data).

Non-numeric missing-variable strings (such as a star, *) will also makeR misclassify. Use na.strings in your read.table command:


> mydata <- read.table("mydata.dat",na.strings="*")

(you can specify more than one value with (e.g.)na.strings=c("*","***","bad", "-9999")).

The data you are about to read in are from a study by Duncan andDuncan 2000 on the disappearance of seeds (presumably removal by “seedpredators”, animals such as rodents or birds that consume seeds) from ex-perimental stations set up along two transects at different distances from theedge of a forest patch. The data are in long format with information speci-fying the date, species, station (location along a transect), number of seedsremaining for each observation. If you have the emdbook package installed,help("SeedPred",package="emdbook") will provide slightly more detailedinformation.

The seed removal data were originally stored in two separate Excel files,one for the 10 m transect and one for the 25 m transect: After a couple ofpreliminary errors I decided to include na.strings="?" (to turn questionmarks into NAs). (There is a # character in the column names; if you wereusing read.table instead of read.csv you would need comment.char=""

to deal with this, or you could edit the csv or Excel file to remove it.)

> dat_10 = read.csv("duncan_10m.csv",na.strings="?")

> dat_25 = read.csv("duncan_25m.csv",na.strings="?")

R doesn’t like column names that begin with numbers (because they arenot valid variable names in R) and adds an X in front of them. It also usesdots to replace spaces, dashes, and other characters that don’t belong invariable names, so (e.g.) a column headed “27-Jan” will be named X27.Jan

instead. You can use check.names=FALSE to turn off this behavior, butsome things (such as accessing columns with $) won’t work in this case.

1.4 Accessing data

To access individual variables within your data set use mydata$varname ormydata[,n] or mydata[,"varname"] where n is the column number andvarname is the variable name you want. You can also use attach(mydata)

to set things up so that you can refer to the variable names alone (e.g.varname rather than mydata$varname). However, BEWARE: if you thenmodify a variable, you can end up with two copies of it: one (modified)is a local variable called varname, the other (original) is a column in thedata frame called varname: it’s probably better not to attach a data setuntil after you’ve finished cleaning and modifying it. Furthermore, if you


have already created a variable called varname, R will find it before it findsthe version of varname that is part of your data set. Attaching multiplecopies of a data set is a good way to get confused: try to remember todetach(mydata) when you’re done.

To access a data set called dataset that is built in to R or included inan R package, say (for example)

> library(emdbook)

> data(DamselSettlement)

(data() by itself will list all available data sets; data(package="emdbook")lists the data sets available in the emdbook package.)

1.5 Packages (reminder)

The sizeplot function I used for Figure 2 in the chapter requires an add-onpackage ([unfortunately] the command for loading a package is library!).To use an additional package it must be (i) installed on your machine (withinstall.packages) or through the menu system and (ii) loaded in yourcurrent R session (with library).

If you haven’t already installed the plotrix package in a previous ses-sion:

> install.packages("plotrix")

Then (even if you’ve already installed the package):

> library(plotrix)

You must both install (with install.packages) and load (withlibrary) a package before you can use or get help on its functions, althoughhelp.search will list functions in packages that are installed but not yetloaded. You only have to install the package once on a particular machine,but you have to the load the package again in every R session where youwant to use it.

For this session, you will need the packages chron, gdata,gplots, gtools, plotrix, plyr, reshape, rgl, and scatterplot3d in-stalled. (If you have the emdbook package installed (and loaded usinglibrary(emdbook)) and have a network connection, you should be ableto type get.emdbook.packages() to install all of these packages, and a fewmore, automatically.)

You should get the ggplot package too, and all of the packages it dependson:

> install.packages("ggplot2",dependencies=TRUE)


1.6 Checking and cleaning up data

The summary and str commands are useful for looking at the structure ofdata sets. They originally showed that I had some extra columns and rows:row 160 of dat_10, and columns 40–44 of dat_25, were junk. I could havegotten rid of them this way:

> dat_10 = dat_10[1:159,]

> dat_25 = dat_25[,1:39]

(I could also have used negative indices to drop specific rows/columns:dat_10[-160,] and dat_25[-(40:44),] would have the same effect). In-stead, I went back and edited the input files.

Exercise 1.1 : Try out sapply(dat_10,class), head, summary, str,and View on these data; make sure you understand the results.

Note: The data are integer rather than the more general numeric; thisdistinction rarely makes much difference.

1.7 Reshaping data

The data are in the “wrong” (wide) format. We reshape them, specifyingid.var=1:2 to preserve the first two columns, station and species, as iden-tifier variables (you will need the reshape package installed).

First “melt” the data to long format, keeping the first two columns fixedand using date as the column name for the column of dates that results:

> library(reshape)

> dat_10_melt = melt(dat_10,id.var=1:2,variable_name="date")

Convert the date column to a real date format. First use paste toappend 1999 to each date (sep="." separates the two pasted strings with aperiod):

> date_10 = paste(dat_10_melt$date,"1999",sep=".")

Then use as.Date to convert the string to a date (%d means day, %b

means three-letter month abbreviation*, and %Y means four-digit year; check?strptime for more date format details).

*If your computer is set to use a a language other than English, you will need to switchthe time format temporarily so that R can recognize the English month abbreviationsused in this data file: Sys.setlocale("LC_TIME","English") will work on Windows,Sys.setlocale("LC_TIME","en_US.UTF-8") will (probably?) work on Linux or MacOS.Before you do this, though, you should probably use Sys.getlocale("LC_TIME") to figureout how the time format is currently set, so you can reset it (although you could alwaysjust restart R. . . ) Alternatively, you can use the chron function in the chron package,which uses hard-coded English month names/abbreviations to convert months.


> dat_10_melt$date = as.Date(date_10,format="X%d.%b.%Y")

You could also do this in one step with the transform command, whichmanipulates variables inside a data frame (you still need to assign the resultback to the original variable (if you want to replace it) or a new variable (ifnot).

> dat_10_melt <- transform(dat_10_melt,

date=as.Date(paste(date,"1999",sep="."),

format="X%d.%b.%Y"))

Finally, rename the columns.

> names(dat_10_melt) = c("station","species","date","seeds")

Do the same steps for the 25-m transect data:

> dat_25_melt = melt(dat_25,id.var=1:2,variable_name="date")

> date_25 = paste(dat_25_melt$date,"1999",sep=".")

> dat_25_melt$date = as.Date(date_25,format="X%d.%b.%Y")

> names(dat_25_melt) = c("station","species","date","seeds")

1.8 More on reshaping

We can cast() the melted (or“molten”) data to reshape them into a differentform.

Rearrange with a single row for each date and a column for each (date× station) combination (because of the experimental design, there is onlyspecies per station, but multiple stations per species and multiple dates perstation: try with(dat_10_melt,table(species,station)) to see this).Because the results of some of the following commands are long, I havehidden them (head only prints the first 6 rows of the data, but it printsevery column).

> head(cast(dat_10_melt,date~species+station,value="seeds"))

One row per date and one column per species — calculate the mean bydate/species category:

> head(cast(dat_10_melt,date~species,value="seeds",fun.agg=mean))

date abz cd cor dio mmu pol psd uva

1 1999-03-25 4.25 3.70 4.95 4.90 4.473684 2.75 4.75 4.45


2 1999-03-28 2.80 2.65 4.20 3.85 3.052632 1.75 4.50 4.10

3 1999-04-04 1.75 1.20 3.35 3.60 2.421053 1.45 3.95 3.40

4 1999-04-11 1.20 0.90 2.65 3.35 1.789474 1.20 3.50 3.25

5 1999-04-18 0.60 0.65 2.60 3.00 1.631579 0.65 2.85 2.85

6 1999-04-25 0.60 0.65 2.55 2.70 1.473684 0.60 2.85 2.85

One row per date, collapse all data down to a single mean and standarddeviation value:

> head(cast(dat_10_melt,date~.,fun.agg=c(mean,sd),value="seeds"))

date mean sd

1 1999-03-25 4.276730 1.630065

2 1999-03-28 3.364780 2.168303

3 1999-04-04 2.641509 2.295519

4 1999-04-11 2.232704 2.270323

5 1999-04-18 1.855346 2.192728

6 1999-04-25 1.786164 2.188476

Exercise 1.2 : Use cast, or aggregate if you prefer, to compute meansand standard deviations by species and date. (Hint: you should include theargument na.rm=TRUE in the cast command — it gets passed along to themean and sd functions to tell R to ignore NA values when it computes theaggregate values.)

1.9 More on data types

While you can usually get by coding data in not quite the right way — forexample, coding dates as numeric values or categorical variables as strings— R tries to “do the right thing” with your data, and it is more likely to dothe right thing the more it knows about how your data are structured.

Strings instead of factors Sometimes R’s default of assigning factors isnot what you want: if your strings are unique identifiers (e.g. if you havea code for observations that combines the date and location of sampling,and each location combination is only sampled once on a given date) thenR’s strategy of coding unique levels as integers and then associating a labelwith integers will waste space and add confusion. If all of your non-numericvariables should be treated as character strings rather than factors, youcan just specify as.is=TRUE; if you want specific columns to be left “as is”you can specify them by number or column name. For example, these twocommands have the same result:


> tmpdata = read.csv("duncan_10m.csv",comment="",

as.is="Seed.Species")

> tmpdata = read.csv("duncan_10m.csv",comment="",as.is=2)

> sapply(tmpdata,class)[1:2]

X10m.Station.. Seed.Species

"integer" "character"

While the first line of the data file uses the name “Seed Species”, R auto-matically converts the space to a dot to get the column name. Use c —e.g. c("name1","name2") or c(1,3) — to specify more than one column.You can also use the colClasses="character" argument to read.table tospecify that a particular column should be converted to type character —


na.strings="?",

colClasses=c(rep("character",2),

rep("numeric",37)))

brings in the first two columns as characters and the last 37 as numeric.To convert factors back to strings after you have read them into R, use

as.character.


na.strings="?")

> tmpdata$Seed.Species = as.character(tmpdata$Seed.Species)

Factors instead of numeric values In contrast, sometimes you havenumeric labels for data that are really categorical values — for example, inthe seed data set the stations are listed by integer codes (often data setswill contain redundant information in them, e.g. both a species name anda species code number). It’s best to specify appropriate data types, so usecolClasses to force R to treat the data as a factor. For example:


na.strings="?",colClasses=c(rep("factor",2),rep("numeric",37)))

n.b.: by default, R sets the order of the factor levels alphabetically. Youcan find out the levels and their order in a factor f with levels(f). If youwant your levels ordered in some other way (e.g. site names in order alongsome transect), you need to specify this explicitly. Most confusingly, R willsort strings in alphabetic order too, even if they represent numbers. This isOK:


> f = factor(1:10); levels(f)

[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"

but the following is not, since we explicitly tell R to treat the numbers ascharacters (this can happen by accident in some contexts):

> f = factor(as.character(1:10)); levels(f)

[1] "1" "10" "2" "3" "4" "5" "6" "7" "8" "9"

In a list of numbers from 1 to 10, “10” comes after “1” but before “2”!You can fix the levels by using the levels argument in factor to tell R

explicitly what you want it to do, e.g.:

> f = factor(as.character(1:10),levels=1:10)

> x = c("far_north","north","middle","south")

> f = factor(x,levels=c("far_north","north","middle","south"))

so that the levels come out ordered geographically rather than alphabetically.To put factors in the order in which they appear first in the data:

> f = factor(f,levels=unique(as.character(f)))

Sometimes your data contain a subset of integer values in a range, butyou want to make sure the levels of the factor you construct include all ofthe values in the range, not just the ones in your data. Use levels again:

> f = factor(c(3,3,5,6,7,8,10),levels=3:10)

Finally, you may want to get rid of levels that were included in a previousfactor but are no longer relevant:

> f = factor(c("a","b","c","d"))

> f2 = f[1:2]

> levels(f2)

[1] "a" "b" "c" "d"

> f2 = droplevels(f2)

> levels(f2)

[1] "a" "b"


This “re-leveling” operation leaves the remaining levels of the factor in thesame order.

For more complicated recoding operations, you can use the recode func-tion in the car package (also see the entry on factors in the R wiki, http://rwiki.sciviews.org/doku.php?id=tips:data-factors:factors).

Another useful function, especially for plotting (when legends and group-ings will often occur in the order specified by the factor levels) is reorder,which reorders the levels of a factor according to some function (by defaultthe mean) of another variable: that is, reorder(f,x) will take the mean ofx for each value of f and reset the levels of f in order of increasing mean(x).

Exercise 1.3 : Illustrate the effects of the levels argument: create afactor f from the vector

> v = c(3,3,5,6,7,8,10)

with and without using levels to specify that all the numbers between3 and 10 should be included as factor levels. Inspect f in each case, andplot the results with plot(f) (which produces a barplot of the numbers ineach level). Use par(mfrow=c(1,2)) to draw the results as two side-by-sidesubplots. (Use par(mfrow=c(1,1)), or close the graphics window, to undothis setting.)

Dates and times Dates and times can be tricky in R, but you can (andshould) generally handle your dates as type Date within R rather than mess-ing around with Julian days (i.e., days since the beginning of the year) ormaintaining separate variables for day/month/year.

The Date class only handles dates: the POSIXt class deals with date-time objects, but has several tricky aspects, in particular its handling oftime zones. The chron package is friendlier. In general the rule for handlingdates and times in R is that you should the least complicated data type thatcan represent your data. (The zoo package can also be useful for handlingtime-series data, interpolating missing points, etc.)

You can use colClasses="Date" within read.table to read in datesdirectly from a file, but only if your dates are in four-digit-year/month/day(e.g. 2005/08/16 or 2005-08-16) format; otherwise R will either butcher yourdates or complain

Error in fromchar(x) : character string is not

in a standard unambiguous format

If your dates are in another format in a single column, read them in ascharacter strings (colClasses="character" or using as.is) and then use

http://rwiki.sciviews.org/doku.php?id=tips:data-factors:factors

http://rwiki.sciviews.org/doku.php?id=tips:data-factors:factors


as.Date, which uses a very flexible format argument to convert characterformats to dates:

> as.Date(c("1jan1960", "2jan1960", "31mar1960", "30jul1960"),

format="%d%b%Y")

[1] "1960-01-01" "1960-01-02" "1960-03-31" "1960-07-30"

> as.Date(c("02/27/92", "02/27/92", "01/14/92",

"02/28/92", "02/01/92"),

format="%m/%d/%y")

[1] "1992-02-27" "1992-02-27" "1992-01-14" "1992-02-28" "1992-02-01"

The most useful format codes are %m for month number, %d for day of month,%j for Julian date (day of year), %y for two-digit year (dangerous for datesbefore 1970!) and %Y for four-digit year; see ?strftime for many moredetails. (See the previous footnote for issues with month abbreviations ifyour version of R is not running with English time/date formats.)

If you have your dates as separate (numeric) day, month, and yearcolumns, you actually have to squash them together into a character format(with paste, using sep="/" to specify that the values should be separatedby a slash) and then convert them to dates:

> year = c(2004,2004,2004,2005)

> month = c(10,11,12,1)

> day = c(20,18,28,17)

> datestr = paste(year,month,day,sep="/")

> (date = as.Date(datestr))

[1] "2004-10-20" "2004-11-18" "2004-12-28" "2005-01-17"

> rm(date,year,month,day,datestr) ## clean up

Although R prints the dates out so they look like a vector of character strings,they are really dates: class(date) will give you the answer "Date".

Times work similarly (you have to convert them after you read in yourdata), but you will need to install and attach the chron package to use them.

> install.packages(chron) ## if you haven't already installed it

> library(chron)


Basic time conversion is pretty easy:

> timevec1 = c("11:00:00","11:25:30","15:30:20")

> (times1 = times(timevec1))

[1] 11:00:00 11:25:30 15:30:20

(times1 looks identical to timevec1, but is in a different – and more usefulformat. For example, differences (diff) work as they should. If you translatethem to numeric values, they correspond to fractions of a day.

> d1 = diff(times1); d1

[1] 00:25:30 04:04:50

> as.numeric(d1)

[1] 0.01770833 0.17002315

> rm(d1) ## clean up

If you have times without seconds, you have to use paste to append :00

first:

> timevec2 = c("11:00","11:25","15:30")

> timevec2 = paste(timevec2,":00",sep="")

> times(timevec2)

[1] 11:00:00 11:25:00 15:30:00

Other traps:

� quotation marks in character variables: if you have character stringsin your data set with apostrophes or quotation marks embedded inthem, you have to get R to ignore them. I used a data set recentlythat contained lines like this:

Western Canyon|Santa Cruz|313120N|1103145WO'Donnell Canyon

I used

> data = read.table("datafile",sep="|",quote="")


to tell R that | was the separator between fields and that it shouldignore all apostrophes/single quotations/double quotations in the dataset and just read them as part of a string.

� Hash/pound (#) symbols in your data set that are not intended to becomments. For example, I recently loaded a data set from Excel wheresome entries were #VALUE! (Excel worksheet cells involving a divisionby zero). R interpreted the # as signifying the start of a comment, andall subsequent variables in those rows were lost. (read.csv eliminatescomment characters by using comment.char="" by default, but forread.table you will need to specify it explicitly.)

1.10 Augmenting the data

We’ve finished cleaning up and reformatting the data. Now we would like tocalculate some derived quantities: specifically, tcum (elapsed time from thefirst sample), tint (time since previous sample), taken (number removedsince previous sample), and available (number available at previous sam-ple).

For each station, we want to calculate the cumulative (elapsed) time foreach observation by subtracting the first date from all the dates; the timeinterval by taking the difference of successive dates (with diff) and puttingan NA at the beginning; the number of seeds lost by taking the negative of thedifference of successive numbers of seeds; and the number of seeds availableat the previous time by prepending NA and dropping the last element. Thenput the new derived variables together with the original data and re-assignit.

We will write our own function now, to take advantage of a packagecalled plyr. Later on we will be writing lots of our own functions. Here’show it works:

> tmpf = function(x) {

tcum = as.numeric(x$date-x$date[1]) ## elapsed time since beginning

tint = as.numeric(c(NA,diff(x$date))) ## delta-t since prev. obs.

taken = c(NA,-diff(x$seeds)) ## number taken since prev. obs.

available = c(NA,x$seeds[-nrow(x)]) ## number available at prev. obs.

## combine existing data (x) with derived variables:

data.frame(x,tcum,tint,taken,available)

}

A more “magic” version of the same function computes the new variables


on the fly within the data.frame function and uses with to make thingsmore compact (so we don’t need the x$ in front of the variables):

> tmpf = function(x) {

with(x,

data.frame(x,

tcum=as.numeric(date-date[1]),

tint=as.numeric(c(NA,diff(date))),

taken=c(NA,-diff(seeds)),

available=c(NA,seeds[-nrow(x)])))

}

When you define an R function, it doesn’t do anything immediately,just creates the function for later use. Any variables that get defined insidethe function are purely temporary, and disappear when the function finishes.The function returns the last expression (in this case a data frame containingthe original variables plus the derived variables).

Now we use the ddply function from the plyr package to split the dataup by station, run the tmpf function on the data from each station, andput it back together. (plyr is required by the reshape package, so we’vealready loaded it.)

> dat_10 = ddply(dat_10_melt,"station",tmpf)

This trick is useful whenever you have individuals or stations that havedata recorded only for the first observation of the individual. In some casesyou can also do these manipulations by working with the data in wide format.

Do the same for the 25-m data:

> dat_25 = ddply(dat_25_melt,"station",tmpf)

Create new data frames with an extra column that gives the distancefrom the forest (the data.frame automatically replicates the values in thenew columns to the right length); then stick them together with rbind (“rowbind”).

> dat_10 = data.frame(dat_10,dist=10)

> dat_25 = data.frame(dat_25,dist=25)

> SeedPred = rbind(dat_10,dat_25)

Convert station and distance from numeric to factors:


> SeedPred$station = factor(SeedPred$station)

> SeedPred$dist = factor(SeedPred$dist)

Alternatively:

> SeedPred = transform(SeedPred,station = factor(station),

dist = factor(dist))

Reorder columns:

> SeedPred = SeedPred[,c("station","dist","species","date","seeds",

"tcum","tint","taken","available")]

1.11 subset and merge

subset and merge are two other extremely useful data manipulation com-mands.

1.11.1 subset

Most of what subset does can be achieved by the matrix and data frameindexing you learned in Lab 1, but subset is more convenient and can makeyour code cleaner and easier to understand. For example, let’s say you wantto extract the data for species abz from the SeedPred data frame: you couldsay

> SeedPred[SeedPred$species=="abz",]

to select all columns (the blank after the comma) for which the logical state-ment SeedPred$species=="abz" was true. However

> subset(SeedPred,species=="abz")

is simpler: you don’t have to include the SeedPred$ before the variable,and R assumes you want all the columns (this is the most common use:you can also use the select argument if you don’t want to keep all thecolumns). Note that subset() does not change the original SeedPreddata frame: you have to assign the results to a variable if you wantto use the subset of the data for something. Many R modeling andplotting functions have a subset argument that lets you do the subsettingon the fly, rather than creating a temporary subset variable.

It’s often useful to use the logical & (and) and | (or) operators in sub-setting: to pull out the data for species abz on the 10-m transect,


> subset(SeedPred,species=="abz" & dist=="10")

(the "10" is in quotation marks because dist is a factor, not a numericvariable) or

> subset(SeedPred,(species=="abz" | species == "cd") & dist=="10")

to pull out data for two species, although

> subset(SeedPred,species %in% c("abz","cd") & dist=="10")

does the same thing and is a little bit easier to read.One confusing aspect of subset is that it does not automatically drop

unused levels: use the droplevels function (new in R 2.12.0) to do this,e.g.

> droplevels(subset(SeedPred,species %in% c("abz","cd") & dist=="10"))

Exercise 1.4 : how would you extract the data from every species exceptabz? From every species except abz or cd? (There are at least two waysto do the second part — either using & or combining the ! (not) and %in%

operators.)

1.11.2 merge

merge combines two data sets by matching the value of a key — a columnthat the two data sets have in common. For example, here’s a data set onthe prevalence of a mycoplasmal infection in gopher tortoises in differentyears and sites:

> prev <- read.csv("prevalence.csv",comment="#")

> prev_melt <- melt(prev)

> ## extract year from 'variable' by taking characters 5-8

> ## substr stands for "substring"

> prev_melt$year <- as.numeric(substr(prev_melt$variable,5,8))

> names(prev_melt)[3] <- "prevalence"

> prev_melt = prev_melt[,-2] ## drop old variable column

> head(prev_melt)

Site prevalence year

1 BS 4.2 2003

2 FE 1.0 2003

3 FC 1.0 2003


4 GSW 1.0 2003

5 ORD 1.0 2003

6 CB 14.3 2003

Here’s another data set with information on the number of recoveredshells of 2 types (fresh and total):

> shells = read.csv("recoveredshells.csv",comment="#")

> shells_melt = melt(shells,id.var=1)

> ## extract year and type from 'variable'

> shells_melt$year=as.numeric(substr(shells_melt$variable,2,5))

> shells_melt$type=factor(substr(shells_melt$variable,7,11))

> shells_melt = shells_melt[,-2] ## drop old variable column

> ## cast to get both types of shells on the same line

> shells2 = cast(shells_melt,Site+year~type)

> head(shells2)

Site year Fresh Total

1 BS 2003 NA NA

2 BS 2004 0 0

3 BS 2005 0 0

4 BS 2006 0 0

5 CB 2003 NA NA

6 CB 2004 1 1

Now that we have done all this rearranging, we can put the two datasets together, matching on Site and year (by default, merge matches on allthe columns with matching names in the two data sets):

> head(merge(shells2,prev_melt,all=TRUE))

Site year Fresh Total prevalence

1 BS 2003 NA NA 4.2

2 BS 2004 0 0 1.0

3 BS 2005 0 0 1.0

4 BS 2006 0 0 1.0

5 CB 2003 NA NA 14.3

6 CB 2004 1 1 4.3

Voila!If you have time, go back and look at all the intermediate results to see

that you understand what’s going on here.

2 EXPLORATORY GRAPHICS: SEED DATA 19

2 Exploratory graphics: seed data

2.1 Mean number remaining with time

You can attach the seed removal (predation) data with attach(SeedPred)

so that you can refer to the variables as date, species etc. rather thanSeedPred$date, SeedPred$species, etc. Using attach can make your codeeasier to read, since you don’t have to put SeedPred$ in front of the columnnames, but it’s important to realize that attaching a data frame makes a localcopy of the variables. Changes that you make to these variables are not savedin the original data frame, which can be very confusing. Therefore, it’s bestto use attach only after you’ve finished modifying your data. attach canalso be confusing if you have columns with the same name in two differentattached data frames: use search to see where R is looking for variables. Ifyou have to use attach, it’s best to attach just one data frame at a time —and make sure to detach it when you finish. Lately I have started to usingthe with function instead; it is like a temporary version of attach that onlyapplies to a single command.

To split the data on numbers of seeds present by date and species andtake the mean (na.rm=TRUE says to drop NA values):

> s_means = recast(SeedPred,

dist+date~species,

measure.var="seeds",

fun.agg=mean,na.rm=TRUE)

matplot (“matrix plot”) plots the columns of a matrix together againsta single x variable. Use it to plot the 10 m data on a log scale (log="y") withboth lines and points (type="b"), in black (col=1), with plotting characters1 through 8, with solid lines (lty=1). Use matlines (“matrix lines”) to addthe 25 m data in gray. (lines and points are the base graphics commandsto add lines and points to an existing graph.) Specifying [,3:10] ignoresthe first two columns, which are the distance and date.

> s10m = subset(s_means,dist=="10")

> matplot(s10m[,3:10],

log="y",type="b",col=1,pch=1:8,lty=1)

> s25m = subset(s_means,dist=="25")

> matlines(s25m[,3:10],

type="b",col="gray",pch=1:8,lty=1)


We will use ggplot to do the same plots, in a slicker way. First, however,we reorder the species factor in order of decreasing mean number of seeds,so that the legend will come out in an order that approximately matches theorder of the lines on the graph:

> SP2 <- transform(SeedPred,species=reorder(species,seeds,

FUN=function(x) -mean(x,na.rm=TRUE)))

Now use ggplot, setting the date as the x variable, number of seeds asthe y variable, with colour corresponding to species and point and line typecorresponding to distance. We use stat_summary to collapse the values forany given date/species/distance combination to their mean value. (We haveto specify the points and lines separately.)

> library(ggplot2)

> g1 <- ggplot(SP2,

aes(x=date,y=seeds,colour=species,shape=dist,

linetype=dist))+

stat_summary(fun.y=mean,geom="line")+

stat_summary(fun.y=mean,geom="point")

> g1

Alternatively, we could use facet_grid to specify a grid of sub-plots,with a single row (.) and columns specified by dist:

> g2 <- ggplot(SP2,

aes(x=date,y=seeds,colour=species))+

stat_summary(fun.y=mean,geom="line")+

facet_grid(.~dist)

> g2

2.2 Number taken vs. number available

2.2.1 Jittered plot

Jittered plot:

> attach(SeedPred)

> plot(jitter(available),jitter(taken))

ggplot version:

> qplot(available,taken,data=SeedPred,position="jitter")


2.2.2 Bubble plot

The following graph differs from the figure in Chapter 2, because I don’texclude cases where there are no seeds available. (I use xlim and ylim toextend the axes slightly.) scale and pow can be tweaked to change the sizeand scaling of the symbols.

To plot the numbers in each category, I use text, row to get row numbers,and col to get column numbers; I subtract 1 from the row and columnnumbers to plot values starting at zero.

I used

> t1 = table(available,taken)

to cross-tabulate the data, and then used the text command to add thenumbers to the plot. There’s a little bit more trickery involved in puttingthe numbers in the right place on the plot. row(x) gives a matrix withthe row numbers corresponding to the elements of x; col(x) does the samefor column numbers. Subtracting 1 (col(x)-1) accounts for the fact thatcolumns 1 through 6 of our table refer to 0 through 5 seeds actually taken.When R plots, it simply matches up each of the x values, each of the y values,and each of the text values (which in this case are the numbers in the table)and plots them, even though the numbers are arranged in matrices ratherthan vectors. I also limit the plotting to positive values (using [t1>0]),although this is just cosmetic.

> library(plotrix)

> sizeplot(available,taken,scale=0.5,pow=0.5,xlim=c(-2,6),ylim=c(-2,5))

> t1 = table(available,taken)

> r = row(t1)-1

> c = col(t1)-1

> text(r,c,t1)

Or you can use balloonplot from the gplots package:

> library(gplots)

> balloonplot(t1)

ggplot, with a certain amount of finagling (stat_sum counts the numberof identical (x,y) values; we make the points thus created semi-transparent(alpha=0.5) and use scale_size_continuous to increase the maximumsizes


> ggplot(subset(SeedPred,available>0),aes(x=available,y=taken))+

stat_sum(alpha=0.5)+

scale_size_continuous(to=c(1,24),legend=FALSE)+

stat_sum(geom="text",aes(label=..n..,size=NULL),colour="red")

Finally, the default mosaic plot, either using the default plot commandon the existing tabulation

> plot(t1)

or using mosaicplot with a formula based on the columns of SeedPred:

> mosaicplot(~available+taken,data=SeedPred)

Here’s the code to produce a bar plot on the log10(1 + x)x scale:

> m = t(log10(t1+1))

> barplot(m,

beside=TRUE,legend=TRUE,xlab="Available",

ylab="log10(1+# observations)")

> op = par(xpd=TRUE)

> text(34.5,3.05,"Number taken")

> par(op)

You could also use

> barplot(t(t1+1),log="y",beside=TRUE,

xlab="Available",ylab="1+# observations")

As mentioned in the text, log10(t1+1) finds log(x + 1), a reasonabletransformation to compress the range of discrete data; t transposes the tableso we can plot groups by number available (R groups barplots by the numberof columns rather than rows, which would mean grouping by the numbertaken if we didn’t transpose the table). The beside=TRUE argument plotsgrouped rather than stacked bars; legend=TRUE plots a legend; and xlab

and ylab set labels. The statement par(xpd=TRUE) allows text and lines tobe plotted outside the edge of the plot; the op=par(...) and par(op) area way to set parameters and then restore the original settings (I could havecalled op anything I wanted, but in this case it stands for old parameters).

You can use barchart in the lattice package to produce these graphics,although the bars are horizontal rather than vertical by default. Try thefollowing (stack=FALSE is equivalent to beside=TRUE for barplot):


> library(lattice)

> barchart(log10(1+table(available,taken)),

stack=FALSE,

auto.key=TRUE)

More impressively, the lattice package can automatically plot a barplotof a three-way cross-tabulation, in small multiples (I had to experiment abit to get the factors in the right order in the table command): try

> barchart(log10(1+table(available,species,taken)),

stack=FALSE,auto.key=TRUE)

In ggplot (converting taken to a factor affects the spacing of the barsand the colours chosen).

> ggplot(subset(SeedPred,available>0 & taken>0),

aes(x=factor(taken),fill=factor(taken)))+

stat_bin(aes(y=log10(1+..count..)))+

facet_grid(.~available)

Or splitting by species:


aes(x=factor(taken),fill=factor(taken)))+

stat_bin(aes(y=log10(1+..count..)))+

facet_grid(species~available)

Arguably, though, bar plots are not the best for these data anyway:something like this might be better:


aes(x=factor(taken),

shape=factor(available),

colour=factor(taken)))+

stat_bin(aes(y=log10(1+..count..)),geom="point")

Exercise 2.1 *: Restricting your analysis to only the observations with5 seeds available, create a barplot showing the distribution of number ofseeds taken broken down by species.


2.3 Mean fraction taken: barplot with error bars

Computing the fraction taken:

> frac_taken = taken/available

Computing the mean fraction taken for each number of seeds available,using the tapply function (the tapply function (standing for table apply,pronounced “t apply”) splits a vector into groups according to the list offactors provided, then applies a function (e.g. mean or sd) to each group).

> mean_frac_by_avail = tapply(frac_taken,available,mean)

computes the mean of frac_taken for each group defined by a differentvalue of available (R automatically converts available into a factor

temporarily for this purpose).If you want to compute the mean (or some other function) by group for

more than one variable in a data set, you can use aggregate.We can also calculate the standard errors, σ/

√n:

> n_by_avail = table(available)

> se_by_avail = tapply(frac_taken,available,

sd,na.rm=TRUE)/

sqrt(n_by_avail)

I’ll actually use a variant of barplot, barplot2 (from the gplots pack-age, which you may need to install, along with the the gtools and gdata

packages) to plot these values with standard errors. (I am mildly embar-rassed that R does not supply error-bar plotting as a built-in function, butyou can use the barplot2 in the gplots package or the plotCI function(the gplots and plotrix packages have slightly different versions).

> library(gplots)

> lower_lim = mean_frac_by_avail-2*se_by_avail

> upper_lim = mean_frac_by_avail+2*se_by_avail

> b = barplot2(mean_frac_by_avail,plot.ci=TRUE,

ci.l=lower_lim,ci.u=upper_lim,

xlab="Number available",

ylab="Mean number taken")

I specified that I wanted error bars plotted (plot.ci=TRUE) and the lower(ci.l) and upper (ci.u) limits.


Bar plot of mean fraction taken by species — in this case we use barplot,saving the x locations of the bars in a variable b, and then add the confidenceintervals with plotCI.

> library(plotrix)

> frac_taken = SeedPred$taken/SeedPred$available

> mean_frac_by_avail_by_species =

tapply(frac_taken,list(available,species),mean,na.rm=TRUE)

> n_by_avail_by_species = table(available,species)

> se_by_avail_by_species = tapply(frac_taken,list(available,species),

sd,na.rm=TRUE)/sqrt(n_by_avail_by_species)

> b = barplot(mean_frac_by_avail_by_species,beside=TRUE)

> plotCI(b,mean_frac_by_avail_by_species,

se_by_avail_by_species,add=TRUE,pch=".",gap=FALSE)

With ggplot we can use stat_summary to compute the data summarieson the fly: we will also reorder the species factor in order of mean fractiontaken . . .

> SeedPred <- transform(SeedPred,frac_taken=taken/available,

species=reorder(species,frac_taken,na.rm=TRUE))

> ggplot(subset(SeedPred,!is.na(available) & available>0),

aes(x=factor(available),y=frac_taken,colour=species))+

stat_summary(fun.data=mean_cl_normal)+facet_grid(.~species)

3D plots: using t1 from above, define the x, y, and z variables for theplot:

> avail = row(t1)[t1>0]

> taken = col(t1)[t1>0]-1

> freq = log10(t1[t1>0])

The scatterplot3d package is a little bit simpler to use, but less inter-active — once the plot is drawn you can’t change the viewpoint. Plot -availand -taken to reverse the order of the axes and use type="h" (originallynamed for a “high density” plot in R’s 2D graphics) to draw lollipops:

> library(scatterplot3d)

> scatterplot3d(-avail,-taken,freq,type="h",

angle=50,pch=16)

With the rgl package: first plot spheres (type="s") hanging in space:


> library(rgl)

> plot3d(avail,taken,freq,lit=TRUE,

col.pt="gray",type="s",

size=0.5,

zlim=c(0,4))

Then add stems and grids to the plot:

> plot3d(avail,taken,freq,add=TRUE,type="h",size=4,col=gray(0.2))

> grid3d(c("x+","y-","z"))

Use the mouse to move the viewpoint until you like the result.Clean up:

> rm(taken,avail,freq)

2.4 Histograms/small multiples

All I had to do to get the lattice package to plot the histogram by specieswas:

> histogram(~frac_taken|species,

data=SeedPred,xlab="Fraction taken")

or with base graphics, using a for loop:

> op = par(mfrow=c(3,3))

> for (i in 1:length(levels(species))) {

hist(frac_taken[species==levels(species)[i]],

xlab="Fraction taken",main="",

col="gray")

}

> par(op)

op stands for “old parameters”. Saving the old parameters in this way andusing par(op) at the end of the plot restores the original graphical param-eters.

Clean up:

> detach(SeedPred)

Plots in this section: scatterplot (plot or xyplot) bubble plot(sizeplot), barplot (barplot or barchart or barplot2), histogram (histor histogram).

3 TADPOLE DATA 27

Data manipulation: reshape, stack/unstack, table, split, lapply,sapply

Exercise 2.2 *: generate three new plots based on one of the data setsin this lab (or elsewhere in Chapter 2), or on your own data.

3 Tadpole data

These data are from an experiment by James Vonesh on predation of tad-poles of the reed frog Hyperolius spinigularis by aquatic dragonfly larvae inexperimental tanks. See help("ReedfrogPred",package="emdbook") formore info.

Reading in the data was fairly easy in this case:read.table(...,header=TRUE) and read.csv worked without anytricks. I take a shortcut, therefore, to load these datasets from the emdbook

library:

> library(emdbook)

> data(ReedfrogPred)

> data(ReedfrogFuncresp)

> data(ReedfrogSizepred)

3.1 Boxplot of factorial experiment

The boxplot is fairly easy:

> graycols = rep(rep(gray(c(0.4,0.7,0.9)),each=2),2)

> boxplot(propsurv~size*density*pred,data=ReedfrogPred,col=graycols)

Play around with the order of the factors to see how useful the differentplots are.

graycols specifies the colors of the bars to mark the differentdensity treatments. gray(c(0.4,0.7,0.9)) produces a vector of col-ors; rep(gray(c(0.4,0.7,0.9)),each=2) repeats each color twice (forthe big and small treatments within each density treatment; andrep(rep(gray(c(0.4,0.7,0.9)),each=2),2) repeats the whole sequencetwice (for the no-predator and predator treatments).

> ggplot(ReedfrogPred,aes(x=interaction(size,density),y=propsurv))+

geom_boxplot()+facet_grid(.~pred)

or

3 TADPOLE DATA 28

> ReedfrogPred$ssize <- ReedfrogPred$size ## ggplot gets confused by 'size' variable

> ## use density as a factor for proper axis spacing etc.

> ggplot(ReedfrogPred,aes(x=factor(density),y=propsurv))+

geom_boxplot()+facet_grid(ssize~pred)

3.2 Functional response values

I’ll attach the functional response data (warn=FALSE says not to warn aboutany variables that are masked by the newly attached data):

> attach(ReedfrogFuncresp,warn=FALSE)

A simple x-y plot, with an extended x axis and some axis labels:

> plot(Initial,Killed,xlim=c(0,100),

ylab="Number killed",xlab="Initial density")

Adding the lowess fit (lines is the general command for adding linesto a plot: points is handy too):

> lines(lowess(Initial,Killed))

Calculate mean values and corresponding initial densities, add to theplot with a different line type:

> meanvals = tapply(Killed,Initial,mean)

> densvals = unique(Initial)

> lines(densvals,meanvals,lty=3)

Fit a spline to the data using the smooth.spline command:

> lms = smooth.spline(Initial,Killed,df = 5)

To add the spline curve to the plot, I have to use predict to calculatethe predicted values for a range of initial densities, then add the results tothe plot*:

*Equivalently, I could use the lm function with ns (natural spline), which is a bit morecomplicated to use in this case but has more general uses:library(splines)

lm1 = lm(Killed ~ns(Initial, df = 5), data = ReedfrogSizepred)

p1 = predict(lm1,newdata=data.frame(Initial=1:100))

lines(p1,lty=2)

4 DAMSELFISH DATA 29

> ps = predict(lms,x=0:100)

> lines(ps,lty=2)

Finally, I could do linear or quadratic regression (I need to useI(Initial^2) to tell R I really want to fit the square of the initial den-sity); adding the lines to the plot would follow the procedure above.

> lm2 = lm(Killed ~ Initial, data = ReedfrogSizepred)

> lmq = lm(Killed ~ Initial+I(Initial^2), data = ReedfrogSizepred)

Clean up:

> detach(ReedfrogFuncresp)

The (tadpole size) vs. (number killed) plot follows similar lines, althoughI did use sizeplot because there were overlapping points.

With ggplot it’s easy to add a loess fit, a spline fit (achieved in this casevia the gam function in the mgcv package), or a quadratic regression to theplot . . .

> library(mgcv)

> qplot(Initial,Killed,data=ReedfrogFuncresp)+

geom_smooth(colour="black")+

geom_smooth(method=gam,fill="red",colour="red")+

geom_smooth(method=lm,formula=y ~ poly(x,2), fill="blue", colour="blue")

4 Damselfish data

These data are on the settlement (immigration from the open ocean) andrecruitment (survival for a particular length of time, to the next life stage)of damselfish Dascyllus trimaculatus, from Schmitt, Holbrook et al.. Seehelp("Damselfish",package="emdbook").

4.1 Survivors as a function of density

Load and attach data:

> data(DamselRecruitment)

> data(DamselRecruitment_sum)

> attach(DamselRecruitment)

> attach(DamselRecruitment_sum)


Plot surviving vs. initial density; use plotCI to add the summary databy target density; and add a lowess-smoothed curve to the plot:

> init.dens = init/area*1000

> surv.dens = surv/area*1000

> plot(init.dens,surv.dens,log="x")

> plotCI(settler.den,surv.den,SE,

add=TRUE,pch=16,col="darkgray",gap=0)

> lines(lowess(init.dens,surv.dens))

Clean up:

> detach(DamselRecruitment)

> detach(DamselRecruitment_sum)

This demonstrates how to overlay two different data sets in ggplot . . .

> DamselRecruitment = transform(DamselRecruitment,

init_dens=init/area*1000,

surv_dens=surv/area*1000)

> qplot(init_dens,surv_dens,log="x",data=DamselRecruitment)+

geom_smooth()+

geom_pointrange(data=DamselRecruitment_sum,

aes(x=settler.den,y=surv.den,

ymin=surv.den-2*SE,

ymax=surv.den+2*SE),colour="red")+

labs(x="Initial density",y="Surviving density")

4.2 Distribution of settlement density

Plot the histogram (normally one would specify freq=FALSE to plot prob-abilities rather than counts, but the uneven breaks argument makes thishappen automatically).

> data(DamselSettlement)

> attach(DamselSettlement)

> hist(density[density<200],breaks=c(0,seq(1,201,by=4)),col="gray",

xlab="",

ylab="Prob. density")

> lines(density(density[density<200],from=0),col=2,lwd=2)

Some alternatives to try:


> hist(log(1+density))

> hist(density[density>0],breaks=50)

(you can use breaks to specify particular breakpoints, or to give the totalnumber of bins to use).

If you really want to lump all the large values together:

> h1 = hist(density,breaks=c(0,seq(1,201,by=4),500),plot=FALSE)

> b= barplot(h1$counts,space=0)

> axis(side=1,at=b,labels=h1$mids)

(use hist to calculate the number of counts in each bin, but don’t plotanything; use barplot to plot the values (ignoring the uneven width of thebins!), with space=0 to squeeze them together).

Box and whisker plots showing density across different recruitment pulsesat different sites:

> bwplot(log10(1+density)~pulse|site,data=DamselSettlement,

horizontal=FALSE)

Other variations to try:

> ## density distributions plotted in a single frame

> densityplot(~density,groups=site,data=DamselSettlement,xlim=c(0,100))

> ## box-and-whiskers of overall settlement by site

> bwplot(density~site,horizontal=FALSE,data=DamselSettlement)

> ## density rather than log10(1+density)

> bwplot(density~site|factor(pulse),horizontal=FALSE,data=DamselSettlement)

> ## violin plots

> bwplot(log10(1+density)~site|factor(pulse),data=DamselSettlement,

panel=panel.violin,

horizontal=FALSE)

> ## all site-pulse combinations, in base graphics (ugly)

> boxplot(density~site*pulse)

> detach(DamselSettlement)

ggplot examples:

> ggplot(DamselSettlement,

aes(x=density))+geom_histogram(breaks=seq(0,501,by=4))

> ## ggplot doesn't allow unequal-width bins in histograms

> DS <- transform(DamselSettlement,site=reorder(site,density))


> ggplot(DS,

aes(x=factor(pulse),y=log10(1+density)))+geom_boxplot()+

facet_wrap(~site)

> ggplot(DS,

aes(x=log10(1+density),fill=site))+

geom_density(colour=NA,position="identity",alpha=0.5)+

facet_wrap(~pulse)