2013 - notes - r trinker's_notes

1 | P a g e

Commands for R R Help Type: Purpose: ?function.name Help page on specific function args(function.name) Arguments for a particular function function.name Code of a function example(function.name) Example(s) of a function in action ??function.name For when ?function.name doesn’t work help(package="package.name") Information on a package RSiteSearch("key phrase") Search from within [R], the R site for key phrases RSiteSearch("{key phrase}") Search from within [R], the R site for exact phrases apropos("key phrase") Returns a list of all matching objects in the search list find("function") Returns the package the function is in news() Find out new things happening in [R] news(Version == "v.9", package = "package.name")

Find out new things happening with a package

news(grepl("key phrase", Text), db= news())

Search for key words in news()

help.start() Opens a CRAN page for statistical analysis library(sos); findFn("key phrase") or ???key.phrase; ???'word1 word2'

Search for rated functions related to a topic

sessionInfo() Info about current session (including loaded packages) Additional Websites Website: Purpose: rseek.org [R] version of google stackoverflow.com Q & A forum generally oriented to code and programing stats.stackexchange.com Q & A forum generally oriented to statistics

crantastic.org Search through packages for key words use CTRL + f and search by anything including author

cran.r-project.org/web/views Search through packages by area (psychometrics, cluster, etc.)

inside-r.org/ R community site zoonek2.free.fr/UNIX/48_R/all.html Helpful Book/Manual Website (very thorough) statmethods.net/ Helpful Book/Manual Website (very detailed) Sample [R] capabilities demo(persp); demo(graphics); demo(Hershey); demo(plotmath) Cool R Visualization Examples http://paulbutler.org/archives/visualizing-facebook-friends/

http://blog.revolutionanalytics.com/2012/01/nyt-uses-r-to-map-the-1.html

http://blog.revolutionanalytics.com/2009/11/choropleth-challenge-result.html

http://www.r-bloggers.com/visualize-your-facebook-friends-network-with-r/

http://www.r-bloggers.com/see-the-wind/

http://www.r-bloggers.com/mapped-british-and-spanish-shipping-1750-1800/

http://paulbutler.org/archives/visualizing-facebook-friends/

http://blog.revolutionanalytics.com/2012/01/nyt-uses-r-to-map-the-1.html

http://blog.revolutionanalytics.com/2009/11/choropleth-challenge-result.html

http://www.r-bloggers.com/visualize-your-facebook-friends-network-with-r/

http://www.r-bloggers.com/see-the-wind/

http://www.r-bloggers.com/mapped-british-and-spanish-shipping-1750-1800/

2 | P a g e

Packages & Libraries

Working with packages library(package.name) Loads library require(package.name) Loads library .libPaths() Prints the path(s) to R's library(s) detach(package:package.name, unload = TRUE) Removes package from directory. install.packages("package.name") Install a package from command line. library(fBasics) #then listFunctions("stats")

List functions from a library

objects(package:package.name) List functions from a library (preferred: from base)

installed.packages() [,1] What packages do you have installed on your computer

maintainer("package.name") Get the name and email of the package maintainer

packageDescription("package.name") Brief info about a package’s contents packageDescription("package.name")["Version"] Find version number of package remove.packages("package.name") Delete a package from your library data(package = "package.name") Look at data sets available in a package library() Look at all available libraries vignette() Look at all available vignettes for installed

libraries vignette("package.name") Look at vignettes for a libraries (.packages()) Current packages loaded search() Current packages loaded library(help="package.name") See contents of a package package.name::object Access library object w/o opening package

Especially good if 2 packages have the same named function

list.files(.libPaths()[1]) See what files are in your saved library list.files(.libPaths()) Shows all available packes including

standard install libs .packages(all=TRUE)) Shows all available packages (.libPaths()) Displays the paths to all your library

locations suppressPackageStartupMessages(library(package.name))

Supress the startup message of a package

Install Package Not Compiled for Windows install.packages("PATH/TO/THE/SVGAnnotation.tar.gz", repos=NULL, type="source") Citing [R] and packages citation() #citing [R] citation(package = "psych", lib.loc = NULL, auto = NULL) #bibtex citation of a package method 1 utils:::print.bibentry(citation("psych"), style = "Bibtex") #bibtex citation of a package method 2

3 | P a g e

Importing a data from Excel, (csv), text, HTML Make sure the excel file is saved as a .csv file in the folder containing the route directory of R.

Using File Choose

<-read.csv(file.choose(),strip.white = header=TRUE, sep=",",na.strings="999")

Exporting a data table to Excel write.table(x, file = "foo.csv", sep = ",", col.names = T, row.names=F, qmethod = "double") write.table(x, file = "foo.csv", sep = ",", col.names = NA, qmethod = "double") Exporting a data table to SAS library(SASxport); write.xport(...dataframe(s)..., file=) Keyboard Short Cuts Clear console cntrl + L Load script lines cntrl + R Load all script cntrl + A and then cntrl + R Load content to console from non interactive window (ie history() etc) cntrl + c; cntrl + v Go to the beginning or end of script cntrl + HOME; cntrl + END Highlight from a given point to beginning or end cntrl + + SHIFT + HOME; cntrl + SHIFT + END

For Fixed Column Width Files: Save as Plain text, and import into Excel using file/open and follow the steps. Then Export as a .csv file. [or use read.fwf()]

<-read.csv(file, header=TRUE, strip.white = TRUE, sep=",", as.is=FALSE, na.strings= c("999", "NA", " "))

<-read.delim(file, header=T, strip.white = T, sep="\t", as.is=FALSE, na.strings= c("999", "NA", " "))

library(XML) #which is the table number to return <-readHTMLTable(doc, which=#, header=T, strip.white = T, as.is=FALSE, sep=",", na.strings= c("999", "NA", " "))

<-read.fwf(file,widths, header=FALSE, strip.white = T, sep=" ", as.is=FALSE, na.strings= c("999", "NA", " "))

library(gdata) <-read.xls(file,sheet=1, header=FALSE, strip.white=T,sep=" ", as.is=FALSE, na.strings= c("999", "NA", " "))

require(foreign) #for SPSS <-read.spss(file, use.value.labels = TRUE, to.data.frame = TRUE)

<-read.table(file.choose(),sep=",",header=T, strip.white = T,na.strings=c("999","NA"," "))

#example 1

library(XML)

URL <- "http://library.columbia.edu/indiv/dssc/data/nycounty_fips.html"

Table <- readHTMLTable(URL,

colClasses = rep("character", 2),

skip.rows=1,

which=1)

names(Table) <- c("County_FIPS", "County_Name")

Table

#example 2

library(XML)

URL2 <- "http://en.wikipedia.org/wiki/List_of_counties_in_New_York"

Table2 <- readHTMLTable(URL2, which=2)

Table2 #needs to be cleaned

Use , as.is=FALSE, to keep not convert character to factor

4 | P a g e

Write and Read in Vector Files of Unequal Length write.unequal(…, csv.name) read.unequal(file)

#=============WRITE=A=CSV=FILE=OF=UNEQUAL=LENGTHS================

Vector1 <- 1:6

Vector2 <- LETTERS[1:9]

Vector3 <- c('the', 'quick', 'red', 'fox',

'jumped', 'over', 'the', 'lazy', 'brown', 'dog')

Vector4 <- c(.1, .3, .6, .4)

lst <- list(Vector1, Vector2, Vector3, Vector4)

lns <- sapply(lst, length)

n <- length(lst)

ans <- as.data.frame(matrix(nrow = max(lns), ncol = n))

for(i in 1:n){

ans[1:lns[i], i] <- lst[[i]]

}

ans

write.csv(ans, file = "DELETE.ME.csv", na = "", row.names = FALSE)

#=============READ=IN=A=FILE=OF=UNEQUAL=LENGTHS==================

x <- "DELETE.ME.csv" #THE CSV OF != LENGTH VECTORS

j <- read.csv(x, stringsAsFactors = F)

k <- lapply(as.list(j), function(x){x[!is.na(x)]})

#======================DELETE=THE=FILE===========================

delete(x)

###################################################################

# NOTE: I WRAPPED THIS ALL UP IN A FUNCTION I KEEP INT HE USEFUL #

# FUNCTIONS FILE LOADED BY .FIRST AS SEEN BELOW #

###################################################################

write.unequal(Vector1, Vector2, Vector3, Vector4, csv.name=".DELETE.ME")

read.unequal(".DELETE.ME.csv")

5 | P a g e

Read in ascii type files (see my created function to right) read.table(name<-textConnection("")); close(name) site.data <- read.table(tc<-textConnection(

"site year peak

1 ALBEN 5 101529.6

2 ALBEN 10 117483.4

3 ALBEN 20 132960.9

8 ALDER 5 6561.3

9 ALDER 10 7897.1

10 ALDER 20 9208.1

15 AMERI 5 43656.5

16 AMERI 10 51475.3

17 AMERI 20 58854.4")); close(tc)

site.data

site.data[,1]

Read in ascii type files read.table(text="", header=TRUE)

Give comments to an object object comment dataframe commen scommentt comment(object) comment(object) <- value

Object Characteristics str() names() attributes() comment() getAnywhere() #look at any code example getAnywhere(plot.table)

#created read in function

ascii <- function(x, header=TRUE){

name <-textConnection(x)

DF <- read.table(name,header)

close(name)

DF

}

#EXAMPLE:

x <- mtcars

comment(x) <- c("This is about cars #0234", "I know nothing about cars")

x

comment(x)

str(x)

mod<-lm(disp~hp+cyl, mtcars)

str(mod)

attributes(mod)

names(mod)

str(mtcars)

attributes(mtcars)

names(mtcars)

#WATHCH OUT FOR METHODS CLASSES

library(tm); library(proxy)

dissimilarity #notice the UseMethod (that tells

you to look at the methods

methods(dissimilarity) #notice there are three

different methods types

getAnywhere("dissimilarity.DocumentTermMatrix")

#works or

tm:::dissimilarity.DocumentTermMatrix

#so does :::

ascii("site year peak

1 ALBEN 5 101529.6

2 ALBEN 10 117483.4

3 ALBEN 20 132960.9

8 ALDER 5 6561.3

9 ALDER 10 7897.1

10 ALDER 20 9208.1

15 AMERI 5 43656.5

16 AMERI 10 51475.3

17 AMERI 20 58854.4")

read.table(text="site year peak

1 ALBEN 5 101529.6

2 ALBEN 10 117483.4

3 ALBEN 20 132960.9

8 ALDER 5 6561.3

9 ALDER 10 7897.1

10 ALDER 20 9208.1

15 AMERI 5 43656.5

16 AMERI 10 51475.3

17 AMERI 20 58854.4")

Alternative Method

6 | P a g e

Merging data sets (by column) merge(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x",".y"))

x, y data frames, or objects to be coerced to one.

by, by.x, by.y

specifications of the common columns. See ‘Details’.

all logical; all = L is shorthand for all.x = L and all.y = L.

all.x logical; if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NAs in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output.

all.y logical; analogous to all.x above.

sort logical. Should the results be sorted on the by columns?

Merging data sets (by row) and fill missing with NA library(plyr) merge by ro merge rows rbind.fill(dataframes…)

#EXAMPLE

x1<-LETTERS[1:3]

x2<-letters[1:3]

x2b<-letters[5:7]

x3<-rnorm(3)

x4<-rnorm(3)

x5<-rnorm(3)

#DATA LOOKS LIKE THIS

data.frame(x1,x2,x3,x4,x5)

data.frame(x1,x3,x4,x5)

data.frame(x2,x3,x4,x5)

data.frame(x1,x2,x3,x4,x5)

data.frame(x1,x2b,x3,x4,x5)

#========================================

# IF EACH ONE IS A DATA FRAME ALREADY

#========================================

library(plyr)

d1 <- data.frame(x1,x2,x3,x4,x5)

d2 <- data.frame(x1,x3,x4,x5)

d3 <- data.frame(x2,x3,x4,x5)

d4 <- data.frame(x1,x2,x3,x4,x5)

d5 <- data.frame(x1,x2b,x3,x4,x5)

rbind.fill(d1,d2,d3,d4,d5)

#========================================

# IF EACH ONE IS A DATA FRAME ALREADY

#========================================

library(plyr)

LIST<-list(

data.frame(x1, x2, x3, x4, x5),

data.frame(x1,x3,x4,x5),

data.frame(x2,x3,x4,x5),

data.frame(x1,x2,x3,x4,x5),

data.frame(x1,x2b,x3,x4,x5))

DF <- rbind.fill(LIST)

data.frame(FAC(DF), NUM(DF))

#Output

x1 x2 x2b x3 x4 x5

1 A a <NA> -1.0193006 -0.8175212 -0.3094028

2 B b <NA> -2.0372846 -1.0685405 -1.0913312

3 C c <NA> -0.6502925 0.7338066 0.7393544

4 A <NA> <NA> -1.0193006 -0.8175212 -0.3094028

5 B <NA> <NA> -2.0372846 -1.0685405 -1.0913312

6 C <NA> <NA> -0.6502925 0.7338066 0.7393544

7 <NA> a <NA> -1.0193006 -0.8175212 -0.3094028

8 <NA> b <NA> -2.0372846 -1.0685405 -1.0913312

9 <NA> c <NA> -0.6502925 0.7338066 0.7393544

10 A a <NA> -1.0193006 -0.8175212 -0.3094028

11 B b <NA> -2.0372846 -1.0685405 -1.0913312

12 C c <NA> -0.6502925 0.7338066 0.7393544

13 A <NA> e -1.0193006 -0.8175212 -0.3094028

14 B <NA> f -2.0372846 -1.0685405 -1.0913312

15 C <NA> g -0.6502925 0.7338066 0.7393544

7 | P a g e

Merge Rows of a Data Set Method 1 library(plyr) [sum by id variables] combine rows ddply(data.frame, .(other.facs), summarize, combined.fac = sum(combined.fac)) dataframe = dataframe combined.facs = numeric factors you want to sum (or other operation) other.facs = list of factors that are repeated in all rows for the combined.facs Merge Rows of a Data Set Method 2 library(data.table) [sum by id variables] combine rows dataframe[ , list(combined.facs=sum(combined.facs)), list(other.facs)] dataframe = dataframe combined.fac = numeric factor you want to sum (or other operation) other.facs = list of factors that are repeated in all rows for the combined.facs

Paste two data frames together a<-mtcars[1:3,1:3]

b<-mtcars[1:3,8:10]

mypaste <- function(x,y) paste(x, "(", y, ")", sep="")

mapply(mypaste, a,b)

EXAMPLE

(dat <- structure(list(year = structure(c(1L, 1L, 1L, 1L,

1L, 1L),.Label = "base", class = "factor"), age =

structure(c(1L, 2L, 2L, 3L, 3L, 4L), .Label = c("0",

"1", "2", "3"), class = "factor"), pop = c(98378,

104648, 96769, 92448, 100745, 116926),

FIPS = structure(c(1L, 1L, 1L, 1L, 1L, 1L),

.Label = "6001", class = "factor")),

.Names = c("year", "age", "pop", "FIPS"),

row.names = c(NA, -6L), class = c("data.table",

"data.frame"), sorted = c("year", "age",

"FIPS")))

library(data.table) #Method 1

dat[,list(pop=sum(pop)),list(year,age,FIPS)]

library(plyr) #Method 2

ddply(dat, .(year, age, FIPS), summarize, pop = sum(pop))

dat<-data.frame(dat, "x2"=sample(1:100, nrow(dat)))

dat$x2<-as.numeric(dat$x2)

ddply(dat, .(year, age, FIPS), summarize, pop = sum(pop),

x2=sum(x2))

Extra Combined

factors

8 | P a g e

Multi Merge #Create three dataframe

Week_1_sheet <- read.table(text="ID Gender DOB Absences Unexcused_Absences Lates

1 1 M 1997 5 1 14

2 2 F 1998 4 2 3", header=TRUE)


1 1 M 1997 2 1 10

2 2 F 1998 8 2 2

3 3 M 1998 8 2 2", header=TRUE)


1 1 M 1997 2 1 10

2 2 F 1998 8 2 2", header=TRUE)

#Consolidate them into a list

WEEKlist <- list(Week_1_sheet , Week_2_sheet , Week_3_sheet)

#transform common variables before the merge

lapply(seq_along(WEEKlist), function(x) {

WEEKlist[[x]] <<- transform(WEEKlist[[x]],

Absences=sum(Absences, Unexcused_Absences))[, -5]

}

) #notice the assignment to the enviroment

#change names of columns that may overlap with other data frame yet not have duplicate data

lapply(seq_along(WEEKlist), function(x) {

y <- names(WEEKlist[[x]]) #do this to avoid repeating this 3 times

names(WEEKlist[[x]]) <<- c(y[1:3], paste(y[4:length(y)], ".", x, sep=""))}

) #notice the assignment to the enviroment

#Method using a for loop

DF <- WEEKlist[[1]][, 1:3]

for ( .df in WEEKlist) {

DF <-merge(DF,.df,by=c('ID', 'Gender', 'DOB'), all=T, suffixes=c("", ""))

}

DF

#Method using Reduce

merge.all <- function(frames, by) {

return (Reduce(function(x, y) {merge(x, y, by = by, all = TRUE)}, frames))

}

merge.all(frames=WEEKlist, by=c('ID', 'Gender', 'DOB'))

merge.all(frames=WEEKlist, by=1:3)

test replications elapsed relative user.self sys.self

1 LOOP 1000 10.12 1.62701 7.89 0

2 REDUCE 1000 6.22 1.00000 5.34 0

#BENCHMARKING

require(rbenchmark)

benchmark(

LOOP={DF <- WEEKlist[[1]]

for ( .df in WEEKlist) {

DF <-merge(DF,.df,by=c('ID', 'Gender', 'DOB'), all=T, suffixes=c("", ""))

}},

REDUCE=merge.all(frames=WEEKlist, by=c('ID', 'Gender', 'DOB')),

columns = c( "test", "replications", "elapsed", "relative", "user.self", "sys.self"),

order = "test",

replications = 1000,

environment = parent.frame())

9 | P a g e

Exporting an output to a file (method 1) saving a file save file cat(object,file="name.doc", sep = " ", append = FALSE) Append means add to the existing file (TRUE) or overwrite the file (FALSE) Exporting an output to a file (method 2) saving a file save file

write(x, file="data", ncolumns=if(is.character(x)) 1 else 5, append=FALSE) Append means add to the existing file (TRUE) or overwrite the file (FALSE)

Exporting an output to a file (method 3) saving a file save file This method prints all the results directly to the file without naming an object to print. sink (file="name.doc", append = FALSE, split = FALSE) Append means add to the existing file (TRUE) or overwrite the file (FALSE) Split sends it to the file and the command line. see also: capture.output()

Saving R Objects Save objects and load same objects into R save(…, file = "foo.RData") load("foo.RData") Save objects and load them back in by assigning them to a new object saveRDS(mod, "mymodel.rds") <- readRDS("mymodel.rds")

EXAMPLE

xc<-pi*3^2

cat(xc,file="xREPORT.doc")

xc2<-((xc+23)/4)-1000

cat(xc2,file="xREPORT.doc",append=T)

unlink("xREPORT.doc")

EXAMPLE

sink("example.doc",append=FALSE,split=TRUE)#append=T if file already exists

mod<-lm(mpg~disp*cyl,data=mtcars)

anova(mod)

summary(mod)

cat("The dog ate the food on ",date(),".\n",sep="")

sink()#turns sink off

xc<-pi*3^2

cat(xc,file="xREPORT.doc")

xc2<-((xc+23)/4)-1000

write(xc2,file="xREPORT.doc",append=T)

unlink("xREPORT.doc")

10 | P a g e

Delete a File from within [R] unlink("file") or file.remove("file")

Checking What Files Are in a directory (default is working directory) list.files()

EXAMPLE

STRING<-"TEST"

cat(STRING, file = ".TEST.txt")

unlink(".TEST.txt")

11 | P a g e

Checking the Data Set Simply type the name of the data set (data frame) and hit enter (in the example above we called the data set myData). Look @ beginning or end of a data set head() tail() dataset[1:n] or dataset[c(3,4,5,6,100,101,102,200),] The psych() package also includes a quick way to show the first and last n lines of a data.frame, matrix, or a text object. headtail(x,hlength=4,tlength=4,digits=2)

Arguments x A matrix or data frame or free text hlength The number of lines at the beginning to show tlength The number of lines at the end to show digits Round off the data to digits

Quick way to attach/detach variables Type: attach(myData)/detach(myData) Where the attach is the command and the myData is the imported file (data set). Now when you type the column headers you are expressing the variable name. Preferred method for attaching data to a function/expression with(data, expr, ...) Looking at a Variable (column) from the Data Set (data frame) Type: myData$Day Note: the myData and the Day portion are alterable; myData is your data set (data frame) and the Day portion is the name of the variable in the data set. This shows the vector for that variable.

12 | P a g e

Manual data Entry x<-c(3,4,3,5,2,3,4,3) or x<-scan() This is a line by line entry system that has the feel of Excel entry. When you come to the end of your data press enter. Look at stored variables/objects ls() #see all objects in environment in R console objects() #see all objects in environment in R console browseEnv() #see all objects in environment in web browser ls.str() #gives all the stored objects in the workspace plus some info on each one Remove all stored variables/objects (see also: Reduce Objects and Junk in Memory) rm(list = ls(all = TRUE)) ls() …to check if the variables have been reset output will be character(0) Remove everything except for functions

rm(list=setdiff(ls(all.names=TRUE), lsf.str(all.names=TRUE)))

Searching the Objects in List term="b"

ls(pattern=paste("^",term,sep=""))

ls(pattern=paste(term,sep="")) Hiding objects in workspace EXAMPLE

.BB<-"You can't see me!"

ls()

.BB

rm(.BB) #to remove the object

.BB

Name the object beginning with a period and it hides the object from the working directory

> x<-scan()

1: 21

2: 2

3: 3

4: 4

5: 5

6: 67

7: 776

8: 565

9: 45

10: 87

11: 567

12: 54

13: 34

14: 34

15: 32

16:

Read 15 items

> mean(x)

[1] 153.0667

13 | P a g e

Checking the Data Missing Values or Missing Data Finding Missing Values type: NAfun() for a list of NA functions I’ve created

Good implementations that can be accessed through R include Amelia II, Mice, and mitools.

Functions to omit observations with missing values (listwise deletion) na.fail(object, ...) na.omit(object, ...) na.exclude(object, ...) na.pass(object, ...) If ‘na.omit’ removes cases, the row numbers of the cases form the‘"na.action"’ attribute of the result, of class ‘"omit"’. ‘na.exclude’ differs from ‘na.omit’ only in the class of the ‘"na.action"’ attribute of the result, which is ‘"exclude"’. This gives different behaviour in functions making use of ‘naresid’ and ‘napredict’: when ‘na.exclude’ is used the residuals and predictions are padded to the correct length by inserting ‘NA’s for cases omitted by ‘na.exclude’. Impute means or median for missing library(e1071) Not a preferred method

impute(x, what = c("median", "mean"))

Replace Missing Values With A given Value See just for certain columns method below variable[is.na(variable)]<- # you want to impute

EXAMPLE

lk<-c(3,4,5,6,NA,3,4,5,6)

jk<-(.4,NA,.5,.3,.4,.3,NA,NA,.8)

das<-data.frame(lk,jk)

sapply(na.omit(das),mean)

sapply(na.omit(das),median)

das

impute(das, what = c("median"))

impute(das, what = c( "mean"))

EXAMPLE: mtcars2<-mtcars

mtcars2$carb<-with(mtcars2,replace(carb, carb==1,NA))

mtcars2

mtcars2$carb[is.na(mtcars2$carb)]<-1000 #1000 could be 0 or mode etc.

mtcars2see also: replace()

http://gking.harvard.edu/amelia/

http://web.inter.nl.net/users/S.van.Buuren/mi/hmtl/mice.htm

http://cran.us.r-project.org/web/packages/mitools/index.html

14 | P a g e

Replace Missing Values With A given Value for Selected Columns EXAMPLE A<-c(NA,5,4,7,3,NA,NA)

B<-c(.1,.4,.5,NA,NA,.3,.2)

C<-c(30,NA,40,40,60,50,70)

DF<-data.frame(A,B,C)

DF2<-DF #this is just so we can reset DF

cols <- c(2,3) #select the columns you want to impute with 0's

DF[,cols][is.na(DF[,cols])] <- 0

DF

cols <- c(2,3) #select the columns you don't want to impute with 0's

DF2[,-cols][is.na(DF2[,-cols])] <- 0

DF2

Replace Missing Values With Means by Group impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE)) #generic function dat2 <- ddply(dataframe, ~ group.var, transform, new.var{or.replace old} = impute.mean(var))

dat <- read.table(text = "id taxa length width

101 collembola 2.1 0.9

102 mite 0.9 0.7

103 mite 1.1 0.8

104 collembola NA NA

105 collembola 1.5 0.5

106 mite NA NA", header=TRUE)

library(plyr)

impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))

dat2 <- ddply(dat, ~ taxa, transform, length = impute.mean(length),

width = impute.mean(width))

dat2[order(dat2$id), ]

> dat

id taxa length width

1 101 collembola 2.1 0.9

2 102 mite 0.9 0.7

3 103 mite 1.1 0.8

4 104 collembola NA NA


6 106 mite NA NA

> dat2[order(dat2$id), ]

id taxa length width


4 102 mite 0.9 0.70

5 103 mite 1.1 0.80



6 106 mite 1.0 0.75

15 | P a g e

Generic Replace Missing Values impute <- function(x, fun) { missing <- is.na(x) replace(x, missing, fun(x[!missing])) } ddply(dataframe, ~group, transform, length = impute(length, function))

impute <- function(x, fun) {

missing <- is.na(x)

replace(x, missing, fun(x[!missing]))

}

ddply(dat, ~ taxa, transform, length = impute(length, mean),

width = impute(width, mean))

ddply(dat, ~ taxa, transform, length = impute(length, median),

width = impute(width, median))

ddply(dat, ~ taxa, transform, length = impute(length, min),

width = impute(width, min))

con = function(x) 100

ddply(dat, ~ taxa, transform, length = impute(length, con),

width = impute(width, con))

16 | P a g e

Create a subset of data with missing values removed per variable Example: A<-c(NA,2:6)

B<-c(11:15,NA)

C<-c(NA,3,NA,5,NA,9)

(DF<-data.frame(A,B,C))

with(DF,which(is.na(B)))

with(DF,which(!is.na(B)))

DF[with(DF,which(is.na(B))),]

DF[with(DF,which(!is.na(B))),]

Output: > A<-c(NA,2:6)

> B<-c(11:15,NA)

> C<-c(NA,3,NA,5,NA,9)

> (DF<-data.frame(A,B,C))

A B C

1 NA 11 NA

2 2 12 3

3 3 13 NA

4 4 14 5

5 5 15 NA

6 6 NA 9

> with(DF,which(is.na(B)))

[1] 6

> with(DF,which(!is.na(B)))

[1] 1 2 3 4 5

> DF[with(DF,which(is.na(B))),]

A B C

6 6 NA 9

> DF[with(DF,which(!is.na(B))),]

A B C

1 NA 11 NA

2 2 12 3

3 3 13 NA

4 4 14 5

5 5 15 NA #LOOK AT: Library(Hmisc) aregImpute()

17 | P a g e

Assumption Testing

Function for Assessing Assumptions library(gvlma) gvlma(lm(model))

Normality Assumption (remember the assumption is usually normality of residuals)

#=============================================================================== # LOADING THE LIBRARIES USED #=============================================================================== library(MASS);library(nortest);library(fBasics);library(psych);library(timeDate) #=============================================================================== # GENERATING SOME DATA #=============================================================================== x.norm<-rnorm(n=200,m=10,sd=2) #=============================================================================== #LOOKING AT THE GRAPHS (remember somewhat manipulated by the scale you choose) #=============================================================================== par(mfrow=c(3,1)) h<-hist(x.norm,main="Histogram of observed data w/ normal curve",col="red") xfit<-seq(min(x.norm),max(x.norm),length=40) yfit<-dnorm(xfit,mean=mean(x.norm),sd=sd(x.norm)) yfit <- yfit*diff(h$mids[1:2])*length(x.norm) lines(xfit, yfit, col="blue", lwd=2) plot(density(x.norm),main="Density estimate of data") polygon(density(x.norm) ,col="green", border="blue") truehist(x.norm,main="True Histogram of observed data") #=============================================================================== #LOOKING AT THE QQ PLOTS (very effective approach) #=============================================================================== win.graph() par(mfrow=c(1,2)) qqnorm(x.norm, col="red") qqline(x.norm) qqnormPlot(x.norm) #=============================================================================== #STATISTICAL TESTS OF NORMALITY (vary greatly; be cautious; p>.05 = normal) #=============================================================================== ksnormTest(x.norm)#Kolmogorov-Smirnov (for large sample) normality test shapiro.test(x.norm) #Shapiro-Wilk’s (for small samples)test for normality shapiroTest(x.norm) #Shapiro-Wilk’s test for normality jarqueberaTest(x.norm) #Jarque–Bera test for normality dagoTest(x.norm) #D’Agostino normality test adTest(x.norm) #Anderson–Darling normality test cvmTest(x.norm) #Cramer–von Mises normality test lillieTest(x.norm) #Lilliefors (Kolmogorov-Smirnov) pchiTest(x.norm) #Pearson chi–square normality test sfTest(x.norm) #Shapiro–Francia normality test kurtosis(x.norm,type=1) #type 1 biased; type 2 unbiased kurtosis(x.norm,type=2) #excess selected = moment method or -3 (0 is normal) kurtosi(x.norm);library(e1071) kurtosis(x.norm, type=1);kurtosis(x.norm, type=2);kurtosis(x.norm, type=3) skewness(x.norm, type=1);skewness(x.norm, type=2);skewness(x.norm, type=3) skew(x.norm);win.graph();par(mfrow=c(1,2)) mardia(x.norm) #for multivariate data probplot(x.norm) #

EXAMPLE

library(gvlma)

(gvmodel <- with(mtcars,gvlma(lm(mpg~disp*hp*cyl))))

summary(gvmodel)

multiG(27,11,6,2,c(1:12))

plot(gvmodel,onepage=F)

18 | P a g e

Addressing Non-Normaility Assumptions Script Folder has several skew, kurtosis checkers.

One Function to Conduct Multiple Tests of Normality

Info on Normality from Andy Fields [CNTRL +CLICK HERE]

Transforming Skew (for positive skew) 3 Methods: 1. Log Transformation log10(Xi) You can’t log ≤0 so if your data has this you must add a constant to adjust the data 2. Square Root Transformation sqrt(Xi) You can’t sqrt() <0 so if your data has -# you must add a constant to adjust the data 3. Reciprocal Transformation 1/( Xhighest score Xi) Reverse scored (Xhighest score Xi) to overcome the effect of inverse making the big scores small and the small scores big. All of these transformations can be done to negative skew as well but the data must be reverse scored (Xhighest score Xi) first to reverse the skew. REMEMBER: Transform one numeric variable, transform all of them. Fixing & Transforming Kurtosis 1. First Check for outliers and, if possible, delete any more than SD from the regression line 2. Square Transformation (Xi)^2 Function for normalizing Data uniformDAT(x)

# NORMAL TRANSFORMATION FUNCTION CODE WITH EXAMPLE

#===============================================================

uniformDAT <- function (x) {

x <- rank(x,

na.last = "keep",

ties.method = "average")

n <- sum(!is.na(x))

x / (n + 1)

}

normalize <- function (x) {

qnorm(uniformize(x))

} #===============================================================

# THE DATA

#===============================================================

par(mfrow=c(3,2))

x1<-sample(1:100,100, replace=T)

x2<-ifelse(x1<21,x1+79,x1)

x3<-ifelse(x1>80,x1-79,x1)

#===============================================================

# WHAT THE FUNCTION DOES GRAPHICALLY

#===============================================================

f(x1, FUN = normalize, main = "Distribution 1")

f(x2, FUN = normalize, main = "Right Skewed distribution")

f(x3, FUN = normalize, main = "Left Skewed distribution")

#===============================================================

# LOOKING AT THE DATA

#===============================================================

list("x1 data before"=x1,"x1 data after"=uniformDAT(x1),

"left skewed data before"=x2,

"left skewed data after"=uniformDAT(x2),

"right skewed data before"=x3,

"right skewed data after"=uniformDAT(x3))

source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Assumption Testing/Tests of

Normality.txt")

Source Path

source("C:/Users/Rinker/Desktop/PhD Program/CEP

523-Stat Meth Ed Inference/R

Stuff/Scripts/Assumption Testing/Normal

Tansformation.txt

mod1<-lm(mpg~disp,data=mtcars)

mod2<-lm(log(mpg)~disp,data=mtcars)

windows(h=6,w=12)

par(mfrow=c(1,2))

with(mtcars,plot(mpg~disp,main="Raw Plot"))

abline(reg=mod1,lty=3,col="green")

with(mtcars,plot(log(mpg)~disp,main="Log

Transformation"))

abline(reg=mod2,lty=3,col="blue")

mod1;mod2

Graph

example

log trans

19 | P a g e

Homogeneity Assumtion Equal variance of 2 populations var.test(x,y) Equal variance of groups bartlett.test (numeric variable, grouping factor) and levene.test(numeric variable, grouping factor)

EXAMPLE

#=================================================================================

# TESTING IF TWO SAMPLE VARIANCES ARE EQUAL

#=================================================================================

# GENERATING THE DATA

x1<-rnorm(1:1000,100)

y1<-rnorm(1:1000,100)

#=================================================================================

# MEAN AND STANDARD DEVIATION

#=================================================================================

descriptives<-data.frame(c(mean(x1),sd(x1)),c(mean(y1),sd(y1)))

colnames(descriptives)<-c("x1","y1");rownames(descriptives)<-c("mean","sd")

descriptives

#=================================================================================

# GRAPH THE DATA

#=================================================================================

par(mfrow=c(2,1));library(descr)

histkdnc(x1,main="x1");histkdnc(y1,main="y1")

#=================================================================================

# TESTING EQUAL VARIANCES; function--> var.test()

#=================================================================================

list(" NOTE: p > .05; not significantly

different"=var.test(x1,y1,alternative="two.sided",confi.level=.95))

EXAMPLE

#=================================================================================

# TESTING IF TWO SAMPLE VARIANCES ARE EQUAL

#=================================================================================

# GENERATING THE DATA

rannum<-c(sample(1:5,1000,replace=T))

factor<-c(recodeVar(rannum,src=c(1,2,3,4,5),

tgt=c("blue","black","red","green","orange"), default=NULL, keep.na=TRUE))

dep.var<-rnorm(1000)

color.df<-data.frame(factor,dep.var);tail(color.df)

#=================================================================================

# MEAN AND STANDARD DEVIATION

#=================================================================================

library(doBy)

summaryBy(dep.var~factor, data = color.df,

FUN = function(x) { c(n = length(x),mean = mean(x), sd = sd(x)) } )

#=================================================================================

# GRAPH THE DATA

#=================================================================================

black<-subset(color.df,factor=="black");blue<-subset(color.df,factor=="blue")

green<-subset(color.df,factor=="green");orange<-subset(color.df,factor=="orange")

red<-subset(color.df,factor=="red");par(mfrow=c(3,2));library(descr)

histkdnc(dep.var,main="Overall");histkdnc(black$dep.var,main="black",col="black")

histkdnc(blue$dep.var,main="blue",col="blue");histkdnc(green$dep.var,main="green",col="green")

histkdnc(orange$dep.var,main="orange",col="orange");histkdnc(red$dep.var,main="red",col="red")

#=================================================================================

# TESTING EQUAL VARIANCES; function--> var.test()

#=================================================================================

library(lawstat)

list("----------------------------------------------------------------------------------

Levene's Test"=levene.test(dep.var, factor,location= "mean"),"------------------------------------

----------------------------------------------

"=bartlett.test(dep.var, factor))

20 | P a g e

Equal variance of groups less sensitive to outliers fligner.test(x, ...) fligner.test(x, g, ...) fligner.test(formula, data, subset, na.action, ...) fligner.test(list(group a,group b,group c,group…n))

Arguments x a numeric vector of data values, or a list of numeric data vectors.

g a vector or factor object giving the group for the corresponding elements of x. Ignored

if x is a list.

formula a formula of the form lhs ~ rhs where lhs gives the data values and rhs the

corresponding groups.

21 | P a g e

Sphericity (Assumption of Repeated Measures Test) Sphericity- is, in a nutshell, that the variances of the differences between the repeated measurements should be about the same. mauchly.test Greenhouse-Geisser Outliers library(mvoutlier) & library(outlier) ?influence.measures

22 | P a g e

Create Sequences of Integers Input 1:5 Output 1 2 3 4 5 Create Sequences of Real Numbers Input seq(from=3,to=7,by=.5) NOTE: would do seq(3,7,.5)too Output 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0

from, to: the starting and (maximal) end value of the sequence. by: increment of the sequence. (leave this out is the same as n:m) length.out: desired length of the sequence. A non-negative number, which for ‘seq’

and ‘seq.int’ will be rounded up if fractional. Repeat Integer Pattern term: replicate rep(pattern, times=) rep(pattern, times=,each=)

Search Patterns Wthin Vectors spattern rel(x) inverse.rle(rle(x))

EXAMPLE

rep(1:2, times=1,each=25)

rep(1:2, times=25)

#=================================================================

> rep(1:2, times=25)

[1] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2

[39] 1 2 1 2 1 2 1 2 1 2 1 2

> rep(1:2, times=1,each=25)

[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2

[39] 2 2 2 2 2 2 2 2 2 2 2 2

x <- rev(rep(6:10, 1:5))

rle(x)

inverse.rle(rle(x))

z <- c(TRUE,TRUE,FALSE,FALSE,TRUE,FALSE,TRUE,TRUE,TRUE)

rle(z)

inverse.rle(rle(z))

23 | P a g e

Generating Random Numbers, Integers and Categorical Variables (random sample) Random Normal rnorm(n=, m=,sd=) Where n is amount of samples Random Normal Between Certain Values 1 x<- rnorm(n=500, m=42, sd=10) x <- x[x>=30 & x <=50] Random Normal Between Certain Values 2(doesn’t throw away samples) n <- 1000 #samples desired L <- .2 #lower limit U <- .8 #upper limit m <- 1 #mean s <- 1 #sd x <- qnorm(runif(n, pnorm(L, mean=m, sd=s), pnorm(U, mean=m, sd=s)), mean=m, sd=s) x Random Integers sample(seq,n, replace=T) Where seq is a sequence such as 10:80, and n is amount of samples Random Integers 2 sample.int(x,n, replace=T) x= all integers ≤ that value

Random Categorical Vector (See generate factor below) sample(categorical.vector,n,replace=T)

Example colors<-c(sample(c("blue","red","green","orange"),10,replace=T))

hue<-abs(rnorm(10))

colorsDF<-data.frame(colors,hue)

#colors is the random creation of a categorical variable

colors;hue;colorsDF

Note: the sample() with the recodeVar() from the doBy library could also be used for generating a random character vector. (See also relevel which is more efficient)

library(doBy) recodeVar(sample(1:5,25,replace=T), src=c(1,2,3,4,5), tgt=c("a","e","i","o","u"), default=NULL, keep.na=TRUE) [1] "u" "a" "o" "u" "e" "i" "i" "i" "u" "u" "o" "o" "i" "e" "u" "u" "u" "o" "a" "a" "u" "e" "o" "a" "u"

EXAMPLE

sample.int(2,10,replace=T)#flip a coin 10 times

sample.int(6,7,replace=T)#roll a die 7 times

sample.int(52,5,replace=T)#pick a card 5 times

ALLOW REPRODUCIBLE RANDOM NUMBERS NOTE: you can use set.seed() to enable someone else to reproduce exactly the same set.seed(15)# allow reproducible random numbers

sample(1:2, size=5, replace=TRUE)

sample(1:2, size=5, replace=TRUE)

24 | P a g e

Generate Factor (non random) gl(n, k, length = n*k, labels = 1:n, ordered = FALSE) ARGUMENTS

Convert scores to z-scores scale(vector) To function to the right is an example of how the z scores “normalizes” the data. The code creates a vector of random integers and then coverts it to a z-score vector. There is also a comparison of both vectors with histograms. Probability calculation computes the combinations choose(n,k) EXAMPLE: choose(54,5)

1/choose(54,5)

Generate all the possible outcomes of a vector given each element is used only once combn(x,m) x vector source for combinations, or integer n for x <- seq_len(n).

m number of elements to choose

SCALEfun<-function(){

ay<-sample(1:100, 25, replace=T)

az<-c(scale(ay, center = TRUE, scale = TRUE))

par(mfrow=c(2,1));library(descr)

histkdnc(ay,main="Before z score Transformation",col="red")

histkdnc(az,main="After z score Transformation",col="green")

list(ay,az)

}

EXAMPLE

gl(3, 5, length=100,labels = c("Control", "Treat","Died"))

EXAMP_LES

combn(letters[1:4], 2)

combn(LETTERS[1:10], 9)

combn(0:10,10)

http://127.0.0.1:31146/library/utils/help/seq_len

25 | P a g e

Generate All Possible Outcomes For the outside of a matrix outer(vector 1,vector 2,FUN=) EXAMPLES outer(month.abb, 1999:2003, FUN = "paste")

data.frame(outer(c("R","r"), c("R","r"), FUN = "paste"))#Punnets Square

outer(c("H","T"), 1:6, FUN = "paste")#outcomes of flipping coin and rolling a die

outer(LETTERS[1:10], 0:9, FUN = "paste")

outer(0:9, 0:9, FUN = "*") #multiplication table 0-9

outer(0:20, 0:20, FUN = "*") #multiplication table 0-20

outer(0:9, 1:9, FUN = "/") #division table 0-9

outer(0:9, 0:9, FUN = "^") #exponential table 0-9

outer(0:9, 0:9, FUN = "-") #exponential table 0-9

outer(0:9, 0:9, FUN = "+") #exponential table 0-9

Generate All Possible Combinations for a List of Factors expand.grid(factor.name1=c("factor levels"), factor.name2=c("factor levels"), factor.name…n=c("factor levels"))

List Prime Numbers library(matlab) primes(n) Perform Prime Factorization library(matlab) factors(n) Create Magic Squares library(matlab) magic(n)

Generate a list of Dummy Codes for a Factor model.matrix(~factor-1)

EXAMPLE

expand.grid(age=c(4:10),academic.level=c("high","med","low"),sex=c("male","female"))

EXAMPLE:

(iris.dummy<-with(iris,model.matrix(~Species-1)))

(IRIS<-data.frame(iris,iris.dummy))

26 | P a g e

Data Manipulation sround sformat Changing digits options(digits=#) #This is a global change print(x,digits=#) #Local change cat(format(x,digits=#)) #Local change for functions round(x, digits=#) #Rounds x to an integer SEE Cut Points for an example round data frame signif(x, digits=#) #Scientific Notation options(scipen=99) #Eliminates scientific notation (Global Change) format(x,…) #SEE BELOW FOR MORE ABOUT FORMAT sprintf("%.49f", (1+sqrt(5))/2) Force rounding with a certain number of digits sprintf("%.49f", pi) library(mpc); mpc(1, 3000) / mpc(998001, 3000) Format (ENABLES digits, format(x,…)

Arguments

x any R object (conceptually); typically numeric.

trim logical; if FALSE, logical, numeric and complex values are right-justified to a common width: if TRUE the leading blanks for justification are suppressed.

digits how many significant digits are to be used for numeric and complex x. The default, NULL, uses getOption(digits).

This is a suggestion: enough decimal places will be used so that the smallest (in magnitude) number has this many significant digits, and also to satisfy nsmall. (For the interpretation for complex numbers see signif.)

nsmall the minimum number of digits to the right of the decimal point in formatting real/complex numbers in non-

scientific formats. Allowed values are 0 <= nsmall <= 20.

justify should a character vector be left-justified (the default), right-justified, centred or left alone.

width default method: the minimum field width or NULL or 0 for no restriction.

AsIs method: the maximum field width for non-character objects. NULL corresponds to the default 12.

na.encode logical: should NA strings be encoded? Note this only applies to elements of character vectors, not to numerical or logical NAs, which are always encoded as "NA".

scientific Either a logical specifying whether elements of a real or complex vector should be encoded in scientific format, or an integer penalty (see options("scipen")). Missing values correspond to the current default penalty.

...

further arguments passed to or from other methods.

Round to a nearst fractional x <- c(4.2, 4.3, 4.8)

#Method 1 (generalizes to any rounding)

library(plyr)

round_any(x, 3)

round_any(x, 1)

round_any(x, 0.5)

round_any(x, 0.2)

#Method 2 (nearest half)

round(x*2)/2

#Method 3 (nearest half)

round(x/5, 1)*5

can be used to round integers as well

options(digits=10)

(x<-pi*12345)

round(x,-4:4) #negative rounds the integer

http://127.0.0.1:12821/library/base/help/getOption

http://127.0.0.1:12821/library/base/help/signif

http://127.0.0.1:12821/library/base/help/options

27 | P a g e

Indexing dataframe$object or dataframe[,"object"]

Determine number of observations in a dataframe

nrow(dataframe)

Determine number of variables in a dataframe

ncol(dataframe)

Determine number levels of a factor number of levels

nlevels(factor) Look At beginning or end of a data frame head(dataframe) dataframe[1:10,] I compiled this into the function HEAD() in .First tail(dataframe) dataframe[(nrow(dataframe)-10):(nrow(dataframe)),] Compiled as function TAIL() in .First begend(dataframe) Looks at first 5 and last 5 observations of a dataframe Locating the info for a single row (observation) Type: data[10,] Output Where data is your data frame (set) and the 10 is the tenth observation. Changing a numeric variable by a constant Type: b<-sa*.45 Output …Where b is the new variable vector name, sa is the original numeric variable, and .45 is the constant. This could be very useful for creating new combined variables:

Minimal Specifications R looks for objects with in the environment you specify that minimally meet your requirements: Example CO2$T #Both Treatment and Type fit this so R returns NULL

CO2$Ty

CO2$P

28 | P a g e

List all the variables created Type: ls() or objects() Output Look at all the code and commands you’ve typed in a session in a new window history(number) Sys.setenv(R_HISTSIZE=10000) #increase the 512 line limit even more Pull up the last value A quasi constant in R in that it is a non function that takes the value of the last input .Last.value EXAMPLE Break a data frame into Groups of a factor split(df,factor) Creates a list of data frames by the Groups of the Chosen Factor

EXAMPLE

warpbreaks

(groups<-with(warpbreaks,split(warpbreaks,tension))) #method 1

with(groups,lapply(groups,mean))

with(groups,lapply(groups,nrow))

with(groups,lapply(groups,sd))

takes.a.while <- function(){

Sys.sleep(10)

rnorm(20)

}

takes.a.while()

# Oh no I forgot to assign it to a variable

lifesaver <- .Last.value

lifesaver

29 | P a g e

Creating a subset of data (useful for control vs. treatment groups) Using your original data set type: gg<-subset(a,g=="f", select=NULL) Where gg is the name of the new data subset, subset is the function, a is the large data set name, g is the variable you wish to make a subset around “f” is the level you want to isolate to make a new data set of, and select are the columns you want.

Original data set New subset

Remember to rename the variables for each data set in the same way you did with the larger data set (let’s say we have a female and male subset for example). This can be done for any number of variables in the subset, enabling tests on the subgroups.

The summary is a great follow up to generate some quick and useful information about each subset: Type: summary(gg)

Also see select cases below

subset(mtcars,select = mpg:vs)

subset(airquality, Temp > 80, select = c(Ozone, Temp))

subset(airquality, Day == 1, select = -Temp)

subset(airquality, select = Ozone:Wind)

30 | P a g e

Select certain rows or columns by criteria Other ways to subset Examples Examples slogical

Select columns from a data frame that are just numeric or just factors mtcars2[,sapply(mtcars2,is.numeric)] #same as NUM(df) in usefule functions mtcars2[,sapply(mtcars2,is.factor)] #same as FAC(df) in usefule functions NAs Please also see Missing Values for created functions and specific handling of NAs Create a Subset of non NA for just one column/variable dataframe2<-subset(dataframe, factor!=is.na(factor))

#========================================================

#select mpg = 21

with(mtcars,mtcars[mpg==21,])

#========================================================

#select mpg greater than 30

with(mtcars,mtcars[mpg>=30,])

#========================================================

#select mpg greater than 30 and disp less than 80

with(mtcars,mtcars[mpg>=30&disp<80,])

#========================================================

#select mpg greater than median mpg and disp over 110

with(mtcars,mtcars[mpg>=median(mpg)&disp>110,])

#========================================================

mtcars.8cyl<-mtcars[cyl==8,]

mtcars.8cyl<-mtcars.8cyl[-c(2)]

mtcars.8cyl #this is the same as:

subset(mtcars[-2],cyl==8)

EXAMPLE: mtcars2<-mtcars

mtcars2$carb<-with(mtcars2,replace(carb, carb==1,NA))

mtcars2;paste("n is = to",length(mtcars2$mpg))

mtcars3<-subset(mtcars2, carb!=is.na(carb))

#Line above is the code (the rest is generating a dataset with NAs in it

mtcars3;paste("n is = to",length(mtcars3$mpg))

mtcars[cyl==8|cyl==6,]

mtcars[cyl==8|gear>=5,]

mtcars[cyl==8&mpg>=18,]

mtcars[cyl==8&mpg>=18|wt>=3.5,]

mtcars[cyl==8&mpg>=18&wt>=3.5,]

mtcars[cyl==8|cyl==6,]

subset(mtcars, (cyl %in% c(6, 8)))

subset(mtcars, !(cyl %in% c(6, 8)))

subset(CO2, !(Plant %in% c("Qn1", "Mc3", "Mc1", "Mn2")))

mtcars2<-mtcars

mtcars2[,c("mpg","disp","wt")] #select some columns meth. 1

subset(mtcars2,select=c(mpg,disp,wt)) #select some columns meth. 2

#Subset really fine tunes the selection:

subset(mtcars2,select=c(mpg,disp,wt),subset=c(mpg>25&wt<=4))

subset(mtcars2,select=c(mpg,disp,wt),subset=c(mpg>25&wt<=4|disp>=400))

LOGICAL OPERATORS & | ! COMPARISON OPERATORS == >= <= != Value Matching %in%

31 | P a g e

Index a Single Columns Without Dropping the Variable Name drop name subset(dataframe, select=column.number) dataframe[, column.number, drop = FALSE] dataframe[column.number] or dataframe[column.name] #no comma #EXAMPLES

names(subset(mtcars, select=1))

names(mtcars[,1, drop = FALSE])

mtcars[1] or mtcars["mpg"]

Keep Column Names the Do Not Conform to R Standards data.frame('a b'=1, check.names=F) transform(data.frame('a b'=1, check.names=F), `c d`=`a b`, check.names=F) Split a Numeric Vector by a Categorical vector split(numeric,factor)

#================================================================

# CREATE THE DATA SET

#================================================================

colors<-c(sample(c("blue","red","green","orange"),20,replace=T))

hue<-abs(rnorm(20))

colorsDF<-data.frame(colors,hue)

#================================================================

# split(numeric.var,factor)

#================================================================

with(colorsDF,split(hue,colors))

#================================================================

# USING SPLIT FOR MEANS AND SD etc.

#================================================================

sapply(with(colorsDF,split(hue,colors)),mean)

sapply(with(colorsDF,split(hue,colors)),sd)

OUTPUT > with(colorsDF,split(hue,colors))

$blue

[1] 0.0143338132 ,1.9922393211 ,0.7892910777

0.0004594093

$green

[1] 0.5897572, 0.7480668, 2.8692182, 0.2506951

$orange

[1] 1.3469976. 0.8757391. 1.4951192. 1.3781447

$red

[1] 0.05447693, 0.10730018, 2.20397056,0.05800449,

1.84962318,0.15243645,0.29141207, 0.05877585

===============================================

USING SPLIT FOR MEANS AND SD etc.

================================================

> sapply(with(colorsDF,split(hue,colors)),mean)

blue green orange red

0.6990809 1.1144343 1.2740002 0.5970000

> sapply(with(colorsDF,split(hue,colors)),sd)

blue green orange red

0.9376117 1.1881110 0.2730569 0.8909689

32 | P a g e

Sorting & Ordering Observations #1 see also arrange() & orderBy() ssort sorder NOTE: You can use sort() for vectors x[order(x$B),] #sort a dataframe by the order of the elements in B x[rev(order(x$B)),] #sort the dataframe in reverse order with(mtcars,mtcars[ order(-cyl, gear, carb) ,]) #sort ascending and descending (use of - )

Sorting #2 arrange() MORE EFFICIENT THAN ORDER library(plyr) arrange(df,…) Sorting #3 orderBy() I like this one the best orderBy(~formula, data=)

Duplicate Certain Rows of a Data Frame duplicate rows reprow(dataframe, column, value)

#EXAMPLE

mtcars[1:10] #first 1-10 of data set

#order ascending

#rev descending

mtcars[rev(order(mtcars$cyl)),][1:10] #Sort by cyl (descending)

mtcars[order(mtcars$cyl),][1:10] #Sort by cyl (ascending)

mtcars[order(mtcars$cyl,mtcars$vs),][1:10] #Sort by cyl then vs (asc.)

mtcars[order(mtcars$cyl,mtcars$vs,mtcars$gear),][1:10] #Sort by cyl,vs, then gear( asc.)

library(plyr)

(mtcars2<-data.frame(mtcars[1:16,],"grade"=c(rep(1:3,each=4,times=1),rep("k",4))))

levels(mtcars2$grade) <- c("4","3","2","1","k")

arrange(mtcars2, -as.numeric(factor(grade,levels=c("k","1","2","3")),cyl,-disp))

arrange(mtcars, cyl, desc(disp))

library(doBy)

(mtcars2<-data.frame(mtcars[1:16,],"grade"=c(rep(1:3,each=4,times=1),rep("k",4))))

levels(mtcars2$grade) <- c("4","3","2","1","k")

orderBy(~-grade + cyl, data=mtcars2)

orderBy(~-grade + -disp + hp, data=mtcars2)

#THE FUNCTION

reprow <- function(dataframe, column, value) {

dataframe$ID <- 1:nrow(dataframe)

DF <- data.frame(rbind(rbind(dataframe[which(dataframe[,column] %in% value),

], dataframe[which(dataframe[ ,column] %in% value), ]),

rbind(dataframe[which(!dataframe[,column] %in% value), ])))

DF <- DF[order(DF$ID), ]

DF$ID <- NULL

rownames(DF) <- 1:nrow(DF)

DF

}

#EXAMPLES

reprow(mtcars, 'cyl', 4) #repeats any column with that value a second time

reprow(mtcars, 'cyl', c(4, 6))

33 | P a g e

Replace (Go through data set and find values and replace them with a new value) [recode; missing values ]srplace

replace(dataframe, list, values) #remember to assign this to some object i.e., x <- replace(dataframe,dataframe==-9,NA) #similar to the operation x[x==-9] <- NA This can also be done for just one variable in a data frame by saving the output as follows: dataframe$variable<- with(dataframe,replace(dataframe,factor==-9,NA)) EXAMPLE: mtcars2<-mtcars

mtcars2$carb<-with(mtcars2,replace(carb, carb==1,NA)); mtcars2 #Way 1

mtcars2<-mtcars #RESET mtcars2

mtcars2$carb[mtcars2$carb==4]<-NA; mtcars2 #Way 2

Remove values in a vector subset could also be used vector [!is.element(x,c(values to remove))]

Example

x <- sample(0:20,100,replace=T)

table(x)

x2<-x[!is.element(x, c(0,9,20))] #removes the values 0,20

table(x2)

34 | P a g e

Rename a column (rename a variable) method 1 names(dataframe)[c(column#)] <- "new.name" Rename a column (rename a variable) method 2 library(reshape) rename variable rename column

rename(data.frame, vector of name conversions)

Rename a column (rename a variable) method 3 library(gdata) variable rename column rename.vars(data, from="", to="", info=TRUE)

Finding duplicates rows in a data frame (see matching) unique(data.frame)

Locating variables and values which(x==" ") Finding Duplicate Entries in a Column duplicated(x)

Arguments x a vector or a data frame or an array or NULL. incomparables a vector of values that cannot be compared. FALSE is a special value, meaning that all values can be compared, and may be the only value accepted

#EXAMPLE

iris2<-data.frame(rbind(iris[1:15,],iris[1,],iris[3,]))

iris2<-with(iris2,iris2[order(Sepal.Length),])

rownames(iris2)<-c(1:17);iris2

mess<-c("\bNOTE:\n",

"\bThe unique() function searches for duplicates;\n",

"\bnotice observation 6 & 14 are elimnated\n")

unique(iris2,incomparables = FALSE)

cat(mess)

Example:

rename(mtcars, c(wt = "weight", cyl = "cylinders"))

EXAMPLE

(IRIS<-iris[1:30,])

IRIS$Petal.Length[!duplicated(IRIS$Petal.Length)]

IRIS$Petal.Length[duplicated(IRIS$Petal.Length)]

which(duplicated(IRIS$Petal.Length))

which(!duplicated(IRIS$Petal.Length))

EXAMPLE

dat<-mtcars

rename.vars(dat, from="mpg", to="new", info=TRUE)

rename.vars(dat, from=c("wt","mpg"), to=c("new1","new2"), info=TRUE)

example

mtcars.let<-data.frame("lets"=rep(letters[c(19:26)],4) ,mtcars) #making a dataframe

with(mtcars.let,which(lets=="v"))

#[1] 4 12 20 28

with(mtcars.let,which(cyl=="6"))

[#1] 1 2 4 6 10 11 30

EXAMPLE

names(mtcars)[names(mtcars)=='hp'] <-'sweet.new.hp'

names(mtcars)

names(mtcars)[3] <- "new.name"

names(mtcars)

names(mtcars)[c(2,5)] <- c("new.name2","new.name3")

names(mtcars)

35 | P a g e

Finding Truly Unique Items in Vector (3 methods) #DATA SET:

x <- c(378, 380, 380, 380, 380, 360, 187, 380)

#METHOD 1 [fastest and numeric/unsorted]

setdiff(unique(x), x[duplicated(x)])

#METHOD 2 [medium speed & numeric/sorted]

y <- rle(sort(x)); y[[2]][y[[1]]==1]

#METHOD 3 [slowest/character/sorted]

b <- table(x); names(b[b==1])

test replications elapsed relative user.self sys.self user.child sys.child

3 METHOD_1 1000 0.08 1.000 0.06 0 NA NA

2 METHOD_2 1000 0.25 3.125 0.23 0 NA NA

1 METHOD_3 1000 0.61 7.625 0.48 0 NA NA

Same idea applied to character data set.seed(100)

x <- sample(c("Certin", "features", "of", "the", "setting", "affected"), 13, replace=T)

x

hapax1 <- function(x) {x <- na.omit(tolower(x)); setdiff(unique(x), x[duplicated(x)])}

hapax1(x)

hapax2 <- function(x)names(table(tolower(x))[table(tolower(x))==1])

hapax2(x)

hapax3 <- function(x) {y <- rle(sort(tolower(x))); y[[2]][y[[1]]==1]}

hapax3(x)

36 | P a g e

Using Which to Find Even and Odd Numbers even odd

which(vector%%2 == 1) #find odd which(vector%%2 == 0) #find even Using TRUE/FALSE To find odd and even of objects Select every other of a vector object[c(T, F)] #odds object[c(F, T)] #evens Add column to a data set or an existing column (add variable)

transform(data.set.name,new.var=(Science.Comprehension*10))

Delete a variable from a data set See also subset() ddrop variable drop columnelete variable delete column

You could select all the columns you want and create a subset or: POSITIVE SELECTION EXAMPLE:

mtcars2<-mtcars

mtcars2[,1:10]

NEGATIVE SELECTION METHODS EXAMPLES: snegative indexing

mtcars[, -which(names(mtcars) == "carb")]

mtcars[, names(mtcars) != "carb"]

mtcars[, !names(mtcars) %in% c("carb")]

mtcars[, -match(c("carb"), names(mtcars))]

mtcars2<-mtcars;mtcars2$hp <- NULL

#--------------------------------

library(gdata)

#--------------------------------

remove.vars(mtcars2, names="mpg", info=TRUE)

remove.vars(mtcars2, names=c("wt","mpg"), info=TRUE)

Logic Testing & Coercion test null test na test missing test fact test numeric is.numeric() is.factor() is.data.frame() is.character() is.vector() is.na() is.null()

EXAMPLE

airquality<-airquality[1:10,]

transform(airquality, Ozone = -Ozone)

transform(airquality, new = -Ozone, Temp = (Temp-32)/1.8)

#Notice that if a variable is unknown a new variable is created

attach(airquality)

transform(Ozone, logOzone = log(Ozone))

#EXAMPLES

with(mtcars,which(hp%%2 == 1))

with(mtcars,which(hp%%2 == 0))

with(mtcars,sapply(mtcars,even<-function(x){which(x%%2 == 0)}))#apply to data frame

#EXAMPLES

mtcars[c(T,F)] #every odd column

mtcars[c(F,T)] #every even column

mtcars[c(T,F), ] #odd row

mtcars[c(F,T), ] #even row

37 | P a g e

Logic Testing on a Whole Data Set str(dataset) Coerce the vectors using: as.numeric(y) as.factor(y) as.data.frame(y) as.character(y) as.vector(y) Matching smatching match(x, y) Finding Where Vectors (variables) are the Same and Different union(x, y); intersect(x, y); setdiff(x, y); setdiff(y,x); setequal(x, y) ; is.element(x, y) Matching Extenders %w/% x without y (same as setdiff) %IN% x and y overlap (Same as Intersect)

Examples:

str(mtcars)

str(CO2)

Method 2: TEST A WHOLE DATA SET WITH: library(gdata) is.what()

Example:

library(gdata)

sapply(mtcars,is.what)

Examples "%w/o%" <- function(x, y) x[!x %in% y] #-- x without y

(1:10) %w/o% c(3,7,12)

[1] 1 2 4 5 6 8 9 10

(1:10) %in% c(3,7,12)

[1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE

"%IN%" <- function(x, y) x[x %in% y] #-- x and y

(1:10) %IN% c(3,7,12)

[1] 3 7

x <- c(sort(sample(1:20, 9)),NA)

y <- c(sort(sample(3:23, 7)),NA)

x;y

length(x);length(y)

union(x, y) ; length(union(x, y)) #combine and remove duplicates

intersect(x, y)#duplicates of the 2 lists

setdiff(x, y)#what x has that y does not

setdiff(y, x) #what y has that x does not

setequal(x, y) #is set x the same as set y

a<-union(x,y)

b<-c(setdiff(x,y), intersect(x,y), setdiff(y,x))

setequal(a,b);sort(a);sort(b)

is.element(x, y)# what elements of set x are the same as set y

is.element(y, x)# what elements of set y are the same as set x

cat("\b%in% IS THE SAME AS is.element()\n")

a%in%b;x%in%y;y%in%x

union

intersect

set diff

set.equal

is.element or %in%

combine and remove duplicates

duplicates of the 2 lists

what x has that y does not

is set x the same as set y

what elements of set x are the same as set y

38 | P a g e

Intersect multiple vectors See Overlap in User Defined Functions

Reduce(intersect, list(...))

a <- c(1,3,5,7,9)

b <- c(3,6,8,9,10)

c <- c(2,3,4,5,7,9)

Reduce(intersect, list(a,b,c))

Find Rows Not Shared by Two Nested Data Sets x[! data.frame(t(x)) %in% data.frame(t(y)), ] Where x is the full data set and y is the nested data set A <- mtcars

B <- subset(mtcars, cyl==6)

A[! data.frame(t(A)) %in% data.frame(t(B)), ]

39 | P a g e

Recoding Variables method 1 recode variables recode columns levels (variable)<-c("new names") This has one to one correspondence levels (variable)<- list(new1=c("A","C") etc…) This can combine levels

Recoding Variables method 2 library(doBy) recodeVar(x, src=c(), tgt=c(), default=NULL, keep.na=TRUE)

Cut Points (Chop a numeric variable into a factor) [NOT RECOMMENDED]

cut(x, breaks, labels = NULL, include.lowest = F, right = T, dig.lab = 3, ordered_result = F)

e30<-read.table("e30.csv", header=TRUE, sep=",",na.strings="NA")

library(doBy)

AIDE<-recodeVar(e30$aide,src=c(0,1,NA),tgt=c("YES","NO",NA))

CLASS.TP<-recodeVar(e30$cl.type,src=c(1,2,3,NA),tgt=c("AM","PM","FULL",NA))

CLASS.BH.SPR<-

recodeVar(e30$cl.behav.spr,src=list(c(1:3),c(4,5),NA),tgt=c("POOR","GOOD",NA))

#This one is a complete recoding of a variable with cut points

DDD<-data.frame(AIDE,CLASS.TP,CLASS.BH.SPR)

DDD[620:630,]

NAhunter(DDD)

EXAMPLE

aaa <- c(1.2,2.2,3,4.1,.7,2,pi,4,5.3434343344,6.245,pi/3)

cut(aaa, 3)

cut(aaa, 3, dig.lab=4, ordered = TRUE)

cut(aaa, 3, labels=c("low","medium","high"), ordered = T)

(BBB<-cut(aaa, 3, labels=c("low","medium","high"), ordered = F))

(DF<-data.frame("OBS"=LETTERS[1:11],"LEVEL"=BBB,"NUM.LVL"=aaa))

round(DF,digits=2)#Can't round with factors so...

DF2<-DF

DF$NUM.LVL<-round(DF$NUM.LVL,digits=2)

list("METHOD 1"=format(DF2,digits=3),"METHOD 2"=DF,"USING"="DF$NUM.LVL<-

round(DF$NUM.LVL,digits=2)" )

mtcars

(mpg.rating<-with(mtcars,cut(mpg,3)))

levels(mpg.rating)<-c("low","medium","high")

(mtcars2<-data.frame(mtcars,mpg.rating))

mtcars[with(mtcars2,which(mpg.rating=="high")),]

table(mpg.rating)

mtcars$HPcut<-cut(mtcars$hp, breaks=c(0,66,110,150,335),

labels=c("low","medium","high", "super"), include.lowest=TRUE, right=FALSE)

EXAMPLE

InsectSprays2<-InsectSprays

levels(InsectSprays2$spray)

levels(InsectSprays2$spray)<-list(new1=c("A","C"),YEPS=c("B","D","E"),LASTLY="F")

levels(InsectSprays2$spray)

InsectSprays2

40 | P a g e

Relevel a factor method 1 [reorder factor groups] (& recode numeric to factor [see cut points for more on this])

Relevel factor Quick Reference set.seed(12)

z <-factor(sample(LETTERS[1:5], 10, T));z

factor(z, levels=c("C", "D", "A", "B")) #the releveling

Extra Reference dataset$factorgroup <- factor(dataset$factorgroup, levels = c("c","a","b"),ordered=is.ordered(factor))

Relevel a factor method 2 (order 1 group; kinda junky)

relevel(x, ref, ...) # only places one group at the front (limited)

Drop unused levels 1 drop factor levels

droplevels(x)

x is a factor vector/dataframe with factors

Drop unused levels 2 drop factor levels

x <- factor(x)

EXAMPLE

warpbreaks$tension

warpbreaks$tension2 <- relevel(warpbreaks$tension, ref="M")

warpbreaks$tension2

mtcars2<-mtcars

mtcars2$carb[mtcars2$carb==4]<-NA

mtcars2

mtcars2$carb[is.na(mtcars2$carb)]<-4

mtcars2

mtcars2$carb[mtcars2$carb<=4&mtcars2$carb>=3]<-"med"

mtcars2$carb[mtcars2$carb<=2&mtcars2$carb>=1]<-"low"

mtcars2$carb[mtcars2$carb<=8&mtcars2$carb>=6]<-"high"

mtcars2

with(mtcars2,mtcars2[order(carb),])

mtcars2$carb <-as.factor(mtcars2$carb)

levels(mtcars2$carb)

mtcars2$carb <- factor(mtcars2$carb, levels = c("low","med","high"))

mtcars2$carb

> set.seed(12)

> z <-factor(sample(LETTERS[1:5], 10, T));z

[1] A E E B A A A D A A

Levels: A B D E

> factor(z, levels=c("C", "D", "A", "B"))

[1] A <NA> <NA> B A A A D A A

Levels: C D A B

41 | P a g e

Add an observation to a vector/column method 1 append(vector, new items to add, after=#) Add an observation to a vector/column method 2 (preffered for its speediness) EXAMPLE mtcars2<-mtcars mtcars2;mtcars2$mpg[3] mtcars2$mpg[3]<-25 mtcars2 MPG<-mtcars2$mpg MPG[40]<-25 MPG #[R] fills in the gap w/ NAs

Combine characters/numbers in one column (variable) with that of another

NOTE: you can do this in EXCEL using cell#& " " &cell# example: G1& " " &H1 (whatever is between the “” will be the divider of the characters)

paste(x,y,sep= " ")

x,y,z… are the variable characters/numbers to combine together. Whatever is between the " " will be the character separator.

Paste unknown number of columns apply(x, 1, paste, collapse = ".") apply(x, 1, function(x){if(any(is.na(x))){NA}else{paste(x, collapse = ".")}}) #if any NA returns NA

library(doBy)

x1<-recodeVar(sample(1:26,25,replace=T), src=1:26, tgt=letters, default=NULL, keep.na=TRUE)

y1<-recodeVar(sample(1:26,25,replace=T), src=1:26, tgt=LETTERS, default=NULL, keep.na=TRUE)

z1<-sample(1:26,25,replace=T)

merged.characters<-paste(x1,z1,y1,sep="")

data.frame(x1,z1,y1,merged.characters)

paste(x1,z1,y1,sep="-")#variation

old<-c(1:10) EXAMPLE

[1] 1 2 3 4 5 6 7 8 9 10

new<-append(old,c(3,6,9),after=4)

[1] 1 2 3 4 3 6 9 5 6 7 8 9 10

#EXAMPLES

CO2[1,1] <- NA

x <- CO2[, 1:3]

y <- CO2[, 1:4]

apply(x, 1, paste, collapse = ".")

apply(x, 1, function(x){if(any(is.na(x))){NA}else{paste(x, collapse = ".")}})

#do.call METHOD

y <- as.list(CO2[1:3]) # make it a list

y$sep = "." # set our separator

do.call("paste", y)

42 | P a g e

Matrices & Data frames

The difference between a matrix and a data frame is that the matric must have all the same type of data (eg. numeric, character etc). A data frame may have mixed comumns of data.

Turn a Vector Into a Matrix Creating a Matrix EXAMPLE b<-1:20

b

dim(b)<-c(4,5)

b

Change upper and lower triangle of matrix lower.tri(x, diag = FALSE) upper.tri(x, diag = FALSE)

ARGUMENTS:

Splicing or gluing together Rows or Columns rbind() cbind() NOTE: rbind or cbind are slow functions for larger data sets. It is usually better to create a black matrix first and then use indexing to put the information into the blank matrix.

x a matrix.

diag logical. Should the diagonal be included?

EXAMPLE

#CREATE A CORRRELATION MATRIX WITH THE LOWER TRIANGLE r^2 VALUES

CORmat<-cor(mtcars)

lower.tri(CORmat)

CORmat[lower.tri(CORmat)]<-CORmat[which(lower.tri(CORmat))]^2

Example

(aRow <- matrix(NA, ncol=18, nrow=49)) #create a matrix of NAs and then fill

aRow[1:44,1:8] <- as.vector(as.matrix(mtcars))

aRow[29:49,11:18] <- as.vector(as.numeric(as.matrix(CO2)))[253:420]

aRow #notice it's all numeric

aRow[49,1]<-"a"

aRow # changes matrix to character

43 | P a g e

Matrix Algebra

Transpose X' or XT

t(X) Diagonal diag(X) Matrix Multiplication XY X %*% Y Matrix Inverse X-1 solve(X) Outer Product XY' X %o% Y Column Means Returns a vector containing the column means of X

colMeans(X) Cross Products X'X crossprod(X) Cross Products X'Y

crossprod(X,Y)

#Example of Regression Parameters With Matrix Algebra

#FORMULA: b = (X'X)-1(X'y)

#DATA

midterm <- c(5,7,7,7,9)

final <- c(4,5,6,8,10)

(SUM <- summary(lm(final~midterm)))

#==============================================

#ASSIGN DATA TO LETTERS TO FIT MATRIX NOTATION

x <- midterm

y <- final

#==============================================

#CONVERT VECTOR x TO MATRIX X WITH PARAMETER

#NOTE THE COLUMN OF ONES BEING ADDED IS

#FOR THE PARAMETER

X <- as.matrix(c(rep(1,length(x)),x))

dim(X)<-c(5,2)

X

#NOW THE HEAVY LIFTING MADE EASY:

(b <- solve(crossprod(X))%*%crossprod(X,y))

44 | P a g e

Text and Character Strings

Combine multiple items into a single response scat spaste

cat() example: Input cat("The data and time is",date(),"!","\n") Output The data and time is Thu May 05 10:17:47 2011 !

paste()

Pretty print number with commas and import numbers into commas into R

prettyNum(string(), big.mark=",", scientific=F)) as.numeric(gsub(",","", string()))

Turn a character string into a formula sforumula

as.formula()

Example First<-c("Greg","Sue","Sally");Last<-c("Smith","Collins","Peters")

Ages<-c(11,12,11);ClassRoster<-data.frame(Last,First,Ages)

ClassRoster

Students.FL<-paste(First,Last,sep=" ")

Students.LF<-paste(Last,First,sep=" ")

#============================================================================================================

paste(Students.FL,"is",Ages,"years","old.")

#............................................................................................................

#OUTPUT-->[1] "Greg Smith is 11 years old." "Sue Collins is 12 years old." "Sally Peters is 11 years old."

#============================================================================================================

cat(paste(Students.FL,"is",Ages,"years","old",colapse=",",sep=" "))

#............................................................................................................

#OUTPUT-->Greg Smith is 11 years old , Sue Collins is 12 years old , Sally Peters is 11 years old ,

test<-c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb")

lm(as.formula(paste(test[1],"~", paste(test[-1],collapse="+"), sep="")), data=mtcars)

noquote(prettyNum(12345.678,big.mark=",",scientific=F))

x<-noquote(prettyNum(c(12345.678, 123154543, 32434343),big.mark=",",scientific=F))

as.numeric(gsub(",","", x))

45 | P a g e

Special Escaped Characters

Using cat() with Quotes to manipulate text output QUOTES \n newline \r carriage return \t tab \b backspace \a alert (bell) \f form feed \v vertical tab \\ backslash \ \' ASCII apostrophe ' OR sQuote(phrase) \" ASCII quotation mark " OR dQuote(phrase) Eliminate the "\" from strings (test <- c("\\hi\\", "\n", "\t", "\\1", "\1", "\01", "\001"))

eval(parse(text=gsub("\\", "", deparse(test), fixed=TRUE)))

#INPUT

#[1] "\\hi\\" "\n" "\t" "\\1" "\001" "\001" "\001"

#OUTPUT

#[1] "hi" "n" "t" "1" "001" "001" "001"

EXAMPLES

cat("\a","Hello","\n","Hel","\blo","\'\tHELLO!\'","\"WElp!\"","\n",

"DELETE ME","","\r\"I DID DELETE YOU!\"","\n","BYE","\\Yeah Backslash",

"\n")

cat(LETTERS,"\n","\r")

46 | P a g e

Built in Character Strings constants LETTERS [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" [14] "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z" letters [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" [14] "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" month.abb [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" month.name [1] "January" "February" "March" "April" "May" "June" "July" "August" "September" "October" "November" "December" state.name state.abb

Remove quotes for printing noquote(letters) >[1] a b c d e f g h i j k l m n o p q r s t u v w x y z cat(letters) >a b c d e f g h i j k l m n o p q r s t u v w x y z Number of letters per word in a character string nchar(character string) yields the number of characters per word example: pets<-c("chester","callie");nchar(pets)

47 | P a g e

Replacing Characters in String

gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

Arguments

pattern character string containing a regular expression (or character string for fixed = TRUE) to be matched in the

given character vector. Coerced by as.character to a character string if possible. If a character vector of length 2

or more is supplied, the first element is used with a warning. Missing values are allowed except

for regexpr and gregexpr.

x, text a character vector where matches are sought, or an object which can be coerced by as.character to a

character vector.

ignore.case if FALSE, the pattern matching is case sensitive and if TRUE, case is ignored during matching.

perl logical. Should perl-compatible regexps be used? Has priority over extended.

value if FALSE, a vector containing the (integer) indices of the matches determined by grep is returned, and

if TRUE, a vector containing the matching elements themselves is returned.

fixed logical. If TRUE, pattern is a string to be matched as is. Overrides all conflicting arguments.

useBytes logical. If TRUE the matching is done byte-by-byte rather than character-by-character. See ‘Details’.

invert logical. If TRUE return indices or values for elements that do not match.

replacement a replacement for matched pattern in sub and gsub. Coerced to character if possible. For fixed =

FALSE this can include backreferences "\1" to "\9" to parenthesized subexpressions of pattern.

For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or

lower case and "\E" to end case conversion. If a character vector of length 2 or more is supplied, the first

element is used with a warning. If NA, all elements in the result corresponding to matches will be set to NA.

Each of these functions (apart from regexec, which currently does not support Perl-

style regular expressions) operates in one of three modes:

1. fixed = TRUE: use exact matching.

2. perl = TRUE: use Perl-style regular expressions.

3. fixed = FALSE, perl = FALSE: use POSIX 1003.2 extended regular expressions.

http://127.0.0.1:16944/library/base/help/regular%20expression

http://127.0.0.1:16944/library/base/help/as.character

48 | P a g e

EXAMPLE 2

text.ex<-c("hat","coat","gloves","shirt","pants")

gsub("h","H",text.ex)

gsub("^.","A",text.ex)

gsub("(\\w)(\\w*)","\\U\\1\\L\\2",text.ex,perl=T)

gsub("(\\w*)","\\U\\1",text.ex,perl=T)

gsub("(\\w)(\\w*)(\\w)", "\\U\\1\\E\\2\\U\\3", text.ex, perl=T)

Output

> > gsub("^.","\\*b",text)

[1] "*bat" "*boat" "*bloves" "*bhirt" "*bants"

> gsub("h","H",text)

[1] "Hat" "coat" "gloves" "sHirt" "pants"

> gsub("^.","A",text)

[1] "Aat" "Aoat" "Aloves" "Ahirt" "Aants"

> gsub("(\\w)(\\w*)","\\U\\1\\L\\2",text,perl=T)

[1] "Hat" "Coat" "Gloves" "Shirt" "Pants"

> gsub("(\\w*)","\\U\\1",text,perl=T)

[1] "HAT" "COAT" "GLOVES" "SHIRT" "PANTS"

> gsub("(\\w)(\\w*)(\\w)", "\\U\\1\\E\\2\\U\\3", text, perl=TRUE)

[1] "HaT" "CoaT" "GloveS" "ShirT" "PantS"

x<-"ATextIWantToDisplayWithSpaces" #Split on capital letters

gsub('([[:upper:]])', ' \\1', x)

> gsub('([[:upper:]])', ' \\1', x)

[1] " A Text I Want To Display With Spaces"

x<-"I like it...What the... Oh I see it." #replace …

gsub(pattern = "\\.\\.\\.", replacement = ".", x) #or

gsub(pattern = "\\.+", replacement = ".", x) #betterand more flexible

> gsub(pattern = "\\.\\.\\.", replacement = ".", x)

[1] "I like it.What the. Oh I see it."

EXAMPLE 1 input

a <- c("foo_5h", "bar_7")

gsub(".*_", "", a)

b <- c("xtfo_oin5hl", "6b_arin7", "xin7")

gsub("in.", "", b)

gsub("t.*l", "HERE", b)

gsub("^([a-zA-Z]in)", "INSERT", b)

d <- c("xtfo_oin5h;lx", "6b_arin;7", "xin;7")

gsub("t.+?l", "HERE", d)

gsub("[a-zA-Z].+?l", "HERE", d)

gsub("[a-zA-Z].+?;", "HERE", d)

gsub("_.+?;", "HERE", d)

e <- c("Dog foo_5h dog bar_7 doGs God")

gsub("\\bdog\\b", "HERE", e)

gsub("\\bdog.\\b", "HERE", e)

gsub("[^a-zA-Z0-9]", "", e)

gsub("\\b[dD][oO][Gg].\\b", " ", e)

gsub("\\b[dD][oO][Gg]\\b", " ", e)

EXAMPLE 1 outcome

> a <- c("foo_5h", "bar_7")

> gsub(".*_", "", a)

[1] "5h" "7"

>

> b <- c("xtfo_oin5hl", "6b_arin7", "xin7")

> gsub("in.", "", b)

[1] "xtfo_ohl" "6b_ar" "x"

> gsub("t.*l", "HERE", b)

[1] "xHERE" "6b_arin7" "xin7"

> gsub("^([a-zA-Z]in)", "INSERT", b)

[1] "xtfo_oin5hl" "6b_arin7" "INSERT7"

>

> d <- c("xtfo_oin5h;lx", "6b_arin;7", "xin;7")

> gsub("t.+?l", "HERE", d)

[1] "xHEREx" "6b_arin;7" "xin;7"

> gsub("[a-zA-Z].+?l", "HERE", d)

[1] "HEREx" "6b_arin;7" "xin;7"

> gsub("[a-zA-Z].+?;", "HERE", d)

[1] "HERElx" "6HERE7" "HERE7"

> gsub("_.+?;", "HERE", d)

[1] "xtfoHERElx" "6bHERE7" "xin;7"

>

> e <- c("Dog foo_5h dog bar_7 doGs God")

> gsub("\\bdog\\b", "HERE", e)

[1] "Dog foo_5h HERE bar_7 doGs God"

> gsub("\\bdog.\\b", "HERE", e)

[1] "Dog foo_5h HEREbar_7 doGs God"

> gsub("[^a-zA-Z0-9]", "", e)

[1] "Dogfoo5hdogbar7doGsGod"

> gsub("\\b[dD][oO][Gg].\\b", " ", e)

[1] " foo_5h bar_7 God"

> gsub("\\b[dD][oO][Gg]\\b", " ", e)

[1] " foo_5h bar_7 doGs God"

Match & replace from here to blank

49 | P a g e

Replacing Certain Occurances

Find and remove space and/or numeric occurances #EXAMPLE 1

#=========

data <- c("Flagstaff 2", "Los Angeles 23", "Cleveland 29", "Cleveland 29", "Seattle 22")

gsub("\\s*\\d*$", "", data)

[1] "Flagstaff" "Los Angeles" "Cleveland" "Cleveland" "Seattle"

#EXAMPLE 2

#=========

x <- "the dog ate his \n food"

gsub("[ô h \n]", "", x)

gsub("[ô h \n]|\\s+", "", x)

> gsub("[ô h \n]", "", x)

[1] "h o h \n oo"

> gsub("[ô h \n]|\\s+", "", x)

[1] "hohoo"

Find Consecutive Occurences

mystring <- c(1, 2, 3, "toot", "tooooot", "good", "apple", "banana", "frrr")

mystring[!grepl("(.)\\1{2,}", mystring)]

mystring[!grepl("(.)\\1{1,}", mystring)]

gsub("(.)\\1{2,}", "HELLO", mystring)

## > mystring[!grepl("(.)\\1{2,}", mystring)]

## [1] "1" "2" "3" "toot" "good" "apple" "banana"

## > mystring[!grepl("(.)\\1{1,}", mystring)]

## [1] "1" "2" "3" "banana"

## > gsub("(.)\\1{2,}", "HELLO", mystring)

## [1] "1" "2" "3" "toot" "tHELLOt" "good" "apple" "banana" "fHELLO"

string <- c('sta_+1+0_field2ndtry_0000$01.cfg' , 'sta_+B+0_field2ndtry_0000$01.cfg' ,

'sta_+1+0_field2ndtry_0000$01.cfg' , 'sta_+9+0_field2ndtry_0000$01.cfg')

sapply(1:length(string), function(i)gsub("\\+(.*)\\+.", paste("\\+\\1\\+", i, sep=""),

string[i]))

> string

[1] "sta_+1+0_field2ndtry_0000$01.cfg" "sta_+B+0_field2ndtry_0000$01.cfg"

[3] "sta_+1+0_field2ndtry_0000$01.cfg" "sta_+9+0_field2ndtry_0000$01.cfg"

> sapply(1:length(string), function(i)gsub("\\+(.*)\\+.", paste("\\+\\1\\+", i, sep=""),

string[i]))

[1] "sta_+1+1_field2ndtry_0000$01.cfg" "sta_+B+2_field2ndtry_0000$01.cfg"

[3] "sta_+1+3_field2ndtry_0000$01.cfg" "sta_+9+4_field2ndtry_0000$01.cfg"

50 | P a g e

Find Location of Chunks within a String(s)

Find a pattern in a string or a vector of strings

gregexp(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

gregexpr("es", "Testes")

c(gregexpr("es", "Test")[[1]])

c(gregexpr("es", "Testes")[[1]])

c(gregexpr("es", "Testes establishes esteem")[[1]])

gregexpr("es", c("Testes", "dog", 6, "esteem")) #vector of strings

Find a pattern in a vector of strings Gives Location

grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)

Gives Logical TRUE/FALSE

grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

grep("es", c("Testes", "dog", 6, "esteem"))

#[1] 1 4

grepl("es", c("Testes", "dog", 6, "esteem"))

#[1] TRUE FALSE FALSE TRUE

51 | P a g e

String Splitting

Split on first space

(x<-rownames(mtcars))

rexp <- "^(\\w+)\\s?(.*)$"

sub(rexp,"\\1",x)

sub(rexp,"\\2",x)

data.frame(COM=x, MANUF=sub(rexp,"\\1",x), MAKE=sub(rexp,"\\2",x))

27 Porsche 914-2

28 Lotus Europa

29 Ford Pantera L

30 Ferrari Dino

31 Maserati Bora

32 Volvo 142E

COM MANUF MAKE

27 Porsche 914-2 Porsche 914-2

28 Lotus Europa Lotus Europa

29 Ford Pantera L Ford Pantera L

30 Ferrari Dino Ferrari Dino

31 Maserati Bora Maserati Bora

32 Volvo 142E Volvo 142E

#Also could have been solved: #METHOD 2

mat <- do.call("rbind", strsplit(sub(" ", ";", x), ";"))

colnames(mat) <- c("MANUF", "MAKE")

#METHOD 3

library(reshape2)

y <- reshape2::colsplit(x," ",c("MANUF","MAKE"))

tail(y)

#METHOD 4

library(stringr)

split_x <- str_split(x, " ", 2)

y <- data.frame(

MANUF = sapply(split_x, head, n = 1),

MAKE = sapply(split_x, tail, n = 1)

)

tail(y)

str <- c("George W. Bush", "Lyndon B. Johnson")

gsub("([A-Z])[.]?", "\\1", str)

sub(" .*", "", str)

sub("\\s\\w+$", "", str)

sub(".*\\s(\\w+$)", "\\1", str)

str <- c("&George W. Bush", "Lyndon B. Johnson?")

gsub("[^[:alnum:][:space:].]", "", str)

> str <- c("&George W. Bush", "Lyndon B.

Johnson?")

> gsub("[^[:alnum:][:space:].]", "", str)

[1] "George W. Bush" "Lyndon B. Johnson"

> str <- c("George W. Bush", "Lyndon B. Johnson")

> gsub("([A-Z])[.]?", "\\1", str)

[1] "George W Bush" "Lyndon B Johnson"

> sub(" .*", "", str)

[1] "George" "Lyndon"

> sub("\\s\\w+$", "", str)

[1] "George W." "Lyndon B."

> sub(".*\\s(\\w+$)", "\\1", str)

[1] "Bush" "Johnson"

>

> str <- c("&George W. Bush", "Lyndon B.

Johnson?")

> gsub("[^[:alnum:][:space:].]", "", str)

[1] "George W. Bush" "Lyndon B. Johnson"

string<- factor(c("California CA", "New York NY", "Georgia GA"))

#Cheesy Method

string <- gsub(" +", " ", string)

sapply(string, function(x) substring(x, 1, nchar(x)-3)) #or

unlist(lapply(string, function(x) substring(x, 1, nchar(x)-3)))

or

#sub Method (BETTER!)

sub("[[:space:]]*..$", "", string)

#OUTPUT

[1] "California" "New York" "Georgia"

52 | P a g e

Split on first underscore character

Split on first comma

y <- c("Here's comma 1, and 2, see?", "Here's 2nd sting, like it, not a lot.")

#Method 1

XX <- "SoMeThInGrIdIcUlOuS"

LIST <- strsplit(sub(",\\s*", XX, y), XX)

LIST2 <- lapply(LIST, function(x) data.frame('x'=c(x[1]), 'z'=c(x[2])))

do.call('rbind', LIST2)

#Method 2

y2 <- strsplit(y, ",")

LIST <- sapply(seq_along(y2), function(i) data.frame(x= y2[[i]][1],

z=paste(y2[[i]][-1], collapse=" ")), simplify=F)

do.call('rbind', LIST)

#Method 3

GL(reshape2)

colsplit(y, ",", c("x","z"))

x whatever

1 x00 aaa_123

2 x00 bbb_123

3 x00 ccc_123

4 x01 aaa_123

5 x01 bbb_123

6 x01 ccc_123

7 x02 aaa_123

8 x02 bbb_123

9 x02 ccc_123

library(reshape2)

my_var_1 <- factor(c("x00_aaa_123","x00_bbb_123","x00_ccc_123",

"x01_aaa_123","x01_bbb_123","x01_ccc_123","x02_aaa_123","x02_bbb_

123","x02_ccc_123"))

colsplit(my_var_1, "_", c("x","whatever"))

53 | P a g e

Piece Grabbing Grab part x <- c(

"ELOVL7",

"ELP2",

"EMC1 (includes EG:23065)",

"EPT1 (includes EG:28042)",

"ZEB1 (includes EG:29009)"

)

gsub("(.*)\\s+\$.*\$", "\\1", x)

Test and grab certain occurances (eg beginning abc and ending some numeric)

#example 1

#==========

s <- c('abc1', 'abc2', 'abc3', 'abc11', 'abc12',

'abcde1', 'abcde2', 'abcde3', 'abcde11', 'abcde12',

'nonsense')

s[grepl("abc.*(3|11|12)", s)]

s[grepl("^abc", s) & grepl("(3|11|12)$", s)]

#^ means negate or everything except the abc (2nd one is more interpretable)

> s[grepl("abc.*(3|11|12)", s)]

[1] "abc3" "abc11" "abc12" "abcde3" "abcde11" "abcde12"

> s[grepl("^abc", s) & grepl("(3|11|12)$", s)]

[1] "abc3" "abc11" "abc12" "abcde3" "abcde11" "abcde12"

#example 2

#==========

x <- c("fcer cgr tr cg g.", "gce tgv te ger refxre,c3rfc rf3rcf3rfr?")

x[grepl("^[[:alpha:]]", x) & grepl("(\\?)$", x)]

x[grepl("^[[:alpha:]]", x) & grepl("(\\?|\\.)$", x)]

> x[grepl("^[[:alpha:]]", x) & grepl("(\\?)$", x)]

[1] "gce tgv te ger refxre,c3rfc rf3rcf3rfr?"

> x[grepl("^[[:alpha:]]", x) & grepl("(\\?|\\.)$", x)]

[1] "fcer cgr tr cg g."

[2] "gce tgv te ger refxre,c3rfc rf3rcf3rfr?"

Grabe everything except last word df1 <- structure(list(id = c(1, 2, 3), city = structure(c(2L, 3L, 1L

), .Label = c("Hillside Village", "Middletown Township", "Sunny Valley Borough"

), class = "factor")), .Names = c("id", "city"), row.names = c(NA,

-3L), class = "data.frame")

gsub("\\s*\\w*$", "", df1$city)

> gsub("\\s*\\w*$", "", df1$city)

[1] "Middletown" "Sunny Valley" "Hillside"

54 | P a g e

Split apart by chunks

test <- "abc123def"

x <- gsub("([0-9]+)","~\\1~", test)

strsplit(x, "~")

#or in one step

strsplit(gsub("([0-9]+)","~\\1~", test), "~")

[[1]]

[1] "abc" "123" "def"

55 | P a g e

Punctuation

Delete all punctuation except…

#EXAMPLE x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"

#METHODS FOR SUBBING OUT ALL PUNCTUATUION EXCEPT APOSTROPHES

gsub("[^[:alnum:][:space:]'\"]", "", x) #METHOD 1

gsub(".*?($|'|[^[:punct:]]).*?", "\\1", x) #METHOD 2

gsub("(.*?)($|'|[^[:punct:]]+?)(.*?)", "\\2", x) #METHOD 3

#EXTENDING METHOD 1 TO SUB OUT EVERYTHING EXCEPT APOSTROPHES AND SEMI COLONS

gsub("[^[:alnum:][:space:]'\ ;\"]", "", x, perl=T)

56 | P a g e

Capitalization

#Capitalize the first letter of a word

capitalize <- function(x) {

simpleCap <- function(x) {

s <- strsplit(x, " ")[[1]]

paste(toupper(substring(s, 1,1)), substring(s, 2),

sep="", collapse=" ")

}

unlist(lapply(x, simpleCap))

}

x <- "i'll"

y <- "you"

z <- c("I'll", "go")

capitalize(x)

capitalize(y)

capitalize(z)

Capital Letters capitalize toupper(string) Lower Case Letters tolower(string)

EXAMPLE

string<-toupper(paste("i do not know"," where the dog is.",sep=""))

cat(string,"\n",sep="")

string

tolower(string)

57 | P a g e

String Matching

Search a Vector for a Match see my Search() function

Exact Matches

grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,

fixed = FALSE, useBytes = FALSE, invert = FALSE)

Arguments:

Approximate Matches

agrep(pattern, x, ignore.case = FALSE, value = FALSE,

max.distance = 0.1, useBytes = FALSE)

Arguments:

str1 <- "This is a string, that I've written to ask about a question, or at least tried to."

animals <- c("mose", "dog", "cat", "gooberciluousrex")

animals[agrep("mouse", animals, max.distance = 0.01)] <- "cheese"

animals

animals[agrep("chese", animals)] <- "mouse"

animals

animals[agrep("goobercilsdeef", animals, max.distance = 0.01)] <- "duck"

animals

animals[agrep("goobercilsdeef", animals, max.distance = 0.29)] <- "duck"

animals

58 | P a g e

My Search() function

Search for a term str <- "BBSSHHSRBSBBS"

unlist(gregexpr("BS", str))

str2 <- "I can't stand know it all egg head scientists."

unlist(gregexpr("i can't", tolower(str2)))

term <- "egg head"

loc <- unlist(gregexpr(term, tolower(str2)))

substring(str2, loc, nchar(term)-1+loc)

Search for a term and Count Occurances str2 <- "ionisation should only be matched at the end of the word"

matched_commas <- gregexpr(",", str1, fixed = TRUE)

length(matched_commas[[1]])

matched_ion <- gregexpr("ion", str1, fixed = TRUE)

length(matched_ion[[1]])

length(gregexpr("ion\\b", str2, perl = TRUE))

Search for Strings that contain a phrase

a <- c('This is a healthcare facility', 'this is a hospital',

'this is a hospital district', 'this is a district health service')

a[grepl("hospital", a) & !grepl("district", a)]

a[!grepl("district", a)]

Levenshtein distance between strings

pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.")

lapply(pres, agrep, pres, value = F)

lapply(pres, agrep, pres, value = T)

Search<-function(term,dataframe,column.name,variation=.02,...){

te<-substitute(term) #use " " for multi word terms

te<-as.character(te)

cn<-substitute(column.name)

cn<-as.character(cn)

HUNT<-agrep(te,dataframe[,cn],ignore.case =TRUE,max.distance=variation,...)

dataframe[c(HUNT),]

}

59 | P a g e

Test for occurance in Columns

myfile <- read.table( text = '"G1" "G2"

SEP11 ABCC1

205772_s_at FMO2

214223_at ADAM19

ANK2 215742_at

COPS4 BIK

214808_at DCP1A

ACE ALG3

BAD 215369_at

EMP3 215385_at

CARD8 217579_x_at

', header = TRUE, stringsAsFactors = FALSE)

lapply( myfile,

function(column) grep( "_at$", column, invert = TRUE, value = TRUE )

)

lapply( myfile,

function(column) grep( "_at$", column, value = TRUE )

)

lapply( myfile,

function(column) grep( "_at$", column, invert = TRUE )

)

60 | P a g e

Insert characters between characters of a character string See also "Insert a Vector of Character…"

x <- "output"

Method 1

y <- unlist(strsplit(x, NULL))

y <- paste(y, collapse="\n")

cat(y)

[1] "o\nu\nt\np\nu\nt"

Method 2

z <- gsub('(?<=.)(?=.)','\n', x, perl=TRUE)

cat(z)

[1] "o\nu\nt\np\nu\nt"

Insert a Vector of Character Strings Into Another Character String

a <- c("string", "factor")

sprintf("This is where a %s goes.", a)

sprintf("This is where a %s goes.", a)

[1] "This is where a string goes." "This is where a factor goes."

Insert Vector(s) of Character Strings Into Another Character String

#paste method

n <- 10; a <- 1:n

paste0("p", a, "=", a)

#sprintf method

n <- 10; a <- 1:n

sprintf("p%d=%d", a, a)

Insert Trailing or Leading Spaces Easily

x <- c("I like", "good", "better than you")

sprintf("%8s", x) #Add leading space

sprintf("%-8s", x) #Add trailing space

> sprintf("%8s", x) #Add leading space

[1] " I like" " good" "better than you"

> sprintf("%-8s", x) #Add trailing space

[1] "I like " "good " "better than you"

Delete Trailing or Leading Spaces Easily gsub("^\\s+, "", x) #leading spaces gsub("\\s+$", "", x) #trailing spaces Trim <- function (x) gsub("^\\s+|\\s+$", "", x) Insetr Leading Zeros sprintf("%02d",c(1,2,3,45)) > sprintf("%02d",c(1,2,3,45))

[1] "01" "02" "03" "45"

> sprintf("%03d",c(1,2,3,45))

[1] "001" "002" "003" "045"

> sprintf("%010d",c(1,2,3,45))

[1] "0000000001" "0000000002" "0000000003" "0000000045"

61 | P a g e

Reverse character strings

reverse(string)

reverse("this is a string")

strings1 <- c(123,4212,234567)

reverse(strings1)

strings2 <- c("retsnomerom","was","retar")

reverse(strings2)

reverse <- function(string) {

strReverse <- function(x) sapply(lapply(strsplit(x, NULL),

rev), paste, collapse = "")

if (is.numeric(string)) {

strReverse(as.character(string))

} else {

strReverse(string)

}

}

62 | P a g e

Select portions of a character string (parts of a character string) substr(text,start point,end point)

EXAMPLES

substr("abcdefghi",2,5)

substr("abcdefghi",1,8)

substring("Callie loves to chew bones!",8,20)

substring("Callie loves to chew bones!",28:1)

substring("Callie loves to chew bones!",1:28)

data.frame(cbind(substring("Callie loves to chew bones!",28:1),

substring("Callie loves to chew bones!",1:28)))

substr(rep("abcdef",4),1:4,4:5)

x <- c("asfef", "qwerty", "yuiop[", "b", "stuff.blah.yech")

substr(x, 2, 5)

substring(x, 2, 4:6)

substring(x, 2) <- c("..", "+++")

x

#USE TO PULL APART BEDS CODE

beds.numbs<-as.character(c(3452171,3452172,3462173,3452274,3452275,3462276,3462277,3452178,

3452189,3452080,3452081,3452082,3462083))

(Region<-substr(beds.numbs,1,3))#use this to recode regions

(District<-substr(beds.numbs,1,6))

DATS<-data.frame(beds.numbs,Region,District)

with(DATS,table(District))

with(DATS,ftable(DATS))

subset(DATS,Region=="346")

subset(DATS,District=="345208")

63 | P a g e

Create a diminishing list from a vector of names PREDS<-c("gender", "g1freelunch", "g3tmathss", "g3treadss", "yearssmall",

"crap")

#===============================================

method 1 getDiminishingList<-function(data){

ans <- list()

for(i in 1:length(data)){

ans[[i]] <- data[1:(length(data) - i + 1)]

}

ans

}

# Use function

getDiminishingList(PREDS)

getDiminishingList(1:10)

#===============================================

method 2 getDiminishingList <- function(data){

n <- length(data)

tmpfunc <- function(i){

data[1:(length(data) - i + 1)]

}

return(apply(matrix(1:n), 1, tmpfunc))

}

# Use function

getDiminishingList(PREDS)

getDiminishingList(1:10)

Output

[[1]]

[1] "gender" "g1freelunch" "g3tmathss" "g3treadss" "yearssmall"

[6] "crap"

[[2]]

[1] "gender" "g1freelunch" "g3tmathss" "g3treadss" "yearssmall"

[[3]]

[1] "gender" "g1freelunch" "g3tmathss" "g3treadss"

[[4]]

[1] "gender" "g1freelunch" "g3tmathss"

[[5]]

[1] "gender" "g1freelunch"

[[6]]

[1] "gender"

64 | P a g e

Convert a Character String or Factor to Numeric

Method 1

y <- c( "OLDa", "ALL", "OLDc", "OLDa", "OLDb", "NEW", "OLDb", "OLDa", "ALL")

el <- c("OLDa", "OLDb", "OLDc", "NEW", "ALL")

match(y,el)

Method2

f <- factor(y,levels=c("OLDa", "OLDb", "OLDc", "NEW", "ALL") )

as.integer(f)

Changing Variable (numeric vs. factor) See Dummy Coding

Type: y <- as.factor(y) changes the variable to factor y <- as.numeric(y) changes the variable to numeric Note 1: This function can be use to change a categorical variable into a numeric variable (useful for dummy coding) …Or recode as 0,1 Note 2: If you’ve renamed the variables in your data set you must use the as.numeric function with the original data set terms (data.set$variable.name) to turn the list in the actual data set numeric (see incorrect [A] vs. correct [B] below).

A B

65 | P a g e

Dummy Coding a Factor method 1 User defined function requires library(ade4) dummy(dataframe)

Dummy Coding a Factor method 2 model.matrix(~factor-1)

#EXAMPLE

x <- c(2, 2, 5, 3, 6, 5, NA)

xf <- factor(x, levels = 2:6)

model.matrix( ~ xf - 1)

#EXAMPLE

dummy <- function(df) {

require(ade4)

ISFACT <- sapply(df, is.factor)

FACTS <- acm.disjonctif(df[, ISFACT, drop = FALSE])

NONFACTS <- df[, !ISFACT,drop = FALSE]

data.frame(NONFACTS, FACTS)

}

df <-data.frame(eggs = c("foo", "foo", "bar", "bar"),

ham = c("red","blue","green","red"), x=rnorm(4))

dummy(df)

66 | P a g e

Convert numbers to Roman numerals

as.roman(vector)

Convert Decimals to fractions library(MASS) fractions(x, cycles = 10, max.denominator = 2000, ...) Arguments:

NOTE: This is Rational Approximation and may not be a true value of the decimal

EXAMPLES

as.roman(101) #coverts to roman numeral

as.roman(c(101,23,67,92)) #vector

EXAMPLES

library(MASS)

fractions(.12)

fractions(pi)

67 | P a g e

Date & Time

Date & Time date() Date Sys.Date() Time substr(as.character(Sys.time()),12,19) Year substr(as.character(Sys.Date()),1,4) Date/Time/Time Zone Sys.time()

Extracting pieces from Sys.Date and Sys.time format(Sys.time(), "%a %b %d %H:%M:%S %Y") %a=weekday; %b=month; %d=day of the month; %H:%M:%S =hour:minute:second; %Y=year Use of cat with “\n” gets rid of the quotes around the date (see final example) cat(format(Sys.time(), "%a %b %d %H:%M:%S %Y"),"\n")

Import dates in various formats such as dd/mm/yyyy as.Date(x, format = "") Arguments

EXAMPLE

format(Sys.Date(), format="%b %d %Y")

format(Sys.Date(), format="%a %b %d %Y")

format(Sys.time(), "%a %b %d %H:%M:%S %Y")

format(Sys.time(),"%H:%M")

dec1 <- as.Date("2004-12-1")

cat(format(dec1, format="%b %d %Y"),"\n")

#Notice how cat eliminates the quotes

EXAMPLE

dates <- c("02/27/92", "02/27/92", "01/14/92", "02/28/92", "02/01/92")

as.Date(dates, "%m/%d/%y")

dates <- c("02/27/1992", "02/27/1992", "01/14/1992", "02/28/1992", "02/01/1992")

as.Date(dates, "%m/%d/%Y")

Note: the package chron is good at handling dates and times

68 | P a g e

Differences in Dates and Times difftime(t1,t2) Note: put the later time in for t2 EXAMPLES difftime("2005-10-21","1980-11-16")

as.numeric(difftime("2005-10-21","1980-11-16"))

difftime("2011-05-17 00:35:07","2002-9-11 8:46:40")

difftime(Sys.time(),"2002-9-11 8:46:40")

Output > difftime("2005-10-21","1980-11-16")

Time difference of 9104.958 days

> as.numeric(difftime("2005-10-21","1980-11-16"))

[1] 9104.958

> difftime("2011-05-17 00:35:07","2002-9-11 8:46:40")


> difftime(Sys.time(),"2002-9-11 8:46:40")


Time and Date Sequence seq.Date(from,to,by) by= "day", "week", "month" or "year"

Turn dates into Day of the Week weekdays(x, abbreviate=FALSE)

Units Argument difftime(time1, time2, tz, units = c("auto", "secs", "mins", "hours","days", "weeks")) Can request answer be given in "auto", "secs", "mins", "hours","days", "weeks"

EXAMPLE

C<-seq.Date(as.Date("2010-10-10"),Sys.Date(),"week")

data.frame("OBS"=1:length(C),C)

df <- data.frame(date=c("2012-02-01", "2012-02-01", "2012-02-02"))

df$day <- weekdays(as.Date(df$date)) #turns dates to days of the week

df$day.ab <- weekdays(as.Date(df$date), TRUE)

> df

date day day.ab

1 2012-02-01 Wednesday Wed

2 2012-02-01 Wednesday Wed

3 2012-02-02 Thursday Thu

69 | P a g e

Graphics

Open a second Graphics Window (universal) plot.new() Open a second Graphics Window (windows) win.graph() or windows() or x11() Open a second Graphics Window (mac) quartz() or x11() Open a second Graphics Window ready to plot (No need to call a plot before adding lines etc.) plot.new() or frame() This enable you to add lines and text without an actual plot Check the system for OS and return correct graphics device (2 methods) #covers everything and is safe for other Graphics Devices if (dev.interactive()) dev.new()

#covers only gui graphics device and is not safe for other Graphics Devices

if( .Platform$GUI %in% c("X11", "Tk") ) {

X11()

} else {

if ( .Platform$GUI == "AQUA" ){

quartz()

} else {

windows()

}

}

}

Control the Size of the Graph Window windows(width=10, height=4) or win.graph (width=10, height=4) or x11(w=10,h=4) NOTE: All will take just w or h or the specific order of w and then h as in: x11(10,4) Pause Between Switching to Second Graph par(ask=TRUE) Multiple Graphs on one page par(mfrow=c(2,3)) 2 is in the rows position and 3 is in the columns position

70 | P a g e

Graphical output formats Device Function Screen/GUI Devices x11() or X11() windows() File Devices postscript(file="myplot.ps") pdf(file="myplot.pdf") pictex(file="myplot.tex") bmp(file="myplot.bmp")

jpeg(file="myplot.jepg")

Return Current Graphic device dev.cur() Turn Off Graphic Device dev.off() Turns Off all graphics Devices graphics.off() Copy the Current Graphics Device to a File dev.copy(device=png, file="foo", width=500, height=300)

71 | P a g e

Par Function Arguments adj

The value of adj determines the way in which text strings are justified in text, mtext and title. A value of 0 produces left-justified text, 0.5 (the default) centered text and 1 right-justified text. (Any value in [0, 1] is allowed, and on most devices values outside that interval will also work.) Note that the adj argument of text also allows adj = c(x, y) for different adjustment in x- and y- directions. Note that whereas for text it refers to positioning of text about a point, for mtext and title it controls placement within the plot or device region.

ann If set to FALSE, high-level plotting functions calling plot.default do not annotate the plots they produce with axis titles and overall titles. The default is to do annotation.

ask logical. If TRUE (and the R session is interactive) the user is asked for input, before a new figure is drawn. As this applies to the device, it also affects output by packages grid and lattice. It can be set even on non-screen devices but may have no effect there. This not really a graphics parameter, and its use is deprecated in favour of devAskNewPage.

bg The color to be used for the background of the device region. When called from par() it also sets new=FALSE. See section ‘Color Specification’ for suitable values. For many devices the initial value is set from the bg argument of the device, and for the rest it is normally "white". Note that some graphics functions such as plot.default and points have an argument of this name with a different meaning.

bty A character string which determined the type of box which is drawn about plots. If bty is one of "o" (the default), "l", "7", "c", "u", or "]" the resulting box resembles the corresponding upper case letter. A value of "n" suppresses the box.

cex A numerical value giving the amount by which plotting text and symbols should be magnified relative to the default. This starts as 1 when a device is opened, and is reset when the layout is changed, e.g. by setting mfrow. Note that some graphics functions such as plot.default have an argument of this name which multiplies this graphical parameter, and some functions such as points accept a vector of values which are recycled. Other uses will take just the first value if a vector of length greater than one is supplied.

cex.axis The magnification to be used for axis annotation relative to the current setting of cex.

cex.lab The magnification to be used for x and y labels relative to the current setting of cex.

cex.main The magnification to be used for main titles relative to the current setting of cex.

cex.sub The magnification to be used for sub-titles relative to the current setting of cex.

cin R.O.; character size (width, height) in inches. These are the same measurements as cra, expressed in different units.

col A specification for the default plotting color. See section ‘Color Specification’. (Some functions such as lines accept a vector of values which are recycled. Other uses will take just the first value if a vector of length greater than one is supplied.)

col.axis The color to be used for axis annotation. Defaults to "black".

col.lab The color to be used for x and y labels. Defaults to "black".

col.main The color to be used for plot main titles. Defaults to "black".

col.sub The color to be used for plot sub-titles. Defaults to "black".

cra R.O.; size of default character (width, height) in ‘rasters’ (pixels). Some devices have no concept of pixels and so assume an arbitrary pixel size, usually 1/72 inch. These are the same measurements as cin, expressed in different units.

crt A numerical value specifying (in degrees) how single characters should be rotated. It is unwise to expect values other than multiples of 90 to work. Compare with srt which does string rotation.

csi R.O.; height of (default-sized) characters in inches. The same as par("cin")[2].

cxy R.O.; size of default character (width, height) in user coordinate units. par("cxy") is par("cin")/par("pin") scaled to user coordinates. Note that c(strwidth(ch), strheight(ch)) for a given string ch is usually much more precise.

din R.O.; the device dimensions, (width,height), in inches.

err (Unimplemented; R is silent when points outside the plot region are not plotted.) The degree of error reporting desired.

http://127.0.0.1:23586/library/graphics/help/text

http://127.0.0.1:23586/library/graphics/help/mtext

http://127.0.0.1:23586/library/graphics/help/title


http://127.0.0.1:23586/library/graphics/help/plot.default

http://127.0.0.1:23586/library/graphics/help/devAskNewPage


http://127.0.0.1:23586/library/graphics/help/points

http://127.0.0.1:23586/library/graphics/help/box



http://127.0.0.1:23586/library/graphics/help/lines

http://127.0.0.1:23586/library/graphics/help/strwidth

http://127.0.0.1:23586/library/graphics/help/strheight

72 | P a g e

family The name of a font family for drawing text. The maximum allowed length is 200 bytes. This name gets mapped by each graphics device to a device-specific font description. The default value is "" which means that the default device fonts will be used (and what those are should be listed on the help page for the device). Standard values are "serif", "sans" and "mono", and the Hershey font families are also available. (Different devices may define others, and some devices will ignore this setting completely.) This can be specified inline for text.

fg The color to be used for the foreground of plots. This is the default color used for things like axes and boxes around plots. When called from par() this also sets parameter col to the same value. See section ‘Color Specification’. A few devices have an argument to set the initial value, which is otherwise "black".

fig A numerical vector of the form c(x1, x2, y1, y2) which gives the (NDC) coordinates of the figure region in the display region of the device. If you set this, unlike S, you start a new plot, so to add to an existing plot use new=TRUE as well.

fin The figure region dimensions, (width,height), in inches. If you set this, unlike S, you start a new plot.

font An integer which specifies which font to use for text. If possible, device drivers arrange so that 1 corresponds to plain text (the default), 2 to bold face, 3 to italic and 4 to bold italic. Also, font 5 is expected to be the symbol font, in Adobe symbol encoding. On some devices font families can be selected by family to choose different sets of 5 fonts.

font.axis The font to be used for axis annotation.

font.lab The font to be used for x and y labels.

font.main The font to be used for plot main titles.

font.sub The font to be used for plot sub-titles.

lab A numerical vector of the form c(x, y, len) which modifies the default way that axes are annotated. The values of x and y give the (approximate) number of tickmarks on the x and y axes and len specifies the label length. The default is c(5, 5, 7). Note that this only affects the way the parameters xaxp and yaxp are set when the user coordinate system is set up, and is not consulted when axes are drawn. len is unimplemented in R.

las numeric in {0,1,2,3}; the style of axis labels. 0: always parallel to the axis [default], 1: always horizontal, 2: always perpendicular to the axis, 3: always vertical. Also supported by mtext. Note that string/character rotation via argument srt to par does not affect the axis labels.

lend The line end style. This can be specified as an integer or string: 0 and "round" mean rounded line caps [default]; 1 and "butt" mean butt line caps; 2 and "square" mean square line caps.

lheight The line height multiplier. The height of a line of text (used to vertically space multi-line text) is found by multiplying the character height both by the current character expansion and by the line height multiplier. Default value is 1. Used in text and strheight.

ljoin The line join style. This can be specified as an integer or string: 0 and "round" mean rounded line joins [default]; 1 and "mitre" mean mitred line joins; 2 and "bevel" mean bevelled line joins.

lmitre The line mitre limit. This controls when mitred line joins are automatically converted into bevelled line joins. The value must be larger than 1 and the default is 10. Not all devices will honour this setting.

http://127.0.0.1:23586/library/graphics/help/Hershey




http://127.0.0.1:23586/library/graphics/help/strheight

73 | P a g e

lty The line type. Line types can either be specified as an integer (0=blank, 1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the character strings "blank", "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash", where "blank" uses ‘invisible lines’ (i.e., does not draw them). Alternatively, a string of up to 8 characters (from c(1:9, "A":"F")) may be given, giving the length of line segments which are alternatively drawn and skipped. See section ‘Line Type Specification’. Some functions such as lines accept a vector of values which are recycled. Other uses will take just the first value if a vector of length greater than one is supplied.

lwd The line width, a positive number, defaulting to 1. The interpretation is device-specific, and some devices do not implement line widths less than one. (See the help on the device for details of the interpretation.) Some functions such as lines accept a vector of values which are recycled. Other uses will take just the first value if a vector of length greater than one is supplied.

mai A numerical vector of the form c(bottom, left, top, right) which gives the margin size specified in inches.

mar A numerical vector of the form c(bottom, left, top, right) which gives the number of lines of margin to be specified on the four sides of the plot. The default is c(5, 4, 4, 2) + 0.1.

mex mex is a character size expansion factor which is used to describe coordinates in the margins of plots. Note that this does not change the font size, rather specifies the size of font (as a multiple of csi) used to convert between mar and mai, and between oma and omi. This starts as 1 when the device is opened, and is reset when the layout is changed (alongside resetting cex).

mfcol, mfrow A vector of the form c(nr, nc). Subsequent figures will be drawn in an nr-by-nc array on the device by columns (mfcol), or rows (mfrow), respectively. In a layout with exactly two rows and columns the base value of "cex" is reduced by a factor of 0.83: if there are three or more of either rows or columns, the reduction factor is 0.66. Setting a layout resets the base value of cex and that of mex to 1. If either of these is queried it will give the current layout, so querying cannot tell you the order in which the array will be filled. Consider the alternatives, layout and split.screen.

mfg A numerical vector of the form c(i, j) where i and j indicate which figure in an array of figures is to be drawn next (if setting) or is being drawn (if enquiring). The array must already have been set by mfcol or mfrow. For compatibility with S, the form c(i, j, nr, nc) is also accepted, when nr and nc should be the current number of rows and number of columns. Mismatches will be ignored, with a warning.

mgp The margin line (in mex units) for the axis title, axis labels and axis line. Note that mgp[1] affects title whereas mgp[2:3] affect axis. The default is c(3, 1, 0).

mkh The height in inches of symbols to be drawn when the value of pch is an integer. Completely ignored in R.

new logical, defaulting to FALSE. If set to TRUE, the next high-level plotting command (actually plot.new) should not clean the frame before drawing as if it were on a new device. It is an error (ignored with a warning) to try to use new = TRUE on a device that does not currently contain a high-level plot.

oma A vector of the form c(bottom, left, top, right) giving the size of the outer margins in lines of text.

omd A vector of the form c(x1, x2, y1, y2) giving the region inside outer margins in NDC (= normalized device coordinates), i.e., as a fraction (in [0, 1]) of the device region.

omi A vector of the form c(bottom, left, top, right) giving the size of the outer margins in inches.

pch Either an integer specifying a symbol or a single character to be used as the default in plotting points. See points for possible values and their interpretation. Note that only integers and single-character strings can be set as a graphics parameter (and not NA nor NULL).

pin The current plot dimensions, (width,height), in inches.

plt A vector of the form c(x1, x2, y1, y2) giving the coordinates of the plot region as fractions of the current figure region.

ps integer; the point size of text (but not symbols). Unlike the pointsize argument of most devices, this does not change the relationship between mar and mai (nor oma and omi). What is meant by ‘point size’ is device-specific, but most devices mean a multiple of 1bp, that is 1/72 of an inch.

pty A character specifying the type of plot region to be used; "s" generates a square plotting region and "m" generates the maximal plotting region.



http://127.0.0.1:23586/library/graphics/help/layout

http://127.0.0.1:23586/library/graphics/help/split.screen

http://127.0.0.1:23586/library/graphics/help/title

http://127.0.0.1:23586/library/graphics/help/axis

http://127.0.0.1:23586/library/graphics/help/plot.new


74 | P a g e

smo (Unimplemented) a value which indicates how smooth circles and circular arcs should be.

srt The string rotation in degrees. See the comment about crt. Only supported by text.

tck The length of tick marks as a fraction of the smaller of the width or height of the plotting region. If tck >= 0.5 it is interpreted as a fraction of the relevant side, so if tck = 1 grid lines are drawn. The default setting (tck = NA) is to use tcl = -0.5.

tcl The length of tick marks as a fraction of the height of a line of text. The default value is -0.5; setting tcl = NA sets tck = -0.01 which is S' default.

usr A vector of the form c(x1, x2, y1, y2) giving the extremes of the user coordinates of the plotting region. When a logarithmic scale is in use (i.e., par("xlog") is true, see below), then the x-limits will be 10 ^ par("usr")[1:2]. Similarly for the y-axis.

xaxp A vector of the form c(x1, x2, n) giving the coordinates of the extreme tick marks and the number of intervals between tick-marks when par("xlog") is false. Otherwise, when log coordinates are active, the three values have a different meaning: For a small range, n is negative, and the ticks are as in the linear case, otherwise, n is in 1:3, specifying a case number, and x1 and x2 are the lowest and highest power of 10 inside the user coordinates, 10 ^ par("usr")[1:2]. (The "usr" coordinates are log10-transformed here!) n=1 will produce tick marks at 10^j for integer j, n=2 gives marks k 10^j with k in {1,5}, n=3 gives marks k 10^j with k in {1,2,5}. See axTicks() for a pure R implementation of this. This parameter is reset when a user coordinate system is set up, for example by starting a new page or by calling plot.window or setting par("usr"): n is taken from par("lab"). It affects the default behaviour of subsequent calls to axis for sides 1 or 3.

xaxs The style of axis interval calculation to be used for the x-axis. Possible values are "r", "i", "e", "s", "d". The styles are generally controlled by the range of data or xlim, if given. Style "r" (regular) first extends the data range by 4 percent at each end and then finds an axis with pretty labels that fits within the extended range. Style "i" (internal) just finds an axis with pretty labels that fits within the original data range. Style "s" (standard) finds an axis with pretty labels within which the original data range fits. Style "e" (extended) is like style "s", except that it is also ensures that there is room for plotting symbols within the bounding box. Style "d" (direct) specifies that the current axis should be used on subsequent plots. (Only "r" and "i" styles have been implemented in R.)

xaxt A character which specifies the x axis type. Specifying "n" suppresses plotting of the axis. The standard value is "s": for compatibility with S values "l" and "t" are accepted but are equivalent to "s": any value other than "n" implies plotting.

xlog A logical value (see log in plot.default). If TRUE, a logarithmic scale is in use (e.g., after plot(*, log = "x")). For a new device, it defaults to FALSE, i.e., linear scale.

xpd A logical value or NA. If FALSE, all plotting is clipped to the plot region, if TRUE, all plotting is clipped to the figure region, and if NA, all plotting is clipped to the device region. See also clip.

yaxp A vector of the form c(y1, y2, n) giving the coordinates of the extreme tick marks and the number of intervals between tick-marks unless for log coordinates, see xaxp above.

yaxs The style of axis interval calculation to be used for the y-axis. See xaxs above.

yaxt A character which specifies the y axis type. Specifying "n" suppresses plotting.

ylbias A positive real value used in the positioning of text in the margins by axis and mtext. The default is in principle device-specific, but currently 0.2 for all of R's own devices. Set this to 0.2 for compatibility with R < 2.14.0 on x11 and windows() devices.

ylog A logical value; see xlog above


http://127.0.0.1:23586/library/graphics/help/axTicks

http://127.0.0.1:23586/library/graphics/help/plot.window



http://127.0.0.1:23586/library/graphics/help/clip



75 | P a g e

COLORS

List all the available graphics colors sColors colors() Hexidecimal Color Chart

List of Colors in [R]

76 | P a g e

Color Palette palette() The palette is what is supplied to col arguments referenced by number. The palette can be changes to any of the numeric numbers above {colors()[subset of numbers from chart]} and then reset using the "default" argument Default colors: black, red, green3, blue, cyan, magenta, yellow, gray

Changing Colors in Arguments Example

frame()

textClick("GGG",colors()[47],4)





palette() # obtain the current palette

palette(rainbow(6)) # six color rainbow


palette(colors()[c(1,10,20,30,40,50,60,70,80,90,100)]) #11 colors


palette("default") # reset the color palette


Compare the numbers to the number chart above.

77 | P a g e

Show Some of [R]'s colors by name and color library(DAAG) show.colors(type=c("shades"), order.cols=TRUE) show.colors(type=c("gray"), order.cols=TRUE) show.colors(type=c("singles"), order.cols=TRUE)

EXAMPLE

plot(mpg~disp,col="blue4", data=mtcars)#using shades Preset Palettes rainbow(n, s = 1, v = 1, start = 0, end = max(1,n - 1)/n, alpha = 1) gray.colors(n, start = 0.3, end = 0.9, gamma = 2.2) heat.colors(n, alpha = 1) terrain.colors(n, alpha = 1) topo.colors(n, alpha = 1) cm.colors(n, alpha = 1) Arguments

n the number of colors (≥ 1) to be in the palette.

s,v the ‘saturation’ and ‘value’ to be used to complete the HSV color descriptions.

start the (corrected) hue in [0,1] at which the rainbow begins.

end the (corrected) hue in [0,1] at which the rainbow ends.

alpha the alpha transparency, a number in [0,1], see argument alpha in hsv.

Change the Background Color par(bg="color") Change the Foreground Color par(fg="color") Select Random Color Self-created color randomization function ran.col(c(dataframe, vector, number), color.choice = c(colors, rainbow, heat, terrain, topo, cm))

EXAMPLE

frame()

terrain.colors(6)

textClick("GGG",terrain.colors(7)[1],4)






x11(16,8)

par(mfrow = c(2,3))

with(mtcars,plot(mpg,disp,pch=19,col=ran.col(mtcars,colors),main="COLORS"))

with(mtcars,plot(mpg,disp,pch=19,col=ran.col(mtcars,rainbow),main="RAINBOW"))

with(mtcars,plot(mpg,disp,pch=19,col=ran.col(mtcars,heat),main="HEAT"))

with(mtcars,plot(mpg,disp,pch=19,col=ran.col(mtcars,terrain),main="TERRAIN"))

with(mtcars,plot(mpg,disp,pch=19,col=ran.col(mtcars,topo),main="TOPO"))

with(mtcars,plot(mpg,disp,pch=19,col=ran.col(3,cm),main="CM"))

ran.col(6,colors)

#USING TO SET PALETTE

palette() #current palette

palette(ran.col(10)) #set palette

palette() #current palette

with(mtcars,plot(mpg,disp,pch=19,col=cyl,main="COLORS"))

palette("default") #return to default

http://127.0.0.1:23586/library/grDevices/help/hsv

78 | P a g e

Plot two graphs in the same pane (Overlay graphs) par(new=TRUE)

Building Plot Frames from Pieces plot(x, y, type="n",xlab="",ylab="", axes=F) points(x, y) axis(1) axis(2,at=seq(.2,1.8,.2)) box() Plot Grid Lines grid(nx = NULL, ny = nx, col = "lightgray", lty = "dotted",lwd = par("lwd"), equilogs = TRUE)

EXAMPLE

frame()

grid(col="blue")

shapeClick("seg",col="red")

EXAMPLES

EXAMPLE A

plot(mpg~as.factor(cyl),col="green")

par("new"=TRUE)

plot(mpg~cyl,col="blue",xlab="",axes=F)

EXAMPLE B

x11()

frame()

with(mtcars,plot(mpg~disp))

shapeClick("poly",6,border="red",col="yellow")

shapeClick("poly",6,border="red",col="green")

shapeClick("poly",6,border="red",col="orange")

par(new=T)


EXAMPLE

attach(mtcars)

plot(mpg, disp, type="n",xlab="",ylab="", axes=F)

points(mpg, disp,col="blue")

axis(1,at=seq(0,35,5),col="red",col.axis="green",lwd=3)

axis(2,lwd=6)

axis(3,seq_along(mpg), c(LETTERS,LETTERS[1:6]), col.axis = "blue")

axis(4)

box(col="orange",lwd=7)

title(main="YEPPER",xlab="OK GUY", ylab="YOU DA MAN",sub="SUBTITLE")

detach(mtcars)

plot(1:10, xaxt = "n")

axis(1, xaxp=c(0, 9, 5))

plot(1:10, xaxt = "n")

axis(1, xaxp=c(2, 9, 7))

Plot Grid Lines

79 | P a g e

Title and Labels for Graphics Type: plot(x, y,main="The Title", xlab="X Axis Label", ylab="Y Axis Label") Where plot is the function, x is the x variable, y is your y variable, “The Title”is what the graph will be named, “X Axis Label” is the name of X axis, and the “Y Axis Label” is the name of the Y axis. An example of plotting without and with the titles and labels. Note: This can be applied to any graphic: hist(p,main="Parent Agression Levels", xlab="Agression Range", ylab="Numbe rof Occurances")

80 | P a g e

Varying Graphs on a Page #1 layout(matrix(c(), rows, columns) Work on the rows and columns first. This creates the grid work for the matrix specifications. So a 2x2 for rows by columns is The matrix(c(1,1,2,3)) will give graph 1 the first two boxes and graph 2 and 3 one box each on the bottom, Varying Graphs on a Page #2 (controls size of window and layout) source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Graphics/Multiple Graphics Function.txt ")

multiG (width, height, columns, rows, matrix) For an example use: EXAMPLE(multiG)

#===================================================

# VARYING GRAPHS PER PAGE

#===================================================

attach(mtcars)

#===================================================

layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE))

hist(wt)

hist(mpg)

hist(disp)

#===================================================

windows()


hist(wt)

hist(mpg)

hist(disp)

#===================================================

windows()


hist(wt)

hist(mpg)

hist(disp)

#===================================================

windows(h=6,w=8)

layout(matrix(c(1,2,3,3,4,5), 3, 2, byrow = TRUE))

hist(wt)

hist(mpg)

hist(disp)

hist(drat)

hist(qsec)

See my created function for doing this

quickly.

81 | P a g e

Varying Graphs on a Page #3 library(plotrix) Split the graphics device into a "panel" type layout for a group of plots panes(mat=NULL,widths=rep(1,ncol(mat)),heights=rep(1,nrow(mat)), nrow=2,ncol=2, mar=c(0,0,1.6,0),oma=c(2.5,1,1,1)) Arguments:

EXAMPLE

y<-runif(8)

panes(matrix(1:4,nrow=2,byrow=TRUE))

par(mar=c(0,2,1.6,0))

boxplot(y,axes=FALSE)

axis(1)

box(2)

par(mar=c(0,0,1.6,2))

tab.title("Boxplot of y",tab.col="#88dd88")

barplot(y,axes=FALSE,col=2:9)

axis(4)

box()

tab.title("Barplot of y",tab.col="#88dd88")

par(mar=c(2,2,1.6,0))

pie(y,col=2:9)

tab.title("Pie chart of y",tab.col="#88dd88")

box()

par(mar=c(2,0,1.6,2))

plot(y,xaxs="i",xlim=c(0,9),axes=FALSE,col=2:9)

axis(4)

box()

tab.title("Scatterplot of y",tab.col="#88dd88")

# center the title at the left edge of the last plot

mtext("Test of panes function",at=0,side=1,line=0.8,cex=1.5)

panes(matrix(1:3,ncol=1),heights=c(0.7,0.8,1))

par(mar=c(0,2,2,2))

plot(sort(runif(7)),type="l",axes=FALSE)

axis(2,at=seq(0.1,0.9,by=0.2))

box()

tab.title("Rising expectations",tab.col="#ee6666")

barplot(rev(sort(runif(7))),col="blue",axes=FALSE)

axis(2,at=seq(0.1,0.9,by=0.2))

box()

tab.title("Diminishing returns",tab.col="#6666ee")

par(mar=c(4,2,2,2))

tso<-c(0.2,0.3,0.5,0.4,0.6,0.8,0.1)

plot(tso,type="n",axes=FALSE,xlab="")

# the following needs a Unicode locale to work

points(1:7,tso,pch=c(rep(-0x263a,6),-0x2639),cex=2)

axis(1,at=1:7,

labels=c("Tuesday","Wednesday","Thursday","Friday",

"Saturday","Sunday","Monday"))

axis(2,at=seq(0.1,0.9,by=0.2))

box()

tab.title("The sad outcome",tab.col="#66ee66")

mtext("A lot of malarkey",side=1,line=2.5)

82 | P a g e

Put a Box Around a Figure or a group of Figues box()

#EXAMPLES

test<-rnorm(100);plot(test)

box("figure", lwd=2)

test<-rnorm(100);plot(test)

box("outer", lwd =2)

par(mfrow = c(2, 2))

plot(test)


plot(test)


plot(test)


plot(test)


box("outer", lwd =5, col="red")

83 | P a g e

Graph Types

Pie Graph pie(x,labels,…) x is a vector of values labels is a vector of label names. See examples to the right NOTE: Cleveland (1985) States that a pie is a poor Choice for displaying info 3-D Pie Graph pie3D(x,labels,…) x is a vector of values labels is a vector of label names. Very similar to pie. See examples to the right

#===========================================================================

# THE DATA

#===========================================================================

slices <- c(11, 12,4, 16, 8)

N<-sum(slices)

Percents<-format(digits=3,(slices/N)*100)

lbls <- c("US", "UK", "Australia", "Germany", "France")

lbls2<-paste(lbls," ",Percents,"%",sep="")

#===========================================================================

# PIE PLOTS (not a prefered method of display)

#===========================================================================

windows(height=6,width=10);par(mfrow=c(2,2))

#...........................................................................

# TYPE 1

#...........................................................................

pie(slices, labels = lbls, main="Pie Chart of Countries")

#...........................................................................

# TYPE 2

#...........................................................................

pie(slices, labels ="", main="Pie Chart of Countries",

col=c("blue","red","green","yellow","orange"))

legend(1.60,.7,lbls,cex=0.8,fill=c("blue","red","green","yellow","orange"))

#...........................................................................

# TYPE 3

#...........................................................................

pie(slices, labels = lbls2, main="Pie Chart of Countries",

col=c("blue","chocolate","red","yellow","bisque"))

#...........................................................................

# TYPE 4

#...........................................................................

pie(slices,labels=paste(Percents,"%",sep=""),main="Pie Chart of Countries",


legend(1.60,.7,lbls,cex=0.8,fill=c("blue","red","green","yellow","orange"))

#===========================================================================

# THE DATA

#===========================================================================

slices <- c(11, 12,4, 16, 8)

N<-sum(slices)




#===========================================================================

# PIE PLOTS (not a prefered method of display)

#===========================================================================

library(plotrix);windows(h=6,w=12);par(mfrow=c(1,2))

#...........................................................................

# TYPE 1

#...........................................................................

pie3D(slices, labels = lbls, main="Pie Chart of Countries")

#...........................................................................

# TYPE 2

#...........................................................................

pie3D(slices, labels ="", main="Pie Chart of Countries",


legend(.47,1,lbls,cex=0.8,fill=c("blue","red","green","yellow","orange"))

windows(h=6,w=12);par(mfrow=c(1,2))

#...........................................................................

# TYPE 3

#...........................................................................

pie3D(slices, labels = lbls2, main="Pie Chart of Countries",

col=c("blue","chocolate","red","yellow","bisque"),labelcex=1.1)

#...........................................................................

# TYPE 4

#...........................................................................

pie3D(slices,labels=paste(Percents,"%",sep=""),main="Pie Chart of Countries",


legend(.55,1.05,lbls,cex=0.8,fill=c("blue","red","green","yellow","orange"))

84 | P a g e

Dot Chart dotchart(x,labels,…) x is a vector of values labels is a vector of label names. Very similar to pie. See examples to the right NOTE: The dot chart is preferred to the pie graph. It can display everything a pie graph can and then some. StripPlot library(lattice) stripplot(factor~numeric)

#===========================================================================

# THE DATA

#===========================================================================

slices <- c(11, 12,4, 16, 8)

N<-sum(slices)




#===========================================================================

# DOT PLOTS (prefered over pie charts)

#===========================================================================


#...........................................................................

# Simple 1

#...........................................................................

dotchart(slices,labels=lbls2,cex=.7,

main="Dot Plot COuntries Comparison",

xlab="Corn Production (Millions of Bushels)",

col=c("blue","red","darkgreen","black","orange"))

#...........................................................................

# Simple 2 Colored

#...........................................................................

dotchart(mtcars$mpg,labels=row.names(mtcars),cex=.7,

main="Gas Milage for Car Models",

xlab="Miles Per Gallon")

#...........................................................................

# By Group-Colored(Cylinders)

#...........................................................................


x <- mtcars[order(mtcars$mpg),] # sort by mpg

x$cyl <- factor(x$cyl) # it must be a factor

x$color[x$cyl==4] <- "red"

x$color[x$cyl==6] <- "blue"

x$color[x$cyl==8] <- "darkgreen"

dotchart(x$mpg,labels=row.names(x),cex=.7,groups= x$cyl,

main="Gas Milage for Car Models\ngrouped by cylinder",

xlab="Miles Per Gallon", gcolor="black", color=x$color)

mtext("Cars Grouped by Cylinder", side = 2, line = 2, cex = .7)

library(lattice)

stripplot(factor(cyl,levels=c("8","4","6"))~mpg,data=mtcars)

stripplot(factor(cyl,levels=c("8","6","4"))~mpg,main="Milage by Cylinder

Type",ylab="Cylinders",data=mtcars)

85 | P a g e

Venn Diagram 1 svenn library(vennecular) #Example: List1 <- c("apple", "apple", "orange", "kiwi", "cherry", "peach")

List2 <- c("apple", "orange", "cherry", "tomatoe", "pear", "plum", "plum")

Lists <- list(List1, List2) #put the word vectors into a list to supply lapply

items <- sort(unique(unlist(Lists))) #put in alphabetical order

MAT <- matrix(rep(0, length(items)*length(Lists)), ncol=2) #make a matrix of 0's

colnames(MAT) <- paste0("List", 1:2)

rownames(MAT) <- items

lapply(seq_along(Lists), function(i) { #fill the matrix

MAT[items %in% Lists[[i]], i] <<- table(Lists[[i]])

})

MAT #look at the results

library(venneuler)

v <- venneuler(MAT)

plot(v)

Venn Diagram 2 library(gplots) venn(data, universe=NA, small=0.7, showSetLogicLabel=FALSE, simplify=FALSE, show.plot=TRUE)

Arguments data,x

Either a list list containing vectors of names or indices of group members, or a data frame containing boolean indicators of group membership

universe Subset of valid name/index elements. Values ignore values in codedata not in this list will be ignored. Use NA to use all elements of data (the default).

small Character scaling of the smallest group counts

showSetLogicLabel Logical flag indicating whether the internal group label should be displayed

simplify Logical flag indicating whether unobserved groups should be omitted.

show.plot Logical flag indicating whether the plot should be displayed. If false, simply returns the group count matrix.

86 | P a g e

Line Graph See Example Below

Line Graph With Confidence Interval lineplot.CI(x.factor=, response=, main=" ", data=,xlab="",ylab="") x.factor is grouping variable, response is the numeric measure

EXAMPLE

#oneway

lineplot.CI(x.factor=cyl, response=mpg, main="MPG by Cylinder Type", data=mtcars,

xlab="Cylinders",ylab="mpg")

#twoway

lineplot.CI(x.factor=cyl, response=mpg,group=am, main="MPG by Cylinder Type",

data=mtcars,xlab="Cylinders",ylab="mpg")

cars <- c(1, 3, 6, 4, 9)

trucks <- c(2, 5, 4, 5, 12)

# Calculate range from 0 to max value of cars and trucks

g_range <- range(0, cars, trucks)

plot(cars, type="o", col="blue", ylim=g_range,

axes=FALSE, ann=FALSE)

# Make x axis using Mon-Fri labels

axis(1, at=1:5, lab=c("Mon","Tue","Wed","Thu","Fri"))

# Make y axis with horizontal labels that display ticks at

# every 4 marks. 4*0:g_range[2] is equivalent to c(0,4,8,12).

axis(2, las=1, at=4*0:g_range[2])

# Create box around plot

box()

# Graph trucks with red dashed line and square points

lines(trucks, type="o", pch=22, lty=2, col="red")

# Create a title with a red, bold/italic font

title(main="Autos", col.main="red", font.main=4)

# Label the x and y axes with dark green text

title(xlab="Days", col.lab=rgb(0,0.5,0))

title(ylab="Total", col.lab=rgb(0,0.5,0))

# Create a legend at (1, g_range[2]) that is slightly smaller

# (cex) and uses the same line colors and points used by

# the actual plots

legend(1, g_range[2], c("cars","trucks"), cex=0.8,

col=c("blue","red"), pch=21:22, lty=1:2)

87 | P a g e

Line Graph 2 (more for joining points of existing plots) Yellow is the code responsible Interaction Plot method 1 see also effects below interaction.plot(x.factor, trace.factor, response, fun = mean, type = c("l", "p", "b"), legend = TRUE, trace.label = deparse(substitute(trace.factor)), fixed = FALSE, xlab = deparse(substitute(x.factor)), ylab = ylabel, ylim = range(cells, na.rm=TRUE), lty = nc:1, col = 1, pch = c(1:9, 0, letters), xpd = NULL, leg.bg = par("bg"), leg.bty = "n", xtick = FALSE, xaxt = par("xaxt"), axes = TRUE, ...) Arguments:

EXAMPLE: HW19<-read.table("HW19.csv", header=TRUE, sep=",",na.strings="NA");HW19$Attitude<-as.factor(HW19$Attitude);x11(18,8)

frame()

par(mfrow=c(1,3))

with(HW19,interaction.plot(Attitude,Gender,Science.Comprehension,lwd=3,col=c(11,4)))

with(HW19,interaction.plot(Attitude,Grade,Science.Comprehension,lwd=3,col=c(6,2)))

with(HW19,interaction.plot(Grade,Gender,Science.Comprehension,lwd=3,col=c("orange","purple")))

EXAMPLE

with(mtcars,plot(mpg,hp,main="Norah's Cries",

xlab="Time",ylab="Decibals"))

sequence<-with(mtcars,order(mpg))

with(mtcars,lines(mpg[sequence],hp[sequence],

col="green",lwd=2))

shapeClick("arrow",code=1,col="blue",lwd=2)



text(locator(1), "Begining RTI!", pos=4)

text(locator(1), "It get's worse before", pos=4)

text(locator(1), "it gets better!", pos=4)

text(locator(1), "Extinction Bursts", pos=4)

shapeClick("box",border="blue",lwd=2)



88 | P a g e

Interaction Plot method 2 library(effects) plot(effect (term1:term2, fit, list(term3=c(levels))), multiline=TRUE)

Interaction Plot method 3 library(HH) interaction2wt(y~x1*x2…, data=)

EXAMPLE

HW19<-read.table("HW19.csv", header=TRUE, sep=",",na.strings="NA");HW19$Attitude<-

as.factor(HW19$Attitude)

fit <- with(HW19, lm(Science.Comprehension~Gender * Attitude * Grade))

attach(HW19)

plot(effect("Gender:Attitude", fit, list(Gender=c("f","m"))),multiline=T)

plot(effect("Attitude:Grade", fit, list(Grade=c("eight","nine"))),multiline=T)

#had to x out of graph

plot(effect("Gender:Grade", fit, list(Grade=c("eight","nine"))),multiline=T)

interaction2wt(len~supp*dose, data=ToothGrowth)

89 | P a g e

Plot the columns of one matrix or dataframe against the columns of another matplot(x, y, type = "p", lty = 1:5, lwd = 1, lend = par("lend"), pch = NULL, col = 1:6, cex = NULL, bg = NA, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)

Arguments

Example

x <- as.matrix( EuStockMarkets[1:50,] )

matplot(x,main = "matplot (standard)", xlab = "", ylab = "")

matplot(x,type=l,lty=1, main = "matplot (line)", xlab = "", ylab = "")

#=============================================================

x <- as.data.frame(x)

matplot(x,main = "matplot (standard)", xlab = "", ylab = "")

matplot(x,type=l,lty=1, main = "matplot (line)", xlab = "", ylab = "")

what type of plot should be drawn. Possible types are

"p" for points,

"l" for lines,

"b" for both,

"c" for the lines part alone of "b",

"o" for both ‘overplotted’,

"h" for ‘histogram’ like (or ‘high-density’) vertical lines,

"s" for stair steps,

"S" for other steps, see ‘Details’ below,

"n" for no plotting.

90 | P a g e

Bar graph sbar graph m sbarplot barplot(x) ?barplot for details horiz=F yields horizontal bars Bar graph with Confidence Intervals library(sciplot) bargraph.CI(x.factor=, response=, main=" ", data=,xlab="",ylab="") x.factor is grouping variable, response is the numeric measure

3-D Bar Plot Additional Nicities for Bar Graphs barplot(VADeaths, beside=TRUE, las=1)

abline(h=seq(0, 100, by=1), col="gray90")

abline(h=seq(0, 100, by=10), col="gray")

par(new=T)

barplot(VADeaths, beside=TRUE, las=1)


abline(h=seq(0, 100, by=5), col="gray90")

abline(h=seq(0, 100, by=10), col="gray")

par(new=T)



abline(h=0:100, col="white")

barplot(

VADeaths, beside=TRUE, las=1,

add=TRUE, col=FALSE

)

source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Graphics/3-D Bar plot.txt")

EXAMPLE

bargraph.CI(x.factor=cyl, response=mpg, main="MPG by Cylinder Type", data=mtcars,

xlab="Cylinders",ylab="mpg")

91 | P a g e

Add Text Directly Below Or Above Bars

day <- c(0:28)

ndied <- c(342,335,240,122,74,64,49,60,51,44,35,48,41,34,38,

27,29,23,20,15,20,16,17,17,14,10,4,1,2)

pdied <- c(19.1,18.7,13.4,6.8,4.1,3.6,2.7,3.3,2.8,2.5,2.0,2.7,

2.3,1.9,2.1,1.5,1.6,1.3,1.1,0.8,1.1,0.9,0.9,0.9,

0.8,0.6,0.2,0.1,0.1)

pmort <- data.frame(day,ndied,pdied)

barX <- barplot(pmort$pdied,xlab="Age(days)",

ylab="Percent", names=pmort$day,

xlim=c(0,35),ylim=c(0,20),legend="Mortality")

text(cex=.5, x=barX, y=pmort$pdied+par("cxy")[2]/2, pmort$ndied,

xpd=TRUE, col='darkgreen')

text(cex=.5, x=barX, y=-.5, pmort$ndied, xpd=TRUE, col="blue")

X2sum <- c(42.6, 3.6, 1.8, 3.9, 12.1, 14.3, 14.6 ,28.4)

X2.labels <- c("No earnings", "Less than $5000/year", "$5K to $10K" ,

"$10K to $15K" , "$ 15K to $20K" , "$20K to $25K" , "$25K to $30K",

"Over $30K" )

barCenters <- barplot(X2sum)

text(barCenters, par("usr")[3] - 0.5, srt = 45, adj = 1,

labels =X2.labels, xpd = TRUE, cex=.7)

mtcars2 <- mtcars[order(-mtcars$mpg), ]

par(cex.lab=1, cex.axis=.6,

mar=c(6.5, 3, 2, 2) + 0.1, xpd=NA) #shrink axis text and increase bot. mar.

barX <- barplot(mtcars2$mpg,xlab="Cars", main="MPG of Cars",

ylab="", names=rownames(mtcars2), mgp=c(5,1,0),

ylim=c(0, 35), las=2, col=mtcars2$cyl)

mtext(side=2, text="MPG", cex=1, padj=-2.5)

text(cex=.5, x=barX, y=mtcars2$mpg+par("cxy")[2]/2, mtcars2$hp, xpd=TRUE)

text(cex=.5, x=barX, y=-.5, mtcars2$gear, xpd=TRUE, col="red")

92 | P a g e

Histogram shistogram hist(Set$Attitude, col="purple", breaks=20) hist(Set$Attitude, col="purple") Histogram with kernel density and normal curve library(descr)

histkdnc(variable) Additional arguments are similar to histogram

Histograms for all variables in a data frame or matrix hist.data.frame(data.frame) library(Hmisc) Histograms with normal curve and density plot for all variables in a data frame or matrix multi.hist(dataframe) library(psych)

93 | P a g e

Density Plot plot(density(d1$mathscore ), main="yes",xlab="bad", ylab="good") #The Plot polygon(density(d1$mathscore ) ,col="orange", border="purple") #Coloring and the Border Histogram-density plot with qq plot (check normality) source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Graphics/ Histogram-density plot with qq plot.txt”)

QQhist(x) for an example use: QQhist.fun() Plot 2 or more density plots library(sm) sm.density.compare(num.variable,factor)

*GRAPH IS EMBELISHED WITH MEAN LINES FOR EACH GROUP

AND EXTRA MEAN LINES TO EXPLAIN 4 CYL.’S BIMODAL GRAPH

EXAMPLE

library(sm)

mtcars2<-mtcars

mtcars2$cyl<-as.factor(with(mtcars,recodeVar(cyl,

src=c(4,6,8),tgt=c("four","six","eight"), default=NULL,

keep.na=TRUE)))

fm<-mean(subset(mtcars2,cyl=="four")$mpg)

sm<-mean(subset(mtcars2,cyl=="six")$mpg)

em<-mean(subset(mtcars2,cyl=="eight")$mpg)

with(mtcars2,sm.density.compare(mpg,cyl)) #plot several densities @ once

abline(v=mean(subset(mtcars2,cyl=="four")$mpg)) #plot means

abline(v=mean(subset(mtcars2,cyl=="six")$mpg))

abline(v=mean(subset(mtcars2,cyl=="eight")$mpg))

# uh oh 4 cyl is bi-modal. Why?

#plot of means for four cylinder by displacement;

#the factor that makes this group's graph bi-modal

abline(v=mean(fmDF[c(1,2,3,7,9,11),1]),col="orange")

abline(v=mean(fmDF[c(4,5,6,8,10),1]),col="pink")

legend(locator(1),c("Four Cylinder","Six Cylinder", "Eight Cylinder", "4cyl low disp", "4cyl high disp"),

fill=c("green","blue","red", "orange", "pink"))

94 | P a g e

Histogram with colored tails (2 sd or what ever you set) histogram <- hist(scale(vector)), breaks= , plot=FALSE) plot(histogram, col=ifelse(abs(histogram$breaks) < #of SD, Color 1, Color 2)) Example windows(13,4) par(mfrow=c(1,2)) histograph <- hist(scale(mtcars$mpg), breaks=10, plot=FALSE) plot(histograph, main="Histogram of MPG",col=ifelse(abs(histograph$breaks) < 2, 5, 8)) x <- rnorm(1000) hx <- hist(x, breaks=150, plot=FALSE) plot(hx, col=ifelse(abs(hx$breaks) < 2, 3, 6)) Stem and Leaf Plot stem(x, scale = 1, width = 80, atom = 1e-08)

Arguments

x a numeric vector. scale This controls the plot length. width The desired width of plot.

atom a tolerance.

EXAMPLE:

x<-round(mtcars$mpg)

stem(x, scale=.5)

stem(x, scale=.25)

95 | P a g e

MULTIVARIATE DATA PLOTS

Star Plot (Multivariate data) Draw star plots or segment diagrams of a multivariate data set stars(x, full = TRUE, scale = TRUE, radius = TRUE, labels = dimnames(x)[[1]], locations = NULL, nrow = NULL, ncol = NULL, len = 1, key.loc = NULL, key.labels = dimnames(x)[[2]], key.xpd = TRUE, xlim = NULL, ylim = NULL, flip.labels = NULL, draw.segments = FALSE, col.segments = 1:n.seg, col.stars = NA, axes = FALSE, frame.plot = axes, main = NULL, sub = NULL, xlab = "", ylab = "", cex = 0.8, lwd = 0.25, lty = par("lty"), xpd = FALSE, mar = pmin(par("mar"), 1.1+ c(2*axes+ (xlab != ""), 2*axes+ (ylab != ""), 1,0)), add = FALSE, plot = TRUE, ...) ARGUMENTS example(stars)

96 | P a g e

Chernoff Faces (Multivariate data) faces(data, plot.faces=c(TRUE,FALSE)) library(aplpack) faces2(data, ncol=#,nrow=#) library(TeachingDemos)

EXAMPLES: library(aplpack)

a<-aplpack::faces(mtcars,plot.faces=FALSE)

win.graph(11,8);par(mar = rep(0, 4),xpd=NA)

plot(0:5,0:5,type="n")

plot(a)

library(TeachingDemos)

win.graph(11,8);par(mar = rep(0, 4),xpd=NA)

faces2(mtcars[,1:7],ncol=8,nrow=4)

97 | P a g e

++

+

+

++

+

+

+

+

+

++

+

++

+

+

+

+

+

++

+

+

+

+

+

+

+

+

+

100 200 300 400

10

15

20

25

30

disp

mp

g

00

0

0

00

0

0

0

0

0

00

0

00

0

0

0

0

0

00

0

0

0

0

0

0

0

0

0

100 200 300 400

10

15

20

25

30

disp

mp

g

Cylinders

Four

Six

Eight

Cylinders

One

Two

Three

Four

Six

Eight

Bubles represent

Horsepower

Bubble Plot (view multivariate data) bubbleplot plot(y~x1) symbols(y~x1,circles=x2)

EXAMPLE

plot(mpg ~ disp, data = mtcars, pch ="+",col=carb)

par(new=T)

plot(mpg ~ disp, data = mtcars, pch = "0",col=cyl)

with(mtcars,symbols(disp, mpg, circles = hp,add = TRUE))

legend(280,34,c("Four","Six","Eight"),fill=c("blue","violet","gray"),title="Cylinders")

legend(385,34,c("One","Two","Three","Four","Six","Eight"),

fill=palette()[as.numeric(levels(as.factor(mtcars$carb)))],title="Cylinders")

textClick(expression("Bubles represent\nHorsepower"),"black",1)

shapeClick("box",3)

#data represents 5 different variables

98 | P a g e

3-D Scatterplot (view multivariate data) library(scatterplot3d) [see also spinable 3-d scatterplot] scatterplot3d(x, y=NULL, z=NULL, color=par("col"), pch=NULL, main=NULL, sub=NULL, xlim=NULL, ylim=NULL, zlim=NULL, xlab=NULL, ylab=NULL, zlab=NULL, scale.y=1, angle=40,axis=TRUE, tick.marks=TRUE, label.tick.marks=TRUE, x.ticklabs=NULL, y.ticklabs=NULL, z.ticklabs=NULL, y.margin.add=0, grid=TRUE, box=TRUE, lab=par("lab"), lab.z=mean(lab[1:2]), type="p", highlight.3d=FALSE, mar=c(5,3,4,3)+0.1, col.axis=par("col.axis"), col.grid="grey", col.lab=par("col.lab"), cex.symbols=par("cex"), cex.axis=0.8 * par("cex.axis"), cex.lab=par("cex.lab"), font.axis=par("font.axis"),font.lab=par("font.lab"), lty.axis=par("lty"), lty.grid=par("lty"), lty.hide=NULL, lty.hplot=par("lty"), log="")

x the coordinates of points in the plot.

y the y coordinates of points in the plot, optional if x is an appropriate structure.

z the z coordinates of points in the plot, optional if x is an appropriate structure.

color colors of points in the plot, optional if x is an appropriate structure. Will be ignored if highlight.3d = TRUE.

pch plotting "character", i.e. symbol to use.

main an overall title for the plot.

sub sub-title.

xlim, ylim, zlim the x, y and z limits (min, max) of the plot. Note that setting enlarged limits may not work as exactly as expected (a known but unfixed bug).

xlab, ylab, zlab titles for the x, y and z axis.

scale.y scale of y axis related to x- and z axis.

angle angle between x and y axis (Attention: result depends on scaling).

axis a logical value indicating whether axes should be drawn on the plot.

tick.marks a logical value indicating whether tick marks should be drawn on the plot (only if axis = TRUE).

label.tick.marks a logical value indicating whether tick marks should be labeled on the plot (only if axis = TRUE and tick.marks = TRUE).

x.ticklabs, y.ticklabs, z.ticklabs vector of tick mark labels.

y.margin.add add additional space between tick mark labels and axis label of the y axis

grid a logical value indicating whether a grid should be drawn on the plot.

box a logical value indicating whether a box should be drawn around the plot.

lab a numerical vector of the form c(x, y, len). The values of x and y give the (approximate) number of tickmarks on the x and y axes.

lab.z the same as lab, but for z axis.

type character indicating the type of plot: "p" for points, "l" for lines, "h" for vertical lines to x-y-plane, etc.

highlight.3d points will be drawn in different colors related to y coordinates (only if type = "p" or type = "h", else color will be used). On some devices not all colors can be displayed. In this case try the postscript device or use highlight.3d = FALSE.

mar A numerical vector of the form c(bottom, left, top, right) which gives the lines of margin to be specified on the four sides of the plot.

col.axis, col.grid, col.lab the color to be used for axis / grid / axis labels.

cex.symbols, cex.axis, cex.lab the magnification to be used for point symbols, axis annotation, labels relative to the current.

font.axis, font.lab the font to be used for axis annotation / labels.

lty.axis, lty.grid the line type to be used for axis / grid.

lty.hide line style used to plot ‘non-visible’ edges (defaults of the lty.axis style)

lty.hplot the line type to be used for vertical segments with type = "h".

log Not yet implemented! A character string which contains "x" (if the x axis is to be logarithmic), "y", "z", "xy", "xz", "yz", "xyz".

EXAMPLE library(scatterplot3d) multiG(16,8,2,1) with(mtcars,scatterplot3d(mpg,disp,hp,color=cyl,pch=19,main="HP, MPG, & DISP BY CYL")) par(mar = rep(0, 4),xpd=NA) legend(locator(1),legend=c("4 cyl","6 cyl", "8 cyl"),fill=c(4,6,8), title="Cylinders") with(mtcars,scatterplot3d(mpg,disp,hp,color=cyl,pch=gear,main="HP, MPG, & DISP BY CYL & GEAR")) par(mar = rep(0, 4),xpd=NA) legend(locator(1),legend=c("3","4", "5"),pch=c(3,4,5), title="Gear Number") #by changing pch and color we're viewing 5 variables simultaneously

99 | P a g e

Spinable 3-d Scatterplot rotate Method 1: plot3d(x, y, z, xlab, ylab, zlab, type = "p", col, size, lwd, radius) Method 2: scatter3d(x, y, z, xlab=deparse(substitute(x)), ylab=deparse(substitute(y)), zlab=deparse(substitute(z)), axis.scales=TRUE, revolutions=0, bg.col=c("white", "black"), axis.col=if (bg.col == "white") c("darkmagenta", "black", "darkcyan") else c("darkmagenta", "white", "darkcyan"), surface.col=c("blue", "green", "orange", "magenta", "cyan", "red", "yellow", "gray"), surface.alpha=0.5, neg.res.col="red", pos.res.col="green", square.col=if (bg.col == "white") "black" else "gray", point.col="yellow", text.col=axis.col, grid.col=if (bg.col == "white") "black" else "gray", fogtype=c("exp2", "linear", "exp", "none"), residuals=(length(fit) == 1), surface=TRUE, fill=TRUE, grid=TRUE, grid.lines=26, df.smooth=NULL, df.additive=NULL, sphere.size=1, threshold=0.01, speed=1, fov=60, fit="linear", groups=NULL, parallel=TRUE, ellipsoid=FALSE, level=0.5, ellipsoid.alpha=0.1, id.method=c("mahal", "xz", "y", "xyz", "identify", "none"), id.n=if (id.method == "identify") Inf else 0, labels=as.character(seq(along=x)), offset = ((100/length(x))^(1/3)) * 0.02, model.summary=FALSE)

EXAMPLES library(rgl)

with(mtcars, plot3d(wt, disp, mpg, col=cyl, size=6))

library(Rcmdr)

with(mtcars, scatter3d(wt, disp, mpg, col=cyl))

100 | P a g e

Staircase plot (show an increase or a decrease over time) library(plotrix) See examples for best understanding:

EXAMPLE

sample_size<-c(500,-72,428,-94,334,-45,-89,200)

totals<-c(TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,FALSE,TRUE)

labels<-c("Contact list","Uncontactable","","Declined","","Ineligible",

"Died","Final sample")

#==========================================================================

staircase.plot(sample_size,totals,labels,main="Acquisition of the sample",

total.col="gray",inc.col=2:5,bg.col="#eeeebb",direction="s")

staircase.plot(sample_size,totals,labels,main="Acquisition of the sample",

total.col="yellow",inc.col=2:5,bg.col="#eeeebb",direction="e")

#==========================================================================

sample_size<-c(200,+72,272,+94,366,+45,411)

totals2<-c(TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE)

labels2<-c("Begining Level","Humor","","Water","","Candy",

"Final Level")

#==========================================================================

staircase.plot(sample_size,totals2,labels2,main="Energy Level",

total.col="gray",inc.col=2:5,bg.col="#eeeebb",direction="s")

101 | P a g e

Pyramid plot (comparing nested groups) library(plotrix) See examples for best understanding:

EXAMPLES

x11(15,8)

par(mfrow=c(1,2))

xy.pop<-c(3.2,3.5,3.6,3.6,3.5,3.5,3.9,3.7,3.9,3.5,3.2,2.8,2.2,1.8,

1.5,1.3,0.7,0.4)

xx.pop<-c(3.2,3.4,3.5,3.5,3.5,3.7,4,3.8,3.9,3.6,3.2,2.5,2,1.7,1.5,

1.3,1,0.8)

agelabels<-c("0-4","5-9","10-14","15-19","20-24","25-29","30-34",

"35-39","40-44","45-49","50-54","55-59","60-64","65-69","70-74",

"75-79","80-44","85+")

mcol<-color.gradient(c(0,0,0.5,1),c(0,0,0.5,1),c(1,1,0.5,1),18)

fcol<-color.gradient(c(1,1,0.5,1),c(0.5,0.5,0.5,1),c(0.5,0.5,0.5,1),18)

#==========================================================================

par(mar=pyramid.plot(xy.pop,xx.pop,labels=agelabels,

main="Australian population pyramid 2002",lxcol=mcol,rxcol=fcol,

gap=0.5,show.values=TRUE))

#==========================================================================

# three column matrices

avtemp<-c(seq(11,2,by=-1),rep(2:6,each=2),seq(11,2,by=-1))

malecook<-matrix(avtemp+sample(-2:2,30,TRUE),ncol=3)

femalecook<-matrix(avtemp+sample(-2:2,30,TRUE),ncol=3)

# group by age

agegrps<-c("0-10","11-20","21-30","31-40","41-50","51-60",

"61-70","71-80","81-90","91+")

#==========================================================================

oldmar<-pyramid.plot(malecook,femalecook,labels=agegrps,

unit="Bowls per month",lxcol=c("#ff0000","#eeee88","#0000ff"),

rxcol=c("#ff0000","#eeee88","#0000ff"),laxlab=c(0,10,20,30),

raxlab=c(0,10,20,30),top.labels=c("Males","Age","Females"),gap=3)

# put a box around it

box()

# give it a title

mtext("Porridge temperature by age and sex of cook",3,2,cex=1.5)

# stick in a legend

legend(par("usr")[1],11,c("Too hot","Just right","Too cold"),

fill=c("#ff0000","#eeee88","#0000ff"))

102 | P a g e

Engelmann-Hecker-Plot Compare spread of grouped data library(plotrix) ehplot(data, groups, intervals=50, offset=0.1, log=FALSE, median=TRUE, box=FALSE, boxborder="grey50", xlab="groups", ylab="values", col="black", ...) Arguments:

Created using the panes() function

data(iris)

ehplot(iris$Sepal.Length, iris$Species,

intervals=20,offset=0.1, cex=1.5, pch=20)

tab.title("ehplot 1",tab.col=1)

ehplot(iris$Sepal.Width, iris$Species,

intervals=20, offset=0.1,box=TRUE, median=FALSE)


ehplot(iris$Petal.Length, iris$Species,

pch=17,offset=0.1, col="red", log=TRUE)



offset=0.1, pch=as.numeric(iris$Species))


Example data(iris);library(plotrix)

ehplot(iris$Sepal.Length, iris$Species,

intervals=20, cex=1.8, pch=20)

ehplot(iris$Sepal.Width, iris$Species,

intervals=20, box=TRUE, median=FALSE)


pch=17, col="red", log=TRUE)


offset=0.06,

pch=as.numeric(iris$Species))

# Groups don't have to be presorted:

rnd <- sample(150)

plen <- iris$Petal.Length[rnd]

pwid <- abs(rnorm(150, 1.2))

spec <- iris$Species[rnd]

ehplot(plen, spec, pch=19, cex=pwid,

col=rainbow(3,

alpha=0.6)[as.numeric(spec)])

103 | P a g e

Hexbin Plot (visualize data closely clustered data) see also sunflowerplothigh density data

plot(hexbin(x, y, xbins = 30, shape = 1,xbnds = range(x), ybnds = range(y),xlab = NULL, ylab = NULL)) Bump chart (looking at how ranks have changed from time 1 to time 2 library(plotrix)

Arguments:

EXAMPLE

#======================================================================

# percentage of those over 25 years having completed high school

# in 10 cities in the USA in 1990 and 2000

educattn<-matrix(c(90.4,90.3,75.7,78.9,66,71.8,70.5,70.4,68.4,67.9,

67.2,76.1,68.1,74.7,68.5,72.4,64.3,71.2,73.1,77.8),ncol=2,byrow=TRUE)

rownames(educattn)<-c("Anchorage AK","Boston MA","Chicago IL",

"Houston TX","Los Angeles CA","Louisville KY","New Orleans LA",

"New York NY","Philadelphia PA","Washington DC")

colnames(educattn)<-c(1990,2000)

#......................................................................

bumpchart(educattn,main="Rank for high school completion by over 25s")

#======================================================================

# now show the raw percentages and add central ticks

#======================================================================

bumpchart(educattn,rank=FALSE,

main="Percentage high school completion by over 25s",col=rainbow(10))

# margins have been reset, so use

par(xpd=TRUE)

boxed.labels(1.5,seq(65,90,by=5),seq(65,90,by=5))

par(xpd=FALSE)

EXAMPLE

library(plyr) #contains a large data set

begend(baseball) #look at begining and end of data set

library(hexbin)

with(baseball,plot(hexbin(r,ab)))

bumpchart(y,top.labels=colnames(y),labels=rownames(y),rank=TRUE,mar=c(2,8,5,8),pch=19,col=par("fg"),lty=1,lwd=1)

104 | P a g e

Heatmap with numbers x <- "http://datasets.flowingdata.com/ppg2008.csv" nba <- read.csv(x)

dst <- dist(nba[1:20, -1],)

dst <- data.matrix(dst)

dim <- ncol(dst)

sdim <- seq_len(dim)

image(sdim, sdim, dst, axes = FALSE)

axis(1, sdim, nba[1:20,1], cex.axis = 0.5)

axis(2, sdim, nba[1:20,1], cex.axis = 0.5)

lapply(sdim, function(i){

lapply(sdim, function(j){

txt <- sprintf("%0.1f", dst[i,j])

text(i, j, txt, cex=0.5)

})

})

105 | P a g e

Scatterplot Ssymbols plot(Set$Attitude ~ Set$Grade, col="pink") To change the point plot symbols use the pch= argument. Note: the set$attitude and set$grade are merely variable names attached to the data set (data frame). In Windows the R supports Unicode symbols with the negative sign plot(1, 1, pch = -0x2665L, cex = 10, xlab = "", ylab = "", col = "firebrick3")

points(.8, .8, pch = -0x2642L, cex = 10, col = "firebrick3")

points(1.2, 1.2, pch = -0x2640L, cex = 10, col = "firebrick3")

Plot group by color Argument to plot: See example below

Identify Plot Points (Plot point labels and identification) slocate sidentify Locate coordinates of a specific point locator(1) Note that locator(n) can be used to locate a list of specific points and compile a list. This could be useful for locating extreme or odd values that lie outside the overall scatter or group scatter as in the code below in the example box.

Locate name of a specific points identify(x-vector,y-vector,vector of labels,…) Locate names of all points text(x-vector,y-vector,vector of labels,…) See example below for details

EXAMPLE

#Plot data by cylinder groups w/ legend

with(mtcars,plot(drat,hp,col=c("blue","green","red")[as.numeric(as.factor(cyl))]))

legend(locator(1),c("4 cyl","6 cyl","8 cyl"),fill=c("blue","green","red"))

locator(1) #locate a specific point on the plot (x and y coordinate)

#Example of using locate to create a data frame of coordinates for extreme values

outlier<-data.frame(locator(4)[1:2])

outlier

#locate certain points by name (adj: -1=right justify, 1=left, .5=centered)

with(mtcars,identify(drat,hp,labels=c(rownames(mtcars)),adj=1))

#label the points on the plot

with(mtcars,text(drat,hp,labels=c(rownames(mtcars)),cex=.5,adj=c(0,-1)))

EXAMPLE

#Plot data by cylinder groups w/ legend

with(mtcars,plot(drat,hp,col=c("blue","green","red")[as.numeric(as.factor(cyl))]))

legend(locator(1),c("4 cyl","6 cyl","8 cyl"),fill=c("blue","green","red"))

#I had to say as.factor for cylinder first because it was actually a numeric variable

106 | P a g e

Interactive coloring of points (click and recolor)

x <- 1:5

plot(x, x, col=ifelse(x==3, "red", "black"), pch=19)

plot(x, x, col=ifelse(x==3, "red", "black"),

pch=ifelse(x==3, 19, 2), cex=ifelse(x==3, 2, 1))

with(mtcars, plot(hp, disp, pch=19,

col=c(ifelse(mpg>25, 'red', 'green'))))

#===========================================

n <- 15

x <- rnorm(n)

y <- rnorm(n)

# Plot the data

plot(x,y, pch = 19, cex = 2)

# This lets you click on the points you want to change

# the color of. Right click and select "stop" when

# you have clicked all the points you want

pnt <- identify(x, y, plot = F)

# This colors those points red

points(x[pnt], y[pnt], col = "red", pch = 19, cex = 2)

points(x[pnt], y[pnt], col = "green", pch = 17, cex = 1)

107 | P a g e

Flip the x and y axis library(lattice) # First make some example data

df <- data.frame(name=rep(c("a", "b", "c"), each=5), value=rnorm(15))

# Then try plotting it in both 'orientations'

# ... as a dotplot

xyplot(value~name, data=df)

xyplot(name~value, data=df)

# ... or perhaps as a 'box-and-whisker' plot

bwplot(value~name, data=df)

bwplot(name~value, data=df)

108 | P a g e

Blending Plots in R sblending transparency

set.seed(42)

p1 <- hist(rnorm(500,4)) # centered at 4

p2 <- hist(rnorm(500,6)) # centered at 6

plot( p1, col=rgb(0,0,1,1/4), xlim=c(0,10)) # first histogram

plot( p2, col=rgb(1,0,0,1/4), xlim=c(0,10), add=T) # second hist

#or

a=rnorm(1000, 3, 1)

b=rnorm(1000, 6, 1)

hist(a, xlim=c(0,10), col="red")

hist(b, add=T, col=rgb(0, 1, 0, 0.5))

library(ggplot2)

path = "http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data"

saheart = read.table(path, sep=",",head=T,row.names=1)

fmla = "chd ~ sbp + tobacco + ldl + adiposity + famhist + typea + obesity"

model = glm(fmla, data=saheart, family=binomial(link="logit"),

na.action=na.exclude)

dframe = data.frame(chd=as.factor(saheart$chd),

prediction=predict(model, type="response"))

ggplot(dframe, aes(x=prediction, colour=chd)) + geom_density()

ggplot(dframe, aes(x=prediction, fill=chd)) +

geom_histogram(position="identity", binwidth=0.05, alpha=0.5)

ggplot(dframe, aes(x=prediction, fill=chd)) +

geom_histogram(position="identity", binwidth=0.05, alpha=0.5)

109 | P a g e

Add text to the graph margins See Example Text-Rect-LineSeg (search this tag) mtext(text, side = 3, line = 0, outer = FALSE, at = NA, adj = NA, padj = NA, cex = NA, col = NA, font = NA, ...) Plot Curved Text (arc text) library(plotrix) arctext(x,center=c(0,0),radius=1,start=NA,middle=pi/2,stretch=1,cex=1,...) Arguments: EXAMPLES

Argument text: a character or expression vector specifying the _text_ to be written. Other objects are coerced by

‘as.graphicsAnnot’. side: on which side of the plot (1=bottom, 2=left, 3=top, 4=right). line: on which MARgin line, starting at 0 counting outwards. outer: use outer margins if available. at: give location of each string in user coordinates. If the component of ‘at’ corresponding to a

particular text item is not a finite value (the default), the location will be determined by ‘adj’. adj: adjustment for each string in reading direction. For strings parallel to the axes, ‘adj = 0’ means left

or bottom alignment, and ‘adj = 1’ means right or top alignment.

If ‘adj’ is not a finite value (the default), the value of ‘par("las")’ determines the adjustment. For strings plotted parallel to the axis the default is to centre the string.

padj: adjustment for each string perpendicular to the reading direction (which is controlled by ‘adj’).

For strings parallel to the axes, ‘padj = 0’ means right or top alignment, and ‘padj = 1’ means left or bottom alignment.

If ‘padj’ is not a finite value (the default), the value of ‘par("las")’ determines the adjustment. For strings plotted perpendicular to the axis the default is to centre the string.

cex: character expansion factor. ‘NULL’ and ‘NA’ are equivalent to ‘1.0’. This is an absolute measure,

not scaled by ‘par("cex")’ or by setting ‘par("mfrow")’ or ‘par("mfcol")’. Can be a vector. col: color to use. Can be a vector. ‘NA’ values (the default) mean use ‘par("col")’. font: font for text. Can be a vector. ‘NA’ values (the default) mean use ‘par("font")’.

plot(0,xlim=c(1,5),ylim=c(1,5),main="Test

of arctext",xlab="",ylab="",

type="n")

arctext("bendy like

spaghetti",center=c(3,3),col="blue")

arctext("bendy like

spaghetti",center=c(3,3),radius=1.5,start=

pi,cex=2)

arctext("bendy like

spaghetti",center=c(3,3),radius=0.5,

start=pi/2,stretch=1.2)

See also: Eliminate margins for clickText()

110 | P a g e

Add a Table to a Graph library(plotrix) [see also scripts for a click function]add table; table plot

addtable2plot(x,y=NULL,table,lwd=par("lwd"),bty="n",bg=par("bg"),cex=1,xjust=0,yjust=1,box.col=par("fg"),text.col=par("fg"),display.colnames=TRUE,display.rownames=FALSE,hlines=FALSE,vlines=FALSE, title=NULL) Arguments: Plot Lines (verticle, horizontal or sloped) [see line types] abline(various arguments)

Plot Lowess Line Example: with(mtcars,plot(mpg,disp,pch=cyl,col=cyl+2))

with(mtcars,lines(lowess(cbind(mpg,disp)),lwd=2,col="blue"))

Plot a circle library(plotrix) [see also scripts for a shapeClick function] draw.circle(x,y,radius,nv=100,border=NULL,col=NA,lty=1,lwd=1) Arguments:

EXAMPLE

with(mtcars,plot(drat,hp))

abline(h=204) #plot horizontal

abline(v=4) #plot verticle

abline(a=210,b=20) #plot sloped (a=y intercept; and b=slope)

111 | P a g e

Plot a circle inside a square library(plotrix); library(grid) circle square

require(plotrix)

require(grid)

plot(c(-1, 1), c(-1,1), type = "n", asp=1)

rect( -.5, -.5, .5, .5)

draw.circle( 0, 0, .5 )

#note asp must be specified

112 | P a g e

Add Math symbols and Expressions to Plot expression() #wrapped in title, text, mtext etc. List of Math Symbols to Add

Syntax Meaning Syntax Meaning

x + y x plus y ... ellipsis (height varies)

x - y x minus y cdots ellipsis (vertically centred)

x*y juxtapose x and y ldots ellipsis (at baseline)

x/y x forwardslash y x %subset% y x is a proper subset of y

x %+-% y x plus or minus y x %subseteq% y x is a subset of y

x %/% y x divided by y x %notsubset% y x is not a subset of y

x %*% y x times y x %supset% y x is a proper superset of y

x %.% y x cdot y x %supseteq% y x is a superset of y

x[i] x subscript i x %in% y x is an element of y

x^2 x superscript 2 x %notin% y x is not an element of y

paste(x, y, z) juxtapose x, y, and z hat(x) x with a circumflex

sqrt(x) square root of x tilde(x) x with a tilde

sqrt(x, y) yth root of x dot(x) x with a dot

x == y x equals y ring(x) x with a ring

x != y x is not equal to y bar(xy) xy with bar

x < y x is less than y widehat(xy) xy with a wide circumflex

x <= y x is less than or equal to y widetilde(xy) xy with a wide tilde

x > y x is greater than y x %<->% y x double-arrow y

x >= y x is greater than or equal to y x %->% y x right-arrow y

x %~~% y x is approximately equal to y x %<-% y x left-arrow y

x %=~% y x and y are congruent x %up% y x up-arrow y

x %==% y x is defined as y x %down% y x down-arrow y

x %prop% y x is proportional to y x %<=>% y x is equivalent to y

plain(x) draw x in normal font x %=>% y x implies y

bold(x) draw x in bold font x %<=% y y implies x

italic(x) draw x in italic font x %dblup% y x double-up-arrow y

bolditalic(x) draw x in bolditalic font x %dbldown% y x double-down-arrow y

symbol(x) draw x in symbol font alpha – omega Greek symbols

list(x, y, z) comma-separated list Alpha – Omega uppercase Greek symbols

EXAMPLE

frame()

title(expression( "graph of the function f"(x) == sqrt(1+x^2)))

text(locator(1),expression(sum(x)/sqrt(n*S^2)))

text(locator(1),expression(hat(beta)==-.567))

text(locator(1),expression(hat(Omega)==infinity*frac(x, y)))

mtext(expression(Area == pi*r^2),side=2,line=-12)

text(locator(1), expression(bar(x) == sum(frac(x[i], n), i==1, n)))

mtext(expression(2.3 %+-% 4.5*pi),side=1,line=-5,adj=.7)

mtext(expression(bar(xy)!=sum(x[i], i==1, n) ),side=1,line=-5,adj=0)

textClick(expression(sum(sum((X[ij]-bar(X))^2))))

textClick(expression(sum(x[i], i=1, n)),"green",3)

Summation

113 | P a g e

Syntax Meaning

theta1, phi1, sigma1,

omega1 cursive Greek symbols

Upsilon1 capital upsilon with hook

aleph first letter of Hebrew alphabet

infinity infinity symbol

partialdiff partial differential symbol

nabla nabla, gradient symbol

32*degree 32 degrees

60*minute 60 minutes of angle

30*second 30 seconds of angle

displaystyle(x) draw x in normal size (extra spacing)

textstyle(x) draw x in normal size

scriptstyle(x) draw x in small size

scriptscriptstyle(x) draw x in very small size

underline(x) draw x underlined

x ~~ y put extra space between x and y

x + phantom(0) + y leave gap for "0", but don't draw it

x + over(1, phantom(0)) leave vertical gap for "0" (don't draw)

frac(x, y) x over y

over(x, y) x over y

atop(x, y) x over y (no horizontal bar)

sum(x[i], i==1, n) sum x[i] for i equals 1 to n

prod(plain(P)(X==x), x) product of P(X=x) for all values of x

integral(f(x)*dx, a, b) definite integral of f(x) wrt x

union(A[i], i==1, n) union of A[i] for i equals 1 to n

intersect(A[i], i==1, n) intersection of A[i]

lim(f(x), x %->% 0) limit of f(x) as x tends to 0

min(g(x), x > 0) minimum of g(x) for x greater than 0

inf(S) infimum of S

sup(S) supremum of S

x^y + z normal operator precedence

x^(y + z) visible grouping of operands

x^{y + z} invisible grouping of operands

group("(",list(a, b),"]") specify left and right delimiters

bgroup("(",atop(x,y),")") use scalable delimiters

group(lceil, x, rceil) special delimiters

114 | P a g e

Expressions in Titles (method 1) a<-5

b<-1

plot(1:10, main=bquote(p==.(a) *"," ~q==.(b)))

Expressions in Titles (method 2)

a<-5

b<-1

plot(1:10, main = substitute(paste(p == a,, ", ", q == b), list(a = a, b = b)))

115 | P a g e

Control Margins & Eliminate Margins (use locator with text to put text outside plot) par(mar=c(0,0,0,0) #standard par(mar = rep(0, 4)) #no margin

Function for adding text or expression anywhere with locator clickText source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Graphics/Click text")

clickText(expression, col, font size)) expression must either be in quotes if text or an expression() Text Outside Margins clickText() utilizes this par(xpd=NA)

#Margins Control Examples

#========================================================

#Standard Margins

x11()

frame()

par(oma = c(0,0,0,0))

grid()

#========================================================

#Attempt to add text to outer margins using locator (Wrong way)

x11()

frame()

par(oma = c(0,0,0,0))

with(mtcars,plot(mpg~wt))

text(locator(1),expression(beta==3)) #The mistake: tried to add text w/o changing margins first

par(mar = rep(0, 4))

text(locator(1),expression(beta==3))

#========================================================

#Add text using locator to outer margins (Correct way)

x11()

frame()

par(oma = c(0,0,0,0))

with(mtcars,plot(mpg~wt)) #Correct way: 1)plot; 2)change margins; 3)add text



#========================================================

#Plot with no margins

x11()

frame()


with(mtcars,plot(mpg~wt))


see also clickText() below

116 | P a g e

Normal curve with upper/lower shaded (uses the polygon function) Example xv<-seq(-3,3,.01)

yv<-dnorm(xv)

windows(h=5,w=11)

par(mfrow=c(1,2))

plot(xv,yv,type="l",main="2 Standard Deviation")

polygon(c(xv[xv<=-2],-2),c(yv[xv<=-2],yv[xv==-3]),col="blue",border="green")

polygon(c(xv[xv>=2],2),c(yv[xv>=2],yv[xv==3]),col="blue",border="green")

plot(xv,yv,type="l",main="1 Standard Deviation")

polygon(c(xv[xv<=-1],-1),c(yv[xv<=-1],yv[xv==-3]),col="red",border="orange")

polygon(c(xv[xv>=1],1),c(yv[xv>=1],yv[xv==3]),col="red",border="orange")

Use the mouse to add lines segments, rectangles, arrows and polygons source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Graphics/Click Shapes.txt")

shapeClick(shape="arrow", corners=NULL,col=NULL, border = NULL, lty = par("lty"), lwd = par("lwd"),code=2) examples code and use source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed

Inference/R Stuff/Scripts/Graphics/Click Shapes.txt")

with(mtcars,plot(mpg,disp))

shapeClick("seg",col="red")

shapeClick("box",col="yellow",border="red",lty=1)

shapeClick("arrow",col="orange",lwd=4,lty=55)


shapeClick("poly",3,col="yellow",border="red",lwd=2)

shapeClick("poly",5,border="green",lwd=2)

Add Rug to Graph (Shows the actual data points) rug(variable) # like a line this is done after the graph is plotted

x11()


rug(mtcars$disp);rug(mtcars$mpg,side=2)

117 | P a g e

Add line segments to the graph See Example Text-Rect-LineSeg Below segments(x0, y0, x1, y1,col = par("fg"), lty = par("lty"), lwd = par("lwd"), ...) x0,y0,x1,y1 are coordinates of the start and end points Add rectangles to the graph See Example Text-Rect-LineSeg rect(xleft, ybottom, xright, ytop, angle = 45, col = NULL, border = NULL, lty = NULL, lwd =) Add Arrows to Graph arrows(x0,y0,x1,y1)

#=================================================================================================

# VARIOUS TEXT ARGUMENTS

#=================================================================================================

windows(h=6.5,w=10);par(mfrow=c(2,3))

plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab")

text(3,23, "pos 1", pos=1);text(3,23, "pos 2", pos=2)

text(3,23, "pos 3", pos=3);text(3,23, "pos 4", pos=4)


text(locator(1), "yippee", pos=1)


mtext("line -2", line = -2);mtext("line 2", line = 2)

mtext("line 3", line = 3);mtext("line -6", line = -6)


mtext("adj 0", line = -2, adj =0);mtext("adj .5", line = -2, adj =.5)

mtext("adj 1", line = -2, adj =1)


mtext("side 1",side=1);mtext("side 2",side=2)

mtext("side 3",side=3);mtext("side 4",side=4)


mtext("side4;adj1",side=4,adj=1,col="blue")

mtext("side2;adj0",side=2,adj=.5,col="red")

mtext("side4;adj.5",side=4,adj=0,col="green")

mtext("side4;adj.5;padj1",side=4,adj=.5,padj=1,col="purple")

mtext("side4;adj.5;padj-2",side=4,adj=.5,padj=-2,col="orange")

#=================================================================================================

# VARIOUS RECTANGLE USES/ARGUMENTS

#=================================================================================================



rect(6, 10, 8, 12, angle = 45,col = NULL, border = "red", lty = 1, lwd = 2)

text(6.16,11, "HELLO!", pos=4)

rect(4, 15, 8, 20, angle = 45,col = NULL, border = "blue", lty = 1, lwd = 4)

text(locator(1), "HELLO!", pos=4,cex=2)

rect(8, 0, 10, 3, angle = 45,col = "black", border = "blue", lty = 1, lwd = 4)

text(locator(1), "HELLO!", pos=4,cex=.8,col="white")

#=================================================================================================

# VARIOUS LINE SEGMENT USES/ARGUMENTS

#=================================================================================================

segments(4, 0, 10, 10,col = "orange", lty = 1, lwd = 1)

segments(2, 25, 8, 25,col = "blue", lty = 2, lwd = 1)

segments(2, 0, 2, 25,col = "yellow", lty = 1, lwd = 3)

#=================================================================================================

# USING TEXT TO CREATE A LABEL (In action)

#=================================================================================================


x <- mtcars[order(mtcars$mpg),] # sort by mpg

x$cyl <- factor(x$cyl) # it must be a factor

x$color[x$cyl==4] <- "red"

x$color[x$cyl==6] <- "blue"

x$color[x$cyl==8] <- "darkgreen"

dotchart(x$mpg,labels=row.names(x),cex=.7,groups= x$cyl,

main="Gas Milage for Car Models\ngrouped by cylinder",

xlab="Miles Per Gallon", gcolor="black", color=x$color)

mtext("Cars Grouped by Cylinder", side = 2, line = 2, cex = .7)

Example Text-Rect-LineSeg

118 | P a g e

Add a Legend to a Graph sLEGEND legend(x, y = NULL, legend, fill = NULL, col = par("col"), border="black", lty, lwd, pch, angle = 45, density = NULL, bty = "o", bg = par("bg"), box.lwd = par("lwd"), box.lty = par("lty"), box.col = par("fg"), pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd, xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1, adj = c(0, 0.5), text.width = NULL, text.col = par("col"), merge = do.lines && has.pch, trace = FALSE, plot = TRUE, ncol = 1, horiz = FALSE, title = NULL, inset = 0, xpd, title.col = text.col, title.adj = 0.5, seg.len = 2) Description: This function can be used to add legends to plots. Note that a call to the function ‘locator(1)’ can be used in place of the ‘x’ and ‘y’ arguments. Arguments: x, y: the x and y co-ordinates to be used to position the legend. They can be specified by keyword or in any way

which is accepted by ‘xy.coords’: See ‘Details’. fill: if specified, this argument will cause boxes filled with the specified colors (or shaded in the specified colors) to

appear beside the legend text. col: the color of points or lines appearing in the legend. border: the border color for the boxes (used only if ‘fill’ is specified). lty, lwd: the line types and widths for lines appearing in the legend. One of these two _must_ be specified for line

drawing. ch: the plotting symbols appearing in the legend, either as vector of 1-character strings, or one (multi character)

string. _Must_ be specified for symbol drawing. angle: angle of shading lines. density: the density of shading lines, if numeric and positive. If ‘NULL’ or negative or ‘NA’ color filling is assumed. bty: the type of box to be drawn around the legend. The allowed values are ‘"o"’ (the default) and ‘"n"’. bg: the background color for the legend box. (Note that this is only used if ‘bty != "n"’.) box.lty, box.lwd, box.col: the line type, width and color for the legend box (if ‘bty = "o"’). pt.bg: the background color for the ‘points’, corresponding to its argument ‘bg’. cex: character expansion factor *relative* to current ‘par("cex")’. Used for text, and provides the default for

‘pt.cex’ and ‘title.cex’. pt.cex: expansion factor(s) for the points. pt.lwd: line width for the points, defaults to the one for lines, or if that is not set, to ‘par("lwd")’. xjust: how the legend is to be justified relative to the legend x location. A value of 0 means left justified, 0.5 means

centered and 1 means right justified. yjust: the same as ‘xjust’ for the legend y location. x.intersp: character interspacing factor for horizontal (x) spacing. y.intersp: the same for vertical (y) line distances. adj: numeric of length 1 or 2; the string adjustment for legend text. Useful for y-adjustment when ‘labels’ are

plotmath expressions. text.width: the width of the legend text in x (‘"user"’) coordinates. (Should be positive even for a reversed x axis.) Defaults tothe proper value computed by ‘strwidth(legend)’. text.col: the color used for the legend text. merge: logical; if ‘TRUE’, merge points and lines but not filledboxes. Defaults to ‘TRUE’ if there are points and lines. trace: logical; if ‘TRUE’, shows how ‘legend’ does all its magical computations. plot: logical. If ‘FALSE’, nothing is plotted but the sizes are returned. ncol: the number of columns in which to set the legend items (default is 1, a vertical legend). horiz: logical; if ‘TRUE’, set the legend horizontally rather than vertically (specifying ‘horiz’ overrides the ‘ncol’

specification). title: a character string or length-one expression giving a title to be placed at the top of the legend. Other objects

will be coerced by ‘as.graphicsAnnot’. inset: inset distance(s) from the margins as a fraction of the plot region when legend is placed by keyword. xpd: if supplied, a value of the graphical parameter ‘xpd’ to beused while the legend is being drawn. title.col: color for ‘title’. title.adj: horizontal adjustment for ‘title’: see the help for ‘par("adj")’. seg.len: the length of lines drawn to illustrate ‘lty’ and/or ‘lwd’ (in units of character widths).

EXAMPLE

legend(locator(1),c("Grazed","Ungrazed"),fill=c("blue","darkgreen"))

119 | P a g e

Draw a cylinder library(plotrix) cylindrect(xleft,ybottom,xright,ytop,col,border=NA,gradient="x",nslices=50) Arguments: example(cylindrect) Included in the shapeClick function in scripts Inset a break in a scale (broken scale) library(plotrix) axis.break(axis=1,breakpos=NULL,pos=NA,bgcol="white",breakcol="black",style="slash",brw=0.02) Arguments:

EXAMPLE

#===========================================

x11(12,8)

par(mfrow=c(1,2))

#===========================================

barplot(tN, col=heat.colors(12), log = "y")

axis.break(axis=2,breakpos=4,style="zigzag")

axis.break(axis=2,breakpos=9,style="zigzag")

#===========================================

plot(mpg~cyl,mtcars)

axis.break(breakpos=4.5,axis=1)

120 | P a g e

Plot with a zoomed in plot side by side library(plotrix) zoomInPlot(x,y=NULL,xlim=NULL,ylim=NULL,rxlim=xlim, rylim=ylim,xend=NA,zoomtitle=NULL,titlepos=NA,...) Arguments: Scatterplot w/ histogram, correlation, density, elipse library(psych) scatter.hist (x, y = NULL, smooth = TRUE, ab = FALSE, correl = TRUE, density = TRUE, ellipse = TRUE, digits = 2, cex.cor = 1, title = "Scatter plot + histograms", xlab = NULL, ylab = NULL)

121 | P a g e

Design Plots (compare mean differences, sd, var, or medians) [effects] plot.design(y~x1*x2*xn,fun="mean") fun arguments: mean, median, sd, var

BWplot Compares means and spread in a multi box plot format) bwplot(formula,data) Formula is in the form y ~ x | g1 * g2 * ... (or equivalently, y ~ x | g1 + g2 + ...), indicating

that plots of y (on the y-axis) versus x (on the x-axis) should be produced conditional on the variables g1, g2,

.... Here x and y are the primary variables, and g1, g2, ... are the conditioning variables. The

conditioning variables may be omitted to give a formula of the form y ~ x, in which case the plot will consist

of a single panel with the full dataset. The formula can also involve expressions, e.g., sqrt(), log(), etc.

EXAMPLE

dat<-read.table("HW19.csv", header=TRUE, sep=",",na.strings="999")

dat$Attitude<-as.factor(dat$Attitude)

mod<-lm(Science.Comprehension~Gender*Attitude*Grade,data=dat)

anova(mod)

windows(h=6,w=10)

par(mfrow=c(1,2))

with(dat,plot.design(Science.Comprehension~Gender*Attitude*Grade,

main="Mean Differences"))

with(dat,plot.design(Science.Comprehension~Gender*Attitude*Grade,

fun="sd",main="SD Comparisons"))

EXAMPLE

dat<-read.table("HW19.csv", header=TRUE,

sep=",",na.strings="999")



anova(mod)

library(lattice)

trellis.par.set(col.whitebg())

bwplot(Science.Comprehension~Gender|Attitude*Grade,dat)

122 | P a g e

Box plot with confidence intervals library(psych) boxplot(attitude,notch=F,main="Boxplot with error bars") error.bars(attitude,add=TRUE) boxplot(attitude,notch=T,main="Notched boxplot with error bars") error.bars(attitude,add=TRUE)

Spaghetti Plot for Repeated Measures Data library(lattice) xyplot(y ~ x, groups =, type = "b", data=)

Coplots coplot(formula,data,)

source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R

Stuff/Scripts/Data Sets.txt")

library(reshape)

rep.mes2<-rep.mes

Sex<-gl(2, 25, length=50,labels = c("Male", "Female"))

rep.mes2<-data.frame(rep.mes2[1:2],Sex,rep.mes2[3:5])

long.rep.mes<-melt(rep.mes2,id=1:3)[order(melt(rep.mes)$Sub),]

rownames(long.rep.mes)<-1:150

rep.mes2;long.rep.mes

library(lattice)

xyplot(value[1:18] ~ variable, groups = Sub, type = "b",data=long.rep.mes)

xyplot(value ~ variable, groups = Group, type = "b",data=long.rep.mes)

xyplot(value ~ variable, groups = Sex, type = "b",data=long.rep.mes)

EXAMPLE

coplot(Sepal.Width~ Petal.Length|Petal.Width,data=iris,panel=panel.smooth)

123 | P a g e

Change Scale Text Direction las=either 0,1,2,3 NOTE: This is an argument to a plot.

EXAMPLE

par(mfrow=c(2,2))

with(iris,plot(Sepal.Length,Sepal.Width,pch=a

s.numeric(Species),col=as.numeric(Species)))


s.numeric(Species),las=1,col=as.numeric(Speci

es)))



es)))



es)))

124 | P a g e

Overplotting

# Generate some data

library(MASS)

set.seed(101)

n <- 50000

X <- mvrnorm(n, mu=c(.5,2.5), Sigma=matrix(c(1,.6,.6,1), ncol=2))

# A color palette from blue to yellow to red

library(RColorBrewer)

k <- 11

my.cols <- rev(brewer.pal(k, "RdYlBu"))

## compute 2D kernel density, see MASS book, pp. 130-131

z <- kde2d(X[,1], X[,2], n=50)

# Make the base plot

plot(X, xlab="X label", ylab="Y label", pch=19, cex=.4)

# Draw the colored contour lines

contour(z, drawlabels=FALSE, nlevels=k, col=my.cols, add=TRUE, lwd=2)

# Make points smaller - use a single pixel as the plotting charachter

plot(X, pch=".")

# Hexbinning

library(hexbin)

plot(hexbin(X[,1], X[,2]))

# Make points semi-transparent

library(ggplot2)

qplot(X[,1], X[,2], alpha=I(.1))

# The smoothScatter function (graphics package)

smoothScatter(X)

125 | P a g e

List of Graph Functions in Base

126 | P a g e

List of Graph Functions in Lattice

127 | P a g e

Graphics Arguments (Parameters)

128 | P a g e

129 | P a g e

130 | P a g e

Descriptive Statistics

Find standard deviation, variance, mean, median, range, & standard error From the stats package: sd() var() mean() median() max() min() summary() From the library(plotrix): std.error(x,na.rm) #--> example: std.error(mtcars$mpg,na.rm) Descriptives library(psych) describe(x, na.rm = TRUE, interp=FALSE,skew = TRUE, ranges = TRUE,trim=.1) Descriptives by Group (Note: Pairwise deletion; this will include as much data as possible) library(psych) describe.by(dataset, variable1) #Does Descripts on Variable 1 describe.by(dataset, variable2) #Does Descripts on Variable 2 describe.by(dataset, list(variable1,variable2)) #Does All the Interactions

library(psych) #example: g4<-read.table("g4.csv", header=TRUE, sep=",",na.strings="999")

describe.by(g4,g4$gender) #Does Descripts on Variable 1

describe.by(g4,g4$race) #Does Descripts on Variable 2

describe.by(g4,list(g4$gender,g4$race)) #Does All the Interactions

NOTE: I created functions to automate this for the .Regression Bundle script. desc2v() for 2 variables; desc3v() for 3 variables use rfun() to view the list of functions and arguments in the regression bundle use: data.frame(cv[[A]][[B]])[c(desired variable rows),c(2,3,4)] to extract certain groups,columns & rows

[[A]] "DESCRIPTIVES FOR VARIABLE 1"=[[1]] “DESCRIPTIVES FOR VARIABLE 2"=[[2]] "DESCRIPTIVES FOR VARIABLE 3"=[[3]] "DESCRIPTIVES FOR INTERACTION OF VARIABLE 1 & 2"=[[4]] "DESCRIPTIVES FOR INTERACTION OF VARIABLE 2 & 3"=[[5]] "DESCRIPTIVES FOR INTERACTION OF VARIABLE 1 & 3"=[[6]]

"DESCRIPTIVES FOR INTERACTION OF VARIABLE 1,2,&3[[7]]

[[B]] Level of group or interaction: ie. a [[A=1]] with 2 groups=[[B n of 2]] a [[A=1]] with 3 groups=[[B n of 3]] a [[A=4]] with 2 and 3 groups=[[B n of 8]] a [[A=7]] with 2, 2 and 3 groups=[[B n of 19]]

2,3,4 here gives n, mean & sd

131 | P a g e

Descriptives by Group (this relies on specific variables of a data set; more manageable info)

library(doBy)

summaryBy()

examples:

Descriptive by Group (another approach)

by(data set, factor, summary)

#EXAMPLE of summaryBy()

g4<-read.table("g4.csv", header=TRUE, sep=",",na.strings="999")

library(doBy)

summaryBy(mathscore + effort+ initiative+valueing ~ race, data = g4,


summaryBy(mathscore + effort+ initiative+valueing ~ gender, data = g4,


summaryBy(mathscore + effort+ initiative+valueing ~ gender+race, data = g4,


EXAMPLE WITH WRITE TO EXCEL

e30<-read.table("e30.csv", header=TRUE, sep=",",na.strings="NA")

attach(e30)

DFD<-e30[,5:11]

percent.disabled<-as.numeric(N.stud.disable)/as.numeric(class.enroll)

DFSD<-data.frame(DFD,percent.disabled)

DFSD<-na.omit(DFSD)

#_________________________________________________________________________

#DESCRIPTIVES ON AIDES

#

ZZ<-summaryBy(cl.behav.fall+cl.behav.spr+perc.min+percent.disabled~aide,

data = DFSD,FUN = function(x) { c(n = length(x),mean = mean(x), sd = sd(x)) } )

#_________________________________________________________________________

#DESCRIPTIVES ON CLASS TYPE

#

YY<-summaryBy(cl.behav.fall+cl.behav.spr+perc.min+percent.disabled~cl.type,data = DFSD,FUN =

function(x) { c(n = length(x),mean = mean(x), sd = sd(x)) } )

#_________________________________________________________________________

#DESCRIPTIVES ON CLASS TYPE & AIDE

#

XX<-summaryBy(cl.behav.fall+cl.behav.spr+perc.min+percent.disabled~aide+cl.type,

data = DFSD,FUN = function(x) { c(n = length(x),mean = mean(x), sd = sd(x)) } )

#_________________________________________________________________________

#TRANSFORMED DESCRIPTIVES ON CLASS TYPE & AIDE

#

WW<-t(summaryBy(cl.behav.fall+cl.behav.spr+perc.min+percent.disabled~aide+cl.type,

data = DFSD,FUN = function(x) { c(n = length(x),mean = mean(x), sd = sd(x)) } ))

DESCRIBE2<-list(ZZ,YY,XX,WW)

DESCRIBE2

write.table(ZZ, file = "DESCRIBE2.csv", sep = ",", col.names = NA,qmethod = "double")

write.table(YY, file = "DESCRIBE3.csv", sep = ",", col.names = NA,qmethod = "double")

write.table(XX, file = "DESCRIBE4.csv", sep = ",", col.names = NA,qmethod = "double")

write.table(WW, file = "DESCRIBE5.csv", sep = ",", col.names = NA,qmethod = "double")

Example mtcars2<-mtcars

library(doBy)

mtcars2$cyl<-with(mtcars,recodeVar(cyl, src=c(4,6,8),

tgt=c("four","six","eight"), default=NULL,

keep.na=TRUE))

by(mtcars2, mtcars2$cyl, summary)

132 | P a g e

Using ftable and tapply to generate descriptives

Example

Descriptives by Group Favored method)

It is easiest to look at how this piece of code works through an example:

Descriptives by Variable1 library(pastecs)

stat.desc(data.frame)

Descriptives by Variable2 library(fBasics)

basicStats(dataframe)

dat<-read.table("HW19.csv", header=TRUE, sep=",",na.strings="999")



anova(mod)

#==================================================================================================

# USING FTABLE AND TAPPLY TO GENERATE TABLES OF MEAN, SD, N, VAR ETC. BY GROUP

#==================================================================================================

with(dat,ftable(tapply(Science.Comprehension,list(Grade,Gender,Attitude),mean)))

with(dat,ftable(tapply(Science.Comprehension,list(Grade,Gender,Attitude),sd)))

with(dat,ftable(tapply(Science.Comprehension,list(Grade,Gender,Attitude),length)))

#==================================================================================================

# USING FTABLE AND TAPPLY TO COMPLIE 1 TABLE OF MEAN, SD, N, VAR ETC. BY GROUP

#==================================================================================================

DF<-with(dat,as.data.frame.table(tapply(Science.Comprehension,list(Grade,Gender,

Attitude),mean)))

stndev<-as.vector(with(dat,ftable(tapply(Science.Comprehension,list(Grade,Gender,Attitude),sd))))

n<-as.vector(with(dat,ftable(tapply(Science.Comprehension,list(Grade,Gender,Attitude),length))))

DF<-data.frame(DF[,1:3],n,DF[,4],stndev)

colnames(DF)<-c("Grade","Gender","Attitude","n","mean","sd")

#==================================================================================================

# TABLE OF N,MEANS & SD BY GROUP

#==================================================================================================

DF

#EXAMPLE

library(reshape)

dstats <- function(x)(c(n=length(x), mean=mean(x), sd=sd(x), med=median(x)))

dfm <- melt(mtcars, measure.vars=c("mpg", "hp", "wt"),

id.vars=c("am", "cyl"))

cast(dfm, am + cyl + variable ~ ., dstats)

133 | P a g e

Correlation Matrices and Plots scorrelation

Correlation Package I’ve created Select Numeric Columns for the cor function Useful for cor() Method 1

sapply(dataframe, is.numeric)

Method 2

which(sapply(data.frame, is.numeric))

Correlation and Correlation Tables Type: cor( x,y, use="complete.obs") Output Where x and y are single numeric variables the correlation is a single value. To correlate more than two numeric variables:

1) The first step is to bind your outcome variables: y<-cbind(x,y,z…) #or see selecting numeric variables

2) The last step is to type: cor(y, use="complete.obs")

The output will be a correlation table.

Note: If you have changed all the variables to numeric as described in the changing variable section you can simply type: cor(data1) Where data1 is the data set, however some of the numeric conversions (as in age level y, m, o = 1,2,3) are inappropriate correlations.

source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Regression/Correlation Matrix and Data Visualization.txt")

cmat()

Note: Use round(cor(),2) to round the table to 2 decimals

EXAMPLE With cor()

cor[iris] #the problem

cor(iris[as.numeric(which(sapply(iris, is.numeric)))])#the fix

134 | P a g e

Correlation Tables and p-values Use rcorr() from the car package to get correlation matrix and a p-value matrix. Correlation matrix w/ n’s and pvalues (does pairwise deletion) library(Hmisc) rcorr(x, y, type=c("pearson","spearman")) Could make it do pairwise by doing:

rcorr(na.omit(x), y, type=c("pearson","spearman"))

Pairwise Associations between Items using a Correlation Coefficient library(ltm) This one is similar to rcorr above rcor.test(mat, p.adjust = FALSE, p.adjust.method = "holm", ...)

135 | P a g e

Correlation matrix w/ n’s and pvalues library(psych) corr.test(x, y = NULL, use = "pairwise", method="pearson") x A matrix or dataframe

y A second matrix or dataframe with the same number of rows as x

use use="pairwise" is the default value and will do pairwise deletion of cases.

use="complete" will select just complete cases.

method method="pearson" is the default value. The alternatives to be passed to cor are

"spearman" and "kendall"

Correlation Matrix w/sig stars sigstarC(dataset)

Find the significance of the difference between (un)paired correlations library(psych) paired.r(xy, xz, yz=NULL, n, n2=NULL,twotailed=TRUE)

Arguments xy r(xy) xz r(xz) yz r(yz) n Number of subjects for first group n2 Number of subjects in second group (if not equal to n) twotailed Calculate two or one tailed probability values

Description

Test the difference between two (paired or unpaired) correlations. Given 3 variables, x, y, z, is the correlation between xy different than that between xz? If y and z are independent, this is a simple t-test of the z transformed rs. But, if they are dependent, it is a bit more complicated. To find the z of the difference between two independent correlations, first convert them to z scores using the Fisher r-z transform and then find the z of the difference between the two correlations. The default assumption is that the group sizes are the same, but the test can be done for different size groups by specifying n2. If the correlations are not independent (i.e., they are from the same sample) then the correlation with the third variable r(yz) must be specified. Find a t statistic for the difference of three two dependent correlations.

source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Regression/Correlation Matrix With Sig Stars.txt

Example

library(psych)

corr.test(mtcars)

136 | P a g e

Correlation matrix, correlation plots, and histograms of the variables library(psych) pairs.panels(x, smooth = TRUE, scale = FALSE, density=TRUE,ellipses=TRUE,digits=2)

Arguments x a data.frame or matrix

smooth TRUE draws loess smooths scale TRUE scales the correlation font by the size of the absolute correlation. density TRUE shows the density plots as well as histograms ellipses TRUE draws correlation ellipses lm Plot the linear fit rather than the LOESS smoothed fits. digits the number of digits to show pch The plot character (defaults to 20 which is a ’.’). cor If plotting regressions, should correlations be reported? jiggle Should the points be jittered before plotting? factor factor for jittering (1-5)

hist.col What color should the histogram on the diagonal be? show.points If FALSE, do not show the data points

Description Adapted from the help page for pairs, pairs.panels shows a scatter plot of matrices (SPLOM), with bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal. Useful for descriptive statistics of small data sets. If lm=TRUE, linear regression fits are shown for both y by x and x by y. Correlation ellipses are also shown. Points may be given different colors depending upon some grouping variable.

scatterplot matrix

137 | P a g e

Correlation plot represented by colors library(psych) cor.plot(dfg,colors=TRUE, n=10,main=NULL,zlim=c(-1,1),show.legend=TRUE,labels=NULL) Correlation Test With p values (pairwise or complete obs.) library(psych) corr.test(x, y = NULL, use = "pairwise",method="pearson")

Description Although the cor function finds the correlations for a matrix, it does not report probability values. corr.test uses cor to find the correlations for either complete or pairwise data and reports the sample sizes and probability values as well.

Arguments x A matrix or dataframe y A second matrix or dataframe with the same number of rows as x use

use="pairwise" is the default value and will do pairwise deletion of cases. use="complete" will select just complete cases.

method method="pearson" is the default value. The alternatives to be passed to cor are "spearman" and "kendall"

Details corr.test uses the cor function to find the correlations, and then applies a t-test to the individual correlations using the formula Value r The matrix of correlations n Number of cases per correlation t value of t-test for each correlation p two tailed probability of t for each correlation

138 | P a g e

Correlation Matrix Plot Using Pie Graphs and Colors library(corrgram) corrgram(x, order = NULL, panel=panel.txt, lower.panel=panel.shade, upper.panel=panel.pie, diag.panel=NULL, text.panel=panel.txt, label.pos=0.5, cex.labels=NULL, font.labels=1, row1attop=TRUE, gap=0, main=NULL)

Options x is a dataframe with one observation per row.

order=TRUE will cause the variables to be ordered using principal component analysis

of the correlation matrix. panel= refers to the off-diagonal panels. You can use lower.panel= and

upper.panel= to choose different options below and above the main diagonal respectively. text.panel= and diag.panel= refer to the main diagnonal. Allowable parameters are given below.

off diagonal panels panel.pie (the filled portion of the pie indicates the magnitude of the correlation) panel.shade (the depth of the shading indicates the magnitude of the correlation) panel.ellipse (confidence ellipse and smoothed line) panel.pts (scatterplot) main diagonal panels panel.minmax (min and max values of the variable) panel.txt (variable name).

Use this function before plotting to change the colors used: col.corrgram <- function(ncol){ colorRampPalette(c("purple”, "red","lightred", "pink"))(ncol)}

Lines on the shade indicate direction.

Shade color and pie indicates magnitude.

139 | P a g e

Correlation plot represented by colors library(psych) cor.plot(dfg,colors=TRUE, n=10,main=NULL,zlim=c(-1,1),show.legend=TRUE,labels=NULL) Correlation Test With p values (pairwise or complete obs.) library(psych) corr.test(x, y = NULL, use = "pairwise",method="pearson")

Description Although the cor function finds the correlations for a matrix, it does not report probability values. corr.test uses cor to find the correlations for either complete or pairwise data and reports the sample sizes and probability values as well.

Arguments x A matrix or dataframe y A second matrix or dataframe with the same number of rows as x use

use="pairwise" is the default value and will do pairwise deletion of cases. use="complete" will select just complete cases.

method method="pearson" is the default value. The alternatives to be passed to cor are "spearman" and "kendall"

Details corr.test uses the cor function to find the correlations, and then applies a t-test to the individual correlations using the formula Value r The matrix of correlations n Number of cases per correlation t value of t-test for each correlation p two tailed probability of t for each correlation

140 | P a g e

Correlation Matrix Plot Using Pie Graphs and Colors library(corrgram) corrgram(x, order = NULL, panel=panel.txt, lower.panel=panel.shade, upper.panel=panel.pie, diag.panel=NULL, text.panel=panel.txt, label.pos=0.5, cex.labels=NULL, font.labels=1, row1attop=TRUE, gap=0, main=NULL)

Options x is a dataframe with one observation per row.

order=TRUE will cause the variables to be ordered using principal component analysis

of the correlation matrix. panel= refers to the off-diagonal panels. You can use lower.panel= and

upper.panel= to choose different options below and above the main diagonal respectively. text.panel= and diag.panel= refer to the main diagnonal. Allowable parameters are given below.

off diagonal panels panel.pie (the filled portion of the pie indicates the magnitude of the correlation) panel.shade (the depth of the shading indicates the magnitude of the correlation) panel.ellipse (confidence ellipse and smoothed line) panel.pts (scatterplot) main diagonal panels panel.minmax (min and max values of the variable) panel.txt (variable name).

Use this function before plotting to change the colors used: col.corrgram <- function(ncol){ colorRampPalette(c("purple”, "red","lightred", "pink"))(ncol)}

Lines on the shade indicate direction.

Shade color and pie indicates magnitude.

141 | P a g e

Correlation Hypothesis testing

Correlation Package I’ve created for testing Correlation Hypotheses

Confidence Interval for a Correlation Coefficient library(psychometric) CIr(r, n, level = 0.95)

Arguments r Correlation Coefficient n Sample Size level Significance Level for constructing the CI, default is .95

Convert r values to z scores fisherz <- function(rho) {0.5*log((1+rho)/(1-rho)) } #converts r to z fisherz(r) OR using library(psych) fisherz(rho) fisherz2r(z) r.con(rho,n,p=.95,twotailed=TRUE) r2t(rho,n)

Description convert a correlation to a z score or z to r using the Fisher transformation or find the confidence intervals for a specified correlation

Convert a Pearson correlation coefficient to Fishers z’ library(psychometric) r2z(x) Where x is the Pearson correlation coefficient

source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Regression/Correlation Matrix and Data Visualization.txt")

cmat()

142 | P a g e

Confidence Interval for Fisher z’ library(psychometric) CIz(z, n, level = 0.95)

Arguments z Fishers z’ n Sample Size level Significance Level for constructing the CI, default is .95

Convert a Fishers z’ to aPearson correlation coefficient library(psychometric) z2r(x) Where x is the Fishers z’

143 | P a g e

Find partial correlation between two variables, with other vars removes library((ggm) pcor(u, S) u a vector of integers of length > 1. The first two integers are the indices of variables the

correlation of which must be computed. The rest of the vector is the conditioning set.

S a symmetric positive definite matrix, a sample covariance matrix.

Find the partial correlations for a set (x) of variables with set (y) removed library(psych) partial.r(m, x, y)

Arguments m A data or correlation matrix x The variable numbers associated with the X set. y The variable numbers associated with the Y set

Description

A straightforward application of matrix algebra to remove the effect of the variables in the y set from the x set. Input may be either a data matrix or a correlation matrix. Variables in x and y are specified by location. It is sometimes convenient to partial the effect of a number of variables (e.g., sex, age, education) out of the correlations of another set of variables. This could be done laboriously by finding the residuals of various multiple correlations, and then correlating these residuals. The matrix algebra alternative is to do it directly.

Find the partial correlations for a set correlation or covariance matrix library(corpcor) cor2pcor(m, tol)

EXAMPLE

cool <- make.hierarchical() #make up a correlation matrix

round(cool[1:5,1:5],2)

partial.r(cool,c(1,3,5),c(2,4))

EXAMPLE

library(ggm)

data(marks)

with(marks, cor(vectors, algebra)) #cor not accounting for anything else

## The correlation between vectors and algebra given analysis and statistics

pcor(c("vectors", "algebra", "analysis", "statistics"), var(marks))

144 | P a g e

Tests the hypothesis that two correlations are significantly different library(psychometric) rdif.nul(r1, r2, n1, n2) (one tailed; p must be doubled to get p value for 2 tail)

Arguments r1 Correlation 1 r2 Correlation 2 n1 Sample size for r1 n2 Sample size for r2

Details First converts r to z’ for each correlation. Then constructs a z test for the difference z <- (z1 - z2)/sqrt(1/(n1-3)+1/(n2-3))

Returns a table with 2 elements zDIF z value for the H0 p p value

145 | P a g e

Tests of significance for correlations library = psych() r.test (n, r12, r34 = NULL, r23 = NULL, r13 = NULL, r14 = NULL, r24 = NULL, n2 = NULL, pooled =

TRUE, twotailed = TRUE)

Arguments n Sample size of first group r12 Correlation to be tested r34 Test if this correlation is different from r12, if r23 is specified, but r13 is not,

then r34 becomes r13 r23 if ra = r(12) and rb = r(13) then test for differences of dependent correlations

given r23 r13 implies ra =r(12) and rb =r(34) test for difference of dependent correlations r14 implies ra =r(12) and rb =r(34) r24 ra =r(12) and rb =r(34) n2 n2 is specified in the case of two independent correlations. n2 defaults to n if if

not specified pooled use pooled estimates of correlations

twotailed should a twotailed or one tailed test be used

Description Tests the significance of a single correlation, the difference between two independent correlations, the difference between two dependent correlations sharing one variable (Williams’s Test), or the difference between two dependent correlations with different variables (Steiger Tests). Details Depending upon the input, one of four different tests of correlations is done. 1. For a sample size n, find the t value for a single correlation. 2. For sample sizes of n and n2 (n2 = n if not specified) find the z of the difference between

the z transformed correlations divided by the standard error of the difference of two z scores.

3. For sample size n, and correlations ra= r12, rb= r23 and r13 specified, test for the difference of two dependent correlations.

4. For sample size n, test for the difference between two dependent correlations involving different variables.

For clarity, correlations may be specified by value. If specified by location and if doing the test of dependent correlations, if three correlations are specified, they are assumed to be in the order r12, r13, r23.

Value test Label of test done z z value for tests 2 or 4 t t value for tests 1 and 3 p probability value of z or t

146 | P a g e

Nil hypothesis for a correlation (Does r = 0?) library(psychometric) r.nil(r, n)

Arguments r Correlation coefficient n Sample Size

Performs a one-tailed t-test of the H0 that r = 0

Returns a table with 4 elements “H0:rNot0” correlation to be tested t t value for the H0

df degrees of freedom p p value

147 | P a g e

Cronbach’s Alpha library(psych)

alpha(x, keys=NULL,cumulative=FALSE, title=NULL, max=10,na.rm = TRUE)

Cronbach’s Alpha library(psy) cronbach(v1) v1 = n*p matrix or dataframe, n subjects and p items Missing value are omitted in a "listwise" way (all items are removed even if only one of them is missing). Cronbach’s Alpha library(psychometric) alpha(x) Where x is a data.frame

148 | P a g e

Cronbach’s Alpha library(ltm) cronbach.alpha(data, standardized = FALSE, CI = FALSE, probs = c(0.025, 0.975), B = 1000, na.rm = FALSE) Confidence Interval for Coefficient Alpha (1 or 2 tailed) library(psychometric) First calculate an alpha and then: alpha.CI(alpha, k, N, level = 0.90, onesided = FALSE)

149 | P a g e

Descriptive Statistics for a response data frame (includes Cron. Alpha and lots of goodies) descript(data, n.print = 10, chi.squared = TRUE, B = 1000) library(ltm) Returns:

150 | P a g e

151 | P a g e

Alternative reliability Analysis library(psych) ?guttman guttman(r,key=NULL) tenberge(r) glb(r,key=NULL) glb.fa(r,key=NULL)

Arguments r A correlation matrix or raw data matrix. key a vector of -1, 0, 1 to select or reverse items

Estimation of a True Score library(psychometric) Est.true(obs, mx, rxx)

Arguments obs an observed score on test x mx mean of test x rxx reliability of test x

Description

Given the mean and reliability of a test, this function estimates the true score based on an observed score. The estimation is accounting for regression to the mean

Spearman-Brown Prophecy Formulae library(psychometric) SBrel(Nlength, rxx) SBlength(rxxp, rxx)

Arguments Nlength New length of a test in relation to original rxx reliability of test x rxxp reliability of desired (parallel) test x

Returns: rxxp - the prophesized reliability; N -Ratio of new test length to original test length

152 | P a g e

Item Analysis (Gives lots of info from a sample’s responses) library(psychometric) item.exam(x, y = NULL, discrim = FALSE)

Arguments x matrix or data.frame of items y Criterion variable discrim Whether or not the discrimination of item is to be computed

Description

Conducts an item level analysis. Provides item-total correlations, Standard deviation in items, difficulty,discrimination, and reliability and validity indices.

Details

If someone is interested in examining the items of a dataset contained in data.frame x, and the criterion measure is also in data.frame x, one must parse the matrix or data.frame and specify each part into the function. See example below. Otherwise, one must be sure that x and y are properly merged/matched. If one is not interested in assessing item-criterion relationships, simply leave out that portion of the call. The function does not check whether the items are dichotomously coded, this is user specified. As such, one can specify that items are binary when in fact they are not. This has the effect of computing the discrimination index for continuously coded variables.

The difficulty index (p) is simply the mean of the item. When dichotomously coded, p reflects the proportion endorsing the item. However, when continuously coded, p has a different interpretation.

153 | P a g e

Grade multiple choices (uses multiple choice data set and answer key to give correct(1) incorrect (0)

mult.choice(data, correct) library(ltm)

Arguments data a matrix or a data.frame containing the manifest variables as columns. correct a vector of length ncol(data) with the correct responses (answer key)

This new matrix could then be used to do column sums; row sums; weighting of questions; grades; weighted grades based on questions weights. Find Intraclass Correlations (ICC1, ICC2, ICC3 from Shrout and Fleiss) of two raters (numeric) ICC(x,missing=TRUE,alpha=.05) library(psych)

Arguments x a matrix or dataframe of ratings missing if TRUE, remove missing data – work on complete cases only alpha The alpha level for significance for finding the confidence intervals

Description The Intraclass correlation is used as a measure of association when studying the reliability of raters. Shrout and Fleiss (1979) outline 6 different estimates, that depend upon the particular experimental design. All are implemented and given confidence limits. Intraclass correlation coefficient (ICC) package psy() icc(data) data = n*p matrix or dataframe, n subjects p raters

Details Missing data are omitted in a listwise way. The "agreement" ICC is the ratio of the subject variance by the sum of the subject variance, the rater variance and the residual; it is generally prefered. The "consistency" version is the ratio of the subject variance by the sum of the subject variance and the residual; it may be of interest when estimating the reliability of pre/post variations in measurements.

154 | P a g e

Find Cohen’s kappa and weighted kappa coefficients for correlation of two raters (nominal)

cohen.kappa(x, w=NULL,n.obs=NULL,alpha=.05) library(psych) wkappa(x, w = NULL)

Arguments x Either a two by n data with categorical values from 1 to p or a p x p table. If

data rray, a table will be found. w A p x p matrix of weights. If not specified, they are set to be 0 (on the diagonal)

and (distance from diagonal) off the diagonal)^2. n.obs Number of observations (if input is a square matrix. alpha Probability level for confidence intervals

Description Cohen’s kappa (Cohen, 1960) and weighted kappa (Cohen, 1968) may be used to find the agreement of two raters when using nominal scores. weighted.kappa is (probability of observed matches - probability of expected matches)/(1 – probability of expected matches). Kappa just considers the matches on the main diagonal. Weighted kappa considers off diagonal elements as well. Find Cohen’s kappa and weighted kappa coefficients for correlation of two raters (nominal) wkappa(r,weights="squared") library psy()

Arguments r n*2 matrix or dataframe, n subjects and 2 raters weights weights="squared" to obtain squared weights. If not, absolute weights are

computed

Details The diagnoses are ordered as follow: numbers < letters, letters and numbers ordered naturally. For weigths="squared", weights are related to squared differences between rows and columns indices (in this situation wkappa is close to an icc). For weights!="squared", weights are related to absolute values of differences between rows and columns indices. The function deals with the case where the two raters have not exactly the same scope of rating (some software associate an error with this situation). Missing value are omitted.

155 | P a g e

Reverse Scoring library(psych) reverse.code(keys, items, mini = NULL, maxi = NULL)

NOTE: Reverse scoring can also be accomplished by taking the item and creating a new rescored variable using the formula: (m+1)-s = reverse scored item Where m is the max score you could have gotten on a Likert type scale and s is the score vector containing the scores of the item that is to be reverse scored.

EXAMPLE

original <- matrix(sample(6,50,replace=TRUE),10,5)

keys <- c(1,1,-1,-1,1) #reverse the 3rd and 4th items

new <- reverse.code(keys,original,mini=rep(1,5),maxi=rep(6,5))

156 | P a g e

TABULAR DATA

Table of Counts (frequency table) [nested]

table(factor1,factor2,n factor)

ftable(factor1,factor2,n factor…) Note: you can use describe by tilde ~ (see example)

Un-nested table of counts margin.table(table,factor # to reveal)

Compute column and row sums for a table method 1 addmargins(table) Compute column and row sums for a table method 2 library(vcd) mar_table(x)

EXAMPLE

DF<-read.table("fake remdial reading (logistic regression example).csv", header=TRUE,


DF<-data.frame(DF,"Fav.Color"=sample(c("blue","red","green","orange"),nrow(DF),replace=T))

with(DF,table(sex,rem.read.rec))

with(DF,table(sex,rem.read.rec,Fav.Color)) #too cumbersome so we use ftable

with(DF,ftable(sex,rem.read.rec,Fav.Color))

with(DF,ftable(rem.read.rec~sex+Fav.Color))

with(DF,ftable(sex+Fav.Color~rem.read.rec))

EXAMPLE:




(tab2<-with(DF,table(sex,Fav.Color,rem.read.rec)))

margin.table(tab2,1)



d<-data.frame(matrix(c(sample(c("red","blue", "green"), 25, replace=T),

sample(c(letters[1:5]), 25, replace=T),

sample(c("DOG","CAT", "CHICKEN", "SNAKE"), 25, replace=T)), nrow=25, ncol=3))

DT <- with(d, table(X1,X3))

with(d, chisq.test(X1,X2))

with(d, fisher.test(X1,X2))

DT<-with(d,xtabs(~X1+X3))

addmargins(DT)

mar_table(DT)

157 | P a g e

Table of Counts for Proportion Tables

prop.table(table)

Tabular Data 2 x 2 Table Chi squared test of independence 2 x 2 summary(table(factor1,factor2)) EXAMPLE

DF<-data.frame(cbind("X1"=c(rep("yes",12),rep("no",12)),"X2"=c(rep("red",10),rep("blue",14))))

with(DF,summary(table(X1,X2)))

Tabular Data 2 x 2 Table with Yates Continuity Correction Chi squared test of independence

chisq.test(table(factor1,factor2)) EXAMPLE

DF<-data.frame(cbind("X1"=c(rep("yes",12),rep("no",12)),"X2"=c(rep("red",10),rep("blue",14))))

with(DF, chisq.test(table(X1,X2)))

Compute a table of expected frequencies (used by chisq.test) library(vcd)

independence_table(x, frequency = c("absolute", "relative")) x is a table. frequency indicates whether absolute or relative frequencies should be computed.

Tabular Data 2 x 2 and larger Table fisher.test(table(factor1,factor2)) EXAMPLE

with(warpbreaks,fisher.test(table(wool,tension)))

EXAMPLE

x<-ftable(mtcars[,c(2,8)])

independence_table(x)

EXAMPLE




with(DF,table(sex,rem.read.rec))

with(DF,table(sex,rem.read.rec,Fav.Color))

with(DF,ftable(sex,rem.read.rec,Fav.Color))

with(DF,ftable(rem.read.rec~sex+Fav.Color))

(tab1<-with(DF,ftable(sex+Fav.Color~rem.read.rec)))

prop.table(tab1,1)

(percentTABLE<-prop.table(tab1,1)*100)

(tab2<-with(DF,table(sex,Fav.Color,rem.read.rec)))

prop.table(tab2,1)

(percentTABLE<-prop.table(tab2,1)*100)

158 | P a g e

Cross Tabulation with Tests for Factor Independence library(gmodels) CrossTable(x, y, digits=3, max.width = 5, expected=FALSE, prop.r=TRUE, prop.c=TRUE, prop.t=TRUE, prop.chisq=TRUE, chisq = FALSE, fisher=FALSE, mcnemar=FALSE, resid=FALSE, sresid=FALSE, asresid=FALSE, missing.include=FALSE, format=c("SAS","SPSS"), dnn = NULL, ...)

Arguments

x A vector or a matrix. If y is specified, x must be a vector

y A vector in a matrix or a dataframe

digits Number of digits after the decimal point for cell proportions

max.width In the case of a 1 x n table, the default will be to print the output horizontally. If the number of columns exceeds max.width, the table will be wrapped for each successive increment of max.width columns. If you want a single column vertical table, set max.width to 1

expected If TRUE, chisq will be set to TRUE and expected cell counts from the Chi-Square will be included

prop.r If TRUE, row proportions will be included

prop.c If TRUE, column proportions will be included

prop.t If TRUE, table proportions will be included

prop.chisq If TRUE, chi-square contribution of each cell will be included

chisq If TRUE, the results of a chi-square test will be included

fisher If TRUE, the results of a Fisher Exact test will be included

mcnemar If TRUE, the results of a McNemar test will be included

resid If TRUE, residual (Pearson) will be included

sresid If TRUE, standardized residual will be included

asresid If TRUE, adjusted standardized residual will be included

missing.include If TRUE, then remove any unused factor levels

format Either SAS (default) or SPSS, depending on the type of output desired.

dnn the names to be given to the dimensions in the result (the dimnames names).

Combine columns or rows of a cross table library(vcdExtra) collapse.table(table)

EXAMPLES

library(gmodels)

CrossTable(infert$education, infert$induced, expected = TRUE,chisq = T, fisher=TE,

mcnemar=T,resid=T, sresid=T, asresid=T, missing.include=T)

#&&&&&&&&&&&&&&&&&&&&&

CrossTable(mtcars$cyl,mtcars$vs,format="SAS")

CrossTable(mtcars$cyl,mtcars$vs,format="SPSS")

#EXAMPLE

library(vcdExtra)

# create some sample data in table form

sex <- c("Male", "Female")

age <- letters[1:6]

education <- c("low", "med", "high")

data <- expand.grid(sex=sex, age=age, education=education)

counts <- rpois(36, 100)

data <- cbind(data, counts)

(t1 <- xtabs(counts ~ sex + age + education, data=data))

# collapse age to 3 levels

(t2 <- collapse.table(t1, age=c("A", "A", "B", "B", "C", "C")))

# collapse age to 3 levels and pool education: "low" and "med" to "low"

(t3 <- collapse.table(t1, age=c("A", "A", "B", "B", "C", "C")))

education=c("low", "low", "high"))

# change labels for levels of education to 1:3

(t4 <- collapse.table(t1, education=1:3))

159 | P a g e

Strength of Effect Measures [SOE] (Tabular Data)

Read Measures of association in crosstab tables article for SOE measures decisions Compute Pearson χ2, Likelihood Ratio χ2, φ coefficient, contingency coefficient & Cramer's V assocstats(x) library(vcd)

Cohen's kappa and weighted kappa for a confusion matrix library(vcd) kappa(z) Z is a matrix or a the result of qr or a fit from a class inheriting from "lm".

EXAMPLES

data("Arthritis")

Arthritis

(tab <- xtabs(~Improved + Treatment, data = Arthritis))

summary(assocstats(tab))

#AND

x<-ftable(mtcars[,c(2,8)])

summary(assocstats(x))

Examples

kappa(x1 <- cbind(1,1:10))# 15.71

kappa(x1, exact = TRUE) # 13.68

kappa(x2 <- cbind(x1,2:11))# high! [x2 is singular!]

http://127.0.0.1:31966/library/base/help/qr

160 | P a g e

Turn a table into a dataframe METHOD 1 table2flat(table) table- can be ftable, xtabs, table Turn a table into a dataframe METHOD2 library(vcdExtra) expand.dft(x, var.names = NULL, freq = "Freq", ...) expand.table(x, var.names = NULL, freq = "Freq", ...)

Categorical Article Data (Cross Table) to Raw Data Below is an example starting with creating a table from numeric values (replicate data frame from results)

FROM AN ARTICLE TO A TABLE TO RAW DATA

#CODE

table2flat <- function(mytable){

#by Robert Kabakoff

df <- as.data.frame(mytable)

rows <- dim(df)[1]

cols <- dim(df)[2]

x <- NULL

for (i in 1:rows){

for (j in 1:df$Freq[i]){

row <- df[i, c(1:(cols-1))]

x <- rbind(x, row)

}

}

row.names(x) <- c(1:dim(x)[1])

return(x)

}

#EXAMPLE

x <- with(mtcars,table(am, gear, cyl, vs))

table2flat(x)

x2 <- with(mtcars,ftable(am, gear, cyl, vs))

table2flat(x2)

#===================================================================

# CREATE THE DATA FRAME FROM A MATRIX OF FREQUENCIES

#===================================================================

d2 <- matrix(c(23, 15, 66, 34, 19, 22), ncol=3, nrow=2)

dimnames(d2) <-list(Gender=c("boys", "girls"),

Inst.Meth=c("direct.int", "explicit.learn", "didactic"))

d2<-as.table(d2)

d2

#===================================================================

expand.dft(d2) #BOTH FUNCTIONS WILL RETURN THE DATA FRAME

table2flat(d2)

art <- xtabs(~Treatment + Improved, data = Arthritis)

art

expand.dft(art)

161 | P a g e

ANOVA

ANOVA (balanced or not; as many ways as you want [1 way, 2 way, 3 way …]) linear model Type: anova(lm(sc~ g)) One way Analysis of Variance Table Response: sc Df Sum Sq Mean Sq F value Pr(>F) g 1 18.778 18.7778 6.7759 0.01359 * Residuals 34 94.222 2.7712 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 anova(lm(sc~ s*a*g)) Multi-way (gives ineractions) Analysis of Variance Table Response: sc Df Sum Sq Mean Sq F value Pr(>F) s 1 4.000 4.0000 1.8947 0.18138 a 2 23.167 11.5833 5.4868 0.01091 * g 1 18.778 18.7778 8.8947 0.00647 ** s:a 2 1.167 0.5833 0.2763 0.76095 s:g 1 11.111 11.1111 5.2632 0.03083 * a:g 2 1.389 0.6944 0.3289 0.72287 s:a:g 2 2.722 1.3611 0.6447 0.53365 Residuals 24 50.667 2.1111 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 anova(lm(sc~ s+a+g)) Multi-way (gives only main effects) Analysis of Variance Table Response: sc Df Sum Sq Mean Sq F value Pr(>F) s 1 4.000 4.0000 1.8492 0.183684 a 2 23.167 11.5833 5.3550 0.010055 * g 1 18.778 18.7778 8.6810 0.006056 ** Residuals 31 67.056 2.1631 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Note: sc is the dependent variable s,a,g are the main effect (IV), s*a,s*g,a*g,&s*a*g are the interactions.

162 | P a g e

Analysis of Variance AOV Model Type: z.aov<-aov(y~factor1*factor2*factor3) Where z.aov is the output, y is the DV numeric scores, factor1-factor2-factor3 ar your categorical IV’s. summary(z.aov) This will give you the same output as the summary(anova(lm(y~ factor1*factor2*factor3)) from the linear model approach. Means Tables (load car package) This is for after you have run an aov model (make sure you’ve labeled the categorical IV’s as factors using as.factors function): model.tables(z.aov,"means",se=T) Where z.aov is the output label for the aov model you’ve just run. This gives you the means tables for main and interaction effects. Residual plots plot(model) example: plot(hw.aov) Where model is the aov model. Post Hoc & Protected Tests (for use after ANOVA) Tukey TukeyHSD(model) [example: TukeyHSD(z.aov)] Where model is the output label for the aov.

163 | P a g e

MANOVA I will use the following data set to illustrate the MANOVA I usually go in and change the variable names to something simple using the command: s<-data$Study.Group Then make sure your categorical variables are factors using the command: s<-as.factor(s) The next step is to bind your outcome variables: y<-cbind(c,l,h) The output (when entering y) should look like this Now you can check for outliers using the aq.plot command from the mvoutlier package: mvoutlier(y) > aq.plot(y )

Projection to the first and second robust principal

components.

Proportion of total variation (explained variance):

0.8742298

$outliers

[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE

FALSE FALSE FALSE TRUE FALSE

[15] TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE

TRUE

Warning message:

In princomp.default(x, covmat = covr) :

both 'x' and 'covmat' were supplied: 'x' will be ignored

-6 -4 -2 0 2 4 6

-4-2

02

46

12

3 45

67

8

9

10

11

12

13

14

15 16

17

18

19

20

21

2223

24

0 10 20 30 40 50

0.2

0.4

0.6

0.8

1.0

Ordered squared robust distance

Cum

ula

tive p

robabili

ty

20211619923117225174214123

1013

2418

158

1697.5

% Q

uantil

eA

dju

ste

d Q

uantil

e

-6 -4 -2 0 2 4 6

-4-2

02

46

Outliers based on 97.5% quantile

8

13

15 16

18

24

12

34

567

9

10

11

12

14

1719

20

21

2223

-6 -4 -2 0 2 4 6

-4-2

02

46

Outliers based on adjusted quantile

8

13

15 16

18

24

12

34

567

9

10

11

12

14

1719

20

21

2223

164 | P a g e

Next you can run the MANOVA using the following command: v<-manova(lm(y ~ s * r)) The v is an arbitrary choice just as y was in the last step. The s*r will give you the main effects and the interaction as well. The output (when entering v) should look like this To get the f statistics you need to call up a summary command: summary(v) By default the test will be the Pillai. If you wish to change the output to Wilks enter: summary(v,test="Wilks") The output should look like this Note: you can also change Wilks to Roy From here you should go onto running individual anovas on each of the out come variables to complete the anova table.

MANOVA Correlation Table(for the DV’s) Type: cor(y) Note: the y is the bind numeric variables. The output table will be as a correlation table.

165 | P a g e

Repeated Measures Anova (Balanced Design) It is inappropriate to use this for unbalanced and/or missing values)

A 3 time no between factors model (IV-Subject[categorical and random]; IV-Meal[categorical and fixed]; DV-Cholesterol Rating[numeric])

Data Set Type: ex20.aov <- aov(Cholesterol.Intake ~factor(Meal) + Error(factor(Student))) Where Cholesterol.Intake is the DV, Meal is an IV fixed factor, and Student is an IV random factor. [output] A 3 time within and between factors model (IV-Subject[categorical and random]; IV-Gender[categorical and fixed]; Instructor Type[categorical and fixed]; DV-Cholesterol Rating[numeric])

> ex27.10.aov <- aov(Study.Time ~factor(Instructor.Type)*factor(Gender) + Error(factor(Student))) And then… > summary(ex27.10.aov) [output] Note: see section 27:12 for the ANOVA table that corresponds to this output.

166 | P a g e

Repeated Measures Anova (balanced or unbalanced) [relies on the car package] A 3 time no between factors model (IV-Subject[categorical and random]; IV-Meal[categorical and fixed]; DV-Cholesterol Rating[numeric]) Note: this model keeps data sets in the traditional data table format (no need to re-format data); additionally the DV does not have an actual column (it is instead the numeric measurements at the meal variables)

Data Set 1) Create a vector of levels for the measurement

points (1 for each measurement point):

meals <- c(1, 2, 3) Where meals is the new vector name (a factor), and the numbers represent each measurement point. 2) Create a within groups measurement point factor

to house the levels you just created (this will be used later in our data frame(matrix style)) and in our Anova anaylsis):

mealFactor<- as.factor(meals) Where mealFactor is the new factor with n levels to house our levels that describe our n numeric columns(measurement points). 3) Create a matrix style data frame from the factor and levels that will be used to describe our

numeric columns(measurement points): mealFrame <- data.frame(mealFactor) 4) Now create a bound vector containing the n numeric columns for later use in the linear

model: mealBind<-cbind(breakfast , lunch, dinner) 5) Create a linear model with the bound vector you just created. mealModel<-lm(mealBind~1) 6) Use the Anova function from the car package to analyze our data (notice we are using the

measurement point matrix style data frame and corresponding within groups factors as well as the linear model we just created):

analysis3 <- Anova(mealModel, idata = mealFrame, idesign = ~mealFactor) Note: we could have added the argument ,type=”III” but the default of Anova is to switch from type II to type III SS when there is only one intercept 7) Now create a summary of the anova tables and information: summary(analysis) Look below at the summary:

167 | P a g e

Possible Errors for small DF Your tutorial is excellent. I was able to follow it easily and quickly analyze a data set I've been working with for a long time. I tried applying the same steps to another data set but when I tried to use the Anova(mod, idata, idesign) function I got the following error message: Error in linearHypothesis.mlm(mod, hyp.matrix, SSPE = SSPE, idata = idata, : The error SSP matrix is apparently of deficient rank = 3 < 4 Do you have any idea what this means or how to deal with it. Thanks a lot! John M. Quick said... Thanks for the comments. I am familiar with this error. In short, it has to do with a combination of a lack of degrees of freedom to execute the multivariate tests (i.e. small sample size compared to variables) and the inability of the Anova() function to ignore/forgo calculating the multivariate tests. See this R listserv discussion for details: http://r.789695.n4.nabble.com/Anova-in-car-SSPE-apparently-deficient-rank-tp997619p997619.html An alternative, which will get you the Greenhouse-Geisser and Hyunh-Feldt epsilon corrections, but no multivariate tests, is to use the anova() function. anova(ageModel, idata = ageFrame, X = ~ageFactor, test = "Spherical") One caveat, I believe, is that this will use Type I SS, whereas my Anova() example uses Type III SS. I'm not sure how to get Type III SS with the anova() function.

http://www.blogger.com/profile/05331039307550313006

http://r.789695.n4.nabble.com/Anova-in-car-SSPE-apparently-deficient-rank-tp997619p997619.html

http://r.789695.n4.nabble.com/Anova-in-car-SSPE-apparently-deficient-rank-tp997619p997619.html

168 | P a g e

A 3 time within and between factors model (IV-Subject[categorical and random]; IV-Gender[categorical and fixed]; Instructor Type[categorical and fixed]; DV-Cholesterol Rating[numeric]) Note: this model keeps data sets in the traditional data table format (no need to re-format data); additionally the DV does not have an actual column (it is instead the numeric measurements at the meal variables)

Data Set 1) Create a vector of levels for the measurement points

(1 for each measurement point):

instructor <- c(1, 2, 3) Where meals is the new vector name (a factor), and the numbers represent each measurement point. 2) Create a within groups measurement point factor to

house the levels you just created (this will be used later in our data frame(matrix style)) and in our Anova anaylsis):

instructorF<- as.factor(instructor) Where instructorF is the new factor with n levels to house our levels that describe our n numeric columns(measurement points). 3) Create a matrix style data frame from the factor and levels that will be used to describe

our numeric columns(measurement points): instructorFR <- data.frame(instructorF) 4) Now create a bound vector containing the n numeric columns for later use in the linear

model: instructorBind<-cbind(male, female, computer) 5) Create a linear model with the bound vector you just created. LMmodel<-lm(instructorBind~gender) Notice we have included the fixed between groups gender variable in the linear model. 6) Use the Anova function from the car package to analyze our data (notice we are using the

measurement point matrix style data frame and corresponding within groups factors as well as the linear model we just created):

Analysis7 <- Anova(LMmodel, idata = instructorFR, idesign = ~instructorF)

7) Now create a summary of the anova tables and information: summary(analysis7) Look below at the summary and how the information was placed in an anova table:

169 | P a g e

>instructor<- c(1, 2, 3) > instructorF<- as.factor(instructor) > instructorFR<- data.frame(instructorF) > instructorBind<-cbind(male ,female ,computer) > LMmodel<-lm( instructorBind~gender ) > analysis4 <- Anova(LMmodel, idata = instructorFR, idesign = ~instructorF) > summary(analysis4)

Analysis of Variance Table Source SS df MS F

Grand Mean 2022.78 1

Student Gender(A) 3.56 1 3.56 1.21

Instructor Type(B) 92.29 2 46.14 22.58**

AB Interaction 37.16 2 18.58 9.09**

Student within 47.07 16 2.94

Student x Instructor 65.39 32 2.04

Total 2668.25 54

.01 Critical Values: F1,16 -8.53, F2,32 5.387 **p<.01

1. MS=SS/df 2. To get totals sum the columns

170 | P a g e

Graph comparing the regression line of the three models

LINEAR MODELING

Linear Model m1 <- lm(gl~sc,data=df) summary(m1) Note: m1 is changeable, gl and sc are the numeric variables. Resistant Linear Model library(MASS) lqs(gl~sc,data=df) For models with outliers consider this model. Uses least median squares(LMS) & least trimmed squares (LTS). Robust Linear Model library(MASS) rlm(gl~sc,data=df) For models with outliers and heteroscedasticity problems consider this model.

CODE FOR THE GRAPH ABOVE mtcars2<-data.frame(rbind(mtcars[,c(1,6)],"mp800"=c(16.0,9),"DFxz00"=c(12.0,3.7)))

mod<-lm(mpg~wt,data=mtcars2);library(MASS)

mod2<-lqs(mpg~wt,data=mtcars2)

mod3<-rlm(mpg~wt,data=mtcars2)

plot(mod);par(ask=T);library(mvoutlier)

aq.plot(mtcars2[c("mpg","wt")]);par(ask=T)

uni.plot(mtcars2[c("mpg","wt")],symbol=T)

influence.measures(mod)

par(ask=T);par(mfrow=c(1,1))

with(mtcars2,plot(wt,mpg))

abline(reg=mod,lty=1,col="blue")

abline(reg=mod2,lty=2,col="red")

abline(reg=mod3,lty=3,col="green")

legend(x=6.72,y=33.67,legend=c("lm()","lqs()","rlm()"),

lty=c(1,2,3),col=c("blue","red","green"))

mtext("Notice how the outliers affect lm(), less with lqs(), and least with rlm()", font=4,side=3,col="dark green")

2 4 6 8

10

15

20

25

30

wt

mp

g

lm()

lqs()

rlm()

Notice how the outliers affect lm(), less with lqs(), and least with rlm()> mtcars2

mpg wt

Mazda RX4 21.0 2.620

Mazda RX4 Wag 21.0 2.875

Datsun 710 22.8 2.320

Hornet 4 Drive 21.4 3.215

Hornet Sportabout 18.7 3.440

Valiant 18.1 3.460

Duster 360 14.3 3.570

Merc 240D 24.4 3.190

Merc 230 22.8 3.150

Merc 280 19.2 3.440

Merc 280C 17.8 3.440

Merc 450SE 16.4 4.070

Merc 450SL 17.3 3.730

Merc 450SLC 15.2 3.780

Cadillac Fleetwood 10.4 5.250

Lincoln Continental 10.4 5.424

Chrysler Imperial 14.7 5.345

Fiat 128 32.4 2.200

Honda Civic 30.4 1.615

Toyota Corolla 33.9 1.835

Toyota Corona 21.5 2.465

Dodge Challenger 15.5 3.520

AMC Javelin 15.2 3.435

Camaro Z28 13.3 3.840

Pontiac Firebird 19.2 3.845

Fiat X1-9 27.3 1.935

Porsche 914-2 26.0 2.140

Lotus Europa 30.4 1.513

Ford Pantera L 15.8 3.170

Ferrari Dino 19.7 2.770

Maserati Bora 15.0 3.570

Volvo 142E 21.4 2.780

mp800 16.0 9.000

DFxz00 12.0 3.700

171 | P a g e

Calling Components of Linear Models and Summaries Use model$and one of the components below [use names(model)to view these]

1. # [1] "coefficients" "residuals" "effects" "rank" 2. # [5] "fitted.values" "assign" "qr" "df.residual" 3. # [9] "xlevels" "call" "terms" "model"

Use summary(model)$and one of the components below [names(summary((model))to view]

1. names(summary(fit)) 2. # [1] "call" "terms" "residuals" "coefficients" 3. # [5] "aliased" "sigma" "df" "r.squared" 4. # [9] "adj.r.squared" "fstatistic" "cov.unscaled"

Accessor Functions residuals(model);resid(model) coefficients(model);coef(model) fitted(model) predict(model) deviance(model) df.residual(model) rstandard(model) rstudent(model) influence.measures(model) influence(model) dfbeta(model) dfbetas(model) covratio(model) cooks.distance(model) hatvalues(model) BIC(model) AIC(model) model.frame(model)

172 | P a g e

Dealing with multicolinearity method 1 lm.ridge() From the library(MASS) it attempts to minimize SS residuals and penalizes for coefficient sizes Dealing with multicolinearity method 2 lars(x,y,type= "lasso") Where x is a matrix of predictor values and y is the response variable. From the library(lars) it penalizes for coefficient sizes differently than lm.ridge using algorithm for least angle regression. Dealing with multicolinearity method 3 pcr(formula) From the library(pls) it transforms the predictors and then linear regression is performed. Dealing with multicolinearity method 4 plsr (formula) From the library(pls) it uses partial regression coefficients and then linear regression is performed. Dealing with multicolinearity method 5 mean centering according to aiken and west Linear Model Hypothesis Test For Simple Linear Regression See: ex21c.docx Regression Analysis Use the source code below for calling Regression and Correlations Functions:

source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Regression/.Regression Bundle.txt")

rfun()

source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Regression/Correlation Matrix and Data

Visualization.txt")

cmat()

173 | P a g e

Calculate Multiple Regression from a correlation matrix library(psych) mat.regress(m, x, y,n.obs=NULL) Multiple Regression from matrix input

Arguments m a matrix of correlations or, if not square of data x either the column numbers of the x set (e.g., c(1,3,5) or the column names of the x set

(e.g. c("Cubes","PaperFormBoard") y either the column numbers of the y set (e.g., c(2,4,6) or the column names of the y set

(e.g., c("Flags","Addition") n.obs If specified, then confidence intervals, etc. are calculated, not needed if raw data are

given Allows you to calculate multiple regression from correlation matrix. Useful for interpreting results from someone else’s work. Computes the confidence interval for a desired level for the squared-multiple correlation CI.Rsq(rsq, n, k, level = 0.95) This function you have to enter the R2 but not the function below

require(foreign)

data<-read.spss("data3_Revised.sav", use.value.labels = TRUE, to.data.frame = TRUE)

dat <- subset(data, select=c(achiev,momedu,ses,parsupport, parmonitor,rules ))

dat3 <-subset(dat, select=c(achiev,parsupport, parmonitor,rules))

dat3<-na.omit(dat3)

mod7 <- with(dat3, lm(achiev ~parsupport + parmonitor + rules))

test.data<-cor(dat3)

x<-mat.regress(test.data,c(2,3,4),c(1),n.obs=478)#choose the variables by number

summary(x,digits=4) #note gives standardized beta weights

#compare to:

summary(mod7);mod7

stan.beta(mod7)

174 | P a g e

Computes the confidence interval for a desired level for the squared-multiple correlation CI.Rsqlm(obj, level = 0.95) library(psychometric) Where obj is the linear model (ie. obj<-lm(y~x1+x2) and level is the confidence interval desired.

Arguments R Correlation Coefficient n Sample Size level Significance Level for constructing the CI, default is .95

Predicting from a model sPREDICT

predict(object, data)

#EXAMPLE

(mod <- lm(mpg~hp+disp+hp:disp, data=mtcars))

NEW<-data.frame(hp=c(260, 280), disp=c(330, 350))

predict(mod, NEW)

175 | P a g e

One Factor/One Continuous ANCOVA Example:

#============================================================================================================ #GETTING THE DATA #============================================================================================================ regrowth<-read.table("ipomopsis.txt", header=TRUE, sep="\t",na.strings="999") attach(regrowth) names(regrowth) head(regrowth) #COVARIATE-->"Root"/OUTCOME-->"Fruit"/CATEGORICAL-->"Grazing" #============================================================================================================ #LOOKING AT MEANS #============================================================================================================ mean(subset(regrowth,Grazing=="Ungrazed")$Fruit) mean(subset(regrowth,Grazing=="Grazed")$Fruit) #............................ #Looking at means we would suggest that the grazed plants actually produce more fruit (incorect assumption as the plot will show) #............................ #============================================================================================================ #PLOTTING THE DATA #============================================================================================================ plot(Root,Fruit,pch=16+as.numeric(Grazing),col=c("blue","green")[as.numeric(Grazing)]) #............................ #A look at the lines reveals ungrazed acually produces more fruit, opposite of what the means suggests #+16as.numeric is what turns the categorical data into plot points [16 changes the point type] #............................ abline(lm(Fruit[Grazing=="Grazed"]~Root[Grazing=="Grazed"]),lty=15,col="blue") abline(lm(Fruit[Grazing=="Ungrazed"]~Root[Grazing=="Ungrazed"]),lty=3,col="dark green") legend(locator(1),c("Grazed","Ungrazed"),fill=c("blue","dark green")) #............................ #draws the regression lines for each group of Grazing as described by the covariate roots #............................ #============================================================================================================ #ANALYZING THE DATA (ANCOVA) #============================================================================================================ ancova.fruit<-lm(Fruit~Grazing*Root) #............................ #covariates go second, because we are not interested in their effects, just the addition error they remove and the power they give #order matters here: anova(lm(Fruit~Root*Grazing)) will give a different output #............................ summary(ancova.fruit) anova(ancova.fruit)

176 | P a g e

Two Factor/One Continuous ANCOVA Example:

#============================================================================================================ #GETTING THE DATA #============================================================================================================ Gain<-read.table("Gain.txt", header=TRUE, sep="\t",na.strings="999") attach(Gain) names(Gain) head(Gain) #COVARIATE-->"Age"/OUTCOME-->"Weight"/CATEGORICAL-->"Sex/"CATEGORICAL-->"Genotype" #============================================================================================================ #LOOKING AT MEANS #============================================================================================================ #............................ #method 1 #............................ library(doBy) summaryBy() summaryBy(Weight~ Sex+Genotype, data = Gain,FUN = function(x) { c(n = length(x),mean = mean(x), sd = sd(x)) } ) #............................ method 2 #............................ source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Regression/.Regression Bundle.txt") rfun() desc2v(Gain,Sex,Genotype) #============================================================================================================ #ANALYZING THE DATA (ANCOVA) #============================================================================================================ ancova.gain<-lm(Weight~Sex*Age*Genotype) summary(ancova.gain) anova(ancova.gain)

177 | P a g e

Basic Functions

Generic Functions ‘

178 | P a g e

Find the probability of obtaining the same result if the experiment were conducted again library(psych)

Usage p.rep(p = 0.05, n=NULL,twotailed = FALSE) p.rep.f(F,df2,twotailed=FALSE) p.rep.r(r,n,twotailed=TRUE) p.rep.t(t,df,df2=NULL,twotailed=TRUE)

Arguments

p conventional probability of statistic (e.g., of F, t, or r) F The F statistic df Degrees of freedom of the t-test, or of the first group if unequal sizes df2 Degrees of freedom of the denominator of F or the second group in an unequal

sizes t test r Correlation coefficient n Total sample size if using r t t-statistic if doing a t-test or testing significance of a regression slope

twotailed Should a one or two tailed test be used?

179 | P a g e

Critical Values (p=1-α or 1-α/2 [1 or 2 tail], df= degrees of freedom, q=critical value)

The birthday paradox Find the # of occurrence given the probability of the event qbirthday(prob = 0.5, classes = 365, coincident = 2) Find the probability of given the # of occurrences of the event pbirthday(n, classes = 365, coincident = 2)

Function What it does qnorm(p) Returns a value q such that the area to the left of q for a standard normal

random variable is p. pnorm(q) Returns a value p such that the area to the left of q on a standard normal is

p. qt(p,df) Returns a value q such that the area to the left of q on a t(df) distribution

equals q. pt(q,df) Returns p, the area to the left of q for a t(df) distribution qf(p,df1,df2) Returns a value q such that the area to the left of q on a F(df1, df2)

distribution is p. For example, qf(.95,3,20) returns the 95% points of the F(3, 20) distribution.

pf(q,df1,df2) Returns p, the area to the left of q on a F(df1, df2) distribution. qchisq(p,df) Returns a value q such that the area to the left of q on a χ2(df) distribution

is p. pchisq(q,df) Returns p, the area to the left of q on a χ2(df) distribution.

EXAMPLES

qbirthday(prob = .95, classes = 365, coincident = 2)

pbirthday(23, classes = 365, coincident = 2)

180 | P a g e

Function Writing Information LOOPS for() Function Using a for Loop Example Repeat a function over and over again Infinite Repeat Loop i <- 1

repeat{

i <- i/2

print(i)

flush.console()

}

Repeat Loop i <- 1

repeat{

i <- i/2

print(i)

flush.console()

if (i < .0005) break

}

While Loop i <- c(1)

while(i < 20){

i <- c(i, i*1.5)

print(i)

flush.console()

}

Nested For Loop for (i in 1:2){

for(j in 20:21){

for (k in c("horse", "cow")){

print(i)

print(j)

print(i*j)

print(k)

}

}

}

X<-function(col=10, rows=40){

vec <- 1:col

holder <-c()

for (i in 1:rows){

perm <- sample(vec, replace=F)

holder <- rbind(holder, perm)

}

holder

rownames(holder)<-paste("obs. ", 1:rows,

sep="")

colnames(holder)<-paste("VAR-",LETTERS[1:col],

sep="")

holder

}

X(15, 20)

#===========================================

#loop to repeat a function

#===========================================

DFer = list()

n = 10

j=6

for (i in 1:n){

DFer[[i]]= data.frame(A=1:j, B=rnorm(j),

C=letters[1:j])

}

DFer

#===========================================

#or (the 2nd allocates the vector ahead of time)

#===========================================

getDFs <- function(n, j) {

df <- vector("list", n) # As I said, if you

know the size, allocate the object beforehand

for (i in seq(n))

df[[i]] <- data.frame(A = seq(j), B =

rnorm(j), C = letters[seq(j)])

return(df)

} # end function

(x<-getDFs(10, 4))

#===========================================

#put it together (the list into a data frame

#===========================================

do.call("rbind", x)

library(plyr)

ldply(x, rbind)

For Loop with next i <- 0

for (i in 1:100){

if (i%%2==0) next

i <- i +1

print(i)

flush.console()

}

For Loop with break and next i <- 0

for (i in 1:100){

if (i%%2==0) next

if (i > 90) break

i <- i +1

print(i)

flush.console()

}

181 | P a g e

IFELSE ifelse(test,then this occurs, if not this happens) Example x<-sample(-2:5,20,replace=T);x

outcome<-ifelse(x >= 0, sqrt(x), NA)

data.frame(x,outcome)

Switch Function Example 1

Central <- function(y, measure = "Mean"){

switch(measure,

Mean = mean(y),

Geometric = exp(mean(log(y))),

Harmonic = 1/mean(1/y),

Median = median(y),

stop("chose a mean")

)

}

central(mtcars$mpg,"Median")

central(mtcars$mpg,"Geometric")

central(mtcars$mpg,"Harmonic")

Example2

FUN <- function(x){

switch(x,

`1` = "A",

`2` = "B",

`3` = "C",

stop("chose a # between 1-3")

)

}

FUN(1)

FUN(2)

FUN(4)

182 | P a g e

Repeat Loops (complex example) srepeat EXAMPLE:

#FIRST I'LL RECREATE A DATA SET. IT"LL CONTAIN REDUNDANCY

DATA <- structure(list(person = structure(c(4L, 1L, 5L, 4L, 1L, 3L, 1L,

4L, 3L, 2L, 1L), .Label = c("greg", "researcher", "sally", "sam",

"teacher"), class = "factor"), sex = structure(c(2L, 2L, 2L,

2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor"),

adult = c(0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L), state = structure(c(2L,

7L, 9L, 11L, 5L, 4L, 8L, 3L, 10L, 1L, 6L), .Label = c("Shall we move on? Good then.",

"Computer is fun. Not too fun.", "I distrust you.", "How can we be certain?",

"I am telling the truth!", "Im hungry. Lets eat. You already?",

"No its not, its ****.", "There is no way.", "What to do?",

"What are you talking about?", "You liar, it stinks!"), class = "factor")), .Names =

c("person",

"sex", "adult", "state"), class = "data.frame", row.names = c(NA,

-11L))

DATA <- data.frame(rbind(DATA, DATA, DATA))

DATA <- data.frame(rbind(DATA, DATA, DATA))

DATA <- data.frame(tot = 1:nrow(DATA), DATA)

DATA <- DATA[with(DATA, order(person, tot)), ]

rownames(DATA)<-1:nrow(DATA)

#==========================================

#A SIMPLE WORD COUNT FUNCTION

word.count <- function(text, by = "row") {

unblanker <-function(x)subset(x, nchar(x)>0)

word.split <- function(x) sapply(x, function(x)as.vector(unlist(strsplit(x, " "))))

reducer <- function(x) gsub("\\s+", " ", x)

txt <- sapply(as.character(text), function(x) ifelse(is.na(x), "", x))

OP <- switch(by,

all = length(unblanker(unlist(word.split(reducer(unlist(as.character(txt))))))),

row = sapply(txt, function(x)

length(unblanker(unlist(word.split(reducer(unlist(as.character(x)))))))))

ifelse(OP==0, NA, OP)

}

#==========================================

DATA$wc <- word.count(DATA$state)

#==========================================

#METHOD 1 DASON

g <- function(x, k = 30){

# Need to know how long the final vector should be

n <- length(x)

# Take care of case where we don't get any groups.

if(sum(x) < k){

ans <- rep(NA, n)

return(factor(ans))

}

# Store where we want to break into a new group

breaks <- c()

repeat{

# Find the first spot where the vector sums to at least k

# If that doesn't happen we get Inf and a warning

# suppress the warning

spot <- suppressWarnings(min(which(cumsum(x) >= k)))

# If we got inf then the sum couldn't reach k

if(!is.finite(spot)){

break # Jump out of the repeat loop

}

# Remove the spots that we accounted for

x <- x[-c(1:spot)]

# Note where the break is

183 | P a g e

breaks <- c(breaks, spot)

}

ans <- rep(NA, n)

groups <- paste("sel_", rep(1:(length(breaks)), breaks), sep = "")

ans[1:length(groups)] <- groups

return(factor(ans))

}

# Try it out

x <- subset(DATA, person=='sam')

g(x$wc)

y <- subset(DATA, person=='teacher')

g(y$wc)

z <- subset(DATA, person=='greg')

g(z$wc)

#METHOD 2 BRYANGOODRICH

f <- function(x) {

# Initialize variables

r <- length(x) # Total size

n <- 1 # Starting index

i <- 1 # Group index

sum <- 0 # Container for sum check

groups <- vector("character", r)

# Loop through r-length vector x

for (m in seq(r)) {

sum <- sum + x[m] # Add to running sum

isEnough <- sum >= 30

# Block allocation and index adjustments

if (isEnough) {

groups[n:m] <- paste("sel_", i, sep ="")

i <- i + 1 # Increment group index

n <- m + 1 # Increment start index

sum <- 0 # Start sum check over

} # end if isEnough

} # end for m

groups[groups == ""] <- NA

return (factor(groups))

} # end function

# Try it out

x <- subset(DATA, person=='sam')

f(x$wc)

y <- subset(DATA, person=='teacher')

f(y$wc)

z <- subset(DATA, person=='greg')

f(z$wc)

184 | P a g e

Menu (Interactive Mode) menu(choices, graphics = FALSE, title = NULL) Arguments choices- a character vector of choices graphics- a logical indicating whether a graphics menu should be used if available. title- a character string to be used as the title of the menu. NULL is also accepted.

Menu (gwidgets style) menu(choices, title, graphics = TRUE) select.list(choices, title)

EXAMPLES

switch(menu(c("List letters", "List LETTERS", "What does this do?")) + 1,

cat("Nothing done\n"), letters, LETTERS,c("Oh now I get it"))

#=====================================================================

.hurtz.donut<-function(){

cat("You want a hurtz donut?\n\n")

switch(menu(c("Yes", "No")) ,

cat("<PUNCH>\nHurts, don't it?\n"), cat("What a wimp!!\n"))

}

.hurtz.donut()

menu(sort(.packages(all.available = TRUE)), title = "packages", graphics = TRUE)

#returns --> [1] 17

select.list(sort(.packages(all.available = TRUE)), title = "packages")

#returns --> [1] "car"

185 | P a g e

Progress Bars (base) see also: tcltk , plyr, RGtk2, txtProgressBar(min = 0, max = 1, initial = 0, char = "=", width = NA, title, label, style = 1, file = "") winProgressBar(title = "R progress bar", label = "", min = 0, max = 1, initial = 0, width = 300) close() #needed after the call to the progress bar

#EXAMPLE PASSING SEQUENCE ALONG THE VECTOR

total = nrow(mtcars)

progress.bar = TRUE

type = FALSE

#progress.bar = FALSE #parameter to play with

#type = 'text' #parameter to play with

if(progress.bar) {

if (Sys.info()[['sysname']]=="Windows" & type != "text"){

# create progress bar

pb <- winProgressBar(title = "progress bar", min = 0,

max = total, width = 300)

lapply(1:total, function(i) {

Sys.sleep(.5)

setWinProgressBar(pb, i,

title=paste(round(i/total*100, 0), "% done"))

}

)

close(pb)

} else {

# create progress bar

pb <- txtProgressBar(min = 0, max = total, style = 3)

lapply(1:total, function(i) {

Sys.sleep(.5)

setTxtProgressBar(pb, i)

}

)

close(pb)

}

} else {

Sys.sleep(total/4)

return("should have used a progress bar")

}

= portion that is function dependent

#EXAMPLE PASSING THE VECTOR

w <- c("raptors are awesome don't you all agree")

y <- unlist(strsplit(w, " "))

total <- length(y)

#WINDOWS TEXT BAR

pb <- winProgressBar(title = "progress bar", min = 0,

max = total, width = 300)

lapply(y, function(x){

z <- nchar(x); Sys.sleep(.5)

i <- which(y %in% x)

setWinProgressBar(pb, i, title=

paste(round(i/total *100,

0), "% done"))

return(z)

}

)

close(pb)

#STANDARD TEXT BAR




i <- which(y %in% x)


return(z)

}

)

close(pb)

#EXAMPLE PASSING THE VECTOR With Global Assignment

w <- c("raptors are awesome don't you all agree")

y <- unlist(strsplit(w, " "))

total <- length(y)

#WINDOWS VERSION

pb <- winProgressBar(title = "progress bar",

min = 0, max = total, width = 300)

i <- 0



i <<- i + 1

setWinProgressBar(pb, i, title=

paste(round(i/ total *100, 0), "% done"))

return(z)

}

)

close(pb)

#STANDARD TEXT VERSION


i <- 0



i <<- i + 1


return(z)

}

)

close(pb)

186 | P a g e

Pass a data frame to a function (One method) f <- function(x,data=NULL, fun) {

fun(eval(match.call()$x,data))

}

f(hp,mtcars,mean)

Passing a Variable (Vector name) on to an argument as a character Best seen with an examples. See below. Four is passed on without using quotes.

Passing a character string to a function eval parse Best seen with an examples. See below.

EXAMPLE WITH DATA

mtcars2<-mtcars

library(doBy)

mtcars2$cyl<-with(mtcars2,recodeVar(mtcars2$cyl,src=c(4,6,8),

tgt=c("four","six","eight"), default=NULL, keep.na=TRUE))

with(mtcars2,cyl)

Tfun <-function (DV,IV,group1){

g <- substitute(group1)

g1<-DV[IV ==as.character(g)]

p <- mean(g1)

list(g1,p)

}

with(mtcars2,Tfun(mpg,cyl,four))

SIMPLE EXAMPLE

extract.arg <-function (a){

s <- substitute(a)

as.character(s)

}

extract.arg(hello)

#EXAMPLE

x <- c(1:20)

myoptions <- "trim=0, na.rm=FALSE"

eval(parse(text = paste("mean(x,", myoptions, ")")))

library(fortunes);fortune(106)

187 | P a g e

Functions That Take Input scan(n=,what = double(0),quiet=T)

EXAMPLE ASKING FOR 1 INPUT

x<-function(){

#choose angle in degrees

cat("\n","Enter Value","\n")

x<-scan(n=1,what = double(0),quiet=T)

x

}

EXAMPLE ASKING FOR 4 INPUT

x<-function(){

#choose angle in degrees

cat("\n","Enter Value","\n")

x<-scan(n=4,what = double(0),quiet=T)

x

}

188 | P a g e

Warning Messages warning(..., call. = TRUE, immediate. = FALSE, domain = NULL) suppressWarnings()

Alarm alarm() #makes a call to "\a" OR cat("\a") Tell a function what to do with a missing argument missing(x) #Best understood with an example Used as a “minifunction” within the function to tell what to do if y is not given. Technically this could be done with myplot <- function(x,y=x) as well. Reset Parameters useful for resetting graphical parameters or performing cleanup actions. on.exit()

EXAMPLE

test <- function() warning("You idiot you forgot quotes!")

test() ## shows call

test2 <- function() warning("You idiot you forgot quotes!", call. = FALSE)

test2() ## no call

EXAMPLE

myplot <- function(x,y) {

if(missing(y)) {

y <- x

x <- 1:length(y)

}

plot(x,y)

}

textClick <- function(express, col="black", cex=NULL, srt = 0, family="sans", ...){

old.par <- par(no.readonly = TRUE)

on.exit(par(old.par))

par(mar = rep(0, 4),xpd=NA)

x<-locator(1)

X<-format(x, digits=3)

text(x[1], x[2], express, col=col, cex=cex, srt=srt, family=family, ...)

noquote(paste(X[1], X[2],sep=", "))

}

189 | P a g e

Sequence for the n of a vector Traditionally people use: 1:length(x) However this may lead to problems. Use instead: seq_along(x)

190 | P a g e

Viewing the code of generic functions look at a functions code look at a function's code

If a function is generic or one you’ve created (downloaded) you can view its code by simply typing the name of the function: For the aov() function type: aov (and enter) Viewing the code of generic functions Type method(function). This gives a list of the functions with suffixes. Now type the function with the suffix name for its code. methods(anova)

anova.glm

How R evaluates true and false In R, TRUE is considered to be the number 1 and FALSE is considered the number 0. This can be very useful in practice. Example: T+T+F=2 T*T*F=0

191 | P a g e

Determine how much memory an object takes up object.size(x,units=) Units can be changed units = c("b", "auto", "Kb", "Mb", "Gb") EXAMPLEs print(object.size(library(base)),units="auto") #specific library

print(object.size(library),units ="auto") #entire library

print(object.size(Tpass),units ="auto") #a function

print(object.size(ls()),units ="auto") #current objects in workspace

Determine memory allocation and Increase Allocation memory.limit() #Report memory limit memory.limit(size=3500) #increase memory limit Reduce Objects and Junk in Memory gc() rm(list=ls()) rm(list = ls(all.names = TRUE))

192 | P a g e

Timing

Determine How Long It Takes to Run a Function (method 1) library(microbenchmark) [very accurate] microbenchmark(..., list, times=100, control=list())

193 | P a g e

Determine How Long It Takes to Run a Function (method 2) library(rbenchmark) function timing time a function function time

benchmark( ..., columns = c( "test", "replications", "elapsed", "relative", "user.self", "sys.self", "user.child", "sys.child"), order = "test", replications = 100, environment = parent.frame()) Arguments ... captures any number of unevaluated expressions passed to benchmark as named or unnamed arguments (the

functions to be teststed).

columns a character or integer vector specifying which columns should be included in the returned data frame (see below).

order a character or integer vector specifying which columns should be used to sort the output data frame. Any of the columns that can be specified for columns (see above) can be used, even if it is not included in columns and will not appear in the output data frame. If order=NULL, the benchmarks will appear sorted by the order of the expressions in the call to benchmark.

replications a numeric vector specifying how many times an expression should be evaluated when the runtime is measured. If replications consists of more than one value, each expression will be benchmarked multiple times, once for each value in replications.

environment the environment in which the expressions will be evaluated.

a functionun

c

Determine How Long It Takes to Run a Function (method 3) function timing time a function function time

system.time() tion

#example and output

benchmark(

Plyr = ddply(mtcars, .(cyl, gear), summarise, output = mean(hp)),

Tapply = with(mtcars, data.frame(output = tapply(hp, interaction(cyl, gear), mean))),

Aggregate = aggregate(hp ~ cyl + gear, mtcars, mean),

order=c('replications', 'elapsed'))

# test replications elapsed relative user.self sys.self user.child sys.child

# 2 Tapply 100 0.19 1.000000 0.18 0 NA NA

# 3 Aggregate 100 0.51 2.684211 0.39 0 NA NA

# 1 Plyr 100 1.36 7.157895 1.10 0 NA NA

Explanation: Typically just look at the elapsed time and the relative times. The relative is really what is interesting to me - it tells you how long each expression takes in comparison to the fastest expression. So in your example the Tapply is the quickest and Aggregate takes 2.68 times longer and the Plyr solution takes 7.15 times longer than the Tapply.

194 | P a g e

Timer 1 library(data.table) begin.time <- Sys.time() timetaken(begin.time) Timer 2 library(matlab) tic(gcFirst=FALSE) toc(echo=TRUE) Timer 3 base x <- Sys.time() difftime(Sys.time(), x) Time Stamping timestamp()

195 | P a g e

Generate Reproducible Code

Write a code, data frame to a file to send to a help archive (reproducible code) sink dput(object, "file to write to") EXAMPLE dput(mtcars, "foo.txt")

SEE ALSO: Exporting an output to a file section using cat() Generate window of a code, data frame to send to a help archive (reproducible code) page(object)

EXAMPLE page(mtcars)

196 | P a g e

Customized Workflow

Create Multiple Working Directories Create a shortcut where you want a new directory. Locate (& cut [ctrl + c]) the location of where you stored the shortcut (ie. C:\Users\Rinker\Desktop\PhD Program\CEP 523-Stat Meth Ed Inference\R Stuff). Click on Properties and paste (ctrl + p) the location to the Start In box. Now data files from this location load automatically without referencing their specific location. Stop the Stupid Start Up Message and Auto Save At the end of the target box(see "Create Multiple Working Directories") location add a space and then -q --no-save "C:\Program Files\R\R-2.13.0\bin\i386\Rgui.exe" -q --no-save

197 | P a g e

.First Function (start up commands) Open a new script from within [R]. Create a .First function with the following type of set up: .First<-function(Sys.time){

library(psych)

library(car)

options(repos="http://lib.stat.cmu.edu/R/CRAN")


Stuff/Scripts/Missing Values/.NA Bundle.txt")


Stuff/Scripts/Assumption Testing/Tests of Normality.txt")

options(repos="http://lib.stat.cmu.edu/R/CRAN") #See Choosing a CRAN Mirror

cat("Hello Tyler! Today is",date(),"\n")

}

Now save it to the working directory (see Creating A Working Directory) as .Rprofile NOTE: As of right now I have not been able to edit this file. I have to delete it and resave it using the method above to edit it. Choosing a CRAN Mirror Type the following to choose a CRAN mirror and to find it’s URL: chooseCRANmirror()

options("repos")[[1]][1]

#gives a URL for the Mirror you just chose

Now paste the following command to the .First function in your .Rprofile options(repos="URL") See options .Options

198 | P a g e

Paths, Directories and System Info

Check the System info and Operating System Sys.info() Sys.info()[['sysname']] Check System Path Infoe Sys.getenv("USERNAME") Sys.getenv("HOME") Sys.getenv() Determine working Directory/ Set Working Directory getwd() setwd(dir) Directory Functions dir() #list files in working directory list.files() #list files in working directory file.info(list.files()) #get info on files in directory path.expand("~/Desktop/PhD Program/") #replaces the tilde with user home directory

>Sys.getenv("USERNAME")

[1] "Rinker"

> Sys.getenv("HOME")

[1] "C:\\Users\\Rinker"

199 | P a g e

File Editing and Web Browsing

Opening Web Pages method 1 See the web() function in scripts browseURL(url, browser = getOption("browser"), encodeIfNeeded = FALSE) Opening Web Pages method 2 Not recommended as the function may not work on non-Windows machines. shell.exec(url) Arguments: file = file or url to open Opening Files Within R See the ret() function in scripts shell.exec(file) Arguments: file = file or url to open Open text files for editing in R file.edit("file")

Example

file.edit(".Rprofile")

200 | P a g e

Determine if a file exists file.exists(path) If you don't provide a path R only checks the wd() Rename a file file.rename(path) Delete a directory unlink(x, recursive = TRUE, force = FALSE) Format R code library(formatR) sformatr tidy.source(source, file.output="windows console")

EXAMPLE CODE

#save it somewhere or copy and then read it in

# check tidy.source's clipborad option

library(formatR)

xx<-pathPrep()

C:\Users\Rinker\Desktop\transcript Functions.R

shell.exec(xx)

tidy.source(source = xx, file="transcript Functions.txt")

tidy.source(source = "clipboard") #using a clipboard

tidy.source(source = "clipboard", file="transcript

Functions.txt")) #using a clipboard

sink(file="New.doc")

4*3; sink()

file.exists("New.doc")

Open() #look at the New.doc

file.rename("New.doc", "Renamed.doc")

file.exists("New.doc")

Open() #look at the Renamed.doc

delete(Renamed.doc) #user defined

Open() #look at the no Renamed.doc

201 | P a g e

Debugging

Find out the values of a function up to a given point browser() Exit Browser Typer Q and enter

Debug Use debug(Function_Name) and then use the function to step by step by step debug(mean)

mean(1:10)

undebug(mean)

try and trycatch L <- list(a=c(1, 3, 5), b=c("a", "v"), d=mtcars[,1])

lapply(L, function(x){

try(sum(x))

})

L <- list(a=c(1, 3, 5), b=c("a", "v"), d=mtcars[,1])

sapply(L, function(x){

tryCatch(sum(x), error=function(err) NA)

})

testfun <- function(x = 5){

y = 5

browser()

print(x + y)

}

testfun()

y; x; p

Browse[1]> y; x; p #check the values of each of these objects within the function

[1] 5

[1] 5

Error: object 'p' not found

202 | P a g e

Expand a column that's a list column spit #====================================

# THE DATA FRAME

#====================================

input <- data.frame(site = 1:6,

sector = factor(c("north", "south", "east",

"west", "east", "south")),

observations =

I(list(c(1, 2, 3), c(4, 3), c(), c(14, 12, 53, 2, 4), c(3),c(23))))

#====================================

# EXPAND THE COLUMN AND MERGE

#====================================

obs.l <- sapply(input$observations, length)

desire.output <- data.frame(site=rep(1:6,obs.l), obs=unlist(input$observations))

merge(input[, -3], desire.output, all.x=TRUE)

#NOTE- THE SITE IS THE KEY FOR THEN MERGING WITH THE DATA FRAME

203 | P a g e

Expand a Text Column (Split by sentence)

sentSplit <- function(dataframe, text.var, splitpoint = NULL, rownames = numeric,

text.place = original) {

DF <- dataframe

input <- as.character(substitute(text.var))

re <- ifelse(is.null(splitpoint), "[\\?\\.\\!]", as.character(substitute(splitpoint)))

RN <- as.character(substitute(rownames))

TP <- as.character(substitute(text.place))

breakinput <- function(input, re) {

j <- gregexpr(re, input)

lengths <- unlist(lapply(j, length))

spots <- lapply(j, as.numeric)

first <- unlist(lapply(spots, function(x) {

c(1, (x + 1)[-length(x)])

}))

last <- unlist(spots)

ans <- substring(rep(input, lengths), first, last)

return(list(text = ans, lengths = lengths))

}

j <- breakinput(DF[, input], re)

others <- DF[, -which(colnames(DF) == input)]

idx <- rep(1:dim(others)[1], j$lengths)

ans <- cbind(input = j$text, others[idx, ])

colnames(ans)[1] <- input

if (RN == "numeric") {

rownames(ans) <- 1:nrow(ans)

}

if (TP == "original") {

ans <- ans[, c(colnames(DF))]

} else {

if (TP == "right") {

ans <- data.frame(ans[, -1], ans[, 1])

colnames(ans)<-c(colnames(ans)[-ncol(ans)],input)

} else {

if (TP == "left") {

ans

}

}

}

return(ans)

}

#=====================

#TEST IT

#=====================

DATA<-structure(list(person = structure(c(4L, 1L, 5L, 4L, 1L, 3L, 1L,

4L, 3L, 2L, 1L), .Label = c("greg", "researcher", "sally", "sam",

"teacher"), class = "factor"), sex = structure(c(2L, 2L, 2L,

2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor"),

adult = c(0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L), state = structure(c(2L,


"Computer is fun. Not too fun.", "I distrust you.",

"How can we be certain?", "I am telling the truth!", "Im hungry. Lets eat. You already?",

"No its not, its dumb.", "There is no way.", "What should we do?",

"What are you talking about?", "You liar, it stinks!"

), class = "factor"), code = structure(c(1L, 4L, 5L, 6L,

7L, 8L, 9L, 10L, 11L, 2L, 3L), .Label = c("K1", "K10", "K11",

"K2", "K3", "K4", "K5", "K6", "K7", "K8", "K9"), class = "factor")), .Names = c("person",

"sex", "adult", "state", "code"), row.names = c(NA, -11L), class = "data.frame")

sentSplit(DATA, state, rownames=sub)

sentSplit(DATA, state)

sentSplit(DATA, state, text.place=right)

sentSplit(DATA, state, text.place=left)

204 | P a g e

Collapse A Text Column by A grouping Variable

#THE DATA

dat <- structure(list(sex = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 2L,

2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor"), state = structure(c(2L,


"Computer is fun. Not too fun.", "I distrust you.", "How can we be certain?",

"I am telling the truth!", "Im hungry. Lets eat. You already?",

"No its not, its dumb.", "There is no way.", "What should we do?",

"What are you talking about?", "You liar, it stinks!"), class = "factor")), .Names =

c("group",

"text"), class = "data.frame", row.names = c(NA, -11L))

#METHOD 1 (A better choice)

# Needed for later

k <- rle(as.numeric(dat$group))

# Create a grouping vector

id <- rep(seq_along(k$len), k$len)

# Combine the text in the desired manner

out <- tapply(dat$text, id, paste, collapse = " ")

# Bring it together into a data frame

data.frame(text = out, group = levels(dat$group)[k$val])

#METHOD 2

y <- rle(as.character(dat$group))

x <- y[[1]]

dat$new <- as.factor(rep(1:length(x), x))

text <- aggregate(text~new, dat, paste, collapse = " ")[, 2]

data.frame(text, group = y[[2]])

#Method 3 (combined) k <- rle(as.numeric(dat$group)); dat$id <- rep(seq_along(k$len), k$len)

data.frame(sex=rle(as.character(dat$group))$val,

aggregate(text~id, dat, paste, collapse=" "))

group text

1 m Computer is fun. Not too fun. No its not, its dumb. What should we do? You liar, it stinks! I am telling the truth!

2 f How can we be certain?

3 m There is no way. I distrust you.

4 f What are you talking about? Shall we move on? Good then.

5 m Im hungry. Lets eat. You already?

group text

1 m Computer is fun. Not too fun.

2 m No its not, its dumb.

3 m What should we do?

4 m You liar, it stinks!

5 m I am telling the truth!

6 f How can we be certain?

7 m There is no way.

8 m I distrust you.

9 f What are you talking about?

10 f Shall we move on? Good then.

11 m Im hungry. Lets eat. You already?

205 | P a g e

Enter unknow number of vectors to unnamed argument return names and math out put

Extract compinents from dots (…) f1 <- function(x, ...) substitute(...()) #Dunlap's method

f2 <- function(x, ...) match.call(expand.dots=FALSE)$... #traditional match.call

f1(1, warning("Hmm"), stop("Oops"), cat("some output\n"))

f2(1, warning("Hmm"), stop("Oops"), cat("some output\n"))

Creating a Quasi-Package

#EXAMPLE

#Create objects such as data sets and functions to include in the file

.hurtz.donut<-function(){"You want a hurtz donut? Yes! <Punch> Hurts don't it?"}

.hurtz.donut()

(.FUNdat<-data.frame(cbind(LETTERS,1:26)))

#Save the objects to the .RData file

save(.hurtz.donut,.FUNdat,file="myFUNCTIONS.RData")

#just to show everything has been wiped clean (this will delete all objects from your

workspace)

rm(list = ls(all.names = TRUE))

#Close out and reload [R]

load("myFUNCTIONS.RData")

.hurtz.donut()

.FUNdat

#FAKE DATA

a<-sample(50:90+0, 20, replace=TRUE)

b<-sample(50:90+20, 20, replace=TRUE)

d<-sample(50:90-20, 20, replace=TRUE)

#FUNCTION THAT WORKS IF SUPPLYING A LIST AS THE UNNAMED ARGUMENT

foo <- function(...){

# Get the names of the objects that were passed into the function

x <- as.character(match.call())[-1]

# Apply mean to every object passed in

y <- sapply(list(...), mean)

return(list(x, y))

}

#TEST IT OUT

foo(a,b,d)

#THE OUTPUT

[[1]]

[1] "a" "b" "d"

[[2]]

[1] 72.90 92.80 51.75

206 | P a g e

Function returns return() print() invisible() Specifically tells the function what to return. If the return function is not given the last line of the code will be returned. Invisible is a feature for being able to recall a function created object but it is not automatically returned.

Function returns extended (return some recall the rest later) Look at both examples

EXAMPLE Invisible

test <- function(){

with(mtcars, plot(mpg~hp))

invisible(list("type1"="Shh! I'm invisible.","type2"="Real quiet now."))

}

x <- test()

x$type1

x$type2

#==========================================

#The original function that returns a list

#==========================================

test <- function(number=10){

XX <- number

YY <- "hello"

ZZ <- Sys.time()

o <- list(x = XX, y = YY, z = ZZ)

class(o) <- "stuff"

return(o)

}

#=================================================

#This makes the above return one piece of the list

#=================================================

print.stuff <- function(stuff){

print(stuff$z)

}

#=================================================

#See the end results

#=================================================

(PP <- test()) #returns what was specified by print.stuff

PP$y #recall the other components of the list

PP$x

test <- function(number=10){

XX <- number

YY <- "hello"

ZZ <- Sys.time()

o <- list(x = XX, y = YY, z = ZZ, zz = "Recall Me")

class(o) <- "stuff"

return(o)

}

#=================================================

#This makes the above return one piece of the list

#=================================================

print.stuff <- function(stuff){

list(print(stuff$z),

print(stuff$x))

}

#=================================================

#See the end results

#=================================================

(PP <- test()) #returns what was specified by print.stuff

PP$y #recall the other components of the list

PP$zz

PP$z

PP$x

EXAMPLE2 Invisible

a <- data.frame(x=1:10,y=1:10)

test <- function(z){

mean.x<-mean(z$x)

nm <-as.character(substitute(z))

print(mtcars)

invisible(list(mean.x, nm))}

x <- test(a)

x

207 | P a g e

Rolling Math Functions rolling mean rolling median sapply(seq(x), function(i) MATH.FUNCTION(x[seq(i)])) x<- mtcars$disp

sapply(seq(x), function(i) median(x[seq(i)]))

sapply(seq(x), function(i) mean(x[seq(i)]))

sapply(seq(x), function(i) range(x[seq(i)]))

sapply(seq(x), function(i) sd(x[seq(i)]))

208 | P a g e

Apply Family, PLYR & RESHAPE

PLYR splyr

#Group by subgroups, find max of another variable by these subgroups, return those rows

#################################################

## A FAKE DATA SET LIKE THE ONE YOU DESCRIBE ##

#################################################

DF <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,

2L, 2L), .Label = c("1", "2"), class = "factor"), var2 = structure(c(1L,

4L, 5L, 9L, 3L, 8L, 2L, 6L, 10L, 7L), .Label = c("B", "C", "F",

"G", "H", "I", "M", "P", "W", "Z"), class = "factor"), var2.1 = c(-0.184379525153166,

-1.42413441621445, -0.245741502747687, 0.762805889444348, -0.85561728601498,

-0.358079724034542, -0.137767483903655, -0.952739149867607, 1.01227935773242,

0.0722005649132995), DF.DATE = structure(c(14662, 15556, NA,

14903, 15641, NA, 14970, 15625, 15075, 14819), class = "Date")), .Names = c("ID",

"var2", "var3", "DATE"), row.names = c(NA, -10L), class = "data.frame")

DF #view dataframe

library(plyr) #get plyr

ddply(na.omit(DF), .(ID), summarise, max = max(DATE)) #or

ddply(na.omit(DF), "ID", summarise, max = max(DATE))

ddply(na.omit(DF), "ID", summarise, mean = mean(var3))

X1 <- sample(1:20, 60, replace=TRUE)

X2 <- X1*(1+sample(seq( .00, .2, .005), 60, replace=TRUE))

X3 <- as.factor(sort(sample(c("dog", "cat", "pig", "snake"),

60, replace=TRUE)))

DF <- data.frame(X1, X2, X3)

library(plyr)

ddply(na.omit(DF), .(X3), summarise, cor = cor(X1,X2)) #correlation by group

stats <- function(x)c("mean"=mean(x), "med"=median(x), "sd"=sd(x),

"var"=var(x), "n"=length(x))

ddply(na.omit(DF), .(X3), summarise, X1=stats(X1),X2=stats(X2))

ID var2 var3 DATE

1 1 B -0.184380 2010-02-22

2 1 G -1.424134 2012-08-04

3 1 H -0.245742 <NA>

4 1 W 0.762806 2010-10-21

5 1 F -0.855617 2012-10-28

6 2 P -0.358080 <NA>

7 2 C -0.137767 2010-12-27

8 2 I -0.952739 2012-10-12

9 2 Z 1.012279 2011-04-11

10 2 M 0.072201 2010-07-29

> ddply(na.omit(DF), .(ID), summarise, max = max(DATE))

ID max

1 1 2012-10-28

2 2 2012-10-12

> ddply(na.omit(DF), "ID", summarise, mean = mean(var3))

ID mean

1 1 -0.4253313

2 2 -0.0015067

require(plyr)

ddply(mtcars, .(cyl, am), with, each(min, mean, sd, max)(hp))

> ddply(mtcars, .(cyl, am), with,

each(min, mean, sd, max)(hp))

cyl am min mean sd max

1 4 0 62 84.66667 19.65536 97

2 4 1 52 81.87500 22.65542 113

3 6 0 105 115.25000 9.17878 123

4 6 1 110 131.66667 37.52777 175

5 8 0 150 194.16667 33.35984 245

6 8 1 264 299.50000 50.20458 335

> ddply(na.omit(DF), .(X3),

summarise, cor = cor(X1,X2))

X3 cor

1 cat 0.9970943

2 dog 0.9974141

3 pig 0.9959173

4 snake 0.9865586

209 | P a g e

DF<-structure(list(car_id = c(500L, 500L, 500L, 500L, 500L, 500L,

501L, 501L, 501L, 501L, 501L, 501L, 501L, 502L, 502L, 502L, 502L,

502L, 502L), visitnum = c(40L, 50L, 60L, 100L, 110L, 120L, 40L,

50L, 60L, 100L, 110L, 120L, 150L, 40L, 50L, 60L, 100L, 110L,

120L), measurement = c(2301L, NA, NA, NA, NA, NA, 4480L, NA,

NA, NA, NA, NA, 38570L, NA, NA, NA, NA, NA, 2560L)), .Names = c("car_id",

"visitnum", "measurement"), class = "data.frame", row.names = c(NA,

-19L))

DF

library(plyr)

DF$measurement2 <- DF$measurement #duplicate measurement column

DF$measurement2[is.na(DF$measurement2)]<-0 #replace NA's with 0

FM <-function(x)ifelse(sum(x)-x[1]>x[1], 1, 0) #code to make a new column of 0 and 1

ddply(DF, .(car_id), transform, "flagmeasure" = FM(measurement2))[,-4]

ddply(DF, .(car_id), summarise, "flagmeasure" = FM(measurement2))

> DF

car_id visitnum measurement

1 500 40 2301

2 500 50 NA

3 500 60 NA

4 500 100 NA

5 500 110 NA

6 500 120 NA

7 501 40 4480

8 501 50 NA

9 501 60 NA

10 501 100 NA

11 501 110 NA

12 501 120 NA

13 501 150 38570

14 502 40 NA

15 502 50 NA

16 502 60 NA

17 502 100 NA

18 502 110 NA

19 502 120 2560

ddply(DF, .(car_id), transform, "flagmeasure"

= FM(measurement2))[,-4]

car_id visitnum measurement flagmeasure

1 500 40 2301 0

2 500 50 NA 0

3 500 60 NA 0

4 500 100 NA 0

5 500 110 NA 0

6 500 120 NA 0

7 501 40 4480 1

8 501 50 NA 1

9 501 60 NA 1

10 501 100 NA 1

11 501 110 NA 1

12 501 120 NA 1

13 501 150 38570 1

14 502 40 NA 1

15 502 50 NA 1

16 502 60 NA 1

17 502 100 NA 1

18 502 110 NA 1

19 502 120 2560 1

> ddply(DF, .(car_id), summarise, "flagmeasure" = FM(measurement2))

car_id flagmeasure

1 500 0

2 501 1

3 502 1

210 | P a g e

#========================

# The data

#========================

test<-data.frame(group=c(rep(1,4),rep(2,5),3),day=c(0:3,0:4,0),

measure=c(5,3,7,8,3,2,4,5,7,5))

(test1<-test)

#========================

# With base (faster)

#========================

test$diff <- unlist(by(test$measure, test$group, function(x){x - x[1]}))

test$perchange <- unlist(by(test$measure, test$group, function(x){(x - x[1])/x[1]}))

test

#========================

# With plyr (slower)

#========================

test<-test1 #reset test

library(plyr)

perch<-function(x){(x - x[1])/x[1]}

differ<-function(x){x - x[1]}

ddply(test, .(group), transform, diff=differ(measure))

ddply(test, .(group), transform, perchange= perch(measure))

ddply(test, .(group), transform, diff=differ(measure), perchange= perch(measure))

group day measure

1 1 0 5

2 1 1 3

3 1 2 7

4 1 3 8

5 2 0 3

6 2 1 2

7 2 2 4

8 2 3 5

9 2 4 7

10 3 0 5

group day measure diff perchange

1 1 0 5 0 0.0000000

2 1 1 3 -2 -0.4000000

3 1 2 7 2 0.4000000

4 1 3 8 3 0.6000000

5 2 0 3 0 0.0000000

6 2 1 2 -1 -0.3333333

7 2 2 4 1 0.3333333

8 2 3 5 2 0.6666667

9 2 4 7 4 1.3333333

10 3 0 5 0 0.0000000

211 | P a g e

test<-data.frame(person=c("A","A","A","A", "B","B",'C', 'C'),day=c(7,14,21,22, 7, 14, 7, 14),

measure=c(112,0,500,600, 0, 0, 0, 50),temp=c(36.9,36.1,37.2,39.6, 35, 37, 37, 35))

test$detector<-ifelse(test$measure>0 & test$temp>=37, 'TYPE.II',

ifelse(test$measure>0 & test$temp<37, 'TYPE.I','ok'))

firstFUN <-function(x, y) y [which(x!='ok')[1]]

typeFUN <-function(x, y) y [which(x!='ok')[1]]

(outcome<-ddply(test, .(person), transform, "failure.day" = firstFUN(detector, day),

"failure.type" = typeFUN(detector, detector)))

> test

person day measure temp

1 A 7 112 36.9

2 A 14 0 36.1

3 A 21 500 37.2

4 A 22 600 39.6

5 B 7 0 35.0

6 B 14 0 37.0

7 C 7 0 37.0

8 C 14 50 35.0

>outcome

person day measure temp detector failure.day failure.type

1 A 7 112 36.9 TYPE.1 7 TYPE.1

2 A 14 0 36.1 ok 7 TYPE.1

3 A 21 500 37.2 TYPE.II 7 TYPE.1

4 A 22 600 39.6 TYPE.II 7 TYPE.1

5 B 7 0 35.0 ok NA <NA>

6 B 14 0 37.0 ok NA <NA>

7 C 7 0 37.0 ok 14 TYPE.1

8 C 14 50 35.0 TYPE.1 14 TYPE.1

212 | P a g e

APPLY A FUNCTION TO A DATA SET BROKEN DOWN BY A CATEGORICAL VARIABLE

distTab(mtcars, 5)#Normal use of the function

require(plyr)

dlply(mtcars, .(cyl), function(x)distTab(x, 4))

dlply(mtcars, .(cyl, am), function(x)distTab(x, 4))

dlply(CO2, .(Type, Treatment), function(x)distTab(x, 4))

dlply(CO2, .(Type, Treatment), mean)

> Test

Person Day Parasites

1 A 1 100

2 A 5 0

3 A 12 0

4 B 1 34

5 B 3 15

6 B 5 11

7 B 9 0

8 B 27 0

9 C 1 188

10 C 3 15

11 C 5 0

12 C 9 8

13 C 19 0

14 D 1 35

15 D 2 0

16 D 4 0

17 D 6 12

18 D 23 10

Test<-dput(structure(list(Person = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L,

2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L), .Label = c("A",

"B", "C", "D"), class = "factor"), Day = c(1L, 5L, 12L, 1L, 3L,

5L, 9L, 27L, 1L, 3L, 5L, 9L, 19L, 1L, 2L, 4L, 6L, 23L), Parasites = c(100L,

0L, 0L, 34L, 15L, 11L, 0L, 0L, 188L, 15L, 0L, 8L, 0L, 35L, 0L,

0L, 12L, 10L)), .Names = c("Person", "Day", "Parasites"), class = "data.frame",

row.names = c(NA,

-18L)))

#############################################################################

METHOD 1

TESTER <- function(day, parasites){

x <- rle(parasites)

ifelse(x[[2]][length(x[[2]])]==0,

as.character(day[length(parasites)+1-x[[1]][length(x[[1]])]]),

"DNC"

)

}

NEW <- ddply(Test, .(Person), transform, "clearance.day" = TESTER(Day, Parasites))

############################################################################

#METHOD 2

fun <- function(Parasite, Day){

tmp <- rle(rev(Parasite))

len <- length(Parasite)

if(tmp$values[1] != 0){

return(rep("DNC", len))

}

n <- len

k <- n + 1 - tmp$lengths[1]

return(rep(Day[k], len))

}

ddply(Test, .(Person), summarize, Day = Day, clearance = fun(Parasites, Day))

#################################################################################

# test replications elapsed relative user.self sys.self user.child sys.child #

# 1 meth1 1000 7.92 1.000000 7.05 0.00 NA NA #

# 2 meth2 1000 15.70 1.982323 10.59 0.01 NA NA #

#################################################################################

Find the last occurance of a value

> NEW

Person Day Parasites clearance.day

1 A 1 100 5

2 A 5 0 5

3 A 12 0 5

4 B 1 34 9

5 B 3 15 9

6 B 5 11 9

7 B 9 0 9

8 B 27 0 9

9 C 1 188 19

10 C 3 15 19

11 C 5 0 19

12 C 9 8 19

13 C 19 0 19

14 D 1 35 DNC

15 D 2 0 DNC

16 D 4 0 DNC

17 D 6 12 DNC

18 D 23 10 DNC

213 | P a g e

APPLY A FUNCTION BY GROUP TO TWO COLUMNS OF A DATA FRAME use: lapply with split (faster) OR by df <- data.frame(group = rep(c("G1", "G2"), each = 10),

var1 = rnorm(20),

var2 = rnorm(20))

r <- by(df, df$group, FUN = function(X) cor(X$var1, X$var2, method = "spearman"))

j <- lapply(split(df, df$group), function(x){cor(x[,2], x[,3], method = "spearman")})

data.frame(group = names(j), corr = unlist(j), row.names = NULL)

#INPUT

group var1 var2

1 G1 -0.60324036 0.22355138

2 G1 -1.64211667 -0.78414595

3 G1 -0.26629745 1.00448792

4 G1 0.42810545 1.04770451

5 G1 -1.26773098 -0.38998673

6 G1 0.78676448 -0.70243031

7 G1 0.29611857 -0.51216302

8 G1 1.96831668 -0.07017856

9 G1 0.13034798 1.28344355

10 G1 -0.15531481 0.94086118

11 G2 0.65258740 -0.48107934

12 G2 -1.11294137 -0.51280763

13 G2 1.35929571 -0.85913000

14 G2 -0.36637039 -0.50303582

15 G2 -1.20766391 -0.52910758

16 G2 0.27350136 -0.00188101

17 G2 -1.03189591 -0.11919335

18 G2 -0.11188425 -1.42868344

19 G2 0.05789754 -1.66900549

20 G2 -1.16903207 -0.17194032

#OUTPUT (by method)

>by(df, df$group, FUN = function(X) cor(X$var1, X$var2, method = "spearman"))

df$group: G1

[1] 0.1515152

------------------------------------------------------------

df$group: G2

[1] -0.1151515

#OUTPUT (lapply & split method)

> j <- lapply(split(df, df$group), function(x){cor(x[,2], x[,3], method = "spearman")})

> data.frame(group = names(j), corr = unlist(j), row.names = NULL)

group corr

1 G1 0.1515152

2 G2 -0.1151515

214 | P a g e

Unlist & Recursively Unlist sunlist A <- data.frame( a = c(1:10), b = c(11:20) )

B <- data.frame( a = c(101:110), b = c(111:120) )

C <- data.frame( a = c(5:8), b = c(55:58) )

L <- list(list(B,C),list(A),list(C,A),list(A,B,C),list(C))

unlist(L) #unlist everything into one vector

unlist(L, recursive=F) #unlist everything into on list of many vectors

Access Elements in a List mod <- summary(lm(cyl~mpg, data=mtcars))

mod[[4]][[2]]

mod[[c(4,2)]] #uses the vectore to recursively access lists

Repeat rows (1 & 2) by times of a third column[good for table() sumarised data frames] see dfpeat() in .Rprofile dataframe[ rep( seq(dim(dataframe)[1]), 3rd column), -1] EXAMPLE

DF <- structure(list(num = c(5, 5, 4), freq = c(96,

60, 59), rank = c(1, 2, 3)), .Names = c("num",

"freq", "rank"), row.names = c(NA, 3L), class = "data.frame")

num freq rank

1 5 96 1

2 5 60 2

3 4 59 3

DF2 <- DF[ rep( seq(dim(DF)[1]), DF$num), -1]

rownames(DF2) <- 1:nrow(DF2)

DF2

> DF2

freq rank

1 96 1

2 96 1

3 96 1

4 96 1

5 96 1

6 60 2

7 60 2

8 60 2

9 60 2

10 60 2

11 59 3

12 59 3

13 59 3

14 59 3

215 | P a g e

Turn a list into a dataframe 3 ways

j <- lapply(1:10, rnorm, n=4)

#METHOD 1

do.call(rbind, j) #or

data.frame(do.call(rbind, j))

#METHOD 2

library(plyr)

ldply(j, I)

#METHOD 3

ldply(j, function(x){x})

> do.call(rbind, j)

[,1] [,2] [,3] [,4]

[1,] 1.6411064 1.157174 0.873377 0.3134954

[2,] 0.9041039 2.667465 1.965937 0.4181302

[3,] 4.4037940 2.420527 3.264888 3.8311805

[4,] 3.9637209 5.402170 5.196343 4.6943378

[5,] 3.9358796 5.866777 5.540184 4.2303664

[6,] 5.8809682 4.669888 4.773183 6.8188467

[7,] 8.3059954 6.389316 5.942269 7.4630666

[8,] 7.5501919 7.807572 7.373059 7.5226562

[9,] 8.6035129 7.044928 9.074038 8.0470154

[10,] 9.3076546 8.424741 11.628522 9.7019016

>

> ldply(j, I)

V1 V2 V3 V4

1 1.6411064 1.157174 0.873377 0.3134954

2 0.9041039 2.667465 1.965937 0.4181302

3 4.4037940 2.420527 3.264888 3.8311805

4 3.9637209 5.402170 5.196343 4.6943378

5 3.9358796 5.866777 5.540184 4.2303664

6 5.8809682 4.669888 4.773183 6.8188467

7 8.3059954 6.389316 5.942269 7.4630666

8 7.5501919 7.807572 7.373059 7.5226562

9 8.6035129 7.044928 9.074038 8.0470154

10 9.3076546 8.424741 11.628522 9.7019016

>

> ldply(j, function(x){x})

V1 V2 V3 V4

1 1.6411064 1.157174 0.873377 0.3134954

2 0.9041039 2.667465 1.965937 0.4181302

3 4.4037940 2.420527 3.264888 3.8311805

4 3.9637209 5.402170 5.196343 4.6943378

5 3.9358796 5.866777 5.540184 4.2303664

6 5.8809682 4.669888 4.773183 6.8188467

7 8.3059954 6.389316 5.942269 7.4630666

8 7.5501919 7.807572 7.373059 7.5226562

9 8.6035129 7.044928 9.074038 8.0470154

10 9.3076546 8.424741 11.628522 9.7019016

216 | P a g e

EVAL/PARSE

a <- 3

x <- "a > 2"

eval(parse(text=x))

x2 <- "a==3"

eval(parse(text=x2))

a <- 1:13

x <- "mean(a)"

eval(parse(text=x))

## > a <- 3

## > x <- "a > 2"

## > eval(parse(text=x))

## [1] TRUE

## >

## > x2 <- "a==3"

## > eval(parse(text=x2))

## [1] TRUE

## >

## > a <- 1:13

## > x <- "mean(a)"

## > eval(parse(text=x))

## [1] 7

217 | P a g e

RESHAPE http://had.co.nz/stat405/lectures/19-tables.pdf

Data Set to Long Format for Repeated Measures library(reshape) melt(data.frame, id=variables/columns to group by) cast(molten data.frame, formula, variable or value, agregrate.function)

# Example 2:

d<-ascii("Code Country 1950 1951 1952 1953 1954

AFG Afghanistan 20,249 21,352 22,532 23,557 24,555

ALB Albania 8,097 8,986 10,058 11,123 12,246")

d

#Method 1

x1 <- reshape(d, direction="long", varying=list(names(d)[3:7]), v.names="Value",

idvar=c("Code","Country"), timevar="Year", times=1950:1954)

rownames(x1) <- 1:nrow(x1)

x1

#Method 2 PREFERED

library(reshape)

x2 <- melt(d,id=c("Code","Country"),variable_name="Year")

x2[,"Year"] <- as.numeric(gsub("X","",x2[,"Year"]))

x2

cast(x2, Year~Country)

cast(x2, Country~Year)

cast(x2, Country + Code~Year)

#Example 1:


Stuff/Scripts/Data Sets.txt")

library(reshape)

rep.mes2<-rep.mes

Sex<-gl(2, 25, length=50,labels = c("Male", "Female"))

rep.mes2<-data.frame(rep.mes2[1:2],Sex,rep.mes2[3:5])

long.rep.mes<-melt(rep.mes2,id=1:3)[order(melt(rep.mes)$Sub),]

rownames(long.rep.mes)<-1:150

rep.mes2;long.rep.mes

#EXAMPLE

Code Country X1950 X1951 X1952 X1953 X1954

1 AFG Afghanistan 20,249 21,352 22,532 23,557 24,555

2 ALB Albania 8,097 8,986 10,058 11,123 12,246

> x2 #MELTED

Code Country Year value

1 AFG Afghanistan 1950 20,249

2 ALB Albania 1950 8,097








10 ALB Albania 1954 12,246

#RECASTED

> cast(x2, Year~Country)

Year Afghanistan Albania

1 1950 20,249 8,097

2 1951 21,352 8,986

3 1952 22,532 10,058

4 1953 23,557 11,123

5 1954 24,555 12,246

> cast(x2, Country~Year)

Country 1950 1951 1952 1953 1954

1 Afghanistan 20,249 21,352 22,532 23,557 24,555

2 Albania 8,097 8,986 10,058 11,123 12,246

> cast(x2, Country + Code~Year)

Country Code 1950 1951 1952 1953 1954

1 Afghanistan AFG 20,249 21,352 22,532 23,557 24,555

2 Albania ALB 8,097 8,986 10,058 11,123 12,246

http://had.co.nz/stat405/lectures/19-tables.pdf

218 | P a g e

library('reshape')

DF<-data.frame("TAX"=c("A", "A", "A", "A", "B","B","B","B"),

"YEAR"=c(2000,2001,2002,2003,2000,2001,2002,2004),

"NUMBER"=c(2,2,3,1,3,4,3,2))

DF

cast(DF, YEAR ~ TAX, value = 'NUMBER', fill = 0)

DF2<-data.frame(DF, "NEW"=rnorm(nrow(DF)))

cast(DF2, YEAR+NEW ~ TAX, value = 'NUMBER', fill = 0)

cast(DF2, TAX ~ YEAR, value = 'NUMBER')

cast(DF2, TAX ~ NUMBER, value = 'NEW', mean)

cast(DF2, TAX ~ NUMBER, value = 'NEW', mean, fill=NA)

cast(DF2, TAX ~ NUMBER , value = 'NEW', sum)

cast(DF2, TAX + YEAR ~ NUMBER , value = 'NEW', sum)

TAX YEAR NUMBER

1 A 2000 2

2 A 2001 2

3 A 2002 3

4 A 2003 1

5 B 2000 3

6 B 2001 4

7 B 2002 3

8 B 2004 2

YEAR A B

1 2000 2 3

2 2001 2 4

3 2002 3 3

4 2003 1 0

5 2004 0 2

> cast(DF2, YEAR+NEW ~ TAX, value = 'NUMBER', fill = 0)

YEAR NEW A B

1 2000 -1.77380068 2 0

2 2000 0.46681003 0 3

3 2001 -0.09072904 0 4

4 2001 2.19618765 2 0

5 2002 -1.68538164 0 3

6 2002 0.85410280 3 0

7 2003 0.13744107 1 0

8 2004 0.12992724 0 2

> cast(DF2, TAX ~ YEAR, value = 'NUMBER')

TAX 2000 2001 2002 2003 2004

1 A 2 2 3 1 NA

2 B 3 4 3 NA 2

> cast(DF2, TAX ~ NUMBER, value = 'NEW', mean)

TAX 1 2 3 4

1 A 0.1374411 0.2111935 0.8541028 NaN

2 B NaN 0.1299272 -0.6092858 -0.09072904

> cast(DF2, TAX ~ NUMBER, value = 'NEW', mean, fill=NA)

TAX 1 2 3 4

1 A 0.1374411 0.2111935 0.8541028 NA

2 B NA 0.1299272 -0.6092858 -0.09072904

> cast(DF2, TAX ~ NUMBER , value = 'NEW', sum)

TAX 1 2 3 4

1 A 0.1374411 0.4223870 0.8541028 0.00000000

2 B 0.0000000 0.1299272 -1.2185716 -0.09072904

> cast(DF2, TAX + YEAR ~ NUMBER , value = 'NEW', sum)

TAX YEAR 1 2 3 4

1 A 2000 0.0000000 -1.7738007 0.0000000 0.00000000

2 A 2001 0.0000000 2.1961876 0.0000000 0.00000000

3 A 2002 0.0000000 0.0000000 0.8541028 0.00000000

4 A 2003 0.1374411 0.0000000 0.0000000 0.00000000

5 B 2000 0.0000000 0.0000000 0.4668100 0.00000000

6 B 2001 0.0000000 0.0000000 0.0000000 -0.09072904

7 B 2002 0.0000000 0.0000000 -1.6853816 0.00000000

8 B 2004 0.0000000 0.1299272 0.0000000 0.00000000

DF3 <- melt(DF2, id=c("TAX", "YEAR"), na.rm=TRUE)

cast(DF3, TAX ~ . | variable, mean)

cast(DF3, TAX ~ . | variable, sum)

cast(DF3, TAX ~ . | variable, range) #or even better

cast(DF3, TAX ~ . | variable, c(min, max))

cast(DF3, YEAR + TAX ~ . | variable)

recast(DF2, YEAR + TAX ~ . | variable,

id.var=c("TAX", "YEAR"),

measure.var=c("NUMBER", "NEW"))

recast(DF2, YEAR + TAX + NUMBER ~ . | variable,

id.var=c("TAX", "YEAR", "NUMBER"),

measure.var=c("NEW"),

fun.aggregate=range)

> cast(DF3, TAX ~ . | variable, c(min, max))

$NUMBER

TAX min max

1 A 1 3

2 B 2 4

$NEW

TAX min max

1 A -2.061310 1.281748

2 B -1.726776 1.986024

> cast(DF3, YEAR + TAX ~ . |variable)

$NUMBER

YEAR TAX (all)

1 2000 A 2

2 2000 B 3

3 2001 A 2

4 2001 B 4

5 2002 A 3

6 2002 B 3

7 2003 A 1

8 2004 B 2

$NEW

YEAR TAX (all)

1 2000 A -2.06131000

2 2000 B 1.98602360

3 2001 A 1.10881310

4 2001 B -0.89410042

5 2002 A 1.28174758

6 2002 B -1.72677556

7 2003 A 0.05761605

8 2004 B -0.15146665

219 | P a g e

Parm values person

1 hour 0.00 1

2 day 0.00 1

3 min 5.00 1

4 max 7.00 1

5 outlier 0.25 1

6 hour 1.00 2

7 day 0.00 2

8 min 5.00 2

9 max 7.00 2

10 outlier 0.25 2

person day hour max min outlier

1 1 0 0 7 5 0.25

2 2 0 1 7 5 0.25

test<-dput(structure(list(Parm = structure(c(2L, 1L, 4L, 3L, 5L, 2L, 1L,

4L, 3L, 5L), .Label = c("day", "hour", "max", "min", "outlier"

), class = "factor"), values = c(0, 0, 5, 7, 0.25, 1, 0, 5, 7,

0.25), person = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L)), .Names = c("Parm",

"values", "person"), class = "data.frame", row.names = c(NA,

-10L)))

library(reshape)

cast(test, person ~ Parm, value = "values")

220 | P a g e

Differences between a value by group #Create the data set

set.seed(1234)

x <- data.frame(id=1:5, value=sample(20:30, 5, replace=T), year=3)

y <- data.frame(id=1:5, value=sample(10:19, 5, replace=T), year=2)

z <- data.frame(id=1:5, value=sample(0:9, 5, replace=T), year=1)

df <- df.reset <- rbind(x, y, z) #The reset allows us to reset df each time

#SAPPLY

df <- df[order(df$id, df$year), ]

sdf <-split(df, df$id)

df$actual <- c(sapply(seq_along(sdf), function(x) diff(c(0, sdf[[x]][,2]))))

df[order(as.numeric(rownames(df))),]

#AGGREGATE

df <- df.reset


diff2 <- function(x) diff(c(0, x))

df$actual <- c(unlist(t(aggregate(value~id, df, diff2)[, -1])))


#BY

df <- df.reset


df$actual <- unlist(by(df$value, df$id, diff2))


#PLYR

df <- df.reset


df <- data.frame(temp=1:nrow(df), df)

library(plyr)

df <- ddply(df, .(id), transform, actual=diff2(value))

df[order(df$year, df$temp),][, -1]

Extending this to multiple columns #Create the data set

set.seed(1234)

x <- data.frame(id=1:5, value=sample(20:30, 5, replace=T), year=3)

y <- data.frame(id=1:5, value=sample(10:19, 5, replace=T), year=2)

z <- data.frame(id=1:5, value=sample(0:9, 5, replace=T), year=1)

df <- rbind(x, y, z)

df <- df.reset <- data.frame(df[, 1:2], new.var=df[, 2]+sample(1:5, nrow(df),

replace=T), year=df[, 3])

#SAPPLY the BY



group.diff<- function(x) unlist(by(x, df$id, diff2))

df <- data.frame(df, sapply(df[, 2:3], group.diff))

df <- df[order(as.numeric(rownames(df))),]

names(df)[5:6] <- c('actual', 'actual.new');df

#TRANSFORM the BY

df <- df.reset



group.diff<- function(x) unlist(by(x, df$id, diff2))

df <- transform(df, actual=group.diff(value), actual.new=group.diff(new.var))


#PLYR

df <- df.reset

df <- data.frame(temp=1:nrow(df), df)


library(plyr)

df <- ddply(df, .(id), transform, actual=diff2(value), actual.new=diff2(new.var))

df[order(df$temp),][, -1]

id value year

1 1 21 3

2 2 26 3

3 3 26 3

4 4 26 3

5 5 29 3

6 1 16 2

7 2 10 2

8 3 12 2

9 4 16 2

10 5 15 2

11 1 6 1

12 2 5 1

13 3 2 1

14 4 9 1

15 5 2 1

id value year actual

1 1 21 3 5

2 2 26 3 16

3 3 26 3 14

4 4 26 3 10

5 5 29 3 14

6 1 16 2 10

7 2 10 2 5

8 3 12 2 10

9 4 16 2 7

10 5 15 2 13

11 1 6 1 6

12 2 5 1 5

13 3 2 1 2

14 4 9 1 9

15 5 2 1 2

221 | P a g e

Extract just the rows of a dataframe from the max of 1 variable

#THE DATA

df <- structure(list(ID = c(1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 9L, 9L,

9L, 9L, 9L), week = c(2L, 4L, 6L, 2L, 6L, 9L, 9L, 12L, 2L, 4L,

6L, 9L, 12L), outcome = c(14L, 28L, 42L, 14L, 46L, 64L, 71L,

85L, 14L, 28L, 51L, 66L, 84L)), .Names = c("ID", "week", "outcome"

), class = "data.frame", row.names = c(NA, -13L))

#METHOD 1

do.call("rbind",

by(df, INDICES=df$ID, FUN=function(DF) DF[which.max(DF$week),]))

#METHOD 2

library(data.table)

dt <- data.table(df, key="ID")

dt[, .SD[which.max(outcome),], by=ID]

#METHOD 3

library(plyr)

ddply(df, .(ID), function(X) X[which.max(X$week), ])

#METHOD 4

sdf <-with(df, split(df, ID))

max.week <- sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week']))

data.frame(t(mapply(function(x, y) y[x, ], max.week, sdf)))

#METHOD 5

df[rev(rownames(df)),][match(unique(df$ID), rev(df$ID)), ]

#METHOD 6

sdf <-with(df, split(df, ID))

df[cumsum(sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week']))), ]

#METHOD 7

df[cumsum(aggregate(week ~ ID, df, which.max)$week), ]

#METHOD 8

df[tapply(seq_along(df$ID), df$ID, function(x){tail(x,1)}),]

#METHOD 9

df[cumsum(as.numeric(lapply(split(df$week, df$ID), which.max))), ]

See the rbenchmark results on the next page:

ID week outcome

1 1 2 14

2 1 4 28

3 1 6 42

4 4 2 14

5 4 6 46

6 4 9 64

7 4 9 71

8 4 12 85

9 9 2 14

10 9 4 28

11 9 6 51

12 9 9 66

13 9 12 84

ID week outcome

1 1 6 42

4 4 12 85

9 9 12 84

We want to select the max week for each individual but return the rest of the data frame.

222 | P a g e

library(rbenchmark)

benchmark(

DATA.TABLE= {dt <- data.table(df, key="ID")

dt[, .SD[which.max(outcome),], by=ID]},

DO.CALL={do.call("rbind",

by(df, INDICES=df$ID, FUN=function(DF) DF[which.max(DF$week),]))},

PLYR=ddply(df, .(ID), function(X) X[which.max(X$week), ]),

SPLIT={sdf <-with(df, split(df, ID))

max.week <- sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week']))

data.frame(t(mapply(function(x, y) y[x, ], max.week, sdf)))},

MATCH.INDEX=df[rev(rownames(df)),][match(unique(df$ID), rev(df$ID)), ],

AGGREGATE=df[cumsum(aggregate(week ~ ID, df, which.max)$week), ],

BRYANS.INDEX = df[cumsum(as.numeric(lapply(split(df$week, df$ID), which.max))), ],

SPLIT2={sdf <-with(df, split(df, ID))

df[cumsum(sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week']))), ]},

TAPPLY= df[tapply(seq_along(df$ID), df$ID, function(x){tail(x,1)}),],

columns = c( "test", "replications", "elapsed", "relative", "user.self","sys.self"),

order = "test", replications = 1000, environment = parent.frame())

test replications elapsed relative user.self sys.self

6 AGGREGATE 1000 4.49 7.610169 2.84 0.05

7 BRYANS.INDEX 1000 0.59 1.000000 0.20 0.00

1 DATA.TABLE 1000 20.28 34.372881 11.98 0.00

2 DO.CALL 1000 4.67 7.915254 2.95 0.03

5 MATCH.INDEX 1000 1.07 1.813559 0.51 0.00

3 PLYR 1000 10.61 17.983051 5.07 0.00

4 SPLIT 1000 3.12 5.288136 1.81 0.00

8 SPLIT2 1000 1.56 2.644068 1.28 0.00

9 TAPPLY 1000 1.08 1.830508 0.88 0.00

223 | P a g e

Summaraize (apply function) to a numeric variable by 2 categorical variables #================

#Make a data set

#================

n <- 100

dat <- data.frame(

Accuracy = round(runif(n, 0, 5), 1),

Month = sample(1:2, n, replace=TRUE),

Day = sample(1:5, n, replace=TRUE),

Easting = rnorm(n),

Northing = rnorm(n),

Etc = rnorm(n)

)

#==========

#using plyr

#==========

library(plyr)

ddply(

dat,

c("Month", "Day"),

function (x) x[ which.min(x$Accuracy), ]

)

#==========

#using base

#==========

t(sapply(

split(dat, list(dat$Month, dat$Day)),

function(d) d[ which.min(d$Accuracy), ]))

#You can get quasi there with (but not all the data frame will come along):

aggregate(Accuracy ~ Month + Day, data = dat, FUN = min)

#OUTCOME (find min value by month and day)

Accuracy Month Day Easting Northing Etc

1 1.0 1 1 -1.2107186 -0.06473102 1.5195738

2 0.7 1 2 0.7552501 1.20389863 0.1319931

3 0.5 1 3 1.1104158 -0.31173230 -0.4738744

4 0.5 1 4 -0.7936402 0.94957122 -0.5173246

5 0.4 1 5 0.1725260 2.50637015 0.5808553

6 0.1 2 1 1.1359366 1.73373416 1.1122071

7 0.3 2 2 0.9101894 0.57581224 0.2726678

8 0.2 2 3 -0.2905642 0.67290842 1.7687111

9 0.7 2 4 -2.2955213 0.23270159 1.2040872

10 0.0 2 5 1.1167519 1.04612217 -0.7811158

224 | P a g e

Apply multiple functions to multiple outcomes by multiple groups > head(CO2)

Plant Type Treatment conc uptake

1 Qn1 Quebec nonchilled 95 16.0






aggregate(cbind(conc, uptake) ~ Plant + Type + Treatment, data=CO2, FUN=mean)

Plant Type Treatment conc uptake




4 Mn3 Mississippi nonchilled 435 24.11429



7 Qc1 Quebec chilled 435 29.97143



10 Mc2 Mississippi chilled 435 12.14286



SUM <- function(x) c(mean=mean(x), sd=sd(x), n=length(x))

aggregate(cbind(conc, uptake) ~ Plant + Type + Treatment, data=CO2, FUN=SUM)

Plant Type Treatment conc.mean conc.sd conc.n uptake.mean uptake.sd uptake.n

1 Qn1 Quebec nonchilled 435.0000 317.7263 7.0000 33.228571 8.214766 7.000000



4 Mn3 Mississippi nonchilled 435.0000 317.7263 7.0000 24.114286 6.484707 7.000000



7 Qc1 Quebec chilled 435.0000 317.7263 7.0000 29.971429 8.334609 7.000000



10 Mc2 Mississippi chilled 435.0000 317.7263 7.0000 12.142857 2.186974 7.000000



225 | P a g e

Table of Means sTABLE of MEANS smeans sMeans Table smeans sTABLE of MEANS smeans sTABLE of

dat <- structure(list(partic = c(4.875, 3.375, 4.5, 2.875, 4, 4.625,

4.375, 4, 4.375, 3.625, 3.25, 4.875, 4.625, 4.875, 4.125, 3.25,

2.5, 3.875, 3.75, 3.625, 3.375, 4.75, 4.75, 3.57142857142857,

2.5, 4.125, 3.5, 3.375, 3.5, 4.5, 4.375, 3.66666666666667, 1.5,

4.375, 3.875, 4.375, 3.14285714285714, 3.875, 3.875, 3.125, 3.25,

2.375, 2.5, 3.5, 4.25, 4.25, 3.5, 3.625, 3.5, 3.625, 3.75, 3.625,

3.625, 4.25, 4, 4, 3.75, 3.875, 3.5, 4.375, 4, 3.5, 3.75, 3.375,

4.375, 3.875, 1.75, 4.5, 3.75, 3.625, 4, 4, 3.875, 2.75, 3.625,

3.5, 4.5, 4.125, 4.125, 4.625, 3.125, 4.625, 3.875, 3, 4.5, 4.25,

4.375, 4.25, 3.625, 3.5, 2.5, 2.875, 2.875, 2.5, 3.75, 4, 2.875,

2.375, 4.125, 4.5), grade = structure(c(3L, 4L, 3L, 3L, 3L, 3L,

3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 1L, 4L,





4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 2L, 2L, 2L), .Label = c("freshman",

"sophomore", "junior", "senior"), class = "factor"), race3 = structure(c(1L,







2L, 2L, 1L), .Label = c("white/asian", "black", "hispanic"), class = "factor")), .Names =

c("partic",

"grade", "race3"), na.action = structure(c(11L, 42L, 154L, 210L,

230L, 282L, 306L, 336L, 349L, 352L, 377L, 378L, 397L, 437L, 477L

), .Names = c("11", "42", "154", "210", "230", "282", "306",

"336", "349", "352", "377", "378", "397", "437", "477"), class = "omit"), row.names = c(NA,

100L), class = "data.frame")

#====================================================================

DF.rs <- melt(dat, id=c("grade", "race3"))

MT <- function(x){ paste(round(mean(x), digits=2),

"(", round(sd(x), digits=2), ")", sep="")}

cast(DF.rs, grade ~ race3, fun.aggregate=MT,

margins=c("grand_row", "grand_col"))

library(reshape)

dat <- read.table(text="

request user group

1 1 1

4 1 1

7 1 1

5 1 2

8 1 2

1 2 3

4 2 3

7 2 3

9 2 4

", header=T

newdat <- ddply(dat, .(user, group), transform, idx = paste("request", 1:length(request), sep

= ""))

cast(newdat, user + group ~ idx, value = .(request))

> cast(newdat, user + group ~ idx, value = .(request))

user group request1 request2 request3

1 1 1 1 4 7

2 1 2 5 8 NA

3 2 3 1 4 7

4 2 4 9 NA NA

of MEANS smeans sTABLE of MEANS smeans sTABLE

226 | P a g e

MEANSofMEANS smeans tables

names(airquality) <- tolower(names(airquality))

aqm <- melt(airquality, id=c("month", "day"), na.rm=TRUE)

cast(aqm, day ~ month ~ variable)

cast(aqm, month ~ variable, mean)

cast(aqm, month ~ . | variable, mean)

cast(aqm, month ~ variable, mean, margins=c("grand_row", "grand_col"))

cast(aqm, day ~ month, mean, subset=variable=="ozone")

cast(aqm, month ~ variable, range)

cast(aqm, month ~ variable + result_variable, range)

cast(aqm, variable ~ month ~ result_variable,range)

#Chick weight example

names(ChickWeight) <- tolower(names(ChickWeight))

chick_m <- melt(ChickWeight, id=2:4, na.rm=TRUE)

cast(chick_m, time ~ variable, mean) # average effect of time

cast(chick_m, diet ~ variable, mean) # average effect of diet

cast(chick_m, diet ~ time ~ variable, mean) # average effect of diet & time

# How many chicks at each time? - checking for balance

cast(chick_m, time ~ diet, length)

cast(chick_m, chick ~ time, mean)

cast(chick_m, chick ~ time, mean, subset=time < 10 & chick < 20)

cast(chick_m, diet + chick ~ time)

cast(chick_m, chick ~ time ~ diet)

cast(chick_m, diet + chick ~ time, mean, margins="diet")

#Tips example

cast(melt(tips), sex ~ smoker, mean, subset=variable=="total_bill")

cast(melt(tips), sex ~ smoker | variable, mean)

ff_d <- melt(french_fries, id=1:4, na.rm=TRUE)

cast(ff_d, subject ~ time, length)

cast(ff_d, subject ~ time, length, fill=0)

cast(ff_d, subject ~ time, function(x) 30 - length(x))

cast(ff_d, subject ~ time, function(x) 30 - length(x), fill=30)

cast(ff_d, variable ~ ., c(min, max))

cast(ff_d, variable ~ ., function(x) quantile(x,c(0.25,0.5)))

cast(ff_d, treatment ~ variable, mean, margins=c("grand_col", "grand_row"))

cast(ff_d, treatment + subject ~ variable, mean, margins="treatment")

227 | P a g e

From long to wide format reshape(data, varying = NULL, v.names = NULL, timevar = "time", idvar = "id", ids = 1:NROW(data), times = seq_along(varying[[1]]), drop = NULL, direction, new.row.names = NULL)

Arguments

data a data frame varying names of sets of variables in the wide format that correspond to single

variables in long format (‘time-varying’). This is canonically a list of vectors of variable names, but it can optionally be a matrix of names, or a single vector of names. In each case, the names can be replaced by indices which are interpreted as referring to names(data). See below for more details and options.

v.names names of variables in the long format that correspond to multiple variables in the wide format. See below for details.

timevar the variable in long format that differentiates multiple records from the same group or individual.

idvar Names of one or more variables in long format that identify multiple records from the same group/individual. These variables may also be present in wide format.

ids the values to use for a newly created idvar variable in long format. times the values to use for a newly created timevar variable in long format. See

below for details.

drop a vector of names of variables to drop before reshaping direction character string, either "wide" to reshape to wide format, or "long" to

reshape to long format. new.row.names logical; if TRUE and direction="wide", create new row names in long format

from the values of the id and time variables.

DF<-data.frame(id=rep(paste("subject", 1:5, sep=" "), 3),

sex=rep(c("m","m","m","f","f"), 3),

time=c(rep("Time1",5), rep("Time2",5), rep("Time3",5)),

score1=rnorm(15), score2=abs(rnorm(15)*4))

wide <- reshape(DF, v.names=c("score1", "score2"), idvar="id",

timevar="time", direction="wide")

wide

long <- with(wide, reshape(wide, idvar="id",

v.names=c("score1", "score2"), direction="long"))

rownames(long)<-1:nrow(long)

long

#USING RESHAPE DF<-data.frame(id=rep(paste("subject", 1:5, sep=" "), 3),

sex=rep(c("m","m","m","f","f"), 3),

time=c(rep("Time1",5), rep("Time2",5),

rep("Time3",5)), score1=rnorm(15),

score2=abs(rnorm(15)*4))

library(reshape)

m <- melt(DF)

cast(m,id+sex~...)

cast(m,id+sex~variable+time)

228 | P a g e

Long to wide example (base's reshape and reshape package) dat <- data.frame(county = rep(letters[1:4], each=2),

state = rep(LETTERS[1], times=8),

industry = rep(c("construction", "manufacturing"), 4),

employment = round(rnorm(8, 100, 50), 0),

establishments = round(rnorm(8, 20, 5), 0))

#Method 1 (base)

reshape(dat, direction="wide", idvar=c("state", "county"), timevar="industry")

#method 2 (reshape 2 package)

library(reshape2)

m <- melt(dat)

dcast(m, state + county~...)

county state industry employment establishments

1 a A construction 100 24

2 a A manufacturing 159 26

3 b A construction 117 17

4 b A manufacturing 64 25

5 c A construction 85 23

6 c A manufacturing 50 19

7 d A construction 21 14

8 d A manufacturing 48 8

state county construction_employment construction_establishments manufacturing_employment manufacturing_establishments

1 A a 100 24 159 26

2 A b 117 17 64 25

3 A c 85 23 50 19

4 A d 21 14 48 8

229 | P a g e

Long to wide with base reshape explained: df3 <- data.frame(school = rep(1:3, each = 4), class = rep(9:10, 6),

time = rep(c(1,1,2,2), 3), score = rnorm(12))

df3

wide <- reshape(df3, idvar = c("school","class"), direction = "wide")

wide

DF<-data.frame(id=rep(paste("subject", 1:5, sep=" "), 3),

sex=rep(c("m","m","m","f","f"), 3),

time=c(rep("Time1",5), rep("Time2",5), rep("Time3",5)),

score1=rnorm(15), score2=abs(rnorm(15)*4))

DF

wide <- reshape(DF, v.names=c("score1", "score2"), idvar="id",

timevar="time", direction="wide")

DF2 <- expand.grid(market = LETTERS[1:5],

date = Sys.Date()+(0:5),

sitename = letters[1:2])

DF2$impression <- sample(100, nrow(DF2), replace=TRUE)

DF2$clicks <- sample(100, nrow(DF2), replace=TRUE)

DF2

wide <- reshape(DF2, v.names=c("impression", "clicks"), idvar=c("market",

"date"),

timevar="sitename", direction="wide")

wide

What's going on with reshape when its long to wide.

timevar these are the repeated measures; that may be times or locations

etc. [categorical]

v.names the repeated measures measurement (in both these case we

have two different variables being measured over two different

times)[numeric]

idvar these are the variables we want to replicate and unstack to

match to the timevar and idvar

Basically worry about what your repeated measures variable is (timevar).

This is not numeric but categorical. Then enter in your actual measures

taken at each repeated measure (v.names). This is usually numeric (however

could be categorical). Generally everything remaining is an id variable.

230 | P a g e

Wide to Long with > 2 Measures Per Time FROM

TO

CODE

id x1 x2 x3 y1 y2 y3 z1 z2 z3 v

1 1 2 4 5 10 20 15 200 150 170 2.5

2 2 3 7 6 25 35 40 300 350 400 4.2

id xsource x y v

1 1 x1 2 10 2.5

2 1 x2 4 20 2.5

3 1 x3 5 15 2.5

4 2 x1 3 25 4.2

5 2 x2 7 35 4.2

6 2 x3 6 40 4.2

x <- read.table(text="

id x1 x2 x3 y1 y2 y3 z1 z2 z3 v

1 2 4 5 10 20 15 200 150 170 2.5

2 3 7 6 25 35 40 300 350 400 4.2

", header=TRUE)

x

#===============================================================

#METHOD #1

res <- reshape(x, direction = "long", idvar = "id",

varying = list(c("x1","x2", "x3"),

c("y1", "y2", "y3"),

c("z1", "z2", "z3")),

v.names = c("x", "y", "z"),

timevar = "xsource", times = c("x1", "x2", "x3"))

res <- res[order(res$id, res$xsource), c(1,3,4,5,2)]

row.names(res) <- NULL

res

===============================================================

#METHOD #2

chunks <- lapply(1:nrow(x),

function(i)cbind(x[i, 1], 1:3, matrix(x[i, 2:10], ncol=3), x[i,

11]))

res <- do.call(rbind, chunks)

colnames(res) <- c("id", "source", "x", "y", "z", "v")

res

231 | P a g e

Wide to Long with Multiple Measures per Time (stacking and double stacking closely examined)

Original Double Stack Single Stack

#The Data Frame

id <- paste('x', "1.", 1:10, sep="")

set.seed(10)

DF <- data.frame(id, trt=sample(c('cnt', 'tr'), 10, T),

work.T1=runif(10), play.T1=runif(10), talk.T1=runif(10),

total.T1=runif(10), work.T2=runif(10), play.T2=runif(10),

talk.T2=runif(10), total.T2=runif(10))

id trt work.T1 play.T1 talk.T1 total.T1 work.T2 play.T2 talk.T2 total.T2

1 x1.1 tr 0.65165567 0.8647212 0.53559704 0.27548386 0.3543281 0.03188816 0.07557029 0.86138244

2 x1.2 cnt 0.56773775 0.6153524 0.09308813 0.22890394 0.9364325 0.11446759 0.53442678 0.46439198

3 x1.3 cnt 0.11350898 0.7751099 0.16980304 0.01443391 0.2458664 0.46893548 0.64135658 0.22286743

4 x1.4 tr 0.59592531 0.3555687 0.89983245 0.72896456 0.4731415 0.39698674 0.52573932 0.62354960

5 x1.5 cnt 0.35804998 0.4058500 0.42263761 0.24988047 0.1915609 0.83361919 0.03928139 0.20364770

6 x1.6 cnt 0.42880942 0.7066469 0.74774647 0.16118328 0.5832220 0.76112174 0.54585984 0.01967341

7 x1.7 cnt 0.05190332 0.8382877 0.82265258 0.01704265 0.4594732 0.57335645 0.37276310 0.79799301

8 x1.8 cnt 0.26417767 0.2395891 0.95465365 0.48610035 0.4674340 0.44750805 0.96130241 0.27431890

9 x1.9 tr 0.39879073 0.7707715 0.68544451 0.10290017 0.3998326 0.08380201 0.25734157 0.16660910

10 x1.10 cnt 0.83613414 0.3558977 0.50050323 0.80154700 0.5052856 0.21913855 0.20795168 0.17015172

id trt times Work Play Talk Total

1 x1.1 tr 1 0.65165567 0.86472123 0.53559704 0.27548386

2 x1.2 cnt 1 0.56773775 0.61535242 0.09308813 0.22890394

3 x1.3 cnt 1 0.11350898 0.77510990 0.16980304 0.01443391

4 x1.4 tr 1 0.59592531 0.35556869 0.89983245 0.72896456

5 x1.5 cnt 1 0.35804998 0.40584997 0.42263761 0.24988047

6 x1.6 cnt 1 0.42880942 0.70664691 0.74774647 0.16118328

7 x1.7 cnt 1 0.05190332 0.83828767 0.82265258 0.01704265

8 x1.8 cnt 1 0.26417767 0.23958913 0.95465365 0.48610035

9 x1.9 tr 1 0.39879073 0.77077153 0.68544451 0.10290017

10 x1.10 cnt 1 0.83613414 0.35589774 0.50050323 0.80154700

11 x1.1 tr 2 0.35432806 0.03188816 0.07557029 0.86138244

12 x1.2 cnt 2 0.93643254 0.11446759 0.53442678 0.46439198

13 x1.3 cnt 2 0.24586639 0.46893548 0.64135658 0.22286743

14 x1.4 tr 2 0.47314146 0.39698674 0.52573932 0.62354960

15 x1.5 cnt 2 0.19156087 0.83361919 0.03928139 0.20364770

16 x1.6 cnt 2 0.58322197 0.76112174 0.54585984 0.01967341

17 x1.7 cnt 2 0.45947319 0.57335645 0.37276310 0.79799301

18 x1.8 cnt 2 0.46743405 0.44750805 0.96130241 0.27431890

19 x1.9 tr 2 0.39983256 0.08380201 0.25734157 0.16660910

20 x1.10 cnt 2 0.50528560 0.21913855 0.20795168 0.17015172

id trt time type measures

1 x1.1 tr 1 work 0.65165567

2 x1.2 cnt 1 work 0.56773775

3 x1.3 cnt 1 work 0.11350898

4 x1.4 tr 1 work 0.59592531

5 x1.5 cnt 1 work 0.35804998

6 x1.6 cnt 1 work 0.42880942

7 x1.7 cnt 1 work 0.05190332

8 x1.8 cnt 1 work 0.26417767

9 x1.9 tr 1 work 0.39879073

10 x1.10 cnt 1 work 0.83613414

11 x1.1 tr 2 work 0.35432806

12 x1.2 cnt 2 work 0.93643254

13 x1.3 cnt 2 work 0.24586639

14 x1.4 tr 2 work 0.47314146

15 x1.5 cnt 2 work 0.19156087

16 x1.6 cnt 2 work 0.58322197

17 x1.7 cnt 2 work 0.45947319

18 x1.8 cnt 2 work 0.46743405

19 x1.9 tr 2 work 0.39983256

20 x1.10 cnt 2 work 0.50528560

21 x1.1 tr 1 play 0.86472123

22 x1.2 cnt 1 play 0.61535242

23 x1.3 cnt 1 play 0.77510990

24 x1.4 tr 1 play 0.35556869

25 x1.5 cnt 1 play 0.40584997

26 x1.6 cnt 1 play 0.70664691

27 x1.7 cnt 1 play 0.83828767

28 x1.8 cnt 1 play 0.23958913

29 x1.9 tr 1 play 0.77077153

30 x1.10 cnt 1 play 0.35589774

31 x1.1 tr 2 play 0.03188816

32 x1.2 cnt 2 play 0.11446759

33 x1.3 cnt 2 play 0.46893548

34 x1.4 tr 2 play 0.39698674

35 x1.5 cnt 2 play 0.83361919

36 x1.6 cnt 2 play 0.76112174

37 x1.7 cnt 2 play 0.57335645

38 x1.8 cnt 2 play 0.44750805

39 x1.9 tr 2 play 0.08380201

40 x1.10 cnt 2 play 0.21913855

41 x1.1 tr 1 talk 0.53559704

42 x1.2 cnt 1 talk 0.09308813

43 x1.3 cnt 1 talk 0.16980304

44 x1.4 tr 1 talk 0.89983245

45 x1.5 cnt 1 talk 0.42263761

46 x1.6 cnt 1 talk 0.74774647

47 x1.7 cnt 1 talk 0.82265258

48 x1.8 cnt 1 talk 0.95465365

49 x1.9 tr 1 talk 0.68544451

50 x1.10 cnt 1 talk 0.50050323

51 x1.1 tr 2 talk 0.07557029

52 x1.2 cnt 2 talk 0.53442678

53 x1.3 cnt 2 talk 0.64135658

54 x1.4 tr 2 talk 0.52573932

55 x1.5 cnt 2 talk 0.03928139

56 x1.6 cnt 2 talk 0.54585984

57 x1.7 cnt 2 talk 0.37276310

58 x1.8 cnt 2 talk 0.96130241

59 x1.9 tr 2 talk 0.25734157

60 x1.10 cnt 2 talk 0.20795168

61 x1.1 tr 1 total 0.27548386

62 x1.2 cnt 1 total 0.22890394

63 x1.3 cnt 1 total 0.01443391

64 x1.4 tr 1 total 0.72896456

65 x1.5 cnt 1 total 0.24988047

66 x1.6 cnt 1 total 0.16118328

67 x1.7 cnt 1 total 0.01704265

68 x1.8 cnt 1 total 0.48610035

69 x1.9 tr 1 total 0.10290017

70 x1.10 cnt 1 total 0.80154700

71 x1.1 tr 2 total 0.86138244

72 x1.2 cnt 2 total 0.46439198

73 x1.3 cnt 2 total 0.22286743

74 x1.4 tr 2 total 0.62354960

75 x1.5 cnt 2 total 0.20364770

76 x1.6 cnt 2 total 0.01967341

77 x1.7 cnt 2 total 0.79799301

78 x1.8 cnt 2 total 0.27431890

79 x1.9 tr 2 total 0.16660910

80 x1.10 cnt 2 total 0.17015172

See the next two pages for how to stack and double stack with

1. reshape (base function) 2. reshape (the package) 3. rbinding and cbinding

232 | P a g e

Single Stack 2 Methods using reshape from base: #Method 1

NEW <- reshape(DF, varying=list(work= c(3, 7), play= c(4,8), talk= c(5,9), total= c(6,10) ),

v.names=c("work", "play", "talk", "total"),

# that was needed after changed 'varying' arg to a list to allow 'times'

direction="long",

times=1:2, # substitutes number fot T1 and T2

timevar="times") # to name the time col

rownames(NEW) <- 1:nrow(NEW)

#Method 2 (shorter but less explicit)

NEW <- reshape(DF, direction="long", varying=3:10, sep=".T")

rownames(NEW) <- 1:nrow(NEW)

NEW

Method from reshape package: library(reshape)

DF2 <- melt(DF,id.vars=1:2)

DF3 <- cbind(DF2,

colsplit(as.character(DF2$variable),"\\.",

names=c("activity","times")))

## rename time, reorder factors:

DF4 <- transform(DF3,

times=as.numeric(gsub("^T","",time)),

activity=factor(activity,

levels=c("work","play","talk","total")),

id=factor(id,levels=paste("x1",1:10,sep=".")))

## reshape back to wide

DF5 <- cast(subset(DF4,select=-variable),id+trt+times~activity)

## reorder

NEW <- with(DF5,DF5[order(time,id),])

NEW

2 Methods using rbinding and cbinding: #Method 1

DF.1 <- DF[, 1:2]

DFlist <- list(DF[, 3:6], DF[, 7:10])

lapply(seq_along(DFlist), function(x) names(DFlist[[x]]) <<-

unlist(strsplit(names(DFlist[[x]])[1:length(names(DFlist[[x]]))],

".", fixed=T))[c(T, F)]

)

repeats <- 2 #Number of repeated measures

time <- rep(1:repeats, each=nrow(DF.1))

NEW <- data.frame(DF.1[rep(seq_len(nrow(DF.1)), repeats), ], time,

do.call('rbind', DFlist))

NEW

#Method 2

DF.1 <- DF[, 1:2]

DF.2 <- DF[, 3:6]

DF.3 <- DF[, 7:10]

repeats <- 2 #Number of repeated measures

names(DF.2) <- names(DF.3) <- unlist(strsplit(names(DF.2), ".", fixed=T))[c(T,F)]

time <- rep(1:repeats, each=nrow(DF.1))

NEW <- data.frame(DF.1[rep(seq_len(nrow(DF.1)), repeats), ], time, rbind(DF.2, DF.3))

NEW

These items in green are doing the same thing, turning the T.1 & T.2 into 1 & 2

Replicate and stack subset (columns) of a data frame repeat rows

This is a method of stacking the same data frame x number of times (the id variables) replicate rows

dataframe[rep(seq_len(nrow(dataframe)), repeats), ]

Where: dataframe- is the data frame to be repeated and stacked repeats- is the number of time to repeate the dataframe

233 | P a g e

Single Stack (This can be useful for certain analysis or graphics such as repeated measures or faceting in ggplot2) Method using reshape from base: NEW2 <- reshape(NEW, direction = "long", idvar = c("id", "trt", "time"),

varying = list(c("work", "play", "talk", "total")),

v.names = c("measures"),

timevar = "type",

times = c("work", "play", "talk", "total"))

rownames(NEW2) <- 1:nrow(NEW2)

NEW2

Method from reshape package: require(reshape)

DF2 <- melt(DF,id.vars=1:2)

DF3 <- cbind(DF2,

colsplit(as.character(DF2$variable),"\\.",

names=c("type","times")))

NEW2 <- with(DF3, DF3[, c('id', 'trt', 'times', 'type', 'value')])

levels(NEW2$times) <- 1:2

NEW2

2 Methods using rbinding and cbinding: NEW2 <- reshape(NEW, direction = "long", idvar = c("id", "trt", "time"),

varying = list(c("work", "play", "talk", "total")),

v.names = c("measures"),

timevar = "type",

times = c("work", "play", "talk", "total"))

rownames(NEW2) <- 1:nrow(NEW2)

NEW2

234 | P a g e

Another Wide To Long With Akwardly Named Columns (rename 'em for ease) #THE DATA SET

dat <- read.table(text=" WorkerId pio_1_1 pio_1_2 pio_1_3 pio_1_4 pio_2_1 pio_2_2 pio_2_3

pio_2_4

1 1 Yes No No No No No Yes No

2 2 No Yes No No Yes No Yes No

3 3 Yes Yes No No Yes No Yes No", header=T)

redat <- dat #To reset the Data

The Trick to Getting the Most Out of Reshape is to Get Your Column Names in an R Friendly Format to Begin with or You Have to Specify to Varying What Columns to Stack on What.

#METHOD 1 (Cool renaming; If you rename varying is easy)

#The "([a-z])_([0-9])_([0-9])" part says look for a character group then "_" followed by a digit

#string then "_" followed by a digit string. The "\\1_\\3\\.\\2" means keep the first character

#string and "_" right in the first spot. Then take the last digit string (#3) and make it

#second, then put a period and take the 2nd digit string and put it 3

rd.

names(dat) <- gsub("([a-z])_([0-9])_([0-9])", "\\1_\\3\\.\\2", names(dat))

#names(dat) <- gsub("([0-9])_([0-9])$", "\\2\\.\\1", names(dat)) # another way

dat2 <- reshape(dat, direction="long", varying=2:9, timevar="set", idvar=1)

row.names(dat2) <- NULL

dat2[order(dat2$WorkerId), ]

#METHOD 2 (My Method; If you rename varying is easy)

y <- do.call('rbind', strsplit(names(dat)[-1], "_"))[, c(1, 3, 2)]

names(dat) <- c(names(dat)[1], paste0(y[, 1], "_", y[, 2], ".", y[, 3]))

dat2 <- reshape(dat, varying=2:9, idvar = "WorkerId", direction="long",

timevar="set")


dat2[order(dat2$WorkerId, dat2$set), ]

#METHOD 3 (Using Reshape)

dat <- redat

library("reshape2")

reshape.middle <- function(dat) {

dat <- melt(so, id="WorkerId")

dat$set <- substr(dat$variable, 5,5)

dat$name <- paste(substr(dat$variable, 1, 4),

substr(dat$variable, 7, 7),

sep="")

dat$variable <- NULL

dat <- melt(dat, id=c("WorkerId", "set", "name"))

dat$variable <- NULL

return(dcast(dat, WorkerId + set ~ name))

}

reshape.middle(dat)

#Without the rename You'd have to approach it this way

dat2 <- reshape(dat,

varying=list(pio_1= c(2, 6), pio_2= c(3,7), pio_3= c(4,8), pio_4= c(5,9) ),

v.names=c(paste0("pio_",1:4)),

idvar = "WorkerId",

direction="long",

timevar="set")


dat2[order(dat2$WorkerId, dat2$set), ]

235 | P a g e

Randomish Rows – Long Format to Wide w/ Missing Data

var<-c("Id", "Name", "Score", "Id", "Score", "Id", "Name")

num<-c(1, "Tom", 4, 2, 7, 3, "Jim")

format1<-data.frame(var, num)

format1

#STARTING DATAFRAME

#

# var num

# 1 Id 1

# 2 Name Tom

# 3 Score 4

# 4 Id 2

# 5 Score 7

# 6 Id 3

# 7 Name Jim

format1$ID <- cumsum(format1$var == "Id")

#ADD THE cumsum ID COLUMN (IMPORTANT FOR BOTH METHODS)

#

# var num ID

# 1 Id 1 1

# 2 Name Tom 1

# 3 Score 4 1

# 4 Id 2 2

# 5 Score 7 2

# 6 Id 3 3

# 7 Name Jim 3

# METHOD 1

format2 <- reshape(format1, idvar = "ID",timevar = "var", direction = "wide")[-1]

names(format2) <- gsub("num.", "", names(format2))

format2

#OUTCOME

#

# Id Name Score

# 1 1 Tom 4

# 4 2 <NA> 7

# 6 3 Jim <NA>

# METHOD 2

reshape(format1, idvar = "ID",timevar = "var", direction = "wide",

varying = list(c("Id", "Name", "Score")))[-1]

# METHOD 3

format1$pk <- cumsum( format1$var=="Id" )

library(reshape2)

dcast( format1, pk ~ var, value.var="num" )

236 | P a g e

Extract object names from list in a function (using both lapply and a for loop)

x <- c("yes", "no", "maybe", "no", "no", "yes")

y <- c("red", "blue", "green", "green", "orange")

list.xy <- list(x=x, y=y)

WORD.C <- function(WORDS){

require(wordcloud)

L2 <- lapply(WORDS, function(x) as.data.frame(table(x), stringsAsFactors = FALSE))

# Takes a dataframe and the text you want to display

FUN <- function(X, text){

windows()

wordcloud(X[, 1], X[, 2], min.freq=1)

mtext(text, 3, padj=-4.5, col="red") #what I'm trying that isn't working

}

# Now creates the sequence 1,...,length(L2)

# Loops over that and then create an anonymous function

# to send in the information you want to use.

lapply(seq_along(L2), function(i){FUN(L2[[i]], names(L2)[i])})

}

WORD.C2 <- function(WORDS){

require(wordcloud)

L2 <- lapply(WORDS, function(x) as.data.frame(table(x), stringsAsFactors = FALSE))

# Takes a dataframe and the text you want to display

FUN <- function(X, text){

windows()

wordcloud(X[, 1], X[, 2], min.freq=1)

mtext(text, 3, padj=-4.5, col="red") #what I'm trying that isn't working

}

# you could use i in seq_along(L2)

# instead of 1:length(L2) if you wanted to

for(i in 1:length(L2)){

FUN(L2[[i]], names(L2)[i])

}

}

WORD.C(list.xy)

WORD.C2(list.xy)

237 | P a g e

Working on Dataframes in Lists & Acting on Global Environment Variables #CREATE A FAKE DATA SET

df <- data.frame(

x.2=rnorm(25),

y.2=rnorm(25),

g=rep(factor(LETTERS[1:5]), 5)

)

#Strip a Particular Column From Every data Frame in the List

LIST <- split(df, df$g) #split it into a list of data frames

NAMES <- names(LIST) #save the names of this for later use as they may be stripped

LIST <- lapply(seq_along(LIST), function(x) as.data.frame(LIST[[x]])[, 1:2])

LIST

#Change All Variable Names of Data Frames in a List

LIST <- lapply(LIST, function(x) {

names(x) <- unlist(strsplit(names(x)[1:length(names(x))],

".", fixed=T))[c(T, F)]

return(x)

}

)

LIST

#Rename All the Data Frames in the List

names(LIST) <- NAMES

LIST

#Assign Data Frames in a List to Objects in The Global Environment

lapply(seq_along(LIST),

function(x) {

assign(c("V", "W", "X", "Y", "Z")[x], LIST[[x]], envir=.GlobalEnv)

}

)

V; W #etc

#Use Global Assignment to Change All Variable Names of Data Frames in a List

lapply(seq_along(LIST), function(x) names(LIST[[x]]) <<-

unlist(strsplit(names(LIST[[x]])[1:length(names(LIST[[x]]))],

".", fixed=T))[c(T, F)]

)

LIST

#Rename All the Data Frames in the List Using Global Assignment

lapply(seq_along(LIST), function(x) {names(LIST)[[x]] <<- NAMES[x]})

LIST

238 | P a g e

do.call, replicate, split do.call (take a list, apply a function) #do.call must be in a list

mtcars2 <- as.list(mtcars)

#do.call with rbind and dataframe

do.call('rbind', mtcars2)

do.call('data.frame', mtcars2)

#to use with paste we have to pass the separator that paste takes

mtcars2$sep <- "HELLO"

do.call('paste', mtcars2)

Classic Use of split, lapply and do.call (split by a factor(s), apply a function, put back together) Note: consider using by and tapply as well

LIST <- split(mtcars, mtcars$cyl)

MEANS <- lapply(LIST, colMeans)

row2col(do.call('rbind', MEANS), 'cyl')

#notice we split by two factors

LIST2 <- split(mtcars, list(mtcars$cyl, mtcars$carb))

MEANS2 <- lapply(LIST2, colMeans)

OC <- row2col(do.call('rbind', MEANS2), 'cyl.carb')

replacer(OC, NaN, NA)

Use replicate to repeat a function over and over and then do.call('rbind', ) them together # Create some fake data.

dat <- rnorm(200)

# Get a sample of size 5 from this without replacement

sample(dat, 5)

# Do this 10 times

replicate(10, sample(dat, 5))

#replicate finding means

replicate(10, colMeans(mtcars))

#replicate and paste a data frame

do.call('rbind', replicate(10, data.frame(a=1:10, b=letters[1:10],

c=state.name[1:10]), simplify=F))

239 | P a g e

Data Table

Sample Stats

require(data.table)

dat <- data.table(iris)

x <- dat[,list(mean=mean(Sepal.Length), sd=sd(Sepal.Length)),by=Species]

rownamer(x)

240 | P a g e

Look Up Tables & Dictionaries #Create a Data Set to Match test1<-(structure(list(person = structure(1:7, .Label = c("A", "B", "C",

"D", "E", "F", "G"), class = "factor"), age = c(7L, 22L, 65L,

32L, 14L, 53L, 23L)), .Names = c("person", "age"), class = "data.frame", row.names = c(NA, -7L)))

test2<-(structure(list(Lower_limit = c(5L, 15L, 25L, 45L), Upper_limt = c(15L,

25L, 45L, 100L), support = c(10L, 20L, 30L, 40L)), .Names = c("Lower_limit",

"Upper_limt", "support"), class = "data.frame", row.names = c(NA,

-4L)))

test1; test2

# Merge (slow) test1$sup1<-cut(test1$age, breaks=unique(unlist(test2[, 1:2])))

key <- data.frame(sup1=levels(test1$sup1), support=test2$support)

test3 <- merge(test1, key, sort=FALSE)[, -1]

test3 <- test3[order(test3$person), ]

rownames(test3) <- 1:nrow(test3)

test3

# Match test1$sup1<-cut(test1$age, breaks=unique(unlist(test2[, 1:2])))


test1$support <- key[match(test1$sup1, key$sup1), 2]

test1[, -3]

# Hash Table test1$sup1<-cut(test1$age, breaks=unique(unlist(test2[, 1:2])))


hash <- function(x, type = "character") {

e <- new.env(hash = TRUE, size = nrow(x), parent = emptyenv())

char <- function(col) assign(col[1], as.character(col[2]), envir = e)

num <- function(col) assign(col[1], as.numeric(col[2]), envir = e)

FUN <- if(type=="character") char else num

apply(x, 1, FUN)

return(e)

}

KEY <- hash(key, type="numeric")

type <- function(x) if(exists(x, env = KEY))get(x, e = KEY) else NA

test1$support <- sapply(as.character(test1$sup1), type)

test1[, -3]

# Indexing test1$sup1<-cut(test1$age, breaks=unique(unlist(test2[, 1:2])))

key2 <- c(test2$support);names(key2) <- levels(test1$sup1) #lookup table

transform(test1, support=key2[sup1])[, -3]

# data.table library(data.table)

test1$sup1<-cut(test1$age, breaks=unique(unlist(test2[, 1:2])))


dtKEY <- data.table(key, key="sup1")

test1$support <- dtKEY[J(test1$sup1), ][[2]]

test1[, -3]

# qdap lookup (hash based) library(qdap)

test1$sup1 <-cut(test1$age, breaks=unique(unlist(test2[, 1:2])))


test1$support <- lookup(cut(test1$age, c(5, 15, 25, 45, 100)), key)

The dataframe Test1

The dictionary/look up Test2

241 | P a g e

ggplot2

Globally Alter background color

Globally Reset Background Color theme_set(theme_gray()) theme_set(theme_bw()) White Background Gray Grid + theme(panel.grid.major = element_blank()) White Background No Grid + theme_bw() + theme(panel.grid.major=element_blank(),panel.grid.minor=element_blank()) library(ggplot2)

x <- ggplot(CO2, aes(x=uptake, group=Plant))

y <- x + geom_density(aes(colour=Plant)) + facet_grid(Type~Treatment)

y + theme_bw() + theme(panel.grid.major=element_blank(),panel.grid.minor=element_blank())

Change Back Ground Color + theme(panel.background = element_rect(fill='green', colour='red')) Change Margins Color + theme(plot.background = element_rect(fill='green', colour='red'))

theme_new <- theme_update(

panel.background = element_rect(fill="gray20")

)

new <- theme_set(theme_new)

theme_set(new)

ggplot(data = mtcars, aes(mpg, wt)) + geom_point(colour="yellow")

theme_new <- theme_update(

panel.background = element_rect(fill="red")

)

new <- theme_set(theme_new)

theme_set(new)

ggplot(data = mtcars, aes(mpg, wt)) + geom_point(colour="yellow")

theme_set(theme_gray())

ggplot(data = mtcars, aes(mpg, wt)) + geom_point(colour="yellow") +

theme_new #not global change

242 | P a g e

Change ggplot2 Pallette and Apply to Individuals + scale_colour_identity() First you creat a color palette using the hexadecimal colors to the right. Then you assign those colors to groups or observations etc. and add the parameter: + scale_colour_identity()

Change Color Transparency (saturation (chromaticity) & increase luminance) scale_fill_hue(h = c(0, 360) + 15, l = 65, c = 100)

ggplot(df, aes(x=cond, y=yval, fill=cond)) + geom_bar() + scale_fill_hue(c=15, l=10)



col <- c("#000000", "#FF0000",

"0033000")

mtcars$col <- col[1]

mtcars$col[5:6] <- col[2:3]

p <- ggplot(mtcars, aes(x=wt, y=mpg,

label=rownames(mtcars)))

p + geom_text(data=mtcars,

aes(colour=col), size=2)

243 | P a g e

Map a Numeric Variable on a Color Continuim scale_colour_gradient(low='col1', high='col2')

Change Color of Factors and Reorder Legend + scale_colour_manual(values = cols)

#Examples

gradient_rb <- scale_colour_gradient(low='blue', high='red')



p + geom_text(data=mtcars,

aes(colour=mpg), size=3)+ gradient_rb

#example 2

p + geom_point(data=mtcars,

aes(colour=mpg), size=2)+ gradient_rb

#Examples



w <- p + geom_point(data=mtcars,

aes(colour=cyl), size=3)

w

w + scale_colour_manual(values = c("red","blue", "green"))

w + scale_colour_manual( #specify who takes what color

values = c("8" = "red","4" = "blue","6" = "green"))

cols <- c("8" = "red","4" = "blue","6" = "darkgreen", "10" = "orange")

w + scale_colour_manual(values = cols)

#breaks allows you to specify which factor gets what color

w + scale_colour_manual(values = cols, breaks = c("4", "6", "8"))

w + scale_colour_manual(values = cols, breaks = c("8", "6", "4"))

w + scale_colour_manual(values = cols, breaks = c("4", "6", "8"),

labels = c("four", "six", "eight"))

#plot just some of the groups (below 6 cyl not plotted)

w + scale_colour_manual(values = cols, limits = c("4", "8"))

244 | P a g e

Change Color Pallette for bar graphs (items you fill in) + scale_fill_manual()

Adjust Transparency (An argument to many geoms) alpha=

Symbols and Color Fills symbols 16:25 are fillable

#EXAMPLE

library(ggplot2)

cbbFillPalette <- scale_fill_manual(values=c("#000000", "#E69F00", "#56B4E9"))

cbbFillPalette2 <- scale_fill_manual(values=c("red", "blue", "brown"))

mtcars$cyl <- as.factor(mtcars$cyl) #make cylinder a factor

ggplot(mtcars, aes( x=cyl, fill=cyl)) + geom_bar() + cbbFillPalette

ggplot(mtcars, aes( x=cyl, fill=cyl)) + geom_bar() + cbbFillPalette2

#EXAMPLES

library(ggplot2)

#EX1

x <- ggplot(mtcars, aes(factor(cyl)))

x + geom_bar(fill = "dark grey", colour = "black", alpha = 1/3)

#EX12

df <- data.frame(x = rnorm(5000), y = rnorm(5000))

h <- ggplot(df, aes(x,y))

h + geom_point(alpha = 0.5)

h + geom_point(alpha = 1/10)

x <- ggplot(mtcars, aes(x=hp, y=mpg))

x + geom_point(aes(shape = 21), size = 4,

colour = "red", fill = "black")

df2 <- data.frame(x = 1:5 , y = 1:25,

z = 1:25)

s <- ggplot(df2, aes(x = x, y = y))

s + geom_point(aes(shape = z), size = 4,

colour = "red", fill = "black") +

scale_shape_identity()

245 | P a g e

Fill By 2 or More Combined Variables library(ggplot2); library(RColorBrewer)

dat <- data.frame(category = c("A","A","B","B","C","C","D","D"),

variable = c("inclusion","exclusion","inclusion","exclusion",

"inclusion", "exclusion","inclusion","exclusion"),

value = c(60,20,20,80,50,55,25,20))

#FILL BY 1 VARIABLE

colors <- c("#FF0000","#990000")

ggplot(dat, aes(category, value, fill = variable)) +

geom_bar()+ scale_fill_manual(values = colors)

#FILL BY 2 VARIABLES

dat$grp <- paste2(dat[, 1:2], sep=" ")# create a combined variable

ggplot(dat, aes(category, value, fill = grp)) +

geom_bar()+

scale_fill_manual(values = brewer.pal(8,"Reds"))

246 | P a g e

Annotations and Text

Correct Approach to Plotting Annotations (not found in original dataframe) Create a separate data frame with the text and locations and pass that data frame to geom_text #Original data frame

data2 <- read.table(text= "type value time year

1 NA* 0.90 3 2008

3 EDS 0.01 3 2008

4 KIU 0.01 3 2008

5 MVH 0.09 3 2008

6 LAK 0.00 3 2008

7 NA* 0.80 6 2007

9 EDS 0.05 6 2007

10 KIU 0.00 6 2007

11 MVH 0.15 6 2007

12 LAK 0.00 6 2007

13 NA* 0.41 15 2007

15 EDS 0.04 15 2007

16 KIU 0.03 15 2007

17 MVH 0.52 15 2007

18 LAK 0.00 15 2007

19 NA* 0.23 27 2006

21 EDS 0.11 27 2006

22 KIU 0.02 27 2006

23 MVH 0.64 27 2006

24 LAK 0.01 27 2006", header=T)

#create separate text data frame

data2.labels <- data.frame(

time = c(7, 15),

value = c(.9, .6),

label = c("correct color", "another correct color!"),

type = c("NA*", "MVH")

)

ggplot(data2, aes(x=time, y=value, group=type, col=type))+

geom_line()+

geom_point()+

theme_bw() +

#pass the new data frame to geom_text so it doesn't print 1000x

geom_text(data = data2.labels, aes(x = time, y = value, label = label))

Grid Letters and Greek Text Separate the words with the tilde (~) symbol.

d <- data.frame(x=1:3,y=1:3)

qplot(x, y, data=d) +

geom_text(aes(2, 2, label="rho~and~some~other~text"), parse=TRUE)

247 | P a g e

Ajust Size Difference Ratio + scale_size (range = c(x, y)) + scale_size_continuous(range = c(x, y)) Change Aspect Ratio Of the Plot Region + coord_equal(ratio = 5) qplot(mpg, wt, data = mtcars) + coord_equal(ratio = 5) qplot(mpg, wt, data = mtcars) + coord_equal(ratio = 1) qplot(mpg, wt, data = mtcars) + coord_equal(ratio = 1/5)

p <- ggplot(mtcars, aes(hp, as.factor(cyl))) +

geom_point(aes(size=mpg))

p

p + scale_size(range = c(2, 10))

p + scale_size_continuous(range = c(3,8))

p + scale_size_continuous(range = c(.05,15))

248 | P a g e

Add a title sggplot title ggplot title + ggtitle("Title text") + labs(title="Title text")

249 | P a g e

Legends Legend Manipulation + guides()

library(reshape2) # for melt

df <- melt(outer(1:4, 1:4), varnames = c("X1", "X2"))

p1 <- ggplot(df, aes(X1, X2)) + geom_tile(aes(fill = value))

# Basic form

p1 + scale_fill_continuous(guide = "legend")

p1 + scale_fill_continuous(guide = guide_legend())

# Guide title

p1 + scale_fill_continuous(guide = guide_legend(title = "V")) # title text

p1 + scale_fill_continuous(name = "V") # same

p1 + scale_fill_continuous(guide = guide_legend(title = NULL)) # no title

# Control styles

# key size

p1 + guides(fill = guide_legend(keywidth = 3, keyheight = 1))

# title position

p1 + guides(fill = guide_legend(title = "LEFT", title.position = "left"))

# title text styles via theme_text

p1 + guides(fill = guide_legend(

title.theme = theme_text(size=15, face="italic", col="red", angle=45)))

p1 + guides(fill = guide_legend(label.position = "bottom"))

# label styles

p1 + scale_fill_continuous(breaks = c(5, 10, 15),

labels = paste("long", c(5, 10, 15)),

guide = guide_legend(direction = "horizontal", title.position = "top",

label.position="bottom", label.hjust = 0.5, label.vjust = 0.5,

label.theme = theme_text(angle = 90)))

# Set aesthetic of legend key

# very low alpha value make it difficult to see legend key

p3 <- qplot(carat, price, data = diamonds, colour = color,

alpha = I(1/100))

p3

# override.aes overwrites the alpha

p3 + guides(colour = guide_legend(override.aes = list(alpha = 1)))

# multiple row/col legends

p <- qplot(1:20, 1:20, colour = letters[1:20])

p + guides(col = guide_legend(nrow = 8))

p + guides(col = guide_legend(ncol = 8))

p + guides(col = guide_legend(nrow = 8, byrow = TRUE))

p + guides(col = guide_legend(ncol = 8, byrow = TRUE))

# reversed order legend

p + guides(col = guide_legend(reverse = TRUE))

sggplot legends sggplot2 legends

250 | P a g e

Change Legend Title + labs(shape, cour, fill, line, shape, etc.)

Change Legend Position + theme(legend.position = 'left') #directional input + theme(legend.position = c(0.5, 0.5)) #coordinate input

Eliminate Legend + theme(legend.position = "none")

library(ggplot2)

data(iris)

ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) +

geom_point(aes(shape=Species, colour=Petal.Width)) +

scale_colour_gradient() +

labs(shape="Species label", colour="Petal width label")

library(ggplot2)

xy <- data.frame(x=1:10, y=10:1, type = rep(LETTERS[1:2], each=5))

plot <- ggplot(data = xy)+ geom_point(aes(x = x, y = y, color=type))

plot

plot + theme(legend.position = 'left')

plot + theme(legend.position = 'bottom')

plot + theme(legend.position = c(0.5, 0.5))

plot + theme(legend.position = c(0.9, 0.9))

251 | P a g e

Share Legend p1 <- ggplot(subset(mtcars, cyl = 4), aes(wt, cyl, colour = mpg)) +

geom_point()

p2 <- ggplot(subset(mtcars, cyl = 8), aes(wt, hp, colour = mpg)) +

geom_point() + guides(colour=FALSE)

library(gridExtra)

grid.draw(cbind(ggplotGrob(p2), ggplotGrob(p1), size="last"))

## Make a tableGrob of your legend

tmp <- ggplot_gtable(ggplot_build(p2))

leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")

legend <- tmp$grobs[[leg]]

# Plot objects using widths and height and respect to fix aspect ratios

# We make a grid layout with 3 columns, one each for the plots and one for the legend

grid.newpage()

pushViewport( viewport( layout = grid.layout( 1 , 3 , widths = unit( c( 0.4 , 0.4 ,

0.2 ) , "npc" ) ,heights = unit( c( 0.45 , 0.45 , 0.45 ) , "npc" ) , respect =

matrix(rep(1,3),1) ) ) )

print( p1 + theme(legend.position="none") , vp = viewport( layout.pos.row = 1 ,

layout.pos.col = 1 ) )

print( p2 + theme(legend.position="none") , vp = viewport( layout.pos.row = 1,

layout.pos.col = 2 ) )

upViewport(0)

vp3 <- viewport( width = unit(0.2,"npc") , x = 0.9 , y = 0.5)

pushViewport(vp3)

grid.draw(legend)

popViewport()

252 | P a g e

Continuous Legend + guides(fill = guide_colorbar())

Reverse Order Legend + guides(fill = guide_legend(reverse = TRUE)) Change Legend Symbols library(grid) grid.gedit("^key-[-0-9]+$", label = "NEW_SYMBOL")

library(reshape2) # for melt

df <- melt(outer(1:4, 1:4), varnames = c("X1", "X2"))

p1 <- ggplot(df, aes(X1, X2)) + geom_tile(aes(fill = value))

p1 + guides(fill = guide_colorbar(barwidth = 0.5, barheight = 10))

p1 + guides(fill = guide_colorbar(label = FALSE))

p1 + guides(fill = guide_colorbar(ticks = FALSE))

p1 + guides(fill = guide_colorbar(label.position = "left"))

p1 + guides(fill = guide_colorbar(label.theme = theme_text(col="blue")))

p1 + scale_fill_continuous(limits = c(0,20), breaks=c(0, 5, 10, 15, 20),

guide = guide_colorbar(nbin=100, draw.ulim = FALSE, draw.llim = FALSE))

p1 + guides(fill = guide_colorbar(direction = "horizontal",

label.theme = theme_text(col="blue")))

#EXAMPLE

library(ggplot2)

p <- ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar()

p

p + guides(fill = guide_legend(reverse = TRUE))

#EXAMPLE

df <- expand.grid(x = factor(seq(1:5)), y =

factor(seq(1:5)), KEEP.OUT.ATTRS = FALSE)

df$Count <- seq(1:25)

# A plot

library(ggplot2)

p <- ggplot(data = df, aes( x = x, y = y,

label = Count, size = Count)) +

geom_text() +

scale_size(range = c(2, 10))

p

library(grid)

grid.gedit("^key-[-0-9]+$", label = ":)")

253 | P a g e

Custom Legend library(ggplot2)

df <- data.frame(gp = factor(rep(letters[1:3], each = 10)), y = rnorm(30))

library(plyr)

ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y))

ggplot(df, aes(x = gp, y = y)) +

geom_point(aes(colour="data")) +

geom_point(data = ds, aes(y = mean, colour = "mean"), size = 3) +

scale_colour_manual("Legend", values=c("mean"="red", "data"="black"))

library(reshape2)

# in long format

dsl <- melt(ds, value.name = 'y')

# add variable column to df data.frame

df[['variable']] <- 'data'

# combine

all_data <- rbind(df,dsl)

# drop sd rows

data_w_mean <- subset(all_data,variable != 'sd',drop = T)

# create vectors for use with scale_..._manual

colour_scales <- setNames(c('black','red'),c('data','mean'))

size_scales <- setNames(c(1,3),c('data','mean') )

ggplot(data_w_mean, aes(x = gp, y = y)) +

geom_point(aes(colour = variable, size = variable)) +

scale_colour_manual(name = 'Type', values = colour_scales) +

scale_size_manual(name = 'Type', values = size_scales)

dsl_mean <- subset(dsl,variable != 'sd',drop = T)

ggplot(df, aes(x = gp, y = y, colour = variable, size = variable)) +

geom_point() +

geom_point(data = dsl_mean) +

scale_colour_manual(name = 'Type', values = colour_scales) +

scale_size_manual(name = 'Type', values = size_scales)

Remove Diagonal Lines show_guide=FALSE ggplot(mtcars, aes(factor(cyl), fill=am, group=am)) +

geom_bar(colour="black")

ggplot(mtcars, aes(factor(cyl), fill=am, group=am)) +

geom_bar() + geom_bar(colour="black", show_guide=FALSE)

254 | P a g e

Eliminate Vertical/Horizontal Grid Lines #MAKE A DATA SET

library(ggplot2); set.seed(10)

CO3 <- data.frame(id=1:nrow(CO2), CO2[, 2:3],

outcome=factor(sample(c('none', 'some', 'lots', 'tons'),

nrow(CO2), rep=T), levels=c('none', 'some', 'lots', 'tons')))

x <- ggplot(CO3, aes(x=outcome)) + geom_bar(aes(x=outcome))+

facet_grid(Treatment~Type, margins='Treatment', scales='free') +

theme_bw() + theme(axis.text.x=element_text(angle= 45, vjust=1, hjust= 1))

#REMOVE LINES

x + theme(panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank())

x + theme(panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank())

Equal distance between bars df <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,


5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L), .Label = c("1",

"2", "3", "4", "5", "6", "7"), class = "factor"), TYPE = structure(c(1L,



5L, 6L, 1L, 2L, 3L), .Label = c("1", "2", "3", "4", "5", "6",

"7", "8"), class = "factor"), TIME = structure(c(2L, 2L, 2L,



1L, 1L, 1L), .Label = c("1", "5", "15"), class = "factor"), VAL = c(0.94,

0.52, 0.28, 0.97, 0.12, 0.05, 0.47, 0.62, 0.2, 0.73, 1, 0.98,

0.67, 0.29, 0.17, 0.86, 0.17, 0.83, 0.62, 0.79, 0.76, 0.43, 0.61,

0.18, 0.53, 0.49, 0.47, 0.07, 0.7, 0.23, 0.36, 0.52, 0.26, 0.15,

0.01, 0.46, 0.92, 0.23), w = c(0.675, 0.675, 0.675, 0.675, 0.675,

0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.675,

0.675, 0.675, 0.675, 0.675, 0.675, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9,

0.675, 0.675, 0.675, 0.675, 0.675, 0.675, 0.9, 0.9, 0.9)), .Names = c("ID",

"TYPE", "TIME", "VAL", "w"), row.names = c(NA, -38L), class = "data.frame")

ggplot(df, aes(x=ID, y=VAL, fill=TYPE)) +

facet_wrap(~TIME, ncol=1) +

geom_bar(position="stack",stat = "identity") +

coord_flip()


facet_wrap(~TIME, ncol=1, scale="free") +

geom_bar(position="stack",stat = "identity") +

coord_flip()

df$w <- 0.9

df$w[df$TIME == 5] <- 0.9 * 3/4


facet_wrap(~TIME, ncol=1, scale="free") +

geom_bar(position="stack",aes(width = w),stat = "identity") +

coord_flip()

255 | P a g e

Faceting Faceted Plot library(ggplot2)

qplot(mpg, wt, data=mtcars) + facet_grid(cyl ~ vs)

Faceted Plot Margins (including plotting just one margin) library(ggplot2)

qplot(mpg, wt, data=mtcars) + facet_grid(cyl ~ vs, margins=TRUE)

qplot(mpg, wt, data=mtcars) + facet_grid(cyl ~ vs, margins='vs')

qplot(mpg, wt, data=mtcars) + facet_grid(cyl ~ vs, margins='cyl')

256 | P a g e

Facet Labels on Top library(ggplot2) + facet_wrap(~Species, ncol=1) + facet_wrap(~Species,nrow = 3) Change Facet Labels #Data Source and plot library(ggplot2); library(directlabels)


y <- x + geom_density(aes(colour=Plant))

y + facet_grid(Type~Treatment)

#method 1 Does not alter data mf_labeller <- function(var, value){

value <- as.character(value)

if (var=="Treatment") {

value[value=="nonchilled"] <- "Var 1"

value[value=="chilled"] <- "Var 2"

}

return(value)

}

y + facet_grid(Type~Treatment, labeller=mf_labeller)

#method 2 Faster but does alter data levels(CO2$Treatment) <- c("Var 1", "Var 2")

library(ggplot2); library(directlabels)


y <- x + geom_density(aes(colour=Plant))

y + facet_grid(Type~Treatment)

library(ggplot2)

ggplot(iris, aes(Petal.Length)) + stat_bin() + facet_grid(Species ~ .)

ggplot(iris, aes(Petal.Length)) + stat_bin() + facet_wrap(~Species,nrow = 3)

ggplot(iris, aes(Petal.Length)) + stat_bin() + facet_wrap(~Species, ncol=1)

ggplot(iris, aes(Petal.Length)) + stat_bin() + facet_wrap(~Species,nrow = 4)

257 | P a g e

Change (all) facet_grid label_renamemargin_gen <- function(newname="Total") {

function(variable, value) {

value <- as.character(value)

value[value == "(all)"] <- newname

value

}

}

ggplot(mtcars, aes(cyl)) +

geom_point(stat="bin", size = 2,

aes(shape = gear, position = "stack")) +

facet_grid(carb ~ gear, margins = TRUE,

labeller=label_renamemargin_gen("Total"))

Adjust Facet Labels and Boxes library(ggplot2)


x + geom_density(aes(colour=Plant)) +

facet_grid(Type~Treatment)+

theme(strip.text.x = element_text(size=8, angle=75),

strip.text.y = element_text(size=12, face="bold"),

strip.background = element_rect(colour="red", fill="#CCCCFF"))

Eliminate Background Color and Maintain Facet Boxes + theme_bw()

ggplot(CO2, aes(conc)) + geom_density() +

facet_grid(Type~Treatment) +

theme(panel.background = element_blank())

#basically don't use panel.background for this

ggplot(CO2, aes(conc)) + geom_density() +

facet_grid(Type~Treatment) +

#theme(panel.background = element_blank()) +

theme_bw()

258 | P a g e

Annotate one box in facet_grid

library(ggplot2)

p <- ggplot(mtcars, aes(mpg, wt)) + geom_point()

p <- p + facet_grid(. ~ cyl)

#create a new data frame with the info

ann_text <- data.frame(mpg = 15,wt = 5,lab = "Text",

cyl = factor(8,levels = c("4","6","8")))

p + geom_text(data = ann_text,label = "Text")

Annotate every box in facet_grid

#make a few numeric into factors

mtcars[, c("cyl", "am", "gear")] <- lapply(mtcars[, c("cyl", "am", "gear")], as.factor)

#plot it with no annotations

p <- ggplot(mtcars, aes(mpg, wt, group = cyl)) +

geom_line(aes(color=cyl)) +

geom_point(aes(shape=cyl)) +

facet_grid(gear ~ am) +

theme_bw()

p

#find number of facets

len <- length(levels(mtcars$gear)) * length(levels(mtcars$am))

#make a data frame with coordinates, facet variable levels, labels

vars <- data.frame(expand.grid(levels(mtcars$gear), levels(mtcars$am)))

colnames(vars) <- c("gear", "am")

dat <- data.frame(x = rep(15, len), y = rep(5, len), vars,

labs=LETTERS[1:len])

#use geom_text to annotate (notice group=NULL)

p + geom_text(aes(x, y, label=labs, group=NULL),data=dat)

#change just one location

dat[1, 1:2] <- c(30, 2) #to change specific locations

p + geom_text(aes(x, y, label=labs, group=NULL), data=dat)

#use math plotting

p + geom_text(aes(x, y, label=paste("beta ==", labs),

group=NULL), size = 4, color = "grey50", data=dat, parse = T)

x y gear am labs

1 15 5 3 0 A

2 15 5 4 0 B

3 15 5 5 0 C

4 15 5 3 1 D

5 15 5 4 1 E

6 15 5 5 1 F

259 | P a g e

Axis Adjustments

Eliminate Space at Bottom of Barplots + scale_y_continuous(expand = c(0,0)) Reverse Axis sflip axis + coord_flip()

#EXAMPLE

qplot(1:10, geom = 'bar')

qplot(1:10, geom = 'bar') + scale_y_continuous(expand = c(0,0))

#Examples

qplot(cut, price, data=diamonds, geom="boxplot")

last_plot() + coord_flip()

qplot(cut, data=diamonds, geom="bar")

last_plot() + coord_flip()

260 | P a g e

Adjust Axis Labels + theme(axis.title.x = element_text(vjust=-0.5)) #vertical + theme(axis.title.x = element_text(hjust=0.25)) #horizontal Axis Labels Names + labs(x = "x", y = "y") OR + xlab("x") + ylab("y")

p <- qplot(mpg, wt, data = mtcars)

p

p + xlab("Vehicle Weight") + ylab("Miles per Gallon")

# Or

p + labs(x = "Vehicle Weight", y = "Miles per Gallon")

261 | P a g e

Dendrograms with ggplot library(ggplot); library(ggdendro) library(ggplot2)

library(ggdendro)

data(mtcars)

x <- as.matrix(scale(mtcars))

dd.row <- as.dendrogram(hclust(dist(t(x))))

ddata_x <- dendro_data(dd.row)

p <- ggplot(segment(ddata_x)) +

geom_segment(aes(x=x, y=y, xend=xend, yend=yend)) +

scale_y_continuous(trans = 'reverse')

p + geom_text(data=label(ddata_x),

aes(label=label, x=x, y=0), hjust=0) +

coord_flip()

Initial Between Variable Data Visualization scatterplot matrix

library(ggplot2)

library(GGally)

ggpairs(iris, colour='Species', alpha=0.4)

ggpairs(CO2, colour ='Type', alpha=0.4)

mtcars$cyl <- factor(mtcars$cyl)

ggpairs(mtcars, colour ='cyl', alpha=0.4)

262 | P a g e

Nice reference to this:

http://stackoverflow.com/questions/8112208/how-can-i-obtain-an-unbalanced-grid-of-ggplots

Combine Two plots (even facetted plots) library(gridExtra) grid.arrange(plot.1, ..., plot.n)

library(ggplot2)

p1 <- ggplot(mtcars[mtcars$cyl!=8,], aes(mpg, wt))+

geom_point()+

facet_wrap( ~ cyl)

p2 <- ggplot(mtcars[mtcars$cyl!=8,], aes(mpg, wt))+

geom_point()+

facet_grid(am ~ cyl)+

theme( axis.text.y = element_blank(),

axis.text.x = element_blank(),

axis.title.y = element_blank(),

axis.ticks = element_blank(),

#strip.background = element_blank(),

strip.text.x = element_blank())

library(gridExtra)

grid.arrange(p1,p2,

main ="this is a title", left =

"This is my global Y-axis title")

http://stackoverflow.com/questions/8112208/how-can-i-obtain-an-unbalanced-grid-of-ggplots

263 | P a g e

Add a table to a grid plot #1 + annotation_custom(grob, xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf) Add a table to a grid plot (can't superimpose) library (gridExtra) tableGrob() ?tableGrob Add Table to Plot (control widths) my_hist<-ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar()

my_table<- tableGrob(head(diamonds)[,1:3],

gpar.coretext = gpar(fontsize=8),gpar.coltext=gpar(fontsize=8),

gpar.rowtext=gpar(fontsize=8))

grid.arrange(my_hist,my_table, ncol=2)

grid.arrange(my_hist,my_table, ncol=2, widths=c(.7, .3))

Add a table right below a legend my_hist<-ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar()

#create inset table

my_table<- tableGrob(head(diamonds)[,1:3],

gpar.coretext =gpar(fontsize=8), gpar.coltext=gpar(fontsize=8),

gpar.rowtext=gpar(fontsize=8))

#Extract Legend

g_legend<-function(a.gplot){

tmp <- ggplot_gtable(ggplot_build(a.gplot))

leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")

legend <- tmp$grobs[[leg]]

return(legend)}

legend <- g_legend(my_hist)

#Create the viewports, push them, draw and go up

grid.newpage()

vp1 <- viewport(width = 0.75, height = 1, x = 0.375, y = .5)

vpleg <- viewport(width = 0.25, height = 0.5, x = 0.85, y = 0.75)

subvp <- viewport(width = 0.3, height = 0.3, x = 0.85, y = 0.25)

print(my_hist + theme(legend.position = "none"), vp = vp1)

upViewport(0)

pushViewport(vpleg)

grid.draw(legend)

#Make the new viewport active and draw

upViewport(0)

pushViewport(subvp)

grid.draw(my_table)

264 | P a g e

Add text to a bar plot EXAMPLE 1: Above Bars

library(ggplot2)

mtcars2 <- data.frame(id=1:nrow(mtcars), mtcars[, c(2, 8:11)])

mtcars2[, -1] <- lapply(mtcars2[, -1], as.factor)

with(mtcars2, ftable(cyl, gear, am)) #USE FOR FREQUENCY COUNTS OF ANY VARAIBLE

ggplot(mtcars2, aes(x=cyl)) + geom_bar() +

facet_grid(gear~am) + stat_bin(geom="text", aes(label=..count.., vjust=-1))

EXAMPLE 2: On Stacked Bar

Year <- c(rep(c("2006-07", "2007-08", "2008-09", "2009-10"), each = 4))

Category <- c(rep(c("A", "B", "C", "D"), times = 4))

Frequency <- c(168, 259, 226, 340, 216, 431, 319, 368, 423, 645, 234,

685, 166, 467, 274, 251)

Data <- data.frame(Year, Category, Frequency)

library(ggplot2)

p <- qplot(Year, Frequency, data = Data, geom = "bar", fill = Category,

theme_set(theme_bw()))

p + geom_text(aes(label = Frequency), size = 3, hjust = 0.5, vjust = 3,

position = "stack")

EXAMPLE 3: Centered on Stacked Bar

Year <- c(rep(c("2006-07", "2007-08", "2008-09", "2009-10"), each = 4))

Category <- c(rep(c("A", "B", "C", "D"), times = 4))

Frequency <- c(168, 259, 226, 340, 216, 431, 319, 368, 423, 645,

234, 685, 166, 467, 274, 251)

Data <- data.frame(Year, Category, Frequency)

library(ggplot2)

ggplot(Data, aes(x = Year, y = Frequency)) +

geom_bar(aes(fill = Category)) +

geom_text(aes(label = Frequency, y = pos), size = 3)

Add text to Barplots (negative and positive values) library(plyr);library(ggplot2);library(scales)

dtf <- data.frame(x = c("ETB", "PMA", "PER", "KON", "TRA",

"DDR", "BUM", "MAT", "HED", "EXP"), y = c(.02, .11,

-.01, -.03, -.03, .02, .1, -.01, -.02, 0.06))

ggplot(dtf, aes(x, y)) +

geom_bar(stat = "identity", aes(fill = x), legend = FALSE) +

geom_text(aes(label = paste(y * 100, "%"),

vjust = ifelse(y >= 0, -.2, 1.1))) +

scale_y_continuous("Anteil in Prozent",

labels = percent_format()) +

theme(axis.title.x = element_blank())

265 | P a g e

stat_summary [ggplot2]

Alter boxplot ends

library(ggplot2)

data(mpg)

#Create a function to calculate the points

get_tails <- function(x) {

q1 = quantile(x)[2]

q3 = quantile(x)[4]

iqr = q3 -q1

upper = q3+1.5*iqr

lower = q1-1.5*iqr

##Trim upper and lower

up = max(x[x < upper])

lo = min(x[x > lower])

return(c(lo, up))

}

ggplot(mpg, aes(x=drv,y=hwy)) + geom_boxplot() +

stat_summary(geom="point", fun.y= get_tails, colour="Red",

shape=3, size=5)

266 | P a g e

Add Colored Rectangles in Background

geom_rect() #EXAMPLE

scores <- data.frame(category = 1:4,

percentage = c(34,62,41,44), type = c("a","a","a","b"))

rects <- data.frame(ystart = c(0,25,45,65,85),

yend = c(25,45,65,85,100), #the y values to stop and start coloring

col = c("Z1","Z2","Z3","Z4","Z5")) #the "grouping" variable to color on

labels <- c("ER", "OP", "PAE", "Overall") #labels for the x axis

medals <- c("navy","goldenrod4","darkgrey","gold","cadetblue1") #rectangle colors

library(ggplot2)

ggplot() +

geom_rect(data = rects, aes(xmin = -Inf, xmax = Inf, ymin = ystart,

ymax = yend, fill=col), alpha = 0.3) +

theme(legend.position="none") +

geom_bar(data=scores, aes(x=category, y=percentage, fill=type), stat="identity") +

scale_fill_manual(values=c("indianred1", "indianred4", medals)) +

scale_x_continuous(breaks = 1:4, labels = labels)

267 | P a g e

Labels

Labels above bars scale_y_continuous(labels = percent)

#DATA SET df <- structure(list(A = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,

3L), .Label = c("0-50,000", "50,001-250,000", "250,001-Over"),

class = "factor"), B = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L),

.Label = c("0-50,000", "50,001-250,000", "250,001-Over"),

class = "factor"), Freq = c(0.507713884992987,

0.258064516129032, 0.23422159887798, 0.168539325842697,

0.525280898876405, 0.306179775280899, 0.160958904109589,

0.243150684931507, 0.595890410958904)), .Names = c("A", "B", "Freq"),

class = "data.frame", row.names = c(NA,

-9L))

library(ggplot2); library(scales)

ggplot(data=df, aes(x=A, y=Freq))+

geom_bar(aes(fill=B), position = position_dodge()) +

geom_text(aes(label = paste(sprintf("%.1f", Freq*100), "%", sep=""),

y = Freq+0.015, group=B),

size = 3, position = position_dodge(width=0.9)) +

scale_y_continuous(labels = percent) +

theme_bw()

268 | P a g e

Mapping

Maps for Different Coordinate Systems + coord_map()

require("maps")

states <- data.frame(map("state", plot=FALSE)[c("x","y")])

(usamap <- qplot(x, y, data=states, geom="path"))

usamap + coord_map()

usamap + coord_map(project="orthographic")

usamap + coord_map(project="stereographic")

usamap + coord_map(project="conic", lat0 = 30)

usamap + coord_map(project="bonne", lat0 = 50)

269 | P a g e

Random

Stacked Bar Histogram # Create data Set set.seed(3421)

library(plyr); library(ggplot2)

# added type to mimick which candidate is supported

dfr <- data.frame(

name = LETTERS[1:26],

percent = rnorm(26, mean=15),

type = sample(c("A", "B"), 26, replace = TRUE)

)

# easier to prepare data in advance. uses two ideas

# 1. calculate histogram bins (quite flexible)

# 2. calculate frequencies and label positions

dfr <- transform(dfr, perc_bin = cut(percent, 5))

dfr <- ddply(dfr, .(perc_bin), mutate,

freq = length(name), pos = cumsum(freq) - 0.5*freq)

# start plotting. key steps are

# 1. plot bars, filled by type and grouped by name

# 2. plot labels using name at position pos

# 3. get rid of grid, border, background, y axis text and lables

ggplot(dfr, aes(x = perc_bin)) +

geom_hline(yintercept=seq(10, 70, by=10), colour="gray90", size=.05) +

geom_bar(aes(y = freq, group = name, fill = type), colour = 'gray60',

show_guide = F) +

geom_text(aes(y = pos, label = name), colour = 'white') +

scale_fill_manual(values = c('red', 'orange')) +

theme_bw() + xlab("") + ylab("") + scale_y_continuous(expand = c(0,0))+

theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

Show Zero Count (discrete/categorical) + scale_x_discrete(drop=F)

Histogram That Matches Base sggplothistogram right = TRUE

#EXAMPLES library(ggplot2)

mtcars$cyl<-factor(mtcars$cyl)

levels(mtcars[!mtcars$cyl==4,]$cyl) #level 4 there but won't be plotted

ggplot(mtcars[!mtcars$cyl==4,], aes(cyl))+ geom_bar()

ggplot(mtcars[!mtcars$cyl==4,], aes(cyl))+ geom_bar() +

scale_x_discrete(drop=F)

#EXAMPLE

ggplot(diamonds, aes(carat, ..density..)) +

geom_histogram(binwidth = 0.2) +

facet_grid(.~cut)

ggplot(diamonds, aes(carat, ..density..)) +

geom_histogram(binwidth = 0.2, right = TRUE) +

facet_grid(.~cut)

270 | P a g e

Market Share Plot library(ggplot2)

library(reshape2)

library(scales)

# A DATA SET

Subset <- structure(list(Year = 1995:2011, BDMP = c(18L, 19L, 41L, 30L,

18L, 36L, 28L, 33L, 37L, 45L, 36L, 27L, 27L, 26L, 47L, 43L, 45L

), JMP = c(257L, 370L, 550L, 690L, 865L, 1060L, 1190L, 1430L,

1710L, 2070L, 2520L, 2970L, 3400L, 3830L, 4170L, 4680L, 5590L

), Minitab = c(1150L, 1290L, 1400L, 1460L, 1670L, 1890L, 2180L,

2490L, 2860L, 3300L, 3770L, 4590L, 5210L, 5830L, 6510L, 7190L,

7990L), SPSS = c(6450L, 7600L, 10500L, 14500L, 24300L, 45600L,

67200L, 87200L, 75900L, 137000L, 145000L, 141000L, 133000L, 119000L,

61500L, 45700L, 33200L), SAS = c(8630L, 8700L, 10200L, 11100L,

12700L, 16500L, 21900L, 27200L, 39600L, 49400L, 57000L, 62800L,

60400L, 59100L, 53700L, 43000L, 32300L), Stata = c(22L, 91L,

205L, 322L, 516L, 784L, 986L, 1290L, 1740L, 2400L, 3090L, 4010L,

5100L, 6330L, 7600L, 9230L, 12000L), Statistica = c(3L, 11L,

19L, 28L, 23L, 42L, 62L, 84L, 89L, 146L, 165L, 219L, 209L, 249L,

297L, 351L, 413L), Systat = c(2480L, 2510L, 3390L, 2700L, 2650L,

2780L, 2880L, 2900L, 3100L, 3340L, 4000L, 4870L, 5430L, 6270L,

6560L, 7030L, 8060L), R = c(8L, 2L, 6L, 13L, 25L, 51L, 133L,

286L, 627L, 1180L, 2180L, 3430L, 5060L, 6960L, 9150L, 11400L,

14500L), SPlus = c(8L, 17L, 33L, 39L, 45L, 52L, 159L, 341L, 574L,

817L, 1010L, 1180L, 1160L, 1180L, 970L, 710L, 644L)), .Names = c("Year",

"BDMP", "JMP", "Minitab", "SPSS", "SAS", "Stata", "Statistica",

"Systat", "R", "SPlus"), class = "data.frame", row.names = c(NA,

-17L))

Scholar

Little6 <- c("JMP","Minitab","Stata","Statistica","Systat","R")

Subset <- Scholar[ , Little6]

Year <- rep(Scholar$Year, length(Subset))

ScholarLong <- melt(Subset)

names(ScholarLong) <- c("Software", "Hits")

ScholarLong <- data.frame(Year, ScholarLong)

ggplot(ScholarLong, aes(Year, Hits, group=Software)) +

geom_smooth(aes(fill=Software), position="fill") +

coord_flip()+

scale_x_continuous("Year", trans="reverse") +

scale_y_continuous("Proportion of Google Scholar Hits For Each Software",

labels = NULL)+

theme(title = expression("Market Share"), axis.ticks = element_blank())

271 | P a g e

Dotplot DF <- structure(list(Country = structure(1:30, .Label = c("Georgia",

"South Africa", "Colombia", "Cuba", "Poland", "Romania", "Taipei (Chinese Taipei)",

"Azerbaijan", "Belgium", "Canada", "Republic of Moldova", "Norway",

"Serbia", "Slovakia", "Ukraine", "Uzbekistan", "Kazakhstan",

"Netherlands", "Great Britain", "Democratic People's Republic of Korea",

"Australia", "Brazil", "Hungary", "France", "Russian Federation",

"Republic of Korea", "Japan", "Italy", "United States of America",

"People's Republic of China"), class = "factor"), Gold = c(1,

1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 2, 1, 1,

1, 2, 1, 2, 0, 2, 3, 6), Silver = c(0, 0, 1, 1, 1, 1, 1, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 2, 3, 5, 4

), Bronze = c(0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,

0, 0, 1, 1, 1, 1, 1, 1, 3, 2, 3, 2, 3, 2), total = c(1, 1, 1,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4,

4, 5, 5, 7, 11, 12)), .Names = c("Country", "Gold", "Silver",

"Bronze", "total"), row.names = c(13L, 14L, 17L, 18L, 19L, 20L,

21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 7L, 11L, 16L,

6L, 8L, 9L, 10L, 5L, 12L, 4L, 15L, 3L, 2L, 1L), class = "data.frame")

#CONVERT TO LONG

DF2 <- reshape(DF, varying = 2:5, direction="long",v.names = "number", timevar = "medal",

idvar = "Country", times =c("Gold", "Silver", "Bronze", "total"))

DF2$medal <- factor(DF2$medal, levels=c("Bronze", "Silver", "Gold", "total"))

ggplot(DF2, aes(x = number, y = Country, colour = medal)) +

geom_point() +

facet_grid(.~medal) + theme_bw()+

scale_colour_manual(values=c("#CC6600", "#999999", "#FFCC33", "#000000"))

272 | P a g e

Direct Labels

Move Specific Labels Around #The faceted ggplot code

library(ggplot2); library(directlabels)


y <- x + geom_density(aes(colour=Plant)) +

facet_grid(Type~Treatment)+ theme_bw()

y #with a legend

direct.label(y) #with direct labels

#use this to supply arguments to direct.label to move it around

my.method1 <-

list('top.points',

dl.move("Qn1", hjust=0,vjust=-5),

dl.move("Qc2", hjust=6,vjust=-8)

)

direct.label(y, my.method1) #moved labels

Find Values from Plot and Adjust that Way GL("ggplot2"); GL(directlabels)

set.seed(124234345)

# Generate data

df.2 <- data.frame("n_gram" = c("word1"),

"year" = rep(100:199),

"match_count" = runif(100 ,min = 1000 , max = 2000))

df.2 <- rbind(df.2, data.frame("n_gram" = c("word2"),

"year" = rep(100:199),

"match_count" = runif(100 ,min = 1000 , max = 2000)))

# Function to get last Y-value from loess

funcDlMove <- function (n_gram) {

model <- loess(match_count ~ year, df.2[df.2$n_gram==n_gram,], span=0.3)

Y <- model$fitted[length(model$fitted)]

Y <- dl.move(n_gram, y=Y,x=200)

return(Y)

}

index <- unique(df.2$n_gram)

mymethod <- list(

"top.points",

lapply(index, funcDlMove)

)

# Plot

PLOT <- ggplot(df.2, aes(year, match_count, group=n_gram, color=n_gram)) +

geom_line(alpha = I(7/10), color="grey", show_guide=F) +

stat_smooth(size=2, span=0.3, se=F, show_guide=F)

direct.label(PLOT, mymethod)

273 | P a g e

Move Plot Over to Add Line Names p_load(ggplot2, directlabels)

set.seed(124234345)

# Generate data

df.2 <- data.frame("n_gram" = c("word1"),

"year" = rep(100:199),

"match_count" = runif(100 ,min = 1000 , max = 2000))

df.2 <- rbind(df.2, data.frame("n_gram" = c("word2"),

"year" = rep(100:199),

"match_count" = runif(100 ,min = 1000 , max = 2000)))

mymethod <- list(

"top.points",

dl.move("word1", hjust=-.5, vjust=19.5),

dl.move("word2", hjust =-4.4, vjust=15.5)

)

ggplot(df.2, aes(year, match_count, group=n_gram, color=n_gram)) +

geom_line(alpha = I(7/10), color="grey", show_guide=F) +

xlim(c(100,220))+

stat_smooth(size=2, span=0.3, se=F, show_guide=F) +

geom_dl(aes(label=n_gram), method = mymethod, show_guide=F)

274 | P a g e

LATEX

Prepare tables for LATEX xtable(table, caption=NULL, label=NULL, align=NULL, digits=3,display=NULL) Info on R to Latex http://stackoverflow.com/questions/2978784/suggestion-for-r-latex-table-creation-package

2013 - notes - r trinker's_notes

Documents

package library

packages current packages

r site

search current packages

r r help type

r version

r newsversion

package p