Download - 2013 - Notes - R Trinker'S_notes
1 | P a g e
Commands for R R Help Type: Purpose: ?function.name Help page on specific function args(function.name) Arguments for a particular function function.name Code of a function example(function.name) Example(s) of a function in action ??function.name For when ?function.name doesn’t work help(package="package.name") Information on a package RSiteSearch("key phrase") Search from within [R], the R site for key phrases RSiteSearch("{key phrase}") Search from within [R], the R site for exact phrases apropos("key phrase") Returns a list of all matching objects in the search list find("function") Returns the package the function is in news() Find out new things happening in [R] news(Version == "v.9", package = "package.name")
Find out new things happening with a package
news(grepl("key phrase", Text), db= news())
Search for key words in news()
help.start() Opens a CRAN page for statistical analysis library(sos); findFn("key phrase") or ???key.phrase; ???'word1 word2'
Search for rated functions related to a topic
sessionInfo() Info about current session (including loaded packages) Additional Websites Website: Purpose: rseek.org [R] version of google stackoverflow.com Q & A forum generally oriented to code and programing stats.stackexchange.com Q & A forum generally oriented to statistics
crantastic.org Search through packages for key words use CTRL + f and search by anything including author
cran.r-project.org/web/views Search through packages by area (psychometrics, cluster, etc.)
inside-r.org/ R community site zoonek2.free.fr/UNIX/48_R/all.html Helpful Book/Manual Website (very thorough) statmethods.net/ Helpful Book/Manual Website (very detailed) Sample [R] capabilities demo(persp); demo(graphics); demo(Hershey); demo(plotmath) Cool R Visualization Examples http://paulbutler.org/archives/visualizing-facebook-friends/
http://blog.revolutionanalytics.com/2012/01/nyt-uses-r-to-map-the-1.html
http://blog.revolutionanalytics.com/2009/11/choropleth-challenge-result.html
http://www.r-bloggers.com/visualize-your-facebook-friends-network-with-r/
http://www.r-bloggers.com/see-the-wind/
http://www.r-bloggers.com/mapped-british-and-spanish-shipping-1750-1800/
2 | P a g e
Packages & Libraries
Working with packages library(package.name) Loads library require(package.name) Loads library .libPaths() Prints the path(s) to R's library(s) detach(package:package.name, unload = TRUE) Removes package from directory. install.packages("package.name") Install a package from command line. library(fBasics) #then listFunctions("stats")
List functions from a library
objects(package:package.name) List functions from a library (preferred: from base)
installed.packages() [,1] What packages do you have installed on your computer
maintainer("package.name") Get the name and email of the package maintainer
packageDescription("package.name") Brief info about a package’s contents packageDescription("package.name")["Version"] Find version number of package remove.packages("package.name") Delete a package from your library data(package = "package.name") Look at data sets available in a package library() Look at all available libraries vignette() Look at all available vignettes for installed
libraries vignette("package.name") Look at vignettes for a libraries (.packages()) Current packages loaded search() Current packages loaded library(help="package.name") See contents of a package package.name::object Access library object w/o opening package
Especially good if 2 packages have the same named function
list.files(.libPaths()[1]) See what files are in your saved library list.files(.libPaths()) Shows all available packes including
standard install libs .packages(all=TRUE)) Shows all available packages (.libPaths()) Displays the paths to all your library
locations suppressPackageStartupMessages(library(package.name))
Supress the startup message of a package
Install Package Not Compiled for Windows install.packages("PATH/TO/THE/SVGAnnotation.tar.gz", repos=NULL, type="source") Citing [R] and packages citation() #citing [R] citation(package = "psych", lib.loc = NULL, auto = NULL) #bibtex citation of a package method 1 utils:::print.bibentry(citation("psych"), style = "Bibtex") #bibtex citation of a package method 2
3 | P a g e
Importing a data from Excel, (csv), text, HTML Make sure the excel file is saved as a .csv file in the folder containing the route directory of R.
Using File Choose
<-read.csv(file.choose(),strip.white = header=TRUE, sep=",",na.strings="999")
Exporting a data table to Excel write.table(x, file = "foo.csv", sep = ",", col.names = T, row.names=F, qmethod = "double") write.table(x, file = "foo.csv", sep = ",", col.names = NA, qmethod = "double") Exporting a data table to SAS library(SASxport); write.xport(...dataframe(s)..., file=) Keyboard Short Cuts Clear console cntrl + L Load script lines cntrl + R Load all script cntrl + A and then cntrl + R Load content to console from non interactive window (ie history() etc) cntrl + c; cntrl + v Go to the beginning or end of script cntrl + HOME; cntrl + END Highlight from a given point to beginning or end cntrl + + SHIFT + HOME; cntrl + SHIFT + END
For Fixed Column Width Files: Save as Plain text, and import into Excel using file/open and follow the steps. Then Export as a .csv file. [or use read.fwf()]
<-read.csv(file, header=TRUE, strip.white = TRUE, sep=",", as.is=FALSE, na.strings= c("999", "NA", " "))
<-read.delim(file, header=T, strip.white = T, sep="\t", as.is=FALSE, na.strings= c("999", "NA", " "))
library(XML) #which is the table number to return <-readHTMLTable(doc, which=#, header=T, strip.white = T, as.is=FALSE, sep=",", na.strings= c("999", "NA", " "))
<-read.fwf(file,widths, header=FALSE, strip.white = T, sep=" ", as.is=FALSE, na.strings= c("999", "NA", " "))
library(gdata) <-read.xls(file,sheet=1, header=FALSE, strip.white=T,sep=" ", as.is=FALSE, na.strings= c("999", "NA", " "))
require(foreign) #for SPSS <-read.spss(file, use.value.labels = TRUE, to.data.frame = TRUE)
<-read.table(file.choose(),sep=",",header=T, strip.white = T,na.strings=c("999","NA"," "))
#example 1
library(XML)
URL <- "http://library.columbia.edu/indiv/dssc/data/nycounty_fips.html"
Table <- readHTMLTable(URL,
colClasses = rep("character", 2),
skip.rows=1,
which=1)
names(Table) <- c("County_FIPS", "County_Name")
Table
#example 2
library(XML)
URL2 <- "http://en.wikipedia.org/wiki/List_of_counties_in_New_York"
Table2 <- readHTMLTable(URL2, which=2)
Table2 #needs to be cleaned
Use , as.is=FALSE, to keep not convert character to factor
4 | P a g e
Write and Read in Vector Files of Unequal Length write.unequal(…, csv.name) read.unequal(file)
#=============WRITE=A=CSV=FILE=OF=UNEQUAL=LENGTHS================
Vector1 <- 1:6
Vector2 <- LETTERS[1:9]
Vector3 <- c('the', 'quick', 'red', 'fox',
'jumped', 'over', 'the', 'lazy', 'brown', 'dog')
Vector4 <- c(.1, .3, .6, .4)
lst <- list(Vector1, Vector2, Vector3, Vector4)
lns <- sapply(lst, length)
n <- length(lst)
ans <- as.data.frame(matrix(nrow = max(lns), ncol = n))
for(i in 1:n){
ans[1:lns[i], i] <- lst[[i]]
}
ans
write.csv(ans, file = "DELETE.ME.csv", na = "", row.names = FALSE)
#=============READ=IN=A=FILE=OF=UNEQUAL=LENGTHS==================
x <- "DELETE.ME.csv" #THE CSV OF != LENGTH VECTORS
j <- read.csv(x, stringsAsFactors = F)
k <- lapply(as.list(j), function(x){x[!is.na(x)]})
#======================DELETE=THE=FILE===========================
delete(x)
###################################################################
# NOTE: I WRAPPED THIS ALL UP IN A FUNCTION I KEEP INT HE USEFUL #
# FUNCTIONS FILE LOADED BY .FIRST AS SEEN BELOW #
###################################################################
write.unequal(Vector1, Vector2, Vector3, Vector4, csv.name=".DELETE.ME")
read.unequal(".DELETE.ME.csv")
5 | P a g e
Read in ascii type files (see my created function to right) read.table(name<-textConnection("")); close(name) site.data <- read.table(tc<-textConnection(
"site year peak
1 ALBEN 5 101529.6
2 ALBEN 10 117483.4
3 ALBEN 20 132960.9
8 ALDER 5 6561.3
9 ALDER 10 7897.1
10 ALDER 20 9208.1
15 AMERI 5 43656.5
16 AMERI 10 51475.3
17 AMERI 20 58854.4")); close(tc)
site.data
site.data[,1]
Read in ascii type files read.table(text="", header=TRUE)
Give comments to an object object comment dataframe commen scommentt comment(object) comment(object) <- value
Object Characteristics str() names() attributes() comment() getAnywhere() #look at any code example getAnywhere(plot.table)
#created read in function
ascii <- function(x, header=TRUE){
name <-textConnection(x)
DF <- read.table(name,header)
close(name)
DF
}
#EXAMPLE:
x <- mtcars
comment(x) <- c("This is about cars #0234", "I know nothing about cars")
x
comment(x)
str(x)
mod<-lm(disp~hp+cyl, mtcars)
str(mod)
attributes(mod)
names(mod)
str(mtcars)
attributes(mtcars)
names(mtcars)
#WATHCH OUT FOR METHODS CLASSES
library(tm); library(proxy)
dissimilarity #notice the UseMethod (that tells
you to look at the methods
methods(dissimilarity) #notice there are three
different methods types
getAnywhere("dissimilarity.DocumentTermMatrix")
#works or
tm:::dissimilarity.DocumentTermMatrix
#so does :::
ascii("site year peak
1 ALBEN 5 101529.6
2 ALBEN 10 117483.4
3 ALBEN 20 132960.9
8 ALDER 5 6561.3
9 ALDER 10 7897.1
10 ALDER 20 9208.1
15 AMERI 5 43656.5
16 AMERI 10 51475.3
17 AMERI 20 58854.4")
read.table(text="site year peak
1 ALBEN 5 101529.6
2 ALBEN 10 117483.4
3 ALBEN 20 132960.9
8 ALDER 5 6561.3
9 ALDER 10 7897.1
10 ALDER 20 9208.1
15 AMERI 5 43656.5
16 AMERI 10 51475.3
17 AMERI 20 58854.4")
Alternative Method
6 | P a g e
Merging data sets (by column) merge(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x",".y"))
x, y data frames, or objects to be coerced to one.
by, by.x, by.y
specifications of the common columns. See ‘Details’.
all logical; all = L is shorthand for all.x = L and all.y = L.
all.x logical; if TRUE, then extra rows will be added to the output, one for each row in x that has no matching row in y. These rows will have NAs in those columns that are usually filled with values from y. The default is FALSE, so that only rows with data from both x and y are included in the output.
all.y logical; analogous to all.x above.
sort logical. Should the results be sorted on the by columns?
Merging data sets (by row) and fill missing with NA library(plyr) merge by ro merge rows rbind.fill(dataframes…)
#EXAMPLE
x1<-LETTERS[1:3]
x2<-letters[1:3]
x2b<-letters[5:7]
x3<-rnorm(3)
x4<-rnorm(3)
x5<-rnorm(3)
#DATA LOOKS LIKE THIS
data.frame(x1,x2,x3,x4,x5)
data.frame(x1,x3,x4,x5)
data.frame(x2,x3,x4,x5)
data.frame(x1,x2,x3,x4,x5)
data.frame(x1,x2b,x3,x4,x5)
#========================================
# IF EACH ONE IS A DATA FRAME ALREADY
#========================================
library(plyr)
d1 <- data.frame(x1,x2,x3,x4,x5)
d2 <- data.frame(x1,x3,x4,x5)
d3 <- data.frame(x2,x3,x4,x5)
d4 <- data.frame(x1,x2,x3,x4,x5)
d5 <- data.frame(x1,x2b,x3,x4,x5)
rbind.fill(d1,d2,d3,d4,d5)
#========================================
# IF EACH ONE IS A DATA FRAME ALREADY
#========================================
library(plyr)
LIST<-list(
data.frame(x1, x2, x3, x4, x5),
data.frame(x1,x3,x4,x5),
data.frame(x2,x3,x4,x5),
data.frame(x1,x2,x3,x4,x5),
data.frame(x1,x2b,x3,x4,x5))
DF <- rbind.fill(LIST)
data.frame(FAC(DF), NUM(DF))
#Output
x1 x2 x2b x3 x4 x5
1 A a <NA> -1.0193006 -0.8175212 -0.3094028
2 B b <NA> -2.0372846 -1.0685405 -1.0913312
3 C c <NA> -0.6502925 0.7338066 0.7393544
4 A <NA> <NA> -1.0193006 -0.8175212 -0.3094028
5 B <NA> <NA> -2.0372846 -1.0685405 -1.0913312
6 C <NA> <NA> -0.6502925 0.7338066 0.7393544
7 <NA> a <NA> -1.0193006 -0.8175212 -0.3094028
8 <NA> b <NA> -2.0372846 -1.0685405 -1.0913312
9 <NA> c <NA> -0.6502925 0.7338066 0.7393544
10 A a <NA> -1.0193006 -0.8175212 -0.3094028
11 B b <NA> -2.0372846 -1.0685405 -1.0913312
12 C c <NA> -0.6502925 0.7338066 0.7393544
13 A <NA> e -1.0193006 -0.8175212 -0.3094028
14 B <NA> f -2.0372846 -1.0685405 -1.0913312
15 C <NA> g -0.6502925 0.7338066 0.7393544
7 | P a g e
Merge Rows of a Data Set Method 1 library(plyr) [sum by id variables] combine rows ddply(data.frame, .(other.facs), summarize, combined.fac = sum(combined.fac)) dataframe = dataframe combined.facs = numeric factors you want to sum (or other operation) other.facs = list of factors that are repeated in all rows for the combined.facs Merge Rows of a Data Set Method 2 library(data.table) [sum by id variables] combine rows dataframe[ , list(combined.facs=sum(combined.facs)), list(other.facs)] dataframe = dataframe combined.fac = numeric factor you want to sum (or other operation) other.facs = list of factors that are repeated in all rows for the combined.facs
Paste two data frames together a<-mtcars[1:3,1:3]
b<-mtcars[1:3,8:10]
mypaste <- function(x,y) paste(x, "(", y, ")", sep="")
mapply(mypaste, a,b)
EXAMPLE
(dat <- structure(list(year = structure(c(1L, 1L, 1L, 1L,
1L, 1L),.Label = "base", class = "factor"), age =
structure(c(1L, 2L, 2L, 3L, 3L, 4L), .Label = c("0",
"1", "2", "3"), class = "factor"), pop = c(98378,
104648, 96769, 92448, 100745, 116926),
FIPS = structure(c(1L, 1L, 1L, 1L, 1L, 1L),
.Label = "6001", class = "factor")),
.Names = c("year", "age", "pop", "FIPS"),
row.names = c(NA, -6L), class = c("data.table",
"data.frame"), sorted = c("year", "age",
"FIPS")))
library(data.table) #Method 1
dat[,list(pop=sum(pop)),list(year,age,FIPS)]
library(plyr) #Method 2
ddply(dat, .(year, age, FIPS), summarize, pop = sum(pop))
dat<-data.frame(dat, "x2"=sample(1:100, nrow(dat)))
dat$x2<-as.numeric(dat$x2)
ddply(dat, .(year, age, FIPS), summarize, pop = sum(pop),
x2=sum(x2))
Extra Combined
factors
8 | P a g e
Multi Merge #Create three dataframe
Week_1_sheet <- read.table(text="ID Gender DOB Absences Unexcused_Absences Lates
1 1 M 1997 5 1 14
2 2 F 1998 4 2 3", header=TRUE)
Week_2_sheet <- read.table(text="ID Gender DOB Absences Unexcused_Absences Lates
1 1 M 1997 2 1 10
2 2 F 1998 8 2 2
3 3 M 1998 8 2 2", header=TRUE)
Week_3_sheet <- read.table(text="ID Gender DOB Absences Unexcused_Absences Lates
1 1 M 1997 2 1 10
2 2 F 1998 8 2 2", header=TRUE)
#Consolidate them into a list
WEEKlist <- list(Week_1_sheet , Week_2_sheet , Week_3_sheet)
#transform common variables before the merge
lapply(seq_along(WEEKlist), function(x) {
WEEKlist[[x]] <<- transform(WEEKlist[[x]],
Absences=sum(Absences, Unexcused_Absences))[, -5]
}
) #notice the assignment to the enviroment
#change names of columns that may overlap with other data frame yet not have duplicate data
lapply(seq_along(WEEKlist), function(x) {
y <- names(WEEKlist[[x]]) #do this to avoid repeating this 3 times
names(WEEKlist[[x]]) <<- c(y[1:3], paste(y[4:length(y)], ".", x, sep=""))}
) #notice the assignment to the enviroment
#Method using a for loop
DF <- WEEKlist[[1]][, 1:3]
for ( .df in WEEKlist) {
DF <-merge(DF,.df,by=c('ID', 'Gender', 'DOB'), all=T, suffixes=c("", ""))
}
DF
#Method using Reduce
merge.all <- function(frames, by) {
return (Reduce(function(x, y) {merge(x, y, by = by, all = TRUE)}, frames))
}
merge.all(frames=WEEKlist, by=c('ID', 'Gender', 'DOB'))
merge.all(frames=WEEKlist, by=1:3)
test replications elapsed relative user.self sys.self
1 LOOP 1000 10.12 1.62701 7.89 0
2 REDUCE 1000 6.22 1.00000 5.34 0
#BENCHMARKING
require(rbenchmark)
benchmark(
LOOP={DF <- WEEKlist[[1]]
for ( .df in WEEKlist) {
DF <-merge(DF,.df,by=c('ID', 'Gender', 'DOB'), all=T, suffixes=c("", ""))
}},
REDUCE=merge.all(frames=WEEKlist, by=c('ID', 'Gender', 'DOB')),
columns = c( "test", "replications", "elapsed", "relative", "user.self", "sys.self"),
order = "test",
replications = 1000,
environment = parent.frame())
9 | P a g e
Exporting an output to a file (method 1) saving a file save file cat(object,file="name.doc", sep = " ", append = FALSE) Append means add to the existing file (TRUE) or overwrite the file (FALSE) Exporting an output to a file (method 2) saving a file save file
write(x, file="data", ncolumns=if(is.character(x)) 1 else 5, append=FALSE) Append means add to the existing file (TRUE) or overwrite the file (FALSE)
Exporting an output to a file (method 3) saving a file save file This method prints all the results directly to the file without naming an object to print. sink (file="name.doc", append = FALSE, split = FALSE) Append means add to the existing file (TRUE) or overwrite the file (FALSE) Split sends it to the file and the command line. see also: capture.output()
Saving R Objects Save objects and load same objects into R save(…, file = "foo.RData") load("foo.RData") Save objects and load them back in by assigning them to a new object saveRDS(mod, "mymodel.rds") <- readRDS("mymodel.rds")
EXAMPLE
xc<-pi*3^2
cat(xc,file="xREPORT.doc")
xc2<-((xc+23)/4)-1000
cat(xc2,file="xREPORT.doc",append=T)
unlink("xREPORT.doc")
EXAMPLE
sink("example.doc",append=FALSE,split=TRUE)#append=T if file already exists
mod<-lm(mpg~disp*cyl,data=mtcars)
anova(mod)
summary(mod)
cat("The dog ate the food on ",date(),".\n",sep="")
sink()#turns sink off
xc<-pi*3^2
cat(xc,file="xREPORT.doc")
xc2<-((xc+23)/4)-1000
write(xc2,file="xREPORT.doc",append=T)
unlink("xREPORT.doc")
10 | P a g e
Delete a File from within [R] unlink("file") or file.remove("file")
Checking What Files Are in a directory (default is working directory) list.files()
EXAMPLE
STRING<-"TEST"
cat(STRING, file = ".TEST.txt")
unlink(".TEST.txt")
11 | P a g e
Checking the Data Set Simply type the name of the data set (data frame) and hit enter (in the example above we called the data set myData). Look @ beginning or end of a data set head() tail() dataset[1:n] or dataset[c(3,4,5,6,100,101,102,200),] The psych() package also includes a quick way to show the first and last n lines of a data.frame, matrix, or a text object. headtail(x,hlength=4,tlength=4,digits=2)
Arguments x A matrix or data frame or free text hlength The number of lines at the beginning to show tlength The number of lines at the end to show digits Round off the data to digits
Quick way to attach/detach variables Type: attach(myData)/detach(myData) Where the attach is the command and the myData is the imported file (data set). Now when you type the column headers you are expressing the variable name. Preferred method for attaching data to a function/expression with(data, expr, ...) Looking at a Variable (column) from the Data Set (data frame) Type: myData$Day Note: the myData and the Day portion are alterable; myData is your data set (data frame) and the Day portion is the name of the variable in the data set. This shows the vector for that variable.
12 | P a g e
Manual data Entry x<-c(3,4,3,5,2,3,4,3) or x<-scan() This is a line by line entry system that has the feel of Excel entry. When you come to the end of your data press enter. Look at stored variables/objects ls() #see all objects in environment in R console objects() #see all objects in environment in R console browseEnv() #see all objects in environment in web browser ls.str() #gives all the stored objects in the workspace plus some info on each one Remove all stored variables/objects (see also: Reduce Objects and Junk in Memory) rm(list = ls(all = TRUE)) ls() …to check if the variables have been reset output will be character(0) Remove everything except for functions
rm(list=setdiff(ls(all.names=TRUE), lsf.str(all.names=TRUE)))
Searching the Objects in List term="b"
ls(pattern=paste("^",term,sep=""))
ls(pattern=paste(term,sep="")) Hiding objects in workspace EXAMPLE
.BB<-"You can't see me!"
ls()
.BB
rm(.BB) #to remove the object
.BB
Name the object beginning with a period and it hides the object from the working directory
> x<-scan()
1: 21
2: 2
3: 3
4: 4
5: 5
6: 67
7: 776
8: 565
9: 45
10: 87
11: 567
12: 54
13: 34
14: 34
15: 32
16:
Read 15 items
> mean(x)
[1] 153.0667
13 | P a g e
Checking the Data Missing Values or Missing Data Finding Missing Values type: NAfun() for a list of NA functions I’ve created
Good implementations that can be accessed through R include Amelia II, Mice, and mitools.
Functions to omit observations with missing values (listwise deletion) na.fail(object, ...) na.omit(object, ...) na.exclude(object, ...) na.pass(object, ...) If ‘na.omit’ removes cases, the row numbers of the cases form the‘"na.action"’ attribute of the result, of class ‘"omit"’. ‘na.exclude’ differs from ‘na.omit’ only in the class of the ‘"na.action"’ attribute of the result, which is ‘"exclude"’. This gives different behaviour in functions making use of ‘naresid’ and ‘napredict’: when ‘na.exclude’ is used the residuals and predictions are padded to the correct length by inserting ‘NA’s for cases omitted by ‘na.exclude’. Impute means or median for missing library(e1071) Not a preferred method
impute(x, what = c("median", "mean"))
Replace Missing Values With A given Value See just for certain columns method below variable[is.na(variable)]<- # you want to impute
EXAMPLE
lk<-c(3,4,5,6,NA,3,4,5,6)
jk<-(.4,NA,.5,.3,.4,.3,NA,NA,.8)
das<-data.frame(lk,jk)
sapply(na.omit(das),mean)
sapply(na.omit(das),median)
das
impute(das, what = c("median"))
impute(das, what = c( "mean"))
EXAMPLE: mtcars2<-mtcars
mtcars2$carb<-with(mtcars2,replace(carb, carb==1,NA))
mtcars2
mtcars2$carb[is.na(mtcars2$carb)]<-1000 #1000 could be 0 or mode etc.
mtcars2see also: replace()
14 | P a g e
Replace Missing Values With A given Value for Selected Columns EXAMPLE A<-c(NA,5,4,7,3,NA,NA)
B<-c(.1,.4,.5,NA,NA,.3,.2)
C<-c(30,NA,40,40,60,50,70)
DF<-data.frame(A,B,C)
DF2<-DF #this is just so we can reset DF
cols <- c(2,3) #select the columns you want to impute with 0's
DF[,cols][is.na(DF[,cols])] <- 0
DF
cols <- c(2,3) #select the columns you don't want to impute with 0's
DF2[,-cols][is.na(DF2[,-cols])] <- 0
DF2
Replace Missing Values With Means by Group impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE)) #generic function dat2 <- ddply(dataframe, ~ group.var, transform, new.var{or.replace old} = impute.mean(var))
dat <- read.table(text = "id taxa length width
101 collembola 2.1 0.9
102 mite 0.9 0.7
103 mite 1.1 0.8
104 collembola NA NA
105 collembola 1.5 0.5
106 mite NA NA", header=TRUE)
library(plyr)
impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
dat2 <- ddply(dat, ~ taxa, transform, length = impute.mean(length),
width = impute.mean(width))
dat2[order(dat2$id), ]
> dat
id taxa length width
1 101 collembola 2.1 0.9
2 102 mite 0.9 0.7
3 103 mite 1.1 0.8
4 104 collembola NA NA
5 105 collembola 1.5 0.5
6 106 mite NA NA
> dat2[order(dat2$id), ]
id taxa length width
1 101 collembola 2.1 0.90
4 102 mite 0.9 0.70
5 103 mite 1.1 0.80
2 104 collembola 1.8 0.70
3 105 collembola 1.5 0.50
6 106 mite 1.0 0.75
15 | P a g e
Generic Replace Missing Values impute <- function(x, fun) { missing <- is.na(x) replace(x, missing, fun(x[!missing])) } ddply(dataframe, ~group, transform, length = impute(length, function))
impute <- function(x, fun) {
missing <- is.na(x)
replace(x, missing, fun(x[!missing]))
}
ddply(dat, ~ taxa, transform, length = impute(length, mean),
width = impute(width, mean))
ddply(dat, ~ taxa, transform, length = impute(length, median),
width = impute(width, median))
ddply(dat, ~ taxa, transform, length = impute(length, min),
width = impute(width, min))
con = function(x) 100
ddply(dat, ~ taxa, transform, length = impute(length, con),
width = impute(width, con))
16 | P a g e
Create a subset of data with missing values removed per variable Example: A<-c(NA,2:6)
B<-c(11:15,NA)
C<-c(NA,3,NA,5,NA,9)
(DF<-data.frame(A,B,C))
with(DF,which(is.na(B)))
with(DF,which(!is.na(B)))
DF[with(DF,which(is.na(B))),]
DF[with(DF,which(!is.na(B))),]
Output: > A<-c(NA,2:6)
> B<-c(11:15,NA)
> C<-c(NA,3,NA,5,NA,9)
> (DF<-data.frame(A,B,C))
A B C
1 NA 11 NA
2 2 12 3
3 3 13 NA
4 4 14 5
5 5 15 NA
6 6 NA 9
> with(DF,which(is.na(B)))
[1] 6
> with(DF,which(!is.na(B)))
[1] 1 2 3 4 5
> DF[with(DF,which(is.na(B))),]
A B C
6 6 NA 9
> DF[with(DF,which(!is.na(B))),]
A B C
1 NA 11 NA
2 2 12 3
3 3 13 NA
4 4 14 5
5 5 15 NA #LOOK AT: Library(Hmisc) aregImpute()
17 | P a g e
Assumption Testing
Function for Assessing Assumptions library(gvlma) gvlma(lm(model))
Normality Assumption (remember the assumption is usually normality of residuals)
#=============================================================================== # LOADING THE LIBRARIES USED #=============================================================================== library(MASS);library(nortest);library(fBasics);library(psych);library(timeDate) #=============================================================================== # GENERATING SOME DATA #=============================================================================== x.norm<-rnorm(n=200,m=10,sd=2) #=============================================================================== #LOOKING AT THE GRAPHS (remember somewhat manipulated by the scale you choose) #=============================================================================== par(mfrow=c(3,1)) h<-hist(x.norm,main="Histogram of observed data w/ normal curve",col="red") xfit<-seq(min(x.norm),max(x.norm),length=40) yfit<-dnorm(xfit,mean=mean(x.norm),sd=sd(x.norm)) yfit <- yfit*diff(h$mids[1:2])*length(x.norm) lines(xfit, yfit, col="blue", lwd=2) plot(density(x.norm),main="Density estimate of data") polygon(density(x.norm) ,col="green", border="blue") truehist(x.norm,main="True Histogram of observed data") #=============================================================================== #LOOKING AT THE QQ PLOTS (very effective approach) #=============================================================================== win.graph() par(mfrow=c(1,2)) qqnorm(x.norm, col="red") qqline(x.norm) qqnormPlot(x.norm) #=============================================================================== #STATISTICAL TESTS OF NORMALITY (vary greatly; be cautious; p>.05 = normal) #=============================================================================== ksnormTest(x.norm)#Kolmogorov-Smirnov (for large sample) normality test shapiro.test(x.norm) #Shapiro-Wilk’s (for small samples)test for normality shapiroTest(x.norm) #Shapiro-Wilk’s test for normality jarqueberaTest(x.norm) #Jarque–Bera test for normality dagoTest(x.norm) #D’Agostino normality test adTest(x.norm) #Anderson–Darling normality test cvmTest(x.norm) #Cramer–von Mises normality test lillieTest(x.norm) #Lilliefors (Kolmogorov-Smirnov) pchiTest(x.norm) #Pearson chi–square normality test sfTest(x.norm) #Shapiro–Francia normality test kurtosis(x.norm,type=1) #type 1 biased; type 2 unbiased kurtosis(x.norm,type=2) #excess selected = moment method or -3 (0 is normal) kurtosi(x.norm);library(e1071) kurtosis(x.norm, type=1);kurtosis(x.norm, type=2);kurtosis(x.norm, type=3) skewness(x.norm, type=1);skewness(x.norm, type=2);skewness(x.norm, type=3) skew(x.norm);win.graph();par(mfrow=c(1,2)) mardia(x.norm) #for multivariate data probplot(x.norm) #
EXAMPLE
library(gvlma)
(gvmodel <- with(mtcars,gvlma(lm(mpg~disp*hp*cyl))))
summary(gvmodel)
multiG(27,11,6,2,c(1:12))
plot(gvmodel,onepage=F)
18 | P a g e
Addressing Non-Normaility Assumptions Script Folder has several skew, kurtosis checkers.
One Function to Conduct Multiple Tests of Normality
Info on Normality from Andy Fields [CNTRL +CLICK HERE]
Transforming Skew (for positive skew) 3 Methods: 1. Log Transformation log10(Xi) You can’t log ≤0 so if your data has this you must add a constant to adjust the data 2. Square Root Transformation sqrt(Xi) You can’t sqrt() <0 so if your data has -# you must add a constant to adjust the data 3. Reciprocal Transformation 1/( Xhighest score Xi) Reverse scored (Xhighest score Xi) to overcome the effect of inverse making the big scores small and the small scores big. All of these transformations can be done to negative skew as well but the data must be reverse scored (Xhighest score Xi) first to reverse the skew. REMEMBER: Transform one numeric variable, transform all of them. Fixing & Transforming Kurtosis 1. First Check for outliers and, if possible, delete any more than SD from the regression line 2. Square Transformation (Xi)^2 Function for normalizing Data uniformDAT(x)
# NORMAL TRANSFORMATION FUNCTION CODE WITH EXAMPLE
#===============================================================
uniformDAT <- function (x) {
x <- rank(x,
na.last = "keep",
ties.method = "average")
n <- sum(!is.na(x))
x / (n + 1)
}
normalize <- function (x) {
qnorm(uniformize(x))
} #===============================================================
# THE DATA
#===============================================================
par(mfrow=c(3,2))
x1<-sample(1:100,100, replace=T)
x2<-ifelse(x1<21,x1+79,x1)
x3<-ifelse(x1>80,x1-79,x1)
#===============================================================
# WHAT THE FUNCTION DOES GRAPHICALLY
#===============================================================
f(x1, FUN = normalize, main = "Distribution 1")
f(x2, FUN = normalize, main = "Right Skewed distribution")
f(x3, FUN = normalize, main = "Left Skewed distribution")
#===============================================================
# LOOKING AT THE DATA
#===============================================================
list("x1 data before"=x1,"x1 data after"=uniformDAT(x1),
"left skewed data before"=x2,
"left skewed data after"=uniformDAT(x2),
"right skewed data before"=x3,
"right skewed data after"=uniformDAT(x3))
source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Assumption Testing/Tests of
Normality.txt")
Source Path
source("C:/Users/Rinker/Desktop/PhD Program/CEP
523-Stat Meth Ed Inference/R
Stuff/Scripts/Assumption Testing/Normal
Tansformation.txt
mod1<-lm(mpg~disp,data=mtcars)
mod2<-lm(log(mpg)~disp,data=mtcars)
windows(h=6,w=12)
par(mfrow=c(1,2))
with(mtcars,plot(mpg~disp,main="Raw Plot"))
abline(reg=mod1,lty=3,col="green")
with(mtcars,plot(log(mpg)~disp,main="Log
Transformation"))
abline(reg=mod2,lty=3,col="blue")
mod1;mod2
Graph
example
log trans
19 | P a g e
Homogeneity Assumtion Equal variance of 2 populations var.test(x,y) Equal variance of groups bartlett.test (numeric variable, grouping factor) and levene.test(numeric variable, grouping factor)
EXAMPLE
#=================================================================================
# TESTING IF TWO SAMPLE VARIANCES ARE EQUAL
#=================================================================================
# GENERATING THE DATA
x1<-rnorm(1:1000,100)
y1<-rnorm(1:1000,100)
#=================================================================================
# MEAN AND STANDARD DEVIATION
#=================================================================================
descriptives<-data.frame(c(mean(x1),sd(x1)),c(mean(y1),sd(y1)))
colnames(descriptives)<-c("x1","y1");rownames(descriptives)<-c("mean","sd")
descriptives
#=================================================================================
# GRAPH THE DATA
#=================================================================================
par(mfrow=c(2,1));library(descr)
histkdnc(x1,main="x1");histkdnc(y1,main="y1")
#=================================================================================
# TESTING EQUAL VARIANCES; function--> var.test()
#=================================================================================
list(" NOTE: p > .05; not significantly
different"=var.test(x1,y1,alternative="two.sided",confi.level=.95))
EXAMPLE
#=================================================================================
# TESTING IF TWO SAMPLE VARIANCES ARE EQUAL
#=================================================================================
# GENERATING THE DATA
rannum<-c(sample(1:5,1000,replace=T))
factor<-c(recodeVar(rannum,src=c(1,2,3,4,5),
tgt=c("blue","black","red","green","orange"), default=NULL, keep.na=TRUE))
dep.var<-rnorm(1000)
color.df<-data.frame(factor,dep.var);tail(color.df)
#=================================================================================
# MEAN AND STANDARD DEVIATION
#=================================================================================
library(doBy)
summaryBy(dep.var~factor, data = color.df,
FUN = function(x) { c(n = length(x),mean = mean(x), sd = sd(x)) } )
#=================================================================================
# GRAPH THE DATA
#=================================================================================
black<-subset(color.df,factor=="black");blue<-subset(color.df,factor=="blue")
green<-subset(color.df,factor=="green");orange<-subset(color.df,factor=="orange")
red<-subset(color.df,factor=="red");par(mfrow=c(3,2));library(descr)
histkdnc(dep.var,main="Overall");histkdnc(black$dep.var,main="black",col="black")
histkdnc(blue$dep.var,main="blue",col="blue");histkdnc(green$dep.var,main="green",col="green")
histkdnc(orange$dep.var,main="orange",col="orange");histkdnc(red$dep.var,main="red",col="red")
#=================================================================================
# TESTING EQUAL VARIANCES; function--> var.test()
#=================================================================================
library(lawstat)
list("----------------------------------------------------------------------------------
Levene's Test"=levene.test(dep.var, factor,location= "mean"),"------------------------------------
----------------------------------------------
"=bartlett.test(dep.var, factor))
20 | P a g e
Equal variance of groups less sensitive to outliers fligner.test(x, ...) fligner.test(x, g, ...) fligner.test(formula, data, subset, na.action, ...) fligner.test(list(group a,group b,group c,group…n))
Arguments x a numeric vector of data values, or a list of numeric data vectors.
g a vector or factor object giving the group for the corresponding elements of x. Ignored
if x is a list.
formula a formula of the form lhs ~ rhs where lhs gives the data values and rhs the
corresponding groups.
21 | P a g e
Sphericity (Assumption of Repeated Measures Test) Sphericity- is, in a nutshell, that the variances of the differences between the repeated measurements should be about the same. mauchly.test Greenhouse-Geisser Outliers library(mvoutlier) & library(outlier) ?influence.measures
22 | P a g e
Create Sequences of Integers Input 1:5 Output 1 2 3 4 5 Create Sequences of Real Numbers Input seq(from=3,to=7,by=.5) NOTE: would do seq(3,7,.5)too Output 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
from, to: the starting and (maximal) end value of the sequence. by: increment of the sequence. (leave this out is the same as n:m) length.out: desired length of the sequence. A non-negative number, which for ‘seq’
and ‘seq.int’ will be rounded up if fractional. Repeat Integer Pattern term: replicate rep(pattern, times=) rep(pattern, times=,each=)
Search Patterns Wthin Vectors spattern rel(x) inverse.rle(rle(x))
EXAMPLE
rep(1:2, times=1,each=25)
rep(1:2, times=25)
#=================================================================
> rep(1:2, times=25)
[1] 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2
[39] 1 2 1 2 1 2 1 2 1 2 1 2
> rep(1:2, times=1,each=25)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
[39] 2 2 2 2 2 2 2 2 2 2 2 2
x <- rev(rep(6:10, 1:5))
rle(x)
inverse.rle(rle(x))
z <- c(TRUE,TRUE,FALSE,FALSE,TRUE,FALSE,TRUE,TRUE,TRUE)
rle(z)
inverse.rle(rle(z))
23 | P a g e
Generating Random Numbers, Integers and Categorical Variables (random sample) Random Normal rnorm(n=, m=,sd=) Where n is amount of samples Random Normal Between Certain Values 1 x<- rnorm(n=500, m=42, sd=10) x <- x[x>=30 & x <=50] Random Normal Between Certain Values 2(doesn’t throw away samples) n <- 1000 #samples desired L <- .2 #lower limit U <- .8 #upper limit m <- 1 #mean s <- 1 #sd x <- qnorm(runif(n, pnorm(L, mean=m, sd=s), pnorm(U, mean=m, sd=s)), mean=m, sd=s) x Random Integers sample(seq,n, replace=T) Where seq is a sequence such as 10:80, and n is amount of samples Random Integers 2 sample.int(x,n, replace=T) x= all integers ≤ that value
Random Categorical Vector (See generate factor below) sample(categorical.vector,n,replace=T)
Example colors<-c(sample(c("blue","red","green","orange"),10,replace=T))
hue<-abs(rnorm(10))
colorsDF<-data.frame(colors,hue)
#colors is the random creation of a categorical variable
colors;hue;colorsDF
Note: the sample() with the recodeVar() from the doBy library could also be used for generating a random character vector. (See also relevel which is more efficient)
library(doBy) recodeVar(sample(1:5,25,replace=T), src=c(1,2,3,4,5), tgt=c("a","e","i","o","u"), default=NULL, keep.na=TRUE) [1] "u" "a" "o" "u" "e" "i" "i" "i" "u" "u" "o" "o" "i" "e" "u" "u" "u" "o" "a" "a" "u" "e" "o" "a" "u"
EXAMPLE
sample.int(2,10,replace=T)#flip a coin 10 times
sample.int(6,7,replace=T)#roll a die 7 times
sample.int(52,5,replace=T)#pick a card 5 times
ALLOW REPRODUCIBLE RANDOM NUMBERS NOTE: you can use set.seed() to enable someone else to reproduce exactly the same set.seed(15)# allow reproducible random numbers
sample(1:2, size=5, replace=TRUE)
sample(1:2, size=5, replace=TRUE)
24 | P a g e
Generate Factor (non random) gl(n, k, length = n*k, labels = 1:n, ordered = FALSE) ARGUMENTS
Convert scores to z-scores scale(vector) To function to the right is an example of how the z scores “normalizes” the data. The code creates a vector of random integers and then coverts it to a z-score vector. There is also a comparison of both vectors with histograms. Probability calculation computes the combinations choose(n,k) EXAMPLE: choose(54,5)
1/choose(54,5)
Generate all the possible outcomes of a vector given each element is used only once combn(x,m) x vector source for combinations, or integer n for x <- seq_len(n).
m number of elements to choose
SCALEfun<-function(){
ay<-sample(1:100, 25, replace=T)
az<-c(scale(ay, center = TRUE, scale = TRUE))
par(mfrow=c(2,1));library(descr)
histkdnc(ay,main="Before z score Transformation",col="red")
histkdnc(az,main="After z score Transformation",col="green")
list(ay,az)
}
EXAMPLE
gl(3, 5, length=100,labels = c("Control", "Treat","Died"))
EXAMP_LES
combn(letters[1:4], 2)
combn(LETTERS[1:10], 9)
combn(0:10,10)
25 | P a g e
Generate All Possible Outcomes For the outside of a matrix outer(vector 1,vector 2,FUN=) EXAMPLES outer(month.abb, 1999:2003, FUN = "paste")
data.frame(outer(c("R","r"), c("R","r"), FUN = "paste"))#Punnets Square
outer(c("H","T"), 1:6, FUN = "paste")#outcomes of flipping coin and rolling a die
outer(LETTERS[1:10], 0:9, FUN = "paste")
outer(0:9, 0:9, FUN = "*") #multiplication table 0-9
outer(0:20, 0:20, FUN = "*") #multiplication table 0-20
outer(0:9, 1:9, FUN = "/") #division table 0-9
outer(0:9, 0:9, FUN = "^") #exponential table 0-9
outer(0:9, 0:9, FUN = "-") #exponential table 0-9
outer(0:9, 0:9, FUN = "+") #exponential table 0-9
Generate All Possible Combinations for a List of Factors expand.grid(factor.name1=c("factor levels"), factor.name2=c("factor levels"), factor.name…n=c("factor levels"))
List Prime Numbers library(matlab) primes(n) Perform Prime Factorization library(matlab) factors(n) Create Magic Squares library(matlab) magic(n)
Generate a list of Dummy Codes for a Factor model.matrix(~factor-1)
EXAMPLE
expand.grid(age=c(4:10),academic.level=c("high","med","low"),sex=c("male","female"))
EXAMPLE:
(iris.dummy<-with(iris,model.matrix(~Species-1)))
(IRIS<-data.frame(iris,iris.dummy))
26 | P a g e
Data Manipulation sround sformat Changing digits options(digits=#) #This is a global change print(x,digits=#) #Local change cat(format(x,digits=#)) #Local change for functions round(x, digits=#) #Rounds x to an integer SEE Cut Points for an example round data frame signif(x, digits=#) #Scientific Notation options(scipen=99) #Eliminates scientific notation (Global Change) format(x,…) #SEE BELOW FOR MORE ABOUT FORMAT sprintf("%.49f", (1+sqrt(5))/2) Force rounding with a certain number of digits sprintf("%.49f", pi) library(mpc); mpc(1, 3000) / mpc(998001, 3000) Format (ENABLES digits, format(x,…)
Arguments
x any R object (conceptually); typically numeric.
trim logical; if FALSE, logical, numeric and complex values are right-justified to a common width: if TRUE the leading blanks for justification are suppressed.
digits how many significant digits are to be used for numeric and complex x. The default, NULL, uses getOption(digits).
This is a suggestion: enough decimal places will be used so that the smallest (in magnitude) number has this many significant digits, and also to satisfy nsmall. (For the interpretation for complex numbers see signif.)
nsmall the minimum number of digits to the right of the decimal point in formatting real/complex numbers in non-
scientific formats. Allowed values are 0 <= nsmall <= 20.
justify should a character vector be left-justified (the default), right-justified, centred or left alone.
width default method: the minimum field width or NULL or 0 for no restriction.
AsIs method: the maximum field width for non-character objects. NULL corresponds to the default 12.
na.encode logical: should NA strings be encoded? Note this only applies to elements of character vectors, not to numerical or logical NAs, which are always encoded as "NA".
scientific Either a logical specifying whether elements of a real or complex vector should be encoded in scientific format, or an integer penalty (see options("scipen")). Missing values correspond to the current default penalty.
...
further arguments passed to or from other methods.
Round to a nearst fractional x <- c(4.2, 4.3, 4.8)
#Method 1 (generalizes to any rounding)
library(plyr)
round_any(x, 3)
round_any(x, 1)
round_any(x, 0.5)
round_any(x, 0.2)
#Method 2 (nearest half)
round(x*2)/2
#Method 3 (nearest half)
round(x/5, 1)*5
can be used to round integers as well
options(digits=10)
(x<-pi*12345)
round(x,-4:4) #negative rounds the integer
27 | P a g e
Indexing dataframe$object or dataframe[,"object"]
Determine number of observations in a dataframe
nrow(dataframe)
Determine number of variables in a dataframe
ncol(dataframe)
Determine number levels of a factor number of levels
nlevels(factor) Look At beginning or end of a data frame head(dataframe) dataframe[1:10,] I compiled this into the function HEAD() in .First tail(dataframe) dataframe[(nrow(dataframe)-10):(nrow(dataframe)),] Compiled as function TAIL() in .First begend(dataframe) Looks at first 5 and last 5 observations of a dataframe Locating the info for a single row (observation) Type: data[10,] Output Where data is your data frame (set) and the 10 is the tenth observation. Changing a numeric variable by a constant Type: b<-sa*.45 Output …Where b is the new variable vector name, sa is the original numeric variable, and .45 is the constant. This could be very useful for creating new combined variables:
Minimal Specifications R looks for objects with in the environment you specify that minimally meet your requirements: Example CO2$T #Both Treatment and Type fit this so R returns NULL
CO2$Ty
CO2$P
28 | P a g e
List all the variables created Type: ls() or objects() Output Look at all the code and commands you’ve typed in a session in a new window history(number) Sys.setenv(R_HISTSIZE=10000) #increase the 512 line limit even more Pull up the last value A quasi constant in R in that it is a non function that takes the value of the last input .Last.value EXAMPLE Break a data frame into Groups of a factor split(df,factor) Creates a list of data frames by the Groups of the Chosen Factor
EXAMPLE
warpbreaks
(groups<-with(warpbreaks,split(warpbreaks,tension))) #method 1
with(groups,lapply(groups,mean))
with(groups,lapply(groups,nrow))
with(groups,lapply(groups,sd))
takes.a.while <- function(){
Sys.sleep(10)
rnorm(20)
}
takes.a.while()
# Oh no I forgot to assign it to a variable
lifesaver <- .Last.value
lifesaver
29 | P a g e
Creating a subset of data (useful for control vs. treatment groups) Using your original data set type: gg<-subset(a,g=="f", select=NULL) Where gg is the name of the new data subset, subset is the function, a is the large data set name, g is the variable you wish to make a subset around “f” is the level you want to isolate to make a new data set of, and select are the columns you want.
Original data set New subset
Remember to rename the variables for each data set in the same way you did with the larger data set (let’s say we have a female and male subset for example). This can be done for any number of variables in the subset, enabling tests on the subgroups.
The summary is a great follow up to generate some quick and useful information about each subset: Type: summary(gg)
Also see select cases below
subset(mtcars,select = mpg:vs)
subset(airquality, Temp > 80, select = c(Ozone, Temp))
subset(airquality, Day == 1, select = -Temp)
subset(airquality, select = Ozone:Wind)
30 | P a g e
Select certain rows or columns by criteria Other ways to subset Examples Examples slogical
Select columns from a data frame that are just numeric or just factors mtcars2[,sapply(mtcars2,is.numeric)] #same as NUM(df) in usefule functions mtcars2[,sapply(mtcars2,is.factor)] #same as FAC(df) in usefule functions NAs Please also see Missing Values for created functions and specific handling of NAs Create a Subset of non NA for just one column/variable dataframe2<-subset(dataframe, factor!=is.na(factor))
#========================================================
#select mpg = 21
with(mtcars,mtcars[mpg==21,])
#========================================================
#select mpg greater than 30
with(mtcars,mtcars[mpg>=30,])
#========================================================
#select mpg greater than 30 and disp less than 80
with(mtcars,mtcars[mpg>=30&disp<80,])
#========================================================
#select mpg greater than median mpg and disp over 110
with(mtcars,mtcars[mpg>=median(mpg)&disp>110,])
#========================================================
mtcars.8cyl<-mtcars[cyl==8,]
mtcars.8cyl<-mtcars.8cyl[-c(2)]
mtcars.8cyl #this is the same as:
subset(mtcars[-2],cyl==8)
EXAMPLE: mtcars2<-mtcars
mtcars2$carb<-with(mtcars2,replace(carb, carb==1,NA))
mtcars2;paste("n is = to",length(mtcars2$mpg))
mtcars3<-subset(mtcars2, carb!=is.na(carb))
#Line above is the code (the rest is generating a dataset with NAs in it
mtcars3;paste("n is = to",length(mtcars3$mpg))
mtcars[cyl==8|cyl==6,]
mtcars[cyl==8|gear>=5,]
mtcars[cyl==8&mpg>=18,]
mtcars[cyl==8&mpg>=18|wt>=3.5,]
mtcars[cyl==8&mpg>=18&wt>=3.5,]
mtcars[cyl==8|cyl==6,]
subset(mtcars, (cyl %in% c(6, 8)))
subset(mtcars, !(cyl %in% c(6, 8)))
subset(CO2, !(Plant %in% c("Qn1", "Mc3", "Mc1", "Mn2")))
mtcars2<-mtcars
mtcars2[,c("mpg","disp","wt")] #select some columns meth. 1
subset(mtcars2,select=c(mpg,disp,wt)) #select some columns meth. 2
#Subset really fine tunes the selection:
subset(mtcars2,select=c(mpg,disp,wt),subset=c(mpg>25&wt<=4))
subset(mtcars2,select=c(mpg,disp,wt),subset=c(mpg>25&wt<=4|disp>=400))
LOGICAL OPERATORS & | ! COMPARISON OPERATORS == >= <= != Value Matching %in%
31 | P a g e
Index a Single Columns Without Dropping the Variable Name drop name subset(dataframe, select=column.number) dataframe[, column.number, drop = FALSE] dataframe[column.number] or dataframe[column.name] #no comma #EXAMPLES
names(subset(mtcars, select=1))
names(mtcars[,1, drop = FALSE])
mtcars[1] or mtcars["mpg"]
Keep Column Names the Do Not Conform to R Standards data.frame('a b'=1, check.names=F) transform(data.frame('a b'=1, check.names=F), `c d`=`a b`, check.names=F) Split a Numeric Vector by a Categorical vector split(numeric,factor)
#================================================================
# CREATE THE DATA SET
#================================================================
colors<-c(sample(c("blue","red","green","orange"),20,replace=T))
hue<-abs(rnorm(20))
colorsDF<-data.frame(colors,hue)
#================================================================
# split(numeric.var,factor)
#================================================================
with(colorsDF,split(hue,colors))
#================================================================
# USING SPLIT FOR MEANS AND SD etc.
#================================================================
sapply(with(colorsDF,split(hue,colors)),mean)
sapply(with(colorsDF,split(hue,colors)),sd)
OUTPUT > with(colorsDF,split(hue,colors))
$blue
[1] 0.0143338132 ,1.9922393211 ,0.7892910777
0.0004594093
$green
[1] 0.5897572, 0.7480668, 2.8692182, 0.2506951
$orange
[1] 1.3469976. 0.8757391. 1.4951192. 1.3781447
$red
[1] 0.05447693, 0.10730018, 2.20397056,0.05800449,
1.84962318,0.15243645,0.29141207, 0.05877585
===============================================
USING SPLIT FOR MEANS AND SD etc.
================================================
> sapply(with(colorsDF,split(hue,colors)),mean)
blue green orange red
0.6990809 1.1144343 1.2740002 0.5970000
> sapply(with(colorsDF,split(hue,colors)),sd)
blue green orange red
0.9376117 1.1881110 0.2730569 0.8909689
32 | P a g e
Sorting & Ordering Observations #1 see also arrange() & orderBy() ssort sorder NOTE: You can use sort() for vectors x[order(x$B),] #sort a dataframe by the order of the elements in B x[rev(order(x$B)),] #sort the dataframe in reverse order with(mtcars,mtcars[ order(-cyl, gear, carb) ,]) #sort ascending and descending (use of - )
Sorting #2 arrange() MORE EFFICIENT THAN ORDER library(plyr) arrange(df,…) Sorting #3 orderBy() I like this one the best orderBy(~formula, data=)
Duplicate Certain Rows of a Data Frame duplicate rows reprow(dataframe, column, value)
#EXAMPLE
mtcars[1:10] #first 1-10 of data set
#order ascending
#rev descending
mtcars[rev(order(mtcars$cyl)),][1:10] #Sort by cyl (descending)
mtcars[order(mtcars$cyl),][1:10] #Sort by cyl (ascending)
mtcars[order(mtcars$cyl,mtcars$vs),][1:10] #Sort by cyl then vs (asc.)
mtcars[order(mtcars$cyl,mtcars$vs,mtcars$gear),][1:10] #Sort by cyl,vs, then gear( asc.)
library(plyr)
(mtcars2<-data.frame(mtcars[1:16,],"grade"=c(rep(1:3,each=4,times=1),rep("k",4))))
levels(mtcars2$grade) <- c("4","3","2","1","k")
arrange(mtcars2, -as.numeric(factor(grade,levels=c("k","1","2","3")),cyl,-disp))
arrange(mtcars, cyl, desc(disp))
library(doBy)
(mtcars2<-data.frame(mtcars[1:16,],"grade"=c(rep(1:3,each=4,times=1),rep("k",4))))
levels(mtcars2$grade) <- c("4","3","2","1","k")
orderBy(~-grade + cyl, data=mtcars2)
orderBy(~-grade + -disp + hp, data=mtcars2)
#THE FUNCTION
reprow <- function(dataframe, column, value) {
dataframe$ID <- 1:nrow(dataframe)
DF <- data.frame(rbind(rbind(dataframe[which(dataframe[,column] %in% value),
], dataframe[which(dataframe[ ,column] %in% value), ]),
rbind(dataframe[which(!dataframe[,column] %in% value), ])))
DF <- DF[order(DF$ID), ]
DF$ID <- NULL
rownames(DF) <- 1:nrow(DF)
DF
}
#EXAMPLES
reprow(mtcars, 'cyl', 4) #repeats any column with that value a second time
reprow(mtcars, 'cyl', c(4, 6))
33 | P a g e
Replace (Go through data set and find values and replace them with a new value) [recode; missing values ]srplace
replace(dataframe, list, values) #remember to assign this to some object i.e., x <- replace(dataframe,dataframe==-9,NA) #similar to the operation x[x==-9] <- NA This can also be done for just one variable in a data frame by saving the output as follows: dataframe$variable<- with(dataframe,replace(dataframe,factor==-9,NA)) EXAMPLE: mtcars2<-mtcars
mtcars2$carb<-with(mtcars2,replace(carb, carb==1,NA)); mtcars2 #Way 1
mtcars2<-mtcars #RESET mtcars2
mtcars2$carb[mtcars2$carb==4]<-NA; mtcars2 #Way 2
Remove values in a vector subset could also be used vector [!is.element(x,c(values to remove))]
Example
x <- sample(0:20,100,replace=T)
table(x)
x2<-x[!is.element(x, c(0,9,20))] #removes the values 0,20
table(x2)
34 | P a g e
Rename a column (rename a variable) method 1 names(dataframe)[c(column#)] <- "new.name" Rename a column (rename a variable) method 2 library(reshape) rename variable rename column
rename(data.frame, vector of name conversions)
Rename a column (rename a variable) method 3 library(gdata) variable rename column rename.vars(data, from="", to="", info=TRUE)
Finding duplicates rows in a data frame (see matching) unique(data.frame)
Locating variables and values which(x==" ") Finding Duplicate Entries in a Column duplicated(x)
Arguments x a vector or a data frame or an array or NULL. incomparables a vector of values that cannot be compared. FALSE is a special value, meaning that all values can be compared, and may be the only value accepted
#EXAMPLE
iris2<-data.frame(rbind(iris[1:15,],iris[1,],iris[3,]))
iris2<-with(iris2,iris2[order(Sepal.Length),])
rownames(iris2)<-c(1:17);iris2
mess<-c("\bNOTE:\n",
"\bThe unique() function searches for duplicates;\n",
"\bnotice observation 6 & 14 are elimnated\n")
unique(iris2,incomparables = FALSE)
cat(mess)
Example:
rename(mtcars, c(wt = "weight", cyl = "cylinders"))
EXAMPLE
(IRIS<-iris[1:30,])
IRIS$Petal.Length[!duplicated(IRIS$Petal.Length)]
IRIS$Petal.Length[duplicated(IRIS$Petal.Length)]
which(duplicated(IRIS$Petal.Length))
which(!duplicated(IRIS$Petal.Length))
EXAMPLE
dat<-mtcars
rename.vars(dat, from="mpg", to="new", info=TRUE)
rename.vars(dat, from=c("wt","mpg"), to=c("new1","new2"), info=TRUE)
example
mtcars.let<-data.frame("lets"=rep(letters[c(19:26)],4) ,mtcars) #making a dataframe
with(mtcars.let,which(lets=="v"))
#[1] 4 12 20 28
with(mtcars.let,which(cyl=="6"))
[#1] 1 2 4 6 10 11 30
EXAMPLE
names(mtcars)[names(mtcars)=='hp'] <-'sweet.new.hp'
names(mtcars)
names(mtcars)[3] <- "new.name"
names(mtcars)
names(mtcars)[c(2,5)] <- c("new.name2","new.name3")
names(mtcars)
35 | P a g e
Finding Truly Unique Items in Vector (3 methods) #DATA SET:
x <- c(378, 380, 380, 380, 380, 360, 187, 380)
#METHOD 1 [fastest and numeric/unsorted]
setdiff(unique(x), x[duplicated(x)])
#METHOD 2 [medium speed & numeric/sorted]
y <- rle(sort(x)); y[[2]][y[[1]]==1]
#METHOD 3 [slowest/character/sorted]
b <- table(x); names(b[b==1])
test replications elapsed relative user.self sys.self user.child sys.child
3 METHOD_1 1000 0.08 1.000 0.06 0 NA NA
2 METHOD_2 1000 0.25 3.125 0.23 0 NA NA
1 METHOD_3 1000 0.61 7.625 0.48 0 NA NA
Same idea applied to character data set.seed(100)
x <- sample(c("Certin", "features", "of", "the", "setting", "affected"), 13, replace=T)
x
hapax1 <- function(x) {x <- na.omit(tolower(x)); setdiff(unique(x), x[duplicated(x)])}
hapax1(x)
hapax2 <- function(x)names(table(tolower(x))[table(tolower(x))==1])
hapax2(x)
hapax3 <- function(x) {y <- rle(sort(tolower(x))); y[[2]][y[[1]]==1]}
hapax3(x)
36 | P a g e
Using Which to Find Even and Odd Numbers even odd
which(vector%%2 == 1) #find odd which(vector%%2 == 0) #find even Using TRUE/FALSE To find odd and even of objects Select every other of a vector object[c(T, F)] #odds object[c(F, T)] #evens Add column to a data set or an existing column (add variable)
transform(data.set.name,new.var=(Science.Comprehension*10))
Delete a variable from a data set See also subset() ddrop variable drop columnelete variable delete column
You could select all the columns you want and create a subset or: POSITIVE SELECTION EXAMPLE:
mtcars2<-mtcars
mtcars2[,1:10]
NEGATIVE SELECTION METHODS EXAMPLES: snegative indexing
mtcars[, -which(names(mtcars) == "carb")]
mtcars[, names(mtcars) != "carb"]
mtcars[, !names(mtcars) %in% c("carb")]
mtcars[, -match(c("carb"), names(mtcars))]
mtcars2<-mtcars;mtcars2$hp <- NULL
#--------------------------------
library(gdata)
#--------------------------------
remove.vars(mtcars2, names="mpg", info=TRUE)
remove.vars(mtcars2, names=c("wt","mpg"), info=TRUE)
Logic Testing & Coercion test null test na test missing test fact test numeric is.numeric() is.factor() is.data.frame() is.character() is.vector() is.na() is.null()
EXAMPLE
airquality<-airquality[1:10,]
transform(airquality, Ozone = -Ozone)
transform(airquality, new = -Ozone, Temp = (Temp-32)/1.8)
#Notice that if a variable is unknown a new variable is created
attach(airquality)
transform(Ozone, logOzone = log(Ozone))
#EXAMPLES
with(mtcars,which(hp%%2 == 1))
with(mtcars,which(hp%%2 == 0))
with(mtcars,sapply(mtcars,even<-function(x){which(x%%2 == 0)}))#apply to data frame
#EXAMPLES
mtcars[c(T,F)] #every odd column
mtcars[c(F,T)] #every even column
mtcars[c(T,F), ] #odd row
mtcars[c(F,T), ] #even row
37 | P a g e
Logic Testing on a Whole Data Set str(dataset) Coerce the vectors using: as.numeric(y) as.factor(y) as.data.frame(y) as.character(y) as.vector(y) Matching smatching match(x, y) Finding Where Vectors (variables) are the Same and Different union(x, y); intersect(x, y); setdiff(x, y); setdiff(y,x); setequal(x, y) ; is.element(x, y) Matching Extenders %w/% x without y (same as setdiff) %IN% x and y overlap (Same as Intersect)
Examples:
str(mtcars)
str(CO2)
Method 2: TEST A WHOLE DATA SET WITH: library(gdata) is.what()
Example:
library(gdata)
sapply(mtcars,is.what)
Examples "%w/o%" <- function(x, y) x[!x %in% y] #-- x without y
(1:10) %w/o% c(3,7,12)
[1] 1 2 4 5 6 8 9 10
(1:10) %in% c(3,7,12)
[1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
"%IN%" <- function(x, y) x[x %in% y] #-- x and y
(1:10) %IN% c(3,7,12)
[1] 3 7
x <- c(sort(sample(1:20, 9)),NA)
y <- c(sort(sample(3:23, 7)),NA)
x;y
length(x);length(y)
union(x, y) ; length(union(x, y)) #combine and remove duplicates
intersect(x, y)#duplicates of the 2 lists
setdiff(x, y)#what x has that y does not
setdiff(y, x) #what y has that x does not
setequal(x, y) #is set x the same as set y
a<-union(x,y)
b<-c(setdiff(x,y), intersect(x,y), setdiff(y,x))
setequal(a,b);sort(a);sort(b)
is.element(x, y)# what elements of set x are the same as set y
is.element(y, x)# what elements of set y are the same as set x
cat("\b%in% IS THE SAME AS is.element()\n")
a%in%b;x%in%y;y%in%x
union
intersect
set diff
set.equal
is.element or %in%
combine and remove duplicates
duplicates of the 2 lists
what x has that y does not
is set x the same as set y
what elements of set x are the same as set y
38 | P a g e
Intersect multiple vectors See Overlap in User Defined Functions
Reduce(intersect, list(...))
a <- c(1,3,5,7,9)
b <- c(3,6,8,9,10)
c <- c(2,3,4,5,7,9)
Reduce(intersect, list(a,b,c))
Find Rows Not Shared by Two Nested Data Sets x[! data.frame(t(x)) %in% data.frame(t(y)), ] Where x is the full data set and y is the nested data set A <- mtcars
B <- subset(mtcars, cyl==6)
A[! data.frame(t(A)) %in% data.frame(t(B)), ]
39 | P a g e
Recoding Variables method 1 recode variables recode columns levels (variable)<-c("new names") This has one to one correspondence levels (variable)<- list(new1=c("A","C") etc…) This can combine levels
Recoding Variables method 2 library(doBy) recodeVar(x, src=c(), tgt=c(), default=NULL, keep.na=TRUE)
Cut Points (Chop a numeric variable into a factor) [NOT RECOMMENDED]
cut(x, breaks, labels = NULL, include.lowest = F, right = T, dig.lab = 3, ordered_result = F)
e30<-read.table("e30.csv", header=TRUE, sep=",",na.strings="NA")
library(doBy)
AIDE<-recodeVar(e30$aide,src=c(0,1,NA),tgt=c("YES","NO",NA))
CLASS.TP<-recodeVar(e30$cl.type,src=c(1,2,3,NA),tgt=c("AM","PM","FULL",NA))
CLASS.BH.SPR<-
recodeVar(e30$cl.behav.spr,src=list(c(1:3),c(4,5),NA),tgt=c("POOR","GOOD",NA))
#This one is a complete recoding of a variable with cut points
DDD<-data.frame(AIDE,CLASS.TP,CLASS.BH.SPR)
DDD[620:630,]
NAhunter(DDD)
EXAMPLE
aaa <- c(1.2,2.2,3,4.1,.7,2,pi,4,5.3434343344,6.245,pi/3)
cut(aaa, 3)
cut(aaa, 3, dig.lab=4, ordered = TRUE)
cut(aaa, 3, labels=c("low","medium","high"), ordered = T)
(BBB<-cut(aaa, 3, labels=c("low","medium","high"), ordered = F))
(DF<-data.frame("OBS"=LETTERS[1:11],"LEVEL"=BBB,"NUM.LVL"=aaa))
round(DF,digits=2)#Can't round with factors so...
DF2<-DF
DF$NUM.LVL<-round(DF$NUM.LVL,digits=2)
list("METHOD 1"=format(DF2,digits=3),"METHOD 2"=DF,"USING"="DF$NUM.LVL<-
round(DF$NUM.LVL,digits=2)" )
mtcars
(mpg.rating<-with(mtcars,cut(mpg,3)))
levels(mpg.rating)<-c("low","medium","high")
(mtcars2<-data.frame(mtcars,mpg.rating))
mtcars[with(mtcars2,which(mpg.rating=="high")),]
table(mpg.rating)
mtcars$HPcut<-cut(mtcars$hp, breaks=c(0,66,110,150,335),
labels=c("low","medium","high", "super"), include.lowest=TRUE, right=FALSE)
EXAMPLE
InsectSprays2<-InsectSprays
levels(InsectSprays2$spray)
levels(InsectSprays2$spray)<-list(new1=c("A","C"),YEPS=c("B","D","E"),LASTLY="F")
levels(InsectSprays2$spray)
InsectSprays2
40 | P a g e
Relevel a factor method 1 [reorder factor groups] (& recode numeric to factor [see cut points for more on this])
Relevel factor Quick Reference set.seed(12)
z <-factor(sample(LETTERS[1:5], 10, T));z
factor(z, levels=c("C", "D", "A", "B")) #the releveling
Extra Reference dataset$factorgroup <- factor(dataset$factorgroup, levels = c("c","a","b"),ordered=is.ordered(factor))
Relevel a factor method 2 (order 1 group; kinda junky)
relevel(x, ref, ...) # only places one group at the front (limited)
Drop unused levels 1 drop factor levels
droplevels(x)
x is a factor vector/dataframe with factors
Drop unused levels 2 drop factor levels
x <- factor(x)
EXAMPLE
warpbreaks$tension
warpbreaks$tension2 <- relevel(warpbreaks$tension, ref="M")
warpbreaks$tension2
mtcars2<-mtcars
mtcars2$carb[mtcars2$carb==4]<-NA
mtcars2
mtcars2$carb[is.na(mtcars2$carb)]<-4
mtcars2
mtcars2$carb[mtcars2$carb<=4&mtcars2$carb>=3]<-"med"
mtcars2$carb[mtcars2$carb<=2&mtcars2$carb>=1]<-"low"
mtcars2$carb[mtcars2$carb<=8&mtcars2$carb>=6]<-"high"
mtcars2
with(mtcars2,mtcars2[order(carb),])
mtcars2$carb <-as.factor(mtcars2$carb)
levels(mtcars2$carb)
mtcars2$carb <- factor(mtcars2$carb, levels = c("low","med","high"))
mtcars2$carb
> set.seed(12)
> z <-factor(sample(LETTERS[1:5], 10, T));z
[1] A E E B A A A D A A
Levels: A B D E
> factor(z, levels=c("C", "D", "A", "B"))
[1] A <NA> <NA> B A A A D A A
Levels: C D A B
41 | P a g e
Add an observation to a vector/column method 1 append(vector, new items to add, after=#) Add an observation to a vector/column method 2 (preffered for its speediness) EXAMPLE mtcars2<-mtcars mtcars2;mtcars2$mpg[3] mtcars2$mpg[3]<-25 mtcars2 MPG<-mtcars2$mpg MPG[40]<-25 MPG #[R] fills in the gap w/ NAs
Combine characters/numbers in one column (variable) with that of another
NOTE: you can do this in EXCEL using cell#& " " &cell# example: G1& " " &H1 (whatever is between the “” will be the divider of the characters)
paste(x,y,sep= " ")
x,y,z… are the variable characters/numbers to combine together. Whatever is between the " " will be the character separator.
Paste unknown number of columns apply(x, 1, paste, collapse = ".") apply(x, 1, function(x){if(any(is.na(x))){NA}else{paste(x, collapse = ".")}}) #if any NA returns NA
library(doBy)
x1<-recodeVar(sample(1:26,25,replace=T), src=1:26, tgt=letters, default=NULL, keep.na=TRUE)
y1<-recodeVar(sample(1:26,25,replace=T), src=1:26, tgt=LETTERS, default=NULL, keep.na=TRUE)
z1<-sample(1:26,25,replace=T)
merged.characters<-paste(x1,z1,y1,sep="")
data.frame(x1,z1,y1,merged.characters)
paste(x1,z1,y1,sep="-")#variation
old<-c(1:10) EXAMPLE
[1] 1 2 3 4 5 6 7 8 9 10
new<-append(old,c(3,6,9),after=4)
[1] 1 2 3 4 3 6 9 5 6 7 8 9 10
#EXAMPLES
CO2[1,1] <- NA
x <- CO2[, 1:3]
y <- CO2[, 1:4]
apply(x, 1, paste, collapse = ".")
apply(x, 1, function(x){if(any(is.na(x))){NA}else{paste(x, collapse = ".")}})
#do.call METHOD
y <- as.list(CO2[1:3]) # make it a list
y$sep = "." # set our separator
do.call("paste", y)
42 | P a g e
Matrices & Data frames
The difference between a matrix and a data frame is that the matric must have all the same type of data (eg. numeric, character etc). A data frame may have mixed comumns of data.
Turn a Vector Into a Matrix Creating a Matrix EXAMPLE b<-1:20
b
dim(b)<-c(4,5)
b
Change upper and lower triangle of matrix lower.tri(x, diag = FALSE) upper.tri(x, diag = FALSE)
ARGUMENTS:
Splicing or gluing together Rows or Columns rbind() cbind() NOTE: rbind or cbind are slow functions for larger data sets. It is usually better to create a black matrix first and then use indexing to put the information into the blank matrix.
x a matrix.
diag logical. Should the diagonal be included?
EXAMPLE
#CREATE A CORRRELATION MATRIX WITH THE LOWER TRIANGLE r^2 VALUES
CORmat<-cor(mtcars)
lower.tri(CORmat)
CORmat[lower.tri(CORmat)]<-CORmat[which(lower.tri(CORmat))]^2
Example
(aRow <- matrix(NA, ncol=18, nrow=49)) #create a matrix of NAs and then fill
aRow[1:44,1:8] <- as.vector(as.matrix(mtcars))
aRow[29:49,11:18] <- as.vector(as.numeric(as.matrix(CO2)))[253:420]
aRow #notice it's all numeric
aRow[49,1]<-"a"
aRow # changes matrix to character
43 | P a g e
Matrix Algebra
Transpose X' or XT
t(X) Diagonal diag(X) Matrix Multiplication XY X %*% Y Matrix Inverse X-1 solve(X) Outer Product XY' X %o% Y Column Means Returns a vector containing the column means of X
colMeans(X) Cross Products X'X crossprod(X) Cross Products X'Y
crossprod(X,Y)
#Example of Regression Parameters With Matrix Algebra
#FORMULA: b = (X'X)-1(X'y)
#DATA
midterm <- c(5,7,7,7,9)
final <- c(4,5,6,8,10)
(SUM <- summary(lm(final~midterm)))
#==============================================
#ASSIGN DATA TO LETTERS TO FIT MATRIX NOTATION
x <- midterm
y <- final
#==============================================
#CONVERT VECTOR x TO MATRIX X WITH PARAMETER
#NOTE THE COLUMN OF ONES BEING ADDED IS
#FOR THE PARAMETER
X <- as.matrix(c(rep(1,length(x)),x))
dim(X)<-c(5,2)
X
#NOW THE HEAVY LIFTING MADE EASY:
(b <- solve(crossprod(X))%*%crossprod(X,y))
44 | P a g e
Text and Character Strings
Combine multiple items into a single response scat spaste
cat() example: Input cat("The data and time is",date(),"!","\n") Output The data and time is Thu May 05 10:17:47 2011 !
paste()
Pretty print number with commas and import numbers into commas into R
prettyNum(string(), big.mark=",", scientific=F)) as.numeric(gsub(",","", string()))
Turn a character string into a formula sforumula
as.formula()
Example First<-c("Greg","Sue","Sally");Last<-c("Smith","Collins","Peters")
Ages<-c(11,12,11);ClassRoster<-data.frame(Last,First,Ages)
ClassRoster
Students.FL<-paste(First,Last,sep=" ")
Students.LF<-paste(Last,First,sep=" ")
#============================================================================================================
paste(Students.FL,"is",Ages,"years","old.")
#............................................................................................................
#OUTPUT-->[1] "Greg Smith is 11 years old." "Sue Collins is 12 years old." "Sally Peters is 11 years old."
#============================================================================================================
cat(paste(Students.FL,"is",Ages,"years","old",colapse=",",sep=" "))
#............................................................................................................
#OUTPUT-->Greg Smith is 11 years old , Sue Collins is 12 years old , Sally Peters is 11 years old ,
test<-c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb")
lm(as.formula(paste(test[1],"~", paste(test[-1],collapse="+"), sep="")), data=mtcars)
noquote(prettyNum(12345.678,big.mark=",",scientific=F))
x<-noquote(prettyNum(c(12345.678, 123154543, 32434343),big.mark=",",scientific=F))
as.numeric(gsub(",","", x))
45 | P a g e
Special Escaped Characters
Using cat() with Quotes to manipulate text output QUOTES \n newline \r carriage return \t tab \b backspace \a alert (bell) \f form feed \v vertical tab \\ backslash \ \' ASCII apostrophe ' OR sQuote(phrase) \" ASCII quotation mark " OR dQuote(phrase) Eliminate the "\" from strings (test <- c("\\hi\\", "\n", "\t", "\\1", "\1", "\01", "\001"))
eval(parse(text=gsub("\\", "", deparse(test), fixed=TRUE)))
#INPUT
#[1] "\\hi\\" "\n" "\t" "\\1" "\001" "\001" "\001"
#OUTPUT
#[1] "hi" "n" "t" "1" "001" "001" "001"
EXAMPLES
cat("\a","Hello","\n","Hel","\blo","\'\tHELLO!\'","\"WElp!\"","\n",
"DELETE ME","","\r\"I DID DELETE YOU!\"","\n","BYE","\\Yeah Backslash",
"\n")
cat(LETTERS,"\n","\r")
46 | P a g e
Built in Character Strings constants LETTERS [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" [14] "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z" letters [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" [14] "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" month.abb [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" month.name [1] "January" "February" "March" "April" "May" "June" "July" "August" "September" "October" "November" "December" state.name state.abb
Remove quotes for printing noquote(letters) >[1] a b c d e f g h i j k l m n o p q r s t u v w x y z cat(letters) >a b c d e f g h i j k l m n o p q r s t u v w x y z Number of letters per word in a character string nchar(character string) yields the number of characters per word example: pets<-c("chester","callie");nchar(pets)
47 | P a g e
Replacing Characters in String
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
Arguments
pattern character string containing a regular expression (or character string for fixed = TRUE) to be matched in the
given character vector. Coerced by as.character to a character string if possible. If a character vector of length 2
or more is supplied, the first element is used with a warning. Missing values are allowed except
for regexpr and gregexpr.
x, text a character vector where matches are sought, or an object which can be coerced by as.character to a
character vector.
ignore.case if FALSE, the pattern matching is case sensitive and if TRUE, case is ignored during matching.
perl logical. Should perl-compatible regexps be used? Has priority over extended.
value if FALSE, a vector containing the (integer) indices of the matches determined by grep is returned, and
if TRUE, a vector containing the matching elements themselves is returned.
fixed logical. If TRUE, pattern is a string to be matched as is. Overrides all conflicting arguments.
useBytes logical. If TRUE the matching is done byte-by-byte rather than character-by-character. See ‘Details’.
invert logical. If TRUE return indices or values for elements that do not match.
replacement a replacement for matched pattern in sub and gsub. Coerced to character if possible. For fixed =
FALSE this can include backreferences "\1" to "\9" to parenthesized subexpressions of pattern.
For perl = TRUE only, it can also contain "\U" or "\L" to convert the rest of the replacement to upper or
lower case and "\E" to end case conversion. If a character vector of length 2 or more is supplied, the first
element is used with a warning. If NA, all elements in the result corresponding to matches will be set to NA.
Each of these functions (apart from regexec, which currently does not support Perl-
style regular expressions) operates in one of three modes:
1. fixed = TRUE: use exact matching.
2. perl = TRUE: use Perl-style regular expressions.
3. fixed = FALSE, perl = FALSE: use POSIX 1003.2 extended regular expressions.
48 | P a g e
EXAMPLE 2
text.ex<-c("hat","coat","gloves","shirt","pants")
gsub("h","H",text.ex)
gsub("^.","A",text.ex)
gsub("(\\w)(\\w*)","\\U\\1\\L\\2",text.ex,perl=T)
gsub("(\\w*)","\\U\\1",text.ex,perl=T)
gsub("(\\w)(\\w*)(\\w)", "\\U\\1\\E\\2\\U\\3", text.ex, perl=T)
Output
> > gsub("^.","\\*b",text)
[1] "*bat" "*boat" "*bloves" "*bhirt" "*bants"
> gsub("h","H",text)
[1] "Hat" "coat" "gloves" "sHirt" "pants"
> gsub("^.","A",text)
[1] "Aat" "Aoat" "Aloves" "Ahirt" "Aants"
> gsub("(\\w)(\\w*)","\\U\\1\\L\\2",text,perl=T)
[1] "Hat" "Coat" "Gloves" "Shirt" "Pants"
> gsub("(\\w*)","\\U\\1",text,perl=T)
[1] "HAT" "COAT" "GLOVES" "SHIRT" "PANTS"
> gsub("(\\w)(\\w*)(\\w)", "\\U\\1\\E\\2\\U\\3", text, perl=TRUE)
[1] "HaT" "CoaT" "GloveS" "ShirT" "PantS"
x<-"ATextIWantToDisplayWithSpaces" #Split on capital letters
gsub('([[:upper:]])', ' \\1', x)
> gsub('([[:upper:]])', ' \\1', x)
[1] " A Text I Want To Display With Spaces"
x<-"I like it...What the... Oh I see it." #replace …
gsub(pattern = "\\.\\.\\.", replacement = ".", x) #or
gsub(pattern = "\\.+", replacement = ".", x) #betterand more flexible
> gsub(pattern = "\\.\\.\\.", replacement = ".", x)
[1] "I like it.What the. Oh I see it."
EXAMPLE 1 input
a <- c("foo_5h", "bar_7")
gsub(".*_", "", a)
b <- c("xtfo_oin5hl", "6b_arin7", "xin7")
gsub("in.", "", b)
gsub("t.*l", "HERE", b)
gsub("^([a-zA-Z]in)", "INSERT", b)
d <- c("xtfo_oin5h;lx", "6b_arin;7", "xin;7")
gsub("t.+?l", "HERE", d)
gsub("[a-zA-Z].+?l", "HERE", d)
gsub("[a-zA-Z].+?;", "HERE", d)
gsub("_.+?;", "HERE", d)
e <- c("Dog foo_5h dog bar_7 doGs God")
gsub("\\bdog\\b", "HERE", e)
gsub("\\bdog.\\b", "HERE", e)
gsub("[^a-zA-Z0-9]", "", e)
gsub("\\b[dD][oO][Gg].\\b", " ", e)
gsub("\\b[dD][oO][Gg]\\b", " ", e)
EXAMPLE 1 outcome
> a <- c("foo_5h", "bar_7")
> gsub(".*_", "", a)
[1] "5h" "7"
>
> b <- c("xtfo_oin5hl", "6b_arin7", "xin7")
> gsub("in.", "", b)
[1] "xtfo_ohl" "6b_ar" "x"
> gsub("t.*l", "HERE", b)
[1] "xHERE" "6b_arin7" "xin7"
> gsub("^([a-zA-Z]in)", "INSERT", b)
[1] "xtfo_oin5hl" "6b_arin7" "INSERT7"
>
> d <- c("xtfo_oin5h;lx", "6b_arin;7", "xin;7")
> gsub("t.+?l", "HERE", d)
[1] "xHEREx" "6b_arin;7" "xin;7"
> gsub("[a-zA-Z].+?l", "HERE", d)
[1] "HEREx" "6b_arin;7" "xin;7"
> gsub("[a-zA-Z].+?;", "HERE", d)
[1] "HERElx" "6HERE7" "HERE7"
> gsub("_.+?;", "HERE", d)
[1] "xtfoHERElx" "6bHERE7" "xin;7"
>
> e <- c("Dog foo_5h dog bar_7 doGs God")
> gsub("\\bdog\\b", "HERE", e)
[1] "Dog foo_5h HERE bar_7 doGs God"
> gsub("\\bdog.\\b", "HERE", e)
[1] "Dog foo_5h HEREbar_7 doGs God"
> gsub("[^a-zA-Z0-9]", "", e)
[1] "Dogfoo5hdogbar7doGsGod"
> gsub("\\b[dD][oO][Gg].\\b", " ", e)
[1] " foo_5h bar_7 God"
> gsub("\\b[dD][oO][Gg]\\b", " ", e)
[1] " foo_5h bar_7 doGs God"
Match & replace from here to blank
49 | P a g e
Replacing Certain Occurances
Find and remove space and/or numeric occurances #EXAMPLE 1
#=========
data <- c("Flagstaff 2", "Los Angeles 23", "Cleveland 29", "Cleveland 29", "Seattle 22")
gsub("\\s*\\d*$", "", data)
[1] "Flagstaff" "Los Angeles" "Cleveland" "Cleveland" "Seattle"
#EXAMPLE 2
#=========
x <- "the dog ate his \n food"
gsub("[^o h \n]", "", x)
gsub("[^o h \n]|\\s+", "", x)
> gsub("[^o h \n]", "", x)
[1] "h o h \n oo"
> gsub("[^o h \n]|\\s+", "", x)
[1] "hohoo"
Find Consecutive Occurences
mystring <- c(1, 2, 3, "toot", "tooooot", "good", "apple", "banana", "frrr")
mystring[!grepl("(.)\\1{2,}", mystring)]
mystring[!grepl("(.)\\1{1,}", mystring)]
gsub("(.)\\1{2,}", "HELLO", mystring)
## > mystring[!grepl("(.)\\1{2,}", mystring)]
## [1] "1" "2" "3" "toot" "good" "apple" "banana"
## > mystring[!grepl("(.)\\1{1,}", mystring)]
## [1] "1" "2" "3" "banana"
## > gsub("(.)\\1{2,}", "HELLO", mystring)
## [1] "1" "2" "3" "toot" "tHELLOt" "good" "apple" "banana" "fHELLO"
string <- c('sta_+1+0_field2ndtry_0000$01.cfg' , 'sta_+B+0_field2ndtry_0000$01.cfg' ,
'sta_+1+0_field2ndtry_0000$01.cfg' , 'sta_+9+0_field2ndtry_0000$01.cfg')
sapply(1:length(string), function(i)gsub("\\+(.*)\\+.", paste("\\+\\1\\+", i, sep=""),
string[i]))
> string
[1] "sta_+1+0_field2ndtry_0000$01.cfg" "sta_+B+0_field2ndtry_0000$01.cfg"
[3] "sta_+1+0_field2ndtry_0000$01.cfg" "sta_+9+0_field2ndtry_0000$01.cfg"
> sapply(1:length(string), function(i)gsub("\\+(.*)\\+.", paste("\\+\\1\\+", i, sep=""),
string[i]))
[1] "sta_+1+1_field2ndtry_0000$01.cfg" "sta_+B+2_field2ndtry_0000$01.cfg"
[3] "sta_+1+3_field2ndtry_0000$01.cfg" "sta_+9+4_field2ndtry_0000$01.cfg"
50 | P a g e
Find Location of Chunks within a String(s)
Find a pattern in a string or a vector of strings
gregexp(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
gregexpr("es", "Testes")
c(gregexpr("es", "Test")[[1]])
c(gregexpr("es", "Testes")[[1]])
c(gregexpr("es", "Testes establishes esteem")[[1]])
gregexpr("es", c("Testes", "dog", 6, "esteem")) #vector of strings
Find a pattern in a vector of strings Gives Location
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
Gives Logical TRUE/FALSE
grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
grep("es", c("Testes", "dog", 6, "esteem"))
#[1] 1 4
grepl("es", c("Testes", "dog", 6, "esteem"))
#[1] TRUE FALSE FALSE TRUE
51 | P a g e
String Splitting
Split on first space
(x<-rownames(mtcars))
rexp <- "^(\\w+)\\s?(.*)$"
sub(rexp,"\\1",x)
sub(rexp,"\\2",x)
data.frame(COM=x, MANUF=sub(rexp,"\\1",x), MAKE=sub(rexp,"\\2",x))
27 Porsche 914-2
28 Lotus Europa
29 Ford Pantera L
30 Ferrari Dino
31 Maserati Bora
32 Volvo 142E
COM MANUF MAKE
27 Porsche 914-2 Porsche 914-2
28 Lotus Europa Lotus Europa
29 Ford Pantera L Ford Pantera L
30 Ferrari Dino Ferrari Dino
31 Maserati Bora Maserati Bora
32 Volvo 142E Volvo 142E
#Also could have been solved: #METHOD 2
mat <- do.call("rbind", strsplit(sub(" ", ";", x), ";"))
colnames(mat) <- c("MANUF", "MAKE")
#METHOD 3
library(reshape2)
y <- reshape2::colsplit(x," ",c("MANUF","MAKE"))
tail(y)
#METHOD 4
library(stringr)
split_x <- str_split(x, " ", 2)
y <- data.frame(
MANUF = sapply(split_x, head, n = 1),
MAKE = sapply(split_x, tail, n = 1)
)
tail(y)
str <- c("George W. Bush", "Lyndon B. Johnson")
gsub("([A-Z])[.]?", "\\1", str)
sub(" .*", "", str)
sub("\\s\\w+$", "", str)
sub(".*\\s(\\w+$)", "\\1", str)
str <- c("&George W. Bush", "Lyndon B. Johnson?")
gsub("[^[:alnum:][:space:].]", "", str)
> str <- c("&George W. Bush", "Lyndon B.
Johnson?")
> gsub("[^[:alnum:][:space:].]", "", str)
[1] "George W. Bush" "Lyndon B. Johnson"
> str <- c("George W. Bush", "Lyndon B. Johnson")
> gsub("([A-Z])[.]?", "\\1", str)
[1] "George W Bush" "Lyndon B Johnson"
> sub(" .*", "", str)
[1] "George" "Lyndon"
> sub("\\s\\w+$", "", str)
[1] "George W." "Lyndon B."
> sub(".*\\s(\\w+$)", "\\1", str)
[1] "Bush" "Johnson"
>
> str <- c("&George W. Bush", "Lyndon B.
Johnson?")
> gsub("[^[:alnum:][:space:].]", "", str)
[1] "George W. Bush" "Lyndon B. Johnson"
string<- factor(c("California CA", "New York NY", "Georgia GA"))
#Cheesy Method
string <- gsub(" +", " ", string)
sapply(string, function(x) substring(x, 1, nchar(x)-3)) #or
unlist(lapply(string, function(x) substring(x, 1, nchar(x)-3)))
or
#sub Method (BETTER!)
sub("[[:space:]]*..$", "", string)
#OUTPUT
[1] "California" "New York" "Georgia"
52 | P a g e
Split on first underscore character
Split on first comma
y <- c("Here's comma 1, and 2, see?", "Here's 2nd sting, like it, not a lot.")
#Method 1
XX <- "SoMeThInGrIdIcUlOuS"
LIST <- strsplit(sub(",\\s*", XX, y), XX)
LIST2 <- lapply(LIST, function(x) data.frame('x'=c(x[1]), 'z'=c(x[2])))
do.call('rbind', LIST2)
#Method 2
y2 <- strsplit(y, ",")
LIST <- sapply(seq_along(y2), function(i) data.frame(x= y2[[i]][1],
z=paste(y2[[i]][-1], collapse=" ")), simplify=F)
do.call('rbind', LIST)
#Method 3
GL(reshape2)
colsplit(y, ",", c("x","z"))
x whatever
1 x00 aaa_123
2 x00 bbb_123
3 x00 ccc_123
4 x01 aaa_123
5 x01 bbb_123
6 x01 ccc_123
7 x02 aaa_123
8 x02 bbb_123
9 x02 ccc_123
library(reshape2)
my_var_1 <- factor(c("x00_aaa_123","x00_bbb_123","x00_ccc_123",
"x01_aaa_123","x01_bbb_123","x01_ccc_123","x02_aaa_123","x02_bbb_
123","x02_ccc_123"))
colsplit(my_var_1, "_", c("x","whatever"))
53 | P a g e
Piece Grabbing Grab part x <- c(
"ELOVL7",
"ELP2",
"EMC1 (includes EG:23065)",
"EPT1 (includes EG:28042)",
"ZEB1 (includes EG:29009)"
)
gsub("(.*)\\s+\\(.*\\)", "\\1", x)
Test and grab certain occurances (eg beginning abc and ending some numeric)
#example 1
#==========
s <- c('abc1', 'abc2', 'abc3', 'abc11', 'abc12',
'abcde1', 'abcde2', 'abcde3', 'abcde11', 'abcde12',
'nonsense')
s[grepl("abc.*(3|11|12)", s)]
s[grepl("^abc", s) & grepl("(3|11|12)$", s)]
#^ means negate or everything except the abc (2nd one is more interpretable)
> s[grepl("abc.*(3|11|12)", s)]
[1] "abc3" "abc11" "abc12" "abcde3" "abcde11" "abcde12"
> s[grepl("^abc", s) & grepl("(3|11|12)$", s)]
[1] "abc3" "abc11" "abc12" "abcde3" "abcde11" "abcde12"
#example 2
#==========
x <- c("fcer cgr tr cg g.", "gce tgv te ger refxre,c3rfc rf3rcf3rfr?")
x[grepl("^[[:alpha:]]", x) & grepl("(\\?)$", x)]
x[grepl("^[[:alpha:]]", x) & grepl("(\\?|\\.)$", x)]
> x[grepl("^[[:alpha:]]", x) & grepl("(\\?)$", x)]
[1] "gce tgv te ger refxre,c3rfc rf3rcf3rfr?"
> x[grepl("^[[:alpha:]]", x) & grepl("(\\?|\\.)$", x)]
[1] "fcer cgr tr cg g."
[2] "gce tgv te ger refxre,c3rfc rf3rcf3rfr?"
Grabe everything except last word df1 <- structure(list(id = c(1, 2, 3), city = structure(c(2L, 3L, 1L
), .Label = c("Hillside Village", "Middletown Township", "Sunny Valley Borough"
), class = "factor")), .Names = c("id", "city"), row.names = c(NA,
-3L), class = "data.frame")
gsub("\\s*\\w*$", "", df1$city)
> gsub("\\s*\\w*$", "", df1$city)
[1] "Middletown" "Sunny Valley" "Hillside"
54 | P a g e
Split apart by chunks
test <- "abc123def"
x <- gsub("([0-9]+)","~\\1~", test)
strsplit(x, "~")
#or in one step
strsplit(gsub("([0-9]+)","~\\1~", test), "~")
[[1]]
[1] "abc" "123" "def"
55 | P a g e
Punctuation
Delete all punctuation except…
#EXAMPLE x <- "I like %$@to*&, chew;: gum, but don't like|}{[] bubble@#^)( gum!?"
#METHODS FOR SUBBING OUT ALL PUNCTUATUION EXCEPT APOSTROPHES
gsub("[^[:alnum:][:space:]'\"]", "", x) #METHOD 1
gsub(".*?($|'|[^[:punct:]]).*?", "\\1", x) #METHOD 2
gsub("(.*?)($|'|[^[:punct:]]+?)(.*?)", "\\2", x) #METHOD 3
#EXTENDING METHOD 1 TO SUB OUT EVERYTHING EXCEPT APOSTROPHES AND SEMI COLONS
gsub("[^[:alnum:][:space:]'\ ;\"]", "", x, perl=T)
56 | P a g e
Capitalization
#Capitalize the first letter of a word
capitalize <- function(x) {
simpleCap <- function(x) {
s <- strsplit(x, " ")[[1]]
paste(toupper(substring(s, 1,1)), substring(s, 2),
sep="", collapse=" ")
}
unlist(lapply(x, simpleCap))
}
x <- "i'll"
y <- "you"
z <- c("I'll", "go")
capitalize(x)
capitalize(y)
capitalize(z)
Capital Letters capitalize toupper(string) Lower Case Letters tolower(string)
EXAMPLE
string<-toupper(paste("i do not know"," where the dog is.",sep=""))
cat(string,"\n",sep="")
string
tolower(string)
57 | P a g e
String Matching
Search a Vector for a Match see my Search() function
Exact Matches
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
fixed = FALSE, useBytes = FALSE, invert = FALSE)
Arguments:
Approximate Matches
agrep(pattern, x, ignore.case = FALSE, value = FALSE,
max.distance = 0.1, useBytes = FALSE)
Arguments:
str1 <- "This is a string, that I've written to ask about a question, or at least tried to."
animals <- c("mose", "dog", "cat", "gooberciluousrex")
animals[agrep("mouse", animals, max.distance = 0.01)] <- "cheese"
animals
animals[agrep("chese", animals)] <- "mouse"
animals
animals[agrep("goobercilsdeef", animals, max.distance = 0.01)] <- "duck"
animals
animals[agrep("goobercilsdeef", animals, max.distance = 0.29)] <- "duck"
animals
58 | P a g e
My Search() function
Search for a term str <- "BBSSHHSRBSBBS"
unlist(gregexpr("BS", str))
str2 <- "I can't stand know it all egg head scientists."
unlist(gregexpr("i can't", tolower(str2)))
term <- "egg head"
loc <- unlist(gregexpr(term, tolower(str2)))
substring(str2, loc, nchar(term)-1+loc)
Search for a term and Count Occurances str2 <- "ionisation should only be matched at the end of the word"
matched_commas <- gregexpr(",", str1, fixed = TRUE)
length(matched_commas[[1]])
matched_ion <- gregexpr("ion", str1, fixed = TRUE)
length(matched_ion[[1]])
length(gregexpr("ion\\b", str2, perl = TRUE))
Search for Strings that contain a phrase
a <- c('This is a healthcare facility', 'this is a hospital',
'this is a hospital district', 'this is a district health service')
a[grepl("hospital", a) & !grepl("district", a)]
a[!grepl("district", a)]
Levenshtein distance between strings
pres <- c(" Obama, B.","Bush, G.W.","Obama, B.H.","Clinton, W.J.")
lapply(pres, agrep, pres, value = F)
lapply(pres, agrep, pres, value = T)
Search<-function(term,dataframe,column.name,variation=.02,...){
te<-substitute(term) #use " " for multi word terms
te<-as.character(te)
cn<-substitute(column.name)
cn<-as.character(cn)
HUNT<-agrep(te,dataframe[,cn],ignore.case =TRUE,max.distance=variation,...)
dataframe[c(HUNT),]
}
59 | P a g e
Test for occurance in Columns
myfile <- read.table( text = '"G1" "G2"
SEP11 ABCC1
205772_s_at FMO2
214223_at ADAM19
ANK2 215742_at
COPS4 BIK
214808_at DCP1A
ACE ALG3
BAD 215369_at
EMP3 215385_at
CARD8 217579_x_at
', header = TRUE, stringsAsFactors = FALSE)
lapply( myfile,
function(column) grep( "_at$", column, invert = TRUE, value = TRUE )
)
lapply( myfile,
function(column) grep( "_at$", column, value = TRUE )
)
lapply( myfile,
function(column) grep( "_at$", column, invert = TRUE )
)
60 | P a g e
Insert characters between characters of a character string See also "Insert a Vector of Character…"
x <- "output"
Method 1
y <- unlist(strsplit(x, NULL))
y <- paste(y, collapse="\n")
cat(y)
[1] "o\nu\nt\np\nu\nt"
Method 2
z <- gsub('(?<=.)(?=.)','\n', x, perl=TRUE)
cat(z)
[1] "o\nu\nt\np\nu\nt"
Insert a Vector of Character Strings Into Another Character String
a <- c("string", "factor")
sprintf("This is where a %s goes.", a)
sprintf("This is where a %s goes.", a)
[1] "This is where a string goes." "This is where a factor goes."
Insert Vector(s) of Character Strings Into Another Character String
#paste method
n <- 10; a <- 1:n
paste0("p", a, "=", a)
#sprintf method
n <- 10; a <- 1:n
sprintf("p%d=%d", a, a)
Insert Trailing or Leading Spaces Easily
x <- c("I like", "good", "better than you")
sprintf("%8s", x) #Add leading space
sprintf("%-8s", x) #Add trailing space
> sprintf("%8s", x) #Add leading space
[1] " I like" " good" "better than you"
> sprintf("%-8s", x) #Add trailing space
[1] "I like " "good " "better than you"
Delete Trailing or Leading Spaces Easily gsub("^\\s+, "", x) #leading spaces gsub("\\s+$", "", x) #trailing spaces Trim <- function (x) gsub("^\\s+|\\s+$", "", x) Insetr Leading Zeros sprintf("%02d",c(1,2,3,45)) > sprintf("%02d",c(1,2,3,45))
[1] "01" "02" "03" "45"
> sprintf("%03d",c(1,2,3,45))
[1] "001" "002" "003" "045"
> sprintf("%010d",c(1,2,3,45))
[1] "0000000001" "0000000002" "0000000003" "0000000045"
61 | P a g e
Reverse character strings
reverse(string)
reverse("this is a string")
strings1 <- c(123,4212,234567)
reverse(strings1)
strings2 <- c("retsnomerom","was","retar")
reverse(strings2)
reverse <- function(string) {
strReverse <- function(x) sapply(lapply(strsplit(x, NULL),
rev), paste, collapse = "")
if (is.numeric(string)) {
strReverse(as.character(string))
} else {
strReverse(string)
}
}
62 | P a g e
Select portions of a character string (parts of a character string) substr(text,start point,end point)
EXAMPLES
substr("abcdefghi",2,5)
substr("abcdefghi",1,8)
substring("Callie loves to chew bones!",8,20)
substring("Callie loves to chew bones!",28:1)
substring("Callie loves to chew bones!",1:28)
data.frame(cbind(substring("Callie loves to chew bones!",28:1),
substring("Callie loves to chew bones!",1:28)))
substr(rep("abcdef",4),1:4,4:5)
x <- c("asfef", "qwerty", "yuiop[", "b", "stuff.blah.yech")
substr(x, 2, 5)
substring(x, 2, 4:6)
substring(x, 2) <- c("..", "+++")
x
#USE TO PULL APART BEDS CODE
beds.numbs<-as.character(c(3452171,3452172,3462173,3452274,3452275,3462276,3462277,3452178,
3452189,3452080,3452081,3452082,3462083))
(Region<-substr(beds.numbs,1,3))#use this to recode regions
(District<-substr(beds.numbs,1,6))
DATS<-data.frame(beds.numbs,Region,District)
with(DATS,table(District))
with(DATS,ftable(DATS))
subset(DATS,Region=="346")
subset(DATS,District=="345208")
63 | P a g e
Create a diminishing list from a vector of names PREDS<-c("gender", "g1freelunch", "g3tmathss", "g3treadss", "yearssmall",
"crap")
#===============================================
method 1 getDiminishingList<-function(data){
ans <- list()
for(i in 1:length(data)){
ans[[i]] <- data[1:(length(data) - i + 1)]
}
ans
}
# Use function
getDiminishingList(PREDS)
getDiminishingList(1:10)
#===============================================
method 2 getDiminishingList <- function(data){
n <- length(data)
tmpfunc <- function(i){
data[1:(length(data) - i + 1)]
}
return(apply(matrix(1:n), 1, tmpfunc))
}
# Use function
getDiminishingList(PREDS)
getDiminishingList(1:10)
Output
[[1]]
[1] "gender" "g1freelunch" "g3tmathss" "g3treadss" "yearssmall"
[6] "crap"
[[2]]
[1] "gender" "g1freelunch" "g3tmathss" "g3treadss" "yearssmall"
[[3]]
[1] "gender" "g1freelunch" "g3tmathss" "g3treadss"
[[4]]
[1] "gender" "g1freelunch" "g3tmathss"
[[5]]
[1] "gender" "g1freelunch"
[[6]]
[1] "gender"
64 | P a g e
Convert a Character String or Factor to Numeric
Method 1
y <- c( "OLDa", "ALL", "OLDc", "OLDa", "OLDb", "NEW", "OLDb", "OLDa", "ALL")
el <- c("OLDa", "OLDb", "OLDc", "NEW", "ALL")
match(y,el)
Method2
f <- factor(y,levels=c("OLDa", "OLDb", "OLDc", "NEW", "ALL") )
as.integer(f)
Changing Variable (numeric vs. factor) See Dummy Coding
Type: y <- as.factor(y) changes the variable to factor y <- as.numeric(y) changes the variable to numeric Note 1: This function can be use to change a categorical variable into a numeric variable (useful for dummy coding) …Or recode as 0,1 Note 2: If you’ve renamed the variables in your data set you must use the as.numeric function with the original data set terms (data.set$variable.name) to turn the list in the actual data set numeric (see incorrect [A] vs. correct [B] below).
A B
65 | P a g e
Dummy Coding a Factor method 1 User defined function requires library(ade4) dummy(dataframe)
Dummy Coding a Factor method 2 model.matrix(~factor-1)
#EXAMPLE
x <- c(2, 2, 5, 3, 6, 5, NA)
xf <- factor(x, levels = 2:6)
model.matrix( ~ xf - 1)
#EXAMPLE
dummy <- function(df) {
require(ade4)
ISFACT <- sapply(df, is.factor)
FACTS <- acm.disjonctif(df[, ISFACT, drop = FALSE])
NONFACTS <- df[, !ISFACT,drop = FALSE]
data.frame(NONFACTS, FACTS)
}
df <-data.frame(eggs = c("foo", "foo", "bar", "bar"),
ham = c("red","blue","green","red"), x=rnorm(4))
dummy(df)
66 | P a g e
Convert numbers to Roman numerals
as.roman(vector)
Convert Decimals to fractions library(MASS) fractions(x, cycles = 10, max.denominator = 2000, ...) Arguments:
NOTE: This is Rational Approximation and may not be a true value of the decimal
EXAMPLES
as.roman(101) #coverts to roman numeral
as.roman(c(101,23,67,92)) #vector
EXAMPLES
library(MASS)
fractions(.12)
fractions(pi)
67 | P a g e
Date & Time
Date & Time date() Date Sys.Date() Time substr(as.character(Sys.time()),12,19) Year substr(as.character(Sys.Date()),1,4) Date/Time/Time Zone Sys.time()
Extracting pieces from Sys.Date and Sys.time format(Sys.time(), "%a %b %d %H:%M:%S %Y") %a=weekday; %b=month; %d=day of the month; %H:%M:%S =hour:minute:second; %Y=year Use of cat with “\n” gets rid of the quotes around the date (see final example) cat(format(Sys.time(), "%a %b %d %H:%M:%S %Y"),"\n")
Import dates in various formats such as dd/mm/yyyy as.Date(x, format = "") Arguments
EXAMPLE
format(Sys.Date(), format="%b %d %Y")
format(Sys.Date(), format="%a %b %d %Y")
format(Sys.time(), "%a %b %d %H:%M:%S %Y")
format(Sys.time(),"%H:%M")
dec1 <- as.Date("2004-12-1")
cat(format(dec1, format="%b %d %Y"),"\n")
#Notice how cat eliminates the quotes
EXAMPLE
dates <- c("02/27/92", "02/27/92", "01/14/92", "02/28/92", "02/01/92")
as.Date(dates, "%m/%d/%y")
dates <- c("02/27/1992", "02/27/1992", "01/14/1992", "02/28/1992", "02/01/1992")
as.Date(dates, "%m/%d/%Y")
Note: the package chron is good at handling dates and times
68 | P a g e
Differences in Dates and Times difftime(t1,t2) Note: put the later time in for t2 EXAMPLES difftime("2005-10-21","1980-11-16")
as.numeric(difftime("2005-10-21","1980-11-16"))
difftime("2011-05-17 00:35:07","2002-9-11 8:46:40")
difftime(Sys.time(),"2002-9-11 8:46:40")
Output > difftime("2005-10-21","1980-11-16")
Time difference of 9104.958 days
> as.numeric(difftime("2005-10-21","1980-11-16"))
[1] 9104.958
> difftime("2011-05-17 00:35:07","2002-9-11 8:46:40")
Time difference of 3169.659 days
> difftime(Sys.time(),"2002-9-11 8:46:40")
Time difference of 3169.661 days
Time and Date Sequence seq.Date(from,to,by) by= "day", "week", "month" or "year"
Turn dates into Day of the Week weekdays(x, abbreviate=FALSE)
Units Argument difftime(time1, time2, tz, units = c("auto", "secs", "mins", "hours","days", "weeks")) Can request answer be given in "auto", "secs", "mins", "hours","days", "weeks"
EXAMPLE
C<-seq.Date(as.Date("2010-10-10"),Sys.Date(),"week")
data.frame("OBS"=1:length(C),C)
df <- data.frame(date=c("2012-02-01", "2012-02-01", "2012-02-02"))
df$day <- weekdays(as.Date(df$date)) #turns dates to days of the week
df$day.ab <- weekdays(as.Date(df$date), TRUE)
> df
date day day.ab
1 2012-02-01 Wednesday Wed
2 2012-02-01 Wednesday Wed
3 2012-02-02 Thursday Thu
69 | P a g e
Graphics
Open a second Graphics Window (universal) plot.new() Open a second Graphics Window (windows) win.graph() or windows() or x11() Open a second Graphics Window (mac) quartz() or x11() Open a second Graphics Window ready to plot (No need to call a plot before adding lines etc.) plot.new() or frame() This enable you to add lines and text without an actual plot Check the system for OS and return correct graphics device (2 methods) #covers everything and is safe for other Graphics Devices if (dev.interactive()) dev.new()
#covers only gui graphics device and is not safe for other Graphics Devices
if( .Platform$GUI %in% c("X11", "Tk") ) {
X11()
} else {
if ( .Platform$GUI == "AQUA" ){
quartz()
} else {
windows()
}
}
}
Control the Size of the Graph Window windows(width=10, height=4) or win.graph (width=10, height=4) or x11(w=10,h=4) NOTE: All will take just w or h or the specific order of w and then h as in: x11(10,4) Pause Between Switching to Second Graph par(ask=TRUE) Multiple Graphs on one page par(mfrow=c(2,3)) 2 is in the rows position and 3 is in the columns position
70 | P a g e
Graphical output formats Device Function Screen/GUI Devices x11() or X11() windows() File Devices postscript(file="myplot.ps") pdf(file="myplot.pdf") pictex(file="myplot.tex") bmp(file="myplot.bmp")
jpeg(file="myplot.jepg")
Return Current Graphic device dev.cur() Turn Off Graphic Device dev.off() Turns Off all graphics Devices graphics.off() Copy the Current Graphics Device to a File dev.copy(device=png, file="foo", width=500, height=300)
71 | P a g e
Par Function Arguments adj
The value of adj determines the way in which text strings are justified in text, mtext and title. A value of 0 produces left-justified text, 0.5 (the default) centered text and 1 right-justified text. (Any value in [0, 1] is allowed, and on most devices values outside that interval will also work.) Note that the adj argument of text also allows adj = c(x, y) for different adjustment in x- and y- directions. Note that whereas for text it refers to positioning of text about a point, for mtext and title it controls placement within the plot or device region.
ann If set to FALSE, high-level plotting functions calling plot.default do not annotate the plots they produce with axis titles and overall titles. The default is to do annotation.
ask logical. If TRUE (and the R session is interactive) the user is asked for input, before a new figure is drawn. As this applies to the device, it also affects output by packages grid and lattice. It can be set even on non-screen devices but may have no effect there. This not really a graphics parameter, and its use is deprecated in favour of devAskNewPage.
bg The color to be used for the background of the device region. When called from par() it also sets new=FALSE. See section ‘Color Specification’ for suitable values. For many devices the initial value is set from the bg argument of the device, and for the rest it is normally "white". Note that some graphics functions such as plot.default and points have an argument of this name with a different meaning.
bty A character string which determined the type of box which is drawn about plots. If bty is one of "o" (the default), "l", "7", "c", "u", or "]" the resulting box resembles the corresponding upper case letter. A value of "n" suppresses the box.
cex A numerical value giving the amount by which plotting text and symbols should be magnified relative to the default. This starts as 1 when a device is opened, and is reset when the layout is changed, e.g. by setting mfrow. Note that some graphics functions such as plot.default have an argument of this name which multiplies this graphical parameter, and some functions such as points accept a vector of values which are recycled. Other uses will take just the first value if a vector of length greater than one is supplied.
cex.axis The magnification to be used for axis annotation relative to the current setting of cex.
cex.lab The magnification to be used for x and y labels relative to the current setting of cex.
cex.main The magnification to be used for main titles relative to the current setting of cex.
cex.sub The magnification to be used for sub-titles relative to the current setting of cex.
cin R.O.; character size (width, height) in inches. These are the same measurements as cra, expressed in different units.
col A specification for the default plotting color. See section ‘Color Specification’. (Some functions such as lines accept a vector of values which are recycled. Other uses will take just the first value if a vector of length greater than one is supplied.)
col.axis The color to be used for axis annotation. Defaults to "black".
col.lab The color to be used for x and y labels. Defaults to "black".
col.main The color to be used for plot main titles. Defaults to "black".
col.sub The color to be used for plot sub-titles. Defaults to "black".
cra R.O.; size of default character (width, height) in ‘rasters’ (pixels). Some devices have no concept of pixels and so assume an arbitrary pixel size, usually 1/72 inch. These are the same measurements as cin, expressed in different units.
crt A numerical value specifying (in degrees) how single characters should be rotated. It is unwise to expect values other than multiples of 90 to work. Compare with srt which does string rotation.
csi R.O.; height of (default-sized) characters in inches. The same as par("cin")[2].
cxy R.O.; size of default character (width, height) in user coordinate units. par("cxy") is par("cin")/par("pin") scaled to user coordinates. Note that c(strwidth(ch), strheight(ch)) for a given string ch is usually much more precise.
din R.O.; the device dimensions, (width,height), in inches.
err (Unimplemented; R is silent when points outside the plot region are not plotted.) The degree of error reporting desired.
72 | P a g e
family The name of a font family for drawing text. The maximum allowed length is 200 bytes. This name gets mapped by each graphics device to a device-specific font description. The default value is "" which means that the default device fonts will be used (and what those are should be listed on the help page for the device). Standard values are "serif", "sans" and "mono", and the Hershey font families are also available. (Different devices may define others, and some devices will ignore this setting completely.) This can be specified inline for text.
fg The color to be used for the foreground of plots. This is the default color used for things like axes and boxes around plots. When called from par() this also sets parameter col to the same value. See section ‘Color Specification’. A few devices have an argument to set the initial value, which is otherwise "black".
fig A numerical vector of the form c(x1, x2, y1, y2) which gives the (NDC) coordinates of the figure region in the display region of the device. If you set this, unlike S, you start a new plot, so to add to an existing plot use new=TRUE as well.
fin The figure region dimensions, (width,height), in inches. If you set this, unlike S, you start a new plot.
font An integer which specifies which font to use for text. If possible, device drivers arrange so that 1 corresponds to plain text (the default), 2 to bold face, 3 to italic and 4 to bold italic. Also, font 5 is expected to be the symbol font, in Adobe symbol encoding. On some devices font families can be selected by family to choose different sets of 5 fonts.
font.axis The font to be used for axis annotation.
font.lab The font to be used for x and y labels.
font.main The font to be used for plot main titles.
font.sub The font to be used for plot sub-titles.
lab A numerical vector of the form c(x, y, len) which modifies the default way that axes are annotated. The values of x and y give the (approximate) number of tickmarks on the x and y axes and len specifies the label length. The default is c(5, 5, 7). Note that this only affects the way the parameters xaxp and yaxp are set when the user coordinate system is set up, and is not consulted when axes are drawn. len is unimplemented in R.
las numeric in {0,1,2,3}; the style of axis labels. 0: always parallel to the axis [default], 1: always horizontal, 2: always perpendicular to the axis, 3: always vertical. Also supported by mtext. Note that string/character rotation via argument srt to par does not affect the axis labels.
lend The line end style. This can be specified as an integer or string: 0 and "round" mean rounded line caps [default]; 1 and "butt" mean butt line caps; 2 and "square" mean square line caps.
lheight The line height multiplier. The height of a line of text (used to vertically space multi-line text) is found by multiplying the character height both by the current character expansion and by the line height multiplier. Default value is 1. Used in text and strheight.
ljoin The line join style. This can be specified as an integer or string: 0 and "round" mean rounded line joins [default]; 1 and "mitre" mean mitred line joins; 2 and "bevel" mean bevelled line joins.
lmitre The line mitre limit. This controls when mitred line joins are automatically converted into bevelled line joins. The value must be larger than 1 and the default is 10. Not all devices will honour this setting.
73 | P a g e
lty The line type. Line types can either be specified as an integer (0=blank, 1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash) or as one of the character strings "blank", "solid", "dashed", "dotted", "dotdash", "longdash", or "twodash", where "blank" uses ‘invisible lines’ (i.e., does not draw them). Alternatively, a string of up to 8 characters (from c(1:9, "A":"F")) may be given, giving the length of line segments which are alternatively drawn and skipped. See section ‘Line Type Specification’. Some functions such as lines accept a vector of values which are recycled. Other uses will take just the first value if a vector of length greater than one is supplied.
lwd The line width, a positive number, defaulting to 1. The interpretation is device-specific, and some devices do not implement line widths less than one. (See the help on the device for details of the interpretation.) Some functions such as lines accept a vector of values which are recycled. Other uses will take just the first value if a vector of length greater than one is supplied.
mai A numerical vector of the form c(bottom, left, top, right) which gives the margin size specified in inches.
mar A numerical vector of the form c(bottom, left, top, right) which gives the number of lines of margin to be specified on the four sides of the plot. The default is c(5, 4, 4, 2) + 0.1.
mex mex is a character size expansion factor which is used to describe coordinates in the margins of plots. Note that this does not change the font size, rather specifies the size of font (as a multiple of csi) used to convert between mar and mai, and between oma and omi. This starts as 1 when the device is opened, and is reset when the layout is changed (alongside resetting cex).
mfcol, mfrow A vector of the form c(nr, nc). Subsequent figures will be drawn in an nr-by-nc array on the device by columns (mfcol), or rows (mfrow), respectively. In a layout with exactly two rows and columns the base value of "cex" is reduced by a factor of 0.83: if there are three or more of either rows or columns, the reduction factor is 0.66. Setting a layout resets the base value of cex and that of mex to 1. If either of these is queried it will give the current layout, so querying cannot tell you the order in which the array will be filled. Consider the alternatives, layout and split.screen.
mfg A numerical vector of the form c(i, j) where i and j indicate which figure in an array of figures is to be drawn next (if setting) or is being drawn (if enquiring). The array must already have been set by mfcol or mfrow. For compatibility with S, the form c(i, j, nr, nc) is also accepted, when nr and nc should be the current number of rows and number of columns. Mismatches will be ignored, with a warning.
mgp The margin line (in mex units) for the axis title, axis labels and axis line. Note that mgp[1] affects title whereas mgp[2:3] affect axis. The default is c(3, 1, 0).
mkh The height in inches of symbols to be drawn when the value of pch is an integer. Completely ignored in R.
new logical, defaulting to FALSE. If set to TRUE, the next high-level plotting command (actually plot.new) should not clean the frame before drawing as if it were on a new device. It is an error (ignored with a warning) to try to use new = TRUE on a device that does not currently contain a high-level plot.
oma A vector of the form c(bottom, left, top, right) giving the size of the outer margins in lines of text.
omd A vector of the form c(x1, x2, y1, y2) giving the region inside outer margins in NDC (= normalized device coordinates), i.e., as a fraction (in [0, 1]) of the device region.
omi A vector of the form c(bottom, left, top, right) giving the size of the outer margins in inches.
pch Either an integer specifying a symbol or a single character to be used as the default in plotting points. See points for possible values and their interpretation. Note that only integers and single-character strings can be set as a graphics parameter (and not NA nor NULL).
pin The current plot dimensions, (width,height), in inches.
plt A vector of the form c(x1, x2, y1, y2) giving the coordinates of the plot region as fractions of the current figure region.
ps integer; the point size of text (but not symbols). Unlike the pointsize argument of most devices, this does not change the relationship between mar and mai (nor oma and omi). What is meant by ‘point size’ is device-specific, but most devices mean a multiple of 1bp, that is 1/72 of an inch.
pty A character specifying the type of plot region to be used; "s" generates a square plotting region and "m" generates the maximal plotting region.
74 | P a g e
smo (Unimplemented) a value which indicates how smooth circles and circular arcs should be.
srt The string rotation in degrees. See the comment about crt. Only supported by text.
tck The length of tick marks as a fraction of the smaller of the width or height of the plotting region. If tck >= 0.5 it is interpreted as a fraction of the relevant side, so if tck = 1 grid lines are drawn. The default setting (tck = NA) is to use tcl = -0.5.
tcl The length of tick marks as a fraction of the height of a line of text. The default value is -0.5; setting tcl = NA sets tck = -0.01 which is S' default.
usr A vector of the form c(x1, x2, y1, y2) giving the extremes of the user coordinates of the plotting region. When a logarithmic scale is in use (i.e., par("xlog") is true, see below), then the x-limits will be 10 ^ par("usr")[1:2]. Similarly for the y-axis.
xaxp A vector of the form c(x1, x2, n) giving the coordinates of the extreme tick marks and the number of intervals between tick-marks when par("xlog") is false. Otherwise, when log coordinates are active, the three values have a different meaning: For a small range, n is negative, and the ticks are as in the linear case, otherwise, n is in 1:3, specifying a case number, and x1 and x2 are the lowest and highest power of 10 inside the user coordinates, 10 ^ par("usr")[1:2]. (The "usr" coordinates are log10-transformed here!) n=1 will produce tick marks at 10^j for integer j, n=2 gives marks k 10^j with k in {1,5}, n=3 gives marks k 10^j with k in {1,2,5}. See axTicks() for a pure R implementation of this. This parameter is reset when a user coordinate system is set up, for example by starting a new page or by calling plot.window or setting par("usr"): n is taken from par("lab"). It affects the default behaviour of subsequent calls to axis for sides 1 or 3.
xaxs The style of axis interval calculation to be used for the x-axis. Possible values are "r", "i", "e", "s", "d". The styles are generally controlled by the range of data or xlim, if given. Style "r" (regular) first extends the data range by 4 percent at each end and then finds an axis with pretty labels that fits within the extended range. Style "i" (internal) just finds an axis with pretty labels that fits within the original data range. Style "s" (standard) finds an axis with pretty labels within which the original data range fits. Style "e" (extended) is like style "s", except that it is also ensures that there is room for plotting symbols within the bounding box. Style "d" (direct) specifies that the current axis should be used on subsequent plots. (Only "r" and "i" styles have been implemented in R.)
xaxt A character which specifies the x axis type. Specifying "n" suppresses plotting of the axis. The standard value is "s": for compatibility with S values "l" and "t" are accepted but are equivalent to "s": any value other than "n" implies plotting.
xlog A logical value (see log in plot.default). If TRUE, a logarithmic scale is in use (e.g., after plot(*, log = "x")). For a new device, it defaults to FALSE, i.e., linear scale.
xpd A logical value or NA. If FALSE, all plotting is clipped to the plot region, if TRUE, all plotting is clipped to the figure region, and if NA, all plotting is clipped to the device region. See also clip.
yaxp A vector of the form c(y1, y2, n) giving the coordinates of the extreme tick marks and the number of intervals between tick-marks unless for log coordinates, see xaxp above.
yaxs The style of axis interval calculation to be used for the y-axis. See xaxs above.
yaxt A character which specifies the y axis type. Specifying "n" suppresses plotting.
ylbias A positive real value used in the positioning of text in the margins by axis and mtext. The default is in principle device-specific, but currently 0.2 for all of R's own devices. Set this to 0.2 for compatibility with R < 2.14.0 on x11 and windows() devices.
ylog A logical value; see xlog above
75 | P a g e
COLORS
List all the available graphics colors sColors colors() Hexidecimal Color Chart
List of Colors in [R]
76 | P a g e
Color Palette palette() The palette is what is supplied to col arguments referenced by number. The palette can be changes to any of the numeric numbers above {colors()[subset of numbers from chart]} and then reset using the "default" argument Default colors: black, red, green3, blue, cyan, magenta, yellow, gray
Changing Colors in Arguments Example
frame()
textClick("GGG",colors()[47],4)
textClick("GGG",colors()[134],4)
textClick("GGG",colors()[500],4)
textClick("GGG",colors()[551],4)
textClick("GGG",colors()[634],4)
palette() # obtain the current palette
palette(rainbow(6)) # six color rainbow
palette() # obtain the current palette
palette(colors()[c(1,10,20,30,40,50,60,70,80,90,100)]) #11 colors
palette() # obtain the current palette
palette("default") # reset the color palette
palette() # obtain the current palette
Compare the numbers to the number chart above.
77 | P a g e
Show Some of [R]'s colors by name and color library(DAAG) show.colors(type=c("shades"), order.cols=TRUE) show.colors(type=c("gray"), order.cols=TRUE) show.colors(type=c("singles"), order.cols=TRUE)
EXAMPLE
plot(mpg~disp,col="blue4", data=mtcars)#using shades Preset Palettes rainbow(n, s = 1, v = 1, start = 0, end = max(1,n - 1)/n, alpha = 1) gray.colors(n, start = 0.3, end = 0.9, gamma = 2.2) heat.colors(n, alpha = 1) terrain.colors(n, alpha = 1) topo.colors(n, alpha = 1) cm.colors(n, alpha = 1) Arguments
n the number of colors (≥ 1) to be in the palette.
s,v the ‘saturation’ and ‘value’ to be used to complete the HSV color descriptions.
start the (corrected) hue in [0,1] at which the rainbow begins.
end the (corrected) hue in [0,1] at which the rainbow ends.
alpha the alpha transparency, a number in [0,1], see argument alpha in hsv.
Change the Background Color par(bg="color") Change the Foreground Color par(fg="color") Select Random Color Self-created color randomization function ran.col(c(dataframe, vector, number), color.choice = c(colors, rainbow, heat, terrain, topo, cm))
EXAMPLE
frame()
terrain.colors(6)
textClick("GGG",terrain.colors(7)[1],4)
textClick("GGG",terrain.colors(7)[2],4)
textClick("GGG",terrain.colors(7)[3],4)
textClick("GGG",terrain.colors(7)[4],4)
textClick("GGG",terrain.colors(7)[5],4)
textClick("GGG",terrain.colors(7)[6],4)
x11(16,8)
par(mfrow = c(2,3))
with(mtcars,plot(mpg,disp,pch=19,col=ran.col(mtcars,colors),main="COLORS"))
with(mtcars,plot(mpg,disp,pch=19,col=ran.col(mtcars,rainbow),main="RAINBOW"))
with(mtcars,plot(mpg,disp,pch=19,col=ran.col(mtcars,heat),main="HEAT"))
with(mtcars,plot(mpg,disp,pch=19,col=ran.col(mtcars,terrain),main="TERRAIN"))
with(mtcars,plot(mpg,disp,pch=19,col=ran.col(mtcars,topo),main="TOPO"))
with(mtcars,plot(mpg,disp,pch=19,col=ran.col(3,cm),main="CM"))
ran.col(6,colors)
#USING TO SET PALETTE
palette() #current palette
palette(ran.col(10)) #set palette
palette() #current palette
with(mtcars,plot(mpg,disp,pch=19,col=cyl,main="COLORS"))
palette("default") #return to default
78 | P a g e
Plot two graphs in the same pane (Overlay graphs) par(new=TRUE)
Building Plot Frames from Pieces plot(x, y, type="n",xlab="",ylab="", axes=F) points(x, y) axis(1) axis(2,at=seq(.2,1.8,.2)) box() Plot Grid Lines grid(nx = NULL, ny = nx, col = "lightgray", lty = "dotted",lwd = par("lwd"), equilogs = TRUE)
EXAMPLE
frame()
grid(col="blue")
shapeClick("seg",col="red")
EXAMPLES
EXAMPLE A
plot(mpg~as.factor(cyl),col="green")
par("new"=TRUE)
plot(mpg~cyl,col="blue",xlab="",axes=F)
EXAMPLE B
x11()
frame()
with(mtcars,plot(mpg~disp))
shapeClick("poly",6,border="red",col="yellow")
shapeClick("poly",6,border="red",col="green")
shapeClick("poly",6,border="red",col="orange")
par(new=T)
with(mtcars,plot(mpg~disp))
EXAMPLE
attach(mtcars)
plot(mpg, disp, type="n",xlab="",ylab="", axes=F)
points(mpg, disp,col="blue")
axis(1,at=seq(0,35,5),col="red",col.axis="green",lwd=3)
axis(2,lwd=6)
axis(3,seq_along(mpg), c(LETTERS,LETTERS[1:6]), col.axis = "blue")
axis(4)
box(col="orange",lwd=7)
title(main="YEPPER",xlab="OK GUY", ylab="YOU DA MAN",sub="SUBTITLE")
detach(mtcars)
plot(1:10, xaxt = "n")
axis(1, xaxp=c(0, 9, 5))
plot(1:10, xaxt = "n")
axis(1, xaxp=c(2, 9, 7))
Plot Grid Lines
79 | P a g e
Title and Labels for Graphics Type: plot(x, y,main="The Title", xlab="X Axis Label", ylab="Y Axis Label") Where plot is the function, x is the x variable, y is your y variable, “The Title”is what the graph will be named, “X Axis Label” is the name of X axis, and the “Y Axis Label” is the name of the Y axis. An example of plotting without and with the titles and labels. Note: This can be applied to any graphic: hist(p,main="Parent Agression Levels", xlab="Agression Range", ylab="Numbe rof Occurances")
80 | P a g e
Varying Graphs on a Page #1 layout(matrix(c(), rows, columns) Work on the rows and columns first. This creates the grid work for the matrix specifications. So a 2x2 for rows by columns is The matrix(c(1,1,2,3)) will give graph 1 the first two boxes and graph 2 and 3 one box each on the bottom, Varying Graphs on a Page #2 (controls size of window and layout) source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Graphics/Multiple Graphics Function.txt ")
multiG (width, height, columns, rows, matrix) For an example use: EXAMPLE(multiG)
#===================================================
# VARYING GRAPHS PER PAGE
#===================================================
attach(mtcars)
#===================================================
layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE))
hist(wt)
hist(mpg)
hist(disp)
#===================================================
windows()
layout(matrix(c(2,1,2,3), 2, 2, byrow = TRUE))
hist(wt)
hist(mpg)
hist(disp)
#===================================================
windows()
layout(matrix(c(1,2,3,3), 2, 2, byrow = TRUE))
hist(wt)
hist(mpg)
hist(disp)
#===================================================
windows(h=6,w=8)
layout(matrix(c(1,2,3,3,4,5), 3, 2, byrow = TRUE))
hist(wt)
hist(mpg)
hist(disp)
hist(drat)
hist(qsec)
See my created function for doing this
quickly.
81 | P a g e
Varying Graphs on a Page #3 library(plotrix) Split the graphics device into a "panel" type layout for a group of plots panes(mat=NULL,widths=rep(1,ncol(mat)),heights=rep(1,nrow(mat)), nrow=2,ncol=2, mar=c(0,0,1.6,0),oma=c(2.5,1,1,1)) Arguments:
EXAMPLE
y<-runif(8)
panes(matrix(1:4,nrow=2,byrow=TRUE))
par(mar=c(0,2,1.6,0))
boxplot(y,axes=FALSE)
axis(1)
box(2)
par(mar=c(0,0,1.6,2))
tab.title("Boxplot of y",tab.col="#88dd88")
barplot(y,axes=FALSE,col=2:9)
axis(4)
box()
tab.title("Barplot of y",tab.col="#88dd88")
par(mar=c(2,2,1.6,0))
pie(y,col=2:9)
tab.title("Pie chart of y",tab.col="#88dd88")
box()
par(mar=c(2,0,1.6,2))
plot(y,xaxs="i",xlim=c(0,9),axes=FALSE,col=2:9)
axis(4)
box()
tab.title("Scatterplot of y",tab.col="#88dd88")
# center the title at the left edge of the last plot
mtext("Test of panes function",at=0,side=1,line=0.8,cex=1.5)
panes(matrix(1:3,ncol=1),heights=c(0.7,0.8,1))
par(mar=c(0,2,2,2))
plot(sort(runif(7)),type="l",axes=FALSE)
axis(2,at=seq(0.1,0.9,by=0.2))
box()
tab.title("Rising expectations",tab.col="#ee6666")
barplot(rev(sort(runif(7))),col="blue",axes=FALSE)
axis(2,at=seq(0.1,0.9,by=0.2))
box()
tab.title("Diminishing returns",tab.col="#6666ee")
par(mar=c(4,2,2,2))
tso<-c(0.2,0.3,0.5,0.4,0.6,0.8,0.1)
plot(tso,type="n",axes=FALSE,xlab="")
# the following needs a Unicode locale to work
points(1:7,tso,pch=c(rep(-0x263a,6),-0x2639),cex=2)
axis(1,at=1:7,
labels=c("Tuesday","Wednesday","Thursday","Friday",
"Saturday","Sunday","Monday"))
axis(2,at=seq(0.1,0.9,by=0.2))
box()
tab.title("The sad outcome",tab.col="#66ee66")
mtext("A lot of malarkey",side=1,line=2.5)
82 | P a g e
Put a Box Around a Figure or a group of Figues box()
#EXAMPLES
test<-rnorm(100);plot(test)
box("figure", lwd=2)
test<-rnorm(100);plot(test)
box("outer", lwd =2)
par(mfrow = c(2, 2))
plot(test)
box("figure", lwd=1)
plot(test)
box("figure", lwd=1)
plot(test)
box("figure", lwd=1)
plot(test)
box("figure", lwd=1)
box("outer", lwd =5, col="red")
83 | P a g e
Graph Types
Pie Graph pie(x,labels,…) x is a vector of values labels is a vector of label names. See examples to the right NOTE: Cleveland (1985) States that a pie is a poor Choice for displaying info 3-D Pie Graph pie3D(x,labels,…) x is a vector of values labels is a vector of label names. Very similar to pie. See examples to the right
#===========================================================================
# THE DATA
#===========================================================================
slices <- c(11, 12,4, 16, 8)
N<-sum(slices)
Percents<-format(digits=3,(slices/N)*100)
lbls <- c("US", "UK", "Australia", "Germany", "France")
lbls2<-paste(lbls," ",Percents,"%",sep="")
#===========================================================================
# PIE PLOTS (not a prefered method of display)
#===========================================================================
windows(height=6,width=10);par(mfrow=c(2,2))
#...........................................................................
# TYPE 1
#...........................................................................
pie(slices, labels = lbls, main="Pie Chart of Countries")
#...........................................................................
# TYPE 2
#...........................................................................
pie(slices, labels ="", main="Pie Chart of Countries",
col=c("blue","red","green","yellow","orange"))
legend(1.60,.7,lbls,cex=0.8,fill=c("blue","red","green","yellow","orange"))
#...........................................................................
# TYPE 3
#...........................................................................
pie(slices, labels = lbls2, main="Pie Chart of Countries",
col=c("blue","chocolate","red","yellow","bisque"))
#...........................................................................
# TYPE 4
#...........................................................................
pie(slices,labels=paste(Percents,"%",sep=""),main="Pie Chart of Countries",
col=c("blue","red","green","yellow","orange"))
legend(1.60,.7,lbls,cex=0.8,fill=c("blue","red","green","yellow","orange"))
#===========================================================================
# THE DATA
#===========================================================================
slices <- c(11, 12,4, 16, 8)
N<-sum(slices)
Percents<-format(digits=3,(slices/N)*100)
lbls <- c("US", "UK", "Australia", "Germany", "France")
lbls2<-paste(lbls," ",Percents,"%",sep="")
#===========================================================================
# PIE PLOTS (not a prefered method of display)
#===========================================================================
library(plotrix);windows(h=6,w=12);par(mfrow=c(1,2))
#...........................................................................
# TYPE 1
#...........................................................................
pie3D(slices, labels = lbls, main="Pie Chart of Countries")
#...........................................................................
# TYPE 2
#...........................................................................
pie3D(slices, labels ="", main="Pie Chart of Countries",
col=c("blue","red","green","yellow","orange"))
legend(.47,1,lbls,cex=0.8,fill=c("blue","red","green","yellow","orange"))
windows(h=6,w=12);par(mfrow=c(1,2))
#...........................................................................
# TYPE 3
#...........................................................................
pie3D(slices, labels = lbls2, main="Pie Chart of Countries",
col=c("blue","chocolate","red","yellow","bisque"),labelcex=1.1)
#...........................................................................
# TYPE 4
#...........................................................................
pie3D(slices,labels=paste(Percents,"%",sep=""),main="Pie Chart of Countries",
col=c("blue","red","green","yellow","orange"))
legend(.55,1.05,lbls,cex=0.8,fill=c("blue","red","green","yellow","orange"))
84 | P a g e
Dot Chart dotchart(x,labels,…) x is a vector of values labels is a vector of label names. Very similar to pie. See examples to the right NOTE: The dot chart is preferred to the pie graph. It can display everything a pie graph can and then some. StripPlot library(lattice) stripplot(factor~numeric)
#===========================================================================
# THE DATA
#===========================================================================
slices <- c(11, 12,4, 16, 8)
N<-sum(slices)
Percents<-format(digits=3,(slices/N)*100)
lbls <- c("US", "UK", "Australia", "Germany", "France")
lbls2<-paste(lbls," ",Percents,"%",sep="")
#===========================================================================
# DOT PLOTS (prefered over pie charts)
#===========================================================================
windows(h=6,w=12);par(mfrow=c(1,2))
#...........................................................................
# Simple 1
#...........................................................................
dotchart(slices,labels=lbls2,cex=.7,
main="Dot Plot COuntries Comparison",
xlab="Corn Production (Millions of Bushels)",
col=c("blue","red","darkgreen","black","orange"))
#...........................................................................
# Simple 2 Colored
#...........................................................................
dotchart(mtcars$mpg,labels=row.names(mtcars),cex=.7,
main="Gas Milage for Car Models",
xlab="Miles Per Gallon")
#...........................................................................
# By Group-Colored(Cylinders)
#...........................................................................
windows(h=6,w=6);par(mfrow=c(1,1))
x <- mtcars[order(mtcars$mpg),] # sort by mpg
x$cyl <- factor(x$cyl) # it must be a factor
x$color[x$cyl==4] <- "red"
x$color[x$cyl==6] <- "blue"
x$color[x$cyl==8] <- "darkgreen"
dotchart(x$mpg,labels=row.names(x),cex=.7,groups= x$cyl,
main="Gas Milage for Car Models\ngrouped by cylinder",
xlab="Miles Per Gallon", gcolor="black", color=x$color)
mtext("Cars Grouped by Cylinder", side = 2, line = 2, cex = .7)
library(lattice)
stripplot(factor(cyl,levels=c("8","4","6"))~mpg,data=mtcars)
stripplot(factor(cyl,levels=c("8","6","4"))~mpg,main="Milage by Cylinder
Type",ylab="Cylinders",data=mtcars)
85 | P a g e
Venn Diagram 1 svenn library(vennecular) #Example: List1 <- c("apple", "apple", "orange", "kiwi", "cherry", "peach")
List2 <- c("apple", "orange", "cherry", "tomatoe", "pear", "plum", "plum")
Lists <- list(List1, List2) #put the word vectors into a list to supply lapply
items <- sort(unique(unlist(Lists))) #put in alphabetical order
MAT <- matrix(rep(0, length(items)*length(Lists)), ncol=2) #make a matrix of 0's
colnames(MAT) <- paste0("List", 1:2)
rownames(MAT) <- items
lapply(seq_along(Lists), function(i) { #fill the matrix
MAT[items %in% Lists[[i]], i] <<- table(Lists[[i]])
})
MAT #look at the results
library(venneuler)
v <- venneuler(MAT)
plot(v)
Venn Diagram 2 library(gplots) venn(data, universe=NA, small=0.7, showSetLogicLabel=FALSE, simplify=FALSE, show.plot=TRUE)
Arguments data,x
Either a list list containing vectors of names or indices of group members, or a data frame containing boolean indicators of group membership
universe Subset of valid name/index elements. Values ignore values in codedata not in this list will be ignored. Use NA to use all elements of data (the default).
small Character scaling of the smallest group counts
showSetLogicLabel Logical flag indicating whether the internal group label should be displayed
simplify Logical flag indicating whether unobserved groups should be omitted.
show.plot Logical flag indicating whether the plot should be displayed. If false, simply returns the group count matrix.
86 | P a g e
Line Graph See Example Below
Line Graph With Confidence Interval lineplot.CI(x.factor=, response=, main=" ", data=,xlab="",ylab="") x.factor is grouping variable, response is the numeric measure
EXAMPLE
#oneway
lineplot.CI(x.factor=cyl, response=mpg, main="MPG by Cylinder Type", data=mtcars,
xlab="Cylinders",ylab="mpg")
#twoway
lineplot.CI(x.factor=cyl, response=mpg,group=am, main="MPG by Cylinder Type",
data=mtcars,xlab="Cylinders",ylab="mpg")
cars <- c(1, 3, 6, 4, 9)
trucks <- c(2, 5, 4, 5, 12)
# Calculate range from 0 to max value of cars and trucks
g_range <- range(0, cars, trucks)
plot(cars, type="o", col="blue", ylim=g_range,
axes=FALSE, ann=FALSE)
# Make x axis using Mon-Fri labels
axis(1, at=1:5, lab=c("Mon","Tue","Wed","Thu","Fri"))
# Make y axis with horizontal labels that display ticks at
# every 4 marks. 4*0:g_range[2] is equivalent to c(0,4,8,12).
axis(2, las=1, at=4*0:g_range[2])
# Create box around plot
box()
# Graph trucks with red dashed line and square points
lines(trucks, type="o", pch=22, lty=2, col="red")
# Create a title with a red, bold/italic font
title(main="Autos", col.main="red", font.main=4)
# Label the x and y axes with dark green text
title(xlab="Days", col.lab=rgb(0,0.5,0))
title(ylab="Total", col.lab=rgb(0,0.5,0))
# Create a legend at (1, g_range[2]) that is slightly smaller
# (cex) and uses the same line colors and points used by
# the actual plots
legend(1, g_range[2], c("cars","trucks"), cex=0.8,
col=c("blue","red"), pch=21:22, lty=1:2)
87 | P a g e
Line Graph 2 (more for joining points of existing plots) Yellow is the code responsible Interaction Plot method 1 see also effects below interaction.plot(x.factor, trace.factor, response, fun = mean, type = c("l", "p", "b"), legend = TRUE, trace.label = deparse(substitute(trace.factor)), fixed = FALSE, xlab = deparse(substitute(x.factor)), ylab = ylabel, ylim = range(cells, na.rm=TRUE), lty = nc:1, col = 1, pch = c(1:9, 0, letters), xpd = NULL, leg.bg = par("bg"), leg.bty = "n", xtick = FALSE, xaxt = par("xaxt"), axes = TRUE, ...) Arguments:
EXAMPLE: HW19<-read.table("HW19.csv", header=TRUE, sep=",",na.strings="NA");HW19$Attitude<-as.factor(HW19$Attitude);x11(18,8)
frame()
par(mfrow=c(1,3))
with(HW19,interaction.plot(Attitude,Gender,Science.Comprehension,lwd=3,col=c(11,4)))
with(HW19,interaction.plot(Attitude,Grade,Science.Comprehension,lwd=3,col=c(6,2)))
with(HW19,interaction.plot(Grade,Gender,Science.Comprehension,lwd=3,col=c("orange","purple")))
EXAMPLE
with(mtcars,plot(mpg,hp,main="Norah's Cries",
xlab="Time",ylab="Decibals"))
sequence<-with(mtcars,order(mpg))
with(mtcars,lines(mpg[sequence],hp[sequence],
col="green",lwd=2))
shapeClick("arrow",code=1,col="blue",lwd=2)
shapeClick("arrow",code=1,col="blue",lwd=2)
shapeClick("arrow",code=1,col="blue",lwd=2)
text(locator(1), "Begining RTI!", pos=4)
text(locator(1), "It get's worse before", pos=4)
text(locator(1), "it gets better!", pos=4)
text(locator(1), "Extinction Bursts", pos=4)
shapeClick("box",border="blue",lwd=2)
shapeClick("box",border="blue",lwd=2)
shapeClick("box",border="blue",lwd=2)
88 | P a g e
Interaction Plot method 2 library(effects) plot(effect (term1:term2, fit, list(term3=c(levels))), multiline=TRUE)
Interaction Plot method 3 library(HH) interaction2wt(y~x1*x2…, data=)
EXAMPLE
HW19<-read.table("HW19.csv", header=TRUE, sep=",",na.strings="NA");HW19$Attitude<-
as.factor(HW19$Attitude)
fit <- with(HW19, lm(Science.Comprehension~Gender * Attitude * Grade))
attach(HW19)
plot(effect("Gender:Attitude", fit, list(Gender=c("f","m"))),multiline=T)
plot(effect("Attitude:Grade", fit, list(Grade=c("eight","nine"))),multiline=T)
#had to x out of graph
plot(effect("Gender:Grade", fit, list(Grade=c("eight","nine"))),multiline=T)
interaction2wt(len~supp*dose, data=ToothGrowth)
89 | P a g e
Plot the columns of one matrix or dataframe against the columns of another matplot(x, y, type = "p", lty = 1:5, lwd = 1, lend = par("lend"), pch = NULL, col = 1:6, cex = NULL, bg = NA, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
Arguments
Example
x <- as.matrix( EuStockMarkets[1:50,] )
matplot(x,main = "matplot (standard)", xlab = "", ylab = "")
matplot(x,type=l,lty=1, main = "matplot (line)", xlab = "", ylab = "")
#=============================================================
x <- as.data.frame(x)
matplot(x,main = "matplot (standard)", xlab = "", ylab = "")
matplot(x,type=l,lty=1, main = "matplot (line)", xlab = "", ylab = "")
what type of plot should be drawn. Possible types are
"p" for points,
"l" for lines,
"b" for both,
"c" for the lines part alone of "b",
"o" for both ‘overplotted’,
"h" for ‘histogram’ like (or ‘high-density’) vertical lines,
"s" for stair steps,
"S" for other steps, see ‘Details’ below,
"n" for no plotting.
90 | P a g e
Bar graph sbar graph m sbarplot barplot(x) ?barplot for details horiz=F yields horizontal bars Bar graph with Confidence Intervals library(sciplot) bargraph.CI(x.factor=, response=, main=" ", data=,xlab="",ylab="") x.factor is grouping variable, response is the numeric measure
3-D Bar Plot Additional Nicities for Bar Graphs barplot(VADeaths, beside=TRUE, las=1)
abline(h=seq(0, 100, by=1), col="gray90")
abline(h=seq(0, 100, by=10), col="gray")
par(new=T)
barplot(VADeaths, beside=TRUE, las=1)
barplot(VADeaths, beside=TRUE, las=1)
abline(h=seq(0, 100, by=5), col="gray90")
abline(h=seq(0, 100, by=10), col="gray")
par(new=T)
barplot(VADeaths, beside=TRUE, las=1)
barplot(VADeaths, beside=TRUE, las=1)
abline(h=0:100, col="white")
barplot(
VADeaths, beside=TRUE, las=1,
add=TRUE, col=FALSE
)
source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Graphics/3-D Bar plot.txt")
EXAMPLE
bargraph.CI(x.factor=cyl, response=mpg, main="MPG by Cylinder Type", data=mtcars,
xlab="Cylinders",ylab="mpg")
91 | P a g e
Add Text Directly Below Or Above Bars
day <- c(0:28)
ndied <- c(342,335,240,122,74,64,49,60,51,44,35,48,41,34,38,
27,29,23,20,15,20,16,17,17,14,10,4,1,2)
pdied <- c(19.1,18.7,13.4,6.8,4.1,3.6,2.7,3.3,2.8,2.5,2.0,2.7,
2.3,1.9,2.1,1.5,1.6,1.3,1.1,0.8,1.1,0.9,0.9,0.9,
0.8,0.6,0.2,0.1,0.1)
pmort <- data.frame(day,ndied,pdied)
barX <- barplot(pmort$pdied,xlab="Age(days)",
ylab="Percent", names=pmort$day,
xlim=c(0,35),ylim=c(0,20),legend="Mortality")
text(cex=.5, x=barX, y=pmort$pdied+par("cxy")[2]/2, pmort$ndied,
xpd=TRUE, col='darkgreen')
text(cex=.5, x=barX, y=-.5, pmort$ndied, xpd=TRUE, col="blue")
X2sum <- c(42.6, 3.6, 1.8, 3.9, 12.1, 14.3, 14.6 ,28.4)
X2.labels <- c("No earnings", "Less than $5000/year", "$5K to $10K" ,
"$10K to $15K" , "$ 15K to $20K" , "$20K to $25K" , "$25K to $30K",
"Over $30K" )
barCenters <- barplot(X2sum)
text(barCenters, par("usr")[3] - 0.5, srt = 45, adj = 1,
labels =X2.labels, xpd = TRUE, cex=.7)
mtcars2 <- mtcars[order(-mtcars$mpg), ]
par(cex.lab=1, cex.axis=.6,
mar=c(6.5, 3, 2, 2) + 0.1, xpd=NA) #shrink axis text and increase bot. mar.
barX <- barplot(mtcars2$mpg,xlab="Cars", main="MPG of Cars",
ylab="", names=rownames(mtcars2), mgp=c(5,1,0),
ylim=c(0, 35), las=2, col=mtcars2$cyl)
mtext(side=2, text="MPG", cex=1, padj=-2.5)
text(cex=.5, x=barX, y=mtcars2$mpg+par("cxy")[2]/2, mtcars2$hp, xpd=TRUE)
text(cex=.5, x=barX, y=-.5, mtcars2$gear, xpd=TRUE, col="red")
92 | P a g e
Histogram shistogram hist(Set$Attitude, col="purple", breaks=20) hist(Set$Attitude, col="purple") Histogram with kernel density and normal curve library(descr)
histkdnc(variable) Additional arguments are similar to histogram
Histograms for all variables in a data frame or matrix hist.data.frame(data.frame) library(Hmisc) Histograms with normal curve and density plot for all variables in a data frame or matrix multi.hist(dataframe) library(psych)
93 | P a g e
Density Plot plot(density(d1$mathscore ), main="yes",xlab="bad", ylab="good") #The Plot polygon(density(d1$mathscore ) ,col="orange", border="purple") #Coloring and the Border Histogram-density plot with qq plot (check normality) source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Graphics/ Histogram-density plot with qq plot.txt”)
QQhist(x) for an example use: QQhist.fun() Plot 2 or more density plots library(sm) sm.density.compare(num.variable,factor)
*GRAPH IS EMBELISHED WITH MEAN LINES FOR EACH GROUP
AND EXTRA MEAN LINES TO EXPLAIN 4 CYL.’S BIMODAL GRAPH
EXAMPLE
library(sm)
mtcars2<-mtcars
mtcars2$cyl<-as.factor(with(mtcars,recodeVar(cyl,
src=c(4,6,8),tgt=c("four","six","eight"), default=NULL,
keep.na=TRUE)))
fm<-mean(subset(mtcars2,cyl=="four")$mpg)
sm<-mean(subset(mtcars2,cyl=="six")$mpg)
em<-mean(subset(mtcars2,cyl=="eight")$mpg)
with(mtcars2,sm.density.compare(mpg,cyl)) #plot several densities @ once
abline(v=mean(subset(mtcars2,cyl=="four")$mpg)) #plot means
abline(v=mean(subset(mtcars2,cyl=="six")$mpg))
abline(v=mean(subset(mtcars2,cyl=="eight")$mpg))
# uh oh 4 cyl is bi-modal. Why?
#plot of means for four cylinder by displacement;
#the factor that makes this group's graph bi-modal
abline(v=mean(fmDF[c(1,2,3,7,9,11),1]),col="orange")
abline(v=mean(fmDF[c(4,5,6,8,10),1]),col="pink")
legend(locator(1),c("Four Cylinder","Six Cylinder", "Eight Cylinder", "4cyl low disp", "4cyl high disp"),
fill=c("green","blue","red", "orange", "pink"))
94 | P a g e
Histogram with colored tails (2 sd or what ever you set) histogram <- hist(scale(vector)), breaks= , plot=FALSE) plot(histogram, col=ifelse(abs(histogram$breaks) < #of SD, Color 1, Color 2)) Example windows(13,4) par(mfrow=c(1,2)) histograph <- hist(scale(mtcars$mpg), breaks=10, plot=FALSE) plot(histograph, main="Histogram of MPG",col=ifelse(abs(histograph$breaks) < 2, 5, 8)) x <- rnorm(1000) hx <- hist(x, breaks=150, plot=FALSE) plot(hx, col=ifelse(abs(hx$breaks) < 2, 3, 6)) Stem and Leaf Plot stem(x, scale = 1, width = 80, atom = 1e-08)
Arguments
x a numeric vector. scale This controls the plot length. width The desired width of plot.
atom a tolerance.
EXAMPLE:
x<-round(mtcars$mpg)
stem(x, scale=.5)
stem(x, scale=.25)
95 | P a g e
MULTIVARIATE DATA PLOTS
Star Plot (Multivariate data) Draw star plots or segment diagrams of a multivariate data set stars(x, full = TRUE, scale = TRUE, radius = TRUE, labels = dimnames(x)[[1]], locations = NULL, nrow = NULL, ncol = NULL, len = 1, key.loc = NULL, key.labels = dimnames(x)[[2]], key.xpd = TRUE, xlim = NULL, ylim = NULL, flip.labels = NULL, draw.segments = FALSE, col.segments = 1:n.seg, col.stars = NA, axes = FALSE, frame.plot = axes, main = NULL, sub = NULL, xlab = "", ylab = "", cex = 0.8, lwd = 0.25, lty = par("lty"), xpd = FALSE, mar = pmin(par("mar"), 1.1+ c(2*axes+ (xlab != ""), 2*axes+ (ylab != ""), 1,0)), add = FALSE, plot = TRUE, ...) ARGUMENTS example(stars)
96 | P a g e
Chernoff Faces (Multivariate data) faces(data, plot.faces=c(TRUE,FALSE)) library(aplpack) faces2(data, ncol=#,nrow=#) library(TeachingDemos)
EXAMPLES: library(aplpack)
a<-aplpack::faces(mtcars,plot.faces=FALSE)
win.graph(11,8);par(mar = rep(0, 4),xpd=NA)
plot(0:5,0:5,type="n")
plot(a)
library(TeachingDemos)
win.graph(11,8);par(mar = rep(0, 4),xpd=NA)
faces2(mtcars[,1:7],ncol=8,nrow=4)
97 | P a g e
++
+
+
++
+
+
+
+
+
++
+
++
+
+
+
+
+
++
+
+
+
+
+
+
+
+
+
100 200 300 400
10
15
20
25
30
disp
mp
g
00
0
0
00
0
0
0
0
0
00
0
00
0
0
0
0
0
00
0
0
0
0
0
0
0
0
0
100 200 300 400
10
15
20
25
30
disp
mp
g
Cylinders
Four
Six
Eight
Cylinders
One
Two
Three
Four
Six
Eight
Bubles represent
Horsepower
Bubble Plot (view multivariate data) bubbleplot plot(y~x1) symbols(y~x1,circles=x2)
EXAMPLE
plot(mpg ~ disp, data = mtcars, pch ="+",col=carb)
par(new=T)
plot(mpg ~ disp, data = mtcars, pch = "0",col=cyl)
with(mtcars,symbols(disp, mpg, circles = hp,add = TRUE))
legend(280,34,c("Four","Six","Eight"),fill=c("blue","violet","gray"),title="Cylinders")
legend(385,34,c("One","Two","Three","Four","Six","Eight"),
fill=palette()[as.numeric(levels(as.factor(mtcars$carb)))],title="Cylinders")
textClick(expression("Bubles represent\nHorsepower"),"black",1)
shapeClick("box",3)
#data represents 5 different variables
98 | P a g e
3-D Scatterplot (view multivariate data) library(scatterplot3d) [see also spinable 3-d scatterplot] scatterplot3d(x, y=NULL, z=NULL, color=par("col"), pch=NULL, main=NULL, sub=NULL, xlim=NULL, ylim=NULL, zlim=NULL, xlab=NULL, ylab=NULL, zlab=NULL, scale.y=1, angle=40,axis=TRUE, tick.marks=TRUE, label.tick.marks=TRUE, x.ticklabs=NULL, y.ticklabs=NULL, z.ticklabs=NULL, y.margin.add=0, grid=TRUE, box=TRUE, lab=par("lab"), lab.z=mean(lab[1:2]), type="p", highlight.3d=FALSE, mar=c(5,3,4,3)+0.1, col.axis=par("col.axis"), col.grid="grey", col.lab=par("col.lab"), cex.symbols=par("cex"), cex.axis=0.8 * par("cex.axis"), cex.lab=par("cex.lab"), font.axis=par("font.axis"),font.lab=par("font.lab"), lty.axis=par("lty"), lty.grid=par("lty"), lty.hide=NULL, lty.hplot=par("lty"), log="")
x the coordinates of points in the plot.
y the y coordinates of points in the plot, optional if x is an appropriate structure.
z the z coordinates of points in the plot, optional if x is an appropriate structure.
color colors of points in the plot, optional if x is an appropriate structure. Will be ignored if highlight.3d = TRUE.
pch plotting "character", i.e. symbol to use.
main an overall title for the plot.
sub sub-title.
xlim, ylim, zlim the x, y and z limits (min, max) of the plot. Note that setting enlarged limits may not work as exactly as expected (a known but unfixed bug).
xlab, ylab, zlab titles for the x, y and z axis.
scale.y scale of y axis related to x- and z axis.
angle angle between x and y axis (Attention: result depends on scaling).
axis a logical value indicating whether axes should be drawn on the plot.
tick.marks a logical value indicating whether tick marks should be drawn on the plot (only if axis = TRUE).
label.tick.marks a logical value indicating whether tick marks should be labeled on the plot (only if axis = TRUE and tick.marks = TRUE).
x.ticklabs, y.ticklabs, z.ticklabs vector of tick mark labels.
y.margin.add add additional space between tick mark labels and axis label of the y axis
grid a logical value indicating whether a grid should be drawn on the plot.
box a logical value indicating whether a box should be drawn around the plot.
lab a numerical vector of the form c(x, y, len). The values of x and y give the (approximate) number of tickmarks on the x and y axes.
lab.z the same as lab, but for z axis.
type character indicating the type of plot: "p" for points, "l" for lines, "h" for vertical lines to x-y-plane, etc.
highlight.3d points will be drawn in different colors related to y coordinates (only if type = "p" or type = "h", else color will be used). On some devices not all colors can be displayed. In this case try the postscript device or use highlight.3d = FALSE.
mar A numerical vector of the form c(bottom, left, top, right) which gives the lines of margin to be specified on the four sides of the plot.
col.axis, col.grid, col.lab the color to be used for axis / grid / axis labels.
cex.symbols, cex.axis, cex.lab the magnification to be used for point symbols, axis annotation, labels relative to the current.
font.axis, font.lab the font to be used for axis annotation / labels.
lty.axis, lty.grid the line type to be used for axis / grid.
lty.hide line style used to plot ‘non-visible’ edges (defaults of the lty.axis style)
lty.hplot the line type to be used for vertical segments with type = "h".
log Not yet implemented! A character string which contains "x" (if the x axis is to be logarithmic), "y", "z", "xy", "xz", "yz", "xyz".
EXAMPLE library(scatterplot3d) multiG(16,8,2,1) with(mtcars,scatterplot3d(mpg,disp,hp,color=cyl,pch=19,main="HP, MPG, & DISP BY CYL")) par(mar = rep(0, 4),xpd=NA) legend(locator(1),legend=c("4 cyl","6 cyl", "8 cyl"),fill=c(4,6,8), title="Cylinders") with(mtcars,scatterplot3d(mpg,disp,hp,color=cyl,pch=gear,main="HP, MPG, & DISP BY CYL & GEAR")) par(mar = rep(0, 4),xpd=NA) legend(locator(1),legend=c("3","4", "5"),pch=c(3,4,5), title="Gear Number") #by changing pch and color we're viewing 5 variables simultaneously
99 | P a g e
Spinable 3-d Scatterplot rotate Method 1: plot3d(x, y, z, xlab, ylab, zlab, type = "p", col, size, lwd, radius) Method 2: scatter3d(x, y, z, xlab=deparse(substitute(x)), ylab=deparse(substitute(y)), zlab=deparse(substitute(z)), axis.scales=TRUE, revolutions=0, bg.col=c("white", "black"), axis.col=if (bg.col == "white") c("darkmagenta", "black", "darkcyan") else c("darkmagenta", "white", "darkcyan"), surface.col=c("blue", "green", "orange", "magenta", "cyan", "red", "yellow", "gray"), surface.alpha=0.5, neg.res.col="red", pos.res.col="green", square.col=if (bg.col == "white") "black" else "gray", point.col="yellow", text.col=axis.col, grid.col=if (bg.col == "white") "black" else "gray", fogtype=c("exp2", "linear", "exp", "none"), residuals=(length(fit) == 1), surface=TRUE, fill=TRUE, grid=TRUE, grid.lines=26, df.smooth=NULL, df.additive=NULL, sphere.size=1, threshold=0.01, speed=1, fov=60, fit="linear", groups=NULL, parallel=TRUE, ellipsoid=FALSE, level=0.5, ellipsoid.alpha=0.1, id.method=c("mahal", "xz", "y", "xyz", "identify", "none"), id.n=if (id.method == "identify") Inf else 0, labels=as.character(seq(along=x)), offset = ((100/length(x))^(1/3)) * 0.02, model.summary=FALSE)
EXAMPLES library(rgl)
with(mtcars, plot3d(wt, disp, mpg, col=cyl, size=6))
library(Rcmdr)
with(mtcars, scatter3d(wt, disp, mpg, col=cyl))
100 | P a g e
Staircase plot (show an increase or a decrease over time) library(plotrix) See examples for best understanding:
EXAMPLE
sample_size<-c(500,-72,428,-94,334,-45,-89,200)
totals<-c(TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,FALSE,TRUE)
labels<-c("Contact list","Uncontactable","","Declined","","Ineligible",
"Died","Final sample")
#==========================================================================
staircase.plot(sample_size,totals,labels,main="Acquisition of the sample",
total.col="gray",inc.col=2:5,bg.col="#eeeebb",direction="s")
staircase.plot(sample_size,totals,labels,main="Acquisition of the sample",
total.col="yellow",inc.col=2:5,bg.col="#eeeebb",direction="e")
#==========================================================================
sample_size<-c(200,+72,272,+94,366,+45,411)
totals2<-c(TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE)
labels2<-c("Begining Level","Humor","","Water","","Candy",
"Final Level")
#==========================================================================
staircase.plot(sample_size,totals2,labels2,main="Energy Level",
total.col="gray",inc.col=2:5,bg.col="#eeeebb",direction="s")
101 | P a g e
Pyramid plot (comparing nested groups) library(plotrix) See examples for best understanding:
EXAMPLES
x11(15,8)
par(mfrow=c(1,2))
xy.pop<-c(3.2,3.5,3.6,3.6,3.5,3.5,3.9,3.7,3.9,3.5,3.2,2.8,2.2,1.8,
1.5,1.3,0.7,0.4)
xx.pop<-c(3.2,3.4,3.5,3.5,3.5,3.7,4,3.8,3.9,3.6,3.2,2.5,2,1.7,1.5,
1.3,1,0.8)
agelabels<-c("0-4","5-9","10-14","15-19","20-24","25-29","30-34",
"35-39","40-44","45-49","50-54","55-59","60-64","65-69","70-74",
"75-79","80-44","85+")
mcol<-color.gradient(c(0,0,0.5,1),c(0,0,0.5,1),c(1,1,0.5,1),18)
fcol<-color.gradient(c(1,1,0.5,1),c(0.5,0.5,0.5,1),c(0.5,0.5,0.5,1),18)
#==========================================================================
par(mar=pyramid.plot(xy.pop,xx.pop,labels=agelabels,
main="Australian population pyramid 2002",lxcol=mcol,rxcol=fcol,
gap=0.5,show.values=TRUE))
#==========================================================================
# three column matrices
avtemp<-c(seq(11,2,by=-1),rep(2:6,each=2),seq(11,2,by=-1))
malecook<-matrix(avtemp+sample(-2:2,30,TRUE),ncol=3)
femalecook<-matrix(avtemp+sample(-2:2,30,TRUE),ncol=3)
# group by age
agegrps<-c("0-10","11-20","21-30","31-40","41-50","51-60",
"61-70","71-80","81-90","91+")
#==========================================================================
oldmar<-pyramid.plot(malecook,femalecook,labels=agegrps,
unit="Bowls per month",lxcol=c("#ff0000","#eeee88","#0000ff"),
rxcol=c("#ff0000","#eeee88","#0000ff"),laxlab=c(0,10,20,30),
raxlab=c(0,10,20,30),top.labels=c("Males","Age","Females"),gap=3)
# put a box around it
box()
# give it a title
mtext("Porridge temperature by age and sex of cook",3,2,cex=1.5)
# stick in a legend
legend(par("usr")[1],11,c("Too hot","Just right","Too cold"),
fill=c("#ff0000","#eeee88","#0000ff"))
102 | P a g e
Engelmann-Hecker-Plot Compare spread of grouped data library(plotrix) ehplot(data, groups, intervals=50, offset=0.1, log=FALSE, median=TRUE, box=FALSE, boxborder="grey50", xlab="groups", ylab="values", col="black", ...) Arguments:
Created using the panes() function
data(iris)
ehplot(iris$Sepal.Length, iris$Species,
intervals=20,offset=0.1, cex=1.5, pch=20)
tab.title("ehplot 1",tab.col=1)
ehplot(iris$Sepal.Width, iris$Species,
intervals=20, offset=0.1,box=TRUE, median=FALSE)
tab.title("ehplot 2",tab.col=4)
ehplot(iris$Petal.Length, iris$Species,
pch=17,offset=0.1, col="red", log=TRUE)
tab.title("ehplot 3",tab.col=3)
ehplot(iris$Petal.Length, iris$Species,
offset=0.1, pch=as.numeric(iris$Species))
tab.title("ehplot 4",tab.col=2)
Example data(iris);library(plotrix)
ehplot(iris$Sepal.Length, iris$Species,
intervals=20, cex=1.8, pch=20)
ehplot(iris$Sepal.Width, iris$Species,
intervals=20, box=TRUE, median=FALSE)
ehplot(iris$Petal.Length, iris$Species,
pch=17, col="red", log=TRUE)
ehplot(iris$Petal.Length, iris$Species,
offset=0.06,
pch=as.numeric(iris$Species))
# Groups don't have to be presorted:
rnd <- sample(150)
plen <- iris$Petal.Length[rnd]
pwid <- abs(rnorm(150, 1.2))
spec <- iris$Species[rnd]
ehplot(plen, spec, pch=19, cex=pwid,
col=rainbow(3,
alpha=0.6)[as.numeric(spec)])
103 | P a g e
Hexbin Plot (visualize data closely clustered data) see also sunflowerplothigh density data
plot(hexbin(x, y, xbins = 30, shape = 1,xbnds = range(x), ybnds = range(y),xlab = NULL, ylab = NULL)) Bump chart (looking at how ranks have changed from time 1 to time 2 library(plotrix)
Arguments:
EXAMPLE
#======================================================================
# percentage of those over 25 years having completed high school
# in 10 cities in the USA in 1990 and 2000
educattn<-matrix(c(90.4,90.3,75.7,78.9,66,71.8,70.5,70.4,68.4,67.9,
67.2,76.1,68.1,74.7,68.5,72.4,64.3,71.2,73.1,77.8),ncol=2,byrow=TRUE)
rownames(educattn)<-c("Anchorage AK","Boston MA","Chicago IL",
"Houston TX","Los Angeles CA","Louisville KY","New Orleans LA",
"New York NY","Philadelphia PA","Washington DC")
colnames(educattn)<-c(1990,2000)
#......................................................................
bumpchart(educattn,main="Rank for high school completion by over 25s")
#======================================================================
# now show the raw percentages and add central ticks
#======================================================================
bumpchart(educattn,rank=FALSE,
main="Percentage high school completion by over 25s",col=rainbow(10))
# margins have been reset, so use
par(xpd=TRUE)
boxed.labels(1.5,seq(65,90,by=5),seq(65,90,by=5))
par(xpd=FALSE)
EXAMPLE
library(plyr) #contains a large data set
begend(baseball) #look at begining and end of data set
library(hexbin)
with(baseball,plot(hexbin(r,ab)))
bumpchart(y,top.labels=colnames(y),labels=rownames(y),rank=TRUE,mar=c(2,8,5,8),pch=19,col=par("fg"),lty=1,lwd=1)
104 | P a g e
Heatmap with numbers x <- "http://datasets.flowingdata.com/ppg2008.csv" nba <- read.csv(x)
dst <- dist(nba[1:20, -1],)
dst <- data.matrix(dst)
dim <- ncol(dst)
sdim <- seq_len(dim)
image(sdim, sdim, dst, axes = FALSE)
axis(1, sdim, nba[1:20,1], cex.axis = 0.5)
axis(2, sdim, nba[1:20,1], cex.axis = 0.5)
lapply(sdim, function(i){
lapply(sdim, function(j){
txt <- sprintf("%0.1f", dst[i,j])
text(i, j, txt, cex=0.5)
})
})
105 | P a g e
Scatterplot Ssymbols plot(Set$Attitude ~ Set$Grade, col="pink") To change the point plot symbols use the pch= argument. Note: the set$attitude and set$grade are merely variable names attached to the data set (data frame). In Windows the R supports Unicode symbols with the negative sign plot(1, 1, pch = -0x2665L, cex = 10, xlab = "", ylab = "", col = "firebrick3")
points(.8, .8, pch = -0x2642L, cex = 10, col = "firebrick3")
points(1.2, 1.2, pch = -0x2640L, cex = 10, col = "firebrick3")
Plot group by color Argument to plot: See example below
Identify Plot Points (Plot point labels and identification) slocate sidentify Locate coordinates of a specific point locator(1) Note that locator(n) can be used to locate a list of specific points and compile a list. This could be useful for locating extreme or odd values that lie outside the overall scatter or group scatter as in the code below in the example box.
Locate name of a specific points identify(x-vector,y-vector,vector of labels,…) Locate names of all points text(x-vector,y-vector,vector of labels,…) See example below for details
EXAMPLE
#Plot data by cylinder groups w/ legend
with(mtcars,plot(drat,hp,col=c("blue","green","red")[as.numeric(as.factor(cyl))]))
legend(locator(1),c("4 cyl","6 cyl","8 cyl"),fill=c("blue","green","red"))
locator(1) #locate a specific point on the plot (x and y coordinate)
#Example of using locate to create a data frame of coordinates for extreme values
outlier<-data.frame(locator(4)[1:2])
outlier
#locate certain points by name (adj: -1=right justify, 1=left, .5=centered)
with(mtcars,identify(drat,hp,labels=c(rownames(mtcars)),adj=1))
#label the points on the plot
with(mtcars,text(drat,hp,labels=c(rownames(mtcars)),cex=.5,adj=c(0,-1)))
EXAMPLE
#Plot data by cylinder groups w/ legend
with(mtcars,plot(drat,hp,col=c("blue","green","red")[as.numeric(as.factor(cyl))]))
legend(locator(1),c("4 cyl","6 cyl","8 cyl"),fill=c("blue","green","red"))
#I had to say as.factor for cylinder first because it was actually a numeric variable
106 | P a g e
Interactive coloring of points (click and recolor)
x <- 1:5
plot(x, x, col=ifelse(x==3, "red", "black"), pch=19)
plot(x, x, col=ifelse(x==3, "red", "black"),
pch=ifelse(x==3, 19, 2), cex=ifelse(x==3, 2, 1))
with(mtcars, plot(hp, disp, pch=19,
col=c(ifelse(mpg>25, 'red', 'green'))))
#===========================================
n <- 15
x <- rnorm(n)
y <- rnorm(n)
# Plot the data
plot(x,y, pch = 19, cex = 2)
# This lets you click on the points you want to change
# the color of. Right click and select "stop" when
# you have clicked all the points you want
pnt <- identify(x, y, plot = F)
# This colors those points red
points(x[pnt], y[pnt], col = "red", pch = 19, cex = 2)
points(x[pnt], y[pnt], col = "green", pch = 17, cex = 1)
107 | P a g e
Flip the x and y axis library(lattice) # First make some example data
df <- data.frame(name=rep(c("a", "b", "c"), each=5), value=rnorm(15))
# Then try plotting it in both 'orientations'
# ... as a dotplot
xyplot(value~name, data=df)
xyplot(name~value, data=df)
# ... or perhaps as a 'box-and-whisker' plot
bwplot(value~name, data=df)
bwplot(name~value, data=df)
108 | P a g e
Blending Plots in R sblending transparency
set.seed(42)
p1 <- hist(rnorm(500,4)) # centered at 4
p2 <- hist(rnorm(500,6)) # centered at 6
plot( p1, col=rgb(0,0,1,1/4), xlim=c(0,10)) # first histogram
plot( p2, col=rgb(1,0,0,1/4), xlim=c(0,10), add=T) # second hist
#or
a=rnorm(1000, 3, 1)
b=rnorm(1000, 6, 1)
hist(a, xlim=c(0,10), col="red")
hist(b, add=T, col=rgb(0, 1, 0, 0.5))
library(ggplot2)
path = "http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data"
saheart = read.table(path, sep=",",head=T,row.names=1)
fmla = "chd ~ sbp + tobacco + ldl + adiposity + famhist + typea + obesity"
model = glm(fmla, data=saheart, family=binomial(link="logit"),
na.action=na.exclude)
dframe = data.frame(chd=as.factor(saheart$chd),
prediction=predict(model, type="response"))
ggplot(dframe, aes(x=prediction, colour=chd)) + geom_density()
ggplot(dframe, aes(x=prediction, fill=chd)) +
geom_histogram(position="identity", binwidth=0.05, alpha=0.5)
ggplot(dframe, aes(x=prediction, fill=chd)) +
geom_histogram(position="identity", binwidth=0.05, alpha=0.5)
109 | P a g e
Add text to the graph margins See Example Text-Rect-LineSeg (search this tag) mtext(text, side = 3, line = 0, outer = FALSE, at = NA, adj = NA, padj = NA, cex = NA, col = NA, font = NA, ...) Plot Curved Text (arc text) library(plotrix) arctext(x,center=c(0,0),radius=1,start=NA,middle=pi/2,stretch=1,cex=1,...) Arguments: EXAMPLES
Argument text: a character or expression vector specifying the _text_ to be written. Other objects are coerced by
‘as.graphicsAnnot’. side: on which side of the plot (1=bottom, 2=left, 3=top, 4=right). line: on which MARgin line, starting at 0 counting outwards. outer: use outer margins if available. at: give location of each string in user coordinates. If the component of ‘at’ corresponding to a
particular text item is not a finite value (the default), the location will be determined by ‘adj’. adj: adjustment for each string in reading direction. For strings parallel to the axes, ‘adj = 0’ means left
or bottom alignment, and ‘adj = 1’ means right or top alignment.
If ‘adj’ is not a finite value (the default), the value of ‘par("las")’ determines the adjustment. For strings plotted parallel to the axis the default is to centre the string.
padj: adjustment for each string perpendicular to the reading direction (which is controlled by ‘adj’).
For strings parallel to the axes, ‘padj = 0’ means right or top alignment, and ‘padj = 1’ means left or bottom alignment.
If ‘padj’ is not a finite value (the default), the value of ‘par("las")’ determines the adjustment. For strings plotted perpendicular to the axis the default is to centre the string.
cex: character expansion factor. ‘NULL’ and ‘NA’ are equivalent to ‘1.0’. This is an absolute measure,
not scaled by ‘par("cex")’ or by setting ‘par("mfrow")’ or ‘par("mfcol")’. Can be a vector. col: color to use. Can be a vector. ‘NA’ values (the default) mean use ‘par("col")’. font: font for text. Can be a vector. ‘NA’ values (the default) mean use ‘par("font")’.
plot(0,xlim=c(1,5),ylim=c(1,5),main="Test
of arctext",xlab="",ylab="",
type="n")
arctext("bendy like
spaghetti",center=c(3,3),col="blue")
arctext("bendy like
spaghetti",center=c(3,3),radius=1.5,start=
pi,cex=2)
arctext("bendy like
spaghetti",center=c(3,3),radius=0.5,
start=pi/2,stretch=1.2)
See also: Eliminate margins for clickText()
110 | P a g e
Add a Table to a Graph library(plotrix) [see also scripts for a click function]add table; table plot
addtable2plot(x,y=NULL,table,lwd=par("lwd"),bty="n",bg=par("bg"),cex=1,xjust=0,yjust=1,box.col=par("fg"),text.col=par("fg"),display.colnames=TRUE,display.rownames=FALSE,hlines=FALSE,vlines=FALSE, title=NULL) Arguments: Plot Lines (verticle, horizontal or sloped) [see line types] abline(various arguments)
Plot Lowess Line Example: with(mtcars,plot(mpg,disp,pch=cyl,col=cyl+2))
with(mtcars,lines(lowess(cbind(mpg,disp)),lwd=2,col="blue"))
Plot a circle library(plotrix) [see also scripts for a shapeClick function] draw.circle(x,y,radius,nv=100,border=NULL,col=NA,lty=1,lwd=1) Arguments:
EXAMPLE
with(mtcars,plot(drat,hp))
abline(h=204) #plot horizontal
abline(v=4) #plot verticle
abline(a=210,b=20) #plot sloped (a=y intercept; and b=slope)
111 | P a g e
Plot a circle inside a square library(plotrix); library(grid) circle square
require(plotrix)
require(grid)
plot(c(-1, 1), c(-1,1), type = "n", asp=1)
rect( -.5, -.5, .5, .5)
draw.circle( 0, 0, .5 )
#note asp must be specified
112 | P a g e
Add Math symbols and Expressions to Plot expression() #wrapped in title, text, mtext etc. List of Math Symbols to Add
Syntax Meaning Syntax Meaning
x + y x plus y ... ellipsis (height varies)
x - y x minus y cdots ellipsis (vertically centred)
x*y juxtapose x and y ldots ellipsis (at baseline)
x/y x forwardslash y x %subset% y x is a proper subset of y
x %+-% y x plus or minus y x %subseteq% y x is a subset of y
x %/% y x divided by y x %notsubset% y x is not a subset of y
x %*% y x times y x %supset% y x is a proper superset of y
x %.% y x cdot y x %supseteq% y x is a superset of y
x[i] x subscript i x %in% y x is an element of y
x^2 x superscript 2 x %notin% y x is not an element of y
paste(x, y, z) juxtapose x, y, and z hat(x) x with a circumflex
sqrt(x) square root of x tilde(x) x with a tilde
sqrt(x, y) yth root of x dot(x) x with a dot
x == y x equals y ring(x) x with a ring
x != y x is not equal to y bar(xy) xy with bar
x < y x is less than y widehat(xy) xy with a wide circumflex
x <= y x is less than or equal to y widetilde(xy) xy with a wide tilde
x > y x is greater than y x %<->% y x double-arrow y
x >= y x is greater than or equal to y x %->% y x right-arrow y
x %~~% y x is approximately equal to y x %<-% y x left-arrow y
x %=~% y x and y are congruent x %up% y x up-arrow y
x %==% y x is defined as y x %down% y x down-arrow y
x %prop% y x is proportional to y x %<=>% y x is equivalent to y
plain(x) draw x in normal font x %=>% y x implies y
bold(x) draw x in bold font x %<=% y y implies x
italic(x) draw x in italic font x %dblup% y x double-up-arrow y
bolditalic(x) draw x in bolditalic font x %dbldown% y x double-down-arrow y
symbol(x) draw x in symbol font alpha – omega Greek symbols
list(x, y, z) comma-separated list Alpha – Omega uppercase Greek symbols
EXAMPLE
frame()
title(expression( "graph of the function f"(x) == sqrt(1+x^2)))
text(locator(1),expression(sum(x)/sqrt(n*S^2)))
text(locator(1),expression(hat(beta)==-.567))
text(locator(1),expression(hat(Omega)==infinity*frac(x, y)))
mtext(expression(Area == pi*r^2),side=2,line=-12)
text(locator(1), expression(bar(x) == sum(frac(x[i], n), i==1, n)))
mtext(expression(2.3 %+-% 4.5*pi),side=1,line=-5,adj=.7)
mtext(expression(bar(xy)!=sum(x[i], i==1, n) ),side=1,line=-5,adj=0)
textClick(expression(sum(sum((X[ij]-bar(X))^2))))
textClick(expression(sum(x[i], i=1, n)),"green",3)
Summation
113 | P a g e
Syntax Meaning
theta1, phi1, sigma1,
omega1 cursive Greek symbols
Upsilon1 capital upsilon with hook
aleph first letter of Hebrew alphabet
infinity infinity symbol
partialdiff partial differential symbol
nabla nabla, gradient symbol
32*degree 32 degrees
60*minute 60 minutes of angle
30*second 30 seconds of angle
displaystyle(x) draw x in normal size (extra spacing)
textstyle(x) draw x in normal size
scriptstyle(x) draw x in small size
scriptscriptstyle(x) draw x in very small size
underline(x) draw x underlined
x ~~ y put extra space between x and y
x + phantom(0) + y leave gap for "0", but don't draw it
x + over(1, phantom(0)) leave vertical gap for "0" (don't draw)
frac(x, y) x over y
over(x, y) x over y
atop(x, y) x over y (no horizontal bar)
sum(x[i], i==1, n) sum x[i] for i equals 1 to n
prod(plain(P)(X==x), x) product of P(X=x) for all values of x
integral(f(x)*dx, a, b) definite integral of f(x) wrt x
union(A[i], i==1, n) union of A[i] for i equals 1 to n
intersect(A[i], i==1, n) intersection of A[i]
lim(f(x), x %->% 0) limit of f(x) as x tends to 0
min(g(x), x > 0) minimum of g(x) for x greater than 0
inf(S) infimum of S
sup(S) supremum of S
x^y + z normal operator precedence
x^(y + z) visible grouping of operands
x^{y + z} invisible grouping of operands
group("(",list(a, b),"]") specify left and right delimiters
bgroup("(",atop(x,y),")") use scalable delimiters
group(lceil, x, rceil) special delimiters
114 | P a g e
Expressions in Titles (method 1) a<-5
b<-1
plot(1:10, main=bquote(p==.(a) *"," ~q==.(b)))
Expressions in Titles (method 2)
a<-5
b<-1
plot(1:10, main = substitute(paste(p == a,, ", ", q == b), list(a = a, b = b)))
115 | P a g e
Control Margins & Eliminate Margins (use locator with text to put text outside plot) par(mar=c(0,0,0,0) #standard par(mar = rep(0, 4)) #no margin
Function for adding text or expression anywhere with locator clickText source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Graphics/Click text")
clickText(expression, col, font size)) expression must either be in quotes if text or an expression() Text Outside Margins clickText() utilizes this par(xpd=NA)
#Margins Control Examples
#========================================================
#Standard Margins
x11()
frame()
par(oma = c(0,0,0,0))
grid()
#========================================================
#Attempt to add text to outer margins using locator (Wrong way)
x11()
frame()
par(oma = c(0,0,0,0))
with(mtcars,plot(mpg~wt))
text(locator(1),expression(beta==3)) #The mistake: tried to add text w/o changing margins first
par(mar = rep(0, 4))
text(locator(1),expression(beta==3))
#========================================================
#Add text using locator to outer margins (Correct way)
x11()
frame()
par(oma = c(0,0,0,0))
with(mtcars,plot(mpg~wt)) #Correct way: 1)plot; 2)change margins; 3)add text
par(mar = rep(0, 4))
text(locator(1),expression(beta==3))
#========================================================
#Plot with no margins
x11()
frame()
par(mar = rep(0, 4))
with(mtcars,plot(mpg~wt))
text(locator(1),expression(beta==3))
see also clickText() below
116 | P a g e
Normal curve with upper/lower shaded (uses the polygon function) Example xv<-seq(-3,3,.01)
yv<-dnorm(xv)
windows(h=5,w=11)
par(mfrow=c(1,2))
plot(xv,yv,type="l",main="2 Standard Deviation")
polygon(c(xv[xv<=-2],-2),c(yv[xv<=-2],yv[xv==-3]),col="blue",border="green")
polygon(c(xv[xv>=2],2),c(yv[xv>=2],yv[xv==3]),col="blue",border="green")
plot(xv,yv,type="l",main="1 Standard Deviation")
polygon(c(xv[xv<=-1],-1),c(yv[xv<=-1],yv[xv==-3]),col="red",border="orange")
polygon(c(xv[xv>=1],1),c(yv[xv>=1],yv[xv==3]),col="red",border="orange")
Use the mouse to add lines segments, rectangles, arrows and polygons source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Graphics/Click Shapes.txt")
shapeClick(shape="arrow", corners=NULL,col=NULL, border = NULL, lty = par("lty"), lwd = par("lwd"),code=2) examples code and use source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed
Inference/R Stuff/Scripts/Graphics/Click Shapes.txt")
with(mtcars,plot(mpg,disp))
shapeClick("seg",col="red")
shapeClick("box",col="yellow",border="red",lty=1)
shapeClick("arrow",col="orange",lwd=4,lty=55)
shapeClick("arrow",code=3,col="blue",lwd=2)
shapeClick("poly",3,col="yellow",border="red",lwd=2)
shapeClick("poly",5,border="green",lwd=2)
Add Rug to Graph (Shows the actual data points) rug(variable) # like a line this is done after the graph is plotted
x11()
with(mtcars,plot(mpg~disp))
rug(mtcars$disp);rug(mtcars$mpg,side=2)
117 | P a g e
Add line segments to the graph See Example Text-Rect-LineSeg Below segments(x0, y0, x1, y1,col = par("fg"), lty = par("lty"), lwd = par("lwd"), ...) x0,y0,x1,y1 are coordinates of the start and end points Add rectangles to the graph See Example Text-Rect-LineSeg rect(xleft, ybottom, xright, ytop, angle = 45, col = NULL, border = NULL, lty = NULL, lwd =) Add Arrows to Graph arrows(x0,y0,x1,y1)
#=================================================================================================
# VARIOUS TEXT ARGUMENTS
#=================================================================================================
windows(h=6.5,w=10);par(mfrow=c(2,3))
plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab")
text(3,23, "pos 1", pos=1);text(3,23, "pos 2", pos=2)
text(3,23, "pos 3", pos=3);text(3,23, "pos 4", pos=4)
plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab")
text(locator(1), "yippee", pos=1)
plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab")
mtext("line -2", line = -2);mtext("line 2", line = 2)
mtext("line 3", line = 3);mtext("line -6", line = -6)
plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab")
mtext("adj 0", line = -2, adj =0);mtext("adj .5", line = -2, adj =.5)
mtext("adj 1", line = -2, adj =1)
plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab")
mtext("side 1",side=1);mtext("side 2",side=2)
mtext("side 3",side=3);mtext("side 4",side=4)
plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab")
mtext("side4;adj1",side=4,adj=1,col="blue")
mtext("side2;adj0",side=2,adj=.5,col="red")
mtext("side4;adj.5",side=4,adj=0,col="green")
mtext("side4;adj.5;padj1",side=4,adj=.5,padj=1,col="purple")
mtext("side4;adj.5;padj-2",side=4,adj=.5,padj=-2,col="orange")
#=================================================================================================
# VARIOUS RECTANGLE USES/ARGUMENTS
#=================================================================================================
windows(h=6,w=6);par(mfrow=c(1,1))
plot(1:10, (-4:5)^2, main="Parabola Points", xlab="xlab")
rect(6, 10, 8, 12, angle = 45,col = NULL, border = "red", lty = 1, lwd = 2)
text(6.16,11, "HELLO!", pos=4)
rect(4, 15, 8, 20, angle = 45,col = NULL, border = "blue", lty = 1, lwd = 4)
text(locator(1), "HELLO!", pos=4,cex=2)
rect(8, 0, 10, 3, angle = 45,col = "black", border = "blue", lty = 1, lwd = 4)
text(locator(1), "HELLO!", pos=4,cex=.8,col="white")
#=================================================================================================
# VARIOUS LINE SEGMENT USES/ARGUMENTS
#=================================================================================================
segments(4, 0, 10, 10,col = "orange", lty = 1, lwd = 1)
segments(2, 25, 8, 25,col = "blue", lty = 2, lwd = 1)
segments(2, 0, 2, 25,col = "yellow", lty = 1, lwd = 3)
#=================================================================================================
# USING TEXT TO CREATE A LABEL (In action)
#=================================================================================================
windows(h=6,w=6);par(mfrow=c(1,1))
x <- mtcars[order(mtcars$mpg),] # sort by mpg
x$cyl <- factor(x$cyl) # it must be a factor
x$color[x$cyl==4] <- "red"
x$color[x$cyl==6] <- "blue"
x$color[x$cyl==8] <- "darkgreen"
dotchart(x$mpg,labels=row.names(x),cex=.7,groups= x$cyl,
main="Gas Milage for Car Models\ngrouped by cylinder",
xlab="Miles Per Gallon", gcolor="black", color=x$color)
mtext("Cars Grouped by Cylinder", side = 2, line = 2, cex = .7)
Example Text-Rect-LineSeg
118 | P a g e
Add a Legend to a Graph sLEGEND legend(x, y = NULL, legend, fill = NULL, col = par("col"), border="black", lty, lwd, pch, angle = 45, density = NULL, bty = "o", bg = par("bg"), box.lwd = par("lwd"), box.lty = par("lty"), box.col = par("fg"), pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd, xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1, adj = c(0, 0.5), text.width = NULL, text.col = par("col"), merge = do.lines && has.pch, trace = FALSE, plot = TRUE, ncol = 1, horiz = FALSE, title = NULL, inset = 0, xpd, title.col = text.col, title.adj = 0.5, seg.len = 2) Description: This function can be used to add legends to plots. Note that a call to the function ‘locator(1)’ can be used in place of the ‘x’ and ‘y’ arguments. Arguments: x, y: the x and y co-ordinates to be used to position the legend. They can be specified by keyword or in any way
which is accepted by ‘xy.coords’: See ‘Details’. fill: if specified, this argument will cause boxes filled with the specified colors (or shaded in the specified colors) to
appear beside the legend text. col: the color of points or lines appearing in the legend. border: the border color for the boxes (used only if ‘fill’ is specified). lty, lwd: the line types and widths for lines appearing in the legend. One of these two _must_ be specified for line
drawing. ch: the plotting symbols appearing in the legend, either as vector of 1-character strings, or one (multi character)
string. _Must_ be specified for symbol drawing. angle: angle of shading lines. density: the density of shading lines, if numeric and positive. If ‘NULL’ or negative or ‘NA’ color filling is assumed. bty: the type of box to be drawn around the legend. The allowed values are ‘"o"’ (the default) and ‘"n"’. bg: the background color for the legend box. (Note that this is only used if ‘bty != "n"’.) box.lty, box.lwd, box.col: the line type, width and color for the legend box (if ‘bty = "o"’). pt.bg: the background color for the ‘points’, corresponding to its argument ‘bg’. cex: character expansion factor *relative* to current ‘par("cex")’. Used for text, and provides the default for
‘pt.cex’ and ‘title.cex’. pt.cex: expansion factor(s) for the points. pt.lwd: line width for the points, defaults to the one for lines, or if that is not set, to ‘par("lwd")’. xjust: how the legend is to be justified relative to the legend x location. A value of 0 means left justified, 0.5 means
centered and 1 means right justified. yjust: the same as ‘xjust’ for the legend y location. x.intersp: character interspacing factor for horizontal (x) spacing. y.intersp: the same for vertical (y) line distances. adj: numeric of length 1 or 2; the string adjustment for legend text. Useful for y-adjustment when ‘labels’ are
plotmath expressions. text.width: the width of the legend text in x (‘"user"’) coordinates. (Should be positive even for a reversed x axis.) Defaults tothe proper value computed by ‘strwidth(legend)’. text.col: the color used for the legend text. merge: logical; if ‘TRUE’, merge points and lines but not filledboxes. Defaults to ‘TRUE’ if there are points and lines. trace: logical; if ‘TRUE’, shows how ‘legend’ does all its magical computations. plot: logical. If ‘FALSE’, nothing is plotted but the sizes are returned. ncol: the number of columns in which to set the legend items (default is 1, a vertical legend). horiz: logical; if ‘TRUE’, set the legend horizontally rather than vertically (specifying ‘horiz’ overrides the ‘ncol’
specification). title: a character string or length-one expression giving a title to be placed at the top of the legend. Other objects
will be coerced by ‘as.graphicsAnnot’. inset: inset distance(s) from the margins as a fraction of the plot region when legend is placed by keyword. xpd: if supplied, a value of the graphical parameter ‘xpd’ to beused while the legend is being drawn. title.col: color for ‘title’. title.adj: horizontal adjustment for ‘title’: see the help for ‘par("adj")’. seg.len: the length of lines drawn to illustrate ‘lty’ and/or ‘lwd’ (in units of character widths).
EXAMPLE
legend(locator(1),c("Grazed","Ungrazed"),fill=c("blue","darkgreen"))
119 | P a g e
Draw a cylinder library(plotrix) cylindrect(xleft,ybottom,xright,ytop,col,border=NA,gradient="x",nslices=50) Arguments: example(cylindrect) Included in the shapeClick function in scripts Inset a break in a scale (broken scale) library(plotrix) axis.break(axis=1,breakpos=NULL,pos=NA,bgcol="white",breakcol="black",style="slash",brw=0.02) Arguments:
EXAMPLE
#===========================================
x11(12,8)
par(mfrow=c(1,2))
#===========================================
barplot(tN, col=heat.colors(12), log = "y")
axis.break(axis=2,breakpos=4,style="zigzag")
axis.break(axis=2,breakpos=9,style="zigzag")
#===========================================
plot(mpg~cyl,mtcars)
axis.break(breakpos=4.5,axis=1)
120 | P a g e
Plot with a zoomed in plot side by side library(plotrix) zoomInPlot(x,y=NULL,xlim=NULL,ylim=NULL,rxlim=xlim, rylim=ylim,xend=NA,zoomtitle=NULL,titlepos=NA,...) Arguments: Scatterplot w/ histogram, correlation, density, elipse library(psych) scatter.hist (x, y = NULL, smooth = TRUE, ab = FALSE, correl = TRUE, density = TRUE, ellipse = TRUE, digits = 2, cex.cor = 1, title = "Scatter plot + histograms", xlab = NULL, ylab = NULL)
121 | P a g e
Design Plots (compare mean differences, sd, var, or medians) [effects] plot.design(y~x1*x2*xn,fun="mean") fun arguments: mean, median, sd, var
BWplot Compares means and spread in a multi box plot format) bwplot(formula,data) Formula is in the form y ~ x | g1 * g2 * ... (or equivalently, y ~ x | g1 + g2 + ...), indicating
that plots of y (on the y-axis) versus x (on the x-axis) should be produced conditional on the variables g1, g2,
.... Here x and y are the primary variables, and g1, g2, ... are the conditioning variables. The
conditioning variables may be omitted to give a formula of the form y ~ x, in which case the plot will consist
of a single panel with the full dataset. The formula can also involve expressions, e.g., sqrt(), log(), etc.
EXAMPLE
dat<-read.table("HW19.csv", header=TRUE, sep=",",na.strings="999")
dat$Attitude<-as.factor(dat$Attitude)
mod<-lm(Science.Comprehension~Gender*Attitude*Grade,data=dat)
anova(mod)
windows(h=6,w=10)
par(mfrow=c(1,2))
with(dat,plot.design(Science.Comprehension~Gender*Attitude*Grade,
main="Mean Differences"))
with(dat,plot.design(Science.Comprehension~Gender*Attitude*Grade,
fun="sd",main="SD Comparisons"))
EXAMPLE
dat<-read.table("HW19.csv", header=TRUE,
sep=",",na.strings="999")
dat$Attitude<-as.factor(dat$Attitude)
mod<-lm(Science.Comprehension~Gender*Attitude*Grade,data=dat)
anova(mod)
library(lattice)
trellis.par.set(col.whitebg())
bwplot(Science.Comprehension~Gender|Attitude*Grade,dat)
122 | P a g e
Box plot with confidence intervals library(psych) boxplot(attitude,notch=F,main="Boxplot with error bars") error.bars(attitude,add=TRUE) boxplot(attitude,notch=T,main="Notched boxplot with error bars") error.bars(attitude,add=TRUE)
Spaghetti Plot for Repeated Measures Data library(lattice) xyplot(y ~ x, groups =, type = "b", data=)
Coplots coplot(formula,data,)
source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R
Stuff/Scripts/Data Sets.txt")
library(reshape)
rep.mes2<-rep.mes
Sex<-gl(2, 25, length=50,labels = c("Male", "Female"))
rep.mes2<-data.frame(rep.mes2[1:2],Sex,rep.mes2[3:5])
long.rep.mes<-melt(rep.mes2,id=1:3)[order(melt(rep.mes)$Sub),]
rownames(long.rep.mes)<-1:150
rep.mes2;long.rep.mes
library(lattice)
xyplot(value[1:18] ~ variable, groups = Sub, type = "b",data=long.rep.mes)
xyplot(value ~ variable, groups = Group, type = "b",data=long.rep.mes)
xyplot(value ~ variable, groups = Sex, type = "b",data=long.rep.mes)
EXAMPLE
coplot(Sepal.Width~ Petal.Length|Petal.Width,data=iris,panel=panel.smooth)
123 | P a g e
Change Scale Text Direction las=either 0,1,2,3 NOTE: This is an argument to a plot.
EXAMPLE
par(mfrow=c(2,2))
with(iris,plot(Sepal.Length,Sepal.Width,pch=a
s.numeric(Species),col=as.numeric(Species)))
with(iris,plot(Sepal.Length,Sepal.Width,pch=a
s.numeric(Species),las=1,col=as.numeric(Speci
es)))
with(iris,plot(Sepal.Length,Sepal.Width,pch=a
s.numeric(Species),las=2,col=as.numeric(Speci
es)))
with(iris,plot(Sepal.Length,Sepal.Width,pch=a
s.numeric(Species),las=3,col=as.numeric(Speci
es)))
124 | P a g e
Overplotting
# Generate some data
library(MASS)
set.seed(101)
n <- 50000
X <- mvrnorm(n, mu=c(.5,2.5), Sigma=matrix(c(1,.6,.6,1), ncol=2))
# A color palette from blue to yellow to red
library(RColorBrewer)
k <- 11
my.cols <- rev(brewer.pal(k, "RdYlBu"))
## compute 2D kernel density, see MASS book, pp. 130-131
z <- kde2d(X[,1], X[,2], n=50)
# Make the base plot
plot(X, xlab="X label", ylab="Y label", pch=19, cex=.4)
# Draw the colored contour lines
contour(z, drawlabels=FALSE, nlevels=k, col=my.cols, add=TRUE, lwd=2)
# Make points smaller - use a single pixel as the plotting charachter
plot(X, pch=".")
# Hexbinning
library(hexbin)
plot(hexbin(X[,1], X[,2]))
# Make points semi-transparent
library(ggplot2)
qplot(X[,1], X[,2], alpha=I(.1))
# The smoothScatter function (graphics package)
smoothScatter(X)
125 | P a g e
List of Graph Functions in Base
126 | P a g e
List of Graph Functions in Lattice
127 | P a g e
Graphics Arguments (Parameters)
128 | P a g e
129 | P a g e
130 | P a g e
Descriptive Statistics
Find standard deviation, variance, mean, median, range, & standard error From the stats package: sd() var() mean() median() max() min() summary() From the library(plotrix): std.error(x,na.rm) #--> example: std.error(mtcars$mpg,na.rm) Descriptives library(psych) describe(x, na.rm = TRUE, interp=FALSE,skew = TRUE, ranges = TRUE,trim=.1) Descriptives by Group (Note: Pairwise deletion; this will include as much data as possible) library(psych) describe.by(dataset, variable1) #Does Descripts on Variable 1 describe.by(dataset, variable2) #Does Descripts on Variable 2 describe.by(dataset, list(variable1,variable2)) #Does All the Interactions
library(psych) #example: g4<-read.table("g4.csv", header=TRUE, sep=",",na.strings="999")
describe.by(g4,g4$gender) #Does Descripts on Variable 1
describe.by(g4,g4$race) #Does Descripts on Variable 2
describe.by(g4,list(g4$gender,g4$race)) #Does All the Interactions
NOTE: I created functions to automate this for the .Regression Bundle script. desc2v() for 2 variables; desc3v() for 3 variables use rfun() to view the list of functions and arguments in the regression bundle use: data.frame(cv[[A]][[B]])[c(desired variable rows),c(2,3,4)] to extract certain groups,columns & rows
[[A]] "DESCRIPTIVES FOR VARIABLE 1"=[[1]] “DESCRIPTIVES FOR VARIABLE 2"=[[2]] "DESCRIPTIVES FOR VARIABLE 3"=[[3]] "DESCRIPTIVES FOR INTERACTION OF VARIABLE 1 & 2"=[[4]] "DESCRIPTIVES FOR INTERACTION OF VARIABLE 2 & 3"=[[5]] "DESCRIPTIVES FOR INTERACTION OF VARIABLE 1 & 3"=[[6]]
"DESCRIPTIVES FOR INTERACTION OF VARIABLE 1,2,&3[[7]]
[[B]] Level of group or interaction: ie. a [[A=1]] with 2 groups=[[B n of 2]] a [[A=1]] with 3 groups=[[B n of 3]] a [[A=4]] with 2 and 3 groups=[[B n of 8]] a [[A=7]] with 2, 2 and 3 groups=[[B n of 19]]
2,3,4 here gives n, mean & sd
131 | P a g e
Descriptives by Group (this relies on specific variables of a data set; more manageable info)
library(doBy)
summaryBy()
examples:
Descriptive by Group (another approach)
by(data set, factor, summary)
#EXAMPLE of summaryBy()
g4<-read.table("g4.csv", header=TRUE, sep=",",na.strings="999")
library(doBy)
summaryBy(mathscore + effort+ initiative+valueing ~ race, data = g4,
FUN = function(x) { c(n = length(x),mean = mean(x), sd = sd(x)) } )
summaryBy(mathscore + effort+ initiative+valueing ~ gender, data = g4,
FUN = function(x) { c(n = length(x),mean = mean(x), sd = sd(x)) } )
summaryBy(mathscore + effort+ initiative+valueing ~ gender+race, data = g4,
FUN = function(x) { c(n = length(x),mean = mean(x), sd = sd(x)) } )
EXAMPLE WITH WRITE TO EXCEL
e30<-read.table("e30.csv", header=TRUE, sep=",",na.strings="NA")
attach(e30)
DFD<-e30[,5:11]
percent.disabled<-as.numeric(N.stud.disable)/as.numeric(class.enroll)
DFSD<-data.frame(DFD,percent.disabled)
DFSD<-na.omit(DFSD)
#_________________________________________________________________________
#DESCRIPTIVES ON AIDES
#
ZZ<-summaryBy(cl.behav.fall+cl.behav.spr+perc.min+percent.disabled~aide,
data = DFSD,FUN = function(x) { c(n = length(x),mean = mean(x), sd = sd(x)) } )
#_________________________________________________________________________
#DESCRIPTIVES ON CLASS TYPE
#
YY<-summaryBy(cl.behav.fall+cl.behav.spr+perc.min+percent.disabled~cl.type,data = DFSD,FUN =
function(x) { c(n = length(x),mean = mean(x), sd = sd(x)) } )
#_________________________________________________________________________
#DESCRIPTIVES ON CLASS TYPE & AIDE
#
XX<-summaryBy(cl.behav.fall+cl.behav.spr+perc.min+percent.disabled~aide+cl.type,
data = DFSD,FUN = function(x) { c(n = length(x),mean = mean(x), sd = sd(x)) } )
#_________________________________________________________________________
#TRANSFORMED DESCRIPTIVES ON CLASS TYPE & AIDE
#
WW<-t(summaryBy(cl.behav.fall+cl.behav.spr+perc.min+percent.disabled~aide+cl.type,
data = DFSD,FUN = function(x) { c(n = length(x),mean = mean(x), sd = sd(x)) } ))
DESCRIBE2<-list(ZZ,YY,XX,WW)
DESCRIBE2
write.table(ZZ, file = "DESCRIBE2.csv", sep = ",", col.names = NA,qmethod = "double")
write.table(YY, file = "DESCRIBE3.csv", sep = ",", col.names = NA,qmethod = "double")
write.table(XX, file = "DESCRIBE4.csv", sep = ",", col.names = NA,qmethod = "double")
write.table(WW, file = "DESCRIBE5.csv", sep = ",", col.names = NA,qmethod = "double")
Example mtcars2<-mtcars
library(doBy)
mtcars2$cyl<-with(mtcars,recodeVar(cyl, src=c(4,6,8),
tgt=c("four","six","eight"), default=NULL,
keep.na=TRUE))
by(mtcars2, mtcars2$cyl, summary)
132 | P a g e
Using ftable and tapply to generate descriptives
Example
Descriptives by Group Favored method)
It is easiest to look at how this piece of code works through an example:
Descriptives by Variable1 library(pastecs)
stat.desc(data.frame)
Descriptives by Variable2 library(fBasics)
basicStats(dataframe)
dat<-read.table("HW19.csv", header=TRUE, sep=",",na.strings="999")
dat$Attitude<-as.factor(dat$Attitude)
mod<-lm(Science.Comprehension~Gender*Attitude*Grade,data=dat)
anova(mod)
#==================================================================================================
# USING FTABLE AND TAPPLY TO GENERATE TABLES OF MEAN, SD, N, VAR ETC. BY GROUP
#==================================================================================================
with(dat,ftable(tapply(Science.Comprehension,list(Grade,Gender,Attitude),mean)))
with(dat,ftable(tapply(Science.Comprehension,list(Grade,Gender,Attitude),sd)))
with(dat,ftable(tapply(Science.Comprehension,list(Grade,Gender,Attitude),length)))
#==================================================================================================
# USING FTABLE AND TAPPLY TO COMPLIE 1 TABLE OF MEAN, SD, N, VAR ETC. BY GROUP
#==================================================================================================
DF<-with(dat,as.data.frame.table(tapply(Science.Comprehension,list(Grade,Gender,
Attitude),mean)))
stndev<-as.vector(with(dat,ftable(tapply(Science.Comprehension,list(Grade,Gender,Attitude),sd))))
n<-as.vector(with(dat,ftable(tapply(Science.Comprehension,list(Grade,Gender,Attitude),length))))
DF<-data.frame(DF[,1:3],n,DF[,4],stndev)
colnames(DF)<-c("Grade","Gender","Attitude","n","mean","sd")
#==================================================================================================
# TABLE OF N,MEANS & SD BY GROUP
#==================================================================================================
DF
#EXAMPLE
library(reshape)
dstats <- function(x)(c(n=length(x), mean=mean(x), sd=sd(x), med=median(x)))
dfm <- melt(mtcars, measure.vars=c("mpg", "hp", "wt"),
id.vars=c("am", "cyl"))
cast(dfm, am + cyl + variable ~ ., dstats)
133 | P a g e
Correlation Matrices and Plots scorrelation
Correlation Package I’ve created Select Numeric Columns for the cor function Useful for cor() Method 1
sapply(dataframe, is.numeric)
Method 2
which(sapply(data.frame, is.numeric))
Correlation and Correlation Tables Type: cor( x,y, use="complete.obs") Output Where x and y are single numeric variables the correlation is a single value. To correlate more than two numeric variables:
1) The first step is to bind your outcome variables: y<-cbind(x,y,z…) #or see selecting numeric variables
2) The last step is to type: cor(y, use="complete.obs")
The output will be a correlation table.
Note: If you have changed all the variables to numeric as described in the changing variable section you can simply type: cor(data1) Where data1 is the data set, however some of the numeric conversions (as in age level y, m, o = 1,2,3) are inappropriate correlations.
source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Regression/Correlation Matrix and Data Visualization.txt")
cmat()
Note: Use round(cor(),2) to round the table to 2 decimals
EXAMPLE With cor()
cor[iris] #the problem
cor(iris[as.numeric(which(sapply(iris, is.numeric)))])#the fix
134 | P a g e
Correlation Tables and p-values Use rcorr() from the car package to get correlation matrix and a p-value matrix. Correlation matrix w/ n’s and pvalues (does pairwise deletion) library(Hmisc) rcorr(x, y, type=c("pearson","spearman")) Could make it do pairwise by doing:
rcorr(na.omit(x), y, type=c("pearson","spearman"))
Pairwise Associations between Items using a Correlation Coefficient library(ltm) This one is similar to rcorr above rcor.test(mat, p.adjust = FALSE, p.adjust.method = "holm", ...)
135 | P a g e
Correlation matrix w/ n’s and pvalues library(psych) corr.test(x, y = NULL, use = "pairwise", method="pearson") x A matrix or dataframe
y A second matrix or dataframe with the same number of rows as x
use use="pairwise" is the default value and will do pairwise deletion of cases.
use="complete" will select just complete cases.
method method="pearson" is the default value. The alternatives to be passed to cor are
"spearman" and "kendall"
Correlation Matrix w/sig stars sigstarC(dataset)
Find the significance of the difference between (un)paired correlations library(psych) paired.r(xy, xz, yz=NULL, n, n2=NULL,twotailed=TRUE)
Arguments xy r(xy) xz r(xz) yz r(yz) n Number of subjects for first group n2 Number of subjects in second group (if not equal to n) twotailed Calculate two or one tailed probability values
Description
Test the difference between two (paired or unpaired) correlations. Given 3 variables, x, y, z, is the correlation between xy different than that between xz? If y and z are independent, this is a simple t-test of the z transformed rs. But, if they are dependent, it is a bit more complicated. To find the z of the difference between two independent correlations, first convert them to z scores using the Fisher r-z transform and then find the z of the difference between the two correlations. The default assumption is that the group sizes are the same, but the test can be done for different size groups by specifying n2. If the correlations are not independent (i.e., they are from the same sample) then the correlation with the third variable r(yz) must be specified. Find a t statistic for the difference of three two dependent correlations.
source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Regression/Correlation Matrix With Sig Stars.txt
Example
library(psych)
corr.test(mtcars)
136 | P a g e
Correlation matrix, correlation plots, and histograms of the variables library(psych) pairs.panels(x, smooth = TRUE, scale = FALSE, density=TRUE,ellipses=TRUE,digits=2)
Arguments x a data.frame or matrix
smooth TRUE draws loess smooths scale TRUE scales the correlation font by the size of the absolute correlation. density TRUE shows the density plots as well as histograms ellipses TRUE draws correlation ellipses lm Plot the linear fit rather than the LOESS smoothed fits. digits the number of digits to show pch The plot character (defaults to 20 which is a ’.’). cor If plotting regressions, should correlations be reported? jiggle Should the points be jittered before plotting? factor factor for jittering (1-5)
hist.col What color should the histogram on the diagonal be? show.points If FALSE, do not show the data points
Description Adapted from the help page for pairs, pairs.panels shows a scatter plot of matrices (SPLOM), with bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal. Useful for descriptive statistics of small data sets. If lm=TRUE, linear regression fits are shown for both y by x and x by y. Correlation ellipses are also shown. Points may be given different colors depending upon some grouping variable.
scatterplot matrix
137 | P a g e
Correlation plot represented by colors library(psych) cor.plot(dfg,colors=TRUE, n=10,main=NULL,zlim=c(-1,1),show.legend=TRUE,labels=NULL) Correlation Test With p values (pairwise or complete obs.) library(psych) corr.test(x, y = NULL, use = "pairwise",method="pearson")
Description Although the cor function finds the correlations for a matrix, it does not report probability values. corr.test uses cor to find the correlations for either complete or pairwise data and reports the sample sizes and probability values as well.
Arguments x A matrix or dataframe y A second matrix or dataframe with the same number of rows as x use
use="pairwise" is the default value and will do pairwise deletion of cases. use="complete" will select just complete cases.
method method="pearson" is the default value. The alternatives to be passed to cor are "spearman" and "kendall"
Details corr.test uses the cor function to find the correlations, and then applies a t-test to the individual correlations using the formula Value r The matrix of correlations n Number of cases per correlation t value of t-test for each correlation p two tailed probability of t for each correlation
138 | P a g e
Correlation Matrix Plot Using Pie Graphs and Colors library(corrgram) corrgram(x, order = NULL, panel=panel.txt, lower.panel=panel.shade, upper.panel=panel.pie, diag.panel=NULL, text.panel=panel.txt, label.pos=0.5, cex.labels=NULL, font.labels=1, row1attop=TRUE, gap=0, main=NULL)
Options x is a dataframe with one observation per row.
order=TRUE will cause the variables to be ordered using principal component analysis
of the correlation matrix. panel= refers to the off-diagonal panels. You can use lower.panel= and
upper.panel= to choose different options below and above the main diagonal respectively. text.panel= and diag.panel= refer to the main diagnonal. Allowable parameters are given below.
off diagonal panels panel.pie (the filled portion of the pie indicates the magnitude of the correlation) panel.shade (the depth of the shading indicates the magnitude of the correlation) panel.ellipse (confidence ellipse and smoothed line) panel.pts (scatterplot) main diagonal panels panel.minmax (min and max values of the variable) panel.txt (variable name).
Use this function before plotting to change the colors used: col.corrgram <- function(ncol){ colorRampPalette(c("purple”, "red","lightred", "pink"))(ncol)}
Lines on the shade indicate direction.
Shade color and pie indicates magnitude.
139 | P a g e
Correlation plot represented by colors library(psych) cor.plot(dfg,colors=TRUE, n=10,main=NULL,zlim=c(-1,1),show.legend=TRUE,labels=NULL) Correlation Test With p values (pairwise or complete obs.) library(psych) corr.test(x, y = NULL, use = "pairwise",method="pearson")
Description Although the cor function finds the correlations for a matrix, it does not report probability values. corr.test uses cor to find the correlations for either complete or pairwise data and reports the sample sizes and probability values as well.
Arguments x A matrix or dataframe y A second matrix or dataframe with the same number of rows as x use
use="pairwise" is the default value and will do pairwise deletion of cases. use="complete" will select just complete cases.
method method="pearson" is the default value. The alternatives to be passed to cor are "spearman" and "kendall"
Details corr.test uses the cor function to find the correlations, and then applies a t-test to the individual correlations using the formula Value r The matrix of correlations n Number of cases per correlation t value of t-test for each correlation p two tailed probability of t for each correlation
140 | P a g e
Correlation Matrix Plot Using Pie Graphs and Colors library(corrgram) corrgram(x, order = NULL, panel=panel.txt, lower.panel=panel.shade, upper.panel=panel.pie, diag.panel=NULL, text.panel=panel.txt, label.pos=0.5, cex.labels=NULL, font.labels=1, row1attop=TRUE, gap=0, main=NULL)
Options x is a dataframe with one observation per row.
order=TRUE will cause the variables to be ordered using principal component analysis
of the correlation matrix. panel= refers to the off-diagonal panels. You can use lower.panel= and
upper.panel= to choose different options below and above the main diagonal respectively. text.panel= and diag.panel= refer to the main diagnonal. Allowable parameters are given below.
off diagonal panels panel.pie (the filled portion of the pie indicates the magnitude of the correlation) panel.shade (the depth of the shading indicates the magnitude of the correlation) panel.ellipse (confidence ellipse and smoothed line) panel.pts (scatterplot) main diagonal panels panel.minmax (min and max values of the variable) panel.txt (variable name).
Use this function before plotting to change the colors used: col.corrgram <- function(ncol){ colorRampPalette(c("purple”, "red","lightred", "pink"))(ncol)}
Lines on the shade indicate direction.
Shade color and pie indicates magnitude.
141 | P a g e
Correlation Hypothesis testing
Correlation Package I’ve created for testing Correlation Hypotheses
Confidence Interval for a Correlation Coefficient library(psychometric) CIr(r, n, level = 0.95)
Arguments r Correlation Coefficient n Sample Size level Significance Level for constructing the CI, default is .95
Convert r values to z scores fisherz <- function(rho) {0.5*log((1+rho)/(1-rho)) } #converts r to z fisherz(r) OR using library(psych) fisherz(rho) fisherz2r(z) r.con(rho,n,p=.95,twotailed=TRUE) r2t(rho,n)
Description convert a correlation to a z score or z to r using the Fisher transformation or find the confidence intervals for a specified correlation
Convert a Pearson correlation coefficient to Fishers z’ library(psychometric) r2z(x) Where x is the Pearson correlation coefficient
source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Regression/Correlation Matrix and Data Visualization.txt")
cmat()
142 | P a g e
Confidence Interval for Fisher z’ library(psychometric) CIz(z, n, level = 0.95)
Arguments z Fishers z’ n Sample Size level Significance Level for constructing the CI, default is .95
Convert a Fishers z’ to aPearson correlation coefficient library(psychometric) z2r(x) Where x is the Fishers z’
143 | P a g e
Find partial correlation between two variables, with other vars removes library((ggm) pcor(u, S) u a vector of integers of length > 1. The first two integers are the indices of variables the
correlation of which must be computed. The rest of the vector is the conditioning set.
S a symmetric positive definite matrix, a sample covariance matrix.
Find the partial correlations for a set (x) of variables with set (y) removed library(psych) partial.r(m, x, y)
Arguments m A data or correlation matrix x The variable numbers associated with the X set. y The variable numbers associated with the Y set
Description
A straightforward application of matrix algebra to remove the effect of the variables in the y set from the x set. Input may be either a data matrix or a correlation matrix. Variables in x and y are specified by location. It is sometimes convenient to partial the effect of a number of variables (e.g., sex, age, education) out of the correlations of another set of variables. This could be done laboriously by finding the residuals of various multiple correlations, and then correlating these residuals. The matrix algebra alternative is to do it directly.
Find the partial correlations for a set correlation or covariance matrix library(corpcor) cor2pcor(m, tol)
EXAMPLE
cool <- make.hierarchical() #make up a correlation matrix
round(cool[1:5,1:5],2)
partial.r(cool,c(1,3,5),c(2,4))
EXAMPLE
library(ggm)
data(marks)
with(marks, cor(vectors, algebra)) #cor not accounting for anything else
## The correlation between vectors and algebra given analysis and statistics
pcor(c("vectors", "algebra", "analysis", "statistics"), var(marks))
144 | P a g e
Tests the hypothesis that two correlations are significantly different library(psychometric) rdif.nul(r1, r2, n1, n2) (one tailed; p must be doubled to get p value for 2 tail)
Arguments r1 Correlation 1 r2 Correlation 2 n1 Sample size for r1 n2 Sample size for r2
Details First converts r to z’ for each correlation. Then constructs a z test for the difference z <- (z1 - z2)/sqrt(1/(n1-3)+1/(n2-3))
Returns a table with 2 elements zDIF z value for the H0 p p value
145 | P a g e
Tests of significance for correlations library = psych() r.test (n, r12, r34 = NULL, r23 = NULL, r13 = NULL, r14 = NULL, r24 = NULL, n2 = NULL, pooled =
TRUE, twotailed = TRUE)
Arguments n Sample size of first group r12 Correlation to be tested r34 Test if this correlation is different from r12, if r23 is specified, but r13 is not,
then r34 becomes r13 r23 if ra = r(12) and rb = r(13) then test for differences of dependent correlations
given r23 r13 implies ra =r(12) and rb =r(34) test for difference of dependent correlations r14 implies ra =r(12) and rb =r(34) r24 ra =r(12) and rb =r(34) n2 n2 is specified in the case of two independent correlations. n2 defaults to n if if
not specified pooled use pooled estimates of correlations
twotailed should a twotailed or one tailed test be used
Description Tests the significance of a single correlation, the difference between two independent correlations, the difference between two dependent correlations sharing one variable (Williams’s Test), or the difference between two dependent correlations with different variables (Steiger Tests). Details Depending upon the input, one of four different tests of correlations is done. 1. For a sample size n, find the t value for a single correlation. 2. For sample sizes of n and n2 (n2 = n if not specified) find the z of the difference between
the z transformed correlations divided by the standard error of the difference of two z scores.
3. For sample size n, and correlations ra= r12, rb= r23 and r13 specified, test for the difference of two dependent correlations.
4. For sample size n, test for the difference between two dependent correlations involving different variables.
For clarity, correlations may be specified by value. If specified by location and if doing the test of dependent correlations, if three correlations are specified, they are assumed to be in the order r12, r13, r23.
Value test Label of test done z z value for tests 2 or 4 t t value for tests 1 and 3 p probability value of z or t
146 | P a g e
Nil hypothesis for a correlation (Does r = 0?) library(psychometric) r.nil(r, n)
Arguments r Correlation coefficient n Sample Size
Performs a one-tailed t-test of the H0 that r = 0
Returns a table with 4 elements “H0:rNot0” correlation to be tested t t value for the H0
df degrees of freedom p p value
147 | P a g e
Cronbach’s Alpha library(psych)
alpha(x, keys=NULL,cumulative=FALSE, title=NULL, max=10,na.rm = TRUE)
Cronbach’s Alpha library(psy) cronbach(v1) v1 = n*p matrix or dataframe, n subjects and p items Missing value are omitted in a "listwise" way (all items are removed even if only one of them is missing). Cronbach’s Alpha library(psychometric) alpha(x) Where x is a data.frame
148 | P a g e
Cronbach’s Alpha library(ltm) cronbach.alpha(data, standardized = FALSE, CI = FALSE, probs = c(0.025, 0.975), B = 1000, na.rm = FALSE) Confidence Interval for Coefficient Alpha (1 or 2 tailed) library(psychometric) First calculate an alpha and then: alpha.CI(alpha, k, N, level = 0.90, onesided = FALSE)
149 | P a g e
Descriptive Statistics for a response data frame (includes Cron. Alpha and lots of goodies) descript(data, n.print = 10, chi.squared = TRUE, B = 1000) library(ltm) Returns:
150 | P a g e
151 | P a g e
Alternative reliability Analysis library(psych) ?guttman guttman(r,key=NULL) tenberge(r) glb(r,key=NULL) glb.fa(r,key=NULL)
Arguments r A correlation matrix or raw data matrix. key a vector of -1, 0, 1 to select or reverse items
Estimation of a True Score library(psychometric) Est.true(obs, mx, rxx)
Arguments obs an observed score on test x mx mean of test x rxx reliability of test x
Description
Given the mean and reliability of a test, this function estimates the true score based on an observed score. The estimation is accounting for regression to the mean
Spearman-Brown Prophecy Formulae library(psychometric) SBrel(Nlength, rxx) SBlength(rxxp, rxx)
Arguments Nlength New length of a test in relation to original rxx reliability of test x rxxp reliability of desired (parallel) test x
Returns: rxxp - the prophesized reliability; N -Ratio of new test length to original test length
152 | P a g e
Item Analysis (Gives lots of info from a sample’s responses) library(psychometric) item.exam(x, y = NULL, discrim = FALSE)
Arguments x matrix or data.frame of items y Criterion variable discrim Whether or not the discrimination of item is to be computed
Description
Conducts an item level analysis. Provides item-total correlations, Standard deviation in items, difficulty,discrimination, and reliability and validity indices.
Details
If someone is interested in examining the items of a dataset contained in data.frame x, and the criterion measure is also in data.frame x, one must parse the matrix or data.frame and specify each part into the function. See example below. Otherwise, one must be sure that x and y are properly merged/matched. If one is not interested in assessing item-criterion relationships, simply leave out that portion of the call. The function does not check whether the items are dichotomously coded, this is user specified. As such, one can specify that items are binary when in fact they are not. This has the effect of computing the discrimination index for continuously coded variables.
The difficulty index (p) is simply the mean of the item. When dichotomously coded, p reflects the proportion endorsing the item. However, when continuously coded, p has a different interpretation.
153 | P a g e
Grade multiple choices (uses multiple choice data set and answer key to give correct(1) incorrect (0)
mult.choice(data, correct) library(ltm)
Arguments data a matrix or a data.frame containing the manifest variables as columns. correct a vector of length ncol(data) with the correct responses (answer key)
This new matrix could then be used to do column sums; row sums; weighting of questions; grades; weighted grades based on questions weights. Find Intraclass Correlations (ICC1, ICC2, ICC3 from Shrout and Fleiss) of two raters (numeric) ICC(x,missing=TRUE,alpha=.05) library(psych)
Arguments x a matrix or dataframe of ratings missing if TRUE, remove missing data – work on complete cases only alpha The alpha level for significance for finding the confidence intervals
Description The Intraclass correlation is used as a measure of association when studying the reliability of raters. Shrout and Fleiss (1979) outline 6 different estimates, that depend upon the particular experimental design. All are implemented and given confidence limits. Intraclass correlation coefficient (ICC) package psy() icc(data) data = n*p matrix or dataframe, n subjects p raters
Details Missing data are omitted in a listwise way. The "agreement" ICC is the ratio of the subject variance by the sum of the subject variance, the rater variance and the residual; it is generally prefered. The "consistency" version is the ratio of the subject variance by the sum of the subject variance and the residual; it may be of interest when estimating the reliability of pre/post variations in measurements.
154 | P a g e
Find Cohen’s kappa and weighted kappa coefficients for correlation of two raters (nominal)
cohen.kappa(x, w=NULL,n.obs=NULL,alpha=.05) library(psych) wkappa(x, w = NULL)
Arguments x Either a two by n data with categorical values from 1 to p or a p x p table. If
data rray, a table will be found. w A p x p matrix of weights. If not specified, they are set to be 0 (on the diagonal)
and (distance from diagonal) off the diagonal)^2. n.obs Number of observations (if input is a square matrix. alpha Probability level for confidence intervals
Description Cohen’s kappa (Cohen, 1960) and weighted kappa (Cohen, 1968) may be used to find the agreement of two raters when using nominal scores. weighted.kappa is (probability of observed matches - probability of expected matches)/(1 – probability of expected matches). Kappa just considers the matches on the main diagonal. Weighted kappa considers off diagonal elements as well. Find Cohen’s kappa and weighted kappa coefficients for correlation of two raters (nominal) wkappa(r,weights="squared") library psy()
Arguments r n*2 matrix or dataframe, n subjects and 2 raters weights weights="squared" to obtain squared weights. If not, absolute weights are
computed
Details The diagnoses are ordered as follow: numbers < letters, letters and numbers ordered naturally. For weigths="squared", weights are related to squared differences between rows and columns indices (in this situation wkappa is close to an icc). For weights!="squared", weights are related to absolute values of differences between rows and columns indices. The function deals with the case where the two raters have not exactly the same scope of rating (some software associate an error with this situation). Missing value are omitted.
155 | P a g e
Reverse Scoring library(psych) reverse.code(keys, items, mini = NULL, maxi = NULL)
NOTE: Reverse scoring can also be accomplished by taking the item and creating a new rescored variable using the formula: (m+1)-s = reverse scored item Where m is the max score you could have gotten on a Likert type scale and s is the score vector containing the scores of the item that is to be reverse scored.
EXAMPLE
original <- matrix(sample(6,50,replace=TRUE),10,5)
keys <- c(1,1,-1,-1,1) #reverse the 3rd and 4th items
new <- reverse.code(keys,original,mini=rep(1,5),maxi=rep(6,5))
156 | P a g e
TABULAR DATA
Table of Counts (frequency table) [nested]
table(factor1,factor2,n factor)
ftable(factor1,factor2,n factor…) Note: you can use describe by tilde ~ (see example)
Un-nested table of counts margin.table(table,factor # to reveal)
Compute column and row sums for a table method 1 addmargins(table) Compute column and row sums for a table method 2 library(vcd) mar_table(x)
EXAMPLE
DF<-read.table("fake remdial reading (logistic regression example).csv", header=TRUE,
sep=",",na.strings="999")
DF<-data.frame(DF,"Fav.Color"=sample(c("blue","red","green","orange"),nrow(DF),replace=T))
with(DF,table(sex,rem.read.rec))
with(DF,table(sex,rem.read.rec,Fav.Color)) #too cumbersome so we use ftable
with(DF,ftable(sex,rem.read.rec,Fav.Color))
with(DF,ftable(rem.read.rec~sex+Fav.Color))
with(DF,ftable(sex+Fav.Color~rem.read.rec))
EXAMPLE:
DF<-read.table("fake remdial reading (logistic regression example).csv", header=TRUE,
sep=",",na.strings="999")
DF<-data.frame(DF,"Fav.Color"=sample(c("blue","red","green","orange"),nrow(DF),replace=T))
(tab2<-with(DF,table(sex,Fav.Color,rem.read.rec)))
margin.table(tab2,1)
margin.table(tab2,2)
margin.table(tab2,3)
d<-data.frame(matrix(c(sample(c("red","blue", "green"), 25, replace=T),
sample(c(letters[1:5]), 25, replace=T),
sample(c("DOG","CAT", "CHICKEN", "SNAKE"), 25, replace=T)), nrow=25, ncol=3))
DT <- with(d, table(X1,X3))
with(d, chisq.test(X1,X2))
with(d, fisher.test(X1,X2))
DT<-with(d,xtabs(~X1+X3))
addmargins(DT)
mar_table(DT)
157 | P a g e
Table of Counts for Proportion Tables
prop.table(table)
Tabular Data 2 x 2 Table Chi squared test of independence 2 x 2 summary(table(factor1,factor2)) EXAMPLE
DF<-data.frame(cbind("X1"=c(rep("yes",12),rep("no",12)),"X2"=c(rep("red",10),rep("blue",14))))
with(DF,summary(table(X1,X2)))
Tabular Data 2 x 2 Table with Yates Continuity Correction Chi squared test of independence
chisq.test(table(factor1,factor2)) EXAMPLE
DF<-data.frame(cbind("X1"=c(rep("yes",12),rep("no",12)),"X2"=c(rep("red",10),rep("blue",14))))
with(DF, chisq.test(table(X1,X2)))
Compute a table of expected frequencies (used by chisq.test) library(vcd)
independence_table(x, frequency = c("absolute", "relative")) x is a table. frequency indicates whether absolute or relative frequencies should be computed.
Tabular Data 2 x 2 and larger Table fisher.test(table(factor1,factor2)) EXAMPLE
with(warpbreaks,fisher.test(table(wool,tension)))
EXAMPLE
x<-ftable(mtcars[,c(2,8)])
independence_table(x)
EXAMPLE
DF<-read.table("fake remdial reading (logistic regression example).csv", header=TRUE,
sep=",",na.strings="999")
DF<-data.frame(DF,"Fav.Color"=sample(c("blue","red","green","orange"),nrow(DF),replace=T))
with(DF,table(sex,rem.read.rec))
with(DF,table(sex,rem.read.rec,Fav.Color))
with(DF,ftable(sex,rem.read.rec,Fav.Color))
with(DF,ftable(rem.read.rec~sex+Fav.Color))
(tab1<-with(DF,ftable(sex+Fav.Color~rem.read.rec)))
prop.table(tab1,1)
(percentTABLE<-prop.table(tab1,1)*100)
(tab2<-with(DF,table(sex,Fav.Color,rem.read.rec)))
prop.table(tab2,1)
(percentTABLE<-prop.table(tab2,1)*100)
158 | P a g e
Cross Tabulation with Tests for Factor Independence library(gmodels) CrossTable(x, y, digits=3, max.width = 5, expected=FALSE, prop.r=TRUE, prop.c=TRUE, prop.t=TRUE, prop.chisq=TRUE, chisq = FALSE, fisher=FALSE, mcnemar=FALSE, resid=FALSE, sresid=FALSE, asresid=FALSE, missing.include=FALSE, format=c("SAS","SPSS"), dnn = NULL, ...)
Arguments
x A vector or a matrix. If y is specified, x must be a vector
y A vector in a matrix or a dataframe
digits Number of digits after the decimal point for cell proportions
max.width In the case of a 1 x n table, the default will be to print the output horizontally. If the number of columns exceeds max.width, the table will be wrapped for each successive increment of max.width columns. If you want a single column vertical table, set max.width to 1
expected If TRUE, chisq will be set to TRUE and expected cell counts from the Chi-Square will be included
prop.r If TRUE, row proportions will be included
prop.c If TRUE, column proportions will be included
prop.t If TRUE, table proportions will be included
prop.chisq If TRUE, chi-square contribution of each cell will be included
chisq If TRUE, the results of a chi-square test will be included
fisher If TRUE, the results of a Fisher Exact test will be included
mcnemar If TRUE, the results of a McNemar test will be included
resid If TRUE, residual (Pearson) will be included
sresid If TRUE, standardized residual will be included
asresid If TRUE, adjusted standardized residual will be included
missing.include If TRUE, then remove any unused factor levels
format Either SAS (default) or SPSS, depending on the type of output desired.
dnn the names to be given to the dimensions in the result (the dimnames names).
Combine columns or rows of a cross table library(vcdExtra) collapse.table(table)
EXAMPLES
library(gmodels)
CrossTable(infert$education, infert$induced, expected = TRUE,chisq = T, fisher=TE,
mcnemar=T,resid=T, sresid=T, asresid=T, missing.include=T)
#&&&&&&&&&&&&&&&&&&&&&
CrossTable(mtcars$cyl,mtcars$vs,format="SAS")
CrossTable(mtcars$cyl,mtcars$vs,format="SPSS")
#EXAMPLE
library(vcdExtra)
# create some sample data in table form
sex <- c("Male", "Female")
age <- letters[1:6]
education <- c("low", "med", "high")
data <- expand.grid(sex=sex, age=age, education=education)
counts <- rpois(36, 100)
data <- cbind(data, counts)
(t1 <- xtabs(counts ~ sex + age + education, data=data))
# collapse age to 3 levels
(t2 <- collapse.table(t1, age=c("A", "A", "B", "B", "C", "C")))
# collapse age to 3 levels and pool education: "low" and "med" to "low"
(t3 <- collapse.table(t1, age=c("A", "A", "B", "B", "C", "C")))
education=c("low", "low", "high"))
# change labels for levels of education to 1:3
(t4 <- collapse.table(t1, education=1:3))
159 | P a g e
Strength of Effect Measures [SOE] (Tabular Data)
Read Measures of association in crosstab tables article for SOE measures decisions Compute Pearson χ2, Likelihood Ratio χ2, φ coefficient, contingency coefficient & Cramer's V assocstats(x) library(vcd)
Cohen's kappa and weighted kappa for a confusion matrix library(vcd) kappa(z) Z is a matrix or a the result of qr or a fit from a class inheriting from "lm".
EXAMPLES
data("Arthritis")
Arthritis
(tab <- xtabs(~Improved + Treatment, data = Arthritis))
summary(assocstats(tab))
#AND
x<-ftable(mtcars[,c(2,8)])
summary(assocstats(x))
Examples
kappa(x1 <- cbind(1,1:10))# 15.71
kappa(x1, exact = TRUE) # 13.68
kappa(x2 <- cbind(x1,2:11))# high! [x2 is singular!]
160 | P a g e
Turn a table into a dataframe METHOD 1 table2flat(table) table- can be ftable, xtabs, table Turn a table into a dataframe METHOD2 library(vcdExtra) expand.dft(x, var.names = NULL, freq = "Freq", ...) expand.table(x, var.names = NULL, freq = "Freq", ...)
Categorical Article Data (Cross Table) to Raw Data Below is an example starting with creating a table from numeric values (replicate data frame from results)
FROM AN ARTICLE TO A TABLE TO RAW DATA
#CODE
table2flat <- function(mytable){
#by Robert Kabakoff
df <- as.data.frame(mytable)
rows <- dim(df)[1]
cols <- dim(df)[2]
x <- NULL
for (i in 1:rows){
for (j in 1:df$Freq[i]){
row <- df[i, c(1:(cols-1))]
x <- rbind(x, row)
}
}
row.names(x) <- c(1:dim(x)[1])
return(x)
}
#EXAMPLE
x <- with(mtcars,table(am, gear, cyl, vs))
table2flat(x)
x2 <- with(mtcars,ftable(am, gear, cyl, vs))
table2flat(x2)
#===================================================================
# CREATE THE DATA FRAME FROM A MATRIX OF FREQUENCIES
#===================================================================
d2 <- matrix(c(23, 15, 66, 34, 19, 22), ncol=3, nrow=2)
dimnames(d2) <-list(Gender=c("boys", "girls"),
Inst.Meth=c("direct.int", "explicit.learn", "didactic"))
d2<-as.table(d2)
d2
#===================================================================
expand.dft(d2) #BOTH FUNCTIONS WILL RETURN THE DATA FRAME
table2flat(d2)
art <- xtabs(~Treatment + Improved, data = Arthritis)
art
expand.dft(art)
161 | P a g e
ANOVA
ANOVA (balanced or not; as many ways as you want [1 way, 2 way, 3 way …]) linear model Type: anova(lm(sc~ g)) One way Analysis of Variance Table Response: sc Df Sum Sq Mean Sq F value Pr(>F) g 1 18.778 18.7778 6.7759 0.01359 * Residuals 34 94.222 2.7712 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 anova(lm(sc~ s*a*g)) Multi-way (gives ineractions) Analysis of Variance Table Response: sc Df Sum Sq Mean Sq F value Pr(>F) s 1 4.000 4.0000 1.8947 0.18138 a 2 23.167 11.5833 5.4868 0.01091 * g 1 18.778 18.7778 8.8947 0.00647 ** s:a 2 1.167 0.5833 0.2763 0.76095 s:g 1 11.111 11.1111 5.2632 0.03083 * a:g 2 1.389 0.6944 0.3289 0.72287 s:a:g 2 2.722 1.3611 0.6447 0.53365 Residuals 24 50.667 2.1111 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 anova(lm(sc~ s+a+g)) Multi-way (gives only main effects) Analysis of Variance Table Response: sc Df Sum Sq Mean Sq F value Pr(>F) s 1 4.000 4.0000 1.8492 0.183684 a 2 23.167 11.5833 5.3550 0.010055 * g 1 18.778 18.7778 8.6810 0.006056 ** Residuals 31 67.056 2.1631 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Note: sc is the dependent variable s,a,g are the main effect (IV), s*a,s*g,a*g,&s*a*g are the interactions.
162 | P a g e
Analysis of Variance AOV Model Type: z.aov<-aov(y~factor1*factor2*factor3) Where z.aov is the output, y is the DV numeric scores, factor1-factor2-factor3 ar your categorical IV’s. summary(z.aov) This will give you the same output as the summary(anova(lm(y~ factor1*factor2*factor3)) from the linear model approach. Means Tables (load car package) This is for after you have run an aov model (make sure you’ve labeled the categorical IV’s as factors using as.factors function): model.tables(z.aov,"means",se=T) Where z.aov is the output label for the aov model you’ve just run. This gives you the means tables for main and interaction effects. Residual plots plot(model) example: plot(hw.aov) Where model is the aov model. Post Hoc & Protected Tests (for use after ANOVA) Tukey TukeyHSD(model) [example: TukeyHSD(z.aov)] Where model is the output label for the aov.
163 | P a g e
MANOVA I will use the following data set to illustrate the MANOVA I usually go in and change the variable names to something simple using the command: s<-data$Study.Group Then make sure your categorical variables are factors using the command: s<-as.factor(s) The next step is to bind your outcome variables: y<-cbind(c,l,h) The output (when entering y) should look like this Now you can check for outliers using the aq.plot command from the mvoutlier package: mvoutlier(y) > aq.plot(y )
Projection to the first and second robust principal
components.
Proportion of total variation (explained variance):
0.8742298
$outliers
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
FALSE FALSE FALSE TRUE FALSE
[15] TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
TRUE
Warning message:
In princomp.default(x, covmat = covr) :
both 'x' and 'covmat' were supplied: 'x' will be ignored
-6 -4 -2 0 2 4 6
-4-2
02
46
12
3 45
67
8
9
10
11
12
13
14
15 16
17
18
19
20
21
2223
24
0 10 20 30 40 50
0.2
0.4
0.6
0.8
1.0
Ordered squared robust distance
Cum
ula
tive p
robabili
ty
20211619923117225174214123
1013
2418
158
1697.5
% Q
uantil
eA
dju
ste
d Q
uantil
e
-6 -4 -2 0 2 4 6
-4-2
02
46
Outliers based on 97.5% quantile
8
13
15 16
18
24
12
34
567
9
10
11
12
14
1719
20
21
2223
-6 -4 -2 0 2 4 6
-4-2
02
46
Outliers based on adjusted quantile
8
13
15 16
18
24
12
34
567
9
10
11
12
14
1719
20
21
2223
164 | P a g e
Next you can run the MANOVA using the following command: v<-manova(lm(y ~ s * r)) The v is an arbitrary choice just as y was in the last step. The s*r will give you the main effects and the interaction as well. The output (when entering v) should look like this To get the f statistics you need to call up a summary command: summary(v) By default the test will be the Pillai. If you wish to change the output to Wilks enter: summary(v,test="Wilks") The output should look like this Note: you can also change Wilks to Roy From here you should go onto running individual anovas on each of the out come variables to complete the anova table.
MANOVA Correlation Table(for the DV’s) Type: cor(y) Note: the y is the bind numeric variables. The output table will be as a correlation table.
165 | P a g e
Repeated Measures Anova (Balanced Design) It is inappropriate to use this for unbalanced and/or missing values)
A 3 time no between factors model (IV-Subject[categorical and random]; IV-Meal[categorical and fixed]; DV-Cholesterol Rating[numeric])
Data Set Type: ex20.aov <- aov(Cholesterol.Intake ~factor(Meal) + Error(factor(Student))) Where Cholesterol.Intake is the DV, Meal is an IV fixed factor, and Student is an IV random factor. [output] A 3 time within and between factors model (IV-Subject[categorical and random]; IV-Gender[categorical and fixed]; Instructor Type[categorical and fixed]; DV-Cholesterol Rating[numeric])
> ex27.10.aov <- aov(Study.Time ~factor(Instructor.Type)*factor(Gender) + Error(factor(Student))) And then… > summary(ex27.10.aov) [output] Note: see section 27:12 for the ANOVA table that corresponds to this output.
166 | P a g e
Repeated Measures Anova (balanced or unbalanced) [relies on the car package] A 3 time no between factors model (IV-Subject[categorical and random]; IV-Meal[categorical and fixed]; DV-Cholesterol Rating[numeric]) Note: this model keeps data sets in the traditional data table format (no need to re-format data); additionally the DV does not have an actual column (it is instead the numeric measurements at the meal variables)
Data Set 1) Create a vector of levels for the measurement
points (1 for each measurement point):
meals <- c(1, 2, 3) Where meals is the new vector name (a factor), and the numbers represent each measurement point. 2) Create a within groups measurement point factor
to house the levels you just created (this will be used later in our data frame(matrix style)) and in our Anova anaylsis):
mealFactor<- as.factor(meals) Where mealFactor is the new factor with n levels to house our levels that describe our n numeric columns(measurement points). 3) Create a matrix style data frame from the factor and levels that will be used to describe our
numeric columns(measurement points): mealFrame <- data.frame(mealFactor) 4) Now create a bound vector containing the n numeric columns for later use in the linear
model: mealBind<-cbind(breakfast , lunch, dinner) 5) Create a linear model with the bound vector you just created. mealModel<-lm(mealBind~1) 6) Use the Anova function from the car package to analyze our data (notice we are using the
measurement point matrix style data frame and corresponding within groups factors as well as the linear model we just created):
analysis3 <- Anova(mealModel, idata = mealFrame, idesign = ~mealFactor) Note: we could have added the argument ,type=”III” but the default of Anova is to switch from type II to type III SS when there is only one intercept 7) Now create a summary of the anova tables and information: summary(analysis) Look below at the summary:
167 | P a g e
Possible Errors for small DF Your tutorial is excellent. I was able to follow it easily and quickly analyze a data set I've been working with for a long time. I tried applying the same steps to another data set but when I tried to use the Anova(mod, idata, idesign) function I got the following error message: Error in linearHypothesis.mlm(mod, hyp.matrix, SSPE = SSPE, idata = idata, : The error SSP matrix is apparently of deficient rank = 3 < 4 Do you have any idea what this means or how to deal with it. Thanks a lot! John M. Quick said... Thanks for the comments. I am familiar with this error. In short, it has to do with a combination of a lack of degrees of freedom to execute the multivariate tests (i.e. small sample size compared to variables) and the inability of the Anova() function to ignore/forgo calculating the multivariate tests. See this R listserv discussion for details: http://r.789695.n4.nabble.com/Anova-in-car-SSPE-apparently-deficient-rank-tp997619p997619.html An alternative, which will get you the Greenhouse-Geisser and Hyunh-Feldt epsilon corrections, but no multivariate tests, is to use the anova() function. anova(ageModel, idata = ageFrame, X = ~ageFactor, test = "Spherical") One caveat, I believe, is that this will use Type I SS, whereas my Anova() example uses Type III SS. I'm not sure how to get Type III SS with the anova() function.
168 | P a g e
A 3 time within and between factors model (IV-Subject[categorical and random]; IV-Gender[categorical and fixed]; Instructor Type[categorical and fixed]; DV-Cholesterol Rating[numeric]) Note: this model keeps data sets in the traditional data table format (no need to re-format data); additionally the DV does not have an actual column (it is instead the numeric measurements at the meal variables)
Data Set 1) Create a vector of levels for the measurement points
(1 for each measurement point):
instructor <- c(1, 2, 3) Where meals is the new vector name (a factor), and the numbers represent each measurement point. 2) Create a within groups measurement point factor to
house the levels you just created (this will be used later in our data frame(matrix style)) and in our Anova anaylsis):
instructorF<- as.factor(instructor) Where instructorF is the new factor with n levels to house our levels that describe our n numeric columns(measurement points). 3) Create a matrix style data frame from the factor and levels that will be used to describe
our numeric columns(measurement points): instructorFR <- data.frame(instructorF) 4) Now create a bound vector containing the n numeric columns for later use in the linear
model: instructorBind<-cbind(male, female, computer) 5) Create a linear model with the bound vector you just created. LMmodel<-lm(instructorBind~gender) Notice we have included the fixed between groups gender variable in the linear model. 6) Use the Anova function from the car package to analyze our data (notice we are using the
measurement point matrix style data frame and corresponding within groups factors as well as the linear model we just created):
Analysis7 <- Anova(LMmodel, idata = instructorFR, idesign = ~instructorF)
7) Now create a summary of the anova tables and information: summary(analysis7) Look below at the summary and how the information was placed in an anova table:
169 | P a g e
>instructor<- c(1, 2, 3) > instructorF<- as.factor(instructor) > instructorFR<- data.frame(instructorF) > instructorBind<-cbind(male ,female ,computer) > LMmodel<-lm( instructorBind~gender ) > analysis4 <- Anova(LMmodel, idata = instructorFR, idesign = ~instructorF) > summary(analysis4)
Analysis of Variance Table Source SS df MS F
Grand Mean 2022.78 1
Student Gender(A) 3.56 1 3.56 1.21
Instructor Type(B) 92.29 2 46.14 22.58**
AB Interaction 37.16 2 18.58 9.09**
Student within 47.07 16 2.94
Student x Instructor 65.39 32 2.04
Total 2668.25 54
.01 Critical Values: F1,16 -8.53, F2,32 5.387 **p<.01
1. MS=SS/df 2. To get totals sum the columns
170 | P a g e
Graph comparing the regression line of the three models
LINEAR MODELING
Linear Model m1 <- lm(gl~sc,data=df) summary(m1) Note: m1 is changeable, gl and sc are the numeric variables. Resistant Linear Model library(MASS) lqs(gl~sc,data=df) For models with outliers consider this model. Uses least median squares(LMS) & least trimmed squares (LTS). Robust Linear Model library(MASS) rlm(gl~sc,data=df) For models with outliers and heteroscedasticity problems consider this model.
CODE FOR THE GRAPH ABOVE mtcars2<-data.frame(rbind(mtcars[,c(1,6)],"mp800"=c(16.0,9),"DFxz00"=c(12.0,3.7)))
mod<-lm(mpg~wt,data=mtcars2);library(MASS)
mod2<-lqs(mpg~wt,data=mtcars2)
mod3<-rlm(mpg~wt,data=mtcars2)
plot(mod);par(ask=T);library(mvoutlier)
aq.plot(mtcars2[c("mpg","wt")]);par(ask=T)
uni.plot(mtcars2[c("mpg","wt")],symbol=T)
influence.measures(mod)
par(ask=T);par(mfrow=c(1,1))
with(mtcars2,plot(wt,mpg))
abline(reg=mod,lty=1,col="blue")
abline(reg=mod2,lty=2,col="red")
abline(reg=mod3,lty=3,col="green")
legend(x=6.72,y=33.67,legend=c("lm()","lqs()","rlm()"),
lty=c(1,2,3),col=c("blue","red","green"))
mtext("Notice how the outliers affect lm(), less with lqs(), and least with rlm()", font=4,side=3,col="dark green")
2 4 6 8
10
15
20
25
30
wt
mp
g
lm()
lqs()
rlm()
Notice how the outliers affect lm(), less with lqs(), and least with rlm()> mtcars2
mpg wt
Mazda RX4 21.0 2.620
Mazda RX4 Wag 21.0 2.875
Datsun 710 22.8 2.320
Hornet 4 Drive 21.4 3.215
Hornet Sportabout 18.7 3.440
Valiant 18.1 3.460
Duster 360 14.3 3.570
Merc 240D 24.4 3.190
Merc 230 22.8 3.150
Merc 280 19.2 3.440
Merc 280C 17.8 3.440
Merc 450SE 16.4 4.070
Merc 450SL 17.3 3.730
Merc 450SLC 15.2 3.780
Cadillac Fleetwood 10.4 5.250
Lincoln Continental 10.4 5.424
Chrysler Imperial 14.7 5.345
Fiat 128 32.4 2.200
Honda Civic 30.4 1.615
Toyota Corolla 33.9 1.835
Toyota Corona 21.5 2.465
Dodge Challenger 15.5 3.520
AMC Javelin 15.2 3.435
Camaro Z28 13.3 3.840
Pontiac Firebird 19.2 3.845
Fiat X1-9 27.3 1.935
Porsche 914-2 26.0 2.140
Lotus Europa 30.4 1.513
Ford Pantera L 15.8 3.170
Ferrari Dino 19.7 2.770
Maserati Bora 15.0 3.570
Volvo 142E 21.4 2.780
mp800 16.0 9.000
DFxz00 12.0 3.700
171 | P a g e
Calling Components of Linear Models and Summaries Use model$and one of the components below [use names(model)to view these]
1. # [1] "coefficients" "residuals" "effects" "rank" 2. # [5] "fitted.values" "assign" "qr" "df.residual" 3. # [9] "xlevels" "call" "terms" "model"
Use summary(model)$and one of the components below [names(summary((model))to view]
1. names(summary(fit)) 2. # [1] "call" "terms" "residuals" "coefficients" 3. # [5] "aliased" "sigma" "df" "r.squared" 4. # [9] "adj.r.squared" "fstatistic" "cov.unscaled"
Accessor Functions residuals(model);resid(model) coefficients(model);coef(model) fitted(model) predict(model) deviance(model) df.residual(model) rstandard(model) rstudent(model) influence.measures(model) influence(model) dfbeta(model) dfbetas(model) covratio(model) cooks.distance(model) hatvalues(model) BIC(model) AIC(model) model.frame(model)
172 | P a g e
Dealing with multicolinearity method 1 lm.ridge() From the library(MASS) it attempts to minimize SS residuals and penalizes for coefficient sizes Dealing with multicolinearity method 2 lars(x,y,type= "lasso") Where x is a matrix of predictor values and y is the response variable. From the library(lars) it penalizes for coefficient sizes differently than lm.ridge using algorithm for least angle regression. Dealing with multicolinearity method 3 pcr(formula) From the library(pls) it transforms the predictors and then linear regression is performed. Dealing with multicolinearity method 4 plsr (formula) From the library(pls) it uses partial regression coefficients and then linear regression is performed. Dealing with multicolinearity method 5 mean centering according to aiken and west Linear Model Hypothesis Test For Simple Linear Regression See: ex21c.docx Regression Analysis Use the source code below for calling Regression and Correlations Functions:
source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Regression/.Regression Bundle.txt")
rfun()
source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Regression/Correlation Matrix and Data
Visualization.txt")
cmat()
173 | P a g e
Calculate Multiple Regression from a correlation matrix library(psych) mat.regress(m, x, y,n.obs=NULL) Multiple Regression from matrix input
Arguments m a matrix of correlations or, if not square of data x either the column numbers of the x set (e.g., c(1,3,5) or the column names of the x set
(e.g. c("Cubes","PaperFormBoard") y either the column numbers of the y set (e.g., c(2,4,6) or the column names of the y set
(e.g., c("Flags","Addition") n.obs If specified, then confidence intervals, etc. are calculated, not needed if raw data are
given Allows you to calculate multiple regression from correlation matrix. Useful for interpreting results from someone else’s work. Computes the confidence interval for a desired level for the squared-multiple correlation CI.Rsq(rsq, n, k, level = 0.95) This function you have to enter the R2 but not the function below
require(foreign)
data<-read.spss("data3_Revised.sav", use.value.labels = TRUE, to.data.frame = TRUE)
dat <- subset(data, select=c(achiev,momedu,ses,parsupport, parmonitor,rules ))
dat3 <-subset(dat, select=c(achiev,parsupport, parmonitor,rules))
dat3<-na.omit(dat3)
mod7 <- with(dat3, lm(achiev ~parsupport + parmonitor + rules))
test.data<-cor(dat3)
x<-mat.regress(test.data,c(2,3,4),c(1),n.obs=478)#choose the variables by number
summary(x,digits=4) #note gives standardized beta weights
#compare to:
summary(mod7);mod7
stan.beta(mod7)
174 | P a g e
Computes the confidence interval for a desired level for the squared-multiple correlation CI.Rsqlm(obj, level = 0.95) library(psychometric) Where obj is the linear model (ie. obj<-lm(y~x1+x2) and level is the confidence interval desired.
Arguments R Correlation Coefficient n Sample Size level Significance Level for constructing the CI, default is .95
Predicting from a model sPREDICT
predict(object, data)
#EXAMPLE
(mod <- lm(mpg~hp+disp+hp:disp, data=mtcars))
NEW<-data.frame(hp=c(260, 280), disp=c(330, 350))
predict(mod, NEW)
175 | P a g e
One Factor/One Continuous ANCOVA Example:
#============================================================================================================ #GETTING THE DATA #============================================================================================================ regrowth<-read.table("ipomopsis.txt", header=TRUE, sep="\t",na.strings="999") attach(regrowth) names(regrowth) head(regrowth) #COVARIATE-->"Root"/OUTCOME-->"Fruit"/CATEGORICAL-->"Grazing" #============================================================================================================ #LOOKING AT MEANS #============================================================================================================ mean(subset(regrowth,Grazing=="Ungrazed")$Fruit) mean(subset(regrowth,Grazing=="Grazed")$Fruit) #............................ #Looking at means we would suggest that the grazed plants actually produce more fruit (incorect assumption as the plot will show) #............................ #============================================================================================================ #PLOTTING THE DATA #============================================================================================================ plot(Root,Fruit,pch=16+as.numeric(Grazing),col=c("blue","green")[as.numeric(Grazing)]) #............................ #A look at the lines reveals ungrazed acually produces more fruit, opposite of what the means suggests #+16as.numeric is what turns the categorical data into plot points [16 changes the point type] #............................ abline(lm(Fruit[Grazing=="Grazed"]~Root[Grazing=="Grazed"]),lty=15,col="blue") abline(lm(Fruit[Grazing=="Ungrazed"]~Root[Grazing=="Ungrazed"]),lty=3,col="dark green") legend(locator(1),c("Grazed","Ungrazed"),fill=c("blue","dark green")) #............................ #draws the regression lines for each group of Grazing as described by the covariate roots #............................ #============================================================================================================ #ANALYZING THE DATA (ANCOVA) #============================================================================================================ ancova.fruit<-lm(Fruit~Grazing*Root) #............................ #covariates go second, because we are not interested in their effects, just the addition error they remove and the power they give #order matters here: anova(lm(Fruit~Root*Grazing)) will give a different output #............................ summary(ancova.fruit) anova(ancova.fruit)
176 | P a g e
Two Factor/One Continuous ANCOVA Example:
#============================================================================================================ #GETTING THE DATA #============================================================================================================ Gain<-read.table("Gain.txt", header=TRUE, sep="\t",na.strings="999") attach(Gain) names(Gain) head(Gain) #COVARIATE-->"Age"/OUTCOME-->"Weight"/CATEGORICAL-->"Sex/"CATEGORICAL-->"Genotype" #============================================================================================================ #LOOKING AT MEANS #============================================================================================================ #............................ #method 1 #............................ library(doBy) summaryBy() summaryBy(Weight~ Sex+Genotype, data = Gain,FUN = function(x) { c(n = length(x),mean = mean(x), sd = sd(x)) } ) #............................ method 2 #............................ source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R Stuff/Scripts/Regression/.Regression Bundle.txt") rfun() desc2v(Gain,Sex,Genotype) #============================================================================================================ #ANALYZING THE DATA (ANCOVA) #============================================================================================================ ancova.gain<-lm(Weight~Sex*Age*Genotype) summary(ancova.gain) anova(ancova.gain)
177 | P a g e
Basic Functions
Generic Functions ‘
178 | P a g e
Find the probability of obtaining the same result if the experiment were conducted again library(psych)
Usage p.rep(p = 0.05, n=NULL,twotailed = FALSE) p.rep.f(F,df2,twotailed=FALSE) p.rep.r(r,n,twotailed=TRUE) p.rep.t(t,df,df2=NULL,twotailed=TRUE)
Arguments
p conventional probability of statistic (e.g., of F, t, or r) F The F statistic df Degrees of freedom of the t-test, or of the first group if unequal sizes df2 Degrees of freedom of the denominator of F or the second group in an unequal
sizes t test r Correlation coefficient n Total sample size if using r t t-statistic if doing a t-test or testing significance of a regression slope
twotailed Should a one or two tailed test be used?
179 | P a g e
Critical Values (p=1-α or 1-α/2 [1 or 2 tail], df= degrees of freedom, q=critical value)
The birthday paradox Find the # of occurrence given the probability of the event qbirthday(prob = 0.5, classes = 365, coincident = 2) Find the probability of given the # of occurrences of the event pbirthday(n, classes = 365, coincident = 2)
Function What it does qnorm(p) Returns a value q such that the area to the left of q for a standard normal
random variable is p. pnorm(q) Returns a value p such that the area to the left of q on a standard normal is
p. qt(p,df) Returns a value q such that the area to the left of q on a t(df) distribution
equals q. pt(q,df) Returns p, the area to the left of q for a t(df) distribution qf(p,df1,df2) Returns a value q such that the area to the left of q on a F(df1, df2)
distribution is p. For example, qf(.95,3,20) returns the 95% points of the F(3, 20) distribution.
pf(q,df1,df2) Returns p, the area to the left of q on a F(df1, df2) distribution. qchisq(p,df) Returns a value q such that the area to the left of q on a χ2(df) distribution
is p. pchisq(q,df) Returns p, the area to the left of q on a χ2(df) distribution.
EXAMPLES
qbirthday(prob = .95, classes = 365, coincident = 2)
pbirthday(23, classes = 365, coincident = 2)
180 | P a g e
Function Writing Information LOOPS for() Function Using a for Loop Example Repeat a function over and over again Infinite Repeat Loop i <- 1
repeat{
i <- i/2
print(i)
flush.console()
}
Repeat Loop i <- 1
repeat{
i <- i/2
print(i)
flush.console()
if (i < .0005) break
}
While Loop i <- c(1)
while(i < 20){
i <- c(i, i*1.5)
print(i)
flush.console()
}
Nested For Loop for (i in 1:2){
for(j in 20:21){
for (k in c("horse", "cow")){
print(i)
print(j)
print(i*j)
print(k)
}
}
}
X<-function(col=10, rows=40){
vec <- 1:col
holder <-c()
for (i in 1:rows){
perm <- sample(vec, replace=F)
holder <- rbind(holder, perm)
}
holder
rownames(holder)<-paste("obs. ", 1:rows,
sep="")
colnames(holder)<-paste("VAR-",LETTERS[1:col],
sep="")
holder
}
X(15, 20)
#===========================================
#loop to repeat a function
#===========================================
DFer = list()
n = 10
j=6
for (i in 1:n){
DFer[[i]]= data.frame(A=1:j, B=rnorm(j),
C=letters[1:j])
}
DFer
#===========================================
#or (the 2nd allocates the vector ahead of time)
#===========================================
getDFs <- function(n, j) {
df <- vector("list", n) # As I said, if you
know the size, allocate the object beforehand
for (i in seq(n))
df[[i]] <- data.frame(A = seq(j), B =
rnorm(j), C = letters[seq(j)])
return(df)
} # end function
(x<-getDFs(10, 4))
#===========================================
#put it together (the list into a data frame
#===========================================
do.call("rbind", x)
library(plyr)
ldply(x, rbind)
For Loop with next i <- 0
for (i in 1:100){
if (i%%2==0) next
i <- i +1
print(i)
flush.console()
}
For Loop with break and next i <- 0
for (i in 1:100){
if (i%%2==0) next
if (i > 90) break
i <- i +1
print(i)
flush.console()
}
181 | P a g e
IFELSE ifelse(test,then this occurs, if not this happens) Example x<-sample(-2:5,20,replace=T);x
outcome<-ifelse(x >= 0, sqrt(x), NA)
data.frame(x,outcome)
Switch Function Example 1
Central <- function(y, measure = "Mean"){
switch(measure,
Mean = mean(y),
Geometric = exp(mean(log(y))),
Harmonic = 1/mean(1/y),
Median = median(y),
stop("chose a mean")
)
}
central(mtcars$mpg,"Median")
central(mtcars$mpg,"Geometric")
central(mtcars$mpg,"Harmonic")
Example2
FUN <- function(x){
switch(x,
`1` = "A",
`2` = "B",
`3` = "C",
stop("chose a # between 1-3")
)
}
FUN(1)
FUN(2)
FUN(4)
182 | P a g e
Repeat Loops (complex example) srepeat EXAMPLE:
#FIRST I'LL RECREATE A DATA SET. IT"LL CONTAIN REDUNDANCY
DATA <- structure(list(person = structure(c(4L, 1L, 5L, 4L, 1L, 3L, 1L,
4L, 3L, 2L, 1L), .Label = c("greg", "researcher", "sally", "sam",
"teacher"), class = "factor"), sex = structure(c(2L, 2L, 2L,
2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor"),
adult = c(0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L), state = structure(c(2L,
7L, 9L, 11L, 5L, 4L, 8L, 3L, 10L, 1L, 6L), .Label = c("Shall we move on? Good then.",
"Computer is fun. Not too fun.", "I distrust you.", "How can we be certain?",
"I am telling the truth!", "Im hungry. Lets eat. You already?",
"No its not, its ****.", "There is no way.", "What to do?",
"What are you talking about?", "You liar, it stinks!"), class = "factor")), .Names =
c("person",
"sex", "adult", "state"), class = "data.frame", row.names = c(NA,
-11L))
DATA <- data.frame(rbind(DATA, DATA, DATA))
DATA <- data.frame(rbind(DATA, DATA, DATA))
DATA <- data.frame(tot = 1:nrow(DATA), DATA)
DATA <- DATA[with(DATA, order(person, tot)), ]
rownames(DATA)<-1:nrow(DATA)
#==========================================
#A SIMPLE WORD COUNT FUNCTION
word.count <- function(text, by = "row") {
unblanker <-function(x)subset(x, nchar(x)>0)
word.split <- function(x) sapply(x, function(x)as.vector(unlist(strsplit(x, " "))))
reducer <- function(x) gsub("\\s+", " ", x)
txt <- sapply(as.character(text), function(x) ifelse(is.na(x), "", x))
OP <- switch(by,
all = length(unblanker(unlist(word.split(reducer(unlist(as.character(txt))))))),
row = sapply(txt, function(x)
length(unblanker(unlist(word.split(reducer(unlist(as.character(x)))))))))
ifelse(OP==0, NA, OP)
}
#==========================================
DATA$wc <- word.count(DATA$state)
#==========================================
#METHOD 1 DASON
g <- function(x, k = 30){
# Need to know how long the final vector should be
n <- length(x)
# Take care of case where we don't get any groups.
if(sum(x) < k){
ans <- rep(NA, n)
return(factor(ans))
}
# Store where we want to break into a new group
breaks <- c()
repeat{
# Find the first spot where the vector sums to at least k
# If that doesn't happen we get Inf and a warning
# suppress the warning
spot <- suppressWarnings(min(which(cumsum(x) >= k)))
# If we got inf then the sum couldn't reach k
if(!is.finite(spot)){
break # Jump out of the repeat loop
}
# Remove the spots that we accounted for
x <- x[-c(1:spot)]
# Note where the break is
183 | P a g e
breaks <- c(breaks, spot)
}
ans <- rep(NA, n)
groups <- paste("sel_", rep(1:(length(breaks)), breaks), sep = "")
ans[1:length(groups)] <- groups
return(factor(ans))
}
# Try it out
x <- subset(DATA, person=='sam')
g(x$wc)
y <- subset(DATA, person=='teacher')
g(y$wc)
z <- subset(DATA, person=='greg')
g(z$wc)
#METHOD 2 BRYANGOODRICH
f <- function(x) {
# Initialize variables
r <- length(x) # Total size
n <- 1 # Starting index
i <- 1 # Group index
sum <- 0 # Container for sum check
groups <- vector("character", r)
# Loop through r-length vector x
for (m in seq(r)) {
sum <- sum + x[m] # Add to running sum
isEnough <- sum >= 30
# Block allocation and index adjustments
if (isEnough) {
groups[n:m] <- paste("sel_", i, sep ="")
i <- i + 1 # Increment group index
n <- m + 1 # Increment start index
sum <- 0 # Start sum check over
} # end if isEnough
} # end for m
groups[groups == ""] <- NA
return (factor(groups))
} # end function
# Try it out
x <- subset(DATA, person=='sam')
f(x$wc)
y <- subset(DATA, person=='teacher')
f(y$wc)
z <- subset(DATA, person=='greg')
f(z$wc)
184 | P a g e
Menu (Interactive Mode) menu(choices, graphics = FALSE, title = NULL) Arguments choices- a character vector of choices graphics- a logical indicating whether a graphics menu should be used if available. title- a character string to be used as the title of the menu. NULL is also accepted.
Menu (gwidgets style) menu(choices, title, graphics = TRUE) select.list(choices, title)
EXAMPLES
switch(menu(c("List letters", "List LETTERS", "What does this do?")) + 1,
cat("Nothing done\n"), letters, LETTERS,c("Oh now I get it"))
#=====================================================================
.hurtz.donut<-function(){
cat("You want a hurtz donut?\n\n")
switch(menu(c("Yes", "No")) ,
cat("<PUNCH>\nHurts, don't it?\n"), cat("What a wimp!!\n"))
}
.hurtz.donut()
menu(sort(.packages(all.available = TRUE)), title = "packages", graphics = TRUE)
#returns --> [1] 17
select.list(sort(.packages(all.available = TRUE)), title = "packages")
#returns --> [1] "car"
185 | P a g e
Progress Bars (base) see also: tcltk , plyr, RGtk2, txtProgressBar(min = 0, max = 1, initial = 0, char = "=", width = NA, title, label, style = 1, file = "") winProgressBar(title = "R progress bar", label = "", min = 0, max = 1, initial = 0, width = 300) close() #needed after the call to the progress bar
#EXAMPLE PASSING SEQUENCE ALONG THE VECTOR
total = nrow(mtcars)
progress.bar = TRUE
type = FALSE
#progress.bar = FALSE #parameter to play with
#type = 'text' #parameter to play with
if(progress.bar) {
if (Sys.info()[['sysname']]=="Windows" & type != "text"){
# create progress bar
pb <- winProgressBar(title = "progress bar", min = 0,
max = total, width = 300)
lapply(1:total, function(i) {
Sys.sleep(.5)
setWinProgressBar(pb, i,
title=paste(round(i/total*100, 0), "% done"))
}
)
close(pb)
} else {
# create progress bar
pb <- txtProgressBar(min = 0, max = total, style = 3)
lapply(1:total, function(i) {
Sys.sleep(.5)
setTxtProgressBar(pb, i)
}
)
close(pb)
}
} else {
Sys.sleep(total/4)
return("should have used a progress bar")
}
= portion that is function dependent
#EXAMPLE PASSING THE VECTOR
w <- c("raptors are awesome don't you all agree")
y <- unlist(strsplit(w, " "))
total <- length(y)
#WINDOWS TEXT BAR
pb <- winProgressBar(title = "progress bar", min = 0,
max = total, width = 300)
lapply(y, function(x){
z <- nchar(x); Sys.sleep(.5)
i <- which(y %in% x)
setWinProgressBar(pb, i, title=
paste(round(i/total *100,
0), "% done"))
return(z)
}
)
close(pb)
#STANDARD TEXT BAR
pb <- txtProgressBar(min = 0, max = total, style = 3)
lapply(y, function(x){
z <- nchar(x); Sys.sleep(.5)
i <- which(y %in% x)
setTxtProgressBar(pb, i)
return(z)
}
)
close(pb)
#EXAMPLE PASSING THE VECTOR With Global Assignment
w <- c("raptors are awesome don't you all agree")
y <- unlist(strsplit(w, " "))
total <- length(y)
#WINDOWS VERSION
pb <- winProgressBar(title = "progress bar",
min = 0, max = total, width = 300)
i <- 0
lapply(y, function(x){
z <- nchar(x); Sys.sleep(.5)
i <<- i + 1
setWinProgressBar(pb, i, title=
paste(round(i/ total *100, 0), "% done"))
return(z)
}
)
close(pb)
#STANDARD TEXT VERSION
pb <- txtProgressBar(min = 0, max = total, style = 3)
i <- 0
lapply(y, function(x){
z <- nchar(x); Sys.sleep(.5)
i <<- i + 1
setTxtProgressBar(pb, i)
return(z)
}
)
close(pb)
186 | P a g e
Pass a data frame to a function (One method) f <- function(x,data=NULL, fun) {
fun(eval(match.call()$x,data))
}
f(hp,mtcars,mean)
Passing a Variable (Vector name) on to an argument as a character Best seen with an examples. See below. Four is passed on without using quotes.
Passing a character string to a function eval parse Best seen with an examples. See below.
EXAMPLE WITH DATA
mtcars2<-mtcars
library(doBy)
mtcars2$cyl<-with(mtcars2,recodeVar(mtcars2$cyl,src=c(4,6,8),
tgt=c("four","six","eight"), default=NULL, keep.na=TRUE))
with(mtcars2,cyl)
Tfun <-function (DV,IV,group1){
g <- substitute(group1)
g1<-DV[IV ==as.character(g)]
p <- mean(g1)
list(g1,p)
}
with(mtcars2,Tfun(mpg,cyl,four))
SIMPLE EXAMPLE
extract.arg <-function (a){
s <- substitute(a)
as.character(s)
}
extract.arg(hello)
#EXAMPLE
x <- c(1:20)
myoptions <- "trim=0, na.rm=FALSE"
eval(parse(text = paste("mean(x,", myoptions, ")")))
library(fortunes);fortune(106)
187 | P a g e
Functions That Take Input scan(n=,what = double(0),quiet=T)
EXAMPLE ASKING FOR 1 INPUT
x<-function(){
#choose angle in degrees
cat("\n","Enter Value","\n")
x<-scan(n=1,what = double(0),quiet=T)
x
}
EXAMPLE ASKING FOR 4 INPUT
x<-function(){
#choose angle in degrees
cat("\n","Enter Value","\n")
x<-scan(n=4,what = double(0),quiet=T)
x
}
188 | P a g e
Warning Messages warning(..., call. = TRUE, immediate. = FALSE, domain = NULL) suppressWarnings()
Alarm alarm() #makes a call to "\a" OR cat("\a") Tell a function what to do with a missing argument missing(x) #Best understood with an example Used as a “minifunction” within the function to tell what to do if y is not given. Technically this could be done with myplot <- function(x,y=x) as well. Reset Parameters useful for resetting graphical parameters or performing cleanup actions. on.exit()
EXAMPLE
test <- function() warning("You idiot you forgot quotes!")
test() ## shows call
test2 <- function() warning("You idiot you forgot quotes!", call. = FALSE)
test2() ## no call
EXAMPLE
myplot <- function(x,y) {
if(missing(y)) {
y <- x
x <- 1:length(y)
}
plot(x,y)
}
textClick <- function(express, col="black", cex=NULL, srt = 0, family="sans", ...){
old.par <- par(no.readonly = TRUE)
on.exit(par(old.par))
par(mar = rep(0, 4),xpd=NA)
x<-locator(1)
X<-format(x, digits=3)
text(x[1], x[2], express, col=col, cex=cex, srt=srt, family=family, ...)
noquote(paste(X[1], X[2],sep=", "))
}
189 | P a g e
Sequence for the n of a vector Traditionally people use: 1:length(x) However this may lead to problems. Use instead: seq_along(x)
190 | P a g e
Viewing the code of generic functions look at a functions code look at a function's code
If a function is generic or one you’ve created (downloaded) you can view its code by simply typing the name of the function: For the aov() function type: aov (and enter) Viewing the code of generic functions Type method(function). This gives a list of the functions with suffixes. Now type the function with the suffix name for its code. methods(anova)
anova.glm
How R evaluates true and false In R, TRUE is considered to be the number 1 and FALSE is considered the number 0. This can be very useful in practice. Example: T+T+F=2 T*T*F=0
191 | P a g e
Determine how much memory an object takes up object.size(x,units=) Units can be changed units = c("b", "auto", "Kb", "Mb", "Gb") EXAMPLEs print(object.size(library(base)),units="auto") #specific library
print(object.size(library),units ="auto") #entire library
print(object.size(Tpass),units ="auto") #a function
print(object.size(ls()),units ="auto") #current objects in workspace
Determine memory allocation and Increase Allocation memory.limit() #Report memory limit memory.limit(size=3500) #increase memory limit Reduce Objects and Junk in Memory gc() rm(list=ls()) rm(list = ls(all.names = TRUE))
192 | P a g e
Timing
Determine How Long It Takes to Run a Function (method 1) library(microbenchmark) [very accurate] microbenchmark(..., list, times=100, control=list())
193 | P a g e
Determine How Long It Takes to Run a Function (method 2) library(rbenchmark) function timing time a function function time
benchmark( ..., columns = c( "test", "replications", "elapsed", "relative", "user.self", "sys.self", "user.child", "sys.child"), order = "test", replications = 100, environment = parent.frame()) Arguments ... captures any number of unevaluated expressions passed to benchmark as named or unnamed arguments (the
functions to be teststed).
columns a character or integer vector specifying which columns should be included in the returned data frame (see below).
order a character or integer vector specifying which columns should be used to sort the output data frame. Any of the columns that can be specified for columns (see above) can be used, even if it is not included in columns and will not appear in the output data frame. If order=NULL, the benchmarks will appear sorted by the order of the expressions in the call to benchmark.
replications a numeric vector specifying how many times an expression should be evaluated when the runtime is measured. If replications consists of more than one value, each expression will be benchmarked multiple times, once for each value in replications.
environment the environment in which the expressions will be evaluated.
a functionun
c
Determine How Long It Takes to Run a Function (method 3) function timing time a function function time
system.time() tion
#example and output
benchmark(
Plyr = ddply(mtcars, .(cyl, gear), summarise, output = mean(hp)),
Tapply = with(mtcars, data.frame(output = tapply(hp, interaction(cyl, gear), mean))),
Aggregate = aggregate(hp ~ cyl + gear, mtcars, mean),
order=c('replications', 'elapsed'))
# test replications elapsed relative user.self sys.self user.child sys.child
# 2 Tapply 100 0.19 1.000000 0.18 0 NA NA
# 3 Aggregate 100 0.51 2.684211 0.39 0 NA NA
# 1 Plyr 100 1.36 7.157895 1.10 0 NA NA
Explanation: Typically just look at the elapsed time and the relative times. The relative is really what is interesting to me - it tells you how long each expression takes in comparison to the fastest expression. So in your example the Tapply is the quickest and Aggregate takes 2.68 times longer and the Plyr solution takes 7.15 times longer than the Tapply.
194 | P a g e
Timer 1 library(data.table) begin.time <- Sys.time() timetaken(begin.time) Timer 2 library(matlab) tic(gcFirst=FALSE) toc(echo=TRUE) Timer 3 base x <- Sys.time() difftime(Sys.time(), x) Time Stamping timestamp()
195 | P a g e
Generate Reproducible Code
Write a code, data frame to a file to send to a help archive (reproducible code) sink dput(object, "file to write to") EXAMPLE dput(mtcars, "foo.txt")
SEE ALSO: Exporting an output to a file section using cat() Generate window of a code, data frame to send to a help archive (reproducible code) page(object)
EXAMPLE page(mtcars)
196 | P a g e
Customized Workflow
Create Multiple Working Directories Create a shortcut where you want a new directory. Locate (& cut [ctrl + c]) the location of where you stored the shortcut (ie. C:\Users\Rinker\Desktop\PhD Program\CEP 523-Stat Meth Ed Inference\R Stuff). Click on Properties and paste (ctrl + p) the location to the Start In box. Now data files from this location load automatically without referencing their specific location. Stop the Stupid Start Up Message and Auto Save At the end of the target box(see "Create Multiple Working Directories") location add a space and then -q --no-save "C:\Program Files\R\R-2.13.0\bin\i386\Rgui.exe" -q --no-save
197 | P a g e
.First Function (start up commands) Open a new script from within [R]. Create a .First function with the following type of set up: .First<-function(Sys.time){
library(psych)
library(car)
options(repos="http://lib.stat.cmu.edu/R/CRAN")
source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R
Stuff/Scripts/Missing Values/.NA Bundle.txt")
source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R
Stuff/Scripts/Assumption Testing/Tests of Normality.txt")
options(repos="http://lib.stat.cmu.edu/R/CRAN") #See Choosing a CRAN Mirror
cat("Hello Tyler! Today is",date(),"\n")
}
Now save it to the working directory (see Creating A Working Directory) as .Rprofile NOTE: As of right now I have not been able to edit this file. I have to delete it and resave it using the method above to edit it. Choosing a CRAN Mirror Type the following to choose a CRAN mirror and to find it’s URL: chooseCRANmirror()
options("repos")[[1]][1]
#gives a URL for the Mirror you just chose
Now paste the following command to the .First function in your .Rprofile options(repos="URL") See options .Options
198 | P a g e
Paths, Directories and System Info
Check the System info and Operating System Sys.info() Sys.info()[['sysname']] Check System Path Infoe Sys.getenv("USERNAME") Sys.getenv("HOME") Sys.getenv() Determine working Directory/ Set Working Directory getwd() setwd(dir) Directory Functions dir() #list files in working directory list.files() #list files in working directory file.info(list.files()) #get info on files in directory path.expand("~/Desktop/PhD Program/") #replaces the tilde with user home directory
>Sys.getenv("USERNAME")
[1] "Rinker"
> Sys.getenv("HOME")
[1] "C:\\Users\\Rinker"
199 | P a g e
File Editing and Web Browsing
Opening Web Pages method 1 See the web() function in scripts browseURL(url, browser = getOption("browser"), encodeIfNeeded = FALSE) Opening Web Pages method 2 Not recommended as the function may not work on non-Windows machines. shell.exec(url) Arguments: file = file or url to open Opening Files Within R See the ret() function in scripts shell.exec(file) Arguments: file = file or url to open Open text files for editing in R file.edit("file")
Example
file.edit(".Rprofile")
200 | P a g e
Determine if a file exists file.exists(path) If you don't provide a path R only checks the wd() Rename a file file.rename(path) Delete a directory unlink(x, recursive = TRUE, force = FALSE) Format R code library(formatR) sformatr tidy.source(source, file.output="windows console")
EXAMPLE CODE
#save it somewhere or copy and then read it in
# check tidy.source's clipborad option
library(formatR)
xx<-pathPrep()
C:\Users\Rinker\Desktop\transcript Functions.R
shell.exec(xx)
tidy.source(source = xx, file="transcript Functions.txt")
tidy.source(source = "clipboard") #using a clipboard
tidy.source(source = "clipboard", file="transcript
Functions.txt")) #using a clipboard
sink(file="New.doc")
4*3; sink()
file.exists("New.doc")
Open() #look at the New.doc
file.rename("New.doc", "Renamed.doc")
file.exists("New.doc")
Open() #look at the Renamed.doc
delete(Renamed.doc) #user defined
Open() #look at the no Renamed.doc
201 | P a g e
Debugging
Find out the values of a function up to a given point browser() Exit Browser Typer Q and enter
Debug Use debug(Function_Name) and then use the function to step by step by step debug(mean)
mean(1:10)
undebug(mean)
try and trycatch L <- list(a=c(1, 3, 5), b=c("a", "v"), d=mtcars[,1])
lapply(L, function(x){
try(sum(x))
})
L <- list(a=c(1, 3, 5), b=c("a", "v"), d=mtcars[,1])
sapply(L, function(x){
tryCatch(sum(x), error=function(err) NA)
})
testfun <- function(x = 5){
y = 5
browser()
print(x + y)
}
testfun()
y; x; p
Browse[1]> y; x; p #check the values of each of these objects within the function
[1] 5
[1] 5
Error: object 'p' not found
202 | P a g e
Expand a column that's a list column spit #====================================
# THE DATA FRAME
#====================================
input <- data.frame(site = 1:6,
sector = factor(c("north", "south", "east",
"west", "east", "south")),
observations =
I(list(c(1, 2, 3), c(4, 3), c(), c(14, 12, 53, 2, 4), c(3),c(23))))
#====================================
# EXPAND THE COLUMN AND MERGE
#====================================
obs.l <- sapply(input$observations, length)
desire.output <- data.frame(site=rep(1:6,obs.l), obs=unlist(input$observations))
merge(input[, -3], desire.output, all.x=TRUE)
#NOTE- THE SITE IS THE KEY FOR THEN MERGING WITH THE DATA FRAME
203 | P a g e
Expand a Text Column (Split by sentence)
sentSplit <- function(dataframe, text.var, splitpoint = NULL, rownames = numeric,
text.place = original) {
DF <- dataframe
input <- as.character(substitute(text.var))
re <- ifelse(is.null(splitpoint), "[\\?\\.\\!]", as.character(substitute(splitpoint)))
RN <- as.character(substitute(rownames))
TP <- as.character(substitute(text.place))
breakinput <- function(input, re) {
j <- gregexpr(re, input)
lengths <- unlist(lapply(j, length))
spots <- lapply(j, as.numeric)
first <- unlist(lapply(spots, function(x) {
c(1, (x + 1)[-length(x)])
}))
last <- unlist(spots)
ans <- substring(rep(input, lengths), first, last)
return(list(text = ans, lengths = lengths))
}
j <- breakinput(DF[, input], re)
others <- DF[, -which(colnames(DF) == input)]
idx <- rep(1:dim(others)[1], j$lengths)
ans <- cbind(input = j$text, others[idx, ])
colnames(ans)[1] <- input
if (RN == "numeric") {
rownames(ans) <- 1:nrow(ans)
}
if (TP == "original") {
ans <- ans[, c(colnames(DF))]
} else {
if (TP == "right") {
ans <- data.frame(ans[, -1], ans[, 1])
colnames(ans)<-c(colnames(ans)[-ncol(ans)],input)
} else {
if (TP == "left") {
ans
}
}
}
return(ans)
}
#=====================
#TEST IT
#=====================
DATA<-structure(list(person = structure(c(4L, 1L, 5L, 4L, 1L, 3L, 1L,
4L, 3L, 2L, 1L), .Label = c("greg", "researcher", "sally", "sam",
"teacher"), class = "factor"), sex = structure(c(2L, 2L, 2L,
2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor"),
adult = c(0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L), state = structure(c(2L,
7L, 9L, 11L, 5L, 4L, 8L, 3L, 10L, 1L, 6L), .Label = c("Shall we move on? Good then.",
"Computer is fun. Not too fun.", "I distrust you.",
"How can we be certain?", "I am telling the truth!", "Im hungry. Lets eat. You already?",
"No its not, its dumb.", "There is no way.", "What should we do?",
"What are you talking about?", "You liar, it stinks!"
), class = "factor"), code = structure(c(1L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 11L, 2L, 3L), .Label = c("K1", "K10", "K11",
"K2", "K3", "K4", "K5", "K6", "K7", "K8", "K9"), class = "factor")), .Names = c("person",
"sex", "adult", "state", "code"), row.names = c(NA, -11L), class = "data.frame")
sentSplit(DATA, state, rownames=sub)
sentSplit(DATA, state)
sentSplit(DATA, state, text.place=right)
sentSplit(DATA, state, text.place=left)
204 | P a g e
Collapse A Text Column by A grouping Variable
#THE DATA
dat <- structure(list(sex = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 2L,
2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor"), state = structure(c(2L,
7L, 9L, 11L, 5L, 4L, 8L, 3L, 10L, 1L, 6L), .Label = c("Shall we move on? Good then.",
"Computer is fun. Not too fun.", "I distrust you.", "How can we be certain?",
"I am telling the truth!", "Im hungry. Lets eat. You already?",
"No its not, its dumb.", "There is no way.", "What should we do?",
"What are you talking about?", "You liar, it stinks!"), class = "factor")), .Names =
c("group",
"text"), class = "data.frame", row.names = c(NA, -11L))
#METHOD 1 (A better choice)
# Needed for later
k <- rle(as.numeric(dat$group))
# Create a grouping vector
id <- rep(seq_along(k$len), k$len)
# Combine the text in the desired manner
out <- tapply(dat$text, id, paste, collapse = " ")
# Bring it together into a data frame
data.frame(text = out, group = levels(dat$group)[k$val])
#METHOD 2
y <- rle(as.character(dat$group))
x <- y[[1]]
dat$new <- as.factor(rep(1:length(x), x))
text <- aggregate(text~new, dat, paste, collapse = " ")[, 2]
data.frame(text, group = y[[2]])
#Method 3 (combined) k <- rle(as.numeric(dat$group)); dat$id <- rep(seq_along(k$len), k$len)
data.frame(sex=rle(as.character(dat$group))$val,
aggregate(text~id, dat, paste, collapse=" "))
group text
1 m Computer is fun. Not too fun. No its not, its dumb. What should we do? You liar, it stinks! I am telling the truth!
2 f How can we be certain?
3 m There is no way. I distrust you.
4 f What are you talking about? Shall we move on? Good then.
5 m Im hungry. Lets eat. You already?
group text
1 m Computer is fun. Not too fun.
2 m No its not, its dumb.
3 m What should we do?
4 m You liar, it stinks!
5 m I am telling the truth!
6 f How can we be certain?
7 m There is no way.
8 m I distrust you.
9 f What are you talking about?
10 f Shall we move on? Good then.
11 m Im hungry. Lets eat. You already?
205 | P a g e
Enter unknow number of vectors to unnamed argument return names and math out put
Extract compinents from dots (…) f1 <- function(x, ...) substitute(...()) #Dunlap's method
f2 <- function(x, ...) match.call(expand.dots=FALSE)$... #traditional match.call
f1(1, warning("Hmm"), stop("Oops"), cat("some output\n"))
f2(1, warning("Hmm"), stop("Oops"), cat("some output\n"))
Creating a Quasi-Package
#EXAMPLE
#Create objects such as data sets and functions to include in the file
.hurtz.donut<-function(){"You want a hurtz donut? Yes! <Punch> Hurts don't it?"}
.hurtz.donut()
(.FUNdat<-data.frame(cbind(LETTERS,1:26)))
#Save the objects to the .RData file
save(.hurtz.donut,.FUNdat,file="myFUNCTIONS.RData")
#just to show everything has been wiped clean (this will delete all objects from your
workspace)
rm(list = ls(all.names = TRUE))
#Close out and reload [R]
load("myFUNCTIONS.RData")
.hurtz.donut()
.FUNdat
#FAKE DATA
a<-sample(50:90+0, 20, replace=TRUE)
b<-sample(50:90+20, 20, replace=TRUE)
d<-sample(50:90-20, 20, replace=TRUE)
#FUNCTION THAT WORKS IF SUPPLYING A LIST AS THE UNNAMED ARGUMENT
foo <- function(...){
# Get the names of the objects that were passed into the function
x <- as.character(match.call())[-1]
# Apply mean to every object passed in
y <- sapply(list(...), mean)
return(list(x, y))
}
#TEST IT OUT
foo(a,b,d)
#THE OUTPUT
[[1]]
[1] "a" "b" "d"
[[2]]
[1] 72.90 92.80 51.75
206 | P a g e
Function returns return() print() invisible() Specifically tells the function what to return. If the return function is not given the last line of the code will be returned. Invisible is a feature for being able to recall a function created object but it is not automatically returned.
Function returns extended (return some recall the rest later) Look at both examples
EXAMPLE Invisible
test <- function(){
with(mtcars, plot(mpg~hp))
invisible(list("type1"="Shh! I'm invisible.","type2"="Real quiet now."))
}
x <- test()
x$type1
x$type2
#==========================================
#The original function that returns a list
#==========================================
test <- function(number=10){
XX <- number
YY <- "hello"
ZZ <- Sys.time()
o <- list(x = XX, y = YY, z = ZZ)
class(o) <- "stuff"
return(o)
}
#=================================================
#This makes the above return one piece of the list
#=================================================
print.stuff <- function(stuff){
print(stuff$z)
}
#=================================================
#See the end results
#=================================================
(PP <- test()) #returns what was specified by print.stuff
PP$y #recall the other components of the list
PP$x
test <- function(number=10){
XX <- number
YY <- "hello"
ZZ <- Sys.time()
o <- list(x = XX, y = YY, z = ZZ, zz = "Recall Me")
class(o) <- "stuff"
return(o)
}
#=================================================
#This makes the above return one piece of the list
#=================================================
print.stuff <- function(stuff){
list(print(stuff$z),
print(stuff$x))
}
#=================================================
#See the end results
#=================================================
(PP <- test()) #returns what was specified by print.stuff
PP$y #recall the other components of the list
PP$zz
PP$z
PP$x
EXAMPLE2 Invisible
a <- data.frame(x=1:10,y=1:10)
test <- function(z){
mean.x<-mean(z$x)
nm <-as.character(substitute(z))
print(mtcars)
invisible(list(mean.x, nm))}
x <- test(a)
x
207 | P a g e
Rolling Math Functions rolling mean rolling median sapply(seq(x), function(i) MATH.FUNCTION(x[seq(i)])) x<- mtcars$disp
sapply(seq(x), function(i) median(x[seq(i)]))
sapply(seq(x), function(i) mean(x[seq(i)]))
sapply(seq(x), function(i) range(x[seq(i)]))
sapply(seq(x), function(i) sd(x[seq(i)]))
208 | P a g e
Apply Family, PLYR & RESHAPE
PLYR splyr
#Group by subgroups, find max of another variable by these subgroups, return those rows
#################################################
## A FAKE DATA SET LIKE THE ONE YOU DESCRIBE ##
#################################################
DF <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L), .Label = c("1", "2"), class = "factor"), var2 = structure(c(1L,
4L, 5L, 9L, 3L, 8L, 2L, 6L, 10L, 7L), .Label = c("B", "C", "F",
"G", "H", "I", "M", "P", "W", "Z"), class = "factor"), var2.1 = c(-0.184379525153166,
-1.42413441621445, -0.245741502747687, 0.762805889444348, -0.85561728601498,
-0.358079724034542, -0.137767483903655, -0.952739149867607, 1.01227935773242,
0.0722005649132995), DF.DATE = structure(c(14662, 15556, NA,
14903, 15641, NA, 14970, 15625, 15075, 14819), class = "Date")), .Names = c("ID",
"var2", "var3", "DATE"), row.names = c(NA, -10L), class = "data.frame")
DF #view dataframe
library(plyr) #get plyr
ddply(na.omit(DF), .(ID), summarise, max = max(DATE)) #or
ddply(na.omit(DF), "ID", summarise, max = max(DATE))
ddply(na.omit(DF), "ID", summarise, mean = mean(var3))
X1 <- sample(1:20, 60, replace=TRUE)
X2 <- X1*(1+sample(seq( .00, .2, .005), 60, replace=TRUE))
X3 <- as.factor(sort(sample(c("dog", "cat", "pig", "snake"),
60, replace=TRUE)))
DF <- data.frame(X1, X2, X3)
library(plyr)
ddply(na.omit(DF), .(X3), summarise, cor = cor(X1,X2)) #correlation by group
stats <- function(x)c("mean"=mean(x), "med"=median(x), "sd"=sd(x),
"var"=var(x), "n"=length(x))
ddply(na.omit(DF), .(X3), summarise, X1=stats(X1),X2=stats(X2))
ID var2 var3 DATE
1 1 B -0.184380 2010-02-22
2 1 G -1.424134 2012-08-04
3 1 H -0.245742 <NA>
4 1 W 0.762806 2010-10-21
5 1 F -0.855617 2012-10-28
6 2 P -0.358080 <NA>
7 2 C -0.137767 2010-12-27
8 2 I -0.952739 2012-10-12
9 2 Z 1.012279 2011-04-11
10 2 M 0.072201 2010-07-29
> ddply(na.omit(DF), .(ID), summarise, max = max(DATE))
ID max
1 1 2012-10-28
2 2 2012-10-12
> ddply(na.omit(DF), "ID", summarise, mean = mean(var3))
ID mean
1 1 -0.4253313
2 2 -0.0015067
require(plyr)
ddply(mtcars, .(cyl, am), with, each(min, mean, sd, max)(hp))
> ddply(mtcars, .(cyl, am), with,
each(min, mean, sd, max)(hp))
cyl am min mean sd max
1 4 0 62 84.66667 19.65536 97
2 4 1 52 81.87500 22.65542 113
3 6 0 105 115.25000 9.17878 123
4 6 1 110 131.66667 37.52777 175
5 8 0 150 194.16667 33.35984 245
6 8 1 264 299.50000 50.20458 335
> ddply(na.omit(DF), .(X3),
summarise, cor = cor(X1,X2))
X3 cor
1 cat 0.9970943
2 dog 0.9974141
3 pig 0.9959173
4 snake 0.9865586
209 | P a g e
DF<-structure(list(car_id = c(500L, 500L, 500L, 500L, 500L, 500L,
501L, 501L, 501L, 501L, 501L, 501L, 501L, 502L, 502L, 502L, 502L,
502L, 502L), visitnum = c(40L, 50L, 60L, 100L, 110L, 120L, 40L,
50L, 60L, 100L, 110L, 120L, 150L, 40L, 50L, 60L, 100L, 110L,
120L), measurement = c(2301L, NA, NA, NA, NA, NA, 4480L, NA,
NA, NA, NA, NA, 38570L, NA, NA, NA, NA, NA, 2560L)), .Names = c("car_id",
"visitnum", "measurement"), class = "data.frame", row.names = c(NA,
-19L))
DF
library(plyr)
DF$measurement2 <- DF$measurement #duplicate measurement column
DF$measurement2[is.na(DF$measurement2)]<-0 #replace NA's with 0
FM <-function(x)ifelse(sum(x)-x[1]>x[1], 1, 0) #code to make a new column of 0 and 1
ddply(DF, .(car_id), transform, "flagmeasure" = FM(measurement2))[,-4]
ddply(DF, .(car_id), summarise, "flagmeasure" = FM(measurement2))
> DF
car_id visitnum measurement
1 500 40 2301
2 500 50 NA
3 500 60 NA
4 500 100 NA
5 500 110 NA
6 500 120 NA
7 501 40 4480
8 501 50 NA
9 501 60 NA
10 501 100 NA
11 501 110 NA
12 501 120 NA
13 501 150 38570
14 502 40 NA
15 502 50 NA
16 502 60 NA
17 502 100 NA
18 502 110 NA
19 502 120 2560
ddply(DF, .(car_id), transform, "flagmeasure"
= FM(measurement2))[,-4]
car_id visitnum measurement flagmeasure
1 500 40 2301 0
2 500 50 NA 0
3 500 60 NA 0
4 500 100 NA 0
5 500 110 NA 0
6 500 120 NA 0
7 501 40 4480 1
8 501 50 NA 1
9 501 60 NA 1
10 501 100 NA 1
11 501 110 NA 1
12 501 120 NA 1
13 501 150 38570 1
14 502 40 NA 1
15 502 50 NA 1
16 502 60 NA 1
17 502 100 NA 1
18 502 110 NA 1
19 502 120 2560 1
> ddply(DF, .(car_id), summarise, "flagmeasure" = FM(measurement2))
car_id flagmeasure
1 500 0
2 501 1
3 502 1
210 | P a g e
#========================
# The data
#========================
test<-data.frame(group=c(rep(1,4),rep(2,5),3),day=c(0:3,0:4,0),
measure=c(5,3,7,8,3,2,4,5,7,5))
(test1<-test)
#========================
# With base (faster)
#========================
test$diff <- unlist(by(test$measure, test$group, function(x){x - x[1]}))
test$perchange <- unlist(by(test$measure, test$group, function(x){(x - x[1])/x[1]}))
test
#========================
# With plyr (slower)
#========================
test<-test1 #reset test
library(plyr)
perch<-function(x){(x - x[1])/x[1]}
differ<-function(x){x - x[1]}
ddply(test, .(group), transform, diff=differ(measure))
ddply(test, .(group), transform, perchange= perch(measure))
ddply(test, .(group), transform, diff=differ(measure), perchange= perch(measure))
group day measure
1 1 0 5
2 1 1 3
3 1 2 7
4 1 3 8
5 2 0 3
6 2 1 2
7 2 2 4
8 2 3 5
9 2 4 7
10 3 0 5
group day measure diff perchange
1 1 0 5 0 0.0000000
2 1 1 3 -2 -0.4000000
3 1 2 7 2 0.4000000
4 1 3 8 3 0.6000000
5 2 0 3 0 0.0000000
6 2 1 2 -1 -0.3333333
7 2 2 4 1 0.3333333
8 2 3 5 2 0.6666667
9 2 4 7 4 1.3333333
10 3 0 5 0 0.0000000
211 | P a g e
test<-data.frame(person=c("A","A","A","A", "B","B",'C', 'C'),day=c(7,14,21,22, 7, 14, 7, 14),
measure=c(112,0,500,600, 0, 0, 0, 50),temp=c(36.9,36.1,37.2,39.6, 35, 37, 37, 35))
test$detector<-ifelse(test$measure>0 & test$temp>=37, 'TYPE.II',
ifelse(test$measure>0 & test$temp<37, 'TYPE.I','ok'))
firstFUN <-function(x, y) y [which(x!='ok')[1]]
typeFUN <-function(x, y) y [which(x!='ok')[1]]
(outcome<-ddply(test, .(person), transform, "failure.day" = firstFUN(detector, day),
"failure.type" = typeFUN(detector, detector)))
> test
person day measure temp
1 A 7 112 36.9
2 A 14 0 36.1
3 A 21 500 37.2
4 A 22 600 39.6
5 B 7 0 35.0
6 B 14 0 37.0
7 C 7 0 37.0
8 C 14 50 35.0
>outcome
person day measure temp detector failure.day failure.type
1 A 7 112 36.9 TYPE.1 7 TYPE.1
2 A 14 0 36.1 ok 7 TYPE.1
3 A 21 500 37.2 TYPE.II 7 TYPE.1
4 A 22 600 39.6 TYPE.II 7 TYPE.1
5 B 7 0 35.0 ok NA <NA>
6 B 14 0 37.0 ok NA <NA>
7 C 7 0 37.0 ok 14 TYPE.1
8 C 14 50 35.0 TYPE.1 14 TYPE.1
212 | P a g e
APPLY A FUNCTION TO A DATA SET BROKEN DOWN BY A CATEGORICAL VARIABLE
distTab(mtcars, 5)#Normal use of the function
require(plyr)
dlply(mtcars, .(cyl), function(x)distTab(x, 4))
dlply(mtcars, .(cyl, am), function(x)distTab(x, 4))
dlply(CO2, .(Type, Treatment), function(x)distTab(x, 4))
dlply(CO2, .(Type, Treatment), mean)
> Test
Person Day Parasites
1 A 1 100
2 A 5 0
3 A 12 0
4 B 1 34
5 B 3 15
6 B 5 11
7 B 9 0
8 B 27 0
9 C 1 188
10 C 3 15
11 C 5 0
12 C 9 8
13 C 19 0
14 D 1 35
15 D 2 0
16 D 4 0
17 D 6 12
18 D 23 10
Test<-dput(structure(list(Person = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L), .Label = c("A",
"B", "C", "D"), class = "factor"), Day = c(1L, 5L, 12L, 1L, 3L,
5L, 9L, 27L, 1L, 3L, 5L, 9L, 19L, 1L, 2L, 4L, 6L, 23L), Parasites = c(100L,
0L, 0L, 34L, 15L, 11L, 0L, 0L, 188L, 15L, 0L, 8L, 0L, 35L, 0L,
0L, 12L, 10L)), .Names = c("Person", "Day", "Parasites"), class = "data.frame",
row.names = c(NA,
-18L)))
#############################################################################
METHOD 1
TESTER <- function(day, parasites){
x <- rle(parasites)
ifelse(x[[2]][length(x[[2]])]==0,
as.character(day[length(parasites)+1-x[[1]][length(x[[1]])]]),
"DNC"
)
}
NEW <- ddply(Test, .(Person), transform, "clearance.day" = TESTER(Day, Parasites))
############################################################################
#METHOD 2
fun <- function(Parasite, Day){
tmp <- rle(rev(Parasite))
len <- length(Parasite)
if(tmp$values[1] != 0){
return(rep("DNC", len))
}
n <- len
k <- n + 1 - tmp$lengths[1]
return(rep(Day[k], len))
}
ddply(Test, .(Person), summarize, Day = Day, clearance = fun(Parasites, Day))
#################################################################################
# test replications elapsed relative user.self sys.self user.child sys.child #
# 1 meth1 1000 7.92 1.000000 7.05 0.00 NA NA #
# 2 meth2 1000 15.70 1.982323 10.59 0.01 NA NA #
#################################################################################
Find the last occurance of a value
> NEW
Person Day Parasites clearance.day
1 A 1 100 5
2 A 5 0 5
3 A 12 0 5
4 B 1 34 9
5 B 3 15 9
6 B 5 11 9
7 B 9 0 9
8 B 27 0 9
9 C 1 188 19
10 C 3 15 19
11 C 5 0 19
12 C 9 8 19
13 C 19 0 19
14 D 1 35 DNC
15 D 2 0 DNC
16 D 4 0 DNC
17 D 6 12 DNC
18 D 23 10 DNC
213 | P a g e
APPLY A FUNCTION BY GROUP TO TWO COLUMNS OF A DATA FRAME use: lapply with split (faster) OR by df <- data.frame(group = rep(c("G1", "G2"), each = 10),
var1 = rnorm(20),
var2 = rnorm(20))
r <- by(df, df$group, FUN = function(X) cor(X$var1, X$var2, method = "spearman"))
j <- lapply(split(df, df$group), function(x){cor(x[,2], x[,3], method = "spearman")})
data.frame(group = names(j), corr = unlist(j), row.names = NULL)
#INPUT
group var1 var2
1 G1 -0.60324036 0.22355138
2 G1 -1.64211667 -0.78414595
3 G1 -0.26629745 1.00448792
4 G1 0.42810545 1.04770451
5 G1 -1.26773098 -0.38998673
6 G1 0.78676448 -0.70243031
7 G1 0.29611857 -0.51216302
8 G1 1.96831668 -0.07017856
9 G1 0.13034798 1.28344355
10 G1 -0.15531481 0.94086118
11 G2 0.65258740 -0.48107934
12 G2 -1.11294137 -0.51280763
13 G2 1.35929571 -0.85913000
14 G2 -0.36637039 -0.50303582
15 G2 -1.20766391 -0.52910758
16 G2 0.27350136 -0.00188101
17 G2 -1.03189591 -0.11919335
18 G2 -0.11188425 -1.42868344
19 G2 0.05789754 -1.66900549
20 G2 -1.16903207 -0.17194032
#OUTPUT (by method)
>by(df, df$group, FUN = function(X) cor(X$var1, X$var2, method = "spearman"))
df$group: G1
[1] 0.1515152
------------------------------------------------------------
df$group: G2
[1] -0.1151515
#OUTPUT (lapply & split method)
> j <- lapply(split(df, df$group), function(x){cor(x[,2], x[,3], method = "spearman")})
> data.frame(group = names(j), corr = unlist(j), row.names = NULL)
group corr
1 G1 0.1515152
2 G2 -0.1151515
214 | P a g e
Unlist & Recursively Unlist sunlist A <- data.frame( a = c(1:10), b = c(11:20) )
B <- data.frame( a = c(101:110), b = c(111:120) )
C <- data.frame( a = c(5:8), b = c(55:58) )
L <- list(list(B,C),list(A),list(C,A),list(A,B,C),list(C))
unlist(L) #unlist everything into one vector
unlist(L, recursive=F) #unlist everything into on list of many vectors
Access Elements in a List mod <- summary(lm(cyl~mpg, data=mtcars))
mod[[4]][[2]]
mod[[c(4,2)]] #uses the vectore to recursively access lists
Repeat rows (1 & 2) by times of a third column[good for table() sumarised data frames] see dfpeat() in .Rprofile dataframe[ rep( seq(dim(dataframe)[1]), 3rd column), -1] EXAMPLE
DF <- structure(list(num = c(5, 5, 4), freq = c(96,
60, 59), rank = c(1, 2, 3)), .Names = c("num",
"freq", "rank"), row.names = c(NA, 3L), class = "data.frame")
num freq rank
1 5 96 1
2 5 60 2
3 4 59 3
DF2 <- DF[ rep( seq(dim(DF)[1]), DF$num), -1]
rownames(DF2) <- 1:nrow(DF2)
DF2
> DF2
freq rank
1 96 1
2 96 1
3 96 1
4 96 1
5 96 1
6 60 2
7 60 2
8 60 2
9 60 2
10 60 2
11 59 3
12 59 3
13 59 3
14 59 3
215 | P a g e
Turn a list into a dataframe 3 ways
j <- lapply(1:10, rnorm, n=4)
#METHOD 1
do.call(rbind, j) #or
data.frame(do.call(rbind, j))
#METHOD 2
library(plyr)
ldply(j, I)
#METHOD 3
ldply(j, function(x){x})
> do.call(rbind, j)
[,1] [,2] [,3] [,4]
[1,] 1.6411064 1.157174 0.873377 0.3134954
[2,] 0.9041039 2.667465 1.965937 0.4181302
[3,] 4.4037940 2.420527 3.264888 3.8311805
[4,] 3.9637209 5.402170 5.196343 4.6943378
[5,] 3.9358796 5.866777 5.540184 4.2303664
[6,] 5.8809682 4.669888 4.773183 6.8188467
[7,] 8.3059954 6.389316 5.942269 7.4630666
[8,] 7.5501919 7.807572 7.373059 7.5226562
[9,] 8.6035129 7.044928 9.074038 8.0470154
[10,] 9.3076546 8.424741 11.628522 9.7019016
>
> ldply(j, I)
V1 V2 V3 V4
1 1.6411064 1.157174 0.873377 0.3134954
2 0.9041039 2.667465 1.965937 0.4181302
3 4.4037940 2.420527 3.264888 3.8311805
4 3.9637209 5.402170 5.196343 4.6943378
5 3.9358796 5.866777 5.540184 4.2303664
6 5.8809682 4.669888 4.773183 6.8188467
7 8.3059954 6.389316 5.942269 7.4630666
8 7.5501919 7.807572 7.373059 7.5226562
9 8.6035129 7.044928 9.074038 8.0470154
10 9.3076546 8.424741 11.628522 9.7019016
>
> ldply(j, function(x){x})
V1 V2 V3 V4
1 1.6411064 1.157174 0.873377 0.3134954
2 0.9041039 2.667465 1.965937 0.4181302
3 4.4037940 2.420527 3.264888 3.8311805
4 3.9637209 5.402170 5.196343 4.6943378
5 3.9358796 5.866777 5.540184 4.2303664
6 5.8809682 4.669888 4.773183 6.8188467
7 8.3059954 6.389316 5.942269 7.4630666
8 7.5501919 7.807572 7.373059 7.5226562
9 8.6035129 7.044928 9.074038 8.0470154
10 9.3076546 8.424741 11.628522 9.7019016
216 | P a g e
EVAL/PARSE
a <- 3
x <- "a > 2"
eval(parse(text=x))
x2 <- "a==3"
eval(parse(text=x2))
a <- 1:13
x <- "mean(a)"
eval(parse(text=x))
## > a <- 3
## > x <- "a > 2"
## > eval(parse(text=x))
## [1] TRUE
## >
## > x2 <- "a==3"
## > eval(parse(text=x2))
## [1] TRUE
## >
## > a <- 1:13
## > x <- "mean(a)"
## > eval(parse(text=x))
## [1] 7
217 | P a g e
RESHAPE http://had.co.nz/stat405/lectures/19-tables.pdf
Data Set to Long Format for Repeated Measures library(reshape) melt(data.frame, id=variables/columns to group by) cast(molten data.frame, formula, variable or value, agregrate.function)
# Example 2:
d<-ascii("Code Country 1950 1951 1952 1953 1954
AFG Afghanistan 20,249 21,352 22,532 23,557 24,555
ALB Albania 8,097 8,986 10,058 11,123 12,246")
d
#Method 1
x1 <- reshape(d, direction="long", varying=list(names(d)[3:7]), v.names="Value",
idvar=c("Code","Country"), timevar="Year", times=1950:1954)
rownames(x1) <- 1:nrow(x1)
x1
#Method 2 PREFERED
library(reshape)
x2 <- melt(d,id=c("Code","Country"),variable_name="Year")
x2[,"Year"] <- as.numeric(gsub("X","",x2[,"Year"]))
x2
cast(x2, Year~Country)
cast(x2, Country~Year)
cast(x2, Country + Code~Year)
#Example 1:
source("C:/Users/Rinker/Desktop/PhD Program/CEP 523-Stat Meth Ed Inference/R
Stuff/Scripts/Data Sets.txt")
library(reshape)
rep.mes2<-rep.mes
Sex<-gl(2, 25, length=50,labels = c("Male", "Female"))
rep.mes2<-data.frame(rep.mes2[1:2],Sex,rep.mes2[3:5])
long.rep.mes<-melt(rep.mes2,id=1:3)[order(melt(rep.mes)$Sub),]
rownames(long.rep.mes)<-1:150
rep.mes2;long.rep.mes
#EXAMPLE
Code Country X1950 X1951 X1952 X1953 X1954
1 AFG Afghanistan 20,249 21,352 22,532 23,557 24,555
2 ALB Albania 8,097 8,986 10,058 11,123 12,246
> x2 #MELTED
Code Country Year value
1 AFG Afghanistan 1950 20,249
2 ALB Albania 1950 8,097
3 AFG Afghanistan 1951 21,352
4 ALB Albania 1951 8,986
5 AFG Afghanistan 1952 22,532
6 ALB Albania 1952 10,058
7 AFG Afghanistan 1953 23,557
8 ALB Albania 1953 11,123
9 AFG Afghanistan 1954 24,555
10 ALB Albania 1954 12,246
#RECASTED
> cast(x2, Year~Country)
Year Afghanistan Albania
1 1950 20,249 8,097
2 1951 21,352 8,986
3 1952 22,532 10,058
4 1953 23,557 11,123
5 1954 24,555 12,246
> cast(x2, Country~Year)
Country 1950 1951 1952 1953 1954
1 Afghanistan 20,249 21,352 22,532 23,557 24,555
2 Albania 8,097 8,986 10,058 11,123 12,246
> cast(x2, Country + Code~Year)
Country Code 1950 1951 1952 1953 1954
1 Afghanistan AFG 20,249 21,352 22,532 23,557 24,555
2 Albania ALB 8,097 8,986 10,058 11,123 12,246
218 | P a g e
library('reshape')
DF<-data.frame("TAX"=c("A", "A", "A", "A", "B","B","B","B"),
"YEAR"=c(2000,2001,2002,2003,2000,2001,2002,2004),
"NUMBER"=c(2,2,3,1,3,4,3,2))
DF
cast(DF, YEAR ~ TAX, value = 'NUMBER', fill = 0)
DF2<-data.frame(DF, "NEW"=rnorm(nrow(DF)))
cast(DF2, YEAR+NEW ~ TAX, value = 'NUMBER', fill = 0)
cast(DF2, TAX ~ YEAR, value = 'NUMBER')
cast(DF2, TAX ~ NUMBER, value = 'NEW', mean)
cast(DF2, TAX ~ NUMBER, value = 'NEW', mean, fill=NA)
cast(DF2, TAX ~ NUMBER , value = 'NEW', sum)
cast(DF2, TAX + YEAR ~ NUMBER , value = 'NEW', sum)
TAX YEAR NUMBER
1 A 2000 2
2 A 2001 2
3 A 2002 3
4 A 2003 1
5 B 2000 3
6 B 2001 4
7 B 2002 3
8 B 2004 2
YEAR A B
1 2000 2 3
2 2001 2 4
3 2002 3 3
4 2003 1 0
5 2004 0 2
> cast(DF2, YEAR+NEW ~ TAX, value = 'NUMBER', fill = 0)
YEAR NEW A B
1 2000 -1.77380068 2 0
2 2000 0.46681003 0 3
3 2001 -0.09072904 0 4
4 2001 2.19618765 2 0
5 2002 -1.68538164 0 3
6 2002 0.85410280 3 0
7 2003 0.13744107 1 0
8 2004 0.12992724 0 2
> cast(DF2, TAX ~ YEAR, value = 'NUMBER')
TAX 2000 2001 2002 2003 2004
1 A 2 2 3 1 NA
2 B 3 4 3 NA 2
> cast(DF2, TAX ~ NUMBER, value = 'NEW', mean)
TAX 1 2 3 4
1 A 0.1374411 0.2111935 0.8541028 NaN
2 B NaN 0.1299272 -0.6092858 -0.09072904
> cast(DF2, TAX ~ NUMBER, value = 'NEW', mean, fill=NA)
TAX 1 2 3 4
1 A 0.1374411 0.2111935 0.8541028 NA
2 B NA 0.1299272 -0.6092858 -0.09072904
> cast(DF2, TAX ~ NUMBER , value = 'NEW', sum)
TAX 1 2 3 4
1 A 0.1374411 0.4223870 0.8541028 0.00000000
2 B 0.0000000 0.1299272 -1.2185716 -0.09072904
> cast(DF2, TAX + YEAR ~ NUMBER , value = 'NEW', sum)
TAX YEAR 1 2 3 4
1 A 2000 0.0000000 -1.7738007 0.0000000 0.00000000
2 A 2001 0.0000000 2.1961876 0.0000000 0.00000000
3 A 2002 0.0000000 0.0000000 0.8541028 0.00000000
4 A 2003 0.1374411 0.0000000 0.0000000 0.00000000
5 B 2000 0.0000000 0.0000000 0.4668100 0.00000000
6 B 2001 0.0000000 0.0000000 0.0000000 -0.09072904
7 B 2002 0.0000000 0.0000000 -1.6853816 0.00000000
8 B 2004 0.0000000 0.1299272 0.0000000 0.00000000
DF3 <- melt(DF2, id=c("TAX", "YEAR"), na.rm=TRUE)
cast(DF3, TAX ~ . | variable, mean)
cast(DF3, TAX ~ . | variable, sum)
cast(DF3, TAX ~ . | variable, range) #or even better
cast(DF3, TAX ~ . | variable, c(min, max))
cast(DF3, YEAR + TAX ~ . | variable)
recast(DF2, YEAR + TAX ~ . | variable,
id.var=c("TAX", "YEAR"),
measure.var=c("NUMBER", "NEW"))
recast(DF2, YEAR + TAX + NUMBER ~ . | variable,
id.var=c("TAX", "YEAR", "NUMBER"),
measure.var=c("NEW"),
fun.aggregate=range)
> cast(DF3, TAX ~ . | variable, c(min, max))
$NUMBER
TAX min max
1 A 1 3
2 B 2 4
$NEW
TAX min max
1 A -2.061310 1.281748
2 B -1.726776 1.986024
> cast(DF3, YEAR + TAX ~ . |variable)
$NUMBER
YEAR TAX (all)
1 2000 A 2
2 2000 B 3
3 2001 A 2
4 2001 B 4
5 2002 A 3
6 2002 B 3
7 2003 A 1
8 2004 B 2
$NEW
YEAR TAX (all)
1 2000 A -2.06131000
2 2000 B 1.98602360
3 2001 A 1.10881310
4 2001 B -0.89410042
5 2002 A 1.28174758
6 2002 B -1.72677556
7 2003 A 0.05761605
8 2004 B -0.15146665
219 | P a g e
Parm values person
1 hour 0.00 1
2 day 0.00 1
3 min 5.00 1
4 max 7.00 1
5 outlier 0.25 1
6 hour 1.00 2
7 day 0.00 2
8 min 5.00 2
9 max 7.00 2
10 outlier 0.25 2
person day hour max min outlier
1 1 0 0 7 5 0.25
2 2 0 1 7 5 0.25
test<-dput(structure(list(Parm = structure(c(2L, 1L, 4L, 3L, 5L, 2L, 1L,
4L, 3L, 5L), .Label = c("day", "hour", "max", "min", "outlier"
), class = "factor"), values = c(0, 0, 5, 7, 0.25, 1, 0, 5, 7,
0.25), person = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L)), .Names = c("Parm",
"values", "person"), class = "data.frame", row.names = c(NA,
-10L)))
library(reshape)
cast(test, person ~ Parm, value = "values")
220 | P a g e
Differences between a value by group #Create the data set
set.seed(1234)
x <- data.frame(id=1:5, value=sample(20:30, 5, replace=T), year=3)
y <- data.frame(id=1:5, value=sample(10:19, 5, replace=T), year=2)
z <- data.frame(id=1:5, value=sample(0:9, 5, replace=T), year=1)
df <- df.reset <- rbind(x, y, z) #The reset allows us to reset df each time
#SAPPLY
df <- df[order(df$id, df$year), ]
sdf <-split(df, df$id)
df$actual <- c(sapply(seq_along(sdf), function(x) diff(c(0, sdf[[x]][,2]))))
df[order(as.numeric(rownames(df))),]
#AGGREGATE
df <- df.reset
df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x))
df$actual <- c(unlist(t(aggregate(value~id, df, diff2)[, -1])))
df[order(as.numeric(rownames(df))),]
#BY
df <- df.reset
df <- df[order(df$id, df$year), ]
df$actual <- unlist(by(df$value, df$id, diff2))
df[order(as.numeric(rownames(df))),]
#PLYR
df <- df.reset
df <- df[order(df$id, df$year), ]
df <- data.frame(temp=1:nrow(df), df)
library(plyr)
df <- ddply(df, .(id), transform, actual=diff2(value))
df[order(df$year, df$temp),][, -1]
Extending this to multiple columns #Create the data set
set.seed(1234)
x <- data.frame(id=1:5, value=sample(20:30, 5, replace=T), year=3)
y <- data.frame(id=1:5, value=sample(10:19, 5, replace=T), year=2)
z <- data.frame(id=1:5, value=sample(0:9, 5, replace=T), year=1)
df <- rbind(x, y, z)
df <- df.reset <- data.frame(df[, 1:2], new.var=df[, 2]+sample(1:5, nrow(df),
replace=T), year=df[, 3])
#SAPPLY the BY
df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x))
group.diff<- function(x) unlist(by(x, df$id, diff2))
df <- data.frame(df, sapply(df[, 2:3], group.diff))
df <- df[order(as.numeric(rownames(df))),]
names(df)[5:6] <- c('actual', 'actual.new');df
#TRANSFORM the BY
df <- df.reset
df <- df[order(df$id, df$year), ]
diff2 <- function(x) diff(c(0, x))
group.diff<- function(x) unlist(by(x, df$id, diff2))
df <- transform(df, actual=group.diff(value), actual.new=group.diff(new.var))
df[order(as.numeric(rownames(df))),]
#PLYR
df <- df.reset
df <- data.frame(temp=1:nrow(df), df)
df <- df[order(df$id, df$year), ]
library(plyr)
df <- ddply(df, .(id), transform, actual=diff2(value), actual.new=diff2(new.var))
df[order(df$temp),][, -1]
id value year
1 1 21 3
2 2 26 3
3 3 26 3
4 4 26 3
5 5 29 3
6 1 16 2
7 2 10 2
8 3 12 2
9 4 16 2
10 5 15 2
11 1 6 1
12 2 5 1
13 3 2 1
14 4 9 1
15 5 2 1
id value year actual
1 1 21 3 5
2 2 26 3 16
3 3 26 3 14
4 4 26 3 10
5 5 29 3 14
6 1 16 2 10
7 2 10 2 5
8 3 12 2 10
9 4 16 2 7
10 5 15 2 13
11 1 6 1 6
12 2 5 1 5
13 3 2 1 2
14 4 9 1 9
15 5 2 1 2
221 | P a g e
Extract just the rows of a dataframe from the max of 1 variable
#THE DATA
df <- structure(list(ID = c(1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 9L, 9L,
9L, 9L, 9L), week = c(2L, 4L, 6L, 2L, 6L, 9L, 9L, 12L, 2L, 4L,
6L, 9L, 12L), outcome = c(14L, 28L, 42L, 14L, 46L, 64L, 71L,
85L, 14L, 28L, 51L, 66L, 84L)), .Names = c("ID", "week", "outcome"
), class = "data.frame", row.names = c(NA, -13L))
#METHOD 1
do.call("rbind",
by(df, INDICES=df$ID, FUN=function(DF) DF[which.max(DF$week),]))
#METHOD 2
library(data.table)
dt <- data.table(df, key="ID")
dt[, .SD[which.max(outcome),], by=ID]
#METHOD 3
library(plyr)
ddply(df, .(ID), function(X) X[which.max(X$week), ])
#METHOD 4
sdf <-with(df, split(df, ID))
max.week <- sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week']))
data.frame(t(mapply(function(x, y) y[x, ], max.week, sdf)))
#METHOD 5
df[rev(rownames(df)),][match(unique(df$ID), rev(df$ID)), ]
#METHOD 6
sdf <-with(df, split(df, ID))
df[cumsum(sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week']))), ]
#METHOD 7
df[cumsum(aggregate(week ~ ID, df, which.max)$week), ]
#METHOD 8
df[tapply(seq_along(df$ID), df$ID, function(x){tail(x,1)}),]
#METHOD 9
df[cumsum(as.numeric(lapply(split(df$week, df$ID), which.max))), ]
See the rbenchmark results on the next page:
ID week outcome
1 1 2 14
2 1 4 28
3 1 6 42
4 4 2 14
5 4 6 46
6 4 9 64
7 4 9 71
8 4 12 85
9 9 2 14
10 9 4 28
11 9 6 51
12 9 9 66
13 9 12 84
ID week outcome
1 1 6 42
4 4 12 85
9 9 12 84
We want to select the max week for each individual but return the rest of the data frame.
222 | P a g e
library(rbenchmark)
benchmark(
DATA.TABLE= {dt <- data.table(df, key="ID")
dt[, .SD[which.max(outcome),], by=ID]},
DO.CALL={do.call("rbind",
by(df, INDICES=df$ID, FUN=function(DF) DF[which.max(DF$week),]))},
PLYR=ddply(df, .(ID), function(X) X[which.max(X$week), ]),
SPLIT={sdf <-with(df, split(df, ID))
max.week <- sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week']))
data.frame(t(mapply(function(x, y) y[x, ], max.week, sdf)))},
MATCH.INDEX=df[rev(rownames(df)),][match(unique(df$ID), rev(df$ID)), ],
AGGREGATE=df[cumsum(aggregate(week ~ ID, df, which.max)$week), ],
BRYANS.INDEX = df[cumsum(as.numeric(lapply(split(df$week, df$ID), which.max))), ],
SPLIT2={sdf <-with(df, split(df, ID))
df[cumsum(sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week']))), ]},
TAPPLY= df[tapply(seq_along(df$ID), df$ID, function(x){tail(x,1)}),],
columns = c( "test", "replications", "elapsed", "relative", "user.self","sys.self"),
order = "test", replications = 1000, environment = parent.frame())
test replications elapsed relative user.self sys.self
6 AGGREGATE 1000 4.49 7.610169 2.84 0.05
7 BRYANS.INDEX 1000 0.59 1.000000 0.20 0.00
1 DATA.TABLE 1000 20.28 34.372881 11.98 0.00
2 DO.CALL 1000 4.67 7.915254 2.95 0.03
5 MATCH.INDEX 1000 1.07 1.813559 0.51 0.00
3 PLYR 1000 10.61 17.983051 5.07 0.00
4 SPLIT 1000 3.12 5.288136 1.81 0.00
8 SPLIT2 1000 1.56 2.644068 1.28 0.00
9 TAPPLY 1000 1.08 1.830508 0.88 0.00
223 | P a g e
Summaraize (apply function) to a numeric variable by 2 categorical variables #================
#Make a data set
#================
n <- 100
dat <- data.frame(
Accuracy = round(runif(n, 0, 5), 1),
Month = sample(1:2, n, replace=TRUE),
Day = sample(1:5, n, replace=TRUE),
Easting = rnorm(n),
Northing = rnorm(n),
Etc = rnorm(n)
)
#==========
#using plyr
#==========
library(plyr)
ddply(
dat,
c("Month", "Day"),
function (x) x[ which.min(x$Accuracy), ]
)
#==========
#using base
#==========
t(sapply(
split(dat, list(dat$Month, dat$Day)),
function(d) d[ which.min(d$Accuracy), ]))
#You can get quasi there with (but not all the data frame will come along):
aggregate(Accuracy ~ Month + Day, data = dat, FUN = min)
#OUTCOME (find min value by month and day)
Accuracy Month Day Easting Northing Etc
1 1.0 1 1 -1.2107186 -0.06473102 1.5195738
2 0.7 1 2 0.7552501 1.20389863 0.1319931
3 0.5 1 3 1.1104158 -0.31173230 -0.4738744
4 0.5 1 4 -0.7936402 0.94957122 -0.5173246
5 0.4 1 5 0.1725260 2.50637015 0.5808553
6 0.1 2 1 1.1359366 1.73373416 1.1122071
7 0.3 2 2 0.9101894 0.57581224 0.2726678
8 0.2 2 3 -0.2905642 0.67290842 1.7687111
9 0.7 2 4 -2.2955213 0.23270159 1.2040872
10 0.0 2 5 1.1167519 1.04612217 -0.7811158
224 | P a g e
Apply multiple functions to multiple outcomes by multiple groups > head(CO2)
Plant Type Treatment conc uptake
1 Qn1 Quebec nonchilled 95 16.0
2 Qn1 Quebec nonchilled 175 30.4
3 Qn1 Quebec nonchilled 250 34.8
4 Qn1 Quebec nonchilled 350 37.2
5 Qn1 Quebec nonchilled 500 35.3
6 Qn1 Quebec nonchilled 675 39.2
aggregate(cbind(conc, uptake) ~ Plant + Type + Treatment, data=CO2, FUN=mean)
Plant Type Treatment conc uptake
1 Qn1 Quebec nonchilled 435 33.22857
2 Qn2 Quebec nonchilled 435 35.15714
3 Qn3 Quebec nonchilled 435 37.61429
4 Mn3 Mississippi nonchilled 435 24.11429
5 Mn2 Mississippi nonchilled 435 27.34286
6 Mn1 Mississippi nonchilled 435 26.40000
7 Qc1 Quebec chilled 435 29.97143
8 Qc3 Quebec chilled 435 32.58571
9 Qc2 Quebec chilled 435 32.70000
10 Mc2 Mississippi chilled 435 12.14286
11 Mc3 Mississippi chilled 435 17.30000
12 Mc1 Mississippi chilled 435 18.00000
SUM <- function(x) c(mean=mean(x), sd=sd(x), n=length(x))
aggregate(cbind(conc, uptake) ~ Plant + Type + Treatment, data=CO2, FUN=SUM)
Plant Type Treatment conc.mean conc.sd conc.n uptake.mean uptake.sd uptake.n
1 Qn1 Quebec nonchilled 435.0000 317.7263 7.0000 33.228571 8.214766 7.000000
2 Qn2 Quebec nonchilled 435.0000 317.7263 7.0000 35.157143 11.004069 7.000000
3 Qn3 Quebec nonchilled 435.0000 317.7263 7.0000 37.614286 10.349948 7.000000
4 Mn3 Mississippi nonchilled 435.0000 317.7263 7.0000 24.114286 6.484707 7.000000
5 Mn2 Mississippi nonchilled 435.0000 317.7263 7.0000 27.342857 7.652855 7.000000
6 Mn1 Mississippi nonchilled 435.0000 317.7263 7.0000 26.400000 8.694251 7.000000
7 Qc1 Quebec chilled 435.0000 317.7263 7.0000 29.971429 8.334609 7.000000
8 Qc3 Quebec chilled 435.0000 317.7263 7.0000 32.585714 10.321083 7.000000
9 Qc2 Quebec chilled 435.0000 317.7263 7.0000 32.700000 11.336960 7.000000
10 Mc2 Mississippi chilled 435.0000 317.7263 7.0000 12.142857 2.186974 7.000000
11 Mc3 Mississippi chilled 435.0000 317.7263 7.0000 17.300000 3.049044 7.000000
12 Mc1 Mississippi chilled 435.0000 317.7263 7.0000 18.000000 4.118657 7.000000
225 | P a g e
Table of Means sTABLE of MEANS smeans sMeans Table smeans sTABLE of MEANS smeans sTABLE of
dat <- structure(list(partic = c(4.875, 3.375, 4.5, 2.875, 4, 4.625,
4.375, 4, 4.375, 3.625, 3.25, 4.875, 4.625, 4.875, 4.125, 3.25,
2.5, 3.875, 3.75, 3.625, 3.375, 4.75, 4.75, 3.57142857142857,
2.5, 4.125, 3.5, 3.375, 3.5, 4.5, 4.375, 3.66666666666667, 1.5,
4.375, 3.875, 4.375, 3.14285714285714, 3.875, 3.875, 3.125, 3.25,
2.375, 2.5, 3.5, 4.25, 4.25, 3.5, 3.625, 3.5, 3.625, 3.75, 3.625,
3.625, 4.25, 4, 4, 3.75, 3.875, 3.5, 4.375, 4, 3.5, 3.75, 3.375,
4.375, 3.875, 1.75, 4.5, 3.75, 3.625, 4, 4, 3.875, 2.75, 3.625,
3.5, 4.5, 4.125, 4.125, 4.625, 3.125, 4.625, 3.875, 3, 4.5, 4.25,
4.375, 4.25, 3.625, 3.5, 2.5, 2.875, 2.875, 2.5, 3.75, 4, 2.875,
2.375, 4.125, 4.5), grade = structure(c(3L, 4L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 3L, 1L, 4L,
3L, 4L, 2L, 2L, 3L, 4L, 4L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 4L,
4L, 4L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 4L, 3L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 2L, 2L, 2L), .Label = c("freshman",
"sophomore", "junior", "senior"), class = "factor"), race3 = structure(c(1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L,
1L, 3L, 2L, 3L, 2L, 3L, 2L, 2L, 3L, 2L, 2L, 1L, 3L, 1L, 3L, 2L,
1L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 3L, 2L, 1L, 1L, 1L,
1L, 2L, 1L, 2L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L,
2L, 1L, 3L, 1L, 2L, 1L, 3L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 1L, 2L, 2L, 2L,
2L, 2L, 1L), .Label = c("white/asian", "black", "hispanic"), class = "factor")), .Names =
c("partic",
"grade", "race3"), na.action = structure(c(11L, 42L, 154L, 210L,
230L, 282L, 306L, 336L, 349L, 352L, 377L, 378L, 397L, 437L, 477L
), .Names = c("11", "42", "154", "210", "230", "282", "306",
"336", "349", "352", "377", "378", "397", "437", "477"), class = "omit"), row.names = c(NA,
100L), class = "data.frame")
#====================================================================
DF.rs <- melt(dat, id=c("grade", "race3"))
MT <- function(x){ paste(round(mean(x), digits=2),
"(", round(sd(x), digits=2), ")", sep="")}
cast(DF.rs, grade ~ race3, fun.aggregate=MT,
margins=c("grand_row", "grand_col"))
library(reshape)
dat <- read.table(text="
request user group
1 1 1
4 1 1
7 1 1
5 1 2
8 1 2
1 2 3
4 2 3
7 2 3
9 2 4
", header=T
newdat <- ddply(dat, .(user, group), transform, idx = paste("request", 1:length(request), sep
= ""))
cast(newdat, user + group ~ idx, value = .(request))
> cast(newdat, user + group ~ idx, value = .(request))
user group request1 request2 request3
1 1 1 1 4 7
2 1 2 5 8 NA
3 2 3 1 4 7
4 2 4 9 NA NA
of MEANS smeans sTABLE of MEANS smeans sTABLE
226 | P a g e
MEANSofMEANS smeans tables
names(airquality) <- tolower(names(airquality))
aqm <- melt(airquality, id=c("month", "day"), na.rm=TRUE)
cast(aqm, day ~ month ~ variable)
cast(aqm, month ~ variable, mean)
cast(aqm, month ~ . | variable, mean)
cast(aqm, month ~ variable, mean, margins=c("grand_row", "grand_col"))
cast(aqm, day ~ month, mean, subset=variable=="ozone")
cast(aqm, month ~ variable, range)
cast(aqm, month ~ variable + result_variable, range)
cast(aqm, variable ~ month ~ result_variable,range)
#Chick weight example
names(ChickWeight) <- tolower(names(ChickWeight))
chick_m <- melt(ChickWeight, id=2:4, na.rm=TRUE)
cast(chick_m, time ~ variable, mean) # average effect of time
cast(chick_m, diet ~ variable, mean) # average effect of diet
cast(chick_m, diet ~ time ~ variable, mean) # average effect of diet & time
# How many chicks at each time? - checking for balance
cast(chick_m, time ~ diet, length)
cast(chick_m, chick ~ time, mean)
cast(chick_m, chick ~ time, mean, subset=time < 10 & chick < 20)
cast(chick_m, diet + chick ~ time)
cast(chick_m, chick ~ time ~ diet)
cast(chick_m, diet + chick ~ time, mean, margins="diet")
#Tips example
cast(melt(tips), sex ~ smoker, mean, subset=variable=="total_bill")
cast(melt(tips), sex ~ smoker | variable, mean)
ff_d <- melt(french_fries, id=1:4, na.rm=TRUE)
cast(ff_d, subject ~ time, length)
cast(ff_d, subject ~ time, length, fill=0)
cast(ff_d, subject ~ time, function(x) 30 - length(x))
cast(ff_d, subject ~ time, function(x) 30 - length(x), fill=30)
cast(ff_d, variable ~ ., c(min, max))
cast(ff_d, variable ~ ., function(x) quantile(x,c(0.25,0.5)))
cast(ff_d, treatment ~ variable, mean, margins=c("grand_col", "grand_row"))
cast(ff_d, treatment + subject ~ variable, mean, margins="treatment")
227 | P a g e
From long to wide format reshape(data, varying = NULL, v.names = NULL, timevar = "time", idvar = "id", ids = 1:NROW(data), times = seq_along(varying[[1]]), drop = NULL, direction, new.row.names = NULL)
Arguments
data a data frame varying names of sets of variables in the wide format that correspond to single
variables in long format (‘time-varying’). This is canonically a list of vectors of variable names, but it can optionally be a matrix of names, or a single vector of names. In each case, the names can be replaced by indices which are interpreted as referring to names(data). See below for more details and options.
v.names names of variables in the long format that correspond to multiple variables in the wide format. See below for details.
timevar the variable in long format that differentiates multiple records from the same group or individual.
idvar Names of one or more variables in long format that identify multiple records from the same group/individual. These variables may also be present in wide format.
ids the values to use for a newly created idvar variable in long format. times the values to use for a newly created timevar variable in long format. See
below for details.
drop a vector of names of variables to drop before reshaping direction character string, either "wide" to reshape to wide format, or "long" to
reshape to long format. new.row.names logical; if TRUE and direction="wide", create new row names in long format
from the values of the id and time variables.
DF<-data.frame(id=rep(paste("subject", 1:5, sep=" "), 3),
sex=rep(c("m","m","m","f","f"), 3),
time=c(rep("Time1",5), rep("Time2",5), rep("Time3",5)),
score1=rnorm(15), score2=abs(rnorm(15)*4))
wide <- reshape(DF, v.names=c("score1", "score2"), idvar="id",
timevar="time", direction="wide")
wide
long <- with(wide, reshape(wide, idvar="id",
v.names=c("score1", "score2"), direction="long"))
rownames(long)<-1:nrow(long)
long
#USING RESHAPE DF<-data.frame(id=rep(paste("subject", 1:5, sep=" "), 3),
sex=rep(c("m","m","m","f","f"), 3),
time=c(rep("Time1",5), rep("Time2",5),
rep("Time3",5)), score1=rnorm(15),
score2=abs(rnorm(15)*4))
library(reshape)
m <- melt(DF)
cast(m,id+sex~...)
cast(m,id+sex~variable+time)
228 | P a g e
Long to wide example (base's reshape and reshape package) dat <- data.frame(county = rep(letters[1:4], each=2),
state = rep(LETTERS[1], times=8),
industry = rep(c("construction", "manufacturing"), 4),
employment = round(rnorm(8, 100, 50), 0),
establishments = round(rnorm(8, 20, 5), 0))
#Method 1 (base)
reshape(dat, direction="wide", idvar=c("state", "county"), timevar="industry")
#method 2 (reshape 2 package)
library(reshape2)
m <- melt(dat)
dcast(m, state + county~...)
county state industry employment establishments
1 a A construction 100 24
2 a A manufacturing 159 26
3 b A construction 117 17
4 b A manufacturing 64 25
5 c A construction 85 23
6 c A manufacturing 50 19
7 d A construction 21 14
8 d A manufacturing 48 8
state county construction_employment construction_establishments manufacturing_employment manufacturing_establishments
1 A a 100 24 159 26
2 A b 117 17 64 25
3 A c 85 23 50 19
4 A d 21 14 48 8
229 | P a g e
Long to wide with base reshape explained: df3 <- data.frame(school = rep(1:3, each = 4), class = rep(9:10, 6),
time = rep(c(1,1,2,2), 3), score = rnorm(12))
df3
wide <- reshape(df3, idvar = c("school","class"), direction = "wide")
wide
DF<-data.frame(id=rep(paste("subject", 1:5, sep=" "), 3),
sex=rep(c("m","m","m","f","f"), 3),
time=c(rep("Time1",5), rep("Time2",5), rep("Time3",5)),
score1=rnorm(15), score2=abs(rnorm(15)*4))
DF
wide <- reshape(DF, v.names=c("score1", "score2"), idvar="id",
timevar="time", direction="wide")
DF2 <- expand.grid(market = LETTERS[1:5],
date = Sys.Date()+(0:5),
sitename = letters[1:2])
DF2$impression <- sample(100, nrow(DF2), replace=TRUE)
DF2$clicks <- sample(100, nrow(DF2), replace=TRUE)
DF2
wide <- reshape(DF2, v.names=c("impression", "clicks"), idvar=c("market",
"date"),
timevar="sitename", direction="wide")
wide
What's going on with reshape when its long to wide.
timevar these are the repeated measures; that may be times or locations
etc. [categorical]
v.names the repeated measures measurement (in both these case we
have two different variables being measured over two different
times)[numeric]
idvar these are the variables we want to replicate and unstack to
match to the timevar and idvar
Basically worry about what your repeated measures variable is (timevar).
This is not numeric but categorical. Then enter in your actual measures
taken at each repeated measure (v.names). This is usually numeric (however
could be categorical). Generally everything remaining is an id variable.
230 | P a g e
Wide to Long with > 2 Measures Per Time FROM
TO
CODE
id x1 x2 x3 y1 y2 y3 z1 z2 z3 v
1 1 2 4 5 10 20 15 200 150 170 2.5
2 2 3 7 6 25 35 40 300 350 400 4.2
id xsource x y v
1 1 x1 2 10 2.5
2 1 x2 4 20 2.5
3 1 x3 5 15 2.5
4 2 x1 3 25 4.2
5 2 x2 7 35 4.2
6 2 x3 6 40 4.2
x <- read.table(text="
id x1 x2 x3 y1 y2 y3 z1 z2 z3 v
1 2 4 5 10 20 15 200 150 170 2.5
2 3 7 6 25 35 40 300 350 400 4.2
", header=TRUE)
x
#===============================================================
#METHOD #1
res <- reshape(x, direction = "long", idvar = "id",
varying = list(c("x1","x2", "x3"),
c("y1", "y2", "y3"),
c("z1", "z2", "z3")),
v.names = c("x", "y", "z"),
timevar = "xsource", times = c("x1", "x2", "x3"))
res <- res[order(res$id, res$xsource), c(1,3,4,5,2)]
row.names(res) <- NULL
res
===============================================================
#METHOD #2
chunks <- lapply(1:nrow(x),
function(i)cbind(x[i, 1], 1:3, matrix(x[i, 2:10], ncol=3), x[i,
11]))
res <- do.call(rbind, chunks)
colnames(res) <- c("id", "source", "x", "y", "z", "v")
res
231 | P a g e
Wide to Long with Multiple Measures per Time (stacking and double stacking closely examined)
Original Double Stack Single Stack
#The Data Frame
id <- paste('x', "1.", 1:10, sep="")
set.seed(10)
DF <- data.frame(id, trt=sample(c('cnt', 'tr'), 10, T),
work.T1=runif(10), play.T1=runif(10), talk.T1=runif(10),
total.T1=runif(10), work.T2=runif(10), play.T2=runif(10),
talk.T2=runif(10), total.T2=runif(10))
id trt work.T1 play.T1 talk.T1 total.T1 work.T2 play.T2 talk.T2 total.T2
1 x1.1 tr 0.65165567 0.8647212 0.53559704 0.27548386 0.3543281 0.03188816 0.07557029 0.86138244
2 x1.2 cnt 0.56773775 0.6153524 0.09308813 0.22890394 0.9364325 0.11446759 0.53442678 0.46439198
3 x1.3 cnt 0.11350898 0.7751099 0.16980304 0.01443391 0.2458664 0.46893548 0.64135658 0.22286743
4 x1.4 tr 0.59592531 0.3555687 0.89983245 0.72896456 0.4731415 0.39698674 0.52573932 0.62354960
5 x1.5 cnt 0.35804998 0.4058500 0.42263761 0.24988047 0.1915609 0.83361919 0.03928139 0.20364770
6 x1.6 cnt 0.42880942 0.7066469 0.74774647 0.16118328 0.5832220 0.76112174 0.54585984 0.01967341
7 x1.7 cnt 0.05190332 0.8382877 0.82265258 0.01704265 0.4594732 0.57335645 0.37276310 0.79799301
8 x1.8 cnt 0.26417767 0.2395891 0.95465365 0.48610035 0.4674340 0.44750805 0.96130241 0.27431890
9 x1.9 tr 0.39879073 0.7707715 0.68544451 0.10290017 0.3998326 0.08380201 0.25734157 0.16660910
10 x1.10 cnt 0.83613414 0.3558977 0.50050323 0.80154700 0.5052856 0.21913855 0.20795168 0.17015172
id trt times Work Play Talk Total
1 x1.1 tr 1 0.65165567 0.86472123 0.53559704 0.27548386
2 x1.2 cnt 1 0.56773775 0.61535242 0.09308813 0.22890394
3 x1.3 cnt 1 0.11350898 0.77510990 0.16980304 0.01443391
4 x1.4 tr 1 0.59592531 0.35556869 0.89983245 0.72896456
5 x1.5 cnt 1 0.35804998 0.40584997 0.42263761 0.24988047
6 x1.6 cnt 1 0.42880942 0.70664691 0.74774647 0.16118328
7 x1.7 cnt 1 0.05190332 0.83828767 0.82265258 0.01704265
8 x1.8 cnt 1 0.26417767 0.23958913 0.95465365 0.48610035
9 x1.9 tr 1 0.39879073 0.77077153 0.68544451 0.10290017
10 x1.10 cnt 1 0.83613414 0.35589774 0.50050323 0.80154700
11 x1.1 tr 2 0.35432806 0.03188816 0.07557029 0.86138244
12 x1.2 cnt 2 0.93643254 0.11446759 0.53442678 0.46439198
13 x1.3 cnt 2 0.24586639 0.46893548 0.64135658 0.22286743
14 x1.4 tr 2 0.47314146 0.39698674 0.52573932 0.62354960
15 x1.5 cnt 2 0.19156087 0.83361919 0.03928139 0.20364770
16 x1.6 cnt 2 0.58322197 0.76112174 0.54585984 0.01967341
17 x1.7 cnt 2 0.45947319 0.57335645 0.37276310 0.79799301
18 x1.8 cnt 2 0.46743405 0.44750805 0.96130241 0.27431890
19 x1.9 tr 2 0.39983256 0.08380201 0.25734157 0.16660910
20 x1.10 cnt 2 0.50528560 0.21913855 0.20795168 0.17015172
id trt time type measures
1 x1.1 tr 1 work 0.65165567
2 x1.2 cnt 1 work 0.56773775
3 x1.3 cnt 1 work 0.11350898
4 x1.4 tr 1 work 0.59592531
5 x1.5 cnt 1 work 0.35804998
6 x1.6 cnt 1 work 0.42880942
7 x1.7 cnt 1 work 0.05190332
8 x1.8 cnt 1 work 0.26417767
9 x1.9 tr 1 work 0.39879073
10 x1.10 cnt 1 work 0.83613414
11 x1.1 tr 2 work 0.35432806
12 x1.2 cnt 2 work 0.93643254
13 x1.3 cnt 2 work 0.24586639
14 x1.4 tr 2 work 0.47314146
15 x1.5 cnt 2 work 0.19156087
16 x1.6 cnt 2 work 0.58322197
17 x1.7 cnt 2 work 0.45947319
18 x1.8 cnt 2 work 0.46743405
19 x1.9 tr 2 work 0.39983256
20 x1.10 cnt 2 work 0.50528560
21 x1.1 tr 1 play 0.86472123
22 x1.2 cnt 1 play 0.61535242
23 x1.3 cnt 1 play 0.77510990
24 x1.4 tr 1 play 0.35556869
25 x1.5 cnt 1 play 0.40584997
26 x1.6 cnt 1 play 0.70664691
27 x1.7 cnt 1 play 0.83828767
28 x1.8 cnt 1 play 0.23958913
29 x1.9 tr 1 play 0.77077153
30 x1.10 cnt 1 play 0.35589774
31 x1.1 tr 2 play 0.03188816
32 x1.2 cnt 2 play 0.11446759
33 x1.3 cnt 2 play 0.46893548
34 x1.4 tr 2 play 0.39698674
35 x1.5 cnt 2 play 0.83361919
36 x1.6 cnt 2 play 0.76112174
37 x1.7 cnt 2 play 0.57335645
38 x1.8 cnt 2 play 0.44750805
39 x1.9 tr 2 play 0.08380201
40 x1.10 cnt 2 play 0.21913855
41 x1.1 tr 1 talk 0.53559704
42 x1.2 cnt 1 talk 0.09308813
43 x1.3 cnt 1 talk 0.16980304
44 x1.4 tr 1 talk 0.89983245
45 x1.5 cnt 1 talk 0.42263761
46 x1.6 cnt 1 talk 0.74774647
47 x1.7 cnt 1 talk 0.82265258
48 x1.8 cnt 1 talk 0.95465365
49 x1.9 tr 1 talk 0.68544451
50 x1.10 cnt 1 talk 0.50050323
51 x1.1 tr 2 talk 0.07557029
52 x1.2 cnt 2 talk 0.53442678
53 x1.3 cnt 2 talk 0.64135658
54 x1.4 tr 2 talk 0.52573932
55 x1.5 cnt 2 talk 0.03928139
56 x1.6 cnt 2 talk 0.54585984
57 x1.7 cnt 2 talk 0.37276310
58 x1.8 cnt 2 talk 0.96130241
59 x1.9 tr 2 talk 0.25734157
60 x1.10 cnt 2 talk 0.20795168
61 x1.1 tr 1 total 0.27548386
62 x1.2 cnt 1 total 0.22890394
63 x1.3 cnt 1 total 0.01443391
64 x1.4 tr 1 total 0.72896456
65 x1.5 cnt 1 total 0.24988047
66 x1.6 cnt 1 total 0.16118328
67 x1.7 cnt 1 total 0.01704265
68 x1.8 cnt 1 total 0.48610035
69 x1.9 tr 1 total 0.10290017
70 x1.10 cnt 1 total 0.80154700
71 x1.1 tr 2 total 0.86138244
72 x1.2 cnt 2 total 0.46439198
73 x1.3 cnt 2 total 0.22286743
74 x1.4 tr 2 total 0.62354960
75 x1.5 cnt 2 total 0.20364770
76 x1.6 cnt 2 total 0.01967341
77 x1.7 cnt 2 total 0.79799301
78 x1.8 cnt 2 total 0.27431890
79 x1.9 tr 2 total 0.16660910
80 x1.10 cnt 2 total 0.17015172
See the next two pages for how to stack and double stack with
1. reshape (base function) 2. reshape (the package) 3. rbinding and cbinding
232 | P a g e
Single Stack 2 Methods using reshape from base: #Method 1
NEW <- reshape(DF, varying=list(work= c(3, 7), play= c(4,8), talk= c(5,9), total= c(6,10) ),
v.names=c("work", "play", "talk", "total"),
# that was needed after changed 'varying' arg to a list to allow 'times'
direction="long",
times=1:2, # substitutes number fot T1 and T2
timevar="times") # to name the time col
rownames(NEW) <- 1:nrow(NEW)
#Method 2 (shorter but less explicit)
NEW <- reshape(DF, direction="long", varying=3:10, sep=".T")
rownames(NEW) <- 1:nrow(NEW)
NEW
Method from reshape package: library(reshape)
DF2 <- melt(DF,id.vars=1:2)
DF3 <- cbind(DF2,
colsplit(as.character(DF2$variable),"\\.",
names=c("activity","times")))
## rename time, reorder factors:
DF4 <- transform(DF3,
times=as.numeric(gsub("^T","",time)),
activity=factor(activity,
levels=c("work","play","talk","total")),
id=factor(id,levels=paste("x1",1:10,sep=".")))
## reshape back to wide
DF5 <- cast(subset(DF4,select=-variable),id+trt+times~activity)
## reorder
NEW <- with(DF5,DF5[order(time,id),])
NEW
2 Methods using rbinding and cbinding: #Method 1
DF.1 <- DF[, 1:2]
DFlist <- list(DF[, 3:6], DF[, 7:10])
lapply(seq_along(DFlist), function(x) names(DFlist[[x]]) <<-
unlist(strsplit(names(DFlist[[x]])[1:length(names(DFlist[[x]]))],
".", fixed=T))[c(T, F)]
)
repeats <- 2 #Number of repeated measures
time <- rep(1:repeats, each=nrow(DF.1))
NEW <- data.frame(DF.1[rep(seq_len(nrow(DF.1)), repeats), ], time,
do.call('rbind', DFlist))
NEW
#Method 2
DF.1 <- DF[, 1:2]
DF.2 <- DF[, 3:6]
DF.3 <- DF[, 7:10]
repeats <- 2 #Number of repeated measures
names(DF.2) <- names(DF.3) <- unlist(strsplit(names(DF.2), ".", fixed=T))[c(T,F)]
time <- rep(1:repeats, each=nrow(DF.1))
NEW <- data.frame(DF.1[rep(seq_len(nrow(DF.1)), repeats), ], time, rbind(DF.2, DF.3))
NEW
These items in green are doing the same thing, turning the T.1 & T.2 into 1 & 2
Replicate and stack subset (columns) of a data frame repeat rows
This is a method of stacking the same data frame x number of times (the id variables) replicate rows
dataframe[rep(seq_len(nrow(dataframe)), repeats), ]
Where: dataframe- is the data frame to be repeated and stacked repeats- is the number of time to repeate the dataframe
233 | P a g e
Single Stack (This can be useful for certain analysis or graphics such as repeated measures or faceting in ggplot2) Method using reshape from base: NEW2 <- reshape(NEW, direction = "long", idvar = c("id", "trt", "time"),
varying = list(c("work", "play", "talk", "total")),
v.names = c("measures"),
timevar = "type",
times = c("work", "play", "talk", "total"))
rownames(NEW2) <- 1:nrow(NEW2)
NEW2
Method from reshape package: require(reshape)
DF2 <- melt(DF,id.vars=1:2)
DF3 <- cbind(DF2,
colsplit(as.character(DF2$variable),"\\.",
names=c("type","times")))
NEW2 <- with(DF3, DF3[, c('id', 'trt', 'times', 'type', 'value')])
levels(NEW2$times) <- 1:2
NEW2
2 Methods using rbinding and cbinding: NEW2 <- reshape(NEW, direction = "long", idvar = c("id", "trt", "time"),
varying = list(c("work", "play", "talk", "total")),
v.names = c("measures"),
timevar = "type",
times = c("work", "play", "talk", "total"))
rownames(NEW2) <- 1:nrow(NEW2)
NEW2
234 | P a g e
Another Wide To Long With Akwardly Named Columns (rename 'em for ease) #THE DATA SET
dat <- read.table(text=" WorkerId pio_1_1 pio_1_2 pio_1_3 pio_1_4 pio_2_1 pio_2_2 pio_2_3
pio_2_4
1 1 Yes No No No No No Yes No
2 2 No Yes No No Yes No Yes No
3 3 Yes Yes No No Yes No Yes No", header=T)
redat <- dat #To reset the Data
The Trick to Getting the Most Out of Reshape is to Get Your Column Names in an R Friendly Format to Begin with or You Have to Specify to Varying What Columns to Stack on What.
#METHOD 1 (Cool renaming; If you rename varying is easy)
#The "([a-z])_([0-9])_([0-9])" part says look for a character group then "_" followed by a digit
#string then "_" followed by a digit string. The "\\1_\\3\\.\\2" means keep the first character
#string and "_" right in the first spot. Then take the last digit string (#3) and make it
#second, then put a period and take the 2nd digit string and put it 3
rd.
names(dat) <- gsub("([a-z])_([0-9])_([0-9])", "\\1_\\3\\.\\2", names(dat))
#names(dat) <- gsub("([0-9])_([0-9])$", "\\2\\.\\1", names(dat)) # another way
dat2 <- reshape(dat, direction="long", varying=2:9, timevar="set", idvar=1)
row.names(dat2) <- NULL
dat2[order(dat2$WorkerId), ]
#METHOD 2 (My Method; If you rename varying is easy)
y <- do.call('rbind', strsplit(names(dat)[-1], "_"))[, c(1, 3, 2)]
names(dat) <- c(names(dat)[1], paste0(y[, 1], "_", y[, 2], ".", y[, 3]))
dat2 <- reshape(dat, varying=2:9, idvar = "WorkerId", direction="long",
timevar="set")
row.names(dat2) <- NULL
dat2[order(dat2$WorkerId, dat2$set), ]
#METHOD 3 (Using Reshape)
dat <- redat
library("reshape2")
reshape.middle <- function(dat) {
dat <- melt(so, id="WorkerId")
dat$set <- substr(dat$variable, 5,5)
dat$name <- paste(substr(dat$variable, 1, 4),
substr(dat$variable, 7, 7),
sep="")
dat$variable <- NULL
dat <- melt(dat, id=c("WorkerId", "set", "name"))
dat$variable <- NULL
return(dcast(dat, WorkerId + set ~ name))
}
reshape.middle(dat)
#Without the rename You'd have to approach it this way
dat2 <- reshape(dat,
varying=list(pio_1= c(2, 6), pio_2= c(3,7), pio_3= c(4,8), pio_4= c(5,9) ),
v.names=c(paste0("pio_",1:4)),
idvar = "WorkerId",
direction="long",
timevar="set")
row.names(dat2) <- NULL
dat2[order(dat2$WorkerId, dat2$set), ]
235 | P a g e
Randomish Rows – Long Format to Wide w/ Missing Data
var<-c("Id", "Name", "Score", "Id", "Score", "Id", "Name")
num<-c(1, "Tom", 4, 2, 7, 3, "Jim")
format1<-data.frame(var, num)
format1
#STARTING DATAFRAME
#
# var num
# 1 Id 1
# 2 Name Tom
# 3 Score 4
# 4 Id 2
# 5 Score 7
# 6 Id 3
# 7 Name Jim
format1$ID <- cumsum(format1$var == "Id")
#ADD THE cumsum ID COLUMN (IMPORTANT FOR BOTH METHODS)
#
# var num ID
# 1 Id 1 1
# 2 Name Tom 1
# 3 Score 4 1
# 4 Id 2 2
# 5 Score 7 2
# 6 Id 3 3
# 7 Name Jim 3
# METHOD 1
format2 <- reshape(format1, idvar = "ID",timevar = "var", direction = "wide")[-1]
names(format2) <- gsub("num.", "", names(format2))
format2
#OUTCOME
#
# Id Name Score
# 1 1 Tom 4
# 4 2 <NA> 7
# 6 3 Jim <NA>
# METHOD 2
reshape(format1, idvar = "ID",timevar = "var", direction = "wide",
varying = list(c("Id", "Name", "Score")))[-1]
# METHOD 3
format1$pk <- cumsum( format1$var=="Id" )
library(reshape2)
dcast( format1, pk ~ var, value.var="num" )
236 | P a g e
Extract object names from list in a function (using both lapply and a for loop)
x <- c("yes", "no", "maybe", "no", "no", "yes")
y <- c("red", "blue", "green", "green", "orange")
list.xy <- list(x=x, y=y)
WORD.C <- function(WORDS){
require(wordcloud)
L2 <- lapply(WORDS, function(x) as.data.frame(table(x), stringsAsFactors = FALSE))
# Takes a dataframe and the text you want to display
FUN <- function(X, text){
windows()
wordcloud(X[, 1], X[, 2], min.freq=1)
mtext(text, 3, padj=-4.5, col="red") #what I'm trying that isn't working
}
# Now creates the sequence 1,...,length(L2)
# Loops over that and then create an anonymous function
# to send in the information you want to use.
lapply(seq_along(L2), function(i){FUN(L2[[i]], names(L2)[i])})
}
WORD.C2 <- function(WORDS){
require(wordcloud)
L2 <- lapply(WORDS, function(x) as.data.frame(table(x), stringsAsFactors = FALSE))
# Takes a dataframe and the text you want to display
FUN <- function(X, text){
windows()
wordcloud(X[, 1], X[, 2], min.freq=1)
mtext(text, 3, padj=-4.5, col="red") #what I'm trying that isn't working
}
# you could use i in seq_along(L2)
# instead of 1:length(L2) if you wanted to
for(i in 1:length(L2)){
FUN(L2[[i]], names(L2)[i])
}
}
WORD.C(list.xy)
WORD.C2(list.xy)
237 | P a g e
Working on Dataframes in Lists & Acting on Global Environment Variables #CREATE A FAKE DATA SET
df <- data.frame(
x.2=rnorm(25),
y.2=rnorm(25),
g=rep(factor(LETTERS[1:5]), 5)
)
#Strip a Particular Column From Every data Frame in the List
LIST <- split(df, df$g) #split it into a list of data frames
NAMES <- names(LIST) #save the names of this for later use as they may be stripped
LIST <- lapply(seq_along(LIST), function(x) as.data.frame(LIST[[x]])[, 1:2])
LIST
#Change All Variable Names of Data Frames in a List
LIST <- lapply(LIST, function(x) {
names(x) <- unlist(strsplit(names(x)[1:length(names(x))],
".", fixed=T))[c(T, F)]
return(x)
}
)
LIST
#Rename All the Data Frames in the List
names(LIST) <- NAMES
LIST
#Assign Data Frames in a List to Objects in The Global Environment
lapply(seq_along(LIST),
function(x) {
assign(c("V", "W", "X", "Y", "Z")[x], LIST[[x]], envir=.GlobalEnv)
}
)
V; W #etc
#Use Global Assignment to Change All Variable Names of Data Frames in a List
lapply(seq_along(LIST), function(x) names(LIST[[x]]) <<-
unlist(strsplit(names(LIST[[x]])[1:length(names(LIST[[x]]))],
".", fixed=T))[c(T, F)]
)
LIST
#Rename All the Data Frames in the List Using Global Assignment
lapply(seq_along(LIST), function(x) {names(LIST)[[x]] <<- NAMES[x]})
LIST
238 | P a g e
do.call, replicate, split do.call (take a list, apply a function) #do.call must be in a list
mtcars2 <- as.list(mtcars)
#do.call with rbind and dataframe
do.call('rbind', mtcars2)
do.call('data.frame', mtcars2)
#to use with paste we have to pass the separator that paste takes
mtcars2$sep <- "HELLO"
do.call('paste', mtcars2)
Classic Use of split, lapply and do.call (split by a factor(s), apply a function, put back together) Note: consider using by and tapply as well
LIST <- split(mtcars, mtcars$cyl)
MEANS <- lapply(LIST, colMeans)
row2col(do.call('rbind', MEANS), 'cyl')
#notice we split by two factors
LIST2 <- split(mtcars, list(mtcars$cyl, mtcars$carb))
MEANS2 <- lapply(LIST2, colMeans)
OC <- row2col(do.call('rbind', MEANS2), 'cyl.carb')
replacer(OC, NaN, NA)
Use replicate to repeat a function over and over and then do.call('rbind', ) them together # Create some fake data.
dat <- rnorm(200)
# Get a sample of size 5 from this without replacement
sample(dat, 5)
# Do this 10 times
replicate(10, sample(dat, 5))
#replicate finding means
replicate(10, colMeans(mtcars))
#replicate and paste a data frame
do.call('rbind', replicate(10, data.frame(a=1:10, b=letters[1:10],
c=state.name[1:10]), simplify=F))
239 | P a g e
Data Table
Sample Stats
require(data.table)
dat <- data.table(iris)
x <- dat[,list(mean=mean(Sepal.Length), sd=sd(Sepal.Length)),by=Species]
rownamer(x)
240 | P a g e
Look Up Tables & Dictionaries #Create a Data Set to Match test1<-(structure(list(person = structure(1:7, .Label = c("A", "B", "C",
"D", "E", "F", "G"), class = "factor"), age = c(7L, 22L, 65L,
32L, 14L, 53L, 23L)), .Names = c("person", "age"), class = "data.frame", row.names = c(NA, -7L)))
test2<-(structure(list(Lower_limit = c(5L, 15L, 25L, 45L), Upper_limt = c(15L,
25L, 45L, 100L), support = c(10L, 20L, 30L, 40L)), .Names = c("Lower_limit",
"Upper_limt", "support"), class = "data.frame", row.names = c(NA,
-4L)))
test1; test2
# Merge (slow) test1$sup1<-cut(test1$age, breaks=unique(unlist(test2[, 1:2])))
key <- data.frame(sup1=levels(test1$sup1), support=test2$support)
test3 <- merge(test1, key, sort=FALSE)[, -1]
test3 <- test3[order(test3$person), ]
rownames(test3) <- 1:nrow(test3)
test3
# Match test1$sup1<-cut(test1$age, breaks=unique(unlist(test2[, 1:2])))
key <- data.frame(sup1=levels(test1$sup1), support=test2$support)
test1$support <- key[match(test1$sup1, key$sup1), 2]
test1[, -3]
# Hash Table test1$sup1<-cut(test1$age, breaks=unique(unlist(test2[, 1:2])))
key <- data.frame(sup1=levels(test1$sup1), support=test2$support)
hash <- function(x, type = "character") {
e <- new.env(hash = TRUE, size = nrow(x), parent = emptyenv())
char <- function(col) assign(col[1], as.character(col[2]), envir = e)
num <- function(col) assign(col[1], as.numeric(col[2]), envir = e)
FUN <- if(type=="character") char else num
apply(x, 1, FUN)
return(e)
}
KEY <- hash(key, type="numeric")
type <- function(x) if(exists(x, env = KEY))get(x, e = KEY) else NA
test1$support <- sapply(as.character(test1$sup1), type)
test1[, -3]
# Indexing test1$sup1<-cut(test1$age, breaks=unique(unlist(test2[, 1:2])))
key2 <- c(test2$support);names(key2) <- levels(test1$sup1) #lookup table
transform(test1, support=key2[sup1])[, -3]
# data.table library(data.table)
test1$sup1<-cut(test1$age, breaks=unique(unlist(test2[, 1:2])))
key <- data.frame(sup1=levels(test1$sup1), support=test2$support)
dtKEY <- data.table(key, key="sup1")
test1$support <- dtKEY[J(test1$sup1), ][[2]]
test1[, -3]
# qdap lookup (hash based) library(qdap)
test1$sup1 <-cut(test1$age, breaks=unique(unlist(test2[, 1:2])))
key <- data.frame(sup1=levels(test1$sup1), support=test2$support)
test1$support <- lookup(cut(test1$age, c(5, 15, 25, 45, 100)), key)
The dataframe Test1
The dictionary/look up Test2
241 | P a g e
ggplot2
Globally Alter background color
Globally Reset Background Color theme_set(theme_gray()) theme_set(theme_bw()) White Background Gray Grid + theme(panel.grid.major = element_blank()) White Background No Grid + theme_bw() + theme(panel.grid.major=element_blank(),panel.grid.minor=element_blank()) library(ggplot2)
x <- ggplot(CO2, aes(x=uptake, group=Plant))
y <- x + geom_density(aes(colour=Plant)) + facet_grid(Type~Treatment)
y + theme_bw() + theme(panel.grid.major=element_blank(),panel.grid.minor=element_blank())
Change Back Ground Color + theme(panel.background = element_rect(fill='green', colour='red')) Change Margins Color + theme(plot.background = element_rect(fill='green', colour='red'))
theme_new <- theme_update(
panel.background = element_rect(fill="gray20")
)
new <- theme_set(theme_new)
theme_set(new)
ggplot(data = mtcars, aes(mpg, wt)) + geom_point(colour="yellow")
theme_new <- theme_update(
panel.background = element_rect(fill="red")
)
new <- theme_set(theme_new)
theme_set(new)
ggplot(data = mtcars, aes(mpg, wt)) + geom_point(colour="yellow")
theme_set(theme_gray())
ggplot(data = mtcars, aes(mpg, wt)) + geom_point(colour="yellow") +
theme_new #not global change
242 | P a g e
Change ggplot2 Pallette and Apply to Individuals + scale_colour_identity() First you creat a color palette using the hexadecimal colors to the right. Then you assign those colors to groups or observations etc. and add the parameter: + scale_colour_identity()
Change Color Transparency (saturation (chromaticity) & increase luminance) scale_fill_hue(h = c(0, 360) + 15, l = 65, c = 100)
ggplot(df, aes(x=cond, y=yval, fill=cond)) + geom_bar() + scale_fill_hue(c=15, l=10)
ggplot(df, aes(x=cond, y=yval, fill=cond)) + geom_bar() + scale_fill_hue(c=85, l=10)
ggplot(df, aes(x=cond, y=yval, fill=cond)) + geom_bar() + scale_fill_hue(c=85, l=90)
col <- c("#000000", "#FF0000",
"0033000")
mtcars$col <- col[1]
mtcars$col[5:6] <- col[2:3]
p <- ggplot(mtcars, aes(x=wt, y=mpg,
label=rownames(mtcars)))
p + geom_text(data=mtcars,
aes(colour=col), size=2)
243 | P a g e
Map a Numeric Variable on a Color Continuim scale_colour_gradient(low='col1', high='col2')
Change Color of Factors and Reorder Legend + scale_colour_manual(values = cols)
#Examples
gradient_rb <- scale_colour_gradient(low='blue', high='red')
p <- ggplot(mtcars, aes(x=wt, y=mpg,
label=rownames(mtcars)))
p + geom_text(data=mtcars,
aes(colour=mpg), size=3)+ gradient_rb
#example 2
p + geom_point(data=mtcars,
aes(colour=mpg), size=2)+ gradient_rb
#Examples
p <- ggplot(mtcars, aes(x=wt, y=mpg,
label=rownames(mtcars)))
w <- p + geom_point(data=mtcars,
aes(colour=cyl), size=3)
w
w + scale_colour_manual(values = c("red","blue", "green"))
w + scale_colour_manual( #specify who takes what color
values = c("8" = "red","4" = "blue","6" = "green"))
cols <- c("8" = "red","4" = "blue","6" = "darkgreen", "10" = "orange")
w + scale_colour_manual(values = cols)
#breaks allows you to specify which factor gets what color
w + scale_colour_manual(values = cols, breaks = c("4", "6", "8"))
w + scale_colour_manual(values = cols, breaks = c("8", "6", "4"))
w + scale_colour_manual(values = cols, breaks = c("4", "6", "8"),
labels = c("four", "six", "eight"))
#plot just some of the groups (below 6 cyl not plotted)
w + scale_colour_manual(values = cols, limits = c("4", "8"))
244 | P a g e
Change Color Pallette for bar graphs (items you fill in) + scale_fill_manual()
Adjust Transparency (An argument to many geoms) alpha=
Symbols and Color Fills symbols 16:25 are fillable
#EXAMPLE
library(ggplot2)
cbbFillPalette <- scale_fill_manual(values=c("#000000", "#E69F00", "#56B4E9"))
cbbFillPalette2 <- scale_fill_manual(values=c("red", "blue", "brown"))
mtcars$cyl <- as.factor(mtcars$cyl) #make cylinder a factor
ggplot(mtcars, aes( x=cyl, fill=cyl)) + geom_bar() + cbbFillPalette
ggplot(mtcars, aes( x=cyl, fill=cyl)) + geom_bar() + cbbFillPalette2
#EXAMPLES
library(ggplot2)
#EX1
x <- ggplot(mtcars, aes(factor(cyl)))
x + geom_bar(fill = "dark grey", colour = "black", alpha = 1/3)
#EX12
df <- data.frame(x = rnorm(5000), y = rnorm(5000))
h <- ggplot(df, aes(x,y))
h + geom_point(alpha = 0.5)
h + geom_point(alpha = 1/10)
x <- ggplot(mtcars, aes(x=hp, y=mpg))
x + geom_point(aes(shape = 21), size = 4,
colour = "red", fill = "black")
df2 <- data.frame(x = 1:5 , y = 1:25,
z = 1:25)
s <- ggplot(df2, aes(x = x, y = y))
s + geom_point(aes(shape = z), size = 4,
colour = "red", fill = "black") +
scale_shape_identity()
245 | P a g e
Fill By 2 or More Combined Variables library(ggplot2); library(RColorBrewer)
dat <- data.frame(category = c("A","A","B","B","C","C","D","D"),
variable = c("inclusion","exclusion","inclusion","exclusion",
"inclusion", "exclusion","inclusion","exclusion"),
value = c(60,20,20,80,50,55,25,20))
#FILL BY 1 VARIABLE
colors <- c("#FF0000","#990000")
ggplot(dat, aes(category, value, fill = variable)) +
geom_bar()+ scale_fill_manual(values = colors)
#FILL BY 2 VARIABLES
dat$grp <- paste2(dat[, 1:2], sep=" ")# create a combined variable
ggplot(dat, aes(category, value, fill = grp)) +
geom_bar()+
scale_fill_manual(values = brewer.pal(8,"Reds"))
246 | P a g e
Annotations and Text
Correct Approach to Plotting Annotations (not found in original dataframe) Create a separate data frame with the text and locations and pass that data frame to geom_text #Original data frame
data2 <- read.table(text= "type value time year
1 NA* 0.90 3 2008
3 EDS 0.01 3 2008
4 KIU 0.01 3 2008
5 MVH 0.09 3 2008
6 LAK 0.00 3 2008
7 NA* 0.80 6 2007
9 EDS 0.05 6 2007
10 KIU 0.00 6 2007
11 MVH 0.15 6 2007
12 LAK 0.00 6 2007
13 NA* 0.41 15 2007
15 EDS 0.04 15 2007
16 KIU 0.03 15 2007
17 MVH 0.52 15 2007
18 LAK 0.00 15 2007
19 NA* 0.23 27 2006
21 EDS 0.11 27 2006
22 KIU 0.02 27 2006
23 MVH 0.64 27 2006
24 LAK 0.01 27 2006", header=T)
#create separate text data frame
data2.labels <- data.frame(
time = c(7, 15),
value = c(.9, .6),
label = c("correct color", "another correct color!"),
type = c("NA*", "MVH")
)
ggplot(data2, aes(x=time, y=value, group=type, col=type))+
geom_line()+
geom_point()+
theme_bw() +
#pass the new data frame to geom_text so it doesn't print 1000x
geom_text(data = data2.labels, aes(x = time, y = value, label = label))
Grid Letters and Greek Text Separate the words with the tilde (~) symbol.
d <- data.frame(x=1:3,y=1:3)
qplot(x, y, data=d) +
geom_text(aes(2, 2, label="rho~and~some~other~text"), parse=TRUE)
247 | P a g e
Ajust Size Difference Ratio + scale_size (range = c(x, y)) + scale_size_continuous(range = c(x, y)) Change Aspect Ratio Of the Plot Region + coord_equal(ratio = 5) qplot(mpg, wt, data = mtcars) + coord_equal(ratio = 5) qplot(mpg, wt, data = mtcars) + coord_equal(ratio = 1) qplot(mpg, wt, data = mtcars) + coord_equal(ratio = 1/5)
p <- ggplot(mtcars, aes(hp, as.factor(cyl))) +
geom_point(aes(size=mpg))
p
p + scale_size(range = c(2, 10))
p + scale_size_continuous(range = c(3,8))
p + scale_size_continuous(range = c(.05,15))
248 | P a g e
Add a title sggplot title ggplot title + ggtitle("Title text") + labs(title="Title text")
249 | P a g e
Legends Legend Manipulation + guides()
library(reshape2) # for melt
df <- melt(outer(1:4, 1:4), varnames = c("X1", "X2"))
p1 <- ggplot(df, aes(X1, X2)) + geom_tile(aes(fill = value))
# Basic form
p1 + scale_fill_continuous(guide = "legend")
p1 + scale_fill_continuous(guide = guide_legend())
# Guide title
p1 + scale_fill_continuous(guide = guide_legend(title = "V")) # title text
p1 + scale_fill_continuous(name = "V") # same
p1 + scale_fill_continuous(guide = guide_legend(title = NULL)) # no title
# Control styles
# key size
p1 + guides(fill = guide_legend(keywidth = 3, keyheight = 1))
# title position
p1 + guides(fill = guide_legend(title = "LEFT", title.position = "left"))
# title text styles via theme_text
p1 + guides(fill = guide_legend(
title.theme = theme_text(size=15, face="italic", col="red", angle=45)))
p1 + guides(fill = guide_legend(label.position = "bottom"))
# label styles
p1 + scale_fill_continuous(breaks = c(5, 10, 15),
labels = paste("long", c(5, 10, 15)),
guide = guide_legend(direction = "horizontal", title.position = "top",
label.position="bottom", label.hjust = 0.5, label.vjust = 0.5,
label.theme = theme_text(angle = 90)))
# Set aesthetic of legend key
# very low alpha value make it difficult to see legend key
p3 <- qplot(carat, price, data = diamonds, colour = color,
alpha = I(1/100))
p3
# override.aes overwrites the alpha
p3 + guides(colour = guide_legend(override.aes = list(alpha = 1)))
# multiple row/col legends
p <- qplot(1:20, 1:20, colour = letters[1:20])
p + guides(col = guide_legend(nrow = 8))
p + guides(col = guide_legend(ncol = 8))
p + guides(col = guide_legend(nrow = 8, byrow = TRUE))
p + guides(col = guide_legend(ncol = 8, byrow = TRUE))
# reversed order legend
p + guides(col = guide_legend(reverse = TRUE))
sggplot legends sggplot2 legends
250 | P a g e
Change Legend Title + labs(shape, cour, fill, line, shape, etc.)
Change Legend Position + theme(legend.position = 'left') #directional input + theme(legend.position = c(0.5, 0.5)) #coordinate input
Eliminate Legend + theme(legend.position = "none")
library(ggplot2)
data(iris)
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) +
geom_point(aes(shape=Species, colour=Petal.Width)) +
scale_colour_gradient() +
labs(shape="Species label", colour="Petal width label")
library(ggplot2)
xy <- data.frame(x=1:10, y=10:1, type = rep(LETTERS[1:2], each=5))
plot <- ggplot(data = xy)+ geom_point(aes(x = x, y = y, color=type))
plot
plot + theme(legend.position = 'left')
plot + theme(legend.position = 'bottom')
plot + theme(legend.position = c(0.5, 0.5))
plot + theme(legend.position = c(0.9, 0.9))
251 | P a g e
Share Legend p1 <- ggplot(subset(mtcars, cyl = 4), aes(wt, cyl, colour = mpg)) +
geom_point()
p2 <- ggplot(subset(mtcars, cyl = 8), aes(wt, hp, colour = mpg)) +
geom_point() + guides(colour=FALSE)
library(gridExtra)
grid.draw(cbind(ggplotGrob(p2), ggplotGrob(p1), size="last"))
## Make a tableGrob of your legend
tmp <- ggplot_gtable(ggplot_build(p2))
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
legend <- tmp$grobs[[leg]]
# Plot objects using widths and height and respect to fix aspect ratios
# We make a grid layout with 3 columns, one each for the plots and one for the legend
grid.newpage()
pushViewport( viewport( layout = grid.layout( 1 , 3 , widths = unit( c( 0.4 , 0.4 ,
0.2 ) , "npc" ) ,heights = unit( c( 0.45 , 0.45 , 0.45 ) , "npc" ) , respect =
matrix(rep(1,3),1) ) ) )
print( p1 + theme(legend.position="none") , vp = viewport( layout.pos.row = 1 ,
layout.pos.col = 1 ) )
print( p2 + theme(legend.position="none") , vp = viewport( layout.pos.row = 1,
layout.pos.col = 2 ) )
upViewport(0)
vp3 <- viewport( width = unit(0.2,"npc") , x = 0.9 , y = 0.5)
pushViewport(vp3)
grid.draw(legend)
popViewport()
252 | P a g e
Continuous Legend + guides(fill = guide_colorbar())
Reverse Order Legend + guides(fill = guide_legend(reverse = TRUE)) Change Legend Symbols library(grid) grid.gedit("^key-[-0-9]+$", label = "NEW_SYMBOL")
library(reshape2) # for melt
df <- melt(outer(1:4, 1:4), varnames = c("X1", "X2"))
p1 <- ggplot(df, aes(X1, X2)) + geom_tile(aes(fill = value))
p1 + guides(fill = guide_colorbar(barwidth = 0.5, barheight = 10))
p1 + guides(fill = guide_colorbar(label = FALSE))
p1 + guides(fill = guide_colorbar(ticks = FALSE))
p1 + guides(fill = guide_colorbar(label.position = "left"))
p1 + guides(fill = guide_colorbar(label.theme = theme_text(col="blue")))
p1 + scale_fill_continuous(limits = c(0,20), breaks=c(0, 5, 10, 15, 20),
guide = guide_colorbar(nbin=100, draw.ulim = FALSE, draw.llim = FALSE))
p1 + guides(fill = guide_colorbar(direction = "horizontal",
label.theme = theme_text(col="blue")))
#EXAMPLE
library(ggplot2)
p <- ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar()
p
p + guides(fill = guide_legend(reverse = TRUE))
#EXAMPLE
df <- expand.grid(x = factor(seq(1:5)), y =
factor(seq(1:5)), KEEP.OUT.ATTRS = FALSE)
df$Count <- seq(1:25)
# A plot
library(ggplot2)
p <- ggplot(data = df, aes( x = x, y = y,
label = Count, size = Count)) +
geom_text() +
scale_size(range = c(2, 10))
p
library(grid)
grid.gedit("^key-[-0-9]+$", label = ":)")
253 | P a g e
Custom Legend library(ggplot2)
df <- data.frame(gp = factor(rep(letters[1:3], each = 10)), y = rnorm(30))
library(plyr)
ds <- ddply(df, .(gp), summarise, mean = mean(y), sd = sd(y))
ggplot(df, aes(x = gp, y = y)) +
geom_point(aes(colour="data")) +
geom_point(data = ds, aes(y = mean, colour = "mean"), size = 3) +
scale_colour_manual("Legend", values=c("mean"="red", "data"="black"))
library(reshape2)
# in long format
dsl <- melt(ds, value.name = 'y')
# add variable column to df data.frame
df[['variable']] <- 'data'
# combine
all_data <- rbind(df,dsl)
# drop sd rows
data_w_mean <- subset(all_data,variable != 'sd',drop = T)
# create vectors for use with scale_..._manual
colour_scales <- setNames(c('black','red'),c('data','mean'))
size_scales <- setNames(c(1,3),c('data','mean') )
ggplot(data_w_mean, aes(x = gp, y = y)) +
geom_point(aes(colour = variable, size = variable)) +
scale_colour_manual(name = 'Type', values = colour_scales) +
scale_size_manual(name = 'Type', values = size_scales)
dsl_mean <- subset(dsl,variable != 'sd',drop = T)
ggplot(df, aes(x = gp, y = y, colour = variable, size = variable)) +
geom_point() +
geom_point(data = dsl_mean) +
scale_colour_manual(name = 'Type', values = colour_scales) +
scale_size_manual(name = 'Type', values = size_scales)
Remove Diagonal Lines show_guide=FALSE ggplot(mtcars, aes(factor(cyl), fill=am, group=am)) +
geom_bar(colour="black")
ggplot(mtcars, aes(factor(cyl), fill=am, group=am)) +
geom_bar() + geom_bar(colour="black", show_guide=FALSE)
254 | P a g e
Eliminate Vertical/Horizontal Grid Lines #MAKE A DATA SET
library(ggplot2); set.seed(10)
CO3 <- data.frame(id=1:nrow(CO2), CO2[, 2:3],
outcome=factor(sample(c('none', 'some', 'lots', 'tons'),
nrow(CO2), rep=T), levels=c('none', 'some', 'lots', 'tons')))
x <- ggplot(CO3, aes(x=outcome)) + geom_bar(aes(x=outcome))+
facet_grid(Treatment~Type, margins='Treatment', scales='free') +
theme_bw() + theme(axis.text.x=element_text(angle= 45, vjust=1, hjust= 1))
#REMOVE LINES
x + theme(panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank())
x + theme(panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank())
Equal distance between bars df <- structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L,
5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L), .Label = c("1",
"2", "3", "4", "5", "6", "7"), class = "factor"), TYPE = structure(c(1L,
2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L,
1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L,
5L, 6L, 1L, 2L, 3L), .Label = c("1", "2", "3", "4", "5", "6",
"7", "8"), class = "factor"), TIME = structure(c(2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 1L, 1L), .Label = c("1", "5", "15"), class = "factor"), VAL = c(0.94,
0.52, 0.28, 0.97, 0.12, 0.05, 0.47, 0.62, 0.2, 0.73, 1, 0.98,
0.67, 0.29, 0.17, 0.86, 0.17, 0.83, 0.62, 0.79, 0.76, 0.43, 0.61,
0.18, 0.53, 0.49, 0.47, 0.07, 0.7, 0.23, 0.36, 0.52, 0.26, 0.15,
0.01, 0.46, 0.92, 0.23), w = c(0.675, 0.675, 0.675, 0.675, 0.675,
0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9, 0.675,
0.675, 0.675, 0.675, 0.675, 0.675, 0.9, 0.9, 0.9, 0.9, 0.9, 0.9,
0.675, 0.675, 0.675, 0.675, 0.675, 0.675, 0.9, 0.9, 0.9)), .Names = c("ID",
"TYPE", "TIME", "VAL", "w"), row.names = c(NA, -38L), class = "data.frame")
ggplot(df, aes(x=ID, y=VAL, fill=TYPE)) +
facet_wrap(~TIME, ncol=1) +
geom_bar(position="stack",stat = "identity") +
coord_flip()
ggplot(df, aes(x=ID, y=VAL, fill=TYPE)) +
facet_wrap(~TIME, ncol=1, scale="free") +
geom_bar(position="stack",stat = "identity") +
coord_flip()
df$w <- 0.9
df$w[df$TIME == 5] <- 0.9 * 3/4
ggplot(df, aes(x=ID, y=VAL, fill=TYPE)) +
facet_wrap(~TIME, ncol=1, scale="free") +
geom_bar(position="stack",aes(width = w),stat = "identity") +
coord_flip()
255 | P a g e
Faceting Faceted Plot library(ggplot2)
qplot(mpg, wt, data=mtcars) + facet_grid(cyl ~ vs)
Faceted Plot Margins (including plotting just one margin) library(ggplot2)
qplot(mpg, wt, data=mtcars) + facet_grid(cyl ~ vs, margins=TRUE)
qplot(mpg, wt, data=mtcars) + facet_grid(cyl ~ vs, margins='vs')
qplot(mpg, wt, data=mtcars) + facet_grid(cyl ~ vs, margins='cyl')
256 | P a g e
Facet Labels on Top library(ggplot2) + facet_wrap(~Species, ncol=1) + facet_wrap(~Species,nrow = 3) Change Facet Labels #Data Source and plot library(ggplot2); library(directlabels)
x <- ggplot(CO2, aes(x=uptake, group=Plant))
y <- x + geom_density(aes(colour=Plant))
y + facet_grid(Type~Treatment)
#method 1 Does not alter data mf_labeller <- function(var, value){
value <- as.character(value)
if (var=="Treatment") {
value[value=="nonchilled"] <- "Var 1"
value[value=="chilled"] <- "Var 2"
}
return(value)
}
y + facet_grid(Type~Treatment, labeller=mf_labeller)
#method 2 Faster but does alter data levels(CO2$Treatment) <- c("Var 1", "Var 2")
library(ggplot2); library(directlabels)
x <- ggplot(CO2, aes(x=uptake, group=Plant))
y <- x + geom_density(aes(colour=Plant))
y + facet_grid(Type~Treatment)
library(ggplot2)
ggplot(iris, aes(Petal.Length)) + stat_bin() + facet_grid(Species ~ .)
ggplot(iris, aes(Petal.Length)) + stat_bin() + facet_wrap(~Species,nrow = 3)
ggplot(iris, aes(Petal.Length)) + stat_bin() + facet_wrap(~Species, ncol=1)
ggplot(iris, aes(Petal.Length)) + stat_bin() + facet_wrap(~Species,nrow = 4)
257 | P a g e
Change (all) facet_grid label_renamemargin_gen <- function(newname="Total") {
function(variable, value) {
value <- as.character(value)
value[value == "(all)"] <- newname
value
}
}
ggplot(mtcars, aes(cyl)) +
geom_point(stat="bin", size = 2,
aes(shape = gear, position = "stack")) +
facet_grid(carb ~ gear, margins = TRUE,
labeller=label_renamemargin_gen("Total"))
Adjust Facet Labels and Boxes library(ggplot2)
x <- ggplot(CO2, aes(x=uptake, group=Plant))
x + geom_density(aes(colour=Plant)) +
facet_grid(Type~Treatment)+
theme(strip.text.x = element_text(size=8, angle=75),
strip.text.y = element_text(size=12, face="bold"),
strip.background = element_rect(colour="red", fill="#CCCCFF"))
Eliminate Background Color and Maintain Facet Boxes + theme_bw()
ggplot(CO2, aes(conc)) + geom_density() +
facet_grid(Type~Treatment) +
theme(panel.background = element_blank())
#basically don't use panel.background for this
ggplot(CO2, aes(conc)) + geom_density() +
facet_grid(Type~Treatment) +
#theme(panel.background = element_blank()) +
theme_bw()
258 | P a g e
Annotate one box in facet_grid
library(ggplot2)
p <- ggplot(mtcars, aes(mpg, wt)) + geom_point()
p <- p + facet_grid(. ~ cyl)
#create a new data frame with the info
ann_text <- data.frame(mpg = 15,wt = 5,lab = "Text",
cyl = factor(8,levels = c("4","6","8")))
p + geom_text(data = ann_text,label = "Text")
Annotate every box in facet_grid
#make a few numeric into factors
mtcars[, c("cyl", "am", "gear")] <- lapply(mtcars[, c("cyl", "am", "gear")], as.factor)
#plot it with no annotations
p <- ggplot(mtcars, aes(mpg, wt, group = cyl)) +
geom_line(aes(color=cyl)) +
geom_point(aes(shape=cyl)) +
facet_grid(gear ~ am) +
theme_bw()
p
#find number of facets
len <- length(levels(mtcars$gear)) * length(levels(mtcars$am))
#make a data frame with coordinates, facet variable levels, labels
vars <- data.frame(expand.grid(levels(mtcars$gear), levels(mtcars$am)))
colnames(vars) <- c("gear", "am")
dat <- data.frame(x = rep(15, len), y = rep(5, len), vars,
labs=LETTERS[1:len])
#use geom_text to annotate (notice group=NULL)
p + geom_text(aes(x, y, label=labs, group=NULL),data=dat)
#change just one location
dat[1, 1:2] <- c(30, 2) #to change specific locations
p + geom_text(aes(x, y, label=labs, group=NULL), data=dat)
#use math plotting
p + geom_text(aes(x, y, label=paste("beta ==", labs),
group=NULL), size = 4, color = "grey50", data=dat, parse = T)
x y gear am labs
1 15 5 3 0 A
2 15 5 4 0 B
3 15 5 5 0 C
4 15 5 3 1 D
5 15 5 4 1 E
6 15 5 5 1 F
259 | P a g e
Axis Adjustments
Eliminate Space at Bottom of Barplots + scale_y_continuous(expand = c(0,0)) Reverse Axis sflip axis + coord_flip()
#EXAMPLE
qplot(1:10, geom = 'bar')
qplot(1:10, geom = 'bar') + scale_y_continuous(expand = c(0,0))
#Examples
qplot(cut, price, data=diamonds, geom="boxplot")
last_plot() + coord_flip()
qplot(cut, data=diamonds, geom="bar")
last_plot() + coord_flip()
260 | P a g e
Adjust Axis Labels + theme(axis.title.x = element_text(vjust=-0.5)) #vertical + theme(axis.title.x = element_text(hjust=0.25)) #horizontal Axis Labels Names + labs(x = "x", y = "y") OR + xlab("x") + ylab("y")
p <- qplot(mpg, wt, data = mtcars)
p
p + xlab("Vehicle Weight") + ylab("Miles per Gallon")
# Or
p + labs(x = "Vehicle Weight", y = "Miles per Gallon")
261 | P a g e
Dendrograms with ggplot library(ggplot); library(ggdendro) library(ggplot2)
library(ggdendro)
data(mtcars)
x <- as.matrix(scale(mtcars))
dd.row <- as.dendrogram(hclust(dist(t(x))))
ddata_x <- dendro_data(dd.row)
p <- ggplot(segment(ddata_x)) +
geom_segment(aes(x=x, y=y, xend=xend, yend=yend)) +
scale_y_continuous(trans = 'reverse')
p + geom_text(data=label(ddata_x),
aes(label=label, x=x, y=0), hjust=0) +
coord_flip()
Initial Between Variable Data Visualization scatterplot matrix
library(ggplot2)
library(GGally)
ggpairs(iris, colour='Species', alpha=0.4)
ggpairs(CO2, colour ='Type', alpha=0.4)
mtcars$cyl <- factor(mtcars$cyl)
ggpairs(mtcars, colour ='cyl', alpha=0.4)
262 | P a g e
Nice reference to this:
http://stackoverflow.com/questions/8112208/how-can-i-obtain-an-unbalanced-grid-of-ggplots
Combine Two plots (even facetted plots) library(gridExtra) grid.arrange(plot.1, ..., plot.n)
library(ggplot2)
p1 <- ggplot(mtcars[mtcars$cyl!=8,], aes(mpg, wt))+
geom_point()+
facet_wrap( ~ cyl)
p2 <- ggplot(mtcars[mtcars$cyl!=8,], aes(mpg, wt))+
geom_point()+
facet_grid(am ~ cyl)+
theme( axis.text.y = element_blank(),
axis.text.x = element_blank(),
axis.title.y = element_blank(),
axis.ticks = element_blank(),
#strip.background = element_blank(),
strip.text.x = element_blank())
library(gridExtra)
grid.arrange(p1,p2,
main ="this is a title", left =
"This is my global Y-axis title")
263 | P a g e
Add a table to a grid plot #1 + annotation_custom(grob, xmin = -Inf, xmax = Inf, ymin = -Inf, ymax = Inf) Add a table to a grid plot (can't superimpose) library (gridExtra) tableGrob() ?tableGrob Add Table to Plot (control widths) my_hist<-ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar()
my_table<- tableGrob(head(diamonds)[,1:3],
gpar.coretext = gpar(fontsize=8),gpar.coltext=gpar(fontsize=8),
gpar.rowtext=gpar(fontsize=8))
grid.arrange(my_hist,my_table, ncol=2)
grid.arrange(my_hist,my_table, ncol=2, widths=c(.7, .3))
Add a table right below a legend my_hist<-ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar()
#create inset table
my_table<- tableGrob(head(diamonds)[,1:3],
gpar.coretext =gpar(fontsize=8), gpar.coltext=gpar(fontsize=8),
gpar.rowtext=gpar(fontsize=8))
#Extract Legend
g_legend<-function(a.gplot){
tmp <- ggplot_gtable(ggplot_build(a.gplot))
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
legend <- tmp$grobs[[leg]]
return(legend)}
legend <- g_legend(my_hist)
#Create the viewports, push them, draw and go up
grid.newpage()
vp1 <- viewport(width = 0.75, height = 1, x = 0.375, y = .5)
vpleg <- viewport(width = 0.25, height = 0.5, x = 0.85, y = 0.75)
subvp <- viewport(width = 0.3, height = 0.3, x = 0.85, y = 0.25)
print(my_hist + theme(legend.position = "none"), vp = vp1)
upViewport(0)
pushViewport(vpleg)
grid.draw(legend)
#Make the new viewport active and draw
upViewport(0)
pushViewport(subvp)
grid.draw(my_table)
264 | P a g e
Add text to a bar plot EXAMPLE 1: Above Bars
library(ggplot2)
mtcars2 <- data.frame(id=1:nrow(mtcars), mtcars[, c(2, 8:11)])
mtcars2[, -1] <- lapply(mtcars2[, -1], as.factor)
with(mtcars2, ftable(cyl, gear, am)) #USE FOR FREQUENCY COUNTS OF ANY VARAIBLE
ggplot(mtcars2, aes(x=cyl)) + geom_bar() +
facet_grid(gear~am) + stat_bin(geom="text", aes(label=..count.., vjust=-1))
EXAMPLE 2: On Stacked Bar
Year <- c(rep(c("2006-07", "2007-08", "2008-09", "2009-10"), each = 4))
Category <- c(rep(c("A", "B", "C", "D"), times = 4))
Frequency <- c(168, 259, 226, 340, 216, 431, 319, 368, 423, 645, 234,
685, 166, 467, 274, 251)
Data <- data.frame(Year, Category, Frequency)
library(ggplot2)
p <- qplot(Year, Frequency, data = Data, geom = "bar", fill = Category,
theme_set(theme_bw()))
p + geom_text(aes(label = Frequency), size = 3, hjust = 0.5, vjust = 3,
position = "stack")
EXAMPLE 3: Centered on Stacked Bar
Year <- c(rep(c("2006-07", "2007-08", "2008-09", "2009-10"), each = 4))
Category <- c(rep(c("A", "B", "C", "D"), times = 4))
Frequency <- c(168, 259, 226, 340, 216, 431, 319, 368, 423, 645,
234, 685, 166, 467, 274, 251)
Data <- data.frame(Year, Category, Frequency)
library(ggplot2)
ggplot(Data, aes(x = Year, y = Frequency)) +
geom_bar(aes(fill = Category)) +
geom_text(aes(label = Frequency, y = pos), size = 3)
Add text to Barplots (negative and positive values) library(plyr);library(ggplot2);library(scales)
dtf <- data.frame(x = c("ETB", "PMA", "PER", "KON", "TRA",
"DDR", "BUM", "MAT", "HED", "EXP"), y = c(.02, .11,
-.01, -.03, -.03, .02, .1, -.01, -.02, 0.06))
ggplot(dtf, aes(x, y)) +
geom_bar(stat = "identity", aes(fill = x), legend = FALSE) +
geom_text(aes(label = paste(y * 100, "%"),
vjust = ifelse(y >= 0, -.2, 1.1))) +
scale_y_continuous("Anteil in Prozent",
labels = percent_format()) +
theme(axis.title.x = element_blank())
265 | P a g e
stat_summary [ggplot2]
Alter boxplot ends
library(ggplot2)
data(mpg)
#Create a function to calculate the points
get_tails <- function(x) {
q1 = quantile(x)[2]
q3 = quantile(x)[4]
iqr = q3 -q1
upper = q3+1.5*iqr
lower = q1-1.5*iqr
##Trim upper and lower
up = max(x[x < upper])
lo = min(x[x > lower])
return(c(lo, up))
}
ggplot(mpg, aes(x=drv,y=hwy)) + geom_boxplot() +
stat_summary(geom="point", fun.y= get_tails, colour="Red",
shape=3, size=5)
266 | P a g e
Add Colored Rectangles in Background
geom_rect() #EXAMPLE
scores <- data.frame(category = 1:4,
percentage = c(34,62,41,44), type = c("a","a","a","b"))
rects <- data.frame(ystart = c(0,25,45,65,85),
yend = c(25,45,65,85,100), #the y values to stop and start coloring
col = c("Z1","Z2","Z3","Z4","Z5")) #the "grouping" variable to color on
labels <- c("ER", "OP", "PAE", "Overall") #labels for the x axis
medals <- c("navy","goldenrod4","darkgrey","gold","cadetblue1") #rectangle colors
library(ggplot2)
ggplot() +
geom_rect(data = rects, aes(xmin = -Inf, xmax = Inf, ymin = ystart,
ymax = yend, fill=col), alpha = 0.3) +
theme(legend.position="none") +
geom_bar(data=scores, aes(x=category, y=percentage, fill=type), stat="identity") +
scale_fill_manual(values=c("indianred1", "indianred4", medals)) +
scale_x_continuous(breaks = 1:4, labels = labels)
267 | P a g e
Labels
Labels above bars scale_y_continuous(labels = percent)
#DATA SET df <- structure(list(A = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L), .Label = c("0-50,000", "50,001-250,000", "250,001-Over"),
class = "factor"), B = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L),
.Label = c("0-50,000", "50,001-250,000", "250,001-Over"),
class = "factor"), Freq = c(0.507713884992987,
0.258064516129032, 0.23422159887798, 0.168539325842697,
0.525280898876405, 0.306179775280899, 0.160958904109589,
0.243150684931507, 0.595890410958904)), .Names = c("A", "B", "Freq"),
class = "data.frame", row.names = c(NA,
-9L))
library(ggplot2); library(scales)
ggplot(data=df, aes(x=A, y=Freq))+
geom_bar(aes(fill=B), position = position_dodge()) +
geom_text(aes(label = paste(sprintf("%.1f", Freq*100), "%", sep=""),
y = Freq+0.015, group=B),
size = 3, position = position_dodge(width=0.9)) +
scale_y_continuous(labels = percent) +
theme_bw()
268 | P a g e
Mapping
Maps for Different Coordinate Systems + coord_map()
require("maps")
states <- data.frame(map("state", plot=FALSE)[c("x","y")])
(usamap <- qplot(x, y, data=states, geom="path"))
usamap + coord_map()
usamap + coord_map(project="orthographic")
usamap + coord_map(project="stereographic")
usamap + coord_map(project="conic", lat0 = 30)
usamap + coord_map(project="bonne", lat0 = 50)
269 | P a g e
Random
Stacked Bar Histogram # Create data Set set.seed(3421)
library(plyr); library(ggplot2)
# added type to mimick which candidate is supported
dfr <- data.frame(
name = LETTERS[1:26],
percent = rnorm(26, mean=15),
type = sample(c("A", "B"), 26, replace = TRUE)
)
# easier to prepare data in advance. uses two ideas
# 1. calculate histogram bins (quite flexible)
# 2. calculate frequencies and label positions
dfr <- transform(dfr, perc_bin = cut(percent, 5))
dfr <- ddply(dfr, .(perc_bin), mutate,
freq = length(name), pos = cumsum(freq) - 0.5*freq)
# start plotting. key steps are
# 1. plot bars, filled by type and grouped by name
# 2. plot labels using name at position pos
# 3. get rid of grid, border, background, y axis text and lables
ggplot(dfr, aes(x = perc_bin)) +
geom_hline(yintercept=seq(10, 70, by=10), colour="gray90", size=.05) +
geom_bar(aes(y = freq, group = name, fill = type), colour = 'gray60',
show_guide = F) +
geom_text(aes(y = pos, label = name), colour = 'white') +
scale_fill_manual(values = c('red', 'orange')) +
theme_bw() + xlab("") + ylab("") + scale_y_continuous(expand = c(0,0))+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
Show Zero Count (discrete/categorical) + scale_x_discrete(drop=F)
Histogram That Matches Base sggplothistogram right = TRUE
#EXAMPLES library(ggplot2)
mtcars$cyl<-factor(mtcars$cyl)
levels(mtcars[!mtcars$cyl==4,]$cyl) #level 4 there but won't be plotted
ggplot(mtcars[!mtcars$cyl==4,], aes(cyl))+ geom_bar()
ggplot(mtcars[!mtcars$cyl==4,], aes(cyl))+ geom_bar() +
scale_x_discrete(drop=F)
#EXAMPLE
ggplot(diamonds, aes(carat, ..density..)) +
geom_histogram(binwidth = 0.2) +
facet_grid(.~cut)
ggplot(diamonds, aes(carat, ..density..)) +
geom_histogram(binwidth = 0.2, right = TRUE) +
facet_grid(.~cut)
270 | P a g e
Market Share Plot library(ggplot2)
library(reshape2)
library(scales)
# A DATA SET
Subset <- structure(list(Year = 1995:2011, BDMP = c(18L, 19L, 41L, 30L,
18L, 36L, 28L, 33L, 37L, 45L, 36L, 27L, 27L, 26L, 47L, 43L, 45L
), JMP = c(257L, 370L, 550L, 690L, 865L, 1060L, 1190L, 1430L,
1710L, 2070L, 2520L, 2970L, 3400L, 3830L, 4170L, 4680L, 5590L
), Minitab = c(1150L, 1290L, 1400L, 1460L, 1670L, 1890L, 2180L,
2490L, 2860L, 3300L, 3770L, 4590L, 5210L, 5830L, 6510L, 7190L,
7990L), SPSS = c(6450L, 7600L, 10500L, 14500L, 24300L, 45600L,
67200L, 87200L, 75900L, 137000L, 145000L, 141000L, 133000L, 119000L,
61500L, 45700L, 33200L), SAS = c(8630L, 8700L, 10200L, 11100L,
12700L, 16500L, 21900L, 27200L, 39600L, 49400L, 57000L, 62800L,
60400L, 59100L, 53700L, 43000L, 32300L), Stata = c(22L, 91L,
205L, 322L, 516L, 784L, 986L, 1290L, 1740L, 2400L, 3090L, 4010L,
5100L, 6330L, 7600L, 9230L, 12000L), Statistica = c(3L, 11L,
19L, 28L, 23L, 42L, 62L, 84L, 89L, 146L, 165L, 219L, 209L, 249L,
297L, 351L, 413L), Systat = c(2480L, 2510L, 3390L, 2700L, 2650L,
2780L, 2880L, 2900L, 3100L, 3340L, 4000L, 4870L, 5430L, 6270L,
6560L, 7030L, 8060L), R = c(8L, 2L, 6L, 13L, 25L, 51L, 133L,
286L, 627L, 1180L, 2180L, 3430L, 5060L, 6960L, 9150L, 11400L,
14500L), SPlus = c(8L, 17L, 33L, 39L, 45L, 52L, 159L, 341L, 574L,
817L, 1010L, 1180L, 1160L, 1180L, 970L, 710L, 644L)), .Names = c("Year",
"BDMP", "JMP", "Minitab", "SPSS", "SAS", "Stata", "Statistica",
"Systat", "R", "SPlus"), class = "data.frame", row.names = c(NA,
-17L))
Scholar
Little6 <- c("JMP","Minitab","Stata","Statistica","Systat","R")
Subset <- Scholar[ , Little6]
Year <- rep(Scholar$Year, length(Subset))
ScholarLong <- melt(Subset)
names(ScholarLong) <- c("Software", "Hits")
ScholarLong <- data.frame(Year, ScholarLong)
ggplot(ScholarLong, aes(Year, Hits, group=Software)) +
geom_smooth(aes(fill=Software), position="fill") +
coord_flip()+
scale_x_continuous("Year", trans="reverse") +
scale_y_continuous("Proportion of Google Scholar Hits For Each Software",
labels = NULL)+
theme(title = expression("Market Share"), axis.ticks = element_blank())
271 | P a g e
Dotplot DF <- structure(list(Country = structure(1:30, .Label = c("Georgia",
"South Africa", "Colombia", "Cuba", "Poland", "Romania", "Taipei (Chinese Taipei)",
"Azerbaijan", "Belgium", "Canada", "Republic of Moldova", "Norway",
"Serbia", "Slovakia", "Ukraine", "Uzbekistan", "Kazakhstan",
"Netherlands", "Great Britain", "Democratic People's Republic of Korea",
"Australia", "Brazil", "Hungary", "France", "Russian Federation",
"Republic of Korea", "Japan", "Italy", "United States of America",
"People's Republic of China"), class = "factor"), Gold = c(1,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 2, 1, 1,
1, 2, 1, 2, 0, 2, 3, 6), Silver = c(0, 0, 1, 1, 1, 1, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 2, 3, 5, 4
), Bronze = c(0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 1, 1, 1, 1, 1, 1, 3, 2, 3, 2, 3, 2), total = c(1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4,
4, 5, 5, 7, 11, 12)), .Names = c("Country", "Gold", "Silver",
"Bronze", "total"), row.names = c(13L, 14L, 17L, 18L, 19L, 20L,
21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 7L, 11L, 16L,
6L, 8L, 9L, 10L, 5L, 12L, 4L, 15L, 3L, 2L, 1L), class = "data.frame")
#CONVERT TO LONG
DF2 <- reshape(DF, varying = 2:5, direction="long",v.names = "number", timevar = "medal",
idvar = "Country", times =c("Gold", "Silver", "Bronze", "total"))
DF2$medal <- factor(DF2$medal, levels=c("Bronze", "Silver", "Gold", "total"))
ggplot(DF2, aes(x = number, y = Country, colour = medal)) +
geom_point() +
facet_grid(.~medal) + theme_bw()+
scale_colour_manual(values=c("#CC6600", "#999999", "#FFCC33", "#000000"))
272 | P a g e
Direct Labels
Move Specific Labels Around #The faceted ggplot code
library(ggplot2); library(directlabels)
x <- ggplot(CO2, aes(x=uptake, group=Plant))
y <- x + geom_density(aes(colour=Plant)) +
facet_grid(Type~Treatment)+ theme_bw()
y #with a legend
direct.label(y) #with direct labels
#use this to supply arguments to direct.label to move it around
my.method1 <-
list('top.points',
dl.move("Qn1", hjust=0,vjust=-5),
dl.move("Qc2", hjust=6,vjust=-8)
)
direct.label(y, my.method1) #moved labels
Find Values from Plot and Adjust that Way GL("ggplot2"); GL(directlabels)
set.seed(124234345)
# Generate data
df.2 <- data.frame("n_gram" = c("word1"),
"year" = rep(100:199),
"match_count" = runif(100 ,min = 1000 , max = 2000))
df.2 <- rbind(df.2, data.frame("n_gram" = c("word2"),
"year" = rep(100:199),
"match_count" = runif(100 ,min = 1000 , max = 2000)))
# Function to get last Y-value from loess
funcDlMove <- function (n_gram) {
model <- loess(match_count ~ year, df.2[df.2$n_gram==n_gram,], span=0.3)
Y <- model$fitted[length(model$fitted)]
Y <- dl.move(n_gram, y=Y,x=200)
return(Y)
}
index <- unique(df.2$n_gram)
mymethod <- list(
"top.points",
lapply(index, funcDlMove)
)
# Plot
PLOT <- ggplot(df.2, aes(year, match_count, group=n_gram, color=n_gram)) +
geom_line(alpha = I(7/10), color="grey", show_guide=F) +
stat_smooth(size=2, span=0.3, se=F, show_guide=F)
direct.label(PLOT, mymethod)
273 | P a g e
Move Plot Over to Add Line Names p_load(ggplot2, directlabels)
set.seed(124234345)
# Generate data
df.2 <- data.frame("n_gram" = c("word1"),
"year" = rep(100:199),
"match_count" = runif(100 ,min = 1000 , max = 2000))
df.2 <- rbind(df.2, data.frame("n_gram" = c("word2"),
"year" = rep(100:199),
"match_count" = runif(100 ,min = 1000 , max = 2000)))
mymethod <- list(
"top.points",
dl.move("word1", hjust=-.5, vjust=19.5),
dl.move("word2", hjust =-4.4, vjust=15.5)
)
ggplot(df.2, aes(year, match_count, group=n_gram, color=n_gram)) +
geom_line(alpha = I(7/10), color="grey", show_guide=F) +
xlim(c(100,220))+
stat_smooth(size=2, span=0.3, se=F, show_guide=F) +
geom_dl(aes(label=n_gram), method = mymethod, show_guide=F)
274 | P a g e
LATEX
Prepare tables for LATEX xtable(table, caption=NULL, label=NULL, align=NULL, digits=3,display=NULL) Info on R to Latex http://stackoverflow.com/questions/2978784/suggestion-for-r-latex-table-creation-package