intermediate programming in r session 1: datawernera/rintermediate/... · 2013-05-03 · dr olivia...

24
Olivia Lau, PhD Intermediate Programming in R Session 1: Data

Upload: others

Post on 02-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Olivia Lau, PhD

Intermediate Programming in RSession 1: Data

Page 2: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Intermediate R ProgrammingDr Olivia Lau

Outline

About Me

About You

Course Overview and Logistics

R Data Types

R Data Structures

Importing Data

Recoding Data

2

Page 3: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Intermediate R ProgrammingDr Olivia Lau

About Me

• Using and programming in R for over 10 years

• Working in high tech, previously worked in the federal sector and academia

• Expertise in:– Linear and general linear models– Survival and hazard rate models– Multi-level models– Experimentation and causal inference

• Taught “A Crash Course in R Programming” at the 2010 and 2012 UseR! conferences

• For more information, see http://www.olivialau.org

3

Page 4: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Intermediate R ProgrammingDr Olivia Lau

About You

• Have taken “Introduction to R” or have equivalent experience– Familiar with R data types– Familiar with R data structures– Comfortable typing at the command line

• Take a moment to introduce yourselves via the “Meet and Greet” on the course website– What is your background?– What do you do?– What do you want to get out of the course?

4

Page 5: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Intermediate R ProgrammingDr Olivia Lau

Course Overview and Logistics

• 4 modules– Data (and review)– Loops– Functions– Avoiding loops

• Self-paced, so please feel free to pause, rewind, and review

• I will answer questions twice per day, once in the early morning and once in the early evening Pacific time. Students are encouraged to reply to questions as well.

• Note: Throughout the slides, R code will look like this

5

Page 6: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Intermediate R ProgrammingDr Olivia Lau

Setting up your work environment

• Install R version 2.14 or greater– If you have R installed, you can check the version with R.Version()$version.string– If your R version number is less than 2.14.0, you must install the latest version– Windows users: Make sure you install to C:\Program Files

• Install the R editor of your choice (Word, Notepad, TextEdit are not sufficient)– Emacs with ESS: http://vgoulet.act.ulaval.ca/en/emacs/– Vim with Vim-R-plugin: http://www.vim.org/scripts/script.php?script_id=2628– TinnR: http://sourceforge.net/projects/tinn-r/– NotePad++: http://notepad-plus-plus.org– Eclipse with StatET: http://www.walware.de/goto/statet

6

Page 7: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Intermediate R ProgrammingDr Olivia Lau

Some Reminders

• R is case-sensitive

• R for Windows uses / (forward slash) instead of \ (back slash) in file paths

• Ctrl-C kills the R command being executed

• If you get a syntax error, check your commas

• If you get stuck, try

– args(command) to see the inputs to the command function– help(command) for detailed help on the command function– ls() to see the contents of your workspace– names(object) or str(object) to see the contents of data frames and lists– Do classes match up as they should? Check with class(object)

• Final reminders:– getwd() and setwd() to ID and set your working directory– save() to save your workspace or specific objects

7

Page 8: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Intermediate R ProgrammingDr Olivia Lau

Check In 1

• What are the arguments to the read.table command?

• Answer> args(read.table)function (file, header = FALSE, sep = "", quote = "\"'", dec = ".", row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, flush = FALSE, stringsAsFactors = default.stringsAsFactors(), fileEncoding = "", encoding = "unknown", text)

8

Page 9: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Intermediate R ProgrammingDr Olivia Lau

Outline

About Me

About You

Course Overview and Logistics

R Data Types

R Data Structures

Importing Data

Recoding Data

9

Page 10: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Intermediate R ProgrammingDr Olivia Lau

R Data Types: Atomic or scalar units

• Smallest building blocks in the R language

• All units have a class attribute– Numeric (or integer)– Character– Logical (TRUE or FALSE)– Date (either as Date [without time] or a full POSIXct time stamp)– Additional specialized classes can be defined

foo <- 25class(foo)foobar <- “super”class(foobar)sotrue <- TRUEclass(sotrue)

10

Page 11: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Intermediate R ProgrammingDr Olivia Lau

Special classes: Dates and Factors

• A factor is categorical value – Levels are usually represented as character strings, but may also be numeric– Can be unordered (nominal) values, or ordered values

• Time is represented in two ways– Dates with day and optionally time zone values, e.g., as.Date(“2012-08-12”, format = “%Y-%m-%d”)

– Time stamps with date, time in hours, minutes, and seconds, and optionally time zone attributesas.POSIXct(“2011-03-27 01:30:00”, format = “%Y-%m-%d %H:%M:%S”)

– POSIXct stores time stamps as the number of seconds since January 1, 1970

11

Page 12: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Intermediate R ProgrammingDr Olivia Lau

Check In 2

• Create a timestamp for one hour in the future. Do not hardcode the date and time, but use the system variable Sys.time()

• AnswerSys.time() + (60 * 60)

12

Page 13: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Intermediate R ProgrammingDr Olivia Lau

R Data Types: Homogenous Data Structures

• A homogenous data structure contains scalars all of the same class

• These data structures are delimited with square brackets [] and , to separate dimensions– Vector: One dimension

foo.v <- c(2, 3, 4, 5) names(foo.v) <- c(“eeny”, “meeny”, “miny”, “moe”)

– Matrix: Two dimensions (first is always row, second is always column) foo.m <- matrix(1:20, nrow = 4, ncol = 5) rownames(foo.m) <- c(“A”, “B”, “C”, “D”) colnames(foo.m) <- c(“E”, “F”, “G”, “H”, “I”)

– Array: K dimensions foo.a <- array(1:30, dim = c(2, 5, 3), dimnames = list(c(“r1”, “r2”), NULL, c(“z1”, “z2”, “z3”)) dim(foo.a)

• Hit pause, and take a moment to create the data structures foo.v, foo.m, and foo.a

13

Page 14: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Intermediate R ProgrammingDr Olivia Lau

Check In 3

• Extract the element named “eeny” from foo.v

• Answerfoo.v[“eeny”]

• Extract row 3 from the matrix foo.m

• Answerfoo.m[3, ]

• Extract the matrix associated with column 4 of foo.a

• Answerfoo.a[ , 4, ] [,1] [,2] [,3][1,] 7 17 27[2,] 8 18 28

14

Page 15: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Intermediate R ProgrammingDr Olivia Lau

R Data Types: Heterogenous Data Structures: Lists

• Most general type: The list– Can contain any type of data structure, scalars, vectors, matrix, arrays, other lists, etc– Has names and length attributes

• Come in two flavors: S3 and S4– Use $ or [[ ]] to extract elements from S3 lists– Use @ to extract elements from S4 lists

foo.l <- list(vec = foo.v, mat = foo.m, arr = foo.a)foo.l$vecfoo.l$vec[“eeny”]foo.l[[3]][2, , ]attributes(foo.l)

15

Page 16: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Intermediate R ProgrammingDr Olivia Lau

R Data Types: Data frames

• A data frame is a list in which all of the elements have the same length

• Data frames use S3 methods of extraction

library(MASS)data(Cars93)

names(Cars93)str(Cars93)dim(Cars93)summary(Cars93)head(Cars93)

16

Page 17: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Intermediate R ProgrammingDr Olivia Lau

Check In 4

• In the Cars93 data set from the MASS library, identify the first 10 values in Weight

• Answerlibrary(MASS)data(Cars93)Cars93$Weight[1:10]

17

Page 18: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Intermediate R ProgrammingDr Olivia Lau

Importing Data

• Text files, space delimitedworldbank <- read.table(“worldbank.txt”, header = TRUE)

• Text files, tab delimitedworldbank <- read.table(“worldbank.tab”, header = TRUE, sep = “\t”)

• Text files, comma delimitedworldbank <- read.csv(“worldbank.csv”)

• Text files, fixed width: read.fwf()

• If reading a text file takes a long time,– Pre-specify column classes using the colClasses argument for text files– Alternatively, use scan()

• SAS, STATA, SPSS, and other “foreign” file types can be imported using the foreign librarylibrary(foreign)worldbank <- read.dta(“worldbank.dta”) # For STATA files

18

Page 19: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Intermediate R ProgrammingDr Olivia Lau

Importing Data: Sanity Checks

• Check for number or rows and columns with dim() or nrow() or ncol()

• Check for variable names using names(), assign names if necessary

• Check for missing values with apply(worldbank, 2, function(x) sum(is.na(x)))– Do you have the right number of missing values for each variable?– Were missing values coded in the original data (e.g., -99 = missing)? If so, use

read.*(..., na.strings = c(“”, “ ”, “-99” ))

• Check variable classes, recode if necessary

19

Page 20: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Intermediate R ProgrammingDr Olivia Lau

Recoding Data: Change Variable Classes

• Never attach() a data frame

• By default, R coerces character strings (including dates) to factors– To override for all character variables, use read.*(..., as.is = TRUE)– If some character fields are factors and others character, read.*() as normal, then recode to

correct class

• From factor to characterworldbank$YearCode <- as.character(worldbank$YearCode)

• From factor to numericworldbank$Year <- as.numeric(as.character(worldbank$Year))

• From character to ordered factorworldbank$YearCode <- factor(worldbank$YearCode, levels = paste0(“YR”, 2002:2011), ordered = TRUE)

20

Page 21: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Intermediate R ProgrammingDr Olivia Lau

Recoding Data: Change Variable Names

• Check existing variables withnames(worldbank)

• Rename variables in two steps– Create the new variableworldbank$Year.Factor <- as.factor(worldbank$YearCode)

– Remove the old variableworkldbank$YearCode <- NULL

• No error message if you are overwriting an existing variable

21

Page 22: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Intermediate R ProgrammingDr Olivia Lau

Recoding Data: Subsets

• Identify rows, columns, or vector positions using– A logical vector of the same dimension as the object– A numeric vector with the dimension index of the object– A character vector with the element names (row names, column names, etc) of the object– Any combination of the above three

• Extract identified positions and save them as new objects in the workspaceyr2002 <- worldbank[worldbank$Year == 2002, ]ck2002 <- worldbank[which(worldbank$Year == 2002), ]identical(yr2002, ck2002)

• Replace identified positions with new valuesworldbank$before2005 <- 0worldbank[worldbank$Year < 2005, “before2005”] <- 1

22

Page 23: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Intermediate R ProgrammingDr Olivia Lau

Recoding Data: Merging data

• Both data frames to be merged should already be R objects in the workspace

• R creates a primary key by looking for identical variable names in dataset x and dataset y– Check that variable names are expected before joining – If no common variables are found, R will perform a combinatoric expansion of the rows and

columns of both data sets, resulting in really really big data sets

• R supports 4 standard types of joins using one command: merge()

– Inner join (default, unless there are no common variables)merge(x, y)

– Outer joinmerge(x, y, all = TRUE)

– Left joinmerge(x, y, all.x = TRUE)

– Right joinmerge(x, y, all.y = TRUE)

• R’s equivalent of SQL’s UNION ALLrbind(x, y)

23

Page 24: Intermediate Programming in R Session 1: Datawernera/RIntermediate/... · 2013-05-03 · Dr Olivia Lau Intermediate R Programming About You • Have taken “Introduction to R”

Intermediate R ProgrammingDr Olivia Lau

Assignment

• Introduce yourself on the class discussion board

• Reading for this week– From the course text, Paul Teetor’s R Cookbook:• Chapters 1-2• Chapter 4, Sections 7-10 only• Chapter 5 (stop at the beginning of Section 5.1 on p. 101)

– R help pages for:• which• merge

• Problem set as assigned

24