Transcript
Page 1: Stat 115 Lecture Notes 4 - Washington State University

Stat 115 Lecture Notes 4Xiongzhi Chen

Washington State University

Contents2

Overview of data 2Course materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2NYC flight data: nycflights13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2NYC flight data: flights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2NYC flight data: airlines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2NYC flight data: airports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3NYC flight data: planes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3NYC flight data: weather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3NYC flight data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3NYC flight data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Mutating Joins 4Key . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Data: flights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Data: airlines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Add airline name to flights2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Inner join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Outer Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Left outer join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Right outer join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Full outer join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Join flights and plances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Join flights and airports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Filtering Joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Missing values 8Course contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Check object types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Check object types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8Coerce an object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Missing observations/values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Check for NA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Remove observatiosn with NAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Complete and fill in NAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Mising as if not . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Fill in missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11License and session Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1

Page 2: Stat 115 Lecture Notes 4 - Washington State University

Overview of data

Course materials

• Chapters 10 of book “R for data science” by Wickham and Grolemun• R library: tidyverse, dplyr, nycflights13

NYC flight data: nycflights13

nycflights13 contains 5 tibbles

• airlines: full carrier name• airports: information about each airport• planes: information plane, identified by tailnum• weather: hourly weather at each NYC airport• flights: airplane scheduled departure and arrival times, tailnum, etc

NYC flight data: flights

# A tibble: 6 x 19year month day dep_time sched_dep_time dep_delay arr_time

<int> <int> <int> <int> <int> <dbl> <int>1 2013 1 1 517 515 2 8302 2013 1 1 533 529 4 8503 2013 1 1 542 540 2 9234 2013 1 1 544 545 -1 10045 2013 1 1 554 600 -6 8126 2013 1 1 554 558 -4 740# ... with 12 more variables: sched_arr_time <int>,# arr_delay <dbl>, carrier <chr>, flight <int>,# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

NYC flight data: airlines

# A tibble: 6 x 2carrier name<chr> <chr>

1 9E Endeavor Air Inc.2 AA American Airlines Inc.3 AS Alaska Airlines Inc.4 B6 JetBlue Airways5 DL Delta Air Lines Inc.6 EV ExpressJet Airlines Inc.

2

Page 3: Stat 115 Lecture Notes 4 - Washington State University

NYC flight data: airports

# A tibble: 6 x 8faa name lat lon alt tz dst tzone<chr> <chr> <dbl> <dbl> <int> <dbl> <chr> <chr>

1 04G Lansdowne Airp~ 41.1 -80.6 1044 -5 A America/~2 06A Moton Field Mu~ 32.5 -85.7 264 -6 A America/~3 06C Schaumburg Reg~ 42.0 -88.1 801 -6 A America/~4 06N Randall Airport 41.4 -74.4 523 -5 A America/~5 09J Jekyll Island ~ 31.1 -81.4 11 -5 A America/~6 0A9 Elizabethton M~ 36.4 -82.2 1593 -5 A America/~

NYC flight data: planes

# A tibble: 6 x 9tailnum year type manufacturer model engines seats speed<chr> <int> <chr> <chr> <chr> <int> <int> <int>

1 N10156 2004 Fixe~ EMBRAER EMB-~ 2 55 NA2 N102UW 1998 Fixe~ AIRBUS INDU~ A320~ 2 182 NA3 N103US 1999 Fixe~ AIRBUS INDU~ A320~ 2 182 NA4 N104UW 1999 Fixe~ AIRBUS INDU~ A320~ 2 182 NA5 N10575 2002 Fixe~ EMBRAER EMB-~ 2 55 NA6 N105UW 1999 Fixe~ AIRBUS INDU~ A320~ 2 182 NA# ... with 1 more variable: engine <chr>

NYC flight data: weather

# A tibble: 6 x 15origin year month day hour temp dewp humid wind_dir<chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl>

1 EWR 2013 1 1 1 39.0 26.1 59.4 2702 EWR 2013 1 1 2 39.0 27.0 61.6 2503 EWR 2013 1 1 3 39.0 28.0 64.4 2404 EWR 2013 1 1 4 39.9 28.0 62.2 2505 EWR 2013 1 1 5 39.0 28.0 64.4 2606 EWR 2013 1 1 6 37.9 28.0 67.2 240# ... with 6 more variables: wind_speed <dbl>,# wind_gust <dbl>, precip <dbl>, pressure <dbl>,# visib <dbl>, time_hour <dttm>

NYC flight data

For nycflights13:

• flights connects to planes via a single variable, tailnum• flights connects to airlines through the carrier variable• flights connects to airports in two ways: via the origin and dest variables• flights connects to weather via origin (the location), and year, month, day, and hour (the time).

3

Page 4: Stat 115 Lecture Notes 4 - Washington State University

NYC flight data

Mutating Joins

Key

• a key is a variable (or set of variables) that uniquely identifies an observation• A primary key uniquely identifies an observation in its own table. For example, planes$tailnum

is a primary key because it uniquely identifies each plane in the planes table• A foreign key uniquely identifies an observation in another table. For example, flights$tailnum

is a foreign key because it appears in the flights table where it matches each flight to a unique plane.

Data: flights

> library(dplyr)> flights2 <- flights %>%+ select(year:day, hour, origin, dest, tailnum, carrier)> flights2# A tibble: 336,776 x 8

year month day hour origin dest tailnum carrier<int> <int> <int> <dbl> <chr> <chr> <chr> <chr>

1 2013 1 1 5 EWR IAH N14228 UA2 2013 1 1 5 LGA IAH N24211 UA3 2013 1 1 5 JFK MIA N619AA AA4 2013 1 1 5 JFK BQN N804JB B65 2013 1 1 6 LGA ATL N668DN DL6 2013 1 1 5 EWR ORD N39463 UA

4

Page 5: Stat 115 Lecture Notes 4 - Washington State University

7 2013 1 1 6 EWR FLL N516JB B68 2013 1 1 6 LGA IAD N829AS EV9 2013 1 1 6 JFK MCO N593JB B6

10 2013 1 1 6 LGA ORD N3ALAA AA# ... with 336,766 more rows

Data: airlines

# A tibble: 6 x 2carrier name<chr> <chr>

1 9E Endeavor Air Inc.2 AA American Airlines Inc.3 AS Alaska Airlines Inc.4 B6 JetBlue Airways5 DL Delta Air Lines Inc.6 EV ExpressJet Airlines Inc.

Add airline name to flights2

> library(dplyr)> flights2 %>%+ select(-origin, -dest) %>%+ left_join(airlines, by = "carrier")# A tibble: 336,776 x 7

year month day hour tailnum carrier name<int> <int> <int> <dbl> <chr> <chr> <chr>

1 2013 1 1 5 N14228 UA United Air Lines In~2 2013 1 1 5 N24211 UA United Air Lines In~3 2013 1 1 5 N619AA AA American Airlines I~4 2013 1 1 5 N804JB B6 JetBlue Airways5 2013 1 1 6 N668DN DL Delta Air Lines Inc.6 2013 1 1 5 N39463 UA United Air Lines In~7 2013 1 1 6 N516JB B6 JetBlue Airways8 2013 1 1 6 N829AS EV ExpressJet Airlines~9 2013 1 1 6 N593JB B6 JetBlue Airways

10 2013 1 1 6 N3ALAA AA American Airlines I~# ... with 336,766 more rows

Inner join

inner_join: unmatched rows are not included in the result

5

Page 6: Stat 115 Lecture Notes 4 - Washington State University

y <- tribble( ~key, ~val_y, 1, "y1", 2, "y2", 4, "y3")

The colored column represents the “key” variable: these are used tomatch the rows between the tables. The gray column represents the“value” column that is carried along for the ride. In these examplesI’ll show a single key variable and single value variable, but the ideageneralizes in a straightforward way to multiple keys and multiplevalues.

A join is a way of connecting each row in x to zero, one, or morerows in y. The following diagram shows each potential match as anintersection of a pair of lines:

(If you look closely, you might notice that we’ve switched the orderof the key and value columns in x. This is to emphasize that joinsmatch based on the key; the value is just carried along for the ride.)

In an actual join, matches will be indicated with dots. The numberof dots = the number of matches = the number of rows in the out‐put.

Inner JoinThe simplest type of join is the inner join. An inner join matchespairs of observations whenever their keys are equal:

180 | Chapter 10: Relational Data with dplyr

Outer Joins

outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:

• A left join keeps all observations in x• A right join keeps all observations in y• A full join keeps all observations in x and y

Left outer joinGraphically, that looks like:

The most commonly used join is the left join: you use this wheneveryou look up additional data from another table, because it preservesthe original observations even when there isn’t a match. The left joinshould be your default join: use it unless you have a strong reason toprefer one of the others.

Another way to depict the different types of joins is with a Venn dia‐gram:

However, this is not a great representation. It might jog your mem‐ory about which join preserves the observations in which table, butit suffers from a major limitation: a Venn diagram can’t show whathappens when keys don’t uniquely identify an observation.

182 | Chapter 10: Relational Data with dplyr

Right outer join

Graphically, that looks like:

The most commonly used join is the left join: you use this wheneveryou look up additional data from another table, because it preservesthe original observations even when there isn’t a match. The left joinshould be your default join: use it unless you have a strong reason toprefer one of the others.

Another way to depict the different types of joins is with a Venn dia‐gram:

However, this is not a great representation. It might jog your mem‐ory about which join preserves the observations in which table, butit suffers from a major limitation: a Venn diagram can’t show whathappens when keys don’t uniquely identify an observation.

182 | Chapter 10: Relational Data with dplyr

6

Page 7: Stat 115 Lecture Notes 4 - Washington State University

Full outer join

Graphically, that looks like:

The most commonly used join is the left join: you use this wheneveryou look up additional data from another table, because it preservesthe original observations even when there isn’t a match. The left joinshould be your default join: use it unless you have a strong reason toprefer one of the others.

Another way to depict the different types of joins is with a Venn dia‐gram:

However, this is not a great representation. It might jog your mem‐ory about which join preserves the observations in which table, butit suffers from a major limitation: a Venn diagram can’t show whathappens when keys don’t uniquely identify an observation.

182 | Chapter 10: Relational Data with dplyr

Join flights and plances

joint flights and planes by tailnum> flights2 %>% left_join(planes, by = "tailnum")# A tibble: 336,776 x 16

year.x month day hour origin dest tailnum carrier year.y<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int>

1 2013 1 1 5 EWR IAH N14228 UA 19992 2013 1 1 5 LGA IAH N24211 UA 19983 2013 1 1 5 JFK MIA N619AA AA 19904 2013 1 1 5 JFK BQN N804JB B6 20125 2013 1 1 6 LGA ATL N668DN DL 19916 2013 1 1 5 EWR ORD N39463 UA 20127 2013 1 1 6 EWR FLL N516JB B6 20008 2013 1 1 6 LGA IAD N829AS EV 19989 2013 1 1 6 JFK MCO N593JB B6 2004

10 2013 1 1 6 LGA ORD N3ALAA AA NA# ... with 336,766 more rows, and 7 more variables:# type <chr>, manufacturer <chr>, model <chr>,# engines <int>, seats <int>, speed <int>, engine <chr>

Join flights and airports

joint flights and airports by matching dest in flights to faa in airports; the variables in x will beused in the output> flights2 %>%+ left_join(airports, c("dest" = "faa"))# A tibble: 336,776 x 15

year month day hour origin dest tailnum carrier name<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr>

1 2013 1 1 5 EWR IAH N14228 UA Geor~2 2013 1 1 5 LGA IAH N24211 UA Geor~3 2013 1 1 5 JFK MIA N619AA AA Miam~4 2013 1 1 5 JFK BQN N804JB B6 <NA>5 2013 1 1 6 LGA ATL N668DN DL Hart~6 2013 1 1 5 EWR ORD N39463 UA Chic~

7

Page 8: Stat 115 Lecture Notes 4 - Washington State University

7 2013 1 1 6 EWR FLL N516JB B6 Fort~8 2013 1 1 6 LGA IAD N829AS EV Wash~9 2013 1 1 6 JFK MCO N593JB B6 Orla~

10 2013 1 1 6 LGA ORD N3ALAA AA Chic~# ... with 336,766 more rows, and 6 more variables: lat <dbl>,# lon <dbl>, alt <int>, tz <dbl>, dst <chr>, tzone <chr>

Filtering Joins

Filtering joins match observations in the same way as mutating joins, but affect the observations, not thevariables. There are two types:

• semi_join(x, y) keeps all observations in x that have a match in y• anti_join(x, y) drops all observations in x that have a match in y

Missing values

Course contents

• Chapter 9 of book “R for data science” by Wickham and Grolemun• R packages: tidyr, dplyr and tidyverse

Check object types

After R imports data, make sure variables have correct object types

• str gives object type of each variable in a data frame or tibble• summary gives object type of each variable, together with 5-value summary for each numeric object• head gives object type for each variable in a data frame or tibble

Check object types

> summary(flights2)year month day

Min. :2013 Min. : 1.000 Min. : 1.001st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00Median :2013 Median : 7.000 Median :16.00Mean :2013 Mean : 6.549 Mean :15.713rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00Max. :2013 Max. :12.000 Max. :31.00

hour origin destMin. : 1.00 Length:336776 Length:3367761st Qu.: 9.00 Class :character Class :characterMedian :13.00 Mode :character Mode :characterMean :13.183rd Qu.:17.00Max. :23.00

tailnum carrier

8

Page 9: Stat 115 Lecture Notes 4 - Washington State University

Length:336776 Length:336776Class :character Class :characterMode :character Mode :character

> str(flights2)Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 8 variables:$ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...$ month : int 1 1 1 1 1 1 1 1 1 1 ...$ day : int 1 1 1 1 1 1 1 1 1 1 ...$ hour : num 5 5 5 5 6 5 6 6 6 6 ...$ origin : chr "EWR" "LGA" "JFK" "JFK" ...$ dest : chr "IAH" "IAH" "MIA" "BQN" ...$ tailnum: chr "N14228" "N24211" "N619AA" "N804JB" ...$ carrier: chr "UA" "UA" "AA" "B6" ...

> head(flights2)# A tibble: 6 x 8

year month day hour origin dest tailnum carrier<int> <int> <int> <dbl> <chr> <chr> <chr> <chr>

1 2013 1 1 5 EWR IAH N14228 UA2 2013 1 1 5 LGA IAH N24211 UA3 2013 1 1 5 JFK MIA N619AA AA4 2013 1 1 5 JFK BQN N804JB B65 2013 1 1 6 LGA ATL N668DN DL6 2013 1 1 5 EWR ORD N39463 UA

Coerce an object

Force to change the type of a object by as.objecType()> x = c("1","2","3")> x[1] "1" "2" "3"> as.numeric(x)[1] 1 2 3> as.factor(x)[1] 1 2 3Levels: 1 2 3> as.matrix(x,3,1)

[,1][1,] "1"[2,] "2"[3,] "3"> y = c("a","b")> as.numeric(y)[1] NA NA

Missing observations/values

• Missing explicitly, i.e., flagged with NA (,i.e., presence of an absence)

9

Page 10: Stat 115 Lecture Notes 4 - Washington State University

• Missing implicitly, i.e., simply not present in the data (i.e., absence of a presence)

Example

> library(tidyr)> stocks <- tibble(+ year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),+ qtr = c( 1, 2, 3, 4, 2, 3, 4),+ return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)+ )> stocks# A tibble: 7 x 3

year qtr return<dbl> <dbl> <dbl>

1 2015 1 1.882 2015 2 0.593 2015 3 0.354 2015 4 NA5 2016 2 0.926 2016 3 0.177 2016 4 2.66

Check for NA

Command: is.na() or any(is.na())> any(is.na(stocks))[1] TRUE> is.na(stocks)

year qtr return[1,] FALSE FALSE FALSE[2,] FALSE FALSE FALSE[3,] FALSE FALSE FALSE[4,] FALSE FALSE TRUE[5,] FALSE FALSE FALSE[6,] FALSE FALSE FALSE[7,] FALSE FALSE FALSE

Remove observatiosn with NAs

> stocks %>%+ spread(year, return) %>%+ gather(year, return, `2015`:`2016`, na.rm = TRUE)# A tibble: 6 x 3

qtr year return* <dbl> <chr> <dbl>1 1 2015 1.882 2 2015 0.593 3 2015 0.35

10

Page 11: Stat 115 Lecture Notes 4 - Washington State University

4 2 2016 0.925 3 2016 0.176 4 2016 2.66

Complete and fill in NAs

> stocks %>% complete(year, qtr)# A tibble: 8 x 3

year qtr return<dbl> <dbl> <dbl>

1 2015 1 1.882 2015 2 0.593 2015 3 0.354 2015 4 NA5 2016 1 NA6 2016 2 0.927 2016 3 0.178 2016 4 2.66

Mising as if not

scenario: missing values indicate that the previous value should be carried forward> library(tibble)> treatment <- tribble(+ ~ person, ~ treatment, ~response,+ "Derrick Whitmore", 1, 7,+ NA, 2, 10,+ NA, 3, 9,+ "Katherine Burke", 1, 4+ )> treatment# A tibble: 4 x 3

person treatment response<chr> <dbl> <dbl>

1 Derrick Whitmore 1 72 <NA> 2 103 <NA> 3 94 Katherine Burke 1 4

Fill in missing values

> treatment %>% fill(person)# A tibble: 4 x 3

person treatment response<chr> <dbl> <dbl>

1 Derrick Whitmore 1 72 Derrick Whitmore 2 10

11

Page 12: Stat 115 Lecture Notes 4 - Washington State University

3 Derrick Whitmore 3 94 Katherine Burke 1 4

License and session Information

License> sessionInfo()R version 3.5.0 (2018-04-23)Platform: x86_64-w64-mingw32/x64 (64-bit)Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:[1] LC_COLLATE=English_United States.1252[2] LC_CTYPE=English_United States.1252[3] LC_MONETARY=English_United States.1252[4] LC_NUMERIC=C[5] LC_TIME=English_United States.1252

attached base packages:[1] stats graphics grDevices utils datasets methods[7] base

other attached packages:[1] tidyr_0.8.2 dplyr_0.7.8 nycflights13_1.0.0[4] knitr_1.21

loaded via a namespace (and not attached):[1] Rcpp_1.0.0 rstudioapi_0.8 bindr_0.1.1[4] magrittr_1.5 tidyselect_0.2.5 R6_2.3.0[7] rlang_0.3.0.1 fansi_0.4.0 stringr_1.3.1

[10] tools_3.5.0 xfun_0.4 utf8_1.1.4[13] cli_1.0.1 htmltools_0.3.6 yaml_2.2.0[16] digest_0.6.18 assertthat_0.2.0 tibble_1.4.2[19] crayon_1.3.4 bindrcpp_0.2.2 purrr_0.2.5[22] glue_1.3.0 evaluate_0.12 rmarkdown_1.11[25] stringi_1.2.4 compiler_3.5.0 pillar_1.3.1[28] pkgconfig_2.0.2

12


Top Related