data analysis in r -...
TRANSCRIPT
![Page 1: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/1.jpg)
Data Analysis in R@dustinvtranEngineering and Applied Sciences @Harvard University
![Page 2: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/2.jpg)
1. introduction
![Page 3: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/3.jpg)
What is R?
• R is a language developed for statistical computing and visualization
• It is free and open source• It is a dynamic, lazy, functional, and object-oriented language
![Page 4: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/4.jpg)
What is R?
• R is a language developed for statistical computing and visualization• It is free and open source
• It is a dynamic, lazy, functional, and object-oriented language
![Page 5: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/5.jpg)
What is R?
• R is a language developed for statistical computing and visualization• It is free and open source• It is a dynamic, lazy, functional, and object-oriented language
![Page 6: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/6.jpg)
![Page 7: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/7.jpg)
Bottlenecks
• The biggest bottleneck in data analysis is cognitive.
• You need tools (domain specific languages) to help you define theproblem and express solutions programmatically.
![Page 8: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/8.jpg)
Bottlenecks
• The biggest bottleneck in data analysis is cognitive.• You need tools (domain specific languages) to help you define the
problem and express solutions programmatically.
![Page 9: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/9.jpg)
![Page 10: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/10.jpg)
R...
• has an enormous number of packages for statistical modelling,machine learning, visualization, and importing and manipulating data
• is designed to interface with high-performance computing languagessuch as Fortran and C++.
• can also integrate web visualizations from JavaScript libraries such asD3.js, Leaflet, Google Charts.
![Page 11: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/11.jpg)
R...
• has an enormous number of packages for statistical modelling,machine learning, visualization, and importing and manipulating data
• is designed to interface with high-performance computing languagessuch as Fortran and C++.
• can also integrate web visualizations from JavaScript libraries such asD3.js, Leaflet, Google Charts.
![Page 12: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/12.jpg)
R...
• has an enormous number of packages for statistical modelling,machine learning, visualization, and importing and manipulating data
• is designed to interface with high-performance computing languagessuch as Fortran and C++.
• can also integrate web visualizations from JavaScript libraries such asD3.js, Leaflet, Google Charts.
![Page 13: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/13.jpg)
2. fundamentals
![Page 14: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/14.jpg)
2 + 22 * pi7 + runif(1)3ˆ4
sqrt(4ˆ4)log(10)log(100, base=10)
23 %% 2 # 23 mod 223 %/% 2 # floor(23/2)5e9 * 1e3 # 5000000000 * 1000
![Page 15: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/15.jpg)
val <- 3val## [1] 3print(val)## [1] 3
val = 1:6val## [1] 1 2 3 4 5 6
![Page 16: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/16.jpg)
R objects
• Vector: vector of some type (all entries are same type)
![Page 17: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/17.jpg)
# numericnums <- c(1.1, 3, -5.7)devs <- rnorm(2)devs## [1] 1.8469193 0.4091781
# integerints <- c(1L, 5L, -3L)ints## [1] 1 5 -3
![Page 18: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/18.jpg)
# characterchars <- c(’arthur’, ”marvin’s”,
”marvin\”s”)chars## [1] ”arthur” ”marvin’s” ”marvin\”s”
# logicalbools <- c(TRUE, FALSE, TRUE)bools## [1] TRUE FALSE TRUE
![Page 19: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/19.jpg)
vals <- seq(2, 12, by=2)vals## [1] 2 4 6 8 10 12vals[3]## [1] 6vals[3:5]## [1] 6 8 10vals[c(1, 3, 6)]## [1] 2 6 12vals[-c(1, 3, 6)]## [1] 4 8 10vals[c(rep(TRUE, 3), rep(FALSE, 4))]## [1] 2 4 6
![Page 20: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/20.jpg)
set.seed(42)vals <- rnorm(3)vals## [1] 1.3709584 -0.5646982 0.3631284
vals[1:2] <- 0vals## [1] 0.0000000 0.0000000 0.3631284
vals[vals != 0] <- 5vals## [1] 0 0 5
![Page 21: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/21.jpg)
vec1 <- 1:3vec2 <- 3:5vec1 + vec2## [1] 4 6 8vec1 * vec2## [1] 3 8 15vec1 >= vec2## [1] FALSE FALSE FALSE
![Page 22: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/22.jpg)
R objects
• Vector: vector of some type (all entries are same type)• Matrix: matrix of some type (all entries are same type)
mat <- matrix(1:9, nrow = 3)## [,1] [,2] [,3]## [1,] 1 4 7## [2,] 2 5 8## [3,] 3 6 9dim(mat)class(mat)t(mat) %*% mat
![Page 23: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/23.jpg)
R objects
• Vector: vector of some type (all entries are same type)• Matrix: matrix of some type (all entries are same type)• Data frame: collection of columns (each column can be a different
type)
dat <- data.frame(ints=1:3,chars=c(”hello”, ”world”, ”foo”))
dat## ints chars## 1 1 hello## 2 2 world## 3 3 foo
![Page 24: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/24.jpg)
R objects
• Vector: vector of some type (all entries are same type)• Matrix: matrix of some type (all entries are same type)• Data frame: collection of columns (each column can be a different
type)• List: collection of objects
list(stuff = 3,mat = matrix(1:4, nrow = 2),moreStuff = ”china”,list(5, ”bear”))
![Page 25: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/25.jpg)
help(lm)?lm
![Page 26: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/26.jpg)
3. demo
![Page 27: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/27.jpg)
4. closer
![Page 28: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/28.jpg)
Resources
Guides• Text: Hadley Wickham’s ”Advanced R”• Videos: 2013 R bootcamp at UC Berkeley• Interactive: DataCamp
Community & Help• mailing lists• #rstats• useR!• Stack Overflow, Google, Github, ...
![Page 29: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/29.jpg)
Resources
Guides• Text: Hadley Wickham’s ”Advanced R”• Videos: 2013 R bootcamp at UC Berkeley• Interactive: DataCamp
Help/Community• mailing lists• #rstats• useR!• Stack Overflow, Google, Github, ...
@dustinvtran • dustinvtran.com • [email protected]
![Page 30: Data Analysis in R - cdn.cs50.netcdn.cs50.net/2014/fall/seminars/data_analysis_r/data_analysis_r.pdf · • The biggest bottleneck in data analysis is cognitive. • You need tools](https://reader030.vdocuments.us/reader030/viewer/2022040713/5e180cb9e00d2f23c271c880/html5/thumbnails/30.jpg)
@dustinvtran • dustinvtran.com • [email protected]