computing for data analysis r statistics programming environment ming ni [email protected]...
TRANSCRIPT
![Page 1: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/1.jpg)
Introduction to the R language
Computing for Data AnalysisR statistics programming environment
Ming Ni
[email protected]/14/2014
![Page 2: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/2.jpg)
http://tinyurl.com/ise-r-talk
![Page 3: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/3.jpg)
Outline
1.Overview and History of R
2.Data types in R
3.Reading and Writing Data
4.Plotting Data
![Page 4: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/4.jpg)
Overview and History of R
What is S?
• R is a dialect of S language
• S is a language that was developed by John Chambers and
others at Bell Labs.
• S was initiated in 1976 as an internal statistical analysis
environment – originally implemented as Fortran libraries.
• Version 4 of the S language was release in 1998 and is the
version we use today
![Page 5: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/5.jpg)
• 1991: Created in New Zealand by Ross Ihaka and Robert Gentleman
• 1993: First announcement of R to public
• 1995:Use the GNU General Public License to make R free software
• 1997: The R Core Group is formed. The core group controls the source code for R.
• 2000: R version 1.0.0 is released
• 2014: R version 3.1.2 is most recently released.
What is R?
Overview and History of R
![Page 6: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/6.jpg)
Features of R
1. It is free!
2. The syntax and semantics are very similar to S
3. R is case sensitive
4. Commands are separated either by ; or by a newline
5. Run on almost any standard computing platform/OS (Windows, Mac, Linux even
on the PlayStation 4
Overview and History of R
![Page 7: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/7.jpg)
6. Frequent releases (annual + bugfix releases); active development
7. Core software is quite lean; Functionality is divided into modular packages
8. Graphics capabilities are very sophisticated
9. Useful for interactive work, but contains a powerful programming language for
developing new tools
10. Very active and vibrant user community. (mailing lists and Stack Overflow
Features of R, cont’d
Overview and History of R
![Page 8: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/8.jpg)
1. Essentially based on 40 year old technology
2. Little built in support for dynamic or 3-D graphics
3. No help line you can call for support or explaining features
4. Objects must generally be stored in physical memory of computer! (Big data age
5. Not ideal for all possible situation. R cannot do everything!
Overview and History of R
Drawbacks of R
![Page 9: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/9.jpg)
Other Data Analysis Software
The number of analytics jobs for the more popular software (250 jobs or more, 2/2014).
Overview and History of R
![Page 10: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/10.jpg)
Number of scholarly articles found for each software (2/2014).
Overview and History of R
Other Data Analysis Software
![Page 11: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/11.jpg)
honorable mention:• Python with package numpy, pandas, Scipy• SPSS modeler
Easy drag and drop nodes to access to advanced data analytics
Overview and History of R
Other Data Analysis Software
![Page 12: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/12.jpg)
http://cran.us.r-project.org/
Overview and History of R
Downloading and Installing R
![Page 13: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/13.jpg)
The R system is divided into 2 conceptual parts:• The “base” R system that you download from
CRAN• Everything else
R functionality is divided into a number of packages• There are 4000+ packages on CRAN• Users contributed and not controlled by R Core• There are also large amount R packages outside
of CRAN
Overview and History of R
Design of the R System
![Page 14: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/14.jpg)
R Console R Script
Overview and History of R
Get start of R
![Page 15: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/15.jpg)
You can work directly in R, but most users prefer a graphical interface.
Integrated Development Environment (IDE):• RStudio• Tinn-R • Deducer• Revolution R (leverage R in Hadoop
environments
Text editor with plugins:• Vim• Eclipse +statET
RStudio server on web browser
Overview and History of R
Get start of R
![Page 16: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/16.jpg)
Interactive environment, where people did not consciously think of themselves as programming• Read tables • Data analysis• User
After sophistication increased and have clear need, people are able to slide gradually into programming• Data processing• Develop the own tools • Programmer
Overview and History of R
Get start of R
![Page 17: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/17.jpg)
Outline
1.Overview and History of R
2.Data types in R
3.Reading and Writing Data
4.Plotting Data
![Page 18: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/18.jpg)
• Basic classes: numeric, integer, character,
logical (TRUE/FALSE), complex
• vector, matrix, list
• factor
• missing value
• data frame
Data types in R
![Page 19: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/19.jpg)
Entering InputAt the R prompt we type expressions. The <- symbol is the assignment operator
Expression: x<- 1Object: x Value: 1Class of x: numeric
Hash symbol
Data types in R
Assignment Operator
![Page 20: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/20.jpg)
When a complete expression is entered at prompt, it is evaluated and result of the evaluated expression is returned. The result may be auto-printed.
The [1] indicates that x is a vector and the first element of the object x is value 1
Data types in R
Printing
![Page 21: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/21.jpg)
The : operator is used to create integer sequences
Data types in R
Printing
![Page 22: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/22.jpg)
The c() function can be used to create vectors of objects.
When different objects are mixed in a vector, coercion occurs so that every element in the vector is of the same class.
class(object) # class or type of an object
Data types in R
Create Vectors
![Page 23: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/23.jpg)
Objects can be explicitly coerced from one class to another using as.* functions, if available
Data types in R
Explicit Coercion
![Page 24: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/24.jpg)
1. vector: A vector can only contain objects of the same class
2. matrix: Matrix are vectors with a dimension attribute. The dimension
attribute is an integer vector of length 2 (nrow, ncol)
3. list: List are a special type of vector that can contain elements of different
classes. It can be multiple dimensions.
Data types in R
vector, matrix, list
![Page 25: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/25.jpg)
Matrices can be created by column-binding or row-binding with cbind() and rbind().They are also able to be used for data frame.
Data types in R
cbind-ing and rbind-ing
![Page 26: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/26.jpg)
• Basic classes: numeric, integer, character,
logical (TRUE/FALSE), complex
• vector, matrix, list
• factor
• missing value
• data frame
Data types in R
![Page 27: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/27.jpg)
Factor is special type of vector. Factors are used to represent categorical data.• Factors can be unordered or ordered.• Each element of factors has a label.
Factors are treated specially by modelling functions like lm() and glm()
Data types in R
Factor
![Page 28: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/28.jpg)
generate frequency tables using the table( ) function
Data types in R
Factor
![Page 29: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/29.jpg)
Missing values are denoted by NA or NaN for undefined mathematical operations.
• NaA means 0/0 – stands for Not a Number
• NA is generally interpreted as a missing value.
• NA values have a class also, so there integer NA, character NA, logical NA, etc.
• A NaN value is also NA but the converse is not true
Data types in R
Missing Values
![Page 30: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/30.jpg)
Data types in R
Missing Values Functions
![Page 31: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/31.jpg)
Data types in R
Summary
• Basic classes: numeric, integer, character,
logical (TRUE/FALSE), complex
• vector, matrix, list
• factor
• missing value
• data frame
![Page 32: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/32.jpg)
Outline
1.Overview and History of R
2.Data types in R
3.Reading and Writing Data
4.Plotting Data
![Page 33: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/33.jpg)
Principal functions reading data into R.
• read.table, read.csv, for reading tabular data (.csv, .txt
• readLines, for reading lines of a text file
• source, for reading in R code file (.r
• load, for reading in saved workspaces (.rdata
Analogous functions writing data to files.
• write.table (txt, .csv
• writeLines
• dump
• save
Reading and Writing Data
![Page 34: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/34.jpg)
The read.table function is one of most commonly used function for reading data. It has few important arguments:
read.table(file, header, sep, colClasses, nrows, skip, stringAsFactors)
• file, the name of a file, or a connection
• header, logical indicating if the file has a header line
• sep, a string indicting how the columns are separated
• colClasses, a character vector indicating the class of each column in the dataset
• nrows, the number of rows in the dataset
• skip, the number of lines to skip from the beginning
• stringAsFactors, should character variables be coded as factors?
Reading and Writing Data
![Page 35: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/35.jpg)
• read.table(file, header, sep)• The other arguments of the function use default parameters. How to check it?
read.table(file, header = FALSE, sep = "", quote = "\"'", dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"), row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA", colClasses = NA, nrows = -1, skip = 0, check.names = TRUE, fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE, comment.char = "#", allowEscapes = FALSE, flush = FALSE, stringsAsFactors = default.stringsAsFactors(), fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
The help file for the read.table function from R Documentation:
Reading and Writing Data
![Page 36: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/36.jpg)
Check with the R help Documentation
1. ?read.table: precede the name of the function with ?
2. ??keyword: searches R documentation for keyword
3. Google read.table r
If you cannot follow the help documentation, please
see the example first, which is at end of the webpage
Reading and Writing Data
![Page 37: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/37.jpg)
Data frames are used to store tabular data(Key data type used in R)
1. They are represented as a special type of list where every element of the
list has to have the same length
2. Unlike matrix , data frames can store different classes of objects in each
column (just like lists)
3. Data frames also have a special attribute called row.names, used to
annotate the data
4. Data frames are usually created by calling read.table() or read.csv()
5. Can be converted to a matrix by calling data.matrx()
Reading and Writing Data
![Page 38: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/38.jpg)
Demo• The Iris Data Set consists of 50 samples from each of three
species of Iris flowers (Iris setosa, Iris virginica and Iris versicolor).
• 4 attributes were measured from each sample.
Reading and Writing Data
![Page 39: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/39.jpg)
Outline
1.Overview and History of R
2.Data types in R
3.Reading and Writing Data
4.Plotting Data
![Page 40: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/40.jpg)
The plotting and graphics engine in R is in a few base and
recommend packages:
• graphics: contains plotting functions for the “base” graphing
systems, including plot, hist, boxplot, etc.
• lattice;
• Grid;
• grDevices;
Plotting Data
![Page 41: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/41.jpg)
Common questions about R plotting
• Where to plot: R graphic devices.
• How to plot: Function with parameter
• Need to resize: Exportation Format selection
The process of making a R base plotting:
• Base graphics are usually constructed piece by piece.
• Each aspect of the plot handled separately through a series of function calls
• Mirror the thought process
Base plotting is used most commonly and are a very powerful system for creating 2-D Graphics.
Plotting Data
![Page 42: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/42.jpg)
Plotting Data
Plot Title
Y label
X label
Margin 1,2,3,4
![Page 43: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/43.jpg)
Some Important Base Graphics Parameters
The par() function is used to specify global graphics parameters that affect all plots in an R session.
Plotting Data
• pch: the plotting symbol (default is open circle
• lty: the line type (solid line, dashed, dotted• lwd: the line width• col: the plotting color• las: the orientation of the axis labels• bg: the background color• mar: the margin size• mfrow: number of plots per row, column
(plots are filled row-wise)• mfcol: number of plots per row, column
(plots are filled column-wise)
![Page 44: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/44.jpg)
DemoR base plotting
Plotting Data
![Page 45: Computing for Data Analysis R statistics programming environment Ming Ni mingni@buffalo.edu 11/14/2014](https://reader035.vdocuments.us/reader035/viewer/2022062421/56649cf95503460f949c9e1b/html5/thumbnails/45.jpg)
Ming Ni
Student of Industrial and Systems Engineering, State University of New York at Buffalo
Email: [email protected]: Qing He, Ph.D.