getting to know your data with r

121
dev.Objective() ~ Thursday, 14 May 2015 Getting to Know Your Data with R Steve Withington

Upload: stephen-withington

Post on 14-Aug-2015

106 views

Category:

Software


2 download

TRANSCRIPT

Page 1: Getting to Know Your Data with R

dev.Objective() ~ Thursday, 14 May 2015

Getting to Know Your Data with RSteve Withington

Page 2: Getting to Know Your Data with R

01

Steve Withington✤ Blue River / Mura CMS

Application Engineer & Instructor!

✤ dev.Objective() Steering Committee Member since 2010!

✤ Author!

✤ Marathoner & Triathlete!

✤ Amateur Data Enthusiast

Page 3: Getting to Know Your Data with R

Steps in a Data Analysis

✤ Define the question!

✤ Define the ideal data set!

✤ Determine what data you can access!

✤ Obtain the data!

✤ Clean the data!

✤ Exploratory data analysis!

✤ Statistical prediction/modeling!

✤ Interpret results!

✤ Challenge results!

✤ Synthesize/write up results!

✤ Create reproducible code

Page 4: Getting to Know Your Data with R

Steps in a Data Analysis

✤ Define the question!

✤ Define the ideal data set!

✤ Determine what data you can access!

✤ Obtain the data!

✤ Clean the data!

✤ Exploratory data analysis!

✤ Statistical prediction/modeling!

✤ Interpret results!

✤ Challenge results!

✤ Synthesize/write up results!

✤ Create reproducible code

Page 5: Getting to Know Your Data with R

Types of Data Science Questions

✤ In approximate order of difficulty!

✤ Descriptive!

✤ Exploratory!

✤ Inferential!

✤ Predictive!

✤ Causal!

✤ Mechanistic

Page 6: Getting to Know Your Data with R

Descriptive Analysis

✤ Goal: Describe a set of data!

✤ The first kind of data analysis performed!

✤ Commonly applied to census data!

✤ The description and interpretation are different steps!

✤ Descriptions can usually not be generalized without additional statistical modeling

Page 7: Getting to Know Your Data with R

Descriptive Analysis

– http://www.census.gov/2010census/

Page 8: Getting to Know Your Data with R

Exploratory Analysis

✤ Goal: Find relationships you didn’t know about!

✤ Exploratory models are good for discovering new connections!

✤ They are also useful for defining future studies!

✤ Exploratory analyses are usually not the final say!

✤ Exploratory analyses alone should not be used for generalizing/predicting!

✤ Correlation does not imply causation

Page 9: Getting to Know Your Data with R

Correlation is Not Causation

✤ Even if you observe two variables are correlated with each other, you have to prove to yourself that they’re not related because of some other variable you didn’t measure!

✤ Example: shoe size and literacy

Page 10: Getting to Know Your Data with R

Exploratory Analysis

– http://www.sdss.org

Page 11: Getting to Know Your Data with R

Inferential Analysis

✤ Goal: Use a relatively small sample of data to say something about a bigger population!

✤ Inference is commonly the goal of statistical models!

✤ Inference involves estimating both the quantity you care about and your uncertainty about your estimate!

✤ Inference depends heavily on both the population and the sampling scheme

Page 12: Getting to Know Your Data with R

Inferential Analysis

– http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3521092/

Page 13: Getting to Know Your Data with R

Predictive Analysis

✤ Goal: To use the data on some objects to predict the values for another object!

✤ If X predicts Y it does not mean that X causes Y!

✤ Accurate prediction depends heavily on measuring the right variables!

✤ Although there are better and worse prediction models, more data and a simple model works really well!

✤ Prediction is very hard, especially about the future references

Page 14: Getting to Know Your Data with R

Predictive Analysis

– http://fivethirtyeight.com/interactives/uk-general-election-predictions/

Page 15: Getting to Know Your Data with R

Predictive Analysis

– http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/

Page 16: Getting to Know Your Data with R

Causal Analysis

✤ Goal: To find out what happens to one variable when you make another variable change!

✤ Usually randomized studies are required to identify causation!

✤ There are approaches to inferring causation in non-randomized studies, but they are complicated and sensitive to assumptions!

✤ Causal relationships are usually identified as average effects, but may not apply to every individual!

✤ Causal models are usually the “gold standard” for data analysis

Page 17: Getting to Know Your Data with R

Causal Analysis

– http://www.nejm.org/doi/full/10.1056/NEJM200007133430201

Page 18: Getting to Know Your Data with R

Mechanistic Analysis

✤ Goal: Understand the exact changes in variables that lead to changes in other variables for individual objects!

✤ Incredibly hard to infer, except in simple situations!

✤ Usually modeled by a deterministic set of equations (physical/engineering science)!

✤ Generally the random component of the data is measurement error!

✤ If the equations are known but the parameters are not, they may be inferred with data analysis

Page 19: Getting to Know Your Data with R

Mechanistic Analysis

– http://www.ellisonfoundation.org/content/mechanistic-analysis-transcriptional-switch-regulating-p53-activity-and-premature-aging

Page 20: Getting to Know Your Data with R

“You can have data without information, but you cannot have information without data.”

–Daniel Keys Moran

Page 21: Getting to Know Your Data with R

01

Page 22: Getting to Know Your Data with R

New Terminology

– http://www.dailymail.co.uk/sciencetech/article-2247081/There-soon-words-data-stored-world.html

Page 23: Getting to Know Your Data with R

Yottabyte

✤ 1000 kB kilobyte1000

2 MB megabyte

10003 GB gigabyte

10004 TB terabyte

10005 PB petabyte

10006 EB exabyte

10007 ZB zettabyte

10008 YB yottabyte!

✤ 1 YB = 1000

8 bytes

= 1024

bytes = 1 000 000 000 000 000 000 000 000 bytes = 1000 zettabytes = 1 trillion terabytes

Page 24: Getting to Know Your Data with R

NSA’s Bumblehive

– http://www.bbc.com/news/business-26383058 (Capable of storing one thousand trillion gigabytes!)

Page 25: Getting to Know Your Data with R

Data

✤ Companies have terabytes of data about the consumers they interact with!

✤ Governmental, academic, and private research institutions have extensive archival and survey data on every manner of research topic

Page 26: Getting to Know Your Data with R

The Challenge

✤ Presenting information in easily accessible and digestible ways

Page 27: Getting to Know Your Data with R

Visualizations

✤ We’re really good at gleaning useful information from visual presentations!

✤ Graphical presentations are an excellent way to convey results and uncover meaning

Page 28: Getting to Know Your Data with R

Visualizing Friendships

– https://www.facebook.com/note.php?note_id=469716398919

Page 29: Getting to Know Your Data with R

Visualizing Friendships

– https://www.facebook.com/note.php?note_id=469716398919

Page 30: Getting to Know Your Data with R

“Data is a set of values of qualitative or quantitative variables; restated,

pieces of data are individual pieces of information.”

– http://en.wikipedia.org/wiki/Data

Page 31: Getting to Know Your Data with R

Definition of Data

“Data is a set of values of qualitative or quantitative variables; restated, pieces of data are individual pieces of information.” !

✤ Qualitative: properties that are observed and can generally not be measured with a numerical result. (Sex, country of origin, etc.)!

✤ Quantitative: properties that can exist as a magnitude or multitude. (Height, weight, blood pressure, etc.)

Page 32: Getting to Know Your Data with R

Definition of Data

“Data is a set of values of qualitative or quantitative variables; restated, pieces of data are individual pieces of information.” !

✤ Variable: a logical set of attributes!

✤ Attribute: a characteristic of an object (person, thing, etc.)

Page 33: Getting to Know Your Data with R

Definition of Data

“Data is a set of values of qualitative or quantitative variables; restated, pieces of data are individual pieces of information.” !

✤ Information: an answer to a question, as well as that from which knowledge and data can be derived (as data represents values attributed to parameters, and knowledge signifies understanding of real things or abstract concepts). As it regards data, the information’s existence is not necessarily coupled to an observer (it exists beyond an event horizon, for example), while in the case of knowledge, information requires a cognitive observer.

Page 34: Getting to Know Your Data with R

“The data may not contain the answer. The combination of some data and an aching desire

for an answer does not ensure that a reasonable answer can be extracted from a given body of data.”

–John Wilder Tukey

Page 35: Getting to Know Your Data with R

What Data Looks Like

– http://brianknaus.com/software/srtoolbox/s_4_1_sequence80.txt

Page 36: Getting to Know Your Data with R

What Data Looks Like

– https://developers.facebook.com/docs/graph-api

Page 37: Getting to Know Your Data with R

What Data Looks Like

– http://www.nytimes.com/2012/06/26/technology/in-a-big-network-of-computers-evidence-of-machine-learning.html

Page 38: Getting to Know Your Data with R

What Data Looks Like

– http://www.baseball-reference.com/leagues/AL/2015-standard-pitching.shtml

Page 39: Getting to Know Your Data with R

Data is Second Most Important

✤ The most important thing in data science is the question!

✤ The second most important is the data!

✤ Often the data will limit or enable the questions!

✤ But having data can’t save you if you don’t have a question

Page 40: Getting to Know Your Data with R

01

Page 41: Getting to Know Your Data with R

“We wanted users to be able to begin in an interactive environment, where they did not consciously think of

themselves as programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and

system aspects would become more important.”

–John Chambers, “Stages in the Evolution of S”

Page 42: Getting to Know Your Data with R

R

✤ Comprehensive statistical platform (just about any type of data analysis can be done in R)!

✤ Useful for interactive work, and also contains a powerful programming language for developing new tools (user -> programmer)!

✤ Open source & free (see http://www.fsf.org)!

✤ State of the art graphics!

✤ Easily import data from a wide variety of sources!

✤ RStudio IDE!

✤ Developer Ecosystem and active community (mailing lists, StackOverflow, etc.)!

✤ Can be integrated into other languages, including C++, Java, Python, PHP, etc.

Page 43: Getting to Know Your Data with R

Design of the R System

✤ The “base” R system that you download from CRAN!

✤ Everything else

Page 44: Getting to Know Your Data with R

The R System

✤ R functionality is divided into a number of packages!

✤ The “base” R system contains, among other things, the base package which is required to run R and contains the most fundamental functions!

✤ There are several other packages contained in the “base” system as well as several “recommended” packages!

✤ Over 6,000 other packages are available on CRAN developed by users and programmers around the world!

✤ In addition, sites such as http://bioconductor.org (for genomic and biological data analysis) host a plethora of R packages

Page 45: Getting to Know Your Data with R

Where to Get R

✤ Comprehensive R Archive Network (CRAN) http://cran.r-project.org!

✤ Linux, Mac OS X, and Windows

Page 46: Getting to Know Your Data with R

Getting Started

✤ If you think of R as having a programming language instead of being one, you’ll be better off in the long run!

✤ It’s a case-sensitive, interpreted language!

✤ Enter commands one at a time via the command line, or sets of commands from a source file!

✤ Wide variety of data types, including vectors, matrices, data frames (similar to recordsets), and lists (collections of objects)!

✤ An object is basically anything that can be assigned a value

Page 47: Getting to Know Your Data with R

demo()

Page 48: Getting to Know Your Data with R

RStudio

Page 49: Getting to Know Your Data with R

RStudio’s Interface

Page 50: Getting to Know Your Data with R

text-file.R

Page 51: Getting to Know Your Data with R

Some Important R Functions

✤ Access help file

Page 52: Getting to Know Your Data with R

Some Important R Functions

✤ Search help files

Page 53: Getting to Know Your Data with R

Some Important R Functions

✤ Get arguments

✤ See code

Page 54: Getting to Know Your Data with R

R Reference Card

– http://cran.r-project.org/doc/contrib/Short-refcard.pdf

Page 55: Getting to Know Your Data with R

Entering Input

✤ At the R prompt we type expressions. The <- symbol is the assignment operator

✤ The grammar of the language determines whether an expression is complete or not

✤ The # character indicates a comment. Anything to the right of the # (including the # itself) is ignored

Page 56: Getting to Know Your Data with R

Evaluation

✤ When a complete expression is entered at the prompt, it is evaluated and the result of the evaluated expression is returned. The result may be auto-printed.

✤ The [1] indicates that x is a vector and 5 is the first element

Page 57: Getting to Know Your Data with R

Printing

✤ The : operator is used to create integer sequences

Page 58: Getting to Know Your Data with R

Objects

✤ R has five basic or “atomic” classes of objects:!

✤ character!

✤ numeric (real numbers)!

✤ integer!

✤ complex!

✤ logical (True/False)

Page 59: Getting to Know Your Data with R

Objects

✤ The most basic object is a vector!

✤ Vectors are one-dimensional arrays that can hold data!

✤ A vector can only contain objects of the same class!

✤ The one exception is a list, which is represented as a vector but can contain objects of different classes (that’s usually why they’re used)!

✤ Empty vectors can be created with the vector() function

Page 60: Getting to Know Your Data with R

Numbers

✤ Numbers in R are generally treated as numeric objects (i.e., double precision real numbers)!

✤ If you explicitly want an integer, you need to specify the L suffix!

✤ For example, entering 1 gives you a numeric object; entering 1L explicitly gives you an integer!

✤ There is also a special number Inf which represents infinity; e.g., 1 / 0; Inf can be used in ordinary calculations; e.g., 1 / Inf is 0!

✤ The value NaN represents an undefined value (“not a number”); e.g., 0 / 0; NaN can also be thought of as a missing value

Page 61: Getting to Know Your Data with R

Attributes

✤ R objects can have attributes!

✤ names, dimnames!

✤ dimensions (e.g., matrices, arrays)!

✤ class!

✤ length!

✤ other user-defined attributes/metadata!

✤ Attributes of an object can be accessed using the attributes() function

Page 62: Getting to Know Your Data with R

Vectors

✤ Vectors are one-dimensional arrays that can hold numeric data, character data, or logical data. The combine function c() is used to form the vector.!

!

!

✤ You can also use the vector() function

Page 63: Getting to Know Your Data with R

Vectors

✤ You can refer to elements of a vector using a numeric vector of positions within brackets. For example, a[c(2, 4)] refers to the second and fourth elements of vector a

Page 64: Getting to Know Your Data with R

Mixing Objects

✤ What about the following?

✤ When different objects are mixed in a vector, coercion occurs so that every element in the vector is of the same class

Page 65: Getting to Know Your Data with R

Explicit Coercion

✤ Objects can be explicitly coerced from one class to another using the as.* functions, if available

Page 66: Getting to Know Your Data with R

Explicit Coercion

✤ Nonsensical coercion results in NAs

Page 67: Getting to Know Your Data with R

Matrices

✤ A matrix is a two-dimensional array. It has a dimension attribute which is an integer vector of length 2 (nrow, ncol)

Page 68: Getting to Know Your Data with R

Matrices

✤ Matrices are constructed column-wise, so entries can be thought of starting in the “upper left” corner an running down the columns

Page 69: Getting to Know Your Data with R

Matrices

✤ Matrices can also be created directly from vectors by adding a dimension attribute

Page 70: Getting to Know Your Data with R

cbind-ing and rbind-ing

✤ Matrices can also be created by column-binding or row-binding with cbind() and rbind()

Page 71: Getting to Know Your Data with R

Arrays

✤ Arrays are similar to matrices, however they can have more than two dimensions

Page 72: Getting to Know Your Data with R

Factors

✤ Variables can be described as nominal, ordinal, or continuous!

✤ Nominal, categorical, without an implied order (e.g., blood type)!

✤ Ordinal, also categorical and imply order but not amount (e.g., grades)!

✤ Continuous, take on any value within some range, and both order and amount are implied (e.g., age)!

✤ Categorical (nominal) and ordered categorical (ordinal) variables in R are called factors!

✤ Factors are crucial in R because they determine how data is analyzed and presented visually

Page 73: Getting to Know Your Data with R

Factors

✤ The factor() function stores the categorical values as a vector of integers in the range [1… k], (where k is the number of unique values in the nominal variable) and an internal vector of character strings (the original values) mapped to these integers

Page 74: Getting to Know Your Data with R

Factors

✤ By default, factor levels for character vectors are created in alphabetical order and can be overridden by specifying a levels option

Page 75: Getting to Know Your Data with R

Lists

✤ Lists are the most complex of the R data types. Basically, a list is an ordered collection of objects. A list allows you to gather a variety of (possibly unrelated) objects under one name. For example, a list may contain a combination of vectors, matrices, data frames, and even other lists.

Page 76: Getting to Know Your Data with R

Data Frames

✤ A data frame is more general than a matrix in that columns can contain different modes of data (numeric, character, etc.). They store tabular data. These are the most common data types in R.!

✤ They are represented as a special type of list where every element of the list has to have the same length!

✤ Each element of the list can be thought of as a column and the length of each element of the list is the number of rows!

✤ Unlike matrices, data frames can store different classes of objects in each column (just like lists); matrices must have every element be the same class!

✤ Data frames also have a special attribute called row.names!

✤ Data frames are usually created by calling read.table() or read.csv()!

✤ Can be converted to a matrix by calling data.matrix()

Page 77: Getting to Know Your Data with R

Data Frames

✤ Each column must have only one mode, but you can put columns of different modes together

Page 78: Getting to Know Your Data with R

Names

✤ R objects can also have names, which is very useful for writing readable code and self-describing objects

Page 79: Getting to Know Your Data with R

Names

✤ Lists can also have names

Page 80: Getting to Know Your Data with R

Names

✤ And matrices

Page 81: Getting to Know Your Data with R

Missing Values

✤ Missing values are denote by NA or NaN for undefined mathematical operations!

✤ is.na() is used to test objects if they are NA!

✤ is.nan() is used to test objects for NaN!

✤ NA values have a class also, so there are integer NA, character NA, etc.!

✤ A NaN value is also NA but the converse is not true

Page 82: Getting to Know Your Data with R

Missing Values

Page 83: Getting to Know Your Data with R

Data Types Summary

✤ atomic classes: numeric, logical, character, integer, complex!

✤ vectors!

✤ matrices!

✤ arrays!

✤ lists!

✤ data frames!

✤ names!

✤ missing values

Page 84: Getting to Know Your Data with R

Data Types Summary

Page 85: Getting to Know Your Data with R

Oddities

✤ The period (.) has no special significant in object names. The dollar sign ($) has a somewhat similar meaning to the period in other object-oriented languages and can be used to identify the parts of a data frame or list (e.g., mydata$columnname)!

✤ R don’t have multiline or block comments (must use the #)!

✤ Assigning a value to a nonexistent element of a vector, matrix, array, or list expands that structure to accommodate the new value!

✤ R doesn’t have scalar values, they’re represented as one-element vectors!

✤ Indices start at 1, not 0!

✤ Variables can’t be declared, they come into existence on the first assignment

Page 86: Getting to Know Your Data with R

Helpful Methods

✤ getwd()!

✤ setwd(“some/directory”)!

✤ list.files()!

✤ ls() or objects()!

✤ rm(object/objectlist)!

✤ rm(list=ls())!

✤ edit(objectname)!

✤ source(“filename”)!

✤ sink(“filename.ext”, append=TRUE, split=TRUE)!

✤ data() ~ list available data sets!

✤ search() ~ list loaded packages!

✤ q()

Page 87: Getting to Know Your Data with R

Reading Data

✤ There are a few principal functions reading data into R!

✤ read.table, read.csv, for reading tabular data!

✤ readLines, for reading lines of a text file!

✤ source, for reading in R code files (inverse of dump)!

✤ dget, for reading in R code files (inverse of dput)!

✤ load, for reading in saved workspaces!

✤ unserialize, for reading single R objects in binary form

Page 88: Getting to Know Your Data with R

Writing Data

✤ There are similar functions for writing data to files!

✤ write.table!

✤ writeLines!

✤ dump!

✤ dput!

✤ save!

✤ serialize

Page 89: Getting to Know Your Data with R

read.table

✤ The read.table function is one of the most commonly used functions for reading data!

✤ It has a few important arguments:!

✤ file, the name of the file, or a connection!

✤ header, logical indicating if the file has a header line!

✤ sep, a string indicating how the columns are separated!

✤ colClasses, a character vector indicating the class of each column in the dataset!

✤ nrows, the number of rows in the dataset!

✤ comment.char, a character string indicating the comment character!

✤ skip, the number of lines to skip from the beginning!

✤ stringsAsFactors, should character variables be coded as factors?

Page 90: Getting to Know Your Data with R

read.table

✤ For small to moderately sized datasets, you can usually call read.table without specifying any other arguments

✤ R will automatically!

✤ skip lines that begin with #!

✤ figure out how many rows there are (and how much memory needs to be allocated)!

✤ figure what time of variable is in each column of the table (telling R all these things directly makes R run much faster and more efficiently)!

✤ read.csv is identical to read.table except that the default separator is a comma

Page 91: Getting to Know Your Data with R

Larger Datasets

✤ With much larger datasets, doing the following things will make your life easier and will prevent R from choking!

✤ Read the help page for read.table, which contains many hints!

✤ Make a rough calculation of the memory required to store your dataset!

✤ If the dataset is larger than the amount of RAM on your computer, you can probably stop right here!

✤ Set comment.char = “” if there are no commented lines in your file

Page 92: Getting to Know Your Data with R

Larger Datasets

✤ Use the colClasses argument!

✤ Specifying this option instead of using the default can make read.table run much faster, often twice as fast!

✤ In order to use this option, you have to know the class of each column in your data farm!

✤ If all of the columns are numeric, for example, then you can just set colClasses = “numeric”!

✤ Set nrows!

✤ This doesn’t make R run faster but it helps with memory usage

Page 93: Getting to Know Your Data with R

Know Thy System

✤ In general, when using R with larger datasets, it’s useful to know a few things about your system!

✤ How much memory is available?!

✤ What other applications are in use?!

✤ Are there other users logged into the same system?!

✤ What is the Operating System?!

✤ Is the OS 32 or 64 bit?

Page 94: Getting to Know Your Data with R

Calculating Memory Requirements

✤ I have a data frame with 1,500,000 rows and 120 columns, all of which are numeric data. Roughly, how much memory is required to store this data frame? Note: each alphanumeric character requires 8 bits (1 byte) of memory to store.!

✤ 1,500,000 rows * 120 columns = 180,000,000 alphanumeric characters!

✤ 180,000,000 * 8 bits = 1,440,000,000 bytes!

✤ 1,440,000,000 bytes / 1024 = 1,406,250 kb!

✤ 1,406,250 kb / 1024 = 1,373.29 MB!

✤ 1,373.29 MB / 1024 = 1.34 GB!

✤ If you only have 2 GB of RAM, you may want to think twice before doing it. If you've got 4, 6, or 8+ GB of RAM, you should be fine.

Page 95: Getting to Know Your Data with R

Textual Formats

✤ dumping and dputing are useful because the resulting textual format is editable, and in the case of corruption, potentially recoverable!

✤ Unlike writing out a table or csv file, dump and dput preserve the metadata (sacrificing some readability), so that another user doesn't have to specify it all over again!

✤ Can work much better with version control programs like SVN or Git which can only track changes meaningfully in text files!

✤ Can be longer-lived; if there is corruption somewhere in the file, it can be easier to fix the problem!

✤ Downside: Format is not very space-efficient

Page 96: Getting to Know Your Data with R

dput-ting R Objects

✤ Another way to pass data around is by deparsing the R object with dput and reading it back in using dget

Page 97: Getting to Know Your Data with R

Dumping R Objects

✤ Multiple objects can be deparsed using the dump function and read back in using source

Page 98: Getting to Know Your Data with R

Interfaces to the Outside World

✤ Data are read in using connection interfaces!

✤ Connections can be made to files (most common) or to other more exotic things!

✤ file, opens a connection to a file!

✤ gzfile, opens a connection to a file compressed with gzip!

✤ bzfile, opens a connection to a file compressed with bzip2!

✤ url, opens a connection to a webpage

Page 99: Getting to Know Your Data with R

File Connections

✤ description is the name of the file!

✤ open is a code indicating!

✤ “r” read only!

✤ “w” writing (and initializing a new file)!

✤ “a” appending!

✤ “rb”, “wb”, “ab” reading, writing, or appending in binary mode (Windows)

Page 100: Getting to Know Your Data with R

Connections

✤ In general, connections are powerful tools that let you navigate files or other external objects!

✤ In practice, we often don’t need to deal with the connection interface directly

Page 101: Getting to Know Your Data with R

Reading Lines of a Text File

✤ readLines can be useful for reading in lines of webpages

Page 102: Getting to Know Your Data with R

Connecting and Listing Databases

Page 103: Getting to Know Your Data with R

Connecting and Listing Databases

Page 104: Getting to Know Your Data with R

R Packages

✤ When you download R from the Comprehensive Archive Network (CRAN), you get the “base” R system!

✤ The base R system comes with basic functionality and implements the R language!

✤ One reason R is so useful is the large collection of packages that extend the basic functionality of R!

✤ R packages are developed and published by the larger R community

Page 105: Getting to Know Your Data with R

Obtaining R Packages

✤ The primary location for obtaining R packages is CRAN!

✤ For biological applications, many packages are available from the Bioconductor Project!

✤ You can obtain information about the available packages on CRAN with the available.packages() function

Page 106: Getting to Know Your Data with R

Obtaining R Packages

✤ There are over 6,000 packages on CRAN covering a wide range of topics!

✤ A list of some topics is available through the Task Views link (http://cran.r-project.org/web/views/), which groups together many R packages related to a given topic

Page 107: Getting to Know Your Data with R

Installing an R Package

✤ Packages can be installed with the install.packages() function in R!

✤ To install a single package, pass the name of the package to the install.packages() function as the first argument!

✤ The following code installs the zipcode package from CRAN

✤ This command downloads the zipcode package from CRAN and installs it on your computer!

✤ Any packages on which this package depends will also be downloaded and installed

Page 108: Getting to Know Your Data with R

Installing Multiple R Packages

✤ You can install multiple R packages at once with a single call to install.packages()!

✤ Place the names of the R packages in a character vector

Page 109: Getting to Know Your Data with R

Installing Packages via RStudio

Page 110: Getting to Know Your Data with R

Installing Packages via RStudio

Page 111: Getting to Know Your Data with R

Loading R Packages

✤ Installing a package does not make it immediately available to your in R; you must load the package!

✤ The library() function is used to load packages into R!

✤ The following code is used to load the ggplot2 package into R

Page 112: Getting to Know Your Data with R

Loading R Packages

✤ Any packages that need to be loaded as dependencies will be loaded first, before the named package is loaded!

✤ Note: Do not put the package name in quotes!!

✤ Some packages produce messages when they are loaded (but some don’t)

Page 113: Getting to Know Your Data with R

Loading R Packages

✤ After loading a package, the functions exported by that packages will be attached to the top of the search() list (after the workspace)

Page 114: Getting to Know Your Data with R

example()

Page 115: Getting to Know Your Data with R

R Packages Summary

✤ R packages provide a powerful mechanism for extending the functionality of R!

✤ R packages can be obtained from CRAN or other repositories!

✤ The install.packages() can be used to install packages at the R console!

✤ The library() function loads packages that have been installed so that you may access the functionality in the packages

Page 116: Getting to Know Your Data with R

Learning Resources

✤ The R Manualshttp://cran.r-project.org/manuals.html!

✤ QuickR http://www.statmethods.net !

✤ R Mailing Listshttps://stat.ethz.ch/mailman/listinfo!

✤ SwiRl http://swirlstats.com!

✤ Introduction to Data Science https://www.udemy.com/introduction-to-data-science/!

✤ Johns Hopkins Data University Data Science Specialization https://www.coursera.org/specialization/jhudatascience/1

Page 117: Getting to Know Your Data with R

More Learning Resources

✤ R By Examplehttp://www.mayin.org/ajayshah/KB/R/index.html!

✤ R Data Import/Export Manualhttp://cran.r-project.org/doc/manuals/r-release/R-data.html!

✤ Data Analysis with R: Visually Analyze and Summarize Data Setshttps://www.udacity.com/course/data-analysis-with-r--ud651!

✤ Intro to Descriptive Statistics: Mathematics for Understanding Datahttps://www.udacity.com/course/intro-to-descriptive-statistics--ud827!

✤ Intro to Inferential Statistics: Making Predictions from Datahttps://www.udacity.com/course/intro-to-inferential-statistics--ud201

Page 118: Getting to Know Your Data with R

Good Stuff

✤ FiveThirtyEighthttp://fivethirtyeight.com !

✤ Kagglehttps://www.kaggle.com!

✤ Figsharehttp://figshare.com!

✤ OpenCPUhttps://www.openspu.org!

✤ RPyhttp://rpy.sourceforge.net!

✤ rChartshttp://rcharts.io!

✤ Facebook Data Science Research Service https://www.facebook.com/data!

✤ Google’s R Style Guide http://google-styleguide.googlecode.com/svn/trunk/Rguide.xml

Page 119: Getting to Know Your Data with R

Books

✤ Chambers. Software for Data Analysis. Springer, 2008.!

✤ Matloff. The Art of R Programming: A Tour of Statistical Software Design. No Starch Press, 2011.!

✤ Kabacoff. R in Action: Data analysis and graphics with R. Second Edition. Manning Publications, 2015.!

✤ Cielen & Meysman. Introducing Data Science. Manning Publications, 2015.!

✤ Tuckey. Exploratory Data Analysis. Pearson, 1977.

Page 120: Getting to Know Your Data with R

Books

✤ Manning Publications Discount Code

ctwdevob!✤ Packt Discount Code

DEV.Objective

Page 121: Getting to Know Your Data with R

01