introduction to r

37
Introduction to R

Upload: agnonchik

Post on 14-Jul-2015

60 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Introduction to R

Contents

● What is R? How to invoke?● Basic data types, control structures● Environments, functions● Classes, packages, graphs

What is R

● A free software programming language and software environment for statistical computing and graphics

● Dialect of the S programming language with lexical scoping semantic inspired by Scheme

● Provides linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and more...

● Cross-platform: Windows, Linux, Mac

How to invoke

● Command-line interpreter for interactive programming– Type R in a terminal

– Use ?topic or help(topic) for help;

try example(topic) for examples

– Use quit() to exit

● GUIs– RStudio IDE

– Web-interface: http://10.122.85.41:8787/

Basic Data Types (1)

● R is value typed. Everything is an object

x<­value # <­ is the assignment

y<­x # deep copying

● Atomic data types– integer (32 bits) age<­20L

– double (binary64) gpa<­3.34

– character name<­"John"

– logical married<­TRUE # or FALSE

– complex, raw

mode(x), typeof(x), class(x), str(x)

● Other types– closure f<­function() {}

– language q<­quote(x<­1)

● Special constants

NULL, NA, Inf, NaN

Basic Data Types (2)

● Vectors– A set of objects of an atomic type

banknotes<­c(1,5,10,20,50,100) # c means combine

banknotes[5], length[banknotes], mode(banknotes)

name<­c(given="George", middle="Walker", family="Bush") name[1], name["given"], names(name)

– Tricks with indexes

x<­c(1,2,3) # try x[0], x[c(1,3)], x[­1], x[c(­1,­3)]

x[c(T,F,T)], x>1, x[x>1] # logical indexing

– Cycling through a vector argument

c(1,2,3,4) + c(0,1), 10 * c(1,2,3,4)

Basic Data Types (3)

● Lists– A set of objects of different types

l<­list(age=20L, gpa=3.34, name="John", married=TRUE)

length(l), names(l)

– Use [] to extract a sublist

l[1], l[c("name","married")], l[­1], l[c(­1,­3)]

– Use [[]] or $ to access an object in a list

l[[1]], l[["age"]], l$age

list(c(1, 2, 3), c("a", "b"), function(){})

● Attributes

attributes(age)<­list(units="years")

structure(l, comment="those one guy")

Basic Data Types (4)

● Matrices

m<­matrix(c(1,2,3,4), c(2,2)) # use dim, nrow, ncol rbind, and cbind with matrices

– Use t to transpose, %*% for matrix multiplication, diag to extract diagonal

● Arrays

a<­array(rnorm(8), c(2,2,2))

● Factors (enumerated type)

faculty<­factor("engineering", c("arts",          "law", "engineering", "finances"))

Basic Data Types (5)

● Data frames– A data frame combines a set of vector of the same length

df<­data.frame(age=c(20L, 21L),

               gpa=c(3.34, 3.14),

               name=c("John", "George"),

               married=c(T, F))

– Any data frame can be accessed either as a list or as a matrix

df.1<­df[df$gpa>3.2, c("name","married")]

df.2<­subset(df, subset=(gpa>3.2),

             select=c(name,married))

identical(df.1, df.2) # TRUE

Control Structures (1)

if (cond) expr

if (cond) expr1 else expr2

for (var in seq) expr

while (cond) expr

repeat expr

break, next

switch (expr, ...)

ifelse (test, yes, no)

● Implicit looping

lapply, sapply, apply, mapply

Control Structures (2)

● Examples

df.3<­data.frame(name=character(), married=logical())

for (row in 1:nrow(df))

  if (df$gpa[row]>3.2) 

    df.3<­rbind(df.3, data.frame(name=df$name[row],

                married=df$married[row]))

identical(df.2, df.3) # TRUE

lapply(l, typeof) #

sapply(l, typeof) # simplifies the result to a vector

apply(m, 2, max) # max column element (vector)

apply(m, 1:length(dim(mat)), sqrt) # (matrix)

mapply(function(x, y) seq_len(x) + y, c(1, 2, 3), c(10, 0, ­10))

Environments (1)

● Every variable or function is defined in an environment

environment() # gives the current evaluation

                environment● Environments form a tree with the root given by emptyenv()

● The root environment emptyenv() cannot be populated

● .GlobalEnv is the user's working environment or workspace. It can also be assessed by globalenv()

identical(environment(), globalenv()) # TRUE● baseenv() is the library environment for the basic R functions

ls(baseenv())

Environments (2)

baseenv()

.GlobalEnv

emptyenv()

...

...

e.1 .BaseNamespaceEnv

......

...

...

...

e.2

Environments (3)

● parent.env(env) returns the parent of environment env

identical(parent.env(baseenv()),emptyenv()) # TRUE● To create a new environment use new.env(parent)

– If the parent parameter is omitted, .GlobalEnv is used by default● To change the evaluation environment use

evalq(expr, env)

with(data, expr) # does the same to data frames and lists● Example

e.1<­new.env() # created a new environment e.1

parent.env(e.1) # should be .GlobalEnv

evalq(environment(), e.1) # should be e.1

e.2<­new.env(parent=e.1) # created a new environment e.2

parent.env(e.2) # should be e.1

evalq(environment(), e.2) # should be e.2

Environments (4)● When resolving a variable or function name, R searches the current evaluation environment, then the

parent environments along the path to the root environment emptyenv()

x<­0 # set x to 0 in .GlobalEnv

Both evalq(x, e.1) and evalq(x, e.2) should give 0

evalq(x<­2, e.2) # set x to 2 in e.2

Now evalq(x, e.1) still gives 0 while evalq(x, e.2) has changed to 2● To set an object, such as a variable or a function, in a particular environment use

assign(obj.name, value, envir=env) # inherits is FALSE by default ● To get the value of an object in a particular environment use

get(obj.name, envir=env) # inherits is TRUE by default● To check whether an object exists in a particular environment

exists(obj.name, envir=env) # inherits is TRUE by default

For example,

exists("x", e.1, inherits=FALSE) # FALSE

exists("x", e.2, inherits=FALSE) # TRUE

Environments (5)

● Every environment can also be treated as a list. For example, e.2$x gives access to x in e.2

● The so-called search path starts from .GlobalEnv and ends with baseenv(). The search() function returns string names of the environments in the search path

[1] ".GlobalEnv" "tools:rstudio"

[3] "package:stats" "package:graphics"

[5] "package:grDevices" "package:utils"

[7] "package:datasets" "package:methods"

[9] "Autoloads" "package:base"

Environments (6)

● To restore an environment from its string name, use as.environment(name). For example,

as.environment("package:base") # maps the string to the baseenv() object

● Unless the evaluation environment was specified explicitly, the interpreter searches the environments along the search path, starting from .GlobalEnv, until it hits emptyenv()

Environments (7)

● To add an environment env to the search path one can use attach(env, pos, name) which creates a copy of the environment env with string name name and inserts it at position pos>1 in the search path

● find(obj.name) returns all environments along the search path containing objects with a specified name

● Example

e.1$x<­1; attach(e.1, 2L, "e.1")

assign("x", 11, e.1) # modified e.1 but not its attached duplicate

get("x", e.1) # returns 11

x # is still 1

Functions (1)

● Functions in R are “first class objects” which means they can be treated much like any other object– Can be passed as arguments to other functions

– Can be nested, so that you can define a function inside of another function

– Can be returned by other functions

● The return value of a function is the last expression in the function body to be evaluated

● Example: factorial

fact<­function(x) ifelse(x==1, 1, x*fact(x­1)) # ? in C

fact # function(x) ifelse(x==1, 1, x*f(x­1))

fact(5) # 120

fact(1000) # Inf

Functions (2)

● A function consists of its formal arguments and a body and it has a reference to the enclosing environment (closure)

formals(fact) # $x

body(fact) # ifelse(x == 1, 1, x * f(x ­ 1))

environment(fact) # .GlobalEnv

● By default, enclosing environment references the environment in which the function was created, but it can be redefined with

environment(fact)<­some.other.environment

● Being called, a function creates its own environment, a child of the enclosing environment● Thus we have

– the environment where the function is created: find("fact")

– the environment where the function resides (enclosing environment): environment(fact)

– the environment created when a function is run: environment()

– the environment where a function is called: parent.frame()

Functions (3)

● Function arguments are named and may have default values● You can mix positional matching with matching by name. When an

argument is matched by name it is “taken out” of the argument list and the remaining unnamed arguments are matched in the order that they are listed in the function definition

f<­function(x, y, z=0) as.list(environment())

f(1, 2) # x:1, y:2, z:0

f(y=1, x=2, z=3) # x:2, y:1, z:3

f(y=1) # x:, y:1, z:0

f(z=3, 2, 1) # x: 2, y: 1, z: 3

f(1, 2, 3, 4) # error: unused argument (4)

Functions (4)

● The order of operations when given an argument is– Check for exact match for a named argument

– Check for partial match

– Check for a positional match

– The … argument indicates a variable number of arguments that are usually passed on to other functions

● Any argument that appears after … in the argument list must be named explicitly and cannot be partially matched

g<­function(y, z=0) as.list(environment())

f<­function(x, ...) g(...)

f(1) # y:, z: 0

f(1, 2, 3) # y: 2, z: 3

f(1, 2, 3, 4) # error: unused argument (4)

f(y=1, 2, 3) # y: 1, z: 3

f(2, 3, x=1) # y: 2, z: 3

Functions (5)

● Free variables

f<­function() x # x is a free variable

f() # error: object 'x' not found

x<­1; f() # 1

● Lexical (static) scoping

f<­function() {

   x<­1

   g<­function() x

}

x<­2; h<­f()

h() # 1 or 2?

● Why 1?

environment(g) # created by f() call, not .GlobalEnv

environment(g)$x # 1

Functions (6)

● Example: function that returns function

power<­function(n) {

  function(x) x^n

}

n<­5 # ignored

square<­power(2)

square(3) # 9

cube<­power(3)

cube(2) # 8

Classes (1)

● Everything in R is an object● A class is the definition of an object● A method is a function that performs specific calculations

on objects of a specific class. A generic function is used to determine the class of its arguments and select the appropriate method. A generic function is a function with a collection of methods

● print, plot, summary...● See ?Classes and ?Methods for more info

Classes (2)

● S3 classes – old style, quick and dirty, informal● Set an object's attribute to the class name, e.g.

x<­c("a", "b", "c") # this is an object

class(x)<­"X" # set the class of the object

# Define a method specific to the X class

print.X<­function(x, ...) {

  cat("X obj:\n")

  print(unclass(x), ...)

}

print(x) # X obj: a b c

● Inheritance

class(x)<­c("X", "Y", "Z")

inherits(x, “Z”) # TRUE

Classes (3)

● Constructor

X<­function(x) {

  if (!is.numeric(x)) stop("x must be numeric")

  structure(x, class = "X")

}

● S3 class useful methods

is.object(obj) checks whether an object has a class attribute

class(x), unclass(x), methods(generic.function),

methods(class="class"), inherits(obj, "class"), is(obj, "class")

● Creating new generics

g<­function(x, ...) UseMethod("f", x)

f.X<­function(x, ...) print(x, ...)

g(x) # X obj: a b c

Classes (4)

● S4 classes – new style, rigorous and formal● Classes have formal definitions which describe their fields

and inheritance structures (parent classes)● Method dispatch can be based on multiple arguments to a

generic function, not just one● There is a special operator, @, for extracting slots (aka

fields) from an S4 object● All S4 related code is stored in the methods package

Classes (5)

● To create a new S4 class

setClass(Class, representation)

● Use new() to generate a new object from a class

● To create an S4 method

setMethod(f, signature, definition)

Classes (6)

● Example

setClass("Person",

  slots = list(name = "character", age = "numeric"))

setClass("Employee",

  slots = list(boss = "Person"),

  contains = "Person")

alice <­ new("Person", name = "Alice", age = 40)

john <­ new("Employee", name = "John", age = 20, boss = alice)

Classes (7)

● Example

setGeneric("union") # [1] "union"

setMethod("union",

  c(x = "data.frame", y = "data.frame"),

  function(x, y) {

    unique(rbind(x, y))

  }

) # [1] "union"

● Useful functions

getGenerics, getClasses, showMehods

Packages (1)

● Packages extend functionality of R● http://cran.r-project.org/web/packages

– 5434 available packages as of Apr 14, 2014

● repository → installed → loaded

● library(help="package")

● Datasets

data(mtcars); help(mtcars)

● Example: libsvm

install.packages("e1071")

library(e1071)

detach("package:e1071")

Packages (2)

● Library and namespace environments– Library environments, such as "package:stats", contain external objects resolvable to the

user by their names. Thus library environment has to be attached to the search path

– Conversely, the namespace environments contain internal objects that are opaque to the user but transparent to the library functions. Usually, namespace environment is a .BaseNamespace's children

find("svm") # "package:e1071"

environment(svm) # <environment: namespace:e1071>

length(ls(environment(svm))) # 90 objects in the namespace environment

length(ls("package:e1071")) # 58 objects in the package environment

– Path from the namespace environment to .GlobalEnv

<environment: namespace:e1071>   "imports:e1071"   .BaseNamespaceEnv → → .GlobalEnv→

Packages (3)

● Example SparkR

sc<­sparkR.init(master)

parallelize(sc, col, numSlices)

map(rdd, func)

reduce(rdd, func)

reduceByKey(rdd, combineFunc, numPartitions)

cache(rdd)

collect(rdd)

● Pi example

Graphs

demo(graphisc); demo(persp)

References

http://www.pitt.edu/~njc23

http://adv-r.had.co.nz/

Appendix: funny stuff about R

● Expression which returns itself

(function(x) substitute((x)(x)))(function(x) substitute((x)(x)))

(function(x) substitute((x)(x)))(function(x) substitute((x)(x)))

expression <­ (function(x) substitute((x)(x)))(function(x) substitute((x)(x)))

expression == eval(expression) # TRUE