r tutorial (r program 101)

Tutorial of R - The basic commands of R www.mbaprogrammer.com [The goal of this page] When I have read all R introductions, the books were filled with just instructions. The goal of R is to solve our real life problem. That's why I want to minimize this page. In the real though, we need to understand some key concepts that might be useful for you to tackle the real life problem. Here's basic data structures and data manipulation method. Still, I believe the best way to learn R programming language is to tackle the real life problems. Please, just skim through how it works. [Assign Variables] Unlike C, C++, or JAVA, you don't need to think about memory. The beauty of R is that it enables you to make codes without knowing in depth knowledge on algorithms or memory management method. You can assign variable easily in R. Let's assume that you know the cash flow of incoming several years. (T=1 30, T=2 50, T=3 80) You want to discount it with the rate (r=0.02). However, it would be tedious type 1+0.02 always. That's why programmers came up with brilliant idea - assigning variable. Here's some examples Please, keep in mind that ">" or "+" signs are command prompt. These are not things that I type. > r<-0.02 > 30/(1+r) [1] 29.41176 > 50/(1+r)^2 [1] 48.05844 > 80/(1+r)^3 [1] 75.38579 [Iterations] When you run into sigma sign(∑), you can think of this command, which allows you to run same commands for certain amount of time. <Annuity Problem> Let's assume that we have an incoming cash flow $50 in T=1, T=2, and T=3. The discount rate is r=0.05. What is the present value of this cash flow? > pv <- 0 > for (i in 1:3) { + pv <- pv + 50/(1.05)^i + } > print(pv)

[1] 136.1624 > [Data type] There are some data that cannot be represented in a single number. The cash flow could be a good example. If you want to know historical stock performance, the time series data type would meet your requirement. Sometimes, we want the data type like excel, a combination of numbers, strings, true or false values, all of things. Then you should take into account using data frame. There is no golden rule in which you should use this data type in this situation. What it takes to be a good data scientist is having an eye for the data type, make it easier for you to solve business problem quickly. <Collection Data Type (aka Vector)> In order to represent cash flow, collection type can be helpful. Let's assume that there is a cash flow T=1 50, T=2 100, T=3 50, T=4 80 Here's how do we defined the collection type. > cf <- c(50,100,50,80) #c means "collection" > cf [1] 50 100 50 80 > cf[1] #Getting value of first index [1] 50 > cf[2] #Getting value of second index [1] 100 > cf[3] #Getting value of third [1] 50 > cf[4] #Fourth [1] 80 > Keep in mind that you don't need to type the indices for cash flow. It automatically start off with "1." <Data Frame> If you want more Excel like or database like data type, "data frame" would meet your need. Let's assume that you need this information. Month LIBOR RATE T-bill rate 1 month 0.01 0.015 3 month 0.02 0.022 6 month 0.03 0.033 You can build the data frame like this. > information_table <- data.frame(

+ month=c("1m", "3m", "6m"), + libor_rate=c(0.01, 0.02, 0.03), + tbill_rate=c(0.015, 0.022, 0.033) + ) > information_table[2] libor_rate 1 0.01 2 0.02 3 0.03 > information_table[,2] [1] 0.01 0.02 0.03 > The important thing is that when you choose the name of the variable, make sure you don't use any white space(space, tab, or enter). if you type data frame[2] you can have access to 2nd column if you type data frame[,2] you can have access to 2nd row. if you type data frame[2,2], you can have access to the value in 2nd column & 2nd row. <List Data Type> If you want to put together the different types of the collection, list type could be useful. Each collection doesn't have to have same length. This is especially useful when you want to store unstructured data. > libor<-c(0.01, 0.02, 0.03) #Number > tbill<-c(0.01, 0.02, 0.05, 0.12) #Number > swap<-c("1m", "2m", "3m", "6m", "12m") #String > counterpartyrisk<-c(TRUE, FALSE, TRUE) #Boolean > list_all <- list(libor, tbill, swap, counterpartyrisk) In order to have access to each collection you should use [[]] instead of []. You can use [], but you'll have the list type instead of the collection type, make it harder for you to have access to specific value. > list_all[1] #this returns list type [[1]] [1] 0.01 0.02 0.03 > list_all[[1]] #this returns collection type(vector) [1] 0.01 0.02 0.03 > list_all[1][1] #this returns list type too. [[1]] [1] 0.01 0.02 0.03

> list_all[[1]][1] #If you want to have access to 1st list and 1st row, this is the right command. [1] 0.01 <Factors (Categorical value)> Factor is a categorical variable, such as Male/Female, Adult/Kids, A/B/C/D/F Grades. If you use just Excel, you would hardly run into this data type, but if you use databases, such as Oracle, My-SQL, or MS-SQL, you should get yourself familiar with this concept, as it separates the actual value from the code value. > information_table <- data.frame( + people_name=c("Tom", "Jane", "Greg", "Kelly"), + people_gender=c(1, 2, 1, 2) + ) > information_table$people_gender <- factor(information_table$people_gender, labels=c("Male", "Female")) > information_table people_name people_gender 1 Tom Male 2 Jane Female 3 Greg Male 4 Kelly Female > information_table$people_gender [1] Male Female Male Female Levels: Male Female One way to have access to the entire column in the specific data frame is to use "$". DataFrame$ColumnName <Time Series> Basically, time series is almost similar to the data frame. The only difference is the row name is defined as the time stamp, like 2015-01-01. It allows you to calculate the date easier than the data frame. In our posts, we are going to use "zoo" type. I want to defer explanation until we run into "tseries" when you get the stock data from the internet. Again, I want to keep the introduction part as simple as possible. I believe that's the huge differentiator from other online R material. You should learn program or tool by actually solving. [Defining functions] When we think about the function, mathematically it can be defined as y=f(x). It's not that different. The function is supposed to give you certain return corresponding to the x value. If the x value is the same, so does y, unless it is a stochastic function. I want to get square value(^2) of the number. Let's do that. > getsquare<-function(x) { + y <- x^2

+ return(y) + } > > getsquare(2) [1] 4 > getsquare(4) [1] 16 <When you use multiple input variables> Let's assume that your function has two input variables, x and y. f(x,y) = x+y. When you use the function you can directly assign the variable in the parenthesis in your function. This feature is especially powerful when the function has a multiple of input variables. We'll see this case often. This feature makes you less confused when you deal with the functions along with more than 10 input variables. If I translate it into the mathematical terms, it would be like below. f(x=3, y=5) = 8. We can do the same way in R. > plot(x=information_table$month, y=information_table$libor_rate) > sumvariables <- function(x, y) { + tmp <- x+y + return(tmp) + } > sumvariables(x=3, y=5) [1] 8 > [Other data manipulation] <sapply> If you want to change all the values in data frame, let's say, you want to multiply the data frame by 2, you can use "sapply" command. Here's the step. (1) define(or read) the data (2) define the function first. (3) use sapply to the data frame that you want to change. The defined function in (1) automatically applies to all your data points. (4) see the result. > information_table <- data.frame( + libor_rate=c(0.01, 0.02, 0.03), + tbill_rate=c(0.015, 0.022, 0.033)

+ ) > > changevalues <- function(x) { + y <- x*2 + return (y) + } > > information_table<-sapply(information_table, changevalues) > information_table libor_rate tbill_rate [1,] 0.02 0.030 [2,] 0.04 0.044 [3,] 0.06 0.066 [Logical Flow] If you want to control logical flow you can use if ~ else statement just like other languages. I'll skip detailing explanation. Seeing code a lot more make sense to you. > is_positive <- function(x) { + if(x > 0) { + print("It's positive") + } else if(x == 0) { + print("It's Zero") + } else { + print("It's negative") + } + } > is_positive(3) [1] "It's positive" > is_positive(0) [1] "It's Zero" > is_positive(-1) [1] "It's negative" > [Data Read & Write] We can't type all the data. Eventually, we need to download them or extract them from our database. R can read csv and other files, as well as excel file. Here's some functions that allow you to read and write file. <Directory> First, you need to choose the directory(aka folder) that you want to work on.

> getwd() [1] "/Users/seokbongchoi" FYI. If you are a window user, you can still use "/" as a directory delimiter. If you want to change the directory that you want to work on, you can use "setwd" command. > setwd("/Users/seokbongchoi/Documents/FinanceInR") <CSV File Read> Let's start of with csv file. Please, make csv file like below. filename: test.csv column1,column2 0,1 2,3 3,4

Please save the file at the folder that you used "setwd()" Then, we can read this file. > mydata<-read.csv("test.csv") #You can use full name "/Users/username/Documents/test.csv" Warning message: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'test.csv' > mydata column1 column2 1 0 1 2 2 3 3 3 4 > You successfully read that file. You can also store file with "write.csv" > write.csv(information_table, file="information_table.csv")

What if it is on the internet? You just need to add "url" function. > iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE) #There are great sample data offered by UCI. Let's use this! > > head(iris) V1 V2 V3 V4 V5 1 5.1 3.5 1.4 0.2 Iris-setosa 2 4.9 3.0 1.4 0.2 Iris-setosa 3 4.7 3.2 1.3 0.2 Iris-setosa 4 4.6 3.1 1.5 0.2 Iris-setosa 5 5.0 3.6 1.4 0.2 Iris-setosa 6 5.4 3.9 1.7 0.4 Iris-setosa head() function allows you to see top 5 rows in the data. tail() allows you to see bottom 5 rows. <Excel File Read> In order to read excel file, you need to install external package library. You can do this with "install.packages" instruction. You should download "gdata" package > install.packages("gdata") % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 4 1106k 4 50600 0 0 184k 0 0:00:05 --:--:-- 0:00:05 184k100 1106k 100 1106k 0 0 1828k 0 --:--:-- --:--:-- --:--:-- 1826k The downloaded binary packages are in /var/folders/54/5hcnnr_12m1bj167mv61gf3w0000gn/T//Rtmps8bNOL/downloaded_packages Read & Write are pretty much same as csv file. The only difference is the instruction. (This time, I would encourage you to make simple excel file like I used in CSV example)

In order to fully use the library you need to declare like below > library(gdata) Then, you can use functions in that library. (By the way, library is nothing but a collection of functions made by third party) > my.data <- read.xls("test_excel.xlsx") > my.data Column1 Column2 1 1 3 2 2 4 3 3 5 [String Manipulation] I think the data science consists of 4 parts. (1) Data Gathering (2) Data Manipulation (3) Data Visualization (4) Data Interpretation Each part is equally important. But, in R, sometimes, data manipulation is more important than others. This is because when you gather information from Internet, it is likely to be a string type. You need to slice or merge them to make it more meaningful. Here are some string functions that enable you to manipulate strings. <Concatenation> Unfortunately, it is not as easy to concatenate strings as other languages. "paste" can be used when you concatenate strings. > sentence <- paste("I am", " a boy.") > sentence <- paste(sentence, "You are a girl") > sentence [1] "I am a boy. You are a girl" <Replace String> When you want to replace string gsub can work. Keep in mind that this is a regular expression. Basically you can put down whatever you want as a single word, but if that has some formula, you would need to study regular expression further. > sentence <- "I am a boy. You are a girl" > sentence_fixed <- gsub(pattern="a boy", replacement="boys", sentence) > sentence_fixed [1] "I am boys. You are a girl"

But, what if we want to change "a boy" and "a girl" to a student? That's where regular expression comes into play. > sentence <- "I am a boy. You are a girl" > sentence_fixed <- gsub(pattern="((a boy)|(a girl))", replacement="a student", sentence) > sentence_fixed "|" means "OR" <substring> LEFT / RIGHT function in Excel Although R has its built-in function "substr" but I found that using external string library makes it easier for you to manipulate the string. Let's install "stringr" package with install.packages("stringr") str_sub(string, start, end) > library(stringr) > sentence<-("Test1Test2") > str_sub(sentence, 1, 4) [1] "Test" > str_sub(sentence, 1, 5) [1] "Test1" > str_sub(sentence, 2, 4) [1] "est" You can use negative number as an index. In that case, it means, "start from the end of the string" > str_sub(sentence, -1, -1) [1] "2" > str_sub(sentence, -4, -1) [1] "est2" > str_sub(sentence, -5, -1) [1] "Test2" [Add or remove the data point from the data frame] Now, it's time to look at how to manipulate the data frame. On top of 'sapply' there are many ways to manipulate the data frame <rbind - merge two data frames into one> > information_table <- data.frame( + month=c("1m", "3m", "6m"), + libor_rate=c(0.01, 0.02, 0.03), + tbill_rate=c(0.015, 0.022, 0.033) + )

> > add_table <- data.frame( + month=c("12m", "18m"), + libor_rate=c(0.04, 0.05), + tbill_rate=c(0.045, 0.055) + ) > information_table <- rbind(information_table, add_table) > information_table month libor_rate tbill_rate 1 1m 0.01 0.015 2 3m 0.02 0.022 3 6m 0.03 0.033 4 12m 0.04 0.045 5 18m 0.05 0.055 > <cbind - if you want to add one additional column on the data frame> > information_table <- data.frame( + month=c("1m", "3m", "6m"), + libor_rate=c(0.01, 0.02, 0.03), + tbill_rate=c(0.015, 0.022, 0.033) + ) > add_column <- data.frame( + rprate=c(0.01, 0.022, 0.031) + ) > > information_table <- cbind(information_table, add_column) > information_table month libor_rate tbill_rate rprate 1 1m 0.01 0.015 0.010 2 3m 0.02 0.022 0.022 3 6m 0.03 0.033 0.031 > <delete> You can just use "-" operator to get rid of the column. > information_table <- information_table[-3,] #Get rid of 3rd row > information_table month libor_rate tbill_rate 1 1m 0.01 0.015 2 3m 0.02 0.022 > information_table <- information_table[,-1] #Get rid of 1st column > information_table libor_rate tbill_rate 1 0.01 0.015

2 0.02 0.022 > [Basic stats for your data] Now, it's time to use actual data. We are going to use the stock price history of Apple in 2015. In order to do that, please, install "tseries" first. > install.packages("tseries") > library(tseries) > aapl <- get.hist.quote("AAPL", #Tick mark + start="2015-01-01", #Start date YYYY-MM-DD + end="2015-12-31" #End date YYYY-MM-DD + ) % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0100 6157 0 6157 0 0 67931 0 --:--:-- --:--:-- --:--:-- 67659100 18323 0 18323 0 0 171k 0 --:--:-- --:--:-- --:--:-- 170k time series starts 2015-01-02 > head(aapl) Open High Low Close 2015-01-02 111.39 111.44 107.35 109.33 2015-01-05 108.29 108.65 105.41 106.25 2015-01-06 106.54 107.43 104.63 106.26 2015-01-07 107.20 108.20 106.70 107.75 2015-01-08 109.23 112.15 108.70 111.89 2015-01-09 112.67 113.25 110.21 112.01 As it is a series of prices, it has opening price, high price, low price, and closing price. What really matter to us is closing price. Actually, when we talk that "the stock price goes up", we imply that "that was a closing price." <Summary> You can see the summarized information as to the data frame. > summary(aapl) Index Open High Low Close Min. :2015-01-02 Min. : 94.87 Min. :107.0 Min. : 92.0 Min. :103.1 1st Qu.:2015-04-05 1st Qu.:113.17 1st Qu.:114.4 1st Qu.:112.0 1st Qu.:113.4 Median :2015-07-04 Median :120.80 Median :121.6 Median :119.3 Median :120.3 Mean :2015-07-03 Mean :120.18 Mean :121.2 Mean :118.9 Mean :120.0 3rd Qu.:2015-10-01 3rd Qu.:127.14 3rd Qu.:127.9 3rd Qu.:126.0 3rd Qu.:126.9 Max. :2015-12-31 Max. :134.46 Max. :134.5 Max. :131.4 Max. :133.0

<Min> From now on, we are going to use closing price. See if these data are consistent with summary data. > min(aapl$Close) [1] 103.12 <Max> > max(aapl$Close) [1] 133 <Mean> > mean(aapl$Close) [1] 120.04 <Median> > median(aapl$Close) [1] 120.3 [Drawing the basic graph] Please, don't remove the variable aapl. We are going to keep using this to draw simple graph. It's simple. If it has a time-series, just one command. > plot(aapl$Close)

In case that you choose the data frame, you should give the information about which one is on x-axis, and which one is on a y-axis. > information_table <- data.frame( + month=c("1m", "3m", "6m"), + libor_rate=c(0.01, 0.02, 0.03),

+ tbill_rate=c(0.015, 0.022, 0.033) + ) > plot(x=information_table$month, y=information_table$libor_rate) #X-Axis: Month, Y-axis: Libor Rate

If you want more R example please, visit www.mbaprogrammer.com

r tutorial (r program 101)

Data & Analytics