r for hadoopers
DESCRIPTION
R + Hadoop presentation. From OSCON and Seattle meetupTRANSCRIPT
![Page 2: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/2.jpg)
2
![Page 3: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/3.jpg)
3
Goals:• Teach basic R and Hadoop basics• Share useful R libraries • Explain RHadoop• Give tips on how to use Rhadoop• Have fun, kick ass
![Page 4: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/4.jpg)
4
Not a Goal:Turn you into:• Statistician• Machine learning expert• R guru• Hadoop expert
![Page 5: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/5.jpg)
5
#include warning.h
![Page 6: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/6.jpg)
6
Agenda
• R Basics• Hadoop Basics• Data Manipulation• Rhadoop
![Page 7: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/7.jpg)
7
Get Started with R-Studio
![Page 8: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/8.jpg)
8
Basic Data Types
• String• Number• Boolean• Assignment <-
![Page 9: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/9.jpg)
9
R can be a nice calculator
> x <- 1> x * 2[1] 2> y <- x + 3> y[1] 4> log(y)[1] 1.386294> help(log)
![Page 10: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/10.jpg)
10
Complex Data Types
• Vector• c, seq, rep, []
• List• Data Frame
• Lists of vectors of same length• Not a matrix
![Page 11: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/11.jpg)
11
Creating vectors
> v1 <- c(1,2,3,4)[1] 1 2 3 4> v1 * 4[1] 4 8 12 16> v4 <- c(1:5)[1] 1 2 3 4 5> v2 <- seq(2,12,by=3)[1] 2 5 8 11> v1 * v2[1] 2 10 24 44> v3 <- rep(3,4)[1] 3 3 3 3
![Page 12: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/12.jpg)
12
Accessing and filtering vectors
> v1 <- c(2,4,6,8)[1] 2 4 6 8> v1[2][1] 4> v1[2:4][1] 4 6 8> v1[-2][1] 2 6 8> v1[v1>3][1] 4 6 8
![Page 13: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/13.jpg)
13
Lists
> lst <- list (1,"x",FALSE)[[1]][1] 1
[[2]][1] "x"
[[3]][1] FALSE
> lst[1][[1]][1] 1
> lst[[1]][1] 1
![Page 14: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/14.jpg)
14
Data Frames
books <- read.csv("~/books.csv")books[1,]books[,1]books[3:4]
books$pricebooks[books$price==6.99,]martin_price <- books[books$author_t=="George R.R. Martin",]$pricemean(martin_price)subset(books,select=-c(id,cat,sequence_i))
![Page 15: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/15.jpg)
15
Vectorization:Always prefer operations On entire vectors
![Page 16: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/16.jpg)
16
Functions
> sq <- function(x) { x*x }> sq(3)[1] 9
Note:R is a functional programming language.Functions are first class objectsAnd can be passed to other functions.
![Page 17: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/17.jpg)
17
packages
![Page 18: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/18.jpg)
18
Agenda
• R Basics• Hadoop Basics• Data Manipulation• Rhadoop
![Page 19: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/19.jpg)
— Grace Hopper, early advocate of distributed computing
“In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,
we didn’t try to grow a larger ox”
![Page 20: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/20.jpg)
20
Hadoop in a Nutshell
![Page 21: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/21.jpg)
21
Map-Reduce is the interesting bit
• Map – Apply a function to each input record• Shuffle & Sort – Partition the map output and sort
each partition• Reduce – Apply aggregation function to all values in
each partition• Map reads input from disk• Reduce writes output to disk
![Page 22: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/22.jpg)
22
Example – Sessionize clickstream
![Page 23: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/23.jpg)
23
Sessionize
Identify unique “sessions” of interacting with our website
Session – for each user (IP), set of clicks that happened within 30 minutes of each other
![Page 24: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/24.jpg)
24
Input – Apache Access Log Records
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
![Page 25: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/25.jpg)
25
Output – Add Session ID
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 15
![Page 26: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/26.jpg)
26
Overview
Map
Map
Map
Reduce
Reduce
Log line
Log line
Log line
IP1, log lines
IP1, log lines
IP1, log lin
es
Log line, session ID
![Page 27: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/27.jpg)
27
Map
parsedRecord = re.search(‘(\\d+.\\d+….’,record)IP = parsedRecord.group(1)timestamp = parsedRecord.group(2)print ((IP,Timestamp),record)
![Page 28: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/28.jpg)
28
Shuffle & Sort
Partition by: IPSort by: timestamp
Now reduce gets:(IP,timestamp) [record1,record2,record3….]
![Page 29: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/29.jpg)
29
Reduce
SessionID = 1curr_record = records[0]Curr_timestamp = getTimestamp(curr_record)foreach record in records:
if (curr_timestamp – getTimestamp(record) > 30):
sessionID += 1curr_timestamp = getTimestamp(record) print(record + “ “ + sessionID)
![Page 30: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/30.jpg)
30
Agenda
• R Basics• Hadoop Basics• Data Manipulation Libraries• Rhadoop
![Page 31: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/31.jpg)
31
Reshape2
• Two functions: • Melt – wide format to long format• Cast – long format to wide format
• Columns: identifiers or measured variables• Molten data:
• Unique identifiers• New column – variable name• New column – value
• Default – all numbers are values
![Page 32: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/32.jpg)
32
Melt
> tipstotal_bill tip sex smoker day time
size16.99 1.01 Female No Sun Dinner 210.34 1.66 Male No Sun Dinner 321.01 3.50 Male No Sun Dinner 3
> melt(tips) sex smoker day time variable value Female No Sun Dinner total_bill 16.99 Female No Sun Dinner tip 1.01 Female No Sun Dinner size
2
![Page 33: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/33.jpg)
33
Cast
> m_tips <- melt(tips) sex smoker day time variable value Female No Sun Dinner total_bill 16.99 Female No Sun Dinner tip 1.01 Female No Sun Dinner size
2
> dcast(m_tips,sex+time~variable,mean) sex time total_bill tip size Female Dinner 19.21308 3.002115
2.461538 Female Lunch 16.33914 2.582857
2.457143 Male Dinner 21.46145 3.144839
2.701613 Male Lunch 18.04848 2.882121
2.363636
![Page 34: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/34.jpg)
34
*Apply
• apply – apply function on rows or columns of matrix• lapply – apply function on each item of list
• Returns list• sapply – like lapply, but return vector• tapply – apply function to subsets of vector or lists
![Page 35: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/35.jpg)
35
plyr
• Split – apply – combine
• Ddply – data frame to data frameddply(.data, .variables, .fun = NULL, ...,• Summarize – aggregate data into new data frame• Transform – modify data frame
![Page 36: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/36.jpg)
36
DDPLY Example
> ddply(tips,c("sex","time"),summarize,+ mean=mean(tip),+ sd=sd(tip),+ ratio=mean(tip/total_bill)+ ) sex time mean sd ratio1 Female Dinner 3.002115 1.193483 0.16932162 Female Lunch 2.582857 1.075108 0.16228493 Male Dinner 3.144839 1.529116 0.15540654 Male Lunch 2.882121 1.329017 0.1660826
![Page 37: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/37.jpg)
37
Agenda
• R Basics• Hadoop Basics• Data Manipulation Libraries• Rhadoop
![Page 38: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/38.jpg)
38
Rhadoop Projects
• RMR• RHDFS• RHBase• (new) PlyRMR
![Page 39: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/39.jpg)
39
Most Important:RMR does not parallelize algorithms.
It allows you to implement MapReduce in R. Efficiently. That’s it.
![Page 40: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/40.jpg)
40
What does that mean?
• Use RMR if you can break your problem down to small pieces and apply the algorithm there
• Use commercial R+Hadoop if you need a parallel version of well known algorithm
• Good fit: Fit piecewise regression model for each county in the US
• Bad fit: Fit piecewise regression model for the entire US population
• Bad fit: Logistic regression
![Page 41: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/41.jpg)
41
Use-case examples – Good or Bad?
1. Model power consumption per household to determine if incentive programs work
2. Aggregate corn yield per 10x10 portion of field to determine best seeds to use
3. Create churn models for service subscribers and determine who is most likely to cancel
4. Determine correlation between device restarts and support calls
![Page 42: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/42.jpg)
42
Second Most Important:RMR requires R, RMR and all libraries you’ll use to be installed on all nodes and accessible by Hadoop user
![Page 43: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/43.jpg)
43
RMR is different from Hadoop Streaming.
RMR mapper input:Key, [List of Records]
This is so we can use vector operations
![Page 44: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/44.jpg)
44
How to RMRify a Problem
![Page 45: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/45.jpg)
45
In more detail…
• Mappers get list of values• You need to process each one independently• But do it for all lines at once.
• Reducers work normally
![Page 46: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/46.jpg)
46
Demo 6
> library(rmr2)t <- list("hello world","don't worry be happy")unlist(sapply(t,function (x) {strsplit(x," ")}))
function(k,v) {ret_k <- unlist(sapply(v,function(x){strsplit(x," ")}))keyval(ret_k,1)
}
function(k,v) { keyval(k,sum(v))}
mapreduce(input=”~/hadoop-recipes/data/shakespeare/Shakespeare_2.txt",output=”~/wc.json",input.format="text”,output.format=”json",map=wc.map,reduce=wc.reduce);
![Page 47: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/47.jpg)
47
Cheating in MapReduce:Do everything possible to havemap only jobs
![Page 48: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/48.jpg)
48
Avg Tips per Person – Naïve Input
Gwen 1Jeff 2Leon 1Gwen 2.5Leon 3Jeff 1Gwen 1Gwen 2Jeff 1.5
![Page 49: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/49.jpg)
49
Avg Tips per Person - Naive
avg.map <- function(k,v){keyval(v$V1,v$V2)}
avg.reduce <- function(k,v) {keyval(k,mean(v))}
mapreduce(input=”~/hadoop-recipes/data/tip1.txt",output="~/avg.txt",input.format=make.input.format("csv"),output.format="text",map=avg.map,reduce=avg.reduce);
![Page 50: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/50.jpg)
50
Avg Tips per Person – Awesome Input
Gwen 1,2.5,1,2Jeff 2,1,1.5Leon 1,3
![Page 51: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/51.jpg)
51
Avg Tips per Person - Optimized
function(k,v) { v1 <- (sapply(v$V2,function(x){strsplit(as.character(x)," ")})) keyval(v$V1,sapply(v1,function(x){mean(as.numeric(x))}))}
mapreduce(input=”~/hadoop-recipes/data/tip2.txt",output="~/avg2.txt",input.format=make.input.format("csv",sep=","),output.format="text",map=avg2.map);
![Page 52: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/52.jpg)
52
Few Final RMR Tips
• Backend = “local” has files as input and output• Backend = “hadoop” uses HDFS directories• In “hadoop” mode, print(X) inside the mapper will fail
the job.• Use: cat(“ERROR!”, file = stderr())
![Page 53: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/53.jpg)
53
Recommended Reading
• http://cran.r-project.org/doc/manuals/R-intro.html• http://blog.revolutionanalytics.com/2013/02/10-r-pac
kages-every-data-scientist-should-know-about.html
• http://had.co.nz/reshape/paper-dsc2005.pdf• http://seananderson.ca/2013/12/01/plyr.html• https://github.com/RevolutionAnalytics/rmr2/blob/m
aster/docs/tutorial.md
• http://cran.r-project.org/web/packages/data.table/index.html
![Page 54: R for hadoopers](https://reader035.vdocuments.us/reader035/viewer/2022070303/549571e5b4795996758b466e/html5/thumbnails/54.jpg)
54