integrate hive and r
DESCRIPTION
TRANSCRIPT
![Page 1: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/1.jpg)
RHive : Integrating R and Hive
Introduction
JunHo Cho
Data Analysis Platform Team
Friday, November 11, 11
![Page 2: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/2.jpg)
Analysis of Data
Friday, November 11, 11
![Page 3: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/3.jpg)
Analysis of Data
MapReduce
Clustering
ClassifierCFDecision Tree
Graph
Recommendation
Friday, November 11, 11
![Page 4: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/4.jpg)
Related Works
• RHIPE
• RHadoop
• hive (Hadoop InteractiVE)
• seuge
Must understa
nd MapReduce
Friday, November 11, 11
![Page 5: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/5.jpg)
RHive is inspired by ...
• Many analysts have been used R for a long time
• Many analysts can use SQL language
• There are already a lot of statistical functions in R
• R needs a capability to analyze big data
• Hive supports SQL-like query language (HQL)
• Hive supports MapReduce to execute HQL
R is the best solution for familiarityHive is the best solution for capability
Friday, November 11, 11
![Page 6: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/6.jpg)
RHive Components
• Hadoop
• store and analyze big data
• Hive
• use HQL instead of MapReduce programming
• R
• support friendly environment to analysts
Friday, November 11, 11
![Page 7: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/7.jpg)
RHive - Architecture
RServe
R Function R Object RUDF RUDAF
01010100101010100101010100101010110101000111
01010100101010100101010100101010110101000111
01010100101010100101010100101010110101000111
Execute R Function Objects and R Objects through Hive Query
Execute Hive Query through Rrcal <- function(arg1,arg2) { coeff * sum(arg1,arg2)}
SELECT R(‘rcal’,col1,col2) from tab1
Friday, November 11, 11
![Page 8: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/8.jpg)
RHive API
• Extension R Functions
• rhive.connect
• rhive.query
• rhive.assign
• rhive.export
• Extension Hive Functions
• RUDF
• RUDAF
• GenericUDTFExpand
• GenericUDTFUnFold
• rhive.napply
• rhive.sapply
• rhive.aggregate
• rhive.list.tables
• rhive.load.table
• rhive.desc.table
Friday, November 11, 11
![Page 9: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/9.jpg)
RUDF - R User-defined Functions
• UDF doesn’t know return type until calling R function
• TYPE : return type
SELECT R(‘R function-name’,col1,col2,...,TYPE)
Example : R function which sums all passed columns
sumCols <- function(arg1,...) {sum(arg1,...)
}rhive.assign(‘sumCols’,sumCols)rhive.exportAll(‘sumCols’,hadoop-clusters)result <- rhive.query(“SELECT R(‘sumCols’, col1, col2, col3, col4, 0.0) FROM tab”)plot(result)
Friday, November 11, 11
![Page 10: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/10.jpg)
RUDAF - R User-defined Aggregation Function
• R can not manipulate large dataset
• Support UDAF’s life cycle
• iterate, partial-merge, merge, terminate
• Return type is only string delimited by ‘,’ - “data1,data2,data3,...”
SELECT RA(‘R function-name’,col1,col2,...)
01010100101010100101010100101010110101000111
01010100101010100101010100101010110101000111
01010100101010100101010100101010110101000111
01010100101010100101010100101010110101000111
01010100101010100101010100101010110101000111
01010100101010100101010100101010110101000111
FUN FUN.partial FUN.merge FUN.terminate
partial aggregation partial aggregationaggregation values
01010100101010100101010100101010110101000111
Friday, November 11, 11
![Page 11: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/11.jpg)
UDTF : unfold and expand
• RUDAF only returns string delimited by ‘,’
• Convert RUDAF’s result to R data.frame
RA(‘newcenter’,...) return “num1,num2,num3” per cluster-key
select unfold(tb1.x,0.0,0.0,0.0,’,’) as (col1,col2,col3) from (select RA(‘newcenter’, attr1,attr2,attr3,attr4) as x from table group by cluster-key
unfold(string_value,type1,type2,...,delimiter)expand(string_value,type,delimiter)
Friday, November 11, 11
![Page 12: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/12.jpg)
napply and sapply
• napply : R apply function for Numeric type
• sapply : R apply function for String type
rhive.napply(table-name,FUN,col1,...)rhive.sapply(table-name,FUN,col1,...)
Example : R function which sums all passed columns
sumCols <- function(arg1,...) {sum(arg1,...)
}result <- rhive.napply(“tab”, sumCols, col1, col2, col3, col4)rhive.load.table(result)
Friday, November 11, 11
![Page 13: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/13.jpg)
napply
rhive.napply <- function(tablename, FUN, col = NULL, ...) { if(is.null(col)) cols <- "" else cols <- paste(",",col)
for(element in c(...)) { cols <- paste(cols,",",element) } exportname <- paste(tablename,"_sapply",as.integer(Sys.time()),sep="")
! rhive.assign(exportname,FUN)! rhive.exportAll(exportname)
tmptable <- paste(exportname,”_table”)
! rhive.query( paste("CREATE TABLE ", tmptable," AS SELECT ","R('",exportname,"'",cols,",0.0) FROM ",tablename,sep=""))
! tmptable}
• ‘napply’ is similar to R apply function
• Store big result to HDFS as Hive table
Friday, November 11, 11
![Page 14: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/14.jpg)
aggregate
• RHive aggregation function to aggregate data stored in HDFS using HIVE Function
rhive.aggregate(table-name,hive-FUN,...,goups)
Example : Aggregate using SUM (Hive aggregation function)
result <- rhive.aggregate(“emp”, “SUM”, sal,groups=”deptno”)rhive.load.table(result)
Friday, November 11, 11
![Page 15: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/15.jpg)
Examples - predict flight delaylibrary(RHive)
rhive.connect()
- Retrieve training set from large dataset stored in HDFS
train <- rhive.query("SELECT dayofweek,arrdelay,distance FROM airlines TABLESAMPLE(BUCKET 1 OUT OF 10000 ON rand())
train$arrdelay <- as.numeric(train$arrdelay)
train$distance <- as.numeric(train$distance)
train <- train[!(is.na(train$arrdelay) | is.na(train$distance)),]
model <- lm(arrdelay ~ distance + dayofweek,data=train)
- Export R object data
rhive.assign("model", model)
- Analyze big data using model calculated by R
predict_table <- rhive.napply(“airlines”,function(arg1,arg2,arg3) {
if(is.null(arg1) | is.null(arg2) | is.null(arg3)) return(0.0)
res <- predict.lm(model, data.frame(dayofweek=arg1,arrdelay=arg2,distance=arg3))
return(as.numeric(res)) }, ‘dayofweek’, ‘arrdelay’, ‘distance’)
Native R code
HiveQuery + R code
Friday, November 11, 11
![Page 16: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/16.jpg)
DEMO
Friday, November 11, 11
![Page 17: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/17.jpg)
Conclusion
• RHive supports HQL, not MapReduce model style
• RHive allows analytics to do everything in R console
• RHive interacts R data and HDFS data
• Future & Current Works
• Integrate Hadoop HDFS
• Support Transform/Map-Reduce Scripts
• Distributed Rserve
• Support more R style API
• Support machine learning algorithms (k-means, classifier, ...)
Friday, November 11, 11
![Page 18: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/18.jpg)
Cooperators
• JunHo Cho
• Seonghak Hong
• Choonghyun Ryu
YOU!
Friday, November 11, 11
![Page 19: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/19.jpg)
How to join RHive project
• Logo
• github (https://github.com/nexr/RHive)
• CRAN (http://cran.r-project.org/web/packages/RHive)
• Welcome to join RHive project
Friday, November 11, 11
![Page 20: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/20.jpg)
References
• Recardo (https://mpi-inf.mpg.de/~rgemulla/publications/das10ricardo.pdf)
• RHIPE (http://ml.stat.purdue.edu/rhipe)
• Hive (http://hive.apache.org)
• Parallels R by Q. Ethan McCallum and Stephen Weston
Friday, November 11, 11
![Page 22: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/22.jpg)
Appendix
Friday, November 11, 11
![Page 23: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/23.jpg)
RHIPE
• the R and Hadoop Integrated Processing Environment
• Must understand the MapReduce model
R
HDFS
Mapper Reducer
R R
PersonalServer
ProtocolBuf
RHMR
Fork
shuffle / sort
R Objects (map, reduce)R Conf
map <- function() {...}reduce <- function() {...}rmr <- rhmr(map,reduce,...)
R Objects (map) R Objects (reduce)
Friday, November 11, 11
![Page 24: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/24.jpg)
RHadoop
• Manipulate Hadoop data stores and HBASE directly from R
• Write MapReduce models in R using Hadoop Streaming
• Must understand the MapReduce model
R
HDFS
Mapper Reducer
R Rrmr
shuffle / sortmanipulate
map <- function() {...}reduce <- function() {...}mapreduce(input,output,map,reduce,...)
Hadoop Streaming
rhdfs
HBASE
rhbase
store R objecs as file
execute hadoop streaming
Friday, November 11, 11
![Page 25: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/25.jpg)
hive(Hadoop InteractiVE)
• R extension facilitating distributed computing via the MapReduce paradigm
• Provide an interface to Hadoop, HDFS and Hadoop Streaming
• Must understand the MapReduce model
R
HDFS
Mapper Reducer
R Rhive_stream
shuffle / sortmanipulate
map <- function() {...}reduce <- function() {...}hive_stream(map,reduce,...)
Hadoop Streaming
DFS
save R script on local
execute hadoop streaming
hive
Friday, November 11, 11
![Page 26: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/26.jpg)
seuge
• Simple parallel processing with a fast and easy setup on Amazon’s WS.
• Parallel lapply function for the EMR engine using Hadoop streaming.
• Does not support MapReduce model but only Map model.
Amazon S3
R
emrlapply
save R objects (data + FUN) on local
data <- list(...)emrlapply(clusterObject,data,FUN,..)
awsFunctions
bootstrap (setup R)mapper.R
EMR
upload R objects
Mapper
R
Hadoop Streaming
Friday, November 11, 11
![Page 27: Integrate Hive and R](https://reader036.vdocuments.us/reader036/viewer/2022081414/54b73eb94a7959be4c8b486d/html5/thumbnails/27.jpg)
Ricardo
• Integrate R and Jaql (JSON Query Language)
• Must know how to use uncommon query, Jaql
• Not open-source
Ref : Ricardo-SIGMOD10
Friday, November 11, 11