rhadoopand - mapr · pdf file2!!!!|!!!rhadoopandmapr! ... redhat!enterprise!linux6.4!64dbit!...

14
RHadoop and MapR Accessing EnterpriseGrade Hadoop from R Version 2.0 (14.March.2014) ®

Upload: trandung

Post on 16-Mar-2018

221 views

Category:

Documents


3 download

TRANSCRIPT

                           

                   

                           

                                                                                                                 

         

 

RHadoop  and  MapR  Accessing  Enterprise-­‐Grade  Hadoop  from  R  

 Version  2.0  (14.March.2014)  

 

 

     

®

                               

2        |        RHadoop  and  MapR  

®

         

 Table  of  Contents  

 

Introduction  ...........................................................  3  

Environment  ...........................................................  3  

R  .............................................................................  3  

Special  Installation  Notes  .......................................  4  

Install  R  ..................................................................  5  

Install  RHadoop  ......................................................  5  Install  rhdfs  ..........................................................................................  5  

Install  rmr2  ...........................................................................................  7  

Install  rhbase  ......................................................................................  10  

Conclusion  ............................................................  14  

Resources  .............................................................  14  

       

                               

3        |        RHadoop  and  MapR  

®

       

Introduction  RHadoop  is  an  open  source  collection  of  three  R  packages  created  by  Revolution  Analytics  that  allow  users  to  manage  and  analyze  data  with  Hadoop  from  an  R  environment.    It  allows  data  scientists  familiar  with  R  to  quickly  utilize  the  enterprise-­‐grade  capabilities  of  the  MapR  Hadoop  distribution  directly  with  the  analytic  capabilities  of  R.  This  paper  provides  step-­‐by-­‐step  instructions  to  install  and  use  RHadoop  with  MapR  and  R  on  RedHat  Enterprise  Linux.      RHadoop  consists  of  the  following  packages:  

• rhdfs -­‐  functions  providing  file  management  of  the  HDFS  from  within  R  • rmr2  -­‐  functions  providing  Hadoop  MapReduce  functionality  in  R  • rhbase  -­‐  functions  providing  database  management  for  the  HBase  distributed  database  from  within  R  

 Each  of  the  RHadoop  packages  can  be  installed  and  used  independently  or  in  conjunction  with  each  other.    

Environment  The  integration  testing  described  in  this  paper  was  performed  in  March  2014  on  a  3-­‐node  Amazon  EC2  cluster.    The  product  versions  in  the  test  are  listed  in  the  table  below.    Note  that  Revolution  Analytics  currently  provides  Linux  support  only  on  RedHat.    Product   Version  EC2  AMI  

Type  Root/Boot  MapR  storage  

RHEL-­‐6.4  x86_64  (ami-­‐74557e31)  m1.large  8GB  EBS  standard  (3)  450GB  EBS  standard  

RedHat  Enterprise  Linux  6.4  64-­‐bit  Java  

2.6.32-­‐358.el6.x86_64  java-­‐1.7.0-­‐openjdk.x86_64  java-­‐1.7.0-­‐openjdk-­‐devel.x86_64  

MapR  M7  HBase    

3.0.2  (3.0.2.22510.GA-­‐1)  0.94.13  

GNU  R     3.0.2  RHadoop  

rhdfs rmr2 rhbase

 1.0.8  3.0.1  1.2.0  

Apache  Thrift   0.9.1    

R  R  is  both  a  language  and  environment  for  statistical  computation  and  is  freely  available  from  GNU.    Revolution  Analytics  provides  two  versions  of  R:    the  free  Revolution  R  Community  and  the  premium  Revolution  R  Enterprise  for  workstations  and  servers.    Revolution  R  Community  is  an  enhanced  distribution  of  the  open  source  R  for  users  

                               

4        |        RHadoop  and  MapR  

®

looking  for  faster  performance  and  greater  stability.    Revolution  R  Enterprise  adds  commercial  enhancements  and  professional  support  for  real-­‐world  use.    It  brings  higher  performance,  greater  scalability,  and  stronger  reliability  to  R—at  a  fraction  of  the  cost  of  legacy  products.    R  is  easily  extended  with  libraries  that  are  distributed  in  packages.    Packages  are  collections  of  R  functions,  data,  and  compiled  code  in  a  well-­‐defined  format.  The  directory  where  packages  are  stored  is  called  the  library.  R  comes  with  a  standard  set  of  packages.  Others  are  available  for  download  and  installation.  Once  installed,  they  have  to  be  loaded  into  the  session  to  be  used.  

Special  Installation  Notes  These  installation  instructions  are  specific  to  the  product  versions  specified  earlier  in  this  document.    Some  modifications  may  be  required  for  your  environment.    A  package  repository  must  be  available  to  install  dependent  packages.    The  MapR  cluster  must  be  installed  and  running  either  Hadoop  HBase  or  MapR  tables.  You  will  need  root  privileges  on  all  nodes  in  the  cluster.    MapR  installation  instructions  are  available  on  the  MapR  Documentation  web  site:    http://doc.mapr.com/display/MapR/Quick+Installation+Guide  All  commands  entered  by  the  user  are  in  bold courier  font.    Commands  entered  from  within  the  R  environment  are  preceded  by  the  default  R  prompt  “>“.    Linux  shell  commands  are  preceded  by  either  the  '#'  character  (for  the  root  user)  or  the  '$'  character  (for  a  non-­‐root  user).        For  example,  the  following  represents  running  the  yum command  as  the  root  user:  

# yum install git -y

 The  following  example  represents  running  a  command  within  the  R  shell:  

> library(rhdfs)

 Similar  to  the  R  shell,  the  following  represents  running  a  command  within  the  HBase  shell:  

hbase(main):001:0> create 'mytable', 'cf1'

 Note  that  some  Linux  commands  are  long  and  wrap  across  multiple  lines  in  this  document.    Linux  commands  use  the  backslash  "\"  character  to  escape  the  carriage  return.    Similarly,  in  the  R  shell,  long  commands  use  the  comma  ","  character  to  escape  the  carriage  return.    This  facilitates  copying  from  this  document  and  pasting  into  a  Linux  terminal  window.        Here  is  an  example  of  a  long  Linux  command  that  wraps  to  multiple  lines  in  this  document:  

# su - user01 -c "hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-*-\ examples.jar wordcount /tmp/mrtest/wc-in /tmp/mrtest/wc-out"

 And  there  is  an  example  of  a  long  R  command  that  wraps  to  multiple  lines  in  this  document:  

>  install.packages(c('Rcpp'), repos="http://cran.revolutionanalytics.com", INSTALL_opts=c('--byte-compile'))

   

                               

5        |        RHadoop  and  MapR  

®

Unless  otherwise  indicated,  all  commands  should  be  run  as  the  root  user  on  the  client  and  task  tracker  systems.        This  document  assumes  there  is  a  non-­‐root  user  called  user01  on  your  client  system  for  purposes  of  validating  the  installation.    You  can  use  any  non-­‐root  user  in  those  commands  that  can  access  the  MapR  cluster.    Just  remember  to  replace  all  occurrences  of  user01  with  your  non-­‐root  user.    Finally,  note  that  these  installation  instructions  assume  your  client  systems  are  running  RedHat  Linux  as  that  is  the  only  operating  system  supported  by  Revolution  Analytics.      

Install  R  Testing  for  this  paper  was  done  with  GNU  R.    If  you  have  Revolution  R  Community  or  Enterprise,  you  can  use  that  version  of  R  instead.    Follow  Revolution  Analytics  installation  instructions  for  the  appropriate  edition.    A  version  of  R  must  always  be  installed  on  the  client  system  accessing  the  cluster  using  the  RHadoop  libraries.    Additionally,  to  execute  MapReduce  jobs  with  the  rmr2  library,  R  must  be  installed  on  all  task  tracker  nodes  in  the  cluster.        As  the  root  user,  follow  the  installation  steps  below  to  install  GNU  R  on  all  client  systems  and  task  trackers.  1) Install  GNU  R.  

# yum -y --enablerepo=epel install R

 2) Install  the  GNU  R  developer  package.  

# yum -y --enablerepo=epel install R-devel  Note  that  the  R-devel  package  may  already  be  up  to  date  when  installing  the  R  package.    

3) Confirm  installation  was  successful  by  running  R  as  a  non-­‐root  user  on  your  client  system  and  all  your  task  trackers.    At  the  command  line,  type  the  following  command  to  determine  the  version  of  R  that  is  installed.  # su - user01 -c "R --version"

Install  RHadoop  The  installation  instructions  that  follow  are  complete  for  each  RHadoop  package  (rhdfs,  rmr2,  rhbase).    System  administrators  can  skip  to  installation  instructions  of  just  the  package(s)  they  want  to  install.    Recall  that  R  must  be  installed  before  installing  any  of  the  RHadoop  packages.  

Install  rhdfs  The  rhdfs  package  uses  the  hadoop  command  to  access  MapR  file  services.    To  use  rhdfs,  R  and  the  rhdfs  package  only  need  to  be  installed  on  the  client  system  that  is  accessing  the  cluster.    This  can  be  a  node  in  the  cluster  or  it  can  be  any  client  system  that  can  access  the  cluster  with  the  hadoop  command.    As  the  root  user,  perform  the  following  steps  on  every  client  node.  1) Confirm  that  you  can  access  the  MapR  file  services  by  listing  the  contents  of  the  top-­‐level  directory.      

# su - user01 -c "hadoop fs -ls /"

2) Install  the  rJava  R  package  that  is  required  by  rhdfs.  

# R --save

                               

6        |        RHadoop  and  MapR  

®

> install.packages(c('rJava'), repos="http://cran.revolutionanalytics.com") > quit()

3) Download  the  rhdfs  package  from  github.  

# yum install git -y # cd ~ # git clone git://github.com/RevolutionAnalytics/rhdfs.git

4) Set  the  HADOOP_CMD  environment  variable  to  the  hadoop  command  script  and  install  the  rhdfs  package.    

Whereas  the  rJava  package  in  the  previous  step  was  downloaded  and  installed  from  a  CRAN  repository,  rhdfs  is  installed  from  with  the  R  client.  # export HADOOP_CMD=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop # R CMD INSTALL ~/rhdfs/pkg

5) Set  required  environment  variables.    In  addition  to  the  HADOOP_CMD  environment  variable  set  in  the  

previous  step,  LD_LIBRARY_PATH  must  include  the  location  of  the  MapR  client  library  libMapRClient.so  and  HADOOP_CONF  must  specify  the  Hadoop  configuration  directory.    Any  user  wanting  to  use  the  rhdfs  library  must  set  these  environment  variables.  # export LD_LIBRARY_PATH=/opt/mapr/lib:$LD_LIBRARY_PATH # export HADOOP_CONF=/opt/mapr/hadoop/hadoop-0.20.2/conf

6) Switch  user  to  your  non-­‐root  user  to  validate  the  installation  was  successful.  # su user01 $    

7) From  R,  load  the  rhdfs  library  and  confirm  that  you  can  access  the  MapR  cluster  file  system  by  listing  the  root  directory.    $ R --no-save > library(rhdfs) > hdfs.init() > hdfs.ls('/') > quit() Note:    When  loading  an  R  library  using  the  library()  command,  dependent  libraries  will  also  be  loaded.    For  rhdfs,  the  rJava  library  will  be  loaded  if  it  has  not  already  been  loaded.  

 8) Exit  from  your  su  command.  

                               

7        |        RHadoop  and  MapR  

®

$ exit #  

9) Check  the  installation  of  the  rhdfs  package.  # R CMD check ~/rhdfs/pkg; echo $?

 An  exit  code  of  0  means  the  installation  was  successful.    If  no  errors  are  reported,  you  have  successfully  installed  rhdfs  and  can  use  it  to  access  the  MapR  file  services  from  R  (you  can  safely  ignore  any  "notes").        

10) (optional)  Persist  the  required  environment  variable  settings  for  the  shells  in  your  client  environment.    The  command  below  will  set  the  variables  for  bourne  and  bash  shell  users.    You  may  wish  to  examine  the  /etc/profile  file  first  before  making  the  following  edits  to  ensure  that  you're  not  duplicating  or  clobbering  existing  settings.  # echo -e "export LD_LIBRARY_PATH=/opt/mapr/lib:\$LD_LIBRARY_PATH" >> \ /etc/profile # echo "export HADOOP_CMD=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop" >> \ /etc/profile

 

Install  rmr2  The  rmr2  package  uses  Hadoop  Streaming  to  invoke  R  map  and  reduce  functions.    To  use  rmr2,  the  R  and  the  rmr2  packages  must  be  installed  on  the  clients  as  well  as  every  task  tracker  node  in  the  MapR  cluster.        Install  rmr2  on  every  tasktracker  node  AND  on  every  client  system  as  the  root  user.  1) Install  required  R  packages:    

# R REPL --no-save >  install.packages(c('Rcpp'), repos="http://cran.revolutionanalytics.com", INSTALL_opts=c('--byte-compile')) > install.packages(c('RJSONIO'), repos="http://cran.revolutionanalytics.com", INSTALL_opts=c('--byte-compile')) > install.packages(c('itertools'), repos="http://cran.revolutionanalytics.com", INSTALL_opts=c('--byte-compile')) > install.packages(c('digest'), repos="http://cran.revolutionanalytics.com", INSTALL_opts=c('--byte-compile')) > install.packages(c('functional'), repos="http://cran.revolutionanalytics.com", INSTALL_opts=c('--byte-compile'))

                               

8        |        RHadoop  and  MapR  

®

> install.packages(c('stringr'), repos="http://cran.revolutionanalytics.com", INSTALL_opts=c('--byte-compile')) > install.packages(c('plyr'), repos="http://cran.revolutionanalytics.com", INSTALL_opts=c('--byte-compile')) > install.packages(c('bitops'), repos="http://cran.revolutionanalytics.com", INSTALL_opts=c('--byte-compile')) > install.packages(c('reshape2'), repos="http://cran.revolutionanalytics.com", INSTALL_opts=c('--byte-compile')) > install.packages(c('caTools'), repos="http://cran.revolutionanalytics.com", INSTALL_opts=c('--byte-compile')) > quit()

 Note:    Warnings  may  be  safely  ignored  while  the  packages  are  being  built.  

2) Download  the  quickcheck  and  rmr2  packages  from  github.    RHadoop  includes  the  quickcheck  package  

to  support  writing  randomized  unit  tests  performed  by  the  rmr2  package  check.  # cd ~ # git clone git://github.com/RevolutionAnalytics/quickcheck.git # git clone git://github.com/RevolutionAnalytics/rmr2.git

3) Install  the  quickcheck  and  rmr2  packages.  # R CMD INSTALL ~/quickcheck/pkg # R CMD INSTALL ~/rmr2/pkg

Note:    Warnings  may  be  safely  ignored  while  the  packages  are  being  built.    

4) From  any  task  tracker  node,  create  a  directory  called  /tmp  (if  it  doesn't  already  exist)  in  the  root  of  your  MapR-­‐FS  (owned  by  the  mapr  user)  and  give  it  global  read-­‐write  permissions.    This  directory  is  required  for  running  MapReduce  applications  in  R  using  the  rmr2  package.  # su - mapr -c "hadoop fs -mkdir /tmp"  Note:  the  command  above  will  fail  gracefully  if  the  /tmp  directory  already  exists.  # su - mapr -c "hadoop fs -chmod 777 /tmp"

                               

9        |        RHadoop  and  MapR  

®

 Validate  rmr2  on  any  single  client  system  as  a  non-­‐root  user  with  the  following  steps.    Note  that  the  rmr2  package  must  be  installed  on  ALL  the  task  trackers  before  proceeding.    1) Confirm  that  your  MapR  cluster  is  configured  to  run  a  simple  MapReduce  job  outside  of  the  RHadoop  

environment.  You  only  need  to  perform  this  step  on  one  client  from  which  you  intend  to  run  your  R  MapReduce  programs.    # su - user01 -c "hadoop fs -mkdir /tmp/mrtest/wc-in" # su - user01 -c "hadoop fs -put /opt/mapr/NOTICE.txt /tmp/mrtest/wc-in" # su - user01 -c "hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-*-\ examples.jar wordcount /tmp/mrtest/wc-in /tmp/mrtest/wc-out" # su - user01 -c "hadoop fs -cat /tmp/mrtest/wc-out/part-r-00000" # su - user01 -c "hadoop fs -rmr /tmp/mrtest"

2) Copy  the  wordcount.R  script  to  a  directory  that  is  accessible  by  your  non-­‐root  user. # cp ~/rmr2/pkg/tests/wordcount.R /tmp

3) Set  the  required  environment  variables  HADOOP_CMD  and  HADOOP_STREAMING.    Since  rmr2  uses  Hadoop  Streaming,  it  needs  access  to  both  the  hadoop  command  and  the  streaming  jar.    # export HADOOP_CMD=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop # export HADOOP_STREAMING=/opt/mapr/hadoop/hadoop-\ 0.20.2/contrib/streaming/hadoop-0.20.2-dev-streaming.jar

4) Switch  user  to  your  non-­‐root  user  to  validate  the  installation  was  successful. # su user01

5) Run  the  wordcount.R  program  from  the  R  environment  as  your  non-­‐root  user.     $ R --no-save < /tmp/wordcount.R; echo $?  Note  that  an  exit  code  of  0  means  the  command  was  successful.  

6) Exit  from  your  su  command.

$ exit #

7) Run  the  full  rmr2  check.    The  examples  run  by  the  rmr2  check  below  will  sequentially  generate  81  streaming  MapReduce  jobs  on  the  cluster.    80  of  the  jobs  have  just  2  mappers  so  a  large  cluster  will  not  speed  this  up.    On  a  3-­‐node  medium  EC2  cluster  with  two  task  trackers,  the  examples  take  just  over  1  hour.    You  may  wish  to  launch  this  under  nohup  as  shown  below  and  wait  for  it  to  complete  before  you  proceed  in  this  document. # nohup R CMD check ~/rmr2/pkg > ~/rmr2-check.out &

                               

10        |        RHadoop  and  MapR  

®

 Check  the  output  in  ~/rmr2-check.out  for  any  errors.    

8) (optional)  Persist  the  required  environment  variable  settings  for  the  shells  in  your  client  environment.    The  command  below  will  set  the  variables  for  Bourne  and  bash  shell  users.    You  may  wish  to  examine  the  /etc/profile  file  first  before  making  the  following  edits  to  ensure  that  you're  not  duplicating  or  clobbering  existing  settings.  # echo -e "export LD_LIBRARY_PATH=/opt/mapr/lib:\$LD_LIBRARY_PATH" >> \ /etc/profile # echo "export HADOOP_CONF=/opt/mapr/hadoop/hadoop-0.20.2/conf" >> \ /etc/profile # echo "export HADOOP_CMD=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop" >> \ /etc/profile

Install  rhbase  The  rhbase  package  accesses  HBase  via  the  HBase  Thrift  server  which  is  included  in  the  MapR  HBase  distribution.    The  rhbase  package  is  a  Thrift  client  that  sends  requests  and  receives  responses  from  the  Thrift  server.    The  Thrift  server  listens  for  Thrift  requests  and  in  turn  uses  the  HBase  HTable  java  class  to  access  HBase.      Since  rhbase  is  a  client-­‐side  technology,  it  only  needs  to  be  installed  on  the  client  system  that  will  access  the  MapR  HBase  cluster.    Any  MapR  HBase  cluster  node  can  also  be  a  client.      For  the  client  system  to  access  a  local  Thrift  server,  the  client  system  must  have  the  mapr-hbase-internal  packaged  installed  which  includes  the  MapR  HBase  Thrift  server.    If  your  client  system  is  one  of  the  MapR  HBase  Masters  or  Region  Servers,  it  will  already  have  this  package  installed.    These  rhbase  installation  instructions  assume  that  mapr-hbase-internal  is  already  installed  on  the  client  system.    In  addition  to  the  HBase  Thrift  server,  the  rhbase  package  requires  the  Thrift  include  files  to  compile  and  the  C++  thrift  library  at  runtime  in  order  to  be  a  Thrift  client.    Since  these  Thrift  components  are  not  included  in  the  MapR  distribution,  Thrift  must  be  installed  before  rhbase.    By  default,  rhbase  connects  to  a  Thrift  server  on  the  local  host.    A  remote  server  can  be  specified  in  the  rhbase  hb.init()  call,  but  the  rhbase  package  check  expects  the  Thrift  server  to  be  local.    These  installation  instructions  assume  the  Thrift  server  is  running  locally  and  that  HBase  is  installed  and  running  in  your  cluster.    As  the  root  user,  perform  the  following  installation  steps  on  ALL  task  tracker  nodes  and  client  systems.  1) Install  (or  update)  prerequisite  packages.

# yum -y install automake libtool flex bison pkgconfig gcc-c++ boost-devel \ libevent-devel zlib-devel python-devel openssl-devel ruby-devel qt qt-\ devel php-devel

2) Download,  build,  and  install  Thrift.

# cd ~ # git clone https://git-wip-us.apache.org/repos/asf/thrift.git thrift # cd ~/thrift #

                               

11        |        RHadoop  and  MapR  

®

sed -i s/2.65/2.63/ configure.ac # ./bootstrap.sh # ./configure # make && make install # /sbin/ldconfig /usr/lib/libthrift-0.9.1.so

3) Download  the  rhbase  package.  # cd ~ # git clone git://github.com/RevolutionAnalytics/rhbase.git

4) Modify  the  thrift.pc  file.    We  need  to  add  the  thrift  directory  to  the  end  of  the  includedir  

configuration.  # sed -i \ 's/^includedir=\${prefix}\/include/includedir=\${prefix}\/include\/thrift\ /' ~/thrift/lib/cpp/thrift.pc  

5) Install  the  rhbase  package.    The  LD_LIBRARY_PATH must  be  set  to  find  the  Thrift  library  (libthrift.so)  which  was  installed  as  part  of  the  Thrift  installation.    Also,  PKG_CONFIG_PATH  must  point  to  the  directory  containing  the  thrift.pc  package  configuration  file. # export LD_LIBRARY_PATH=/usr/local/lib # export PKG_CONFIG_PATH=~/thrift/lib/cpp # R CMD INSTALL ~/rhbase/pkg

6) Configure  the  file  /opt/mapr/hbase/hbase-0.94.13/conf/hbase-site.xml  with  the  HBase  

zookeeper  servers  and  their  port  number.    For  HBase  Master  Servers  and  HBase  Region  Servers,  zookeeper  servers  should  already  be  properly  configured  in  this  file.    For  client  only  systems,  edit  the  hbase.zookeeper.quorum and hbase.zookeeper.property.clientPort properties  to  correspond  to  your  zookeeper  servers. #      su - mapr -c "maprcli node listzookeepers" #      su - mapr -c "vi /opt/mapr/hbase/hbase-0.94.13/conf/hbase-site.xml"

. . . <property> <name>hbase.zookeeper.quorum</name> <value>zkhost1,zkhost2,zkhost3</value> </property> <property> <name>hbase.zookeeper.property.clientPort</name> <value>5181</value> </property> . . .

                               

12        |        RHadoop  and  MapR  

®

 

7) Start  the  MapR  HBase  Thrift  server  as  a  background  daemon.  # /opt/mapr/hbase/hbase-0.94.13/bin/hbase-daemon.sh start thrift

 Note:  Pass  the  parameters  stop thrift  to  the  hbase-daemon.sh script  to  stop  the  daemon.    

8) Run  the  rhbase  package  checks.    LD_LIBRARY_PATH  and  PKG_CONFIG_PATH  must  still  be  set.    Note  that  the  command  will  produce  errors  you  can  safely  ignore  if  you  are  not  running  Hadoop  HBase  in  your  MapR  cluster.    Recall  that  the  rhbase  package  is  compatible  with  MapR  tables  which  does  not  require  Hadoop  HBase  to  be  installed  and  running  in  your  MapR  cluster.  # R CMD check ~/rhbase/pkg

 Note:    When  running  the  rhbase  package  check,  warnings  can  be  safely  ignored.  

9) (optional)  Persist  the  required  environment  variable  settings  for  the  shells  in  your  environment.    The  

command  below  will  set  the  variables  for  Bourne  and  bash  shell  users.    You  may  wish  to  examine  the  /etc/profile  file  first  before  making  the  following  edits  to  ensure  that  you're  not  duplicating  or  clobbering  existing  settings.  # echo -e "export \ LD_LIBRARY_PATH=/usr/local/lib:/usr/lib64/R/library/rhbase/libs:\$\ LD_LIBRARY_PATH" >> /etc/profile # echo "export PKG_CONFIG_PATH=~root/thrift/lib/cpp" >> /etc/profile # echo "export HADOOP_CMD=/opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop" >> \ /etc/profile

   

Test  rhbase  with  Hadoop  HBase  The  rhbase  package  is  now  installed  and  ready  for  use  by  any  user  on  the  system.    Validate  that  a  non-­‐root  user  can  access  HBase  via  the  HBase  shell  and  from  rhbase.    Note  that  the  instructions  below  assume  the  environment  variables  LD_LIBRARY_PATH  and  HADOOP_CMD  have  been  configured  for  the  user01  user.    They  also  assume  you  are  running  Hadoop  HBase  in  your  MapR  cluster.      1) Start  the  HBase  shell,  create  a  table,  display  its  description,  and  drop  it.  

# su user01 -c "hbase shell" hbase(main):001:0> create 'mytable', 'cf1' hbase(main):002:0> describe 'mytable' hbase(main):003:0> disable 'mytable' hbase(main):004:0> drop 'mytable' hbase(main):005:0>

                               

13        |        RHadoop  and  MapR  

®

quit 2) Now  perform  the  same  test  with  rhbase.

# su user01 -c "R --save" > library(rhbase) > hb.init() > hb.new.table('mytable', 'cf1') > hb.describe.table('mytable') > hb.disable.table('mytable') > hb.delete.table('mytable') > q()

 

Test  rhbase  with  MapR  Tables  The  rhbase  package  is  compatible  with  MapR  tables.    You  can  use  the  MapR  tables  feature  if  you  have  an  M7  license  installed  on  your  MapR  cluster.  Simply  use  absolute  paths  in  MapR-­‐FS  for  your  table  names  rather  than  relative  paths  as  for  HBase.    The  following  steps  assume  that  the  user01  home  directory  is  in  a  MapR  file  system  called  /mapr/mycluster/home/user01.    

1) Start  the  HBase  shell,  create  a  table,  display  its  description,  and  drop  it.  # su user01 -c "hbase shell" hbase(main):001:0> create '/mapr/mycluster/home/user01/mytable', 'cf1' hbase(main):002:0> describe '/mapr/mycluster/home/user01/mytable' hbase(main):003:0> disable '/mapr/mycluster/home/user01/mytable' hbase(main):004:0> drop '/mapr/mycluster/home/user01/mytable' hbase(main):005:0> quit

2) Now  perform  the  same  test  with  rhbase. # su user01 -c "R --save" > library(rhbase) > hb.init() > hb.new.table('/mapr/mycluster/home/user01/mytable', 'cf1') >

                               

14        |        RHadoop  and  MapR  

®

hb.describe.table('/mapr/mycluster/home/user01/mytable') > hb.delete.table('/mapr/mycluster/home/user01/mytable') > q()

Conclusion  With  Revolution  Analytics’  RHadoop  packages  and  MapR’s  enterprise  grade  Hadoop  distribution,  data  scientists  can  utilize  the  full  potential  of  Hadoop  from  the  familiar  R  environment.      

Resources  More  information  can  be  found  on  RHadoop,  and  other  technologies  referenced  in  this  paper  at  the  links  below.    GNU  R  home  page:  http://www.r-­‐project.org    RHadoop  home  page:    https://github.com/RevolutionAnalytics/    Apache  Thrift:    http://thrift.apache.org  Revolution  Analytics:  http://www.revolutionanalytics.com  MapR  Technologies:    http://www.mapr.com      About  MapR  Technologies  MapR’s  advanced  distribution  for  Apache  Hadoop  delivers  on  the  promise  of  Hadoop,  making  the  management  and  analysis  of  big  data  a  practical  reality  for  more  organizations.  MapR’s  advanced  capabilities,  such  as  streaming  analytics,  mission-­‐critical  data  protection,  and  MapR  tables  expand  the  breadth  and  depth  of  use  cases  across  industries.    About  Revolution  Analytics  Revolution  Analytics  (formerly  Revolution  Computing)  was  founded  in  2007  to  foster  the  R  Community,  as  well  as  support  the  growing  needs  of  commercial  users.  Our  name  derives  from  combining  the  letter  "R"  with  the  word  "evolution."  It  speaks  to  the  ongoing  development  of  the  R  language  from  an  open-­‐source  academic  research  tool  into  commercial  applications  for  industrial  use.      03/14/2014