linuxclustersinstute: turning&hpc&cluster&into&a ......linuxclustersinstute:...

30
Linux Clusters Ins.tute: Turning HPC cluster into a Big Data Cluster Fang (Cherry) Liu, PhD [email protected] A Partnership for an Advanced Compu@ng Environment (PACE) OIT/ART, Georgia Tech

Upload: others

Post on 20-May-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Linux  Clusters  Ins.tute:  Turning  HPC  cluster  into  a  

Big  Data  Cluster

Fang  (Cherry)  Liu,  PhD                    [email protected]  

 A  Partnership  for  an  Advanced  Compu@ng  Environment  (PACE)  

OIT/ART,  Georgia  Tech      

Page 2: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

18-22 May 2015 2

Targets  for  this  session

Target  audience:        HPC  system  admins  who  wants  to  support  Hadoop  cluster  

 Points  of  interests:  •  Big  Data  is  Common  •  Challenges  and  Tools    •  Hadoop  vs  HPC  •  Hadoop  Core  Characteris@cs  •  Hadoop  EcoSystem  

•  Core  parts  –  HDFS  and  Mapreduce  •  Distribu@ons  •  Projects  

•  Hadoop  Basic  Opera@ons  •  Configure  Hadoop  Cluster  on  HPC  cluster  –  PACE  Hadoop  Cluster  •  Hadoop  Advance  Opera@ons  

Page 3: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Big  Data  is  Common   •  The  en@re  rendering  of  Avatar  reportedly  requires  over  1  Petabyte  of  storage  space  according  to  BBC’s  Clickbits,  which  is  the  equivalent  of  500  hard  drives  of  2TB  each.  That’s  equal  to  a  32  year  long  MP3  file.  The  compu@ng  core  –  34  racks,  each  with  four  chassis  of  32  machines  each  –  adds  up  to  some  40,000  processors  and  104  terabytes  of  RAM.  hZp://thenextweb.com/2010/01/01/avatar-­‐takes-­‐1-­‐petabyte-­‐storage-­‐space-­‐equivalent-­‐32-­‐year-­‐long-­‐mp3/  

•  Facebook’s  data  warehouses  grow  by  “Over  half  a  petabyte  …  every  24  hours”  (2012)  hZp://www.theregister.co.uk/2012/11/09/facebook_open_sources_corona/  

•  An  average  of  one  hundred  thousand  MapReduce  jobs  are  executed  on  Google's  clusters  every  day,  processing  a  total  of  more  than  twenty  petabytes  of  data  per  day  (2009).  hZp://dl.acm.org/[email protected]?id=1327492&dl=ACM&coll=DL&CFID=508148704&CFTOKEN=44216082  

18-22 May 2015 3

Page 4: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Big  Data:  Three  challenges

• Volume  (Storage)  •  The  size  of  the  data  

• Velocity  (Processing  speed)  •  The  latency  of  data  processing  rela@ve  to  the  growing  demand  of  interac@vity  

• Variety  (structure  vs.  unstructured  data)  •  The  diversity  of  sources,  formats,  quality,  structures  

18-22 May 2015 4

Page 5: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

What  are  Tools  to  use?

•  Hadoop  is  an  open-­‐source  sonware  for  reliable,  scalable,  distributed  compu@ng:  

•  Created  by  Doug  Cuong  and  Michael  Cafarella  while  at  Yahoo,  named  aner  Doug’s  son’s  toy  elephant  

•  WriZen  in  Java  •  Scale  to  thousands  of  machines  with  linear  scalability    •  Uses  simple  programming  model  (MapReduce)  •  Fault  tolerant  (HDFS)  

•  Spark  •   it  is  a  fast  and  general  engine  for  large-­‐scale  data  processing,  it  claims  to  run  programs  up  to  100x  faster  than  Hadoop  MapReduce  in  memory,  or  10x  faster  on  disk.    

•  It  provides  high-­‐level  tools  including  Spark  SQL,  Mllib  for  machine  learning,    GraphX,  and  Spark  Streaming.  

•  It  runs  on  Hadoop,  Mesos,  standalone,  or  in  the  cloud.    •  It  can  access  data  sources  including  HDFS,  Cassandra,  Hbase,  Hive,  and  Amazon  S3.    

18-22 May 2015 5

Page 6: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Hadoop  Vs.  Conversional  HPC

4-8 August 2014 6

Hadoop   HPC  

A  cluster  of  machines  collocate  the  data  with  the  compute  node  

A  cluster  of  machines  access  a  shared  filesystem  

Move  computa@on  to  data   Move  data  to  computa@on,  network  bandwidth  is  the  boZleneck  

MapReduce  operates  at  the  higher  level,  data  flow  is  implicit  

Message  Passing  Interface  (MPI)  explicitly  handle  the  mechanics  of  the  data  flow  

Fault  tolerance  through  data  replica@ons,  easier  to  rerun  the  Map  or  Reduce  tasks.    

Needs  to  explicitly  manage  checkpoin@ng  and  recovery  

Restrict  to  data-­‐processing  problem   Can  solve  more  complex  algorithm    

Page 7: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Hadoop  Core  Characteris.cs

• Distribu@on  -­‐  instead  of  building  one  big  supercomputer,  storage  and  processing  are  spread  across  a  cluster  of  smaller  machines  that  communicate  and  work  together.  

• Horizontal  scalability  -­‐  it  is  easy  to  extend  a  Hadoop  cluster  by  just  adding  new  machines.  Every  new  machine  increases  total  storage  and  processing  power  of  the  Hadoop  cluster.  

•  Fault-­‐tolerance  -­‐  Hadoop  con@nues  to  operate  even  when  a  few  hardware  or  sonware  components  fail  to  work  properly.  Cost-­‐op@miza@on  -­‐  Hadoop  runs  on  standard  hardware;  it  does  not  require  expensive  servers.  

• Programming  abstrac@on  -­‐  Hadoop  takes  care  of  all  messy  details  related  to  distributed  compu@ng.  Using  a  high-­‐level  API,  users  can  focus  on  implemen@ng  business  logic  that  solves  their  real-­‐world  problems.  

• Data  locality  –  don’t  move  large  datasets  to  where  applica@on  is  running,  but  run  the  applica@on  where  the  data  already  is.  

18-22 May 2015 7

Page 8: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Hadoop  EcoSystem

18-22 May 2015 8

disk   disk   disk   disk   disk   disk   disk  

HDFS

MapReduce   HBase  

Impata Pig   Scalding   Mahout   Hive  

Akaban   Oozie  Workflow

Analysis

Hue

Page 9: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Hadoop  EcoSystem  –  Core  part

• HDFS:  Hadoop  Distribute  File  System  (HDFS)  is  a  distributed  file  system  designed  to  run  on  a  commodity  cluster  of  machines.  HDFS  is  highly  fault  tolerant  and  is  useful  for  processing  large  data  sets.  

• MapReduce:  MapReduce  is  a  sonware  framework  for  processing  large  data  sets,  petabyte  scale,  on  a  cluster  of  commodity  hardware.    When  MapReduce  jobs  are  run,  Hadoop  splits  the  input  and  locates  the  nodes  on  the  cluster.  The  actual  jobs  are  then  run  at  or  close  to  the  node  where  the  data  is  residing  so  that  the  data  is  as  close  to  the  computa@on  node  as  possible.  This  avoids  transfer  of  huge  amount  of  data  across  the  network  so  that  the  network  does  not  become  a  boZleneck  or  get  flooded.  

18-22 May 2015 9

Page 10: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Hadoop  EcoSystem  -­‐  Distribu.ons

•  Apache:  Purely  Open  Source  distribu@on  of  Hadoop  maintained  by  the  community  at  Apache  Sonware  Founda@on.  

•  Cloudera:  Cloudera’s  distribu@on  of  Hadoop  that  is  built  on  top  of  Apache  Hadoop.  The  distribu@on  includes  capabili@es  such  as  management,  security,  high  availability  and  integra@on  with  a  wide  variety  of  hardware  and  sonware  solu@ons.  Cloudera  is  the  leading  distributor  of  Hadoop.  

•  Horton  Works:  This  also  builds  on  the  Open  Source  Apache  Hadoop  with  claims  to  enterprise  readiness.  It  also  claims  to  be  the  only  distribu@on  that  is  available  for  Windows  servers.  

• MapR:  Hadoop  distribu@on  with  some  unique  features,  most  notably  the  ability  to  mount  the  Hadoop  cluster  over  NFS.  

•  Amazon  EMR  :  Amazon’s  hosted  version  of  MapReduce  is  called  Elas@c  Map  Reduce.  This  is  part  of  the  Amazon  Web  Services  (AWS).  EMR  allows  a  Hadoop  cluster  to  be  deployed  and  MapReduce  jobs  to  be  run  in  the  AWS  cloud  with  just  a  few  clicks.  

18-22 May 2015 10

Page 11: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Hadoop  EcoSystem    -­‐  Related  Projects •  Pig  :  A  high  level  language  to  analyzing  large  data  sets  which  eases  development  of  MapReduce  jobs.    Hundreds  of  lines  of  code  can  be  wriZen  with  just  few  lines  of  Pig.  At  Yahoo  >  60%  of  Hadoop  usage  is  on  Pig.    

•  Hive  :  Hive  is  a  data  warehouse  framework  that  stores  querying  of  large  data  sets  stored  in  Hadoop.  Hive  provides  a  high-­‐level  SQL  like  language  called  HiveQL.  

•  HBase  :  HBase  is  a  distributed  scalable  data  store  based  on  Hadoop.  HBase  is  a  distributed,  versioned,  column-­‐oriented  database  modeled  aner  Google’s  BigTable.    

•  Mahout  :  Mahout  is  a  scalable  Machine  learning  library.  Mahout  u@lizes  Hadoop  to  achieve  massive  scalability.    

•  YARN  :  YARN  is  the  next  genera@on  of  MapReduce  a.k.a  MapReduce  2.  The  MapReduce  framework  was  overhauled  using  YARN  to  overcome  the  scalability  boZlenecks  in  earlier  version  of  MapReduce  when  it  was  run  over  a  very  large  cluster(thousands  of  nodes).  

•  Ozzie  :  .  Ozzie  is  a  workflow  scheduler  system  that  eases  the  crea@on  and  management  of  the  sequence  of  MapReduce  jobs.    

•  Flume  :  A  distributed,  reliable  and  available  service  for  collec@ng,  aggrega@ng  and  moving  log  data  to  HDFS.  This  is  typically  useful  in  systems  where  log  data  needs  to  be  moved  to  HDFS  periodically  for  processing.  

18-22 May 2015 11

Page 12: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Hadoop  Basics  -­‐  HDFS  • HDFS  is  designed  to  store  a  very  large  amount  of  informa@on  (terabytes  or  petabytes).  This  requires  spreading  the  data  across  a  large  number  of  machines.  It  also  supports  much  larger  file  sizes  than  NFS.  

• HDFS  should  store  data  reliably.  If  individual  machines  in  the  cluster  malfunc@on,  data  should  s@ll  be  available.  

• HDFS  should  provide  fast,  scalable  access  to  this  informa@on.  It  should  be  possible  to  serve  a  larger  number  of  clients  by  simply  adding  more  machines  to  the  cluster.  

• HDFS  should  integrate  well  with  Hadoop  MapReduce,  allowing  data  to  be  read  and  computed  upon  locally  when  possible.  

18-22 May 2015 12

Page 13: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Hadoop  Basics  –  HDFS  (Cont.)  

18-22 May 2015 13

Page 14: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Hadoop  Basics  -­‐  MapReduce  

18-22 May 2015 14

Mapping and reducing tasks run on nodes where individual records of data are already present.

Page 15: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Example  :  Word  Count

18-22 May 2015 15

Page 16: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Basic  HDFS  Commands

• Create  a  directory  in  HDFS  •  hadoop  fs  –mkdir  <paths>  (e.g.  hadoop  fs  –mkdir  /user/hadoop/dir1)  

•  List  files  •  hadoop  fs  –ls  <args>  (e.g.  hadoop  fs  –ls  /user/hadoop/dir1)  

• Upload  data  from  local  system  to  HDFS  •  hadoop  fs  -­‐put  <localsrc>  ...  <HDFS_dest_Path>  (e.g.  hadoop  fs  –put  ~/foo.txt  /user/hadoop/dir1/foo.txt)    

• Download  file  from  HDFS  •  hadoop  fs  -­‐get  <hdfs_src>  <localdst>  (e.g.  hadop  fs  –get  /user/hadoop/dir1/foo.txt  /home/)  

• Check  how  much  space  u@liza@on  in  a  HDFS  dir  •  hadoop  fs  –du  URI  (e.g.  hadoop  fs  –du  /user/hadoop  )  

• Get  help  •  hadoop  fs  -­‐help  

18-22 May 2015 16

Page 17: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Case  Study  :  PACE  Hadoop  cluster

• Consists  of  5x24-­‐core  Altus  2704  servers,  each  with  128  GB  of  RAM.  Each  node  has  3TB  local  disk  Raid  (6x500GB)  

• All  nodes  are  connected  via  40  GB/sec  IB  connec@on  •  There  is  one  node  serving  as    NameNode  +  JobTracker  +  DataNode  and  named  as  hadoop-­‐nn1  

•  The  rest  of  four  nodes  serving  as  DataNode  namely    •  Hadoop-­‐dn1  •  Hadoop-­‐dn2  •  Hadoop-­‐dn3  •  Hadoop-­‐dn4  

• Check  the  cluster  status  and  running  jobs  at:  •  Overview  hZp://<NameNode  Full  Qualified  Name>:50070  •  JobTracker  hZp://<NameNode  Full  Qualified  Name>:8088  

18-22 May 2015 17

Page 18: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Installing  Hadoop  on  academic  cluster

• Download  release  from  official  website  hZp://hadoop.apache.org/,  most  recent  release  is  2.7.0  at  April  21,  2015.  

• Puong  the  binary  distribu@on  on  NFS  loca@on  so  that  all  par@cipated  nodes  can  access.  

• Adding  the  local  disk  to  each  node  to  serve  as  HDFS  file  system  • Configuring  the  nodes  into  name  nodes  and  data  nodes.    

4-8 August 2014 18

Page 19: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Configuring  HDFS

• All  configura@ons  are  ${HADOOP_PREFIX}/etc/hadoop  (e.g.  /usr/local/packages/hadoop/2.6/etc/hadoop)  

•  In  core-­‐site.html  

 •  In  hdfs-­‐site.html    

4-8 August 2014 19

Key   Value   Example  

Fs.defaultFS   Protocol://servername:port  

Hdfs://127.0.0.1:9000  

Hadoop.tmp.dir   Pathname   /dfs/hadoop/tmp  

Key   Value   Example  

Dfs.name.dir   Pathname   /dfs/hadoop/name  

Dfs.data.dir   Pathname   /dfs/hadoop/data  

Dfs.replica@on   Number  of  replica@on   3  

Page 20: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Configuring  Yarn

•  In  ${HADOOP_PREFIX}/yarn-­‐site.xml  

4-8 August 2014 20

 <property>                  <name>yarn.resourcemanager.resource-­‐tracker.address</name>                  <value>hdfs://<hostname>:8025</value>          </property>          <property>                  <name>yarn.resourcemanager.scheduler.address</name>                  <value>hdfs://<hostname>:8030</value>          </property>          <property>                  <name>yarn.resourcemanager.address</name>                  <value>hdfs://<hostname>:8040</value>          </property>  

Page 21: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Configure  Slaves  file  ${HADOOP_HOME}/etc/hadoop/slaves

4-8 August 2014 21

Hadoop-­‐nn1  

Hadoop-­‐dn1  

Hadoop-­‐dn2  

Hadoop-­‐dn4  

Hadoop-­‐dn3  

DataNodes

NameNode

hadoop-­‐nn1:  hadoop-­‐nn1  hadoop-­‐dn1  hadoop-­‐dn2  hadoop-­‐dn3  hadoop-­‐dn4    hadoop-­‐dn1:  hadoop-­‐dn1    hadoop-­‐dn2:  hadoop-­‐dn2    hadoop-­‐dn3:  hadoop-­‐dn3    hadoop-­‐dn4:  hadoop-­‐dn4  

Page 22: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Configure  environment  

• Add  following  two  lines  in  ${HADOOP_PREFIX}/etc/hadoop/hadoop-­‐env.sh  export  JAVA_HOME=/usr/local/packages/java/1.7.0  export  HADOOP_LOG_DIR=/nv/ap2/logs/hadoop    

• Add  following  line  in  ${HADOOP_PREFIX}/etc/hadoop/yarn-­‐env.sh  export  YARN_LOG_DIR=/nv/ap2/logs/hadoop    

• Add  following  line  in  ${HADOOP_PREFIX}/etc/hadoop/mapred-­‐env.sh  export  HADOOP_MAPRED_LOG_DIR=/nv/ap2/logs/hadoop  

4-8 August 2014 22

Page 23: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Star.ng  HDFS

• Create  a  user  named  hadoop,  and  start  all  hadoop  services  with  hadoop  user  to  avoid  security  issue.    

 •  Format  the  file  system  based  on  the  above  configura@on:  

${HADOOP_HOME}/bin/hdfs  namenode  –format  <cluster_name>    

•  Start  the  HDFS  •  start-­‐dfs.sh    Make  sure  its  running  correctly:  ps  –ef  |grep  –I  hadoop    

•  Start  the  Yarn  •  start-­‐yarn.sh  Make  sure  its  running  correctly  :  ps  –ef  |  grep  –I  resourcemanager  

4-8 August 2014 23

Page 24: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

User  management  on  Hadoop  Cluster

•  Login  as  hadoop  user  ssh  hadoop@hadoop-­‐nn1    

• Create  a  directory  for  given  user  Hadoop  fs  –mkdir  /user/userID  

•  Set  directory  ownership  to  given  user  Hadoop  fs  –chown  –R  userID:groupID  /user/userID  

• Change  the  permission  to  user  only  Hadoop  fs  –chmod  –R  700  /user/userID  

4-8 August 2014 24

Page 25: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

User  runs  a  job

• User  login  to  Hadoop  NameNode,  and  submit  job  from  there  ssh  userID@hadoop-­‐nn1  

•  Load  Hadoop  environment  variables,  such  as  Jave,  python,  HADOOP_PREFIX  ,  HADOOP_YARN_HOME    Module  load  hadoop/2.6.0  

• Upload  input  file  from  local  file  system  to    hadoop  fs  –put  example.txt  /user/userID  

• Run  wordcount  on  example.txt    hadoop  jar  /usr/local/packages/hadoop/2.6.0/share/hadoop/mapreduce/hadoop-­‐mapreduce-­‐examples-­‐2.6.0.jar  wordcount  /user/userID/  /user/userID/testout  

• Check  the  result  hadoop  fs  –cat  /user/userID/testout  >  output  

4-8 August 2014 25

Page 26: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Stopping  HDFS

•  Stops  resource  manager  stop-­‐yarn.sh  

•  Stops  name  nodes  and  data  nodes  stop-­‐hdfs.sh  

 Note:  Some@mes  may  require  manually  shutdown  on  each  individual  data  nodes.    

4-8 August 2014 26

Page 27: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Hadoop  Cluster  overview

4-8 August 2014 27

Page 28: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Hadoop  DataNodes

4-8 August 2014 28

Page 29: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

Hadoop  Job  Tracker  History

4-8 August 2014 29

Page 30: LinuxClustersInstute: Turning&HPC&cluster&into&a ......LinuxClustersInstute: Turning&HPC&cluster&into&a& BigDataCluster Fang%(Cherry)%Liu,%PhD%%%%% % %%%%fang.liu@oit.gatech.edu% %

References:

•  Yahoo  Hadoop  Tutorial  hZps://developer.yahoo.com/hadoop/tutorial/index.html  

•  “Introduc@on  to  Data  Science”,  Bill  Howe,  University  of  Washington  

4-8 August 2014 30