linuxclustersinstute: turning&hpc&cluster&into&a ......linuxclustersinstute:...

Linux Clusters Ins.tute: Turning HPC cluster into a

Big Data Cluster

Fang (Cherry) Liu, PhD [email protected]

A Partnership for an Advanced Compu@ng Environment (PACE)

OIT/ART, Georgia Tech

18-22 May 2015 2

Targets for this session

Target audience: HPC system admins who wants to support Hadoop cluster

Points of interests: •  Big Data is Common •  Challenges and Tools •  Hadoop vs HPC •  Hadoop Core Characteris@cs •  Hadoop EcoSystem

•  Core parts – HDFS and Mapreduce •  Distribu@ons •  Projects

•  Hadoop Basic Opera@ons •  Configure Hadoop Cluster on HPC cluster – PACE Hadoop Cluster •  Hadoop Advance Opera@ons

Big Data is Common •  The en@re rendering of Avatar reportedly requires over 1 Petabyte of storage space according to BBC’s Clickbits, which is the equivalent of 500 hard drives of 2TB each. That’s equal to a 32 year long MP3 file. The compu@ng core – 34 racks, each with four chassis of 32 machines each – adds up to some 40,000 processors and 104 terabytes of RAM. hZp://thenextweb.com/2010/01/01/avatar-‐takes-‐1-‐petabyte-‐storage-‐space-‐equivalent-‐32-‐year-‐long-‐mp3/

•  Facebook’s data warehouses grow by “Over half a petabyte … every 24 hours” (2012) hZp://www.theregister.co.uk/2012/11/09/facebook_open_sources_corona/

•  An average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day (2009). hZp://dl.acm.org/[email protected]?id=1327492&dl=ACM&coll=DL&CFID=508148704&CFTOKEN=44216082

18-22 May 2015 3

Big Data: Three challenges

• Volume (Storage) •  The size of the data

• Velocity (Processing speed) •  The latency of data processing rela@ve to the growing demand of interac@vity

• Variety (structure vs. unstructured data) •  The diversity of sources, formats, quality, structures

18-22 May 2015 4

What are Tools to use?

•  Hadoop is an open-‐source sonware for reliable, scalable, distributed compu@ng:

•  Created by Doug Cuong and Michael Cafarella while at Yahoo, named aner Doug’s son’s toy elephant

•  WriZen in Java •  Scale to thousands of machines with linear scalability •  Uses simple programming model (MapReduce) •  Fault tolerant (HDFS)

•  Spark •  it is a fast and general engine for large-‐scale data processing, it claims to run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

•  It provides high-‐level tools including Spark SQL, Mllib for machine learning, GraphX, and Spark Streaming.

•  It runs on Hadoop, Mesos, standalone, or in the cloud. •  It can access data sources including HDFS, Cassandra, Hbase, Hive, and Amazon S3.

18-22 May 2015 5

Hadoop Vs. Conversional HPC

4-8 August 2014 6

Hadoop HPC

A cluster of machines collocate the data with the compute node

A cluster of machines access a shared filesystem

Move computa@on to data Move data to computa@on, network bandwidth is the boZleneck

MapReduce operates at the higher level, data flow is implicit

Message Passing Interface (MPI) explicitly handle the mechanics of the data flow

Fault tolerance through data replica@ons, easier to rerun the Map or Reduce tasks.

Needs to explicitly manage checkpoin@ng and recovery

Restrict to data-‐processing problem Can solve more complex algorithm

Hadoop Core Characteris.cs

• Distribu@on -‐ instead of building one big supercomputer, storage and processing are spread across a cluster of smaller machines that communicate and work together.

• Horizontal scalability -‐ it is easy to extend a Hadoop cluster by just adding new machines. Every new machine increases total storage and processing power of the Hadoop cluster.

•  Fault-‐tolerance -‐ Hadoop con@nues to operate even when a few hardware or sonware components fail to work properly. Cost-‐op@miza@on -‐ Hadoop runs on standard hardware; it does not require expensive servers.

• Programming abstrac@on -‐ Hadoop takes care of all messy details related to distributed compu@ng. Using a high-‐level API, users can focus on implemen@ng business logic that solves their real-‐world problems.

• Data locality – don’t move large datasets to where applica@on is running, but run the applica@on where the data already is.

18-22 May 2015 7

Hadoop EcoSystem

18-22 May 2015 8

disk disk disk disk disk disk disk

HDFS

MapReduce HBase

Impata Pig Scalding Mahout Hive

Akaban Oozie Workflow

Analysis

Hue

Hadoop EcoSystem – Core part

• HDFS: Hadoop Distribute File System (HDFS) is a distributed file system designed to run on a commodity cluster of machines. HDFS is highly fault tolerant and is useful for processing large data sets.

• MapReduce: MapReduce is a sonware framework for processing large data sets, petabyte scale, on a cluster of commodity hardware. When MapReduce jobs are run, Hadoop splits the input and locates the nodes on the cluster. The actual jobs are then run at or close to the node where the data is residing so that the data is as close to the computa@on node as possible. This avoids transfer of huge amount of data across the network so that the network does not become a boZleneck or get flooded.

18-22 May 2015 9

Hadoop EcoSystem -‐ Distribu.ons

•  Apache: Purely Open Source distribu@on of Hadoop maintained by the community at Apache Sonware Founda@on.

•  Cloudera: Cloudera’s distribu@on of Hadoop that is built on top of Apache Hadoop. The distribu@on includes capabili@es such as management, security, high availability and integra@on with a wide variety of hardware and sonware solu@ons. Cloudera is the leading distributor of Hadoop.

•  Horton Works: This also builds on the Open Source Apache Hadoop with claims to enterprise readiness. It also claims to be the only distribu@on that is available for Windows servers.

• MapR: Hadoop distribu@on with some unique features, most notably the ability to mount the Hadoop cluster over NFS.

•  Amazon EMR : Amazon’s hosted version of MapReduce is called Elas@c Map Reduce. This is part of the Amazon Web Services (AWS). EMR allows a Hadoop cluster to be deployed and MapReduce jobs to be run in the AWS cloud with just a few clicks.

18-22 May 2015 10

Hadoop EcoSystem -‐ Related Projects •  Pig : A high level language to analyzing large data sets which eases development of MapReduce jobs. Hundreds of lines of code can be wriZen with just few lines of Pig. At Yahoo > 60% of Hadoop usage is on Pig.

•  Hive : Hive is a data warehouse framework that stores querying of large data sets stored in Hadoop. Hive provides a high-‐level SQL like language called HiveQL.

•  HBase : HBase is a distributed scalable data store based on Hadoop. HBase is a distributed, versioned, column-‐oriented database modeled aner Google’s BigTable.

•  Mahout : Mahout is a scalable Machine learning library. Mahout u@lizes Hadoop to achieve massive scalability.

•  YARN : YARN is the next genera@on of MapReduce a.k.a MapReduce 2. The MapReduce framework was overhauled using YARN to overcome the scalability boZlenecks in earlier version of MapReduce when it was run over a very large cluster(thousands of nodes).

•  Ozzie : . Ozzie is a workflow scheduler system that eases the crea@on and management of the sequence of MapReduce jobs.

•  Flume : A distributed, reliable and available service for collec@ng, aggrega@ng and moving log data to HDFS. This is typically useful in systems where log data needs to be moved to HDFS periodically for processing.

18-22 May 2015 11

Hadoop Basics -‐ HDFS • HDFS is designed to store a very large amount of informa@on (terabytes or petabytes). This requires spreading the data across a large number of machines. It also supports much larger file sizes than NFS.

• HDFS should store data reliably. If individual machines in the cluster malfunc@on, data should s@ll be available.

• HDFS should provide fast, scalable access to this informa@on. It should be possible to serve a larger number of clients by simply adding more machines to the cluster.

• HDFS should integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally when possible.

18-22 May 2015 12

Hadoop Basics – HDFS (Cont.)

18-22 May 2015 13

Hadoop Basics -‐ MapReduce

18-22 May 2015 14

Mapping and reducing tasks run on nodes where individual records of data are already present.

Example : Word Count

18-22 May 2015 15

Basic HDFS Commands

• Create a directory in HDFS •  hadoop fs –mkdir <paths> (e.g. hadoop fs –mkdir /user/hadoop/dir1)

•  List files •  hadoop fs –ls <args> (e.g. hadoop fs –ls /user/hadoop/dir1)

• Upload data from local system to HDFS •  hadoop fs -‐put <localsrc> ... <HDFS_dest_Path> (e.g. hadoop fs –put ~/foo.txt /user/hadoop/dir1/foo.txt)

• Download file from HDFS •  hadoop fs -‐get <hdfs_src> <localdst> (e.g. hadop fs –get /user/hadoop/dir1/foo.txt /home/)

• Check how much space u@liza@on in a HDFS dir •  hadoop fs –du URI (e.g. hadoop fs –du /user/hadoop )

• Get help •  hadoop fs -‐help

18-22 May 2015 16

Case Study : PACE Hadoop cluster

• Consists of 5x24-‐core Altus 2704 servers, each with 128 GB of RAM. Each node has 3TB local disk Raid (6x500GB)

• All nodes are connected via 40 GB/sec IB connec@on •  There is one node serving as NameNode + JobTracker + DataNode and named as hadoop-‐nn1

•  The rest of four nodes serving as DataNode namely •  Hadoop-‐dn1 •  Hadoop-‐dn2 •  Hadoop-‐dn3 •  Hadoop-‐dn4

• Check the cluster status and running jobs at: •  Overview hZp://<NameNode Full Qualified Name>:50070 •  JobTracker hZp://<NameNode Full Qualified Name>:8088

18-22 May 2015 17

Installing Hadoop on academic cluster

• Download release from official website hZp://hadoop.apache.org/, most recent release is 2.7.0 at April 21, 2015.

• Puong the binary distribu@on on NFS loca@on so that all par@cipated nodes can access.

• Adding the local disk to each node to serve as HDFS file system • Configuring the nodes into name nodes and data nodes.

4-8 August 2014 18

Configuring HDFS

• All configura@ons are ${HADOOP_PREFIX}/etc/hadoop (e.g. /usr/local/packages/hadoop/2.6/etc/hadoop)

•  In core-‐site.html

•  In hdfs-‐site.html

4-8 August 2014 19

Key Value Example

Fs.defaultFS Protocol://servername:port

Hdfs://127.0.0.1:9000

Hadoop.tmp.dir Pathname /dfs/hadoop/tmp

Key Value Example

Dfs.name.dir Pathname /dfs/hadoop/name

Dfs.data.dir Pathname /dfs/hadoop/data

Dfs.replica@on Number of replica@on 3

Configuring Yarn

•  In ${HADOOP_PREFIX}/yarn-‐site.xml

4-8 August 2014 20

<property> <name>yarn.resourcemanager.resource-‐tracker.address</name> <value>hdfs://<hostname>:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>hdfs://<hostname>:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>hdfs://<hostname>:8040</value> </property>

Configure Slaves file ${HADOOP_HOME}/etc/hadoop/slaves

4-8 August 2014 21

Hadoop-‐nn1

Hadoop-‐dn1

Hadoop-‐dn2

Hadoop-‐dn4

Hadoop-‐dn3

DataNodes

NameNode

hadoop-‐nn1: hadoop-‐nn1 hadoop-‐dn1 hadoop-‐dn2 hadoop-‐dn3 hadoop-‐dn4 hadoop-‐dn1: hadoop-‐dn1 hadoop-‐dn2: hadoop-‐dn2 hadoop-‐dn3: hadoop-‐dn3 hadoop-‐dn4: hadoop-‐dn4

Configure environment

• Add following two lines in ${HADOOP_PREFIX}/etc/hadoop/hadoop-‐env.sh export JAVA_HOME=/usr/local/packages/java/1.7.0 export HADOOP_LOG_DIR=/nv/ap2/logs/hadoop

• Add following line in ${HADOOP_PREFIX}/etc/hadoop/yarn-‐env.sh export YARN_LOG_DIR=/nv/ap2/logs/hadoop

• Add following line in ${HADOOP_PREFIX}/etc/hadoop/mapred-‐env.sh export HADOOP_MAPRED_LOG_DIR=/nv/ap2/logs/hadoop

4-8 August 2014 22

Star.ng HDFS

• Create a user named hadoop, and start all hadoop services with hadoop user to avoid security issue.

•  Format the file system based on the above configura@on:

${HADOOP_HOME}/bin/hdfs namenode –format <cluster_name>

•  Start the HDFS •  start-‐dfs.sh Make sure its running correctly: ps –ef |grep –I hadoop

•  Start the Yarn •  start-‐yarn.sh Make sure its running correctly : ps –ef | grep –I resourcemanager

4-8 August 2014 23

User management on Hadoop Cluster

•  Login as hadoop user ssh hadoop@hadoop-‐nn1

• Create a directory for given user Hadoop fs –mkdir /user/userID

•  Set directory ownership to given user Hadoop fs –chown –R userID:groupID /user/userID

• Change the permission to user only Hadoop fs –chmod –R 700 /user/userID

4-8 August 2014 24

User runs a job

• User login to Hadoop NameNode, and submit job from there ssh userID@hadoop-‐nn1

•  Load Hadoop environment variables, such as Jave, python, HADOOP_PREFIX , HADOOP_YARN_HOME Module load hadoop/2.6.0

• Upload input file from local file system to hadoop fs –put example.txt /user/userID

• Run wordcount on example.txt hadoop jar /usr/local/packages/hadoop/2.6.0/share/hadoop/mapreduce/hadoop-‐mapreduce-‐examples-‐2.6.0.jar wordcount /user/userID/ /user/userID/testout

• Check the result hadoop fs –cat /user/userID/testout > output

4-8 August 2014 25

Stopping HDFS

•  Stops resource manager stop-‐yarn.sh

•  Stops name nodes and data nodes stop-‐hdfs.sh

Note: Some@mes may require manually shutdown on each individual data nodes.

4-8 August 2014 26

Hadoop Cluster overview

4-8 August 2014 27

Hadoop DataNodes

4-8 August 2014 28

Hadoop Job Tracker History

4-8 August 2014 29

References:

•  Yahoo Hadoop Tutorial hZps://developer.yahoo.com/hadoop/tutorial/index.html

•  “Introduc@on to Data Science”, Bill Howe, University of Washington

4-8 August 2014 30

linuxclustersinstute: turning&hpc&cluster&into&a ......linuxclustersinstute:...

Documents