module 2 - pace.edu.in · apache pig apache pig is high level language that enables programmer to...

MODULE 2Prof. Mohammed Tanzeem Agra

Prof. Mohammed Tanzeem Agra

MODULE 2

1) Essential Hadoop Tools

2) Hadoop Yarn Applications

3) Managing Hadoop with Apache Ambari

4) Basic Hadoop administration Procedure.


Essential Hadoop Tools

● Apache Pig

● Apache Hive (Dataware house infrastructure)

● Apache Sqoop (Relational Data)

● Apache Oozie ( design to run and Manage Multiple Hadoop Jobs)

● Apache Hbase (higher version of Google Bigtable)


Apache Pig

● Apache Pig is high level language that enables programmer to write complex MapReduce

transformation using simple scripting language.

● Pig often used in data set such as aggregate, join and sort.

● Also used to extract, transform, and load ETL data pipelines, quick research on raw data, and

data processing.

● There are two modes:

● Local mode : all processing is done on the local machine.

● Non Local Mode : execute the job on the cluster using either MapReduce or Tez Engine


Apache Hive

● Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc

queries, and analysis of large data set using SQL like language called HiveQL

● Feature

– Tools to enable easy data extraction, transformation and loading

(ETL)

– A mechanism to impose structure on variety of data formats

– Access to file storee either directly in HDFS or in other data storage

system.

– Query Execution Via MapReduce and Tez


2.2 : APACHE YARNYet Another Resource Negotiator


Why YARN : MapReduce Version 1

MapReduce performed both processing and resource management functions. It

consisted of a Job Tracker which was the single master.

The Job Tracker allocated the resources, performed scheduling and monitored

the processing jobs.

It assigned map and reduce tasks on a number of subordinate processes

called the Task Trackers.

The Task Trackers periodically reported their progress to the Job Tracker.


Problem with Version 1

scalability bottleneck due to a single Job Tracker.

IBM mentioned in its article that according to Yahoo!, the practical limits of

such a design are reached with a cluster of 5000 nodes and 40,000 tasks

running concurrently.

Apart from this limitation, the utilization of computational resources is

inefficient in MRV1. Also, the Hadoop framework became limited only to

MapReduce processing paradigm.


INTRODUCTION TO YARN

YARN allows different data processing methods like graph processing,

interactive processing, stream processing as well as batch processing to run

and process data stored in HDFS.

Therefore YARN opens up Hadoop to other types of distributed applications

beyond MapReduce.

YARN enabled the users to perform operations as per requirement by using a

variety of tools like Sparkfor real-time processing, Hive for SQL, HBase for

NoSQL and others.


https://www.edureka.co/blog/spark-tutorial/

https://www.edureka.co/blog/hive-tutorial/

https://www.edureka.co/blog/hbase-tutorial

YARN COMPONENTS

YARN performs all your processing activities by allocating resources and

scheduling tasks. Apache Hadoop YARN Architecture consists of the following

main components :

Resource Manager

Node Manager

Application Master

Container


1. Resource Manager

It is the ultimate authority in resource allocation.

On receiving the processing requests, it passes parts of requests to

corresponding node managers accordingly, where the actual processing takes

place.

It is the arbitrator of the cluster resources and decides the allocation of the

available resources for competing applications.

Optimizes the cluster utilization like keeping all resources in use all the

time against various constraints such as capacity guarantees, fairness, and

SLAs.


Resource Manager still have two

components

It has two major components: a) Scheduler b) Application Manager

Scheduler :The scheduler is responsible for allocating resources to the various

running applications subject to constraints of capacities, queues etc.

Performs scheduling based on the resource requirements of the applications.

If there is an application failure or hardware failure, the Scheduler does not

guarantee to restart the failed tasks.


Application Manager

It is responsible for accepting job submissions.

Negotiates the first container from the Resource Manager for executing the

application specific Application Master.

Manages running the Application Masters in a cluster and provides service for

restarting the Application Master container on failure.


2. Node Manager

It takes care of individual nodes in a Hadoop cluster and manages user jobs and workflow on the given node.

It registers with the Resource Manager and sends heartbeats with the health status of the node.

Its primary goal is to manage application containers assigned to it by the resource manager.

It keeps up-to-date with the Resource Manager.

Application Master requests the assigned container from the Node Manager by sending it a Container Launch Context(CLC) which includes everything the application needs in order to run. The Node Manager creates the requested container process and starts it.

Monitors resource usage (memory, CPU) of individual containers.

Performs Log management.

It also kills the container as directed by the Resource Manager.


3. Application Master

An application is a single job submitted to the framework. Each such

application has a unique Application Master associated with it which is a

framework specific entity.

It is the process that coordinates an application’s execution in the cluster and

also manages faults.

Its task is to negotiate resources from the Resource Manager and work with

the Node Manager to execute and monitor the component tasks.

It is responsible for negotiating appropriate resource containers from the

ResourceManager, tracking their status and monitoring progress.

Once started, it periodically sends heartbeats to the Resource Manager to

affirm its health and to update the record of its resource demands.


4. Container

It is a collection of physical resources such as RAM, CPU cores, and disks on a

single node.

YARN containers are managed by a container launch context which is

container life-cycle(CLC). This record contains a map of environment

variables, dependencies stored in a remotely accessible storage, security

tokens, payload for Node Manager services and the command necessary to

create the process.

It grants rights to an application to use a specific amount of

resources (memory, CPU etc.) on a specific host.


Application Submission in YARN


Data Flow in YARN

Client submits an application

Resource Manager allocates a container to start Application Manager

Application Manager registers with Resource Manager

Application Manager asks containers from Resource Manager

Application Manager notifies Node Manager to launch containers

Application code is executed in the container

Client contacts Resource Manager/Application Manager to monitor

application’s status

Application Manager unregisters with Resource Manager


Using Apache Sqoop to Acquire Relational Data

● Sqoop is a tool design to transfer data between Hadoop and Relational DataBase.

● Sqoop can be used with any JDBC complaint database has been tested on Microsoft SQL Server,

PostgresSQL,MySQL, and Oracle.

● It can transfer data from Hive to Hbase Containers.

There are two methods

● Apache Sqoop Import Method

● Step 1 : sqoop examines the database to gether the necessary meta data for the data to be imported.

● Step 2 : map only Hadoop job that sqoop submit to the cluster.

● Each node doing the import must have access to the database.

● The imported data are saved in an HDFS directory.

● Sqoop will use database name for the directory by default.

● Diagram

Data export from the cluster works in a similar fashion. The export is done in two steps, Step1: is to examine the data base

Sqoop divide the input data set into splits, then uses indiviual map task to push the splits to the database.

SQOOB DATA EXPORT METHOD

Managing HADOOP with

Apache AmbariGUI FOR HADOOP


Why APACHE AMBARI

Managing a Hadoop installation by hand can be tedious and time consuming.

Keeping configuration file synronized across a cluster, starting, stopping, ping,

and restarting hadoop services and dependent services in the right order is

not a simple task.

This tool is designed to help you some basic navigation and usage scenario for

apache Ambari.


INTRODUCTION

Apache Ambari is an open source graphical installation and management tool

for install hadoop version 2.

Minimum four node clusters are required.

Can support : HDFS, YARN, MapReduce, Tez, Hive, Hbase, Pig, Sqoop, Oozie,

Zookeper and Flume.

To use Ambari in Hadoop entire installation must be on Hadoop cluster. It is

not possible to use Ambari for Hadoop cluster that have been installed by

other means.


Architecture


Application of Ambari

Centralized point of administration for Hadoop cluster.

User can configure cluster services.

Monitor the status of cluster host.

Start and stop services.

Add new host to the cluster.

It also provide real time reporting of important metrics.


Quick Tour of Apache Ambari

Dashboard View : Moving, Edit, Remove, Add


CPU Usage


Service View

Provide detailed look at each service running on the cluster.

It also provides a graphical method to configure each services – edit

/etc/hadoop/conf

The currently installed services are listed on the left side menu. To select a

service click the service name on the menu.

The status of namenode, SecondaryNameNode, DataNodes, uptime and

available disk space is displayed in the sumarry window.

Clicking the config tab will open the option form.The option are the same one

that are set in the Hadoop XML files.


Ambari Hadoop 2 – service window


Hosts View

Hosts menu item provides the information such as host name, IP address,

number of cores, memory, disk usage, current load average and other Hadoop

components in tabular form.

We can add new hosts by using the Action pull-down menu.

Host view provide three sub windows.

Components

Host metrics

Summary Information.

Each services can be stopped, restarted, decommissioned, or place in

maintenance mode.


Hosts View


Admin View

Admin View provide three option

List of installed software

Service Accounts : Hortonworks Data Platform (HDP)

Security

Admin Pull Down Menu

About : Provides the current version of ambari

Manage Ambari : Open the management screen where Users, Groups, Permissions

and Ambari Views can be created and configured.

Settings : Provide the option to turn of the progress window

Sign Out : Exits the interface.


Fault Tolerance

✔ Fault Tolrence is a property that enables a system to continue operating properlly in the event of failure of some of its components.

✔ Strict Control of Data Flow through out the execution : mapper processes do not exchange data with other mapper processes and data can only go from mapper to reducer.

✔ Recovery From failure of one or many map process : if server fail, the map tasks that were running on that machine could easily be restarted on another working server.

✔ Failed Reducers can be restarted but addition work has to be redone in such case.


Speculative Execution

● Execution of program is challenging because of large cluster but control and monitoring

resourcing is easy.

● When one part of MapReduce process runs slowly; it ultimately slow down everything such in

parallel computing

● In Sepculative Execution.


Hadoop MapReduce Hardware

● Server

● Storage (Hard disk)

● Processing (Processor)

● Old and New Hardware ?


module 2 - pace.edu.in · apache pig apache pig is high level language that enables programmer to...

Documents