module 2 - pace.edu.in · apache pig apache pig is high level language that enables programmer to...
TRANSCRIPT
MODULE 2Prof. Mohammed Tanzeem Agra
Prof. Mohammed Tanzeem Agra
MODULE 2
1) Essential Hadoop Tools
2) Hadoop Yarn Applications
3) Managing Hadoop with Apache Ambari
4) Basic Hadoop administration Procedure.
Prof. Mohammed Tanzeem Agra
Essential Hadoop Tools
● Apache Pig
● Apache Hive (Dataware house infrastructure)
● Apache Sqoop (Relational Data)
● Apache Oozie ( design to run and Manage Multiple Hadoop Jobs)
● Apache Hbase (higher version of Google Bigtable)
Prof. Mohammed Tanzeem Agra
Apache Pig
● Apache Pig is high level language that enables programmer to write complex MapReduce
transformation using simple scripting language.
● Pig often used in data set such as aggregate, join and sort.
● Also used to extract, transform, and load ETL data pipelines, quick research on raw data, and
data processing.
● There are two modes:
● Local mode : all processing is done on the local machine.
● Non Local Mode : execute the job on the cluster using either MapReduce or Tez Engine
Prof. Mohammed Tanzeem Agra
Apache Hive
● Apache Hive is a data warehouse build on the top of Hadoop for providing data summarization, ad hoc
queries, and analysis of large data set using SQL like language called HiveQL
● Feature
– Tools to enable easy data extraction, transformation and loading
(ETL)
– A mechanism to impose structure on variety of data formats
– Access to file storee either directly in HDFS or in other data storage
system.
– Query Execution Via MapReduce and Tez
Prof. Mohammed Tanzeem Agra
2.2 : APACHE YARNYet Another Resource Negotiator
Prof. Mohammed Tanzeem Agra
Why YARN : MapReduce Version 1
MapReduce performed both processing and resource management functions. It
consisted of a Job Tracker which was the single master.
The Job Tracker allocated the resources, performed scheduling and monitored
the processing jobs.
It assigned map and reduce tasks on a number of subordinate processes
called the Task Trackers.
The Task Trackers periodically reported their progress to the Job Tracker.
Prof. Mohammed Tanzeem Agra
Prof. Mohammed Tanzeem Agra
Problem with Version 1
scalability bottleneck due to a single Job Tracker.
IBM mentioned in its article that according to Yahoo!, the practical limits of
such a design are reached with a cluster of 5000 nodes and 40,000 tasks
running concurrently.
Apart from this limitation, the utilization of computational resources is
inefficient in MRV1. Also, the Hadoop framework became limited only to
MapReduce processing paradigm.
Prof. Mohammed Tanzeem Agra
INTRODUCTION TO YARN
YARN allows different data processing methods like graph processing,
interactive processing, stream processing as well as batch processing to run
and process data stored in HDFS.
Therefore YARN opens up Hadoop to other types of distributed applications
beyond MapReduce.
YARN enabled the users to perform operations as per requirement by using a
variety of tools like Sparkfor real-time processing, Hive for SQL, HBase for
NoSQL and others.
Prof. Mohammed Tanzeem Agra
Prof. Mohammed Tanzeem Agra
YARN COMPONENTS
YARN performs all your processing activities by allocating resources and
scheduling tasks. Apache Hadoop YARN Architecture consists of the following
main components :
Resource Manager
Node Manager
Application Master
Container
Prof. Mohammed Tanzeem Agra
Prof. Mohammed Tanzeem Agra
1. Resource Manager
It is the ultimate authority in resource allocation.
On receiving the processing requests, it passes parts of requests to
corresponding node managers accordingly, where the actual processing takes
place.
It is the arbitrator of the cluster resources and decides the allocation of the
available resources for competing applications.
Optimizes the cluster utilization like keeping all resources in use all the
time against various constraints such as capacity guarantees, fairness, and
SLAs.
Prof. Mohammed Tanzeem Agra
Resource Manager still have two
components
It has two major components: a) Scheduler b) Application Manager
Scheduler :The scheduler is responsible for allocating resources to the various
running applications subject to constraints of capacities, queues etc.
Performs scheduling based on the resource requirements of the applications.
If there is an application failure or hardware failure, the Scheduler does not
guarantee to restart the failed tasks.
Prof. Mohammed Tanzeem Agra
Application Manager
It is responsible for accepting job submissions.
Negotiates the first container from the Resource Manager for executing the
application specific Application Master.
Manages running the Application Masters in a cluster and provides service for
restarting the Application Master container on failure.
Prof. Mohammed Tanzeem Agra
2. Node Manager
It takes care of individual nodes in a Hadoop cluster and manages user jobs and workflow on the given node.
It registers with the Resource Manager and sends heartbeats with the health status of the node.
Its primary goal is to manage application containers assigned to it by the resource manager.
It keeps up-to-date with the Resource Manager.
Application Master requests the assigned container from the Node Manager by sending it a Container Launch Context(CLC) which includes everything the application needs in order to run. The Node Manager creates the requested container process and starts it.
Monitors resource usage (memory, CPU) of individual containers.
Performs Log management.
It also kills the container as directed by the Resource Manager.
Prof. Mohammed Tanzeem Agra
3. Application Master
An application is a single job submitted to the framework. Each such
application has a unique Application Master associated with it which is a
framework specific entity.
It is the process that coordinates an application’s execution in the cluster and
also manages faults.
Its task is to negotiate resources from the Resource Manager and work with
the Node Manager to execute and monitor the component tasks.
It is responsible for negotiating appropriate resource containers from the
ResourceManager, tracking their status and monitoring progress.
Once started, it periodically sends heartbeats to the Resource Manager to
affirm its health and to update the record of its resource demands.
Prof. Mohammed Tanzeem Agra
4. Container
It is a collection of physical resources such as RAM, CPU cores, and disks on a
single node.
YARN containers are managed by a container launch context which is
container life-cycle(CLC). This record contains a map of environment
variables, dependencies stored in a remotely accessible storage, security
tokens, payload for Node Manager services and the command necessary to
create the process.
It grants rights to an application to use a specific amount of
resources (memory, CPU etc.) on a specific host.
Prof. Mohammed Tanzeem Agra
Application Submission in YARN
Prof. Mohammed Tanzeem Agra
Data Flow in YARN
Client submits an application
Resource Manager allocates a container to start Application Manager
Application Manager registers with Resource Manager
Application Manager asks containers from Resource Manager
Application Manager notifies Node Manager to launch containers
Application code is executed in the container
Client contacts Resource Manager/Application Manager to monitor
application’s status
Application Manager unregisters with Resource Manager
Prof. Mohammed Tanzeem Agra
Using Apache Sqoop to Acquire Relational Data
● Sqoop is a tool design to transfer data between Hadoop and Relational DataBase.
● Sqoop can be used with any JDBC complaint database has been tested on Microsoft SQL Server,
PostgresSQL,MySQL, and Oracle.
● It can transfer data from Hive to Hbase Containers.
There are two methods
● Apache Sqoop Import Method
● Step 1 : sqoop examines the database to gether the necessary meta data for the data to be imported.
● Step 2 : map only Hadoop job that sqoop submit to the cluster.
● Each node doing the import must have access to the database.
● The imported data are saved in an HDFS directory.
● Sqoop will use database name for the directory by default.
● Diagram
Data export from the cluster works in a similar fashion. The export is done in two steps, Step1: is to examine the data base
Sqoop divide the input data set into splits, then uses indiviual map task to push the splits to the database.
SQOOB DATA EXPORT METHOD
Managing HADOOP with
Apache AmbariGUI FOR HADOOP
Prof. Mohammed Tanzeem Agra
Why APACHE AMBARI
Managing a Hadoop installation by hand can be tedious and time consuming.
Keeping configuration file synronized across a cluster, starting, stopping, ping,
and restarting hadoop services and dependent services in the right order is
not a simple task.
This tool is designed to help you some basic navigation and usage scenario for
apache Ambari.
Prof. Mohammed Tanzeem Agra
INTRODUCTION
Apache Ambari is an open source graphical installation and management tool
for install hadoop version 2.
Minimum four node clusters are required.
Can support : HDFS, YARN, MapReduce, Tez, Hive, Hbase, Pig, Sqoop, Oozie,
Zookeper and Flume.
To use Ambari in Hadoop entire installation must be on Hadoop cluster. It is
not possible to use Ambari for Hadoop cluster that have been installed by
other means.
Prof. Mohammed Tanzeem Agra
Architecture
Prof. Mohammed Tanzeem Agra
Application of Ambari
Centralized point of administration for Hadoop cluster.
User can configure cluster services.
Monitor the status of cluster host.
Start and stop services.
Add new host to the cluster.
It also provide real time reporting of important metrics.
Prof. Mohammed Tanzeem Agra
Quick Tour of Apache Ambari
Dashboard View : Moving, Edit, Remove, Add
Prof. Mohammed Tanzeem Agra
CPU Usage
Prof. Mohammed Tanzeem Agra
Service View
Provide detailed look at each service running on the cluster.
It also provides a graphical method to configure each services – edit
/etc/hadoop/conf
The currently installed services are listed on the left side menu. To select a
service click the service name on the menu.
The status of namenode, SecondaryNameNode, DataNodes, uptime and
available disk space is displayed in the sumarry window.
Clicking the config tab will open the option form.The option are the same one
that are set in the Hadoop XML files.
Prof. Mohammed Tanzeem Agra
Ambari Hadoop 2 – service window
Prof. Mohammed Tanzeem Agra
Hosts View
Hosts menu item provides the information such as host name, IP address,
number of cores, memory, disk usage, current load average and other Hadoop
components in tabular form.
We can add new hosts by using the Action pull-down menu.
Host view provide three sub windows.
Components
Host metrics
Summary Information.
Each services can be stopped, restarted, decommissioned, or place in
maintenance mode.
Prof. Mohammed Tanzeem Agra
Hosts View
Prof. Mohammed Tanzeem Agra
Admin View
Admin View provide three option
List of installed software
Service Accounts : Hortonworks Data Platform (HDP)
Security
Admin Pull Down Menu
About : Provides the current version of ambari
Manage Ambari : Open the management screen where Users, Groups, Permissions
and Ambari Views can be created and configured.
Settings : Provide the option to turn of the progress window
Sign Out : Exits the interface.
Prof. Mohammed Tanzeem Agra
Fault Tolerance
✔ Fault Tolrence is a property that enables a system to continue operating properlly in the event of failure of some of its components.
✔ Strict Control of Data Flow through out the execution : mapper processes do not exchange data with other mapper processes and data can only go from mapper to reducer.
✔ Recovery From failure of one or many map process : if server fail, the map tasks that were running on that machine could easily be restarted on another working server.
✔ Failed Reducers can be restarted but addition work has to be redone in such case.
Prof. Mohammed Tanzeem Agra
Speculative Execution
● Execution of program is challenging because of large cluster but control and monitoring
resourcing is easy.
● When one part of MapReduce process runs slowly; it ultimately slow down everything such in
parallel computing
● In Sepculative Execution.
Prof. Mohammed Tanzeem Agra
Hadoop MapReduce Hardware
● Server
● Storage (Hard disk)
● Processing (Processor)
● Old and New Hardware ?
Prof. Mohammed Tanzeem Agra