schedulers optimization to handle multiple jobs in hadoop cluster

51
SCHEDULERS OPTIMIZATION TO HANDLE MULTIPLE JOBS IN HADOOP CLUSTER By Shivaraj B G 4 th Sem Department of Computer Networking, MITE.

Upload: shivraj-raj

Post on 13-Apr-2017

346 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: Schedulers optimization to handle multiple jobs in hadoop cluster

SCHEDULERS OPTIMIZATION TO HANDLE MULTIPLE JOBS IN HADOOP CLUSTER

By Shivaraj B G4th Sem Department of Computer Networking, MITE.

Page 2: Schedulers optimization to handle multiple jobs in hadoop cluster

Introduction Big Data: Information is turning out to be

more accessible as well as more reasonable to PCs.

A large portion of the Big Data flow of information in the world is uncontrollable stuff like words, pictures and feature on the Web and those floods of sensor information.

It is called unstructured information and is not commonly available for conventional databases.

Firstly Big data should be obtained, processed and analyzed.

Page 3: Schedulers optimization to handle multiple jobs in hadoop cluster

Characteristics of Bigdata

Volume: Represents huge information.

Velocity (Speed): A quick rate that information is gotten.

Variety (Mixture): Unstructured and semi-organized information sorts.

Value: Information has inherent value however must be found.

Page 4: Schedulers optimization to handle multiple jobs in hadoop cluster

Hadoop: A structure of open source tools, libraries and techniques for 'huge information' examination.

Hadoop is converged with two essential key properties known as Hadoop Distributed File System (HDFS) with capacity and administration and MapReduce with processing ability.

Why Hadoop? Adjustable framework with hardware and

programming with implicit application level with fault tolerant among the clusters of computers.

Page 5: Schedulers optimization to handle multiple jobs in hadoop cluster

Why to use Hadoop ? The Hadoop is reliable and simple Utilized to store extensive information with

unlimited storage capacity with adaptable framework.

As it is of 100% open source package. Quick recovery from system failures. Ability to quickly handle large amount of

information in parallel. Once information written in HDFS can be read

many times.

Page 6: Schedulers optimization to handle multiple jobs in hadoop cluster

Contents of Hadoop 1. Hadoop Distributed File System

(HDFS): HDFS is master (expert) and slave

structural architecture; client first cooperates with expert with NameNode and Secondary NameNode.

Every slave runs both DataNode and JobTracker daemon that speak with and getting guidelines from master hub.

The TaskTracker daemon is the slave to the JobTracker and DataNode daemon is slave to the NameNode.

Page 7: Schedulers optimization to handle multiple jobs in hadoop cluster

2. MapReduce: Is basic programming model confined to use of key-values pairs.

Page 8: Schedulers optimization to handle multiple jobs in hadoop cluster

Scheduling in Hadoop Scheduling is a policy used to determine

when a job executes its tasks. And it communicates each other or across clusters using TCP suit using its distributed file system.

Example: process time, communication for data transmission along with available bandwidth.

So the need of scheduling is the fact that multiplexing and multitasking.

The fresh hadoop setup uses only First-in First-out scheduler.

Page 9: Schedulers optimization to handle multiple jobs in hadoop cluster

Pig scripting language intended to handle any sort of data sets

(unstructured information). Pig is comprised of two parts: the First is language itself which

is called Piglatin; the Second is a Pigruntime environment where Piglatin programs are executed.

Pig has two execution modes: Local Mode (pig –x nearby) and MapReduce Mode (pig or pig –x mapreduce) with an ad-hoc way of creating and executing MapReduce jobs.

Piglatin High-level scripting language. Requires no metadata or schema. Statements translated into a series of

MapReduce jobs.Grunt Interactive shell.Piggybank Shared repository for User Defined Functions

(UDFs).

Page 10: Schedulers optimization to handle multiple jobs in hadoop cluster

Hive Data Warehouse System for Hadoop. Tools to facilitate effortless information to

extract/transform/load (ETL) from records stored directly HDFS or in other information storage systems such as HBase.

Contains metadata so as to describe information right to use in HDFS and HBase, not information itself.

Uses a straightforward SQL-like query language called HiveQL

Query implementation through MapReduce

Page 11: Schedulers optimization to handle multiple jobs in hadoop cluster

With the aim of optimizing schedulers, the framework allow jobs to complete in a timely manner, while allowing users who are making queries to get results back in a reasonable time. so users have more freedom in adopting the most appropriated scheduler or other techniques according to their requirements.

Page 12: Schedulers optimization to handle multiple jobs in hadoop cluster

Problem Statement

First in first out (FIFO) scheduler approach allows one job to take all task slots within the cluster, i.e., no other jobs can utilize the cluster until the current one completes. Consequently, jobs that arrive at a later time or with a lower priority will be blocked by those ahead in the queue. Given the total number of jobs is large, there will be a significant delay caused by FIFO scheduler.

 

Page 13: Schedulers optimization to handle multiple jobs in hadoop cluster

Proposed System

The proposed system has a novel framework for optimization approach.

1. Fair scheduler

The Fair scheduler starts execution if any slots are available during execution of tasks the smaller jobs are assigned to that slot.

Merits of Fair Scheduler Though cluster is shared with large jobs it allows running small jobs quickly. Unlike default FIFO scheduler, without starving the large or small job fair

scheduling chooses a job in the queue to run task if available slot is free. Provide service for multiple slots with guaranteed levels of jobs execution in

shared cluster and simple to configure and administer.

Page 14: Schedulers optimization to handle multiple jobs in hadoop cluster

2. Capacity Scheduler The concept provided by Capacity Scheduler is

scheduling queues. Is designed for large cluster for sharing minimum

capacity guarantee while cluster is partitioned among multiple users.

Merits of Capacity Scheduler Capacity guarantees – Multiple queues supports jobs

execution simultaneously as submitted by the user. Elasticity – jobs are assigned beyond the capacity due its

freely available resources which helps entire cluster utilization Multi-Tenancy – Provides system resources to queues created

by user to link with JobTracker to execute jobs.

Page 15: Schedulers optimization to handle multiple jobs in hadoop cluster

Objective To parallelize the job execution crosswise

over stand alone mode or in a cluster. Estimate the processing time for all

parallel applications like small or long running jobs by distributing tasks to the schedulers.

Page 16: Schedulers optimization to handle multiple jobs in hadoop cluster

Methodology HDFS Framework works on block size. Whereas general Parallel file system supports block sizes of 16 KB to 4 MB and maintains default block size of 256KB. But HDFS works with default 64 Mb block size.

Page 17: Schedulers optimization to handle multiple jobs in hadoop cluster
Page 18: Schedulers optimization to handle multiple jobs in hadoop cluster
Page 19: Schedulers optimization to handle multiple jobs in hadoop cluster

Optimization Techniques1. MapCombineReduce

Page 20: Schedulers optimization to handle multiple jobs in hadoop cluster

2. Distinctive Block Size dfs.block.size: File system block size: 67108864 (bytes) E.g. Input data size = 1GB and dfs.block.size = 64 MB

then the minimum no. of maps are (1*1024)/64 = 16 maps. E.g. If input data size = 1 GB and dfs.block.size = 128

MB (134217728 bytes) then minimum no. of maps are (1*1024)/128 = 8 maps.

File Size Time for Moving Data to

HDFS

Total Number of Blocks

Created

64 Mb 128 MB

1.3 GB 0.48 Sec 0.38 Sec 21 11

2.7 GB 1.38 Sec 1.17 Sec 41 21

4.0 GB 2.32 Sec 1.58 Sec 61 31

Page 21: Schedulers optimization to handle multiple jobs in hadoop cluster

S/W and H/W SpecificationSoftware Specification Operating System: Ubuntu 12.04 LTS. Java (Jdk 6.1) is required to run *.jar files. Hadoop Version-hadoop-1.0.3.tar.gz Stable Release. Pig: apache- pig-0.14.0.tar.gz. Hive: apache-hive-0.13.1-bin.tar.gz.Hardware specification

Name Specification

Main processor P4, 2GHz or Higher

Secondary Memory 200GB

Primary Memory 4GB or higher

Page 22: Schedulers optimization to handle multiple jobs in hadoop cluster

Results Starting Hadoop multi-node clusterCd /usr/local/hadoop/bin$ ./start-all.sh

Page 23: Schedulers optimization to handle multiple jobs in hadoop cluster

Displaying Master Node and available cluster summary Using web browser type the master URL such as

localhost:50070 or master:50070

Page 24: Schedulers optimization to handle multiple jobs in hadoop cluster

Listing Total Number of Files in HDFS.

$ bin/hadoop dfs -ls

Page 25: Schedulers optimization to handle multiple jobs in hadoop cluster

Input file of weblogs Stored in HDFS

Page 26: Schedulers optimization to handle multiple jobs in hadoop cluster

Movie dataset Input File Stored in HDFS

Page 27: Schedulers optimization to handle multiple jobs in hadoop cluster

Job Queue Scheduling Information with Default Scheduler

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogHitsByLinkProcessor.jar WeblogHitsByLinkProcessor /user/hduser/nasa_input /user/hduser/weblog_output/

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar WeblogMessagesizevsHitsProcessor /user/hduser/nasa_input /user/hduser/weblogs_output1/

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar WeblogTimeOfDayHistogramCreator.jar WeblogTimeOfDayHistogramCreator /user/hduser/nasa_input /user/hduser/welogg_output2/

Page 28: Schedulers optimization to handle multiple jobs in hadoop cluster

Scheduling Information with FAIR Scheduler. Using web browser type Hadoop MapReduce URL such as localhost:50030

$ time bin/hadoop jar WeblogHits.jar WeblogHits /user/hduser/heterogeneous/inputs/weblog

/user/hduser/heterogeneous/fair/ouputs/weblog_output

$ time bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar WeblogTimeOfDayHistogramCreator /user/hduser/heterogeneous/inputs/weblog /user/hduser/heterogeneous/fair/ouputs/weblogs_output1

$ time bin/hadoop jar WeblogTimeOfDayHistogramCreator.jar WeblogTimeOfDayHistogramCreator /user/hduser/heterogeneous/inputs/weblog

/user/hduser/heterogeneous/fair/ouputs/weblogg_output2

Page 29: Schedulers optimization to handle multiple jobs in hadoop cluster
Page 30: Schedulers optimization to handle multiple jobs in hadoop cluster

Job Summary for QueueA using Capacity Scheduler

$ time bin/hadoop jar Weblog.jar Weblog -Dmapred.job.queue.name=queueA user/hduser/heterogeneous/inputs/weblog

/user/hduser/heterogeneous/capacity/ouputs/weblog_output

Page 31: Schedulers optimization to handle multiple jobs in hadoop cluster

Job Summary for QueueB using Capacity Scheduler.

$ time bin/hadoop jar WeblogMessagesizevsHitsProcessor.jar WeblogTimeOfDayHistogramCreator -Dmapred.job.queue.name=queueB /user/hduser/heterogeneous/inputs/weblogtime/user/hduser/heterogeneous/capacity/ouputs/weblogs_output1

Page 32: Schedulers optimization to handle multiple jobs in hadoop cluster

Output for Web-Logs with Total Number of Hits on Particular Links.

Page 33: Schedulers optimization to handle multiple jobs in hadoop cluster

Output for Web-Logs with Total Number Messages with their Size.

Page 34: Schedulers optimization to handle multiple jobs in hadoop cluster

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 260

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

9000000

Page 35: Schedulers optimization to handle multiple jobs in hadoop cluster

Output for Web-Logs with Time between 0-23 Hours along Number of Users.

Page 36: Schedulers optimization to handle multiple jobs in hadoop cluster

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 240

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

TimeUsers

Page 37: Schedulers optimization to handle multiple jobs in hadoop cluster

Performance Analysis of Proposed Hadoop System using Homogeneous Hadoop Jobs with Single Node Cluster

Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB0

0.5

1

1.5

2

2.5

3

3.5

4 3.7

2.92

3.63.25

Proposed Hadoop System

Page 38: Schedulers optimization to handle multiple jobs in hadoop cluster

Performance Analysis of Proposed Hadoop System using Heterogeneous Hadoop Jobs with Single Node Cluster

Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB0

5

10

15

20

25

30

35

26.11

19.04

29.96

21.24

Proposed Hadoop System

Page 39: Schedulers optimization to handle multiple jobs in hadoop cluster

Performance Analysis of Proposed Hadoop System using Homogeneous Hadoop Jobs with Multi-Node Cluster.

Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB0

0.5

1

1.5

2

2.5

3

3.5

4

1.54 1.5

3.79

1.66

Proposed Hadoop System

Page 40: Schedulers optimization to handle multiple jobs in hadoop cluster

Performance Analysis of Proposed Hadoop System using Heterogeneous Hadoop Jobs with Multi-Node Cluster.

Fair 64 MB Fair 128 MB Capacity 64 MB Capacity 128 MB0

2

4

6

8

10

12

14

16

7.386.38

13.47

12.26

Proposed Hadoop System

Page 41: Schedulers optimization to handle multiple jobs in hadoop cluster

Results for Movie dataset using PIG Running pig in local mode: pig -x local movies = LOAD

'/Users/Rich/Documents/Courses/Fall2014/BigData/Pig/movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);

DUMP movies

Page 42: Schedulers optimization to handle multiple jobs in hadoop cluster

Filter: List the movies that were released between 1950 and 1960 movies_1950_1960 = FILTER movies BY (float)year>1949 and (float)year<1961; store movies_1950_1960 into '/Users/Rich/Desktop/Demo/movies_1950_1960';

Page 43: Schedulers optimization to handle multiple jobs in hadoop cluster

Foreach Generate: List movie names and their duration (in minutes) movies_name_duration = foreach movies generate name, (float)duration/3600; store movies_name_duration into '/Users/Rich/Desktop/Demo/movies_name_duration';

Page 44: Schedulers optimization to handle multiple jobs in hadoop cluster

Order: List all movies in descending order of year movies_year_sort =order movies by year desc; store movies_year_sort into '/Users/Rich/Desktop/Demo/movies_year_sort';

Page 45: Schedulers optimization to handle multiple jobs in hadoop cluster

Results for Movie Dataset using Hive

For Displaying Contents of Movie Data Set.Hive> select * from Movies;

Page 46: Schedulers optimization to handle multiple jobs in hadoop cluster

Result for Particular Year of Movies Released.

Hive> select *from movies where year = 1995;

Page 47: Schedulers optimization to handle multiple jobs in hadoop cluster

finding all Movies based on Movie Length specified.

Hive> select * from movies where length > 3000;

Page 48: Schedulers optimization to handle multiple jobs in hadoop cluster

Retrieve information using GROUP BY clause.

Hive> select COUNT(1) from movies GROUP BY year;

Page 49: Schedulers optimization to handle multiple jobs in hadoop cluster

Results for heterogeneous datasets using Hadoop, pig and hive

Hadoop Pig HiveWord Count 2.367 1.58 1.52Movie Rating 1.53 1.45 1.57

Hadoop Pig Hive0

0.5

1

1.5

2

2.5 2.36

1.58 1.521.53 1.451.57

WordCountMovieRating

Page 50: Schedulers optimization to handle multiple jobs in hadoop cluster

Conclusion This effort is projected to give a high level

summary of what is Big data and how to solve the issues generated through four V’s and stored in HDFS using various configuration parameters by setting up Hadoop, Pig and Hive to retrieve useful data from bulky data sets.

Page 51: Schedulers optimization to handle multiple jobs in hadoop cluster

Thank you