big data and hadoop: improving mapreduce performance … · big data and hadoop: improving...

IJSRD - International Journal for Scientific Research & Development| Vol. 3, Issue 03, 2015 | ISSN (online): 2321-0613

All rights reserved by www.ijsrd.com 3571

Big Data and Hadoop: Improving MapReduce Performance using

Clustering Algorithm

Krunal Dave1 Mr. Jignesh Vania

2

1P.G Student

2Assistant Professor

1,2Department of Information Technology 1,2

LJIET, Ahmedabad, Gujarat, India

Abstract— “Big Data” is a popular term used to describe the

exponential growth and availability of structured,

unstructured data and semi-structured data that has potential

to be mined for information. Data mining involves

knowledge discovery from these large data sets. Hadoop is

the core platform for storing the large volume of data into

Hadoop Distributed File System (HDFS) and that data get

processed by MapReduce model in parallel. Hadoop is

designed to scale up from a single server to thousands of

machines and with a very high degree of fault tolerance.

This paper presents we have implemented our k-means

algorithm in single and multi-node Hadoop cluster. We

calculate the performance based on the time to execute the

MapReduce job.

Key words: Big data, Clustering, Hadoop, k-means,

MapReduce, HDFS

I. INTRODUCTION

Over the last few years, data generated and stored in the

world has increased exponentially. Although this vast

amount of data can be really useful for people and different

corporations, it can be very problematic and difficult in

storing this data using traditional databases. Big data is data

that is too big or hard for existing systems and algorithms to

handle.

Hadoop is a, Java-based programming framework

that supports the processing of large data sets in a

distributed computing environment and is part of the

Apache project sponsored by the Apache Software

Foundation. Hadoop is a combination of MapReduce and

Hadoop distributed file system (HDFS). Work of

MapReduce is to process the data and work of HDFS is to

store the data into file system.

This paper is organized as follows: Section II

describes basic information about big data, big data

characteristics and some problem with Big Data processing.

Section III explains about Hadoop and its architecture and

its components. Section IV explains about clustering.

Section V includes proposed solution. Section VI includes

implementation strategy. Section VII includes

implementation results and Section VIII concludes the

paper.

II. BIG DATA

“Big Data” is a very popular term used to describe the

voluminous and large-volume of data coming from

heterogeneous and independent sources that has potential to

be mined for information.

We can describe the big data characteristics using

following three Vs[9]

: Volume, Varity, Velocity.

Fig. 1: Characteristics of Big Data

[10]

III. HADOOP

Apache Hadoop is an open source software project written

in java that enables the distributed processing of large data

sets across various clusters of different commodity

hardware. It is mainly designed to scale up from a single

computer machine to the thousands of different machines,

with a very high degree of fault tolerance. Hadoop give

automatic parallelization and Data partitioning, Scheduling

of task, Machine failures handling and managing inter-

machine communication.

Fig. 2: MapReduce Execution Flow

[4]

Users use the MapReduce framework by

submitting a job, which is an instance of a MapReduce

application, to the JobTracker. The job is divided into map

tasks (also called mappers) and reduce tasks (also called

Big Data and Hadoop: Improving MapReduce Performance using Clustering Algorithm

(IJSRD/Vol. 3/Issue 03/2015/890)


reducers), and also each task is executed on an available slot

in a worker node and after all map tasks finish and all

reduce tasks soon finish moving output of last map tasks,

they move from shuffle phase into reduce phase. In reduce

phase, the reduce function is called to process the

intermediate data and write final output.

IV. CLUSTERING

A cluster is therefore a collection of objects which are

“similar” between them and are “dissimilar” to the objects

belonging to other clusters. In data mining there are some

requirement of clustering like Scalability, Ability to deal

with different kind of attributes and noisy data, Discovery of

clusters with attribute shape, Interpretability and High

dimensionality. Clustering Methods can be classified as

follows: Hierarchical Method, Density-based Method,

Partition-Based Method Grid-Based Method, Model-Based

Method, Constraint-based Method.

V. PROPOSED SOLUTION

MapReduce having issue of inter process data dependency,

thus it takes more time to execute MapReduce process. If we

can reduce inter process data dependency among

MapReduce program, it will save good amount of time.

Fig. 3: Block diagram of proposed system

In the proposed model, it will use clustering algorithm in

MapReduce for processing. First load the input file into

MapReduce, then we apply clustering algorithm with

MapReduce framework for clustering the input data, which

makes numbers of clusters having similar kind of data and

stored that data. Because of clustering, sort and shuffle

process could not find similar type of key beyond their own

block or too far from their own block. It will reduce inter

process data dependency among MapReduce program, and

save good amount of time.

A. Basic K-Means Algorithm using MapReduce:

- Input: k: number of clusters.

- D: dataset having n data points.

- Output: Set of k clusters

B. Steps:

1) Iteratively improves partitioning of data into k

clusters.

2) Mapper Phase:

- Input in map function is the form of <key, value>

pair as the cluster centre and data points.

- Map function finds the nearest center among k

centers for the input point.

3) Reduce Phase:

- The input of reduce function is the output of map

function that is all data point and one k center

which is nearest to them.

- The reducer phase calculates the new center using

data points.

- The reducer will generate the final output that is

cluster mean as cluster id with all data points in

that cluster.

4) Repeat until the centers do not change.

VI. IMPLEMENTATION

Here, we have implemented Hadoop MapReduce on Ubuntu

operating system. We have used different text files having

different records as our dataset. According to these different

files, we test the performance of the MapReduce job.

A. Single Node Hadoop Cluster

Here, we have created single node Hadoop cluster using

single computer having Ubuntu as operating system.

These are the implementation steps.

1) Step 1: To create single node Hadoop cluster [26].

- Add dedicated system user in Ubuntu.

- Download and install java.

- Configure SSH for password less login.

- Disable IPv6.

- Download and install Hadoop.

- Configure the directory where Hadoop will store its

data file. In this step you have to add some content

of different files of conf directory.

- Format the HDFS file system via the NameNode.

- Start your single node cluster.

2) Step 2: Perform Clustering operation and load input

file to Hadoop.

3) Step 3: Perform MapReduce function.

B. Multi Node Hadoop Cluster

Here, we have created multi node cluster using two

computer having Ubuntu as operating system. To create

multi node Hadoop cluster, follow above described steps to

build a single node Hadoop cluster on each of the two

Ubuntu machine. We have to use same settings like paths

and installation locations on both machines, than connect

and merge the two machines.

These are the implementation steps.

1) Step1: To create multi node Hadoop cluster [27]

- Network configuration that identify master slave

machine.

- SSH access.




- Configure the directory where Hadoop will store its

data file. In this step you have to modify some

contents of different files of conf directory.

- Format the HDFS file system via the NameNode.

- Start your multi node cluster.

2) Step 2: Perform Clustering operation and load input

file to Hadoop.

3) Step 3: Perform MapReduce function.

VII. IMPLEMENTATION RESULTS

We are performing MapReduce’s job on our dataset. Here

we first execute the text file with normal clustering

algorithm, then we implement clustering algorithm in

MapReduce and take same file as a input and process on that

text file. Compare both the operation based on execution

time. Here we are comparing execution time of normal

clustered file and MapReduce with clustering algorithm file.

TOTAL

INPUTS

EXECUTION TIME(SECONDS)

Clustering

(Normal)

Clustering

(MapReduce-

Single Node)

Clustering

(MapReduce-

Multi Node)

Input 1 23 22 16

Input 2 46 34 22

Table 1: Time Comparison in Single and Multi Node

Hadoop cluster

Fig. 4: Execution Time Comparison in Single Node and Multi Node Hadoop cluster

This graph shows execution time comparison of normal

clustering file and MapReduce cluster file in single node and

multi node Hadoop cluster.

VIII. CONCLUSION

In this paper, we have seen an overview of big data,

characteristic of big data(volume, variety,

velocity),Hadoop,MapReduce and clustering algorithm. In

this paper we implement clustering algorithm with

MapReduce operation. Which reduces the inter process data

dependency in shuffle and sort phase among MapReduce

operation. These reduce the time complexity and increase

the MapReduce performance.

REFERENCES

[1] Sachchidanand Singh and Nirmala Singh. "Big data

Analytics", 2013 International Conference on

Communication,Information & Computing Technology

(ICCICT). IEEE,2011

[2] Stephen Kaisler, Frank Armour, J. Alberto Espinosa,

William Money. "Big Data: Issues and Challenges

Moving Forward",46th Hawaii International

Conference on (pp. 995-1004). IEEE, 2013

[3] Wei Fan and Albert Bifet. "Mining Big Data: Current

Status, and Forecast to the Future",SIGKDD

Explorations.Volume 14,Issue 2

[4] Jeffrey Dean and Sanjay Ghemawat. "MapReduce :

Simplified Data Processing on Large Clusters",6th

Symposium on Operating Systems Design and

Implementation at USENIX Association, 2008, pages.

137-150

[5] Archana. "Study and Comparison of Partition Based

and Hierarchical Clustering",IJARCSSE, 2014

[6] Gopi Gandhi and Rohit Srivastava. "A Comparative

Study on Partitioning Techniques of Clustering

Algorithms",International Journal of Computer

Application, 2014

[7] Jiong Xie, FanJun Meng, HaiLong Wang, JinHong

Cheng, Hongfang pan and Xiao Qin."Adaptive

Preshuffling in Hadoop Clusters",International Journal

of Grid and Distributed Computing, 2013

[8] Prajesh P Anchalia, Anjan K Koundinya, Shrinath N K.

"MapReduce Design of K-Means Clustering

Algorithm",IEEE, 2013

[9] Dagli Mikin K., and Brijesh B. Mehta, "Big Data and

Hadoop: A Review", IJARES Volume2, Issue2,

February 2014, pages 192-196.

[10] Sagiroglu, Seref, and Duygu Sinanc. "Big data: A

review."Collaboration Technologies and Systems

(CTS), 2013 International Conference on. IEEE,2013

[11] Book: Jiawei han and Micheline Kamber : Data Mining

Concepts and Techniques. Morgan Kaufmann

Publishers, 2007.




[12] Kapil Bakshi. "Considerations for Big Data:

Architecture and Approach ", IEEE, 2012

[13] V.Borkar, M.J.Carey and C.Li, "Inside Big Data

Management: Ogres, Onions, or parfaits?",EDBT/ICDT

2012 Joint Conference Berlin Germany,2012.

[14] Madhavi Vaidya, "Parallel Processing of cluster by

Map Reduce", International Journal of Distributed and

Parallel Systems (IJDPS) Vol.3, No.1, January 2012.

[15] Http://Hadoop.apache.org/ , Date:18/10/2014,

Time:9.00 P.M

[16] http://developer.yahoo.com/Hadoop/tutorial,

Date:20/10/2014, Time:11.00 P.M

big data and hadoop: improving mapreduce performance … · big data and hadoop: improving...

Documents