big data and hadoop: improving mapreduce performance … · big data and hadoop: improving...
TRANSCRIPT
IJSRD - International Journal for Scientific Research & Development| Vol. 3, Issue 03, 2015 | ISSN (online): 2321-0613
All rights reserved by www.ijsrd.com 3571
Big Data and Hadoop: Improving MapReduce Performance using
Clustering Algorithm
Krunal Dave1 Mr. Jignesh Vania
2
1P.G Student
2Assistant Professor
1,2Department of Information Technology 1,2
LJIET, Ahmedabad, Gujarat, India
Abstract— “Big Data” is a popular term used to describe the
exponential growth and availability of structured,
unstructured data and semi-structured data that has potential
to be mined for information. Data mining involves
knowledge discovery from these large data sets. Hadoop is
the core platform for storing the large volume of data into
Hadoop Distributed File System (HDFS) and that data get
processed by MapReduce model in parallel. Hadoop is
designed to scale up from a single server to thousands of
machines and with a very high degree of fault tolerance.
This paper presents we have implemented our k-means
algorithm in single and multi-node Hadoop cluster. We
calculate the performance based on the time to execute the
MapReduce job.
Key words: Big data, Clustering, Hadoop, k-means,
MapReduce, HDFS
I. INTRODUCTION
Over the last few years, data generated and stored in the
world has increased exponentially. Although this vast
amount of data can be really useful for people and different
corporations, it can be very problematic and difficult in
storing this data using traditional databases. Big data is data
that is too big or hard for existing systems and algorithms to
handle.
Hadoop is a, Java-based programming framework
that supports the processing of large data sets in a
distributed computing environment and is part of the
Apache project sponsored by the Apache Software
Foundation. Hadoop is a combination of MapReduce and
Hadoop distributed file system (HDFS). Work of
MapReduce is to process the data and work of HDFS is to
store the data into file system.
This paper is organized as follows: Section II
describes basic information about big data, big data
characteristics and some problem with Big Data processing.
Section III explains about Hadoop and its architecture and
its components. Section IV explains about clustering.
Section V includes proposed solution. Section VI includes
implementation strategy. Section VII includes
implementation results and Section VIII concludes the
paper.
II. BIG DATA
“Big Data” is a very popular term used to describe the
voluminous and large-volume of data coming from
heterogeneous and independent sources that has potential to
be mined for information.
We can describe the big data characteristics using
following three Vs[9]
: Volume, Varity, Velocity.
Fig. 1: Characteristics of Big Data
[10]
III. HADOOP
Apache Hadoop is an open source software project written
in java that enables the distributed processing of large data
sets across various clusters of different commodity
hardware. It is mainly designed to scale up from a single
computer machine to the thousands of different machines,
with a very high degree of fault tolerance. Hadoop give
automatic parallelization and Data partitioning, Scheduling
of task, Machine failures handling and managing inter-
machine communication.
Fig. 2: MapReduce Execution Flow
[4]
Users use the MapReduce framework by
submitting a job, which is an instance of a MapReduce
application, to the JobTracker. The job is divided into map
tasks (also called mappers) and reduce tasks (also called
Big Data and Hadoop: Improving MapReduce Performance using Clustering Algorithm
(IJSRD/Vol. 3/Issue 03/2015/890)
All rights reserved by www.ijsrd.com 3572
reducers), and also each task is executed on an available slot
in a worker node and after all map tasks finish and all
reduce tasks soon finish moving output of last map tasks,
they move from shuffle phase into reduce phase. In reduce
phase, the reduce function is called to process the
intermediate data and write final output.
IV. CLUSTERING
A cluster is therefore a collection of objects which are
“similar” between them and are “dissimilar” to the objects
belonging to other clusters. In data mining there are some
requirement of clustering like Scalability, Ability to deal
with different kind of attributes and noisy data, Discovery of
clusters with attribute shape, Interpretability and High
dimensionality. Clustering Methods can be classified as
follows: Hierarchical Method, Density-based Method,
Partition-Based Method Grid-Based Method, Model-Based
Method, Constraint-based Method.
V. PROPOSED SOLUTION
MapReduce having issue of inter process data dependency,
thus it takes more time to execute MapReduce process. If we
can reduce inter process data dependency among
MapReduce program, it will save good amount of time.
Fig. 3: Block diagram of proposed system
In the proposed model, it will use clustering algorithm in
MapReduce for processing. First load the input file into
MapReduce, then we apply clustering algorithm with
MapReduce framework for clustering the input data, which
makes numbers of clusters having similar kind of data and
stored that data. Because of clustering, sort and shuffle
process could not find similar type of key beyond their own
block or too far from their own block. It will reduce inter
process data dependency among MapReduce program, and
save good amount of time.
A. Basic K-Means Algorithm using MapReduce:
- Input: k: number of clusters.
- D: dataset having n data points.
- Output: Set of k clusters
B. Steps:
1) Iteratively improves partitioning of data into k
clusters.
2) Mapper Phase:
- Input in map function is the form of <key, value>
pair as the cluster centre and data points.
- Map function finds the nearest center among k
centers for the input point.
3) Reduce Phase:
- The input of reduce function is the output of map
function that is all data point and one k center
which is nearest to them.
- The reducer phase calculates the new center using
data points.
- The reducer will generate the final output that is
cluster mean as cluster id with all data points in
that cluster.
4) Repeat until the centers do not change.
VI. IMPLEMENTATION
Here, we have implemented Hadoop MapReduce on Ubuntu
operating system. We have used different text files having
different records as our dataset. According to these different
files, we test the performance of the MapReduce job.
A. Single Node Hadoop Cluster
Here, we have created single node Hadoop cluster using
single computer having Ubuntu as operating system.
These are the implementation steps.
1) Step 1: To create single node Hadoop cluster [26].
- Add dedicated system user in Ubuntu.
- Download and install java.
- Configure SSH for password less login.
- Disable IPv6.
- Download and install Hadoop.
- Configure the directory where Hadoop will store its
data file. In this step you have to add some content
of different files of conf directory.
- Format the HDFS file system via the NameNode.
- Start your single node cluster.
2) Step 2: Perform Clustering operation and load input
file to Hadoop.
3) Step 3: Perform MapReduce function.
B. Multi Node Hadoop Cluster
Here, we have created multi node cluster using two
computer having Ubuntu as operating system. To create
multi node Hadoop cluster, follow above described steps to
build a single node Hadoop cluster on each of the two
Ubuntu machine. We have to use same settings like paths
and installation locations on both machines, than connect
and merge the two machines.
These are the implementation steps.
1) Step1: To create multi node Hadoop cluster [27]
- Network configuration that identify master slave
machine.
- SSH access.
Big Data and Hadoop: Improving MapReduce Performance using Clustering Algorithm
(IJSRD/Vol. 3/Issue 03/2015/890)
All rights reserved by www.ijsrd.com 3573
- Configure the directory where Hadoop will store its
data file. In this step you have to modify some
contents of different files of conf directory.
- Format the HDFS file system via the NameNode.
- Start your multi node cluster.
2) Step 2: Perform Clustering operation and load input
file to Hadoop.
3) Step 3: Perform MapReduce function.
VII. IMPLEMENTATION RESULTS
We are performing MapReduce’s job on our dataset. Here
we first execute the text file with normal clustering
algorithm, then we implement clustering algorithm in
MapReduce and take same file as a input and process on that
text file. Compare both the operation based on execution
time. Here we are comparing execution time of normal
clustered file and MapReduce with clustering algorithm file.
TOTAL
INPUTS
EXECUTION TIME(SECONDS)
Clustering
(Normal)
Clustering
(MapReduce-
Single Node)
Clustering
(MapReduce-
Multi Node)
Input 1 23 22 16
Input 2 46 34 22
Table 1: Time Comparison in Single and Multi Node
Hadoop cluster
Fig. 4: Execution Time Comparison in Single Node and Multi Node Hadoop cluster
This graph shows execution time comparison of normal
clustering file and MapReduce cluster file in single node and
multi node Hadoop cluster.
VIII. CONCLUSION
In this paper, we have seen an overview of big data,
characteristic of big data(volume, variety,
velocity),Hadoop,MapReduce and clustering algorithm. In
this paper we implement clustering algorithm with
MapReduce operation. Which reduces the inter process data
dependency in shuffle and sort phase among MapReduce
operation. These reduce the time complexity and increase
the MapReduce performance.
REFERENCES
[1] Sachchidanand Singh and Nirmala Singh. "Big data
Analytics", 2013 International Conference on
Communication,Information & Computing Technology
(ICCICT). IEEE,2011
[2] Stephen Kaisler, Frank Armour, J. Alberto Espinosa,
William Money. "Big Data: Issues and Challenges
Moving Forward",46th Hawaii International
Conference on (pp. 995-1004). IEEE, 2013
[3] Wei Fan and Albert Bifet. "Mining Big Data: Current
Status, and Forecast to the Future",SIGKDD
Explorations.Volume 14,Issue 2
[4] Jeffrey Dean and Sanjay Ghemawat. "MapReduce :
Simplified Data Processing on Large Clusters",6th
Symposium on Operating Systems Design and
Implementation at USENIX Association, 2008, pages.
137-150
[5] Archana. "Study and Comparison of Partition Based
and Hierarchical Clustering",IJARCSSE, 2014
[6] Gopi Gandhi and Rohit Srivastava. "A Comparative
Study on Partitioning Techniques of Clustering
Algorithms",International Journal of Computer
Application, 2014
[7] Jiong Xie, FanJun Meng, HaiLong Wang, JinHong
Cheng, Hongfang pan and Xiao Qin."Adaptive
Preshuffling in Hadoop Clusters",International Journal
of Grid and Distributed Computing, 2013
[8] Prajesh P Anchalia, Anjan K Koundinya, Shrinath N K.
"MapReduce Design of K-Means Clustering
Algorithm",IEEE, 2013
[9] Dagli Mikin K., and Brijesh B. Mehta, "Big Data and
Hadoop: A Review", IJARES Volume2, Issue2,
February 2014, pages 192-196.
[10] Sagiroglu, Seref, and Duygu Sinanc. "Big data: A
review."Collaboration Technologies and Systems
(CTS), 2013 International Conference on. IEEE,2013
[11] Book: Jiawei han and Micheline Kamber : Data Mining
Concepts and Techniques. Morgan Kaufmann
Publishers, 2007.
Big Data and Hadoop: Improving MapReduce Performance using Clustering Algorithm
(IJSRD/Vol. 3/Issue 03/2015/890)
All rights reserved by www.ijsrd.com 3574
[12] Kapil Bakshi. "Considerations for Big Data:
Architecture and Approach ", IEEE, 2012
[13] V.Borkar, M.J.Carey and C.Li, "Inside Big Data
Management: Ogres, Onions, or parfaits?",EDBT/ICDT
2012 Joint Conference Berlin Germany,2012.
[14] Madhavi Vaidya, "Parallel Processing of cluster by
Map Reduce", International Journal of Distributed and
Parallel Systems (IJDPS) Vol.3, No.1, January 2012.
[15] Http://Hadoop.apache.org/ , Date:18/10/2014,
Time:9.00 P.M
[16] http://developer.yahoo.com/Hadoop/tutorial,
Date:20/10/2014, Time:11.00 P.M