comprehensive analysis of hadoop ecosystem …onlinepresent.org/proceedings/vol147_2017/66.pdf ·...
TRANSCRIPT
Comprehensive Analysis of Hadoop Ecosystem
Components: MapReduce, Pig and Hive
N. Suneetha1, Ch. Sekhar2, A. Viswanath Sharma3 and P. Sandhya3
1,2,3,4Vignan’s Institute of Information Technology, Duvvada, Visakhapatnam [email protected], [email protected]
Abstract. Big Data is a term for large-volume, complex, growing data
sets with multiple, autonomous sources generated in various fields
ranging from economic and business activities to public administration,
from national security to scientific researches. It is the emerging
technology that draws huge attention from researchers to extract value
from voluminous datasets. As the velocity of data growth is increasing
with the technological challenges, organization and storage of data is of
primary concern. To proceed with a given situation pertaining to big
data the consideration parameters are, firstly the background of data,
then the value chain phases namely data generation, data acquisition,
data storage, and data analysis and finally on the representative
applications of big data, including enterprise management, Internet of
Things, online social networks, medical applications, collective
intelligence, financial service sectors and smart grid. This paper
emphasizes on the performance and work nature of data analysis
through Hadoop ecosystem components like Map reduces Pig and Hive.
Keywords: Big Data, Map reduce, Pig, Hive.
1 Introduction
Over the past few decades, data has increased in a large scale in various fields.
According to a report from International Data Corporation (IDC), in 2011, the overall
created and copied data volume in the world was 1.8ZB (≈ 1021B), which increased
by nearly nine times within five years [1]. This figure will double at least every other
two years in the near future. Under the explosive growth of information, the term Big
Data means the large or complex data sets that could not be handled by traditional
database technologies. Big Data also arises with many challenges, such as difficulties
in data capture, data storage, data analysis and data visualization. Big data is
characterized by the five Vs, namely volume, variety, velocity, value and veracity
(Fig. 1). This 5V definition highlights the meaning and necessity of big data.
Advanced Science and Technology Letters Vol.147 (SMART DSC-2017), pp.468-478
http://dx.doi.org/10.14257/astl.2017.147.66
ISSN: 2287-1233 ASTL Copyright © 2017 SERSC
Volume:
Volume means, with the generation and collection of masses of data. Nowadays
volume or the size of data in different areas/enterprises is larger than terabytes and
petabytes. It is the task of Big data to extract valuable information from high volumes
of low-density, unstructured Hadoop data—that is, data of unknown value, such as
Twitter data feeds, click streams on a web page and a mobile app, network traffic,
sensor-enabled equipment capturing data at the speed of light, and many more. Big
Data volume includes such features as size, scale, amount, dimension for tera- and
exascale data collected from many transactions and stored in individual files or
databases to be accessible, searchable, processed and manageable. One example of
Big data from industry, global services providers such as Google, Facebook, Twitter
are producing, analyzing and storing data in huge amount as their regular
activity/production services. To deal these large volume of data, Big Data
development is needed.
Variety:
Variety deals with the complexity of big data and information and semantic models
behind these data. Big data comes from a great variety of sources and generally has in
four types: structured, semi structured, unstructured and a mixed data. Structured data
inserts a data warehouse already tagged and easily sorted but unstructured data is
random and difficult to analyze. Unstructured and semi-structured data types, such as
text, audio, and video does not conform to fixed fields but contains tags to separate
data elements where they require additional processing to both derive meaning and
the supporting metadata.
Velocity:
Velocity is the fast rate at which Big data streams is generated into memory or disks
by arrays of sensors or multiple events, and need to be processed in real-time, near
real-time or in batch, or as streams (like in case of visualisation). Velocity is required
not only for big data, but also all processes. For time limited processes, big data
should be used as it streams into the organization in order to maximize its value
[4,16]. Some of the applications like Internet of Things (IoT), consumer eCommerce
and mobile communications ..etc. deals with large amount of data in their
implementation in real time or near real time.
Value
Variety indicates the various types of data, which include semi-structured and
unstructured data such as audio, video, webpage, and text, as well as traditional
structured data. Value is an important feature of the Big data which is defined by the
added-value that the collected data can bring to the intended process, activity or
predictive analysis/hypothesis. Data value will depend on the events or processes they
represent such as stochastic, probabilistic, regular or random. For example in
consumer applications, the intrinsic value of data is derived using quantitative and
investigative techniques from discovering a consumer preference or sentiment, to
making a relevant offer by location, or for identifying a piece of equipment that is
about to fail. Depending on this the requirements may be imposed to collect all data,
Advanced Science and Technology Letters Vol.147 (SMART DSC-2017)
Copyright © 2017 SERSC 469
store for longer period (for some possible event of interest), etc. However, finding
value of Bigdata requires new discovery processes to make more accurate and precise
decisions.
Veracity:
Big Data veracity ensures that the data used are trusted, authentic and protected from
unauthorized access and modification. The data must be secured during the whole
their lifecycle from collection from trusted sources to processing on trusted compute
facilities and storage on protected and trusted storage facilities. Data veracity relies
entirely on the security infrastructure deployed and available from the Big Data
infrastructure.
With this definition, characteristics of big data may be summarized as five Vs, i.e.,
Volume(greatvolume), Variety (various modalities), Velocity (rapid generation),
Value (huge value but very low density) and Veracity (trusted and authentic) as
shown in Fig. 1.
Fig. 1. Five V’s of Big Data
This paper is organized as follows. Section II presents the Hadoop EcoSystem
Components. Section III illustrates the precise distributed working environment.
Section IV describes the Conclusion of the presented work
2 Hadoop EcoSystem Components
The size of datasets are increasing at a rapid pace currently to the tune of
petabytes(PB) which is becoming an issue to perform data analysis. The challenges
for such analysis are capture, curation, search, sharing, storage, transfer, visualization,
querying, updating and information privacy. To meet these challenges, "parallel data
processing software" like Hadoop framework is required. Hadoop is an open-source
framework that allows to store and process Big Data in a distributed environment
across clusters of computers using simple programming models. The Hadoop
platform consists of two main services: one is a reliable, distributed file system called
Hadoop Distributed File System (HDFS) and the other one is high-performance
parallel data processing engine called Hadoop MapReduce. Vendors that provide
Hadoop-based platforms include Cloudera, Hortonworks, MapR, Greenplum, IBM,
Advanced Science and Technology Letters Vol.147 (SMART DSC-2017)
470 Copyright © 2017 SERSC
and Amazon. Information is procured from diverse sources, like online networking,
customary undertaking information or sensor information, etc.
The two main components of Hadoop are Hadoop Distributed File System(HDFS)
referring to distributed storage and distributed processing called Map Reduce
framework. Combination of HDFS and MapReduce provides a software framework
for processing vast amounts of data in parallel on large clusters of commodity
hardware in a reliable, fault- tolerant manner. Hadoop is a generic processing
framework designed to execute queries and other batch operations against massive
datasets that can scale from tens of terabytes to petabytes in size.
Figure 2 shows the Hadoop Ecosystem. In the ecosystem of Hadoop, there have
been several recent research projects exploiting sharing opportunities and eliminating
unnecessary data movements, e.g. [2]
Fig. 2. Hadoop echo system
In the ecosystem of Hadoop, there have been several recent research projects
exploiting sharing opportunities and eliminating unnecessary data movements. Huge
amounts of inconsistent, incomplete, and noisy data, a number of data preprocessing
techniques, including data cleaning, data integration, data transformation and date
reduction, can be applied for removing noise and correcting inconsistencies [5].
Data capture and storage
Data sets are captured from different sources such as traditional organization data
from different enterprises (it includes information from transactional ERP data or web
store transactions and general ledger data..etc.), machine generated or sensor data (it
includes information from information- sensing mobile devices, aerial sensory
technologies, remote sensing, radio-frequency identification readers..etc.), social data
(it includes information from blogging sites and social media platforms ..Etc.) and so
Advanced Science and Technology Letters Vol.147 (SMART DSC-2017)
Copyright © 2017 SERSC 471
on. The world’s technological capacity to store information is increasing
exponentially in terms of quintillion byte
Data organization
Data is organized into the following file systems namely Google File System(GFS)
and Hadoop Distributed File System(HDFS). Google File System Google Inc. built
up an appropriated record framework for their own particular use which was intended
for proficient and solid acess to information utilizing extensive bunch of product
equipment. It utilizes the methodology of "Big Files", which are created by Larry
Page and Sergey Brin. Here records are partitioned in fixed size chunks of 64 MB
whose replication factor is 3. HDFS is the distributed storage which positions the data
into fixed size blocks which is replicated to 3. There will be one master node and
multiple slave nodes. Map Reduce Jobs are processed using Version1 and Version2.
Data analysis
Pig and Hive are the frameworks that have an inherent map reduce functionality with
a good processing speed. The analysis is performed on huge databases ranging from
thousands to lakhs using Hive and MapReduce. Both the frameworks run on different
platforms like redhat linux, Cloudera, Hortonworks etc. Figure 3 shows the Uber
transport dataset collected from “GITHUB” and is placed in HDFS using the hadoop
put command. Ubuntu-Server14.04LTS is used to implement the operations of the
specified data set.
Fig. 3. Uber data set from GITHUB site
Advanced Science and Technology Letters Vol.147 (SMART DSC-2017)
472 Copyright © 2017 SERSC
The data shown in above Figure 3 is taken from GITHUB site which is having the
fields like DATE/TIME, LATITUDE, LONGITUDE, BASE fields. For the given
data the number of cars with base numbers waiting at a particular latitude position
are analyzed. Pig and Hive execution speeds are compared
4 Working of Distributed Environment
Map Reduce
Hadoop Map Reduce is a product structure for effortlessly composing applications
which prepare tremendous measures of information (multi-terabyte information sets)
in-parallel on substantial groups (a huge number of hubs) of item equipment in a
dependable, fault tolerant way. A MapReduce work for the most part parts the
information set into free pieces which are prepared by the guide assignments in a
totally parallel way. The structure sorts the yields of the maps, which are then data to
the diminish assignments. Ordinarily both the information and the yield of the
employment are put away in a document framework. The system deals with planning
errands, observing them and re-executes the fizzled assignments. Normally the
register hubs and the stockpiling hubs are the same, that is, the MapReduce structure
and the Hadoop Distributed File System are running on the same arrangement of
hubs. This arrangement permits the system to adequately timetable undertakings on
the hubs where information is as of now present, bringing about high total transfer
speed over the group. The MapReduce structure comprises of a solitary expert
JobTracker and one slave TaskTracker per group hub. The expert is in charge of
booking the occupations' part assignments on the slaves, checking them and
reexecuting the fizzled errands. The slaves execute the assignments as coordinated by
the expert. Negligibly, applications indicate the information/yield areas and supply
guide and lessen capacities through executions of suitable interfaces and/or
conceptual classes. These, and other employment parameters, contain the occupation
design. The Hadoop work customer then presents the occupation (jug/executable and
so on.) and setup to the JobTracker which then expect the obligation of disseminating
the product/design to the slaves, booking errands and observing them, giving status
and demonstrative data to the employment customer.
PIG:
Pig was initially developed at Yahoo Research around 2006 but moved into the
Apache Software Foundation in 2007. Pig which is a scripting language that consists
of a language and an execution environment which is based on hadoop map reduce
framework. Pig script is usually called as PigLatin[6]. PIG came into the world of big
data because most of the industries and the companies are finding difficulties in
handling a small data with at most 50-60 lines of code So industry finds difficult in
wasting the time and money for such things. Pig script is a language which connects
things over very easily with very short lines of code. It is a high level language and it
does not need the help of JAVA. Pig support dataflow language.. Pig can handle
complex data structure, even those who have levels of nesting. It has two types of
execution environment local and distributed environment. Local environment is used
Advanced Science and Technology Letters Vol.147 (SMART DSC-2017)
Copyright © 2017 SERSC 473
for testing when distributed environment cannot be deployed. Pig Latin program is
collection of statements. A statement can be a operation or command.
Pig distributed environment is chosen by command pig. Pig local mode is chosen
by the command pig –x local.
Fig. 4. Execution of PIG
Figure 4 shows the execution process of pig framework. Grunt shell is the console
for writing and execution of the pig scripts. The execution engine converts the scripts
into mapreduce jobs and the resultant is stored in HDFS.
Handling of Semistructured data using Pig
Data of semi-structured nature with extenson .xml,.jpeg,etc also can be handled using
Pig. To handle this .xml files we need to have some additional libraries to be included
with the pig. The two additional libraries for the pig are as follows.
.
The additional library piggybank is used for loading the data into pig from HDFS.
For handling the XML file xmlloader() function is linked with the additional library
piggybank.
Piggy Bank Apache DataFu
1)Collection of useful
LOAD,STORE and UDF functions.
1)Collection of libraries for
working with large scale data in Hadoop.
2)Has many user defined
functions.
2)Project inspired by the need for stable well tested libraries for
data mining and statistics
3)Open source project of
apache Hadoop and should be
downloaded externally and links with pig.
3)It is also a apache open source
project and included with in apache
pig from version 0.14
Advanced Science and Technology Letters Vol.147 (SMART DSC-2017)
474 Copyright © 2017 SERSC
Following is the sample XML file which is described as follows:
<Data>
<employee>
<id>5</id>
<name>Aravind</name>
<gender>M</gender>
</employee>
<employee>
<id>6</id>
<name>Krishna</name>
<gender>M</gender>
</employee>
<employee>
<id>7</id>
<name>Thiru</name>
<gender>M</gender>
</employee>
<employee>
<id>8</id>
<name>Mani</name>
<gender>M</gender>
</employee>
</Data>
The above data is to loaded into Pig execution mode using the command load
whose syntax is as follows:
employee = LOAD ‘input/employee.xml’
using org.apache.pig.piggybank.storage.XMLLoader(’employ ee’) as
(x:chararray);
The output of the script above described is as follows:
Advanced Science and Technology Letters Vol.147 (SMART DSC-2017)
Copyright © 2017 SERSC 475
Hive
Apache Hive is a data warehouse system for Apache Hadoop [7]. Hive is a
technology developed by Facebook and which turns Hadoop into a complete
datawarehouse with an extension of sql for querying. HiveQL is a declarative
language used by Hive.Hive is a declarative language with various commands
execution built upon a schema. The configuration can be set in three ways: Firstly by
editing the hive-site.xml fileor by using the command set in the command prompt
such as set hive-conf. The above specified database is used to describe the processing
of Hive.
s = foreach employee generate x;
Data is displayed using ‘DUMP’
Eg:dump employee;
DATA ANALYSIS USING HIVE
R data set is taken from github.The data can be downloaded from github by using
the “wget” command with nk address in the console. Then the datais moved from our
local system to the hadoop distributed file system using the hadoop command put
Eg:hadoop fs –put ‘localfileaddress’ ‘destinationaddress’ hadoop fs –put uber-raw-
data-may14.csv ‘/user/allamrajuviswanath_gmail/viswanath’;
Advanced Science and Technology Letters Vol.147 (SMART DSC-2017)
476 Copyright © 2017 SERSC
open the hive console using the command ‘hive’ The hive shell gets started in the
console like hive>
Hive:First create a data base with a name. Next create a table for the dataset ; Now
for Uberdata set the table creation is like
create table ubersep14(pickdatetime string,lat float,log float,base string)
row format delimited
fields terminated by ','
stored as textfile;
Load the data into the table using the command
LOAD DATA INPATH
'/user/allamrajuviswanath_gmail/viswanath/uberraw- data-sep14.csv'
OVERWRITE INTO TABLE socialmediaIntelligence;
Query to retrieve the total no.of Base numbers of Uber cars from the table are as
follows:
select count(base) from ubersep14;
The resultant output screen is shown below:
For counting the number of ubercars with their base numbers the at a particular
latitude position of 40.758 the command used is:
Select base, count(*) as count from ubersep14 where lat=40.758 group by base;
Advanced Science and Technology Letters Vol.147 (SMART DSC-2017)
Copyright © 2017 SERSC 477
5 Conclusion
The aforesaid results represent lesser time and exertion on huge databases. Pig, Hive
and MapReduce frameworks can do investigations in brief time. Hive can break down
a database of more than 8 lakh records in only 34 seconds and Pig can do the same in
40 seconds. Hence all these parts make it conceivable to handle and to utilize vast
databases in a simple and proficient way.
References
1. Apache Hadoop. Available at http://hadoop.apache.org.
2. R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang. YSmart: Yet Another SQL-to-
MapReduce Translator. In ICDCS, 2011.
3. H. Lim, H. Herodotou, and S. Babu. Stubby: A Transformation- based Optimizer for
Mapreduce Workflows. In VLDB, 2012.
4. T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. MRShare: Sharing Across
Multiple Queries in Mapreduce. In VLDB, 2010.
5. X. Wang, C. Olston, A. D. Sarma, and R. Burns. CoScan: Cooperative Scan Sharing in the
Cloud. In SoCC, 2011.
6. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang,
Suresh Antony, Hao Liu and Raghotham 7. 7. Murthy “Hive – A Petabyte Scale Data
Warehouse Using Hadoop” By Facebook Data Infrastructure Team
7. Zhifeng YANG, Qichen TU, Kai FAN, Lei ZHU, Rishan CHEN, BoPENG, “Performance
Gain with Variable Chunk Size in GFSlike 9. 9. File Systems”, Journal of Computational
Information Systems4:3 pp-1077-1084, 2008.
8. Sam Madden , “From Databases to Big Data”, IEEE Computer Society , 2012.
9. Sanjeev Dhawan & Sanjay Rathee, “Big Data Analytics using Hadoop Components like Pig
and Hive,” American International Journal of Research in Science, Technology, Engineering
& Mathematics, pp:1-5,20.
Advanced Science and Technology Letters Vol.147 (SMART DSC-2017)
478 Copyright © 2017 SERSC