comprehensive analysis of hadoop ecosystem …onlinepresent.org/proceedings/vol147_2017/66.pdf ·...

Comprehensive Analysis of Hadoop Ecosystem

Components: MapReduce, Pig and Hive

N. Suneetha1, Ch. Sekhar2, A. Viswanath Sharma3 and P. Sandhya3

1,2,3,4Vignan’s Institute of Information Technology, Duvvada, Visakhapatnam [email protected], [email protected]

Abstract. Big Data is a term for large-volume, complex, growing data

sets with multiple, autonomous sources generated in various fields

ranging from economic and business activities to public administration,

from national security to scientific researches. It is the emerging

technology that draws huge attention from researchers to extract value

from voluminous datasets. As the velocity of data growth is increasing

with the technological challenges, organization and storage of data is of

primary concern. To proceed with a given situation pertaining to big

data the consideration parameters are, firstly the background of data,

then the value chain phases namely data generation, data acquisition,

data storage, and data analysis and finally on the representative

applications of big data, including enterprise management, Internet of

Things, online social networks, medical applications, collective

intelligence, financial service sectors and smart grid. This paper

emphasizes on the performance and work nature of data analysis

through Hadoop ecosystem components like Map reduces Pig and Hive.

Keywords: Big Data, Map reduce, Pig, Hive.

1 Introduction

Over the past few decades, data has increased in a large scale in various fields.

According to a report from International Data Corporation (IDC), in 2011, the overall

created and copied data volume in the world was 1.8ZB (≈ 1021B), which increased

by nearly nine times within five years [1]. This figure will double at least every other

two years in the near future. Under the explosive growth of information, the term Big

Data means the large or complex data sets that could not be handled by traditional

database technologies. Big Data also arises with many challenges, such as difficulties

in data capture, data storage, data analysis and data visualization. Big data is

characterized by the five Vs, namely volume, variety, velocity, value and veracity

(Fig. 1). This 5V definition highlights the meaning and necessity of big data.

Advanced Science and Technology Letters Vol.147 (SMART DSC-2017), pp.468-478

http://dx.doi.org/10.14257/astl.2017.147.66

ISSN: 2287-1233 ASTL Copyright © 2017 SERSC

Volume:

Volume means, with the generation and collection of masses of data. Nowadays

volume or the size of data in different areas/enterprises is larger than terabytes and

petabytes. It is the task of Big data to extract valuable information from high volumes

of low-density, unstructured Hadoop data—that is, data of unknown value, such as

Twitter data feeds, click streams on a web page and a mobile app, network traffic,

sensor-enabled equipment capturing data at the speed of light, and many more. Big

Data volume includes such features as size, scale, amount, dimension for tera- and

exascale data collected from many transactions and stored in individual files or

databases to be accessible, searchable, processed and manageable. One example of

Big data from industry, global services providers such as Google, Facebook, Twitter

are producing, analyzing and storing data in huge amount as their regular

activity/production services. To deal these large volume of data, Big Data

development is needed.

Variety:

Variety deals with the complexity of big data and information and semantic models

behind these data. Big data comes from a great variety of sources and generally has in

four types: structured, semi structured, unstructured and a mixed data. Structured data

inserts a data warehouse already tagged and easily sorted but unstructured data is

random and difficult to analyze. Unstructured and semi-structured data types, such as

text, audio, and video does not conform to fixed fields but contains tags to separate

data elements where they require additional processing to both derive meaning and

the supporting metadata.

Velocity:

Velocity is the fast rate at which Big data streams is generated into memory or disks

by arrays of sensors or multiple events, and need to be processed in real-time, near

real-time or in batch, or as streams (like in case of visualisation). Velocity is required

not only for big data, but also all processes. For time limited processes, big data

should be used as it streams into the organization in order to maximize its value

[4,16]. Some of the applications like Internet of Things (IoT), consumer eCommerce

and mobile communications ..etc. deals with large amount of data in their

implementation in real time or near real time.

Value

Variety indicates the various types of data, which include semi-structured and

unstructured data such as audio, video, webpage, and text, as well as traditional

structured data. Value is an important feature of the Big data which is defined by the

added-value that the collected data can bring to the intended process, activity or

predictive analysis/hypothesis. Data value will depend on the events or processes they

represent such as stochastic, probabilistic, regular or random. For example in

consumer applications, the intrinsic value of data is derived using quantitative and

investigative techniques from discovering a consumer preference or sentiment, to

making a relevant offer by location, or for identifying a piece of equipment that is

about to fail. Depending on this the requirements may be imposed to collect all data,

Advanced Science and Technology Letters Vol.147 (SMART DSC-2017)

Copyright © 2017 SERSC 469

store for longer period (for some possible event of interest), etc. However, finding

value of Bigdata requires new discovery processes to make more accurate and precise

decisions.

Veracity:

Big Data veracity ensures that the data used are trusted, authentic and protected from

unauthorized access and modification. The data must be secured during the whole

their lifecycle from collection from trusted sources to processing on trusted compute

facilities and storage on protected and trusted storage facilities. Data veracity relies

entirely on the security infrastructure deployed and available from the Big Data

infrastructure.

With this definition, characteristics of big data may be summarized as five Vs, i.e.,

Volume(greatvolume), Variety (various modalities), Velocity (rapid generation),

Value (huge value but very low density) and Veracity (trusted and authentic) as

shown in Fig. 1.

Fig. 1. Five V’s of Big Data

This paper is organized as follows. Section II presents the Hadoop EcoSystem

Components. Section III illustrates the precise distributed working environment.

Section IV describes the Conclusion of the presented work

2 Hadoop EcoSystem Components

The size of datasets are increasing at a rapid pace currently to the tune of

petabytes(PB) which is becoming an issue to perform data analysis. The challenges

for such analysis are capture, curation, search, sharing, storage, transfer, visualization,

querying, updating and information privacy. To meet these challenges, "parallel data

processing software" like Hadoop framework is required. Hadoop is an open-source

framework that allows to store and process Big Data in a distributed environment

across clusters of computers using simple programming models. The Hadoop

platform consists of two main services: one is a reliable, distributed file system called

Hadoop Distributed File System (HDFS) and the other one is high-performance

parallel data processing engine called Hadoop MapReduce. Vendors that provide

Hadoop-based platforms include Cloudera, Hortonworks, MapR, Greenplum, IBM,


470 Copyright © 2017 SERSC

and Amazon. Information is procured from diverse sources, like online networking,

customary undertaking information or sensor information, etc.

The two main components of Hadoop are Hadoop Distributed File System(HDFS)

referring to distributed storage and distributed processing called Map Reduce

framework. Combination of HDFS and MapReduce provides a software framework

for processing vast amounts of data in parallel on large clusters of commodity

hardware in a reliable, fault- tolerant manner. Hadoop is a generic processing

framework designed to execute queries and other batch operations against massive

datasets that can scale from tens of terabytes to petabytes in size.

Figure 2 shows the Hadoop Ecosystem. In the ecosystem of Hadoop, there have

been several recent research projects exploiting sharing opportunities and eliminating

unnecessary data movements, e.g. [2]

Fig. 2. Hadoop echo system

In the ecosystem of Hadoop, there have been several recent research projects

exploiting sharing opportunities and eliminating unnecessary data movements. Huge

amounts of inconsistent, incomplete, and noisy data, a number of data preprocessing

techniques, including data cleaning, data integration, data transformation and date

reduction, can be applied for removing noise and correcting inconsistencies [5].

Data capture and storage

Data sets are captured from different sources such as traditional organization data

from different enterprises (it includes information from transactional ERP data or web

store transactions and general ledger data..etc.), machine generated or sensor data (it

includes information from information- sensing mobile devices, aerial sensory

technologies, remote sensing, radio-frequency identification readers..etc.), social data

(it includes information from blogging sites and social media platforms ..Etc.) and so



on. The world’s technological capacity to store information is increasing

exponentially in terms of quintillion byte

Data organization

Data is organized into the following file systems namely Google File System(GFS)

and Hadoop Distributed File System(HDFS). Google File System Google Inc. built

up an appropriated record framework for their own particular use which was intended

for proficient and solid acess to information utilizing extensive bunch of product

equipment. It utilizes the methodology of "Big Files", which are created by Larry

Page and Sergey Brin. Here records are partitioned in fixed size chunks of 64 MB

whose replication factor is 3. HDFS is the distributed storage which positions the data

into fixed size blocks which is replicated to 3. There will be one master node and

multiple slave nodes. Map Reduce Jobs are processed using Version1 and Version2.

Data analysis

Pig and Hive are the frameworks that have an inherent map reduce functionality with

a good processing speed. The analysis is performed on huge databases ranging from

thousands to lakhs using Hive and MapReduce. Both the frameworks run on different

platforms like redhat linux, Cloudera, Hortonworks etc. Figure 3 shows the Uber

transport dataset collected from “GITHUB” and is placed in HDFS using the hadoop

put command. Ubuntu-Server14.04LTS is used to implement the operations of the

specified data set.

Fig. 3. Uber data set from GITHUB site



The data shown in above Figure 3 is taken from GITHUB site which is having the

fields like DATE/TIME, LATITUDE, LONGITUDE, BASE fields. For the given

data the number of cars with base numbers waiting at a particular latitude position

are analyzed. Pig and Hive execution speeds are compared

4 Working of Distributed Environment

Map Reduce

Hadoop Map Reduce is a product structure for effortlessly composing applications

which prepare tremendous measures of information (multi-terabyte information sets)

in-parallel on substantial groups (a huge number of hubs) of item equipment in a

dependable, fault tolerant way. A MapReduce work for the most part parts the

information set into free pieces which are prepared by the guide assignments in a

totally parallel way. The structure sorts the yields of the maps, which are then data to

the diminish assignments. Ordinarily both the information and the yield of the

employment are put away in a document framework. The system deals with planning

errands, observing them and re-executes the fizzled assignments. Normally the

register hubs and the stockpiling hubs are the same, that is, the MapReduce structure

and the Hadoop Distributed File System are running on the same arrangement of

hubs. This arrangement permits the system to adequately timetable undertakings on

the hubs where information is as of now present, bringing about high total transfer

speed over the group. The MapReduce structure comprises of a solitary expert

JobTracker and one slave TaskTracker per group hub. The expert is in charge of

booking the occupations' part assignments on the slaves, checking them and

reexecuting the fizzled errands. The slaves execute the assignments as coordinated by

the expert. Negligibly, applications indicate the information/yield areas and supply

guide and lessen capacities through executions of suitable interfaces and/or

conceptual classes. These, and other employment parameters, contain the occupation

design. The Hadoop work customer then presents the occupation (jug/executable and

so on.) and setup to the JobTracker which then expect the obligation of disseminating

the product/design to the slaves, booking errands and observing them, giving status

and demonstrative data to the employment customer.

PIG:

Pig was initially developed at Yahoo Research around 2006 but moved into the

Apache Software Foundation in 2007. Pig which is a scripting language that consists

of a language and an execution environment which is based on hadoop map reduce

framework. Pig script is usually called as PigLatin[6]. PIG came into the world of big

data because most of the industries and the companies are finding difficulties in

handling a small data with at most 50-60 lines of code So industry finds difficult in

wasting the time and money for such things. Pig script is a language which connects

things over very easily with very short lines of code. It is a high level language and it

does not need the help of JAVA. Pig support dataflow language.. Pig can handle

complex data structure, even those who have levels of nesting. It has two types of

execution environment local and distributed environment. Local environment is used



for testing when distributed environment cannot be deployed. Pig Latin program is

collection of statements. A statement can be a operation or command.

Pig distributed environment is chosen by command pig. Pig local mode is chosen

by the command pig –x local.

Fig. 4. Execution of PIG

Figure 4 shows the execution process of pig framework. Grunt shell is the console

for writing and execution of the pig scripts. The execution engine converts the scripts

into mapreduce jobs and the resultant is stored in HDFS.

Handling of Semistructured data using Pig

Data of semi-structured nature with extenson .xml,.jpeg,etc also can be handled using

Pig. To handle this .xml files we need to have some additional libraries to be included

with the pig. The two additional libraries for the pig are as follows.

.

The additional library piggybank is used for loading the data into pig from HDFS.

For handling the XML file xmlloader() function is linked with the additional library

piggybank.

Piggy Bank Apache DataFu

1)Collection of useful

LOAD,STORE and UDF functions.

1)Collection of libraries for

working with large scale data in Hadoop.

2)Has many user defined

functions.

2)Project inspired by the need for stable well tested libraries for

data mining and statistics

3)Open source project of

apache Hadoop and should be

downloaded externally and links with pig.

3)It is also a apache open source

project and included with in apache

pig from version 0.14



Following is the sample XML file which is described as follows:

<Data>

<employee>

<id>5</id>

<name>Aravind</name>

<gender>M</gender>

</employee>

<employee>

<id>6</id>

<name>Krishna</name>

<gender>M</gender>

</employee>

<employee>

<id>7</id>

<name>Thiru</name>

<gender>M</gender>

</employee>

<employee>

<id>8</id>

<name>Mani</name>

<gender>M</gender>

</employee>

</Data>

The above data is to loaded into Pig execution mode using the command load

whose syntax is as follows:

employee = LOAD ‘input/employee.xml’

using org.apache.pig.piggybank.storage.XMLLoader(’employ ee’) as

(x:chararray);

The output of the script above described is as follows:



Hive

Apache Hive is a data warehouse system for Apache Hadoop [7]. Hive is a

technology developed by Facebook and which turns Hadoop into a complete

datawarehouse with an extension of sql for querying. HiveQL is a declarative

language used by Hive.Hive is a declarative language with various commands

execution built upon a schema. The configuration can be set in three ways: Firstly by

editing the hive-site.xml fileor by using the command set in the command prompt

such as set hive-conf. The above specified database is used to describe the processing

of Hive.

s = foreach employee generate x;

Data is displayed using ‘DUMP’

Eg:dump employee;

DATA ANALYSIS USING HIVE

R data set is taken from github.The data can be downloaded from github by using

the “wget” command with nk address in the console. Then the datais moved from our

local system to the hadoop distributed file system using the hadoop command put

Eg:hadoop fs –put ‘localfileaddress’ ‘destinationaddress’ hadoop fs –put uber-raw-

data-may14.csv ‘/user/allamrajuviswanath_gmail/viswanath’;



open the hive console using the command ‘hive’ The hive shell gets started in the

console like hive>

Hive:First create a data base with a name. Next create a table for the dataset ; Now

for Uberdata set the table creation is like

create table ubersep14(pickdatetime string,lat float,log float,base string)

row format delimited

fields terminated by ','

stored as textfile;

Load the data into the table using the command

LOAD DATA INPATH

'/user/allamrajuviswanath_gmail/viswanath/uberraw- data-sep14.csv'

OVERWRITE INTO TABLE socialmediaIntelligence;

Query to retrieve the total no.of Base numbers of Uber cars from the table are as

follows:

select count(base) from ubersep14;

The resultant output screen is shown below:

For counting the number of ubercars with their base numbers the at a particular

latitude position of 40.758 the command used is:

Select base, count(*) as count from ubersep14 where lat=40.758 group by base;



5 Conclusion

The aforesaid results represent lesser time and exertion on huge databases. Pig, Hive

and MapReduce frameworks can do investigations in brief time. Hive can break down

a database of more than 8 lakh records in only 34 seconds and Pig can do the same in

40 seconds. Hence all these parts make it conceivable to handle and to utilize vast

databases in a simple and proficient way.

References

1. Apache Hadoop. Available at http://hadoop.apache.org.

2. R. Lee, T. Luo, Y. Huai, F. Wang, Y. He, and X. Zhang. YSmart: Yet Another SQL-to-

MapReduce Translator. In ICDCS, 2011.

3. H. Lim, H. Herodotou, and S. Babu. Stubby: A Transformation- based Optimizer for

Mapreduce Workflows. In VLDB, 2012.

4. T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. MRShare: Sharing Across

Multiple Queries in Mapreduce. In VLDB, 2010.

5. X. Wang, C. Olston, A. D. Sarma, and R. Burns. CoScan: Cooperative Scan Sharing in the

Cloud. In SoCC, 2011.

6. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang,

Suresh Antony, Hao Liu and Raghotham 7. 7. Murthy “Hive – A Petabyte Scale Data

Warehouse Using Hadoop” By Facebook Data Infrastructure Team

7. Zhifeng YANG, Qichen TU, Kai FAN, Lei ZHU, Rishan CHEN, BoPENG, “Performance

Gain with Variable Chunk Size in GFSlike 9. 9. File Systems”, Journal of Computational

Information Systems4:3 pp-1077-1084, 2008.

8. Sam Madden , “From Databases to Big Data”, IEEE Computer Society , 2012.

9. Sanjeev Dhawan & Sanjay Rathee, “Big Data Analytics using Hadoop Components like Pig

and Hive,” American International Journal of Research in Science, Technology, Engineering

& Mathematics, pp:1-5,20.



comprehensive analysis of hadoop ecosystem …onlinepresent.org/proceedings/vol147_2017/66.pdf ·...

Documents