review on big data analytics and hadoop …ijiser.com/paper/2017/vol4issue3/mar2017p101.1.pdf · is...

5
ISSN: 2347-971X (online) International Journal of Innovations in Scientific and ISSN: 2347-9728(print) Engineering Research (IJISER) www.ijiser.com 78 Vol 4 Issue 3 MAR 2017/101 REVIEW ON BIG DATA ANALYTICS AND HADOOP FRAMEWORK 1 Dr.R.Kousalya, 2 T.Sindhupriya 1 Research Supervisor, Professor & Head, Department of Computer Applications, Dr.N.G.P Arts and Science College, Coimbatore 2 Research Scholar, PG & Research Department of Computer Science, Dr. N.G.P Arts and Science College, Coimbatore [email protected], [email protected] , Abstract: Big data is a term for data sets that are so large and complex in traditional data processing applications. The challenges includes in big data are analysis, scalability, lack of talent, actionable insights, data quality, visualization, search, sharing, information privacy, transfer, capture, data accurate, cost management, querying and updating. Big data describes the large volume of data both structured and unstructured that inundates a business in day-to-day basis. Big data can be analyzed for insights to take better decisions and strategic business planning. This paper is to discuss about the HACE theorem and feature of big data revolution from data mining point of view. Keywor ds : Big data, Data mining, HACE theorem, Hadoop, MapReduce. 1. INTRODUCTION Big data is a collection of large and complex data sets so that it becomes difficult to process the data. The big data is data that is distributed and has exponential growth, both structured and un-structured. The rapidly expanding of systems, both biological and engineered is considered to be the distributed architectures. Big data is a problem statement that talks about the challenges that includes: capturing, storage, transfer, search, process, analyze, accuracy and visualization. Big data might be petabytes (1024 terabytes) or exabytes(1024 petabytes) of consisting data of billions to trillions of records of millions of people from different sources (e.g. web, sales, customer contact center, mobile data, social media, and so on). The loosely structured data is often incomplete and inaccessible. Twitter produces over 90 million tweets per day. EBay uses data warehouses at petabyte as well as hadoop cluster for search and consumer recommendation. Walmart handles more than million customer transactions every hour and the databases estimated to have more than 2.5 petabyte of data. The importance of big data is it doesn‘t revolve around how much data you have, but what you do with it. The data can be taken from any sources and analyze to find answers that enable the cost reductions, time education, new product development and offerings, smart decision makings. 2. BIG DATA CHARACTERISTICS: HACE THEOREM Big data begins with huge-volume, heterogeneous, autonomous sources with distributed control and explore complex and evolving relationships among data [1]. These features make an extreme challenge to discover knowledge from the big data. In a common sense, imagine that a number of blind man trying to analyze the giant elephant which will be big data in this context. The each blind man goal is to draw the picture of the elephant and name the part of information he gather during the process. Because each blind man point of view is limited to his local region and each will conclude independently about the elephant feels like a mountain, cave, tree, branch, snake, etc., depending on the area or region each of them is limited to [5]. To make the problem even more complicated, assume that the elephant is enlarging and its pose changes, each blind man may have his own source of like or dislike of something about the elephant (e.g.,) one blind man may exchange his information about the elephant to the another blind man and the exchanged knowledge is inherently biased.

Upload: vantu

Post on 18-Jun-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

ISSN: 2347-971X (online) International Journal of Innovations in Scientific and

ISSN: 2347-9728(print) Engineering Research (IJISER)

www.ijiser.com 78 Vol 4 Issue 3 MAR 2017/101

REVIEW ON BIG DATA ANALYTICS AND HADOOP FRAMEWORK

1 Dr.R.Kousalya,

2T.Sindhupriya

1Research Supervisor, Professor & Head, Department of Computer Applications, Dr.N.G.P Arts and Science College,

Coimbatore

2Research Scholar, PG & Research Department of Computer Science, Dr. N.G.P Arts and Science College, Coimbatore

[email protected], [email protected]

, Abstract: Big data is a term for data sets that are so large and complex in tradit ional data processing applic ations.

The challenges includes in big data are analysis, scalability, lack of talent, actionable insights, data quality,

visualizat ion, search, sharing, informat ion privacy, transfer, capture, data accurate, cost management, querying and

updating. Big data describes the large volume of data – both structured and unstructured that inundates a business in

day-to-day basis. Big data can be analyzed for insights to take better decisions and strategic business planning. This

paper is to discuss about the HACE theorem and feature of big data revolution from data mining point of view.

Keywords: Big data, Data mining, HACE theorem, Hadoop, MapReduce. 1. INTRODUCTION Big data is a collection of large and complex data sets

so that it becomes difficu lt to process the data. The big

data is data that is distributed and has exponential

growth, both structured and un-structured. The rapidly

expanding of systems, both biological and engineered is

considered to be the distributed architectures. Big data

is a problem statement that talks about the challenges

that includes: capturing, storage, transfer, search,

process, analyze, accuracy and visualization. Big data

might be petabytes (1024 terabytes) or exabytes(1024

petabytes) of consisting data of billions to trillions of

records of millions of people from d ifferent sources

(e.g. web, sales, customer contact center, mobile data,

social media, and so on). The loosely structured data is

often incomplete and inaccessible. Twitter produces

over 90 million tweets per day. EBay uses data

warehouses at petabyte as well as hadoop cluster for

search and consumer recommendation. Walmart

handles more than million customer transactions every

hour and the databases estimated to have more than 2.5

petabyte of data. The importance of big data is it

doesn‘t revolve around how much data you have, but

what you do with it. The data can be taken from any

sources and analyze to find answers that enable the cost

reductions, time education, new product development

and offerings, smart decision makings.

2. BIG DATA CHARACTERIS TICS: HACE

THEOREM Big data begins with huge-volume, heterogeneous,

autonomous sources with distributed control and

explore complex and evolving relat ionships among data

[1]. These features make an ext reme challenge to

discover knowledge from the big data. In a common

sense, imagine that a number of blind man trying to

analyze the g iant elephant which will be b ig data in this

context. The each blind man goal is to draw the picture

of the elephant and name the part of information he

gather during the process. Because each blind man

point of view is limited to his local region and each will

conclude independently about the elephant feels like a

mountain, cave, tree, branch, snake, etc., depending on

the area or reg ion each of them is limited to [5]. To

make the problem even more complicated, assume that

the elephant is enlarging and its pose changes, each

blind man may have his own source of like or dislike of

something about the elephant (e.g.,) one blind man may

exchange his informat ion about the elephant to the

another blind man and the exchanged knowledge is

inherently biased.

ISSN: 2347-971X (online) International Journal of Innovations in Scientific and

ISSN: 2347-9728(print) Engineering Research (IJISER)

www.ijiser.com 79 Vol 4 Issue 3 MAR 2017/101

Figure1: The blind man and the giant elephant: the

limited view of each blind man leads to the biased

conclusion

3. PROBLEM WITH BIG DATA PROCESSING 3.1 Huge data with heterogeneous diverse with

dimensionality Huge volume of data is represented by diverse

dimensionality. It has various types of same individual

to represent and variety of feature. The way of storing

data may have different schema in medical information,

and structured manner like name, age, gender, history

[3]. The images and videos like CT scan and x-ray.

Here the aggregation of data is very important. 3.2 Autonomous source with distributed and

decentralized controls The main characteristics in big data are autonomous

source with distributed controls. The independent

source with the freedom to govern itself or control its

own data to generate and collect information without

depending on any centralized control [4]. A very large

quantity of data makes the system to the possibility of

being attacked or harmed (e.g.) server of Google,

Walmart and Facebook.

3.3 Complex and evolving relationship The data centralized informat ion system of different

things are linked in a close or complicated way and the

focus on finding best features to represent each

observation. The feature value is to treat the indiv idual

entity without taking into account of its social

connection [2]. The v ital role is to take complex data

relationships, evolving changes, and discovering useful

patterns from big data collections.

4. THE CHALLENGES OF BIG DATA The five d ifferent characteristics in b ig data are volume,

variety, velocity, variability, veracity [6].

Volume : refers to the vast amount of data

generated every second. The size of the data

establishes the value and capacity to develop are

considered to be the big data. Just think of emails,

Facebook, twitter messages, photos, videos that we

create and share every second. On Facebook alone

we are sending millions of messages and billions of

like buttons and new pictures each and every day.

With the big data technology we store and use

these data sets with the help of distributed systems

[7].

Variety: refers to the type and nature of the data.

We can use different type of data now. In past, the

structured data are neatly fits into tables and

relational databases such as financial data.

Unstructured data can‘t easily put into tables and

relational databases.

With big data technology different types of data

including messages, photos, videos, voice

recordings are made structured data.

Velocity: refers to speed at which new data is

processed and generated to meet challenges that lie

in the development. The data is generated in a

speed and moves around. Just think of social

message going viral in minutes, the speed of online

transactions are checked.

Veracity: refers to the d isorderliness of the data.

For example Twitter posts with hash tags,

abbreviations. Big data analytics technology allows

working with these types of data [8]. The volume

of data may vary accordingly and affecting

accurate analysis.

ISSN: 2347-971X (online) International Journal of Innovations in Scientific and

ISSN: 2347-9728(print) Engineering Research (IJISER)

www.ijiser.com 80 Vol 4 Issue 3 MAR 2017/101

Value : refers to the ability to turn data into value.

It is important for the businesses to make an

attempt to collect big data. 5. HADOOP: SOLUTION FOR BIG

DATA PROCESSING Hadoop is the framework of tools or programming

framework which is used to support running of

applications on big data. It is an open source set of tools

and it is distributed under apache license. Big data is creating challenges that Hadoop is addressing and

challenges are created with lots of data at very high

speed and large volume of data has been gathered [10].

Hadoop solves the big data problem by storing the data

using HDFS and processing the data using

Mapreduce.Hadoop was developed by Google‘s MapReduce that is a software framework where an

application break down into various parts. The Current

Apache Hadoop ecosystem consists of the Hadoop

Kernel, MapReduce, HDFS and number of various

components like Apache Hive, Base and Zookeeper.

HDFS and Map Reduce are exp lained in following

points.

Figure 2: Hadoop Architecture

5.1 HDFS Architecture Hadoop will store the data in the form of Hadoop

Distributed File System (HDFS). It is a distributed file

system that stores the large files and runs on

commodity hardware. HDFS manages storage by

breaking files into pieces called ‗blocks‘and storing

each blocks across the pool of the server. The size of

each block is 64MB.HDFS architecture divides into

three nodes: Name node, Data node and HDFS

clients/edge node.

Figure 3: HDFS Architecture

5.2 MapReduce Architecture The processing pillar for the Hadoop ecosystem is the

Mapreduce framework. This framework allows to be

applied as huge data sets which divides the data in

parallel. Mapreduce is to process the data by filtering

and aggregation of data [9]. There are two functions in

mapreduce:

Figure 4: MapReduce Architecture

ISSN: 2347-971X (online) International Journal of Innovations in Scientific and

ISSN: 2347-9728(print) Engineering Research (IJISER)

www.ijiser.com 81 Vol 4 Issue 3 MAR 2017/101

Map: map is nothing but the filtering technique

used to filter the large datasets. The map function

takes the key/value pairs as input and generates an

intermediate set of key/value pairs.

Reduce : reduce is a technique used for the

aggregation of the data sets. The reduce function

merges all the intermediate values associated with

the same intermediate key.

6. RELATED WORK Due to the huge source heterogeneous and dynamic

characteristics of applicat ion involves the data in a

distributed environment, one of the most important

feature of the big data is computer the petabyte (PB),

and Exabyte (EB) level of data with a complex level of

computing process. The critical goal fo r big data

processing is to change the distributed data from ―quantity‖ to ―quality‖. Big data processing mainly

depends on parallel programming models like

MapReduce, and also provides a cloud computing

platform of big data services. MapReduce is the batch

oriented parallel computing model. This parallel

programming is applied to many machine learning and

data mining algorithms. The data mining algorithms are

realized in the framework including k-means, naïve

bayes, linear support vector machines, independent

variable analysis. With the analysis of these classical machine

learning algorithms, we argue that the computational

operations in the algorithm learning process could be

transformed into a summat ion operation on a number of

training data sets. Summation operations could be

performed on different subsets independently and

achieve penalization executed easily on the MapReduce

programming p latform. Therefore, a large-scale data set

could be divided into several subsets and assigned to

multiple Mapper nodes.

7. CONCLUS ION Managing and min ing the big data have been the

challenging but the compelling task fo r business

stakeholders and real-world applicat ions. The HACE

theorem suggests the key characteristics of the big data

1) huge data with heterogeneous and diverse

dimensionality 2) autonomous source with distributed

and decentralized controls 3) complex and evolving

relationship. Altered data copies are produced by

introducing the errors, noise and privacy concerns into

the data. Big data has simulated the data mining and

discovered the informat ion based connected areas and

give definite structure of big data evaluation and

statistics. Big Data as an emerging trend and the need

for Big Data min ing is arising in all science and

engineering domains. The challenges include not just

the obvious issues of scale, but also heterogeneity, lack

of structure, error-handling, privacy, t imeliness,

provenance, and visualizat ion, at all stages of the

analysis pipeline from data acquisition to result

interpretation. These technical challenges are common

across a large variety of application domains, and

therefore not cost-effective to address in the context of

one domain alone. This paper describes Hadoop which

is an open source software used for processing of Big

Data.

REFERENCES [1] Wu, Xindong, et al. "Data mining with b ig data."

IEEE transactions on knowledge and data

engineering 26.1 (2014): 97-107.

[2] Tupe, Akshaya, and AmritPriyadarshi. "Data

Mining with Big Data and Privacy Preservation."

nature 5.4 (2016)

[3] Kosinski, Michal, et al. "Min ing big data to

extract patterns and predict real-life outcomes."

Psychological Methods 21.4 (2016): 493.

[4] S.Banerjee, and N.Agarwal, ―Analyzing Collective

Behavior from Blogs Using Swarm Intelligence,‖

Knowledge and Information Systems, vol. 33, no.

3, pp. 523-547, Dec. 2012

[5] Pandey, Rajneeshkumar, and UdayA.Mande.

"Data Mining and Information Security in Big

Data Using HACE Theorem" (2016)

[6] Ducange, Pietro, et al. "Educational Big Data

Mining: How to Enhance Virtual Learning

Environments." International Conference on

EUropean Transnational Education, Springer

International Publishing, 2016

[7] Chen, Lihong, et al. "VFDB 2016: hierarchical

and refined dataset for big data analysis—10 years

on." Nucleic acids research 44.D1 (2016): D694-

D697.

[8] O‘Leary, Daniel E. "Some Issues of Privacy in a

World of Big Data and Data Mining." Big Data:

Storage, Sharing, and Security (2016):289

ISSN: 2347-971X (online) International Journal of Innovations in Scientific and

ISSN: 2347-9728(print) Engineering Research (IJISER)

www.ijiser.com 82 Vol 4 Issue 3 MAR 2017/101

[9] Chen, Jiaoyan, et al. "MR-ELM: a MapReduce-

based framework for large-scale ELM training in

big data era." Neural Computing and Applications

27.1 (2016): 101-110. [10] Bikku, Thulasi, N. Sambasiva Rao, and Ananda

Rao Akepogu. "Hadoop based feature selection

and decision making models on Big Data." Indian

Journal of Science and Technology 9.10 (2016)