data analytics using hadoop

Upload: vineetsajwan

Post on 02-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 data analytics using hadoop

    1/3

    November - 2014

    Research Paper

    Analytics of Data using Hadoop-A review

    Vineet Sajwan Vikash Yadav

    Student, Computer Science and Engineering Student, Computer Science and Engineering !"o#$% &oida Sec-'( !"o#$% &oida Sec-'(

    Abstract We live in data world. Its not easy to calculate the data produced every day but accordin to I!"

    estimation our planet contained 2.#$ %ettabytes &'() worth o* data in 201+ and it will rise up to , %ettabytes by

    201$.his *lood o* data comin *rom various social websites &lie /aceboo twitter etc.) and other source lie 3

    oole maps heatpressure sensors etc. his situation ive rise to a new word nown as 5bi data6. 7ur research

    discuss about what really the bi data is and what plat*orm that can use to process such bi data.

    8eywords (i data (i data analytics9ap:educe;!/3;(asei;ive

    I. INTRODUCTION

    A. Definition

    $he volu)e of data that enterprises and our latest technologies ac*uire every day is increasing e+ponentially, &ow%)any organiation uses .ig data analytics which processes )assive a)ount of data, /ig data is e+tre)ely large data sets

    that )ay .e analysed co)putationally to reveal patterns% trends and .ig data analytics is the process of e+a)ining largedata sets which can .e in the for) of structured and unstructured to uncover hidden pattern% unknown correlation% )arkettrends and .usiness infor)ation,

    B. Big Data Parameters

    $o understand the pheno)enon of .ig data it is descri.ed into 0 Vs, $he 0 Vs of .ig data are1-Volu)e% Velocity%Variety% Veracity% and Value,

    a. Volume:-refers to vast a)ount of data generated every second like 2ace.ook receives appro+, 34.illion photos% taking up one peta.yte of storage, 5n 2ace.ook we send 34 .illion )essages per dayand upload 644 )illion new pictures every day, $he data stored has .een growing e+ponentially, /y.ig data analytics tools we can now store huge a)ount of data sets and use the) .y usingdistri.uted syste)% where parts of data stored in different location and .rought together .y asoftware,

    b. Velocit:- 5n todays world which is surrounded .y social network data is generated so *uickly and)oves around very fast, 5n Social )edia 7like 2ace.ook8 )essage going viral in seconds, /ig datatechnology allows us to analyse data while it is generated without putting it into the data.ase,

    c. Variet:-refers to structured %unstructured and se)i-structured data we can now use, 5n past wefocused on that fits into the ta.le .ut now 94: of world;s data are unstructured therefore can;teasily put into the ta.le,

  • 8/10/2019 data analytics using hadoop

    2/3

    November - 2014

    $he pro.le) is si)ple1 the storage capacities of hard drivers increased )assively over the years-the rate at which datacan read fro) drives have not kept up, 5n 3==4% one typical drive can store 3%6>4 / of data and had data transferspeed of ?,? /@s% so in a.out 0 )inutes you can read the full data, (4 years later% one typical hard drives can storetera.ytes .ut speed is around ?,?/@s so it need half hour to read full data of the disk,

    B. Problem Solution

    $he o.vious way to reduce ti)e is )ultiple disks at once, 5)agine if we have 344 drives working in parallel we can

    read the data in ( )inutes, nly 344 disks is wasteful ,/ut we can store datasets% each of which is one tera.yte andprovide access to the),C. Problem to Solve

    3,Hardware failure1-$he chance of failure increases as we connected the drivers, Another way of avoiding data loss is

    through replicate the data sets, $he Hadoop Distri.uted 2ilesyste) 7HD2S8,(,Analysis task need to co).ine in so)e way1-$he data read to one disk need to .e a.le to co).ine fro) any of ==disks, Various distri.ution syste) allow this .ut doing this is very challenging, apReduce provides a progra))ing)odel that a.stracts the )odel fro) disk read and writes% transfor)ing it into co)putation over sets of keys andvalues,

    IV.BI3 DATAANA"%TIC$U$IN3ADOOP

    a!oo':-Apache Hadoop is an open source of Bava fra)ework for processing and *uerying vast a)ounts of data onlarge clusters of co))odity hardware ,Hadoop is top level Apache project% initiated and led .y YahooC And Doug"utting,

    Apache Hadoop has two )ain feature1- Hadoop Distri.uted 2ile Syste)7HD2S8

    apReduce

    $he current Apache Hadoop ecosyste) consists of HD2S% apReduce% Pig% Hive% S*oop% H/ase% andookeeper,

    )a'Re!uce:-A apReduce is a progra))a.le fra)ework for pulling data in parallel out of cluster, 5t is low level

    progra))ing,D9$:-5t is a java .ased file syste) that provide scala.le and relia.le data storage,Pig:-apReduce is low level language,So%it is is very difficult to i)ple)ent apReduce,Pig is data flow languagewhich is high level and procedural language,

  • 8/10/2019 data analytics using hadoop

    3/3

    November - 2014

    "atenc Eow High

    Integrit High Eow

    "anguage SGE Procedural7Bava%" 8

    $caling &onlinear Einear

    apReduce is good fit for pro.le)s that need to analye the whole dataset in a .atch function whereas RD/S

    is good for point *ueries and updates,

    apReduce is good for applications where data is reads once write anywhere while data.ase is good for

    datasets that are continually updated,

    apReduce works on unstructured data or se)i-structured data .ut RD/S only works n structured data,

    Relational data is nor)alied to regain its integrity and re)ove redundancy .ut nor)alies poses pro.le) forapReduce

    VI.CONC"U$ION