hadoop and bigdata analysis

Upload: venkata-nagesh-kocherlakota

Post on 04-Apr-2018

231 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Hadoop and BigData Analysis

    1/18

    Lets Hadoop

  • 7/29/2019 Hadoop and BigData Analysis

    2/18

    1.WHATS THE

    BIG DEAL WITH

    BIG DATA?

  • 7/29/2019 Hadoop and BigData Analysis

    3/18

    Big data is a term applied to data sets whosesize is beyond the ability of commonly usedsoftware tools to capture, manage, and processthe data within a tolerable elapsed time.

    GartnerPredicts 800%data growthover next 5

    years

    Big Data opens the door to a new approach to engaging customers adecisions

  • 7/29/2019 Hadoop and BigData Analysis

    4/18

    2.BIG DATA:

    WHAT ARE THE

    CHALLENGES?

  • 7/29/2019 Hadoop and BigData Analysis

    5/18

    How we can capture and deliver data to right people in real-time?

    How we can understand and use big data when it is in Variety of f

    How we can store/analyze the data given its size and computation

    While the storage capacities of hard drives have increased massivyears, access speedsthe rate at which data can be read from drikept up. Example: Need to process 100TB datasets

    On 1 node: scanning @ 50MB/s = 23 days On 1000 node cluster:

    scanning @ 50MB/s = 33 min Hardware Problems

    Hardware Problems / Process and combine data from Multiple disk

    Traditional Systems: They cant scale, not reliable and expensive.

  • 7/29/2019 Hadoop and BigData Analysis

    6/18

    3.WHAT

    TECHNOLOGIES

    SUPPORT BIGDATA?

  • 7/29/2019 Hadoop and BigData Analysis

    7/18

    Scale-out everything:StorageCompute

  • 7/29/2019 Hadoop and BigData Analysis

    8/18

    4.WHAT MAKES

    HADOOP

    DIFFERENT?

    A ibl H d l l t f

  • 7/29/2019 Hadoop and BigData Analysis

    9/18

    AccessibleHadoop runs on large clusters ofcommodity machines or on cloud (EC2 ). RobustHadoop is architected with the assumptionof frequent hardware malfunctions. It can gracefullyhandle most such failures. ScalableHadoop scales linearly to handle largerdata by adding more nodes to the cluster.

    SimpleHadoop allows users to quickly writeefficient parallel code. Data LocalityMove Computation to the Data. Replication - Use replication across servers to dealwith unreliable storage/servers

  • 7/29/2019 Hadoop and BigData Analysis

    10/18

    5.IS HADOOPONE-STOP

    SOLUTION?

    Good for

  • 7/29/2019 Hadoop and BigData Analysis

    11/18

    Not good for..Real timeSmall datasetsAlgorithms requires large Problems that are CPU bolots of cross talk

    Good for...

  • 7/29/2019 Hadoop and BigData Analysis

    12/18

    Hadoop is an open source framework for writing and running diapplications that process large amounts of data.

    Framework written in Java Designed to solve problem that involve analyzing large data

    (Petabytes) Programing model based on Googles Map Reduce Infrastructure based on Googles Big Data and Distributed F

    Hadoop consists of two core components. The Hadoop Distributed File System (HDFS) - A distributed MapReduce - distributed processing on compute clusters.

  • 7/29/2019 Hadoop and BigData Analysis

    13/18

    NameNode This manages the file s

    namespace (metadata)

    access to files by client The NameNode execute

    namespace operations closing, and renaming directories

    DataNode This manages storage a

    node in which they run DataNode serves read,

    block operation, deleteupon request from Nam

    Many Data Nodes, typicDataNode for a physica

  • 7/29/2019 Hadoop and BigData Analysis

    14/18

    Large-Scale Data Processingo Want to use 1000s of CPUs

    o But dont want hassle ofmanaging things MapReduce Architecture provides

    o Automatic parallelization & distributiono Fault toleranceo I/O schedulingo Monitoring & status updates

    MapReduce is a method for distributing a task across multi

    Each node processes data stored on that node Consists of two phases:

    o Mapo Reduce

  • 7/29/2019 Hadoop and BigData Analysis

    15/18

  • 7/29/2019 Hadoop and BigData Analysis

    16/18

    In map phase , the mapper reads data in the form of keypairs

    The Reducer process all output from mapper and arrives output as final key/value pairs writes to HDFS.

    There are two types of nodes that control the job executio Jobtracker

    o Tasktrackers The jobtracker coordinates all the jobs run on the system

    scheduling tasks to run on tasktrackers Tasktrackers run tasks and send progress reports to the j Jobtracker runs in NameNode. Tasktracker runs in DataNo

  • 7/29/2019 Hadoop and BigData Analysis

    17/18

  • 7/29/2019 Hadoop and BigData Analysis

    18/18