hadoop and bigdata analysis

7/29/2019 Hadoop and BigData Analysis

1/18

Lets Hadoop


2/18

1.WHATS THE

BIG DEAL WITH

BIG DATA?


3/18

Big data is a term applied to data sets whosesize is beyond the ability of commonly usedsoftware tools to capture, manage, and processthe data within a tolerable elapsed time.

GartnerPredicts 800%data growthover next 5

years

Big Data opens the door to a new approach to engaging customers adecisions


4/18

2.BIG DATA:

WHAT ARE THE

CHALLENGES?


5/18

How we can capture and deliver data to right people in real-time?

How we can understand and use big data when it is in Variety of f

How we can store/analyze the data given its size and computation

While the storage capacities of hard drives have increased massivyears, access speedsthe rate at which data can be read from drikept up. Example: Need to process 100TB datasets

On 1 node: scanning @ 50MB/s = 23 days On 1000 node cluster:

scanning @ 50MB/s = 33 min Hardware Problems

Hardware Problems / Process and combine data from Multiple disk

Traditional Systems: They cant scale, not reliable and expensive.


6/18

3.WHAT

TECHNOLOGIES

SUPPORT BIGDATA?


7/18

Scale-out everything:StorageCompute


8/18

4.WHAT MAKES

HADOOP

DIFFERENT?

A ibl H d l l t f


9/18

AccessibleHadoop runs on large clusters ofcommodity machines or on cloud (EC2 ). RobustHadoop is architected with the assumptionof frequent hardware malfunctions. It can gracefullyhandle most such failures. ScalableHadoop scales linearly to handle largerdata by adding more nodes to the cluster.

SimpleHadoop allows users to quickly writeefficient parallel code. Data LocalityMove Computation to the Data. Replication - Use replication across servers to dealwith unreliable storage/servers


10/18

5.IS HADOOPONE-STOP

SOLUTION?

Good for


11/18

Not good for..Real timeSmall datasetsAlgorithms requires large Problems that are CPU bolots of cross talk

Good for...


12/18

Hadoop is an open source framework for writing and running diapplications that process large amounts of data.

Framework written in Java Designed to solve problem that involve analyzing large data

(Petabytes) Programing model based on Googles Map Reduce Infrastructure based on Googles Big Data and Distributed F

Hadoop consists of two core components. The Hadoop Distributed File System (HDFS) - A distributed MapReduce - distributed processing on compute clusters.


13/18

NameNode This manages the file s

namespace (metadata)

access to files by client The NameNode execute

namespace operations closing, and renaming directories

DataNode This manages storage a

node in which they run DataNode serves read,

block operation, deleteupon request from Nam

Many Data Nodes, typicDataNode for a physica


14/18

Large-Scale Data Processingo Want to use 1000s of CPUs

o But dont want hassle ofmanaging things MapReduce Architecture provides

o Automatic parallelization & distributiono Fault toleranceo I/O schedulingo Monitoring & status updates

MapReduce is a method for distributing a task across multi

Each node processes data stored on that node Consists of two phases:

o Mapo Reduce


15/18


16/18

In map phase , the mapper reads data in the form of keypairs

The Reducer process all output from mapper and arrives output as final key/value pairs writes to HDFS.

There are two types of nodes that control the job executio Jobtracker

o Tasktrackers The jobtracker coordinates all the jobs run on the system

scheduling tasks to run on tasktrackers Tasktrackers run tasks and send progress reports to the j Jobtracker runs in NameNode. Tasktracker runs in DataNo


17/18


18/18

hadoop and bigdata analysis

Documents