hadoop-quick introduction

48
“Data is a precious things and will last longer than the system themselves” – Tim Berners Lee

Upload: sandeep-singh

Post on 16-Apr-2017

502 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Hadoop-Quick introduction

“Data is a precious things and will last longer than the system themselves”

– Tim Berners Lee

Page 2: Hadoop-Quick introduction

Sandeep Kumar

Page 3: Hadoop-Quick introduction

What is Data ?

• What is Data ?• And why should we care about it ?

Page 4: Hadoop-Quick introduction

What is Big Data ?

• Big data is a collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications.

Page 5: Hadoop-Quick introduction

Few Examples

• Web logs• RFID• Social Data-Facebook, Linkedin, Twitter.• Call Detail Records• Large-Scale e-commerce• Medical Records• Video archives• Atmospheric Science• Astronomy• Feeds• Media & Advertising.

Page 6: Hadoop-Quick introduction

What is Big Data ?

• Ancestry.com stores around 2.5 petabytes of Data.• The New York Stock Exchange generates about one

terabyte of new trade data per day.• The Internet Archive stores around 2 petabytes of

data, and is growing at a rate of 20 terabytes per month. (http://archive.org/web/web.php)

Page 7: Hadoop-Quick introduction

How to Process The Big Data?

• Need to process large datasets (>100TB)• Only reading 100TB of data can be overwhelming• Takes ~11 days to read on a standard computer• Takes a day across a 10GB link (very high end

storage solution)• On a single node (@50MB/s) – 23days• On a 1000 node cluster – 33min

Page 8: Hadoop-Quick introduction

Not so easy………..

• The challenges are in search, sharing, transfer, visualization etc.

• Moving data from storage cluster to computation cluster is not feasible.

• In large cluster failure is expected . Computer fails everyday.• Very expensive to build reliability into each application.• massively parallel software running on tens, hundreds, or even

thousands of servers• A programmer worries about errors, data motion,

communication.

Page 9: Hadoop-Quick introduction

What We are looking for.

Page 10: Hadoop-Quick introduction

What we are looking for.

• A common infrastructure and standard set of tools to handle this complexity.

• A Efficient, Reliable fault-tolerant and usable framework.

Page 11: Hadoop-Quick introduction

What is Hadoop ?

• Its a framework that allows distributed processing of large data sets across clusters of computers.

• It is designed to scale up from single servers to thousands of machines.

• Its also designed to run on commodity hardware.

Page 12: Hadoop-Quick introduction

Scalable: store and process petabytes, scale by adding HW and added without needing to change data formats.

Economical: 1000s of commodity machines. Efficient: runs tasks where data is located. Flexible: Hadoop is schema-less, and can absorb any

type of data, structured or not, from any number of sources.

Fault tolerant: When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.

Hadoop is….

Page 13: Hadoop-Quick introduction

Hadoop is useful for…….

• Batch Data Processing.• Log Processing.• Document Analysis & Indexing.• Text Mining.• Crawl Data Processing.• Highly parallel data intensive distributed

applications.

Page 14: Hadoop-Quick introduction

Use The Right Tool For The Right Job Hadoop:RDBMS

When to use?• Write once read many times.• Structured or Not (Agility)• Batch Processing

When to use?• Interactive Reporting (<1sec)• Multistep Transactions• Lots of Inserts/Updates/Deletes

Page 15: Hadoop-Quick introduction

Hadoop Terminology…….

Node 1

Page 16: Hadoop-Quick introduction

Hadoop Terminology…….

Node 1

Node 2

Page 17: Hadoop-Quick introduction

Hadoop Terminology…….

Node 1

Node 2

Node 3

Page 18: Hadoop-Quick introduction

Hadoop Terminology…….

Node 1

Node 2

.

.

Node 3

Rack 1

Page 19: Hadoop-Quick introduction

Hadoop Terminology…….

Node 1

Node 2

.

.

Node 3

Rack 1

Node 1

Node 2

.

.

Node 3

Rack 2

Page 20: Hadoop-Quick introduction

Hadoop Terminology…….

Node 1

Node 2

.

.

Node 3

Rack 1

Node 1

Node 2

.

.

Node 3

Rack 2

Node 1

Node 2

.

.

Node 3

Rack 3

Page 21: Hadoop-Quick introduction

Hadoop Terminology…….

Node 1

Node 2

.

.

Node 3

Rack 1

Node 1

Node 2

.

.

Node 3

Rack 2

Node 1

Node 2

.

.

Node 3

Rack 3Hadoop Cluster

Page 22: Hadoop-Quick introduction

Hadoop Framework…….

Page 23: Hadoop-Quick introduction

Hadoop Nodes…….

• HDFS Nodes NameNode (Master) DataNode (Slaves) Checkpoint Node Secondary NameNode (deprecated) Backup Node

Page 24: Hadoop-Quick introduction

Hadoop Nodes…….

• MapReduce nodes JobTracker (Master) TaskTracker (Slaves)

Page 25: Hadoop-Quick introduction

Hadoop Nodes-Overview

Page 26: Hadoop-Quick introduction

Hadoop Nodes-NameNode

• Manages the filesystem namespace and metadata • Replicate missing blocks • No data goes through the NameNode • NameNode mainly consists of: fsimage: Contains a checkpoint copy of the metadata on disk edit logs: Records all write operations, synchronizes with

metadata in RAM after each write In case of ‘power failure’ on NameNode Can recover using

fsimage + edit logs

Page 27: Hadoop-Quick introduction

Hadoop Nodes-CheckPoint Node

• Periodically creates checkpoints of NameNode filesystem

• The Checkpoint node should run on a different machine than the NameNode

• Should have same storage requirements as NameNode • There can be many Checkpoint nodes per cluster

Page 28: Hadoop-Quick introduction

Hadoop Nodes-BackUp Node

• Difference with Checkpoint node is that it keeps and up-to-date copy of metadata in RAM

• Same RAM requirements as NameNode • Can only have one Backup node per cluster

Page 29: Hadoop-Quick introduction

Hadoop Nodes-Data Node

Can be many per Hadoop cluster

•Manages blocks with data and serves them to clients •Periodically reports to NameNode the list of blocks it stores •Use inexpensive commodity hardware for this node

Page 30: Hadoop-Quick introduction

Hadoop Nodes-Job TrackerOne per Hadoop cluster (Multiple namenode can be configured in Hadoop 2.2 or letter version)

•Receives job requests submitted by client •Schedules and monitors MapReduce jobs on task trackers

Page 31: Hadoop-Quick introduction

Hadoop Nodes-Task Tracker

• Can be many per Hadoop cluster • Executes MapReduce operations • Reads blocks from DataNodes

Page 32: Hadoop-Quick introduction

Map ReduceIt offers:

• Operates on key and value pairs.• Two major functions: Map() and Reduce()• Input formats and splits• Number of tasks.• Provides status about jobs to users• Monitors task progress

Page 33: Hadoop-Quick introduction

Map Reduce Diagram

Page 34: Hadoop-Quick introduction

Map Reduce Architecture.

Page 35: Hadoop-Quick introduction

Map Reduce Job.

JobTracker

client

1. Submit jo

b →

← 2. jobsta

tus

(jobid, e

tc)

TaskTackers & Datanodes

← 4

. tas

ks

NameNode3. Namespace info

Page 36: Hadoop-Quick introduction

Input Output .

The MapReduce framework operates on <key, value> pairs.

It views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job.

Page 37: Hadoop-Quick introduction

Input Output..

Input and Output types of a MapReduce job:

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

Reference:

http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html

Page 38: Hadoop-Quick introduction

HDFS Architecture

Page 39: Hadoop-Quick introduction

Hadoop Tools…….

Hive

It’s a data warehouse system for Hadoop Providing data summarization, query, and analysis.

Page 40: Hadoop-Quick introduction

Hadoop Tools…….

• Pig Its a high-level platform for creating MapReduce

programs used with Hadoop. Developed by Yahoo.

Page 41: Hadoop-Quick introduction

Hadoop Tools…….

Hbase

Used when needs random, real-time read/write access to your Big Data.

Also used for storing historical data.

Page 42: Hadoop-Quick introduction

Hadoop Tools…….

• Hue

Its a Web application for interacting with Apache Hadoop. It supports a file browser, job tracker interface, Hive, Pig and more.

Page 43: Hadoop-Quick introduction

Hadoop Tools…….

• Sqoop

Its a Command-line interface application for transferring data between relational databases and Hadoop.

Microsoft uses a Sqoop-based connector to help transfer data from Microsoft SQL Server databases to Hadoop.

Page 44: Hadoop-Quick introduction

Hadoop Tools…….

• Flume

Its used for efficiently collecting, aggregating, and moving large amounts of distributed data or log data.

Page 45: Hadoop-Quick introduction

Hadoop Tools…….• Flume Model

Page 46: Hadoop-Quick introduction

Hadoop in the Enterprise…….

Page 47: Hadoop-Quick introduction

There are many tools developed on top of hadoop these days and those are available in market and being used widely in industry.We can get more on it from Cloudera, hortonworks and from Google.com

Page 48: Hadoop-Quick introduction

Thanks for your time today.