apache hadoop big data technology

Click here to load reader

Upload: jay-nagar

Post on 14-Apr-2017

62 views

Category:

Data & Analytics


3 download

TRANSCRIPT

PowerPoint Presentation

Er.Jay Nagar(Technology Researcher )+91-9601957620

What is Apache Hadoop?Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Created by Doug Cutting and Mike Carafella in 2005.

Cutting named the program after his sons toy elephant.

Uses for HadoopData-intensive text processing

Assembly of large genomes

Graph mining

Machine learning and data mining

Large scale social network analysis

Who Uses Hadoop?

The Hadoop Ecosystem

How much data?Facebook500 TB per dayYahooOver 170 PBeBayOver 6 PB

Getting the data to the processors becomes the bottleneck

Hadoop:

an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license.

Goals / Requirements:

Abstract and facilitate the storage and processing of large and/or rapidly growing data setsStructured and non-structured dataSimple programming models

High scalability and availability

Use commodity (cheap!) hardware with little redundancy

Fault-tolerance

Move computation rather than data

Hadoop Framework Tools

Hadoops ArchitectureDistributed, with some centralization

Main nodes of cluster are where most of the computational power and storage of the system lies

Main nodes run TaskTracker to accept and reply to MapReduce tasks, and also DataNode to store needed blocks closely as possible

Central control node runs NameNode to keep track of HDFS directories & files, and JobTracker to dispatch compute tasks to TaskTracker

Written in Java, also supports Python and Ruby

Hadoops Architecture

Hadoop Distributed Filesystem

Tailored to needs of MapReduce

Targeted towards many reads of filestreams

Writes are more costly

High degree of data replication (3x by default)

No need for RAID on normal nodes

Large blocksize (64MB)

Location awareness of DataNodes in network

Hadoop is in use at most organizations that handle big data: Yahoo! FacebookAmazonNetflixEtc

Some examples of scale: Yahoo!s Search Webmap runs on 10,000 core Linux cluster and powers Yahoo! Web search

FBs Hadoop cluster hosts 100+ PB of data (July, 2012) & growing at PB/day (Nov, 2012)

Hadoops Architecture

NameNode:

Stores metadata for the files, like the directory structure of a typical FS.

The server holding the NameNode instance is quite crucial, as there is only one.

Transaction log for file deletes/adds, etc. Does not use transactions for whole blocks or file-streams, only metadata.

Handles creation of more replica blocks when necessary after a DataNode failure

Hadoops Architecture

DataNode:

Stores the actual data in HDFS

Can run on any underlying filesystem (ext3/4, NTFS, etc)

Notifies NameNode of what blocks it has

NameNode replicates blocks 2x in local rack, 1x elsewhere

Hadoops Architecture: MapReduce Engine

Hadoops Architecture

MapReduce Engine:

JobTracker & TaskTracker

JobTracker splits up data into smaller tasks(Map) and sends it to the TaskTracker process in each node

TaskTracker reports back to the JobTracker node and reports on job progress, sends data (Reduce) or requests new jobs

HDFS Basic Concepts

HDFS is a file system written in Java based on the Googles GFS

Provides redundant storage for massive amounts of data

HDFS Basic ConceptsHDFS works best with a smaller number of large filesMillions as opposed to billions of filesTypically 100MB or more per file

Files in HDFS are write once

Optimized for streaming reads of large files and not random reads

How are Files StoredFiles are split into blocksBlocks are split across many machines at load timeDifferent blocks from the same file will be stored on different machinesBlocks are replicated across multiple machinesThe NameNode keeps track of which blocks make up a file and where they are stored

Default replication is 3-fold19

Data ReplicationDefault replication is 3-fold

MapReduceDistributing computation across nodes

MapReduce OverviewA method for distributing computation across multiple nodes

Each node processes the data that is stored at that node

Consists of two main phasesMapReduce

MapReduce Features

Automatic parallelization and distribution

Fault-Tolerance

Provides a clean abstraction for programmers to use

The Mapper

Reads data as key/value pairsThe key is often discarded

Outputs zero or more key/value pairs

Shuffle and Sort

Output from the mapper is sorted by key

All values with the same key are guaranteed to go to the same machine

The ReducerCalled once for each unique key

Gets a list of all values associated with a key as input

The reducer outputs zero or more final key/value pairsUsually just one output per input key

MapReduce: Word Count

OverviewNameNodeHolds the metadata for the HDFSSecondary NameNodePerforms housekeeping functions for the NameNodeDataNodeStores the actual HDFS data blocksJobTrackerManages MapReduce jobsTaskTrackerMonitors individual Map and Reduce tasks

The NameNodeStores the HDFS file system information in a fsimage

Updates to the file system (add/remove blocks) do not change the fsimage fileThey are instead written to a log file

When starting the NameNode loads the fsimage file and then applies the changes in the log file

The Secondary NameNodeNOT a backup for the NameNode

Periodically reads the log file and applies the changes to the fsimage file bringing it up to date

Allows the NameNode to restart faster when required

JobTracker and TaskTracker

JobTrackerDetermines the execution plan for the jobAssigns individual tasks

TaskTrackerKeeps track of the performance of an individual mapper or reducer

Hadoop EcosystemOther available tools

Why do these tools exist?

MapReduce is very powerful, but can be awkward to master

These tools allow programmers who are familiar with other programming styles to take advantage of the power of MapReduce

Other ToolsHiveHadoop processing with SQLPigHadoop processing with scriptingCascadingPipe and Filter processing modelHBaseDatabase model built on top of HadoopFlumeDesigned for large scale data movement

Mapper

(intermediates)

(intermediates)

Mapper

(intermediates)

(intermediates)

Mapper

(intermediates)

(intermediates)

Mapper

(intermediates)

Reducer

Reducer

Reducer

Partitioner

Partitioner

Partitioner

Partitioner

shuffling