apache hadoop big data technology
TRANSCRIPT
PowerPoint Presentation
Er.Jay Nagar(Technology Researcher )+91-9601957620
What is Apache Hadoop?Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Created by Doug Cutting and Mike Carafella in 2005.
Cutting named the program after his sons toy elephant.
Uses for HadoopData-intensive text processing
Assembly of large genomes
Graph mining
Machine learning and data mining
Large scale social network analysis
Who Uses Hadoop?
The Hadoop Ecosystem
How much data?Facebook500 TB per dayYahooOver 170 PBeBayOver 6 PB
Getting the data to the processors becomes the bottleneck
Hadoop:
an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license.
Goals / Requirements:
Abstract and facilitate the storage and processing of large and/or rapidly growing data setsStructured and non-structured dataSimple programming models
High scalability and availability
Use commodity (cheap!) hardware with little redundancy
Fault-tolerance
Move computation rather than data
Hadoop Framework Tools
Hadoops ArchitectureDistributed, with some centralization
Main nodes of cluster are where most of the computational power and storage of the system lies
Main nodes run TaskTracker to accept and reply to MapReduce tasks, and also DataNode to store needed blocks closely as possible
Central control node runs NameNode to keep track of HDFS directories & files, and JobTracker to dispatch compute tasks to TaskTracker
Written in Java, also supports Python and Ruby
Hadoops Architecture
Hadoop Distributed Filesystem
Tailored to needs of MapReduce
Targeted towards many reads of filestreams
Writes are more costly
High degree of data replication (3x by default)
No need for RAID on normal nodes
Large blocksize (64MB)
Location awareness of DataNodes in network
Hadoop is in use at most organizations that handle big data: Yahoo! FacebookAmazonNetflixEtc
Some examples of scale: Yahoo!s Search Webmap runs on 10,000 core Linux cluster and powers Yahoo! Web search
FBs Hadoop cluster hosts 100+ PB of data (July, 2012) & growing at PB/day (Nov, 2012)
Hadoops Architecture
NameNode:
Stores metadata for the files, like the directory structure of a typical FS.
The server holding the NameNode instance is quite crucial, as there is only one.
Transaction log for file deletes/adds, etc. Does not use transactions for whole blocks or file-streams, only metadata.
Handles creation of more replica blocks when necessary after a DataNode failure
Hadoops Architecture
DataNode:
Stores the actual data in HDFS
Can run on any underlying filesystem (ext3/4, NTFS, etc)
Notifies NameNode of what blocks it has
NameNode replicates blocks 2x in local rack, 1x elsewhere
Hadoops Architecture: MapReduce Engine
Hadoops Architecture
MapReduce Engine:
JobTracker & TaskTracker
JobTracker splits up data into smaller tasks(Map) and sends it to the TaskTracker process in each node
TaskTracker reports back to the JobTracker node and reports on job progress, sends data (Reduce) or requests new jobs
HDFS Basic Concepts
HDFS is a file system written in Java based on the Googles GFS
Provides redundant storage for massive amounts of data
HDFS Basic ConceptsHDFS works best with a smaller number of large filesMillions as opposed to billions of filesTypically 100MB or more per file
Files in HDFS are write once
Optimized for streaming reads of large files and not random reads
How are Files StoredFiles are split into blocksBlocks are split across many machines at load timeDifferent blocks from the same file will be stored on different machinesBlocks are replicated across multiple machinesThe NameNode keeps track of which blocks make up a file and where they are stored
Default replication is 3-fold19
Data ReplicationDefault replication is 3-fold
MapReduceDistributing computation across nodes
MapReduce OverviewA method for distributing computation across multiple nodes
Each node processes the data that is stored at that node
Consists of two main phasesMapReduce
MapReduce Features
Automatic parallelization and distribution
Fault-Tolerance
Provides a clean abstraction for programmers to use
The Mapper
Reads data as key/value pairsThe key is often discarded
Outputs zero or more key/value pairs
Shuffle and Sort
Output from the mapper is sorted by key
All values with the same key are guaranteed to go to the same machine
The ReducerCalled once for each unique key
Gets a list of all values associated with a key as input
The reducer outputs zero or more final key/value pairsUsually just one output per input key
MapReduce: Word Count
OverviewNameNodeHolds the metadata for the HDFSSecondary NameNodePerforms housekeeping functions for the NameNodeDataNodeStores the actual HDFS data blocksJobTrackerManages MapReduce jobsTaskTrackerMonitors individual Map and Reduce tasks
The NameNodeStores the HDFS file system information in a fsimage
Updates to the file system (add/remove blocks) do not change the fsimage fileThey are instead written to a log file
When starting the NameNode loads the fsimage file and then applies the changes in the log file
The Secondary NameNodeNOT a backup for the NameNode
Periodically reads the log file and applies the changes to the fsimage file bringing it up to date
Allows the NameNode to restart faster when required
JobTracker and TaskTracker
JobTrackerDetermines the execution plan for the jobAssigns individual tasks
TaskTrackerKeeps track of the performance of an individual mapper or reducer
Hadoop EcosystemOther available tools
Why do these tools exist?
MapReduce is very powerful, but can be awkward to master
These tools allow programmers who are familiar with other programming styles to take advantage of the power of MapReduce
Other ToolsHiveHadoop processing with SQLPigHadoop processing with scriptingCascadingPipe and Filter processing modelHBaseDatabase model built on top of HadoopFlumeDesigned for large scale data movement
Mapper
(intermediates)
(intermediates)
Mapper
(intermediates)
(intermediates)
Mapper
(intermediates)
(intermediates)
Mapper
(intermediates)
Reducer
Reducer
Reducer
Partitioner
Partitioner
Partitioner
Partitioner
shuffling