introduction to hadoop
Post on 27-Jan-2015
777 Views
Preview:
DESCRIPTION
TRANSCRIPT
Hadoop – Taming Big Data Jax ArcSig, June 2012
Ovidiu Dimulescu
About @odimulescu
• Working on the Web since 1997 • Likes stuff well done • Into engineering cultures and all around automaEon • Speaker at local user groups • Organizer for the local Mobile User Group jaxmug.com
Agenda
• IntroducEon • Use cases • Architecture • MapReduce Examples
• Q&A
What is ?
• Apache Hadoop is an open source Java soSware framework for running data-‐intensive applicaEons on large clusters of commodity hardware
• Created by Doug CuVng (Lucene & Nutch creator)
• Named aSer Doug’s son’s toy elephant
What and how is solving?
• Processing diverse large datasets in pracAcal Ame at low cost
• Consolidates data in a distributed file system
• Moves computaAon to data rather then data to computaEon
• Simpler programming model
Why does it maEer?
• Volume, Velocity, Variety and Value
• Datasets do not fit on local HDDs let alone RAM
• Data grows at tremendous pace
• Data is heterogeneous • Scaling up is expensive (licensing, cpus, disks, interconnects, etc.)
• Scaling up has a ceiling (physical, technical, etc.)
Why does it maEer?
80%
20%
Data types
Complex Structured
Complex Data
Images, Video Logs Documents Call records Sensor data Mail archives
Structured Data
User Profiles CRM HR Records
* Chart Source: IDC White Paper
Why does it maEer?
• Need to process a 10TB dataset
• Assume sustained transfer of 75MB/s
• On 1 node -‐ Scanning data ~ 2 days
• On 10 node cluster -‐ Scanning data ~ 5 hrs
• Low $/TB for commodity drives
• Low-‐end servers are mulEcore capable
Use Cases
• ETL -‐ Extract Transform Load
• RecommendaEon Engines
• Customer Churn Analysis • Ad TargeEng • Data “sandbox”
Use Cases -‐ Typical ETL
Live DB
ReporAng DB
ETL 1
BI ApplicaAons
Data Warehouse
ETL 2
Logs
Use Cases -‐ Hadoop ETL
Live DB
ReporAng DB
BI ApplicaAons
Data Warehouse
Hadoop Data Loading
Logs
Data Loading
Use Cases – Analysis methods
• Pakern recogniEon
• Index building
• Text mining
• CollaboraEve filtering
• PredicEon models
• SenEment analysis
• Graphs creaEon and traversal
Who uses it?
Who supports it?
Why use Hadoop?
• PracEcal to do things that were previously not
ü Shorter execuEon Eme ü Costs less
ü Simpler programming model • Open system with greater flexibility
• Large and growing ecosystem
Hadoop – Silver bullet?
• Not a database replacement
• Not a data warehousing (complements it)
• Not for interacEve reporEng • Not a general purpose storage mechanism
• Not for problems that are not parallelizable in a share-‐nothing fashion
Architecture – Design Axioms
• System Shall Manage and Heal Itself
• Performance Shall Scale Linearly
• Compute Should Move to Data • Simple Core, Modular and Extensible
Architecture – Core Components
HDFS
Distributed filesystem designed for low cost storage and high bandwidth access across the cluster.
Map-‐Reduce
Programming model for processing and generaEng large data sets.
Architecture – Official Extensions
HDFS HBase
Storage
MapReduce Framework
Data Processing
ZooKeeper Chukwa
Management
Pig (Data Flow) Avro
Data Access
Hive (SQL)
Architecture – CDH DistribuAon
1. CDH – Cloudera’s DistribuEon of Hadoop 2. Image credit -‐ Cloudera presentaEon @ Microstrategy World 2011
HDFS -‐ Design
• Based on Google’s GFS
• Files are stored as blocks (64MB default size)
• Configurable data replicaEon (3x, Rack Aware)
• Fault Tolerant, Expects HW failures
• HUGE files, Expects Streaming not Low Latency
• Mostly WORM
HDFS -‐ Architecture
Namenode (NN)
Datanode 1 Datanode 2 Datanode N
Namenode -‐ Master • Filesystem metadata • Controls read/write to files • Manages blocks replicaEon • Applies transacEon log on startup
Datanode -‐ Slaves • Reads / Write blocks to/from clients • Replicates blocks at master’s request
H D F S
Client ask NN for file NN returns DNs that host it Client ask DN for data
HDFS – Fault tolerance
• DataNode
§ Uses CRC to avoid corrupEon § Data is replicated on other nodes (3x)
• NameNode
§ Checkpoint NameNode § Backup NameNode § Failover is manual
MapReduce -‐ Design
• Based on Google’s MR paper • Borrows from funcEonal programming • Simpler programming model
§ map (in_key, in_value) -‐> (out_key, intermediate_value) list
§ reduce (out_key, intermediate_value list) -‐> out_value list
• No user synchronizaEon and coordinaEon
Input -‐> Map -‐> Reduce -‐> Output
MapReduce -‐ Architecture
JobsTracker (JT)
TaskTracker 1
JobTracker -‐ Master • Accepts MR jobs submiked by clients • Assigns Map and Reduce tasks to
TaskTrackers, data locality aware • Monitors tasks and TaskTracker status,
re-‐executes tasks upon failure • SpeculaEve execuEon
TaskTracker -‐ Slaves • Run Map and Reduce tasks received
from Jobtracker • Manage storage and transmission of
intermediate output
J O B S API
Client launches a job -‐ ConfiguraEon -‐ Mapper -‐ Reducer -‐ Input -‐ Output TaskTracker 2 TaskTracker N
Hadoop -‐ Core Architecture
JobsTracker
TaskTracker 1 DataNode 1
J O B S API
NameNode
TaskTracker 2 DataNode 2
TaskTracker N DataNode N
H D F S
Mini OS • File system • Scheduler
hkp://www.slideshare.net/esaliya/mapreduce-‐in-‐simple-‐terms
MapReduce – Head First Style
MapReduce – Mapper Types
One-‐to-‐One map(k, v) = emit (k, transform(v))
Exploder map(k, v) = foreach p in v: emit (k, p)
Filter map(k, v) = if cond(v) then emit (k, v)
MapReduce – Reducer Types
Sum Reducer
reduce(k, vals) = sum = 0 foreach v in vals: sum += v emit (k, sum)
MapReduce – High level pipeline
K1
K1
K1
K1
K2
K2
K2
K2
MapReduce – Detailed pipeline
Diagram: hkp://developer.yahoo.com/hadoop/tutorial/module4.html
MapReduce – Combiner Phase
• OpEonal • Runs on mapper nodes aSer map phase • “ Mini-‐reduce,” only on local map output • Used to save bandwidth before sending data to full reducer • The Reducer can be Combiner if
1. Output key, values are the same as input key, values 2. CommutaEve and AssociaEve (SUM, MAX ok but AVG not)
Diagram: hkp://developer.yahoo.com/hadoop/tutorial/module4.html
InstallaAon
1. Download & configure single-‐node cluster
hadoop.apache.org/common/releases.html
2. Download a demo VM
Cloudera Hortonwork
3. Use a hosted environment (Amazon’s EMR, Azure)
InstallaAon – Pla[orm Notes
ProducAon Linux – Official
Development
Linux OSX Windows via Cygwin *Nix
MapReduce – Client Languages
Java, Any JVM Languages -‐ NaEve C++ -‐ Pipes framework – Socket IO Any – Streaming – Stdin / Stdout
Pig LaEn, Hive HQL, C via JNI
hadoop pipes -‐input path_in -‐output path_out -‐program exec_program
hadoop jar hadoop-‐streaming.jar -‐mapper map_prog -‐reducer reduce_prog -‐input path_in -‐output path_out
hadoop jar jar_path main_class input_path output_path
MapReduce – Client Anatomy
• Main Program (aka Driver)
Configures the Job IniEates the Job
• Input LocaEon • Mapper • Combiner (opEonal) • Reducer • Output LocaEon
MapReduce – Word Count Example
MapReduce – C# Mapper
MapReduce – C# Reducer
MapReduce – Java Mapper
MapReduce – Java Reducer
MapReduce – JavaScript Mapper
MapReduce – JavaScript Reducer
Summary
is an economical scalable distributed data processing system which enables data:
ü ConsolidaAon (Structured or Not) ü Query Flexibility (Any Language) ü Agility (Evolving Schemas)
QuesAons ?
References
Hadoop at Yahoo!, by Y! Developer Network MapReduce in Simple Terms, by Saliya Ekanayake Hadoop Architecture, by Phillipe Julio 10 Hadoop-‐able Problems, by Cloudera Hadoop, An Industry PerspecEve, by Amr Awadallah Anatomy of a MapReduce Job Run by Tom White MapReduceJobs in Hadoop
top related