introduction to hadoop

Hadoop – Taming Big Data Jax ArcSig, June 2012

Ovidiu Dimulescu

About @odimulescu

•  Working on the Web since 1997 •  Likes stuff well done •  Into engineering cultures and all around automaEon •  Speaker at local user groups •  Organizer for the local Mobile User Group jaxmug.com

Agenda

•  IntroducEon • Use cases • Architecture • MapReduce Examples

• Q&A

What is ?

•  Apache Hadoop is an open source Java soSware framework for running data-‐intensive applicaEons on large clusters of commodity hardware

•  Created by Doug CuVng (Lucene & Nutch creator)

•  Named aSer Doug’s son’s toy elephant

What and how is solving?

•  Processing diverse large datasets in pracAcal Ame at low cost

•  Consolidates data in a distributed file system

•  Moves computaAon to data rather then data to computaEon

•  Simpler programming model

Why does it maEer?

•  Volume, Velocity, Variety and Value

•  Datasets do not fit on local HDDs let alone RAM

•  Data grows at tremendous pace

•  Data is heterogeneous •  Scaling up is expensive (licensing, cpus, disks, interconnects, etc.)

•  Scaling up has a ceiling (physical, technical, etc.)

Why does it maEer?

Data types

Complex Structured

Complex Data

Images, Video Logs Documents Call records Sensor data Mail archives

Structured Data

User Profiles CRM HR Records

* Chart Source: IDC White Paper

Why does it maEer?

•  Need to process a 10TB dataset

•  Assume sustained transfer of 75MB/s

•  On 1 node -‐ Scanning data ~ 2 days

•  On 10 node cluster -‐ Scanning data ~ 5 hrs

•  Low $/TB for commodity drives

•  Low-‐end servers are mulEcore capable

Use Cases

•  ETL -‐ Extract Transform Load

•  RecommendaEon Engines

•  Customer Churn Analysis •  Ad TargeEng •  Data “sandbox”

Use Cases -‐ Typical ETL

Live DB

ReporAng DB

BI ApplicaAons

Data Warehouse

Use Cases -‐ Hadoop ETL

Live DB

ReporAng DB

BI ApplicaAons

Data Warehouse

Hadoop Data Loading

Data Loading

Use Cases – Analysis methods

•  Pakern recogniEon

•  Index building

•  Text mining

•  CollaboraEve filtering

•  PredicEon models

•  SenEment analysis

•  Graphs creaEon and traversal

Who uses it?

Who supports it?

Why use Hadoop?

•  PracEcal to do things that were previously not

ü  Shorter execuEon Eme ü  Costs less

ü  Simpler programming model •  Open system with greater flexibility

•  Large and growing ecosystem

Hadoop – Silver bullet?

•  Not a database replacement

•  Not a data warehousing (complements it)

•  Not for interacEve reporEng •  Not a general purpose storage mechanism

•  Not for problems that are not parallelizable in a share-‐nothing fashion

Architecture – Design Axioms

•  System Shall Manage and Heal Itself

•  Performance Shall Scale Linearly

•  Compute Should Move to Data •  Simple Core, Modular and Extensible

Architecture – Core Components

Distributed filesystem designed for low cost storage and high bandwidth access across the cluster.

Map-‐Reduce

Programming model for processing and generaEng large data sets.

Architecture – Official Extensions

HDFS HBase

Storage

MapReduce Framework

Data Processing

ZooKeeper Chukwa

Management

Pig (Data Flow) Avro

Data Access

Hive (SQL)

Architecture – CDH DistribuAon

1.  CDH – Cloudera’s DistribuEon of Hadoop 2.  Image credit -‐ Cloudera presentaEon @ Microstrategy World 2011

HDFS -‐ Design

•  Based on Google’s GFS

•  Files are stored as blocks (64MB default size)

•  Configurable data replicaEon (3x, Rack Aware)

•  Fault Tolerant, Expects HW failures

•  HUGE files, Expects Streaming not Low Latency

•  Mostly WORM

HDFS -‐ Architecture

Namenode (NN)

Datanode 1 Datanode 2 Datanode N

Namenode -‐ Master •  Filesystem metadata •  Controls read/write to files •  Manages blocks replicaEon •  Applies transacEon log on startup

Datanode -‐ Slaves •  Reads / Write blocks to/from clients •  Replicates blocks at master’s request

H D F S

Client ask NN for file NN returns DNs that host it Client ask DN for data

HDFS – Fault tolerance

•  DataNode

§  Uses CRC to avoid corrupEon §  Data is replicated on other nodes (3x)

•  NameNode

§  Checkpoint NameNode §  Backup NameNode §  Failover is manual

MapReduce -‐ Design

•  Based on Google’s MR paper •  Borrows from funcEonal programming •  Simpler programming model

§  map (in_key, in_value) -‐> (out_key, intermediate_value) list

§  reduce (out_key, intermediate_value list) -‐> out_value list

•  No user synchronizaEon and coordinaEon

Input -‐> Map -‐> Reduce -‐> Output

MapReduce -‐ Architecture

JobsTracker (JT)

TaskTracker 1

JobTracker -‐ Master •  Accepts MR jobs submiked by clients •  Assigns Map and Reduce tasks to

TaskTrackers, data locality aware •  Monitors tasks and TaskTracker status,

re-‐executes tasks upon failure •  SpeculaEve execuEon

TaskTracker -‐ Slaves •  Run Map and Reduce tasks received

from Jobtracker •  Manage storage and transmission of

intermediate output

J O B S API

Client launches a job -‐ ConfiguraEon -‐ Mapper -‐ Reducer -‐ Input -‐ Output TaskTracker 2 TaskTracker N

Hadoop -‐ Core Architecture

JobsTracker

TaskTracker 1 DataNode 1

J O B S API

NameNode

TaskTracker 2 DataNode 2

TaskTracker N DataNode N

H D F S

Mini OS •  File system •  Scheduler

hkp://www.slideshare.net/esaliya/mapreduce-‐in-‐simple-‐terms

MapReduce – Head First Style

MapReduce – Mapper Types

One-‐to-‐One map(k, v) = emit (k, transform(v))

Exploder map(k, v) = foreach p in v: emit (k, p)

Filter map(k, v) = if cond(v) then emit (k, v)

MapReduce – Reducer Types

Sum Reducer

reduce(k, vals) = sum = 0 foreach v in vals: sum += v emit (k, sum)

MapReduce – High level pipeline

MapReduce – Detailed pipeline

Diagram: hkp://developer.yahoo.com/hadoop/tutorial/module4.html

MapReduce – Combiner Phase

•  OpEonal •  Runs on mapper nodes aSer map phase •  “ Mini-‐reduce,” only on local map output •  Used to save bandwidth before sending data to full reducer •  The Reducer can be Combiner if

1.  Output key, values are the same as input key, values 2.  CommutaEve and AssociaEve (SUM, MAX ok but AVG not)

Diagram: hkp://developer.yahoo.com/hadoop/tutorial/module4.html

InstallaAon

1.  Download & configure single-‐node cluster

hadoop.apache.org/common/releases.html

2.  Download a demo VM

Cloudera Hortonwork

3.  Use a hosted environment (Amazon’s EMR, Azure)

InstallaAon – Pla[orm Notes

ProducAon Linux – Official

Development

Linux OSX Windows via Cygwin *Nix

MapReduce – Client Languages

Java, Any JVM Languages -‐ NaEve C++ -‐ Pipes framework – Socket IO Any – Streaming – Stdin / Stdout

Pig LaEn, Hive HQL, C via JNI

hadoop pipes -‐input path_in -‐output path_out -‐program exec_program

hadoop jar hadoop-‐streaming.jar -‐mapper map_prog -‐reducer reduce_prog -‐input path_in -‐output path_out

hadoop jar jar_path main_class input_path output_path

MapReduce – Client Anatomy

•  Main Program (aka Driver)

Configures the Job IniEates the Job

•  Input LocaEon •  Mapper •  Combiner (opEonal) •  Reducer •  Output LocaEon

MapReduce – Word Count Example

MapReduce – C# Mapper

MapReduce – C# Reducer

MapReduce – Java Mapper

MapReduce – Java Reducer

MapReduce – JavaScript Mapper

MapReduce – JavaScript Reducer

Summary

is an economical scalable distributed data processing system which enables data:

ü  ConsolidaAon (Structured or Not) ü Query Flexibility (Any Language) ü  Agility (Evolving Schemas)

QuesAons ?

References

Hadoop at Yahoo!, by Y! Developer Network MapReduce in Simple Terms, by Saliya Ekanayake Hadoop Architecture, by Phillipe Julio 10 Hadoop-‐able Problems, by Cloudera Hadoop, An Industry PerspecEve, by Amr Awadallah Anatomy of a MapReduce Job Run by Tom White MapReduceJobs in Hadoop

introduction to hadoop

ram data

data warehousing

corrupeon data

data s datanode

data types complex data

data simple core

node scanning data

data locality aware

Technology

introduction to hadoop - finistjug

introduction to hadoop administration jk

introduction to spark on hadoop

an introduction to apache hadoop

introduction to mapreduce and hadoop

introduction to hadoop and hadoop component

introduction to hadoop technology

introduction to the hadoop ecosystem

springpeople introduction to apache hadoop

introduction to hadoop ecosystem

introduction to microsoft hadoop

introduction to hadoop and mapreduce

webinar: from hadoop to spark introduction hadoop and spark...

an introduction to hadoop

introduction to hadoop - wiki.apache.org

introduction to nosql databases | hadoop quick introduction

introduction to hadoop and hdfs. table of contents hadoop...

introduction to mapreduce & hadoop

introduction to hadoop - duratechsolutions.in ·...

introduction to apache hadoop