hadoop overview

Hadoop in a

NutshellSiva Pandeti

Cloudera Certified Developer for Apache Hadoop (CCDH)

OverviewWhy

Hadoop?

What is Hadoop?

How to Hadoop?

Examples

Data GrowthWhat is Big Data?

Hadoop usage

ComponentsNo SQLClusterVendors

Tool Comparison

Typical ImplementationData Analysis with Pig & Hive

OpportunitiesMap Reduce deep dive

WordcountSearch index

Recommendation Engine

WhyHadoop?

Data Growth

OLTPDatabases for

Operations

Throw away historical data

RelationalOracle, DB2

OLAPData warehouses

for analytics

Cheaper centralized

storage -> Data warehouses(ETL tools)

Relational/MPP appliances

< few hundred TB

Big DataData explosion (social media,

etc)Petabyte scale

Network speeds haven’t increased

Need Data Locality

Distributed processing on

commodity hardware(Hadoop)

Non-relational

Big Data

What is Big Data?

Volume

Petabyte scale

VarietyStructured

Semi-structured

Unstructured

VelocitySocialSensor

Throughput

VeracityUnclean

ImpreciseUnclear

Where is Hadoop Used?

Industry

Technology

Use Cases

SearchPeople you may know

Movie recommendations

BanksFraud Detection

RegulatoryRisk management

MediaRetail

Marketing analyticsCustomer service

Product recommendations

Manufacturing Preventive maintenance

What isHadoop?

Hadoop

HDFS

Distributed StorageEconomical: commodity hardwareScalable: rebalances data on new nodesFault Tolerant: detects faults & auto recoversReliable: maintains multiple copies of dataHigh throughput: because data is distributed

Open source distributed computing framework for storage and processing

What is Hadoop?

MapReduce

Distributed ProcessingData Locality: process where the data residesFault Tolerant: auto-recover job failuresScalable: add nodes to increase parallelismEconomical: commodity hardware

• Unlike RDBMS:o De-normalizedo No secondary indexeso No transactions

• Modeled after Google’s Big Table• Random real time read/write access to Big Data• Billions of rows x millions of columns• Commodity hardware• Open source, distributed, versioned, column oriented• Integrates with MapReduce; Has Java/REST APIs• Automatic sharding

NoSQL DBs - HBase

Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

Master Node

Slave Node Slave Node Slave Node

Job Tracker

Task Tracker Task Tracker Task Tracker

Name Node

Data Node Data Node Data Node

Cluster

How Does Hadoop Work?

Vendors

Apache Hadoop

Cloudera

HortonWorks

MapR

Pentaho Informatica

Talend Clover

EMR

ETL/BI Connectors

Hadoop Distributors

Microstrategy

Tableau

SASAbInitio

ComparisonTraditional ETL/BI

Expensive licenseExpensive hardware

Hadoop

Open sourceCheap commodity hardware

< 100 TBCentral storage

Petabyte scaleDistributed storage

Cost

Volu

me

Quick response for processing small data

Not as fast on large data

Even smallest job takes 15 seconds

Super fast on large dataSp

eed

Thousands of reads/writes per minute

Millions of reads/writes per minute

Th

rup

ut

How toHadoop?

HDFS

Hadoop

FlumeSqoop

Ing

est

Put/GetETL tools

RDBMSData Feeds

Files

Hadoop Implementation

Reports Machine Learning

Ou

tpu

t AnalyticsVisualizatio

n

SAS R

MapReduce

Pig Hive

Mahout

Pro

cess

Data Analysis: Pig & Hive

Pig Hive

Abstraction on top of MapReduce. Generates MapReduce jobs in the backend. Useful for analysts who are not programmers.

Data flow languageNo schemaBetter with less structured Data

SQL like languageSchema, tables, joins are stored in a meta-store.

ExampleLOAD ‘file’ USING PigStorage(‘\t’) AS (id, name);

FILTERFOREACHGROUPORDERSTORE

ExampleCREATE TABLE customer (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’;

SELECT * from customerWHERE id < 100 limit 10;

MapReduce

Source: http://www.rabidgremlin.com/data20/MapReduceWordCountOverview1.png

Examples

Word count - Java• Copy input files to HDFS

o hadoop fs –put file1.txt input

• Create drivero Set configuration variables, mapper and reducer class names

• Create mappero Read input and emit key value pairs

• Create reducer (optional)o Aggregate all values for a particular key

• Executeo hadoop jar WordCount.jar WordCount input output

• Analyze outputo hadoop fs –cat output/* | head

Word count - Streaming

• Hadoop is written in Java. I don’t know Java. What do I do? o Hadoop Streaming (Python, Ruby, R, etc)

• Copy input files to HDFSo hadoop fs –put file1.txt input

• Create mappero Read input stream (stdin) and emit (print) key value pairs

• Create reducer (optional)o Aggregate all values for a particular key

• Executeo hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-

stream*.jar -mapper mapper.py –file mapper.py -reducer reducer.py –file reducer.py -input input –output output

• Analyze outputo hdoop fs –cat output/* | head

Hadoop for RSys.setenv(HADOOP_HOME="/home/istvan/hadoop")Sys.setenv(HADOOP_CMD="/home/istvan/hadoop/bin/hadoop")library(rmr2)library(rhdfs)

setwd("/home/istvan/rhadoop/blogs/")gdp <- read.csv("GDP_converted.csv")head(gdp)

hdfs.init()gdp.values <- to.dfs(gdp)# AAPL revenue in 2012 in millions USDaaplRevenue = 156508

gdp.map.fn <- function(k,v) {key <- ifelse(v[4] < aaplRevenue, "less", "greater")keyval(key, 1)}

count.reduce.fn <- function(k,v) {keyval(k, length(v))}

count <- mapreduce(input=gdp.values, map = gdp.map.fn, reduce = count.reduce.fn)from.dfs(count)

• RHadoop packageo rmro rhdfso Rhbase

• Uses Hadoop Streaming

• Example on the right determines how many countries have greater GDP than Apple

Source: http://bighadoop.wordpress.com/2013/02/25/r-and-hadoop-data-analysis-rhadoop/

Search index example• Crawl web

o Crawl and save websites to local directory

• Ingest files to HDFS• Map

o Split the words & associate words with file names

• Reduceo Build an index with words and files & count of occurrences

• Searcho Pass the word to the index to get the files it shows up in. Display the

file listing in descending order of number of occurrences of the word in a file

Recommender example

• Use web server logs with user ratings info for items

• Create Hive tables to build structure on top of this log data

• Generate Mahout specific csv input file (user, item, rating)

• Run Mahout to build item recommendations for userso mahout recommeditembased \

--input /user/hive/warehouse/mahout_input \--output recommendations \-s SIMILARITY_PEARSON_CORRELATION –n 20

RecapWhy

Hadoop?

What is Hadoop?

How to Hadoop?

Demo

Data GrowthWhat is Big Data?

Hadoop usage

ComponentsNo SQLClusterVendors

Tool Comparison

Typical ImplementationData Analysis with Pig & Hive

OpportunitiesMap Reduce deep dive

WordcountSearch index

Recommendation Engine

Q & A

Contact Siva Pandeti:Email: [email protected]: www.linkedin.com/in/SivaPandetiTwitter: @SivaPandetihttp://pandeti.com/blog

hadoop overview

Technology

processing small data

hadoop work

examples data growth

db2 olap data warehouses

data growth oltp databases

apache hadoop ccdh

hadoop usage components

hdfs hadoop flume sqoop