a tour of the zoo – hadoop ecosystem · iterative algorithms or in-memory cluster processing...

A Tour of the Zoo – the Hadoop Ecosystem

Prafulla Wani

Technical Architect - Big Data

Syntel

Confidential ©2012 Syntel, Inc.

Agenda

Welcome to the Zoo!

Evolution Timeline

Traditional BI/DW Architecture

Where Hadoop Fits In

2


3

Welcome to the Zoo!

3

Jaql

Giraph Shark

Zookeeper Pig

Hama

Hadoop

I am sure you won’t find a Shark in any other zoo

http://zookeeper.apache.org/

https://cwiki.apache.org/confluence/display/Hive


What is Hadoop?

Hadoop is an open-source project overseen by the Apache Software

Foundation

Hadoop is an ecosystem, not a single product

Originally based on papers published by Google in 2003 and 2004

Some of the projects in the ecosystem have been inspired based on

whitepapers published by Google

4

Google calls it: Hadoop equivalent

GFS HDFS

MapReduce Hadoop MapReduce

Sawzall Hive, Pig

BigTable HBase

Chubby ZooKeeper

Pregel Giraph


Evolution Timeline

Started by Doug Cutting at Yahoo! in early 2006, and named after

his kid’s toy elephant

Hadoop committers work at several different organizations

Including Facebook, Yahoo!, LinkedIn, Twitter, Cloudera, Hortonworks

5

Jaql Giraph

2006 2007 2008 2009 2010 2011

http://zookeeper.apache.org/

https://cwiki.apache.org/confluence/display/Hive


Traditional Data Strategy - BI/DW Architecture

6

ETL Tools DW / Marts BI Analytics

Commercial

Informatica Teradata Microstrategy SAS

Oracle Data Integrator Oracle OBIEE TIBCO Spotfire

IBM Datastage DB2, Netezza Cognos SPSS

Microsoft SSIS SQL server Microsoft SSRS

Open source Talend mySQL Pentaho , Jaspersoft R, RapidMiner

Data Warehouse

Data Marts

ETL

Process

ERP

CRM

Database

Files

Analytics

OLAP Analysis/BI

Ad Hoc

Reporting


How Hadoop fits in?

7

Hadoop can complement the existing DW environment as

well replace some of the components in a traditional data

architecture.

Data Warehouse

Data Marts

ETL

Process

ERP

CRM

Database

Files

Analytics

OLAP Analysis/BI

Ad Hoc

Reporting


Data Storage

Hadoop Distributed File System (HDFS)

It’s a file system, not a DBMS

Allows storage of both structured and unstructured data

Provides distributed, redundant storage for massive amounts of data on

cheap, unreliable computers

Hadoop 2.0 release (still beta) added some important features –

HDFS Federation

High Availability

HBase

Distributed, versioned, column-oriented store on top of HDFS

Provides an option of “low-latency” (OLTP) reads/writes along with

support for batch-processing model of map-reduce

Goal - To store tables with billion rows and million columns

8


Data Processing (ETL / Analytics)

Extract / Load

Source / Target is RDBMS - Sqoop

Log collection and aggregation - Flume, Scribe, Chukwa

Stream processing - S4, Storm (supports Transformation also)

Transformation

Map-reduce programming in Java or any other language or high level query

languages like Pig, Hive etc.

Workflow design and implementation using tools like Oozie, Azkaban etc.

Iterative algorithms or in-memory cluster processing using Spark, Shark etc.

Analytics

Mahout - Scalable machine learning library with most of the algorithms implemented

on top Apache Hadoop using map/reduce paradigm

RHadoop – Provides R packages to access data in HDFS & HBase and also to write

map-reduce jobs in R

9


Common Industry Use Cases

10

Use cases Solution Comments

Cold Data Storage HDFS More cost-effective option compared to most appliances in the market

Huge transactional

volume HBase

StumbleUpon created openTSDB to capture their infrastructure metrics

data

Batch processing MapReduce

/Hive /Pig

Log aggregation Flume, Scribe,

Chukwa web-log collection on HDFS in near real-time

Real-time message/

stream processing Storm, S4 Used by twitter for real-time tweet processing

Iterative algorithms / In-

memory processing Spark / Shark Predictive analytics, Log Mining

Machine Learning/

Analytics

Mahout,

RHadoop

Graph data

storage/processing Giraph Championed at Yahoo!


11

Proposed Big Data Roadmap

Kickoff - Assessment Study:

Understand the business processes

Understand organizational goals & current investments

Understand the challenges and pain-points of current setup

Proof of Concept:

Proof of Concept can be performed to demonstrate applicability of Hadoop to enhance DW

Big Data integration – Initial steps

Move cold/warm data to Hive/HBase to reduce expenses on storage infrastructure

Bring new data sources like web-logs, which was not possible with traditional storage solutions

Big Data integration – Next steps

Throw data open to business users for analysis and they will appreciate the power of new infrastructure


Identify the opportunities in ETL & Analytics space

Move Hot data to Hadoop

Perform real-time data integration using Storm/Spark


Implement advanced solutions

1

2 3

4

5

6

HDFS, Hbase

Hive, Pig,

MapReduce

Mahout, RHadoop

Hadoop Technology Stack

Thank You

a tour of the zoo – hadoop ecosystem · iterative algorithms or in-memory cluster processing...

Documents