harnessing hadoop for big data analytics v0.1

ConfidentialCopyright © 2010 Flytxt B.V. All rights reserved.

Transforming Mobile Marketing & Advertising™

Harnessing s for Big Data

Analytics

Jobin [email protected]


Who am I ?

• Architect @ Flytxt (Big Data Analytics & Automation)

• Passionate about data, distributed computing , machine learning

• Previously

•Virtualization & Cloud Lifecycle Management(BMC)

• Designed and Implemented Cloud Life Cycle Management Interface for BMC

• Large Scale Data Centre Automation(AOL)

• Implemented Centralized Data Center Management Framework for AOL

•Workflow Systems & Automation (Accenture)

• Implemented Service Management Suit for various customers


Session Agenda!

3

• Data – What's the big deal?

• What is Hadoop( & What it is not )

• Map-Reduce Model & HDFS

• Hadoop Ecosystem & Tools

• Lets get started!

• Q&A


Five computers & a 640k ;-)

Moore’s Law

"I think there is a world market

for about five computers"

"640k ought to be enough for

anybody"

Thomas Watson 1943,

Chairman of the board of IBM

Attributed to

Bill Gates in 1981.


Data Explosion !


Do I also know what you might do next summer?

• Does your travel company know you visited Goa &

Cochin twice in the last two years?

• Collaborative Filtering

• Lots of Data + Statistics = WOW!!!

• BTW, don’t worry about the eqn


Don‟t throw away data just because it doesn't „fit‟

• relational tuples, log files, semi structured textual data (e.g., e-mail),pictures

, videos

• User generated data & System generated data

• Applications need more than structured data

• My application is not “Dumb” any more!!

• “I keep saying that the sexy job in the next 10 years will be

statisticians, and I’m not kidding.” - Hal Varian (Google’s chief economist)


Lets get to business!!

• Apache Hadoop is an open-source system to

reliably store and process extremely large data sets

across many commodity computers.

• originally developed to support Nutch search engine

project.

• scales linearly with data size or analysis complexity

• Scale-out ,shared nothing architecture

• inspired by Google's MapReduce and Google File

System (GFS) papers

What is Apache Hadoop ?


Basics of Hadoop

• Two Core Components – HDFS & Map-Reduce

• Machines are un-reliable

• Separates distributed fault-tolerant computing code from application

logic.

• No need to worry about identity of a machine

• lets you interact with a cluster, not a bunch of machines.

• Analysis workloads span across multiple machines

• runs as a cloud(cluster) & possibly on a cloud (EC2)


Lead Actors

• Name Node – Book keeping metadata server

• Secondary Name Node – Assistant to Name Node

• Job Tracker – Scheduler

• Task Tracker - Task execution

• Data Node - Block storage


HDFS Write Model


Map-Reduce Model


Map-Reduce Execution Flow


Hadoop Ecosystem

• Oozie – Open-source workflow/coordination

service to manage data processing jobs for Apache

Hadoop™ - Developed at Yahoo!

• HBase – Column-store database based on

Google’s BigTable. Holds extremely large data sets

(Petabytes)

• Hive – SQL based data warehousing app with

features for analyzing very large data sets -

Developed at Facebook

• Zoo Keeper – Distributed consensus engine

providing Leader election, service

discovery, distributed locking / mutual exclusion

• Pig - platform for analyzing large data sets that

consists of a high-level language for expressing

data analysis steps

• Ganglia - a scalable distributed monitoring system

for high-performance computing systems such as

clusters and Grids


Hadoop is not a “Holy Grail”

• Not a substitute for a database

• MapReduce is not always the best algorithm

• HDFS is not a substitute for a

High Availability SAN-hosted FS

• HDFS is not a Posix file system

• Not a place to learn Java programming

• Not a place to learn Unix/Linux system administration

• Not a place to learn basics of networking


Notable Users of Hadoop(Source: http://en.wikipedia.org/wiki/Hadoop)

• A9.com

• AOL

• EHarmony

• eBay

• Facebook

• Fox Interactive Media

• IBM

• Last.fm

• LinkedIn

• Meebo

• Metaweb

• The New York Times

• Rackspace

• StumbleUpon

• Twitter

• Yahoo

• Amazon


www.flytxt.com

Q & A


www.flytxt.com

THANK YOUcontact us : [email protected]/ [email protected]

18

harnessing hadoop for big data analytics v0.1

Technology