harnessing hadoop for big data analytics v0.1
DESCRIPTION
TRANSCRIPT
ConfidentialCopyright © 2010 Flytxt B.V. All rights reserved.
Transforming Mobile Marketing & Advertising™
Harnessing s for Big Data
Analytics
Jobin [email protected]
ConfidentialCopyright © 2010 Flytxt B.V. All rights reserved.
Who am I ?
• Architect @ Flytxt (Big Data Analytics & Automation)
• Passionate about data, distributed computing , machine learning
• Previously
•Virtualization & Cloud Lifecycle Management(BMC)
• Designed and Implemented Cloud Life Cycle Management Interface for BMC
• Large Scale Data Centre Automation(AOL)
• Implemented Centralized Data Center Management Framework for AOL
•Workflow Systems & Automation (Accenture)
• Implemented Service Management Suit for various customers
ConfidentialCopyright © 2010 Flytxt B.V. All rights reserved.
Session Agenda!
3
• Data – What's the big deal?
• What is Hadoop( & What it is not )
• Map-Reduce Model & HDFS
• Hadoop Ecosystem & Tools
• Lets get started!
• Q&A
ConfidentialCopyright © 2010 Flytxt B.V. All rights reserved.
Five computers & a 640k ;-)
Moore’s Law
"I think there is a world market
for about five computers"
"640k ought to be enough for
anybody"
Thomas Watson 1943,
Chairman of the board of IBM
Attributed to
Bill Gates in 1981.
ConfidentialCopyright © 2010 Flytxt B.V. All rights reserved.
Do I also know what you might do next summer?
• Does your travel company know you visited Goa &
Cochin twice in the last two years?
• Collaborative Filtering
• Lots of Data + Statistics = WOW!!!
• BTW, don’t worry about the eqn
ConfidentialCopyright © 2010 Flytxt B.V. All rights reserved.
Don‟t throw away data just because it doesn't „fit‟
• relational tuples, log files, semi structured textual data (e.g., e-mail),pictures
, videos
• User generated data & System generated data
• Applications need more than structured data
• My application is not “Dumb” any more!!
• “I keep saying that the sexy job in the next 10 years will be
statisticians, and I’m not kidding.” - Hal Varian (Google’s chief economist)
ConfidentialCopyright © 2010 Flytxt B.V. All rights reserved.
Lets get to business!!
• Apache Hadoop is an open-source system to
reliably store and process extremely large data sets
across many commodity computers.
• originally developed to support Nutch search engine
project.
• scales linearly with data size or analysis complexity
• Scale-out ,shared nothing architecture
• inspired by Google's MapReduce and Google File
System (GFS) papers
What is Apache Hadoop ?
ConfidentialCopyright © 2010 Flytxt B.V. All rights reserved.
Basics of Hadoop
• Two Core Components – HDFS & Map-Reduce
• Machines are un-reliable
• Separates distributed fault-tolerant computing code from application
logic.
• No need to worry about identity of a machine
• lets you interact with a cluster, not a bunch of machines.
• Analysis workloads span across multiple machines
• runs as a cloud(cluster) & possibly on a cloud (EC2)
ConfidentialCopyright © 2010 Flytxt B.V. All rights reserved.
Lead Actors
• Name Node – Book keeping metadata server
• Secondary Name Node – Assistant to Name Node
• Job Tracker – Scheduler
• Task Tracker - Task execution
• Data Node - Block storage
ConfidentialCopyright © 2010 Flytxt B.V. All rights reserved.
Hadoop Ecosystem
• Oozie – Open-source workflow/coordination
service to manage data processing jobs for Apache
Hadoop™ - Developed at Yahoo!
• HBase – Column-store database based on
Google’s BigTable. Holds extremely large data sets
(Petabytes)
• Hive – SQL based data warehousing app with
features for analyzing very large data sets -
Developed at Facebook
• Zoo Keeper – Distributed consensus engine
providing Leader election, service
discovery, distributed locking / mutual exclusion
• Pig - platform for analyzing large data sets that
consists of a high-level language for expressing
data analysis steps
• Ganglia - a scalable distributed monitoring system
for high-performance computing systems such as
clusters and Grids
ConfidentialCopyright © 2010 Flytxt B.V. All rights reserved.
Hadoop is not a “Holy Grail”
• Not a substitute for a database
• MapReduce is not always the best algorithm
• HDFS is not a substitute for a
High Availability SAN-hosted FS
• HDFS is not a Posix file system
• Not a place to learn Java programming
• Not a place to learn Unix/Linux system administration
• Not a place to learn basics of networking
ConfidentialCopyright © 2010 Flytxt B.V. All rights reserved.
Notable Users of Hadoop(Source: http://en.wikipedia.org/wiki/Hadoop)
• A9.com
• AOL
• EHarmony
• eBay
• Fox Interactive Media
• IBM
• Last.fm
• Meebo
• Metaweb
• The New York Times
• Rackspace
• StumbleUpon
• Yahoo
• Amazon
ConfidentialCopyright © 2010 Flytxt B.V. All rights reserved.
www.flytxt.com
THANK YOUcontact us : [email protected]/ [email protected]
18