houston hadoop meetup presentation by vikram oberoi of cloudera
DESCRIPTION
When and why to use Hadoop. Hadoop-able problems and use cases.TRANSCRIPT
What is Hadoop, and When Should I Consider Using It?
Houston HUG June 6th, 2011
Vikram Oberoi, Cloudera
Copyright 2011 Cloudera Inc. All rights reserved
About me
• Data engineer at Cloudera, present • Using data and Hadoop to enable more responsive support
• Data engineer at Meebo, Aug ’09 – Nov’10 • Data infrastructure, analyLcs
• CS at Stanford, ’09 • Senior project: ext3 and XFS under Hadoop MapReduce workloads
• Data engineer at Meebo, ’08 • Built an A/B tesLng system
• SDE Intern at Amazon, ’07 • R&D on item-‐to-‐item similariLes
Copyright 2011 Cloudera Inc. All rights reserved
What will I talk about?
• What is Hadoop? • Typical Hadoop-‐able problems and use cases
• Cloudera overview
Copyright 2011 Cloudera Inc. All rights reserved
What is Hadoop?
Copyright 2011 Cloudera Inc. All rights reserved
Source: An IDC White Paper - sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009.
.
Big Data Problem: Exploding Data Volumes
• Online • Web-‐ready devices • Social media • Digital content
• Enterprise • TransacLons • R&D data • OperaLonal (control) data
• Open data iniLaLves
Copyright 2011 Cloudera Inc. All rights reserved
Relational
Complex, Unstructured
• 2,500 exabytes of new information in 2012 with Internet as primary driver • Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year
Big Data Problem: Data Economics
Copyright 2011 Cloudera Inc. All rights reserved
Low ROB
• Return on Byte = value to be extracted from that byte / cost of storing that byte • If ROB is < 1 then it will be buried into tape wasteland, thus we need cheaper ac#ve storage.
High ROB
MapReduce
Hadoop Distributed File System (HDFS)
Hadoop: A Data PlaEorm with Unique Benefits
Copyright 2011 Cloudera Inc. All rights reserved
• Consolidates Everything • Move complex and relaLonal data into a single repository
• Stores Inexpensively • Keep raw data always available • Use commodity hardware
• Processes at the Source • Eliminate ETL boglenecks • Mine data first, govern later
Hadoop Distributed File System (HDFS)
• Based on design of Google’s GFS • Data stored in large files
• Files can contain any data • Files separated into blocks
• 64MB up to 256MB per block (tunable) • Each block replicated across a cluster (tunable, usually 3 replicas across the cluster)
• This buys you: fault tolerance, parallelizable disk reads • Store whatever you want in it
• This buys you: flexibility
Copyright 2011 Cloudera Inc. All rights reserved
“How is data stored?”
MapReduce
• Framework designed for parallel processing of large disk bound batch jobs
• Data processed at the source • File ‘foo’ has 5 blocks, processing happens on 5 nodes • Parallelized disk reads à remove disk bogleneck
• Way to express algorithms such that they are parallelizable
• Two funcLons at the core of every job: • Map funcLon (group by) • Reduce funcLon (perform acLon on group)
Copyright 2011 Cloudera Inc. All rights reserved
“How is data processed?”
What is Hadoop?
• A scalable fault-‐tolerant distributed system for data storage and processing (open source under the Apache license)
• Scalable data processing engine • Hadoop Distributed File System (HDFS): self-‐healing high-‐bandwidth clustered storage
• MapReduce: fault-‐tolerant distributed processing • Key value
• Flexible -‐> store data without a schema and add it later as needed • Affordable -‐> cost / TB at a fracLon of tradiLonal opLons • Broadly adopted -‐> a large and acLve ecosystem • Proven at scale -‐> dozens of petabyte + implementaLons in producLon today
Copyright 2011 Cloudera Inc. All rights reserved
Cloudera’s DistribuSon Including Apache Hadoop
Copyright 2011 Cloudera Inc. All rights reserved
• Open source – 100% Apache licensed and free for download • Simplified – Component versions & dependencies managed for you • Integrated – All components & funcLons interoperate through standard API’s • Reliable – Patched with fixes from future releases to improve stability • Supported – Employs project founders and commigers for >90% of components
Hue Hue SDK
Oozie Oozie
HBase Flume, Sqoop
Zookeeper
Hive
Pig/ Hive
The Industry’s Leading Hadoop Distribu<on
Typical Hadoop-‐able problems
Copyright 2011 Cloudera Inc. All rights reserved
What is common across Hadoop-‐able problems?
Nature of the data
• Complex data • MulLple data sources • Lots of it
Nature of the analysis
Copyright 2010 Cloudera Inc. All rights reserved 13
• Batch processing • Parallelizable
What kinds of analyses are possible with Hadoop?
• Text mining
• Index building
• Graph creaLon and analysis
• Pagern recogniLon
• CollaboraLve filtering
• PredicLon models
• SenLment analysis
• Risk assessment
Copyright 2010 Cloudera Inc. All rights reserved 14
Top 10 Hadoop-‐able Problems
Copyright 2011 Cloudera Inc. All rights reserved
1. Modeling True Risk
2. Customer Churn Analysis
3. RecommendaSon engines
4. Ad TargeSng
5. Point Of Sale TransacSon Analysis
6. Analysing Network Data To Predict Failure
7. Threat Analysis/Fraud DetecSon
8. Trade Surveillance
9. Search Quality
10. Data “Sandbox”
See archived webinar on cloudera.com
Example: Modeling True Risk
Copyright 2010 Cloudera Inc. All rights reserved 16
Example: Modeling True Risk
SoluSon with Hadoop
Copyright 2010 Cloudera Inc. All rights reserved 17
• Source, parse and aggregate disparate data sources to build comprehensive data picture • e.g. credit card records, call recordings, chat sessions, emails, banking acLvity
• Structure and analyze • SenLment analysis, graph creaLon, pagern recogniLon
Typical Industry
• Financial Services (Banks, Insurance)
Example: Threat Analysis
Copyright 2010 Cloudera Inc. All rights reserved 18
Example: Threat Analysis
SoluSon with Hadoop
Copyright 2010 Cloudera Inc. All rights reserved 19
• Parallel processing over huge datasets
• Pagern recogniLon to idenLfy anomalies i.e. threats
Typical Industry
• Security • Financial Services • General: spam fighLng, click fraud
Example: RecommendaSon Engine
Copyright 2010 Cloudera Inc. All rights reserved 20
Example: RecommendaSon Engine SoluSon with Hadoop
Copyright 2010 Cloudera Inc. All rights reserved 21
• Batch processing framework • Allow execuLon in in parallel over large datasets
• CollaboraLve filtering • CollecLng ‘taste’ informaLon from many users • ULlizing informaLon to predict what similar users like
Typical Industry
• Ecommerce, Manufacturing, Retail
Example: Analyzing Network Data to Predict Failure
Copyright 2010 Cloudera Inc. All rights reserved 22
Example: Analyzing Network Data to Predict Failure SoluSon with Hadoop
Copyright 2010 Cloudera Inc. All rights reserved 23
• Take the computaLon to the data • Expand the range of indexing techniques from simple
scans to more complex data mining • Beger understand how the network reacts to fluctuaLons
• How previously thought discrete anomalies may, in fact, be interconnected
• IdenLfy leading indicators of component failure
Typical Industry • ULliLes, TelecommunicaLons,
Data Centers
Example: SupporSng Hadoop at Cloudera
• Collect data from customer clusters • OS configs, Hadoop configs, command outputs, logs • Data served by HBase, used by supporters
• Consolidate data about Hadoop in HDFS • Mailing lists, issue trackers, wiki pages, IRC, books • Customer cluster data
• Analyze many data sources to understand Hadoop issues and deployments • Build tools to enable easier diagnosis or proacLve support
Copyright 2011 Cloudera Inc. All rights reserved
Cloudera overview
Copyright 2011 Cloudera Inc. All rights reserved
Copyright 2011 Cloudera Inc. All rights reserved
Cloudera Offerings Enabling the Enterprise Adop<on of Apache Hadoop
PLATFORM SUPPORT & APPLICATIONS
PROFESSIONAL SERVICES TRAINING
Contact/Resources/QuesSons
• [email protected] • irc.freenode.net #cloudera #hadoop • @cloudera
• Cloudera Groups: hgp://groups.cloudera.org • Hadoop the DefiniLve Guide • 10 Hadoop-‐able problems on Slideshare
• QuesLons? (P.S. We’re hiring SA’s in Houston!)
Copyright 2011 Cloudera Inc. All rights reserved