![Page 1: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/1.jpg)
10 Common Hadoop-able Problems
August 5, 2010
![Page 2: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/2.jpg)
Topics
• Introduction
• 10 Common Hadoop-able Problems
• Summary
• Questions
Copyright 2010 Cloudera Inc. All rights reserved 2
![Page 3: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/3.jpg)
Today’s speaker - Jeff Hammerbacher
• Studied Mathematics at Harvard
• Worked as a Quant on Wall Street
• Conceived, built, and led Data team at Facebook
• Nearly 30 amazing engineers and data scientists
• Several open source projects and research papers
• Founder of Cloudera
• Chief Scientist
• Also, check out the book “Beautiful Data”
Copyright 2010 Cloudera Inc. All rights reserved 3
![Page 4: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/4.jpg)
What is Hadoop?
• A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license)
• Scalable data processing engine• Hadoop Distributed File System (HDFS): self-healing high-bandwidth
clustered storage• MapReduce: fault-tolerant distributed processing
• Key value• Flexible -> store data without a schema and add it later as needed• Affordable -> cost / TB at a fraction of traditional options• Broadly adopted -> a large and active ecosystem• Proven at scale -> dozens of petabyte + implementations in
production today
Copyright 2010 Cloudera Inc. All Rights Reserved. 4
![Page 5: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/5.jpg)
Cloudera’s Distribution for Hadoop, version 3
• Open source – 100% Apache licensed
• Simplified – Component versions & dependencies managed for you
• Integrated – All components & functions interoperate through standard API’s
• Reliable – Patched with fixes from future releases to improve stability
• Supported – Employs project founders and committers for >70% of components
Copyright 2010 Cloudera Inc. All Rights Reserved. 5
Hue Hue SDK
OozieOozie
HBaseFlume, Sqoop
Zookeeper
Hive
Pig/Hive
The industry’s leading Hadoop distribution
![Page 6: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/6.jpg)
How does Cloudera know which problems are Hadoop-able?
• Talking to 1000s of users
• Supporting 100s of implementations
• Experience putting Hadoop into production with customers across a range of industries
Copyright 2010 Cloudera Inc. All rights reserved 6
![Page 7: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/7.jpg)
Summary – 10 Common Hadoop-able Problems
1. Modeling true risk
2. Customer churn analysis
3. Recommendation engine
4. Ad targeting
5. PoS transaction analysis
6. Analyzing network data to predict failure
7. Threat analysis
8. Trade surveillance
9. Search quality
10. Data “sandbox”
Copyright 2010 Cloudera Inc. All rights reserved 7
![Page 8: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/8.jpg)
What is common across Hadoop-able problems?
Nature of the data
• Complex data
• Multiple data sources
• Lots of it
Nature of the analysis
Copyright 2010 Cloudera Inc. All rights reserved 8
• Batch processing
• Parallel execution
• Spread data over a cluster of servers and take the computation to the data
![Page 9: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/9.jpg)
What Analysis is Possible With Hadoop?
• Text mining
• Index building
• Graph creation and analysis
• Pattern recognition
• Collaborative filtering
• Prediction models
• Sentiment analysis
• Risk assessment
Copyright 2010 Cloudera Inc. All rights reserved 9
![Page 10: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/10.jpg)
Benefits of Analyzing With Hadoop
Copyright 2010 Cloudera Inc. All rights reserved 10
• Previously impossible/impractical to do this analysis
• Analysis conducted at lower cost
• Analysis conducted in less time
• Greater flexibility
![Page 11: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/11.jpg)
Topics
• Introduction
• 10 Common Hadoop-able Problems
• Summary
• Questions
Copyright 2010 Cloudera Inc. All rights reserved 11
![Page 12: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/12.jpg)
1. Modeling True Risk
Copyright 2010 Cloudera Inc. All rights reserved 12
![Page 13: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/13.jpg)
1. Modeling True Risk
Solution with Hadoop
Copyright 2010 Cloudera Inc. All rights reserved 13
• Source, parse and aggregate disparate data sources to build comprehensive data picture
• e.g. credit card records, call recordings, chat sessions, emails, banking activity
• Structure and analyze
• Sentiment analysis, graph creation, pattern recognition
Typical Industry
• Financial Services (Banks, Insurance)
![Page 14: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/14.jpg)
2. Customer Churn Analysis
Copyright 2010 Cloudera Inc. All rights reserved 14
![Page 15: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/15.jpg)
2. Customer Churn Analysis
Solution with Hadoop
Copyright 2010 Cloudera Inc. All rights reserved 15
• Rapidly test and build behavioral model of customer from disparate data sources
• Structure and analyze with Hadoop
• Traversing
• Graph creation
• Pattern recognition
Typical Industry
• Telecommunications, Financial Services
![Page 16: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/16.jpg)
3. Recommendation Engine
Copyright 2010 Cloudera Inc. All rights reserved 16
![Page 17: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/17.jpg)
3. Recommendation Engine
Solution with Hadoop
Copyright 2010 Cloudera Inc. All rights reserved 17
• Batch processing framework
• Allow execution in in parallel over large datasets
• Collaborative filtering
• Collecting ‘taste’ information from many users
• Utilizing information to predict what similar users like
Typical Industry
• Ecommerce, Manufacturing, Retail
![Page 18: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/18.jpg)
4. Ad Targeting
Copyright 2010 Cloudera Inc. All rights reserved 18
![Page 19: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/19.jpg)
4. Ad Targeting
Solution with Hadoop
Copyright 2010 Cloudera Inc. All rights reserved 19
• Data analysis can be conducted in parallel, reducing processing times from days to hours
• With Hadoop, as data volumes grow the only expansion cost is hardware
• Add more nodes without a degradation in performance
Typical Industry
• Advertising
![Page 20: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/20.jpg)
5. Point of Sale Transaction Analysis
Copyright 2010 Cloudera Inc. All rights reserved 20
![Page 21: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/21.jpg)
5. Point of Sale Transaction Analysis
Solution with Hadoop
Copyright 2010 Cloudera Inc. All rights reserved 21
• Batch processing framework
• Allow execution in in parallel over large datasets
• Pattern recognition
• Optimizing over multiple data sources
• Utilizing information to predict demand
Typical Industry
• Retail
![Page 22: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/22.jpg)
6. Analyzing Network Data to Predict Failure
Copyright 2010 Cloudera Inc. All rights reserved 22
![Page 23: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/23.jpg)
6. Analyzing Network Data to Predict Failure
Solution with Hadoop
Copyright 2010 Cloudera Inc. All rights reserved 23
• Take the computation to the data
• Expand the range of indexing techniques from simple scans to more complex data mining
• Better understand how the network reacts to fluctuations
• How previously thought discrete anomalies may, in fact, be interconnected
• Identify leading indicators of component failure
Typical Industry
• Utilities, Telecommunications, Data Centers
![Page 24: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/24.jpg)
7. Threat Analysis
Copyright 2010 Cloudera Inc. All rights reserved 24
![Page 25: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/25.jpg)
7. Threat Analysis
Solution with Hadoop
Copyright 2010 Cloudera Inc. All rights reserved 25
• Parallel processing over huge datasets
• Pattern recognition to identify anomalies i.e. threats
Typical Industry
• Security
• Financial Services
• General: spam fighting, click fraud
![Page 26: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/26.jpg)
8. Trade Surveillance
Copyright 2010 Cloudera Inc. All rights reserved 26
![Page 27: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/27.jpg)
8. Trade Surveillance
Solution with Hadoop
Copyright 2010 Cloudera Inc. All rights reserved 27
• Batch processing framework
• Allow execution in in parallel over large datasets
• Pattern recognition
• Detect trading anomalies and harmful behavior
Typical Industry
• Financial services
• Regulatory bodies
![Page 28: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/28.jpg)
9. Search Quality
Copyright 2010 Cloudera Inc. All rights reserved 28
![Page 29: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/29.jpg)
9. Search Quality
Solution with Hadoop
Copyright 2010 Cloudera Inc. All rights reserved 29
• Analyzing search attempts in conjunction with structured data
• Pattern recognition
• Browsing pattern of users performing searches in different categories
Typical Industry
• Web
• Ecommerce
![Page 30: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/30.jpg)
10. Data “Sandbox”
Copyright 2010 Cloudera Inc. All rights reserved 30
![Page 31: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/31.jpg)
10. Data “Sandbox”
Solution with Hadoop
Copyright 2010 Cloudera Inc. All rights reserved 31
• With Hadoop an organization can “dump” all this data into a HDFS cluster
• Then use Hadoop to start trying out different analysis on the data
• See patterns or relationships that allow the organization to derive additional value from data
Typical Industry
• Common across all industries
![Page 32: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/32.jpg)
Topics
• Introduction
• 10 Common Hadoop-able Problems
• Summary
• Questions
Copyright 2010 Cloudera Inc. All rights reserved 32
![Page 33: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/33.jpg)
Summary – 10 Common Hadoop-able Problems
1. Modeling true risk
2. Customer churn analysis
3. Recommendation engine
4. Ad targeting
5. PoS transaction analysis
6. Threat analysis
7. Analyzing network data to predict failure
8. Trade surveillance
9. Search quality
10. Data “sandbox”
Copyright 2010 Cloudera Inc. All rights reserved 33
![Page 34: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/34.jpg)
Who is Cloudera?
• Enterprise software & services company providing the industry’s leading Hadoop-based data management platform• Founding team came from large Web companies
• Products: Cloudera Enterprise & Cloudera’s Distribution for Hadoop• All necessary packages, matched, tested and supported
• Tools to support production use of Hadoop
• The leading distribution for the enterprise
• Contributors and committers• Fixing, patching and adding features
34
![Page 35: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/35.jpg)
Hear More Examples @ Hadoop World 2010
• 2nd annual event focused on practical applications of Hadoop
• Date: October 12th 2010
• Location: Hilton New York
• Keynote from Tim O’Reilly – founder O’Reilly Media
• Pre and post conference training available for Hadoop and related projects
• 36 business and technical focused sessions
Copyright 2010 Cloudera Inc. All Rights Reserved. 35
Confirmed speakers from
http://www.cloudera.com/company/press-center/hadoop-world-nyc/
![Page 36: 20100806 cloudera 10 hadoopable problems webinar](https://reader036.vdocuments.us/reader036/viewer/2022062514/55a2d84a1a28ab887d8b462b/html5/thumbnails/36.jpg)
Questions?
Copyright 2010 Cloudera Inc. All Rights Reserved. 36