big data paris
DESCRIPTION
A talk I gave during the vendor pitch section at Big Data Paris.TRANSCRIPT
1©MapR Technologies - Confidential
Expect More from Hadoop
2©MapR Technologies - Confidential
Introducing MapR
MapR offers thetechnology leading
distribution for Hadoop
3©MapR Technologies - Confidential
The Industry-Leaders Choose MapR in the Cloud
Google chose MapR to provide Hadoop on Google
Compute Engine
Amazon EMR is the largest Hadoop provider in revenue
and # of clusters
4©MapR Technologies - Confidential
MapR Supports Broad Set of Use Cases
Log analysis HBase
Customer targeting Social media analysis
Customer Revenue Analytics
ETL Offload
Advertising exchange analysis and optimization
Clickstream Analysis Quality profiling/field
failure analysis
Customer Sentiment
Network Analytics
Monitors and measures behavior of online shoppers
Fraud Detection Channel analytics
Customer Behavior Analysis Brand Monitoring
Customer targeting Viewer Behavioral analytics
Recommendation Engine Family tree connections
Intrusion detection & prevention Forensic analysis
Global threat analytics
Virus analysis
Patient care monitoring
Leading Retailer Recommendation Engine Fraud detection and Prevention
Leading Bank
5©MapR Technologies - Confidential
Introducing Hadoop
Hadoop is deployed because
a) big datab) fast datac) rapidly changing data
6©MapR Technologies - Confidential
Introducing Hadoop
Hadoop is deployed because
a) big datab) fast datac) rapidly changing data
7©MapR Technologies - Confidential
Introducing Change
Changing data implies a need for integration
8©MapR Technologies - Confidential
Introducing Change
Changing data implies a need for integration
If you copy, the data willchange before you finish.
9©MapR Technologies - Confidential
Controlling Change
Changing data implies a need for stabilization
10©MapR Technologies - Confidential
Controlling Change
Changing data implies a need for stabilization
Long running analyses must have stable data
11©MapR Technologies - Confidential
The Story Can Now be Told
Here are three truestories about how Hadoop integration
pays off
12©MapR Technologies - Confidential
Story #1ETL Off-load
13©MapR Technologies - Confidential
The Problem
Major telecom vendor
Key step in billing pipeline handled by data warehouse (EDW)
EDW at maximum capacity
Multiple rounds of software optimization already done
Revenue limiting (= career limiting) bottleneck
14©MapR Technologies - Confidential
ETLCDR billing
records
Billing reports
Data Warehouse
Customer bills
Original Flow
15©MapR Technologies - Confidential
ETLCDR billing
records
Billing reports
Data Warehouse
Customer bills
Original Flow
70% of total load<10% of total code
Import by bulk load from NFS
16©MapR Technologies - Confidential
ETLCDR billing
records
Billing reports
Data Warehouse
Customer billing
With ETL Offload
Import written to MapR via NFS
Bulk load via NFS from MapR
17©MapR Technologies - Confidential
Simplified Analysis – EDW Strategy
70% of EDW consumed by ETL processing EDW direct hardware cost is approximately $30 million CAPEX, 12
million OPEX Additional EDW only increases capacity by 50% due to poor
division of labor
18©MapR Technologies - Confidential
Simplified Analysis – MapR Strategy
Hardware + MapR cost ~ $1.5 million
ETL replacement development costs ~ $1.5 million
Result is 3x performance increase
19©MapR Technologies - Confidential
Price Performance
EDW strategy– 1.5 x performance– $30 million
MapR Strategy– 3 x performance– $3 million
20x cost/performance advantage for MapR strategy
20©MapR Technologies - Confidential
Story #2Search Abuse
21©MapR Technologies - Confidential
The Problem
Build a high performance recommendation– Use all kinds of available data
Deploy it to production– Must have efficient deployment
22©MapR Technologies - Confidential
Input Data User transactions– user id, merchant id– SIC code, amount
Offer transactions– user id, offer id– vendor id, merchant id’s, – offers, views, accepts
23©MapR Technologies - Confidential
Input Data User transactions– user id, merchant id– SIC code, amount
Offer transactions– user id, offer id– vendor id, merchant id’s, – offers, views, accepts
Import data via standard interfaces from log files, databases, direct feeds
Find anomalous indicators of behavior
24©MapR Technologies - Confidential
Search-based Recommendations
Sample document– Merchant Id– Field for text description– Phone– Address– Location
25©MapR Technologies - Confidential
Search-based Recommendations
Sample “document”– Merchant Id– Field for text description– Phone– Address– Location
– Indicator merchant id’s– Indicator industry (SIC) id’s– Indicator offers– Indicator text– Local top40
26©MapR Technologies - Confidential
Search-based Recommendations
Sample “document”– Merchant Id– Field for text description– Phone– Address– Location
– Indicator merchant id’s– Indicator industry (SIC) id’s– Indicator offers– Indicator text– Local top40
User History (query)– Current location– Recent merchant descriptions– Recent merchant id’s– Recent SIC codes– Recent accepted offers– Local top40
27©MapR Technologies - Confidential
SolRIndexerSolR
IndexerSolrindexing
Cooccurrence(Mahout)
Item meta-data
Indexshards
Transactions
Web Views
Email offers
28©MapR Technologies - Confidential
SolRIndexerSolR
IndexerSolrindexing
Cooccurrence(Mahout)
Item meta-data
Indexshards
Transactions
Web Views
Email offers
Legacy code runs directly in map-
reduce framework
29©MapR Technologies - Confidential
SolRIndexerSolR
IndexerSolrsearchWeb tier
Item meta-data
Indexshards
User history
30©MapR Technologies - Confidential
SolRIndexerSolR
IndexerSolrsearchWeb tier
Item meta-data
Indexshards
User history
SolrCloud runs without change
via NFS
31©MapR Technologies - Confidential
Objective Results
At a very large credit card company
History is all transactions, all web interaction
Processing time cut from 20 hours per day to 3
Recommendation engine load time decreased from 8 hours to 3 minutes
32©MapR Technologies - Confidential
Story #3Stable
Learning
33©MapR Technologies - Confidential
The Theme and Setting
A humble machine learning expert once lived in a small cubicle
One day the CEO walked in and said– Your machine recommended PINK WAFFLES to my wife!!!– Tell me why it is suddenly doing this
34©MapR Technologies - Confidential
The Theme and Setting
A humble machine learning expert once lived in a small cubicle
One day the CEO walked in and said– Your machine recommended PINK WAFFLES to my wife!!!– Tell me why it is suddenly doing this
The machine learning expert could say nothing because he could not reproduce the conditions that model was trained with
The CEO was not pleased
35©MapR Technologies - Confidential
Why?
36©MapR Technologies - Confidential
StormKafka
Data LoggerKafka
ClusterKafka
ClusterKafka
Cluster
Kafka API
Web Service NAS
Web Data
Hadoop
Flume
HDFS Data
Web-site
37©MapR Technologies - Confidential
StormKafka
Data LoggerKafka
ClusterKafka
ClusterKafka
Cluster
Kafka API
Web Service NAS
Web Data
Hadoop
Flume
HDFS Data
Data arrives continuously
Web-site
Learning steps can’t be tied to
delayed dataIt can be delayed
arbitrarily
38©MapR Technologies - Confidential
The Essence of the Problem
Coupling data arrival with modeling makes the data chain brittle– Minor delays in data delivery will break modeling SLA’s
But if data can arrive late and restate the past then we can’t easily replicate a model build
Existing data chains don’t support full bitemporal queries
39©MapR Technologies - Confidential
MapR
Data Logger
Web-site
Snap
Data
Modeling
ModelModelModelModel
Mirror
Live System
40©MapR Technologies - Confidential
The New Story
A humble machine learning expert once lived in a small cubicle
One day the CEO walked in and said– Your machine recommended PINK WAFFLES to my wife!!!– Tell me why it is suddenly doing this
41©MapR Technologies - Confidential
The New Story
A humble machine learning expert once lived in a small cubicle
One day the CEO walked in and said– Your machine recommended PINK WAFFLES to my wife!!!– Tell me why it is suddenly doing this
The machine learning expert could– Pull out all previously deployed models– Could exactly replicate any training run with any version of software– Could point out that PINK WAFFLES were actually quite stylish
The CEO was very pleased … he ran off to buy pink waffles
42©MapR Technologies - Confidential
Expect more fromHadoop
43©MapR Technologies - Confidential
Expect MapR
44©MapR Technologies - Confidential
Contact me!
[email protected] or [email protected]
@ted_dunning
Come to the MapR booth