big data paris

44
1 ©MapR Technologies - Confidential Expect More from Hadoop

Upload: ted-dunning

Post on 10-May-2015

5.018 views

Category:

Technology


0 download

DESCRIPTION

A talk I gave during the vendor pitch section at Big Data Paris.

TRANSCRIPT

Page 1: Big Data Paris

1©MapR Technologies - Confidential

Expect More from Hadoop

Page 2: Big Data Paris

2©MapR Technologies - Confidential

Introducing MapR

MapR offers thetechnology leading

distribution for Hadoop

Page 3: Big Data Paris

3©MapR Technologies - Confidential

The Industry-Leaders Choose MapR in the Cloud

Google chose MapR to provide Hadoop on Google

Compute Engine

Amazon EMR is the largest Hadoop provider in revenue

and # of clusters

Page 4: Big Data Paris

4©MapR Technologies - Confidential

MapR Supports Broad Set of Use Cases

Log analysis HBase

Customer targeting Social media analysis

Customer Revenue Analytics

ETL Offload

Advertising exchange analysis and optimization

Clickstream Analysis Quality profiling/field

failure analysis

Customer Sentiment

Network Analytics

Monitors and measures behavior of online shoppers

Fraud Detection Channel analytics

Customer Behavior Analysis Brand Monitoring

Customer targeting Viewer Behavioral analytics

Recommendation Engine Family tree connections

Intrusion detection & prevention Forensic analysis

Global threat analytics

Virus analysis

Patient care monitoring

Leading Retailer Recommendation Engine Fraud detection and Prevention

Leading Bank

Page 5: Big Data Paris

5©MapR Technologies - Confidential

Introducing Hadoop

Hadoop is deployed because

a) big datab) fast datac) rapidly changing data

Page 6: Big Data Paris

6©MapR Technologies - Confidential

Introducing Hadoop

Hadoop is deployed because

a) big datab) fast datac) rapidly changing data

Page 7: Big Data Paris

7©MapR Technologies - Confidential

Introducing Change

Changing data implies a need for integration

Page 8: Big Data Paris

8©MapR Technologies - Confidential

Introducing Change

Changing data implies a need for integration

If you copy, the data willchange before you finish.

Page 9: Big Data Paris

9©MapR Technologies - Confidential

Controlling Change

Changing data implies a need for stabilization

Page 10: Big Data Paris

10©MapR Technologies - Confidential

Controlling Change

Changing data implies a need for stabilization

Long running analyses must have stable data

Page 11: Big Data Paris

11©MapR Technologies - Confidential

The Story Can Now be Told

Here are three truestories about how Hadoop integration

pays off

Page 12: Big Data Paris

12©MapR Technologies - Confidential

Story #1ETL Off-load

Page 13: Big Data Paris

13©MapR Technologies - Confidential

The Problem

Major telecom vendor

Key step in billing pipeline handled by data warehouse (EDW)

EDW at maximum capacity

Multiple rounds of software optimization already done

Revenue limiting (= career limiting) bottleneck

Page 14: Big Data Paris

14©MapR Technologies - Confidential

ETLCDR billing

records

Billing reports

Data Warehouse

Customer bills

Original Flow

Page 15: Big Data Paris

15©MapR Technologies - Confidential

ETLCDR billing

records

Billing reports

Data Warehouse

Customer bills

Original Flow

70% of total load<10% of total code

Import by bulk load from NFS

Page 16: Big Data Paris

16©MapR Technologies - Confidential

ETLCDR billing

records

Billing reports

Data Warehouse

Customer billing

With ETL Offload

Import written to MapR via NFS

Bulk load via NFS from MapR

Page 17: Big Data Paris

17©MapR Technologies - Confidential

Simplified Analysis – EDW Strategy

70% of EDW consumed by ETL processing EDW direct hardware cost is approximately $30 million CAPEX, 12

million OPEX Additional EDW only increases capacity by 50% due to poor

division of labor

Page 18: Big Data Paris

18©MapR Technologies - Confidential

Simplified Analysis – MapR Strategy

Hardware + MapR cost ~ $1.5 million

ETL replacement development costs ~ $1.5 million

Result is 3x performance increase

Page 19: Big Data Paris

19©MapR Technologies - Confidential

Price Performance

EDW strategy– 1.5 x performance– $30 million

MapR Strategy– 3 x performance– $3 million

20x cost/performance advantage for MapR strategy

Page 20: Big Data Paris

20©MapR Technologies - Confidential

Story #2Search Abuse

Page 21: Big Data Paris

21©MapR Technologies - Confidential

The Problem

Build a high performance recommendation– Use all kinds of available data

Deploy it to production– Must have efficient deployment

Page 22: Big Data Paris

22©MapR Technologies - Confidential

Input Data User transactions– user id, merchant id– SIC code, amount

Offer transactions– user id, offer id– vendor id, merchant id’s, – offers, views, accepts

Page 23: Big Data Paris

23©MapR Technologies - Confidential

Input Data User transactions– user id, merchant id– SIC code, amount

Offer transactions– user id, offer id– vendor id, merchant id’s, – offers, views, accepts

Import data via standard interfaces from log files, databases, direct feeds

Find anomalous indicators of behavior

Page 24: Big Data Paris

24©MapR Technologies - Confidential

Search-based Recommendations

Sample document– Merchant Id– Field for text description– Phone– Address– Location

Page 25: Big Data Paris

25©MapR Technologies - Confidential

Search-based Recommendations

Sample “document”– Merchant Id– Field for text description– Phone– Address– Location

– Indicator merchant id’s– Indicator industry (SIC) id’s– Indicator offers– Indicator text– Local top40

Page 26: Big Data Paris

26©MapR Technologies - Confidential

Search-based Recommendations

Sample “document”– Merchant Id– Field for text description– Phone– Address– Location

– Indicator merchant id’s– Indicator industry (SIC) id’s– Indicator offers– Indicator text– Local top40

User History (query)– Current location– Recent merchant descriptions– Recent merchant id’s– Recent SIC codes– Recent accepted offers– Local top40

Page 27: Big Data Paris

27©MapR Technologies - Confidential

SolRIndexerSolR

IndexerSolrindexing

Cooccurrence(Mahout)

Item meta-data

Indexshards

Transactions

Web Views

Email offers

Page 28: Big Data Paris

28©MapR Technologies - Confidential

SolRIndexerSolR

IndexerSolrindexing

Cooccurrence(Mahout)

Item meta-data

Indexshards

Transactions

Web Views

Email offers

Legacy code runs directly in map-

reduce framework

Page 29: Big Data Paris

29©MapR Technologies - Confidential

SolRIndexerSolR

IndexerSolrsearchWeb tier

Item meta-data

Indexshards

User history

Page 30: Big Data Paris

30©MapR Technologies - Confidential

SolRIndexerSolR

IndexerSolrsearchWeb tier

Item meta-data

Indexshards

User history

SolrCloud runs without change

via NFS

Page 31: Big Data Paris

31©MapR Technologies - Confidential

Objective Results

At a very large credit card company

History is all transactions, all web interaction

Processing time cut from 20 hours per day to 3

Recommendation engine load time decreased from 8 hours to 3 minutes

Page 32: Big Data Paris

32©MapR Technologies - Confidential

Story #3Stable

Learning

Page 33: Big Data Paris

33©MapR Technologies - Confidential

The Theme and Setting

A humble machine learning expert once lived in a small cubicle

One day the CEO walked in and said– Your machine recommended PINK WAFFLES to my wife!!!– Tell me why it is suddenly doing this

Page 34: Big Data Paris

34©MapR Technologies - Confidential

The Theme and Setting

A humble machine learning expert once lived in a small cubicle

One day the CEO walked in and said– Your machine recommended PINK WAFFLES to my wife!!!– Tell me why it is suddenly doing this

The machine learning expert could say nothing because he could not reproduce the conditions that model was trained with

The CEO was not pleased

Page 35: Big Data Paris

35©MapR Technologies - Confidential

Why?

Page 36: Big Data Paris

36©MapR Technologies - Confidential

StormKafka

Twitter

Data LoggerKafka

ClusterKafka

ClusterKafka

Cluster

Kafka API

Web Service NAS

Web Data

Hadoop

Flume

HDFS Data

Web-site

Page 37: Big Data Paris

37©MapR Technologies - Confidential

StormKafka

Twitter

Data LoggerKafka

ClusterKafka

ClusterKafka

Cluster

Kafka API

Web Service NAS

Web Data

Hadoop

Flume

HDFS Data

Data arrives continuously

Web-site

Learning steps can’t be tied to

delayed dataIt can be delayed

arbitrarily

Page 38: Big Data Paris

38©MapR Technologies - Confidential

The Essence of the Problem

Coupling data arrival with modeling makes the data chain brittle– Minor delays in data delivery will break modeling SLA’s

But if data can arrive late and restate the past then we can’t easily replicate a model build

Existing data chains don’t support full bitemporal queries

Page 39: Big Data Paris

39©MapR Technologies - Confidential

Twitter

MapR

Data Logger

Web-site

Snap

Data

Modeling

ModelModelModelModel

Mirror

Live System

Page 40: Big Data Paris

40©MapR Technologies - Confidential

The New Story

A humble machine learning expert once lived in a small cubicle

One day the CEO walked in and said– Your machine recommended PINK WAFFLES to my wife!!!– Tell me why it is suddenly doing this

Page 41: Big Data Paris

41©MapR Technologies - Confidential

The New Story

A humble machine learning expert once lived in a small cubicle

One day the CEO walked in and said– Your machine recommended PINK WAFFLES to my wife!!!– Tell me why it is suddenly doing this

The machine learning expert could– Pull out all previously deployed models– Could exactly replicate any training run with any version of software– Could point out that PINK WAFFLES were actually quite stylish

The CEO was very pleased … he ran off to buy pink waffles

Page 42: Big Data Paris

42©MapR Technologies - Confidential

Expect more fromHadoop

Page 43: Big Data Paris

43©MapR Technologies - Confidential

Expect MapR

Page 44: Big Data Paris

44©MapR Technologies - Confidential

Contact me!

[email protected] or [email protected]

@ted_dunning

Come to the MapR booth