predictive-analytics-san-diego-2013-02-21

56
1 ©MapR Technologies - Confidential Remembering the Future

Upload: ted-dunning

Post on 26-Jan-2015

102 views

Category:

Technology


0 download

DESCRIPTION

The unification of big and little data processing onto a single platform is an important requirement for Hadoop. How can this be achieved? I explain what is needed for three important use cases.

TRANSCRIPT

Page 1: predictive-analytics-san-diego-2013-02-21

1©MapR Technologies - Confidential

Remembering the Future

Page 2: predictive-analytics-san-diego-2013-02-21

2©MapR Technologies - Confidential

My Background

University, Startups– Aptex, MusicMatch, ID Analytics, Veoh– big data since before it was big

Open source– even before the internet– Apache Hadoop, Mahout, Zookeeper, Drill– bought the beer at first HUG

MapR Founding member of Apache Drill

Page 3: predictive-analytics-san-diego-2013-02-21

3©MapR Technologies - Confidential

MapR Technologies

Silicon Valley Startup– Top investors– Top technical and management team• Google, Microsoft, EMC, NetApp, Oracle

Enterprise quality distribution for Hadoop

Many extensions to basic Hadoop function Strong supporter of Apache Drill

Page 4: predictive-analytics-san-diego-2013-02-21

4©MapR Technologies - Confidential

Philosophy First

What is History?

Page 5: predictive-analytics-san-diego-2013-02-21

5©MapR Technologies - Confidential

The study of the past

(what came before now)

Page 6: predictive-analytics-san-diego-2013-02-21

6©MapR Technologies - Confidential

What is the future?

(it comes after now)

Page 7: predictive-analytics-san-diego-2013-02-21

7©MapR Technologies - Confidential

Page 8: predictive-analytics-san-diego-2013-02-21

8©MapR Technologies - Confidential

Page 9: predictive-analytics-san-diego-2013-02-21

9©MapR Technologies - Confidential

Page 10: predictive-analytics-san-diego-2013-02-21

10©MapR Technologies - Confidential

But the future also has a past!

Page 11: predictive-analytics-san-diego-2013-02-21

11©MapR Technologies - Confidential

Do you remember the future?

Page 12: predictive-analytics-san-diego-2013-02-21

12©MapR Technologies - Confidential

Page 13: predictive-analytics-san-diego-2013-02-21

13©MapR Technologies - Confidential

Page 14: predictive-analytics-san-diego-2013-02-21

14©MapR Technologies - Confidential

Page 15: predictive-analytics-san-diego-2013-02-21

15©MapR Technologies - Confidential

Page 16: predictive-analytics-san-diego-2013-02-21

16©MapR Technologies - Confidential

Page 17: predictive-analytics-san-diego-2013-02-21

17©MapR Technologies - Confidential

Some things

turned out as

expected

Page 18: predictive-analytics-san-diego-2013-02-21

18©MapR Technologies - Confidential

Guys wearing Fedoras

Page 19: predictive-analytics-san-diego-2013-02-21

19©MapR Technologies - Confidential

Many things are different!

Page 20: predictive-analytics-san-diego-2013-02-21

20©MapR Technologies - Confidential

Hadoop has a history

Page 21: predictive-analytics-san-diego-2013-02-21

21©MapR Technologies - Confidential

Hadoop also has a

future

Page 22: predictive-analytics-san-diego-2013-02-21

22©MapR Technologies - Confidential

The Old Future of Hadoop

Map-reduce and HDFS– more and more, but not really different

Eco-system additions– Simpler programming (Hive and Pig)– Key-value store– Ad hoc query

Stands apart from other computing– Required by HDFS and other limitations

Page 23: predictive-analytics-san-diego-2013-02-21

23©MapR Technologies - Confidential

The New Future of Hadoop

Real-time processing– Combines real-time and long-time

Integration with traditional IT– No need to stand apart

Integration with new technologies– Solr, Node.js, Twisted all should interface directly

Fast and flexible computation– Drill logical plan language

Page 24: predictive-analytics-san-diego-2013-02-21

24©MapR Technologies - Confidential

Example #1Search Abuse

Page 25: predictive-analytics-san-diego-2013-02-21

25©MapR Technologies - Confidential

History matrix

One row per user

One column per thing

Page 26: predictive-analytics-san-diego-2013-02-21

26©MapR Technologies - Confidential

Recommendation based on cooccurrence

Cooccurrence gives item-item mapping

One row and column per thing

Page 27: predictive-analytics-san-diego-2013-02-21

27©MapR Technologies - Confidential

Cooccurrence matrix can also be implemented as a search index

Page 28: predictive-analytics-san-diego-2013-02-21

28©MapR Technologies - Confidential

SolRIndexerSolR

IndexerSolrindexing

Cooccurrence(Mahout)

Item meta-data

Indexshards

Complete history

Page 29: predictive-analytics-san-diego-2013-02-21

29©MapR Technologies - Confidential

SolRIndexerSolR

IndexerSolrsearchWeb tier

Item meta-data

Indexshards

User history

Page 30: predictive-analytics-san-diego-2013-02-21

30©MapR Technologies - Confidential

Objective Results

At a very large credit card company

History is all transactions, all web interaction

Processing time cut from 20 hours per day to 3

Recommendation engine load time decreased from 8 hours to 3 minutes

Page 31: predictive-analytics-san-diego-2013-02-21

31©MapR Technologies - Confidential

Example #2Web

Technology

Page 32: predictive-analytics-san-diego-2013-02-21

32©MapR Technologies - Confidential

Fast analysis(Storm)

Analytic output

Real-timedata

Raw logs

Page 33: predictive-analytics-san-diego-2013-02-21

33©MapR Technologies - Confidential

Large analysis(map-reduce)

Analytic output Raw logs

Page 34: predictive-analytics-san-diego-2013-02-21

34©MapR Technologies - Confidential

Presentation tier (d3 + node.js)

Analytic output

Browser query

Raw logs

Page 35: predictive-analytics-san-diego-2013-02-21

35©MapR Technologies - Confidential

Objective Results

Real-time + long-time analysis is seamless

Web tier can be rooted directly on Hadoop cluster

No need to move data

Page 36: predictive-analytics-san-diego-2013-02-21

36©MapR Technologies - Confidential

Example #3Apache Drill

Page 37: predictive-analytics-san-diego-2013-02-21

37©MapR Technologies - Confidential

Big Data Processing – Hadoop

Batch processing

Query runtime Minutes to hours

Data volume TBs to PBs

Programming model

MapReduce

Users Developers

Google project MapReduce

Open source project

Hadoop MapReduce

Page 38: predictive-analytics-san-diego-2013-02-21

38©MapR Technologies - Confidential

Big Data Processing – Hadoop and Storm

Batch processing Stream processing

Query runtime Minutes to hours Never-ending

Data volume TBs to PBs Continuous stream

Programming model

MapReduce DAG (pre-programmed)

Users Developers Developers

Google project MapReduce

Open source project

Hadoop MapReduce

Storm or Apache S4

Page 39: predictive-analytics-san-diego-2013-02-21

39©MapR Technologies - Confidential

Big Data Processing – The missing part

Batch processing Interactive analysis Stream processing

Query runtime Minutes to hours Never-ending

Data volume TBs to PBs Continuous stream

Programming model

MapReduce DAG (pre-programmed)

Users Developers Developers

Google project MapReduce

Open source project

Hadoop MapReduce

Storm and S4

Page 40: predictive-analytics-san-diego-2013-02-21

40©MapR Technologies - Confidential

Big Data Processing – The missing part

Batch processing Interactive analysis Stream processing

Query runtime Minutes to hours Milliseconds to minutes

Never-ending

Data volume TBs to PBs GBs to PBs Continuous stream

Programming model

MapReduce Queries(ad hoc)

DAG (pre-programmed)

Users Developers Analysts and developers

Developers

Google project MapReduce

Open source project

Hadoop MapReduce

Storm and S4

Page 41: predictive-analytics-san-diego-2013-02-21

41©MapR Technologies - Confidential

Big Data Processing

Batch processing Interactive analysis Stream processing

Query runtime Minutes to hours Milliseconds to minutes

Never-ending

Data volume TBs to PBs GBs to PBs Continuous stream

Programming model

MapReduce Queries DAG

Users Developers Analysts and developers

Developers

Google project MapReduce Dremel

Open source project

Hadoop MapReduce

Storm and S4

Page 42: predictive-analytics-san-diego-2013-02-21

42©MapR Technologies - Confidential

Big Data Processing

Batch processing Interactive analysis Stream processing

Query runtime Minutes to hours Milliseconds to minutes

Never-ending

Data volume TBs to PBs GBs to PBs Continuous stream

Programming model

MapReduce Queries DAG

Users Developers Analysts and developers

Developers

Google project MapReduce Dremel

Open source project

Hadoop MapReduce

Storm and S4

Apache Drill

Page 43: predictive-analytics-san-diego-2013-02-21

43©MapR Technologies - Confidential

Design Principles

Flexible• Pluggable query languages• Extensible execution engine• Pluggable data formats

• Column-based and row-based• Schema and schema-less

• Pluggable data sources

Easy• Unzip and run• Zero configuration• Reverse DNS not needed• IP addresses can change• Clear and concise log messages

Dependable• No SPOF• Instant recovery from crashes

Fast• C/C++ core with Java support

• Google C++ style guide• Min latency and max throughput

(limited only by hardware)

Page 44: predictive-analytics-san-diego-2013-02-21

44©MapR Technologies - Confidential

Simple Architecture

Page 45: predictive-analytics-san-diego-2013-02-21

45©MapR Technologies - Confidential

Standard Interfaces

Page 46: predictive-analytics-san-diego-2013-02-21

46©MapR Technologies - Confidential

query:[ { op:"sequence", do:[ { op: "scan", memo: "initial_scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "transform", transforms: [ { ref: "donuts.quanity", expr: "donuts.sales”} ] }, { op: "filter", expr: "donuts.ppu < 1.00" }, …

Logical Plan Syntax:

Page 47: predictive-analytics-san-diego-2013-02-21

47©MapR Technologies - Confidential

Logical Streaming Example

{ @id: <refnum>, op: “window-frame”, input: <input>, keys: [ <name>,... ], ref: <name>, before: 2, after: here}

0 1 2 3 4

0 0 10 1 2 1 2 32 3 4

Page 48: predictive-analytics-san-diego-2013-02-21

48©MapR Technologies - Confidential

Logical Plan

Page 49: predictive-analytics-san-diego-2013-02-21

49©MapR Technologies - Confidential

Execution Plan

Page 50: predictive-analytics-san-diego-2013-02-21

50©MapR Technologies - Confidential

Representing a DAG

{ @id: 19, op: "aggregate", input: 18, type: <simple|running|repeat>, keys: [<name>,...], aggregations: [ {ref: <name>, expr: <aggexpr> },... ]}

Page 51: predictive-analytics-san-diego-2013-02-21

51©MapR Technologies - Confidential

Non-SQL queries

Page 52: predictive-analytics-san-diego-2013-02-21

52©MapR Technologies - Confidential

Design Principles

Flexible• Pluggable query languages• Extensible execution engine• Pluggable data formats

• Column-based and row-based• Schema and schema-less

• Pluggable data sources

Easy• Unzip and run• Zero configuration• Reverse DNS not needed• IP addresses can change• Clear and concise log messages

Dependable• No SPOF• Instant recovery from crashes

Fast• C/C++ core with Java support

• Google C++ style guide• Min latency and max throughput

(limited only by hardware)

Page 53: predictive-analytics-san-diego-2013-02-21

53©MapR Technologies - Confidential

The future is not what we thought it would be

Page 54: predictive-analytics-san-diego-2013-02-21

54©MapR Technologies - Confidential

It is better!

Page 55: predictive-analytics-san-diego-2013-02-21

55©MapR Technologies - Confidential

Get Involved!

Tweet:#hcj13w#mapr

@ted_dunning

Page 56: predictive-analytics-san-diego-2013-02-21

56©MapR Technologies - Confidential

Get Involved!

Download these slides– http://www.mapr.com/company/events/hcj-01-21-2013

Join the Drill project– [email protected] – #apachedrill

Contact me:– [email protected][email protected]– @ted_dunning

Join MapR (in Japan!)– [email protected]