big data analysis patterns - trihug 6/27/2013

44
1 Big Data Analysis Patterns TriHUG 6/27/2013

Upload: boorad

Post on 26-Jan-2015

104 views

Category:

Technology


0 download

DESCRIPTION

Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools. Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think. This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.

TRANSCRIPT

Page 1: Big Data Analysis Patterns - TriHUG 6/27/2013

1

Big DataAnalysis PatternsTriHUG6/27/2013

Page 2: Big Data Analysis Patterns - TriHUG 6/27/2013

2

whoami• Brad Anderson• Solutions Architect at MapR (Atlanta)• ATLHUG co-chair• NoSQL East Conference 2009• “boorad” most places (twitter, github)• [email protected]

Page 3: Big Data Analysis Patterns - TriHUG 6/27/2013

3

BIG DATA

Page 4: Big Data Analysis Patterns - TriHUG 6/27/2013

4

Page 5: Big Data Analysis Patterns - TriHUG 6/27/2013

5

Big Data is not new!but the tools are.

Page 6: Big Data Analysis Patterns - TriHUG 6/27/2013

6

The Good News in Big Data:

“Simple algorithms and lots of data trump complex models”

Halevy, Norvig, and Pereira, GoogleIEEE Intelligent Systems

Page 7: Big Data Analysis Patterns - TriHUG 6/27/2013

7

The Challenge: So Many Solutions!

What solutions fit your business problem?

For example, do you need… Apache Hadoop? Apache Mahout? Storm? Apache Solr/Lucene? Apache HBase (or MapR M7)? Apache Drill (or Impala?) d3.js or Tableau? Node.js Titan?

7

Page 8: Big Data Analysis Patterns - TriHUG 6/27/2013

8

Ask a Different Question

It may be more useful to better define the problem by asking some of these questions: How large is the data to be stored? How large is the data to be queried? (the analysis volume) What time frame is appropriate for your query response? How fast is data arriving? (bursts or continuously?) Are queries by sophisticated users? Are you looking for common patterns or outliers? How are your data sources structures?

8

Page 9: Big Data Analysis Patterns - TriHUG 6/27/2013

9

Picking the Best Solution

Your responses to these questions can help you better: define the problem recognize the analysis pattern to which it belongs guide the choice of solutions to try

But first, here’s a quick review of a few of the technologies you might choose, and then we will focus on three of the questions as a part of the landscape.

9

Page 10: Big Data Analysis Patterns - TriHUG 6/27/2013

10

Apache Solr/Lucene

Solr/Lucene is a powerful search engine used for flexible, heavily indexed queries including data such as Full text Geographical data Statistically weighted data

Solr is a small data tool that has flourished in a big data world

Page 11: Big Data Analysis Patterns - TriHUG 6/27/2013

11

Apache Mahout

Mahout provides a library of scalable machine learning algorithms useful for big data analysis based on Hadoop or other storage systems.

Mahout algorithms mainly are used for Recommendation (collaborative filtering) Clustering Classification

Mahout can be used in conjunction with solutions such as Solr: You might use Mahout to create a co-occurrence data base that could then be queried using a search tool such as Solr

Page 12: Big Data Analysis Patterns - TriHUG 6/27/2013

12

Apache Drill

Google Dremel clone Pluggable Query Languages– Starts with ANSI SQL 2003– Hive, Pig, Cascading, MongoQL, …

Pluggable Storage Backends– Hadoop, Hbase– MongoDB (BSON)– RDBMS?

Bypasses MapReduce

Page 13: Big Data Analysis Patterns - TriHUG 6/27/2013

13

Storm

Realtime Stream Computation Engine Horizontal Scalability Guaranteed Data Processing Fault Tolerance Higher level abstraction over:– Message Queues– Worker Logic

“The Hadoop of Realtime”

Page 14: Big Data Analysis Patterns - TriHUG 6/27/2013

14

Titan

Distributed Graph Database Property Graph Pluggable Backend Storage– HBase or M7– Cassandra– Berkeley DB

Search Integrated– Solr/Lucene– Elastic Search

Faunus– Graph traversals on subset– In-memory

Page 15: Big Data Analysis Patterns - TriHUG 6/27/2013

15

Using the Answers to Guide Your Choices

For simplicity, let’s focus in on the first three questions: How large is the data to be stored? How large is the data to be queried? (the analysis volume) What time frame is appropriate for your query response?

Page 16: Big Data Analysis Patterns - TriHUG 6/27/2013

16

Big Data Decision Tree

How big is your data?

<10 GB >200 GBmid

What size queries?

Single element at a time

One passover 100%

Multiple passesover big chunks

Big storage Streaming

Response time?

< 100s(human scale)

throughputnot response

A

B C

ED

??

Page 17: Big Data Analysis Patterns - TriHUG 6/27/2013

17

Use Cases Company Data Shape Technique(s) Business Value

Page 18: Big Data Analysis Patterns - TriHUG 6/27/2013

18

Business Value

Page 19: Big Data Analysis Patterns - TriHUG 6/27/2013

19

Business Value

Page 20: Big Data Analysis Patterns - TriHUG 6/27/2013

20

Telecommunications Giant

ETL Offload

Page 21: Big Data Analysis Patterns - TriHUG 6/27/2013

21

Lots of Data Lots of Queries across Large Sets Throughput important

Data ShapeTelecommunications

Page 22: Big Data Analysis Patterns - TriHUG 6/27/2013

22

Techniques

AnalyticsETL

Telecommunications

Page 23: Big Data Analysis Patterns - TriHUG 6/27/2013

23

Techniques

+

ETL (Hadoop) Analytics (Teradata)

Telecommunications

Page 24: Big Data Analysis Patterns - TriHUG 6/27/2013

24

Business ValueTelecommunications

Page 25: Big Data Analysis Patterns - TriHUG 6/27/2013

25

Credit CardIssuer

Page 26: Big Data Analysis Patterns - TriHUG 6/27/2013

26

Customer Purchase History (big) Merchant Designations Merchant Special Offers Throughput important Recommendations

Data Shape

Credit CardIssuer

Page 27: Big Data Analysis Patterns - TriHUG 6/27/2013

27

History matrix

One row per user

One column per thing

Search Abuse

A Recommendation Engine with Mahout and Solr/Lucene

Techniques

Page 28: Big Data Analysis Patterns - TriHUG 6/27/2013

28

Recommendation based on cooccurrence

Cooccurrence gives item-item mapping

One row and column per thing

Techniques

Page 29: Big Data Analysis Patterns - TriHUG 6/27/2013

29

Cooccurrence matrix can also be implemented as a search index

Techniques

Page 30: Big Data Analysis Patterns - TriHUG 6/27/2013

30

SolRIndexerSolR

IndexerSolrindexing

Cooccurrence(Mahout)

Item meta-data

Indexshards

Complete history

Techniques

20 Hrs 3 Hrs

Page 31: Big Data Analysis Patterns - TriHUG 6/27/2013

31

SolRIndexerSolR

IndexerSolrsearchWeb tier

Item meta-data

Indexshards

User history

Techniques

8Hrs 3 Min

Page 32: Big Data Analysis Patterns - TriHUG 6/27/2013

32

Techniques

PurchaseHistory

Merchant Information

Merchant Offers

RecommendationEngine Results

(Mahout)

PresentationData Store

(DB2)

App

App

App

App

App

Hadoop Export(4 hrs)

Import(4 hrs)

Page 33: Big Data Analysis Patterns - TriHUG 6/27/2013

33

Techniques

PurchaseHistory

Merchant Information

Merchant Offers

RecommendationEngine Results

(Mahout)

RecommendationSearch Index

(Solr)

App

App

App

App

App

Hadoop

IndexUpdate(3 min)

Page 34: Big Data Analysis Patterns - TriHUG 6/27/2013

34

Business Value

Page 35: Big Data Analysis Patterns - TriHUG 6/27/2013

35

Idle Alerts

Waste & Recycling Leader

Page 36: Big Data Analysis Patterns - TriHUG 6/27/2013

36

Truck Geolocation Data– 20,000 trucks– 5 sec interval (arriving quickly)

Landfill Geographic Boundaries

Data Shape

Page 37: Big Data Analysis Patterns - TriHUG 6/27/2013

37

Techniques

TruckGeolocation

Data

Realtime Stream Computation(Storm)

Batch Computation(MapReduce)

ImmediateAlerts

Tax ReductionReporting

HadoopStorage

Shortest PathGraph Algorithm

(Titan)

Route Optimization

Page 38: Big Data Analysis Patterns - TriHUG 6/27/2013

38

Business Value

Page 39: Big Data Analysis Patterns - TriHUG 6/27/2013

39

Social Engagement Application

Beverage Company

Page 40: Big Data Analysis Patterns - TriHUG 6/27/2013

40

Tweets, FB Messages Person, Activity links Graph Traversal

Data Shape

Page 41: Big Data Analysis Patterns - TriHUG 6/27/2013

41

Consumer Activity Graph

Wal*Mart.com

CVS

Dollar General

Ebay

Ebay Motors

Toys R UsStubHub

Shopping.comSam’s

Page 42: Big Data Analysis Patterns - TriHUG 6/27/2013

42

Techniques

Property Graph(Titan)

Key/Value Store(MapR M7)

Social Activity Stream

Graph Traversal(Faunus)

Page 43: Big Data Analysis Patterns - TriHUG 6/27/2013

43

Business Value

Page 44: Big Data Analysis Patterns - TriHUG 6/27/2013

44

Questions?