big data analysis patterns with hadoop, mahout and solr

61
1 Big Data Analysis Patterns Atlanta Big Data User Group 8/15/2013

Upload: boorad

Post on 15-Jan-2015

10.173 views

Category:

Technology


2 download

DESCRIPTION

Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools. Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think. This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.

TRANSCRIPT

Page 1: Big Data Analysis Patterns with Hadoop, Mahout and Solr

1

Big DataAnalysis PatternsAtlanta Big Data User Group8/15/2013

Page 2: Big Data Analysis Patterns with Hadoop, Mahout and Solr

2

whoami• Brad Anderson• Solutions Architect at MapR (Atlanta)• ATLHUG co-chair• NoSQL East Conference 2009• “boorad” most places (twitter, github)• [email protected]

Page 3: Big Data Analysis Patterns with Hadoop, Mahout and Solr

3

Announcements Next ATLHUG Meeting - Sept. 26–How Google Does Big Data

Wednesday – MapR Data Warehouse Offload Roadshow

MapR Upcoming Training• MapR M7 & HBase for Developers on August 27 in Campbell, CA• MapR M7 & HBase for Developers on Sept 17 in Reston, VA• MapR M5 for Administrators on Oct 3 in Campbell, CA

3

Page 4: Big Data Analysis Patterns with Hadoop, Mahout and Solr

4

BIG DATA

Page 5: Big Data Analysis Patterns with Hadoop, Mahout and Solr

5

Page 6: Big Data Analysis Patterns with Hadoop, Mahout and Solr

6

Big Data is not new!but the tools are.

Page 7: Big Data Analysis Patterns with Hadoop, Mahout and Solr

7

The Good News in Big Data:

“Simple algorithms and lots of data trump complex models”

Halevy, Norvig, and Pereira, GoogleIEEE Intelligent Systems

Page 8: Big Data Analysis Patterns with Hadoop, Mahout and Solr

8

The Challenge: So Many Solutions!

What solutions fit your business problem?

For example, do you need… Apache Hadoop? Apache Mahout? Storm? Apache Solr/Lucene? Apache HBase (or MapR M7)? Apache Drill (or Impala?) d3.js or Tableau? Node.js Titan?

8

Page 9: Big Data Analysis Patterns with Hadoop, Mahout and Solr

9

Ask a Different Question

It may be more useful to better define the problem by asking some of these questions: How large is the data to be stored? How large is the data to be queried? (the analysis volume) What time frame is appropriate for your query response? How fast is data arriving? (bursts or continuously?) Are queries by sophisticated users? Are you looking for common patterns or outliers? How are your data sources structures?

9

Page 10: Big Data Analysis Patterns with Hadoop, Mahout and Solr

10

Picking the Best Solution

Your responses to these questions can help you better: define the problem recognize the analysis pattern to which it belongs guide the choice of solutions to try

But first, here’s a quick review of a few of the technologies you might choose, and then we will focus on three of the questions as a part of the landscape.

10

Page 11: Big Data Analysis Patterns with Hadoop, Mahout and Solr

11

Apache Solr/Lucene

Solr/Lucene is a powerful search engine used for flexible, heavily indexed queries including data such as Full text Geographical data Statistically weighted data

Solr is a small data tool that has flourished in a big data world

Page 12: Big Data Analysis Patterns with Hadoop, Mahout and Solr

12

Apache Mahout

Mahout provides a library of scalable machine learning algorithms useful for big data analysis based on Hadoop or other storage systems.

Mahout algorithms mainly are used for Recommendation (collaborative filtering) Clustering Classification

Mahout can be used in conjunction with solutions such as Solr: You might use Mahout to create a co-occurrence data base that could then be queried using a search tool such as Solr

Page 13: Big Data Analysis Patterns with Hadoop, Mahout and Solr

13

Apache Drill

Google Dremel clone Pluggable Query Languages– Starts with ANSI SQL 2003– Hive, Pig, Cascading, MongoQL, …

Pluggable Storage Backends– Hadoop, Hbase– MongoDB (BSON)– RDBMS?

Bypasses MapReduce

Page 14: Big Data Analysis Patterns with Hadoop, Mahout and Solr

14

Storm

Realtime Stream Computation Engine Horizontal Scalability Guaranteed Data Processing Fault Tolerance Higher level abstraction over:– Message Queues– Worker Logic

“The Hadoop of Realtime”

Page 15: Big Data Analysis Patterns with Hadoop, Mahout and Solr

15

Titan Distributed Graph Database Property Graph Pluggable Backend Storage– HBase or M7– Cassandra– Berkeley DB

Search Integrated– Solr/Lucene– Elastic Search

Faunus– Batch processing of large graphs

Fulgora– Graph traversals on subset– In-memory

Page 16: Big Data Analysis Patterns with Hadoop, Mahout and Solr

16

Using the Answers to Guide Your Choices

For simplicity, let’s focus in on the first three questions: How large is the data to be stored? How large is the data to be queried? (the analysis volume) What time frame is appropriate for your query response?

Page 17: Big Data Analysis Patterns with Hadoop, Mahout and Solr

17

Big Data Decision Tree

How big is your data?

<10 GB >200 GBmid

What size queries?

Single element at a time

One passover 100%

Multiple passesover big chunks

Big storage Streaming

Response time?

< 100s(human scale)

throughputnot response

A

B C

ED

??

Page 18: Big Data Analysis Patterns with Hadoop, Mahout and Solr

18

Use Cases Company Data Shape Technique(s) Business Value

Page 19: Big Data Analysis Patterns with Hadoop, Mahout and Solr

19

Business Value

Page 20: Big Data Analysis Patterns with Hadoop, Mahout and Solr

20

Business Value

Page 21: Big Data Analysis Patterns with Hadoop, Mahout and Solr

21

Telecommunications Giant

ETL Offload

Page 22: Big Data Analysis Patterns with Hadoop, Mahout and Solr

22

Lots of Data Lots of Queries across Large Sets Throughput important

Data ShapeTelecommunications

Page 23: Big Data Analysis Patterns with Hadoop, Mahout and Solr

23

Techniques

AnalyticsETL

Telecommunications

Page 24: Big Data Analysis Patterns with Hadoop, Mahout and Solr

24

Techniques

+

ETL (Hadoop) Analytics (Teradata)

Telecommunications

Page 25: Big Data Analysis Patterns with Hadoop, Mahout and Solr

25

Business ValueTelecommunications

Page 26: Big Data Analysis Patterns with Hadoop, Mahout and Solr

26

Credit CardIssuer

Page 27: Big Data Analysis Patterns with Hadoop, Mahout and Solr

27

Customer Purchase History (big) Merchant Designations Merchant Special Offers Throughput important Recommendations

Data Shape

Credit CardIssuer

Page 28: Big Data Analysis Patterns with Hadoop, Mahout and Solr

28

History matrix

One row per user

One column per thing

A Recommendation Engine with Mahout and Solr/Lucene

Techniques

Credit CardIssuer

Page 29: Big Data Analysis Patterns with Hadoop, Mahout and Solr

29

Recommendation based on cooccurrence

Cooccurrence gives item-item mapping

One row and column per thing

Techniques

Credit CardIssuer

Page 30: Big Data Analysis Patterns with Hadoop, Mahout and Solr

30

Cooccurrence matrix can also be implemented as a search index

Techniques

Credit CardIssuer

Page 31: Big Data Analysis Patterns with Hadoop, Mahout and Solr

31

SolRIndexerSolR

IndexerSolrindexing

Cooccurrence(Mahout)

Item meta-data

Indexshards

Complete history

Techniques

20 Hrs 3 Hrs

Credit CardIssuer

Page 32: Big Data Analysis Patterns with Hadoop, Mahout and Solr

32

SolRIndexerSolR

IndexerSolrsearchWeb tier

Item meta-data

Indexshards

User history

Techniques

8Hrs 3 Min

Credit CardIssuer

Page 33: Big Data Analysis Patterns with Hadoop, Mahout and Solr

33

Techniques

PurchaseHistory

Merchant Information

Merchant Offers

RecommendationEngine Results

(Mahout)

PresentationData Store

(DB2)

App

App

App

App

App

Hadoop Export(4 hrs)

Import(4 hrs)

Credit CardIssuer

Page 34: Big Data Analysis Patterns with Hadoop, Mahout and Solr

34

Techniques

PurchaseHistory

Merchant Information

Merchant Offers

RecommendationEngine Results

(Mahout)

RecommendationSearch Index

(Solr)

App

App

App

App

App

Hadoop

IndexUpdate(3 min)

Credit CardIssuer

Page 35: Big Data Analysis Patterns with Hadoop, Mahout and Solr

35

Business Value

Credit CardIssuer

Page 36: Big Data Analysis Patterns with Hadoop, Mahout and Solr

36

Idle Alerts

Waste & Recycling Leader

Page 37: Big Data Analysis Patterns with Hadoop, Mahout and Solr

37

Truck Geolocation Data– 20,000 trucks– 5 sec interval (arriving quickly)

Landfill Geographic Boundaries

Data Shape

Page 38: Big Data Analysis Patterns with Hadoop, Mahout and Solr

38

Techniques

TruckGeolocation

Data

Realtime Stream Computation(Storm)

Batch Computation(MapReduce)

ImmediateAlerts

Tax ReductionReporting

HadoopStorage

Shortest PathGraph Algorithm

(Titan)

Route Optimization

Page 39: Big Data Analysis Patterns with Hadoop, Mahout and Solr

39

Business Value

Page 40: Big Data Analysis Patterns with Hadoop, Mahout and Solr

40

Social Engagement Application

Beverage Company

Page 41: Big Data Analysis Patterns with Hadoop, Mahout and Solr

41

Tweets, FB Messages Person, Activity links Graph Traversal

Data Shape

Page 42: Big Data Analysis Patterns with Hadoop, Mahout and Solr

42

Consumer Activity Graph

Wal*Mart.com

CVS

Dollar General

Ebay

Ebay Motors

Toys R UsStubHub

Shopping.comSam’s

Page 43: Big Data Analysis Patterns with Hadoop, Mahout and Solr

43

Techniques

Property Graph(Titan)

Key/Value Store(MapR M7)

Social Activity Stream

Graph Traversal(Faunus/Fulgora)

Page 44: Big Data Analysis Patterns with Hadoop, Mahout and Solr

44

Business Value

Page 45: Big Data Analysis Patterns with Hadoop, Mahout and Solr

45

Fraud DetectionData Lake

Page 46: Big Data Analysis Patterns with Hadoop, Mahout and Solr

46

Anti-Money Laundering Consumer Transactions

Data Sources

Page 47: Big Data Analysis Patterns with Hadoop, Mahout and Solr

47

TechniquesAnti-Money Laundering

SystemConsumer Transactions

System

Page 48: Big Data Analysis Patterns with Hadoop, Mahout and Solr

48

Techniques

AML

Consumer Transactions

Data Lake(Hadoop)

Suspicious Events

Latent Dirichlet Allocation,Bayesian Learning Neural Network,

Peer Group Analysis

Analyst

Page 49: Big Data Analysis Patterns with Hadoop, Mahout and Solr

49

Business Value

Page 50: Big Data Analysis Patterns with Hadoop, Mahout and Solr

50

Machine LearningSearch Relevance

DNA Matching

Page 51: Big Data Analysis Patterns with Hadoop, Mahout and Solr

51

Birth, Death, Census, Military, Immigration records

Search Behavior Activity DNA SNP (snips)

Data Sources

Page 52: Big Data Analysis Patterns with Hadoop, Mahout and Solr

52

Techniques Record Linking Search Relevance Clickstream Behavior Security Forensics DNA Matching

Page 53: Big Data Analysis Patterns with Hadoop, Mahout and Solr

53

Business Value

Page 54: Big Data Analysis Patterns with Hadoop, Mahout and Solr

54

Traffic Analytics

Page 55: Big Data Analysis Patterns with Hadoop, Mahout and Solr

55

Inrix Road Segment Data– Avg Speed / minute / segment– Reference Speeds

Road Segment Geolocation Data

Data Sources

Page 56: Big Data Analysis Patterns with Hadoop, Mahout and Solr

56

Techniques Bottleneck Detection Algorithm Time Offset Correlations– Alternate Routes

Predictive Congestion Analysis– Growth & Term Assumptions

Page 57: Big Data Analysis Patterns with Hadoop, Mahout and Solr

57

Page 58: Big Data Analysis Patterns with Hadoop, Mahout and Solr

58

Page 59: Big Data Analysis Patterns with Hadoop, Mahout and Solr

59

Business Value

Page 60: Big Data Analysis Patterns with Hadoop, Mahout and Solr

60

Similar Characteristics Lots of Data Structured, Semi-Structured, Unstructured Varied Systems Interoperating– Hadoop, Storm, Solr, MPP, Visualizations

Increase Revenue Decrease Costs

Page 61: Big Data Analysis Patterns with Hadoop, Mahout and Solr

61

Questions?