analytics in olap with lucene & hadoop

22
1 Berlin | 07/02/2012 | zanox | Presentation title

Post on 21-Oct-2014

1.616 views

Category:

Economy & Finance


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Analytics in olap with lucene & hadoop

1 Berlin | 07/02/2012 | zanox | Presentation title

Page 2: Analytics in olap with lucene & hadoop

TABLE OF CONTENTS

1. ABOUT ZANOX AND ME

2. REPORTING IN GENERAL

3. ANALYTICS

4. QUESTIONS

2 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems

Page 3: Analytics in olap with lucene & hadoop

THE ZANOX NETWORK

3 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems

Page 4: Analytics in olap with lucene & hadoop

WHO AM I?

4

● Senior Architect at Zanox

● 7+ years experience in building distributed search systems based on Lucene

Lucid Certified Apache Solr/Lucene Developer

● 5+ years of using Hadoop and 2+ years of using HBase for data mining

Cloudera Certified Developer for Apache Hadoop and HBase

● I have applied different machine-learning techniques mainly to optimise

resource usage while performing distributed search during my PhD

● See my book: “Beyond Centralised Search Engines

An Agent-Based Filtering Framework”

● How can you reach me?

[email protected]

● http://www.linkedin.com/in/draganmilosevic

Berlin | 08/05/2013 | zanox | Zanox Reporting Systems

Page 5: Analytics in olap with lucene & hadoop

1. ABOUT ZANOX AND ME

2. REPORTING IN GENERAL

3. ANALYTICS

4. QUESTIONS

5 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems

Page 6: Analytics in olap with lucene & hadoop

REPORTING IN GENERAL

6

Tracking

Custom

Map Reduce

Distributed Reporting

M

S

S

S

JBoss

S

S

S

S

S

S

S

S

S

M

S

S

S

S

S

S

S

S

S

S

S

S

M

S

S

S

S

S

S

S

S

S

S

S

S

C

C

C

C

C

C

C

D

D

L

G

Y

D

D

L

F

M

T

T

T

T

T

T

T

T O

O

O

D

D

D

D

I I I I I

I I I I I I I

I I I I I I I I

I I I I I I I I I I

I I I I I I

D D

D

D D

D

D

D D

D

D D

D D

Q Q Q Q Q Q

R

R

R

R

R

R

D D

D D

D

D

D

D

D D

D

D

D

D

D

D D

D D

D D

D D

D D

D D

D D

OK OK

00:00 03:00 06:00 09:00 12:00 15:00 18:00 21:00 00:00 01:00 02:00 03:00 04:00 05:00 06:00

D D

D D

D D

D D

D D

D D

D D

D D

D D

00:00:00.000 00:00:00.005 00:00:00.010 00:00:00.015 00:00:00.020 00:00:00.025 00:00:00.030 00:00:00.035 00:00:00.040 00:00:00.045 00:00:00.050 00:00:00.055 00:00:00.060 00:00:00.065 00:00:00.070 00:00:00.075 00:00:00.080 00:00:00.085 00:00:00.090 00:00:00.095 00:00:00.100 00:00:00.200 00:00:00.300 00:00:00.400 00:00:00.500 00:00:00.600 00:00:00.650 00:00:00.700 00:00:00.750 00:00:00.800 00:00:00.850 00:00:00.900 00:00:00.910 00:00:00.920 00:00:00.930 00:00:00.940 00:00:00.950 00:00:00 00:02:00 00:04:00 00:06:00 00:08:00 00:10:00 00:10:20 00:10:40 00:11:00 00:11:20 00:11:40 00:12:00 00:14:00 00:16:00 00:18:00 00:20:00 00:22:00 00:24:00 00:26:00 00:28:00

Berlin | 08/05/2013 | zanox | Zanox Reporting Systems

Name Searcher

Role Knows how to build reports for queries

Able to load assigned indexes from HDFS

Design RPC Server

Query optimiser

Lucene searcher with MMap directories

Report aggregation component

Name Merger

Role Knows how to build sub-queries for searchers

Able to assign indexes to searchers

Aware of available searchers

Design RPC Server

Report aggregation component

Sorting and master-data management

Name Observer

Role Checks the health of mergers and searchers

Automatic deployment of clusters

Repair unhealthy mergers and searchers

Provide information about observed clusters

Design RPC Server

Name Client

Role Keep track of available mergers

Rerouting of failed queries

Take care of load when assigning queries

Design JBoss application

RPC client for communication with mergers

Name Tracking Server

Role Tracks events such as sales, clicks, views …

Writes tracked events to HDFS

Design Proprietary application

Name Database server

FTP server

Log server

Role Provides data needed for reporting

Design …

Name Google

Yahoo

Miva

Role Provides search engine cost data

Design …

Data-Import Data producers are pushing data during a whole day

MR Jobs Reporting jobs are started at midnight

It takes around 6 hours to process data from yesterday

Index loading normally lasts no more that 30 minutes

Reporting There are multiple search clusters for redundancy

Every cluster has one merger that coordinates searchers

The health of every cluster is ensured by observer

Searchers are grouped based on their capabilities

Within 100ms queries are propagated to best searchers

Capable to provide reports in sub-second time

ANALYTICS

Page 7: Analytics in olap with lucene & hadoop

1. ABOUT ZANOX AND ME

2. REPORTING IN GENERAL

3. ANALYTICS

4. QUESTIONS

7 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems

Page 8: Analytics in olap with lucene & hadoop

Searching

2012-09-03

2012-09-03

2012-09-03

2012-09-03

2012-09-04

2012-09-04

2012-09-04

2012-09-04

ANALYTICS

8 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems

Date Publisher Advertiser Keyword Commission

Jim HUK Auto insurance company 23.56 €

Tom DEVK Health insurance 9.24 €

Bob DEVK Health insurance 43.81 €

Bill HUK Cheap auto insurance 7.22 €

Al HUK Auto insurance company 43.07 €

Mike HUK Auto insurance company 29.98 €

Nick DEVK Health insurance 12.03 €

Phil DEVK Cheap auto insurance 9.72 €

Amount

345.06 €

81.35 €

512.19 €

83.76 €

892.12 €

98.17 €

108.19 €

83.21 €

Optimization Sharding

Commission | Amount

23.56 | 345.06 €

9.24 | 81.35 €

43.81 | 512.19 €

7.22 | 83.76 €

43.07 | 892.12 €

29.98 | 98.17 €

12.03 | 108.19 €

9.72 | 83.21 €

Page 9: Analytics in olap with lucene & hadoop

2012-09-03

2012-09-03

2012-09-03

2012-09-04

2012-09-04

2012-09-04

ANALYTICS

9 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems

Date Advertiser Keyword Publisher – Commission | Amount

HUK Auto insurance company Jim – 23.56 | 345.06 €

DEVK Health insurance Tom – 9.24 | 81.35 €

Bob – 43.81 | 512.19 €

HUK Cheap auto insurance Bill – 7.22 | 83.76 €

HUK Auto insurance company Al – 43.07 | 892.12 €

Mike – 29.98 | 98.17 €

DEVK Health insurance Nick – 12.03 | 108.19 €

DEVK Cheap auto insurance Phil – 9.72 | 83.21 €

Searching Optimization Sharding

Page 10: Analytics in olap with lucene & hadoop

Searching

ANALYTICS

10 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems

2012-09-03

2012-09-03

2012-09-03

2012-09-04

2012-09-04

2012-09-04

Date Advertiser Publisher – Commission | Amount

HUK Jim – 23.56 | 345.06 €

DEVK Tom – 9.24 | 81.35 €

Bob – 43.81 | 512.19 €

HUK Bill – 7.22 | 83.76 €

HUK Al – 43.07 | 892.12 €

Mike – 29.98 | 98.17 €

DEVK Nick – 12.03 | 108.19 €

DEVK

Keyword

Auto insurance company

Health insurance

Cheap auto insurance

Auto insurance company

Health insurance

Cheap auto insurance Phil – 9.72 | 83.21 €

Q

FILTER

GROUP BY

Jim PUBLISHER

HUK ADVERTISER

HUK ADVERTISER

R

VALUES

103,83 €

COMMISSION

1419,11 €

AMOUNT

SEARCHING

AGGREGATION

LOADING

What does each

searcher when

query comes?

Optimization Sharding

Can we further

improve loading

of documents?

Yes we can!

Find out which

dimensions are

important filters

and use them

while indexing

documents

Page 11: Analytics in olap with lucene & hadoop

Searching

ANALYTICS

11 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems

2012-09-03

2012-09-03

2012-09-03

2012-09-04

2012-09-04

2012-09-04

Date Advertiser Publisher – Commission | Amount

HUK Jim – 23.56 | 345.06 €

DEVK Tom – 9.24 | 81.35 €

Bob – 43.81 | 512.19 €

HUK Bill – 7.22 | 83.76 €

HUK Al – 43.07 | 892.12 €

Mike – 29.98 | 98.17 €

DEVK Nick – 12.03 | 108.19 €

DEVK

Keyword

Auto insurance company

Health insurance

Cheap auto insurance

Auto insurance company

Health insurance

Cheap auto insurance Phil – 9.72 | 83.21 €

Q

FILTER

GROUP BY

Jim PUBLISHER

HUK ADVERTISER

HUK ADVERTISER

Optimization Sharding

R

VALUES

103,83 €

COMMISSION

1419,11 €

AMOUNT

Page 12: Analytics in olap with lucene & hadoop

Searching

ANALYTICS

12 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems

● There are 9 dimensions in total, where several

● Have huge cardinality (>1 million different values) and

● At the same time are rarely used as filter, meaning

● They are excellent candidates for being put together with measures.

● There are more than 50 measures that

● Are freely combined in user-defined formulas, but at the same time

● Form only few groups based on usage patterns, meaning

● They contribute to better compression because fields become bigger.

● Ordering in which documents are inserted in index is driven by

● Dimensions that are known to be the most important filters, meaning

● Faster loading as needed documents are closer to each other.

Optimization Sharding

Page 13: Analytics in olap with lucene & hadoop

Optimization Searching Sharding

Sharding based on hashing

ANALYTICS

13 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems

SEARCHING

AGGREGATION

LOADING

SEARCHING

AGGREGATION

LOADING

SEARCHING

AGGREGATION

LOADING

SEARCHING

AGGREGATION

LOADING

SEARCHING

AGGREGATION

LOADING

Distributed Reporting

M

S

S

S

S

S

S

S

S

S

S

S

S

M

S

S

S

S

S

S

S

S

S

S

S

S O

M

S

S

S

S

S

S

S

S

S

S

S

S O

O

R

VALUES

16,55 €

COMMISSION

523,21 €

AMOUNT

R

VALUES

103,83 €

COMMISSION

1419,11 €

AMOUNT

R

VALUES

78,72 €

COMMISSION

879,34 €

AMOUNT

R

VALUES

134,23 €

COMMISSION

1622,56 €

AMOUNT

R

VALUES

336,16 €

COMMISSION

4464,09 €

AMOUNT

R R R R

Q

FILTER

GROUP BY

Jim PUBLISHER

HUK ADVERTISER

HUK ADVERTISER

Q Q Q Q Q

R

VALUES

2,83 €

COMMISSION

19,87 €

AMOUNT

R Sharding based on hashing Query is sent to each and every searcher

Searchers work hard to produce reports

Huge amount of data sent over network

Goals for alternative sharding Think very well while distributing data

Network is your friend and precious resource

Produce same report with less searchers

Page 14: Analytics in olap with lucene & hadoop

Sharding based on selected filters

Optimization Sharding Searching

ANALYTICS

14 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems

SEARCHING

AGGREGATION

LOADING

SEARCHING

AGGREGATION

LOADING

M

S S S S S R

VALUES

336,16 €

COMMISSION

4464,09 €

AMOUNT

Q

FILTER

GROUP BY

Jim PUBLISHER

HUK ADVERTISER

HUK ADVERTISER

Q Q

VALUES

133,33 €

COMMISSION

2454,22 €

AMOUNT

R R R

VALUES

202,83 €

COMMISSION

2009,87 €

AMOUNT

R

Page 15: Analytics in olap with lucene & hadoop

Optimization Sharding Searching

ANALYTICS

15 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems

● Sharding based on hashing randomly distributes data

● But that is not needed because no relevance will be computed.

● Queries have to be send to each and every searcher, being suboptimal

because much more data have to be transmitted over the network.

● The slowest searcher determine final response time due to completeness.

● How we partition based on filter fields?

● Order dimensions based on their usage frequency as filter in particular group.

● Greater usage frequency means more important for sharding.

● Examples of dimensions that are found to be excellent for sharding are:

time related dimensions and advertiser name dimension.

● Merger uses knowledge about which data each shard has while routing.

● Final result is more stable system and much better QPS.

Page 16: Analytics in olap with lucene & hadoop

ANALYTICS

16 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems

3rd

Gro

up

2nd G

roup

1st G

roup

Publisher

Advertiser

Keyword

Publisher

Keyword

Publisher

Advertiser

Distributed Reporting

M

S

S

S

S

S

S

S

S

S

S

S

S

M

S

S

S

S

S

S

S

S

S

S

S

S O

M

S

S

S

S

S

S

S

S

S

S

S

S O

O

Optimization Sharding Searching

Page 17: Analytics in olap with lucene & hadoop

We have created separate

searcher group for every

possible combination of

dimensions. Therefore we

are well prepared for each

query that can hit us.

But we are attacked where

we haven’t thought that it

is possible. One new field

have been requested and

that has spoiled everything

17 Berlin | 07/02/2012 | zanox | Presentation title

3rd

Gro

up

2nd G

roup

1st G

roup

Publisher

Advertiser

Keyword

Publisher

Keyword

Publisher

Advertiser

M 4

th G

roup

Publisher

5th

Gro

up

Advertiser

6th

Gro

up

Advertiser

Keyword

7th

Gro

up

Keyword

Q

FILTER

GROUP BY

Jim PUBLISHER

Keyword

HUK ADVERTISER Insurance

KEYWORD

Website

8th

Gro

up

Advertiser

Website

Keyword 9

th G

roup

Website

Advertiser

Publisher

10

th G

roup

Website

11

th G

roup

Website

Keyword

12

th G

roup

Website

Advertiser

13

th G

roup

Website

Publisher

Publisher

There are so

many groups

but no one

has fields

Website

Publisher

Keyword

Creating a separate searcher group for each and every possible

combination of dimensions in each query is not feasible because

number of searcher-groups grows extremely fast (exponentially)

Number of dimensions Number of searcher groups

3 23 – 1 = 7

4 24 – 1 = 15

… …

9 29 – 1 = 511

Solution is to analyze already processed queries to discard low

performing groups and to create new ones for emerging queries

ANALYTICS

Optimization Sharding Searching

Page 18: Analytics in olap with lucene & hadoop

Optimization Sharding Searching

18 Berlin | 07/02/2012 | zanox | Presentation title

3rd

Gro

up

2nd G

roup

1st G

roup

Publisher

Advertiser

Keyword

Publisher

Keyword

Publisher

Advertiser

4th

Gro

up

Publisher

5th

Gro

up

Advertiser

6th

Gro

up

Advertiser

Keyword

7th

Gro

up

Keyword

8th

Gro

up

Advertiser

Website

Keyword 9

th G

roup

Website

Advertiser

Publisher

10

th G

roup

Website

11

th G

roup

Website

Keyword

12

th G

roup

Website

Advertiser

13

th G

roup

Website

Publisher

queries

451236

response

45.54 ms

queries

9

response

15.54 ms

queries

221128

response

82.87 ms

queries

2876136

response

23.54 ms

queries

236

response

92.33 ms

queries

2

response

147.82 ms

queries

228732

response

61.03 ms

queries

89

response

215.04 ms

queries

15

response

105.02 ms

queries

5644

response

29.76 ms

queries

4

response

563.04 ms

queries

32

response

87.03 ms

queries

543765

response

35.87 ms 14

th G

roup

Website

Publisher

Keyword

M Q

FILTER

GROUP BY

Jim PUBLISHER

Keyword

HUK ADVERTISER Insurance

KEYWORD

Website

Publisher

There are so

many groups

but no one

has fields

Website

Publisher

Keyword

ANALYTICS

Page 19: Analytics in olap with lucene & hadoop

19 Berlin | 07/02/2012 | zanox | Presentation title

3rd

Gro

up

9th

Gro

up

1st G

roup

Publisher

Advertiser

Keyword

Website

Publisher

Publisher

Advertiser

4th

Gro

up

Publisher

11

th G

roup

Website

Keyword

queries

20236

response

45.54 ms

Advertiser

queries

91234

response

15.54 ms

queries

221128

response

82.87 ms

queries

286136

response

23.54 ms

queries

22336

response

92.33 ms

14

th G

roup

Website

Publisher

Keyword

queries

32546

response

727.23 ms

M Q

FILTER

GROUP BY

Jim PUBLISHER

HUK ADVERTISER Insurance

KEYWORD

Website

Publisher

15

th G

roup

Keyword

Publisher

Fie

lds Website

Publisher

Keyword

queries

9874

response

129.14 ms

Publisher

Keyword

queries

22678

response

989.28 ms F

ield

s

=

We optimized system

by adding specialized

group that is excellent

for found slow queries

There is no

group that is

optimized

for queries

having fields

Publisher

Keyword

ANALYTICS

Optimization Sharding Searching

Page 20: Analytics in olap with lucene & hadoop

ANALYTICS

20 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems

● Hadoop makes possible to easily build new indexes:

● Optimize utilization of Hadoop resources because

● Better indexes can be always build by low priority jobs.

● Built OLAP continuously adapts itself to queries and

● There are no more expensive queries in long run.

● How can we decide which group is best of query at hand?

● That’s query routing, which we solved either by

● Asking searchers only to return number of hits and

then to generate report if being selected, or simpler

● By using heuristic based on dimension cardinality.

● But query routing is not a topic of this talk.

Optimization Sharding Searching

Page 21: Analytics in olap with lucene & hadoop

1. ABOUT ZANOX AND ME

2. REPORTING IN GENERAL

3. QUERY ROUTING

4. QUESTIONS

21 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems

Page 22: Analytics in olap with lucene & hadoop

22 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems