analytics in olap with lucene & hadoop
Post on 21-Oct-2014
1.616 views
DESCRIPTION
TRANSCRIPT
1 Berlin | 07/02/2012 | zanox | Presentation title
TABLE OF CONTENTS
1. ABOUT ZANOX AND ME
2. REPORTING IN GENERAL
3. ANALYTICS
4. QUESTIONS
2 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
THE ZANOX NETWORK
3 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
WHO AM I?
4
● Senior Architect at Zanox
● 7+ years experience in building distributed search systems based on Lucene
Lucid Certified Apache Solr/Lucene Developer
● 5+ years of using Hadoop and 2+ years of using HBase for data mining
Cloudera Certified Developer for Apache Hadoop and HBase
● I have applied different machine-learning techniques mainly to optimise
resource usage while performing distributed search during my PhD
● See my book: “Beyond Centralised Search Engines
An Agent-Based Filtering Framework”
● How can you reach me?
● http://www.linkedin.com/in/draganmilosevic
Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
1. ABOUT ZANOX AND ME
2. REPORTING IN GENERAL
3. ANALYTICS
4. QUESTIONS
5 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
REPORTING IN GENERAL
6
Tracking
Custom
Map Reduce
Distributed Reporting
M
S
S
S
JBoss
S
S
S
S
S
S
S
S
S
M
S
S
S
S
S
S
S
S
S
S
S
S
M
S
S
S
S
S
S
S
S
S
S
S
S
C
C
C
C
C
C
C
D
D
L
G
Y
D
D
L
F
M
T
T
T
T
T
T
T
T O
O
O
D
D
D
D
I I I I I
I I I I I I I
I I I I I I I I
I I I I I I I I I I
I I I I I I
D D
D
D D
D
D
D D
D
D D
D D
Q Q Q Q Q Q
R
R
R
R
R
R
D D
D D
D
D
D
D
D D
D
D
D
D
D
D D
D D
D D
D D
D D
D D
D D
OK OK
00:00 03:00 06:00 09:00 12:00 15:00 18:00 21:00 00:00 01:00 02:00 03:00 04:00 05:00 06:00
D D
D D
D D
D D
D D
D D
D D
D D
D D
00:00:00.000 00:00:00.005 00:00:00.010 00:00:00.015 00:00:00.020 00:00:00.025 00:00:00.030 00:00:00.035 00:00:00.040 00:00:00.045 00:00:00.050 00:00:00.055 00:00:00.060 00:00:00.065 00:00:00.070 00:00:00.075 00:00:00.080 00:00:00.085 00:00:00.090 00:00:00.095 00:00:00.100 00:00:00.200 00:00:00.300 00:00:00.400 00:00:00.500 00:00:00.600 00:00:00.650 00:00:00.700 00:00:00.750 00:00:00.800 00:00:00.850 00:00:00.900 00:00:00.910 00:00:00.920 00:00:00.930 00:00:00.940 00:00:00.950 00:00:00 00:02:00 00:04:00 00:06:00 00:08:00 00:10:00 00:10:20 00:10:40 00:11:00 00:11:20 00:11:40 00:12:00 00:14:00 00:16:00 00:18:00 00:20:00 00:22:00 00:24:00 00:26:00 00:28:00
Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
Name Searcher
Role Knows how to build reports for queries
Able to load assigned indexes from HDFS
Design RPC Server
Query optimiser
Lucene searcher with MMap directories
Report aggregation component
Name Merger
Role Knows how to build sub-queries for searchers
Able to assign indexes to searchers
Aware of available searchers
Design RPC Server
Report aggregation component
Sorting and master-data management
Name Observer
Role Checks the health of mergers and searchers
Automatic deployment of clusters
Repair unhealthy mergers and searchers
Provide information about observed clusters
Design RPC Server
Name Client
Role Keep track of available mergers
Rerouting of failed queries
Take care of load when assigning queries
Design JBoss application
RPC client for communication with mergers
Name Tracking Server
Role Tracks events such as sales, clicks, views …
Writes tracked events to HDFS
…
Design Proprietary application
Name Database server
FTP server
Log server
Role Provides data needed for reporting
…
Design …
Name Google
Yahoo
Miva
Role Provides search engine cost data
…
Design …
Data-Import Data producers are pushing data during a whole day
MR Jobs Reporting jobs are started at midnight
It takes around 6 hours to process data from yesterday
Index loading normally lasts no more that 30 minutes
Reporting There are multiple search clusters for redundancy
Every cluster has one merger that coordinates searchers
The health of every cluster is ensured by observer
Searchers are grouped based on their capabilities
Within 100ms queries are propagated to best searchers
Capable to provide reports in sub-second time
ANALYTICS
1. ABOUT ZANOX AND ME
2. REPORTING IN GENERAL
3. ANALYTICS
4. QUESTIONS
7 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
Searching
2012-09-03
2012-09-03
2012-09-03
2012-09-03
2012-09-04
2012-09-04
2012-09-04
2012-09-04
ANALYTICS
8 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
Date Publisher Advertiser Keyword Commission
Jim HUK Auto insurance company 23.56 €
Tom DEVK Health insurance 9.24 €
Bob DEVK Health insurance 43.81 €
Bill HUK Cheap auto insurance 7.22 €
Al HUK Auto insurance company 43.07 €
Mike HUK Auto insurance company 29.98 €
Nick DEVK Health insurance 12.03 €
Phil DEVK Cheap auto insurance 9.72 €
Amount
345.06 €
81.35 €
512.19 €
83.76 €
892.12 €
98.17 €
108.19 €
83.21 €
Optimization Sharding
Commission | Amount
23.56 | 345.06 €
9.24 | 81.35 €
43.81 | 512.19 €
7.22 | 83.76 €
43.07 | 892.12 €
29.98 | 98.17 €
12.03 | 108.19 €
9.72 | 83.21 €
2012-09-03
2012-09-03
2012-09-03
2012-09-04
2012-09-04
2012-09-04
ANALYTICS
9 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
Date Advertiser Keyword Publisher – Commission | Amount
HUK Auto insurance company Jim – 23.56 | 345.06 €
DEVK Health insurance Tom – 9.24 | 81.35 €
Bob – 43.81 | 512.19 €
HUK Cheap auto insurance Bill – 7.22 | 83.76 €
HUK Auto insurance company Al – 43.07 | 892.12 €
Mike – 29.98 | 98.17 €
DEVK Health insurance Nick – 12.03 | 108.19 €
DEVK Cheap auto insurance Phil – 9.72 | 83.21 €
Searching Optimization Sharding
Searching
ANALYTICS
10 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
2012-09-03
2012-09-03
2012-09-03
2012-09-04
2012-09-04
2012-09-04
Date Advertiser Publisher – Commission | Amount
HUK Jim – 23.56 | 345.06 €
DEVK Tom – 9.24 | 81.35 €
Bob – 43.81 | 512.19 €
HUK Bill – 7.22 | 83.76 €
HUK Al – 43.07 | 892.12 €
Mike – 29.98 | 98.17 €
DEVK Nick – 12.03 | 108.19 €
DEVK
Keyword
Auto insurance company
Health insurance
Cheap auto insurance
Auto insurance company
Health insurance
Cheap auto insurance Phil – 9.72 | 83.21 €
Q
FILTER
GROUP BY
Jim PUBLISHER
HUK ADVERTISER
HUK ADVERTISER
R
VALUES
103,83 €
COMMISSION
1419,11 €
AMOUNT
SEARCHING
AGGREGATION
LOADING
What does each
searcher when
query comes?
Optimization Sharding
Can we further
improve loading
of documents?
Yes we can!
Find out which
dimensions are
important filters
and use them
while indexing
documents
Searching
ANALYTICS
11 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
2012-09-03
2012-09-03
2012-09-03
2012-09-04
2012-09-04
2012-09-04
Date Advertiser Publisher – Commission | Amount
HUK Jim – 23.56 | 345.06 €
DEVK Tom – 9.24 | 81.35 €
Bob – 43.81 | 512.19 €
HUK Bill – 7.22 | 83.76 €
HUK Al – 43.07 | 892.12 €
Mike – 29.98 | 98.17 €
DEVK Nick – 12.03 | 108.19 €
DEVK
Keyword
Auto insurance company
Health insurance
Cheap auto insurance
Auto insurance company
Health insurance
Cheap auto insurance Phil – 9.72 | 83.21 €
Q
FILTER
GROUP BY
Jim PUBLISHER
HUK ADVERTISER
HUK ADVERTISER
Optimization Sharding
R
VALUES
103,83 €
COMMISSION
1419,11 €
AMOUNT
Searching
ANALYTICS
12 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
● There are 9 dimensions in total, where several
● Have huge cardinality (>1 million different values) and
● At the same time are rarely used as filter, meaning
● They are excellent candidates for being put together with measures.
● There are more than 50 measures that
● Are freely combined in user-defined formulas, but at the same time
● Form only few groups based on usage patterns, meaning
● They contribute to better compression because fields become bigger.
● Ordering in which documents are inserted in index is driven by
● Dimensions that are known to be the most important filters, meaning
● Faster loading as needed documents are closer to each other.
Optimization Sharding
Optimization Searching Sharding
Sharding based on hashing
ANALYTICS
13 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
SEARCHING
AGGREGATION
LOADING
SEARCHING
AGGREGATION
LOADING
SEARCHING
AGGREGATION
LOADING
SEARCHING
AGGREGATION
LOADING
SEARCHING
AGGREGATION
LOADING
Distributed Reporting
M
S
S
S
S
S
S
S
S
S
S
S
S
M
S
S
S
S
S
S
S
S
S
S
S
S O
M
S
S
S
S
S
S
S
S
S
S
S
S O
O
R
VALUES
16,55 €
COMMISSION
523,21 €
AMOUNT
R
VALUES
103,83 €
COMMISSION
1419,11 €
AMOUNT
R
VALUES
78,72 €
COMMISSION
879,34 €
AMOUNT
R
VALUES
134,23 €
COMMISSION
1622,56 €
AMOUNT
R
VALUES
336,16 €
COMMISSION
4464,09 €
AMOUNT
R R R R
Q
FILTER
GROUP BY
Jim PUBLISHER
HUK ADVERTISER
HUK ADVERTISER
Q Q Q Q Q
R
VALUES
2,83 €
COMMISSION
19,87 €
AMOUNT
R Sharding based on hashing Query is sent to each and every searcher
Searchers work hard to produce reports
Huge amount of data sent over network
Goals for alternative sharding Think very well while distributing data
Network is your friend and precious resource
Produce same report with less searchers
Sharding based on selected filters
Optimization Sharding Searching
ANALYTICS
14 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
SEARCHING
AGGREGATION
LOADING
SEARCHING
AGGREGATION
LOADING
M
S S S S S R
VALUES
336,16 €
COMMISSION
4464,09 €
AMOUNT
Q
FILTER
GROUP BY
Jim PUBLISHER
HUK ADVERTISER
HUK ADVERTISER
Q Q
VALUES
133,33 €
COMMISSION
2454,22 €
AMOUNT
R R R
VALUES
202,83 €
COMMISSION
2009,87 €
AMOUNT
R
Optimization Sharding Searching
ANALYTICS
15 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
● Sharding based on hashing randomly distributes data
● But that is not needed because no relevance will be computed.
● Queries have to be send to each and every searcher, being suboptimal
because much more data have to be transmitted over the network.
● The slowest searcher determine final response time due to completeness.
● How we partition based on filter fields?
● Order dimensions based on their usage frequency as filter in particular group.
● Greater usage frequency means more important for sharding.
● Examples of dimensions that are found to be excellent for sharding are:
time related dimensions and advertiser name dimension.
● Merger uses knowledge about which data each shard has while routing.
● Final result is more stable system and much better QPS.
ANALYTICS
16 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
3rd
Gro
up
2nd G
roup
1st G
roup
Publisher
Advertiser
Keyword
Publisher
Keyword
Publisher
Advertiser
Distributed Reporting
M
S
S
S
S
S
S
S
S
S
S
S
S
M
S
S
S
S
S
S
S
S
S
S
S
S O
M
S
S
S
S
S
S
S
S
S
S
S
S O
O
Optimization Sharding Searching
We have created separate
searcher group for every
possible combination of
dimensions. Therefore we
are well prepared for each
query that can hit us.
But we are attacked where
we haven’t thought that it
is possible. One new field
have been requested and
that has spoiled everything
17 Berlin | 07/02/2012 | zanox | Presentation title
3rd
Gro
up
2nd G
roup
1st G
roup
Publisher
Advertiser
Keyword
Publisher
Keyword
Publisher
Advertiser
M 4
th G
roup
Publisher
5th
Gro
up
Advertiser
6th
Gro
up
Advertiser
Keyword
7th
Gro
up
Keyword
Q
FILTER
GROUP BY
Jim PUBLISHER
Keyword
HUK ADVERTISER Insurance
KEYWORD
Website
8th
Gro
up
Advertiser
Website
Keyword 9
th G
roup
Website
Advertiser
Publisher
10
th G
roup
Website
11
th G
roup
Website
Keyword
12
th G
roup
Website
Advertiser
13
th G
roup
Website
Publisher
Publisher
There are so
many groups
but no one
has fields
Website
Publisher
Keyword
Creating a separate searcher group for each and every possible
combination of dimensions in each query is not feasible because
number of searcher-groups grows extremely fast (exponentially)
Number of dimensions Number of searcher groups
3 23 – 1 = 7
4 24 – 1 = 15
… …
9 29 – 1 = 511
Solution is to analyze already processed queries to discard low
performing groups and to create new ones for emerging queries
ANALYTICS
Optimization Sharding Searching
Optimization Sharding Searching
18 Berlin | 07/02/2012 | zanox | Presentation title
3rd
Gro
up
2nd G
roup
1st G
roup
Publisher
Advertiser
Keyword
Publisher
Keyword
Publisher
Advertiser
4th
Gro
up
Publisher
5th
Gro
up
Advertiser
6th
Gro
up
Advertiser
Keyword
7th
Gro
up
Keyword
8th
Gro
up
Advertiser
Website
Keyword 9
th G
roup
Website
Advertiser
Publisher
10
th G
roup
Website
11
th G
roup
Website
Keyword
12
th G
roup
Website
Advertiser
13
th G
roup
Website
Publisher
queries
451236
response
45.54 ms
queries
9
response
15.54 ms
queries
221128
response
82.87 ms
queries
2876136
response
23.54 ms
queries
236
response
92.33 ms
queries
2
response
147.82 ms
queries
228732
response
61.03 ms
queries
89
response
215.04 ms
queries
15
response
105.02 ms
queries
5644
response
29.76 ms
queries
4
response
563.04 ms
queries
32
response
87.03 ms
queries
543765
response
35.87 ms 14
th G
roup
Website
Publisher
Keyword
M Q
FILTER
GROUP BY
Jim PUBLISHER
Keyword
HUK ADVERTISER Insurance
KEYWORD
Website
Publisher
There are so
many groups
but no one
has fields
Website
Publisher
Keyword
ANALYTICS
19 Berlin | 07/02/2012 | zanox | Presentation title
3rd
Gro
up
9th
Gro
up
1st G
roup
Publisher
Advertiser
Keyword
Website
Publisher
Publisher
Advertiser
4th
Gro
up
Publisher
11
th G
roup
Website
Keyword
queries
20236
response
45.54 ms
Advertiser
queries
91234
response
15.54 ms
queries
221128
response
82.87 ms
queries
286136
response
23.54 ms
queries
22336
response
92.33 ms
14
th G
roup
Website
Publisher
Keyword
queries
32546
response
727.23 ms
M Q
FILTER
GROUP BY
Jim PUBLISHER
HUK ADVERTISER Insurance
KEYWORD
Website
Publisher
15
th G
roup
Keyword
Publisher
Fie
lds Website
Publisher
Keyword
queries
9874
response
129.14 ms
Publisher
Keyword
queries
22678
response
989.28 ms F
ield
s
=
We optimized system
by adding specialized
group that is excellent
for found slow queries
There is no
group that is
optimized
for queries
having fields
Publisher
Keyword
ANALYTICS
Optimization Sharding Searching
ANALYTICS
20 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
● Hadoop makes possible to easily build new indexes:
● Optimize utilization of Hadoop resources because
● Better indexes can be always build by low priority jobs.
● Built OLAP continuously adapts itself to queries and
● There are no more expensive queries in long run.
● How can we decide which group is best of query at hand?
● That’s query routing, which we solved either by
● Asking searchers only to return number of hits and
then to generate report if being selected, or simpler
● By using heuristic based on dimension cardinality.
● But query routing is not a topic of this talk.
Optimization Sharding Searching
1. ABOUT ZANOX AND ME
2. REPORTING IN GENERAL
3. QUERY ROUTING
4. QUESTIONS
21 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems
22 Berlin | 08/05/2013 | zanox | Zanox Reporting Systems