hundreds of queries in the time of one - gianmario spacagna

16
Hundreds of queries in the time of one Gianmario Spacagna [email protected]

Upload: spark-summit

Post on 21-Apr-2017

2.187 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Hundreds of queries in the time of one - Gianmario Spacagna

Hundreds of queries in the time of oneGianmario [email protected]

Page 2: Hundreds of queries in the time of one - Gianmario Spacagna

• Retail Banking• 1M+ Barclays business

customers in UK• 95% small businesses• Many of them accept

debit card payments• Huge potential for

Business Analytics

Page 3: Hundreds of queries in the time of one - Gianmario Spacagna

Small businesses can’t harness their own data and/or

compare their performance with the competitors

• Monetary cost• Lack of IT infrastructures• Lack of analytics expertise• Lack of market data

Page 4: Hundreds of queries in the time of one - Gianmario Spacagna

Insights Engine• Calculates a business’ Key

Performance Indicators (KPIs) by combining hundreds of Business Intelligence (BI) queries

• Compares them to those of similar local businesses, i.e. market competitors

• Filters them to expose only relevant insights and to preserve privacy

• Presents them in natural language form

• Collects feedback

Page 5: Hundreds of queries in the time of one - Gianmario Spacagna

Example of InsightsArchetype Comparison Granularity

Time YearIndustry HairdressingLocation Blackpool

Archetype 1: “Compare to customers in Segment/Location”f(A) vs. f(B)

“This year, your business spent £1,000 on electricity, other hairdressing businesses in Blackpool spent on average £1,200 on electricity”

Archetype 2: “Calculate Growth and compare to customers in Segment/Location”f(A)/f(A’) vs. f(B)/f(B’)

“Based on the transactions for the past 12 months, your customers spend £25 on average each time they visit your business. By comparison, customers of other hairdressing businesses in Blackpool spend £23.”

Archetype 3: “Calculate % of wallet and compare to customers in Segment/Location”f(A)/g(A) vs. f(B)/g(B)

“This year, your customers spent 2.3% of their income in your business. Customers of other hairdressers in Blackpool spent 3.5% of their income with them.”

Page 6: Hundreds of queries in the time of one - Gianmario Spacagna

Technical ChallengesMultiple Operations on the Same Dataset Optimized In-Memory Execution Plan No Unnecessary I/OAgile Safe Refactoring Statically Typed

Parallel Architecture Map/Reduce & Composable High-Orders Functions

Functional Programming

None of the above would be feasible in traditional RDBMS!

Page 7: Hundreds of queries in the time of one - Gianmario Spacagna

Technical ChallengesMultiple Operations on the Same Dataset Optimized In-Memory Execution Plan No Unnecessary I/OAgile Safe Refactoring Statically Typed

Parallel Architecture Map/Reduce & Composable High-Orders Functions

Functional Programming

Building complex production-quality applications • Flexibility• Richness• High-level features• Native to the computation framework

None of the above would be feasible in traditional RDBMS!

Page 8: Hundreds of queries in the time of one - Gianmario Spacagna

Domain Specific Language• Elegant: few lines where SQL would use more than 200• Natural: use English to specify, implement and test• Compiled: easy to spot mistakes

Page 9: Hundreds of queries in the time of one - Gianmario Spacagna

Commutative MonoidsAlgebraic Structure made of (T, |+|, Zero):

• Type T

• Binary operator |+|: (T, T) => T

• Neutral element Zero: T

sum = (Int, +, 0), multiplication = (Int, *, 1), distinct = (Set, ++, Set.empty)

Properties:

• (Associativity, Commutability) => Parallelizable aggregation

• Identity: t |+| Zero = t

Our composable “Count and Sum of Amount” monoid:

– ((Int, Int), (count1 + count2, sum1 + sum2), (0, 0))

– val toMonoid = (t: Transaction) => (1, t.amount)

– e.g. (4, 120) |+| (1, 80) = (5, 200)

– Can even achieve median using probabilistic monoids, or DistinctCountBounds using hash sets

(13, 790)

(9, 630)

(3, 400)(1, 120)(1, 200)(1, 80)

(2, 90)(1, 60)(1, 30)

(4, 140)

(1, 90)(1, 10)(1, 30)(1, 10)

(4, 160)(2, 50)

(1, 20)(1, 30)

(2, 110)(1, 60)(1, 50)

Page 10: Hundreds of queries in the time of one - Gianmario Spacagna

Insight Part Optimization

Minimum Set of Insight Parts

Insight Type

3

Insight Type

2

Insight Type

1

Part1

Part2

Part3 Part4

Part5

Given the set of insight types:

• Optimize the minimum number of informative insight parts to compute

• Each part consists of:

– Filters (bookkeeping, paymentType, spendCategory)

– Monoids (Sum, Count, Distinct, Median…)

Example:

- Type: “Energy spending growth”

- 4 Insight Parts:“(Count & Sum of Amount) of Outgoing Transactions for Energy” of:

1. this business in current timeslot

2. this business in previous timeslot

3. competitors in current timeslot

4. competitors in previous timeslot

Page 11: Hundreds of queries in the time of one - Gianmario Spacagna

Hierarchical Aggregation

Time:• Month• Quarter• Half year• Year

Industry:• Sub-segment• Segment• All

Location:• Area• County• Country• Whole UK

day

businessId

filters: • bookkeeping = Outgoing• spendCategory = Electricity

Cell contains all of the monoids sliced by

filters of each extracted Part and aggregated at

that granularity combination

(1, t.amount)

Given the set of insight parts and granularities:

• Fill the OLAP cube starting from the finest granularity levels

• Example:(Month, Hairdressing, Blackpool) -> List(part1@Part(filters = List(Ingoing), monoid = (Count(122), Sum(11928)), part2@Part(filters = List(Electricity, Outgoing), monoid = (Sum(89))

• Further aggregation on ascending levels

Page 12: Hundreds of queries in the time of one - Gianmario Spacagna

Derive Insights from the “memoized” results

Atomic Monoids

Memoized Parts

Insight

1. this busi-ness in current year

2. this business in previous year

4. competitors in:previous year,Blackpool, hairdressing

“This year, your business spent £1,000 on electricity, other hairdressing businesses in Blackpool spent on average £1,200 on electricity”

3. competitors in:current year,Blackpool, hairdressing

This business parts (1 and 2)

comparison parts in each granularity combination (3 and 4)

Collate Function

• Growth• Compare• Ratios• Best day of week

Page 13: Hundreds of queries in the time of one - Gianmario Spacagna

DSLmonoids

comparison levels comparison time

filterscollate function

Page 14: Hundreds of queries in the time of one - Gianmario Spacagna

Some Numbers• 700,000,000 rows of data (2 years worth)• 275,000 UK Businesses• 66 Insights for each Business

– Upper Bound: 9 main queries (insight types) * 4 sub-queries (insight parts) * 36 granularities (dimension combos) =1296 Queries

– Filtered on Privacy and Ranked by Relevance• The Engine ran in 30 minutes

– on a small low-performance cluster6 x (20 CPUs, 48G RAM)

– 500x faster than Hive and probably wouldn’t return on Teradata

Page 15: Hundreds of queries in the time of one - Gianmario Spacagna

SummaryInsights Engine is an analytical engine that takes hundreds of queries as input and generates an optimized single execution plan by combining and re-using intermediate results for each business and each combination of granularity over multiple hierarchical dimensions.

What is cool about it:1. Composable ”Monoids” allowing aggregations at multiple

levels of granularity, like a tree2. A DSL that defines Insights succinctly

(3 lines of code vs ~250 lines of SQL)3. Inspection of the queries specified in the DSL to find

"duplicate" structures of computation (Insight Parts), and up-front “memoization" to ensure they are only computed once

4. Ensure all of the results are privacy-safe and relevance-ranked