hundreds of queries in the time of one - gianmario spacagna

Hundreds of queries in the time of oneGianmario [email protected]

• Retail Banking• 1M+ Barclays business

customers in UK• 95% small businesses• Many of them accept

debit card payments• Huge potential for

Business Analytics

Small businesses can’t harness their own data and/or

compare their performance with the competitors

• Monetary cost• Lack of IT infrastructures• Lack of analytics expertise• Lack of market data

Insights Engine• Calculates a business’ Key

Performance Indicators (KPIs) by combining hundreds of Business Intelligence (BI) queries

• Compares them to those of similar local businesses, i.e. market competitors

• Filters them to expose only relevant insights and to preserve privacy

• Presents them in natural language form

• Collects feedback

Example of InsightsArchetype Comparison Granularity

Time YearIndustry HairdressingLocation Blackpool

Archetype 1: “Compare to customers in Segment/Location”f(A) vs. f(B)

“This year, your business spent £1,000 on electricity, other hairdressing businesses in Blackpool spent on average £1,200 on electricity”

Archetype 2: “Calculate Growth and compare to customers in Segment/Location”f(A)/f(A’) vs. f(B)/f(B’)

“Based on the transactions for the past 12 months, your customers spend £25 on average each time they visit your business. By comparison, customers of other hairdressing businesses in Blackpool spend £23.”

Archetype 3: “Calculate % of wallet and compare to customers in Segment/Location”f(A)/g(A) vs. f(B)/g(B)

“This year, your customers spent 2.3% of their income in your business. Customers of other hairdressers in Blackpool spent 3.5% of their income with them.”

Technical ChallengesMultiple Operations on the Same Dataset Optimized In-Memory Execution Plan No Unnecessary I/OAgile Safe Refactoring Statically Typed

Parallel Architecture Map/Reduce & Composable High-Orders Functions

Functional Programming

None of the above would be feasible in traditional RDBMS!

Technical ChallengesMultiple Operations on the Same Dataset Optimized In-Memory Execution Plan No Unnecessary I/OAgile Safe Refactoring Statically Typed

Parallel Architecture Map/Reduce & Composable High-Orders Functions

Functional Programming

Building complex production-quality applications • Flexibility• Richness• High-level features• Native to the computation framework

None of the above would be feasible in traditional RDBMS!

Domain Specific Language• Elegant: few lines where SQL would use more than 200• Natural: use English to specify, implement and test• Compiled: easy to spot mistakes

Commutative MonoidsAlgebraic Structure made of (T, |+|, Zero):

• Type T

• Binary operator |+|: (T, T) => T

• Neutral element Zero: T

sum = (Int, +, 0), multiplication = (Int, *, 1), distinct = (Set, ++, Set.empty)

Properties:

• (Associativity, Commutability) => Parallelizable aggregation

• Identity: t |+| Zero = t

Our composable “Count and Sum of Amount” monoid:

– ((Int, Int), (count1 + count2, sum1 + sum2), (0, 0))

– val toMonoid = (t: Transaction) => (1, t.amount)

– e.g. (4, 120) |+| (1, 80) = (5, 200)

– Can even achieve median using probabilistic monoids, or DistinctCountBounds using hash sets

(13, 790)

(9, 630)

(3, 400)(1, 120)(1, 200)(1, 80)

(2, 90)(1, 60)(1, 30)

(4, 140)

(1, 90)(1, 10)(1, 30)(1, 10)

(4, 160)(2, 50)

(1, 20)(1, 30)

(2, 110)(1, 60)(1, 50)

Insight Part Optimization

Minimum Set of Insight Parts

Insight Type

3

Insight Type

2

Insight Type

1

Part1

Part2

Part3 Part4

Part5

Given the set of insight types:

• Optimize the minimum number of informative insight parts to compute

• Each part consists of:

– Filters (bookkeeping, paymentType, spendCategory)

– Monoids (Sum, Count, Distinct, Median…)

Example:

- Type: “Energy spending growth”

- 4 Insight Parts:“(Count & Sum of Amount) of Outgoing Transactions for Energy” of:

1. this business in current timeslot

2. this business in previous timeslot

3. competitors in current timeslot

4. competitors in previous timeslot

Hierarchical Aggregation

Time:• Month• Quarter• Half year• Year

Industry:• Sub-segment• Segment• All

Location:• Area• County• Country• Whole UK

day

businessId

filters: • bookkeeping = Outgoing• spendCategory = Electricity

Cell contains all of the monoids sliced by

filters of each extracted Part and aggregated at

that granularity combination

(1, t.amount)

Given the set of insight parts and granularities:

• Fill the OLAP cube starting from the finest granularity levels

• Example:(Month, Hairdressing, Blackpool) -> List(part1@Part(filters = List(Ingoing), monoid = (Count(122), Sum(11928)), part2@Part(filters = List(Electricity, Outgoing), monoid = (Sum(89))

• Further aggregation on ascending levels

Derive Insights from the “memoized” results

Atomic Monoids

Memoized Parts

Insight

1. this busi-ness in current year

2. this business in previous year

4. competitors in:previous year,Blackpool, hairdressing

“This year, your business spent £1,000 on electricity, other hairdressing businesses in Blackpool spent on average £1,200 on electricity”

3. competitors in:current year,Blackpool, hairdressing

This business parts (1 and 2)

comparison parts in each granularity combination (3 and 4)

Collate Function

• Growth• Compare• Ratios• Best day of week

DSLmonoids

comparison levels comparison time

filterscollate function

Some Numbers• 700,000,000 rows of data (2 years worth)• 275,000 UK Businesses• 66 Insights for each Business

– Upper Bound: 9 main queries (insight types) * 4 sub-queries (insight parts) * 36 granularities (dimension combos) =1296 Queries

– Filtered on Privacy and Ranked by Relevance• The Engine ran in 30 minutes

– on a small low-performance cluster6 x (20 CPUs, 48G RAM)

– 500x faster than Hive and probably wouldn’t return on Teradata

SummaryInsights Engine is an analytical engine that takes hundreds of queries as input and generates an optimized single execution plan by combining and re-using intermediate results for each business and each combination of granularity over multiple hierarchical dimensions.

What is cool about it:1. Composable ”Monoids” allowing aggregations at multiple

levels of granularity, like a tree2. A DSL that defines Insights succinctly

(3 lines of code vs ~250 lines of SQL)3. Inspection of the queries specified in the DSL to find

"duplicate" structures of computation (Insight Parts), and up-front “memoization" to ensure they are only computed once

4. Ensure all of the results are privacy-safe and relevance-ranked

Follow-up Links• Insights Engine Blog on Cloudera:

http://blog.cloudera.com/blog/2015/08/how-apache-spark-scala-and-functional-programming-made-hard-problems-easy-at-barclays/

• Fast accurate low-memory aggregations DSL for Spark:https://github.com/samthebest/aggregations

• Contribute to the Agile Data Science Manifesto:www.datasciencemanifesto.com





https://github.com/samthebest/aggregations

https://github.com/samthebest/aggregations

http://www.datasciencemanifesto.com/

hundreds of queries in the time of one - gianmario spacagna

Data & Analytics