aggregators: data day texas, 2015

Post on 14-Jul-2015

93 Views

Category:

Software

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Aggregators: modeling data queries functionally

Oscar Boykin, Twitter @posco

Or:

Aggregators: composable aggregation for scalding, spark, summingbird, and plain scala

@Twitter

How to compute size of a list in Map/Reduce?

3

2 3 5 7 11 13 17

@Twitter

How to compute size of a list in Map/Reduce?

4

2 3 5 7 11 13 17

1 1 1 1 1 1 1

map(x => 1)

@Twitter

How to compute size of a list in Map/Reduce?

5

2 3 5 7 11 13 17

1 1 1 1 1 1 1

222

374

reduce {(x, y) => x+y}

Associative functions: f(a,f(b,c)) == f(f(a,b),c)

also called “semigroups”

we want map+semigroup in one

abstraction!

@Twitter

Getting the average

8

2 3 5 7 11 13 17

@Twitter

Getting the average

9

2 3 5 7 11 13 17

(1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17)

map(x => (1,x))

@Twitter

Getting the average

10

2 3 5 7 11 13 17

(1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17)

2,242, 5

3,417,584,17

2,12

reduce(Semigroup.plus)

@Twitter

Getting the average

11

2 3 5 7 11 13 17

(1,2) (1,3) (1,5) (1,7) (1,11) (1,13) (1,17)

7,58 8.285

map(case (c, s) => s/c.toDouble)

We really want map+semigroup+map in one abstraction!

trait Aggregator[In, Middle, Out] { def prepare(i: In): Middle def semigroup: Semigroup[Middle] def present(m: Middle): Out }

https://github.com/twitter/algebird

How do we use this?

@Twitter 15

@Twitter 16

@Twitter 17

@Twitter 18

Not such a new idea. Scalding had a mapReduceMap function in the first release:

But why should we be excited?

map (prepare)

reduce (semigroup)

map (present)

“Does not compose” is the new “is a piece of crap”

paraphrasing Dan Rosen @mergeconflict

Aggregators Compose

!=💩Aggregator

map (prepare)

reduce (semigroup)

map (present)

map (prepare)

reduce (semigroup)

map (present)

composePrepare

map (prepare)

reduce (semigroup)

map (present)

composePrepare

Function + Aggregator = Aggregator

map (prepare)

reduce (semigroup)

map (present)

map (prepare)

reduce (semigroup)

map (present)

andThenPresent

map (prepare)

reduce (semigroup)

map (present)

andThenPresent

Aggregator + Function = Aggregator

map (prepare)

reduce (semigroup)

map (present)

map (prepare)

reduce (semigroup)

map (present)

Aggregator 1 Aggregator 2

map (prepare)

reduce (semigroup)

map (present)

Joined Aggregator

Aggregator * Aggregator = Aggregator

Aggregators are Applicative Functors

Functor: has a map method map(t: A[T])(fn: T => U): A[U]

Applicative: has a join method: def join(t: A[T], u: A[U]): A[(T, U)] Monad: has a flatMap method: def flatMap(t: A[T])(fn: T => A[U]): A[U]

Aggregators are Applicative Functors

Functor: has a map method map(t: A[T])(fn: T => U): A[U]

Applicative: has a join method: def join(t: A[T], u: A[U]): A[(T, U)] Monad: has a flatMap method: def flatMap(t: A[T])(fn: T => A[U]): A[U]

Let’s go to the REPL

http://bit.ly/AggregatingWithAlice

https://gist.github.com/johnynek/814fc1e77aad1d295bb7

Aggregators “just work” with scala collections

Aggregators are built in to Scalding

Aggregators are easy to use with Spark

@Twitter

Algebird with spark: https://github.com/twitter/algebird/pull/397

37

@Twitter

Algebird with spark: https://github.com/twitter/algebird/pull/397

38

Key Points1) Aggregators encapsulate very general query

logic independent of how it is executed (in memory, scalding, spark, you name it)

2) Aggregators compose so you can define parts you use, and easily glue them together

3) Algebird has many advanced, well tested Aggregators: TopK, HyperLogLog, CountMinSketch, Mean, Stddev, …

Oscar Boykin @posco / oscar@twitter.com

Algebird has these aggregators and more:

https://github.com/twitter/algebird

top related