craig kerstiens - scalable uniques in postgres @ postgres open

Post on 23-Jan-2015

756 Views

Category:

Technology

5 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Scalable Uniques in Postgres -

Craig KerstiensHeroku Postgres

Postgresql-HLL

Truviso

• Extended Postgres to do streaming

• Various markets

• Ad space

• Wanted unique impressions

• Sort of wanted unique impressions

SELECT count(*)

Approx Top K

Compressed Bitmap

HyperLogLog

HyperLogLog• KMV - K minimum value

HyperLogLog• KMV - K minimum value

• Bit observable patterns

HyperLogLog• KMV - K minimum value

• Bit observable patterns

• Stochastic averaging

HyperLogLog• KMV - K minimum value

• Bit observable patterns

• Stochastic averaging

• Harmonic averaging

HyperLogLog• KMV - K minimum value

• Bit observable patterns

• Stochastic averaging

• Harmonic averaging

HyperLogLog• KMV - K minimum value

• Bit observable patterns

• Stochastic averaging

• Harmonic averaging

• Implemented by Aggregate Knowledge

HyperLogLog

Probabilistic uniques with small footprint

HyperLogLog

Probabilistic uniques with small footprintClose enough distinct with small footprint

Use cases

Use cases

• Semi distinct count

• Think pg_stat_statements

• Ad networks

• Web traffic

Use cases

• Semi distinct count

• Think pg_stat_statements

• Ad networks

• Web traffic

• With rollups/groupings

Digging in

CREATE  EXTENSION  hll;

   CREATE  TABLE  helloworld  (            id        integer,            set      hll    );

Digging in

CREATE  EXTENSION  hll;

   CREATE  TABLE  helloworld  (            id        integer,            set      hll    );

Inserting dataUPDATE  helloworld  SET  set  =  hll_add(set,  hll_hash_integer(12345))  WHERE  id  =  1;

UPDATE  helloworld  SET  set  =  hll_add(set,  hll_hash_text('hello  world'))  WHERE  id  =  1;

Real world

CREATE  TABLE  daily_uniques  (        date                        date  UNIQUE,        users                      hll);

Real world

INSERT  INTO  daily_uniques(date,  users)    SELECT  date,  hll_add_agg(hll_hash_integer(user_id))    FROM  users    GROUP  BY  1;

Real worldSELECT  

             EXTRACT(MONTH  FROM  date)  AS  month,                hll_cardinality(hll_union_agg(users))FROM  daily_uniquesWHERE  date  >=  '2012-­‐01-­‐01'  AND            date  <    '2013-­‐01-­‐01'GROUP  BY  1;

Real worldSELECT                EXTRACT(MONTH  FROM  date)  AS  month,                hll_cardinality(hll_union_agg(users))FROM  daily_uniquesWHERE  date  >=  '2012-­‐01-­‐01'  AND            date  <    '2013-­‐01-­‐01'GROUP  BY  1;

Good practices

Good practices

Good practices

• It uses update

Good practices

• It uses update

• Do as a batch in most cases

Good practices

• It uses update

• Do as a batch in most cases

• Tweak the config

Tuning Parameters

Tuning Parameters• log2m - log base 2 of registers

• Between 4 and 17

• Each 1 increase doubles storage

Tuning Parameters• log2m - log base 2 of registers

• Between 4 and 17

• Each 1 increase doubles storage

• regwidth - bits per register

Tuning Parameters• log2m - log base 2 of registers

• Between 4 and 17

• Each 1 increase doubles storage

• regwidth - bits per register

• expthresh - threshold for explicit vs sparse

Tuning Parameters• log2m - log base 2 of registers

• Between 4 and 17

• Each 1 increase doubles storage

• regwidth - bits per register

• expthresh - threshold for explicit vs sparse

• spareson - on/off for sparse

Is it better?

1280 bytesEstimate count of 10s of billions

Few percent error

Questions

top related