craig kerstiens - scalable uniques in postgres @ postgres open
DESCRIPTION
TRANSCRIPT
Scalable Uniques in Postgres -
Craig KerstiensHeroku Postgres
Postgresql-HLL
Truviso
• Extended Postgres to do streaming
• Various markets
• Ad space
• Wanted unique impressions
• Sort of wanted unique impressions
SELECT count(*)
Approx Top K
Compressed Bitmap
HyperLogLog
HyperLogLog• KMV - K minimum value
HyperLogLog• KMV - K minimum value
• Bit observable patterns
HyperLogLog• KMV - K minimum value
• Bit observable patterns
• Stochastic averaging
HyperLogLog• KMV - K minimum value
• Bit observable patterns
• Stochastic averaging
• Harmonic averaging
HyperLogLog• KMV - K minimum value
• Bit observable patterns
• Stochastic averaging
• Harmonic averaging
HyperLogLog• KMV - K minimum value
• Bit observable patterns
• Stochastic averaging
• Harmonic averaging
• Implemented by Aggregate Knowledge
HyperLogLog
Probabilistic uniques with small footprint
HyperLogLog
Probabilistic uniques with small footprintClose enough distinct with small footprint
Use cases
Use cases
• Semi distinct count
• Think pg_stat_statements
• Ad networks
• Web traffic
Use cases
• Semi distinct count
• Think pg_stat_statements
• Ad networks
• Web traffic
• With rollups/groupings
Digging in
CREATE EXTENSION hll;
CREATE TABLE helloworld ( id integer, set hll );
Digging in
CREATE EXTENSION hll;
CREATE TABLE helloworld ( id integer, set hll );
Inserting dataUPDATE helloworld SET set = hll_add(set, hll_hash_integer(12345)) WHERE id = 1;
UPDATE helloworld SET set = hll_add(set, hll_hash_text('hello world')) WHERE id = 1;
Real world
CREATE TABLE daily_uniques ( date date UNIQUE, users hll);
Real world
INSERT INTO daily_uniques(date, users) SELECT date, hll_add_agg(hll_hash_integer(user_id)) FROM users GROUP BY 1;
Real worldSELECT
EXTRACT(MONTH FROM date) AS month, hll_cardinality(hll_union_agg(users))FROM daily_uniquesWHERE date >= '2012-‐01-‐01' AND date < '2013-‐01-‐01'GROUP BY 1;
Real worldSELECT EXTRACT(MONTH FROM date) AS month, hll_cardinality(hll_union_agg(users))FROM daily_uniquesWHERE date >= '2012-‐01-‐01' AND date < '2013-‐01-‐01'GROUP BY 1;
Good practices
Good practices
Good practices
• It uses update
Good practices
• It uses update
• Do as a batch in most cases
Good practices
• It uses update
• Do as a batch in most cases
• Tweak the config
Tuning Parameters
Tuning Parameters• log2m - log base 2 of registers
• Between 4 and 17
• Each 1 increase doubles storage
Tuning Parameters• log2m - log base 2 of registers
• Between 4 and 17
• Each 1 increase doubles storage
• regwidth - bits per register
Tuning Parameters• log2m - log base 2 of registers
• Between 4 and 17
• Each 1 increase doubles storage
• regwidth - bits per register
• expthresh - threshold for explicit vs sparse
Tuning Parameters• log2m - log base 2 of registers
• Between 4 and 17
• Each 1 increase doubles storage
• regwidth - bits per register
• expthresh - threshold for explicit vs sparse
• spareson - on/off for sparse
Is it better?
1280 bytesEstimate count of 10s of billions
Few percent error
Resources• https://github.com/aggregateknowledge/
postgresql-hll
• http://blog.aggregateknowledge.com/2013/02/04/open-source-release-postgresql-hll/
• http://tapoueh.org/blog/2013/02/25-postgresql-hyperloglog
Questions