redshift and why the 'like' in 'posgressql like' matters

11
1 Metail's Redshift Experience and Why the 'Like' in 'Postgres Like' Is Important Gareth Rogers, Data Engineer

Upload: gareth-rogers

Post on 23-Feb-2017

41 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Redshift and Why the 'Like' In 'PosgresSQL Like' Matters

1

Metail's Redshift Experience and Why the 'Like' in 'Postgres Like' Is ImportantGareth Rogers, Data Engineer

Page 2: Redshift and Why the 'Like' In 'PosgresSQL Like' Matters

2

Metail lets you try on clothes online

Discover clothes on your body shape

Create, save outfits and share

Shop with confidence of size and fit

Page 3: Redshift and Why the 'Like' In 'PosgresSQL Like' Matters

3

Proven impact as validated by American business schools and A/B tests

‘‘

…customers who had access to the fitting tool are more likely to come back to the site, and this effect is statistically significant…

‘‘

…shows approximately a 5.1 percent reduction in returns compared to the control group…In other words, providing fit information reduces average fulfilment costs”

…sales for users with access to the tool were substantially higher overall - 22.32 percent larger

‘‘Source: “The Value of Fit Information in Online Retail: Evidence from a Randomized Field Experiment” by Prof Santiago Gallino  (Dartmouth College - Tuck School of Business) & Prof Antonio Moreno (Northwestern University) –Oct 21, 2015

DATA1000+ GARMENTS

POINTS3M

Page 4: Redshift and Why the 'Like' In 'PosgresSQL Like' Matters

4

Architecture

Comparing with a more modern flow:http://tech.metail.com/elastic-mapreduce-metail-aws-loft-london/

User DBDynamoDB

Page 5: Redshift and Why the 'Like' In 'PosgresSQL Like' Matters

5

Creating the Cluster

Page 6: Redshift and Why the 'Like' In 'PosgresSQL Like' Matters

6

Creating the Cluster

• Compute capacity vs storage capacity– Tight coupling of compute and storage– We load everything into Redshift so far >3TB of data– At 1GB per day compute cluster last 10 sprints, at 30GB per day not so long :S– Six node dc1.8xlarge cluster costs $957.60 per week on-demand pricing

Page 7: Redshift and Why the 'Like' In 'PosgresSQL Like' Matters

7

Creating the Cluster

Page 8: Redshift and Why the 'Like' In 'PosgresSQL Like' Matters

8

It’s Postgres Like – Connecting to the Server

• Postgres like system meant from day one there were mature tools and stack overflow help

• Redshift ecosystem now more mature and optimised tooling and help exists

• Redshift JDBC/ODBC is now recommended over PostgresSQL driver

Page 9: Redshift and Why the 'Like' In 'PosgresSQL Like' Matters

9

It’s Postgres Like – My First QueryWITH order_events AS ( SELECT collector_tstamp, event_id, ue_properties FROM events WHERE collector_tstamp >= '2015-09-20' AND collector_tstamp < '2015-10-02‘ AND event = 'unstruct' AND JSON_EXTRACT_PATH_TEXT(ue_properties,'data','data','name') = 'Order'),in_orders AS ( SELECT DATE(collector_tstamp) AS order_date, COUNT(event_id) AS orders, COUNT(DISTINCT event_id) AS orders_distinct FROM order_events WHERE ue_properties ILIKE '%\\\\\"bin\\\\\":\\\\\"in\\\\\",%' GROUP BY DATE(collector_tstamp) ORDER BY DATE (collector_tstamp)),out_orders AS ( SELECT DATE(collector_tstamp) AS order_date, COUNT(event_id) AS orders, COUNT(DISTINCT event_id) AS orders_distinct FROM order_events WHERE ue_properties ILIKE '%\\\\\"bin\\\\\":\\\\\"out\\\\\",%' GROUP BY DATE(collector_tstamp) ORDER BY DATE(collector_tstamp))SELECT bin_in.order_date, bin_in.orders AS bin_in_orders, bin_out.orders AS bin_out_ordersFROM in_orders AS bin_in INNER JOIN out_orders AS bin_out ON bin_in.order_date = bin_out.order_dateORDER BY bin_in.order_date;

Page 10: Redshift and Why the 'Like' In 'PosgresSQL Like' Matters

10

Not so Postgres Like – Schema Design

• For day-to-day querying even power users won’t notice the difference

• For the schema designers the differences matter and will bite you from the start

• Redshift = columnar; Postgres = row; Very different optimisation considerations

Page 11: Redshift and Why the 'Like' In 'PosgresSQL Like' Matters

11

Summary

• Redshift gives you all the usual AWS goodies

• Day-to-day you don’t care that Redshift is Postgres like

• When designing the schema forget about row databases, experiment with columnar stores