redshift and why the 'like' in 'posgressql like' matters

1

Metail's Redshift Experience and Why the 'Like' in 'Postgres Like' Is ImportantGareth Rogers, Data Engineer

2

Metail lets you try on clothes online

Discover clothes on your body shape

Create, save outfits and share

Shop with confidence of size and fit

3

Proven impact as validated by American business schools and A/B tests

‘‘

…customers who had access to the fitting tool are more likely to come back to the site, and this effect is statistically significant…

‘‘

…shows approximately a 5.1 percent reduction in returns compared to the control group…In other words, providing fit information reduces average fulfilment costs”

…sales for users with access to the tool were substantially higher overall - 22.32 percent larger

‘‘Source: “The Value of Fit Information in Online Retail: Evidence from a Randomized Field Experiment” by Prof Santiago Gallino (Dartmouth College - Tuck School of Business) & Prof Antonio Moreno (Northwestern University) –Oct 21, 2015

DATA1000+ GARMENTS

POINTS3M

http://poseidon01.ssrn.com/delivery.php?ID=675084119021001015102024112117066122024006056079005030120082090022113111009099072123124060121106033007109027002117105018067118107006090023002099027120107100099016118063082019111098095017126090122081116007112025022028099001088004088090098075002106021094&EXT=pdf






4

Architecture

Comparing with a more modern flow:http://tech.metail.com/elastic-mapreduce-metail-aws-loft-london/

User DBDynamoDB

5

Creating the Cluster

6


• Compute capacity vs storage capacity– Tight coupling of compute and storage– We load everything into Redshift so far >3TB of data– At 1GB per day compute cluster last 10 sprints, at 30GB per day not so long :S– Six node dc1.8xlarge cluster costs $957.60 per week on-demand pricing

7


8

It’s Postgres Like – Connecting to the Server

• Postgres like system meant from day one there were mature tools and stack overflow help

• Redshift ecosystem now more mature and optimised tooling and help exists

• Redshift JDBC/ODBC is now recommended over PostgresSQL driver

9

It’s Postgres Like – My First QueryWITH order_events AS ( SELECT collector_tstamp, event_id, ue_properties FROM events WHERE collector_tstamp >= '2015-09-20' AND collector_tstamp < '2015-10-02‘ AND event = 'unstruct' AND JSON_EXTRACT_PATH_TEXT(ue_properties,'data','data','name') = 'Order'),in_orders AS ( SELECT DATE(collector_tstamp) AS order_date, COUNT(event_id) AS orders, COUNT(DISTINCT event_id) AS orders_distinct FROM order_events WHERE ue_properties ILIKE '%\\\\\"bin\\\\\":\\\\\"in\\\\\",%' GROUP BY DATE(collector_tstamp) ORDER BY DATE (collector_tstamp)),out_orders AS ( SELECT DATE(collector_tstamp) AS order_date, COUNT(event_id) AS orders, COUNT(DISTINCT event_id) AS orders_distinct FROM order_events WHERE ue_properties ILIKE '%\\\\\"bin\\\\\":\\\\\"out\\\\\",%' GROUP BY DATE(collector_tstamp) ORDER BY DATE(collector_tstamp))SELECT bin_in.order_date, bin_in.orders AS bin_in_orders, bin_out.orders AS bin_out_ordersFROM in_orders AS bin_in INNER JOIN out_orders AS bin_out ON bin_in.order_date = bin_out.order_dateORDER BY bin_in.order_date;

10

Not so Postgres Like – Schema Design

• For day-to-day querying even power users won’t notice the difference

• For the schema designers the differences matter and will bite you from the start

• Redshift = columnar; Postgres = row; Very different optimisation considerations

11

Summary

• Redshift gives you all the usual AWS goodies

• Day-to-day you don’t care that Redshift is Postgres like

• When designing the schema forget about row databases, experiment with columnar stores

redshift and why the 'like' in 'posgressql like' matters

Data & Analytics