redshift and why the 'like' in 'posgressql like' matters
TRANSCRIPT
1
Metail's Redshift Experience and Why the 'Like' in 'Postgres Like' Is ImportantGareth Rogers, Data Engineer
2
Metail lets you try on clothes online
Discover clothes on your body shape
Create, save outfits and share
Shop with confidence of size and fit
3
Proven impact as validated by American business schools and A/B tests
‘‘
…customers who had access to the fitting tool are more likely to come back to the site, and this effect is statistically significant…
‘‘
…shows approximately a 5.1 percent reduction in returns compared to the control group…In other words, providing fit information reduces average fulfilment costs”
…sales for users with access to the tool were substantially higher overall - 22.32 percent larger
‘‘Source: “The Value of Fit Information in Online Retail: Evidence from a Randomized Field Experiment” by Prof Santiago Gallino (Dartmouth College - Tuck School of Business) & Prof Antonio Moreno (Northwestern University) –Oct 21, 2015
DATA1000+ GARMENTS
POINTS3M
4
Architecture
Comparing with a more modern flow:http://tech.metail.com/elastic-mapreduce-metail-aws-loft-london/
User DBDynamoDB
5
Creating the Cluster
6
Creating the Cluster
• Compute capacity vs storage capacity– Tight coupling of compute and storage– We load everything into Redshift so far >3TB of data– At 1GB per day compute cluster last 10 sprints, at 30GB per day not so long :S– Six node dc1.8xlarge cluster costs $957.60 per week on-demand pricing
7
Creating the Cluster
8
It’s Postgres Like – Connecting to the Server
• Postgres like system meant from day one there were mature tools and stack overflow help
• Redshift ecosystem now more mature and optimised tooling and help exists
• Redshift JDBC/ODBC is now recommended over PostgresSQL driver
9
It’s Postgres Like – My First QueryWITH order_events AS ( SELECT collector_tstamp, event_id, ue_properties FROM events WHERE collector_tstamp >= '2015-09-20' AND collector_tstamp < '2015-10-02‘ AND event = 'unstruct' AND JSON_EXTRACT_PATH_TEXT(ue_properties,'data','data','name') = 'Order'),in_orders AS ( SELECT DATE(collector_tstamp) AS order_date, COUNT(event_id) AS orders, COUNT(DISTINCT event_id) AS orders_distinct FROM order_events WHERE ue_properties ILIKE '%\\\\\"bin\\\\\":\\\\\"in\\\\\",%' GROUP BY DATE(collector_tstamp) ORDER BY DATE (collector_tstamp)),out_orders AS ( SELECT DATE(collector_tstamp) AS order_date, COUNT(event_id) AS orders, COUNT(DISTINCT event_id) AS orders_distinct FROM order_events WHERE ue_properties ILIKE '%\\\\\"bin\\\\\":\\\\\"out\\\\\",%' GROUP BY DATE(collector_tstamp) ORDER BY DATE(collector_tstamp))SELECT bin_in.order_date, bin_in.orders AS bin_in_orders, bin_out.orders AS bin_out_ordersFROM in_orders AS bin_in INNER JOIN out_orders AS bin_out ON bin_in.order_date = bin_out.order_dateORDER BY bin_in.order_date;
10
Not so Postgres Like – Schema Design
• For day-to-day querying even power users won’t notice the difference
• For the schema designers the differences matter and will bite you from the start
• Redshift = columnar; Postgres = row; Very different optimisation considerations
11
Summary
• Redshift gives you all the usual AWS goodies
• Day-to-day you don’t care that Redshift is Postgres like
• When designing the schema forget about row databases, experiment with columnar stores