a year of data science at metail

A Year Of Data Science at MetailMatt McDonnell - Data Scientist

Business Context

Startup: “A group of people operating in an environment of uncertainty striving for a repeatable and scalable business model“

A scalable startup needs a Customer Factory

Figure adapted from ‘Scaling Lean’ by Ash Maurya https://leanstack.com/scaling-lean-book/

A look behind the curtain – what’s the data?

See Metail in action:

http://metail.myshopify.com?utm_source=DataInsightsNov2016

(Scary UTM code is there so I don’t have to spend the next week digging into ‘Who are these mysterious visitors?’)

Live Demo Starts Here!Sheepish explanation of why it’s not working starts here

The road to Data Science

• Understand the data

• Learn the tools

• Build the analytics for business intelligence

• More sophisticated data analysis for deeper understanding

• Apply machine learning techniques

• Develop models for prediction and decision making

My experience prior to MetailCareers

• Physics PostdocOxford, Griffith

• Technical ConsultantMathWorks

• Quant DeveloperFidelity Worldwide Investment

• Quant AnalystFidelity Worldwide Investment

Tools used:

(plus some Java, C#, Excel and VBA when I had to)

Understanding the data and tools

My experience since joining Metail

Lots of event stream data

Many AWS components

Outputs:- Business Intelligence- Bespoke Analysis- Productionised Science

Tools to learn

Tools we used a year ago

• R for analysis and science• dplyr, tidyr, ggplot

• Looker for some of the analysis

Tools we use now

• Python • pandas, SQLAlchemy, boto3,

seaborn

• Still some R • dplyr, tidyr, ggplot

• Looker for most of day to day analysis

• Swagger

• AWS stack

Data Analytics

Business intelligence • How well is the customer factory working? (KPIs)

• What about if we do this? (A/B Tests)

• How’s our retention? (Cohort analysis)

• How efficiently are we digitising garments? (Process monitoring)

• How are we growing?

To answer this we need …

LOTS AND LOTS OF SQL! (yay.)

Most of it embedded in Looker LookML (basically YAML) (yay - again.)

Data Analytics

Raw Events Engagement States Analytics Model

(Looker demo goes here if time allows)

Data Science

Exploring Digitised Garments

Event Data{

"schema": "iglu:com.snowplowanalytics.snowplow\/unstruct_event\/jsonschema\/1-0-0","data": {

"schema": "","data": {

"name": "GarmentCoverage","data": {

"page": {"garments": 24,"garmentsWithCtas": 14,"scrollPosY": 201,"load": {"isInitiator": false,"elapsedTimeMs": 1424

}},"batch": {

"garments": 12,"garmentsWithCtas": 7,"ctas": [{

"sku": "32536","x": 0.2721021611002,"y": 1.6311844077961

},{

"sku": "32544","x": 0.51768172888016,"y": 1.6311844077961

},{

"sku": "32545","x": 0.51768172888016,"y": 1.0134932533733

},{

"sku": "32548","x": 0.51768172888016,"y": 0.39580209895052

},{

"sku": "53282","x": 0.76326129666012,"y": 0.39580209895052

},{

"sku": "53337","x": 0.026522593320236,"y": 1.0134932533733

},{

"sku": "134499","x": 0.2721021611002,"y": 0.39580209895052

}]

}}

}}

}

GarmentCoverage event

"scrollPosY": 201,

"garmentsWithCtas": 7,

{"sku": "32544","x": 0.51768172888016,"y": 1.6311844077961},

Spread of digitised garments

• Look at positions of all digitised garments for a given category.

• page is in units of #scrolls (based on browser height on the user’s device)

• Digitised garments on /women-dress and /women-tops-tees are more spread out than garments on /women-jeans

Views by garment position

• Aggregate visitors who see garment ‘X’ in a given category on a given date.

• Scale these visitor counts by the maximum #visitors for a garment on that date in that category.

• In the /women-dress category:• Digitised garments are spread between 0 and 120 page scrolls

with median ~40

• Long “tail” of digitised garments which get much fewer visits.

• The average digitised garment typically gets 20% of the visitors as the most popular garment in that category (on a given day).

Date url_path sku Users Page scaled_count

2016-01-01 /women-dress

101742 699 5.0 0.743617

2016-01-01 /women-dress

101743 700 4.0 0.744681

Views by category

• Look at positions of all digitized garments for a given category.

• ‘page’ is in units of #scrolls (based on browser height on the user’s device)

• Digitised garments on /women-dress and /women-tops-tees are more spread out than digitised garments on /women-jeans. Could also be that there are more digitised garments in /women-tops-tees.

• There are some “hotspots” of digitised garment positions e.g. ~page 100 for /women-tops-tees. Unfortunately, they are quite far down the category page and visitor counts are typically around 10-20% of the values for the most popular garments (closest to the top of the category page)

/women-tops-tees /women-jeans /women-dress

Views as time series

• Digitised garments on /women-dress over time

• The “hotspot” moves further down the page: most discernibly in the last 2 weeks.

Data Science

Exploring User Body Shapes

BMI QuantilesBMI: 17.6Height: 160cmWeight: 45kg

BMI: 19.9Height: 157cmWeight: 49kg




Our Shape Segmentation

Spoon Triangle Bottom Hourglass Rectangle Hourglass Top Hourglass Inverted Triangle

Adapting the shape segmentation rules of the Lee et al. (2007) paper used by FFIT

Users Segmented by Shape

Hips – Waist (cm)

Bu

st –

Wai

st (

cm)

Shape Distribution and Popular Garments

Engagement by Shape% of users trying on at least two garments on personalised MeModel

1SD

Data Science

Learning User Behaviour

Understanding Users

Event stream summary over a month

Visits by day of month

All users

Distinct typesOf users

Machine Learning Techniques

Data Driven User Segmentation

Distinct typesOf users

Use Machine Learning techniques to characterise which features define users in each cluster

Identify clusters: engaged and converted users

Cluster Labels into Redshift / Looker

Acquisition Rate

RPV

Seen Size Advice Rate

Acquisition

Retention Reuse

Retention Revisit

Deep Funnel

Revenue

Revenue

674 users 595 users 541 users 721 users 312 users

Try-ons (any model)

A first look at the clusters

Future plans: more MODELLING!

Some possibilities:• Use engagement clustering to create labels for supervised learning• Engagement prediction using trained machine learning• Apply Probabilistic Graphical Modelling techniques

• (I quite like Daphne Koller’s Coursera course and book https://www.coursera.org/learn/probabilistic-graphical-models/home/welcome )

• More Bayesian reasoning• … (any suggestions?)

Time permitting, SAMIAM (http://reasoning.cs.ucla.edu/samiam/) demo goes here

https://www.coursera.org/learn/probabilistic-graphical-models/home/welcome

http://reasoning.cs.ucla.edu/samiam/

Bayesian inference – what are the variables?

(Disclaimer: this is me playing around with SAMIAM for 15 minutes and not an actual model)

Bayesian inference – how are things related?


Bayesian inference – what can we infer?


That’s all folks!

Questions?

a year of data science at metail

Data & Analytics