replication in data science - a dance between data science & machine learning strata 2016

68
Pinterest

Upload: june-andrews

Post on 21-Apr-2017

2.873 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Pinterest

Page 2: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Iterative supervised clusteringA dance between data science and machine learning

Dr June Andrews — September 2016

Page 3: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Explore Pinterest’s content Question our understanding Inspire the future

Agenda

1

2

3

Design system

Page 4: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Explore Pinterest’s content Question our understanding Inspire the future

Agenda

1

2

3

Design system

Page 5: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Clothing Cooking Decorating Beauty Teaching Carpentry Cars Animated GIFs

Electronics Stereos Fashion Sewing Articles Painting Photography Nature

Cute cats Tattoos Hair Microscopy TV shows Apps Self help Motorcycles

Page 6: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016
Page 7: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Chairs

Page 8: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Fashion

Travel

Garden

Chairs

Food

Page 9: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Links are behind every PinHow are users engaging with link domains?

2:50 PM 100%

Page 10: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Tool Pros Cons

Cluster algorithms (SVM, K-Means, Spectral)

• Considers all users • Accurate

• Tough to communicate • Definitions change over time

User experience studies • Deep knowledge • Captures the immeasurable

• Costly • Considers few users

Domain expert hypothesis • Human interpretable • Inaccurate

Page 11: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Tool Pros Cons

Cluster algorithms (SVM, K-Means, Spectral)

• Considers all users • Accurate

• Tough to communicate • Definitions change over time

User experience studies • Deep knowledge • Captures the immeasurable

• Costly • Considers few users

Domain expert hypothesis • Human interpretable • Inaccurate

Page 12: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Current cluster analysisClean and load data into favorite clustering algorithm

Build visualizations on top of clusters

Fiddle with parameters in clustering algorithm

Add human labels to each cluster

Share human interpretation of clusters

1

2

3

4

5

Page 13: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Current cluster analysisClean and load data into favorite clustering algorithm

Build visualizations on top of clusters

Fiddle with parameters in clustering algorithm

Add human labels to each cluster

Share human interpretation of clusters

1

2

3

4

5

Fatal flaw

Page 14: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Human in the loop computingCommunity membership identification from small seed sets (Kloumann & Kleinberg)

T

Domain Expert

Favorite Clustering Algorithm

Page 15: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Human in the loop computingWhen machine confidence dips, engage with domain expert

T

Domain Expert

Favorite Clustering Algorithm

?

T

Unsure

Confident

Page 16: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Human in the loop computingWhen machine confidence dips, engage with domain expert

T

Domain Expert

Favorite Clustering Algorithm

T

T

Unsure

Confident

?

Page 17: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Human in the loop computingDomain expert determines when labeling is done

T

Domain Expert

Favorite Clustering Algorithm

T

Thats all!

Page 18: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Current analysis methodologyClean and load data into favorite clustering algorithm

Build visualizations on top of clusters

Fiddle with parameters in clustering algorithm

Add human labels to each cluster

Share human interpretation of clusters

1

2

3

4

5

Page 19: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Human in the loop computingStage 1: Machine clusters data

Favorite Clustering Algorithm

Page 20: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Human in the loop computingStage 2: Domain expert creates 1 human interpretable cluster

Domain Expert

Page 21: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Human in the loop computingStage 3: Remove human labeled clusters and iterate

Favorite Clustering Algorithm

Domain Expert

Page 22: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

How are users engaging with link domains?

• For a sample set of link domains we’re interested in: • All Pin creates in their first year on Pinterest • All repins in their first year on Pinterest • 100k link domains sampled total

Links are behind every Pin

2:50 PM 100%

Page 23: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Python Notebook

Page 24: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Provides guided iteration

Python Notebook

Page 25: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Sample visualization for each cluster

Python Notebook

Pin creates RepinsFew Many

Many

Few

Page 26: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Iteration 1

Title Dark content

Description Fewer than 2 Pins a week on average

Examples Noisy low quality content

Page 27: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Iteration 242% of domains left

Few Many Few Some Few Many

0 0 0 0 0 0

Cluster 1 Cluster 3Cluster 2

Pin creates Repins Pin creates RepinsPin creates Repins

Page 28: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

DescriptionDomains with few Pins, but these Pins thrive in the Pinterest ecosystem

Calculation

def detect_pinterest_specials(domain_engagement): ratio = domain_engagement.n_repins / max(1.0, float(domain_engagement.n_pin_creates)) return domain_engagement.n_pin_creates <= X and ratio >= Y

Examples Fashion and impulse sites

Iteration 2Pinterest specials

Few

Pinterest specialsRepins

Many

0 0

Pin creates

Page 29: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Iteration 333% of domains left

Few Few Few Some Few Many

0 0 0 0 0 0

Cluster 1 Cluster 3Cluster 2

Pin creates Repins Pin creates RepinsPin creates Repins

Page 30: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Iteration 3Steady growth

DescriptionActive Pin creates and steady growth throughout the year

Calculationdef detect_steady_growth(domain_engagement): (growth_rate, intercept) = np.polyfit(range(len(domain_engagement.monthly_repins)), domain_engagement.monthly_repins,1) return months_pins_created >= X and growth_rate >= Y

Examples Recipe and DIY sites

Some

Steady growthRepins

Many

0 0

Pin creates

Page 31: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Iteration 425% of domains left

Few Some Many Some Few Some

0 0 0 0 0 0

Cluster 1 Cluster 3Cluster 2

Pin creates Repins Pin creates RepinsPin creates Repins

Page 32: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Iteration 4Slow growth

Description Similar to steady growth, but not as fast

Calculation

def detect_steady_growth(domain_engagement): (growth_rate, intercept) = np.podef detect_steady_growth(domain_engagement): (growth_rate, intercept) = np.polyfit(range(len(domain_engagement.monthly_repins)), domain_engagement.monthly_repins,1) return months_pins_created >= X and growth_rate >= Ylyfit(range(len(domain_engagement.monthly_repins)), domain_engagement.monthly_repins,1) return months_pins_created >= X and growth_rate >= Y

Examples Little lower quality recipe and DIY sites

Few

Slow growthRepins

Many

0 0

Pin creates

Page 33: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Iteration 5Churning

Description Slowly fade through the year

Calculation

def detect_churning(domain_engagement): (repin_growth, intercept) = np.polyfit( range(len(domain_engagement.monthly_repins) - 2), domain_engagement.monthly_repins[2:], 1) (pin_create_growth, intercept) = np.polyfit( range(len(domain_engagement.monthly_repins) - 2), domain_engagement.monthly_pin_creates[2:], 1) return repin_growth < 0 and pin_create_growth < 0

Examples Fashion sale and click bait sites

Few

ChurningRepins

Many

0 0

Pin creates

Page 34: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Iteration 6Yearly

Description Slowly fade through the year

Calculation

def detect_churning(domain_engagement): (repin_growth, intercept) = np.polyfit( range(len(domain_engagement.monthly_repins) - 2), domain_engagement.monthly_repins[2:], 1) (pin_create_growth, intercept) = np.polyfit( range(len(domain_engagement.monthly_repins) - 2), domain_engagement.monthly_pin_creates[2:], 1) return repin_growth < 0 and pin_create_growth < 0

Examples Seasonal fashion, such as snow boots

Few

YearlyPin creates Repins

Many

0 0

Page 35: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Iteration 7Late bloomer

Description Peak mid year

Calculation

def detect_late_bloomer(domain_engagement): (concavity, pin_growth, intercept) = np.polyfit( range(len(domain_engagement.monthly_repins) - 2), [r + p for (r, p) in zip(domain_engagement.monthly_repins[2:], domain_engagement.monthly_pin_creates[2:])], 2) return concavity < 0

Examples Blogs that get off to a slow start

Few

Pinterest late bloomerPin creates Repins

Many

0 0

Page 36: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Clusters• Dark content • Pinterest specials • Steady growth • Slow growth • Churning • Yearly • Late bloomer

Page 37: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Explore Pinterest’s content Question our understanding Inspire the future

Agenda

1

2

3

Design system

Page 38: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Does asking twice yield the same answer?Should we cluster again?

2:50 PM 100%

Page 39: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Cost of replicating analysis is leaving other business opportunities on the table

2:50 PM 100%Data science is expensive

Page 40: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Unknown

2:50 PM 100%Would it make a difference?

Page 41: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Replication Crisis in Psychology

Silberzahn & Ahlmann; Crowdsourced research: Many hands make tight work

Nature August 2015

Page 42: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Crowd sourced study on red cards in soccer

Silberzahn & Ahlmann; Crowdsourced research: Many hands make tight work

Nature October 2015

Page 43: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

The New York Times on predicting the presidencySeptember, 2016

Cohn; We Gave Four Good Pollsters the Same Raw Data. They Had Four Different Results.

Page 44: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

… but we’ve lowered the cost!

2:50 PM 100%Data science is expensive

Page 45: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

… 9 data scientists and machine learning engineers. Same data, same UI, same day. Everyone finished in ~1 hour.

…so we did it again

Page 46: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Models a real world situation with limited resources

9 is huge!

Page 47: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

were the results the same?

Everything was the same

Page 48: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Baseline clusters Results e Results l Results d Results m Results z Results b Results k

Dark content

Pinterest specials

Steady growth

Slow growth

Churning

Yearly

Late bloomer

Existing clusters as our baseline

Page 49: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Baseline clusters Results e Results l Results d Results m Results z Results b Results k

Dark content Unpopular (95%) Trailing (90%)

Pinterest specials Trailing (100%) Viral on Pinterest (98%)

Pin creates drop off (97%)

Steady growth Increasing repins (94%)

Continuous growth (94%)

Slow growth

Churning

Yearly

Late bloomer

90% Matches

Page 50: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Baseline clusters Results e Results l Results d Results m Results z Results b Results k

Dark content Unpopular (95%) Trailing (90%) Original pinny (84%)

Pinterest specials Trailing (100%) Minimal original Pins (66%)

Viral on Pinterest (98%)

Pin creates drop off (97%)

Steady growth Pinterest viral content (62%) Other (53%) Original Pinny

(51%)Viral on the internet (69%)

Increasing repins (94%)

Continuous growth (94%)

Suspected Save button high Pin creates (73%)

Slow growth Pinterest viral content (55%)

Original Pinny (82%)

Viral on the internet (65%)

Increasing repins (65%)

Continuous growth (86%)

Suspected Save button high Pin creates (51%)

Churning Original Pinny (68%)

Viral on the internet (53%)

Yearly Original Pinny (71%)

Late bloomer Original Pinny (71%)

Continuous growth (55%)

Suspected Save button high Pin creates (59%)

50% Matches

Page 51: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Baseline Clusters Results e Results l Results d Results m Results z Results b Results k

Dark content Unpopular (95%) Trailing (90%) Original pinny (84%)

Pinterest specials Trailing (100%) Minimal original Pins (66%)

Viral on Pinterest (98%)

Pin creates drop off (97%)

Steady growth Pinterest viral content (62%) Other (53%) Original Pinny

(51%)Viral on the internet (69%)

Increasing repins (94%)

Continuous growth (94%)

Suspected Save button high Pin creates (73%)

Slow growth Pinterest viral content (55%)

Original Pinny (82%)

Viral on the internet (65%)

Increasing repins (65%)

Continuous growth (86%)

Suspected Save button high Pin creates (51%)

Churning Original Pinny (68%)

Viral on the internet (53%)

Yearly Original Pinny (71%)

Late bloomer Original Pinny (71%)

Continuous growth (55%)

Suspected Save button high Pin creates (59%)

50% Matches

Page 52: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Baseline clusters Results e Results l Results d Results m Results z Results b Results k

Dark content Unpopular (95%) Trailing (90%) Original pinny (84%)

Pinterest specials Trailing (100%) Minimal original Pins (66%)

Viral on Pinterest (98%)

Pin creates drop off (97%)

Steady growth Pinterest viral content (62%) Other (53%) Original Pinny

(51%)Viral on the internet (69%)

Increasing repins (94%)

Continuous growth (94%)

Suspected Save button high Pin creates (73%)

Slow growth Pinterest viral content (55%)

Original Pinny (82%)

Viral on the internet (65%)

Increasing repins (65%)

Continuous growth (86%)

Suspected Save button high Pin creates (51%)

Churning Original Pinny (68%)

Viral on the internet (53%)

Yearly Original Pinny (71%)

Late bloomer Original Pinny (71%)

Continuous growth (55%)

Suspected Save button high Pin creates (59%)

50% Matches

Page 53: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Baseline clusters Results e Results l Results d Results m Results z Results b Results k

Dark content Unpopular (95%) Trailing (90%) Original pinny (84%)

Pinterest specials Trailing (100%) Minimal original Pins (66%)

Viral on Pinterest (98%)

Pin creates drop off (97%)

Steady growth Pinterest viral content (62%) Other (53%) Original Pinny

(51%)Viral on the internet (69%)

Increasing repins (94%)

Continuous growth (94%)

Suspected Save button high Pin creates (73%)

Slow growth Pinterest viral content (55%)

Original Pinny (82%)

Viral on the internet (65%)

Increasing repins (65%)

Continuous growth (86%)

Suspected Save button high Pin creates (51%)

Churning Original Pinny (68%)

Viral on the internet (53%)

Yearly Original Pinny (71%)

Late bloomer Original Pinny (71%)

Continuous growth (55%)

Suspected Save button high Pin creates (59%)

50% Matches

Page 54: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Baseline clusters Results e Results l Results d Results m Results z Results b Results k

Yearly Seasonal Throwback Seasonal Annual

Steady growth Gaining popularity Increasing repins Continuous

growth High engagement

Pinterest specials Initial flurry Minimal original Pins Viral on Pinterest Pin create drop

offUnpopular domains with good content

Conceptually similar clustersBut not related in implementation

Page 55: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

…Good vs. bad

Differences in perspective

Two roots of variations

Page 56: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Signs of suboptimal clustering

• Leading with biases • Cherry-picking: responding

to a limited subset of the data

Few

SeasonalPin creates Repins

Few

0 0

Page 57: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Differences of perspective• Results m - Viral growth centric

• Viral on Pinterest • Viral on the internet • Lame

• Results d - Original content centric • Persistent original Pins • Minimal original Pins • Original Pinny

• Results l - Return on investment centric • Underserved • Draught • Trailing

Page 58: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Impact implications

9 data scientists 9 answers• Products depending on cluster used

• Viral mechanisms • Speeding Pin demotion • Promoting underserved Pins

• For same product, domains impacted differ for • Seasonality • Steady growth • Pinterest specials

Page 59: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Bottom lineIt matters which data scientist does an analysis

Page 60: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Explore Pinterest’s content Question our understanding Inspire the future

Agenda

1

2

3

Design system

Page 61: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Let’s ask the hard question and brave the answer together

When is data science a house of cards?

Page 62: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Avalanche of ResourcesMeasuring data science impact• Experimental systems are now standard • Data scientists are more available • Reproducible analysis • [Now] Fast replicable analysis

Page 63: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Utilize ResourcesExperiment• Record end to end from analysis to impact • Innovate on processes • Borrow ideas on replication from science • Tailor our techniques for replication

Page 64: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Concrete experimentsBreak down the problem and build up• Narrow Difference in Perception

through Priming analysts • Develop a rubric of excellence • Train analysts on generated data • Add process stabilizers

Page 65: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Pinterestis interested

pin.it/Data

Reach out!

Dr June Andrews [email protected] / DrAndrews/ DrJuneAndrews

Page 66: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Let’s data science, data science!Let’s crack the code to systematic innovation

Page 67: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016

Thank you!

We are hiring!https://engineering.pinterest.com/

pin.it/Data

Page 68: Replication in Data Science - A Dance Between Data Science & Machine Learning Strata 2016