diving in the deep end of the big data pool

Post on 11-Jul-2015

275 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Diving In The Deep End of the Big Data Pool

François Garillot@huitseeker

17:45 Thursday

Understanding your Unicorns: Data Science Team Building in Action

Location: 120-121

4 analytical PhDs

3 weeks

1 org with data

& a QUESTION

François Garillot (me)

Stephen Gadd Marisa Figueiredo

Federica Capranico

Globetrotters

Family

Entertainment

SMB

Sport

Music Festivals

Football Fans

In Car Market Buyers

Pet Owners

Technology

Drivers

Mums Preschool

University Students

Gamblers

Mums

Shoppers

Music

Zone 1 commuters Infrequent

Zone 1 commuters Freq.Zone 1 commuters Resident

Zone 1 commuters Regular

Zone 1 commuters

Entertainment FilmsFood Coffee Shops

Gamers

Autos

B2B

Business/Finance

Careers

Education

Entertainment

Family & Youth

Gambling

Gaming

IT

Lifestyle

News

Property

Government

Retail

Search

Social

Sport

Telco

Travel

Globetrotters

Family

Entertainment

SMB

Sport

Music Festivals

Football Fans

In Car Market Buyers

Pet Owners

Technology

Drivers

Mums Preschool

University Students

Gamblers

Mums

Shoppers

Music

Zone 1 commuters Infrequent

Zone 1 commuters Freq.Zone 1 commuters Resident

Zone 1 commuters Regular

Zone 1 commuters

Entertainment FilmsFood Coffee Shops

Gamers

Autos

B2B

Business/Finance

Careers

Education

Entertainment

Family & Youth

Gambling

Gaming

IT

Lifestyle

News

Property

Government

Retail

Search

Social

Sport

Telco

Travel

5+millions

50+ K

... so: Things Not To Mess Up

Nobody ever get those two right

unsupervised clustering

find new segments

based on web

browsing history

relative distances

spatial representation

unsupervised clustering based on web browsing history

have a position for each user

no implementation that works at scale!

find new segments

simrank

Simrank & MDS

website

websitewebsite

website

22 million nodes

123 million edges

simrank

5+ millions

25+ trillions

Clustering

Simrank & MDS

MDS: scalable but too complex to

do in time

website

websitewebsite

website

22 million nodes

123 million edges

simrank

5+ millions

MDS

Clustering

(45, 36)

✓Implemented

✖ Fail

Lay the bare stuff down first, THEN refine

Cluster stilla huge mess to deploy

Results

Singles

Locality-Sensitive Hashing

Hand-made code !

typical web browsing: pof.com, tagged.com

“The year of being single”, Marketing Magazine, 2013

“The rise of the single economy”, The Guardian, 2014

Final results obtainedon the last day

Essential : fuel & friends

- power & network fail

- Bare pipeline first

- Distributed is hard, let's go Think instead !

- Fuel & friends

top related