data day seattle duplicate detection via topic modeling

32
Duplicate Detection via Topic Modeling

Upload: brent-schneeman

Post on 09-Apr-2017

234 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Data Day Seattle  Duplicate Detection via Topic Modeling

Duplicate Detection via Topic Modeling

Page 2: Data Day Seattle  Duplicate Detection via Topic Modeling

HomeAway Key Facts

● 1,300,000+ global vacation rental listings● 200,000,000+ vacation days / year● ~190 countries, 22 languages● HQ in Austin, TX; part of Expedia, Inc

--> Capable competition and fraud vectors

Page 3: Data Day Seattle  Duplicate Detection via Topic Modeling

Competitive Intelligence

Over 2 million global HA + Comp documents and meta data

Page 4: Data Day Seattle  Duplicate Detection via Topic Modeling

Breckenridge Colorado

HomeAway in blue

Page 5: Data Day Seattle  Duplicate Detection via Topic Modeling

Breckenridge, zoomed in

Page 6: Data Day Seattle  Duplicate Detection via Topic Modeling

Same Property

Page 7: Data Day Seattle  Duplicate Detection via Topic Modeling

The Property DescriptionsWhy Property Descriptions?

● Almost identical text

● Similar descriptions seemed probable

○ Consistent owner branding, easy to

replicate● Tech team wanted to use

natural language processing techniques

● Didn’t know if this would work when we began

The Other GuysThere are truly inspiring views at High Point Retreat and plenty of places to sit and enjoy them. Take a load off in one of the many rooms with views of the ski mountain and remember how lucky you are to live like this. Cozy up with family in the sunken living room and chat for hours on end. Sit in a circle of tree stumps around the outdoor fire pit and roast marshmallows. After all that sitting, youll be more than happy to walk 250 yards to the free shuttle to get the blood pumping again. Then, have a seat and enjoy your free ride. Best. Vacation. Ever. Vacation homes allow families to stay...together. At InvitedHome, we think that's pretty important, so we do everything in our power to make your vacation totally epic. Not only do we choose the best homes in the best destinations, but we make the experience effortless so you can really enjoy yourself. Our team will stock your fridge, babysit the kids, cater your party, plan your day trip, make reservations, and do whatever we can to make sure you have the Best. Vacation. Ever.

HomeAwayThere are truly inspiring views at High Point Retreat and plenty of places to sit and enjoy them. Take a load off in one of the many rooms with views of the ski mountain and remember how lucky you are to live like this. Cozy up with family in the sunken living room and chat for hours on end. Sit in a circle of tree stumps around the outdoor fire pit and roast marshmallows. After all that sitting, you’ll be more than happy to walk 250 yards to the free shuttle to get the blood pumping again. Then, have a seat and enjoy your free ride.Best.Vacation.Ever. Vacation homes allow families to stay...together. At InvitedHome, we think that's pretty important, so we do everything in our power to make your vacation totally epic. Not only do we choose the best homes in the best destinations, but we make the experience effortless so you can really enjoy yourself. Let us connect you with the best options in town for babysitting, equipment rental, transportation, catering, day trips, shopping, dining, and even stocking your fridge with groceries! We’ll do everything in our power to make sure you have the Best. Vacation. Ever.

Page 8: Data Day Seattle  Duplicate Detection via Topic Modeling

Hypothesis

We can detect properties listed on HomeAway and the competition by comparing the text in the property descriptions

Page 9: Data Day Seattle  Duplicate Detection via Topic Modeling

Worked great, but...

“Large” Vocabulary size

~10K Tokens -> 10K Dimensions and

millions of sparse vectors

A little slow(took a week to process the US)

Initial Approach: TF-IDF and Cosine Distance

Page 10: Data Day Seattle  Duplicate Detection via Topic Modeling

Spark Clusters?

Topic Modeling?

Other Distance Metrics?

Page 11: Data Day Seattle  Duplicate Detection via Topic Modeling

Hypothesis

We can detect properties listed on HomeAway and the competition by comparing the text in the property descriptions

We can leverage Topic Modeling to do it

Page 12: Data Day Seattle  Duplicate Detection via Topic Modeling

Latent Dirichlet Allocation (Topic Modeling)

Communications of the ACM, Vol. 55 No. 4, Pages 77-8410.1145/2133806.2133826

Page 13: Data Day Seattle  Duplicate Detection via Topic Modeling

Topic Modeling and LDA

In natural language processing, Latent Dirichlet allocation (LDA) is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

(Wikipedia)

Cat, Dog, Fish, Turtle,

Hamster

Cat, Dog, Mass,

Hysteria, Living,

Together

Cat, Dog, Cold, Rain,

Hot, Temperature

Document A

Document B

Document C

Page 14: Data Day Seattle  Duplicate Detection via Topic Modeling

Some Example Topics from Breckenridge

time, setting, wifi, elk, central, enjoying, spend, marijuana, sleepers, brittany

buffalo, soaking, pubs, titles, washroom, pristine, ratedgas, multiple, especially, scrumptious

apartment, weekend, maintained, company, bedroom, bed, sized, bathroom, walk, queen

golf, course, chateau, sole, beauty, payment, splendor, championship, rooftop, stonehaven

smoking, allowed, deposit, damage, fee, owner, dates, paid, balance, zone

Page 15: Data Day Seattle  Duplicate Detection via Topic Modeling

Topic Modeling Motivations● Smaller dimensional space

● Faster processing times?

● At the end, we’d have Topic Models

Must be useful for duplicate detection

We used Spark’s ML APIs for this:

val countLDA = new LDA() .setK(numTopics) .setMaxIter(params.maxIterations) .setSeed(params.randomSeed) .setFeaturesCol(featureCol) .setTopicDistributionCol("topicDistribution")

Page 16: Data Day Seattle  Duplicate Detection via Topic Modeling
Page 17: Data Day Seattle  Duplicate Detection via Topic Modeling

Distances between Topic Distributions

Euclidean Manhattan Cosine

Page 18: Data Day Seattle  Duplicate Detection via Topic Modeling

Distances between Topic Distributions

Euclidean Manhattan Cosine

Jensen-Shannon Hellinger

Page 19: Data Day Seattle  Duplicate Detection via Topic Modeling

Distances between Topic Distributions

Euclidean Manhattan Cosine

Jensen-Shannon Hellinger

Page 20: Data Day Seattle  Duplicate Detection via Topic Modeling

Create an experimental dataset

Original Corpus

Page 21: Data Day Seattle  Duplicate Detection via Topic Modeling

Create an experimental dataset

Original Corpus

Random selection

Page 22: Data Day Seattle  Duplicate Detection via Topic Modeling

Create an experimental dataset

Original Corpus

Random selection

Duplicate (with optional degradation)...… and see if we can find those duplicates

Page 23: Data Day Seattle  Duplicate Detection via Topic Modeling
Page 24: Data Day Seattle  Duplicate Detection via Topic Modeling
Page 25: Data Day Seattle  Duplicate Detection via Topic Modeling
Page 26: Data Day Seattle  Duplicate Detection via Topic Modeling
Page 27: Data Day Seattle  Duplicate Detection via Topic Modeling

How to make something useful?

Machine Learning Effort

Page 28: Data Day Seattle  Duplicate Detection via Topic Modeling
Page 29: Data Day Seattle  Duplicate Detection via Topic Modeling
Page 30: Data Day Seattle  Duplicate Detection via Topic Modeling

Interquartile Ranges are more resilient to outliers than standard deviations

IQRs bring information about the entire set of possible duplicates

Random Forest Model (R):trainIdx <- createDataPartition(dupesFoundByTopic$match, p=0.9, list=FALSE, times=1)

train <- dupesFoundByTopic[trainIdx,]

fit <- randomForest(as.factor(match) ~ distance + iqrs, data=train)

Combining Distance and IQR

Feature Mean Decrease Gini

distance 498

IQR 57

Reference

Pred. FALSE TRUE

FALSE 204 2

TRUE 4 32

Page 31: Data Day Seattle  Duplicate Detection via Topic Modeling

● Topic Models / Topic Distances seem useful

○ Esp. when part of a multi-signal model

(i.e. images)

● Hybrid Spark and R approach

○ Moving to 100% Spark in future for

speed

● Topic Models just sitting there, waiting for

exploitation

○ “Programmatic” Marketing Efforts, &c

● But what about Locality Sensitive Hashing?

Current Status

Page 32: Data Day Seattle  Duplicate Detection via Topic Modeling

Questions?

Brent SchneemanPrincipal Data Scientist

HomeAway, Inc.

[email protected]

@schnee

← https://www.homeaway.com/vacation-rental/p3482065