prep fo/r/ diy

18
prep fo/r/diy Find relevant content to plan your next hobby or home-improvement Kristofor Nyquist

Upload: kristofor-nyquist

Post on 16-Jul-2015

123 views

Category:

Data & Analytics


0 download

TRANSCRIPT

prep fo/r/diyFind relevant content to plan your next hobby or home-improvement

Kristofor Nyquist

…but reddit is a popularity contest

Given that I’m interested in a post1. What are similar posts I can read? 2. Who are the best people I can talk to?

Goal: find related DIY projects

Collecting / organizing data• Scraped reddit for do-it-yourself projects

– Collected text content from each project

i.e. title, externally linked blog post, comments, and general topic

Data conversion

• Combine all of the text to create one “document”

• Treat the document as a list of words

• Convert the list of words to a list of numbers

– Each number represents the “uniqueness” of a particular word to its document

i.e. 0 means word appears in every post

1 means word only appears in that post

Data conversion

• Combine all of the text to create one “document”

• Treat the document as a list of words

• Convert the list of words to a list of numbers

– Each number represents the “uniqueness” of a particular word to its document

i.e. 0 means word appears in every post

1 means word only appears in that post

Data conversion

• Combine all of the text to create one “document”

• Treat the document as a list of words

• Convert the list of words to a list of numbers

– Each number represents the “uniqueness” of a particular word to its document

i.e. 0 means word appears in every post

1 means word only appears in that post

Data conversion

• Combine all of the text to create one “document”

• Treat the document as a list of words

• Convert the list of words to a list of numbers

– Each number represents the “uniqueness” of a particular word to its document

i.e. 0 means word appears in every post

1 means word only appears in that post

Data conversion

• Combine all of the text to create one “document”

• Treat the document as a list of words

• Convert the list of words to a list of numbers

– Each number represents the “uniqueness” of a particular word to its document

i.e. 0 means word appears in every post

1 means word only appears in that post

ClusteringText content is rich enough to cluster projects

ClusteringClassify projects with missing categories based on user inputs

Validating post similaritySimilar posts group well by general topic. Deviations can make sense

Against randomObvious that randomly picking posts is not suitable

PhD Biophysics, UC Berkeley

BS Physics, WSU

About me (Kristofor Nyquist)

Hobbies

Algorithm

Post similarity / Classification• Turn the text into a list of numbers using term-frequency-

inverse-document-frequency

• “Compress” the data for speed– ~70,000 dimensions to 80 dimensions

For similarity:

• Calculate cosine-similarity between documents

• Present user with 5 most similar posts

For classification:

• Logistic regression (L1 regularization)

Settling on 80 PCs

80 principal components somewhat arbitraryBUT overall accuracy of classifier has definitely converged…

even though 80 PCs capture ~30% of variance

Validation numbers

• NLP algorithm

– accuracy: 0.62

recalls:

auto: 0.73

electronic: 0.6

home improve: 0.54

metalwork: 0.44

other: 0.52

outdoor: 0.89

woodworking: 0.68

• Random

–accuracy: 0.17

recalls:

auto: 0.05

electronic: 0.00

home improve: 0.18

metalwork: 0.00

other: 0.08

outdoor: 0.12

woodworking: 0.32

Full BoW vs. 80 PCs

80 PCs Full tfidf vector

Validating classifier

auto

metalwork

home

electronic

other

outdoor

woodwork

auto

met

alw

ork

ho

me

elec

tro

nic

oth

er

ou

tdo

or

wo

od

wo

rk

Pre

dic

tio

n

Truth

Column normalized

vs. rest ROCWoodworking

Other

Home improvement

With this application, can tolerate false positives. We can also present users with more limited options, maintaining higher accuracy