use nlp to solve business problems

18
The Use of NLP to Solve Problems Annie Flippo 11/2/2016

Upload: annie-flippo

Post on 15-Apr-2017

100 views

Category:

Data & Analytics


3 download

TRANSCRIPT

TheUseofNLPtoSolveProblems

AnnieFlippo11/2/2016

WhoamI?

AnnieFlippoSr.DataScientist

AwesomenessTV/DreamworksAnimationSKG

Slidesatbit.ly/acflippo-nlp

WhoisAwesomenessTV?We’redigitalcontentproviderforplatformsincludingHulu,Netflix,Roku,Verizon&YouTube.

BusinessProblemManysystemsmanagingvideosondifferentplatforms

GoalsDevelopamethodtoidentifysameorsimilarassetsacrosssystems:

• Showassetrelationship• Generateuniqueidforin-houseapps

WhyuseNLP?TopgoalsforNaturalLanguageProcessingare:1. DocumentSimilarity(searchenginequery)

2. TopicModeling(Twitter/BlogAnalysis)

3. SentimentAnalysis(movieorrestaurantreviews)

DataProcessingWhyperformtextprocessing?

• Toridofmessinessoffree-fromtext• Togroupwordswiththesamemeaning

• Converttexttonumericfeatures• Modelonequivalentnumericfeatures

DataProcessingTitlesanddescriptionsgetscrubbed• Removepunctuation,non-ascii,carriage

returns• Removestopwords(i.e.it,this,and,that)• Stemming• Lemmatize• Tokenize• Vectorize

StemmingReducetotherootoftheword

Provision,providing,provider,provided

=>provid

Argue,argues,arguing,argued=>argu

LemmatizeRetrievethelinguisticrootoftheword

Walk,walking,walked=>walk

Is,am,are=>be

Begin,began,begun=>begin

*Nounsandverbsarelemmatizeddifferently.

TokenizeCountdistinctwords fromacorpus

“Thequickbrownfoxjumpedoverthelazydogs”

becomes

[‘the’,‘quick’,‘brown’,‘fox’,‘jump’,‘over’,‘lazy’, ‘dog’]

VectorizeCountoccurrencesfromdistinctwordvector.

“Thequickbrownfoxjumped overthelazydogs”Tokenizedto:

[‘the’,‘quick’,‘brown’,‘fox’,‘jump’,‘over’,‘lazy’,‘dog’]Vectorizedto:

[2,1,1,1,1,1,1,1]

Bag-of-WordsComparisonDoc1: “Thequickbrownfoxjumpedoverthelazydogs”Doc2:“Thequickfoxranawayfromthedog”

Afterprocessing,thecorpusattributevectoris:

[‘quick’,‘brown’,‘fox’,‘jump’,‘over’,‘lazy’,‘dog’,‘run’,‘away’, ‘from’]

Twodocumentsvectorize to:Doc1:[1,1,1,1,1,1,1,0,0,0]Doc2:[1,0,1,0,0,0,1,1,1,1]

Sentencesaretransformedintonumericvectors!

brown fox

lazy

runquick

dog

SimilarityMeasureCosinesimilaritycalculateshowclose2numericvectorsarewhichislikethe distancemeasurebetween2points.

Thisproblemhasjustreducedtosimplematrixalgebra.

Bi-GramComparisonDuetothesamewordsusedacrossourvideos,thebag-of-wordssimilarityresultedhighfalsepositivematches.

ThesolutionistouseaBi-Gramalgorithmwhere2consecutivewordsareextractedasonefeature:

“Thequickbrownfoxjumpedoverthelazydogs”

becomes:

[‘thequick’,‘quickbrown’,‘brownfox’,‘foxjump’,

‘jumpover’,‘overthe’,‘thelazy’,‘lazydog’]

LimitationsCertainphrasessuchas“Behindthescenes”arefoundfrequently.Thiscreatesanartificiallyhighsimilarityscoreevenifthevideosaredissimilar.

Possiblesolutions:• Performmorecustomdatascrubbing• Double-checkbymatchingdurationofvideos• Havethematchesverifiedbyahuman

ConclusionIuseNaturalLanguageProcessingto:

1. Identifysimilarvideosacrossplatforms2. Tieassetstogetherwheresomeareidenticalvideos

whileothersarederivedvideos(suchastrailersorpromos).

ThankYou!AnnieFlippo @ACflippo

Slidesandcodeareavailableatbit.ly/acflippo-nlp