use nlp to solve business problems
TRANSCRIPT
WhoamI?
AnnieFlippoSr.DataScientist
AwesomenessTV/DreamworksAnimationSKG
Slidesatbit.ly/acflippo-nlp
WhoisAwesomenessTV?We’redigitalcontentproviderforplatformsincludingHulu,Netflix,Roku,Verizon&YouTube.
GoalsDevelopamethodtoidentifysameorsimilarassetsacrosssystems:
• Showassetrelationship• Generateuniqueidforin-houseapps
WhyuseNLP?TopgoalsforNaturalLanguageProcessingare:1. DocumentSimilarity(searchenginequery)
2. TopicModeling(Twitter/BlogAnalysis)
3. SentimentAnalysis(movieorrestaurantreviews)
DataProcessingWhyperformtextprocessing?
• Toridofmessinessoffree-fromtext• Togroupwordswiththesamemeaning
• Converttexttonumericfeatures• Modelonequivalentnumericfeatures
DataProcessingTitlesanddescriptionsgetscrubbed• Removepunctuation,non-ascii,carriage
returns• Removestopwords(i.e.it,this,and,that)• Stemming• Lemmatize• Tokenize• Vectorize
StemmingReducetotherootoftheword
Provision,providing,provider,provided
=>provid
Argue,argues,arguing,argued=>argu
LemmatizeRetrievethelinguisticrootoftheword
Walk,walking,walked=>walk
Is,am,are=>be
Begin,began,begun=>begin
*Nounsandverbsarelemmatizeddifferently.
TokenizeCountdistinctwords fromacorpus
“Thequickbrownfoxjumpedoverthelazydogs”
becomes
[‘the’,‘quick’,‘brown’,‘fox’,‘jump’,‘over’,‘lazy’, ‘dog’]
VectorizeCountoccurrencesfromdistinctwordvector.
“Thequickbrownfoxjumped overthelazydogs”Tokenizedto:
[‘the’,‘quick’,‘brown’,‘fox’,‘jump’,‘over’,‘lazy’,‘dog’]Vectorizedto:
[2,1,1,1,1,1,1,1]
Bag-of-WordsComparisonDoc1: “Thequickbrownfoxjumpedoverthelazydogs”Doc2:“Thequickfoxranawayfromthedog”
Afterprocessing,thecorpusattributevectoris:
[‘quick’,‘brown’,‘fox’,‘jump’,‘over’,‘lazy’,‘dog’,‘run’,‘away’, ‘from’]
Twodocumentsvectorize to:Doc1:[1,1,1,1,1,1,1,0,0,0]Doc2:[1,0,1,0,0,0,1,1,1,1]
Sentencesaretransformedintonumericvectors!
brown fox
lazy
runquick
dog
SimilarityMeasureCosinesimilaritycalculateshowclose2numericvectorsarewhichislikethe distancemeasurebetween2points.
Thisproblemhasjustreducedtosimplematrixalgebra.
Bi-GramComparisonDuetothesamewordsusedacrossourvideos,thebag-of-wordssimilarityresultedhighfalsepositivematches.
ThesolutionistouseaBi-Gramalgorithmwhere2consecutivewordsareextractedasonefeature:
“Thequickbrownfoxjumpedoverthelazydogs”
becomes:
[‘thequick’,‘quickbrown’,‘brownfox’,‘foxjump’,
‘jumpover’,‘overthe’,‘thelazy’,‘lazydog’]
LimitationsCertainphrasessuchas“Behindthescenes”arefoundfrequently.Thiscreatesanartificiallyhighsimilarityscoreevenifthevideosaredissimilar.
Possiblesolutions:• Performmorecustomdatascrubbing• Double-checkbymatchingdurationofvideos• Havethematchesverifiedbyahuman
ConclusionIuseNaturalLanguageProcessingto:
1. Identifysimilarvideosacrossplatforms2. Tieassetstogetherwheresomeareidenticalvideos
whileothersarederivedvideos(suchastrailersorpromos).