nlp with h2o
TRANSCRIPT
MichalKurkaMeganKurka
Agenda
Today’s Talk
• WhatisH2O?• H2OinR
H2OOverview
• Word2VecAlgorithm• ConvertingTexttoVectors:AnExample
NaturalLanguageProcessing
• IntroductiontoMachineLearning• OverviewofSupervisedLearningAlgorithms• TrainingModelsinRandFlow• Hyper-ParameterTuning
SupervisedLearning
HighLevel Architecture
H2O inR
UsingH2OwithR
• RestAPIdrivesH2OfromR
• Allcomputationsareperformedinperformance-optimizedJavacodeintheH2Ocluster
ReadingData intoH2OwithR
ReadingDatafromHDFS intoH2OwithR
ReadingDatafromHDFS intoH2OwithR
Natural LanguageProcessing
Word2Vec
• Learnsvectorrepresentationsofwordsbyanalyzinglargetextcorpus• WordEmbeddings=mappingofwordstovectorsfromahighdimensionalspace(100-1000)• Textsources:GoogleNews,Wikipedia,Tweets,…
• Embeddingscapturemeaningoftheword• Semanticallysimilarwordsareclosetoeachother• Canbeusedtofindsynonyms&analogies• Famousexample:
» manistowomanaskingisto??? (queen)» Vec(king)– Vec(man)+Vec(woman)=Vec(queen)
• Differentvariationsofword2vec• Skip-Gram,CBOW• HierarchicalSoftmax,NegativeSampling
car
alien-0.09 -0.14 -0.12 -0.06 0.16
-0.38 -0.11 0.10 -0.26 -0.24
Word2Vec
Word2Vec
Source:twitter.com/DanilBaibak/status/844647217885581312
Word2VecAlgorithm
InputLayer
HiddenLayer
OutputLayer
• word2vectrainsaNeuralNetwork• Singlehiddenlayer• Numberofneuronsinhiddenlayer=lengthofembeddings
• Usesatricktoformulatetheproblemasasupervisedproblem• Thenetworkistrainedtopredicttargetwordsbasedongiven
inputwords(windowslidingoverthetext)• Theweightsontheedgesconnectinginputlayertothe
hiddenlayerconstitutethewordembedding(weightmatrix)• Similarapproachtoauto-encoders
• Optimizations• HierarchicalSoftmax – binarytreetorepresentoutput
probabilities(implementedinH2O)• NegativeSampling– sampletheoutputvectorstobeupdated• HogWild!- lock-freeparallelization(ignoresconflicts)
Word2VecAlgorithm
”Aseeminglyindestructiblehumanoidcyborg issentfrom2029to1984toassassinateawaitress,whoseunbornsonwillleadhumanityinawaragainstthemachines,whileasoldierfromthatwarissenttoprotectheratallcosts.”
Predict:indestructible andcyborg
humanoid
indestructible
cyborg
InputLayer
HiddenLayer
OutputLayer• H2OsupportsSkip-Gramarchitecture
• Networktriestopredictsurroundingwordsbased(context)onagiveninputword
• Ittreatseachcontext-targetpairasanewobservation
• Skip-Gramworkswellwithlargertrainingdata• Representswellevenrarewordsandphrases
• CBOWsupportisplanned
Given:humanoid
Word2Vec- Usage
• word2vecembeddings aretypicallyusedinapre-processingstepofanotherMLalgo• weneedtoaggregatethewordembeddings forwordsequencesofvariablelength(sentences,
paragraphs,...)• pooling
• averaging(aka“meanpooling”)• (TF-IDF)weightedaveraging
• Notgoodforusecaseswhenorderofwordsisimportant(SentimentAnalysis)• http://jxieeducation.com/static/research/documentembedding_poster.pdf
• PCA&FisherVectors:http://www.cs.tau.ac.il/~wolf/papers/qagg.pdf• simplyconcatenateembeddings
• forshortinputs,eg.jobtitles• useanextensionofword2vec
• ParagraphVectors/doc2vec• Adds“paragraphtoken”asaninputofthetrainingprocess,itactsasamemorythatrememberswhat
ismissingfromthecurrentcontext– orthetopicoftheparagraph• https://cs.stanford.edu/~quocle/paragraph_vector.pdf
Word2Vec inanExampleMLWorkf low
1. word2vecworkflow1. Tokenizeplots:breakupplotsintoseparatewords2. Filterwords:removestopwordslike“and”,“the”,”of”;removetooshortwords3. Trainaword2vecmodel4. Sanitycheck:findsynonymsbasedonthewordembeddings5. (optional)Evaluatesemanticaccuracyofthemodel
• Butthebestmeasureisalwaystheperformanceoftheoverallproblem(howwellwepredictthegenres)
2. Usemodeltotransformplotstovectors3. Useplotvectorrepresentationswithasupervisedmachinelearningalgorithmtopredictthegenre
UseCase:Predictmoviegenrefromamoviesynopsis/plotsummary.
Word2Vec inH2O– KeyFunct ions
# Break job titles into sequence of wordswords <- h2o.tokenize(movies$plot, “\\\\W+”)
# Build word2vec modelw2v_model <- h2o.word2vec(words, epochs = 10)
# Sanity check - find synonyms for the word 'teacher'h2o.findSynonyms(w2v_model, "teacher", count = 5)
# Transform words into embeddings and aggregate for each plotplot_vecs <- h2o.transform(w2v_model, words, aggregate = “AVERAGE”)
MachineLearning
RuleBasedModel
MoviePlots
• Dog• Magic• Princess
• Alien• Future• Scientist
IfPlotContains
Sci-Fi
Family
MachineLearningModel
0
50
100
Sci-Fi
Family
MachineLearningModel
MoviePlotsSci-Fi
Family
SupervisedLearning
SupervisedLearning
0
50
100
Regression
Classification
Howmuchwillamoviecosttomake?
Whatisthegenreofamovie?IsitaComedy?Drama?Action?
Overfi t t ing
FindingtheSignalintheData MemorizingtheData
DemoTime
Establ ishaBasel ine
print(”Build Gradient Boosted Machine with default parameters")
gbm_model <- h2o.gbm(x = myX, y = "genre", training_frame = train, validation_frame = test)
InR:
InFlow:
PerformanceMetr ics in F low
HitRatioScoringHistory
PerformanceMetr ics in F low
ConfusionMatrixonValidationData
ExaminingResults
WeseethatthemodelpredictedalotofmoviesthatwereThrillerasDrama.
Why?
PlotofaDramafromtheTrainingData:"CordeliaGrayisthereluctantownerofaramshackleinvestigationagencyfollowingthesuicideofherboss.Watchingoverherasshehuntsdowncluesinthemurkyandsinisterworldofcrime,isherstraight-lacedandintuitiveofficeassistantEdithSparshott."
DemoTime
AddingNewFeatures
Features MeanPerClassError• Pre-trainedWordEmbeddings 0.637
• Pre-trainedWordEmbeddings• H2OWordEmbeddings
0.542
Examine Results
VariableImportance WordswithHighC58
• spirituals
• percussion
• philharmonic
• orchestral
• concerto
• sonata
• hadyn
Examine Results
VariableImportance WordswithHighC74
• einsteins
• patrolmen
• upperclassman
• speakeasies
• razzle
Early Stopping
Earlystoppingoncethevalidationmeanperclasserrordoesn’timprovebyatleast0.01%for5consecutivescoringevents
• stopping_rounds = 5• stopping_tolerance = 1e-4• stopping_metric = “mean_per_class_error”
EarlyStopping
Overfitting
ExamineResults
Model MeanPerClassError• Pre-trainedWordEmbeddings 0.637
• Pre-trainedWordEmbeddings• H2OWordEmbeddings
0.542
• Pre-trainedWordEmbeddings• H2OWordEmbeddings• EarlyStopping
0.503
GridSearch
grid <- h2o.grid(hyper_params = list(max_depth = c(1:20)), search_criteria = list(strategy = "RandomDiscrete", max_runtime_secs = 3600),algorithm = "gbm“, ..)
InR:
InFlow:
Questions?