michael alcorn, sr. software engineer, red hat inc. at mlconf sf 2017
TRANSCRIPT
REPRESENTATIONLEARNING @ RED HATMichael A. Alcorn ([email protected])
Machine Learning Engineer - Information Retrieval
https://sites.google.com/view/michaelaalcorn/
1
OutlineBackgroundword2vec/url2vecdoc2vec/account2vecDuplicate Detection(batter|pitcher)2vec
MLconf Blog
2
Background
Why?Small amount (zero?) of labeled data for taskLots of unlabeled data (labeled data for a differenttask?)
Can we use large amounts of unlabeled data to makebetter predictions?
Not the same as traditional unsupervised learning!
in Goodfellow et al.'s Deep Learningtextbook
by Bengio et al.
Representation learning
Transfer learning
Excellent chapter
Article
3
word2vec
ew
TextTextTextText
NVIDIA - " "Introduction to Neural Machine Translation with GPUs (Part 2)
4
word2vec
ew
Deeplearning4j - " "
Mikolov et al. (2013)
Word2vec
5
word2vecAnalogies
"x is to y as ? is to z" x - y + z = ?bash - shellshock + heartbleed = opensslfirefox - linux + windows = internet_exploreropenshift - cloud + storage = glusterrhn_register - rhn + rhsm = subscription-manager
=+—
6
Naming Colors mapping RGB values to
color namesResults are pretty underwhelming for those in theknowCan word embeddings improve ( )?
Blog post by Janelle Shane
GitHub
7
url2vecTasks concerning URLs
Search - returning relevant contentTroubleshooting - recommending related articles
Obvious method - look at textAlternative/enhanced method - use customerbrowsing behavior as additional contextual clues
8
url2vecHow?
Treat each day of browsing activity as a "sentence"Treat each URL as a "word"Run word2vec!
9
url2vec
https://access.redhat.com/solutions/25190
https://access.redhat.com/solutions/10107
Application: ScatterPlot3D
10
doc2vec
" "
Le and Mikolov (2014)
NLP 05: From Word2vec to Doc2vec: a simple example with Gensim
11
customer2vecWhy?
Data-driven segmentation
Same idea as url2vec except now we treat each account asa "document" of many "sentences" (different browsingdays)
12
customer2vecWhy?
Data-driven segmentation
Same idea as url2vec except now we treat each account asa "document" of many "sentences" (different browsingdays)
13
Duplicate DetectionThere are a number of "duplicate" KCS solutions onthe Customer Portal
Muddy search results
How can we identify candidate duplicate documents?
Obvious approach - compare text (e.g., tf-idf)
Bag-of-words loses any structural meaning behind text
Can we learn better representations?
Title is essentially a summary of the solution contentLearn representations of body that are similar to titlerepresentations (like the DSSM; )my code
15
Deep Semantic Similarity Model
Jianfeng Gao - " "Deep Learning for Web Search and Natural Language Processing
16
(batter|pitcher)2vec ( )GitHubCan we learn meaningful representations of MLBplayers?
Accurate representations could be used to simulategames and inform tradesFind undervalued/overvalued players
17
Can we learn meaningful representations of MLBplayers?
Accurate representations could be used to simulategames and inform tradesFind undervalued/overvalued players
(batter|pitcher)2vec ( )GitHub 18
Can we learn meaningful representations of MLBplayers?
Accurate representations could be used to simulategames and inform tradesFind undervalued/overvalued players
SI.com NBCSports.com
=+— LR
(batter|pitcher)2vec ( )GitHub 19
(batter|pitcher)2vec
""
Learning to CoachFootball
Wang and Zemel (2016)
20
THANK YOU!
21