michael alcorn, sr. software engineer, red hat inc. at mlconf sf 2017

21
REPRESENTATION LEARNING @ RED HAT Michael A. Alcorn ([email protected]) Machine Learning Engineer - Information Retrieval https://sites.google.com/view/michaelaalcorn/ 1

Upload: mlconf

Post on 22-Jan-2018

232 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017

REPRESENTATIONLEARNING @ RED HATMichael A. Alcorn ([email protected])

Machine Learning Engineer - Information Retrieval

https://sites.google.com/view/michaelaalcorn/

1

Page 2: Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017

OutlineBackgroundword2vec/url2vecdoc2vec/account2vecDuplicate Detection(batter|pitcher)2vec

MLconf Blog

2

Page 3: Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017

Background

Why?Small amount (zero?) of labeled data for taskLots of unlabeled data (labeled data for a differenttask?)

Can we use large amounts of unlabeled data to makebetter predictions?

Not the same as traditional unsupervised learning!

in Goodfellow et al.'s Deep Learningtextbook

by Bengio et al.

Representation learning

Transfer learning

Excellent chapter

Article

3

Page 5: Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017

word2vec

ew

Deeplearning4j - " "

Mikolov et al. (2013)

Word2vec

5

Page 6: Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017

word2vecAnalogies

"x is to y as ? is to z" x - y + z = ?bash - shellshock + heartbleed = opensslfirefox - linux + windows = internet_exploreropenshift - cloud + storage = glusterrhn_register - rhn + rhsm = subscription-manager

=+—

6

Page 7: Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017

Naming Colors mapping RGB values to

color namesResults are pretty underwhelming for those in theknowCan word embeddings improve ( )?

Blog post by Janelle Shane

GitHub

7

Page 8: Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017

url2vecTasks concerning URLs

Search - returning relevant contentTroubleshooting - recommending related articles

Obvious method - look at textAlternative/enhanced method - use customerbrowsing behavior as additional contextual clues

8

Page 9: Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017

url2vecHow?

Treat each day of browsing activity as a "sentence"Treat each URL as a "word"Run word2vec!

9

Page 11: Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017

doc2vec

" "

Le and Mikolov (2014)

NLP 05: From Word2vec to Doc2vec: a simple example with Gensim

11

Page 12: Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017

customer2vecWhy?

Data-driven segmentation

Same idea as url2vec except now we treat each account asa "document" of many "sentences" (different browsingdays)

12

Page 13: Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017

customer2vecWhy?

Data-driven segmentation

Same idea as url2vec except now we treat each account asa "document" of many "sentences" (different browsingdays)

13

Page 15: Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017

Duplicate DetectionThere are a number of "duplicate" KCS solutions onthe Customer Portal

Muddy search results

How can we identify candidate duplicate documents?

Obvious approach - compare text (e.g., tf-idf)

Bag-of-words loses any structural meaning behind text

Can we learn better representations?

Title is essentially a summary of the solution contentLearn representations of body that are similar to titlerepresentations (like the DSSM; )my code

15

Page 16: Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017

Deep Semantic Similarity Model

Jianfeng Gao - " "Deep Learning for Web Search and Natural Language Processing

16

Page 17: Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017

(batter|pitcher)2vec ( )GitHubCan we learn meaningful representations of MLBplayers?

Accurate representations could be used to simulategames and inform tradesFind undervalued/overvalued players

17

Page 18: Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017

Can we learn meaningful representations of MLBplayers?

Accurate representations could be used to simulategames and inform tradesFind undervalued/overvalued players

(batter|pitcher)2vec ( )GitHub 18

Page 21: Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017

THANK YOU!

21