![Page 1: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/1.jpg)
Tag ExtractionGeorge McBay, Naoki Nakatani
San Jose State UniversityCS185C Spring 2014
![Page 2: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/2.jpg)
AgendaProblem DescriptionETLData AnalysisMachine LearningOptimization
Feature EngineeringTitle vs BodyStop Words
Multi-label ClassificationApache Spark
![Page 3: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/3.jpg)
Problem DescriptionETLData AnalysisMachine LearningOptimization
Feature EngineeringTitle vs BodyStop Words
Multi-label ClassificationApache Spark
Agenda
![Page 4: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/4.jpg)
Problem
Given question with title and body, can we automatically generate tags for it?
Where can I find the LaTeX3 manual?Few month ago I saw a big pdf-manual of all LaTeX3-packages and the new syntax. I think it was bigger than 300 pages. I can't find it on the web.
Does anyone have a link?
Documentation
latex3
expl3
![Page 5: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/5.jpg)
DatasetFile :● Train.csv● Test.csv
Fields :● id, title, body, tags (Train)● id, title, body (Test)
Characteristics :● Quoted csv● Body contains \n● Tags separated by space● Entry delimited by \0
\0
“----” , ”-----------” , “------------------------ “--- --- --- ---” ------------------------” , \0
\0
\0
“----” , ”-----------” , “------------------------” , “--- --- --- ---”
“----” , ”-----------” , “------------------------” , “--- --- --- ---”
“----” , ”-----------” , “------------------------ “--- --- --- ---” ------------------------” ,
![Page 6: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/6.jpg)
Working Environment
● Mac OS 10.9.1● Apache Hadoop 1.2.1● Apache Mahout 0.8● Apache Spark 0.9
![Page 7: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/7.jpg)
Problem DescriptionETLData AnalysisMachine LearningOptimization
Feature EngineeringTitle vs BodyStop Words
Multi-label ClassificationApache Spark
Agenda
![Page 8: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/8.jpg)
ETL
Extract : Assume data is extracted from websiteTransform : Use OpenCSV
1. Remove whitespaces (‘ ’, ‘\n’, ‘\t’)2. Combine fields with ‘\t’3. Write to tsv file
Load : Upload to HDFS
![Page 9: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/9.jpg)
Problem DescriptionETLData AnalysisMachine LearningOptimization
Feature EngineeringTitle vs BodyStop Words
Multi-label ClassificationApache Spark
Agenda
![Page 10: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/10.jpg)
Data Analysis
Tag Occurrence Count
TSV FileMap-Reduce• Input : <index, question>• Mapper output : <tag, 1> for each tag• Reducer output : <tag, count> for each tag
7785 c#6788 java6575 php6135 javascript5317 android4949 jquery3278 c++3082 python
![Page 11: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/11.jpg)
Problem DescriptionETLData AnalysisMachine LearningOptimization
Feature EngineeringTitle vs BodyStop Words
Multi-label ClassificationApache Spark
Agenda
![Page 12: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/12.jpg)
Question Filtering for ML
TSV File
Map-Reduce• Input : <index, question>• Mapper output : <index, question> if question contains top5 tag• Reducer output : <index, question>
TSV Filewith questions that has one of top5 tags
![Page 13: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/13.jpg)
Machine Learning
● Problem○ Can we classify questions into one of 5 categories
(tags) ?
Classification● Naive Bayes Classifier● Detail in Mahout Classification Presentation
![Page 14: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/14.jpg)
Machine LearningCorrectly Classified Instances : 10209 81.8816%Incorrectly Classified Instances : 2259 18.1184%Total Classified Instances : 12468
![Page 15: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/15.jpg)
Problem DescriptionETLData AnalysisMachine LearningOptimization
Feature EngineeringTitle vs BodyStop Words
Multi-label ClassificationApache Spark
Agenda
![Page 16: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/16.jpg)
Title vs BodyIntuitively…Title is a short summary describing the body of the question
⇒ Title must be more important than body!How to put more emphasis on title?● Build separate models for title & body + more weight for
title model?● Prepend title several times and feed into regular model?
![Page 17: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/17.jpg)
Two models approach
Title model not accurate…● Too short for model to
distinguish labels● Longer text wins!
![Page 18: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/18.jpg)
Repeated title approach
Slight improvement!● Testing against train-set
~ 93% ⇒ ~ 95%● Testing against test-set
~ 80% ⇒ ~ 82%
Multiple title● more stop words ⇒ No effect● more keywords (if title has)
![Page 19: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/19.jpg)
AgendaProblem DescriptionETLData AnalysisMachine LearningOptimization
Feature EngineeringTitle vs BodyStop Words
Multi-label ClassificationApache Spark
![Page 20: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/20.jpg)
Diving into model● Top 10 words from each category
● Popular (redundant) words
showing up in all categories (I, it,
code, etc)
BUT● Some words specific to each
category (activity for android,
jquery for javascript, echo for php)
![Page 21: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/21.jpg)
Which words to drop?
Word count against TrainSmall.tsv?● Total count : 19276034
Top 5:● p - 827029● the - 545950● i - 476056● to - 393027● a - 362328
Problem● Key words have high count too
○ 39th - http - 51412wc○ 63rd - java - 35076wc○ 91st - php - 25135wc
Can’t even throw away first 100 words...
![Page 22: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/22.jpg)
Which words to drop?
Word count against ordinary english text?● 20 books from gutenberg.org● Total count : 1041565● A lot less technical! (only 4wc for java,
probably an island from Indonesia?)● Safe to throw away 1959 words (> 50wc)
![Page 23: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/23.jpg)
BUT
![Page 24: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/24.jpg)
Not much improvement...
● Due to tf-idf measurement○ Less weight for words appearing in many documents○ More weight for words appearing only in specific
documents
![Page 25: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/25.jpg)
AgendaProblem DescriptionETLData AnalysisMachine LearningOptimization
Feature EngineeringTitle vs BodyStop Words
Multi-label ClassificationApache Spark
![Page 26: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/26.jpg)
Any room for improvement?
What is the source of error?● android ⇔ java ==> both java● javascript ⇔ php ===> both web-related● java classified as c# ===> many questions have both tags
![Page 27: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/27.jpg)
Any room for improvement?
No problem if we can give multiple labels to one question!
![Page 28: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/28.jpg)
Multi-label classification● Modification from previous classification task
○ Top5 tags ⇒ Top1000 tags○ 1 tag for 1 question ⇒ 5 tags for 1 question
(Pick 5 most probable tags)○ 1 question learned only once ⇒ 1 question with
multiple tags learned multiple times
tag1 body
tag2 bodymodel
![Page 29: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/29.jpg)
Good outcome (Example 1)TITLE: Upgrade iPhone 3GS from iOS 4.2.1 to 4.3.xBODY: <p>Hi folks I have a iPhone 3GS at 4.2.1 and want to upgrade it to 4.3.x for testing. I have read some articles about it but it seems that those are too old and cannot work. </p> <p>Does anyone have some experience in doing this or does apple provide tutorials for developers in this?</p> <p>A lot of thanks.</p>
Actual tags● iphone● ios● upgrade
Predicted tags● iphone● ios● osx● objective-c● php
![Page 30: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/30.jpg)
GREAT outcome (Example 2)TITLE: Is it possible to display an image in text field in html?BODY: <p>Can we display image inside a text field in <code>html</code>? </p> <p><strong>Edit</strong></p> <p>What I want to do is to have an <code>editable</code> area, and want to add <code>html</code> objects inside it(i.e. button, image ..etc)</p>
Actual tags● javascript● jquery● html● css● web
Predicted tags● javascript● jquery ===> Never appears in text!● html● c#● php
![Page 31: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/31.jpg)
Stats
Row : # actual tags assigned to one questionCol : # predicted tags which are also in actual tag set
[Ex] Out of total 32798 questions which have 2 tags:● For 14541 questions, model suggested both 2 actual tags.● For 13922 questions, model suggested 1 of 2 actual tags.● For 4335 questions, model couldn’t suggest the correct tags.
![Page 32: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/32.jpg)
How to evaluateGenerous evaluator
If model gets at least 1 correct, approve it!
Total accuracy = 83.55% (B)
![Page 33: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/33.jpg)
How to evaluateStrict evaluator
Never approve unless model gets all correct!
Total accuracy = 43.04% (F)
![Page 34: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/34.jpg)
Conclusion for performance
● Overall, good!○ Predicted tag set is relatively close to the actual tag
set (Apple-related, Web-related)● but, not there yet...
○ Almost impossible to distinguish versions (c#-3.0, c#-4.0 ⇒ c#), or sub-tags (facebook-graph-api, facebook-like ⇒ facebook)
○ Still showing unrelated tags (php python everywhere!)
![Page 35: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/35.jpg)
AgendaProblem DescriptionETLData AnalysisMachine LearningOptimization
Feature EngineeringTitle vs BodyStop Words
Multi-label ClassificationApache Spark
![Page 36: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/36.jpg)
Spark
Advantages:- Easy to get started with- Interactive shell- Less code to write
![Page 37: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/37.jpg)
Spark
Disadvantages:- Not many reference for MLlib- Still new
![Page 38: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/38.jpg)
Spark
● Used PySpark which is python interface to using Spark
● Implemented ML model from ground-up using python dictionaries and mapreduce procedure
![Page 39: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/39.jpg)
How It Works
5 basic procedures used:● map● flatMap● reduce● reduceByKey● collectAsMap
![Page 40: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/40.jpg)
How It Works
key_val = line.flatMap(~).map(~)
key_val = key_val.reduceByKey(~)
(a, 1) (b, 1) (c, 1) (d, 1)
(a, 1) (b, 1) (c, 1) (d, 1)(a, 1) (d, 1) (a, 2) (b, 1) (c, 1) (d, 2)
LINE
![Page 41: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/41.jpg)
How It Works
dict = key_val.collectAsMap()
{a : 2, b : 1, c : 1, d : 2}(a, 2) (b, 1) (c, 1) (d, 2)
![Page 42: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/42.jpg)
How It Works
Model:- statistical model- matrix of weights- uses tf-idf
![Page 43: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/43.jpg)
How It Works
Tags
![Page 44: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/44.jpg)
How It Works
Tags
Words from document
![Page 45: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/45.jpg)
How It Works
Tags Relevance
Words from document
![Page 46: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/46.jpg)
How It Works
Implemented as → { tag : { word : wight } }
![Page 47: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/47.jpg)
How It Works
● Most relevant tag chosen by sum of weights associated to words contained in the document
![Page 48: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/48.jpg)
How It Works
Now, how are the weights calculated?● First calculate idf (inverse document
frequency) for each word● Next calculate tf (term frequency) associated
with each tag● Multiply idf to each entry then Normalize
![Page 49: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/49.jpg)
How It Works
idf for a worddefined by:
idf(word) = log(D/F(word))where,
D = total # of doc in the training setF(word) = # of doc which contains word
![Page 50: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/50.jpg)
How It Works
Two ways to calculate tf:1) number of times you see the term associated with a tag2) number of documents you see the term associated with a tag (in other words only count one time per doc)
![Page 51: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/51.jpg)
ResultsTITLE: Upgrade iPhone 3GS from iOS 4.2.1 to 4.3.xBODY: <p>Hi folks I have a iPhone 3GS at 4.2.1 and want to upgrade it to 4.3.x for testing. I have read some articles about it but it seems that those are too old and cannot work. </p> <p>Does anyone have some experience in doing this or does apple provide tutorials for developers in this?</p> <p>A lot of thanks.</p>
Actual tags● iphone● ios● upgrade
Predicted tags● ios4.3● iphone-3gs● cocoa-touch● ios4● upgrade
![Page 52: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/52.jpg)
ResultsTITLE: Is it possible to display an image in text field in html?BODY: <p>Can we display image inside a text field in <code>html</code>? </p> <p><strong>Edit</strong></p> <p>What I want to do is to have an <code>editable</code> area, and want to add <code>html</code> objects inside it(i.e. button, image ..etc)</p>
Actual tags● javascript● jquery● html● css● web
Predicted tags● html● img● alignment● get● web
![Page 53: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/53.jpg)
Results
Top: Predicted
Below: Actual
![Page 54: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/54.jpg)
Results
● Not perfect● But very close● Relevant words for tags look right
![Page 55: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/55.jpg)
Results
most relevant words for tag “python”: [u'python', u'def', u'import', u'print', u'module', u'file', u'self', … ]
most relevant words for tag “math”:[u'vector', u'x', u'math', u'calculate', u'number', u'mathematical', u'example', u'matlab', ... ]
![Page 56: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/56.jpg)
Adjusting
What can be adjusted?● Pretty much anything!● I tried playing with: tf, idf, tag_frequency,
normalization, cleaning text, etc.
![Page 57: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/57.jpg)
Conclusion
● Adjusting the metrics to get the right model can be time consuming (many things can be adjusted)!
● But still, Naive Bayes algorithm is very suited for keyword extraction problem (and text classification in general), because of how tf-idf is defined.
![Page 58: Tag Extraction Final Presentation - CS185CSpring2014](https://reader033.vdocuments.us/reader033/viewer/2022060120/55911be21a28ab9b758b47cd/html5/thumbnails/58.jpg)