detecting promotional content in wikipedia shruti bhosale heath vinicombe ray mooney university of...
TRANSCRIPT
![Page 1: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/1.jpg)
1
Detecting Promotional Content in Wikipedia
Shruti BhosaleHeath Vinicombe
Ray MooneyUniversity of Texas at Austin
![Page 2: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/2.jpg)
2
Outline• Introduction• Related Work• Our contribution• Evaluation• Conclusion
![Page 3: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/3.jpg)
3
Outline• Introduction• Related Work• Our contribution• Evaluation• Conclusion
![Page 4: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/4.jpg)
4
Wikipedia’s Core Policies
Can be edited by anyone
Neutral-point-of-view
Verifiability
![Page 5: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/5.jpg)
5
Quality Control in Wikipedia
• Wikipedia editors and administrators• Clean-up Tags
![Page 6: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/6.jpg)
6
Wikipedia Articles with a promotional tone
• Wikipedia Article on Steve Angello– …Since then, he has exploded onto the house
music scene…– …Steve Angello encompasses enough fame as a
stand alone producer. Add astounding remixes for….his unassailable musical sights have truly made for an intense discography…
![Page 7: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/7.jpg)
7
Wikipedia Articles with a promotional tone
Identified manually and tagged with an Cleanup message by Wikipedia editors
![Page 8: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/8.jpg)
8
Outline• Introduction• Related Work• Our contribution• Evaluation• Conclusion
![Page 9: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/9.jpg)
9
Quality Flaw Prediction in Wikipedia(Anderka et al., 2012)
• Classifiers for ten most frequent quality flaws in Wikipedia
• One of the ten flaws is “Advert” => Written like an advertisement
• Majority of promotional Wikipedia articles
“Advert”
![Page 10: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/10.jpg)
10
Features Used in Classification(Anderka et al., 2012)
Content-based features
Structure features
Network features
Edit history features
![Page 11: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/11.jpg)
11
Outline• Introduction• Related Work• Our Approach–Motivation– Dataset Collection– Features– Classification
• Evaluation• Conclusion
![Page 12: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/12.jpg)
12
Style of Writing
• Our hypothesis - Promotional Articles could contain a distinct style of writing.
• Style of writing could be captured using –PCFG models–n-gram models
![Page 13: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/13.jpg)
13
Our Approach
• Training PCFG models, character trigram models and word trigram models for the sets of promotional and non-promotional Wikipedia articles
• Compute features based on these models
![Page 14: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/14.jpg)
14
Dataset Collection
• Positive Examples:13,000 articles from English Wikipedia’s
category, “Category:All articles with a promotional tone” (April 2013)
• Negative Examples:Randomly selected untagged articles (April 2013)
![Page 15: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/15.jpg)
15
Training and Testing
• 70% of the data is used to train language models for each category of articles
• 30% of the data is used to train and test the classifiers for detecting promotional articles
![Page 16: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/16.jpg)
16
Training N-gram Models
• For each categories of articles, we train – Word Trigram language models and – Character Trigram language models
• We also train a unigram word (BOW) model as a baseline for evaluation
![Page 17: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/17.jpg)
17
N-gram Model Features
• Difference in the probabilities assigned to an article by the positive and the negative class character trigram language models
• Difference in the probabilities assigned to an article by the positive and the negative class word trigram language models
![Page 18: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/18.jpg)
18
Training PCFG models(Raghavan et al., 2010; Harpalani et al., 2011)
Promotional Articles
Non-Promotional
Articles
Promotional Category Treebank
Non-promotional
Category Treebank
Promotional PCFG model
Non-Promotional
Articles PCFG model
PCFG PARSING
PCFG MODEL TRAINING
![Page 19: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/19.jpg)
19
PCFG Model Features
• Calculate probabilities assigned to all sentences of an article by each of the two PCFG models
• Compute Mean, Maximum, Minimum and Standard Deviation of all probabilities, per PCFG model.
• Compute the difference in the values for these statistics assigned by the positive and negative class PCFG models
![Page 20: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/20.jpg)
20
Classification
• LogitBoost with Decision Stumps (Friedman et al., 2000)
• 10-fold cross-validation
![Page 21: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/21.jpg)
21
Outline
• Introduction• Related Work• Our contribution• Evaluation• Conclusion
![Page 22: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/22.jpg)
22
EvaluationFeatures Precision Recall F1 AUC
Bag-of-words Baseline 0.82 0.82 0.82 0.89PCFG 0.88 0.87 0.87 0.94Char. Trigram 0.89 0.89 0.89 0.95Word Trigram 0.86 0.86 0.86 0.93PCFG + Char. Trigram + Word Trigram
0.91 0.92 0.91 0.97
Content and Meta Features (Anderka et al.)
0.87 0.87 0.87 0.94
All Features 0.94 0.94 0.94 0.99
![Page 23: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/23.jpg)
23
EvaluationFeatures Precision Recall F1 AUC
Bag-of-words Baseline 0.82 0.82 0.82 0.89PCFG 0.88 0.87 0.87 0.94Char. Trigram 0.89 0.89 0.89 0.95Word Trigram 0.86 0.86 0.86 0.93PCFG + Char. Trigram + Word Trigram
0.91 0.92 0.91 0.97
Content and Meta Features (Anderka et al.)
0.87 0.87 0.87 0.94
All Features 0.94 0.94 0.94 0.99
![Page 24: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/24.jpg)
24
EvaluationFeatures Precision Recall F1 AUC
Bag-of-words 0.82 0.82 0.82 0.89PCFG 0.88 0.87 0.87 0.94Char. Trigram 0.89 0.89 0.89 0.95Word Trigram 0.86 0.86 0.86 0.93PCFG + Char. Trigram + Word Trigram
0.91 0.92 0.91 0.97
Content and Meta Features (Anderka et al.)
0.87 0.87 0.87 0.94
All Features 0.94 0.94 0.94 0.99
![Page 25: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/25.jpg)
25
EvaluationFeatures Precision Recall F1 AUC
Bag-of-words 0.82 0.82 0.82 0.89PCFG 0.88 0.87 0.87 0.94Char. Trigram 0.89 0.89 0.89 0.95Word Trigram 0.86 0.86 0.86 0.93PCFG + Char. Trigram + Word Trigram
0.91 0.92 0.91 0.97
Content and Meta Features (Anderka et al.)
0.87 0.87 0.87 0.94
All Features 0.94 0.94 0.94 0.99
![Page 26: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/26.jpg)
26
EvaluationFeatures Precision Recall F1 AUC
Bag-of-words 0.82 0.82 0.82 0.89PCFG 0.88 0.87 0.87 0.94Char. Trigram 0.89 0.89 0.89 0.95Word Trigram 0.86 0.86 0.86 0.93PCFG + Char. Trigram + Word Trigram
0.91 0.92 0.91 0.97
Content and Meta Features (Anderka et al.)
0.87 0.87 0.87 0.94
All Features 0.94 0.94 0.94 0.99
![Page 27: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/27.jpg)
27
Top 10 Features(Based on Information Gain)
1. LM char trigram 2. LM word trigram 3. PCFG min 4. PCFG max 5. PCFG mean 6. PCFG std. deviation 7. Number of Characters 8. Number of Words 9. Number of Categories 10. Number of Sentences
![Page 28: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/28.jpg)
28
Average Sentiment Score
• Average sentiment of all words in an article using SentiWordNet (Baccianella et al., 2010)
• Intuitively seems like a discriminative feature• 18th most informative feature• Reinforces our hypothesis that surface level
features are insufficient
![Page 29: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/29.jpg)
29
Conclusion
• Features based on n-gram language models and PCFG models work very well in detecting promotional articles in Wikipedia.
• Main advantages – –Depend on the article’s content only
and not on external meta-data –Perform with high accuracy
![Page 30: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/30.jpg)
30
Questions?
![Page 31: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/31.jpg)
31
Content-based Features• Number of characters, words,
sentences• Avg. Word Length • Avg., min., max. Sentence
Lengths, Ratio of max. to min. sentence lengths
• Ratio of long sentences (>48 words) to Short Sentences (<33 words)
• % of Sentences in the passive voice
• Relative Frequencies of POS tags
• % of sentences beginning with selected POS tags
• % of special phrases (e.g. editorializing terms like ‘without a doubt’, ‘of course’ )
• % of easy words, difficult words, long words and stop words
• Overall Sentiment Score based on SentiWordNet*
* Baccianella et al., 2010
![Page 32: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/32.jpg)
32
Structure Features
• Number of Sections • Number of Images • Number of Categories • Number of Wikipedia Templates used • Number of References, Number of References
per sentence and Number of references per section
![Page 33: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/33.jpg)
33
Wikipedia Network Features
• Number of Internal Wikilinks (to other Wikipedia pages)
• Number of External Links (to other websites) • Number of Backlinks (i.e. Number of wikilinks
from other Wikipedia articles to an article) • Number of Language Links (i.e. Number of
links to the same article in other languages)
![Page 34: Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Ray Mooney University of Texas at Austin 1](https://reader035.vdocuments.us/reader035/viewer/2022070409/56649e755503460f94b768ea/html5/thumbnails/34.jpg)
34
Edit History Features
• Age of the article • Days since last revision of the article • Number of edits to the article • Number of unique editors • Number of edits made by registered users and by
anonymous IP addresses • Number of edits per editor • Percentage of edits by top 5% of the top
contributors to the article