detecting promotional content in wikipediabenking/resources/reading... · automatic vandalism...
TRANSCRIPT
Detecting Promotional Content in Wikipedia
Shruti BhosaleHeath VinicombeRaymond Mooney
Shibamouli LahiriFebruary 27, 2014
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 1
Motivation
Automatically identifying if an article is promotional, is hard.
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 2
Motivation
Automatically identifying if an article is promotional, is hard,because:
Number of such articles is small (17,000) compared to totalnumber of articles in Wikipedia (4.46 million).
Good negative examples are hard to find.
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 3
Motivation
Number of such articles is small (17,000) compared to totalnumber of articles in Wikipedia (4.46 million).
Under-sample the negative class.
Good negative examples are hard to find.
Use noisy documentsOptimistic vs Pessimistic settingsUse lots of featuresBoosting
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 4
Motivation
Number of such articles is small (17,000) compared to totalnumber of articles in Wikipedia (4.46 million).
Under-sample the negative class.
Good negative examples are hard to find.
Use noisy documentsOptimistic vs Pessimistic settingsUse lots of featuresBoosting
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 4
Related Work
Predicting Quality Flaws in User-generated Content: TheCase of Wikipedia (SIGIR 2012)
Language of Vandalism: Improving Wikipedia VandalismDetection via Stylometric Analysis (ACL-HLT 2011)
“Got you!”: Automatic Vandalism Detection in Wikipediawith Web-based Shallow Syntactic-Semantic Modeling(COLING 2010)
Automatic quality assessment of content createdcollaboratively by web communities: a case study ofWikipedia (JCDL 2009)
Does It Matter Who Contributes? – A Study on FeaturedArticles in the German Wikipedia (HT 2007)
Assessing the value of cooperation in Wikipedia (arXiv 2007)
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 5
Related Work
Predicting Quality Flaws in User-generated Content: TheCase of Wikipedia (SIGIR 2012)
Language of Vandalism: Improving Wikipedia VandalismDetection via Stylometric Analysis (ACL-HLT 2011)
“Got you!”: Automatic Vandalism Detection in Wikipediawith Web-based Shallow Syntactic-Semantic Modeling(COLING 2010)
Automatic quality assessment of content createdcollaboratively by web communities: a case study ofWikipedia (JCDL 2009)
Does It Matter Who Contributes? – A Study on FeaturedArticles in the German Wikipedia (HT 2007)
Assessing the value of cooperation in Wikipedia (arXiv 2007)
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 5
Related Work
Predicting Quality Flaws in User-generated Content: TheCase of Wikipedia (SIGIR 2012)
Language of Vandalism: Improving Wikipedia VandalismDetection via Stylometric Analysis (ACL-HLT 2011)
“Got you!”: Automatic Vandalism Detection in Wikipediawith Web-based Shallow Syntactic-Semantic Modeling(COLING 2010)
Automatic quality assessment of content createdcollaboratively by web communities: a case study ofWikipedia (JCDL 2009)
Does It Matter Who Contributes? – A Study on FeaturedArticles in the German Wikipedia (HT 2007)
Assessing the value of cooperation in Wikipedia (arXiv 2007)
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 5
Related Work
Predicting Quality Flaws in User-generated Content: TheCase of Wikipedia (SIGIR 2012)
Language of Vandalism: Improving Wikipedia VandalismDetection via Stylometric Analysis (ACL-HLT 2011)
“Got you!”: Automatic Vandalism Detection in Wikipediawith Web-based Shallow Syntactic-Semantic Modeling(COLING 2010)
Automatic quality assessment of content createdcollaboratively by web communities: a case study ofWikipedia (JCDL 2009)
Does It Matter Who Contributes? – A Study on FeaturedArticles in the German Wikipedia (HT 2007)
Assessing the value of cooperation in Wikipedia (arXiv 2007)
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 5
Dataset
13,000 promotional articles
26,000 untagged (“noisy”) articles
11,000 Featured and Good articles
70% to train language models, 30% for ten-foldcross-validation
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 6
Dataset
13,000 promotional articles
26,000 untagged (“noisy”) articles
11,000 Featured and Good articles
70% to train language models, 30% for ten-foldcross-validation
LogitBoost with default base classifier (Decision Stumps) and500 boosting iterations
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 7
Features
Content features
Structural features
Network features
Edit history features
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 8
Features
Content features
Structural features
Network features
Edit history features
Overall sentiment score
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 9
Features
Content features
Structural features
Network features
Edit history features
Overall sentiment score
58 features
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 10
Features
Content features
Structural features
Network features
Edit history features
Overall sentiment score
trigram language models
PCFG language models
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 11
Trigram Language Model and PCFG Features
LM char trigramLM word trigram
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 12
Trigram Language Model and PCFG Features
LM char trigramLM word trigram
PCFG maxPCFG minPCFG meanPCFG std deviation
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 13
Results
Features
Pessimistic Setting Optimistic Setting
P R F1 AUC P R F1 AUC
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 14
Results
Features
Pessimistic Setting Optimistic Setting
P R F1 AUC P R F1 AUC
Bag-of-words Baseline 0.823 0.820 0.821 0.893 0.931 0.931 0.931 0.979
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 15
Results
Features
Pessimistic Setting Optimistic Setting
P R F1 AUC P R F1 AUC
Bag-of-words Baseline 0.823 0.820 0.821 0.893 0.931 0.931 0.931 0.979PCFG 0.881 0.870 0.865 0.903 0.910 0.910 0.910 0.961
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 16
Results
Features
Pessimistic Setting Optimistic Setting
P R F1 AUC P R F1 AUC
Bag-of-words Baseline 0.823 0.820 0.821 0.893 0.931 0.931 0.931 0.979PCFG 0.881 0.870 0.865 0.903 0.910 0.910 0.910 0.961Character trigrams 0.889 0.887 0.888 0.952 0.858 0.843 0.841 0.877
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 17
Results
Features
Pessimistic Setting Optimistic Setting
P R F1 AUC P R F1 AUC
Bag-of-words Baseline 0.823 0.820 0.821 0.893 0.931 0.931 0.931 0.979PCFG 0.881 0.870 0.865 0.903 0.910 0.910 0.910 0.961Character trigrams 0.889 0.887 0.888 0.952 0.858 0.843 0.841 0.877Word trigrams 0.863 0.863 0.863 0.931 0.887 0.883 0.882 0.931
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 18
Results
Features
Pessimistic Setting Optimistic Setting
P R F1 AUC P R F1 AUC
Bag-of-words Baseline 0.823 0.820 0.821 0.893 0.931 0.931 0.931 0.979PCFG 0.881 0.870 0.865 0.903 0.910 0.910 0.910 0.961Character trigrams 0.889 0.887 0.888 0.952 0.858 0.843 0.841 0.877Word trigrams 0.863 0.863 0.863 0.931 0.887 0.883 0.882 0.931Character trigrams + Word trigrams 0.89 0.888 0.889 0.952 0.908 0.907 0.907 0.962
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 19
Results
Features
Pessimistic Setting Optimistic Setting
P R F1 AUC P R F1 AUC
Bag-of-words Baseline 0.823 0.820 0.821 0.893 0.931 0.931 0.931 0.979PCFG 0.881 0.870 0.865 0.903 0.910 0.910 0.910 0.961Character trigrams 0.889 0.887 0.888 0.952 0.858 0.843 0.841 0.877Word trigrams 0.863 0.863 0.863 0.931 0.887 0.883 0.882 0.931Character trigrams + Word trigrams 0.89 0.888 0.889 0.952 0.908 0.907 0.907 0.962PCFG + Char. trigrams + Word trigrams 0.914 0.915 0.914 0.974 0.950 0.950 0.950 0.983
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 20
Results
Features
Pessimistic Setting Optimistic Setting
P R F1 AUC P R F1 AUC
Bag-of-words Baseline 0.823 0.820 0.821 0.893 0.931 0.931 0.931 0.979PCFG 0.881 0.870 0.865 0.903 0.910 0.910 0.910 0.961Character trigrams 0.889 0.887 0.888 0.952 0.858 0.843 0.841 0.877Word trigrams 0.863 0.863 0.863 0.931 0.887 0.883 0.882 0.931Character trigrams + Word trigrams 0.89 0.888 0.889 0.952 0.908 0.907 0.907 0.962PCFG + Char. trigrams + Word trigrams 0.914 0.915 0.914 0.974 0.950 0.950 0.950 0.98358 Content and Meta Features 0.866 0.867 0.867 0.938 0.986 0.986 0.986 0.996
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 21
Results
Features
Pessimistic Setting Optimistic Setting
P R F1 AUC P R F1 AUC
Bag-of-words Baseline 0.823 0.820 0.821 0.893 0.931 0.931 0.931 0.979PCFG 0.881 0.870 0.865 0.903 0.910 0.910 0.910 0.961Character trigrams 0.889 0.887 0.888 0.952 0.858 0.843 0.841 0.877Word trigrams 0.863 0.863 0.863 0.931 0.887 0.883 0.882 0.931Character trigrams + Word trigrams 0.89 0.888 0.889 0.952 0.908 0.907 0.907 0.962PCFG + Char. trigrams + Word trigrams 0.914 0.915 0.914 0.974 0.950 0.950 0.950 0.98358 Content and Meta Features 0.866 0.867 0.867 0.938 0.986 0.986 0.986 0.996All Features 0.940 0.940 0.940 0.986 0.989 0.989 0.989 0.997
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 22
Results
Best Features in Pessimistic Setting
LM char trigramLM word trigramPCFG minPCFG maxPCFG meanPCFG std deviationNumber of CharactersNumber of WordsNumber of CategoriesNumber of Sentences
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 23
Results
Best Features in Pessimistic Setting Best Features in Optimistic Setting
LM char trigram Number of ReferencesLM word trigram Number of References per SectionPCFG min LM word trigramPCFG max Number of WordsPCFG mean PCFG meanPCFG std deviation Number of SentencesNumber of Characters LM char trigramNumber of Words Number of WordsNumber of Categories Number of CharactersNumber of Sentences Number of Backlinks
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 24
Results
Best Features in Pessimistic Setting Best Features in Optimistic Setting
LM char trigram Number of ReferencesLM word trigram Number of References per SectionPCFG min LM word trigramPCFG max Number of WordsPCFG mean PCFG meanPCFG std deviation Number of SentencesNumber of Characters LM char trigramNumber of Words Number of WordsNumber of Categories Number of CharactersNumber of Sentences Number of Backlinks
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 25
Observations
Stylometric features (PCFG and trigram language models)help.
Overall sentiment score doesn’t.
Using ten top-ranked features, F1 = 0.93 in pessimisticsetting, 0.988 in optimistic setting.
Featured/good articles are long, with more references.
Equal proportion of false positives and false negatives.
Many false positives are actually promotional articles. Manyaren’t.
False negatives are actually promotional; classifier thinks theyare not. Narrative style, excessive detail, biography pages.
Non-neutral vs Neutral promotional articles.
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 26
Thank you!
Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 27