detecting promotional content in wikipediabenking/resources/reading... · automatic vandalism...

31
Detecting Promotional Content in Wikipedia Shruti Bhosale Heath Vinicombe Raymond Mooney Shibamouli Lahiri February 27, 2014 Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 1

Upload: others

Post on 28-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Detecting Promotional Content in Wikipedia

Shruti BhosaleHeath VinicombeRaymond Mooney

Shibamouli LahiriFebruary 27, 2014

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 1

Page 2: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Motivation

Automatically identifying if an article is promotional, is hard.

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 2

Page 3: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Motivation

Automatically identifying if an article is promotional, is hard,because:

Number of such articles is small (17,000) compared to totalnumber of articles in Wikipedia (4.46 million).

Good negative examples are hard to find.

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 3

Page 4: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Motivation

Number of such articles is small (17,000) compared to totalnumber of articles in Wikipedia (4.46 million).

Under-sample the negative class.

Good negative examples are hard to find.

Use noisy documentsOptimistic vs Pessimistic settingsUse lots of featuresBoosting

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 4

Page 5: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Motivation

Number of such articles is small (17,000) compared to totalnumber of articles in Wikipedia (4.46 million).

Under-sample the negative class.

Good negative examples are hard to find.

Use noisy documentsOptimistic vs Pessimistic settingsUse lots of featuresBoosting

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 4

Page 6: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Related Work

Predicting Quality Flaws in User-generated Content: TheCase of Wikipedia (SIGIR 2012)

Language of Vandalism: Improving Wikipedia VandalismDetection via Stylometric Analysis (ACL-HLT 2011)

“Got you!”: Automatic Vandalism Detection in Wikipediawith Web-based Shallow Syntactic-Semantic Modeling(COLING 2010)

Automatic quality assessment of content createdcollaboratively by web communities: a case study ofWikipedia (JCDL 2009)

Does It Matter Who Contributes? – A Study on FeaturedArticles in the German Wikipedia (HT 2007)

Assessing the value of cooperation in Wikipedia (arXiv 2007)

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 5

Page 7: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Related Work

Predicting Quality Flaws in User-generated Content: TheCase of Wikipedia (SIGIR 2012)

Language of Vandalism: Improving Wikipedia VandalismDetection via Stylometric Analysis (ACL-HLT 2011)

“Got you!”: Automatic Vandalism Detection in Wikipediawith Web-based Shallow Syntactic-Semantic Modeling(COLING 2010)

Automatic quality assessment of content createdcollaboratively by web communities: a case study ofWikipedia (JCDL 2009)

Does It Matter Who Contributes? – A Study on FeaturedArticles in the German Wikipedia (HT 2007)

Assessing the value of cooperation in Wikipedia (arXiv 2007)

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 5

Page 8: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Related Work

Predicting Quality Flaws in User-generated Content: TheCase of Wikipedia (SIGIR 2012)

Language of Vandalism: Improving Wikipedia VandalismDetection via Stylometric Analysis (ACL-HLT 2011)

“Got you!”: Automatic Vandalism Detection in Wikipediawith Web-based Shallow Syntactic-Semantic Modeling(COLING 2010)

Automatic quality assessment of content createdcollaboratively by web communities: a case study ofWikipedia (JCDL 2009)

Does It Matter Who Contributes? – A Study on FeaturedArticles in the German Wikipedia (HT 2007)

Assessing the value of cooperation in Wikipedia (arXiv 2007)

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 5

Page 9: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Related Work

Predicting Quality Flaws in User-generated Content: TheCase of Wikipedia (SIGIR 2012)

Language of Vandalism: Improving Wikipedia VandalismDetection via Stylometric Analysis (ACL-HLT 2011)

“Got you!”: Automatic Vandalism Detection in Wikipediawith Web-based Shallow Syntactic-Semantic Modeling(COLING 2010)

Automatic quality assessment of content createdcollaboratively by web communities: a case study ofWikipedia (JCDL 2009)

Does It Matter Who Contributes? – A Study on FeaturedArticles in the German Wikipedia (HT 2007)

Assessing the value of cooperation in Wikipedia (arXiv 2007)

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 5

Page 10: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Dataset

13,000 promotional articles

26,000 untagged (“noisy”) articles

11,000 Featured and Good articles

70% to train language models, 30% for ten-foldcross-validation

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 6

Page 11: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Dataset

13,000 promotional articles

26,000 untagged (“noisy”) articles

11,000 Featured and Good articles

70% to train language models, 30% for ten-foldcross-validation

LogitBoost with default base classifier (Decision Stumps) and500 boosting iterations

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 7

Page 12: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Features

Content features

Structural features

Network features

Edit history features

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 8

Page 13: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Features

Content features

Structural features

Network features

Edit history features

Overall sentiment score

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 9

Page 14: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Features

Content features

Structural features

Network features

Edit history features

Overall sentiment score

58 features

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 10

Page 15: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Features

Content features

Structural features

Network features

Edit history features

Overall sentiment score

trigram language models

PCFG language models

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 11

Page 16: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Trigram Language Model and PCFG Features

LM char trigramLM word trigram

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 12

Page 17: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Trigram Language Model and PCFG Features

LM char trigramLM word trigram

PCFG maxPCFG minPCFG meanPCFG std deviation

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 13

Page 18: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Results

Features

Pessimistic Setting Optimistic Setting

P R F1 AUC P R F1 AUC

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 14

Page 19: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Results

Features

Pessimistic Setting Optimistic Setting

P R F1 AUC P R F1 AUC

Bag-of-words Baseline 0.823 0.820 0.821 0.893 0.931 0.931 0.931 0.979

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 15

Page 20: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Results

Features

Pessimistic Setting Optimistic Setting

P R F1 AUC P R F1 AUC

Bag-of-words Baseline 0.823 0.820 0.821 0.893 0.931 0.931 0.931 0.979PCFG 0.881 0.870 0.865 0.903 0.910 0.910 0.910 0.961

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 16

Page 21: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Results

Features

Pessimistic Setting Optimistic Setting

P R F1 AUC P R F1 AUC

Bag-of-words Baseline 0.823 0.820 0.821 0.893 0.931 0.931 0.931 0.979PCFG 0.881 0.870 0.865 0.903 0.910 0.910 0.910 0.961Character trigrams 0.889 0.887 0.888 0.952 0.858 0.843 0.841 0.877

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 17

Page 22: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Results

Features

Pessimistic Setting Optimistic Setting

P R F1 AUC P R F1 AUC

Bag-of-words Baseline 0.823 0.820 0.821 0.893 0.931 0.931 0.931 0.979PCFG 0.881 0.870 0.865 0.903 0.910 0.910 0.910 0.961Character trigrams 0.889 0.887 0.888 0.952 0.858 0.843 0.841 0.877Word trigrams 0.863 0.863 0.863 0.931 0.887 0.883 0.882 0.931

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 18

Page 23: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Results

Features

Pessimistic Setting Optimistic Setting

P R F1 AUC P R F1 AUC

Bag-of-words Baseline 0.823 0.820 0.821 0.893 0.931 0.931 0.931 0.979PCFG 0.881 0.870 0.865 0.903 0.910 0.910 0.910 0.961Character trigrams 0.889 0.887 0.888 0.952 0.858 0.843 0.841 0.877Word trigrams 0.863 0.863 0.863 0.931 0.887 0.883 0.882 0.931Character trigrams + Word trigrams 0.89 0.888 0.889 0.952 0.908 0.907 0.907 0.962

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 19

Page 24: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Results

Features

Pessimistic Setting Optimistic Setting

P R F1 AUC P R F1 AUC

Bag-of-words Baseline 0.823 0.820 0.821 0.893 0.931 0.931 0.931 0.979PCFG 0.881 0.870 0.865 0.903 0.910 0.910 0.910 0.961Character trigrams 0.889 0.887 0.888 0.952 0.858 0.843 0.841 0.877Word trigrams 0.863 0.863 0.863 0.931 0.887 0.883 0.882 0.931Character trigrams + Word trigrams 0.89 0.888 0.889 0.952 0.908 0.907 0.907 0.962PCFG + Char. trigrams + Word trigrams 0.914 0.915 0.914 0.974 0.950 0.950 0.950 0.983

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 20

Page 25: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Results

Features

Pessimistic Setting Optimistic Setting

P R F1 AUC P R F1 AUC

Bag-of-words Baseline 0.823 0.820 0.821 0.893 0.931 0.931 0.931 0.979PCFG 0.881 0.870 0.865 0.903 0.910 0.910 0.910 0.961Character trigrams 0.889 0.887 0.888 0.952 0.858 0.843 0.841 0.877Word trigrams 0.863 0.863 0.863 0.931 0.887 0.883 0.882 0.931Character trigrams + Word trigrams 0.89 0.888 0.889 0.952 0.908 0.907 0.907 0.962PCFG + Char. trigrams + Word trigrams 0.914 0.915 0.914 0.974 0.950 0.950 0.950 0.98358 Content and Meta Features 0.866 0.867 0.867 0.938 0.986 0.986 0.986 0.996

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 21

Page 26: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Results

Features

Pessimistic Setting Optimistic Setting

P R F1 AUC P R F1 AUC

Bag-of-words Baseline 0.823 0.820 0.821 0.893 0.931 0.931 0.931 0.979PCFG 0.881 0.870 0.865 0.903 0.910 0.910 0.910 0.961Character trigrams 0.889 0.887 0.888 0.952 0.858 0.843 0.841 0.877Word trigrams 0.863 0.863 0.863 0.931 0.887 0.883 0.882 0.931Character trigrams + Word trigrams 0.89 0.888 0.889 0.952 0.908 0.907 0.907 0.962PCFG + Char. trigrams + Word trigrams 0.914 0.915 0.914 0.974 0.950 0.950 0.950 0.98358 Content and Meta Features 0.866 0.867 0.867 0.938 0.986 0.986 0.986 0.996All Features 0.940 0.940 0.940 0.986 0.989 0.989 0.989 0.997

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 22

Page 27: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Results

Best Features in Pessimistic Setting

LM char trigramLM word trigramPCFG minPCFG maxPCFG meanPCFG std deviationNumber of CharactersNumber of WordsNumber of CategoriesNumber of Sentences

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 23

Page 28: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Results

Best Features in Pessimistic Setting Best Features in Optimistic Setting

LM char trigram Number of ReferencesLM word trigram Number of References per SectionPCFG min LM word trigramPCFG max Number of WordsPCFG mean PCFG meanPCFG std deviation Number of SentencesNumber of Characters LM char trigramNumber of Words Number of WordsNumber of Categories Number of CharactersNumber of Sentences Number of Backlinks

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 24

Page 29: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Results

Best Features in Pessimistic Setting Best Features in Optimistic Setting

LM char trigram Number of ReferencesLM word trigram Number of References per SectionPCFG min LM word trigramPCFG max Number of WordsPCFG mean PCFG meanPCFG std deviation Number of SentencesNumber of Characters LM char trigramNumber of Words Number of WordsNumber of Categories Number of CharactersNumber of Sentences Number of Backlinks

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 25

Page 30: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Observations

Stylometric features (PCFG and trigram language models)help.

Overall sentiment score doesn’t.

Using ten top-ranked features, F1 = 0.93 in pessimisticsetting, 0.988 in optimistic setting.

Featured/good articles are long, with more references.

Equal proportion of false positives and false negatives.

Many false positives are actually promotional articles. Manyaren’t.

False negatives are actually promotional; classifier thinks theyare not. Narrative style, excessive detail, biography pages.

Non-neutral vs Neutral promotional articles.

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 26

Page 31: Detecting Promotional Content in Wikipediabenking/resources/reading... · Automatic Vandalism Detection in Wikipedia with Web-based Shallow Syntactic-Semantic Modeling (COLING 2010)

Thank you!

Bhosale, Vinicombe, Mooney Promotional Content in Wikipedia 27