mining interesting trivia for entities from wikipedia part-ii

Post on 24-Jan-2018

387 Views

Category:

Education

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Mining Interesting Trivia for Entities from Wikipedia

Supervised By: Presented By:

Dr. Dhaval Patel,Assistant Professor,IIT Roorkee

Abhay Prakash,En. No. - 10211002,

IIT Roorkee

Dr. Manoj K. Chinnakotla,Applied Researcher,Microsoft India

Publication Accepted[1] Abhay Prakash, Manoj K. Chinnakotla, Dhaval Patel, Puneet Garg: “Did you know?- Mining Interesting Trivia for Entities from Wikipedia”. In 24th

International Joint Conference on Artificial Intelligence (IJCAI), 2015.

Conference Rating: A*

Introduction: Problem StatementDefinition: Trivia are any facts about an entity which are interesting due to any of the following characteristics - unusualness, uniqueness, unexpectedness or weirdness. Generally appear in “Did you know?” articles

E.g. “To prepare for Joker’s role, Heath Ledger secluded himself in a hotel room for a month” [Batman Begins]

Unusual for an actor/human to seclude himself for a month

Problem Statement: For a given entity, mine top-k interesting trivia from its Wikipedia page, where a trivia is considered interesting if when it is shown to 𝑁 persons, more than 𝑁/2 persons find it interesting. For evaluation of unseen set, we chose 𝑁 = 5 (statistical significance discussed in mid evaluation)

Wikipedia Trivia Miner (WTM) Based on ML approach to mine trivia from unstructured text

Trains a ranker using sample trivia of target domain Experiment with Movie entities and Celebrity entities

Harness trained ranker to mine Trivia from entity’s Wikipedia page Retrieves Top-k standalone interesting sentences from entity’s page

Why Wikipedia? Reliable for factual correctness

Ample # of interesting trivia (56/100 in expt.)

System Architecture Filtering & Grading Filters out noisy samples

Give a grade to each sample, as reqd. by ranker

Interestingness Ranker Extracts features from the samples/candidates

Trains ranker(SVMrank)/Ranks candidates

Candidate Selection Identifies candidates from Wikipedia

CandidateSelection

Human Voted Trivia Source

Train Dataset Candidates’ Source

Top-K Interesting Triviafrom Candidates

Wikipedia Trivia Miner (WTM)

Interestingness Ranker

Filtering & Grading

Feature Extraction Feature ExtractionSVMrank

Knowledge Base

CandidateSelection

Candidates’ Source

Top-K Interesting Triviafrom Candidates

Feature ExtractionSVMrank

Knowledge Base

Retrieval Phase

Human Voted Trivia Source

Train Dataset

Filtering & Grading

Feature Extraction SVMrank

Train Phase

Model

Execution Phases Train Phase Crawls and prepares train data

Featurize the train data

Trains SVMrank to build a model

Retrieval Phase Crawls entity’s Wikipedia text

Identify candidates for trivia

Featurize the candidates

Rank the candidates using already built model

Feature EngineeringBucket Feature Significance Sample features Example Trivia

Unigram (U)Features

Each word’sTF-IDF

Identify imp. words which make the trivia interesting

“stunt”, “award”, “improvise”

“Tom Cruise did all of his own stunt driving.”

Linguistic (L)Features

SuperlativeWords

Shows the extremeness (uniqueness)

“best”, “longest”, “first”

“The longest animated Disney film since Fantasia (1940).”

ContradictoryWords

Opposing ideas could spark intrigue and interest

“but”, “although”, “unlike”

“The studios wanted Matthew McConaugheyfor lead role, but James Cameron insisted on Leonardo DiCaprio.”

Root Word(Main Verb)

Captures core activity being discussed in the sentence

root_gross “Gravity grossed $274 Mn in North America”

Subject Word(First Noun)

Captures core thing being discussed in the sentence

subj_actor “The actors snorted crushed B vitamins for scenes involving cocaine”

Readability Complex and lengthy trivia are hardly interesting

FOG Index binned in 3 bins ---

Feature Engineering (Contd…)

Bucket Feature Significance Sample features Example Trivia

Entity (E)Features

Generic NEs captures general about-ness

MONEY, ORGANIZATION, PERSON, DATE, TIME and LOCATION

“The guns in the film were supplied by Aldo Uberti Inc., a company in Italy.”

• ORGANIZATION and LOCATION

RelatedEntities

captures specific about-ness(Entities resolved using DBPedia)

entity_producer,entity_director

“According to Victoria Alonso, Rocket Raccoonand Groot were created through a mix of motion-capture and rotomation VFX.”

• entity_producer, entity_character

Entity Linking before(L) Parsing

Captures generalized story of sentence

subj_entity_producer [The same trivia above]• “According to entity_producer, …”• subj_Victoria subj_entity_producer

Focus Entities Captures core entities being talked about

underroot_entity_producer

[The same trivia above]• underroot_entity_producer,

underroot_entity_character

Feature Engineering: ExampleEx. “According to Victoria Alonso, Rocket Raccoon and Groot were created through a mix of motion-capture and rotomation VFX.”

Features extracted: 18025 (U) + 5 (L) + 4686 (E) columns in total for all train data

Rest of the features have value 0. entity_actor = 0, award = 0, subj_actor = 0, root_win = 0, ….

create mix motion capture rotomation VFX root_create supPOS subj_entity_producer FOG

0.25 0.75 0.96 0.4 0.85 0.75 1 0 1 3

contradictory entity_producer entity_character underroot_entiy_producer underroot_entity_character

0 1 1 1 1

Comparative ApproachesI. Random [Baseline I]:

- 10 sentences picked randomly from Wikipedia

II. CS + Random

- Candidates Selected (standalone context independent sentences)

- i.e., remove sentences like “it really reminds me of my childhood”

- 10 sentences picked randomly from candidates

III. CS + supPOS(Best) [Baseline II]:

- Candidates Selected

- Ranked by # of sup. words

- Deliberately taking interesting sent. for same # of sup. words

Rank # of sup. words

Class

1 2 Interesting

2 2 Boring

3 1 Interesting

4 1 Interesting

5 1 Interesting

6 1 Boring

7 1 Boring

supPOS (Best Case)

Variants of WTMI. WTM (U)

- Candidates Selected

- ML Ranking of candidates using only Unigram Features

II. WTM (U+L+E)

- Candidates Selected

- ML Ranking of candidates using all features: Unigram (U) + Linguistic (L) + Entity (E)

Results: P@10 Metric is Precision at 10 (P@10), which

means out of top 10 ranked candidates, how many actually are interesting

0.25

0.3

0.34 0.34

0.45

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Random CS+Random supPOS(Best Case)

WTM (U) WTM(U+L+E)

P@

10

Approaches

Results: P@10 Metric is Precision at 10 (P@10), which

means out of top 10 ranked candidates, how many actually are interesting

CS+Random > Random

Shows significance of Candidate Selection

0.25

0.3

0.34 0.34

0.45

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Random CS+Random supPOS(Best Case)

WTM (U) WTM(U+L+E)

P@

10

Approaches

Results: P@10 Metric is Precision at 10 (P@10), which

means out of top 10 ranked candidates, how many actually are interesting

CS+Random > Random

Shows significance of Candidate Selection

WTM (U+L+E) >> WTM (U)

Shows significance of Engineered Linguistic (L) and Entity (E) Features

0.25

0.3

0.34 0.34

0.45

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Random CS+Random supPOS(Best Case)

WTM (U) WTM(U+L+E)

P@

10

Approaches

Results: Recall@K supPOS limited to one kind of trivia

WTM captures varied types 62% recall till rank 25

Performance Comparison supPOS better till rank 3

Soon after rank 3, WTM beats superPOS0

10

20

30

40

50

60

70

0 5 10 15 20 25

% R

ecal

l

Rank

SuperPOS (Best Case) WTM Random

Sensitivity to Training Size Current Results reported with 6163 Train

Trivia

WTM precision increases with train size

Desirable property as precision can beimproved by taking more train data

WTM’s Domain Independence Experiment on Celebrity Domain to justify claim of domain independence.

Dataset: Crawled Trivia for Top 1000 Movie celebrities from IMDB and did 5 fold test

Train dataset: 4459 Trivia (106 entities)

Test dataset: 500 Trivia (10 entities)

Doubtful feature for being domain dependent – Entity Features

Unigram (E) Features Linguistic (L) Features Entity (E) Features

All words subj_actor, root_reveal,subj_scene, but, best, FOG_index = 7.2

entity_producer, entity_director, …

WTM’s Domain Independence (Contd…)

Entity Features are domain independent too

Entity Features are automatically generated using attribute:value pairs in DBpedia For a matching of ‘value’ in sentence, the match is replaced by entity_‘attribute’

Unigram (U) and Linguistic (L) features clearly domain independent

DBpedia (attribute: value) pairs for Batman BeginsSample Trivia (Batman Begins)

WTM’s Domain Independence (Contd…)

Entity Features are domain independent too

Entity Features are automatically generated using attribute:value pairs in DBpedia For a matching of ‘value’ in sentence, the match is replaced by entity_‘attribute’

Unigram (U) and Linguistic (L) features clearly domain independent

DBpedia (attribute: value) pairs for Batman BeginsSample Trivia (Batman Begins)

FEATURE ENTITY TRIVIA

entity_partner Johnny Depp Engaged to Amber Heard [January 17, 2014].**

entity_citizenship Nicole Kidman First Australian actress to win the Best Actress Academy Award.

** After Entity Linking sentence parsed as “Engaged to entity_partner”

Entity Feature Generation from DBpedia

Example of Entity Features in Celebrity Domain

WTM’s Domain Independence (Contd…)

Movie Domain (ex. Batman Begins (2005) ) Celebrity Domain (ex. Angelina Jolie)

DBpedia attribute:value Feature generated DBpedia attribute:value Feature generated

Director: Christopher Nolan entity_director Partner: Brad Pitt entity_partner

Producer: Larry J. Franco entity_producer birthplace: California entity_birthPlace

Feature Contribution (Movie v/s Celeb.)

Rank Feature Group

1 win Unigram

3 magazine Unigram

4 superPOS Linguistic

5 MONEY Entity (NER)

6 entity_alternativenames Entity

7 root_engage Linguistic

14 subj_earnings Linguistic

15 subj_entity_children Linguistic + Entity

18 entity_birthplace Entity

19 subj_unlinked_location Linguistic + Entity

Rank Feature Group

1 subj_scene Linguistic

2 subj_entity_cast Linguistic + Entity

3 entity_produced_by Entity

4 underroot_unlinked_organization Linguistic + Entity

6 root_improvise Linguistic

7 entity_character Entity

8 MONEY Entity (NER)

14 stunt Unigram

16 superPOS Linguistic

17 subj_actor Linguistic

Top Features: Our advanced features are useful and intuitive for humans too

Entity Linking leads to better generalization (instead of entity_wolverine, model gets entity_cast)

Movie Domain Celebrity Domain

Results: P@10 (Celebrity Domain)

0.39

0.540.58

0.71

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Random supPOS(BestCase)

WTM (U) WTM(U+L+E)

P@

10

Approaches

Again WTM (U+L+E) >> WTM (U) Significance of advanced (L) and (E)

features

Hence, Features and Approach areDomain Independent

For entities of any domain, just replaceTrain Data (Sample Trivia)

Dissertation Contribution Identified, Defined and Provided a novel research problem not just only providing solutions to existing problem

Proposed a Domain Independent system “Wikipedia Trivia Miner (WTM)” To mine top-k interesting trivia for any given entity based on their interestingness

Engineered features that capture ‘about-ness’ of sentence Generalizes which one are interesting

Proposed a mechanism to prepare ground truth for test-set Cost-effective but statistically significant

Future Works New Features to increase Ranking Quality Unusualness: Probability of occurrence of the sentence in considered domain

Fact Popularity: Lesser known trivia could be more interesting to majority people

Trying Deep Learning Could be helpful as in case of sarcasm detection

Generating Questions from mined trivia To present Trivia in question form

Obtaining personalized Interesting Trivia In this dissertation work, we took interesting based on majority voting. Ranking based on user

demographics

top related