semantic text mining - kbs · •entities comprise of facts and statements supported by external...

Semantic Text MiningMining Structured Information from

Unstructured DataBesnik Fetahu

Outline

• Introduction

• Information Sources

• Structured Data: Knowledge Bases and Formal Representation of Information

• Text Mining Applications

• Relation Extraction

• Named Entity Disambiguation

• Machine Reading

• Wikipedia News Suggestion

• Conclusions

Introduction

• Large Amounts of Data.

• Heterogeneity of information: provenance, quality, content, representation, language etc.

• Unstructured vs. Semi-Structured vs. Structured

• Knowledge Bases: maintenance, updating, addition of new facts

• Automated vs. Crowd—based analysis of data

Information Sources

• Unstructured:

• News Collections: NYTimes, Reuters, Wall Street Journal, GDelt etc.

• Web Resources: Common Crawl, ClueWeb

• Social Streams: Twitter, Facebook, Reddit

• Semi-Structured:

• Wikipedia

• Structured:

• Linked Data: Linked Open Data Cloud

• Knowledge Bases: DBpedia, YAGO, Freebase

Information Sources: GDelt

• ~138200 indexed daily news articles from

more than 3000 news domains • 37007 news domains in total • ~18682 daily entities with an average of 64

mentions per entity

news domain news articles

yahoo.com 1244781

allafrica.com 1035646

reuters.com 828133

dailymail.co.uk 815372

indiatimes.com 743991

wn.com 587607

Information Sources: Wikipedia

• 5 million articles • Articles structured into sections • Articles annotated with categories • Collaboratively edited and maintained

Information Sources: Wikipedia

Structured Data: Formal Representation of Knowledge and Knowledge Bases

• Semantic Web

• Ontologies: Knowledge Representation

• Knowledge Bases

Structured Data: Semantic Web

The ultimate goal of the Web of data is to enable computers to do more useful work and to develop systems that can support trusted interactions over the network. The term “Semantic Web” refers to W3C’s vision of the Web of linked data. Semantic Web technologies enable people to create data stores on the Web, build vocabularies, and write rules for handling data. Linked data are empowered by technologies such as RDF, SPARQL, OWL, and SKOS.

• Format: turtle, n3, etc. • Syntax: XML Schema• Models: RDF• Taxonomies: RDFS • Ontologies: OWL• Query languages: SPARQL• Interchange formats: RIF

Structured Data: Ontologies

Structured Data: Knowledge Bases

• Nell • TextRunner • YAGO • DBpedia • Freebase

Text Mining Applications

• Relation Extraction

• Named Entity Disambiguation

• Machine Reading

Text Mining Applications: Relation Extraction• DP of chunks of texts for relation extraction

• Syntactic patterns for relation extraction

• Semantic and Lexical patterns for relation extraction

Text Mining Applications: Named Entity Disambiguation

• Textual content has rich underlying syntactical and semantical structureSyntactic patterns for relation extraction.

• Frequently extracted syntactical and semantical information: POS, Co-Ref and NER.

• Named entity recognition with specific entity types Person, Organisation, Place, Date.

Text Mining Applications: Named Entity Disambiguation

• NED: named entity disambiguation of surface forms with entities from knowledge bases • DBPedia Spotlight • Aida • Wikiminer

Everything done?

• Only a small fraction of data is actually structured

• Cumbersome to define manually and explicitly schemas, taxonomies, ontologies

• Large proportion of data is unstructured or semi-structured

• Can we automatically extract and model such content?

How can we enrich and maintain Wikipedia?

Why Wikipedia and News?

• Why Wikipedia?• Text Categorization • Entity Disambiguation • Entity Search • Knowledge Bases

• Why News?• Authoritative sources • Professionally edited and qualitative source of

information! • Inherent importance of reported events and facts about

entities in Wikipedia • Second most cited source of information in Wikipedia

0

0.2

0.4

0.6

0.8

1

Com

icsC

reat

orAr

twor

kN

atur

alPl

ace

Airli

neFi

lmSo

ccer

Man

ager

Lega

lCas

eAl

bum

Band

Spor

tsTe

amTe

levi

sion

Show

Anat

omic

alSt

ruct

ure

Athl

ete

Wea

pon

Crim

inal

Mus

ical

Artis

tPo

litic

ian

Plan

tSo

ngN

on-P

rofit

Org

anis

atio

nBo

okAc

tor

Fict

iona

lCha

ract

erR

ecor

dLab

elBr

oadc

aste

rPo

litic

alPa

rtyAu

tom

obile

Trad

eUni

onSc

ient

ist

Milit

aryP

erso

nPh

iloso

pher

Tele

visi

onSe

ason

Elec

tion

Offi

ceH

olde

rSp

orts

Leag

ueG

over

nmen

tAge

ncy

Sing

leAn

imal

Awar

dSp

orts

Even

tAi

rpor

tM

ilitar

yCon

flict

Tele

visi

onEp

isod

eAi

rcra

ftM

agaz

ine

Writ

erLo

catio

n

news book court journal web thesis

0

500

1000

1500

2000

2500

-11

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9

10

11

2001-EE2001-NEE

0

500

1000

1500

-11

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9

10

11

2002-EE2002-NEE

0

500

1000

1500

-11

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9

10

11

2003-EE2003-NEE

0

500

1000

1500

2000

-11

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9

10

11

2004-EE2004-NEE

0

500

1000

1500

2000

2500

-11

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9

10

11

2005-EE2005-NEE

0

500

1000

1500

2000

2500

3000

-11

-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9

10

11

2006-EE2006-NEE

News Distribution in Wikipedia

Entities reported in News and Wikipedia

21

• Human fatalities: 10k vs 1.8k losses

• Estimated damages: $4.5 vs. $108 billions

• ‘Odisha cyclone’ has no coverage in the entity location ‘Odisha’

• ‘Hurricane Katrina’ finds broad coverage in entity location `New Orleans’

New OrleansOdisha

Hurricane KatrinaOdisha Cyclone

Why does this matter at all?

• Entities comprise of facts and statements supported by external references!

• News as authoritative sources with emerging facts and events.

• Delay between the reporting of an event in news and its inclusion in entity pages1

• Incomplete section structure for long—tail entities

• Several implications on real-world applications that make use of Wikipedia, e.g. KB maintenance, entity disambiguation etc.

[1] “How much is Wikipedia lagging behind news?” Besnik Fetahu, Abhijit Anand and Avishek Anand, WebSci’15, Oxford, UK. 22

Approach: Automated news suggestion to entity pages

featureextrac*on

Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival.

en**es

newsar*cle

23


featureextrac*on


en**es

newsar*cle

23

ar*cleen*typlacement

Odisha

Bay of Bengal Phailin

Task#1


featureextrac*on


en**es

newsar*cle

23


Odisha


Task#1

oneclassifierperen*tytype

ar*clesec*onplacement

[state]:geography

[city]:climate…

Task#2


featureextrac*on


en**es

newsar*cle

sec*ons

wikipediaen*typage

23


Odisha


Task#1

oneclassifierperen*tytype

ar*clesec*onplacement

[state]:geography

[city]:climate…

Task#2

Article—Entity Placement

25

Nikola Tesla

Elon MuskLarry Page

John B. Kennedy

News Suggestion Attributes: Task#1 Entity Salience

Entity Salience: Relative Entity Frequency

• reward entity appearing throughout the text • reward entity appearing in the top paragraphs • weigh an entity w.r.t its co-occurring entities

Tesla is a central concept in the given news article

26

News Suggestion Attributes: Task#1 Relative Entity Authority

Elias TabanHillary Clinton

Relative Entity Authority

• entities with `low authority’ have lower entry barrier for a news article

• a news article in which an entity co-occurs with `high authority’ entities conveys news the importance

• entity authority as an a priori probability or any centrality based measure

News Suggestion Attributes: Task#1 Novelty & Redundancy

previously added news articles

• novelty is measured w.r.t previously added news articles in an entity page

• major events have wide coverage in news media • place the news article into the correct section

Novelty and Redundancy Measure

27

Article—Section Placement

Task#2: Section—template generation

Germanwings Adria Lufthansa

• Section templates per entity type • Pre-determined number of main

sections • Canonicalize sections • Generate `complete’ section

templates based on similar entities • Cluster based on the X—means[3]

algorithm

[3] D. Pelleg, A. W. Moore, et al. X-means: Extending k-means with efficient estimation of the number of clusters. In ICML, pages 727–734, 2000. 29

Task#2: Overall news—section fit

• What is the best section to append a given news article?• measure overall similarity between n and the pre-computed sections in

the section templates

• Similarity aspects between news articles and sections • Topic similarity (LDA models over the sections and news documents) • Syntactic similarity • Lexical similarity • Entity—based similarity (overlap of named entities) • Frequency

30

Can we do more than suggest news to a

Wikipedia Section?!

Can we do more than suggest news to a

Wikipedia Section?!

Suggest citations to actual statements in Wikipedia

External Links in Wikipedia and Knowledge Bases

Finding News Citations in Wikipedia!

On October 9, 2009, the Norwegian Nobel Committee announced that Obama had won the 2009 Nobel Peace Prize "for his extraordinary efforts to strengthen international diplomacy and cooperation between peoples". Obama accepted this award in Oslo, Norway on December 10, 2009, with "deep gratitude and great humility." The award drew a mixture of praise and criticism from world leaders and media figures.

34

“Citogenesis”: Citogenesis, on the other hand is a portmanteau of 'Citation' and 'Genesis'. A Citation is a reference to a source, used to back up a specific claim. Genesis means the origin of something. By extension, citogenesis is the creation of text in a reliable source that can be cited to back-up a claim.

http://en.wikipedia.org/wiki/Citogenesis

http://www.merriam-webster.com/dictionary/portmanteau

http://en.wikipedia.org/wiki/Citation

http://mw1.merriam-webster.com/dictionary/genesis?show=1&t=1346949206

35

Citogenesis[citation needed]

35


what type of citation do we need here?

35



which citation do we place for this definition?

35

Citogenesis[citation needed]https://www.explainxkcd.com/wiki/index.php/978:_Citogenesis



35

Citogenesis[citation needed]https://www.explainxkcd.com/wiki/index.php/978:_Citogenesis

https://xkcd.com/978/



Finding Citations: Motivation

36

• Increase Wikipedia article quality by providing citations to external references

• Replace and update existing citations with higher quality and authority references

• Find and replace citations for statements that have dead URLs

• Find citations for statements that are flagged with a “citation needed” tag, currently around 300k statements

• Automate the process of enriching Wikipedia and help editors in the decision process of providing citations to existing or new Wikipedia statements

Motivation: Acknowledged Problem by Wikimedia Foundation

37

Approach Overview

38

Wikipedia entities

Barack Obama

• Sections • Anchors • Text • Categories

typeOfPolitician

Obama was born on August 4, 1961,[4] …..

The couple married in Wailuku on Maui on …

After graduating with a JD degree magna cum laude[49]…

Obama was elected to the Illinois Senate in …

Entity statementsFeature

Extraction

news statement?

Task#1: Classify statements Task#2: Find citations

QueryConstruction “Obama”, “Illinois”, “Senate”

newsindex

doc_1doc_2………

doc_k

Choose Document

Feature Extraction

Classify Correct Reference

YES

1.Statement Classification

2.Finding Citations

Statement Classification

Statement Classification

40

Features:1. Wikipedia Entity Structure 2. Language Style 3. Entity Type Probabilities

Train Supervised Models:1. Learn models that predict

accurately the citation category of a statement.

2. Multi-class classification problem for citation categories: {web, news, comic, journal …}

3. Optimize for “news” statements

Finding Citations

Finding Citations

42

Wikipedia Statement: On October 9, 2009, the Norwegian Nobel Committee announced that Obama had won the 2009 Nobel Peace Prize "for his extraordinary efforts to strengthen international diplomacy and cooperation between peoples". Obama accepted this award in Oslo, Norway on December 10, 2009, with "deep gratitude and great humility." The award drew a mixture of praise and criticism from world leaders and media figures.

News Index

Q={Wikipedia Statement}

doc_1doc_2………

doc_k

retrieve top—100

http://nobelprize.org/nobel_prizes/peace/laureates/2009/http://www.cnn.com/2009/politics/12/10/obama.transcript/index.htmlhttp://www.timesonline.co.uk/tol/news/world/us_and_americas/article6868905.ece………….. ………….. ………….. ………….. ………….. …………..http://www.msnbc.msn.com/id/33237202/http://www.reuters.com/article/topnews/idustre5981jk20091009?sp=truehttp://www.whitehouse.gov/the_press_office/remarks-by-the-president-on-winning-the-nobel-peace-prize/

pick relevant articles

43

Tree Kernel: K(s1, s2)

• Capture syntactic similarity between two sentences

• Capture semantic similarity between two sentences by checking the POS of a word

Finding Citations: Entailment

44

• Not all sentences in a news article have the same importance

• Capture the entailment features w.r.t the central sentence

Central sentence in a news article (TextRank):

Finding Citations: Centrality Features

45

• For different entity types domains have varying authority.

• We learn to predict the more reliable and authoritative sources of information

Finding Citations: News Domains

Finding Citations: Why other and crowdsourced evaluation strategies?

46

Conclusions

• Enrich and Expand Wikipedia Entity Pages

• Maintain and up to date and consistent state of Wikipedia

• Improve quality of Wikipedia pages

• Knowledge Bases approaches benefit from richer content in Wikipedia and more up to date

• Applications, like Google Search, Q&A systems like Siri etc., benefit due to their use of Wikipedia

Thank You! Questions?

For more: Twitter: @FetahuBesnik Web: http://l3s.de/~fetahu/

http://l3s.de/~fetahu/

semantic text mining - kbs · •entities comprise of facts and statements supported by external...

Documents