semantic text mining - kbs · •entities comprise of facts and statements supported by external...
TRANSCRIPT
Semantic Text MiningMining Structured Information from
Unstructured DataBesnik Fetahu
Outline
• Introduction
• Information Sources
• Structured Data: Knowledge Bases and Formal Representation of Information
• Text Mining Applications
• Relation Extraction
• Named Entity Disambiguation
• Machine Reading
• Wikipedia News Suggestion
• Conclusions
Introduction
• Large Amounts of Data.
• Heterogeneity of information: provenance, quality, content, representation, language etc.
• Unstructured vs. Semi-Structured vs. Structured
• Knowledge Bases: maintenance, updating, addition of new facts
• Automated vs. Crowd—based analysis of data
Information Sources
• Unstructured:
• News Collections: NYTimes, Reuters, Wall Street Journal, GDelt etc.
• Web Resources: Common Crawl, ClueWeb
• Social Streams: Twitter, Facebook, Reddit
• Semi-Structured:
• Wikipedia
• Structured:
• Linked Data: Linked Open Data Cloud
• Knowledge Bases: DBpedia, YAGO, Freebase
Information Sources: GDelt
• ~138200 indexed daily news articles from
more than 3000 news domains • 37007 news domains in total • ~18682 daily entities with an average of 64
mentions per entity
news domain news articles
yahoo.com 1244781
allafrica.com 1035646
reuters.com 828133
dailymail.co.uk 815372
indiatimes.com 743991
wn.com 587607
Information Sources: GDelt
• ~138200 indexed daily news articles from
more than 3000 news domains • 37007 news domains in total • ~18682 daily entities with an average of 64
mentions per entity
news domain news articles
yahoo.com 1244781
allafrica.com 1035646
reuters.com 828133
dailymail.co.uk 815372
indiatimes.com 743991
wn.com 587607
Information Sources: Wikipedia
• 5 million articles • Articles structured into sections • Articles annotated with categories • Collaboratively edited and maintained
Information Sources: Wikipedia
Structured Data: Formal Representation of Knowledge and Knowledge Bases
• Semantic Web
• Ontologies: Knowledge Representation
• Knowledge Bases
Structured Data: Semantic Web
The ultimate goal of the Web of data is to enable computers to do more useful work and to develop systems that can support trusted interactions over the network. The term “Semantic Web” refers to W3C’s vision of the Web of linked data. Semantic Web technologies enable people to create data stores on the Web, build vocabularies, and write rules for handling data. Linked data are empowered by technologies such as RDF, SPARQL, OWL, and SKOS.
• Format: turtle, n3, etc. • Syntax: XML Schema• Models: RDF• Taxonomies: RDFS • Ontologies: OWL• Query languages: SPARQL• Interchange formats: RIF
Structured Data: Ontologies
Structured Data: Ontologies
Structured Data: Knowledge Bases
• Nell • TextRunner • YAGO • DBpedia • Freebase
Structured Data: Knowledge Bases
• Nell • TextRunner • YAGO • DBpedia • Freebase
Structured Data: Knowledge Bases
• Nell • TextRunner • YAGO • DBpedia • Freebase
Text Mining Applications
• Relation Extraction
• Named Entity Disambiguation
• Machine Reading
Text Mining Applications: Relation Extraction• DP of chunks of texts for relation extraction
• Syntactic patterns for relation extraction
• Semantic and Lexical patterns for relation extraction
Text Mining Applications: Named Entity Disambiguation
• Textual content has rich underlying syntactical and semantical structureSyntactic patterns for relation extraction.
• Frequently extracted syntactical and semantical information: POS, Co-Ref and NER.
• Named entity recognition with specific entity types Person, Organisation, Place, Date.
Text Mining Applications: Named Entity Disambiguation
• Textual content has rich underlying syntactical and semantical structureSyntactic patterns for relation extraction.
• Frequently extracted syntactical and semantical information: POS, Co-Ref and NER.
• Named entity recognition with specific entity types Person, Organisation, Place, Date.
Text Mining Applications: Named Entity Disambiguation
• NED: named entity disambiguation of surface forms with entities from knowledge bases • DBPedia Spotlight • Aida • Wikiminer
Everything done?
• Only a small fraction of data is actually structured
• Cumbersome to define manually and explicitly schemas, taxonomies, ontologies
• Large proportion of data is unstructured or semi-structured
• Can we automatically extract and model such content?
How can we enrich and maintain Wikipedia?
Why Wikipedia and News?
• Why Wikipedia?• Text Categorization • Entity Disambiguation • Entity Search • Knowledge Bases
• Why News?• Authoritative sources • Professionally edited and qualitative source of
information! • Inherent importance of reported events and facts about
entities in Wikipedia • Second most cited source of information in Wikipedia
0
0.2
0.4
0.6
0.8
1
Com
icsC
reat
orAr
twor
kN
atur
alPl
ace
Airli
neFi
lmSo
ccer
Man
ager
Lega
lCas
eAl
bum
Band
Spor
tsTe
amTe
levi
sion
Show
Anat
omic
alSt
ruct
ure
Athl
ete
Wea
pon
Crim
inal
Mus
ical
Artis
tPo
litic
ian
Plan
tSo
ngN
on-P
rofit
Org
anis
atio
nBo
okAc
tor
Fict
iona
lCha
ract
erR
ecor
dLab
elBr
oadc
aste
rPo
litic
alPa
rtyAu
tom
obile
Trad
eUni
onSc
ient
ist
Milit
aryP
erso
nPh
iloso
pher
Tele
visi
onSe
ason
Elec
tion
Offi
ceH
olde
rSp
orts
Leag
ueG
over
nmen
tAge
ncy
Sing
leAn
imal
Awar
dSp
orts
Even
tAi
rpor
tM
ilitar
yCon
flict
Tele
visi
onEp
isod
eAi
rcra
ftM
agaz
ine
Writ
erLo
catio
n
news book court journal web thesis
0
500
1000
1500
2000
2500
-11
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
10
11
2001-EE2001-NEE
0
500
1000
1500
-11
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
10
11
2002-EE2002-NEE
0
500
1000
1500
-11
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
10
11
2003-EE2003-NEE
0
500
1000
1500
2000
-11
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
10
11
2004-EE2004-NEE
0
500
1000
1500
2000
2500
-11
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
10
11
2005-EE2005-NEE
0
500
1000
1500
2000
2500
3000
-11
-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9
10
11
2006-EE2006-NEE
News Distribution in Wikipedia
Entities reported in News and Wikipedia
21
• Human fatalities: 10k vs 1.8k losses
• Estimated damages: $4.5 vs. $108 billions
• ‘Odisha cyclone’ has no coverage in the entity location ‘Odisha’
• ‘Hurricane Katrina’ finds broad coverage in entity location `New Orleans’
New OrleansOdisha
Hurricane KatrinaOdisha Cyclone
Why does this matter at all?
• Entities comprise of facts and statements supported by external references!
• News as authoritative sources with emerging facts and events.
• Delay between the reporting of an event in news and its inclusion in entity pages1
• Incomplete section structure for long—tail entities
• Several implications on real-world applications that make use of Wikipedia, e.g. KB maintenance, entity disambiguation etc.
[1] “How much is Wikipedia lagging behind news?” Besnik Fetahu, Abhijit Anand and Avishek Anand, WebSci’15, Oxford, UK. 22
Approach: Automated news suggestion to entity pages
featureextrac*on
Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival.
en**es
newsar*cle
23
Approach: Automated news suggestion to entity pages
featureextrac*on
Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival.
en**es
newsar*cle
23
ar*cleen*typlacement
Odisha
Bay of Bengal Phailin
Task#1
Approach: Automated news suggestion to entity pages
featureextrac*on
Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival.
en**es
newsar*cle
23
ar*cleen*typlacement
Odisha
Bay of Bengal Phailin
Task#1
oneclassifierperen*tytype
ar*clesec*onplacement
[state]:geography
[city]:climate…
Task#2
Approach: Automated news suggestion to entity pages
featureextrac*on
Some half a million people were evacuated from the southeastern Indian coast as Cyclone Phailin, a tropical storm from the Bay of Bengal, bore down on India. The states of Orissa and Andhra Pradesh, both of which have large coastal populations, were on high alert ahead of the storm’s expected arrival.
en**es
newsar*cle
sec*ons
wikipediaen*typage
23
ar*cleen*typlacement
Odisha
Bay of Bengal Phailin
Task#1
oneclassifierperen*tytype
ar*clesec*onplacement
[state]:geography
[city]:climate…
Task#2
Article—Entity Placement
25
Nikola Tesla
Elon MuskLarry Page
John B. Kennedy
News Suggestion Attributes: Task#1 Entity Salience
Entity Salience: Relative Entity Frequency
• reward entity appearing throughout the text • reward entity appearing in the top paragraphs • weigh an entity w.r.t its co-occurring entities
Tesla is a central concept in the given news article
26
News Suggestion Attributes: Task#1 Relative Entity Authority
Elias TabanHillary Clinton
Relative Entity Authority
• entities with `low authority’ have lower entry barrier for a news article
• a news article in which an entity co-occurs with `high authority’ entities conveys news the importance
• entity authority as an a priori probability or any centrality based measure
News Suggestion Attributes: Task#1 Novelty & Redundancy
previously added news articles
• novelty is measured w.r.t previously added news articles in an entity page
• major events have wide coverage in news media • place the news article into the correct section
Novelty and Redundancy Measure
27
News Suggestion Attributes: Task#1 Novelty & Redundancy
previously added news articles
• novelty is measured w.r.t previously added news articles in an entity page
• major events have wide coverage in news media • place the news article into the correct section
Novelty and Redundancy Measure
27
Article—Section Placement
Task#2: Section—template generation
Germanwings Adria Lufthansa
• Section templates per entity type • Pre-determined number of main
sections • Canonicalize sections • Generate `complete’ section
templates based on similar entities • Cluster based on the X—means[3]
algorithm
[3] D. Pelleg, A. W. Moore, et al. X-means: Extending k-means with efficient estimation of the number of clusters. In ICML, pages 727–734, 2000. 29
Task#2: Overall news—section fit
• What is the best section to append a given news article?• measure overall similarity between n and the pre-computed sections in
the section templates
• Similarity aspects between news articles and sections • Topic similarity (LDA models over the sections and news documents) • Syntactic similarity • Lexical similarity • Entity—based similarity (overlap of named entities) • Frequency
30
Can we do more than suggest news to a
Wikipedia Section?!
Can we do more than suggest news to a
Wikipedia Section?!
Suggest citations to actual statements in Wikipedia
External Links in Wikipedia and Knowledge Bases
Finding News Citations in Wikipedia!
On October 9, 2009, the Norwegian Nobel Committee announced that Obama had won the 2009 Nobel Peace Prize "for his extraordinary efforts to strengthen international diplomacy and cooperation between peoples". Obama accepted this award in Oslo, Norway on December 10, 2009, with "deep gratitude and great humility." The award drew a mixture of praise and criticism from world leaders and media figures.
34
34
“Citogenesis”: Citogenesis, on the other hand is a portmanteau of 'Citation' and 'Genesis'. A Citation is a reference to a source, used to back up a specific claim. Genesis means the origin of something. By extension, citogenesis is the creation of text in a reliable source that can be cited to back-up a claim.
35
35
Citogenesis[citation needed]
35
Citogenesis[citation needed]
what type of citation do we need here?
35
Citogenesis[citation needed]
what type of citation do we need here?
35
Citogenesis[citation needed]
what type of citation do we need here?
which citation do we place for this definition?
35
Citogenesis[citation needed]
what type of citation do we need here?
which citation do we place for this definition?
35
Citogenesis[citation needed]https://www.explainxkcd.com/wiki/index.php/978:_Citogenesis
what type of citation do we need here?
which citation do we place for this definition?
35
Citogenesis[citation needed]https://www.explainxkcd.com/wiki/index.php/978:_Citogenesis
what type of citation do we need here?
which citation do we place for this definition?
35
Citogenesis[citation needed]https://www.explainxkcd.com/wiki/index.php/978:_Citogenesis
https://xkcd.com/978/
what type of citation do we need here?
which citation do we place for this definition?
Finding Citations: Motivation
36
• Increase Wikipedia article quality by providing citations to external references
• Replace and update existing citations with higher quality and authority references
• Find and replace citations for statements that have dead URLs
• Find citations for statements that are flagged with a “citation needed” tag, currently around 300k statements
• Automate the process of enriching Wikipedia and help editors in the decision process of providing citations to existing or new Wikipedia statements
Motivation: Acknowledged Problem by Wikimedia Foundation
37
Approach Overview
38
Wikipedia entities
Barack Obama
• Sections • Anchors • Text • Categories
typeOfPolitician
Obama was born on August 4, 1961,[4] …..
The couple married in Wailuku on Maui on …
After graduating with a JD degree magna cum laude[49]…
Obama was elected to the Illinois Senate in …
Entity statementsFeature
Extraction
news statement?
Task#1: Classify statements Task#2: Find citations
QueryConstruction “Obama”, “Illinois”, “Senate”
newsindex
doc_1doc_2………
doc_k
Choose Document
Feature Extraction
Classify Correct Reference
YES
1.Statement Classification
2.Finding Citations
Statement Classification
Statement Classification
40
Features:1. Wikipedia Entity Structure 2. Language Style 3. Entity Type Probabilities
Train Supervised Models:1. Learn models that predict
accurately the citation category of a statement.
2. Multi-class classification problem for citation categories: {web, news, comic, journal …}
3. Optimize for “news” statements
Finding Citations
Finding Citations
42
Wikipedia Statement: On October 9, 2009, the Norwegian Nobel Committee announced that Obama had won the 2009 Nobel Peace Prize "for his extraordinary efforts to strengthen international diplomacy and cooperation between peoples". Obama accepted this award in Oslo, Norway on December 10, 2009, with "deep gratitude and great humility." The award drew a mixture of praise and criticism from world leaders and media figures.
News Index
Q={Wikipedia Statement}
doc_1doc_2………
doc_k
retrieve top—100
http://nobelprize.org/nobel_prizes/peace/laureates/2009/http://www.cnn.com/2009/politics/12/10/obama.transcript/index.htmlhttp://www.timesonline.co.uk/tol/news/world/us_and_americas/article6868905.ece………….. ………….. ………….. ………….. ………….. …………..http://www.msnbc.msn.com/id/33237202/http://www.reuters.com/article/topnews/idustre5981jk20091009?sp=truehttp://www.whitehouse.gov/the_press_office/remarks-by-the-president-on-winning-the-nobel-peace-prize/
pick relevant articles
43
Tree Kernel: K(s1, s2)
• Capture syntactic similarity between two sentences
• Capture semantic similarity between two sentences by checking the POS of a word
Finding Citations: Entailment
44
• Not all sentences in a news article have the same importance
• Capture the entailment features w.r.t the central sentence
Central sentence in a news article (TextRank):
Finding Citations: Centrality Features
45
• For different entity types domains have varying authority.
• We learn to predict the more reliable and authoritative sources of information
Finding Citations: News Domains
Finding Citations: Why other and crowdsourced evaluation strategies?
46
Finding Citations: Why other and crowdsourced evaluation strategies?
46
Conclusions
• Enrich and Expand Wikipedia Entity Pages
• Maintain and up to date and consistent state of Wikipedia
• Improve quality of Wikipedia pages
• Knowledge Bases approaches benefit from richer content in Wikipedia and more up to date
• Applications, like Google Search, Q&A systems like Siri etc., benefit due to their use of Wikipedia
Thank You! Questions?
For more: Twitter: @FetahuBesnik Web: http://l3s.de/~fetahu/