construction and querying of large-scale knowledge bases
TRANSCRIPT
Construction and Querying of Large-scale Knowledge Bases
Xiang Ren1 Yu Su2 Xifeng Yan2
University of Southern California1
University of California, Santa Barbara2
Tutorial website:http://xren7.web.engr.illinois.edu/tutorial-cikm17.html
Slides, code, datasets, references
Turning Unstructured Text Datainto Structures
3
UnstructuredText Data
(accountfor~80% of alldata inorganizations)
Knowledge& Insights
?Structures
(Chakraborty,2016)
Reading the reviews: From Text toStructured Facts
4
Restaurant Location
Organization Event
Thishotel ismy favorite Hiltonproperty inNYC! It is locatedrighton42ndstreetnear
TimesSquare,itisclose toallsubways, Broadways shows,andnext togreatrestaurantslikeJuniorâs Cheesecake,
Virgilâs BBQ andmanyothers.
-- TripAdvisor
hotel
locatedat
NYCTimesSquare
Hiltonpropertyis a
JuniorâsCheesecake
VirgilâsBBQ
locatednear
close toclose to
Broadwaysshows
close to
StructuredFacts
1. âTypedâ entities2. âTypedâ relationships
Why Text to Structures?
5
Structured Search & Exploration
Pattern / Association Rule Mining Structured Feature Generation
Graph Mining & Network Analysis
1234567
broadway showsbeacon theaterbroadway dance centerbroadway playsdavid letterman showradio city music halltheatre shows
1234567
high line parkchelsea markethighline walkwayelevated parkmeatpacking districtwest sideold railway
A Product Use Case: FindingâInteresting Hotel Collectionsâ
6http://engineering.tripadvisor.com/using-nlp-to-find-interesting-collections-of-hotels/
Technology Transfer to TripAdvisor Features for âCatch a Showâ collection
Features for âNear The High Lineâ collection
Grouping hotels based on structured factsextracted from the review text
Prior Art: Extracting Structures withRepeated Human Effort
7
Thishotel ismy favorite Hiltonproperty inNYC! It is located
righton42ndstreetnearTimesSquare,itisclose toall
subways, Broadways shows,andnext tomany great âŚ
⌠The June2013Egyptianprotestweremassprotesteventthatoccurredin Egypt on30June
2013, âŚ
Humanlabeling
âŚWehadaroomfacingTimesSquareandaroomfacingthe Empire State
Building, Thelocationisclosetoeverything
andweloveâŚ
Extraction RulesMachine-Learning Models
Broadways shows
NYC
Hiltonproperty
Labeled data Text Corpus
Stanford CoreNLPCMU NELL
UW KnowItAllUSC AMR
IBMAlchemyAPIsGoogle Knowledge Graph
Microsoft SatoriâŚ
Structured Facts
Times square
hotel
This Tutorial: Effort-Light StructMine
8
Corpus-specificModelsText Corpus Structures
⢠Enables quick development of applications over various corpora⢠Extracts complex structures without introducing human error
Newsarticles
PubMedpapers
KnowledgeBases (KB)
EffortâLight StructMine: Where Are We?
9
Humanlabelingeffort
Featureengineering effort
Weakly-supervisedlearning systems
Hand-craftedSystems
Supervisedlearning systems
Distantly-supervisedLearning Systems
CMU NELL, 2009 - presentUW KnowItAll, Open IE, 2005 - presentMax-Planck YAGO, 2008 - present
Stanford CoreNLP, 2005 - presentUT Austin Dependency Kernel, 2005IBMWatson LanguageAPIs
UCB Hearst Pattern, 1992NYU Proteus, 1997
Stanford DeepDive,MIML-RE 2012- presentUW FIGER, MultiR, 2012Effort-Light StructMine(WWWâ15, KDDâ15, KDDâ16, EMNLPâ16, WWWâ17, âŚ)
A Review of Previous Efforts
âDistantâ Supervision: What Is It?
10
Textcorpus
Knowledge Bases
âMatchableâ structures: entity names,entity types, typed relationships ...
(Mintz et al., 2009), (Riedek et al., 2010), (Lin et al., 2012), (Ling et al., 2012),(Surdeanu et al., 2012), (Xu et al., 2013), (Nagesh et al., 2014), âŚ
Freely available!⢠Common knowledge⢠Life sciences⢠Art âŚ
Rapidlygrowing!
Number of Wikipedia articles
https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
Human crowds
âUn-matchableâ
Learning with Distant Supervision:Challenges
1. Sparsity of âMatchableâ⢠Incomplete knowledge bases
⢠Low-confidence matching
2. Accuracy of âExpansionâ⢠For âmatchableâ: Are all thelabels assigned accurately?
⢠For âun-matchableâ:How toperform inference accurately?
11
(Ren et al., KDDâ15)
It is my favoritecity in the
United States
The United States needs a new strategy to meet this
challenge
GovernmentLocation
⌠next to restaurants
like Juniorâs Cheesecake
â
Effort-Light StructMine: Contributions
12
Sparsity ofâMatchableâ
Effective expansionfrom âmatchableâto âun-matchableâ
Accuracy ofâExpansionâ
Pick the âbestâ labelsbased on the context(for both âmatchableâand âun-matchableâ)
Harness the âdataredundancyâ using
graph-basedjoint optimization
Challenge Solution IdeaText
corpus
It is my favoritecity in the
United States
The United States needs a new strategy to meet this
challenge
GovernmentLocation
Effort-Light StructMine: Methodology
13
Data-driven textsegmentation
(SIGMODâ15, WWWâ16)
Entity names& context units
Partially-labeledcorpus
Learning Corpus-specific Model(KDDâ15, KDDâ16,
EMNLPâ16, WWWâ17)
Structures fromthe remainingunlabeled data
Knowledgebases
Textcorpus
15
Desktop search Mobile search
Transformation in Information Search
NewYork-NewYorkhotelAnswer:
LengthyDocuments?DirectAnswers!
âWhichhotelhasarollercoasterinLasVegas?â
Surgeof
mobileInternet
usein
China
Application: Facebook Entity Graph
16
People,Places,andThingsFacebookâs knowledge graph (entity graph) stores as entities the users, places, pages and other objects within the Facebook.
ConnectingThe connections between the entities indicate the type of relationship between them, such as friend, following, photo, check-in, etc.
Structured Query: RDF + SPARQL
Subject Predicate ObjectBarack_Obama parentOf Malia_ObamaBarack_Obama parentOf Natasha_ObamaBarack_Obama spouse Michelle_ObamaBarack_Obama_Sr. parentOf Barack_Obama
17
Triples in an RDF graph
Barack_Obama_Sr.
Barack_Obama
Malia_Obama
Natasha_Obama
Michelle_ObamaRDF graph
SELECT?xWHERE{Barack_Obama_Sr.parentOf ?y.?yparentOf ?x.}
<Malia_Obama><Natasha_Obama>
SPARQL query
Answer
Why Structured Query Falls Short?
18
KnowledgeBase #Entities #Triples #Classes #RelationsFreebase 45M 3B 53K 35KDBpedia 6.6M 13B 760 2.8KGoogleKnowledgeGraph* 570M 18B 1.5K 35KYAGO 10M 120M 350K 100KnowledgeVault 45M 1.6B 1.1K 4.5K
⢠Itâs more than large: High heterogeneity of KBs ⢠If itâs hard to write SQL on simple relational
tables, itâs only harder to write SPARQL on large knowledge bases⢠EvenharderonautomaticallyconstructedKBswithamassive,loosely-definedschema
* as of 2014
Schema-agnostic KB Querying
19
âBarackObamaSr.grandchildrenâ
Keyword query: query like search engine BarackObamaSr.
grandchildren
Graph query: add a little structure
âWhoareBarackObamaSr.âsgrandchildren?â
Natural language query: like asking a friend
<BarackObamaSr.,MaliaObama>
Query by example: Just show me examples