construction and querying of large-scale knowledge bases

20
Construction and Querying of Large - scale Knowledge Bases Xiang Ren 1 Yu Su 2 Xifeng Yan 2 University of Southern California 1 University of California, Santa Barbara 2

Upload: others

Post on 05-Nov-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Construction and Querying of Large-scale Knowledge Bases

Xiang Ren1 Yu Su2 Xifeng Yan2

University of Southern California1

University of California, Santa Barbara2

Tutorial website:http://xren7.web.engr.illinois.edu/tutorial-cikm17.html

Slides, code, datasets, references

Turning Unstructured Text Datainto Structures

3

UnstructuredText Data

(accountfor~80% of alldata inorganizations)

Knowledge& Insights

?Structures

(Chakraborty,2016)

Reading the reviews: From Text toStructured Facts

4

Restaurant Location

Organization Event

Thishotel ismy favorite Hiltonproperty inNYC! It is locatedrighton42ndstreetnear

TimesSquare,itisclose toallsubways, Broadways shows,andnext togreatrestaurantslikeJunior’s Cheesecake,

Virgil’s BBQ andmanyothers.

-- TripAdvisor

hotel

locatedat

NYCTimesSquare

Hiltonpropertyis a

Junior’sCheesecake

Virgil’sBBQ

locatednear

close toclose to

Broadwaysshows

close to

StructuredFacts

1. “Typed” entities2. “Typed” relationships

Why Text to Structures?

5

Structured Search & Exploration

Pattern / Association Rule Mining Structured Feature Generation

Graph Mining & Network Analysis

1234567

broadway showsbeacon theaterbroadway dance centerbroadway playsdavid letterman showradio city music halltheatre shows

1234567

high line parkchelsea markethighline walkwayelevated parkmeatpacking districtwest sideold railway

A Product Use Case: Finding“Interesting Hotel Collections”

6http://engineering.tripadvisor.com/using-nlp-to-find-interesting-collections-of-hotels/

Technology Transfer to TripAdvisor Features for “Catch a Show” collection

Features for “Near The High Line” collection

Grouping hotels based on structured factsextracted from the review text

Prior Art: Extracting Structures withRepeated Human Effort

7

Thishotel ismy favorite Hiltonproperty inNYC! It is located

righton42ndstreetnearTimesSquare,itisclose toall

subways, Broadways shows,andnext tomany great …

… The June2013Egyptianprotestweremassprotesteventthatoccurredin Egypt on30June

2013, …

Humanlabeling

…WehadaroomfacingTimesSquareandaroomfacingthe Empire State

Building, Thelocationisclosetoeverything

andwelove…

Extraction RulesMachine-Learning Models

Broadways shows

NYC

Hiltonproperty

Labeled data Text Corpus

Stanford CoreNLPCMU NELL

UW KnowItAllUSC AMR

IBMAlchemyAPIsGoogle Knowledge Graph

Microsoft Satori…

Structured Facts

Times square

hotel

This Tutorial: Effort-Light StructMine

8

Corpus-specificModelsText Corpus Structures

• Enables quick development of applications over various corpora• Extracts complex structures without introducing human error

Newsarticles

PubMedpapers

KnowledgeBases (KB)

Effort–Light StructMine: Where Are We?

9

Humanlabelingeffort

Featureengineering effort

Weakly-supervisedlearning systems

Hand-craftedSystems

Supervisedlearning systems

Distantly-supervisedLearning Systems

CMU NELL, 2009 - presentUW KnowItAll, Open IE, 2005 - presentMax-Planck YAGO, 2008 - present

Stanford CoreNLP, 2005 - presentUT Austin Dependency Kernel, 2005IBMWatson LanguageAPIs

UCB Hearst Pattern, 1992NYU Proteus, 1997

Stanford DeepDive,MIML-RE 2012- presentUW FIGER, MultiR, 2012Effort-Light StructMine(WWW’15, KDD’15, KDD’16, EMNLP’16, WWW’17, …)

A Review of Previous Efforts

“Distant” Supervision: What Is It?

10

Textcorpus

Knowledge Bases

“Matchable” structures: entity names,entity types, typed relationships ...

(Mintz et al., 2009), (Riedek et al., 2010), (Lin et al., 2012), (Ling et al., 2012),(Surdeanu et al., 2012), (Xu et al., 2013), (Nagesh et al., 2014), …

Freely available!• Common knowledge• Life sciences• Art …

Rapidlygrowing!

Number of Wikipedia articles

https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia

Human crowds

“Un-matchable”

Learning with Distant Supervision:Challenges

1. Sparsity of “Matchable”• Incomplete knowledge bases

• Low-confidence matching

2. Accuracy of “Expansion”• For “matchable”: Are all thelabels assigned accurately?

• For “un-matchable”:How toperform inference accurately?

11

(Ren et al., KDD’15)

It is my favoritecity in the

United States

The United States needs a new strategy to meet this

challenge

GovernmentLocation

… next to restaurants

like Junior’s Cheesecake

✗

Effort-Light StructMine: Contributions

12

Sparsity of“Matchable”

Effective expansionfrom “matchable”to “un-matchable”

Accuracy of“Expansion”

Pick the “best” labelsbased on the context(for both “matchable”and “un-matchable”)

Harness the “dataredundancy” using

graph-basedjoint optimization

Challenge Solution IdeaText

corpus

It is my favoritecity in the

United States

The United States needs a new strategy to meet this

challenge

GovernmentLocation

Effort-Light StructMine: Methodology

13

Data-driven textsegmentation

(SIGMOD’15, WWW’16)

Entity names& context units

Partially-labeledcorpus

Learning Corpus-specific Model(KDD’15, KDD’16,

EMNLP’16, WWW’17)

Structures fromthe remainingunlabeled data

Knowledgebases

Textcorpus

Knowledge Base Querying

14

15

Desktop search Mobile search

Transformation in Information Search

NewYork-NewYorkhotelAnswer:

LengthyDocuments?DirectAnswers!

“WhichhotelhasarollercoasterinLasVegas?”

Surgeof

mobileInternet

usein

China

Application: Facebook Entity Graph

16

People,Places,andThingsFacebook’s knowledge graph (entity graph) stores as entities the users, places, pages and other objects within the Facebook.

ConnectingThe connections between the entities indicate the type of relationship between them, such as friend, following, photo, check-in, etc.

Structured Query: RDF + SPARQL

Subject Predicate ObjectBarack_Obama parentOf Malia_ObamaBarack_Obama parentOf Natasha_ObamaBarack_Obama spouse Michelle_ObamaBarack_Obama_Sr. parentOf Barack_Obama

17

Triples in an RDF graph

Barack_Obama_Sr.

Barack_Obama

Malia_Obama

Natasha_Obama

Michelle_ObamaRDF graph

SELECT?xWHERE{Barack_Obama_Sr.parentOf ?y.?yparentOf ?x.}

<Malia_Obama><Natasha_Obama>

SPARQL query

Answer

Why Structured Query Falls Short?

18

KnowledgeBase #Entities #Triples #Classes #RelationsFreebase 45M 3B 53K 35KDBpedia 6.6M 13B 760 2.8KGoogleKnowledgeGraph* 570M 18B 1.5K 35KYAGO 10M 120M 350K 100KnowledgeVault 45M 1.6B 1.1K 4.5K

• It’s more than large: High heterogeneity of KBs • If it’s hard to write SQL on simple relational

tables, it’s only harder to write SPARQL on large knowledge bases• EvenharderonautomaticallyconstructedKBswithamassive,loosely-definedschema

* as of 2014

Schema-agnostic KB Querying

19

“BarackObamaSr.grandchildren”

Keyword query: query like search engine BarackObamaSr.

grandchildren

Graph query: add a little structure

“WhoareBarackObamaSr.’sgrandchildren?”

Natural language query: like asking a friend

<BarackObamaSr.,MaliaObama>

Query by example: Just show me examples

Tutorial Outline

• Introduction• Part I: Effort–Light StructMine• Tea break at 3:00pm

• Part II: Schema-agnostic KB Querying• Summary & Future Directions

20