knowledg graphs yosi mass

20
IBM Research © 2014 IBM Corporation A Scalable Graph Representation of Knowledge Bases and its Uses for Semantic Document Relatedness Yosi Mass, Dafna Sheinwald (HRL) Feng Cao, Yuan Ni, Hai Pei Zhang, Qiongkai Xu (CRL)

Upload: diannepatricia

Post on 03-Aug-2015

197 views

Category:

Technology


0 download

TRANSCRIPT

IBM Research

© 2014 IBM Corporation

A Scalable Graph Representation of Knowledge Bases and its Uses for Semantic Document Relatedness

Yosi Mass, Dafna Sheinwald (HRL)Feng Cao, Yuan Ni, Hai Pei Zhang, Qiongkai Xu (CRL)

© 2014 IBM Corporation

IBM Research

2

Introduction – Knowledge Base

A Knowledge-base (KB) is a representation of a knowledge where -

Nodes represent entities

Edges represent relationships between entities

Nodes and edges may have attributes

Linked Open Data

© 2014 IBM Corporation

IBM Research

The DBPedia Knowledge base

© 2014 IBM Corporation

IBM Research

4

Usage of Knowledge Bases

1. Semantic understanding of a text by mapping phrases to the knowledge base.

2. Helps to find relatedness/similarity between two given texts

In the United Kingdom and Ireland, high school students traditionally do not have 'free

periods' but do have 'break' which normally occurs just after their second lesson of the

day (normally referred to as second period).

Mentions

United Kingdom - http://en.wikipedia.org/wiki/United_Kingdom

Ireland - http://en.wikipedia.org/wiki/Ireland

high school students - http://en.wikipedia.org/wiki/High_school - note the derivation to "high school

student" and then the re-direct to "High school".

‘free periods’ - http://en.wikipedia.org/wiki/Period_(school) - note the disambiguation.

‘break’ - http://en.wikipedia.org/wiki/Break_(work) - note the disambiguation.

lesson - http://en.wikipedia.org/wiki/Lesson

day - http://en.wikipedia.org/wiki/Day

– period - http://en.wikipedia.org/wiki/Period_(school) - note the disambiguation.

© 2014 IBM Corporation

IBM Research

5

Mention Detection

Graph based Similarity scorers

• Exploits the graph structure to find relationships between pairs of mentions

• Aggregate over all pairs

Facet graph use case - find semantic relatedness between two text paragraphs

Paragraph 1 Paragraph 2 ?

© 2014 IBM Corporation

IBM Research

Outline

• Generation of the Facet Graph from DBPedia

• Mention Detection

• Similarity measures on the FacetGraph

© 2014 IBM Corporation

IBM Research

Outline

• Generation of the Facet Graph from DBPedia

• Mention Detection

• Similarity measures on the FacetGraph

© 2014 IBM Corporation

IBM Research

Titan graph

Hbase

shortest path

similarity scorers

The TinkerPop Stack Usage in a project

Cassandra (planned)

Hadoop

Access the graph

Map reduce code

To generate the graph

Graph stack library

© 2014 IBM Corporation

IBM Research

• Input is given as RDF triples.

• Example

http://dbpedia.org/resource/Yehuda_Vilner,

http://dbpedia.org/ontology/birthPlace

http://dbpedia.org/resource/Israel

• URIs are translated to vertexIDs

• Adding a triple requires:

1. Add the subject and object as nodes (or get their IDs if they are already in the graph)

2. Add the predicate as an edge between the two nodes

This is the most expensive operation

9

Generate the Knowledge Graph from RDF data

subject

object

predicate

Does not scale to millions of triples

© 2014 IBM Corporation

IBM Research

A scalable solution using MapReduce

• What is MapReduce?

• Programming model for expressing distributed computations at a massive scale

• Execution framework for organizing and performing such computations

• Open-source implementation called Hadoop

• Programmers specify two functions:

map (k, v) → <k’, v’>*

reduce (k’, v’*) → <k’’, v’’>*

All values with the same key are sent to the same reducer

The execution framework handles everything else…

© 2014 IBM Corporation

IBM Research

mapmap map map

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6

ba 1 2 c c3 6 a c5 2 b c7 8

a 1 5 b 2 7 c 2 3 6 8

r1 s1 r2 s2 r3 s3

MapReduce

© 2014 IBM Corporation

IBM Research

Graph generation using MapReduce

Job 1 – sort by subjects

(S1, P1, O1)

(S2, P2, O2)

(S3, P3, O1)

(S1, P2, O2)

map

S1 (P1, O1)

S2 (P2, O2)

S3 (P3, O1)

S1 (P2, O2)

reduce

Job 2 – add subjects to graph and sort by objects

map

O1 (P1, SID1)

O2 (P2, SID2)

O1 (P3, SID3)

O2 (P2, SID1) reduce

S1 (P1, O1)

S2 (P2, O2)

S3 (P3, O1)

S1 (P2, O2)

O1 (P1, SID1)

O2 (P2, SID1)

O1 (P3, SID3)

O2 (P2, SID2)

Job 3 – add objects and edges to graph

S1 (P1, O1)

S2 (P2, O2)

S3 (P3, O1)

S1 (P2, O2)

O1 (P1, SID1)

O2 (P2, SID1)

O1 (P3, SID3)

O2 (P2, SID2)

map

SID1

OID1

P1

OID2

P2SID3 P3

SID2P2

© 2014 IBM Corporation

IBM Research

• Implementation based on Titan Graph Library With Hbase as the backend

• Runs on a cluster of 3 machines

• Each machine has 16 cores, 2Tb disk and 32Gb mem

13

Facet Graph Architecture

Rexster

Server

Titan graph 1

Hbase

Application REST API

Hadoop cluster

Titan graph n…

© 2014 IBM Corporation

IBM Research

14

Facet Graph performance

• Creation (offline)

• Use three Map-reduce jobs to index DBPedia into Titan

1. First job sorts subjects

2. Second job adds subjects

3. Third job adds objects and edges

• Access (online)

• Implemented as a JAVA API that wraps REST API through Rexster server

• Performance on a cluster of 3 machines each with 16 cores, 2Tb disk and 32Gb mem

Graph #Vertices #Edges Creation time Access time

Semantics FG 14M 72M 3h:45m 1 msec to get node

description

2 sec to get 223K inlinks of

an heavy node (USA)

Links FG 19M 152M 7h:18m 4.4 sec to get 447K inlinks

of an heave node (USA)

© 2014 IBM Corporation

IBM Research

Outline

• Generation of the Facet Graph from DBPedia

• Mention Detection

• Similarity measures on the FacetGraph

© 2014 IBM Corporation

IBM Research

16

Mention detection

Input Text

Lexicon

Spotting candidates

SelectionDisambiguation

Lucene Index

Facet Graph

Spotting stage: recognizes in a sentence the phrases (surface forms) that may indicate a

mention in the KB

Candidate selection stage: given the surface form, retrieves the set of candidate URIs

for disambiguation

Disambiguation stage: uses the context around the spotted phrase to decide on the best

candidate.

Annotated Text

© 2014 IBM Corporation

IBM Research

Outline

• Generation of the Facet Graph from DBPedia

• Mention Detection

• Similarity measures on the FacetGraph

© 2014 IBM Corporation

IBM Research

18

Pairwise Concept similarity based on wikilinks [1]

[1] Milne D., Witten I. H., An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links, AAAI, 2008

© 2014 IBM Corporation

IBM Research

Our assets on IBM.next

IBM Confidential14/9/8

http://ibmnext.stage1.mybluemix.net/assets

© 2014 IBM Corporation

IBM Research

Thank You