irmles2010 random indexing spaces to bridge the human and data webs

37
Jose Quesada: Random indexing spaces for bridging the Human and Data Webs Random indexing spaces for bridging the Human and Data Webs Jose Quesada, Ralph Brandao-Vidal, Lael schooler Max Planck Institute, Adaptive Behavior and Cognition, Berlin

Upload: jose-quesada

Post on 20-Jan-2015

614 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

Random indexing spaces for bridging the Human and Data Webs

Jose Quesada, Ralph Brandao-Vidal, Lael schooler

Max Planck Institute, Adaptive Behavior and Cognition, Berlin

Page 2: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

Introduction

Most of the existing knowledge on the Web is in plain, unstructured text

The problem we aim to solve in this paper is simply converting literals into resources

Page 3: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

mpib:c97169cadaadbba92afbc2895b9eb9f

unique, meaningful ID (MUID)

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam vulputate ipsum ac erat cursus et adipiscing diam pulvinar. In at ultricies odio. Donec sodales enim euismod nulla pulvinar et elementum velit congue. Cras ac quam ante, non facilisis massa.

Page 4: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

What's 'human web'

Page 5: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

What's 'data web'

Page 6: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

Ontotext's linked data semantic repository (LDSR)

Page 7: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

Resources vs LiteralsResource

The first explicit definition of resource is found in RFC 2396 and states that A resource can be anything that has identity. Familiar examples include an electronic document, an image, a service (e.g., "today's weather report for Los Angeles"), and a collection of other resources. Not all resources are network "retrievable"; e.g., human beings, corporations, and bound books in a library can also be considered resources

Literals

Literals are values that do not have a unique identifier. They are usually a string that contains some human-readable text, for example names, dates and other types of values about a subject. In the previous example, the string ‘Fido’ is a literal. They optionally have a language (e.g., English, Japanese) or a type (e.g., integer, Boolean, string), but this is about all that can be said about literals. They cannot have properties like resources. Unlike resources, literals cannot link to the rest of the graph. They are second-class citizens on the Semantic Web. In terms of graphs, literals are one-way streets: since they cannot be the subject of a triple, there can be no outgoing links to other nodes.

Page 8: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

What's in an identifier?

● Uniform Resource Identifier (URI)

Scheme ":" ["//" authority "/"] [path] [ "?" query ] [ "#" fragment]

Page 9: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

Why turning literals into resources is useful

● Increased integration of the human and data Webs

● Dangling nodes prevent us from applying some machine learning techniques:

Number of URI: 126,875,974

Number of Literals: 227,758,535

Total number of entities: 354,635,159

Page 10: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

● We will use statistical semantics to generate a vector for any literal

● This vector can be used to uniquely identify a literal; it makes it operationally equivalent to a resource

Page 11: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

Attaching new resources to the center of the graph

Page 12: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

Statistical semantics

Page 13: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

Statistical semantics

Exploits statistical patterns of human word usage to figure out word meaning

● Completely unsupervised

● Scale better than say neural networks

● Most require lineal algebra operations on large sparse matrices

● Computationally expensive

● LSA (Landauer)● Topics Models (Griffiths)● BEAGLE (Jones)● HAL (Burgess)● Random indexing (Sahlgren)● SP (Dennis)

Page 14: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

Example of text data: Titles of Some Technical Memos

● c1: Human machine interface for ABC computer applications● c2: A survey of user opinion of computer system response time● c3: The EPS user interface management system● c4: System and human system engineering testing of EPS● c5: Relation of user perceived response time to error measurement

● m1: The generation of random, binary, ordered trees● m2: The intersection graph of paths in trees● m3: Graph minors IV: Widths of trees and well-quasi-ordering● m4: Graph minors: A survey

Page 15: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

Matrix of words by contexts

Page 16: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

=

Wor

ds (

stat

es)

Contexts

=

Singular value Decomposition of the words by contexts matrix

Page 17: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

=

Singular value Decomposition of the words by contexts matrix

Page 18: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

=

Singular value Decomposition of the words by contexts matrix

Page 19: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

=

Singular value Decomposition of the words by contexts matrix

Page 20: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

=

Singular value Decomposition of the words by contexts matrix

Page 21: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

=

Singular value Decomposition of the words by contexts matrix

Page 22: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

=

Singular value Decomposition of the words by contexts matrix

Page 23: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

r (human - user) = -.38 .94

r (human - minors) = -.28 -.83

Before After

Page 24: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

Similarity Measures

● Dot Product ∑=

=N

iii yxyx

1

.

• Cosine

• Euclidean

yx

yxxy

.)cos( =θ

∑=

−=N

iii yxyxeuclid

1

2)(),(

Page 25: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

Parallel spaces

● Dbpedia● Structured● Well-connected to the

rest of the semantic web

● One-to-one mappings

● Wikipedia● Plain text● Representative of

human knowledge and interest

● Pageviews reflect how present a concept is in the average human mind

Page 26: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

Dbpedia-wikipedia corpus

● Currently 4M concepts. We used the most central 1M● Has to have > 100 words after stoplist● More than 5 incoming and outgoing links

Page 27: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

How to use statistical semantic to convert literals into resources

Any literal can have a vector

Computing nearest neighbors will find similar resources

Page 28: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

Random indexing

● Same dimension-reduction without SVD● For each context, assign a random vector

(nonzero seed values is a free parameter).● A word will be the average of all context vectors

it appears in● A new doc vector (e.g., a query) is the average

of the vectors for the words it contains

Page 29: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

Training

Page 30: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

Generating the Meaningful, Unique Identifier (MUID)

● Each literal gets a 1000-dimensional vector. This vector 'captures the meaning' of the text

● Too long to be passed around in RDF. MD5 hashing compacts it

@prefix mpib <http://mpi-ldsr.ontotext.com/mpib#> .mpib:c97169cadaadbba92afbc2895b9eb9f

Page 31: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

Example results. Taking any page and getting the closest dbpedia concepts

results for the search 'http://www.google.de' : @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix mpib <http://mpi-ldsr.ontotext.com/mpib#> .@prefix skos: <http://www.w3.org/2004/02/skos/core#> .@prefix dbpedia: <http://en.wikipedia.org/wiki#>

mpib:c97169cadaadbba92afbc2895b9eb9f skos:related dbpebia:http://en.wikipedia.org/wiki/Google_Alerts mpib:8482e762cceb5d7636529cccf1c825 skos:related dbpebia:http://en.wikipedia.org/wiki/Google_Apps mpib:278c93125941f38c18dfe67591c94a5 skos:related dbpebia:http://en.wikipedia.org/wiki/Googlepedia mpib:2885141b46cd2fdc3c447bcfa18b73 skos:related dbpebia:http://en.wikipedia.org/wiki/IGoogle mpib:2959b4e35ca423f34a47b8fce196cf skos:related dbpebia:http://en.wikipedia.org/wiki/List_of_Google_products

Page 32: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

Example results

Page 33: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

Problems

● Nearest neighbors on the current space takes 2 minutes. Fortunately, it's easily paralellizable

● Vectors depend on the corpora. Two wikipedia version from different years may render slightly different vectors

● Selecting the most relevant concepts on wikipedia is an extra source of free parameters

Page 34: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

Advantages● We can now use any text as subject. We can say that an essay is a

review, or that a particular paragraph is insightful

● Works at different granularity levels, from single word to entire books

● We could use this to disambiguate text

● It may reduce graph search time by connecting dangling nodes to central parts of the graph. Whether this is a good idea is an open question

Page 35: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

Future work

● Merge meaningful ID generation and compression into a single step

● Improve nearest neighbors time

● Apply it in a realistic use case scenario

Page 36: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

What's in an identifier?

Uniform Resource Identifier (URI)

Scheme ":" ["//" authority "/"] [path] [ "?" query ] [ "#" fragment]

Meaningful, unique identifier (MUID)@prefix mpib <http://mpi-ldsr.ontotext.com/mpib#> .

mpib:c97169cadaadbba92afbc2895b9eb9f

Page 37: Irmles2010 Random indexing spaces to bridge the human and data webs

Jose Quesada: Random indexing spaces for bridging the Human and Data Webs

Random indexing spaces for bridging the Human and Data Webs

Jose Quesada, [email protected]

Max Planck Institute, Adaptive Behavior and Cognition, Berlin