indexing vector spaces graph search engines

44
Indexing Vector Spaces Graphs Search Engines Piero Molino XY Lab

Upload: kenzo-kabuto

Post on 23-Jun-2015

836 views

Category:

Technology


2 download

DESCRIPTION

A brief introduction for non technical people about the internals of a search engine

TRANSCRIPT

Page 1: Indexing vector spaces graph search engines

Indexing Vector Spaces Graphs Search EnginesPiero Molino XY Lab

Page 2: Indexing vector spaces graph search engines

Relevance for New PublishingSearch Engines can be considered publishers

They select, filter, choose and sometimes create the content they give as result to queries

From the user point of view, the search process is a blackbox, like magic

Understanding it increases the awareness of the risks it has and could help bypassing or hacking it

Page 3: Indexing vector spaces graph search engines

Information Retrieval

Let a collection of unstructured information (set of texts)

Let an information need (query)

Objective: Find items of the collection that answers the information need

Page 4: Indexing vector spaces graph search engines

Representation

Simple/naive assumption

A text can be represented by the words it contains

Bag-of-words model

Page 5: Indexing vector spaces graph search engines

Bag of Words

"Me and John and Mary attend this lab"

!Me:1

Page 6: Indexing vector spaces graph search engines

Bag of Words

"Me and John and Mary attend this lab"

!Me:1 and:1

Page 7: Indexing vector spaces graph search engines

Bag of Words

"Me and John and Mary attend this lab"

!Me:1 and:1 John:1

Page 8: Indexing vector spaces graph search engines

Bag of Words

"Me and John and Mary attend this lab"

!Me:1 and:2 John:1

Page 9: Indexing vector spaces graph search engines

Bag of Words

"Me and John and Mary attend this lab"

!Me:1 and:2 John:1 Mary:1

Page 10: Indexing vector spaces graph search engines

Bag of Words

"Me and John and Mary attend this lab"

!Me:1 and:2 John:1 Mary:1 attend:1

Page 11: Indexing vector spaces graph search engines

Bag of Words

"Me and John and Mary attend this lab"

!Me:1 and:2 John:1 Mary:1 attend:1 this:1

Page 12: Indexing vector spaces graph search engines

Bag of Words

"Me and John and Mary attend this lab"

!Me:1 and:2 John:1 Mary:1 attend:1 this:1 lab:1

Page 13: Indexing vector spaces graph search engines

Inverted index

To access the documents, we build an index, just like humans do

Inverted Index: word-to-document

"Indice Analitico"

Page 14: Indexing vector spaces graph search engines

Example

doc1:"I like football"

doc2:"John likes football"

doc3:"John likes basketball"

I → doc1

like → doc1

football → doc1 doc2

John → doc2 doc3

likes → doc2 doc3

basketball → doc3

Page 15: Indexing vector spaces graph search engines

Query

Information need represented with the same bag-of-words model

Who likes basketball? → Who:1 likes:1 basketball:1

Page 16: Indexing vector spaces graph search engines

RetrievalWho:1 likes:1 basketball:1

I → doc1

like → doc1

football → doc1 doc2

John → doc2 doc3

likes → doc2 doc3

basketball → doc3

Result: [doc2, doc3]!(without any order... or no?)

Page 17: Indexing vector spaces graph search engines

Boolean Model

doc3 contains both "likes" and "basketball", 2/3 of the input query terms, doc2 contains only 1/3

Boolean Model: represents only presence or absence of query words and ranks document according to how many query words they contain

Result: [doc3, doc2]!(with this precise order)

likes → doc2 doc3

basketball → doc3

Page 18: Indexing vector spaces graph search engines

Vector Space ModelRepresents documents and words as vectors or points in a (multidimensional) geometrical space

We build a term x documents matrix

doc1 doc2 doc3I 1 0 0like 1 0 0

football 1 1 0John 0 1 1likes 0 1 1

basketball 0 0 1

Page 19: Indexing vector spaces graph search engines

Graphical Representation

likes

foot

ball

1

10

doc1

doc2

doc3

Page 20: Indexing vector spaces graph search engines

Tf-idf

The cells of the matrix contain the frequency of a word in a document (term frequency tf)

This value can be counterbalanced by the number of documents that contain the word (inverse document frequency idf)

The final weight is called tf-idf

Page 21: Indexing vector spaces graph search engines

Tf-idf Example 1

"Romeo" is found 700 times in the document "Romeo and Juliet" and also in 7 other plays

The tf-idf of "Romeo" in the document "Romeo and Juliet" is 7000x(1/7)=100

Page 22: Indexing vector spaces graph search engines

Tf-idf Example 2

"and" is found 1000 times in the document "Romeo and Juliet" and also in 10000 other plays

The tf-idf of "and" in the document "Romeo and Juliet" is 1000x(1/10000)=0.1

Page 23: Indexing vector spaces graph search engines

Similarity

Queries are like an additional column in matrix

We can compute a similarity between document and queries

Page 24: Indexing vector spaces graph search engines

Similarity

0

doc1

doc2

doc3

query

Euclidean Distance

Page 25: Indexing vector spaces graph search engines

Similarity

0

doc1

doc2

doc3

query

Vector length normalization Projection of one normalized vector over another = cos(θ)

θ

Page 26: Indexing vector spaces graph search engines

Cosine Similarity

The cosine of the angle between two vectors is a measure of how similar two vectors are

As the vectors represents documents and queries, the cosine is a measure of similarity of how similar is a document with respect to the query

Page 27: Indexing vector spaces graph search engines

Ranking with Cosine Similarity

Shows how query and documents are correlated

1 = maximum correlation, 0 = no correlation, -1 = negative correlation

The documents obtained from the inverted index are ranked according to their cosine similarity wrt. the query

Page 28: Indexing vector spaces graph search engines

Ranking on the web

In the web documents (webpages) are connected to each other by hyperlinks

Is there a way to exploit this topology?

PageRank algorithm (and several others)

Page 29: Indexing vector spaces graph search engines

The web as a graph

2

6

3

7

5

4

1

Webpage

Hyperlink

Page 30: Indexing vector spaces graph search engines

Matrix Representation1 2 3 4 5 6 7

1 1/2 1/2

2 1/2 1/2

3 1

4 1/3 1/3 1/3

5 1/2 1/2

6 1/2 1/2

7 1

Page 31: Indexing vector spaces graph search engines

Simulating Random Walkse = random vector, can be interpreted as how important is a node in the network

Es. e = 1:0.5, 2:0.5, 3:0.5, 4:0.5, ...

A = the matrix

e = e · A, repeatedly is like simulating walking randomly on the graph

The process converges to a vector e where each value represents the importance in the network

Page 32: Indexing vector spaces graph search engines

Random Walks

2

6

3

7

5

4

1

Webpage

Hyperlink

Active

Page 33: Indexing vector spaces graph search engines

Random Walks

2

6

3

7

5

4

1

Webpage

Hyperlink

Active

Page 34: Indexing vector spaces graph search engines

Random Walks

2

6

3

7

5

4

1

Webpage

Hyperlink

Active

Page 35: Indexing vector spaces graph search engines

Random Walks

2

6

3

7

5

4

1

Webpage

Hyperlink

Active

Page 36: Indexing vector spaces graph search engines

Random Walks

2

6

3

7

5

4

1

Webpage

Hyperlink

Active

Page 37: Indexing vector spaces graph search engines

Importance - Authority

Webpage

Hyperlink

2

6

3

7

5

4

1

Importance (value in e)

Page 38: Indexing vector spaces graph search engines

Ranking with PageRankI → doc1

like → doc1

football → doc1 doc2

John → doc2 doc3

likes → doc2 doc3

basketball → doc3

doc1 → 0.3

doc2 → 0.1

doc3 → 0.5

e

The inverted index gives doc2 and doc3 Using the importance in e for ranking Result: [doc3, doc2] (ordered list)

Page 39: Indexing vector spaces graph search engines

Real World Ranking

Real word search engines exploit Vector-Space-Model-like approaches, PageRank-like approaches and several others

They balance all the different factors observing what webpage you click on after issuing a query and using them as examples for a machine learning algorithm

Page 40: Indexing vector spaces graph search engines

Machine LearningA machine learning algorithm works like a baby

You give him examples of what is a dog and what is a cat

He learns to look at the pointy ears, the nose and other aspects of the animals

When he sees a new animal he says "cat" or "dog" depending on his experience

Page 41: Indexing vector spaces graph search engines

Learning to RankThe different indicators of how good a webpage for a query is (cosine similarity, PageRank, etc.) are like the nose, the tail and the ears of the animal

Instead of cat and dog, the algorithm classifies relevant or non-relevant!

When you issue a query and click on a result it is marked as relevant, otherwise it is considered non-relevant

Page 42: Indexing vector spaces graph search engines

Issues

Among the different indicators there are a lot of contextual ones

Where you issue the query from, who you are, what you visited before, what result you clicked on before, ...

This makes the result for each person, in each place, in different moments in time, different

Page 43: Indexing vector spaces graph search engines

The filter bubble

A lot of diverse content is filtered from you

The search engine shows you what it thinks you will be interested on, based on all this contextual factors

Lack of transparency of the search process

A way out of the bubble? Maybe favoring serendipity

Page 44: Indexing vector spaces graph search engines

Thank you for your attention