manzoni paolo menchicchi fabio( relatore ) ( gruppo 1)

EFFICIENT RETRIEVAL OF THE TOP-K

MOST RELEVANTSPATIAL WEB OBJECTS

GAO CONG CHRISTIAN S. JENSEN DINGMING WU

Manzoni Paolo Menchicchi Fabio(Relatore)

(GRUPPO 1)

Motivations• About one fifth of web search queries are

textual and have local intent• This imply that the user would like to find:

– Something that satisfy his needs – Something near to him

– … and know only best results …

I’d like to find an italian restaurant where i can drink

duff-beer near Springfield

Scenary• The conventional internet is acquiring a geo-spatial

dimension:• Many points of interest are being associated with

descriptive text• Web documents are increasingly being geo-tagged• Commercial search engines have started to provide

location based services, such as map services, local search, and local advertisements.

• For example, Google Maps and Yellow Pages supports location-aware text retrieval queries

Example

Problems to face1. How to evaluate both spatial proximity and

text similarity Location aware top-K Text retrieval (LkT) query

2. How to join two different worlds:1. Inverted files for Text Searches2. R-tree for Geo-Searches

Ad hoc data structures (IR-tree , DIR-tree)

Object EvaluationsObject

Text Document Coordinates

Dst (Q ,O )=𝛼 D ε(Q . loc ,O. loc )maxD

+(1−𝛼)(1− P (Q .keywords|O .doc )maxP

)

D= Spatial DatabaseO= Object in D O=(O.loc, O.doc)

Q= Query Q=(Q.loc, Q.keywords)

Normalized euclidean distancebetween queriy and O

P evaluates the normalized relevancy of keywords in the document related to O.

Lower values of Dst implys interesting objects for users

IR-tree• Ad-hoc Data Structure created for LkT Query

• Basically an R-tree, with extra infos needed to evaluate Text Relevancy in each node ...

𝑹𝟕

𝑹𝟓 𝑹𝟔

𝑹𝟑 𝑹𝟒 𝑹𝟐 𝑹𝟏

R5R6

R3R4 R2R1

𝑶𝟔 𝑶𝟖 𝑶𝟏𝑶𝟐𝑶𝟑𝑶𝟒𝑶𝟓𝑶𝟗𝑶𝟕

R-tree We must be able to evaluate FOREACH NODE:

1. The SPATIAL DISTANCE from a query (R-tree help us)

2. The TEXT RELEVANCY for a query

• PSEUDO DOCUMENTRepresents all the documents into the entries of a child node.

How can we evaluate NON-LEAF NODES about TEXT RELEVANCY?

• INVERTED FILES

𝑶𝟒

𝑶𝟐

𝑶𝟑

𝑶𝟓

𝑶𝟕

𝑶𝟔

𝑶𝟖

𝑶𝟏

𝑶𝟗

𝑹𝟕

𝑹𝟏

𝑹𝟔𝑹𝟓

𝑹𝟒

𝑹𝟑 𝑹𝟐

ExampleQ=(A; ‘IRISH’, ‘PUB’)

O1 O2 O3 O4 O5 O6 O7 O8 O9PUB 2 5 0 2 3 4 0 0 IRISH 5 4 0 5 0 3 2 4 3 PIZZA 1 0 5 6 5 1 1 5 6

𝑹𝟕

𝑹𝟓 𝑹𝟔

𝑹𝟑 𝑹𝟒 𝑹𝟐 𝑹𝟏

R5R6

R3R4 R2R1


R-Tree

𝑹𝟕

𝑹𝟓 𝑹𝟔

𝑹𝟑 𝑹𝟒 𝑹𝟐 𝑹𝟏

R5R6

R3R4 R2R1


INVERTED-FILE 7PUB [R5] [R6]IRISH [R54] [R65]PIZZA [R5 [R66]

INVERTED-FILE 5PUB [R3]IRISH [R33] [R44]PIZZA [R3[R46]

INVERTED-FILE 6PUB [R2 5] [R1]IRISH [R25] [R15]PIZZA [R2 [R16]

INVERTED-FILE 3PUB [O.7] [O.63]IRISH [O.72] [O.63]PIZZA [O.7 [O.6 1]

INVERTED-FILE 4PUB IRISH [O.84] [O.93]PIZZA [O.85]

INVERTED-FILE 2PUB [O.1] [O.22]IRISH [O.15] [O.24]PIZZA [O.1

INVERTED-FILE 1PUB [O.3 5] [O.52]IRISH [O.45]PIZZA [O.3 [O.4[O.5


PUB 4 IRISH 3PIZZA 1

IRISH 4PIZZA 6


PUB 5IRISH 5PIZZA 6


PUB 5IRISH 5PIZZA 6

Searching in the treeWe can now evaluate each node of the tree both in text and distance, and explore it in a ‘clever way’

Dst (Q ,O )=𝛼 D ε(Q . loc ,O. loc)maxD

+(1−𝛼)(1− P (Q .keywords|O .doc )maxP

)

For a Leaf node this is our evaluation :

MINDst (Q , N )=𝛼 Dε (Q .loc ,N . rect )maxD

+(1−𝛼)(1− P (Q .keywords|N .doc )maxP

)

For a NON Leaf node this is our evaluation :

Because of the nature of Pseudo documentsGiven a query point Q and a node N whose rectangle encloses a set of objects SO 𝑹𝟐 𝑹𝟏

R2R1

𝑶𝟏𝑶𝟐𝑶𝟑𝑶𝟒𝑶𝟓

+

LkT Query

𝑶𝟒

𝑶𝟐

𝑶𝟑

𝑶𝟓

𝑶𝟕

𝑶𝟔

𝑶𝟖

𝑶𝟏

𝑶𝟗

𝑹𝟕

𝑹𝟏

𝑹𝟔𝑹𝟓

𝑹𝟒

𝑹𝟑 𝑹𝟐

𝑹𝟕

𝑹𝟓 𝑹𝟔

𝑹𝟑 𝑹𝟒 𝑹𝟐 𝑹𝟏

R5R6

R3R4 R2R1

𝑶𝟏𝑶𝟐𝑶𝟑𝑶𝟒𝑶𝟓𝑶𝟖𝑶𝟗𝑶𝟔𝑶𝟕

PriorityQueue for increasing values of

Q=(A; ‘IRISH’, ‘PUB’)

R70

R60,1

R50,16

R10,15

R50,16

R20,19


1)

2)

3)

LkT Query

𝑶𝟒

𝑶𝟐

𝑶𝟑

𝑶𝟓

𝑶𝟕

𝑶𝟔

𝑶𝟖

𝑶𝟏

𝑶𝟗

𝑹𝟕

𝑹𝟏

𝑹𝟔𝑹𝟓

𝑹𝟒

𝑹𝟑 𝑹𝟐

𝑹𝟕

𝑹𝟓 𝑹𝟔

𝑹𝟑 𝑹𝟒 𝑹𝟐 𝑹𝟏

R5R6

R3R4 R2R1


PriorityQueue for increasing values of


O30,46

O40,58

R50,16

R20,19

O50,66

O30,46

R40,48

R20,19

R30,25

O40,58

O50,66

R30,25

O30,46

O10,23

R40,48

O40,58

O50,66

O20,7


4)

5)

6)

Considerations O1 O2 O3 O4 O5 O6 O7 O8 O9PUB 2 5 0 2 3 4 0 0 IRISH 5 4 0 5 0 3 2 4 3 PIZZA 1 0 5 6 5 1 1 5 6

𝑶𝟑

𝑶𝟐

𝑶𝟒

𝑶𝟓

𝑶𝟕

𝑶𝟔

𝑶𝟖

𝑶𝟏

𝑶𝟗

𝑹𝟕

𝑹𝟏

𝑹𝟔𝑹𝟓

𝑹𝟒

𝑹𝟑 𝑹𝟐

𝑹𝟕

𝑹𝟓 𝑹𝟔

𝑹𝟑 𝑹𝟒 𝑹𝟐 𝑹𝟏

R5R6

R3R4 R2R1



In the previous example we explore nodes R5 and R1, even if they do not contain really interesting objects ...

The guilty is the pseudo-document because inherits the best of all the documents inside its subtree, and those documents may be very different between them

To avoid this we would like that inside a node there are:• SIMILAR DOCUMENTS • Near objects

DIR-tree

DIR-TreeUnlike the IR-tree, the DIR-tree aims to take both location and text information into account during tree construction, by optimizing for a combination of minimizing the areas of the enclosing rectangles and maximizing the text similarities between the documents of the enclosing rectangles.

SimAreaCost ( Ek ,O )=(1− 𝛽 )AreaCost ( Ek )maxArea

+𝛽(1−cosine(Ek .Vector ,O .vector))

We need an heuristic function to evaluate in which leaf node insert an item during tree construction.

𝑬𝟏 𝑬𝟐

E1E2


…E p

Current Node

= generic element of a non-leaf node = object to insert in the tree

𝑶𝟏

𝑶𝟐

𝑬𝟏

O𝑬 ′𝟏

AreaCost ( E k )=area (E ′k . rectangle )−area (Ek .rectangle )

E ′ k=area ¿

O

𝑬𝟏 𝑬𝟐

E1E2


…E p

Current Node

Pub 2Pizza 0

Pub 1Pizza 0

vector [2,0] [1,0]

Pub 2Pizza 0

Pseudo-Document-Vector

vector [2,0]

Pub 0Pizza 1[0,1]

O.Vector = term frequency vector of the document OVector = term frequency vector of the pseudo-document related to

Pub

Pizza𝑶 .𝒗𝒆𝒄𝒕𝒐𝒓

θ

SIMILARITY=0max penality (1)

𝑬𝟏 .𝒗𝒆𝒄𝒕𝒐𝒓

O

𝑹𝟕

𝑹𝟓 𝑹𝟔

𝑹𝟑 𝑹𝟒 𝑹𝟐 𝑹𝟏

R5R6

R3R4 R2R1


O10

Here the benefits

𝑶𝟒

𝑶𝟐

𝑶𝟑

𝑶𝟓

𝑶𝟕

𝑶𝟔

𝑶𝟖

𝑶𝟏

𝑶𝟗

𝑹𝟕

𝑹𝟏𝑹𝟔

𝑹𝟓

𝑹𝟒

𝑹𝟑 𝑹𝟐


𝑶𝟒

𝑶𝟐

𝑶𝟑

𝑶𝟓

𝑶𝟕

𝑶𝟔

𝑶𝟖

𝑶𝟏

𝑶𝟗

𝑹𝟕

𝑹𝟏

𝑹𝟔𝑹𝟓

𝑹𝟒

𝑹𝟑 𝑹𝟐

IR-Tree DIR-Tree

𝑹𝟖

How to improve performances ...• Clustering :Spatial web objects often belong to different categories, for example accommodations, restaurants, and tourist attractions.

The idea is to cluster objects into groups according to their corresponding documents.

But what does it involve from the point of view of:

• Efficiency?• Changes in the trees(IR or DIR) structure?

𝑹𝟕

𝑹𝟓 𝑹𝟔

𝑹𝟑 𝑹𝟒 𝑹𝟐 𝑹𝟏

R5R6

R3R4 R2R1

𝑺𝒑𝒐𝒓𝒕 :𝑶𝟏Art: 𝑨𝒓𝒕 :𝑶𝟑𝑨𝒓𝒕 :𝑶𝟒𝑺𝒑𝒐𝒓𝒕 :𝑶𝟓𝑺𝒑𝒐𝒓𝒕 :𝑶𝟖𝑨𝒓𝒕 :𝑶𝟗𝑨𝒓𝒕 :𝑶𝟔𝑺𝒑𝒐𝒓𝒕 :𝑶𝟕

CLUSTER 1 Art, ,)CLUSTER 2 Sport,

Art

Sport

Sport

Sport

Sport

Sport

Art

Art

Art

Art

Art

Sport

ADVANTAGESbounds estimated using clusters in a node will be tighter than those estimated for whole nodes

The problem of finding a clustering solution that minimizes ScanTime is NP-hard.

STRUCTURE MODIFICATION:We need to memorize in each entry of each node one pseudo-document for each kind of cluster represented in a subtreeThis imply an additional space usage

Performances resultsProperty DatasetTotal number Of objects 131.461

Average number of unique words per objects 112

Total number of unique words in dataset 30.616

Total number of words in dataset 14.809.845

Baseline

IR-tree

CIR-tree

DIR-tree

CDIR-tree

manzoni paolo menchicchi fabio( relatore ) ( gruppo 1)

Documents

presentation view

overview slide

brief overview

major topics

major focus

road map

particular topic

text searchesrtree