manzoni paolo menchicchi fabio( relatore ) ( gruppo 1)
DESCRIPTION
Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects Gao Cong Christian S. Jensen Dingming Wu. Manzoni Paolo Menchicchi Fabio( Relatore ) ( GRUPPO 1). Motivations. About one fifth of web search queries are textual and have local intent - PowerPoint PPT PresentationTRANSCRIPT
EFFICIENT RETRIEVAL OF THE TOP-K
MOST RELEVANTSPATIAL WEB OBJECTS
GAO CONG CHRISTIAN S. JENSEN DINGMING WU
Manzoni Paolo Menchicchi Fabio(Relatore)
(GRUPPO 1)
Motivationsβ’ About one fifth of web search queries are
textual and have local intentβ’ This imply that the user would like to find:
β Something that satisfy his needs β Something near to him
β β¦ and know only best results β¦
Iβd like to find an italian restaurant where i can drink
duff-beer near Springfield
Scenaryβ’ The conventional internet is acquiring a geo-spatial
dimension:β’ Many points of interest are being associated with
descriptive textβ’ Web documents are increasingly being geo-taggedβ’ Commercial search engines have started to provide
location based services, such as map services, local search, and local advertisements.
β’ For example, Google Maps and Yellow Pages supports location-aware text retrieval queries
Example
Problems to face1. How to evaluate both spatial proximity and
text similarity Location aware top-K Text retrieval (LkT) query
2. How to join two different worlds:1. Inverted files for Text Searches2. R-tree for Geo-Searches
Ad hoc data structures (IR-tree , DIR-tree)
Object EvaluationsObject
Text Document Coordinates
Dst (Q ,O )=πΌ D Ξ΅(Q . loc ,O. loc )maxD
+(1βπΌ)(1β P (Q .keywords|O .doc )maxP
)
D= Spatial DatabaseO= Object in D O=(O.loc, O.doc)
Q= Query Q=(Q.loc, Q.keywords)
Normalized euclidean distancebetween queriy and O
P evaluates the normalized relevancy of keywords in the document related to O.
Lower values of Dst implys interesting objects for users
IR-treeβ’ Ad-hoc Data Structure created for LkT Query
β’ Basically an R-tree, with extra infos needed to evaluate Text Relevancy in each node ...
πΉπ
πΉπ πΉπ
πΉπ πΉπ πΉπ πΉπ
R5R6
R3R4 R2R1
πΆπ πΆπ πΆππΆππΆππΆππΆππΆππΆπ
R-tree We must be able to evaluate FOREACH NODE:
1. The SPATIAL DISTANCE from a query (R-tree help us)
2. The TEXT RELEVANCY for a query
β’ PSEUDO DOCUMENTRepresents all the documents into the entries of a child node.
How can we evaluate NON-LEAF NODES about TEXT RELEVANCY?
β’ INVERTED FILES
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΉπ
πΉπ
πΉππΉπ
πΉπ
πΉπ πΉπ
ExampleQ=(A; βIRISHβ, βPUBβ)
O1 O2 O3 O4 O5 O6 O7 O8 O9PUB 2 5 0 2 3 4 0 0 IRISH 5 4 0 5 0 3 2 4 3 PIZZA 1 0 5 6 5 1 1 5 6
πΉπ
πΉπ πΉπ
πΉπ πΉπ πΉπ πΉπ
R5R6
R3R4 R2R1
πΆπ πΆπ πΆππΆππΆππΆππΆππΆππΆπ
R-Tree
πΉπ
πΉπ πΉπ
πΉπ πΉπ πΉπ πΉπ
R5R6
R3R4 R2R1
πΆπ πΆπ πΆππΆππΆππΆππΆππΆππΆπ
INVERTED-FILE 7PUB [R5] [R6]IRISH [R54] [R65]PIZZA [R5 [R66]
INVERTED-FILE 5PUB [R3]IRISH [R33] [R44]PIZZA [R3[R46]
INVERTED-FILE 6PUB [R2 5] [R1]IRISH [R25] [R15]PIZZA [R2 [R16]
INVERTED-FILE 3PUB [O.7] [O.63]IRISH [O.72] [O.63]PIZZA [O.7 [O.6 1]
INVERTED-FILE 4PUB IRISH [O.84] [O.93]PIZZA [O.85]
INVERTED-FILE 2PUB [O.1] [O.22]IRISH [O.15] [O.24]PIZZA [O.1
INVERTED-FILE 1PUB [O.3 5] [O.52]IRISH [O.45]PIZZA [O.3 [O.4[O.5
O1 O2 O3 O4 O5 O6 O7 O8 O9PUB 2 5 0 2 3 4 0 0 IRISH 5 4 0 5 0 3 2 4 3 PIZZA 1 0 5 6 5 1 1 5 6
PUB 4 IRISH 3PIZZA 1
IRISH 4PIZZA 6
PUB 5 IRISH 5PIZZA 1
PUB 5IRISH 5PIZZA 6
PUB 4 IRISH 4PIZZA 6
PUB 5IRISH 5PIZZA 6
Searching in the treeWe can now evaluate each node of the tree both in text and distance, and explore it in a βclever wayβ
Dst (Q ,O )=πΌ D Ξ΅(Q . loc ,O. loc)maxD
+(1βπΌ)(1β P (Q .keywords|O .doc )maxP
)
For a Leaf node this is our evaluation :
MINDst (Q , N )=πΌ DΞ΅ (Q .loc ,N . rect )maxD
+(1βπΌ)(1β P (Q .keywords|N .doc )maxP
)
For a NON Leaf node this is our evaluation :
Because of the nature of Pseudo documentsGiven a query point Q and a node N whose rectangle encloses a set of objects SO πΉπ πΉπ
R2R1
πΆππΆππΆππΆππΆπ
+
LkT Query
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΉπ
πΉπ
πΉππΉπ
πΉπ
πΉπ πΉπ
πΉπ
πΉπ πΉπ
πΉπ πΉπ πΉπ πΉπ
R5R6
R3R4 R2R1
πΆππΆππΆππΆππΆππΆππΆππΆππΆπ
PriorityQueue for increasing values of
Q=(A; βIRISHβ, βPUBβ)
R70
R60,1
R50,16
R10,15
R50,16
R20,19
O1 O2 O3 O4 O5 O6 O7 O8 O9PUB 2 5 0 2 3 4 0 0 IRISH 5 4 0 5 0 3 2 4 3 PIZZA 1 0 5 6 5 1 1 5 6
1)
2)
3)
LkT Query
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΉπ
πΉπ
πΉππΉπ
πΉπ
πΉπ πΉπ
πΉπ
πΉπ πΉπ
πΉπ πΉπ πΉπ πΉπ
R5R6
R3R4 R2R1
πΆππΆππΆππΆππΆππΆππΆππΆππΆπ
PriorityQueue for increasing values of
Q=(A; βIRISHβ, βPUBβ)
O30,46
O40,58
R50,16
R20,19
O50,66
O30,46
R40,48
R20,19
R30,25
O40,58
O50,66
R30,25
O30,46
O10,23
R40,48
O40,58
O50,66
O20,7
O1 O2 O3 O4 O5 O6 O7 O8 O9PUB 2 5 0 2 3 4 0 0 IRISH 5 4 0 5 0 3 2 4 3 PIZZA 1 0 5 6 5 1 1 5 6
4)
5)
6)
Considerations O1 O2 O3 O4 O5 O6 O7 O8 O9PUB 2 5 0 2 3 4 0 0 IRISH 5 4 0 5 0 3 2 4 3 PIZZA 1 0 5 6 5 1 1 5 6
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΉπ
πΉπ
πΉππΉπ
πΉπ
πΉπ πΉπ
πΉπ
πΉπ πΉπ
πΉπ πΉπ πΉπ πΉπ
R5R6
R3R4 R2R1
πΆππΆππΆππΆππΆππΆππΆππΆππΆπ
Q=(A; βIRISHβ, βPUBβ)
In the previous example we explore nodes R5 and R1, even if they do not contain really interesting objects ...
The guilty is the pseudo-document because inherits the best of all the documents inside its subtree, and those documents may be very different between them
To avoid this we would like that inside a node there are:β’ SIMILAR DOCUMENTS β’ Near objects
DIR-tree
DIR-TreeUnlike the IR-tree, the DIR-tree aims to take both location and text information into account during tree construction, by optimizing for a combination of minimizing the areas of the enclosing rectangles and maximizing the text similarities between the documents of the enclosing rectangles.
SimAreaCost ( Ek ,O )=(1β π½ )AreaCost ( Ek )maxArea
+π½(1βcosine(Ek .Vector ,O .vector))
We need an heuristic function to evaluate in which leaf node insert an item during tree construction.
π¬π π¬π
E1E2
πΆππΆππΆππΆππΆπ
β¦E p
Current Node
= generic element of a non-leaf node = object to insert in the tree
πΆπ
πΆπ
π¬π
Oπ¬ β²π
AreaCost ( E k )=area (E β²k . rectangle )βarea (Ek .rectangle )
E β² k=area ΒΏ
O
π¬π π¬π
E1E2
πΆππΆππΆππΆππΆπ
β¦E p
Current Node
Pub 2Pizza 0
Pub 1Pizza 0
vector [2,0] [1,0]
Pub 2Pizza 0
Pseudo-Document-Vector
vector [2,0]
Pub 0Pizza 1[0,1]
O.Vector = term frequency vector of the document OVector = term frequency vector of the pseudo-document related to
Pub
PizzaπΆ .ππππππ
ΞΈ
SIMILARITY=0max penality (1)
π¬π .ππππππ
O
πΉπ
πΉπ πΉπ
πΉπ πΉπ πΉπ πΉπ
R5R6
R3R4 R2R1
πΆππΆππΆππΆππΆππΆππΆππΆππΆπ
O10
Here the benefits
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΉπ
πΉππΉπ
πΉπ
πΉπ
πΉπ πΉπ
O1 O2 O3 O4 O5 O6 O7 O8 O9PUB 2 5 0 2 3 4 0 0 IRISH 5 4 0 5 0 3 2 4 3 PIZZA 1 0 5 6 5 1 1 5 6
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΆπ
πΉπ
πΉπ
πΉππΉπ
πΉπ
πΉπ πΉπ
IR-Tree DIR-Tree
πΉπ
How to improve performances ...β’ Clustering :Spatial web objects often belong to different categories, for example accommodations, restaurants, and tourist attractions.
The idea is to cluster objects into groups according to their corresponding documents.
But what does it involve from the point of view of:
β’ Efficiency?β’ Changes in the trees(IR or DIR) structure?
πΉπ
πΉπ πΉπ
πΉπ πΉπ πΉπ πΉπ
R5R6
R3R4 R2R1
πΊππππ :πΆπArt: π¨ππ :πΆππ¨ππ :πΆππΊππππ :πΆππΊππππ :πΆππ¨ππ :πΆππ¨ππ :πΆππΊππππ :πΆπ
CLUSTER 1 Art, ,)CLUSTER 2 Sport,
Art
Sport
Sport
Sport
Sport
Sport
Art
Art
Art
Art
Art
Sport
ADVANTAGESbounds estimated using clusters in a node will be tighter than those estimated for whole nodes
The problem of finding a clustering solution that minimizes ScanTime is NP-hard.
STRUCTURE MODIFICATION:We need to memorize in each entry of each node one pseudo-document for each kind of cluster represented in a subtreeThis imply an additional space usage
Performances resultsProperty DatasetTotal number Of objects 131.461
Average number of unique words per objects 112
Total number of unique words in dataset 30.616
Total number of words in dataset 14.809.845
Baseline
IR-tree
CIR-tree
DIR-tree
CDIR-tree