proximity search in databases

Proximity Search in DatabasesProximity Search in Databases

A Paper byA Paper by

Roy Goldman, Narayna ShivaKumar, Roy Goldman, Narayna ShivaKumar, Suresh VenkataSubramaniam,Hector Suresh VenkataSubramaniam,Hector

Garcia-MolinaGarcia-Molina

Presented by Presented by

Arjun SaraswatArjun Saraswat

Flow of the PresentationFlow of the Presentation

IntroductionIntroduction MotivationMotivation Problem StatementProblem Statement Model/DesignModel/Design Scoring FunctionScoring Function ImplementationImplementation StrategiesStrategies Performance ExperimentsPerformance Experiments

INTRODUCTIONINTRODUCTION

IntroductionIntroduction

Basic Idea: Proximity search is used in IR to Basic Idea: Proximity search is used in IR to retrieve documents that have words occurring retrieve documents that have words occurring nearnear each other. each other.

Database is viewed as a collection of objects that Database is viewed as a collection of objects that are related by distance are related by distance function.function.

Objects: can be tuples, records…Objects: can be tuples, records… In IR traditionally intra-object proximity search is In IR traditionally intra-object proximity search is

searching within the same document.searching within the same document. The Proximity search in this paper talks about The Proximity search in this paper talks about

ranking objects based on their distance to other ranking objects based on their distance to other objects.objects.

MOTIVATIONMOTIVATION

MotivationMotivation There are situations in which user cannot generate a There are situations in which user cannot generate a

specific query or its impractical to generate a specific specific query or its impractical to generate a specific query, or even when a search needs to be based on query, or even when a search needs to be based on relevance of different data objectsrelevance of different data objects

There is no feature in databases and IR for There is no feature in databases and IR for implementation of proximity search .implementation of proximity search .

Motivation is to develop a general purpose proximity Motivation is to develop a general purpose proximity service that can be implemented independent of service that can be implemented independent of underlying database.underlying database.

PROBLEM STATEMENTPROBLEM STATEMENT

Problem StatementProblem Statement

Basic Statement: To rank objects in one given set (Basic Statement: To rank objects in one given set (Find)Find) based on their proximity to the objects in the another setbased on their proximity to the objects in the another set ((nearnear)) What is What is Find Find Set?Set? It is a set that is basically of interest for the ProximityIt is a set that is basically of interest for the Proximity search.search. What is What is NearNear Set ? Set ? Ranking of Find set objects is done in respect of theirRanking of Find set objects is done in respect of their distance to Near set objects.distance to Near set objects. Gets more clear with example:Gets more clear with example:

“ “ Find Find Movie Movie Near Near Travolta Cage”Travolta Cage”

Problem StatementProblem StatementFind Find MovieMovie

Looks for all objects of the type movie or the objects that have word Looks for all objects of the type movie or the objects that have word movie in there body ,it does not in anyway means that it will search movie in there body ,it does not in anyway means that it will search for a for a movie movie containingcontaining Travolta Travolta andand Cage Cage

Here Movie, Travolta and Cage all are different objects.Here Movie, Travolta and Cage all are different objects.

For the Query “Find For the Query “Find MovieMovie Near Near TravoltaTravolta Cage”Cage”

The Top 10 results are:The Top 10 results are:

1.Face off1.Face off

2.She’s so Lovely2.She’s so Lovely

3.Primary colors3.Primary colors

4.Con air4.Con air

5.Mad City5.Mad City

6.Happy Birthday Elizabeth: A Celebration for life6.Happy Birthday Elizabeth: A Celebration for life

7.Original Sin’s7.Original Sin’s

8.’Night Sins’8.’Night Sins’

9. That old feeling9. That old feeling

10. Dancer Upstairs10. Dancer Upstairs

Problem StatementProblem Statement

As we can clearly see that “Face-off” is As we can clearly see that “Face-off” is going to be the top hit as it has both the going to be the top hit as it has both the stars Travolta and Cage. This can be stars Travolta and Cage. This can be explained as both explained as both actoractor objects are at a objects are at a short distance away from the movie Face-short distance away from the movie Face-off. The movies in second place are here 5 off. The movies in second place are here 5 in number, they all have one of the two in number, they all have one of the two stars.stars.

Rest of the answers have an indirect Rest of the answers have an indirect affiliations means they are at a larger affiliations means they are at a larger distances.distances.

MODEL/DESIGNMODEL/DESIGN

Model/DesignModel/DesignBasic ArchitectureBasic Architecture

Fig.1Fig.1

Model/DesignModel/Design

Figure .1 gives a clear view of the basic Figure .1 gives a clear view of the basic components of the Proximity Search architecturecomponents of the Proximity Search architecture

A database stores a set of objects that can be A database stores a set of objects that can be tuples, records, etc.tuples, records, etc.

The application fires The application fires FindFind and and NearNear Queries to get Queries to get the the FindFind set and the set and the NearNear set set

The Proximity Search Engine takes input as Find The Proximity Search Engine takes input as Find and Near objects or sets and Distance Module and Near objects or sets and Distance Module and gives output as and gives output as re-rankedre-ranked Find Set based on Find Set based on there distances, which is obtained from the there distances, which is obtained from the Distance Module.Distance Module.

Model/DesignModel/Design Distance Module in simplified terms can be Distance Module in simplified terms can be

understood as providing the Proximity Search Engine understood as providing the Proximity Search Engine with set of triplets like with set of triplets like (X, Y, d)(X, Y, d) where d is the where d is the distance between objects with identifiers X and Y.distance between objects with identifiers X and Y.

Assumption1: all distances are taken to be greater Assumption1: all distances are taken to be greater than or equal to one.than or equal to one.

Assumption2Assumption2:: Proximity Search Engine makes use of Proximity Search Engine makes use of these distances to compute the lengths of shortest these distances to compute the lengths of shortest paths between objects. Now, As we are more paths between objects. Now, As we are more interested in close objects we disregard all objects interested in close objects we disregard all objects with distances greater than some constant K and with distances greater than some constant K and setting an infinity for the rest.setting an infinity for the rest.

will become more clear when we talk about thewill become more clear when we talk about the algorithmalgorithm


From the perspective of From the perspective of Proximity Search engine the Proximity Search engine the database is viewed as database is viewed as undirected graph with undirected graph with weighted edges. It does not weighted edges. It does not mean that the underlying mean that the underlying databases need to be databases need to be maintained as an undirected maintained as an undirected graph.graph.

As can be seen from the figure As can be seen from the figure given on the right side which given on the right side which

shows a normalized relational shows a normalized relational schema for the Internet Movie schema for the Internet Movie Database.Database.

Model/DesignModel/DesignGraph based representationGraph based representation


In the graph based the representation each In the graph based the representation each tuple is broken down into multiple objects: tuple is broken down into multiple objects: one for the entity object and additional one for the entity object and additional objects for each attribute value.objects for each attribute value.

The distances are assigned between objects The distances are assigned between objects are done on the following basis:are done on the following basis:

1.Small weights are assigned between objects 1.Small weights are assigned between objects like entity and its attribute values i.e. a close like entity and its attribute values i.e. a close relationship.relationship.

2.Larger weights to objects linked through 2.Larger weights to objects linked through foreign and primary keys.foreign and primary keys.

3.Largest weights are assigned to objects 3.Largest weights are assigned to objects linked by entity tuples in the same relation.linked by entity tuples in the same relation.

SCORING FUNCTIONSCORING FUNCTION

Scoring FunctionScoring Function The main idea behind all this is that we want to rank each The main idea behind all this is that we want to rank each

object object f f in the Find set based on there proximity to the to in the Find set based on there proximity to the to the objects in the Near set the objects in the Near set NN..

rrF F : ranking function in the Find set.: ranking function in the Find set. rrN N : ranking function in the Near set.: ranking function in the Near set. range for these functions is [0,1]range for these functions is [0,1] with 1 representing the highest possible rank.with 1 representing the highest possible rank. The distance between any two objects The distance between any two objects f f ЄЄ F F and and n n ЄЄ N N is the is the

weight of the shortest distance in the underlying database weight of the shortest distance in the underlying database graph, known as graph, known as d (f, n) .Bond d (f, n) .Bond between f and n where between f and n where f ≠ f ≠ n n ::

rrFF(f) r(f) rNN(n)(n) b (f, n) =b (f, n) = d (f, n)d (f, n)tt

here here tt is a tuning exponent, it is non-negative real number is a tuning exponent, it is non-negative real number that controls the impact of distance on bondthat controls the impact of distance on bond

Scoring FunctionScoring Function

The Bond ranges between [0,1], higher the value The Bond ranges between [0,1], higher the value greater is the bondgreater is the bond

How to use How to use Bond’s Bond’s depends upon the application, depends upon the application, different approaches can be taken for interpreting different approaches can be taken for interpreting bonds to bonds to NearNear objects objects

Some of the approaches are discussed below:Some of the approaches are discussed below: 1.Additive : For example in the Query 1.Additive : For example in the Query “ “Find Find Movie Movie Near TNear Travoltaravolta Cage”Cage” we intuitively know that movie that we intuitively know that movie that has both the actors should be ranked higher so in has both the actors should be ranked higher so in accordance to our intuition we score each object accordance to our intuition we score each object ff based on the sum of its bonds with based on the sum of its bonds with NearNear objects objects

score (score (ff) ) = = nnЄЄN N ΣΣ b (f, n) b (f, n)

Scoring FunctionScoring Function2.Maximum : In some cases maximum bond may be more 2.Maximum : In some cases maximum bond may be more

important than the total number, in this case important than the total number, in this case score (score (ff) = ) = nnЄЄNN max b (f, n) max b (f, n)

3.Beliefs : In this we treat bonds as beliefs, that is suppose 3.Beliefs : In this we treat bonds as beliefs, that is suppose the graph represents a connection between electronicthe graph represents a connection between electronic devices, such that the two devices close together in the devices, such that the two devices close together in the graph are close together physically as well.graph are close together physically as well. Here rHere rF F : indicates the known status of the : indicates the known status of the Find Find DevicesDevices rrNN: gives that a : gives that a NearNear device is faulty device is faulty b (f ,n) gives us the belief that b (f ,n) gives us the belief that ff is faulty due to is faulty due to n,n, as the as the

more closer f is to faulty device more likely it is to be faultymore closer f is to faulty device more likely it is to be faulty score (score (ff) = 1- ) = 1- nnЄЄNN ΠΠ (1-b (f, n)) (1-b (f, n))

IMPLIMENTATION IMPLIMENTATION

ImplementationImplementation

The implementation of the proximity search architecture The implementation of the proximity search architecture was done on top of was done on top of LORE a LORE a database system that was database system that was designed at Stanford University for storage and querying designed at Stanford University for storage and querying

graph structured data.graph structured data.

It is based on It is based on OEM OEM (Object Exchange Model)(Object Exchange Model)

What is What is OEM OEM ? ? An OEM object contains an OID, textual label, a type and a An OEM object contains an OID, textual label, a type and a

value.value. A value may be atomic or complex. A value may be atomic or complex. Atomic OEM any data value that should be considered Atomic OEM any data value that should be considered

indivisible by the database indivisible by the database A complex OEM value, on the other hand, is a collection of 0 A complex OEM value, on the other hand, is a collection of 0

or more OEM objects or more OEM objects


Complex OEM Object: Complex OEM Object:

<Birthday {<Birthday {

<Month "January"><Month "January">

<Day 7><Day 7>

<Year 1972><Year 1972>

}>}>

Here Birthday is the single complex OEM object with Here Birthday is the single complex OEM object with threethree

Atomic OEM objects Atomic OEM objects MonthMonth, , DayDay and and YearYear


Basics of OEM :Basics of OEM :

<Restaurant {<Restaurant {

<Entree {<Name "Burger"> <Entree {<Name "Burger"> <NINE: Price 9.00>}><NINE: Price 9.00>}>

<Entree {<Name "BLT"> <Entree {<Name "BLT"> <&NINE>}><&NINE>}>

<Entree {<Name "Reuben"> <Entree {<Name "Reuben"> <Cost &NINE>}><Cost &NINE>}>

}>}>

Here NINE is SymOid Here NINE is SymOid

STRATEGIESSTRATEGIES

StrategiesStrategies

Naïve Approach : A simple approach would be to Naïve Approach : A simple approach would be to compute the shortest distances between the objectscompute the shortest distances between the objectsat search time using the Dijkstra's single source at search time using the Dijkstra's single source shortest path algorithm.shortest path algorithm.For each iteration the algorithm will explore For each iteration the algorithm will explore N(v)N(v)Vertices adjacent to the some vertex Vertices adjacent to the some vertex v, v, so it willso it willMake N(v) random seeks for a disk based graph andMake N(v) random seeks for a disk based graph and

as many as |Eas many as |E11| random seeks. This type of approach | random seeks. This type of approach Requires too many random seeks .Requires too many random seeks .

EE1 1 : edge list provided by the distance module, it is of: edge list provided by the distance module, it is ofthe form <u,v,w> the form <u,v,w>

StrategiesStrategiesAlgorithm for Self joinsAlgorithm for Self joins

Algorithm: Distance self-joinAlgorithm: Distance self-joinInput: Edge set EInput: Edge set Ell, Maximum required distance: K, Maximum required distance: KOutput: Lookup table Dist supplies the shortest distance (up to K) between Output: Lookup table Dist supplies the shortest distance (up to K) between any pair of objects any pair of objects [1] For l = 1 to [1] For l = 1 to ┌┌loglog22k k ┐┐[2] Copy E[2] Copy Ell into E into El’+1l’+1

[3] Sort E[3] Sort Ell on first vertex.// To improve performance on first vertex.// To improve performance[4] Scan sorted E[4] Scan sorted Ell::[5] For each <v[5] For each <vi,i, v vJ,J, w wkk> and <v> and <vi,i, v’ v’J,J, w’ w’kk> where v> where vjj != v’ != v’jj

[6] If (w[6] If (wkk + w’ + w’kk ≤ 2 ≤ 2ll ) and (w ) and (wkk + w’ + w’kk ≤ K) ≤ K)[7] Add < v[7] Add < vjj, v’, v’jj, w, wkk + w’ + w’kk > and < v’ > and < v’jj, v, vjj, w, wkk + w’ + w’kk > to E > to El’+1l’+1

[8] Sort on E[8] Sort on El’l’+1 first vertex, and store in E+1 first vertex, and store in El’+1l’+1

[9] Scan sorted E[9] Scan sorted El’+1l’+1 : :[10] Remove tuple <u, v, w>, if there exists another tuple <u, v, w’>, with[10] Remove tuple <u, v, w>, if there exists another tuple <u, v, w’>, with w > w’.w > w’.[11] Let Dist be the final E[11] Let Dist be the final El+1l+1..[12] Build index on first vertex in Dist.[12] Build index on first vertex in Dist.


In algorithm for self joins In algorithm for self joins EEll:: edge-list representation of A edge-list representation of A22

l-1l-1

EEl’l’: edge-list before applying min operator: edge-list before applying min operatorThe algorithm is iteratedThe algorithm is iterated┌┌loglog22kk ┐┐and gives the square of and gives the square of the original matrix the original matrix ┌┌loglog22kk ┐┐times to give the Atimes to give the Akk

The final output that is The final output that is Dist Dist is the look-up table that is the look-up table that contains the distances of all k neighborhood vertices.contains the distances of all k neighborhood vertices.

The table stores <vThe table stores <vii, v, vJJ, w, wkk> for all vertex pairs v> for all vertex pairs vii, v, vJ J having having w wk k ≤ K≤ K

The main purpose is to query for d(vThe main purpose is to query for d(vii, v, vJJ) which can be done ) which can be done efficiently as the efficiently as the Dist Dist table is indexed and access of table is indexed and access of neighborhood for a tuple like <vneighborhood for a tuple like <vii, v, vJJ, w, wkk> ,if its there then > ,if its there then distance is wdistance is wk k or distance is greater than K. or distance is greater than K.

The problem with this approach is that it requires a lot of The problem with this approach is that it requires a lot of space for the generated edge-list and scanning & sorting space for the generated edge-list and scanning & sorting operation on it can be expensive.operation on it can be expensive.


Hub Indexing : It requires far Hub Indexing : It requires far less space for shortest less space for shortest distances then self join distances then self join algorithm at the cost of algorithm at the cost of access time.access time.Hubs : Here in the figure p Hubs : Here in the figure p and q are hub vertices that and q are hub vertices that connect to two sub graphs connect to two sub graphs called as hubs called as hubs Here we calculate for (|A| + |Here we calculate for (|A| + |B|) pair wise shortest B|) pair wise shortest distances rather than storing distances rather than storing all (|A| * |B|).all (|A| * |B|).

StrategiesStrategies Construction of hub indexes : Main Components are a Construction of hub indexes : Main Components are a

Hub Set Hub Set H H and Table of distances whose shortest path do and Table of distances whose shortest path do not cross through Hnot cross through H

The DIST look-up table that was generated by the Self-The DIST look-up table that was generated by the Self-Join algorithm.Join algorithm.

In that one step needs to be changed to make the In that one step needs to be changed to make the algorithm in accordance to Hub indexes, that isalgorithm in accordance to Hub indexes, that is


We need to maintain a matrix of pair-wise of hubs in We need to maintain a matrix of pair-wise of hubs in Memory of the form Hubs [hMemory of the form Hubs [hii] [h] [hjj] , initializing with ] , initializing with distances equal to infinity ,and for each edge <hdistances equal to infinity ,and for each edge <h ii, , hhjj, w, wkk> where hi, hj > where hi, hj ЄЄ H, Hubs [h H, Hubs [hii] [h] [hjj] = w] = wkk

Floyd Warshall’s algorithm is used to compute Floyd Warshall’s algorithm is used to compute shortest distances in hubs.shortest distances in hubs.

StrategiesStrategieshub indexing algorithmhub indexing algorithm

Algorithm: Pair-wise distance queryingAlgorithm: Pair-wise distance querying Input: Lookup table on disk: Dist, Lookup matrix in memory: Hubs, Input: Lookup table on disk: Dist, Lookup matrix in memory: Hubs,

Maximum required distance: K, Hub set: HMaximum required distance: K, Hub set: H Vertices to compute distance between: u, v (u≠ v)Vertices to compute distance between: u, v (u≠ v) Return Value: Distance between u and v: dReturn Value: Distance between u and v: d[1] If u, v [1] If u, v ЄЄ H, return d = H, return d =Hubs [u ][v].Hubs [u ][v].[2] d = ∞[2] d = ∞[3] If u [3] If u Є Є H H[4] For each <v, v[4] For each <v, vii, w, wkk> in > in DistDist[5] If v[5] If vii ЄЄ H // Path u ~v H // Path u ~vii~ v~ v[6] d = min (d, w[6] d = min (d, wkk+ + Hubs [vHubs [vi i ] [ u ]] [ u ]))[7] If d > K, return d = ∞, else return d.[7] If d > K, return d = ∞, else return d.[8] Steps [4]-[7] are symmetric steps if v [8] Steps [4]-[7] are symmetric steps if v ЄЄ H, and u ! H, and u !ЄЄ H. H.[9] // Neither u nor v is in H[9] // Neither u nor v is in H[10] Cache in main-memory (E[10] Cache in main-memory (EUU) all <u, v) all <u, vii, w, wkk > from > from DistDist[11] For each <v, v’[11] For each <v, v’i i ,, w’w’k k > in > in DistDist[12] If (v’[12] If (v’ii = u) = u)[13] d = min(d,[13] d = min(d, w’w’kk) //Path u ~ v without crossing hubs) //Path u ~ v without crossing hubs[14] For each edge <u, v[14] For each edge <u, vii, w, wkk > in E > in EUU

[15] If v’[15] If v’ii ЄЄ H and v H and vii ЄЄ H //Path u~ v H //Path u~ vii ~ v’ ~ v’ii ~v ~v[16] d = min (d, w[16] d = min (d, wkk+ w’+ w’k k +Hubs [v’+Hubs [v’II] [v] [vii] )] )[17] If d > K, return d = ∞, else return d.[17] If d > K, return d = ∞, else return d.

StrategiesStrategies The algorithms discussed earlier on can be The algorithms discussed earlier on can be

used to get the distances between single used to get the distances between single pair of objects pair of objects

Naïve approach for Find/Near Query would Naïve approach for Find/Near Query would be to check for the all pairs of Find and be to check for the all pairs of Find and Near objects. To avoid unnecessary seeks Near objects. To avoid unnecessary seeks clustering over the objects can be done this clustering over the objects can be done this has to be done engine administrator.has to be done engine administrator.

In this Proximity search engine clustering is In this Proximity search engine clustering is done on the labels such as Actors, done on the labels such as Actors, Producers, etc.Producers, etc.

StrategiesStrategiesHub SelectionHub Selection

Consider a Graph G(V,E) , and let V1, V2 be disjointConsider a Graph G(V,E) , and let V1, V2 be disjoint

Subsets of V, A set of vertices S Subsets of V, A set of vertices S ⊆ V⊆ V separates V1 & separates V1 &

V2 If all pairs vertices (v1, v2) v1 V2 If all pairs vertices (v1, v2) v1 ЄЄ V1 , v2 V1 , v2 ЄЄ V2 V2

goes thru some Vertex from S.goes thru some Vertex from S.

We say that S is a balanced separator if We say that S is a balanced separator if

min(|V1||V2|) ≥ |V|/3min(|V1||V2|) ≥ |V|/3

We say that S is a c-separator if We say that S is a c-separator if

V - S = V1 U V2,V - S = V1 U V2,

i.e. S disconnects the graphi.e. S disconnects the graph

PERFORMANCE EXPERIMENTSPERFORMANCE EXPERIMENTS

Performance ExperimentsPerformance Experiments

For the experiments, they have used a Sun SPARC/Ultra II (2x200 MHz) running SunOS 5.6, with 256 MBs of RAM, and 18 GBs of local disk space.

They have done experiments with two sets of datasets IMDB and DBgroup dataset.

Performance ExperimentsPerformance Experiments A generator is used that takes A generator is used that takes

in as input as IMDB’s edge list in as input as IMDB’s edge list and scales the database by a and scales the database by a scale factor S.scale factor S.

For performance we have user For performance we have user ISAM indexesISAM indexes

Performance Issues Performance Issues discussed: discussed:

Index Performance :Index Performance : First figure is storage First figure is storage

requirements with varying Krequirements with varying K Second figure is Index Second figure is Index

Construction time for varying Construction time for varying KK

When the number of Hubs is When the number of Hubs is small small

For this we have taken the For this we have taken the scale Factor to be S =10 and scale Factor to be S =10 and 2.5% vertices as hubs2.5% vertices as hubs

Performance ExperimentsPerformance Experiments

Algorithm Scalability as Algorithm Scalability as database grows in size.database grows in size.

First figure is total First figure is total storage with varying storage with varying scale .scale .

For this scale factor is For this scale factor is taken to be S =10 and taken to be S =10 and 2.5% vertices as hubs.2.5% vertices as hubs.

Second figure number Second figure number of hubs as percentage of hubs as percentage of vertices.of vertices.

For this scale factor is For this scale factor is taken to be K=12,S =10 taken to be K=12,S =10 and 2.5% vertices as and 2.5% vertices as hubs.hubs.

THANK YOUTHANK YOU

ReferencesReferences

1. A Standard Textual Interchange Format 1. A Standard Textual Interchange Format for the Object Exchange Model (OEM)for the Object Exchange Model (OEM)

by Roy Goldman, Sudarshan Chawathe, Arturo by Roy Goldman, Sudarshan Chawathe, Arturo Crespo, Jason McHughCrespo, Jason McHugh

proximity search in databases

Documents

set of objects

near set objects

near objects

different objects

ranking objects

proximity search engine

actor objects

collection of objects