module ii: multimedia data mining - unibo.it · 2015-10-23 · mm queries from previous lesson we...

ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA

Module II:

Multimedia Data MiningLaurea Magistrale in Ingegneria Informatica

University of Bologna

Efficient Algorithms for Multimedia Queries

Home page: http://www-db.disi.unibo.it/courses/DM/Electronic version: 3.01.AlgorithmsForMultimediaQueries.pdf

Electronic version: 3.01.AlgorithmsForMultimediaQueries-2p.pdf

I. Bartolini Data Mining M

� MM query formulation paradigms� Sequential retrieval of MM data� Index-based retrieval of MM data

Outline

2I. Bartolini Data Mining M

MM queries

� From previous lesson we know how to effectively represent MM data content by means of low-level features…� Global, e.g., color and texture of images � Local, e.g., the shape of an image region, or keypoints of an image

� …and how to compare such features, using suitable distance(or dissimilarity) functions, in order to compute their distance, i.e., their dissimilarity score� with the hypothesis that such score is a numeric value in the

range [0,1], we can define the

similarity score ≅ (1 – dissimilarity score)

� Today we focus on how to efficiently solve a MM queryProblem: “Given an input MM data object (i.e., a query) Q and a MM DB,

we want to compute which are the objects in the MM DB

that represent the query results wrt Q”


MM query formulation paradigms (1)

� By text (attributes/annotations)1. using SQL statements (if data is maintained into a DB)

� E.g.: we suppose to have created the table SONG with song records storing

information such as the song artist, the title, etc.

� The execution of this type of searches exploits standard DBMS technologies

Select mp3

From SONG

Where artist = ‘Manowar’


MM query formulation paradigms (2)2. using the “query-by-keyword” paradigm

� The execution of this type of queries exploits traditional Information Retrieval (IR) techniques

� With the assumption that an association has been recorded between MM objects and keywords� .. will detail it very soon! ☺


MM query formulation paradigms (3)

� By low-level features (e.g., color, texture, etc.)� following the “query-by-example” (QBE) paradigm first adopted in the

IBM's query by image content (QBIC) system� Full vs. partial queries� The execution of this type of queries is based on content-based MM data retrieval

techniques

“query-by-sketch”


Content-based MM queries

� From previous lessons we also know that Content-based MM retrievalcan be conveniently formulated as a k-Nearest Neighbor (k-NN) problem� Top-k queries

� Thus, our initial problem can rewritten as follows:

Problem: “Given a MM query object Q, a MM DB,

a MM data similarity function S

(that for each pair of objects (Oi,Oj) returns their similarity score si,j),

and an integer k,

determine the k-NN objects to Q in the MM DB,

i.e., the k objects with highest similarity scores wrt Q,

i.e., the Top-k objects!”


MM queries evaluation strategies

featureextraction

image DB

imagesegmentation (optional)GUI

query engine

visualize query

processor

feature DB

results

query image

� Sequential evaluation: Q is compared with all DB objects� no feasible solution for

large MM data sets

index

� Index-based execution: Q is compared with a subset of DB objects, with “guarantee” on the result correctness

� to speed-up query evaluation time


Access methods for MM data

� Among the plethora of proposed access methods for multidimensional data, here we focus on two basic cases

� R-tree (Guttman, 1984) � B-trees in multiple dimensions

� MM object represented as a point in a vector space

� Index based on the concept of Minimum Bounding Rectangle (MBR)� Variants

� R+-tree (Sellis et al 1987)� R*-tree (Beckmann et al 1990)

� M-tree (Ciaccia et al., 1997)� Intuitively, it generalizes “R-tree principles” to arbitrary metric spaces

� The M-tree treats the distance function as a “black box”

MBR


Recall on structured data…

� Structured data base on a predefined schema able to describe the content of the document collection

� A database (DB) can be seen as a collection of objects representing some information of interest in a structured way (i.e., through a schema)

� A relational database management systems (RDBMS or just DBMS) is a software system able to “manage” collections of objects which can be very large (Giga-Tera byte and more) and shared by different applications in a persistent way (even in presence of faults)

� “manage” = obtain, elaborate, maintain, produce, distribute

� Examples of DBMSs: Oracle, IBM (DB2 UDB), Microsoft (SQL Server), Sybase, mySQL, PostgreSQL, InterBase


Relations as tables

� DBMSs use the relational model (Codd, 1970) to describe the data, that is the information is organized in tables (“relations”)

� The rows of table corresponds to records, while the columns correspond to attributes (schema)

� The language to store/retrieve information from such tables is the Structured Query Language (SQL)

� Example: if we want to create a table with employees records, so that we can store their employee number, name, age and salary, we can use the following SQL statement:

create table EMPLOYEE (

empN integer PRIMARY KEY;

name char(50);

age integer;

salary float );

empN name age salary


Populating and querying tables

� Tables can be populated with the SQL insert command, e.g.,:

� We can retrieve information using the select command. E.g., if we want

to find all the employees with salary less that 50000, we use the query:

Select *

From EMPLOYEE

Where salary <= 50000.00

empN name age salary

123 Smith, John 30 38000.00

456 Johnson, Tom 25 55000.00

insert into EMPLOYEE values (

123, ‘Smith, John’, 30, 38000.00);

insert into EMPLOYEE values (

456, ‘Johnson, Tom’, 25, 55000.00);


Query execution

� In absence of access methods (e.g., an index), the DBMS will perform a sequential scanning, checking the salary of each and every employee record against the desired threshold of 50000!!!

� To accelerate queries execution, we can create an index (usually a B-tree index, as we will see in few minutes) with the command create index

� E.g., to build an index on the employee’s salary, we would issue the SQL statement:

� In general the DBMS relies on an “optimizer” component to decide which is the more efficient way to execute a given query� sequential vs. index-based evaluation� which index is the most appropriate� …

create index salIdx on EMPLOYEE(salary)


Storage hierarchies

� First level is typically main memory or core or RAM

� Fast (access time of micro-seconds or faster), small, expensive

� Second level (secondary store) is typically magnetic disk� Much slower (5-10 msec. access time), but much larger and cheaper

� Database researchers has focused on “large databases” that do not fit in main memory and thus have to be stored in secondary memory

� Secondary store is organized into block (= pages)� The reason is that, accessing data from the disk involves the mechanical

move of the read/write head of the disk above the appropriate track on the disk

� Because these moves (‘seeks’) are slow and expensive, every time we do a disk read we bring into main memory a whole disk block (of the order of 1KB - 8 KB)

� So, it makes a huge difference of performance if we mange to group similar data in the same disk blocks!!


B-tree

� Access methods, like B-tree, try exactly to achieve good clustering of data in order to minimize the number of disk-reads

5 8o o

1 3o o 6 7o o 9 12o o

oPr

Data pointerP

Tree node pointerNull tree pointer

- Balanced tree of order p- Node: <P1, <K1,Pr1>,

P2, <K2, Pr2>, ...Pq >

q ≤ p


B+-tree

� B-tree variant � more commonly used than B-tree

K1 Ki Kq-1P1 Pi ... Pq

X X X

X ≤ K1

Ki < X ≤ K

i

Ki-1

Kq-1

< X

...

K1 Kq-1Pri ... PnextKi...Pr1 Prq-1

data pointer data pointer data pointer

pointer to next leafnode in tree

Internal node

Leaf node

- Data pointers only at the leaf nodes- All leaf nodes linked together

°allows ordered access! ☺


GD

E

HF

PO

N

L

I

J

K

M

A

C

B

A B C

Basic idea:� Recursive bottom-up

aggregation of objects based on MBR’s

� Regions can overlap� At query time, only portions of

the tree which are “close” to the query are accessed

…………………………...D P

N O PI J K L MD E F G H

A B C

Back to multidimensional access methods: how R-tree looks like


Why R-tree is not enough?

� The R-tree principle can be generalized to handle an arbitrary number of dimensions � MBRs become hyper rectangles

� However, vector spaces, equipped with some (weighted) Lp-norm, are not general enough to deal with the whole variety of feature types and distance functions needed in MMDB’s� For example:

� Objects have a different dimensionality (audio different length)� Non numerical features (sets of values, words, etc.)

� Yet relevance of objects to the query is still assessed by way of a similarity function


Metric spaces

� A metric space M = (U,d) is a pair, where U is a domain (“universe”) of values, and d is a distance function that, ∀ x,y,z ∈ U, satisfies the metric axioms:

d(x,y) ≥ 0, d(x,y) = 0 ⇔ x = y (positivity)d(x,y) = d(y,x) (symmetry)d(x,y) ≤ d(x,z) + d(z,y) (triangle inequality)

� Note: � All the distance functions seen in the previous examples are metrics, and so are the

(weighted) Lp-norms� An example of distance that does not fit the metric framework is the DTW

� Metric indexes only use the metric axioms to organize objects, and exploit the triangle inequality to prune the search space!!


The M-tree

� Intuitively, it generalizes “R-tree principles” to arbitrary metric spaces

� The M-tree treats the distance function as a “black box”

� At a first sight, the M-tree “looks like” an R-tree: however, the M-tree only “knows” about distance values, thus it ignores coordinate values and does not rely on any “geometric” (coordinate-based) reasoning

� Since 1997 [CPZ97] it has been used by several research groups for:� Image retrieval, text indexing, shape matching, clustering algorithms,

fingerprint matching, DNA DB’s, etc.� C++ source code freely available at http://www-db.deis.unibo.it/Mtree/

I. Bartolini Data Mining M 20

Basic idea:� Recursive bottom-up

aggregation of objects based on regions

� Regions can overlap� At query time, only

portions of the tree which are “close” to the query are accessed

M-tree: how it looks like

d≡L2

C D E F

A B

B

F D

EA

C

� Depending on the metric, the “shape” of index regions changes

L1 L∞ Weighted Euclidean quadratic distance


The M-tree regions

� Each node N of the tree has an associated region, Reg(N), defined asReg(N) = {p: p ∈∈∈∈U , d(p,vN) ≤≤≤≤ rN}

where:� vN (the “center”) is also called a routing object, and � rN is called the (covering) radius of the region

� The set of indexed points p that are reachable from node N are guaranteed to have d(p,vN) ≤ rN

rNvN p

� The triangle inequality guarantees that if vN is sufficiently far from Q then the whole index region can be skipped


Solving Top-k queries

� R-tree and/or M-tree are immediately able to efficiently solve MM Top-k queries where the MM query is seen as a whole object, i.e., by means of global features

� querying the index we obtain the final result

� Dealing with complex objects this is not true anymore� example:

� multi object queries (e.g., queries in the region-based image retrieval scenario)

� distributed queries (e.g., multiple features-

based MM queries)

� In such cases, the index only gives a partial result which has to be somehow integrated outside of the index itself

23

Image index

I

images

R1 R2

Region index

regions

I. Bartolini Data Mining M

Top-k middleware queries

� A Top-k middleware query retrieves the best k objects, given the (partial) descriptions provided for such objects by m data sources

� We make some simplifying assumptions by requiring for each source:1. given a query, it can return a ranked list of results (i.e., not just a set)

� More precisely, the output of the j-th data source DSj (j=1,…,m) is a list of objects/tuples with format

(ObjID,Attributes,Score)where:� ObjID is the identifier of the objects in DSj, � Attributes are a set of attributes that the query request to DSj� Score is a numerical value that says how well an object matches the query on

DSj, that is, how “similar” is to our ideal target object

� We also say that this is the “local/partial score” computed by DSj


Further assumptions

2. supports a random access interface: � A random access retrieves the local score of an object with respect to a

query Q� “getScore” interface

3. supports a sorted access interface: � A sorted access gets the next best object (and its local score) for query Q

� supported by our reference DB access methods R-tree and M-tree though the “getNext” interface

4. the ObjID is “global”: a given object has the same identifier across all data sources� The important point is to be able to “match” in some way the descriptions

provided by the data sources5. each source manages the same set of objects, i.e., each DSj “knows”

about a given object


A model with scoring functions

� In order to provide a unifying approach to the problem, we consider:� A Top-k query Q = (Q1,Q2,…,Qm)

� Qj is the sub-query sent to the j-th data source DSj

� Each object o returned by a source DSj has an associated local/partial score sj(o), with sj(o) ∈ [0,1]� Scores are normalized, with higher scores being better

� The hypercube [0,1]m is called the “score space”� The point s(o) = (s1(o),s2(o),…,sm(o)) ∈ [0,1]m, which maps o into the score

space, is called the “(representative) point” of o� The global/overall score gs(o) ∈ [0,1] of o is computed by means of a

scoring function (s.f.) S that aggregates the local scores of o:

� Relevant examples of scoring functions:� Average (AVG), SUM, MAX, and MIN

S : [0,1]m → [0,1] gs(o) = S(s(o)) = S(s1(o),s2(o),…,sm(o))


Efficient algorithms for middleware queries

� The A0 algorithm [Fag96] solves the problem for any monotone s.f.:

� The TA algorithm [Fag96] solves the problem for any monotone s.f. and improves performances of A0

� These algorithms represent a general solution for the Top-k middleware queries problem

� Many variants have been proposed for specific scenarios

Monotone scoring function:� An m-ary scoring function S is said to be monotone if

x1 ≤ y1, x2 ≤ y2, …, xm ≤ ym ⇒ S(x1,x2,…,xm) ≤ S(y1,y2,…,ym)


The A0 algorithm

� A0 works in 3 distinct phases:

1. Sorted access phase

� Perform on each DSj a sequence of sorted accesses,

and stop when the set M = ∩∩∩∩j Obj(j)

contains at least k objects

2. Random access phase

� For each object o ∈ Obj = ∪j Obj(j)perform random accesses to retrieve the missing partial scores for o

3. Final computation

� For each object o ∈ Obj compute gs(o)and return the k objects with the highest global scores

ObjID s1

o7 0.7

o3 0.5

… …

ObjID s2

o3 0.8

o2 0.7

… …k = 1

M = {o3}Obj = {o2,o3,o7}

getScoreDS1(Q,o2)getScoreDS2(Q,o7)

compute top-k resultsaccording to S


How A0 works

� Let’s take k = 1� Now we apply A0 to the following data: and after the sorted

accesses obtain:ObjID s1

o7 0.9

o3 0.65

o2 0.6

o1 0.5

o4 0.4

ObjID s2

o2 0.95

o3 0.7

o4 0.6

o1 0.5

o7 0.5

ObjID s3

o7 1.0

o2 0.8

o4 0.75

o3 0.7

o1 0.6

RIGHT!!

� After performing the needed random accesses we get:

ObjID gs

o3 0.65

o2 0.6

o7 0.5

o4 0.4

M = {o2}Obj = {o2,o3,o4,o7}

S ≡ MIN

RIGHT!!

ObjID gs

o7 0.8

o2 0.783

o3 0.683

o4 0.583

S ≡ AVG☺☺☺☺


The TA algorithm

� TA works by interleaving sorted and random accesses:

1. Perform on each DSj a sorted access; for each new object o seen under sorted access, perform random accesses to retrieve the missing partial scores for o and compute gs(o);if gs(o) is one of the k highest scores seen so far keep (o,gs(o)), otherwise discard o and its score

2. Let sj be the lowest score seen so far on DSj;Let τ = S(s1,s2,…,sm) be the threshold score

3. If the current Top-k objects are such that for each of them gs(o) ≥ τ

holds, then stop, otherwise repeat from 1.


How TA works

� Let’s take S ≡ MIN and k = 1

ObjID s1

o7 0.9

o3 0.65

o2 0.6

o1 0.5

o4 0.4

ObjID s2

o2 0.95

o3 0.7

o4 0.6

o1 0.5

o7 0.5

ObjID s3

o7 1.0

o2 0.8

o4 0.75

o3 0.7

o1 0.6

gs(o2) = 0.6; gs(o7) = 0.5. ττττ = 0.9

gs(o3) = 0.65. ττττ = 0.65

� Let’s take S ≡ AVG and k = 2

ObjID s1

o7 0.9

o3 0.65

o2 0.6

o1 0.5

o4 0.4

ObjID s2

o2 0.95

o3 0.7

o4 0.6

o1 0.5

o7 0.5

ObjID s3

o7 1.0

o2 0.8

o4 0.75

o3 0.7

o1 0.6

gs(o2) = 0.783; gs(o7) = 0.8. ττττ = 0.95

gs(o3) = 0.683. ττττ = 0.716


Back to the Windsurf case study

� Windsurf: Wavelet-Based Indexing of Images Using Regions Fragmentation� Discrete Wavelet Transform (DWT): extracts a set of features representing the

image in the color-texture space� Clustering: fragments the image into a set of regions using wavelet

coefficients� Similarity Features: used to compare regions

Similarity FeaturesClusteringDWT


Remind the “query processor” module!

� From the previous lesson we know that the similarity between images is assessed by taking into account similarity between matched regions

region matching

query image DB image


Sequential algorithm

Algorithm k-NNSeqWindsurf

Require: I (query image), k (cardinality of result), DB (image DB), sf(scoring function)

Ensure: set of k-NN images1: for all images I’ in DB do2: for all regions Rj in I’ do3: for all regions Ri in I do4: compute sR (Ri,Rj)5: compute all matchings between regions in I and regions in I’

and select the one that maximizes sI (I, I’) by means of sf

7: return the k images having the highest overall similarity scores sI


Index-based query processing

� Goal: to speed-up query evaluation time over sequential scan� reduce the number of images on which the region matching problem has to be

solved

� Provided algorithms based on the M-tree index� Good for k-Nearest Neighbor (K-NN) query� Able to perform a sorted and random access to the data

� Remind: the M-tree requires the (dis-)similarity function d(I1,I2) to be a metric:� d(I1,I2)>0

� d(I1,I1)=0

� d(I1,I2) + d(I2,I3) ≥ d(I1,I3)


Query processing algorithms

� Indexing regions � Bhattacharyya distance is a metric

� 1-1 matching (Hungarian)� M-N matching (EMD)

� Monotonic scoring functions (e.g., avg) and/or qualitative preferences (e.g., Skyline) [BCC+06, BCO+07, BCP10]

� Indexing images (as a “signature” of regions) � “If the ground distance (i.e., Bhattacharyya) is a

metric and the total weights of the two distribution

are equal, EMD is a metric” [BCP10]

� Full and partial queries [BCP10] Image index

I

images

R1 R2

Region index

regions


1-Nearest Neighbor query example

...

...

sR=0.74

sR=0.82

step 2

sR=0.92

sR=0.9

step 1

θ = 0.91

sI=0.56

sI=0.7

θ = 0.81

sI=0.67

sI=0.76

stop

θ = 0.725

sI=0.71

sI=0.68

step 3

sR=0.72

sR=0.73

� 1-1 match

� Avg as scoring function


Region index-based algorithm Algorithm k-NNIDXWindsurf

Require: I (query image), k (cardinality of result), T (index on regions), sf (scoring function)Ensure: set of k-NN images1: output = 0, Res = ∅;

2: for all regions Ri in I do3: open a sorted access index scan Li on T4: while output < k do5: for all regions Ri in I do6: retrieve from Li the next NN region R’i (R’i in image I’) -- sorted access

7: θi = sR (Ri; R’i)8: retrieve the missing regions of I’ -- random access

9: compute all matchings between regions in I and regions in I’ and select the one that maximizes sI (I, I’) by means of sf

10: Res = Res U {I’} -- Res is kept sorted for decreasing values of sI (I, I’)

11: θ = fs(θi)12: for all images I’ in Res do13: if sI (I, I’) >= θ then14: output I’15: output = output + 1


module ii: multimedia data mining - unibo.it · 2015-10-23 · mm queries from previous lesson we...

Documents