comp3410 db32: technologies for knowledge management lecture 4: inverted files and signature files...

44
COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University of Leeds (including re-use of teaching resources

Upload: carter-mcfadden

Post on 28-Mar-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

COMP3410 DB32:Technologies for Knowledge Management

Lecture 4:

Inverted Files and Signature Files for IR

By Eric Atwell, School of Computing, University of Leeds

(including re-use of teaching resources from other sources, esp. Stuart Roberts, School of Computing, Univ of Leeds)

Page 2: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Module Objectives

“On completion of this module, students should be able to:

… describe classical and emerging information retrieval techniques, and their relevance to knowledge management; …”

Page 3: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Today’s objectivesBy the end of this lecture you should understand:

• why relational databases techniques eg BTree indexing are no use for ‘IR’ queries;

• how ‘inverted file’ structures work to provide efficient query processing.

• An alternative approach provided by signature files

Page 4: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

The relational problem

• the simple approach to searching for a keyword uses leading (and trailing) wildcards: eg ‘%graphics%’

• there is no way other than ‘brute force scan’ to match such a condition with the data records held in a traditional relational database.

Page 5: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

The relational problem• Rather than hold full text, why not do content

analysis, extract the index terms (keywords), and hold these in a relational database?

Module Index termIndexed by

module(mod_code, title, semester, …)

term(term_id, value)

index(mod_code, term_id)

Page 6: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Sample SQL query: ORfind all modules matching:

‘database’ or ‘AI’ or ‘knowledge base’

select distinct m.* from module m inner join index i on m.code = i.mod_code inner join term t on t.term_id = i.term_id where t.value = ‘database’ OR t.value = ‘AI’ OR t.value = ‘knowledge base’;

Page 7: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Another sample query: AND (?)find all modules matching:

‘database’ and ‘AI’ and ‘knowledge base’

select distinct m.* from module m inner join index i on m.code = i.mod_code inner join term t on t.term_id = i.term_id where t.value = ‘database’ AND t.value = ‘AI’ AND t.value = ‘knowledge base’;

This SQL query will not match any record; t.value cannot be simultaneously equal to ‘database’, ‘AI’ and ‘knowledge base’.

We cannot simply replace the ‘OR’s of the last SQL query with ‘AND’s.

Page 8: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Corrected sample query: AND

find all modules matching: ‘database’ and ‘AI’

select distinct m.* from module m inner join index i1 on m.code = i1.mod_code inner join term t1 on t1.term_id = i1.term_idinner join index i2 on m.code = i2.mod_code inner join term t2 on t2.term_id = i2.term_id where t1.value = ‘database’ and t2.value = ‘AI’;

Both tables ‘index’ and ‘term’ must be searched twice in order to establish whether, for each module, it is attached to both terms ‘database’ and ‘AI’.

If the query is a conjunction of N terms, the SQL would have 2N inner joins. AND is more complicated than OR (but common in IR)

Page 9: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Inverted file• Non-DB structure, so not suitable for standard SQL

• each index term entry ‘points’ to a list of document record identifiers (RIDs)

• standard indexing method for IR systems

• widely used for search engines

• can be extended to allow for positional (context) searches

Page 10: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Inverted file structureThe idea of an inverted file is, as well as storing a

document with its list of terms that are used to index it, we store the list of terms used in the whole collection of documents, and for each term point to the list of documents that are indexed by the term. So we have ‘inverted’ the structure:

D1: T11, T12, …, T1k

D2: T21, T22, …, T2l

…to give:T1: D11, D12, …, D1m

T2: D21, D22, …, D2n

Page 11: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Inverted file structure

Term 1 (2)

Term 2 (3)

Term 3 (1)

Term 4 (3)

Term 5 (4)..

1

2

1

2

3

2

2

3

4..

Doc 1

Doc2

Doc3

Doc4

Doc5

Doc6..

1

3

6

7

9..

dictionary Inverted or postings file Data file

Page 12: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Inverted file structure

Term 1 (2)

Term 2 (3)

Term 3 (1)

Term 4 (3)

Term 5 (4)..

1

2

1

2

3

2

2

3

4..

Doc 1

Doc2

Doc3

Doc4

Doc5

Doc6..

1

3

6

7

9..

dictionary Inverted or postings file Data file

Page 13: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Inverted file structure

Term 1 (2)

Term 2 (3)

Term 3 (1)

Term 4 (3)

Term 5 (4)..

1

2

1

2

3

2

2

3

4..

Doc 1

Doc2

Doc3

Doc4

Doc5

Doc6..

1

3

6

7

9..

dictionary Inverted or postings file Data file

Page 14: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Inverted file structure

Term 1 (2)

Term 2 (3)

Term 3 (1)

Term 4 (3)

Term 5 (4)..

1

2

1

2

3

2

2

3

4..

Doc 1

Doc2

Doc3

Doc4

Doc5

Doc6..

1

3

6

7

9..

dictionary Inverted or postings file Data file

Page 15: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Dictionary (in IR)• list of terms including ‘normalised’ keywords or

stems plus object descriptors (eg author name)

• frequency with which that term occurs in the collection

• pointer to the inverted file

• access to dictionary is by standard file access method (binary search or Btree or hashing algorithm; DB21)

Page 16: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Inverted file• for each entry in the dictionary:

– a list of pointers into the data file (or object-ids, or URLs..)

– identifying those objects indexed by the dictionary term

• inverted file may also contain:– positional information within each document– term frequency (or weight) within each document

Page 17: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Use of inverted file• Boolean query: (A or B) and C

– disjunctive normal form: (A and C) or (B and C) or (A and B and C)(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)

– retrieve lists of document ids from inverted file corresponding to A, B and C

doc1doc3doc4doc7doc8doc10

doc2doc3doc5doc6doc8doc12

doc1doc2doc4doc9doc11doc12

Page 18: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Use of inverted file• Boolean query: (A or B) and C

– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)

– retrieve lists of document ids from inverted file corresponding to A, B and C

doc1doc3doc4doc7doc8doc10

doc2doc3doc5doc6doc8doc12

doc1doc2doc4doc9doc11doc12

doc1: (1, 0, 1)

Page 19: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Use of inverted file• Boolean query: (A or B) and C

– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)

– retrieve lists of document ids from inverted file corresponding to A, B and C

doc3doc4doc7doc8doc10

doc2doc3doc5doc6doc8doc12

doc2doc4doc9doc11doc12

doc1: (1, 0, 1)doc2: (0, 1, 1)

Page 20: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Use of inverted file• Boolean query: (A or B) and C

– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)

– retrieve lists of document ids from inverted file corresponding to A, B and C

doc3doc4doc7doc8doc10

doc3doc5doc6doc8doc12

doc4doc9doc11doc12

doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)

Page 21: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Use of inverted file• Boolean query: (A or B) and C

– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)

– retrieve lists of document ids from inverted file corresponding to A, B and C

doc4doc7doc8doc10

doc5doc6doc8doc12

doc4doc9doc11doc12

doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)

Page 22: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Use of inverted file• Boolean query: (A or B) and C

– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)

– retrieve lists of document ids from inverted file corresponding to A, B and C

doc7doc8doc10

doc5doc6doc8doc12

doc9doc11doc12

doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)

Page 23: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Use of inverted file• Boolean query: (A or B) and C

– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)

– retrieve lists of document ids from inverted file corresponding to A, B and C

doc7doc8doc10

doc6doc8doc12

doc9doc11doc12

doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)

Page 24: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Use of inverted file• Boolean query: (A or B) and C

– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)

– retrieve lists of document ids from inverted file corresponding to A, B and C

doc7doc8doc10

doc8doc12

doc9doc11doc12

doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)doc7: (1, 0, 0)

Page 25: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Use of inverted file• Boolean query: (A or B) and C

– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)

– retrieve lists of document ids from inverted file corresponding to A, B and C

doc8doc10

doc8doc12

doc9doc11doc12

doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)doc7: (1, 0, 0)doc8: (1, 1, 0)

Page 26: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Use of inverted file• Boolean query: (A or B) and C

– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)

– retrieve lists of document ids from inverted file corresponding to A, B and C

doc10 doc12 doc9doc11doc12

doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)doc7: (1, 0, 0)doc8: (1, 1, 0)doc9: (0, 0, 1)

Page 27: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Use of inverted file• Boolean query: (A or B) and C

– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)

– retrieve lists of document ids from inverted file corresponding to A, B and C

doc10 doc12 doc11doc12

doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)doc7: (1, 0, 0)doc8: (1, 1, 0)doc9: (0, 0, 1)doc10: (1, 0, 0)

Page 28: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Use of inverted file• Boolean query: (A or B) and C

– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)

– retrieve lists of document ids from inverted file corresponding to A, B and C

doc12 doc11doc12

doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)doc7: (1, 0, 0)doc8: (1, 1, 0)doc9: (0, 0, 1)doc10: (1, 0, 0)doc11: (0, 0, 1)

Page 29: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Use of inverted file• Boolean query: (A or B) and C

– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)

– retrieve lists of document ids from inverted file corresponding to A, B and C

doc12 doc12

doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)doc7: (1, 0, 0)doc8: (1, 1, 0)doc9: (0, 0, 1)doc10: (1, 0, 0)doc11: (0, 0, 1)doc12: (0, 1, 1)

doc12 doc12

Page 30: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Use of inverted file• Boolean query: (A or B) and C

– disjunctive normal form:(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)

doc1: (1, 0, 1)doc2: (0, 1, 1)doc3: (1, 1, 0)doc4: (1, 0, 1)doc5: (0, 1, 0)doc6: (0, 1, 0)doc7: (1, 0, 0)doc8: (1, 1, 0)doc9: (0, 0, 1)doc10: (1, 0, 0)doc11: (0, 0, 1)doc12: (0, 1, 1)

Page 31: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

• Boolean query: (A or B) and C– disjunctive normal form:

(1, 0, 1) OR (0, 1, 1) OR (1, 1, 1)

doc1: (1, 0, 1)doc2: (0, 1, 1)

doc4: (1, 0, 1)

doc12: (0, 1, 1)

doc1: (1, 0, 1)doc2: (0, 1, 1)doc4: (1, 0, 1)doc12: (0, 1, 1)

report number of hits to user (4)(Note: can be done before any ‘hits’ are retrieved

retrieve all objects using ‘pointers’:doc1, doc2, doc4 and doc12

Use of inverted file

Page 32: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

weighted query: A0.5, B0.7, C1.0

form weighted vector:(0.5, 0.7, 1.0)

retrieve lists of document ids from inverted file corresponding to A, B and C with weights

doc1 (.2)doc3 (.6)doc4 (.7) doc7 (.3)doc8 (.5) doc10 (.5)

doc2 (.6)doc3 (.8)doc5 (.9)doc6 (.3)doc8 (.5)doc12 (.2)

doc1 (.4)doc2 (.4)doc4 (.7)doc9 (.6)doc11 (.3)doc12 (.6)

Use of inverted file with weighted terms

Page 33: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

weighted query: A0.5, B0.7, C1.0

form weighted vector:(0.5, 0.7, 1.0)

retrieve lists of document ids from inverted file corresponding to A, B and C with weights

doc1 (.2)doc3 (.6)doc4 (.7) doc7 (.3)doc8 (.5) doc10 (.5)

doc2 (.6)doc3 (.8)doc5 (.9)doc6 (.3)doc8 (.5)doc12 (.2)

doc1 (.4)doc2 (.4)doc4 (.7)doc9 (.6)doc11 (.3)doc12 (.6)

sim((0.5, 0.7, 1.0), (0.2, 0.0, 0.4)) = 0.85

doc1: 0.85

Use of inverted file

Page 34: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

weighted query: A0.5, B0.7, C1.0

form weighted vector:(0.5, 0.7, 1.0)

retrieve lists of document ids from inverted file corresponding to A, B and C with weights

doc3 (.6)doc4 (.7) doc7 (.3)doc8 (.5) doc10 (.5)

doc2 (.6)doc3 (.8)doc5 (.9)doc6 (.3)doc8 (.5)doc12 (.2)

doc2 (.4)doc4 (.7)doc9 (.6)doc11 (.3)doc12 (.6)

sim((0.5, 0.7, 1.0), (0.0, 0.6, 0.4)) = 0.86

doc1: 0.85doc2: 0.86

Use of inverted file

Page 35: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

• sort (rank) list according to similarity coefficient.

• retrieve first ‘N’ ranked objects.

• present ranked list to user.

• offer to retrieve next ‘N’.

• Note that so far we have not retrieved any documents; this is particularly important if the ids are URLS - we don’t need to start downloading web pages in order to rank them.

Use of inverted file

Page 36: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

• proximity queries eg Q1: “A B” Q2: “A(3)B” (A…B)– postings file holds positional information– proceed as for ‘A and B’– keep positional information in (AB) list– filter (AB) list:

for Q1 pos(A) = pos(B) -1 for Q2 |pos(B) - pos(A)| < 3

• now we can distinguish ‘Venetian blind’ from ‘blind Venetian’

• in principle this should help precision without affecting recall too much

Use of inverted file

Page 37: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Pros and cons of inverted file• can be used for Boolean, weighted and positional

queries

• query processing can be completed without accessing data file

• number of hits for single term is available from dictionary

• expensive to update if information objects change content.

• demanding storage requirements (dictionary+inverted file approx same size as original data)

Page 38: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

An alternative: Text signatures

• use hash algorithm to map a keyword onto one or more bits in a bit string: like Hashing (DB21)

• Simplest example: use one bit:– ‘Bath’ = [66, 97, 116, 104]

mod32 (66+97+116+104) = mod32(383) = 31

so represent ‘Bath’ by setting bit 31 in 0-31 bits:000000000000000000000000000000000001

Page 39: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Text signatures• Or use several bits:

– ‘Bath’ = [66, 97, 116, 104]‘ Ba’ mod32(66+97) = 3,

‘Bat’ mod32(66+97+116) = 23

‘ath’ mod32(97+116+104) = 29

‘th ’ mod32(116+104) = 28

represent ‘Bath’ by:001000000000000000000000000100001100

• This may allow wildcards, eg Bat?

Page 40: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Document signatures• superimpose keyword signatures

Bath 0000000000000000000000000000001

tub 0000000000100000000000000000000

0000000000100000000000000000001

• if each document has 6 keywords, there would be comb(32, 6) = 906192 different document signatures.

• Document signatures can be mapped onto numbers between 1 and 906192

Page 41: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Using signature file• Boolean query: (A or B) and C

– superimpose signatures of A and C– superimpose signatures of B and C– for each signature, S, in the file:

if either all bits of A&C are set in S or all bits in B&C are set, retrieve the document with signature S.

– check document to see if it is a ‘hit’– bit comparisons are very fast compared to string

comparisons.

Page 42: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Pros and cons of signature files

• Needs less space than inverted

• easier to update as documents change

• fast for queries with many keywords

• probabilistic - will return false hits

• cannot filter on positional information

• cannot hold keyword weights (or other weights)

these last three points imply that furtherprocessing is required to filter retrieved documents.

Page 43: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Summary of key points

• standard relational databases do not provide suitable indexing for handling index terms.

• standard SQL is not good at expressing ‘search-engine’ type queries

• inverted file structures are purpose made for these types of system

• storing frequencies/weights in the dictionary and inverted file allows for vector model queries

• storing positional information allows proximity queries, “Knowledge Management” v “MK”

• Signature files give faster matches but with limitations

Page 44: COMP3410 DB32: Technologies for Knowledge Management Lecture 4: Inverted Files and Signature Files for IR By Eric Atwell, School of Computing, University

Questions to think about

• Explain why the relational model is not good for IR.• How is it that, using an inverted file, the number of

hits can be reported without retrieving anything from the data file?

• Could this be achieved using signature files?• What are proximity queries, and how can inverted

file technology be used to deal with them?• How can signature files be used for proximity

queries?