![Page 1: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems](https://reader030.vdocuments.us/reader030/viewer/2022020320/56649e735503460f94b725b7/html5/thumbnails/1.jpg)
www.monash.edu.au
CSE3201/CSE4500 Information Retrieval Systems
Signature Based Text Retrieval Systems
![Page 2: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems](https://reader030.vdocuments.us/reader030/viewer/2022020320/56649e735503460f94b725b7/html5/thumbnails/2.jpg)
www.monash.edu.au
2
Signature File for Text Retrieval
• A “signature” is created as an abstraction of a document.
• All the signatures that represent the documents in the collection are kept in a file called “signature file”.
![Page 3: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems](https://reader030.vdocuments.us/reader030/viewer/2022020320/56649e735503460f94b725b7/html5/thumbnails/3.jpg)
www.monash.edu.au
3
Word Signature(WS)
• A word signature – is a fixed-length bit-string represents a word.– is described by
> The length (N)> A number of bits set to 1(k)
1 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0
N=24
k=7
![Page 4: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems](https://reader030.vdocuments.us/reader030/viewer/2022020320/56649e735503460f94b725b7/html5/thumbnails/4.jpg)
www.monash.edu.au
4
Word Signature Generation
• Use a hash function to find the location of the bit(s) that will be set on.
• Using triplets of characters to generate word signature.
– divide the word into overlapping triplets.
– For each triplet of characters:> convert the characters to a numeric value (can be ASCII
representation of the character).> Use the the number as the input to the hash function.> The hash function will produce a number which represent the bit
position of the triplet in the word signature.
![Page 5: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems](https://reader030.vdocuments.us/reader030/viewer/2022020320/56649e735503460f94b725b7/html5/thumbnails/5.jpg)
www.monash.edu.au
5
Signature Generator Algorithm
Set hash_value to 0
for each character in the triplet do
hash_value:=(hash_value*137+character ASCIIvalue)mod 256
K values
![Page 6: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems](https://reader030.vdocuments.us/reader030/viewer/2022020320/56649e735503460f94b725b7/html5/thumbnails/6.jpg)
www.monash.edu.au
6
Word Signature Generation – simplified example
• Example:
– A signature 111000111001 is generated for the word “signature”.
• The position is read from left to right
-si sig ign gna nat atu tur ure re-
12 73 23 9 12 8
1 1 1 0 0 0 1 1 1 0 0 1
signature
Hash function
Position of the bit set to 1
1
![Page 7: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems](https://reader030.vdocuments.us/reader030/viewer/2022020320/56649e735503460f94b725b7/html5/thumbnails/7.jpg)
www.monash.edu.au
7
Document Signature (DS)
• Document Signature can be created using two methods:– concatenation of word signatures.– superimposed coding.
![Page 8: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems](https://reader030.vdocuments.us/reader030/viewer/2022020320/56649e735503460f94b725b7/html5/thumbnails/8.jpg)
www.monash.edu.au
8
Document Signature – Concatenation of WS
• The length of document signatures (DS) can vary. • A fixed number of bits may precede the document
signature (DS) to indicate the length of DS.• It is possible to fix the length of the Document Signature
(DS). – The length can be set to equal the longest document in the
collection.– Extra “0” bits are padded to the shorter documents.
![Page 9: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems](https://reader030.vdocuments.us/reader030/viewer/2022020320/56649e735503460f94b725b7/html5/thumbnails/9.jpg)
www.monash.edu.au
9
Document Signature –Superimposed Coding
• Each document is divided into blocks containing a constant number of distinct words.
• To create a block signature, perform OR operation on all the words in the block.
free 001 000 110 010
text 000 010 101 001
Block signature 001 010 111 011
![Page 10: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems](https://reader030.vdocuments.us/reader030/viewer/2022020320/56649e735503460f94b725b7/html5/thumbnails/10.jpg)
www.monash.edu.au
10
Document Signature – Superimposed Coding
• To create the document signature, all the block signatures are superimposed.
![Page 11: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems](https://reader030.vdocuments.us/reader030/viewer/2022020320/56649e735503460f94b725b7/html5/thumbnails/11.jpg)
www.monash.edu.au
11
Query Signature
• Query will be converted to a block signature as in the document.
• Example:
free 0 0 1 0 0 0 1 1 0 0 1 0
Text 0 0 0 0 1 0 1 0 1 0 0 1
Block/Query
0 0 1 0 1 0 1 1 1 0 1 1
![Page 12: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems](https://reader030.vdocuments.us/reader030/viewer/2022020320/56649e735503460f94b725b7/html5/thumbnails/12.jpg)
www.monash.edu.au
12
Matching the Query and Document Signature
• Premise:– The positions of the bits set to 1 represent the existence
of particular words in the query or document. • A relevant document is document that has a signature
with bits set to 1 at the same position of the bits in the query’s signature.
• The relevant document’s signature does not have to be an exact match of the query’s signature.
• Example:– Query: 0100– Match document signatures: 1111, 0111, 0110, 0100.
![Page 13: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems](https://reader030.vdocuments.us/reader030/viewer/2022020320/56649e735503460f94b725b7/html5/thumbnails/13.jpg)
www.monash.edu.au
13
Query on Signature File
Query
001 010 111 011
0 0 1 0 0 0 1 1 1 0 1 1
0 0 1 1 1 1 1 1 1 0 1 1
0 0 1 0 1 0 1 0 1 0 1 1
0 0 1 0 1 0 1 1 1 0 1 0
1 1 1 0 1 0 1 1 1 0 1 1
0 0 1 1 0 0 1 1 1 0 1 1
0 0 1 0 1 0 1 1 1 1 1 1
No
No
No
Yes
YesNo
Yes
Match? Perform AND operation between the query and block signature, if ( result – query) = 0, they are matched
![Page 14: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems](https://reader030.vdocuments.us/reader030/viewer/2022020320/56649e735503460f94b725b7/html5/thumbnails/14.jpg)
www.monash.edu.au
14
Signature File Structure
• Sequential– During searching, each signature will be compared to
query signature.– Time consuming because:
> Memory size is limited, hence all signatures cannot be loaded to the memory at once.
> May result in multiple number of I/O operations.
• We need a file structure for the signature file that minimise the I/O operation.
• Bit-Sliced Signature– At the maximum, only N (the size of the signature) number
of records need to be retrieved.
![Page 15: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems](https://reader030.vdocuments.us/reader030/viewer/2022020320/56649e735503460f94b725b7/html5/thumbnails/15.jpg)
www.monash.edu.au
15
Matrix Transposed
2313
2212
2111
232221
131211
xx
xx
xx
xxx
xxxT
xij -> xji
fc
eb
da
fed
cbaT
![Page 16: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems](https://reader030.vdocuments.us/reader030/viewer/2022020320/56649e735503460f94b725b7/html5/thumbnails/16.jpg)
www.monash.edu.au
16
Bit-Sliced
0 0 1 0 0 0 1 1 1 0 1 1
0 0 1 1 1 1 1 1 1 0 1 1
0 0 1 0 1 0 1 0 1 0 1 1
0 0 1 0 1 0 1 1 1 0 1 0
0 0 0 0
0 0 0 0
1 1 1 1
0 1 0 0
0 1 1 1
0 1 0 0
1 1 1 1
1 1 0 1
1 1 1 1
0 0 0 0
1 1 1 1
1 1 1 0Bit slicedsequential
N bits
N records
d1
d4
d2d3
Query: 001 010 111 011
dn
d1 d2 d3 d4 dn
![Page 17: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems](https://reader030.vdocuments.us/reader030/viewer/2022020320/56649e735503460f94b725b7/html5/thumbnails/17.jpg)
www.monash.edu.au
17
Bit Sliced Signature File
• Retrieval– If ith bit in the query signature is set to 1, retrieve
the ith signature block/record.– If there is n number of bits are set to 1 in the
query, only n number of records needs to be retrieved.
![Page 18: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems](https://reader030.vdocuments.us/reader030/viewer/2022020320/56649e735503460f94b725b7/html5/thumbnails/18.jpg)
www.monash.edu.au
18
Bit Slice Signature File
0 0 0 0
0 0 0 0
1 1 1 1
0 1 0 0
0 1 1 1
0 1 0 0
1 1 1 1
1 1 0 1
1 1 1 1
0 0 0 0
1 1 1 1
1 1 1 0
Query: 001 010 111 011
1 1 1 1
0 1 1 1
1 1 1 1
1 1 0 1
1 1 1 1
1 1 1 1
1 1 1 0
Match, because all bits in this column is set to 1 (the 2nd block).
Retrieved records
![Page 19: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems](https://reader030.vdocuments.us/reader030/viewer/2022020320/56649e735503460f94b725b7/html5/thumbnails/19.jpg)
www.monash.edu.au
19
Bit Sliced Signature File
• Advantages:– Smaller number of records are retrieved -> faster
retrieval.• Disadvantages:
– An update operation become a very costly exercise.
![Page 20: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems](https://reader030.vdocuments.us/reader030/viewer/2022020320/56649e735503460f94b725b7/html5/thumbnails/20.jpg)
www.monash.edu.au
20
False Drop
• False drop occurs when a document’s signature matches a query’s signature but the query’s word does not match any word in the document.
• It is possible because 2 distinct blocks may have the same signatures due to:– the hashing algorithm– superimposed coding
![Page 21: Www.monash.edu.au CSE3201/CSE4500 Information Retrieval Systems Signature Based Text Retrieval Systems](https://reader030.vdocuments.us/reader030/viewer/2022020320/56649e735503460f94b725b7/html5/thumbnails/21.jpg)
www.monash.edu.au
21
Relation Between the Signature Properties and False Drop
• The rate of false drop depends on:– The size of the signature (N bits)
> Increase in N will decrease the false drop
– The size of bits set to 1(k bits)> Increase in k to a certain level will decrease the false
drop
– The number of unique words per-block> Decrease in the number of unique word per-block will
decrease the false drop.