indices
DESCRIPTION
Indices. Tomasz Bartoszewski. Inverted Index. Search Construction Compression. Inverted Index. In its simplest form, the inverted index of a document collection is basically a data structure that attaches each distinctive term with a list of all documents that contains the term. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/1.jpg)
IndicesTomasz Bartoszewski
![Page 2: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/2.jpg)
Inverted Index• Search• Construction• Compression
![Page 3: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/3.jpg)
Inverted Index• In its simplest form, the inverted index of a document
collection is basically a data structure that attaches each distinctive term with a list of all documents that contains the term.
![Page 4: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/4.jpg)
![Page 5: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/5.jpg)
Search Using an Inverted Index
![Page 6: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/6.jpg)
Step 1 – vocabulary searchfinds each query term in the vocabularyIf (Single term in query){ goto step3;}Else{ goto step2;}
![Page 7: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/7.jpg)
Step 2 – results merging• merging of the lists is performed to find their intersection• use the shortest list as the base• partial match is possible
![Page 8: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/8.jpg)
Step 3 – rank score computation• based on a relevance function (e.g. okapi, cosine)• score used in the final ranking
![Page 9: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/9.jpg)
Example
![Page 10: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/10.jpg)
Index Construction
![Page 11: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/11.jpg)
Time complexity
• O(T), where T is the number of all terms (including duplicates) in the document collection (after pre-processing)
![Page 12: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/12.jpg)
Index Compression
![Page 13: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/13.jpg)
Why?• avoid disk I/O• the size of an inverted index can be reduced dramatically• the original index can also be reconstructed• all the information is represented with positive integers ->
integer compression
![Page 14: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/14.jpg)
Use gaps• 4, 10, 300, and 305 -> 4, 6, 290 and 5• Smaller numbers• Large for rare terms – not a big problem
![Page 15: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/15.jpg)
All in one
![Page 16: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/16.jpg)
Unary• For x:X-1 bits of 0 and one of 1e.g.5 -> 000017 -> 0000001
![Page 17: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/17.jpg)
Elias Gamma Coding• in unary (i.e., 0-bits followed by a 1-bit)• followed by the binary representation of x without its most
significant bit.• efficient for small integers but is not suited to large integers• is simply the number of bits of x in binary• 9 -> 000 1001
![Page 18: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/18.jpg)
Elias Delta Coding• For small int longer than gamma codes (better for larger)• gamma code representation of • followed by the binary representation of x less the most
significant bit• Dla 9: -> 001009 -> 00100 001
![Page 19: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/19.jpg)
Golomb Coding• values relative to a constant b• several variations of the original Golomb• E.g.
Remainder (b possible reminders e.g. b=3: 0,1,2)binary representation of a remainder requires or write the first few remainders using r
![Page 20: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/20.jpg)
Example• b=3 and x=9
• => ()
• Result 00010
![Page 21: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/21.jpg)
The coding tree for b=5
![Page 22: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/22.jpg)
Selection of b
• N – total number of documents• – number of documents that contain term t
![Page 23: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/23.jpg)
Variable-Byte Coding• seven bits in each byte are used to code an integer• last bit 0 – end, 1 – continue• E.g. 135 -> 00000011 00001110
![Page 24: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/24.jpg)
Summary• Golomb coding better than Elias• Gamma coding does not work well• Variable-byte integers are often faster than Variable-bit
(higher storage costs)• compression technique can allow retrieval to be up to twice as
fast than without compression• space requirement averages 20% – 25% of the cost of storing
uncompressed integers
![Page 25: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/25.jpg)
Latent Semantic Indexing
![Page 26: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/26.jpg)
Reason• many concepts or objects can be described in multiple ways• find using synonyms of the words in the user query• deal with this problem through the identification of statistical
associations of terms
![Page 27: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/27.jpg)
Singular value decomposition (SVD)
• estimate latent structure, and to remove the “noise”• hidden “concept” space, which associates syntactically
different but semantically similar terms and documents
![Page 28: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/28.jpg)
LSI• LSI starts with an m*n termdocument matrix A• row = term; column = document• value e.g. term frequency
![Page 29: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/29.jpg)
Singular Value Decomposition• factor matrix A into three matrices:
m is the number of row in An is the number of columns in Ar is the rank of A,
![Page 30: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/30.jpg)
Singular Value Decomposition• U is a matrix and its columns, called left singular vectors, are
eigenvectors associated with the r non-zero eigenvalues of • V is an matrix and its columns, called right singular vectors,
are eigenvectors associated with the r non-zero eigenvalues of
• E is a diagonal matrix, E = diag(, , …, ), . , , …, , called singular values, are the non-negative square roots of r non-zero eigenvalues of they are arranged in decreasing order, i.e.,
• reduce the size of the matrices
![Page 31: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/31.jpg)
𝐴𝑘=𝑈𝑘𝐸𝑘𝑉 𝑘𝑇
![Page 32: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/32.jpg)
Query and Retrieval• q - user query (treated as a new document)• document in the k-concept space, denoted by
![Page 33: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/33.jpg)
Example
![Page 34: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/34.jpg)
Example
![Page 35: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/35.jpg)
Example
![Page 36: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/36.jpg)
Example
![Page 37: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/37.jpg)
Example
![Page 38: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/38.jpg)
Example
![Page 39: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/39.jpg)
Exampleq - “user interface”
![Page 40: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/40.jpg)
Example
![Page 41: Indices](https://reader035.vdocuments.us/reader035/viewer/2022062811/56816128550346895dd080c7/html5/thumbnails/41.jpg)
Summary• The original paper of LSI suggests 50–350 dimensions.• k needs to be determined based on the specific document
collection• association rules may be able to approximate the results of
LSI