search engines i - ntnu · search engines 1 tdt4125 algoritmekonstruksjon, spring 2011 Øystein...
TRANSCRIPT
![Page 1: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/1.jpg)
Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011
Øystein Torbjørnsen Microsoft® Development Center Norway
![Page 2: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/2.jpg)
Outline
• Inverted index
• Constructing inverted indexes
• Compression
• Succinct index (Holger Bast)
• Hierarchical inverted indexes
• Skip lists
![Page 3: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/3.jpg)
Inverted index
dark darker
Dictionary Posting file
a cal drill excellent
zebra
docid frequency position list
posting list
![Page 4: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/4.jpg)
Inverted index
• Posting list is sorted on docid
• Usually 2 disk IOs to look up one term, O(1)
– One to read the dictionary entry
– One to read the posting list (possibly large)
![Page 5: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/5.jpg)
Searching
• Phrase search "jens stoltenberg"
• Proximity search jens w/5 prime
• Wildcard search
– Prefix search stolt* je*s
– Postfix search *berg
– Full wildcard search *olten* *ol*be*
![Page 6: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/6.jpg)
Construction
• Create sorted subfiles
• Merge the subfiles into one large file
Needs twice the disk storage as the final index
![Page 7: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/7.jpg)
Compression
• Basic idea: – Use knowledge of value distribution to compress data
• Costly to compress and decompress, but – Less disk IO – More data fits in main memory – Better locality in memory
• Many different schemes: – Delta coding – vByte – PFOR-DELTA – Huffman, Golomb, Rice, Simple9, Simple16
![Page 8: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/8.jpg)
Delta coding
• Works on sorted lists
• Encoded as difference from previous entry
• To be combined with other compression
17 31 62 88 89 97 113 187 199
17 14 31 26 1 8 16 74 12
÷
÷
![Page 9: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/9.jpg)
vByte
• Variable-byte encoding
• Using full bytes
• 1 marker bit + 7 value bits
• Fast encoding and decoding
byte
end marker value
0 1001100 = 76 *128*128 = 1245184
0 0111001 = 57 *128 = 7296 1 1101010 = 106 = 106
= 1252586
![Page 10: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/10.jpg)
PFOR-DELTA
• Combination of three techniques
– P=Prefix suppression
– FOR=Frame Of Reference
– DELTA = delta coding
• Blocks of e.g. 128 values
• Fixed number of bits per value
• Exception list for outliers
![Page 11: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/11.jpg)
Succinct index
• Variation of inverted index
• Index ranges of words
• Prefix and range search
• Smaller dictionary
• Longer lists to process
• Better compression
• Less disk IOs – Disk position vs. transfer times
![Page 12: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/12.jpg)
Hierarchical inverted indexes
• Incremental indexing
• Build vs lookup time
![Page 13: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/13.jpg)
Never merge
• Just keep sub-files and never merge into large file
• Construction is O(n)
• Fastest possible construction time
• Slow lookup with many files O(n)
![Page 14: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/14.jpg)
Hierarchy
n=3
Level 1
Level 2
Level 3
![Page 15: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/15.jpg)
Merging strategy
Merge into same level Merge to level above
m=2 n=3
![Page 16: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/16.jpg)
Issues
• Needs twice the space
• Merge of upper layer takes a long time
• Larger initial files leads to fewer merges
• Lookup times varies over time depending on number of files at each level
![Page 17: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/17.jpg)
Column organization
• Field selection
– Based on query
• Phrase queries and proximity scoring needs position
• Simple boolean queries does not need position and frequency
• Relevance scoring needs frequency
– Don’t decompress what you don’t need
– Don’t read from disk what you don’t need
– Locality
![Page 18: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/18.jpg)
More than text search
• Context info
• Meta data
• Values
docid frequency position list context
docid date
docid size
docid owner
docid person
docid zip code
docid company
position
position
position
docid URI
![Page 19: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/19.jpg)
Skipping
• Search engine and skipping
– Used in merging (AND queries)
– Semi sequential access
– Direct lookup
– Disk based
• Skip list
• Vs Btree
• Variants
![Page 20: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/20.jpg)
Skip list
0 < p < 1 (e.g. p=1/2 or p=1/4) Lookup and insertion is O(log n) Size vs speed
![Page 21: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/21.jpg)
Issues
• Compression
• Can be skewed
![Page 22: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/22.jpg)
Skip list vs B-Tree
Skip list
• Main-memory structure
• Less space
B-Tree
• Disk based structure
• Better locality
![Page 23: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/23.jpg)
Variations
• Deterministic skip list
• 1 level skips
• Separate skip table
![Page 24: Search Engines I - NTNU · Search Engines 1 TDT4125 Algoritmekonstruksjon, Spring 2011 Øystein Torbjørnsen Microsoft® Development Center Norway . Outline •Inverted index •Constructing](https://reader035.vdocuments.us/reader035/viewer/2022071117/6008fc8759713859cd4b55e0/html5/thumbnails/24.jpg)
Literature
• Justin Zobel and Alistair Moffat. Inverted files for text search engines. ACM Comput. Surv. 38, 2, July 2006.
• Marcin Zukowski, Sandor Heman, Niels Nes, and Peter Boncz. Super-Scalar RAM-CPU Cache Compression. In Proceedings of the 22nd International Conference on Data Engineering (ICDE '06).
• Holger Bast and Ingmar Weber. Type less, find more: fast autocompletion search with a succinct index. In Proceedings of the 29th annual international ACM SIGIR conference (SIGIR '06).
• William Pugh. Skip lists: a probabilistic alternative to balanced trees. Communications of the ACM 33, 6, June 1990. ftp://ftp.cs.umd.edu/pub/skipLists/skiplists.pdf