search engine caching rank-preserving two-level caching for scalable search engines, paricia correia...
TRANSCRIPT
![Page 1: Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ea85503460f94bab59c/html5/thumbnails/1.jpg)
Search Engine Caching
Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001 http://doi.acm.org.ezproxy.lib.vt.edu:8080/10.1145/383952.383959
Predictive Caching and Prefetching of Query Results in Search Engines, Ronny Lempel and Shlomo Moran, September 2001
http://doi.acm.org.ezproxy.lib.vt.edu:8080/10.1145/775152.775156
Presented by
Adam "So, is this gonna be on the test?" Edelman
![Page 2: Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ea85503460f94bab59c/html5/thumbnails/2.jpg)
The Problem
• The User: "I want my results now!"
• But...– Over 4 billion web pages– Over 1 million queries per minute
• How do we keep response times down as the web grows?
![Page 3: Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ea85503460f94bab59c/html5/thumbnails/3.jpg)
Search Engine Statistics
• 63.7% of the search phrases appear only once in the billion query log
• The 25 most popular queries in the log account for 1.5% of the submissions
• Considerable time and processing power can be saved through well implemented caching
![Page 4: Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ea85503460f94bab59c/html5/thumbnails/4.jpg)
Search Engine Statistics
• 58% of the users view only the first page of results (the top-10 results)
• No more than 12% of users browse through more than 3 result pages.
• We do not need to cache large result sets for a given query
![Page 5: Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ea85503460f94bab59c/html5/thumbnails/5.jpg)
What do we Cache?
• 36% of all queries have been retrieved before
• Can we apply caching even if the query does not exactly match any previous query?
![Page 6: Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ea85503460f94bab59c/html5/thumbnails/6.jpg)
What do we Cache?
• Saraiva et. al propose a two level cache
• In addition to caching query results, we also cache inverted lists for popular terms
![Page 7: Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ea85503460f94bab59c/html5/thumbnails/7.jpg)
Query Cache Implementation
• Store only the first 50 references per query– ~25KB per query
• Query logs show that the miss ratios do not drastically improve after query result cache exceeds 10 MB
![Page 8: Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ea85503460f94bab59c/html5/thumbnails/8.jpg)
Inverted List Cache Implementation• For this data set 50-75%
of inverted lists contain documents where term appears only once
• Use 4KB inverted list size per term– More work needs to be
done
• Asymptotic behavior is apparent after cache exceeds 200MB
• Use 250MB for IL Cache
![Page 9: Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ea85503460f94bab59c/html5/thumbnails/9.jpg)
Two-Level Cache Implementation
• Combine previous two caches
• 270MB total cache– Accounts for only 6.5% of overall index size
• Tested over a log of 100K queries to TodoBR
![Page 10: Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ea85503460f94bab59c/html5/thumbnails/10.jpg)
Two-Level Cache Results• Compared to caches of
270MB for only query results, only inverted lists and no cache
• Queries processed reduced by 62%– 21% increase compared to
only query result cache
• Page fetches from the database reduced 95%– 3% increase compared to
only inverted list cache
![Page 11: Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ea85503460f94bab59c/html5/thumbnails/11.jpg)
Two-Level Cache Results
• For more than 20 queries per second two-level cache is 20% disk reads of no cache
• Two-level cache can handle 64 queries per second against 22 per second with no cache
![Page 12: Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ea85503460f94bab59c/html5/thumbnails/12.jpg)
How do we cache?
• Saraiva et al use a least recently used (LRU) replacement policy for cache maintenance
• Users search in sessions, the next query will probably be related to the previous query
• Can we use this to improve caching?
![Page 13: Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ea85503460f94bab59c/html5/thumbnails/13.jpg)
Probability Driven Cache (PDC)
• Lempel and Moran propose a cache based on the probability of a page being requested
![Page 14: Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ea85503460f94bab59c/html5/thumbnails/14.jpg)
Page Least Recently Used (PLRU)
• Allocate a page queue that can accommodate a certain number of result pages
• When the queue is full and a new page needs to be cached, the least recently used page is removed from the cach
• Achieves hit ratios around 30% for warm, large caches
![Page 15: Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ea85503460f94bab59c/html5/thumbnails/15.jpg)
Page Segmented LRU (PSLRU)
• Maintains two LRU segments, a protected segment and a probationary segment
• Pages are first placed in the probationary segment, if requested again they are moved to the protected segment
• Pages evicted from the protected segment are moved to the probationary segment
• Pages evicted from the probationary segment are removed from the cache
• Consistently outperforms PLRU although difference is very small
![Page 16: Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ea85503460f94bab59c/html5/thumbnails/16.jpg)
Topic LRU (TLRU)
• Let t(q) denote the topic of the query q
• After the cache is warm, any cached result page of t(q) is moved to the tail of the queue.
• Each topic’s pages will reside contiguously in the queue
![Page 17: Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001](https://reader035.vdocuments.us/reader035/viewer/2022070410/56649ea85503460f94bab59c/html5/thumbnails/17.jpg)
Topic SLRU (TSLRU)
• All pages are initially inserted in the probationary segment
• In addition to promoting pages from probationary to protected, we also promote all pages of t(q)