search engine caching rank-preserving two-level caching for scalable search engines, paricia correia...

Search Engine Caching

Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001 http://doi.acm.org.ezproxy.lib.vt.edu:8080/10.1145/383952.383959

Predictive Caching and Prefetching of Query Results in Search Engines, Ronny Lempel and Shlomo Moran, September 2001

http://doi.acm.org.ezproxy.lib.vt.edu:8080/10.1145/775152.775156

Presented by

Adam "So, is this gonna be on the test?" Edelman





The Problem

• The User: "I want my results now!"

• But...– Over 4 billion web pages– Over 1 million queries per minute

• How do we keep response times down as the web grows?

Search Engine Statistics

• 63.7% of the search phrases appear only once in the billion query log

• The 25 most popular queries in the log account for 1.5% of the submissions

• Considerable time and processing power can be saved through well implemented caching

Search Engine Statistics

• 58% of the users view only the first page of results (the top-10 results)

• No more than 12% of users browse through more than 3 result pages.

• We do not need to cache large result sets for a given query

What do we Cache?

• 36% of all queries have been retrieved before

• Can we apply caching even if the query does not exactly match any previous query?

What do we Cache?

• Saraiva et. al propose a two level cache

• In addition to caching query results, we also cache inverted lists for popular terms

Query Cache Implementation

• Store only the first 50 references per query– ~25KB per query

• Query logs show that the miss ratios do not drastically improve after query result cache exceeds 10 MB

Inverted List Cache Implementation• For this data set 50-75%

of inverted lists contain documents where term appears only once

• Use 4KB inverted list size per term– More work needs to be

done

• Asymptotic behavior is apparent after cache exceeds 200MB

• Use 250MB for IL Cache

Two-Level Cache Implementation

• Combine previous two caches

• 270MB total cache– Accounts for only 6.5% of overall index size

• Tested over a log of 100K queries to TodoBR

Two-Level Cache Results• Compared to caches of

270MB for only query results, only inverted lists and no cache

• Queries processed reduced by 62%– 21% increase compared to

only query result cache

• Page fetches from the database reduced 95%– 3% increase compared to

only inverted list cache

Two-Level Cache Results

• For more than 20 queries per second two-level cache is 20% disk reads of no cache

• Two-level cache can handle 64 queries per second against 22 per second with no cache

How do we cache?

• Saraiva et al use a least recently used (LRU) replacement policy for cache maintenance

• Users search in sessions, the next query will probably be related to the previous query

• Can we use this to improve caching?

Probability Driven Cache (PDC)

• Lempel and Moran propose a cache based on the probability of a page being requested

Page Least Recently Used (PLRU)

• Allocate a page queue that can accommodate a certain number of result pages

• When the queue is full and a new page needs to be cached, the least recently used page is removed from the cach

• Achieves hit ratios around 30% for warm, large caches

Page Segmented LRU (PSLRU)

• Maintains two LRU segments, a protected segment and a probationary segment

• Pages are first placed in the probationary segment, if requested again they are moved to the protected segment

• Pages evicted from the protected segment are moved to the probationary segment

• Pages evicted from the probationary segment are removed from the cache

• Consistently outperforms PLRU although difference is very small

Topic LRU (TLRU)

• Let t(q) denote the topic of the query q

• After the cache is warm, any cached result page of t(q) is moved to the tail of the queue.

• Each topic’s pages will reside contiguously in the queue

Topic SLRU (TSLRU)

• All pages are initially inserted in the probationary segment

• In addition to promoting pages from probationary to protected, we also promote all pages of t(q)

search engine caching rank-preserving two-level caching for scalable search engines, paricia correia...

Documents

cache maintenanceusers

caching query results

prefetching of query

query logthe

query result cachepage

inverted lists

popular queries

twolevel caching