mining di dati web
DESCRIPTION
Mining di Dati Web. Web Search Engine ’ s Query Log Mining A.A 2006/2007. What’s Recorded in a WSE Query Log?. Each component of a WSE records information about its operations. We are mainly concerned with frontend logs. They record each query submitted to the WSE. Data Recorded. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/1.jpg)
Mining di Dati WebWeb Search Engine’s Query Log Mining
A.A 2006/2007
![Page 2: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/2.jpg)
What’s Recorded in a WSE Query Log?
Each component of a WSE records information about its operations.
We are mainly concerned with frontend logs.
They record each query submitted to the WSE.
![Page 3: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/3.jpg)
Data Recorded Among other information WSEs record:
The query topic. The first result wanted. The number of results wanted.
Some examples: q(fabrizio silvestri)f(1)n(10) q(“information retrieval”)f(5)n(15)
Some other information: The language. Results folded? (Y/N). Etc.
Commonly referred to as “the query”
![Page 4: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/4.jpg)
What Can We Look For?
The most popular queries.How queries are distributed.How queries are related.How subsequent queries are related.How topics are distributed.How topics change throughout the 24 hours.
Can we exploit this information?
![Page 5: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/5.jpg)
Let’s Start Looking at Some
Interesting Items What are the most popular queries?
![Page 6: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/6.jpg)
Most Popular Topics
![Page 7: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/7.jpg)
Most Popular Terms
![Page 8: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/8.jpg)
What Are Users Doing?
Not typing many words!Average query was 2.6 words long (in 2001), up from 2.4 words in 1997.
Moving toward e-commerceLess sex (down from 17% to 9%), more business (up from 13% to 25%).
Spink A., et al. “From e-Sex to e-Commerce: Web Search Changes”, Computer, March 2002.
![Page 9: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/9.jpg)
Why Are Queries so Short?
Users minimize effort.Users don’t realize more information is better.
Users learn that too many words belongs to fewer results. (Since implicit AND)
Query Boxes are Small.Belkin, N.J., et al. “Rutgers’ TREC 2001 Interactive Track Experience”, in Voorhees & Harmon, The Tenth Text Retrieval Conference.
![Page 10: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/10.jpg)
Different Kind of Queries
![Page 11: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/11.jpg)
Distribution of Query Types
![Page 12: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/12.jpg)
Hourly Analysis of a Query Log
Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury, David Grossman, Ophir Frieder, "Hourly Analysis of a Very Large Topically Categorized Web Query Log", Proceedings of the 2004 ACM Conference on Research and Development in Information Retrieval (ACM-SIGIR), Sheffield, UK, July 2004.
![Page 13: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/13.jpg)
Frequency Time Distribution
![Page 14: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/14.jpg)
Query Repetition
![Page 15: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/15.jpg)
Query Categories
![Page 16: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/16.jpg)
Categories over Time
![Page 17: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/17.jpg)
Analysis of Three Query Logs
Tiziano Fagni, Salvatore Orlando, Raffaele Perego, Fabrizio Silvestri. “Boosting the Performance of Web Search Engines: Caching and Prefetching Query Results by Exploiting Historical Usage Data. ACM Transactions on Information Systems. 24(1). January 2006.
![Page 18: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/18.jpg)
Temporal Locality
=0.66
![Page 19: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/19.jpg)
Query Submission Distance
![Page 20: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/20.jpg)
Page Requested
![Page 21: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/21.jpg)
Subsequent Page Requests
![Page 22: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/22.jpg)
Query Caching
Francesca, 1
Results
QuickTime™ e undecompressore TIFF (Non compresso)
sono necessari per visualizzare quest'immagine.
WSE
Francesca
Index
Francesca, 1
![Page 23: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/23.jpg)
Caching: Who Care?!?
Successful caching of query results can:
Lower the number/cost of query executions.
Shorten the engine’s response time.
Increase the engine’s throughput.
![Page 24: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/24.jpg)
Caching: How-To?Caching can exploit locality of reference in the query streams search engines are faced with.
Query popularity follows a power-law and vary widely, from the extremely popular to the very rare.
![Page 25: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/25.jpg)
Caching: What to Measure?
Hit Ratio: Let N be the number of requests to the WSE Let H be the number of hits - i.e. the number of queries that can be answered by the cache.
The Hit Ratio HR is defined as HR = H/N. Usually is expressed in percentage.
E.g. HR = 30% means that the thirty percent of the queries are satisfied using the cache.
Alternatively we could define the Miss Ratio: MR = 1 - HR = M/N. Where M is the number of miss - i.e. the number of queries that cannot be answered by the query.
![Page 26: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/26.jpg)
What About the Throughput?
The throughput is defined as the number of queries answered per-second.
Caching, in general, rises the throughput.
The lower the hit-ratio the lower the throughput.
The lower the cache response-time the higher the throughput.
![Page 27: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/27.jpg)
Caching ComplexityThe caching response time depends on the replacement policy complexity.
The complexity usually depends on the cache size K.
There exists policies that are:O(1) - i.e. constant. They don’t depend on the size of the cache.
O(log K).O(N).
![Page 28: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/28.jpg)
Is There Only Caching?
No!!!!There’s also PREFETCHING!What’s Prefetching:
Anticipating users query by exploiting query stream properties
Uhuuuu! Sounds like kind of “Usage Mining”!
For instance let’s have a look at the probability of subsequent page requests.
Prefetching factor p is the number of pages prefetched.
![Page 29: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/29.jpg)
Prefetching: PROS and CONS
Prefetching enhance hit-ratio.Prefetching reduce the query load on the query server.The cost for computing p pages of results is approx the same of computing only one page
Prefetching is very likely to load pages that will never be requested in future.
![Page 30: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/30.jpg)
Adaptive Prefetching
![Page 31: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/31.jpg)
Theoretical Bounds
![Page 32: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/32.jpg)
Some Classical Caching Policies
LRU Last Recently Used. Evict from Cache the query results that have been accessed farthest in the past.
SLRU Two segments:
ProbationaryProtected.
Lines in each segment are ordered from the most to the least recently accessed. Data from misses is added to the cache at the most recently accessed end of the probationary segment. Hits are removed from wherever they currently reside and added to the most recently accessed end of the protected segment. Lines in the protected segment have thus been accessed at least twice. The protected segment is finite, so migration of a line from the probationary segment to the protected segment may force the migration of the LRU line in the protected segment to the most recently used (MRU) end of the probationary segment, giving this line another chance to be accessed before being replaced.
![Page 33: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/33.jpg)
ProblemsClassical Replacement Policies do not care about stream characteristics.
They are not designed using usage mining investigation techniques.
They offer godd performance, though!
Uhmmm…. Are you sure?!? Stay tuned!
![Page 34: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/34.jpg)
Caching May be Attacked from two
DirectionsArchitecture of the caching system:Two-level cachingThree-level cachingSDC
Replacement policyPDCSDC
BothSDC
![Page 35: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/35.jpg)
Two-level CachingCache of Query Results
Cache of Inverted Lists
Both
![Page 36: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/36.jpg)
Throughput
![Page 37: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/37.jpg)
Three-level Caching
Long, X. and Suel, T. 2005. Three-level caching for efficient query processing in large Web search engines. In Proceedings of the 14th international Conference on World Wide Web (Chiba, Japan, May 10 - 14, 2005). WWW '05. ACM Press, New York, NY, 257-266.
![Page 38: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/38.jpg)
Probability Driven Caching
Lempel, R. and Moran, S. 2003. Predictive caching and prefetching of query results in search engines. In Proceedings of the 12th international Conference on World Wide Web (Budapest, Hungary, May 20 - 24, 2003). WWW '03. ACM Press, New York, NY, 19-28.
Tanks to Ronny for his original slides!
![Page 39: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/39.jpg)
Static-Dynamic Caching
Tiziano Fagni, Salvatore Orlando, Raffaele Perego, Fabrizio Silvestri. “Boosting the Performance of Web Search Engines: Caching and Prefetching Query Results by Exploiting Historical Usage Data. ACM Transactions on Information Systems. 24(1). January 2006.
Idea: Divide the cache in two sets:
Static Set Dynamic Set.
Fill the Static Set using the most frequently submitted query in the past.
The Static Set is read-only: good in multithreaded architectures.
![Page 40: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/40.jpg)
Inside SDC Static-Dynamic Caching. The cache is divided into two sets:
Static Set: contains the results of the queries most frequently submitted so far.
Dynamic Set: is implemented using a classical caching replacement policy like, for instance, LRU, SLRU, PDC.
The Static Set size is given by fstatic*N. Where 0< fstatic < 1 is the fraction of the total entries (N) of the cache devoted to the Static Set.
Adaptive Prefetching is adopted.
![Page 41: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/41.jpg)
Benefits in Real-World Caches
SDCCacheThread
StaticSet
DynamicSet
Mutex
SDCCache
WSE
SDCCacheThread
SDCCacheThread
SDCCacheThread
![Page 42: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/42.jpg)
SDC Performance Linux PC: 2GHz Pentium Xeon - 1GB RAM Single process. fstatic = 0.5. No prefetching.
![Page 43: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/43.jpg)
SDC Hit-Ratio
![Page 44: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/44.jpg)
SDC Hit-Ratio
![Page 45: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/45.jpg)
SDC Hit-Ratio
![Page 46: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/46.jpg)
SDC Hit-Ratio
![Page 47: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/47.jpg)
SDC Hit-Ratio
![Page 48: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/48.jpg)
SDC Hit-Ratio
![Page 49: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/49.jpg)
SDC Hit-Ratio
![Page 50: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/50.jpg)
SDC Hit-Ratio
![Page 51: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/51.jpg)
Why Static Set Helps?
![Page 52: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/52.jpg)
Concurrent Caching
![Page 53: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/53.jpg)
Freshness of the Training Data
How frequently should we perform mining again on the usage data?
Does performance of Usage-Mining-based caching degrades gracefully as time goes by?
Do time-of-day patterns exist in query stream.
![Page 54: Mining di Dati Web](https://reader036.vdocuments.us/reader036/viewer/2022081512/568154c1550346895dc2c972/html5/thumbnails/54.jpg)
Daily Patterns