tpdl 2016 topic_coverage_indf-bf_crawls
TRANSCRIPT
Comparing Topic Coverage in Breadth-first & Depth-firstCrawls using Anchor Texts
Thaer Samar, Myriam C. Traub, Jacco van Ossenbruggen, Arjen P. de Vries
Web archives were initiated to preserve the fastchanging Web
Web archives are created using Web crawlers Crawling strategy has a great influence on the
data that is archived
Web Archives
2
Web Crawlers
Two main strategies
Depth-first, focus only on selected web sites, as deepas possible
Breadth-first, focus on the entire web, but not in depth
3
Our Study
Research Question:
How does the crawling strategy impact webarchive's coverage of past popular topics?
Challenges
How to approximate user interest when no query log isavailable?
Scalability, Web archives or crawls consist of hugeamount of data. e.g., one month snapshot from Common Crawl can be 80 TB of
compressed data.
4
Approach
We use hyperlinks anchor text from the rawcontent of two datasets crawled using different crawling strategies
We compare anchor texts with external sourcesthat identify past popular topics on the Web at thetime of our datasets
5
Based on What (Anchor Text)
Hyperlinks define the Web structure link information: source URL, target URL, and the anchor text
Anchor text is a short text describing the target page exhibits characteristics similar to user query and document
title [Eiron & McCurley, Jin et al.] widely used in Information Retrieval to improve search
effectiveness combined with timestamps has be used to
capture & trace entity evolution [Kanhabua and Nejdl] uncover & provide representation of missing pages from the archive
[Klein & Nelson, Huurdeman et al.]
6
Anchor Text (data)
Anchor Text from two different crawls Dutch Web archive by National Library of the
Netherlands (KB) depth-first (selective) +10,000 websites Dutch domain 100 million crawled urls
Common Crawl (CC) breadth-first millions of websites entire Web 2.8 billion crawled urls
7
Link Processing
Extraction
Cleaning URL normalization; get host of the source and the target Clean spam e.g., rolex watches
Aggregation Pre-processing: lowercase, Stop-words removing &
stemming Aggregate based on the anchor text
Deduplication Remove duplicate links; due to crawling frequency Same source, target, anchor text, year of crawl-date
SourceURL, targetURL, anchorText
anchorText, count8
SourceURL, targetURL, anchorText, crawl-date (year)
Anchor Text Summary (1/2)
Short texts describing target pages Available for both archived & unarchived pages
not all target pages exist in the archive
9
e.g., 6.5% of the uniquehosts of target pages of
external links werecrawled
Anchor Text Summary (2/2)
The breadth-first crawl (KB) and the breadth-first crawl (CC) differ in terms of size &coverage links from CC is 559x times larger than the number of KB
links
KB dataset mainly covers the NL domain, and CC datasetcovers the entire Web
CC contains more hosts (websites) than the KB; sourceand target hosts
10
For Fair Comparison
Subsets from CC dataset NL part, we focus on links that originate from the NL
domain. Links whose source is from the KB seeds list
11
Number of links fromthe NL part of CC is
comparable to numberof links from the KB
Past Popular Topics (of interests to users)
Queries represent user's information need User queries have usually not been preserved Impossible to go back in time reconstruct which
queries the user would have used to search thearchive
12
Past Popular Topics (data 1/3)
Three different sources to identify past populartopics Wikistats
views aggregation of Wikipedia (WP) pages over time contains WP title, number of views, timestamp, and the language keep WP titles viewed >= 1,000 times
Google trends top searched queries from entire world or per country
Query log from users visiting the public Dutch historicnewspaper archive
13
Past Popular Topics (data 2/3)
Our assumption is that the CC (a breadth-first crawl) covers more global
topics, and that KB (a depth-first crawl) covers more topicsfrom NL domain
For validation: split topic sources into topics that attracted attention in the
entire Web (global) and topics that were only picked up inthe NL domain
14
Past Popular Topics (data 3/3)
Wikistats we used the language to get the Dutch Wikipedia pages other pages, we labeled them as global
Google trends most searched queries from the entire world most searched queries from Netherlands
15
Match Anchor Text to Topics
Pre-process topics from all sources lowercase stop-words removing, and stemming
Exact string matching between anchor text andtopics from all sources
16
Topic Coverage (1/3)
The breadth-first crawl (CC) covers more topicsthan depth-first crawl (KB) for all topic sources not only for global topics (as expected) but also for topics from the NL domain!
17
Topic Coverage (2/3)
Subsets from the CC dataset have comparableresults
18
NL part KB seeds
Topic Coverage (3/3)
Influence of anchor text popularity in the archive rank anchor texts based on their frequency high percentage of anchor text occurs occurs once match again to topics at different rank cut-off
19
Anchor text combined with timestamps can beused to find past popular topics the % of coverage varies across the sources we find a relation between anchor text frequency and the
percentage of coverage Breadth-first (CC) covers more topics globally
and from the NL domain Subsets from CC
the NL part covers more topics the KB seeds part shows comparable results
20
References [Eiron & McCurley] Nadav Eiron and Kevin S. McCurley. Analysis of anchor text for web search. In SIGIR 2003
[Jin et al.] Rong Jin, Alexander G. Hauptmann, and ChengXiang Zhai. Title language model for information retrieval. In SIGIR 2002
[Kanhabua and Nejdl] Nattiya Kanhabua and Wolfgang Nejdl. On the value of temporal anchor texts in wikipedia. In SIGIR Workshop on Temporal, Social and Spatially-aware Information Access, 2014.
[Klein & Nelson] Martin Klein and Michael L. Nelson. Moved but not gone: an evaluation of real-time methods for discovering replacement web pages. Int. J. on Digital Libraries, 2014.
[Huurdeman et al.] Hugo C. Huurdeman, Jaap Kamps, Thaer Samar, Arjen P. de Vries, Anat Ben-David, and Richard A. Rogers. Lost but not forgotten: finding pages on the unarchived web. Int. J. on Digital Libraries, 2015.
[CommonCrawl] https://commoncrawl.org/
[WikiStats] http://wikistats.ins.cwi.nl/
Comparing Topic Coverage in Breadth-first & Depth-first Crawlsusing Anchor Texts
We would like to thank
for making the Web archive data available for us
And
For giving us access to their Hadoop cluster