tpdl 2016 topic_coverage_indf-bf_crawls

Comparing Topic Coverage in Breadth-first & Depth-firstCrawls using Anchor Texts

Thaer Samar, Myriam C. Traub, Jacco van Ossenbruggen, Arjen P. de Vries

Web archives were initiated to preserve the fastchanging Web

Web archives are created using Web crawlers Crawling strategy has a great influence on the

data that is archived

Web Archives

2

Web Crawlers

Two main strategies

Depth-first, focus only on selected web sites, as deepas possible

Breadth-first, focus on the entire web, but not in depth

3

Our Study

Research Question:

How does the crawling strategy impact webarchive's coverage of past popular topics?

Challenges

How to approximate user interest when no query log isavailable?

Scalability, Web archives or crawls consist of hugeamount of data. e.g., one month snapshot from Common Crawl can be 80 TB of

compressed data.

4

Approach

We use hyperlinks anchor text from the rawcontent of two datasets crawled using different crawling strategies

We compare anchor texts with external sourcesthat identify past popular topics on the Web at thetime of our datasets

5

Based on What (Anchor Text)

Hyperlinks define the Web structure link information: source URL, target URL, and the anchor text

Anchor text is a short text describing the target page exhibits characteristics similar to user query and document

title [Eiron & McCurley, Jin et al.] widely used in Information Retrieval to improve search

effectiveness combined with timestamps has be used to

capture & trace entity evolution [Kanhabua and Nejdl] uncover & provide representation of missing pages from the archive

[Klein & Nelson, Huurdeman et al.]

6

Anchor Text (data)

Anchor Text from two different crawls Dutch Web archive by National Library of the

Netherlands (KB) depth-first (selective) +10,000 websites Dutch domain 100 million crawled urls

Common Crawl (CC) breadth-first millions of websites entire Web 2.8 billion crawled urls

7

Link Processing

Extraction

Cleaning URL normalization; get host of the source and the target Clean spam e.g., rolex watches

Aggregation Pre-processing: lowercase, Stop-words removing &

stemming Aggregate based on the anchor text

Deduplication Remove duplicate links; due to crawling frequency Same source, target, anchor text, year of crawl-date

SourceURL, targetURL, anchorText

anchorText, count8

SourceURL, targetURL, anchorText, crawl-date (year)

Anchor Text Summary (1/2)

Short texts describing target pages Available for both archived & unarchived pages

not all target pages exist in the archive

9

e.g., 6.5% of the uniquehosts of target pages of

external links werecrawled

Anchor Text Summary (2/2)

The breadth-first crawl (KB) and the breadth-first crawl (CC) differ in terms of size &coverage links from CC is 559x times larger than the number of KB

links

KB dataset mainly covers the NL domain, and CC datasetcovers the entire Web

CC contains more hosts (websites) than the KB; sourceand target hosts

10

For Fair Comparison

Subsets from CC dataset NL part, we focus on links that originate from the NL

domain. Links whose source is from the KB seeds list

11

Number of links fromthe NL part of CC is

comparable to numberof links from the KB

Past Popular Topics (of interests to users)

Queries represent user's information need User queries have usually not been preserved Impossible to go back in time reconstruct which

queries the user would have used to search thearchive

12

Past Popular Topics (data 1/3)

Three different sources to identify past populartopics Wikistats

views aggregation of Wikipedia (WP) pages over time contains WP title, number of views, timestamp, and the language keep WP titles viewed >= 1,000 times

Google trends top searched queries from entire world or per country

Query log from users visiting the public Dutch historicnewspaper archive

13


Our assumption is that the CC (a breadth-first crawl) covers more global

topics, and that KB (a depth-first crawl) covers more topicsfrom NL domain

For validation: split topic sources into topics that attracted attention in the

entire Web (global) and topics that were only picked up inthe NL domain

14


Wikistats we used the language to get the Dutch Wikipedia pages other pages, we labeled them as global

Google trends most searched queries from the entire world most searched queries from Netherlands

15

Match Anchor Text to Topics

Pre-process topics from all sources lowercase stop-words removing, and stemming

Exact string matching between anchor text andtopics from all sources

16

Topic Coverage (1/3)

The breadth-first crawl (CC) covers more topicsthan depth-first crawl (KB) for all topic sources not only for global topics (as expected) but also for topics from the NL domain!

17


Subsets from the CC dataset have comparableresults

18

NL part KB seeds


Influence of anchor text popularity in the archive rank anchor texts based on their frequency high percentage of anchor text occurs occurs once match again to topics at different rank cut-off

19

Anchor text combined with timestamps can beused to find past popular topics the % of coverage varies across the sources we find a relation between anchor text frequency and the

percentage of coverage Breadth-first (CC) covers more topics globally

and from the NL domain Subsets from CC

the NL part covers more topics the KB seeds part shows comparable results

20

References [Eiron & McCurley] Nadav Eiron and Kevin S. McCurley. Analysis of anchor text for web search. In SIGIR 2003

[Jin et al.] Rong Jin, Alexander G. Hauptmann, and ChengXiang Zhai. Title language model for information retrieval. In SIGIR 2002

[Kanhabua and Nejdl] Nattiya Kanhabua and Wolfgang Nejdl. On the value of temporal anchor texts in wikipedia. In SIGIR Workshop on Temporal, Social and Spatially-aware Information Access, 2014.

[Klein & Nelson] Martin Klein and Michael L. Nelson. Moved but not gone: an evaluation of real-time methods for discovering replacement web pages. Int. J. on Digital Libraries, 2014.

[Huurdeman et al.] Hugo C. Huurdeman, Jaap Kamps, Thaer Samar, Arjen P. de Vries, Anat Ben-David, and Richard A. Rogers. Lost but not forgotten: finding pages on the unarchived web. Int. J. on Digital Libraries, 2015.

[CommonCrawl] https://commoncrawl.org/

[WikiStats] http://wikistats.ins.cwi.nl/

https://commoncrawl.org/

http://wikistats.ins.cwi.nl/

Comparing Topic Coverage in Breadth-first & Depth-first Crawlsusing Anchor Texts

We would like to thank

for making the Web archive data available for us

And

For giving us access to their Hadoop cluster

tpdl 2016 topic_coverage_indf-bf_crawls

Science