synchronicity real time recovery of missing web pages martin klein [email protected] introduction to...

129
Synchronicity Real Time Recovery of Missing Web Pages Martin Klein [email protected] Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Upload: conrad-fowler

Post on 24-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Synchronicity

Real Time Recovery ofMissing Web Pages

Martin [email protected]

Introduction to Digital LibrariesWeek 14

CS 751 Spring 201104/12/2011

Page 2: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

2

Who are you again?

• Ph.D. student w/ MLN since 2005• Diagnostic exam in 2006, dissertation proposal

in 2008• 17 publications to date• Outstanding RA award CS dept • CoS dissertation fellowship• 3 ACM SIGWEB + 2 misc travel grants• CS595 (S10) & CS518 (F10)

Page 3: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

3

The Problem

http://www.jcdl2007.org

http://www.jcdl2007.org/JCDL2007_Program.pdf

Page 4: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

4

The Problem

• Web users experience 404 errors• expected lifetime of a web page is 44 days [Kahle97]

• 2% of web disappears every week [Fetterly03]

• Are they really gone? Or just relocated?• has anybody crawled and indexed it?• do Google, Yahoo!, Bing or the IA have a copy of

that page?• Information retrieval techniques needed to

(re-)discover content

Page 5: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Web Infrastructure (WI) [McCown07]

• Web search engines (Google, Yahoo!, Bing) and their caches

• Web archives (Internet Archive)• Research projects (CiteSeer)

5

The Environment

Page 6: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Digital preservation happens in the WI

6

Refreshing and Migration in the WI

Google Scholar

CiteSeerX

Internet Archivehttp://waybackmachine.org/*/http:/techreports.larc.nasa.gov/ltrs/PDF/tm109025.pdf

Page 7: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

1same URI maps to same or very similar content at a later time

2

same URI maps to different content at a later time

3

different URI maps to same or very similar content at the same or at a later time

4

the content can not be found at any URI

7

URI – Content Mapping Problem

U1

C1

U1

C1

timeA B

U1

C2

U1

C1

timeA B

U2

C1

U1

C1

U1

404

timeA B

U1

???

U1

C1

timeA B

Page 8: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Content Similarity

8

JCDL 2005http://www.jcdl2005.org/

July 2005http://www.jcdl2005.org/

Today

Page 9: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Content Similarity

9

Hypertext 2006http://www.ht06.org/

August 2006http://www.ht06.org/

Today

Page 10: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Content Similarity

10

PSP 2003http://www.pspcentral.org/events/annual_meeting_2003.html

August 2003http://www.pspcentral.org/events/archive/annual_meeting_2003.html

Today

Page 11: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Content Similarity

11

ECDL 1999http://www-rocq.inria.fr/EuroDL99/

October 1999http://www.informatik.uni-trier.de/~ley/db/conf/ercimdl/ercimdl99.html

Today

Page 12: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Content Similarity

12

Greynet 1999http://www.konbib.nl/infolev/greynet/2.5.htm

1999Today

? ?

Page 13: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

LS

RemovalHit

RateProxyCache

GoogleYahooBing

• First introduced by Phelps and Wilensky [Phelps00]

• Small set of terms capturing “aboutness” of a document, “lightweight” metadata

13

Lexical Signatures (LSs)

ResourceAbstract

Page 14: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

• Following TF-IDF scheme first introduced by Spaerck Jones and Robertson [Jones88]

• Term frequency (TF):– “How often does this word appear in this

document?”• Inverse document frequency (IDF):

– “In how many documents does this word appear?”

14

Generation of Lexical Signatures

Page 15: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

• “Robust Hyperlink”• 5 terms are suitable• Append LS to URL

http://www.cs.berkeley.edu/~wilensky/NLP.html?lexical-signature=texttiling+wilensky+disambiguation+subtopic+iago

• Limitations:1. Applications (browsers) need to be modified to

exploit LSs2. LSs need to be computed a priori3. Works well with most URLs but not with all of

them 15

LS as Proposed by Phelps and Wilensky

Page 16: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

• Park et al. [Park03] investigated performance of various LS generation algorithms

• Evaluated “tunability” of TF and IDF component

• Weight on TF increases recall (completeness)• Weight on IDF improves precision (exactness)

16

Generation of Lexical Signatures

Page 17: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Rank/Results URL LS

1/243 http://endeavour.cs.berkeley.edu/ endeavour 94720-1776 achieve inter-endeavour amplifiesSearch

1/1,930 http://www.jcdl2005.org jcdl2005 libraries conference cyberinfrastructure jcdl Search

1/25,900 http://www.loc.gov celebrate knowledge webcasts kluge librarySearch

17

Lexical Signatures -- Examples

Page 18: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

18

Synchronicity

404 error occurs while browsing look for same or older page in WI (1)if user satisfied return page (2)else generate LS from retrieved page (3) query SEs with LS if result sufficient return “good enough” alternative page (4) else get more input about desired content (5) (link neighborhood, user input,...) re-generate LS && query SEs ... return pages (6)

The system may not return any results at all

Page 19: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

19

Synchro…What?

Synchronicity• Experience of causally unrelated events

occurring together in a meaningful manner• Events reveal underlying pattern, framework

bigger than any of the synchronous systems• Carl Gustav Jung (1875-1961)

• “meaningful coincidence”• Deschamps – de Fontgibu plum

pudding example

picture from http://www.crystalinks.com/jung.html

Page 20: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

20

404 Errors

Page 21: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

21

404 Errors

Page 22: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

22

“Soft 404” Errors

Page 23: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

23

“Soft 404” Errors

Page 24: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

A Comparison of Techniques for Estimating IDF Values to Generate

Lexical Signatures for the Web(WIDM 2008)

Page 25: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

• LSs are usually generated following the TF-IDF scheme

• TF rather trivial to compute• IDF requires knowledge about:

• overall size of the corpus (# of documents)• # of documents a term occurs in

• Not complicated to compute for bounded corpora (such as TREC)

• If the web is the corpus, values can only be estimated

The Problem

25

Page 26: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

• Use IDF values obtained from 1. Local collection of web pages2. ``screen scraping‘‘ SE result pages

• Validate both methods through comparison to baseline

• Use Google N-Grams as baseline• Note: N-Grams provide term count (TC)

and not DF values – details to come

The Idea

26

Page 27: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

27

Accurate IDF Values for LSs

Screen scraping the Google web interface

Page 28: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

28

The Dataset

Local universe consisting of copies of URLs from the IAbetween 1996 and 2007

Page 29: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Same as above, follows Zipf distribution

10,493 observations254,384 total terms16,791 unique terms

The Dataset

29

Page 30: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Total terms vs new terms

The Dataset

30

Page 31: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Based on all 3 methodsURL: http://www.perfect10wines.comYear: 2007Union: 12 unique terms

LSs Example

31

Page 32: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

1. Normalized term overlap• Assume term commutativity• k-term LSs normalized by k

2. Kendall Tau• Modified version since LSs to compare

may contain different terms3. M-Score

• Penalizes discordance in higher ranks

Comparing LSs

32

Page 33: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Top 5, 10 and 15 terms

LC – local universe

SC – screen scraping

NG – N-Grams

Comparing LSs

33

Page 34: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

• Both methods for the computation of IDF values provide accurate results• compared to the Google N-Gram baseline

• Screen scraping method seems preferable since• similaity scores slightly higher• feasible in real time

Conclusions

34

Page 35: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Correlation of Term Count and Document Frequency for Google N-Grams

(ECIR 2009)

Page 36: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

• Need of a reliable source to accurately compute IDF values of web pages (in real time)

• Shown, screen scraping works but• missing validation of baseline (Google N-

Grams)• N-Grams seem suitable (recently created,

based on web pages) but provide TC and not DF what is their relationship?

The Problem

36

Page 37: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

37

Background & Motivation

• Term frequency (TF) – inverse document frequency (IDF) is a well known term weighting concept• Used (among others) to generate lexical signatures (LSs)

• TF is not hard to compute, IDF is since it depends on global knowledge about the corpus When the entire web is the corpus IDF can only be estimated!

• Most text corpora provide term count values (TC)

D1 = “Please, Please Me” D2 = “Can’t Buy Me Love”D3 = “All You Need Is Love” D4 = “Long, Long, Long”

TC >= DF but is there a correlation? Can we use TC to estimate DF?

Term All Buy Can’t Is Love Me Need Please You Long

TC 1 1 1 1 2 2 1 2 1 3

DF 1 1 1 1 2 2 1 1 1 1

Page 38: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

• Investigate relationship between:• TC and DF within the Web as Corpus (WaC)• WaC based TC and Google N-Gram based TC

• TREC, BNC could be used but:• they are not free• TREC has been shown to be somewhat dated

[Chiang05 ]

The Idea

38

Page 39: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

• Analyze correlation of list of terms ordered by their TC and DF rank by computing:• Spearman‘s Rho• Kendall Tau

• Display frequency of TC/DF ratio for all terms• Compare TC (WaC) and TC (N-Grams)

frequencies

The Experiment

39

Page 40: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

40

Experiment Results

Investigate correlation between TC and DFwithin “Web as Corpus” (WaC)

Rank similarity of all terms

Page 41: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

41

Experiment Results

Investigate correlation between TC and DFwithin “Web as Corpus” (WaC)

Spearman’s ρ and Kendall τ

Page 42: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

42

Experiment Results

Rank WaC-DF WaC-TC Google N-Grams1 IR IR IR IR2 RETRIEVAL RETRIEVAL RETRIEVAL IRSG3 IRSG IRSG IRSG RETRIEVAL4 BCS IRIT CONFERENCE BCS5 IRIT BCS BCS EUROPEAN6 CONFERENCE 2009 GRANT CONFERENCE7 GOOGLE FILTERING IRIT IRIT8 2009 GOOGLE FILTERING GOOGLE9 FILTERING CONFERENCE EUROPEAN ACM

10 GRANT ARIA PAPERS GRANT

Google: screen scraping DF values from the Google web interface

Top 10 terms in decreasing order of their TF/IDF valuestaken from http://ecir09.irit.fr

U = 14∩ = 6

Strong indicator that TC can be used to estimate DF for web pages!

Page 43: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Integer ValuesTwo Decimals One Decimal

Frequency of TC/DF Ratio Within the WaC

Experiment Results

43

Page 44: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

44

Experiment Results

Show similarity between WaC based TC andGoogle N-Gram based TC

TC frequencies

N-Grams have a threshold of 200

Page 45: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

• TC and DF Ranks within the WaC show strong correlation

• TC frequencies of WaC and Google N-Grams are very similiar

• Together with results shown earlier (high correlation between baseline and two other methods) N-Grams seem suitable for accurate IDF estimation for web pages Does not mean everything correlated to TC can be used as DF substitude!

Conclusions

45

Page 46: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Inter-Search EngineLexical Signature Performance

(JCDL 2009)

Page 47: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Inter-Search EngineLexical Signature Performance

Martin Klein Michael L. Nelson

{mklein,mln}@cs.odu.edu

http://en.wikipedia.org/wiki/ElephantElephantTusksTrunkAfricanLoxodonta

Elephant, Asian, AfricanSpecies, TrunkElephant, African, Tusks

Asian, Trunk

Page 48: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

48

Page 49: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Revisiting Lexical Signatures to(Re-)Discover Web Pages

(ECDL 2008)

Page 50: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

50

How to Evaluate the Evolution of LSs over Time

Idea: • Conduct overlap analysis of LSs generated

over time• LSs based on local universe mentioned above

• Neither Phelps and Wilensky nor Park et al. did that• Park et al. just re-confirmed their findings after 6

month

Page 51: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

51

Dataset

Local universe consisting of copies of URLs from the IAbetween 1996 and 2007

Page 52: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

10-term LSs generated forhttp://www.perfect10wines.com

LSs Over Time - Example

52

Page 53: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

53

LS Overlap Analysis

Rooted:overlap between the LS of the year of the first observation in the IA and all LSs of the consecutive years that URL has been observed

Sliding:overlap between two LSs of consecutive years starting with the first year and ending with the last

Page 54: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

54

Evolution of LSs over Time

Results:• Little overlap between the early years and more recent ones• Highest overlap in the first 1-2 years after creation of the LS• Rarely peaks after that – once terms are gone do not return

Rooted

Page 55: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

55

Evolution of LSs over Time

Results:• Overlap increases over time• Seem to reach steady state around 2003

Sliding

Page 56: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

56

Performance of LSs

Idea: • Query Google search API with LSs• LSs based on local universe mentioned above• Identify URL in result set

• For each URL it is possible that:1. URL is returned as the top ranked result2. URL is ranked somewhere between 2 and 103. URL is ranked somewhere between 11 and 1004. URL is ranked somewhere beyond rank 100

considered as not returned

Page 57: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

57

Performance of LSs wrt Number of Terms

Results:• 2-, 3- and 4-term LSs perform poorly• 5-, 6- and 7-term LSs seem best

• Top mean rank (MR) value with 5 terms• Most top ranked with 7 terms• Binary pattern: either in top 10 or undiscovered

• 8 terms and beyond do not show improvement

Page 58: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

58

Performance - Number of Terms

• Lightest gray = rank 1

• Black = rank 101 and beyond

• Ranks 11-20, 21-30,… colored proportionally

• 50% top ranked, 20% in top 10, 30% black

Rank distribution of 5 term LSs

Performance of LSs wrt Number of Terms

Page 59: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

59

Performance of LSs

Scoring:• normalized Discounted Cumulative Gain (nDCG)• Binary relevance: 1 for match, 0 otherwise

Page 60: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

60

nDCG for LSs consisting of 2-15 terms(mean over all years)

Performance of LSs wrt Number of Terms

Page 61: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

61

Performance of LSs over Time

Score for LSs consisting of 2, 5, 7 and 10 terms

Page 62: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

• LSs decay over time• Rooted: quickly after generation• Sliding: seem to stabilize

• 5-, 6- and 7-term LSs seem to perform best• 7 – most top ranked• 5 – fewest undiscovered• 5 – lowest mean rank

• 2..4 as well as 8+ terms insufficient

Conclusions

62

Page 63: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Evaluating Methods to Rediscover Missing Web Pages from theWeb Infrastructure

(JCDL 2010)

Page 64: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

64

The Problem

Internet Archive - Wayback Machine

64

www.aircharter-international.comhttp://web.archive.org/web/*/http://www.aircharter-international.com

Lexical Signature(TF/IDF)Charter Aircraft Cargo Passenger Jet Air Enquiry

TitleACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International

59 copies

The Problem

Page 65: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

65

The Problem

65

www.aircharter-international.com

Lexical Signature(TF/IDF)Charter Aircraft Cargo Passenger Jet Air Enquiry

The Problem

Page 66: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

66

The Problem

www.aircharter-international.com

TitleACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International

The Problem

Page 67: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

67

The Problem

If no archived/cached copy can be found...

Tags

C?

B

A

Link Neighborhood (LNLS)

The Problem

Page 68: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

68

The ProblemThe Problem

Page 69: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

69

Contributions

• Compare performance of four automated methods to rediscover web pages

1. Lexical signatures (LSs) 3. Tags

2. Titles 4. LNLS

• Analysis of title characteristics wrt their retrieval performance

• Evaluate performance of combination of methods and suggest workflow for real time web page rediscovery

Contributions

Page 70: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

70

Experiment - Data Gathering

• 500 URIs randomly sampled from DMOZ

• Applied filters

– .com, .org, .net, .edu domains

– English Language

– min. of 50 terms [Park]

• Results in 309 URIs to download and parse

Data Gathering

Page 71: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

71

Experiment - Data Gathering

• Extract title– <Title>...</Title>

• Generate 3 LSs per page– IDF values obtained from Google, Yahoo!, MSN Live

• Obtain tags from delicious.com API (only 15%)

• Obtain link neighborhood from Yahoo! API (max. 50 URIs)– Generate LNLS

– TF from “bucket” of words per neighborhood

– IDF obtained from Yahoo! API

Data Gathering

Page 72: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

72

LS Retrieval Performance

5- and 7-Term LSs

•Yahoo! returns most URIs top ranked and leaves least undiscovered

•Binary retrieval pattern, URI either within top 10 or undiscovered

LS Retrieval Performance

Page 73: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

73

Title Retrieval Performance

Non-Quoted and Quoted Titles

•Results at least as good as for LSs

•Google and Yahoo! return more URIs for non-quoted titles

•Same binary retrieval pattern

Title Retrieval Performance

Page 74: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

74

Tags Retrieval Performance

•API returns up to top10 tags - distinguish between # of tags queried

•Low # of URIs

•More later…

Tags Retrieval Performance

Page 75: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

75

LNLS Retrieval Performance

•5- and 7-term LNLSs

•< 5% top ranked

•More later…

LNLS Retrieval Performance

Page 76: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

76

Query LNLS

Combination of Methods

Can we achieve better retrieval performance if we combine 2 or more methods?

Done

Done

Done

Query Tags

Query Title

Query LS

Combination of Methods

Page 77: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

77

Combination of Methods

Top Top10 UndisLS5 50.8 12.6 32.4LS7 57.3 9.1 31.1TI 69.3 8.1 19.7TA 2.1 10.6 75.5 Top Top10 Undis

LS5 67.6 7.8 22.3LS7 66.7 4.5 26.9TI 63.8 8.1 27.5TA 6.4 17.0 63.8Top Top10 Undis

LS5 63.1 8.1 27.2LS7 62.8 5.8 29.8TI 61.5 6.8 30.7TA 0 8.5 80.9

Google

Yahoo!

MSN Live

Combination of Methods

Page 78: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

78

Combination of Methods

Google Yahoo! MSN Live

LS5-TI 65.0 73.8 71.5

LS7-TI 70.9 75.7 73.8

TI-LS5 73.5 75.7 73.1

TI-LS7 74.1 75.1 74.1

LS5-TI-LS7 65.4 73.8 72.5

LS7-TI-LS5 71.2 76.4 74.4

TI-LS5-LS7 73.8 75.7 74.1

TI-LS7-LS5 74.4 75.7 74.8

LS5-LS7 52.8 68.0 64.4

LS7-LS5 59.9 71.5 66.7

Top Results for Combination of Methods

Combination of Methods

Page 79: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

79

•Length varies between 1 and 43 terms

•Length between 3 and 6 terms occurs most frequently and performs well [Ntoulas]

Title Characteristics

Length in # of Terms

Title Characteristics

Page 80: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

80

•Length varies between 4 and 294 characters

•Short titles (<10) do not perform well

•Length between 10 and 70 most common

•Length between 10 and 45 seem to perform best

Title Characteristics

Length in # of Characters

Title Characteristics

Page 81: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

81

•Title terms with a mean of 5,6,7 characters seem most suitable for well performing terms

•More than 1 or 2 stop words hurts performance

Title Characteristics

Mean # of Characters, # of Stop Words

Title Characteristics

Page 82: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

82

Concluding Remarks

Lexical signatures, as much as titles, are very suitable as search engine queries to rediscover missing web pages. They return 50-70% URIs top ranked.

Tags and link neighborhood LSs do not seem to significantly contribute to the retrieval of the web pages.

Titles are much cheaper to obtain than LSs.The combination of primarily querying titles and 5-term LSs as a second option returns more than 75% URIs top ranked.

Not all titles are equally good.Titles containing between 3 and 6 terms seem to perform best. More than a couple of stop words hurt the performance.

Conclusions

Page 83: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Is This a Good Title?(Hypertext 2010)

Page 84: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

86

The Problem

86

www.aircharter-international.com

Lexical Signature(TF/IDF)Charter Aircraft Cargo Passenger Jet Air Enquiry

The Problem

Page 85: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

87

The Problem

www.aircharter-international.com

TitleACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International

The Problem

Page 86: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

88

The Problem

http://www.drbartell.com/

Lexical Signature(TF/IDF)Plastic Surgeon Reconstructive Dr Bartell Symbol University

???

The Problem

Page 87: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

89

The Problem

http://www.drbartell.com/

TitleThomas Bartell MD Board-Certified - Cosmetic Plastic Reconstructive Surgery

The Problem

Page 88: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

90

The Problem

90

www.reagan.navy.mil

Lexical Signature(TF/IDF)Ronald USS MCSN Torrey Naval Sea Commanding

The Problem

Page 89: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

91

The Problem

TitleHome Page ???

www.reagan.navy.mil

Is This a Good Title?

The Problem

Page 90: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

92

Contributions

• Discuss discovery performance of web pages titles (compared to LSs)

• Analysis of discovered pages regarding their relevancy

• Display title evolution compared to content evolution over time

• Provide prediction model for title’s retrieval potential

Contributions

Page 91: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

93

Experiment - Data Gathering

• 20k URIs randomly sampled from DMOZ

• Applied filters– English language – min. of 50 terms

• Results in 6,875 URIs

• Downloaded and parsed the pages

• Extract title and generate LS per page (baseline).com .org .net .edu sum

Original 15289 2755 1459 497 20000Filtered 4863 1327 369 316 6875

Data Gathering

Page 92: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

94

Title (and LS) Retrieval Performance

Titles 5- and 7-Term LSs

•Titles return more than 60% URIs top ranked

•Binary retrieval pattern, URI either within top 10 or undiscovered

Title and LS Retrieval Performance

Page 93: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

95

???

Relevancy of Retrieval Results

•Distinguish between discovered (top 10) and undiscovered URIs

•Analyze content of top 10 results

•Measure relevancy in terms of normalized term overlap and shingles between original URI and search result by rank

Do titles return relevant results besides the original URI?

Relevancy of Retrieval Results

Page 94: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

96

Relevancy of Retrieval Results

Term OverlapDiscovered Undiscovered

High relevancy in the top rankswith possible aliases and duplicates.

Relevancy of Retrieval Results

Page 95: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

97

Relevancy of Retrieval Results

ShinglesDiscovered Undiscovered

More optimal shingles values than top ranked URIs - possible aliases and duplicates.

Relevancy of Retrieval Results

Page 96: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

98

1998-01-27Sun Software Products Selector Guides - Solutions Tree

1999-02-20Sun Software Solutions

2002-02-01Sun Microsystems Products

2002-06-01Sun Microsystems - Business & Industry Solutions

2003-08-01Sun Microsystems - Industry & Infrastructure Solutions Sun Solutions

Title Evolution - Example I

2004-02-02Sun Microsystems – Solutions

2004-06-10Gateway Page - Sun Solutions

2006-01-09Sun Microsystems Solutions & Services

2007-01-03Services & Solutions

2007-02-07Sun Services & Solutions

2008-01-19Sun Solutions

www.sun.com/solutions

Title Evolution – Example I

Page 97: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

99

2000-06-19DataCity of Manassas Park Main Page

2000-10-12DataCity of Manassas Park sells Custom Built Computers & Removable Hard Drives

2001-08-21DataCity a computer company in Manassas Park sells Custom Built Computers & Removable Hard Drives

Title Evolution - Example II

2002-10-16computer company in Manassas Virginia sells Custom Built Computers with Removable Hard Drives Kits and Iomega 2GB Jaz Drives (jazz drives) October 2002 DataCity 800-326-5051 toll free

2006-03-14Est 1989 Computer company in Stafford Virginia sells Custom Built Secure Computers with DoD 5200.1-R Approved Removable Hard Drives, Hard Drive Kits and Iomega 2GB Jaz Drives (jazz drives), introduces the IllumiNite; lighted keyboard DataCity 800-326-5051 Service Disabled Veteran Owned Business SDVOB

www.datacity.com/mainf.html

Title Evolution – Example II

Page 98: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

100

•Copies from fixed size time windows per year

•Extract available titles of past 14 years

•Compute normalized Levenshtein edit distance between titles of copies and baseline(0 = identical; 1 = completely dissimilar)

How much do titles change over time?

Title Evolution Over TimeTitle Evolution Over Time

Page 99: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

101

Title Evolution Over Time

Title edit distance frequencies

•Half the titles of available copies from recent years are (close to) identical

•Decay from 2005 on (with fewer copies available)

•4 year old title:40% chance to be unchanged

Title Evolution Over Time

Page 100: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

102

Title Evolution Over Time

Title vs Document

•Y: avg shingle value for all copies per URI

•X: avg edit distance of corresponding titles

•overlap indicated by:green: <10red: >90

•Semi-transparent: total amount of points plotted

[0,1] - over 1600 times

[0,0] - 122 times

Title Evolution Over Time

Page 101: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

103

Title Performance Prediction

•Quality prediction of title by

•Number of nouns, articles etc.

•Amount of title terms, characters ([Ntoulas])

•Observation of re-occurring terms in poorly performing titles - “Stop Titles”

home, index, home page, welcome, untitled document

The performance of any given title can be predicted as insufficient if it consists to 75% or more of a “Stop Title”!

[Ntoulas]A. Ntoulas et al. “Detecting Spam Web Pages Through Content Analysis” In Proceedings of WWW 2004, pp 83-92

Title Performance Prediction

Page 102: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

104

Concluding Remarks

The “aboutness” of web pages can be determined from either the content or from the title.

More than 60% of URIs are returned top ranked when using the title as a search engine query.

Titles change more slowly and less significantly over time than the web pages’ content.

Not all titles are equally good. If the majority of title terms are Stop Titles its quality can be predicted poor.

Conclusions

Page 103: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Find, New, Copy, Web, Page -Tagging for the (Re-)Discovery of Web Pages

(submitted for publication)

Page 104: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

106

The Problem

We have seen that we have a good chance to rediscover missing pages with

• Lexical signatures• Titles

BUT

What if no archived/cached copy can be found?

The Problem

Page 105: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

107

The ProblemThe Solution?

ConferencesDigitallibrariesConferenceLibraryJcdl2005

Search

Page 106: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

108

The Problem

• What is a good length for a tag based query string?• 5 or 7 tags like lexical signatures?

• Can we improve retrieval performance when combining tags w/ title- and/or lexical signature-based queries?

• Do tags contain information about a page that is not in the title/content?

The Questions

Page 107: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

109

The Problem

• URIs with tags rather sparse in previously created corpora

• Creation of new, tag centered corpus• query Delicious for 5k unique URIs

• eventually obtain:• 4,968 URIs• 11 duplicates• 21 URIs w/o tags

The Experiment

Page 108: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

110

The ProblemThe Experiment

• Tags queried against the Yahoo! BOSS API• Same four retrieval cases introduced earlier• nDCG w/ same relevance scoring• Mean Average Precision

Page 109: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

111

The ProblemThe Experiment

• JaroWinkler distance between URIs• Dice similarity between contents

Page 110: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

112

The ProblemThe Experiment

Combining methods

Page 111: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

113

The Problem

• Fact:• ~50% of tags do not occur in page

• “Secret”:• ~50% of tags do not occur in current version of page

• ergo: How about previous versions?

The Experiment

Page 112: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

114

The Problem

• 3,306 URIs w/ older copies• 66.3% of our tags do not occur in page • 4.9% of tags occur in previous version of page – Ghost Tags• represent a previous version better than the current one

• But what kind of tags are these?• Are they important to the document? To the Delicious user?

Ghost Tags

Page 113: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

115

The ProblemGhost Tags

Document importance:TF rank

User importance:Delicious rank

Normalized rank:0 - top1 - bottom

Page 114: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

116

Concluding Remarks

Tags can be used for search!

We can improve the retrieval performance by combining tags based search with titles and lexical signatures.

Ghost Tags exist! One out of three important terms better describes a previous than the current version of a page.

How old are Ghost Tags?When do tags “ghostify”? Wrt importance/change of page?

Conclusions

Page 115: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

Rediscovering Missing Web Pages Using Link Neighborhood Lexical Signatures

(JCDL 2011)

Page 116: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

118

The Problem

We have seen that we have a good chance to rediscover missing pages with

• Lexical signatures• Titles

BUT

What if no archived/cached copy can be found?Plan A: Tags

The Problem

Page 117: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

119

The ProblemThe Solution?

Plan B: Link neighborhood Lexical Signatures

Page 118: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

120

The ProblemThe Questions

• What is a good length for a neighborhood based lexical signature?• 5 or 7 terms like lexical signatures?• 5..8 terms like tag-based queries?

• How many backlinks do we need?• Is the 1st level of backlinks sufficient?• From where in the linking page should we draw the candidate terms?

Page 119: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

121

The ProblemThe Radius Question

Paragraph

Entire page

Anchor text

Page 120: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

122

The Dataset

• Same as for JCDL 2010 experiment• 309 URIs• 28,325 first level & 306,700 second level backlinks• Filter for language, file type, content length, HTTP

response code, “soft 404s” => 12% discarded• Lexical signature generation

• IDF values from Yahoo!• 1..7 and 10 terms

Page 121: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

123

The ProblemThe Results

level-radius-rank

Anchor text

Page 122: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

124

The ProblemThe Results – Backlink Level

level-radius-rank

Anchor text

±5 words

Page 123: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

125

The ProblemThe Results – Backlink Level

level-radius-rank

Anchor text

±10 words

Page 124: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

126

The ProblemThe Results – Backlink Level

level-radius-rank

Anchor text

±10 words

Page 125: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

127

The ProblemThe Results – Radius

level-radius-rank

All Radii

Page 126: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

128

The ProblemThe Results – Backlink Rank

level-radius-rank

Anchor,Ranks

10,100,1000

Page 127: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

129

The ProblemThe Results – In Numbers

1-anchor-1000

1-anchor-10

WINNER

• 4 terms• first backlink level only• top 10 backlinks only• anchor text only

Page 128: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011

130

Concluding Remarks

Link neighborhood based lexical signatures can help rediscover missing pages.

It is a feasible “Plan C” due to the high success rate of cheaper methods (titles, tags, lexical signatures).

Fortunately smallest parameters perform best (anchor, 10 backlinks, 1st level backlinks)

Can we find an optimum for the number of backlinks? (10/100/1000 leaves a big margin)Can we identify “Stop Anchors” e.g. click here, acrobat, etc

Conclusions

Page 129: Synchronicity Real Time Recovery of Missing Web Pages Martin Klein mklein@cs.odu.edu Introduction to Digital Libraries Week 14 CS 751 Spring 2011 04/12/2011