synchronicity real time recovery of missing web pages martin klein mklein@cs.odu.edu introduction to...

Synchronicity

Real Time Recovery ofMissing Web Pages

Martin Kleinmklein@cs.odu.edu

Introduction to Digital LibrariesWeek 14

CS 751 Spring 201104/12/2011

Who are you again?

• Ph.D. student w/ MLN since 2005• Diagnostic exam in 2006, dissertation proposal

in 2008• 17 publications to date• Outstanding RA award CS dept • CoS dissertation fellowship• 3 ACM SIGWEB + 2 misc travel grants• CS595 (S10) & CS518 (F10)

The Problem

http://www.jcdl2007.org

http://www.jcdl2007.org/JCDL2007_Program.pdf

The Problem

• Web users experience 404 errors• expected lifetime of a web page is 44 days [Kahle97]

• 2% of web disappears every week [Fetterly03]

• Are they really gone? Or just relocated?• has anybody crawled and indexed it?• do Google, Yahoo!, Bing or the IA have a copy of

that page?• Information retrieval techniques needed to

(re-)discover content

Web Infrastructure (WI) [McCown07]

• Web search engines (Google, Yahoo!, Bing) and their caches

• Web archives (Internet Archive)• Research projects (CiteSeer)

The Environment

Digital preservation happens in the WI

Refreshing and Migration in the WI

Google Scholar

CiteSeerX

Internet Archivehttp://waybackmachine.org/*/http:/techreports.larc.nasa.gov/ltrs/PDF/tm109025.pdf

1same URI maps to same or very similar content at a later time

same URI maps to different content at a later time

different URI maps to same or very similar content at the same or at a later time

the content can not be found at any URI

URI – Content Mapping Problem

timeA B

Content Similarity

JCDL 2005http://www.jcdl2005.org/

July 2005http://www.jcdl2005.org/

Content Similarity

Hypertext 2006http://www.ht06.org/

August 2006http://www.ht06.org/

Content Similarity

PSP 2003http://www.pspcentral.org/events/annual_meeting_2003.html

August 2003http://www.pspcentral.org/events/archive/annual_meeting_2003.html

Content Similarity

ECDL 1999http://www-rocq.inria.fr/EuroDL99/

October 1999http://www.informatik.uni-trier.de/~ley/db/conf/ercimdl/ercimdl99.html

Content Similarity

Greynet 1999http://www.konbib.nl/infolev/greynet/2.5.htm

1999Today

RemovalHit

RateProxyCache

GoogleYahooBing

• First introduced by Phelps and Wilensky [Phelps00]

• Small set of terms capturing “aboutness” of a document, “lightweight” metadata

Lexical Signatures (LSs)

ResourceAbstract

• Following TF-IDF scheme first introduced by Spaerck Jones and Robertson [Jones88]

• Term frequency (TF):– “How often does this word appear in this

document?”• Inverse document frequency (IDF):

– “In how many documents does this word appear?”

Generation of Lexical Signatures

• “Robust Hyperlink”• 5 terms are suitable• Append LS to URL

http://www.cs.berkeley.edu/~wilensky/NLP.html?lexical-signature=texttiling+wilensky+disambiguation+subtopic+iago

• Limitations:1. Applications (browsers) need to be modified to

exploit LSs2. LSs need to be computed a priori3. Works well with most URLs but not with all of

them 15

LS as Proposed by Phelps and Wilensky

• Park et al. [Park03] investigated performance of various LS generation algorithms

• Evaluated “tunability” of TF and IDF component

• Weight on TF increases recall (completeness)• Weight on IDF improves precision (exactness)

Generation of Lexical Signatures

Rank/Results URL LS

1/243 http://endeavour.cs.berkeley.edu/ endeavour 94720-1776 achieve inter-endeavour amplifiesSearch

1/1,930 http://www.jcdl2005.org jcdl2005 libraries conference cyberinfrastructure jcdl Search

1/25,900 http://www.loc.gov celebrate knowledge webcasts kluge librarySearch

Lexical Signatures -- Examples

Synchronicity

404 error occurs while browsing look for same or older page in WI (1)if user satisfied return page (2)else generate LS from retrieved page (3) query SEs with LS if result sufficient return “good enough” alternative page (4) else get more input about desired content (5) (link neighborhood, user input,...) re-generate LS && query SEs ... return pages (6)

The system may not return any results at all

Synchro…What?

Synchronicity• Experience of causally unrelated events

occurring together in a meaningful manner• Events reveal underlying pattern, framework

bigger than any of the synchronous systems• Carl Gustav Jung (1875-1961)

• “meaningful coincidence”• Deschamps – de Fontgibu plum

pudding example

picture from http://www.crystalinks.com/jung.html

404 Errors

“Soft 404” Errors

A Comparison of Techniques for Estimating IDF Values to Generate

Lexical Signatures for the Web(WIDM 2008)

• LSs are usually generated following the TF-IDF scheme

• TF rather trivial to compute• IDF requires knowledge about:

• overall size of the corpus (# of documents)• # of documents a term occurs in

• Not complicated to compute for bounded corpora (such as TREC)

• If the web is the corpus, values can only be estimated

The Problem

• Use IDF values obtained from 1. Local collection of web pages2. ``screen scraping‘‘ SE result pages

• Validate both methods through comparison to baseline

• Use Google N-Grams as baseline• Note: N-Grams provide term count (TC)

and not DF values – details to come

The Idea

Accurate IDF Values for LSs

Screen scraping the Google web interface

The Dataset

Local universe consisting of copies of URLs from the IAbetween 1996 and 2007

Same as above, follows Zipf distribution

10,493 observations254,384 total terms16,791 unique terms

The Dataset

Total terms vs new terms

The Dataset

Based on all 3 methodsURL: http://www.perfect10wines.comYear: 2007Union: 12 unique terms

LSs Example

1. Normalized term overlap• Assume term commutativity• k-term LSs normalized by k

2. Kendall Tau• Modified version since LSs to compare

may contain different terms3. M-Score

• Penalizes discordance in higher ranks

Comparing LSs

Top 5, 10 and 15 terms

LC – local universe

SC – screen scraping

NG – N-Grams

Comparing LSs

• Both methods for the computation of IDF values provide accurate results• compared to the Google N-Gram baseline

• Screen scraping method seems preferable since• similaity scores slightly higher• feasible in real time

Conclusions

Correlation of Term Count and Document Frequency for Google N-Grams

(ECIR 2009)

• Need of a reliable source to accurately compute IDF values of web pages (in real time)

• Shown, screen scraping works but• missing validation of baseline (Google N-

Grams)• N-Grams seem suitable (recently created,

based on web pages) but provide TC and not DF what is their relationship?

The Problem

Background & Motivation

• Term frequency (TF) – inverse document frequency (IDF) is a well known term weighting concept• Used (among others) to generate lexical signatures (LSs)

• TF is not hard to compute, IDF is since it depends on global knowledge about the corpus When the entire web is the corpus IDF can only be estimated!

• Most text corpora provide term count values (TC)

D1 = “Please, Please Me” D2 = “Can’t Buy Me Love”D3 = “All You Need Is Love” D4 = “Long, Long, Long”

TC >= DF but is there a correlation? Can we use TC to estimate DF?

Term All Buy Can’t Is Love Me Need Please You Long

TC 1 1 1 1 2 2 1 2 1 3

DF 1 1 1 1 2 2 1 1 1 1

• Investigate relationship between:• TC and DF within the Web as Corpus (WaC)• WaC based TC and Google N-Gram based TC

• TREC, BNC could be used but:• they are not free• TREC has been shown to be somewhat dated

[Chiang05 ]

The Idea

• Analyze correlation of list of terms ordered by their TC and DF rank by computing:• Spearman‘s Rho• Kendall Tau

• Display frequency of TC/DF ratio for all terms• Compare TC (WaC) and TC (N-Grams)

frequencies

The Experiment

Experiment Results

Investigate correlation between TC and DFwithin “Web as Corpus” (WaC)

Rank similarity of all terms

Experiment Results

Investigate correlation between TC and DFwithin “Web as Corpus” (WaC)

Spearman’s ρ and Kendall τ

Experiment Results

Rank WaC-DF WaC-TC Google N-Grams1 IR IR IR IR2 RETRIEVAL RETRIEVAL RETRIEVAL IRSG3 IRSG IRSG IRSG RETRIEVAL4 BCS IRIT CONFERENCE BCS5 IRIT BCS BCS EUROPEAN6 CONFERENCE 2009 GRANT CONFERENCE7 GOOGLE FILTERING IRIT IRIT8 2009 GOOGLE FILTERING GOOGLE9 FILTERING CONFERENCE EUROPEAN ACM

10 GRANT ARIA PAPERS GRANT

Google: screen scraping DF values from the Google web interface

Top 10 terms in decreasing order of their TF/IDF valuestaken from http://ecir09.irit.fr

U = 14∩ = 6

Strong indicator that TC can be used to estimate DF for web pages!

Integer ValuesTwo Decimals One Decimal

Frequency of TC/DF Ratio Within the WaC

Experiment Results

Show similarity between WaC based TC andGoogle N-Gram based TC

TC frequencies

N-Grams have a threshold of 200

• TC and DF Ranks within the WaC show strong correlation

• TC frequencies of WaC and Google N-Grams are very similiar

• Together with results shown earlier (high correlation between baseline and two other methods) N-Grams seem suitable for accurate IDF estimation for web pages Does not mean everything correlated to TC can be used as DF substitude!

Conclusions

Inter-Search EngineLexical Signature Performance

(JCDL 2009)

Inter-Search EngineLexical Signature Performance

Martin Klein Michael L. Nelson

{mklein,mln}@cs.odu.edu

http://en.wikipedia.org/wiki/ElephantElephantTusksTrunkAfricanLoxodonta

Elephant, Asian, AfricanSpecies, TrunkElephant, African, Tusks

Asian, Trunk

Revisiting Lexical Signatures to(Re-)Discover Web Pages

(ECDL 2008)

How to Evaluate the Evolution of LSs over Time

Idea: • Conduct overlap analysis of LSs generated

over time• LSs based on local universe mentioned above

• Neither Phelps and Wilensky nor Park et al. did that• Park et al. just re-confirmed their findings after 6

Dataset

Local universe consisting of copies of URLs from the IAbetween 1996 and 2007

10-term LSs generated forhttp://www.perfect10wines.com

LSs Over Time - Example

LS Overlap Analysis

Rooted:overlap between the LS of the year of the first observation in the IA and all LSs of the consecutive years that URL has been observed

Sliding:overlap between two LSs of consecutive years starting with the first year and ending with the last

Evolution of LSs over Time

Results:• Little overlap between the early years and more recent ones• Highest overlap in the first 1-2 years after creation of the LS• Rarely peaks after that – once terms are gone do not return

Rooted

Evolution of LSs over Time

Results:• Overlap increases over time• Seem to reach steady state around 2003

Sliding

Performance of LSs

Idea: • Query Google search API with LSs• LSs based on local universe mentioned above• Identify URL in result set

• For each URL it is possible that:1. URL is returned as the top ranked result2. URL is ranked somewhere between 2 and 103. URL is ranked somewhere between 11 and 1004. URL is ranked somewhere beyond rank 100

considered as not returned

Performance of LSs wrt Number of Terms

Results:• 2-, 3- and 4-term LSs perform poorly• 5-, 6- and 7-term LSs seem best

• Top mean rank (MR) value with 5 terms• Most top ranked with 7 terms• Binary pattern: either in top 10 or undiscovered

• 8 terms and beyond do not show improvement

Performance - Number of Terms

• Lightest gray = rank 1

• Black = rank 101 and beyond

• Ranks 11-20, 21-30,… colored proportionally

• 50% top ranked, 20% in top 10, 30% black

Rank distribution of 5 term LSs

Performance of LSs

Scoring:• normalized Discounted Cumulative Gain (nDCG)• Binary relevance: 1 for match, 0 otherwise

nDCG for LSs consisting of 2-15 terms(mean over all years)

Performance of LSs over Time

Score for LSs consisting of 2, 5, 7 and 10 terms

• LSs decay over time• Rooted: quickly after generation• Sliding: seem to stabilize

• 5-, 6- and 7-term LSs seem to perform best• 7 – most top ranked• 5 – fewest undiscovered• 5 – lowest mean rank

• 2..4 as well as 8+ terms insufficient

Conclusions

Evaluating Methods to Rediscover Missing Web Pages from theWeb Infrastructure

(JCDL 2010)

The Problem

Internet Archive - Wayback Machine

www.aircharter-international.comhttp://web.archive.org/web/*/http://www.aircharter-international.com

Lexical Signature(TF/IDF)Charter Aircraft Cargo Passenger Jet Air Enquiry

TitleACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International

59 copies

The Problem

www.aircharter-international.com

The Problem

If no archived/cached copy can be found...

Link Neighborhood (LNLS)

The Problem

The ProblemThe Problem

Contributions

• Compare performance of four automated methods to rediscover web pages

1. Lexical signatures (LSs) 3. Tags

2. Titles 4. LNLS

• Analysis of title characteristics wrt their retrieval performance

• Evaluate performance of combination of methods and suggest workflow for real time web page rediscovery

Contributions

Experiment - Data Gathering

• 500 URIs randomly sampled from DMOZ

• Applied filters

– .com, .org, .net, .edu domains

– English Language

– min. of 50 terms [Park]

• Results in 309 URIs to download and parse

Data Gathering

• Extract title– <Title>...</Title>

• Generate 3 LSs per page– IDF values obtained from Google, Yahoo!, MSN Live

• Obtain tags from delicious.com API (only 15%)

• Obtain link neighborhood from Yahoo! API (max. 50 URIs)– Generate LNLS

– TF from “bucket” of words per neighborhood

– IDF obtained from Yahoo! API

Data Gathering

LS Retrieval Performance

5- and 7-Term LSs

•Yahoo! returns most URIs top ranked and leaves least undiscovered

•Binary retrieval pattern, URI either within top 10 or undiscovered

LS Retrieval Performance

Title Retrieval Performance

Non-Quoted and Quoted Titles

•Results at least as good as for LSs

•Google and Yahoo! return more URIs for non-quoted titles

•Same binary retrieval pattern

Title Retrieval Performance

Tags Retrieval Performance

•API returns up to top10 tags - distinguish between # of tags queried

•Low # of URIs

•More later…

Tags Retrieval Performance

LNLS Retrieval Performance

•5- and 7-term LNLSs

•< 5% top ranked

•More later…

LNLS Retrieval Performance

Query LNLS

Combination of Methods

Can we achieve better retrieval performance if we combine 2 or more methods?

Query Tags

Query Title

Query LS

Top Top10 UndisLS5 50.8 12.6 32.4LS7 57.3 9.1 31.1TI 69.3 8.1 19.7TA 2.1 10.6 75.5 Top Top10 Undis

LS5 67.6 7.8 22.3LS7 66.7 4.5 26.9TI 63.8 8.1 27.5TA 6.4 17.0 63.8Top Top10 Undis

LS5 63.1 8.1 27.2LS7 62.8 5.8 29.8TI 61.5 6.8 30.7TA 0 8.5 80.9

Google

Yahoo!

MSN Live

Google Yahoo! MSN Live

LS5-TI 65.0 73.8 71.5

LS7-TI 70.9 75.7 73.8

TI-LS5 73.5 75.7 73.1

TI-LS7 74.1 75.1 74.1

LS5-TI-LS7 65.4 73.8 72.5

LS7-TI-LS5 71.2 76.4 74.4

TI-LS5-LS7 73.8 75.7 74.1

TI-LS7-LS5 74.4 75.7 74.8

LS5-LS7 52.8 68.0 64.4

LS7-LS5 59.9 71.5 66.7

synchronicity real time recovery of missing web pages martin klein mklein@cs.odu.edu introduction to...

content slide

wilensky slide

u1 c1 timeab slide

problem http

environment slide

govltrspdftm1 slide

content similarity

url http

Documents

jung on synchronicity and yijing

mae 239-termpaper-20mar2013-mklein

how do you spread the news using nntp (network news transfer...

synchronicity ch8 · synchronicity according to jung, two...

instructor - c. boylefall semester - 2015 cboyle@cs.odu.edu

mklein brochure cover rev1.qxd 3/19/2004 7:09 am …mklein...

synchronicity mind and matter

xml cryptography cs 795. net sunish kotla skotla@cs.odu.edu

persistent annotations deserve new uris · persistent...

it service specification synchronicity

9/11 in synchronicity

jcdl2013 mklein

institutional development and stock price synchronicity...

seeking synchronicity: evaluating virtual reference...

stock price synchronicity and liquidity 201205 price...

synchronicity, an acausal connecting principle - jung

listener-performer synchronicity

seizsmart - cs.odu.edu

synchronicity and the universal beginning

cloud9 synchronicity