Download - Dissertation Defense
Using the Web Infrastructurefor Real Time Recoveryof Missing Web Pages
Dissertation Defense
Martin [email protected]
Old Dominion UniversityNorfolk, VA07/18/2011
Committee:Dr. Michael L. Nelson (Advisor)Dr. Yaohang LiDr. Michele C. WeigleDr. Mohammad ZubairDr. Robert SandersonDr. Herbert Van de Sompel
Agenda
SynchronicityLink
Neighborhood LSs
Book of the DeadWeb Page Tags
Web Page Titles
LSs for Web Pages
DF Estimation Techniques
TC-DF Correlation
2
Motivation Background
The Problem
3
The Problem - 404 Errors
• Expected lifetime of a web page is 44 days [Kahle1997]
• URIs inaccessible in CS papers: 23%-53% [Lawrence2001]
• Inaccessible web pages: 67% after 4 years [Koehler2002]
• Inaccessible objects in DLs: 3% [Nelson2002]
• URIs inaccessible in high IF journals: 3.8% after 3 months; 13% after 27 months [Dellavalle2003]
• URIs inaccessible in D-Lib Magazine: ~30% [McCown2005]
• URIs inaccessible (and not archived) in scholarly articles: ~25% [Sanderson2011]
4
The Problem - 404 Errors
• Are they really gone? Or just relocated?• Has anybody crawled and indexed it?• Do Google, Yahoo!, Bing have a copy of the
page?• Has the page been archived by a web
archive?• Information retrieval techniques needed
to (re-)discover content
5
The Solution?
• Search engines• Requires knowledge about content• Problem with homographs (jaguar, present, lead,
M/mobile, etc)• Problem with very frequent terms/names
(Michael Nelson, Eric Miller, etc)• Web archives
• Helps for apple pie recipe but not for web page of transferred faculty, e.g.
6
Content Similarity
JCDL 2005http://www.jcdl2005.org/
July 2005http://www.jcdl2005.org/
Today
7
Content Similarity
Hypertext 2006http://www.ht06.org/
August 2006http://www.ht06.org/
Today
8
Content Similarity
PSP 2003http://www.pspcentral.org/events/annual_meeting_2003.htmlhttp://www.pspcentral.org/events/archive/annual_meeting_2003.html
August 2003 Today
9
Content Similarity
ECDL 1999
http://www-rocq.inria.fr/EuroDL99/http://www.informatik.uni-trier.de/~ley/db/conf/ercimdl/ercimdl99.html
October 1999 Today
10
Content Similarity
Greynet 1999http://www.konbib.nl/infolev/greynet/2.5.htm
1999Today
? ?
11
The ProblemResearch Questions (1)
1. Based on the WI, can we use content- and link structure based methods to (re-)discover missing web pages in real time?
Investigated Methods:a) Lexical signaturesb) Titlesc) Tagsd) Link neighborhood lexical signatures
12
The ProblemResearch Questions (2)
2. What are the optimal characteristics of these methods (age, length, etc) with respect to retrieval performance?
3. Can we improve the performance by consolidating two or more methods?
4. Can we have a real-world implementation and evaluation of the above?
13
Agenda
SynchronicityLink
Neighborhood LSs
Book of the DeadWeb Page Tags
Web Page Titles
LSs for Web Pages
DF Estimation Techniques
TC-DF Correlation
14
Motivation Background
Memento, Web Infrastructure (WI)
15
LexicalSignature
RemovalHit
RateProxyCache
5 terms
Lexical Signatures (LSs)
• First introduced by Phelps and Wilensky [Phelps2000]
• Small set of terms capturing “aboutness” of a document, “lightweight” metadata
Resource
10,000 terms
Abstract
200 terms
16
• Following TF-IDF scheme first introduced by Spaerck Jones and Robertson [Jones1973]
• Term frequency (TF):• “How often does this word appear in this
document?”• Inverse document frequency (IDF):
• “In how many documents does this word appear?”
Lexical Signature Generation
17
Rank/Results URL LS
1/1,930 http://www.jcdl2005.orgjcdl2005 libraries conference cyberinfrastructure jcdl
1/24,100 http://www.norfolk.govnorfolk city council rfps nauticus
1/185 http://library.lanl.gov lanl library ldrd alamos oppie
2/738,000 http://www.usopen.org open us ashe tickets usta
Lexical Signatures -- Examples
18
Agenda
SynchronicityLink
Neighborhood LSs
Book of the DeadWeb Page Tags
Web Page Titles
LSs for Web Pages
DF Estimation Techniques
TC-DF Correlation
19
Motivation Background
A Comparison of Techniques for Estimating IDF Values to Generate Lexical Signatures for the Web(WIDM 2008)
Accurate IDF Values for LSs
Screen scraping the Google web interface
20
The Dataset
Local universe consisting of copies of URIsfrom the Internet Archive between 1996 and 2007
21
• Use IDF values obtained from 1. Local collection of web pages2. “screen scraping” SE result pages
• Validate both methods against a baseline• Google N-Grams
Note: N-Grams provide term count (TC) and not DF values – ask me for details
The Idea
22
Based on all 3 methodsURL: http://www.perfect10wines.comYear: 2007Union: 12 unique terms
LSs Example
23
1. Normalized term overlap• Assume term commutativity• k-term LSs normalized by k
2. Kendall Tau• Modified version since LSs to compare
may contain different terms3. M-Score
• Penalizes discordance in higher ranks
Comparing LSs
24
Top 5, 10 and 15 terms
LC – local universe
SC – screen scraping
NG – N-Grams
Comparing LSs
25
• Both methods for the computation of IDF values provide accurate results• Compared to the Google N-Gram baseline
• Screen scraping method seems preferable • Similarity scores are slightly higher• Feasible in real time!!!
Contribution:Established well performing IDF estimation technique.
Conclusions
26
Agenda
SynchronicityLink
Neighborhood LSs
Book of the DeadWeb Page Tags
Web Page Titles
LSs for Web Pages
DF Estimation Techniques
TC-DF Correlation
27
Motivation Background
Revisiting Lexical Signatures to (Re-)Discover Web Pages(ECDL 2008)
The Idea
Evaluate Evolution of LSs over Time by• Generate LSs of URIs (from local universe
mentioned above) over time• Conduct overlap analysis
• Neither Phelps and Wilensky nor Park et al.[Park2004] did that• Park et al. just re-confirmed their findings after 6
months28
10-term LSs generated forhttp://www.perfect10wines.com
LSs Over Time - Example
29
LS Overlap Analysis
Rooted:overlap between the LS of the year of the first observation in the IA and all LSs of the consecutive years that URI has been observed
Sliding:overlap between two LSs of consecutive years starting with the first year and ending with the last
30
Evolution of LSs over Time
Results:• Little overlap between the early years and more recent ones• Highest overlap in the first 1-2 years after creation of the LS• Rarely peaks after that – once terms are gone do not return
Rooted
31
Evolution of LSs over Time
Results:• Overlap increases over time• Seem to reach steady state around 2003
Sliding
32
Performance of LSs
Idea: • Query LSs against Google search API• Identify URI in result set
• For each URI it is possible that:1. URI is returned as the top ranked result2. URI is ranked somewhere between 2 and 103. URI is ranked somewhere between 11 and 1004. URI is ranked somewhere beyond rank 100
considered as not returned33
Performance of LSs wrt Length
Results:• 2-, 3- and 4-term LSs perform poorly• 5-, 6- and 7-term LSs seem best
• Top mean rank (MR) value with 5 terms• Most top ranked with 7 terms• Binary pattern: either in top 10 or undiscovered
• 8 terms and beyond do not show improvement 34
nDCG for LSs consisting of 2-15 terms(mean over all years)
Performance of LSs wrt Length
35
Performance of LSs over Time
nDCG for LSs consisting of 2, 5, 7 and 10 terms
36
• LSs decay over time• Rooted: quickly after generation• Sliding: seem to stabilize
• LSs older than 5 years perform poorly• 5-, 6- and 7-term LSs seem to perform best
• 7 – most top ranked• 5 – lowest mean rank
• 2..4 as well as 8+ term LSs are insufficient
Contribution:Determined age and length limits for LSs.
Conclusions
37
Agenda
SynchronicityLink
Neighborhood LSs
Book of the DeadWeb Page Tags
Web Page Titles
LSs for Web Pages
DF Estimation Techniques
TC-DF Correlation
38
Motivation Background
Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure(JCDL 2010)
The Problem
Internet Archive - Wayback Machine
www.aircharter-international.comhttp://web.archive.org/web/*/http://www.aircharter-international.com
Lexical Signature(TF/IDF)Charter Aircraft Cargo Passenger Jet Air Enquiry
TitleACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International
59 copies
The Problem
39
The Problem
www.aircharter-international.com
Lexical Signature(TF/IDF)Charter Aircraft Cargo Passenger Jet Air Enquiry
The Problem
40
The Problem
www.aircharter-international.com
TitleACMI, Private Jet Charter, Private Jet Lease, Charter Flight Service: Air Charter International
The Problem
41
Contributions
• Compare performance of two automated methods to rediscover web pages
1. Lexical signatures (LSs)
2. Titles
• Evaluate performance of combination of methods and suggest workflow for real time web page rediscovery
The Idea
42
LS Retrieval Performance
5- and 7-Term LSs
LS Retrieval Performance
• Yahoo! returns most URIs top ranked and leaves least undiscovered
• Binary retrieval pattern, URI either within top 10 or undiscovered
43
Title Retrieval Performance
Non-Quoted and Quoted Titles
Title Retrieval Performance
• Results at least as good as for LSs
• Google and Yahoo! return more URIs for non-quoted titles
• Same binary retrieval pattern
44
Combination of Methods
Top Results for Combination of Methods
Combination of Methods
45
Google Yahoo! MSN Live
LS5-TI 65.0 73.8 71.5
LS7-TI 70.9 75.7 73.8
TI-LS5 73.5 75.7 73.1
TI-LS7 74.1 75.1 74.1
LS5-TI-LS7 65.4 73.8 72.5
LS7-TI-LS5 71.2 76.4 74.4
TI-LS5-LS7 73.8 75.7 74.1
TI-LS7-LS5 74.4 75.7 74.8
LS5-LS7 52.8 68.0 64.4
LS7-LS5 59.9 71.5 66.7
Concluding RemarksConclusions
• LSs and titles are suitable as search engine queries• Return 50%-70% URIs top ranked
BUT• Titles are cheaper to obtain, hence
• Preferred primary method• 5-term LSs secondary method• Results in 75% top ranked URIs
Contributions:Provided evidence for suitability of titles and introduced web page discovery framework.
46
Agenda
SynchronicityLink
Neighborhood LSs
Book of the DeadWeb Page Tags
Web Page Titles
LSs for Web Pages
DF Estimation Techniques
TC-DF Correlation
47
Motivation Background
Is This a Good Title?(Hypertext 2010)
The Problem
http://www.drbartell.com/
Lexical Signature(TF/IDF)Plastic Surgeon Reconstructive Dr Bartell Symbol University
???
The Problem
48
The Problem
http://www.drbartell.com/
TitleThomas Bartell MD Board-Certified - Cosmetic Plastic Reconstructive Surgery
The Problem
49
The Problem
www.reagan.navy.mil
Lexical Signature(TF/IDF)Ronald USS MCSN Torrey Naval Sea Commanding
The Problem
50
The Problem
TitleHome Page ???
www.reagan.navy.mil
Is This a Good Title?
The Problem
51
Contributions
• Display title evolution over time
• Compare to content evolution
• “Normalize” time as fixed size windows
• Provide prediction model for title’s retrieval potential
The Idea
52
Title (and LS) Retrieval Performance
Titles 5- and 7-Term LSs
Title and LS Retrieval Performance
• Titles return more than 60% URIs top ranked• Binary retrieval pattern, URI either within top 10
or undiscovered 53
1998-01-27Sun Software Products Selector Guides - Solutions Tree
1999-02-20Sun Software Solutions
2002-02-01Sun Microsystems Products
2002-06-01Sun Microsystems - Business & Industry Solutions
2003-08-01Sun Microsystems - Industry & Infrastructure Solutions Sun Solutions
Title Evolution - Example I
2004-02-02Sun Microsystems – Solutions
2004-06-10Gateway Page - Sun Solutions
2006-01-09Sun Microsystems Solutions & Services
2007-01-03Services & Solutions
2007-02-07Sun Services & Solutions
2008-01-19Sun Solutions
www.sun.com/solutions
Title Evolution – Example I
54
2000-06-19DataCity of Manassas Park Main Page
2000-10-12DataCity of Manassas Park sells Custom Built Computers & Removable Hard Drives
2001-08-21DataCity a computer company in Manassas Park sells Custom Built Computers & Removable Hard Drives
Title Evolution - Example II
2002-10-16computer company in Manassas Virginia sells Custom Built Computers with Removable Hard Drives Kits and Iomega 2GB Jaz Drives (jazz drives) October 2002 DataCity 800-326-5051 toll free
2006-03-14Est 1989 Computer company in Stafford Virginia sells Custom Built Secure Computers with DoD 5200.1-R Approved Removable Hard Drives, Hard Drive Kits and Iomega 2GB Jaz Drives (jazz drives), introduces the IllumiNite; lighted keyboard DataCity 800-326-5051 Service Disabled Veteran Owned Business SDVOB
www.datacity.com/mainf.html
Title Evolution – Example II
55
How much do titles change over time?
Title Evolution Over TimeTitle Evolution Over Time
• Copies from fixed size time windows per year
• Extract available titles of past 14 years
• Compute normalized Levenshtein edit distance between titles of copies and baseline (today)(0 = identical;1 = completely dissimilar) 56
Title Evolution Over Time
Title edit distance frequencies
Title Evolution Over Time
• Half the titles of available copies from recent years are (close to) identical
• Decay from 2005 on (with fewer copies available)
• 4 year old title:40% chance to be unchanged
57
Title Evolution Over Time
Title vs Document
[0,1] - over 1600 times
[0,0] - 122 times
Title Evolution Over Time
• Y: avg shingle value for all copies per URI
• X: avg edit distance of corresponding titles
• overlap indicated by:green: <10red: >90
• Semi-transparent: total amount of points plotted
58
Title Performance Prediction
home, index, home page, welcome, untitled document
The performance of any given title can be predicted as insufficient if it consists to 75% or more of a “Stop Title”!
Title Performance Prediction
• Quality prediction of title by• Number of nouns, articles etc.• Amount of title terms, characters [Ntoulas2006]
• Observation of re-occurring terms in poorly performing titles - “Stop Titles”
59
Concluding RemarksConclusions
• Titles change more slowly and less significantly over time than web page content
• Not all titles equally good• If the majority of title terms are Stop Titles its
quality can be predicted poor
Contribution:Quantified title evolution and introduced stop titles.
60
Agenda
SynchronicityLink
Neighborhood LSs
Book of the DeadWeb Page Tags
Web Page Titles
LSs for Web Pages
DF Estimation Techniques
TC-DF Correlation
61
Motivation Background
Find, New, Copy, Web, Page - Tagging for the (Re-)Discovery of Web Pages(TPDL 2011)
The Problem
We have seen that we have a good chance to rediscover missing pages with
• Lexical signatures• Titles
BUT
What if no archived/cached copy can be found?
The Problem
62
The ProblemThe Solution?
ConferencesDigitallibrariesConferenceLibraryJcdl2005
63
The ProblemThe Idea
• Experimental evaluation of tag based query length cf. 5- or 7-term LSs
• Test combination of methods to improve retrieval performance
• Investigate “descriptive” power of tags
64
The ProblemThe Experiment
• Tags queried against the Yahoo! BOSS API• Same four retrieval cases introduced earlier• nDCG w/ binary relevance scoring• Mean Average Precision
65
The ProblemThe Experiment
Combining methods
66
The Problem
• Fact:• ~50% of tags do not occur in page [Bischoff2008]
• “Secret”:• ~50% of tags do not occur in current version
of page• ergo: How about previous versions?
The Experiment
67
The Problem
• 3,306 URIs w/ older copies• 66.3% of our tags do not occur in page • 4.9% of tags occur in previous version of page Ghost Tags• represent a previous version better than the
current one
• What kind of tags are these?• Important to the document, to the Delicious
user?
Ghost Tags
68
The ProblemGhost Tags
Document importance:TF rank
User importance:Delicious rank
Normalized rank:0 - top1 - bottom
69
Concluding RemarksConclusions
• Tags can be used for search (if available)• Combining tags with titles and LSs gains URIs• Ghost Tags exist!
• 1/3 of them are important to the page and user
Contributions:Added tags to web page discovery framework and introduced notion of Ghost Tags.
70
Agenda
SynchronicityLink
Neighborhood LSs
Book of the DeadWeb Page Tags
Web Page Titles
LSs for Web Pages
DF Estimation Techniques
TC-DF Correlation
71
Motivation Background
Rediscovering Missing Web Pages Using Link Neighborhood Lexical Signatures(JCDL 2011)
The Problem
We have seen that we have a good chance to rediscover missing pages with
• Lexical signatures• Titles
BUT
What if no archived/cached copy can be found?Plan A: Tags
The Problem
72
The ProblemPlan B
ComputerDominionNorfolkMonarchextract
is about
Link neighborhood Lexical Signatures (LNLSs)
73
The ProblemThe Idea
• Determine for well performing LNLS:• Length• Number of backlinks• Backlink levels• Radius of terms on backlink page
74
The ProblemThe Radius on a Backlink Page
Paragraph
Entire page
Anchor text
75
The Dataset
• 309 URIs• 28,325 first level• 306,700 second level backlinks• Filter for language, file type, etc.
12% discarded• Lexical signature generation
• IDF values from Yahoo!• 1..7 and 10 terms
• Query Yahoo! API• Compute “goodness” (nDCG) 76
The ProblemThe Results
level-radius-rank
1st and 2nd
level
bett
er
77
The ProblemThe Results – Radius
level-radius-rank
All Radii
78
The ProblemThe Results – Backlink Rank
level-radius-rank
Ranks10
1001000
79
The ProblemThe Results – In Numbers
1-anchor-1000
WINNER1-anchor-10
GOOD
80
Concluding RemarksConclusions
• Optimal link neighborhood lexical signatures:• Contain 4 terms• Parsed from top 10 backlink pages• Include first backlink level only• Consider anchor text only
Contributions:Added LNLS to web page discovery framework.
81
Agenda
SynchronicityLink
Neighborhood LSs
Book of the DeadWeb Page Tags
Web Page Titles
LSs for Web Pages
DF Estimation Techniques
TC-DF Correlation
82
Motivation Background
Synchronicity – Automatically Rediscover Missing Web Pages in Real Time(JCDL 2011)
Concluding RemarksSynchronicity
• Firefox add-on• Triggers on 404 error• Rediscover page via:
• Memento• Title• Lexical signature• Tags• Link neighborhood lexical signature• URI modification
• http://bit.ly/no-more-40483
Concluding RemarksContributions
1. Introduce reliable real-time approach to estimate IDF values
2. Workflow for generation of well performing lexical signatures
3. Performance evaluation of web page titles4. Investigation of tags for web page discovery5. Analysis of link neighborhood lexical
signatures and their optimal parameter6. Introduce Synchronicity implementing the
entire framework 84
Concluding Remarks
85
Concluding RemarksNext Stop… New Mexico
86
Concluding RemarksList of my Relevant Publications
1. M.Klein, M.L.Nelson, “A Comparison of Techniques for Estimating IDF Values to Generate Lexical Signatures for the Web“, WIDM 2008, pp. 39-46
2. M.Klein, M.L.Nelson, “Revisiting Lexical Signatures to (Re-)Discover Web Pages”, ECDL 2008, pp. 371-382
3. M.Klein, M.L.Nelson, “Correlation of Term Count and Document Frequency for Google N-Grams“, ECIR 2009, pp. 620-627
4. M.Klein, M.L.Nelson, “Inter-Search Engine Lexical Signature Performance“, JCDL 2009, pp. 413-414
5. M.Klein, M.L.Nelson, “Investigating the Change of Web Pages Titles Over Time“, InDP 2009
6. M.Klein, J.Shipman, M.L.Nelson, “Is This a Good Title”, Hypertext 2010, pp. 3-127. M.Klein, M.L.Nelson, “Evaluating Methods to Rediscover Missing Web Pages
from the Web Infrastructure”, JCDL 2010, pp. 59-688. M.Klein, J.Ware, M.L.Nelson, “Rediscovering Missing Web Pages Using Link
Neighborhood Lexical Signatures”, JCDL 20119. M.Klein, M.Aly, M.L.Nelson, “Synchronicity - Automatically Rediscover Missing
Web Pages in Real Time”, JCDL 201110. M.Klein, M.L.Nelson, “Find, New, Copy, Web, Page – Tagging for the
(Re-)Discovery of Web Pages”, TPDL 2011 to appear87
Concluding RemarksReferencesBischoff2008K.Bischoff, C.Firan, W.Nejdl, R.Paiu, “Can All Tags Be Used for Search?” In: Proceedings of CIKM '08, pp.193-202, 2008Dellavalle2003R.P.Dellavalle, E.J.Hester, L.F.Heilig, A.L.Drake, J.W.Kuntzman, M.Graber, L.M.Schilling, “Information Science: Going, Going, Gone: Lost Internet References”, Science 302(5646), pp.787-788, 2003Jones1973K.Spärck Jones, “Index Term Weighting”, Information Storage and Retrieval, pp. 619-633, 1973Kahle1997B.Kahle, “Preserving the Internet”, Scientific American 276, pp.82-83, 1997Koehler2002W.C.Koehler, “Web Page Change and Persistence - A Four-Year Longitudinal Study”, JASIST 53(2), pp.162-171, 2002Lawrence2001S.Lawrence, D.M.Pennock, G.W.Flake, R.Krovetz, F.M.Coetzee, E.Glover, F.A.Nielsen, A.Kruger, C.L.Giles, “Persistence of Web References in Scientic Research”, Computer 34(2), pp.26-31, 2001McCown2005F.McCown, S.Chan, M.L.Nelson, J.Bollen, “The Availability and Persistence of Web References in D-Lib Magazine”, Proceedings of IWAW '05, 2005Nelson2002M.L.Nelson, B.D.Allen, “Object Persistence and Availability in Digital Libraries”, D-Lib Magazine 8(1), 2002Ntoulas2006A. Ntoulas, M.Najork, M.Manasse, D.Fetterly, “Detecting Spam Web Pages Through Content Analysis”, Proceedings of WWW ’06, pp 83-92, 2006Park2004S.T.Park, D.M.Pennock, C.L.Giles, R.Krovetz, “Analysis of Lexical Signatures for Improving Information Persistence on the World Wide Web”, TOIS 22(4), pp.540-572, 2004Phelps2000T.A.Phelps, R.Wilensky, “Robust Hyperlinks Cost Just Five Words Each”, technical report, UC Berkeley, 2000Sanderson2011R.Sanderson, M.Phillips, H.Van de Sompel, “Analyzing the Persistence of Referenced Web Resources with Memento”, Proceedings of OR '11, 2011 88
Using the Web Infrastructurefor Real Time Recoveryof Missing Web Pages
Martin [email protected]
http://www.cs.odu.edu/~mklein/
Backup Slides
Future Work
91
• “Story Telling” with Memento• Find more Stop Titles• Find more Ghost Tags• Identify “Stop Anchors”• Synchronicity 1.0
• Web service• CMD line tool
Agenda
SynchronicityLink
Neighborhood LSs
Book of the DeadWeb Page Tags
Web Page Titles
LSs for Web Pages
DF Estimation Techniques
TC-DF Correlation
92
Motivation Background
Correlation of Term Count and Document Frequency for Google N-Grams(ECIR 2009)
• Need of a reliable source to accurately compute IDF values of web pages (in real time)
• Shown, screen scraping works but• missing validation of baseline (Google N-
Grams)• N-Grams seem suitable (recently created,
based on web pages) but provide TC and not DF what is their relationship?
The Problem
93
94
Background
Term All Buy Can’t Is Love Me Need Please You Long
TC 1 1 1 1 2 2 1 2 1 3
DF 1 1 1 1 2 2 1 1 1 1
• Google N-grams provide term count (TC) values
D1 = “Please, Please Me” D2 = “Can’t Buy Me Love”D3 = “All You Need Is Love” D4 = “Long, Long, Long”
TC >= DF, but is there a correlation?Can we use TC to estimate DF?
95
Experiment Results
Investigate correlation between TC and DFwithin “Web as Corpus” (WaC)
Rank similarity of all terms
96
Experiment Results
Investigate correlation between TC and DFwithin “Web as Corpus” (WaC)
Spearman’s ρ and Kendall τ
97
Experiment Results
Rank WaC-DF WaC-TC Google N-Grams1 IR IR IR IR2 RETRIEVAL RETRIEVAL RETRIEVAL IRSG3 IRSG IRSG IRSG RETRIEVAL4 BCS IRIT CONFERENCE BCS5 IRIT BCS BCS EUROPEAN6 CONFERENCE 2009 GRANT CONFERENCE7 GOOGLE FILTERING IRIT IRIT8 2009 GOOGLE FILTERING GOOGLE9 FILTERING CONFERENCE EUROPEAN ACM
10 GRANT ARIA PAPERS GRANT
Google: screen scraping DF values from the Google web interface
Top 10 terms in decreasing order of their TF/IDF valuestaken from http://ecir09.irit.fr
U = 14∩ = 6
Strong indicator that TC can be used to estimate DF for web pages!
98
Experiment Results
Show similarity between WaC based TC andGoogle N-Gram based TC
TC frequencies
N-Grams have a threshold of 200
Integer ValuesTwo Decimals One Decimal
Frequency of TC/DF Ratio Within the WaC
Experiment Results
99
• TC and DF Ranks within the WaC show strong correlation
• TC frequencies of WaC and Google N-Grams are very similiar
• N-Grams are suitable for accurate IDF estimation for web pages
Does not mean everything correlated to TC can be used as DF substitute!
Conclusions
100
Agenda
SynchronicityLink
Neighborhood LSs
Book of the DeadWeb Page Tags
Web Page Titles
LSs for Web Pages
DF Estimation Techniques
TC-DF Correlation
101
Motivation Background
Inter-Search Engine Lexical Signature Performance(JCDL 2009)
Inter-Search EngineLexical Signature Performance
Martin Klein Michael L. Nelson
{mklein,mln}@cs.odu.edu
http://en.wikipedia.org/wiki/ElephantElephantTusksTrunkAfricanLoxodonta
Elephant, Asian, AfricanSpecies, TrunkElephant, African, Tusks
Asian, Trunk
103
Agenda
SynchronicityLink
Neighborhood LSs
Book of the DeadWeb Page Tags
Web Page Titles
LSs for Web Pages
DF Estimation Techniques
TC-DF Correlation
104
Motivation Background
Synchronicity – Automatically Rediscover Missing Web Pages in Real Time(JCDL 2011)
Synchro…What?
Synchronicity• Experience of causally unrelated events
occurring together in a meaningful manner• Events reveal underlying pattern, framework
bigger than any of the synchronous systems• Carl Gustav Jung (1875-1961)
• “meaningful coincidence”• Deschamps – de Fontgibu plum
pudding example
picture from http://www.crystalinks.com/jung.html105
Synchro…What?
http://www.youtube.com/watch?v=X4HQyqc-aVU
Repo Man (1984)http://www.imdb.com/title/tt0087995/
106
Agenda
SynchronicityLink
Neighborhood LSs
Book of the DeadWeb Page Tags
Web Page Titles
LSs for Web Pages
DF Estimation Techniques
TC-DF Correlation
107
Motivation Background
(Not yet published)
Book of the Dead
• Corpus of missing web pages• 233 URIs returning status 404• Mechanical Turk to determine “aboutness”
• Guess from URI string• Mementos for 161 URIs
• Apply lexical signatures and title
108
5-term LSs Titles
109
Experiment Results
Dice Similarity Coefficientof Top 100 Results D = 0
0.0 < D ≤ 0.30.3 < D ≤ 0.60.6 < D ≤ 1.0
5-term LSs Titles
110
Experiment Results
Jaro Distance of Top 100 Results J = 0
0.0 < J ≤ 0.30.3 < J ≤ 0.60.6 < J ≤ 1.0
Book of the Dead
• Mechanical Turk to determine relevance of results
• Top 10 only• Relevant• Somewhat relevant• Not relevant• Broken URI
• nDCG of top 10 results
111
5-term LSs Titles
112
Experiment Results
Relevance of Top 10 Results
113
Experiment Results
nDCG of Top 10 Results