link persistence, website persistence
DESCRIPTION
Presentation on the discrepancy between measurements of link persistence and website persistence and why it matters.TRANSCRIPT
Link Persistence,Website
PersistenceNicholas
Taylor@nullhandle
May 28, 2013 “Forward” by Flickr user Hitchster under CC BY 2.0
why preserve the web?
broken links
“404” by Flickr user adactio under CC BY 2.0
44 days (Kahle, 1997)
75 days (Kahle, 2001)
100 days (Kahle, 2003)
variable (Sanderson, Phillips, and Van de
Sompel, 2011)• literature review of 17 studies• research focused on scholarly
citations• decay rates of 39-82%• over periods of 1-13 years
“Digital documents last forever—or five years, whichever comes first.”
(Jeff Rothenberg, 1997)
“Out of books sprout... plants” by DeviantArt user quinn.anya under CC BY-SA 2.0
LINK CHECKING
The Art and Science of
“http Blue Background” by DeviantArt user SoulArt2012 under CC BY-NC-ND 3.0
http response codes
• 404: “Not Found”• 200: “OK”• 301: “Moved Permanently”• 500: “Internal Server Error”
automated link checker
“La Machine @ Yokohama” by Flickr user chidorian under CC BY-SA 2.0
what link checking tells us
“200 ok” by Flickr user reidab under CC BY-NC-SA 2.0
possible scenarios
• link works; same website• link works; different website
– website may or may not still exist• link doesn’t work; website still exists• link doesn’t work; website no longer
exists
link works; same websitehttp://www.fair.org/ (2002)
http://www.fair.org/ (2013)
link works; different website…
http://www.fb.com/ (2002)
http://www.fb.com/ (2013)
…but website still existshttp://www.fb.org/ (2013)
link doesn’t work…
http://www.state.mo.us/ (2002)
http://www.state.mo.us/ (2013)
…but website still existshttp://www.sos.mo.gov/ (2013)
link doesn’t work;website no longer exists
assumptions
• link works; same website• link works; different website
– website may or may not still exist• link doesn’t work; website still exists• link doesn’t work; website no
longer exists
research questions
• how much are we overestimating website persistence?– some working links point to different
websites• how much are we underestimating
website persistence?– websites may still exist even though
links don’t work or do work but point to different websites
WEB ARCHIVES
A Study Using
Library of CongressU.S. Election 2002 Web Archive
preparing the list of links
• exclude links corresponding to electoral candidate websites
• 1,071 links– state government– political parties– advocacy organizations– major newspapers– political blogs
methodology
automated• run Heritrix against
links, ignoring robots.txt
• log http response codes
• log redirects
manual• manually check each
link• same website behind
working link?• does website still
exist?
methodology
automated• run Heritrix against
links, ignoring robots.txt
• log http response codes
• log redirects
manual• manually check each
link• same website behind
working link?• does website still
exist?
working link?
91%
9%
workingnon-working
same website?
83%
9%
8%
working link; same site
non-working link
non-working link;website still exists?
91%
8%
2%
workingstill existsdoesn't exist
website still exists?
94%
6%
still existsdoesn't exist
summary of results
• how much are we overestimating website persistence?– 8% of working links point to different
websites• how much are we underestimating
website persistence?– 82% of websites associated with non-
working links still exist– 48% of websites whose links now point
to different websites still exist
what does it mean?
• websites are (much more) persistent than links
• websites are surprisingly durable?
“Golden Spider Silk” by Flickr user amandabhslater under CC BY-SA 2.0
WEBSITE CHECKING?
Beyond Link Checking,
“Check” by Flickr user ex.libris under CC BY-NC-ND 2.0
building a website checker
1. check whether link still works2. check whether link still corresponds
to website3. check whether website still exists
“Most web archiving problems are problems of scale.”
(Kris Carpenter Negulescu, 2012)
“chutes and ladders” by Flickr user reallyboring under CC BY-NC-SA 2.0
building a website checker
1. check whether link still works2. check whether link still corresponds
to website3. check whether website still exists
Heritrix compares checksums
“Fingerprint” by Flickr user CPOA under CC BY-ND 2.0
…but checksums are limited
“Hashing Emily” by Flickr user wlef70 under CC BY-NC-SA 3.0
visual analysis of page changes
Pehlivan, Ben-Saad, and Gançarski: “Vi-DIFF: Understanding Web Pages Changes”
building a website checker
1. check whether link still works2. check whether link still corresponds
to website3. check whether website still exists
lexical signature of archived page
Ware, Klein, and Nelson: “An Evaluation of Link Neighborhood Lexical Signatures to Rediscover Missing Web Pages”
find archived pages w/ Memento
• http protocol enhancement
• enables discovery of archived resources in distributed web archives
lexical signatures of backlink pages
“The future is already here; it’s just not very evenly distributed.”
(William Gibson, 1999)
“Time Travel” by Flickr user xcalibr under CC BY-NC-ND 2.0
Nicholas Taylor
@nullhandle
“Thank You” by Flickr user muffintinmom under CC BY 2.0