![Page 1: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/1.jpg)
1
WEB ARCHIVE SERVICES FRAMEWORK
FOR TIGHTER INTEGRATION BETWEEN THE PAST AND PRESENT WEB
Ahmed AlSumPhD Defense
February 2014
Committee Members:• Michael L. Nelson • Michele C. Weigle • Hussein M. Abdel-Wahab • M’Hammad Abdous• Herbert Van de Sompel
Old Dominion University Computer Science Department
![Page 2: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/2.jpg)
2
Domain
Contribution
Goal
WEB ARCHIVE SERVICES FRAMEWORK
FOR TIGHTER INTEGRATION BETWEEN THE PAST AND PRESENT WEB
Ahmed AlSumPhD Defense
February 2014
Committee Members:• Michael L. Nelson • Michele C. Weigle • Hussein M. Abdel-Wahab • M’Hammad Abdous• Herbert Van de Sompel
Old Dominion University Computer Science Department
![Page 3: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/3.jpg)
3
Outline• Introduction• Web Archiving Services Framework
• Content Service• Metadata Service• URI Service• Archive Service
• Conclusions
![Page 4: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/4.jpg)
4
INTRODUCTION
Motivation and Research Questions
![Page 5: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/5.jpg)
5
What is a Web Archive?
Introduction Motivation
http://www.cs.odu.edu
![Page 6: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/6.jpg)
6
Who are using Web Archives? & How?• Politicians• Journalists• Web designers• Historians• Researchers• Social scientists• Curious users
Introduction Motivation
*IIPC Access Working Group 2006, Costa 2010, Dougherty 2010, Stirling 2011, Smith 2009
![Page 7: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/7.jpg)
7
Web Archives interfaces are limited
Introduction Motivation
![Page 8: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/8.jpg)
8
Web Archiving Use Cases• Ponguru asked on Internet Archive forum on May 17,
2010*:• Hi All - I am new to Archive.org. A few quick questions
(1) Is there any API or tools available to access the Archive.org contents programmatically?
(2) Are there any research papers where Archive.org was used for data collection / analysis (e.g. studying a particular topic over time, etc.)? I digged a little bit, could not find much, so checking with the group. "
• Introduction Motivation
*http://archive.org/post/306799/api-or-tools-to-access-research-publications-on-archiveorg
![Page 9: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/9.jpg)
9
Lack of APIs• Famous websites provide APIs to the third-party
developer.• Introduction Motivation
![Page 10: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/10.jpg)
10
Limited and non-standards APIs• Current Web Archives have a limited set of APIs that don’t
cover the user’s needs.• Introduction Motivation
![Page 11: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/11.jpg)
11
Wayback Machine API• Introduction Motivation
• It returns JSON interface for the list of available Mementos.
![Page 12: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/12.jpg)
12
Croatian Web Archive
Introduction Motivation
Full-text search web interface Full-text search APIs in JSON
![Page 13: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/13.jpg)
13
Memento• Introduction Motivation
• Memento provides TimeMap in the application CoRE format.
![Page 14: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/14.jpg)
14
Memento Terminology
Introduction Motivation
URI-R, R
URI-M, M
URI-T, TM
http://www.amazon.com
http://web.archive.org/web/20110411070244/http://amazon.com
Original Resource
Memento
TimeMap
Van de Sompel, H., Nelson, M. L., & Sanderson, R. (2013). RFC 7089 - HTTP framework for time-based access to resource states -- Memento. Internet Engineering Task Force (IETF). Retrieved from
http://tools.ietf.org/html/rfc7089
![Page 15: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/15.jpg)
15
Memento Aggregator• Merges TimeMaps from various archives.
Introduction Motivation
![Page 16: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/16.jpg)
16
Web Archiving as Big Data• Internet Archive corpus reached 5 PetaBytes. • Alexandria Bibliotheca needs one year to recompute
checksum for its corpus.
• Tools
Introduction Motivation
Apache Pig
![Page 17: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/17.jpg)
17
Research Question
How Can We Enrich The Web Archive Access Interface With The Conjunction Of The Live Web?
Introduction Research Questions
![Page 18: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/18.jpg)
18
Research Questions• What are the required services for the web archiving user
community? • Shall we work on the web archive collection as one entity or on
different levels? • How can we use the web archive content beyond full-text search? • What are the metadata fields that could enhance user browsing? • How can we develop access interface to the temporal web graph? • How can we optimize creation of thumbnails?• How can we use the HTTP redirection to enhance the URI-lookup
query? • How can we optimize the query routing mechanism across the
web archives?
Introduction Research Questions
![Page 19: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/19.jpg)
19
WEB ARCHIVE SERVICE FRAMEWORKLevels and Datasets
![Page 20: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/20.jpg)
20
Web Archive Service FrameworkWeb Archive Service Framework
![Page 21: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/21.jpg)
21
• Archive level• Web Archive profiling to
optimize the query routing.
• URI level• URI HTTP redirection in the
web archive URI-lookup.
• Metadata level• ArcLink• ArcThumb
• Content level• ArcContent
Web Archive Service Framework
ArcSys
![Page 22: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/22.jpg)
22
IIPC 2010 Winter Olympics
Web Archive Service Framework Datasets
* http://olympics.us.archive.org/olympics2010/
Size 700+GB
From Nov 2009
To Mar 2010
#URI-R 6.4M
#URI-M 23.7M
![Page 23: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/23.jpg)
23
Fortune 500• 499,540 mementos from 488
TimeMaps.• For each Memento, we download the
HTML and capture the thumbnail using PhantomJS.
Web Archive Service Framework Datasets
![Page 24: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/24.jpg)
24
DMOZ
Web Archive Service Framework Datasets
• URI Open Directory based on user submissions.
![Page 25: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/25.jpg)
25
CONTENT SERVICEArcContent
Archive
URI
Metadata
Content
![Page 26: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/26.jpg)
26
Wayback Machine URI Rewriting
Original Rewritten
Content Service
![Page 27: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/27.jpg)
27
Response Types
Raw Response
Modified Response
Extracted Response
Content Service
![Page 28: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/28.jpg)
28
ArcContent Architecture Diagram
Content Service
![Page 29: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/29.jpg)
29
Extracted Response Filters
Content Service
TextContent
TFContent
![Page 30: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/30.jpg)
30
Extracted Response Formats
Content Service
XML
JSON
![Page 31: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/31.jpg)
31
ArcContent Applications
Content Service
TFContent
TagClouds
![Page 32: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/32.jpg)
32
METADATA SERVICEArcLink & ArcThumb
Archive
URI
Metadata
Content
![Page 33: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/33.jpg)
33
Metadata Access Service• Metadata Service• Metadata is data about data.• Metadata layer is data about mementos.
Type Field Description ExampleTechnical
Content-type Entity mimetype. text/html
Content-length Size of the entity-body. 90883
Extracted
Title Title of the page. Egypt rejoices at Mubarak departure
Description Description about the content of the entity-body.
The BBC World Affairs Editor John Simpson reflects on how Egypt brought about the overthrow of President Hosni Mubarak.
Outgoing Links A list of all the outlinks that the page pointed to.
Derived
Thumbnail Thumbnail of the representation of the web page.
Incoming Links A list of all the inlinks that to pointed to the page
![Page 34: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/34.jpg)
34
ArcLink
Motivation, Stages, Cost Model, Applications
![Page 35: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/35.jpg)
35
ArcLink: optimization techniques to build and retrieve the temporal web graph
A. AlSum and M. L. Nelson,.
In Proceedings of the 13th annual international ACM/IEEE joint conference on Digital libraries
JCDL ‘13, Indianapolis, Indiana, 2013
See also: http://arxiv.org/abs/1305.5959
![Page 36: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/36.jpg)
36
Easily Solved Questions
Q: What are the available mementos for www.vancouver2010.com?
Metadata Service ArcLink Motivation
![Page 37: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/37.jpg)
37
Solved Questions, but hard
Q. What are the HTML titles for www.vancouver2010.com through time?
A. Page scraping for all mementos
Metadata Service ArcLink Motivation
![Page 38: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/38.jpg)
38
Impossible Questions
Q What are the anchor-text that pointed to www.vancouver2010.com through time?
Metadata Service ArcLink Motivation
…<a href=www.vancouver2010.com >Vancouver Olympics</a>….
…<a href=www.vancouver2010.com >Winter Olympics</a>…
…<a href=www.vancouver2010.com >Vancouver 2010</a>…
![Page 39: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/39.jpg)
39
Outlinks
Metadata Service ArcLink Motivation
![Page 40: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/40.jpg)
40
ArcLink and Temporal Web GraphWhat is ArcLink?• ArcLink is a complete system to Extract, Preserve, and
Access to Temporal Web Graph.
What is the Temporal Web Graph?• Link structure through the time, including inlinks and
outlinks.
Metadata Service ArcLink Motivation
WG @t2WG @t1 TWG
![Page 41: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/41.jpg)
41
System Stages
Metadata Service ArcLink Stages
![Page 42: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/42.jpg)
42
Filtering• Using CDX files to filter the URI to select the mementos
that will contribute to the Web Graph.• For example,
• Exclude non-200 HTTP status code• Exclude Images, style-sheets, videos, etc• Exclude duplicate mementos
• Technique: Using Pig Latin script on CDX files• Results: CDX was reduced to 25% of the original size,
from 23.8M mementos to 6.7M mementos.
Metadata Service ArcLink Stages
![Page 43: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/43.jpg)
43
Extraction• Technique: Hadoop• Step 1: URI-ID generation
• Canonicalized the URI into SURT format • Hash the canonicalized format using SimHash• Completely distributed
• Step 2: Define data sources
Metadata Service ArcLink Stages
𝑤𝑤𝑤 .𝑒𝑥𝑎𝑚𝑝𝑙𝑒 .𝑜𝑟𝑔 / 𝑓𝑜𝑜𝑒𝑥𝑎𝑚𝑝𝑙𝑒 .𝑜𝑟𝑔 / 𝑓𝑜𝑜
𝑤𝑤𝑤1.𝑒𝑥𝑎𝑚𝑝𝑙𝑒 .𝑜𝑟𝑔 / 𝑓𝑜𝑜}𝑜𝑟𝑔 ,𝑒𝑥𝑎𝑚𝑝𝑙𝑒¿ / 𝑓𝑜𝑜
𝑜𝑟𝑔 ,𝑒𝑥𝑎𝑚𝑝𝑙𝑒¿ / 𝑓𝑜𝑜→𝐴𝐵𝐶𝐷 11
Input Source Map (sec) Reduce (sec) Total (sec)
2 TasksWayback 21,422 4,194 25,616
WARC 13,327 2,770 16,098 (62%)
5 TasksWayback 13,721 2,257 15,978
WARC 8,304 1,746 10,051 (62%)
• WARC • Web archive UI
![Page 44: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/44.jpg)
44
Storage• ArcLink used database to save the web
graph
Metadata Service ArcLink Stages
Insertion Performance Update Performance
![Page 45: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/45.jpg)
45
ArcLink Response
Metadata Service ArcLink Stages
![Page 46: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/46.jpg)
46
ArcLink Response
Metadata Service ArcLink Stages
![Page 47: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/47.jpg)
47
ArcLink Response
Metadata Service ArcLink Stages
![Page 48: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/48.jpg)
48
Impossible Questions
Q. What are the anchor-text that pointed to www.vancouver2010.com through time?
Metadata Service ArcLink Applications
![Page 49: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/49.jpg)
49
Temporal Page Rank Nov-2009 Dec-2009 Jan-2010
1 vancouver2010.com/code - topsport.com/sportch/liveticker/ 2 vancouver2010.com/en/langpolicy - vancouver2010.com/code
3 vancouver2010.com/forgotpassword - canadacode.vancouver2010.com/ user/register
4 vancouver2010.com/store - canadacode.vancouver2010.com
5 vancouver2010.com/store/index.html - canadacode.vancouver2010.com/explore
6 vancouver2010.com/ - canadacode.vancouver2010.com/ user/login?destination=node/add/image
7 canadacode.vancouver2010.com - canadacode.vancouver2010.com/pulse 8 canadacode.vancouver2010.com/nfb-onf - canadacode.vancouver2010.com/challenge 9 canadacode.vancouver2010.com/contact - i-credible.nl
10 canadacode.vancouver2010.com/resources - vpzschaatsteam.nl
Metadata Service ArcLink Applications
Feb-2010 Mar-2010 Collection ( Nov-09 to Mar-10 ) 1 monlibe.liberation.fr monlibe.liberation.fr monlibe.liberation.fr
2 topsport.com/sportch/liveticker/ laprovence.com/la-provence-le-faq-de-la-moderation vancouver2010.com/code
3 lefigaro.fr get.adobe.com/flashplayer lefigaro.fr
4 laprovence.com/la-provence-le-faq-de-la-moderation
vancouver2010.teamgb.com /teamgb/team-behind-team-gb/filenotfound.aspx
laprovence.com/la-provence-le-faq-de-la-moderation
5 lefigaro.fr/sport ledauphine.com lefigaro.fr/sport 6 get.adobe.com/flashplayer lefigaro.fr/economie get.adobe.com/flashplayer 7 lefigaro.fr/meteo lefigaro.fr/sport lefigaro.fr/meteo 8 lefigaro.fr/le-talk lefigaro.fr/actualites-a-la-une lefigaro.fr/le-talk
9 dosb.de/de/vancouver-2010/vancouver-ticker/detail/printer.html lemonde.fr/cgv topsport.com/sportch/liveticker/
10 ledauphine.com ffs.fr/index.php vancouver2010.com/en/langpolicy
![Page 50: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/50.jpg)
50
ArcThumb
Motivation, Feature Exploration, Selection Algorithm
![Page 51: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/51.jpg)
51
Thumbnail Summarization Techniques For Web Archives
AlSum and M. L. Nelson,.
In Proceedings of the 36th European Conference on Information Retrieval.
ECIR 2014, Amsterdam, Netherlands, 2014
![Page 52: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/52.jpg)
52
Thumbnails in Web Archive
Metadata Service ArcThumb Motivation
Internet Archive UK Web Archive
![Page 53: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/53.jpg)
53
Thumbnails Creation Challenges• Scalability in Time
• IA may need 361 years to create thumbnail for each memento using one hundred machines.
• Scalability in Space• IA will need 355 TB to store 1 thumbnail per each memento.
• Page quality
Metadata Service ArcThumb Motivation
![Page 54: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/54.jpg)
54
Thumbnails Usage Challenges
Metadata Service ArcThumb Motivation
• This is partial view of 700 thumbnails out of 10,500 available mementos for www.apple.com
![Page 55: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/55.jpg)
55
From 10,500 Mementos to 69 Thumbnails.
Metadata Service ArcThumb Motivation
![Page 56: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/56.jpg)
56
How many thumbnails do we need?
Metadata Service ArcThumb Methodology
www.unfi.com on the live Web
![Page 57: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/57.jpg)
57
How many thumbnails do we need?
Metadata Service ArcThumb Methodology
www.unfi.com on the live Web
![Page 58: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/58.jpg)
58
40 Thumbnails are good.
Metadata Service ArcThumb Methodology
![Page 59: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/59.jpg)
59
Visual Similarity and Text Similarity
Metadata Service ArcThumb MethodologyS
imila
rD
iffe
ren
t
HTML Text
![Page 60: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/60.jpg)
60
Correlation between Visual Similarity and Text Similarity
Metadata Service ArcThumb Feature Exploration
SimHash DOM tree
Embedded resources Memento Datetime
SimHash [Charikar 2002], DOM tree [Pawlik 2011], Memento Datetime [Van de Sompel 2013]
![Page 61: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/61.jpg)
61
Threshold Grouping
Metadata Service ArcThumb Selection Algorithms
![Page 62: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/62.jpg)
62
Threshold Grouping
Metadata Service ArcThumb Selection Algorithms
![Page 63: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/63.jpg)
63
Clustering technique
Metadata Service ArcThumb Selection Algorithms
SimHash Feature SimHash and Datetime Features
Park, H.-S., & Jun, C.-H. (2009). A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications, 36(2, Part 2), 3336–3341.
![Page 64: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/64.jpg)
64
Time Normalization
Metadata Service ArcThumb Selection Algorithms
![Page 65: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/65.jpg)
65
Selection Algorithms Comparison
Threshold Grouping K clustering Time Normalization
TimeMap Reduction 27% 9% to 12% 23% Image Loss 28 78 - 101 109
# Features 1 feature 1 or more 1 feature
Preprocessing required Yes Yes No
Efficient processing Medium Extensive Light
Incremental Yes No Yes
Online/offline Both Both Both
Metadata Service ArcThumb Selection Algorithms
![Page 66: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/66.jpg)
66
URI SERVICE
Archive
URI
Metadata
Content
![Page 67: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/67.jpg)
67
ARCHIVAL HTTP REDIRECTION RETRIEVAL POLICIES
A. AlSum, M. L. Nelson, R. Sanderson, and H. Van de Sompel
In Proceedings of 3rd Temporal Web Analytics Workshop.
TempWeb 2013, Rio de Janeiro, Brazil
![Page 68: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/68.jpg)
68
Live Web Redirect
http://bit.ly/r9kIfC redirects to http://www.cs.odu.edu
URI Service
% curl -I http://bit.ly/r9kIfC HTTP/1.1 301 Moved….Location: http://www.cs.odu.edu/…
![Page 69: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/69.jpg)
69
Live Web Redirect
URI Service
R http://bit.ly/r9kIfC R http://www.cs.odu.eduredirects to
![Page 70: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/70.jpg)
70
R1 www.draculathemusical.co.uk R2 www.mosaicstudio.co.uk
R1
http://web.archive.org/web/20020212194020/http://www.draculathemusical.co.uk/
R3
http://web.archive.org/web/20020212194020/http://www.geocities.com/draculathemusical
Web
Arc
hive
Li
ve w
eb
redirects to
redirects to
has Memento
Archived Web Redirect
URI Service
![Page 71: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/71.jpg)
71
Experiment• Dataset: 10,000 sample URIs from • Dataset does not include bit.ly nor doi.• Experiment focused on the root page (no embedded resources)
URI Service Experiment and Results
HTTP Status/Code (10,000 URI-R)
OK (200) 82.83%
Redirection (3xx) 14.71%
Redirection (301) 8.4%
Redirection (302) 6.1%
Redirection (others) 0.2%
Not-Found (4xx) 1.18%
Others 1.28%
HTTP Status/Code (894,717 URI-M)
OK (200) 93.46%
Redirection (3xx) 5.69%
Not-Found (4xx) 0.26%
Others 0.59%
URIs Live HTTP status code Memento HTTP status code
![Page 72: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/72.jpg)
72
URI Stability• URI’s stability is a count of the change in HTTP responses
across time (200, 3xx, or 4xx) and the number of different URIs in the “Location” for 3xx status code.
High Stability = 1 No Stability = 0
URI Service
![Page 73: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/73.jpg)
73
Abstract Model• TimeMap for R
URI Service
M1 M2 M3TimeMapR
![Page 74: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/74.jpg)
74
Timemap Redirection Categories
URI Service
All Mementos have 200 HTTP status code All Mementos have redirection to the same URI.
All Mementos have redirection to different URIs. Mementos have different HTTP status code.
Stability =1 Stability =1
Stability ≈ 0
Stability
![Page 75: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/75.jpg)
75
URI Stability
URI Service Experiment and Results
TimeMap Category Percentage Stability
All Mementos have OK 52% 1
Mementos have mixed status codes 36% 0.91
All Mementos have Redirection 0.92% 0.85
Redirection to the same URI 0.62%
Redirection to different URIs 0.30%
URI has no Mementos at all 10.97% 0
Stability in semi-log scale Stability for |TM(R)| < 300
![Page 76: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/76.jpg)
76
Current Wayback Machine Policy• Live Redirect: Wayback Machine ignores the live
redirects. Use instead of • Archived Redirect: Wayback Machine follows the
redirection.
URI Service Retrieval Policies
![Page 77: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/77.jpg)
77
Policy one:
URI-R with HTTP redirection• Scope: Selection between on the live web.• Example: http://bit.ly/r9kIfC http://www.cs.odu.edu
• Algorithm:
URI Service Retrieval Policies
Retrieve the memento M for R.
Status(M) =200
Status(M) =3xx
Status(M) =4xx&& R has
Stop
Go to Policy 2
Stop
Yes
Yes
Yes No
No
No
Use instead of R
![Page 78: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/78.jpg)
78
Policy one: URI-R with HTTP redirection• Evaluation:
• Policy scope has: 1471 URIs (that have live redirection)
• 77 out of 1471 have no mementos at all• 17 out of 77 have been retrieved mementos based on live
redirection
URI Service Retrieval Policies
![Page 79: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/79.jpg)
79
Policy two: URI-M with HTTP redirection• Scope: Selection between in web archive.• Example: http://api.wayback.archive.org/memento/20101109032705/http://bit.ly/2EEjBl
http://api.wayback.archive.org/memento/20101109032705/http://www.cnn.com/
• Algorithm:
URI Service Retrieval Policies
𝑀→𝑀
Extract original from
Repeat content-netgotiation in datetime for original()
http://api.wayback.archive.org/memento/20101109032705/http://bit.ly/2EEjBl http://api.wayback.archive.org/memento/20101109032705/http://www.cnn.com
/http://www.cnn.com/
Accept-Datetime: Sun, 13 May 2006 http://www.cnn.com/
![Page 80: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/80.jpg)
80
Policy two: URI-M with HTTP redirection• Evaluation:
• Policy scope: 2980 TimeMap (that showed HTTP redirection status code in at least one memento)
• Success criteria: Using policy two contributed to the original TimeMap
• Success percentage: 58% of the cases
URI Service Retrieval Policies
![Page 81: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/81.jpg)
81
ARCHIVE SERVICE
Percentage and Distribution
Archive
URI
Metadata
Content
![Page 82: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/82.jpg)
82
How Much Of The Web Is Archived?
S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson
In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
JCDL '11, Ottawa, Canada 2011
See also: http://arxiv.org/abs/1212.6177
![Page 83: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/83.jpg)
83
Experiment• 4 Sample sets – 1000 URIs each
• For each URI, we used Memento Aggregator to record the TimeMap for this URI.
Archive Service Percentage Experiment
![Page 84: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/84.jpg)
84
Archives Under Experiment2010 2010 and 2013 2013
Archive Service Percentage Experiment
UK
![Page 85: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/85.jpg)
85
How Much of the Web is Archived?• It Depends on Which Web…
Archive Service Percentage Results
2010 2013Including SE cache
Excluding SE Cache General
90% 79% 90%
97% 68% 95%
88% 19% 52%
35% 16% 33%
![Page 86: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/86.jpg)
86
Profiling Web Archive Coverage For Top-level Domain And Content Language
A. AlSum, M. C. Weigle, M. L. Nelson, and H. Van de Sompel
In Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries
TPDL 2013, Valletta, Malta, 2013
Extended version is invited to special edition in IJDL.
See also: http://arxiv.org/abs/1309.4008
![Page 87: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/87.jpg)
87
Memento Aggregator
Archive Service Distribution
![Page 88: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/88.jpg)
88
Where can you find?
Archive Service Distribution
http://www.google.com/
![Page 89: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/89.jpg)
89
Where can you find?
Archive Service Distribution
http://www.google.com/
![Page 90: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/90.jpg)
90
Where can you find?
Archive Service Distribution
http://www.japantimes.co.jp/
![Page 91: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/91.jpg)
91
Where can you find?
Archive Service Distribution
http://www.japantimes.co.jp/
![Page 92: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/92.jpg)
92
Research Question
Problem• We need to profile the web archives around the world with
these characteristics:• Age• Top-level domains• Languages• Growth rate
Goal• To optimize the query routing for Memento Aggregator.• To determine the missing parts of the web.
Archive Service Distribution
![Page 93: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/93.jpg)
93
URIs Samples Sources
Archive Service Distribution
Web1. DMOZ – Random sample2. DMOZ – TLD 200 URIs for
each TLD from DMOZ (80 tlds)
3. DMOZ – Languages 100 URIs for each Languages (40 lang.)
Web Archives4. Top 1-Gram from Bing5. Top 1000 queries term
by Yahoo in 9 languages
User requests6. IA Wayback Machine log files7. Memento aggregator log files
* We used hostnames only
![Page 94: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/94.jpg)
94
TLD Coverage
Archive Service Distribution
IA Internet Archive AIT Archive It LoC Library of Congress UK UK National Library BL British Library
SG Singapore CAT Web Archive of Catalonia CR Croatian Web Archive CZ Archive of the Czech Web TW National Taiwan University
IC Icelandic Web Archive CAN Library and Archives Canada SI Slovenia WEB Web Cite AIS Archive.is
![Page 95: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/95.jpg)
95
Language Coverage
Archive Service Distribution
IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web
LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University
IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
IA Internet Archive AIT Archive It LoC Library of Congress UK UK National Library BL British Library
SG Singapore CAT Web Archive of Catalonia CR Croatian Web Archive CZ Archive of the Czech Web TW National Taiwan University
IC Icelandic Web Archive CAN Library and Archives Canada SI Slovenia WEB Web Cite AIS Archive.is
![Page 96: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/96.jpg)
96
Growth Rate
Archive Service Distribution
IA Internet Archive CAN Library and Archives Canada PO Portuguese Web Archive CZ Archive of the Czech Web
LoC Library of Congress BL British Library CAT Web Archive of Catalonia TW National Taiwan University
IC Icelandic Web Archive UK UK National Library CR Croatian Web Archive AIT Archive It
Stopped archiving in 2008
Steady growth
Stopped getting new URIs, but still crawling
IA Internet Archive AIT Archive It LoC Library of Congress UK UK National Library BL British Library
SG Singapore CAT Web Archive of Catalonia CR Croatian Web Archive CZ Archive of the Czech Web TW National Taiwan University
IC Icelandic Web Archive CAN Library and Archives Canada SI Slovenia WEB Web Cite AIS Archive.is
![Page 97: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/97.jpg)
97
Building Web Archive Profile
Archive Service Distribution
{"Profile":{
"Name“ : "Taiwan Web Archive",
"URI“ : "http://webarchive.lib.ntu.edu.tw",
"TimeGate“ : "http://mementoproxy.cs.odu.edu/tw/timegate/",
"Code“ : "TW",
"Age“ : "Tue, 15 Jul 1997 00:00:00 GMT",
"TLD“ : [ {"tw":0.6},{"cn":0.08},{"hk:0.04}, {"eg":0.04},{"gov":0.04}, {"my":0.04},{"jp":0.04},{"kr":0.02}],
"Language“ : [{"zh-TW":0.5},{"zh-CN":0.25},{"id":0.08},{"ar":0.08}],
"GrowthRate“ : [
{"199707":[4,4]},{"200202":[1,1]},
{"200607":[30,62]},{"200608":[20,80]},
{"200609":[5,9]},{"200612":[77,129]},
... // other values truncated
{"201308":[7,94]},{"201309":[2,94]}]
}
}
![Page 98: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/98.jpg)
98
• RecallTM@1 = 3/8 = 0.375
• RecallTM@2 = 5/8 = 0.625
Web Archive Selection Evaluation
Archive Service Distribution
𝑅𝑒𝑐𝑎𝑙𝑙𝑇𝑀@𝑛=|𝑇𝑀|𝑢𝑠𝑖𝑛𝑔𝑛 h𝑎𝑟𝑐 𝑖𝑣𝑒𝑠|𝑇𝑀|𝑢𝑠𝑖𝑛𝑔𝑁 h𝑎𝑟𝑐 𝑖𝑣𝑒𝑠
TM(R)
A1 M1
M2
M3
A2 M4
M5
A3 M6
A4 M7
A5 M8
![Page 99: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/99.jpg)
99
Web Archive Selection Evaluation
Archive Service Distribution
Number of Archive Including IA Excluding IA
RecallTM@3 0.96 0.647
RecallTM@6 0.98 0.83
RecallTM@9 0.998 0.983
RecallTM@12 0.999 0.987
• Total number of archives N = 15
𝑅𝑒𝑐𝑎𝑙𝑙𝑇𝑀@𝑛=|𝑇𝑀|𝑢𝑠𝑖𝑛𝑔𝑛 h𝑎𝑟𝑐 𝑖𝑣𝑒𝑠|𝑇𝑀|𝑢𝑠𝑖𝑛𝑔𝑁 h𝑎𝑟𝑐 𝑖𝑣𝑒𝑠
![Page 100: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/100.jpg)
100
CONCLUSIONS
![Page 101: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/101.jpg)
101
Conclusions• We proposed a new service framework that divides the web archive
corpus into four levels: Content, Metadata, URI, and Archive.• The development of ArcContent that supports the web archive interface
with extracted version of the mementos based on a set of predefined filters.
• We studied the challenges of building the temporal web graph and developed ArcLink, a distributed system to extract, preserve, and expose the temporal web graph.
• We studied the optimization and summarization techniques to create the thumbnails for the web graph collections based on SimHash fingerprints.
• We extended the concept of URI-lookup in the web archive to include the HTTP redirection status code.
• The concept of “Web Archive Profile” to characterize the web archive corpus was defined with an application on the distributed search in the Memento Aggregator.
![Page 102: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/102.jpg)
102
Publications• S. G. Ainsworth, A. AlSum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. “How
much of the Web is Archived?” In Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, JCDL '11, 2011.
• A. AlSum, M. L. Nelson, R. Sanderson, and H. Van de Sompel. “Archival HTTP Redirection Retrieval Policies.” In Proceedings of 3rd Temporal Web Analytics Workshop, TempWeb ’13, 2013.
• A. AlSum, and M. L. Nelson. “ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph.” In Proceedings of the 13th annual international ACM/IEEE joint conference on Digital libraries, JCDL '13, 2013.
• A. AlSum, Michele C. Weigle, M. L. Nelson, and H. Van de Sompel. “Profiling Web Archive Coverage for Top-Level Domain and Content Language.” In Proceedings of the 17th International Conference on Theory and Practice of Digital Libraries, TPDL 2013, 2013.
• A. AlSum, and M. L. Nelson. “Thumbnail Summarization Techniques for Web Archives.” In Proceedings of the 36th European Conference on Information Retrieval. ECIR ‘14, 2014.
![Page 103: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/103.jpg)
103
What’s next?• Web Archiving Engineer at Stanford University.
![Page 104: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/104.jpg)
104
WEB ARCHIVE SERVICES FRAMEWORK
FOR TIGHTER INTEGRATION BETWEEN THE PAST AND PRESENT WEB
Ahmed AlSumPhD Defense
February 2014
Old Dominion University Computer Science Department
@aalsum
![Page 105: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/105.jpg)
105
BACKUP
![Page 106: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/106.jpg)
106
Memento• Memento is an HTTP
extension to integrate the Past and the Current Web
I Jacobs and N Walsh Architecture of the world wide web Technical report, W3C, 2004 http://wwww3org/TR/webarch/
Now
T1
T2
T3
![Page 107: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/107.jpg)
107
Memento
• Developer and administrator for Memento aggregator and proxies
![Page 108: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/108.jpg)
108
Memento Clients
• Memento currently is RFC.
![Page 109: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/109.jpg)
109
Lack of APIs• Famous websites provide APIs to the third-party
developer.• Introduction Motivation
![Page 110: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/110.jpg)
110
Lack of APIs• US Agencies started to support APIs to data access.• Introduction Motivation
![Page 111: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/111.jpg)
111
Web Archiving Use Cases• Temporal navigation.• Full text search.• Use language filters.• Provide raw WARC.• Import of metadata records
into other repositories.
• Introduction Motivation
*IIPC Access Working Group. Use cases for Access to Internet Archives. International Internet Preservation Consortium Publications, http://www.netpreserve.org/resources/use-cases-access-internet-archives, 2006.
![Page 112: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/112.jpg)
112
Related Projects
Data analysis for the web data
Tools and Methods to access the web archive
Enable the user to do experiments on the raw crawled data on Amazon S3
Enable the user to browse the present and the past web
• Introduction
![Page 113: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/113.jpg)
113
Selection• Decide what to capture
Everything, any domain
National domains
Delegate selection to partners
Users’ favorites
• We studied what is already captured
![Page 114: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/114.jpg)
114
URI-Based
WayBack Machine• Web Archiving Trends Accessing Web Archive
• Textbox to enter the requested URI.
• BubbleMap to show you the available mementos.
![Page 115: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/115.jpg)
115
Collection-Based• Web Archiving Trends Accessing Web Archive
• In addition to browsing the collection, you can browse the URIs in this collection.
![Page 116: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/116.jpg)
116
Full-text search• Web Archiving Trends Accessing Web Archive
• BL interface provides different filtering techniques for the results.
![Page 117: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/117.jpg)
117
Past Web Browser• Web Archiving Trends Accessing Web Archive
• You can replay the pages with different controls to forward, backward, pause and stop.
![Page 118: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/118.jpg)
118
Zoetrope• Web Archiving Trends Accessing Web Archive
• Different Views• Comparison between
different Mementos• Not feasible on the
current web archiving infrastructure
![Page 119: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/119.jpg)
119
DiffIE• Web Archiving Trends Accessing Web Archive
• A browser plug-in that caches the pages a person visits and highlights how those pages have changed when the person returns to them
• It is possible on the personal archiving.
![Page 120: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/120.jpg)
120
Synchronicity• Web Archiving Trends Accessing Web Archive
• Mozilla Firefox add-on supports internet user in (re-)discovering missing web pages in real time
![Page 121: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/121.jpg)
121
Warrick• Web Archiving Trends Accessing Web Archive
• It’s a utility for reconstructing or recovering a website when a back-up is not available
![Page 122: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/122.jpg)
122
ArcSys Architecture Diagram
Web Archive Service Framework
![Page 123: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/123.jpg)
123
WAT files• WAT files are metadata files for WARC files• WAT files are used to create data analysis reports based
on large datasets.
Metadata Service
![Page 124: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/124.jpg)
124
It’s More than WAT filesWAT ArcLink
Batch Process on a set of WARCs Batch process on a set of URIs
For internal use For public use
No-way to integerate with others WAT files in others locations
It could be aggregated with other graphs
No incremental update Support incremental update
Access on WAT file level using Pig Access on URI level using Web service
Metadata Service ArcLink Motivation
![Page 125: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/125.jpg)
125
Cost of Scaling Up• Filtering
• Extraction
• Storage
Metadata Service ArcLink Cost model
Internet Archive
88 hrs108 * 109 mementos
247 days
500 TB
Filtering 𝑇𝑖𝑚𝑒=
𝑛106 ∗
5.5𝑚(h𝑟𝑠
)Extraction 𝑆𝑖𝑧𝑒
=𝑛
∗10
% Storage
*Numbers based on Wayback Machine published statistics on Oct 2013 of 360B mementos with total size 5PB
![Page 126: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/126.jpg)
126
Time-Indexed Inlinks Information
Metadata Service ArcLink Applications
Date Anchor Text
04-Nov-09 vancouver2010.com
11-Nov-09 vancouver2010.com
18-Nov-09 vancouver2010.com
16-Jan-10 Vancouver 2010 Olympic Games
16-Jan-10 Vancouver 2010 Olympic Games
23-Jan-10 vancouver2010.com
23-Jan-10 2010 Vancouver Olympic Games Medals Results Schedule Sports
30-Jan-10 2010 Vancouver Olympic Games Medals Results Schedule Sports
30-Jan-10 vancouver2010.com
30-Jan-10 Vancouver 2010 Olympic Games
13-Feb-10 Vancouver 2010 Olympic Winter Games
15-Feb-10 Vancouver 2010 Olympic Games
18-Feb-10 Official Vancouver Games site
19-Feb-10 vancouver2010.com
20-Feb-10 Official Vancouver Games site
21-Feb-10 VANOC 2010
![Page 127: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/127.jpg)
127
HTTP Redirection Relationship between URI-R & URI-M
URI Service Experiment and Results
Live Web URI − R
OK Redirection
Web ArchiveURI-M
OK Case 1 5
Redirection 2 3,4Case 1
Case 2 Case 3 Case 4 Case 5
80.8%
2.74% 1.34%1.33%
13.7%
![Page 128: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/128.jpg)
128
Timemap Redirection Categories• Category 1
URI Service
All Mementos have 200 HTTP status code
Stability =1
![Page 129: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/129.jpg)
129
Timemap Redirection Categories• Category 2
URI Service
All Mementos have redirection to the same URI.
Stability =1
![Page 130: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/130.jpg)
130
Timemap Redirection Categories• Category 3
URI Service
All Mementos have redirection to different URIs.
Stability ≈ 0
![Page 131: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/131.jpg)
131
Timemap Redirection Categories• Category 4
URI Service
Mementos have different HTTP status code.
Stability
![Page 132: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/132.jpg)
132
HTTP Redirection Relationship between URI-R & URI-M
URI Service
Live Web URI − R
OK Redirection
Web ArchiveURI-M
OK Case 1 5
Redirection 2 3,4Case 1
Case 2 Case 3 Case 4 Case 5
![Page 133: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/133.jpg)
133
URI Reliability
URI Service
M1
3xx
M2
3xx
M3
3xx
TimeMap
rel=original
R`Mrel=original
R`Mrel=original
R`M
RStability =1
? ? ?200 404 3xx
![Page 134: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/134.jpg)
134
Summary• Quantitative study with 10,000 URIs.• 48% were not fully stable through time.• 27% were not perfectly reliable through time.• New archival retrieval policy:
• Policy one: successfully retrieved mementos for 17 out of 77.• Policy two: Expanded the TimeMap for 58% of cases.
URI Service Retrieval Policies
![Page 135: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/135.jpg)
135
URI Reliability• 23% of the mementos did not lead to a successful
memento at the end.
URI Service Experiment and Results
Reliabilityin semi-log scale Reliabilityfor |TM(R)| < 300
![Page 136: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/136.jpg)
136
Experiment
Archive Service Percentage Experiment
• For each sample set, we used Memento Aggregator to get all the possible archived copies (Mementos).
• For each URI, Memento Aggregator responded with TimeMap for this URI.
Example <http://memento.waybackmachine.org/memento/20010819194233/http://jcdl2002.org>;rel="first memento";datetime="Sun, 19 Aug 2001 19:42:33 GMT“, <http://memento.waybackmachine.org/memento/20011216220248/http://jcdl2002.org>; rel="memento"; datetime="Sun, 16 Dec 2001 22:02:48 GMT",
![Page 137: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/137.jpg)
137
1000 URIs Ordered by First Observation Date
Archive Service Percentage Results
See also: http://ws-dl.blogspot.com/2011/06/2011-06-23-how-much-of-web-is-archived.html
![Page 138: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/138.jpg)
138
2010
Archive Service Percentage Results
2013
![Page 139: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/139.jpg)
139
Archive Service Percentage Results
2010 2013
![Page 140: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/140.jpg)
140
Archive Service Percentage Results
2010 2013
![Page 141: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/141.jpg)
141
Archive Service Percentage Results
2010 2013
![Page 142: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/142.jpg)
142
URIs Samples Sources – Live Web1. DMOZ – Random sample
• 10,000 URIs randomly sample from DMOZ directory (~5M URIs).
2. DMOZ – TLD: 200 URIs for each TLD• 80 tlds.
3. DMOZ – Languages 100 URIs for each Languages• 40 languages.
Archive Service Distribution
![Page 143: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/143.jpg)
143
URIs Samples Sources – Web Archive• Query the fulltext search interface for the web archives
with two set of query terms.
4. Top 1-Gram from Bing• Most of them is English
5. Top 1000 queries term by Yahoo in 9 languages• We excluded the general keywords such as: Obama,
Facebook.
Archive Service Distribution
![Page 144: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/144.jpg)
144
URIs Samples Sources – User requests• Sampling from the users requests to the web archived
materials
6. Sample from IA Wayback Machine Log files• 10,000 URIs randomly sampled from Feb 22, 2012 to Feb 26,
2012.
7. Sample from Memento aggregator log files• 1,000 URIs randomly sampled from LANL Memento Aggregator
between 2011 to 2013.
Archive Service Distribution
![Page 145: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/145.jpg)
145
General Coverage
Archive Service Distribution
IA Internet Archive AIT Archive It LoC Library of Congress UK UK National Library BL British Library
SG Singapore CAT Web Archive of Catalonia CR Croatian Web Archive CZ Archive of the Czech Web TW National Taiwan University
IC Icelandic Web Archive CAN Library and Archives Canada SI Slovenia WEB Web Cite AIS Archive.is
![Page 146: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/146.jpg)
146
Web Archive Selection Evaluation
Archive Service Distribution
![Page 147: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/147.jpg)
147
Web Archive Selection Evaluation
Archive Service Distribution
![Page 148: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/148.jpg)
148
Future Works
![Page 149: "Web Archive services framework for tighter integration between the past and the present web", Phd defense presentation](https://reader036.vdocuments.us/reader036/viewer/2022062513/554e821cb4c9054a698b5542/html5/thumbnails/149.jpg)
149
iTunes cover application
Metadata Service ArcThumb Motivation