archival http redirection retrieval policies temporal web analytics workshop 2013, rio de janiro...

25
Archival HTTP Archival HTTP Redirection Retrieval Redirection Retrieval Policies Policies Temporal Web Analytics Workshop 2013, Rio De Janiro Ahmed AlSum , Michael L. Nelson Old Dominion University Norfolk VA, USA {aalsum,mln}@cs.odu.edu Robert Sanderson, Herbert Van de Sompel Los Alamos National Laboratory Los Alamos NM, USA {rsanderson,herbertv}@lan l.gov

Upload: edmund-buck-matthews

Post on 28-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Archival HTTP Archival HTTP Redirection Retrieval Redirection Retrieval

PoliciesPolicies

Temporal Web Analytics Workshop 2013, Rio De Janiro

Ahmed AlSum, Michael L. Nelson

Old Dominion UniversityNorfolk VA, USA

{aalsum,mln}@cs.odu.edu

Robert Sanderson, Herbert Van de Sompel

Los Alamos National Laboratory

Los Alamos NM, USA{rsanderson,herbertv}@lanl.g

ov

AgendaAgenda• Introduction• Abstract Model• Experiment And Results• Retrieval Policies

Memento Memento TerminologiesTerminologies

URI-R, R

URI-M, M

URI-T, TM

http://www.amazon.com

http://web.archive.org/web/20110411070244/http://amazon.com

Original Resource

Memento

TimeMap

Live RedirectLive Redirecthttp://bit.ly/r9kIfC redirects to http://www.cs.odu.edu

% curl -I http://bit.ly/r9kIfC HTTP/1.1 301 Moved….Location: http://www.cs.odu.edu/…

Live RedirectLive Redirecthttp://bit.ly/r9kIfC redirects to http://www.cs.odu.edu

Archived RedirectArchived Redirectwww.draculathemusical.co.uk

redirects www.dracula-uk.com/index.html

http://api.wayback.archive.org/memento/20020212194020/http://www.draculathemusical.co.uk/

redirectshttp://api.wayback.archive.org/memento/20020212194020/http://www.geocities.com/draculathemusicalhttp://api.wayback.archive.org/memento/20020212194020/http://www.geocities.com/draculathemusical

Abstract ModelAbstract Model

Abstract ModelAbstract Model

MM11 MM22 MM33

URI StabilityURI Stability• URI’s stability is a count of the change in HTTP

responses across time (200, 3xx, or 4xx) and the number of different URIs in the “Location” for 3xx status code.

High Stability = 1High Stability = 1 No Stability = 0 No Stability = 0

Timemap Redirection Timemap Redirection

CategoriesCategories

All Mementos have 200 HTTP status codeAll Mementos have redirection to the same URI.

All Mementos have redirection to different URIs.Mementos have different HTTP status code.

Stability

=1Stability

=1

Stability

=1Stability

=1

Stability

≈ 0Stability

≈ 0

URI ReliabilityURI ReliabilityMM113x3xxx

MM223x3xxx

MM333x3xxx

rel=originalrel=original

R`R`MMrel=originalrel=original

R`R`MMrel=originalrel=original

R`R`MM

Stability

=1Stability

=1

?? ?? ??200200 404404 3xx3xx

HTTP Redirection HTTP Redirection Relationship between URI-R Relationship between URI-R

& URI-M& URI-M

Case 1

Case 2 Case 3 Case 4 Case 5

Experiment & ResultsExperiment & Results

ExperimentExperiment• Dataset: 10,000 sample URIs from

HTTP Status/Code

Percentage (10,000 URI-R)

OK (200) 82.83%

Redirection (3xx) 14.71%

Redirection (301) 8.4%

Redirection (302) 6.1%

Redirection (others)

0.2%

Not-Found (4xx) 1.18%

Others 1.28%

HTTP Status/Code

Percentage (894,717 URI-

M)OK (200) 93.46%

Redirection (3xx) 5.69%

Not-Found (4xx) 0.26%

Others 0.59%

URIs Live HTTP status code Memento HTTP status code

Time span Number of Mementos

URI StabilityURI Stability

Stability in semi-log scaleStability in semi-log scale Stability for Stability for |TM|TM((RR))| < | < 300300

URI ReliabilityURI Reliability

Reliabilityin semi-log scaleReliabilityin semi-log scale Reliabilityfor Reliabilityfor |TM|TM((RR))| < | < 300300

HTTP Redirection HTTP Redirection Relationship between URI-R Relationship between URI-R

& URI-M& URI-M

Case 1

Case 2 Case 3 Case 4 Case 5

80.8%

2.74%

1.34% 1.33

%

13.7%

RETRIEVAL POLICIESRETRIEVAL POLICIESARCHIVED HTTP REDIRECTION RETRIEVAL POLICIES

Current Wayback Current Wayback Machine PolicyMachine Policy

Policy one: Policy one: URI-R with HTTP URI-R with HTTP

redirectionredirection

Retrieve the memento M for R.

Status(M) =200

Status(M) =3xx

Stop

Go to Policy 2

Stop

Yes

Yes

Yes No

No

No

Policy one: Policy one: URI-R with HTTP URI-R with HTTP

redirectionredirection• Evaluation:

o Policy scope has: 1471 URIs (that have live redirection)

o 77 out of 1471 have no mementos at allo 17 out of 77 have been retrieved mementos based on live redirection

Policy two: Policy two: URI-M with HTTP URI-M with HTTP

redirectionredirection

http://www.cnn.com/

Accept-Datetime: Sun, 13 May 2006 http://www.cnn.com/

Policy two: Policy two: URI-M with HTTP URI-M with HTTP

redirectionredirection• Evaluation:

o Policy scope: 2980 TimeMap (that showed HTTP redirection status code in at least one memento)

o Success criteria: Using policy two contributed to the original TimeMapo Success percentage: 58% of the cases

ConclusionConclusion• Quantitative study with 10,000 URIs.• 48% were not fully stable through time.• 27% were not perfectly reliable through time.• New archival retrieval policy:

o Policy one: successfully retreived mementos for17 out of 77o Policy two: Expanded the timemap for 58% of cases.

[email protected]• @aalsum