using the memento framework to assess content drift in scholarly communication

36
Memento to Assess Content Drift in Scholarly Communication @mart1nkle1n IIPC WAC, 06/16/2017, London, UK Using the Memento Framework to Assess Content Drift in Scholarly Communication Acknowledgements: Shawn Jones, Harihar Shankar (LANL) Richard Tobin, Claire Grover (University of of Edinburgh) Andy Jackson (British Library) Martin Klein @mart1nkle1n Herbert Van de Sompel @hvdsomp Research Library Los Alamos National Laboratory

Upload: martin-klein

Post on 22-Jan-2018

415 views

Category:

Internet


1 download

TRANSCRIPT

Page 1: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

Using the Memento Framework

to Assess Content Drift

in Scholarly Communication

Acknowledgements:

Shawn Jones, Harihar Shankar (LANL)

Richard Tobin, Claire Grover (University of of Edinburgh)

Andy Jackson (British Library)

Martin Klein@mart1nkle1n

Herbert Van de Sompel@hvdsomp

Research Library

Los Alamos National Laboratory

Page 2: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

2

Link Rot

Page 3: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

3

Page 4: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

4

Content Drift

Page 5: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

5

http://dl00.org

2000

Page 6: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

6

http://dl00.org

2004

Page 7: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

7

http://dl00.org

2005

Page 8: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

8

http://dl00.org

2008

Page 9: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

9

Content Drift

(in legal documents)

Page 10: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

10

Page 11: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

11

Content Drift

(in scholarly articles)

Page 12: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

12

Referenced in

http://dx.doi.org/10.1016/j.nuclphysa.2009.05.110

published on August 15th 2009

May 8th 2009 August 27th 2009

Page 13: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

13

Referenced in

http://arxiv.org/abs/astro-ph/9707064

published on July 4th 1997

June 7th 1997 today

Page 14: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

14

ArXivCorpus

1997 1999 2001 2003 2005 2007 2009 2011

0 2

00

00

60

00

01

00

000

14

00

00

180

00

0

articles

URI references

Page 15: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

15

http://hiberlink.org/Definition:

• Link Rot + Content Drift = Reference Rot

Observation:

• Links to these resources are subject to Reference Rot

• Web at large resources referenced in scholarly articles

Problem:

• Threat to integrity of the web-based scholarly record

• Resources do not have the same sense of fixity like e.g.,

journal articles

• Resources’ custodianship is different, in terms of long-

term archiving, integrity, and access

Page 16: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

16

http://dx.doi.org/10.1371/journal.pone.0115253

Page 17: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

17

Focus: Content Drift

Page 18: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

18

http://dx.doi.org/10.1371/journal.pone.0167475

Page 19: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

19

Study Dataset

• 3.5 million articles from arXiv, Elsevier, PMC

• Published between Jan 1997 – Dec 2012

• Converted from PDF to XML

• Extraction of URIs to web at large resources (>1 million)

• Keep track of articles’ publication date

Page 20: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

20

Novel Approach to Assess Content Drift

Page 21: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

21

Step 1: Find Mementos

• ~ 1 million URI references

• ~ 650k Memento Pre/Post pairs

discovered via Memento

https://mementoweb.org

https://tools.ietf.org/html/rfc7089

t t+1t-1

Page 22: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

22

Step 2: Select Representative Mementos

Page 23: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

23

• Apply content similarity measures

• How similar is representative?

Step 2: Select Representative Mementos

Page 24: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

24

Content Similarity Measures

• Compute normalized scores (values between 0...100) for:

• Simhash

• Jaccard

• Sørensen-Dice

• Cosine

Page 25: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

25

Representative Mementos

• Idea

• If perfect score in all 4 similarity measures

Memento Pre and Post are the same

Representative Mementos

• Sanity check needed

• Via HTTP headers: E-Tag and Last-Modified

• If same for Pre and Post Memento

HTTP-same

• Sanity check passed!

• 98.88% of Memento pairs that are HTTP-same have perfect

score in all 4 similarity measures

Page 26: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

26

• ~ 313k referenced URIs have

representative Mementos

Step 2: Select Representative Mementos

Page 27: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

27

Representative Mementos in arXiv

Page 28: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

28

arXiv

Elsevier

PMC

Page 29: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

29

• 241k out of 313k URIs have a live web version

Step 3: Dereference Live Web Version of URI

Page 30: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

30

Step 4: Representative Memento vs. Live Version

• Apply content similarity measures

• Bin results into 6 clusters

Page 31: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

31

Page 32: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

32

Aggregate

Similarity

Score

Good:

23.7% of

URIs have

*not*

drifted!

Bad:

3/4 URIs

*have*

drifted!

Page 33: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

33

Content Drift & Link Rot Over Time - arXiv

Page 34: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

34

arXiv

Elsevier

PMC

Page 35: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

35

Take-Aways

1. Scholarly articles increasingly contain URI references to web at

large resources.

2. Such resources are subject to reference rot (link rot + content drift).

3. Custodians of these resources are typically not overly concerned

with archiving of their content and longevity of the scholarly record.

4. Spoiler: Authors, publishers, web archives, and other parties can

help tackle this problem (see my lightning talk + poster on Robust

Links).

Page 36: Using the Memento Framework to Assess Content Drift in Scholarly Communication

Memento to Assess Content Drift in Scholarly Communication

@mart1nkle1n

IIPC WAC, 06/16/2017, London, UK

Using the Memento Framework

to Assess Content Drift

in Scholarly Communication

Martin Klein@mart1nkle1n

Herbert Van de Sompel@hvdsomp

Research Library

Los Alamos National Laboratory