formats over time: exploring uk web history

Post on 18-Dec-2014

1.785 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Formats over Time Exploring UK Web History

Andrew Jackson UK Web Archive, The British Library

iPres 2012 | 04-10-2012 | Toronto

DEBATING OBSOLESCENCE Formats over Time

Rothenberg & Rosenthal On Format Obsolescence

  Jeff Rothenberg:   “Digital Information Lasts Forever –

Or Five Years, Whichever Comes First.” (1997)   “…still apt…” (2012)

  David Rosenthal:

  “when challenged, proponents of [format migration strategies] have failed to identify even one format in wide use when Rothenberg [made that assertion] that has gone obsolete in the intervening decade and a half.” (2010)

  That network effects inhibit obsolescence

  Where is the evidence?

AN EXPERIMENT Formats over Time

UK Web Domain Dataset (1994-2010)

  UK Web Domain Dataset (1994-2010)   From the Internet Archive   Millions of websites   > 2.5 billion resources   > 400,000 ARC/WARC files   > 35TB

  Execution at Scale   Stored on HDFS   Map-Reduce

Identification Tools

  DROID   Well-known in digital preservation community   Format version level identification   Minor problem concerning file handles   Only binary signature part (DROID-B) could be embedded

  Apache Tika   Widely used identification and data extraction tool   Identifies many formats at the MIME type level   Easy to embed and extend

  Added ability to extract e.g. software identifiers   Minor bug concerning identification buffer size

A Common Language For Format Identifiers

  Comparison and combination requires a common model   Map PRONOM IDs to extended MIME Types

  fmt/18 becomes application/pdf; version=1.4

  Allows easy comparison at sub-type level   Can easily extend to cover other properties:

  text/plain; charset=UTF-8

  application/pdf; software=“Adobe Acrobat 6.0”

  Also extended Tika to output details from PDFs

Format Profile Dataset

  Server, Tika & DROID-B format profiles, over time:

image/png image/png image/png; version=1.0 2004 102!

application/pdf !application/pdf; version=1.2; software="Acrobat

Distiller 4.0 for Windows"; source="Adobe PageMaker 6.0" !

application/pdf; version=1.2 !2004 !1   CC0 – free to download and reuse

  http://data.webarchive.org.uk/opendata/ukwa.ds.2/fmt/   Please cite us and/or let us know if you use it

  Source code of all tools and modifications also available   https://github.com/openplanets/nanite

COMPARING TOOLS Results

Coverage & Depth

0%#

1%#

10%#

100%#

1996#1997#1998#1999#2000#2001#2002#2003#2004#2005#2006#2007#2008#2009#2010#

Percen

tage)of)resou

rces)

uniden

0fied

)

Year)

DROID1B#v.59#

Apache#Tika#1.1#

No format-version-level information from Apache Tika.

Inconsistencies

  Gaps   37 formats spotted by DROID-B but not Tika

  Notably includes earlier Office formats   129 formats spotted by Tika but not DROID-B

  But at least 20 are due to not using the full DROID   Conflicts

  Failed MIME type mapping, e.g. PDF 1.7 (since fixed)   ‘Soft’ signatures – e.g. PICT matching 3M JPG (gone)   DROID strictness – 9M GIF, 4M JPG, 1.3M PDF…

  Both tools bad at non-HTML/XML text formats   CSS, scripting languages like JS, CSV, TSV, etc.

FORMATS OVER TIME Results

Image Formats Over Time

0.00001%%

0.00010%%

0.00100%%

0.01000%%

0.10000%%

1.00000%%

10.00000%%

100.00000%%

1996%

1997%

1998%

1999%

2000%

2001%

2002%

2003%

2004%

2005%

2006%

2007%

2008%

2009%

2010%

Percen

tage)of)crawl)

Year)

JPEG%

GIF%

PNG%

ICON%

XBM%

TIFF%

HTML Versions Over Time

HTML%2.0%

HTML%3.2%HTML%4.0%

HTML%4.01%

XHTML%1.0%

0%%10%%20%%30%%40%%50%%60%%70%%80%%90%%100%%

1996%1997%1998%1999%2000%2001%2002%2003%2004%2005%2006%2007%2008%2009%2010%Pe

rcen

tage)of)H

TML)Re

sources)

Year)

PDF Versions Over Time

1.0$

1.1$

1.2$1.3$

1.4$

1.5$1.6$

0%$10%$20%$30%$40%$50%$60%$70%$80%$90%$

100%$

1996$1997$1998$1999$2000$2001$2002$2003$2004$2005$2006$2007$2008$2009$2010$Pe

rcen

tage)of)P

DF)Resou

rces)

Year)

Format Usage Versus Time

1"10"100"

1,000"10,000"100,000"

1,000,000"10,000,000"

100,000,000"1,000,000,000"

10,000,000,000"

0" 2" 4" 6" 8" 10" 12" 14" 16" 18"

Num

ber'o

f'Resou

rces'in'Archive'

Timespan'[Years]'

IMPLEMENTATIONS Results

PDF Software Over Time

Acrobat(Dis,ller(

Acrobat(PDFWriter(

Acrobat(

0%(10%(20%(30%(40%(50%(60%(70%(80%(90%(

100%(

1996(1997(1998(1999(2000(2001(2002(2003(2004(2005(2006(2007(2008(2009(2010(Pe

rcen

tage)of)P

DF)Resou

rces)

Year)

Over 2100 Distinct PDF Software IDs

JPEG Hardware Over Time

DS5$ CYBERSHOT$ E990$

MX1700$

NIKON$D40$

0%$10%$20%$30%$40%$50%$60%$70%$80%$90%$

100%$

1994$1995$1996$1997$1998$1999$2000$2001$2002$2003$2004$2005$2006$2007$2008$2009$2010$

Percen

tage)of)H

arware)IDs)

Year)

Over 2100 Distinct JPEG Hardware IDs

CONCLUSIONS Formats over Time

Summary

  Format obsolescence is complex   Network effects do appear to stabilize formats   But once popular formats are fading nevertheless   More sophisticated approach required

  Please re-use our data, or ask for more   Firmer conclusions need:

  Richer, more detailed results   From a wider range of corpora

  This approach only gives creator information   A different approach will be needed to understand

resource consumption (e.g. PPT 4, RealAudio 1)

webarchive.org.uk

Questions?

top related