formats over time: exploring uk web history

22
Formats over Time Exploring UK Web History Andrew Jackson UK Web Archive, The British Library iPres 2012 | 04-10-2012 | Toronto

Upload: andy-jackson

Post on 18-Dec-2014

1.784 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Formats Over Time: Exploring UK Web History

Formats over Time Exploring UK Web History

Andrew Jackson UK Web Archive, The British Library

iPres 2012 | 04-10-2012 | Toronto

Page 2: Formats Over Time: Exploring UK Web History

DEBATING OBSOLESCENCE Formats over Time

Page 3: Formats Over Time: Exploring UK Web History

Rothenberg & Rosenthal On Format Obsolescence

  Jeff Rothenberg:   “Digital Information Lasts Forever –

Or Five Years, Whichever Comes First.” (1997)   “…still apt…” (2012)

  David Rosenthal:

  “when challenged, proponents of [format migration strategies] have failed to identify even one format in wide use when Rothenberg [made that assertion] that has gone obsolete in the intervening decade and a half.” (2010)

  That network effects inhibit obsolescence

  Where is the evidence?

Page 4: Formats Over Time: Exploring UK Web History

AN EXPERIMENT Formats over Time

Page 5: Formats Over Time: Exploring UK Web History

UK Web Domain Dataset (1994-2010)

  UK Web Domain Dataset (1994-2010)   From the Internet Archive   Millions of websites   > 2.5 billion resources   > 400,000 ARC/WARC files   > 35TB

  Execution at Scale   Stored on HDFS   Map-Reduce

Page 6: Formats Over Time: Exploring UK Web History

Identification Tools

  DROID   Well-known in digital preservation community   Format version level identification   Minor problem concerning file handles   Only binary signature part (DROID-B) could be embedded

  Apache Tika   Widely used identification and data extraction tool   Identifies many formats at the MIME type level   Easy to embed and extend

  Added ability to extract e.g. software identifiers   Minor bug concerning identification buffer size

Page 7: Formats Over Time: Exploring UK Web History

A Common Language For Format Identifiers

  Comparison and combination requires a common model   Map PRONOM IDs to extended MIME Types

  fmt/18 becomes application/pdf; version=1.4

  Allows easy comparison at sub-type level   Can easily extend to cover other properties:

  text/plain; charset=UTF-8

  application/pdf; software=“Adobe Acrobat 6.0”

  Also extended Tika to output details from PDFs

Page 8: Formats Over Time: Exploring UK Web History

Format Profile Dataset

  Server, Tika & DROID-B format profiles, over time:

image/png image/png image/png; version=1.0 2004 102!

application/pdf !application/pdf; version=1.2; software="Acrobat

Distiller 4.0 for Windows"; source="Adobe PageMaker 6.0" !

application/pdf; version=1.2 !2004 !1   CC0 – free to download and reuse

  http://data.webarchive.org.uk/opendata/ukwa.ds.2/fmt/   Please cite us and/or let us know if you use it

  Source code of all tools and modifications also available   https://github.com/openplanets/nanite

Page 9: Formats Over Time: Exploring UK Web History

COMPARING TOOLS Results

Page 10: Formats Over Time: Exploring UK Web History

Coverage & Depth

0%#

1%#

10%#

100%#

1996#1997#1998#1999#2000#2001#2002#2003#2004#2005#2006#2007#2008#2009#2010#

Percen

tage)of)resou

rces)

uniden

0fied

)

Year)

DROID1B#v.59#

Apache#Tika#1.1#

No format-version-level information from Apache Tika.

Page 11: Formats Over Time: Exploring UK Web History

Inconsistencies

  Gaps   37 formats spotted by DROID-B but not Tika

  Notably includes earlier Office formats   129 formats spotted by Tika but not DROID-B

  But at least 20 are due to not using the full DROID   Conflicts

  Failed MIME type mapping, e.g. PDF 1.7 (since fixed)   ‘Soft’ signatures – e.g. PICT matching 3M JPG (gone)   DROID strictness – 9M GIF, 4M JPG, 1.3M PDF…

  Both tools bad at non-HTML/XML text formats   CSS, scripting languages like JS, CSV, TSV, etc.

Page 12: Formats Over Time: Exploring UK Web History

FORMATS OVER TIME Results

Page 13: Formats Over Time: Exploring UK Web History

Image Formats Over Time

0.00001%%

0.00010%%

0.00100%%

0.01000%%

0.10000%%

1.00000%%

10.00000%%

100.00000%%

1996%

1997%

1998%

1999%

2000%

2001%

2002%

2003%

2004%

2005%

2006%

2007%

2008%

2009%

2010%

Percen

tage)of)crawl)

Year)

JPEG%

GIF%

PNG%

ICON%

XBM%

TIFF%

Page 14: Formats Over Time: Exploring UK Web History

HTML Versions Over Time

HTML%2.0%

HTML%3.2%HTML%4.0%

HTML%4.01%

XHTML%1.0%

0%%10%%20%%30%%40%%50%%60%%70%%80%%90%%100%%

1996%1997%1998%1999%2000%2001%2002%2003%2004%2005%2006%2007%2008%2009%2010%Pe

rcen

tage)of)H

TML)Re

sources)

Year)

Page 15: Formats Over Time: Exploring UK Web History

PDF Versions Over Time

1.0$

1.1$

1.2$1.3$

1.4$

1.5$1.6$

0%$10%$20%$30%$40%$50%$60%$70%$80%$90%$

100%$

1996$1997$1998$1999$2000$2001$2002$2003$2004$2005$2006$2007$2008$2009$2010$Pe

rcen

tage)of)P

DF)Resou

rces)

Year)

Page 16: Formats Over Time: Exploring UK Web History

Format Usage Versus Time

1"10"100"

1,000"10,000"100,000"

1,000,000"10,000,000"

100,000,000"1,000,000,000"

10,000,000,000"

0" 2" 4" 6" 8" 10" 12" 14" 16" 18"

Num

ber'o

f'Resou

rces'in'Archive'

Timespan'[Years]'

Page 17: Formats Over Time: Exploring UK Web History

IMPLEMENTATIONS Results

Page 18: Formats Over Time: Exploring UK Web History

PDF Software Over Time

Acrobat(Dis,ller(

Acrobat(PDFWriter(

Acrobat(

0%(10%(20%(30%(40%(50%(60%(70%(80%(90%(

100%(

1996(1997(1998(1999(2000(2001(2002(2003(2004(2005(2006(2007(2008(2009(2010(Pe

rcen

tage)of)P

DF)Resou

rces)

Year)

Over 2100 Distinct PDF Software IDs

Page 19: Formats Over Time: Exploring UK Web History

JPEG Hardware Over Time

DS5$ CYBERSHOT$ E990$

MX1700$

NIKON$D40$

0%$10%$20%$30%$40%$50%$60%$70%$80%$90%$

100%$

1994$1995$1996$1997$1998$1999$2000$2001$2002$2003$2004$2005$2006$2007$2008$2009$2010$

Percen

tage)of)H

arware)IDs)

Year)

Over 2100 Distinct JPEG Hardware IDs

Page 20: Formats Over Time: Exploring UK Web History

CONCLUSIONS Formats over Time

Page 21: Formats Over Time: Exploring UK Web History

Summary

  Format obsolescence is complex   Network effects do appear to stabilize formats   But once popular formats are fading nevertheless   More sophisticated approach required

  Please re-use our data, or ask for more   Firmer conclusions need:

  Richer, more detailed results   From a wider range of corpora

  This approach only gives creator information   A different approach will be needed to understand

resource consumption (e.g. PPT 4, RealAudio 1)

Page 22: Formats Over Time: Exploring UK Web History

webarchive.org.uk

Questions?