improved validation and feature extraction for jpeg 2000 part 1: the jpylyzer tool

25
SCAP E Johan van der Knijff 1,2 , René van der Ark 1 , Carl Wilson 3 1 Koninklijke Bibliotheek – National Library of the Netherlands 2 Open Planets Foundation 3 The British Library IS&T, Archiving 2012, Copenhagen, 15.6.2012 Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

Upload: jkslidevault

Post on 11-May-2015

604 views

Category:

Technology


4 download

DESCRIPTION

Presentation on jpylyzer, a new tool that performs thorough validation of JPEG 2000 Part 1 (JP2) images. Presented during IS&T "Archiving 2012" conference.

TRANSCRIPT

Page 1: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPE

Johan van der Knijff1,2, René

van der Ark1, Carl Wilson31 Koninklijke Bibliotheek – National Library of the Netherlands2

Open Planets Foundation3

The British Library 

IS&T, Archiving 2012, Copenhagen, 15.6.2012

Improved validation and feature  extraction for JPEG 2000 Part 1: the jpylyzer

tool

Page 2: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPE

National Programme for preservation of  paper  heritage

Digitisation as a means to conserve threatened paper  originals

Metamorfoze

TIFFJP2

146 TB

Migrate by end 2012

Presentator
Presentatienotities
How this all started: Metamorfoze joint effort of KB and National Archive of the Netherlands Need to save on storage costs Early 2011: investigate feasibility of migration to JP2 Some of this material is irreplaceable, because originals in bad shape!
Page 3: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPEJP2 from JISC 1 Newspaper Collection (BL)

Presentator
Presentatienotities
Presented by Paul Wheatley (BL) during SCAPE kickoff around same time JISC 1 collection: 19th century newspapers Masters as TIFF; converted to JP2 to save storage costs
Page 4: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPE

“Well‐formed and valid”

JP2 from JISC 1 Newspaper Collection (BL)

Presentator
Presentatienotities
Each image validated by JHOVE 1.6 before ingest BL experiences raised concerns about potential risks of Metamorfoze migration
Page 5: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPE

Hardware failure may result in 

corrupted images

Source:

http://img70.imageshack.us/img70/9950/serversnm2.jpg

Presentator
Presentatienotities
Brief network interruptions Disk failure Result: images that are truncated, have missing data Will be able to detect this in time?
Page 6: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPE

Not all encoders

produce standard

compliant images 

Presentator
Presentatienotities
Examples: Photoshop JP2 / JPX hybryds Older versions of Luratech encoder would produce JPX under certain conditions Result: embedded ICC profiles potentially ignored by compliant JP2 reader. Images usually readable, but colour space info may be lost after future migration to other format not acceptable if content needs to remain accessible for posterity!
Page 7: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPEPossible solutions

Option 1

Improve JPEG 2000 module JHOVE

But no institutional support, superseded by JHOVE2 (?)

Option 2

Develop JPEG 2000 module for JHOVE2Not ready for operational use (yet)

Option 3

Develop dedicated tool

Page 8: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPE

0

1 1 0 0 1 1 1

0

1

1

10

0

1

1

1 0 1 011 1 10

Jpylyzer

tool

Presentator
Presentatienotities
Validation against file spec (annexes File Format syntax and codestream syntax) - Will also identify byte-level corruption in case of missing / added bytes Limited detection of bit-rot (e.g zeros that have become ones and vv) No analysis at compressed bitstream level, out of scope (would require decoding of image data )
Page 9: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPEJpylyzer

tool

First prototype: December 2011 

Refactoring

of original code: Jan 2012 

Packaging (Debian): Mar 2012 Univ. Southampton, KEEP Solutions, AIT Vienna

Add remaining functionality, bugfixes: Apr‐May  2012 (current version: 1.5)

Page 10: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPE

JPEG 2000 Signature box

Contiguous Codestream

box 0

File Type box

JP2 Header box (superbox)

Contiguous Codestream

box n

IPR box

XML box(es)

UUID box(es)

UUID Info box(es) (superbox)

JP2 file

Presentator
Presentatienotities
Check general file structure: Do all required boxes exist? Boxes in right order? Separate validator functions for all boxes Further subdivision for codestream box (more complex) Also interdependencies between boxes
Page 11: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPECommand‐line use

Page 12: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPEResult

Presentator
Presentatienotities
Validation based on sequence of tests that return True/False result All tests returned True : most likely a valid JP2! By default only tests that returned False are reported in output Note how output follows box structure.
Page 13: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPEProperties extraction (excerpt)

Page 14: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPEProperties embedded ICC profile

Page 15: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPEDocumentation

Presentator
Presentatienotities
Documents ALL tests and ALL reported properties!
Page 16: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPE

Number of images 2,152,116

Total size 45 TB

Average image size 21.8 MB

Number of threads 1

Time 21 days*

Images/day/ thread 100,000

TB/day/thread 2

Example 1: detection of broken JP2s in JISC 1  Newspapers

*Includes unzipping, actual time needed by jpylyzer

much less!

Page 17: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPEResults

676 broken JP2s in JISC 1 collection (0.03 %)TIFF originals still available

JISC 2 (> 1 million images): 3 broken JP2s

19th

Century books (> 22 million images): no broken  JP2s

Page 18: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPE

TIFFJP2

146 TB

Migrate by end 2012

Example 2: quality control Metamorfoze migration

Page 19: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPETIFF

Aware JP2K SDK

JP2 Jpylyzer*

pixel compare 

compareimage

properties

propertiesprofile pass fail

pixelsidentical?

propertiesmatch?

valid JP2?

yes

no

no

no

yes

yes

*Imported as module in Python‐based workflow 

Presentator
Presentatienotities
TIFFs converted to JP2 using Aware JPEG 2000 SDK Pixel check: not within scope of jpylyzer! Validity doesn’t guarantee integrity of compressed image bitstreams Pixel-wise check by itself does not guarantee image is without faults (header fields, ICC profiles etc could still be wrong) Properties checks: includes progression order, no. layers, tile size, presence of ICC profile, and so on. So these checks are complementary. KEY POINT: Jpylyzer is NOT intended as one-stop-solution for JP2 quality control, but rather as a component of it.
Page 20: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPEExample 3: pre‐ingest quality control Wellcome

Library

JP2s produced in‐house and by external suppliers

Use jpylyzer

to validate against JP2 spec

Use extracted properties to validate against a  profile 

(Progression order, ratio, layers, ….)

Profile coded as XML schema(So jpylyzer

output can be validated against schema)

Page 21: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPEPlatforms and licensing stuff

Presentator
Presentatienotities
Jpylyzer released under very permissive license, few restrictions for re-use and modifications. No need to install Python if you don’t want to!
Page 22: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPEhttp://www.openplanetsfoundation.org/software/jpylyzer

Presentator
Presentatienotities
Links to source code repository, binaries, documentation
Page 23: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPECommunity involvement

Presentator
Presentatienotities
Involvement of wider community important to make jpylyzer sustainable. Everyone who wants to contribute (code, bug reports, testing) is welcome, just get in touch If you find the software useful please let us know
Page 24: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPEAcknowledgements

Debian

packages‐

Dave Tarrant (Uni

Southampton/OPF)

Miguel Ferreira, Rui Castro, Hélder

Silva (KEEP Solutions), 

Rainer Schmidt (AIT)

Feedback on early versions‐

Christy Henshaw (Wellcome

Library)

Ross Spencer (TNA)

Wouter Kool (KB)

Page 25: Improved validation and feature extraction for JPEG 2000 Part 1: the jpylyzer tool

SCAPE

#SCAPEProject

http://www.scape‐project.eu

This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137).

Funding