using jhove2 for policy assessment of files
DESCRIPTION
Using JHOVE2 for Policy Assessment of Files. Richard Anderson Code4LibCon Preconference 2/7/2011 http://code4lib.org/conference/2011/schedule#preconf 13:30-16:30 : Persimmon Room. Agenda 13:30-16:30. What is JHOVE2 ? Characterization of digital objects Validation vs Assessment - PowerPoint PPT PresentationTRANSCRIPT
Using JHOVE2 for Policy Assessment of Files
Richard AndersonCode4LibCon Preconference
2/7/2011
http://code4lib.org/conference/2011/schedule#preconf13:30-16:30 : Persimmon Room
Agenda 13:30-16:30
• What is JHOVE2 ?• Characterization of digital objects• Validation vs Assessment• Examples of JHOVE2 output• Source Units, Modules, Reportable Properties• Implementation of Assessment• Configuration of Assessment Rules
JHOVE2 is …
… a project to develop a next-generation open source framework and application for format-aware characterization
… a collaborative undertaking of the California Digital Library (CDL), Portico, and Stanford University
… a two year grant from the Library of Congress as part of its National Digital Information Infrastructure Preservation Program (NDIIPP)
“What? So what?”
Characterization is the automated determination of the intrinsic and extrinsic properties of a formatted object
– Identification
– Feature extraction
– Validation
– Assessment
Determining the presumptive format of a digital object based on suggestive extrinsic hints and intrinsic signatures
Reporting the intrinsic properties of an object significant for classification, analysis, and planning
What's new in JHOVE2?
Processing of multi-file objects as well as embedded objects inside files
Recursive processing of containers objects
Plug-in Format Modules
Buffered I/O
Internationalized output
Clean APIs and modern design patterns
Je ne sais quoi !
API design idioms
Separation of concerns
– Annotation and Reflection confluence.ucop.edu/display/JHOVE2Info/Background+Papers
Inversion of Control (IOC) / Dependency Injection
– Martin Fowlermartinfowler.com/articles/injection.html
– Spring Frameworkwww.springsource.org/
Project HomeDomain name
– http://jhove2.org/
Code Repository– https://bitbucket.org/jhove2/main/wiki/Home
• Public Wiki/Documentation• Browse/Clone Source Code• Download Release Packages• Changeset History• Issue Tracking
Mailing lists– [email protected]– [email protected]
JHOVE2 Documentation
Complete documentation
– User’s guide
– Architectural overview
– Module specifications
– Programmer’s guide
Agenda 13:30-16:30
• What is JHOVE2 ?• Characterization of digital objects• Validation vs Assessment• Examples of JHOVE2 output• Source Units, Modules, Reportable Properties• Implementation of Assessment• Configuration of Assessment Rules
Characterization
Validation vs. AssessmentValidation is the determination of the level of conformance to the normative requirements of a format’s authoritative specification
– To the extent that there is community consensus on these requirements, validation is an objective determination – Hard coded in JHOVE2 Modules
Assessment is the determination of the level of acceptability for a specific purpose on the basis of locally-defined policy rules
– Since these rules are locally configurable, assessment is a subjective determination – Scripted via config files
Format Specifications
Format Specification
JPEG 2000 JP2 (ISO/IEC 15444-1), JPX (ISO/IEC 15444-2)
PDF PDF 1.0 – 1.7, ISO 3200-1, PDF/A-1 (ISO 19005-1), PDF/X-1 (ISO 15920-1), -1a (ISO 15930-4), -2 (ISO 15930-5) -3 (ISO 15930-6)
TIFF TIFF 4 – 6, Class B, F, G, P, R, Y, TIFF/EP (ISO 12234-2), TIFF/IT (ISO 12639), GeoTIFF, Exif (JEITA CP-3451), DNG
UTF-8 ASCII (ANSI X3.4)
WAVE BWF (EBU N22-1997)
Validation vs. AssessmentValidation is the determination of the level of conformance to the normative requirements of a format’s authoritative specification
– To the extent that there is community consensus on these requirements, validation is an objective determination – Hard coded in JHOVE2 Modules
Assessment is the determination of the level of acceptability for a specific purpose on the basis of locally-defined policy rules
– Since these rules are locally configurable, assessment is a subjective determination – Scripted via config files
Putting it another way …
Assessment is the evaluation ofa source unit's
reportable properties against a set of
policy-based rules
Assessment is the evaluation ofa source unit's
– File (UTF-8)– File with embedded ByteStream(s)
(TIFF with ICC profile)– Aggregate (Directory, ZIP ) – ClumpSource (ShapeFile)
reportable properties against a set of
policy-based rules
Assessment is the evaluation ofa source unit's reportable properties
– Format Identification– Features – Validity
against a set of policy-based rules
Assessment is the evaluation ofa source unit's
reportable properties
against a set of policy-based rules– Is the item acceptable?
– Is there a preservation risk?– What level of preservation service?– Should we flag object for future action?
Practical Applications of Assessment
• Ingest workflows
• Migration workflows
• Digitization workflows
• Publishing workflows
Agenda 13:30-16:30
• What is JHOVE2 ?• Characterization of digital objects• Validation vs Assessment• Examples of JHOVE2 output• Source Units, Modules, Reportable Properties• Implementation of Assessment• Configuration of Assessment Rules
Running JHOVE
jhove2.sh –d Text –o outfile.txt myfile.xmlDisplay format choices are: Text (default), JSON, and XML.
File argument can be any of:– Filename– Directory name– URL– Set of space-delimited filepaths
http://bitbucket.org/jhove2/main/wiki/documents/JHOVE2-Users-Guide.pdf
JHOVE2 Output options
• Input File– xml-schemaLocation-cannot-resolve.xml
• Text– text-output.txt
• XML– xml-output.xml
• JSON– json-output.txt
FileSource:
Path: E:\samples\xml\schema-sample.xml
Size (byte): 9516
LastModified: 2010-10-12T11:55:29-06:00
SourceName: schema-sample.xml
StartingOffset (byte): 0
…
JHOVE2 Output
Format Identification
PresumptiveFormats:
PresumptiveFormat {FormatIdentification}:
NativeIdentifier {I8R}:
Namespace: PUID
Value: fmt/101 PRONOM Identifier
JHOVE2Identifier {I8R}:
Namespace: JHOVE2
Value: http://jhove2.org/terms/format/xml
...
PRONOM Format Registryhttp://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=638
Name Extensible Markup LanguageVersion 1.0Other names XML (1.0)Identifiers PUID: fmt/101
Apple Uniform Type Identifier: public.xmlMIME: text/xml
Classification Text (Mark-up)Description The Extensible Markup Language (XML) is a general
purpose markup language for creating other, special purpose, markup languages, and is a simplified subset of SGML. …
Agent used for Identification
Module {DROIDIdentifier}:
SignatureFile: …/DROID_SignatureFile_V20.xml
Version: 2.0.0
ReleaseDate: 2010-09-10
WrappedProduct:
Name: DROID
Version: 4.0.0
ReleaseDate: 2009-07-23
...
DROIDhttp://sourceforge.net/projects/droid/ DROID (Digital Record Object Identification) is an automatic
file format identification tool. It is the first in a planned series of tools developed by The National Archives under the umbrella of its PRONOM technical registry service
XML Module Module {XmlModule}:
SaxParser:
Parser: org.apache.xerces.parsers.SAXParser
XmlDeclaration:
Version: 1.0
Encoding: UTF-8
Standalone: no
RootElement:
Name: mets
Namespace: http://www.loc.gov/METS/
XML Module (namespaces) NamespaceInformation:
NamespaceCount: 2
Namespaces:
Namespace:
URI: http://www.loc.gov/METS/
Declarations:
Prefix: [default]
SchemaLocations:
SchemaLocation:
Location: http://www.loc.gov/standards/mets/version15/mets.xsd
Namespace:
URI: http://www.loc.gov/mix/v10
Declarations:
Prefix: mix
XML Module (cont)
ValidationResults:
ParserWarnings {ValidationMessageList}:
ValidationMessageCount: 0
ParserErrors {ValidationMessageList}:
ValidationMessageCount: 0
FatalParserErrors {ValidationMessageList}:
ValidationMessageCount: 0
isWellFormed: true
isValid: true
Format Modules from JHOVE2 Team
ICC color profileJPEG 2000PDFSGMLShapefile
TIFFUTF-8WAVEXMLZip
JHOVE2 can identify (by DROID) many more formats than it can validate (by modules)
Other Module Development3rd party development activities
– NetCDF and GRIB modules (Wegener Institute)
– Integration with DuraCloud (DuraSpace)– ARC module (Bibliothèque nationale de France)– WARC, JPEG, GIF modules (CDL, hopefully ;-)
Possible development efforts– Additional format modules– Configuration GUIs– JHOVE2-as-a-service– Integration with DAITTS, DSpace, Fedora, FITS, etc.
Suggestions, volunteers and funders welcome
AssessmentModule Module {AssessmentModule}:
AssessmentResultSets:
AssessmentResultSet:
RuleSetName: XmlRuleSet
RuleSetDescription: RuleSet for Xml Module
ObjectFilter: org.jhove2.module.format.xml.XmlModule
BooleanResult: true
AssessmentResults:
AssessmentResult:
RuleName: XmlValidityRule
RuleDescription: Is the XML file acceptable?
BooleanResult: true
NarrativeResult: Acceptable
Agenda 13:30-16:30
• What is JHOVE2 ?• Characterization of digital objects• Validation vs Assessment• Examples of JHOVE2 output• Source Units, Modules, Reportable Properties• Implementation of Assessment• Configuration of Assessment Rules
JHOVE2 Abstractions
• Source Unit• Module• Reportable• Reportable Property• Message
Source UnitA formatted object about which characterization information can be meaningfully reported
– Unitary File e.g. UTF-8 text file File inside of a container e.g. TIFF inside a Zip Byte stream inside a file e.g. ICC inside a TIFF
– Aggregate Directory Directory inside of a container Clump e.g. Shapefile File set e.g. command line arguments
For purposes of characterization, directories, file sets, and clumps are considered format types
Source Interface (Java)
public Set<FormatIdentification> getPresumptiveFormats() {return presumptiveFormatIdentifications;
}public List<Module> getModules() {
return this.modules;}public List<Source> getChildSources() {
return this.children;}
Format Module• implements Parser• implements Validator • Implements Reportable• Imports org.jhove2.annotation.ReportableProperty
public long parse(JHOVE2 jhove2, Source source, Input input) {// extract features and //fill in the reportable properties fields
. . . }
Reportables
A Reportable is a named set of properties– Reportables correspond to Java classes
– Including classes for sources and modules
Also define reportables for the major conceptual structures inherent to a format
– JPEG 2000: Box
– TIFF: IFH, IFD, IFD entry (“tag”)
– UTF-8: Character stream, character
– WAVE: Chunk
Reportable Interfacepackage org.jhove2.core
public interface Reportable { public I8R getReportableIdentifier(); public String getReportableName(); public void setReportableName(String name);}
public abstract class AbstractReportable implements Reportable{ protected I8R reportableIdentifier; protected String reportableName;}
A reportable class implements the Reportable marker interface
ReportablePropertiesA ReportableProperty is a named, typed value
– org.jhove2.annotation.ReportableProperty – Unique formal identifier– Data type
Scalar or collection Java types, JHOVE2 primitive types, or JHOVE2 reportables
– Typed value– Description of correct semantic interpretation– Properties correspond to fields
ReportableProperty AnnotationEach reportable property is represented by a field and accessor and mutator methodsThe accessor method must be marked with the @ReportableProperty annotation
public class MyReportable implements Reportable{ protected String myProperty;
@ReportableProperty(order=1, desc= “description”, ref= “reference”) public String getMyProperty() { return this.myProperty; }
public void setMyProperty(String property) { this.myProperty = property; }}
Wave Reportable Properties
chunks[ ]
formatChunkNotBeforeDataChunkMessage
missingRequiredFormatChunkMessage
missingRequiredDataChunkMessage
missingRequiredFactChunkMessage
isValid
childChunks[ ]hasPadByteidentifierisValidsize
UTF-8 Reportable Properties
byteOrderMark
c0Characters
c1Characters
codeBlocks
eOLMarkers
invalidCharacters[ ]
isValid
numCharacters
numLines
numNonCharacters
c0Controlc1ControlcodeBlockcodePointcodePointOutOfRangecoverageinvalidByteValuesisByteOrderMarkisC0ControlisC1ControlisNonCharacterisValidsize
XML Reportable Properties
Fields for the reportable properties protected String saxParser = "org.apache.xerces.parsers.SAXParser"; protected XmlDeclaration xmlDeclaration = new XmlDeclaration(); protected String xmlRootElementName; protected List<XmlDTD> xmlDTDs; protected HashMap<String,XmlNamespace> xmlNamespaceMap; protected List<XmlNotation> xmlNotations; protected List<String> xmlCharacterReferences; protected List<XmlEntity> xmlEntitys; protected List<XmlProcessingInstruction> xmlProcessingInstructions; protected List<String> xmlComments; protected XmlValidationResults xmlValidationResults ; protected boolean wellFormed ;
Getter methods for reportable propertiesimport org.jhove2.annotation.ReportableProperty;
@ReportableProperty(order = 1, value = "Java class used to parse the XML")
public String getSaxParser() { return saxParser; } @ReportableProperty(order = 2, value = "XML Declaration data") public XmlDeclaration getXmlDeclaration() { return xmlDeclaration; } @ReportableProperty(order = 3, value = "Name of the document's root element") public String getXmlRootElementName() { return xmlRootElementName; }
Messagesif (position == start && ch.isByteOrderMark()) { Object [] messageParms = new Object [] {position};
this.bomMessage = new Message(Severity.INFO,Context.OBJECT,"org.jhove2.module.format.utf8.UTF8Module.bomMessage",messageParms );
}
Messages
• Messages are reportable properties– Unique identifier
info:jhove2/message/…– Context
Process Condition arising from the process of characterization
Object Condition arising in the object being characterized
– Severity Error Warning Info
– Internationalizable
Agenda 13:30-16:30
• What is JHOVE2 ?• Characterization of digital objects• Validation vs Assessment• Examples of JHOVE2 output• Source Units, Modules, Reportable Properties• Implementation of Assessment• Configuration of Assessment Rules
http://code4lib.org/conference/2011/schedule#preconf
Assessment rules
Assertions (logical expressions) based on
– Presence/absence of a property– Constraints on property values– Combinations of properties/values
Predicate Logic
• Rules use a construct whose basic structure looks like this:
If (condition)
Then (consequent)
Else (alternative)
http://en.wikipedia.org/wiki/Conditional_(programming)
ConditionA condition is defined by a
universal or existential qualifier “for all” “for any”¬ “not any”
and an arbitrary set of predicates {ALL_OFF | ANY_OF | NONE_OF}
(predicate) (predicate) ...
http://www.csm.ornl.gov/~sheldon/ds/sec1.6.html
Predicate
Each predicate is a string containing a boolean expression
xmlDeclaration.standalone == 'yes'
These assertions take the form:property relation value
Supported relational operators include:
== != < > =< =>
contains
exists ( != null)
XML Assessment rule
If ANY_OF validity == true ;
(validity == undetermined) and (wellFormed == true)Then AcceptableElse Not acceptableEnd If
JPEG 2000 Assessment Rule
If ALL_OF validity == true;
exists(colourBox);
exists(resolutionBox.capture)Then AcceptableElse Not acceptableEnd If
Wave Assessment rule
If ALL_OF validity == true ;
exists(broadcastWaveExtensionChunk) ;
waveFormatChunk.nSamplesPerSec == 96000 ;
waveFormatChunk.nBitsPerSample == 24Then AcceptableElse Not acceptableEnd If
TIFF Assessment rule
If ANY_OF validity == true ;
((ifd.messages contains ‘offsetNotByteAligned’) or (ifd.messages contains ‘dateNotWellFormed’))Then AcceptableElse Not acceptableEnd If
Rules Engines
• JSR 94: JavaTM Rule Engine APIhttp://jcp.org/en/jsr/detail?id=94
• Rule Engines Overviewhttp://jadex-rules.informatik.uni-hamburg.de/xwiki/bin/view/Resources/Rule+Engines
• Top 10 Java Business Rule Engineshttp://blog.taragana.com/index.php/archive/top-10-java-business-rule-engines/
• Introduction to Droolshttp://www.intltechventures.com/presentations/2008-01-26-Introduction-to-Drools.pdf
Expression Languages• Predicates (conditions) are evaluated using an domain-specific
language that supports scripted examination of Java objects
• MVEL (MVFLEX Expression Language)
http://mvel.codehaus.org/• OGNL (Object-Graph Navigation Language)
http://www.opensymphony.com/ognl
• Groovyhttp://groovy.codehaus.org/
• Open Source Expression Languages in Javahttp://java-source.net/open-source/expression-languages
http://www.java-opensource.com/open-source/expression-languages.html
Assessment Module at work public void assess(JHOVE2 jhove2, Source source) throws JHOVE2Exception { /* Assess the source unit. */ this.configInfo = jhove2.getConfigInfo(); List<Module> modules = source.getModules(); for (Module module : modules) { assessObject(module); this.getModuleAccessor().persistModule(this); } assessObject(source); this.getModuleAccessor().persistModule(this);
}
AssessObject Method private void assessObject(Object assessedObject) throws JHOVE2Exception {
String objectFilter = assessedObject.getClass().getName();List<RuleSet> ruleSetList = getRuleSetFactory()
.getRuleSetList(objectFilter);if (ruleSetList != null) { for (RuleSet ruleSet : ruleSetList) {
if (ruleSet.isEnabled()) { AssessmentResultSet resultSet =
new AssessmentResultSet();assessmentResultSets.add(resultSet);
resultSet.setRuleSet(ruleSet); resultSet.fireAllRules(assessedObject);
} } }
Fire Off the Rules
Sequence Diagram
Identification
Feature extraction
Assessmemt
Agenda 13:30-16:30
• What is JHOVE2 ?• Characterization of digital objects• Validation vs Assessment• Examples of JHOVE2 output• Source Units, Modules, Reportable Properties• Implementation of Assessment• Configuration of Assessment Rules
Assessment Configuration• Lists of properties for a Module can be generated
using the ReportableInstanceTraverser utilityUSAGE: java -cp CLASSPATH
org.jhove2.app.util.traverser.ReportableInstanceTraverser fully-qualified-class-name output-file-path {optional boolean should-recurse(default true)}
• wave-property-list.txt
• tiff-module-properties.txt
Assessment Configuration• Rules are configured using ARules utility
– Utility developed by CDL to create rule set in XML– Future plans: a GUI
• ARules output is a Spring config fle
ARules configurationruleset XmlRuleSet enabled org.jhove2.module.format.xml.XmlModule
desc Ruleset for XML module
rule XmlStandaloneRule enabled
desc Does XML Declaration specify standalone status?
cons Is Standalone
alt Is Not Standalone
quant all
pred xmlDeclaration.standalone == "yes"
rule XmlAcceptableRule enabled
Desc Is the XML status acceptable?
cons Acceptable
alt Not Acceptable
quant any
pred valid.name() == "True"
pred (valid.name() == "Undetermined") && (wellFormed.name() == "True")
RuleSet Spring Bean <!-- RuleSet bean for the XmlModule --><bean id="XmlRuleSet" class="org.jhove2.module.assess.RuleSet"
scope="singleton"> <property name="name" value="XmlRuleSet"/> <property name="description"
value="RuleSet for Xml Module"/> <property name="objectFilter"
value="org.jhove2.module.format.xml.XmlModule"/> <property name="rules"> <list value-type="org.jhove2.module.assess.Rule">
<ref local="XmlStandaloneRule"/><ref local="XmlValidityRule"/>
</list></property><property name="enabled" value="true"/>
</bean>
Rule Spring Bean<!-- Rule bean for evaluating validity value --><bean id="XmlValidityRule"
class="org.jhove2.module.assess.Rule" scope="singleton"> <property name="name" value="XmlValidityRule"/> <property name="description"
value="Is the XML validity status acceptable?"/><property name="consequent" value="Acceptable"/> <property name="alternative" value="Not Acceptable"/> <property name="quantifier" value="ANY_OF"/><property name="predicates"> <list value-type="java.lang.String">
<value><![CDATA[ valid.toString() == 'true' ]]</value><value><![CDATA[ (valid.toString() == 'undetermined') &&
(wellFormed.toString() == 'true') ]]></value> </list></property><property name="enabled" value="true"/>
</bean>
Spring Config Filesconfig│ └───spring │ └───module ├───aggrefy │ jhove2-aggrefy-config.xml │ ├───assess │ jhove2-assess-config.xml │ jhove2-ruleset-xml-config.xml │ ├───digest │ jhove2-digest-config.xml │ ├───display │ jhove2-display-config.xml │ ├───identify │ jhove2-display-config.xml
Assessment Output
Results stored as new characterization properties
Rule evaluation output includes – Rule's name and brief description– Boolean value of the condition that was evaluated– Text value of the consequent of alternative– Details of the predicate evaluation results
Assessment Output ExampleModule {AssessmentModule}:
AssessmentResultSets: AssessmentResultSet:
RuleSetName: XmlRuleSet RuleSetDescription: Ruleset for XML module
ObjectFilter: org.jhove2.module.format.xml.XmlModule BooleanResult: false AssessmentResults:
AssessmentResult: RuleName: XmlStandaloneRule RuleDescription: Does XML Declaration specify standalone status? BooleanResult: false NarrativeResult: Is Not Standalone AssessmentDetails: ALL_OF { xmlDeclaration.standalone == "yes" =>
false; } AssessmentResult: RuleName: XmlAcceptableRule RuleDescription: Is the XML status acceptable? BooleanResult: true NarrativeResult: Acceptable AssessmentDetails: ANY_OF { valid.name() == "True" => true;(valid.name( )
== "Undetermined") && (wellFormed.name() == "True") => false; }
Actionable Outcomes?
– Assessment outcome is informational data– Surrounding workflows may utilize assessment
results to guide control mechanism– JHOVE2 provides API, but does not initiate actions
Assessment Enhancements• Assessment Config file editing
– Make it easier for a non-programmer to edit– Editing should be bullet-proofed if possible
• GUI User interface– Presents a GUI treeview that lists reportable properties in a navigable
hierarchy.
• Sanity checking– Pre-test config files to ensure compatability
• Command-line invocation of the sanity checker• Run check whenever installed modules have been changed
– Also have robust reporting in case property is missing
JHOVE2 Community
Wiki– http://jhove2.org/– https://bitbucket.org/jhove2/main/wiki/Modules
Mailing lists– [email protected]– [email protected]