superset me—not: why the jpts is sufficient if you use appropriate layer validation alexander...

32
Superset Me—Not: Why the JPTS Is Sufficient if You Use Appropriate Layer Validation Alexander (“Sasha”) Schwarzman American Geophysical Union (AGU) JATS-Con November 2, 2010

Upload: syed-selley

Post on 14-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Superset Me—Not:Why the JPTS Is Sufficient if You Use Appropriate Layer

Validation

Alexander (“Sasha”) SchwarzmanAmerican Geophysical Union (AGU)

JATS-ConNovember 2, 2010

Summary

We have built a superset of the NLM Journal Publishing Tag Set in order to enforce business rules, data types, and house style and, having done that, realized that a JPTS subset could have been sufficient to meet AGU's needs if it were used in conjunction with the appropriate layer validation technology, such as Schematron

Alexander (“Sasha”) Schwarzman 2 Superset Me—Not JATS-Con Nov 2, 2010

3

Contents

• Why we built a JPTS superset• DTD vs. Schematron– Attribute values– Number of element occurrences– Element position & sequence– References

• Lessons learned

Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010

4

Why we built a JPTS superset

• No generic book model• Lack of familiarity with Schematron• Lack of mature tool support (running SVRL not

a viable option in Production environment)• Lack of expertise on integrating Schematron

with validation against relational DB• JATS v2.3: no Compound Keywords, not all

content models parameterized

Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010

5

DTD vs. Schematron:Attribute values

Requirement: Article type is required and can be one of three types: a regular article (rga), a correction (cor), or an editorial (edt)

Strict DTD

<!ATTLIST article article-type (rga | cor | edt) #REQUIRED >

 JPTS

<!ATTLIST article article-type CDATA #IMPLIED >

Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010

6

DTD vs. Schematron:Attribute values (cont’d)

XML instance (contains non-allowed article type)

<article article-type='xxx'/> Schematron

<rule context="article"> <assert test="@article-type=('rga','cor','edt')">

@article-type '<value-of select='@article-type'/>' not allowed, must be 'rga', 'cor', or edt'</assert></rule>

 Schematron message

@article-type 'xxx' not allowed, must be 'rga', 'cor', or'edt'Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010

7

DTD vs. Schematron:Number of element occurrences

Requirement: Acknowledgments, if present, must contain exactly one paragraph, except for two journals (journal code ‘ja’ and ‘rg’) where Acknowledgments must contain two paragraphs

Strict DTD

<!ELEMENT ack (p, p?) >

JPTS

<!ELEMENT ack (p*) >

Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010

8

DTD vs. Schematron:Number of occurrences (cont’d)

XML instance (wrong number of paragraphs)

<article> ... <journal-id>jb</journal-id> ... <ack> <p>Blah</p> <p>Blah-blah</p> </ack> </article>

Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010

9

DTD vs. Schematron:Number of occurrences (cont’d)

Schematron

<rule context="ack[ancestor::*/journal-id=('ja','rg')]"> <assert test="count(p) eq 2">

'<name/>' in '<value-of select="ancestor::*/journal-id"/>' must contain exactly two paragraphs</assert></rule>

<rule context="ack"> <assert test="count(p) eq 1">

'<name/>' in '<value-of select="ancestor::*/journal-id"/>' must contain only one paragraph</assert></rule>

Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010

10

DTD vs. Schematron:Number of occurrences (cont’d)

Schematron message

'ack' in 'jb' must contain only one paragraph

Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010

11

DTD vs. Schematron:Element position & sequence

Requirement: If a journal has subj. grouping (ToC category, subset) & article belongs to sp. collection (sp. section, theme), then subj. grouping info must precede special collection info

Strict DTD

<!ELEMENT article-categories (subject-group*, special-collection?) >JPTS

<!ELEMENT article-categories (subj-group*) >

Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010

12

DTD vs. Schematron:Element position & sequence (cont’d)

XML instance (wrong sequence of subject groups)

<article-categories> <subj-group subj-group-type="special-section"> <subject content-type="EARLYWARN1">New Methods and

Applications of Earthquake Early Warning</subject>

</subj-group> <subj-group subj-group-type="toc-category"> <subject content-type="SDE">Solid Earth</subject> </subj-group></article-categories>

Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010

13

DTD vs. Schematron:Element position & sequence (cont’d)Schematron

<rule context="article-categories/ subj-group[@subj-group-type=('special-section','theme')]"> <assert test="not(following-sibling::

subj-group[@subj-group-type=('toc-category','subset')])">

<name/>/@subj-group-type='<value-of select='@subj-group- type'/>' must appear after a ToC Category or a Subset when either is present</assert></rule> 

Schematron message

subj-group/@subj-group-type='special-section' must appear after a ToC Category or a Subset when either is present

Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010

Superset Me—Not JATS-Con Nov 2, 2010 14

DTD vs. Schematron:References

Validating references is a challenge:• Variety vs. the need to enforce editorial styleStrict DTD:• Fixed element order, no mixed content• Punctuation, spacing, face markup – on outputJPTS:• Lots of elements, any order, mixed content• Punctuation, spacing, face markup includedAlexander (“Sasha”) Schwarzman

Superset Me—Not JATS-Con Nov 2, 2010 15

DTD vs. Schematron:References (cont’d)

Strict DTD

<!ELEMENT book-standalone-citation ((person-group | string-name), year, source, edition?, (person-group | string-name)?, size?, elocation-id?, publisher-name, publisher-loc) ><!ATTLIST book-standalone-citation id ID #REQUIRED >

Alexander (“Sasha”) Schwarzman

Superset Me—Not JATS-Con Nov 2, 2010 16

DTD vs. Schematron:References (cont’d)

JPTS

<!ELEMENT mixed-citation (#PCDATA | person-group | string-name | year | source | edition | size | elocation-id | publisher-name | publisher-loc | ... | ...)* >

<!ATTLIST mixed-citation id ID #IMPLIED publication-type CDATA #IMPLIED >

Alexander (“Sasha”) Schwarzman

Superset Me—Not JATS-Con Nov 2, 2010 17

DTD vs. Schematron:References (cont’d)

Example:

Mood, A. M., and F. A. Graybill (1963), Introduction to the Theory Statistics, 2nd ed., 295 pp., McGraw-Hill, New York.

Alexander (“Sasha”) Schwarzman

18

DTD vs. Schematron:References (cont’d)

XML instance (strict DTD)<book-standalone-citation id="mood63"> <person-group person-group-type="author"> <name><surname>Mood</surname> <given-names>A. M.</given-names></name> <name><surname>Graybill</surname> <given-names>F. A.</given-names></name> </person-group> <year>1963</year> <source>Introduction to the Theory Statistics</source> <edition>2nd</edition> <size units="page">295 pp<size/> <publisher-name>McGraw-Hill</publisher-name> <publisher-loc>New York</publisher-loc></book-standalone-citation>

Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010

19

DTD vs. Schematron:References (cont’d)

XML instance (JPTS)<mixed-citation publication-type="book-standalone"> <string-name> <surname>Mood</surname>, <given-names>A. M.</given-names> </string-name>, and <string-name> <given-names>F. A.</given-names> <surname>Graybill</surname> </string-name> (<year>1963</year>), <source><italic>Introduction to the Theory Statistics</italic></source>, <edition>2</edition>nd ed., <size units="page">295</size> pp., <publisher-name>McGraw-Hill</publisher-name>, <publisher-loc>New York</publisher-loc>.</mixed-citation>

Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010

Superset Me—Not JATS-Con Nov 2, 2010 20

DTD vs. Schematron:References (cont’d)

Schematron can check that all required elements are present and are in the correct sequence (note the required elements and that edition, if present, follows source):

<!ELEMENT book-standalone-citation ((person-group | string-name), year, source, edition?, (person-group | string-name)?, size?, elocation-id?, publisher-name, publisher-loc) >

Alexander (“Sasha”) Schwarzman

Superset Me—Not JATS-Con Nov 2, 2010 21

DTD vs. Schematron:References (cont’d)

• Schematron can check that all required elements are present:

<rule context="mixed-citation[@publication-type='book-standalone']">

<assert test="(person-group | string-name) and yearand source and publisher-nameand publisher-loc">

required element missing</assert></rule>

• & that the elements are in the correct sequence:

Alexander (“Sasha”) Schwarzman

22

DTD vs. Schematron:References (cont’d)

XML instance (JPTS) (edition is in the wrong place)

<mixed-citation publication-type="book-standalone"><string-name> <surname>Mood</surname>, <given-names>A. M.</given-names></string-name>, and <string-name> <given-names>F. A.</given-names><surname>Graybill</surname></string-name> (<year>1963</year>), <edition>2</edition>nd ed.,<source><italic>Introduction to the Theory …</italic></source>, <size units="page">295</size> pp.,<publisher-name>McGraw-Hill</publisher-name>, <publisher-loc>New York</publisher-loc>.</mixed-citation>

Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010

23

DTD vs. Schematron:References (cont’d)

This Schematron uses positional predicate [1] to check that year is immediately followed by source:

<rule context="mixed-citation[@publication-type= 'book-standalone']/year"> <assert test="following-sibling::*[1]/self::source"> '<name/>' must be followed by 'source', not by '<value-of

select='name(following-sibling::*[1])'/>'</assert></rule>

Schematron message

'year' must be immediately followed by 'source', not by 'edition'

Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010

Superset Me—Not JATS-Con Nov 2, 2010 24

DTD vs. Schematron:References (cont’d)

But how to check the sequence of required elements when there might be optional elements interspersed between them?

This Schematron checks that required publisher-name is preceded by required source, regardless of any optional elements that may occur in-between:

<rule context="mixed-citation[@publication-type= 'book-standalone']/publisher-name"> <assert test="preceding-sibling::source">

'<name/>' must be preceded by 'source'</assert></rule>

Alexander (“Sasha”) Schwarzman

Superset Me—Not JATS-Con Nov 2, 2010 25

DTD vs. Schematron:References (cont’d)

• Rick Jelliffe’s approach combines flexibility of JPTS with benefits of a DTD-like fixed element order:– Each element rewritten as a string of its element

names– Content model represented as a regular expression– Schematron checks the string of names against regex– Schematron generates an error message if content

does not match the model

Alexander (“Sasha”) Schwarzman

Superset Me—Not JATS-Con Nov 2, 2010 26

DTD vs. Schematron:References (cont’d)

An XML file, e.g., citation-models.xml, specifies structured citation models:

...<model publication-type="book-standalone"> ((string-name | person-group), year, source, edition, (string-name | person-group)?, size?, elocation-id?, publisher-name, publisher-loc)</model> ...

Alexander (“Sasha”) Schwarzman

Superset Me—Not JATS-Con Nov 2, 2010 27

DTD vs. Schematron:References (cont’d)

• Advantages:– DTD is still DTD-valid– Mixed content is permitted– Type-sensitive handling of references is possible

• Caveat: XSLT 2.0!

Alexander (“Sasha”) Schwarzman

28

Lessons learned• AGU Tag Set + Schematron (200+ checks)– Ensures data quality– Ensures markup integrity– Provides control over production processes

• AGU Tag Set is a superset of JPTS– Based on JPTS– Uses the same modularization principles– Can be easily mapped to JPTS

• Were we to do this again we would have developed JPTS subset and a Schematron

Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010

29

Lessons learned (cont’d)

• Appropriate layer validation– Even the most “Prussian” DTD can’t enforce all

business rules, data types, and house style– Rules-based checking needed anyway– May as well use “Californian” JPTS (de facto

industry standard) adopted by publishers, conversion & composition vendors, archives, etc.

• Paradigm shift: the crux of validation shifts from XML parser to Schematron engine

Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010

30

Lessons learned (cont’d)

• This shift is not without costs:– Content may be valid to JPTS but make no sense– Dependency on Schematron for semantic integrity– Constraints on business partners: must be

Schematron-capable and have tools– Schematron does not “fix” problems—people do.

Processes and procedures must be well-defined

Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010

31

Lessons learned (cont’d)• Writing a simple Schematron is easy; building a complex and efficient one is not:– Elicit, document, convey, and clarify the Requirements– Ensure Schematron fits into your workflow– Modularize Schematron– Ensure that individual Schematron rules aren’t in conflict– Optimize Schematron performance– Employ XSLT 2.0– Test, test, test– Cultivate Schematron & XSLT 2.0 expertise in-house

Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010

32

Conclusion• What about content that is not like a journal

article, e.g., generic (non-NCBI) books and their parts/chapters?

• When this deficiency is addressed, the NLM Archiving and Interchange Tag Suite could truly say:

“Superset Me—Not!”

Alexander (“Sasha”) Schwarzman Superset Me—Not JATS-Con Nov 2, 2010