xml- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · lecture 8: case study...

29
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia XML- dictionary & encyclopedia Case study Mirosław Prywata 2012/11/19

Upload: others

Post on 17-Apr-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

XML- dictionary & encyclopedia

Case study

Mirosław Prywata

2012/11/19

Page 2: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

Agenda

• Introduction

• Case 1: medical dictionary & translation

– Input

– Process & tools

• Case 2: Great Encyclopedia

– SGML vs. XML

– Process & tools

Page 3: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

MEDICAL DICTIONARY

Page 4: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

What we do

• English medical dictionary (Stedman/Medipage)

• About 100k entries/subentries

• Text is SGML/XML (no DTD)

• We need to translate all

• English sequence remains (with english headers)

Page 5: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

Entry form

Page 6: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

Process

Translation Verification Etymology

Proofing DTP Final

proofing

Page 7: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

Input

• Input format – XML

• No DTD

• Tags description

• Formatting is event based

CODE: <ETYMON>

FUNCTION:

Specifies the beginning of an etymon. Change the current font to

small italic.

CODE: </ETYMON>

FUNCTION:

Specifies the end of an etymon. Change the current font to small

small lightface.

Page 8: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

Event based formatting

<xsl:when

test="count (ancestor::VAR) = 0 and

(((count(ancestor::SYNX) > 0 or count(ancestor::SYXT) > 0)

and count(ancestor::INSUB) = 0)

or count(ancestor::SEEAL) > 0 or count(ancestor::REF) > 0

or count(ancestor::SEE) > 0)"> {\i<xsl:value-of select="."/>}

</xsl:when>

<xsl:otherwise>

[...]

<xsl:if

test="(count (ancestor::*[@FOL='INFLECT']))

or ((substring(following::text(),1,1)!='.')

and (count (ancestor::SYNX) = 0

or generate-id(ancestor::SYNX[1])=

generate-id(following::text()[1]/ancestor::SYNX[1]))

and (count (ancestor::SYXT) = 0

and [...]

</xsl:if>

<xsl:if test="count(ancestor::*[@FOL='INFLECT' or @FOL='CINFLECT']) > 0">'s</xsl:if>

</xsl:otherwise>

<xsl:template match="A"> [....] </xsl:template>

<xsl:template match="A/B"> [....] </xsl:template>

<xsl:template match="A" test="@attr=‘abcd'" > [....] </xsl:template>

INSTEAD

Page 9: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

Translations

• Many translators, no XML awareness

• The choice: MS word and tables

• Each entry becomes row in a table

• XML -> RTF -> XML

• Beware of subentries (syndrome has 790 subentries)

Page 10: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

RTF example

atherectomy (ath′e-rek′tō-mē) Any removal by surgery or specialized catheterization of an atheroma in the coronary or any other artery. coronary a., instrumental removal, via catheter, of atheromas in coronary arteries. directional a., removal of an atheroma with a catheter whose cutting device can be positioned both rotationally and longitudinally within an artery.

atherectomy aterektomia

Any removal by surgery or specialized catheterization of an atheroma in the coronary or any other artery.

usunięcie metodą chirurgiczną lub cewnikiem śródnaczyniowym zmiany miażdżycowej w tętnicy wieńcowej lub każdej innej tętnicy

coronary atherectomy udrożnienie tętnicy wieńcowej

instrumental removal, via catheter, of atheromas in coronary arteries.

usunięcie za pomocą specjalnego cewnika zmian miażdżycowych z tętnic wieńcowych

directional atherectomy a. kierunkowa

removal of an atheroma with a catheter whose cutting device can be positioned both rotationally and longitudinally within an artery.

usunięcie zmiany miażdżycowej za pomocą cewnika wyposażonego w urządzenie tnące, którym można wykonywać cięcia obrotowe i podłużne wewnątrz tętnicy

Page 11: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

Etymology

• Almost every entry contains etymology

• Etymology translation is done simultaneously to other translations

• Doctors do not translate etymology

• We use WYSIWYG XML editor: Authentic (free version)

• You need to buy Style Vision

• Elements apart from etymology are blocked

Page 12: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

Etymology - example

English etymology

Polish etymology

Page 13: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

Money

• Translators gets money based on the amount of translated text

• We need to report number of translated character while importing texts from RTF

• We need to keep records of all translated texts

Page 14: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

Proofing

• We need XML editor

• Element based formatting

• Entry should be formated similarly to the final formatting

• We use the same editor (Authentic) but different style

Page 15: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

Authentic style example

<globaltemplate match="LOR-PL">

<editorproperties/>

<properties/>

<styles/>

<children>

<template match="LOR-PL">

<editorproperties markupmode="small" elementstodisplay="1"

elementstofetch="all"/>

<properties/>

<styles/>

<children>

<content>

<editorproperties/>

<properties/>

<styles color="green" font-size="inherit" font-weight="bold"/>

<children/>

<addvalidations/>

<format/>

</content>

</children>

<addvalidations/>

<sort/>

</template>

</children>

</globaltemplate>

Page 16: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

Authentic example

Page 17: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

DTP

• We need someone who can use XML in DTP

• Another DTD is created for polish translation

• All context based words and characters are placed (and tagged for future purpose)

Page 18: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

Final result (printed)

Page 19: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

Conversion to HTML

Page 20: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

XSLT (XML to (X)HTML)

• HTML/XHTML

• (many) Simple rules

• We use CSS

<xsl:output

doctype-system="http://www.w3.org/TR/REC-html40/loose.dtd"

doctype-public="-//W3C//DTD HTML 4.0 Transitional//EN"

method="html"/>

<xsl:template match="LOR">

<span class="bold"><xsl:apply-templates/></span> </xsl:template>

<xsl:template match="LEM/VORL">

<span class="bold"><xsl:apply-templates/></span> </xsl:template>

Page 21: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

XSLT – some more complicated rules

<xsl:template match="LOR-PL">

<xsl:if test="

not(preceding-sibling::LOR-PL)

and not (preceding-sibling::SAPEC)

and not (ancestor::ENT/SPECIAL/CLASS/text()='bio')">

<xsl:text> </xsl:text>❖ </xsl:if>

<span class="green normal bold"><xsl:apply-templates/></span>

</xsl:template>

<xsl:template match="MAIN|SU">

<p>

<xsl:apply-templates/>

<xsl:if test="not (descendant::PUSE-PL)">.</xsl:if>

</p>

</xsl:template>

Page 22: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

Mediteka

źródło: http://www.mediteka.pl/

Page 23: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

GREAT PWN ENCYCLOPEDIA

Page 24: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

Great PWN encyclopedia

• Great Encyclopedia

• 140k entries/subentries, 15k ilustrations, 700 maps

• 30 volumes + index volume

• 100 editors, ~3k authors

• Complicated production process

• Database for future encyclopedia

Page 25: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

SGML

• SGML specification

• DTD + DCL

• Element <> tag

• Space or not space

• Mixed content model

• DSSSL („XSLT + XSLFO for SGML”)

• No editor that implements full SGML

Page 26: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

SGML vs XML

SGML XML Very complicated Simplidied SGML

Tag omission Element = tag

Complicated mixed content Simplified (#PCDATA|B1|B2)*

Charset must be defined (dcl) Predefined utf-8

DSSSL (one tool:jade) XSLT, XSLFO (plenty)

Few parsers (SP) Plenty parsers

No fully conformant editor (apart

from notepad )

Many editors

DTD DTD, XML Schema + many

extensions

... ...

Page 27: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

DSSSL example

• SGML->HTML

• SGML->RTF

(element STRZ (make element gi:"FONT„

attributes: '(("color" "red"))

(make sequence (literal "\U-2192 "))

))

(element IMIE (make element gi:"FONT„

attributes: '(("color" "blue"))

))

(element LISTA (make element gi: "OL"))

(element (LISTA WSTEP) (make element gi: "B"))

(element (LISTA PUNKT) (make element gi: "LI"))

(define ($glowka-hasla$) (make sequence

font-family-name: %body-font-family%

font-size: (* %base-font-size% 1.2)))

(define ($glowka-tytul$) (make sequence

font-family-name: %body-font-family%

font-size: (* %base-font-size% 1.2)

font-weight: 'bold))

(element GLOWKA-R ($glowka-hasla$))

(element GLOWKA-B ($glowka-hasla$))

(element GLOWKA-G ($glowka-hasla$))

(element (GLOWKA-R TYTUL) ($glowka-tytul$))

(element (GLOWKA-G TYTUL) ($glowka-tytul$))

(element (GLOWKA-B TYTUL) ($glowka-tytul$))

Page 28: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

Overall process

• Author

• Editor

• Chief editor

• Names transcription (including geographical names)

• Lexicography (combining entries)

• Consultant (Consultant Board of professors)

• Proofing

• (DTP + proofing) x 3 times

• At last -> print

Page 29: XML- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · Lecture 8: Case study : Dictionary & encyclopedia XML and Content Management RTF example atherectomy ( ath

XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia

Thank you