xml- dictionary & encyclopediaczarnik/zajecia/xml12/08-en-handout.pdf · lecture 8: case study...
TRANSCRIPT
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
XML- dictionary & encyclopedia
Case study
Mirosław Prywata
2012/11/19
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
Agenda
• Introduction
• Case 1: medical dictionary & translation
– Input
– Process & tools
• Case 2: Great Encyclopedia
– SGML vs. XML
– Process & tools
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
MEDICAL DICTIONARY
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
What we do
• English medical dictionary (Stedman/Medipage)
• About 100k entries/subentries
• Text is SGML/XML (no DTD)
• We need to translate all
• English sequence remains (with english headers)
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
Entry form
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
Process
Translation Verification Etymology
Proofing DTP Final
proofing
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
Input
• Input format – XML
• No DTD
• Tags description
• Formatting is event based
CODE: <ETYMON>
FUNCTION:
Specifies the beginning of an etymon. Change the current font to
small italic.
CODE: </ETYMON>
FUNCTION:
Specifies the end of an etymon. Change the current font to small
small lightface.
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
Event based formatting
<xsl:when
test="count (ancestor::VAR) = 0 and
(((count(ancestor::SYNX) > 0 or count(ancestor::SYXT) > 0)
and count(ancestor::INSUB) = 0)
or count(ancestor::SEEAL) > 0 or count(ancestor::REF) > 0
or count(ancestor::SEE) > 0)"> {\i<xsl:value-of select="."/>}
</xsl:when>
<xsl:otherwise>
[...]
<xsl:if
test="(count (ancestor::*[@FOL='INFLECT']))
or ((substring(following::text(),1,1)!='.')
and (count (ancestor::SYNX) = 0
or generate-id(ancestor::SYNX[1])=
generate-id(following::text()[1]/ancestor::SYNX[1]))
and (count (ancestor::SYXT) = 0
and [...]
</xsl:if>
<xsl:if test="count(ancestor::*[@FOL='INFLECT' or @FOL='CINFLECT']) > 0">'s</xsl:if>
</xsl:otherwise>
<xsl:template match="A"> [....] </xsl:template>
<xsl:template match="A/B"> [....] </xsl:template>
<xsl:template match="A" test="@attr=‘abcd'" > [....] </xsl:template>
INSTEAD
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
Translations
• Many translators, no XML awareness
• The choice: MS word and tables
• Each entry becomes row in a table
• XML -> RTF -> XML
• Beware of subentries (syndrome has 790 subentries)
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
RTF example
atherectomy (ath′e-rek′tō-mē) Any removal by surgery or specialized catheterization of an atheroma in the coronary or any other artery. coronary a., instrumental removal, via catheter, of atheromas in coronary arteries. directional a., removal of an atheroma with a catheter whose cutting device can be positioned both rotationally and longitudinally within an artery.
atherectomy aterektomia
Any removal by surgery or specialized catheterization of an atheroma in the coronary or any other artery.
usunięcie metodą chirurgiczną lub cewnikiem śródnaczyniowym zmiany miażdżycowej w tętnicy wieńcowej lub każdej innej tętnicy
coronary atherectomy udrożnienie tętnicy wieńcowej
instrumental removal, via catheter, of atheromas in coronary arteries.
usunięcie za pomocą specjalnego cewnika zmian miażdżycowych z tętnic wieńcowych
directional atherectomy a. kierunkowa
removal of an atheroma with a catheter whose cutting device can be positioned both rotationally and longitudinally within an artery.
usunięcie zmiany miażdżycowej za pomocą cewnika wyposażonego w urządzenie tnące, którym można wykonywać cięcia obrotowe i podłużne wewnątrz tętnicy
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
Etymology
• Almost every entry contains etymology
• Etymology translation is done simultaneously to other translations
• Doctors do not translate etymology
• We use WYSIWYG XML editor: Authentic (free version)
• You need to buy Style Vision
• Elements apart from etymology are blocked
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
Etymology - example
English etymology
Polish etymology
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
Money
• Translators gets money based on the amount of translated text
• We need to report number of translated character while importing texts from RTF
• We need to keep records of all translated texts
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
Proofing
• We need XML editor
• Element based formatting
• Entry should be formated similarly to the final formatting
• We use the same editor (Authentic) but different style
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
Authentic style example
<globaltemplate match="LOR-PL">
<editorproperties/>
<properties/>
<styles/>
<children>
<template match="LOR-PL">
<editorproperties markupmode="small" elementstodisplay="1"
elementstofetch="all"/>
<properties/>
<styles/>
<children>
<content>
<editorproperties/>
<properties/>
<styles color="green" font-size="inherit" font-weight="bold"/>
<children/>
<addvalidations/>
<format/>
</content>
</children>
<addvalidations/>
<sort/>
</template>
</children>
</globaltemplate>
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
Authentic example
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
DTP
• We need someone who can use XML in DTP
• Another DTD is created for polish translation
• All context based words and characters are placed (and tagged for future purpose)
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
Final result (printed)
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
Conversion to HTML
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
XSLT (XML to (X)HTML)
• HTML/XHTML
• (many) Simple rules
• We use CSS
<xsl:output
doctype-system="http://www.w3.org/TR/REC-html40/loose.dtd"
doctype-public="-//W3C//DTD HTML 4.0 Transitional//EN"
method="html"/>
<xsl:template match="LOR">
<span class="bold"><xsl:apply-templates/></span> </xsl:template>
<xsl:template match="LEM/VORL">
<span class="bold"><xsl:apply-templates/></span> </xsl:template>
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
XSLT – some more complicated rules
<xsl:template match="LOR-PL">
<xsl:if test="
not(preceding-sibling::LOR-PL)
and not (preceding-sibling::SAPEC)
and not (ancestor::ENT/SPECIAL/CLASS/text()='bio')">
<xsl:text> </xsl:text>❖ </xsl:if>
<span class="green normal bold"><xsl:apply-templates/></span>
</xsl:template>
<xsl:template match="MAIN|SU">
<p>
<xsl:apply-templates/>
<xsl:if test="not (descendant::PUSE-PL)">.</xsl:if>
</p>
</xsl:template>
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
Mediteka
źródło: http://www.mediteka.pl/
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
GREAT PWN ENCYCLOPEDIA
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
Great PWN encyclopedia
• Great Encyclopedia
• 140k entries/subentries, 15k ilustrations, 700 maps
• 30 volumes + index volume
• 100 editors, ~3k authors
• Complicated production process
• Database for future encyclopedia
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
SGML
• SGML specification
• DTD + DCL
• Element <> tag
• Space or not space
• Mixed content model
• DSSSL („XSLT + XSLFO for SGML”)
• No editor that implements full SGML
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
SGML vs XML
SGML XML Very complicated Simplidied SGML
Tag omission Element = tag
Complicated mixed content Simplified (#PCDATA|B1|B2)*
Charset must be defined (dcl) Predefined utf-8
DSSSL (one tool:jade) XSLT, XSLFO (plenty)
Few parsers (SP) Plenty parsers
No fully conformant editor (apart
from notepad )
Many editors
DTD DTD, XML Schema + many
extensions
... ...
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
DSSSL example
• SGML->HTML
• SGML->RTF
(element STRZ (make element gi:"FONT„
attributes: '(("color" "red"))
(make sequence (literal "\U-2192 "))
))
(element IMIE (make element gi:"FONT„
attributes: '(("color" "blue"))
))
(element LISTA (make element gi: "OL"))
(element (LISTA WSTEP) (make element gi: "B"))
(element (LISTA PUNKT) (make element gi: "LI"))
(define ($glowka-hasla$) (make sequence
font-family-name: %body-font-family%
font-size: (* %base-font-size% 1.2)))
(define ($glowka-tytul$) (make sequence
font-family-name: %body-font-family%
font-size: (* %base-font-size% 1.2)
font-weight: 'bold))
(element GLOWKA-R ($glowka-hasla$))
(element GLOWKA-B ($glowka-hasla$))
(element GLOWKA-G ($glowka-hasla$))
(element (GLOWKA-R TYTUL) ($glowka-tytul$))
(element (GLOWKA-G TYTUL) ($glowka-tytul$))
(element (GLOWKA-B TYTUL) ($glowka-tytul$))
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
Overall process
• Author
• Editor
• Chief editor
• Names transcription (including geographical names)
• Lexicography (combining entries)
• Consultant (Consultant Board of professors)
• Proofing
• (DTP + proofing) x 3 times
• At last -> print
XML and Content Management Lecture 8: Case study : Dictionary & encyclopedia
Thank you