eagles/isle workshop lrec 2000 athens, greece the xml framework its implications for corpus access...
Post on 21-Dec-2015
213 views
TRANSCRIPT
![Page 1: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/1.jpg)
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
The XML FrameworkIts Implications for Corpus Access and Use
Nancy Ide
Department of Computer Science
Vassar College
Data Architectures and Software Support for Large CorporaTowards an American National Corpus
![Page 2: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/2.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
XML
• emerging standard for data representation and exchange on the World Wide Web
• powerful tool for data representation and access
• obvious standard for interchange of language resources– supports text, speech, video, audio– ...and linkage among them!
![Page 3: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/3.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
XML provides more than SGML
• better linkage mechanisms
• XSLT for document access and transformation
• XML schemas
• provision for accessing all or part of multiple DTDs
![Page 4: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/4.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
XML Links• "stand-off" annotation is the accepted norm for
annotated resources
• maintain all or most annotations in separate documents– each references appropriate locations in the
original data – yields a finely linked hypertext format where the
links specify a semantic role rather than navigational options
![Page 5: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/5.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Requirements of the stand-off architecture
• address XML elements
• address characters and chains of characters within those elements
• address elements and characters both within the same document and in other XML documents
![Page 6: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/6.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
XML Path Language (XPath)
• concise notation for element localization in the document tree– /div/p[2]/s[3] - third sentence of second
paragraph in each <div>– /descendant::p - all <p> elements
• predicates for accessing characters within elements– substring(/p/s[2]/text(),10,12)
![Page 7: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/7.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
XPointer
• extends XPath syntax to allow : – addressing points and ranges as well as
nodes– locating information by string matching– use of addressing expressions in URI-
references as fragment identifiers
![Page 8: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/8.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
XLink
• uni- or multi-directional links
• can specify how link is to be activated– by hand or automatically by the browser
• can specify what to do with the target fragment – replace it or insert it into the source document
![Page 9: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/9.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Links to External Documents
• None in SGML
• HyTime/TEI invented "doc" attribute
• CES used "doc" with inheritance to avoid repetition of the attribute– not supported by SGML processors
• XML: XLink and xml:base attribute
![Page 10: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/10.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
XSLT• a powerful tree-traversal language
• translate any XML document into another document in any form– html– XML– plain text– etc.
• most to offer for handling annotated resources
![Page 11: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/11.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
XSLT Capabilities
selection of elements or portions of element content using the XPath syntax
rearrangement, transformation of extracted information (text content, element names, etc.) in the target document
• addition of information to the target document
![Page 12: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/12.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
A Simple Example<?xml version="1.0">
<chunk type="BODY" lang="en"
xml:base=
"http://www.cs.vassar.edu/~ME/Oen.xcesDoc#">
<par xlink:href="xptr(substring(//p[1]">
<s xlink:href="xptr(substring(//p/s[1]">
<tok type="WORD"
xlink:href=
"xptr(substring(//p/s[1]/text(),1,2">
<orth>It</orth>
<disamb>
<base>it</base>
<msd>Pp3ns</msd>
<ctag>PPER3</ctag></lex>
<lex>
<base>it</base>
<msd>Pp3ns</msd>
<ctag>PPER3</ctag></lex></tok>...
xcesAnadocumentxcesAnadocument
![Page 13: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/13.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
<xsl:stylesheet version="1.0" xmnls:xsl= "http://www.w3.org/1999/XSL/Transform">
<xsl:template match= “/”> <html> <body> <xsl:apply-templates/> </body> </html></xsl:template>
<xsl:template match="//par"/> <xsl:for-each select=”//tok”/> <xsl:value-of select=”orth”/> <xsl:text>|</xsl:text> <xsl:value-of select=”disamb/base”/> <xsl:text>|</xsl:text> <xsl:value-of select=”disamb/ctag”/> </xsl:for-each> </xsl:template>
</xsl:stylesheet>
XSLT creates HTML
XSLTdocumentXSLTdocument
![Page 14: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/14.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Result
It|it|PPER3 was|be|PAST3 a|a|DINTbright|bright|ADJEcold|cold|ADJE day|day|NN…
![Page 15: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/15.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Possibilities
• create new documents containing selected annotations
• transduce XML encoded documents to tool-internal formats
• generate a new document with all phonemes that appear in a certain context (or all the unique contexts of a certain phoneme), etc.
![Page 16: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/16.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
XML Schemas
• constrain and document the meaning, usage and relationships of the constituent parts of XML documents– datatypes– elements and their content– attributes and their values
• provide default values for attributes and elements
![Page 17: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/17.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Impact for language resources
• provide means to define an abstract data model for a class of documents– e.g., data model for annotations and annotated
objects– one of the most important tasks for corpus and
tool creators
• provide for much tighter validation of document form and content
![Page 18: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/18.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Capabilities
• different attribute declarations and/or content models can apply to elements with the same name in different contexts– allows for more tightly constrained content
models than possible with DTDs– e.g., <name> in header and <name> in text
likely have different content constraints
![Page 19: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/19.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
• define equivalence classes for groups of elements and/or attributes– may be used in the same ways as defined
for a particular named element
• in CES used parameter entities to make a class of phrase-level objects (for example)– a "kludge"
![Page 20: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/20.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
• constrain attribute or element values (or combinations) to be unique, e.g.,– only one entry in a computational lexicon can
be defined with a given word form – only one paragraph can have an attribute
indicating that it is the 23rd– only one disambiguated form is given for each
token – only one correspondence for a given item in an
alignment document
Useful for error detection and preventionUseful for error detection and prevention
![Page 21: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/21.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
• establish dependencies based on element or attribute values, for example:– prevent nouns from being assigned a tense– specify that tokens with type attribute value
PUNCT include only <orth> elements containing specific characters
– specify annotation labels elsewhere, constrain element content to these values only
• e.g., constrain the values of the <msd> element in an XCES annotation document to the EAGLES morpho-syntactic specifications
Another means for error control and validationAnother means for error control and validation
![Page 22: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/22.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Why is XML a good thing?• search, extraction, and transformation
capabilities answer most current and foreseen needs for corpus-based language engineering
• means to fully implement the stand-off data architecture
• processing tools for XML recommendations are freely distributed– no need for costly and time-consuming tool
development
![Page 23: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/23.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Conclusion• XML will allow for
– representation of multi-lingual, multi-modal resources
– implementation of the stand-off scheme– compatibility with the WWW, enabling
• exploitation by LE researchers via the web
• harmonization and combination of LRE resources with other WWW data
– distributed model for data delivery
![Page 24: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/24.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
P.S....
• A set of XML recommendations for encoding language resources exists:– XCES (XML version of the Corpus Encoding
Standard--CES)– http://www.cs.vassar.edu/XCES
![Page 25: EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d585503460f94a36f30/html5/thumbnails/25.jpg)
Data Architectures and Software Support for Large Corpora
Data Architectures and Software Support for Large Corpora
EAGLES/ISLE WorkshopLREC 2000 • Athens, Greece
Acknowledgements
• Laurent Romary (LORIA/CNRS)
• Patrice Bonhomme (LORIA/CNRS)