encoding language corpora: current trends and future directions
DESCRIPTION
Encoding language corpora: current trends and future directions. Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute , Ljubljana , Slovenia [email protected] , http://nl.ijs.si/et/ National Institute for Japanese Language 2006-09-28. Overview. - PowerPoint PPT PresentationTRANSCRIPT
Encoding language Encoding language corpora: corpora: current trends and future current trends and future directionsdirections
Tomaž ErjavecTomaž ErjavecDepartment of Knowledge TechnologiesDepartment of Knowledge Technologies
Jožef Stefan Institute, Jožef Stefan Institute,
Ljubljana, SloveniaLjubljana, Slovenia
[email protected], [email protected], http://nl.ijs.si/et/http://nl.ijs.si/et/
National Institute for Japanese LanguageNational Institute for Japanese Language
2006-09-282006-09-28
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
OverviewOverview
1.1. History and current practices in History and current practices in corpus encoding: corpus encoding: TEI P4, CESTEI P4, CES
2.2. Open issues: Open issues: multiple annotations, metadata multiple annotations, metadata and analytical toolsand analytical tools
3.3. Future directions: Future directions: TEI P5, ISO TC 37TEI P5, ISO TC 37
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
I. Some historyI. Some history
80’s: corpora (and other language 80’s: corpora (and other language resources) encoded in idiosyncratic resources) encoded in idiosyncratic formats, usu. bound to specific toolsformats, usu. bound to specific tools
corpora expensive to produce butcorpora expensive to produce but difficult exchange and reusedifficult exchange and reuse quickly became obsolete quickly became obsolete to address these problems, the Text to address these problems, the Text
Encoding Initiative is established in 1987Encoding Initiative is established in 1987 initiative comes from humanities initiative comes from humanities
computing: sponsorship by ACH, ALLC, computing: sponsorship by ACH, ALLC, ACLACL
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
Text Encoding Text Encoding Initiative Initiative TEI is the only systematized attempt to TEI is the only systematized attempt to
develop a develop a fully general text encoding fully general text encoding modelmodel and set of encoding conventions and set of encoding conventions based upon itbased upon it
intended for processing and analysis of intended for processing and analysis of any type of text, in any languageany type of text, in any language
main result: the main result: the TEI Guidelines for TEI Guidelines for Electronic Text Encoding and Electronic Text Encoding and InterchangeInterchange
SGML was chosen as the underlying SGML was chosen as the underlying standard for the TEI Guidelines. standard for the TEI Guidelines.
drafts: TEI P1 (1990), TEI P2 (1993)drafts: TEI P1 (1990), TEI P2 (1993)
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
TEI P3 and P4TEI P3 and P4
the third version of the Guidelines, TEI P3 the third version of the Guidelines, TEI P3 (1994) published in two substantial green (1994) published in two substantial green volumes (1200pp) and soon also on the Web. volumes (1200pp) and soon also on the Web.
A major revision, the TEI P4A major revision, the TEI P4 published in published in 20022002
TEI P4 addresses the following issues: TEI P4 addresses the following issues: – error correctionerror correction– provides equal support for XML and SGML provides equal support for XML and SGML – retains backward compatibility with TEI P3retains backward compatibility with TEI P3
Today, TEI P4 is the most widely used version Today, TEI P4 is the most widely used version of TEI: over 130 projects listed on the TEI of TEI: over 130 projects listed on the TEI web pagesweb pages
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
The TEI schemeThe TEI scheme
TEI P4 consists of the written guidelines + a TEI P4 consists of the written guidelines + a set of DTD fragmentsset of DTD fragments
to obtain a project specific DTD (TEI to obtain a project specific DTD (TEI parameterisation) the DTDs fragments are parameterisation) the DTDs fragments are combined: combined:
1.1. core tagset (always present)core tagset (always present)includes the TEI headerincludes the TEI header
2.2. base tagsets (specific text types)base tagsets (specific text types)e.g. prose, dictionaries, dramae.g. prose, dictionaries, drama
3.3. additional tagsets (particular analyses)additional tagsets (particular analyses)e.g. dates×, certainty, simple linguistic analysise.g. dates×, certainty, simple linguistic analysis
4.4. user extensions, which extend or modify the TEIuser extensions, which extend or modify the TEI a widely used parameterisation of TEI: TEI a widely used parameterisation of TEI: TEI
LiteLite
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
What is good about TEIWhat is good about TEI
is a “standard”is a “standard” offers a rich vocabulary of tags with offers a rich vocabulary of tags with
extensive documentationextensive documentation can be extended and modifiedcan be extended and modified many best practice scenariosmany best practice scenarios software and user community software and user community
support (tei-c web pages & tei-l support (tei-c web pages & tei-l mailing list)mailing list)
tutorials teaching TEItutorials teaching TEI
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
What is bad about TEIWhat is bad about TEI
steep learning curve (difficult to start using it) steep learning curve (difficult to start using it) TEI is general, so tags are often too generic for TEI is general, so tags are often too generic for
the needs of particular projects; also, too the needs of particular projects; also, too deeply nested (tag bloat)deeply nested (tag bloat)
it is often not clear to how encode a particular it is often not clear to how encode a particular phenomenon (more than one possibility exists)phenomenon (more than one possibility exists)
while TEI is modular, it will still allow lots of while TEI is modular, it will still allow lots of tags that a project (encoder) has no need fortags that a project (encoder) has no need for
never really became accepted in the comp. never really became accepted in the comp. ling. communityling. community
some areas missing or not up-to date: some areas missing or not up-to date: computational lexicons, terminological computational lexicons, terminological databases, complex linguistic annotationsdatabases, complex linguistic annotations
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
TEI for corpus TEI for corpus encodingencoding base module: TEI.prosebase module: TEI.prose additional modules:additional modules:
– TEI.corpusTEI.corpusadditional tags in the headeradditional tags in the header
– TEI.analysis TEI.analysis tags for simple analytic mechanismstags for simple analytic mechanisms
– TEI.linking TEI.linking tags for linking, segmentation, and tags for linking, segmentation, and alignmentalignment
– TEI.fs TEI.fs tags for feature structure analysistags for feature structure analysis
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
Example annotated Example annotated texttext<seg id="orwl.en.24" corresp="orwl.sl.24"> <seg id="orwl.en.24" corresp="orwl.sl.24"> <s id="Oen.1.1.4.5"> <s id="Oen.1.1.4.5"> <c type="open" ctag='"'>"</c><c type="open" ctag='"'>"</c> <w ana="Af" lemma="big">Big</w> <w ana="Af" lemma="big">Big</w> <w ana="Ncms" lemma="brother">Brother</w> <w ana="Ncms" lemma="brother">Brother</w> <w ana="Vaip3s" lemma="be">is</w> <w ana="Vaip3s" lemma="be">is</w> <w ana="Vmpp" lemma="watch">watching</w> <w ana="Vmpp" lemma="watch">watching</w> <w ana="Pp2" lemma="you">you</w> <w ana="Pp2" lemma="you">you</w> <c ctag='"'>"</c> <c ctag='"'>"</c> <w ana="Dd" lemma="the">the</w> <w ana="Dd" lemma="the">the</w> <w ana="Ncns" lemma="caption">caption</w> <w ana="Ncns" lemma="caption">caption</w> <w ana="Vmis" lemma="say">said</w> <w ana="Vmis" lemma="say">said</w> <c ctag=".">.</c> <c ctag=".">.</c> </s> </s> </seg> </seg>
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
Example Example morphosyntactic morphosyntactic encoding encoding In textIn text::<w ana="Ncfda" lemma="<w ana="Ncfda" lemma="ženskaženska">">ženskamaženskama</w> </w>
In the MSD specification:In the MSD specification:<fsLib> <fsLib> <fs type="Noun" id="Ncfda" select="sl" feats="N1.c N2.f N3.d N4.a"/> <fs type="Noun" id="Ncfda" select="sl" feats="N1.c N2.f N3.d N4.a"/> <fs type="Noun" id="Ncfdd" select="sl" feats="N1.c N2.f N3.d N4.d"/> <fs type="Noun" id="Ncfdd" select="sl" feats="N1.c N2.f N3.d N4.d"/> <fs type="Noun" id="Ncfdg" select="sl" feats="N1.c N2.f N3.d N4.g"/> <fs type="Noun" id="Ncfdg" select="sl" feats="N1.c N2.f N3.d N4.g"/> ... ... </fsLib> </fsLib>
<fLib> <fLib> <f id="N1.c" select="en ro sl cs bg et hu hr" name="Type"> <sym value="common"/> <f id="N1.c" select="en ro sl cs bg et hu hr" name="Type"> <sym value="common"/>
</f> </f> <f id="N1.p" select="en ro sl cs bg et hu hr" name="Type"> <sym value="proper"/> <f id="N1.p" select="en ro sl cs bg et hu hr" name="Type"> <sym value="proper"/>
</f> </f> <f id="N2.m" select="en ro sl cs bg hr" name="Gender"> <sym value="masculine"/> <f id="N2.m" select="en ro sl cs bg hr" name="Gender"> <sym value="masculine"/>
</f> </f> <f id="N2.f" select="en ro sl cs bg hr" name="Gender"> <sym value="feminine"/> </f> <f id="N2.f" select="en ro sl cs bg hr" name="Gender"> <sym value="feminine"/> </f> <f id="N2.n" select="en ro sl cs bg hr" name="Gender"> <sym value="neuter"/> </f> <f id="N2.n" select="en ro sl cs bg hr" name="Gender"> <sym value="neuter"/> </f> ... ... </fLib> </fLib>
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
CES: the Corpus CES: the Corpus Encoding StandardEncoding Standard CES was developed in the scope of EU EAGLES, CES was developed in the scope of EU EAGLES,
the Expert Advisory Group on Language the Expert Advisory Group on Language Engineering Standards (1996)Engineering Standards (1996)
CES is a SGML DTD and is a particular CES is a SGML DTD and is a particular parameterization (and modification) of TEI P3parameterization (and modification) of TEI P3
XCES (2002) is the XML version of CESXCES (2002) is the XML version of CES (X)CES has been used in a number of corpus (X)CES has been used in a number of corpus
projects, mainly because it is simpler to use projects, mainly because it is simpler to use and understand than the full TEIand understand than the full TEI
however, there is not prescribed way how to however, there is not prescribed way how to modify or extend itmodify or extend it
also, less strictly maintained than the TEIalso, less strictly maintained than the TEI
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
II. Open issuesII. Open issues
multiple annotationsmultiple annotations metadatametadata corpus analytical toolscorpus analytical tools
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
Multiple annotationsMultiple annotations
More and more linguistic annotation is being More and more linguistic annotation is being added to the data, e.g.added to the data, e.g.
sentences, words, punctuation, part-of-sentences, words, punctuation, part-of-speech, (morphosyntactic) tags, multi-word speech, (morphosyntactic) tags, multi-word units (terms), named entities, syntactic units (terms), named entities, syntactic structure, co-reference annotation (anaphora), structure, co-reference annotation (anaphora), word-sense informationword-sense information
also rhetorical structure: quoted speech, also rhetorical structure: quoted speech, paragraphs, lists, … paragraphs, lists, …
even more annotation can be added to even more annotation can be added to multimodal data, e.g. speech signalsmultimodal data, e.g. speech signals
furthermore, the same level of analysis can be furthermore, the same level of analysis can be marked-up by more than one tool / annotatormarked-up by more than one tool / annotator
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
How to combine these How to combine these annotations?annotations? simply have distinct tags & attributes simply have distinct tags & attributes
for each of the phenomena coveredfor each of the phenomena covered easy to understand and hand-editeasy to understand and hand-edit easy to validateeasy to validate easy to processeasy to process but XML requires a tree-structure; but XML requires a tree-structure;
what if the tags do not nest properly?what if the tags do not nest properly?
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
Crossing hierarchiesCrossing hierarchies
simple example - page breaks v.s. simple example - page breaks v.s. paragraph boundaries:paragraph boundaries:<page> … <p> …. <page> … <p> …. </page></page> … … </p></p>
a well known problem for XML a well known problem for XML encoding, but with multiple encoding, but with multiple annotations it is now becoming annotations it is now becoming more severemore severe
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
Solutions to crossing Solutions to crossing hierarchieshierarchiesDiscussed in TEI chapter 14 “Linking, Discussed in TEI chapter 14 “Linking,
Segmentation, and Alignment”:Segmentation, and Alignment”: split elements:split elements:
<page broken=“yes” id=“p1” next=“p2”>…</page><page broken=“yes” id=“p1” next=“p2”>…</page><p> <p> <page broken=“yes” id=“p2” prev=“p1”>…</page> <page broken=“yes” id=“p2” prev=“p1”>…</page></p></p>
““milestones” i.e. empty elements:milestones” i.e. empty elements:<page/> … <p> …. <page/> … </p><page/> … <p> …. <page/> … </p>
but somewhat difficult to process and not very but somewhat difficult to process and not very generalgeneral
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
Stand-off markupStand-off markup
General solution to crossing hierarchies General solution to crossing hierarchies is to keep markup in separate is to keep markup in separate documents that only point into the text documents that only point into the text (or other markup)(or other markup)
Several specific recommendations and Several specific recommendations and projects:projects:
TEI P5 and TEI Workgroup on Stand-Off TEI P5 and TEI Workgroup on Stand-Off Markup, XLink and XpointerMarkup, XLink and Xpointer
Annotation Graphs with AGTKAnnotation Graphs with AGTK TIGER annotation schemeTIGER annotation scheme
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
Stand-off markup Stand-off markup example: TIGERexample: TIGER<s id="s5"> <s id="s5"> <graph root="s5_504"> <graph root="s5_504"> <terminals> <terminals> <t id="s5_1" word="Die" pos="ART" morph="Def.Fem.Nom.Sg"/><t id="s5_1" word="Die" pos="ART" morph="Def.Fem.Nom.Sg"/> <t id="s5_2" word="Tagung" pos="NN" morph="Fem.Nom.Sg.*"/><t id="s5_2" word="Tagung" pos="NN" morph="Fem.Nom.Sg.*"/> <t id="s5_3" word="hat" pos="VVFIN" morph="3.Sg.Pres.Ind"/><t id="s5_3" word="hat" pos="VVFIN" morph="3.Sg.Pres.Ind"/> <t id="s5_4" word="mehr" pos="PIAT" morph="--"/><t id="s5_4" word="mehr" pos="PIAT" morph="--"/> <t id="s5_5" word="Teilnehmer" pos="NN" morph="Masc.Akk.Pl.*"/><t id="s5_5" word="Teilnehmer" pos="NN" morph="Masc.Akk.Pl.*"/> <t id="s5_6" word="als" pos="KOKOM" morph="--"/><t id="s5_6" word="als" pos="KOKOM" morph="--"/> <t id="s5_7" word="je" pos="ADV" morph="--"/><t id="s5_7" word="je" pos="ADV" morph="--"/> <t id="s5_8" word="zuvor" pos="ADV" morph="--"/><t id="s5_8" word="zuvor" pos="ADV" morph="--"/> </terminals> </terminals> <nonterminals> <nonterminals> <nt id="s5_500" cat="NP"> <nt id="s5_500" cat="NP"> <edge label="NK" idref="s5_1"/> <edge label="NK" idref="s5_1"/> <edge label="NK" idref="s5_2"/> <edge label="NK" idref="s5_2"/> </nt> </nt> <nt id="s5_501" cat="AVP"><nt id="s5_501" cat="AVP"> <edge label="CM" idref="s5_6"/><edge label="CM" idref="s5_6"/> <edge label="MO" idref="s5_7"/><edge label="MO" idref="s5_7"/> <edge label="HD" idref="s5_8"/><edge label="HD" idref="s5_8"/> </nt> </nt> … …..
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
Problems with stand-Problems with stand-off markupoff markup need tools to link the data:need tools to link the data:
more difficult processing and more difficult processing and editingediting
no automatic validity checking: no automatic validity checking: consistency, cyclesconsistency, cycles
difficult to change (correct) difficult to change (correct) primarily data or downstream primarily data or downstream annotationsannotations
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
MetadataMetadata
description of the corpus or corpus elementsdescription of the corpus or corpus elements traditional bibliographic standards (MARC)traditional bibliographic standards (MARC) but computer corpora need to be but computer corpora need to be
documented also along other dimensions: documented also along other dimensions: availability, size, markup used, relation of availability, size, markup used, relation of digital file to source text, etc.digital file to source text, etc.
EAD developed for archives, but many EAD developed for archives, but many similarities to corpus descriptionsimilarities to corpus description
a meta-data recommendation closely a meta-data recommendation closely coupled with the data itself is the TEI headercoupled with the data itself is the TEI header
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
TEI headerTEI header
<teiHeader> is an obligatory part of every TEI document <teiHeader> is an obligatory part of every TEI document and consists of: and consists of:
<fileDesc>, <fileDesc>, file descriptionfile description full bibliographical description of the computer file itself; full bibliographical description of the computer file itself; includes information about the source or sources of the includes information about the source or sources of the electronic textelectronic text
<encodingDesc>, <encodingDesc>, encoding descriptionencoding description describes relationship between electronic text and its describes relationship between electronic text and its source: normalization, ambiguity resolution, levels of source: normalization, ambiguity resolution, levels of encoding or analysis, etc.encoding or analysis, etc.
<profileDesc>, <profileDesc>, text profiletext profileclassificatory & contextual information, e.g. subject classificatory & contextual information, e.g. subject matter. Important for corpora, to perform retrievals from matter. Important for corpora, to perform retrievals from a body of text in terms of text type or origin (taxonomies)a body of text in terms of text type or origin (taxonomies)
<revisionDesc>, <revisionDesc>, revision historyrevision historyhistory of changes made during the development of the history of changes made during the development of the electronic textelectronic text
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
TEI header II.TEI header II.
an an example of a TEI headerexample of a TEI header very detailed information is very detailed information is
possible, but again, many ways to possible, but again, many ways to express the same information express the same information (e.g. free text or structured in (e.g. free text or structured in elements)elements)
stricter, but poorer alternatives stricter, but poorer alternatives exists: Dublin Coreexists: Dublin Core
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
Dublin CoreDublin Core
Dublin Core Metadata Initiative (DCMI) was Dublin Core Metadata Initiative (DCMI) was founded in 1995 with the aim to create a core founded in 1995 with the aim to create a core set of meta-data descriptions for Web-based set of meta-data descriptions for Web-based resources that would be useful for categorizing resources that would be useful for categorizing the Web for easier search and retrieval. the Web for easier search and retrieval.
Dublin Core Metadata Element Set (DCES) Dublin Core Metadata Element Set (DCES) defines 15 elements, i.e.: defines 15 elements, i.e.: Title, Creator, Subject, Description, Publisher, Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Contributor, Date, Type, Format, Identifier, Source, Language, Relation, Coverage, RightsSource, Language, Relation, Coverage, Rights
can be extendedcan be extended DC is used e.g. by the Open Language DC is used e.g. by the Open Language
Archives Community (OLAC)Archives Community (OLAC)
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
Corpus analytical toolsCorpus analytical tools
Currently, many corpus exploration tools exists, Currently, many corpus exploration tools exists, and they typically offer:and they typically offer:
search with regular expressions over stringssearch with regular expressions over strings sometimes search over (lemma/PoS) sometimes search over (lemma/PoS)
annotationsannotations concordance and word frequency list display of concordance and word frequency list display of
resultsresults sometimes search and display of parallel sometimes search and display of parallel
corporacorpora sometimes basic statistic tests (keywordness, sometimes basic statistic tests (keywordness,
collocation strength)collocation strength) examples: WordSmith, MonoConc, IMS CQP, examples: WordSmith, MonoConc, IMS CQP,
Manatee/Bonito, SARA/Xaira, TigersearchManatee/Bonito, SARA/Xaira, Tigersearch
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
What is missingWhat is missing
possibility to combine different types possibility to combine different types of annotation in queries and displays, of annotation in queries and displays, esp. for multimodal corporaesp. for multimodal corpora
integration of more powerful statistical integration of more powerful statistical methods, esp. for collocations and methods, esp. for collocations and parallel corporaparallel corpora
tools targeted to different types of tools targeted to different types of users (e.g. Sketch Engine)users (e.g. Sketch Engine)
merging of digital library viewers with merging of digital library viewers with corpus concordancing softwarecorpus concordancing software
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
Corpora v.s. digital Corpora v.s. digital librarieslibraries classical reference corpora were composed of classical reference corpora were composed of
samples, and interesting only for their linguistic samples, and interesting only for their linguistic contentcontent
today, more and more corpora contain integral texts, today, more and more corpora contain integral texts, which are of interest in themselves (e.g. historical which are of interest in themselves (e.g. historical texts)texts)
conversely, digital libraries are growing in size and conversely, digital libraries are growing in size and accessibility and becoming interesting also for accessibility and becoming interesting also for linguistic researchlinguistic research
what is needed is a system that can perform two what is needed is a system that can perform two tasks: enable selection of (fragments of) heavily tasks: enable selection of (fragments of) heavily structured (multimedia, text-critical) texts for reading structured (multimedia, text-critical) texts for reading and allow for concordance views of selectionsand allow for concordance views of selections
currently the only available (OS) system that attempts currently the only available (OS) system that attempts this is Philologic from University of Chicagothis is Philologic from University of Chicago
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
III. Future directionsIII. Future directions
Two directions in standardisation of Two directions in standardisation of corpus and language resource corpus and language resource annotation:annotation:
next version of TEI, version P5next version of TEI, version P5 work by ISO TC 37 SC4work by ISO TC 37 SC4
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
TEI P5TEI P5
the next version of TEI, currently at beta the next version of TEI, currently at beta stage: available, but not stablestage: available, but not stable
significantly revised and brought in line with significantly revised and brought in line with current practicescurrent practices
not backward compatible with P3/P4 not backward compatible with P3/P4 (although scripts exists for conversion)(although scripts exists for conversion)
formal specification based on the ISO Relax formal specification based on the ISO Relax NG schema language (although DTD and NG schema language (although DTD and W3C schemas also available)W3C schemas also available)
parameterisation also produces dedicated parameterisation also produces dedicated documentationdocumentation
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
ISO TC 37ISO TC 37
ISO TC 37: ISO Technical Committee ISO TC 37: ISO Technical Committee on Terminology, est. on Terminology, est. 19521952
maybe best known for ISO 639 and maybe best known for ISO 639 and MARTIF MARTIF
in 2002 changed name to Technical in 2002 changed name to Technical Committee on Terminology and Committee on Terminology and Other Other Language ResourcesLanguage Resources
also established also established ISO TC 37/SC 4ISO TC 37/SC 4Sub-Committee on Language Resource Sub-Committee on Language Resource ManagementManagement
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
ISO TC 37 SC4 WGsISO TC 37 SC4 WGs
WG 1 : Basic descriptors and mechanisms for language resourcesWG 1 : Basic descriptors and mechanisms for language resources – terminology used in language resources,terminology used in language resources,– basic mechanisms and data structures for linguistic representationbasic mechanisms and data structures for linguistic representation– meta-data representation scheme to document linguistic information meta-data representation scheme to document linguistic information
structures and processesstructures and processes WG 2 : Representation schemesWG 2 : Representation schemes
– definition of annotation/representation schemes for morpho-syntax and definition of annotation/representation schemes for morpho-syntax and syntaxsyntax
– representation scheme for the semantic content of multimodal information,representation scheme for the semantic content of multimodal information,– metadata for discourse level representation schememetadata for discourse level representation scheme
WG 3 : Multilingual text representationWG 3 : Multilingual text representation– translation memory and alignment of parallel corpora,translation memory and alignment of parallel corpora,– segmentation and counting algorithmssegmentation and counting algorithms,,– meta-markup for Globalization, Internationalization and Localization (GIL)meta-markup for Globalization, Internationalization and Localization (GIL)
WG 4 : Lexical databaseWG 4 : Lexical databasess– standardization of lexical representation formats for the various types of NLP standardization of lexical representation formats for the various types of NLP
applications (Machine Readable Lexica)applications (Machine Readable Lexica) WG 5 : Workflow of language resource managementWG 5 : Workflow of language resource management
– Standardization of guidelines for language validation and net-based Standardization of guidelines for language validation and net-based distributed cooperative workdistributed cooperative work
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
WG4 standardsWG4 standards
Language Resource Management Language Resource Management — Feature Structures— Feature Structures
Language resource management Language resource management —Lexical markup framework (LMF)—Lexical markup framework (LMF)
Language Resource Management Language Resource Management — Morpho-syntactic Annotation — Morpho-syntactic Annotation Framework (MAF)Framework (MAF)
all under development!all under development!
National Institute for Japanese National Institute for Japanese Language Language 2006-09-282006-09-28
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
ConclusionsConclusions
I presented some history, current I presented some history, current state and possible future directions state and possible future directions in the field of encoding in the field of encoding standardisation of, mainly, corporastandardisation of, mainly, corpora
the main recommendation (for me!) the main recommendation (for me!) still seems to be TEI: combines still seems to be TEI: combines tradition with innovationtradition with innovation
Thank you!Thank you!