exploring data models for heterogenous dialect data: the case of explore.bread.at!
TRANSCRIPT
Jack T. BowersMelanie Seltmann
Austrian Academy of Sciences -Austrian Center for Digital Humanities
Exploring data models for heterogenous dialect data:
the case of explore.bread.AT!
Outline of Presentation
Part I: Overview of project & dataPart II: Overview of possible solutions using XML-based markup standards for representing onomasiological dialectal language
explore.AT!Overview:
• DBÖ: collection of Bavarian dialectal speech began 1911 • 2015-2016 converted from TUSTEP to TEI
Goals• Gain cultural and linguistic insights into Bavarian dialects in former
Austro-Hungarian empire;• Update and improve the existing body of resources by converting to
conform with standards and best practice (ISOcat, ISOconcept, etc.;• Enhance usability and compatibility of data in order to share with
project partners;• Integration of semantic web/LOD resources;
Project Overview: Datasets
DBÖ@TEI
WBÖ@TEI
BaseX Database
place inventory (TEI-listPlace)concept inventory(TEI-feature structures)
gram features inventory (TEI-feature structures)
questionnaires (TEI-list)
DBÖ@emaSQL
BaseX Database
Extracted Topical Datasetsexplore.bread
The language of Colorlexicon(location(a))
inventory(lexicalFeature(a))
• Domain/Topic-based (exploreBread)• Location• Lexical/grammatical features
Possible basis for examination of sub-datasets
Visualization
DBÖ Questionnaires
Questionnaires:While questionnaires are topical in general, they are a complicated mixture of semasiological (term-based) and onomasiological (concept-based)
e.g.(31B5) bes. Weißgebäcke: länglich flaches, gerundetes Weißgebäck, z.B. Strutz (l.!), Strutzen, Strützel, Wecken u.a.; scherzhafte Bez. wie Schendarm
Current means of extracting this information were initially limited to:• Questionnaires • String searches in certain data fields
Dataset requires significant manual editing and curation due to nature of the questionnaires
Desired EnhancementsIn most sub-topical studies such as ExploreBread! there would be potential benefits of having the ability to format data onomasiologically, for example:
• Domain and/or concept-oriented entries better represent the content of interest
• Information retrieval• Ontology mapping• Etymological &/or Morphosyntactic analysis• Cross linguistic (or dialectal) comparisson or translation
Problem: > TEI has no explicitly designated means of encoding onomasiological data!
Enhancing original data• Adding domain (onomasiological) and ontology-based sense tags
<sense corresp=“concept:Weißgebäck”>Weißgebäck</usg><usg type="dom" corresp=“concept:Brot”>Brot</usg>
• Normalization of phonetic notation* <form type="lautung" n="1"> <pron notation="tustep">>str-uts</pron> <pron notation="ipa" resp="#JB" change=“01.2">ʒt̊ruːts</pron> </form>
• Adding Morpholgical/Compositional Analysis* <form type="hauptlemma"> <orth>(S:emmel)zipfel</orth> </form>
<form type="hauptlemma" resp="#MS"> <orth>(<seg corresp="concept:Semmel”>S:emm<seg ana="#dimin">el</seg></seg>)
<seg ana="#stem" corresp="concept:Zipf”>zipf<seg ana="#dimin">el</seg></seg> </orth>
</form>
Lexical Organization
Semasiological:
Onomasiological:
Semasiological Lexical Model
meaning(iii)
Form
meaning(ii)meaning(i)
Onomasiological Lexical Model
Concept
Form(i) Form(ii) Form(iii)
Starting point is word form and identifies associated meanings and senses
Starting point is a concept and looks at forms used to represent it
Headword
Lemma(i..n)
BROT
brot broet brɛot
Prôt Prôt Prôt
Core DBÖ entry datatypes—————————————-Archive recordHeadword (Form) POSDialect lemma (Form) Gram info Meaning (Sense)Usage exampleSourcePlaceQuestionnaireEtymology
Desired Data Structure
Desired Onomasiological Model for Extracted Terminological DBÖ Datasets
TermEntryConcept(a)
DialectEntry(i) DialectEntry(ii) DialectEntry(n)
Options using XML-Based Standards
(i) TEI Hacks: Alternate TEI Dictionary format (<entryFree>)
(ii) TEI-TBX Hybrid (Romary, 2014)OR…. use TEI P4
TEI <entryFree> Model
(1…n)
<sense @corresp/>
<entryFree @xml:id>
<usg @type=“dom”>
<superEntry>
<entry @xml:id @xml:lang=“bar”>
(0…n)
(1…n)
<form type=“hauptlemma”>
<orth>
(1…n)
(1…1)
<form type=“hauptlemma”>
(all other elements content from original copied without alteration)
<def @xml:lang>(0…n)
<sense>
concept:meaning
concept:domain
Form (headword(i))
Form (dialect(a))
Metadata:
DBÖ entry (headword (i))
Form (headword(ii))
Form (dialect(b))
Metadata:
DBÖ entry (headword (ii))
TEI <entryFree> Model
concept:meaning
<entryFree> <sense corresp="concept:Wecken"> <usg type="dom" corresp="concept:Brot">Brot</usg> <def xml:lang="en" resp="#JB">Oblong loaf of bread</def> </sense>
<superEntry> <!—for each unique hauptlemma for concept entry —> <form type="hauptlemma"> <orth>Wecken</orth> </form>
<entry xml:id="w834_qdb-d1e602b" xml:lang="bar"> <!-- hauptlemma removed from here; entry content abbreviated --> <form type="lautung" n="1"> <pron notation="tustep">W.eiggn</pron> <pron notation="ipa" resp="#JB" change=“01.2">ʋɛiggn<̩/pron> </form> <usg type="geo"> <placeName>St.Michael/B. Bgl.</placeName> </usg> </entry>
<!—all entries with headword “Wecken” (ii..n) —> </superEntry> <superEntry>
<form type="hauptlemma"> <orth>Strutzen</orth> </form> <entry xml:id="s806_qdb-d1e43847b" xml:lang="bar"> <!-- hauptlemma removed from here; entry content abbreviated --> <form type="lautung" n="1"> <pron notation="tustep">Struzn</pron> <pron notation="ipa" resp="#JB" change=“01.2">ʃtruzn<̩/pron> </form> <usg type="geo"> <placeName>Rohrb. OÖ</placeName> </usg> </entry>
<!—all entries with headword “Strutzen” (ii..n) —> </superEntry> </entryFree>
concept:domain
Form (headword(i))
Form (dialect(a))
Metadata:
DBÖ entry (headword (i))
Form (headword(ii))
Form (dialect(b))
Metadata:
DBÖ entry (headword (ii))
Problems with <entryFree> model
• It is a hack!• Current TEI guidelines and data model are
inherantly and intentionallly semasiological and this use of the vocabulary is only valid by chance, not intention.
>Thus using this data model within the TEI will not have any of the advantages that generally come with its use
TBX-TEI HybridRomary (2014):
Makes attempt at customizing TEI guidelines to incorporate TBX (ISO 30046) terminological entries in order to provide TEI with an onomasiological model
https://github.com/laurentromary/TBXinTEI
TBX-TEI Hybrid <tbx:termEntry xmlns="http://www.tbx.org"><!-- @xml:id; --> <descrip type="concept" target="concept:Wecken"/> <!-- sense not normally included in TBX! --> <descrip type="domain" target="concept:Brot" xml:lang="de">Brot</descrip> <descrip type="definition" xml:lang="en">Oblong loaf of bread</descrip> <!-- no headword form may occur outside of <langSet>—>
<langSet xml:id="w834_qdb-d1e602" xml:lang="bar-x-smichael"><!-- language/dialect i) @xml:id; --><!-- No sense allowed! —>
<tei:note type="anmerkung" resp="O" corresp="#BD">deren Grundriß ein Oval ist</tei:note><!-- @corresp allowed in TEI <note> but not here —><!-- Most metadata element valid using <tei:ref> but syntactically required to occur before <tig> —>
<admin type="geo"> <tei:placeName>St.Michael/B. Bgl.</tei:placeName> </admin> <tig><!-- <tei:form> would be better --> <tei:term type="hauptlemma">Wecken</tei:term> <termNote type="transcription">orth</termNote><!-- this is inefficient: need to allow <orth> & <pron>—> <termNote type="pos">Subst</termNote><!-- this actually should be applicable to all forms (headword & lemmas) --> </tig> <tig> <tei:term type="lautung" n="1">W.eiggn</tei:term> <termNote type="transcription">pron</termNote> <termNote type="notation">tustep</termNote><!-- we also need to allow @notation --> </tig> <tig><!-- TBX doesn't allow multiple instances of <term> in same <tig> as TEI does with <orth>,<pron> w/in <form> --> <tei:term type="lautung" n="1" resp="#JB">ʋɛiggn̩</tei:term> <termNote type="transcription" change=“1.2">pron</termNote><!-- @change in original not allowed in hybrid schema --> <termNote type="notation">ipa</termNote> </tig> </langSet> ….
Problems with TEI-TBX Hybrid model as per the ODD Schema from Romary (2014)
• <tig> is verbose and would be better replaced with <form>• the order of occurence of elements is too restricted • TBX-dominated schema lacks way too many attributes (e.g.
@notation),and elements (e.g. <orth> <pron>) that are key to storage and representation of lexical data as used in TEI
Conclusion(i) TEI lacks a legitimate means of encoding terminological/
onomasiological entries;
(ii) Given that we need to include sense (or a parallel equivalent) and the headword at the top of an entry, a TBX-TEI hybrid doesn’t work either without serious modification via ODD mostly to introduce elements and features from TEI, and stretching the traditional usage of the system;
(iii) TEI needs to re-introduce a means of onomasiological data representation (such as <termEntry>) but with an expanded set of elements and attributes based on the degree of expressivity in the Dictionary module