exploring data models for heterogenous dialect data: the case of explore.bread.at!

Jack T. BowersMelanie Seltmann

Austrian Academy of Sciences -Austrian Center for Digital Humanities

Exploring data models for heterogenous dialect data:

the case of explore.bread.AT!

Outline of Presentation

Part I: Overview of project & dataPart II: Overview of possible solutions using XML-based markup standards for representing onomasiological dialectal language

explore.AT!Overview:

• DBÖ: collection of Bavarian dialectal speech began 1911 • 2015-2016 converted from TUSTEP to TEI

Goals• Gain cultural and linguistic insights into Bavarian dialects in former

Austro-Hungarian empire;• Update and improve the existing body of resources by converting to

conform with standards and best practice (ISOcat, ISOconcept, etc.;• Enhance usability and compatibility of data in order to share with

project partners;• Integration of semantic web/LOD resources;

Project Overview: Datasets

DBÖ@TEI

WBÖ@TEI

BaseX Database

place inventory (TEI-listPlace)concept inventory(TEI-feature structures)

gram features inventory (TEI-feature structures)

questionnaires (TEI-list)

DBÖ@emaSQL

BaseX Database

Extracted Topical Datasetsexplore.bread

The language of Colorlexicon(location(a))

inventory(lexicalFeature(a))

• Domain/Topic-based (exploreBread)• Location• Lexical/grammatical features

Possible basis for examination of sub-datasets

Visualization

DBÖ Questionnaires

Questionnaires:While questionnaires are topical in general, they are a complicated mixture of semasiological (term-based) and onomasiological (concept-based)

e.g.(31B5) bes. Weißgebäcke: länglich flaches, gerundetes Weißgebäck, z.B. Strutz (l.!), Strutzen, Strützel, Wecken u.a.; scherzhafte Bez. wie Schendarm

Current means of extracting this information were initially limited to:• Questionnaires • String searches in certain data fields

Dataset requires significant manual editing and curation due to nature of the questionnaires

Desired EnhancementsIn most sub-topical studies such as ExploreBread! there would be potential benefits of having the ability to format data onomasiologically, for example:

• Domain and/or concept-oriented entries better represent the content of interest

• Information retrieval• Ontology mapping• Etymological &/or Morphosyntactic analysis• Cross linguistic (or dialectal) comparisson or translation

Problem: > TEI has no explicitly designated means of encoding onomasiological data!

Enhancing original data• Adding domain (onomasiological) and ontology-based sense tags

<sense corresp=“concept:Weißgebäck”>Weißgebäck</usg><usg type="dom" corresp=“concept:Brot”>Brot</usg>

• Normalization of phonetic notation* <form type="lautung" n="1"> <pron notation="tustep">>str-uts</pron> <pron notation="ipa" resp="#JB" change=“01.2">ʒt̊ruːts</pron> </form>

• Adding Morpholgical/Compositional Analysis* <form type="hauptlemma"> <orth>(S:emmel)zipfel</orth> </form>

<form type="hauptlemma" resp="#MS"> <orth>(<seg corresp="concept:Semmel”>S:emm<seg ana="#dimin">el</seg></seg>)

<seg ana="#stem" corresp="concept:Zipf”>zipf<seg ana="#dimin">el</seg></seg> </orth>

</form>

Lexical Organization

Semasiological:

Onomasiological:

Semasiological Lexical Model

meaning(iii)

Form

meaning(ii)meaning(i)

Onomasiological Lexical Model

Concept

Form(i) Form(ii) Form(iii)

Starting point is word form and identifies associated meanings and senses

Starting point is a concept and looks at forms used to represent it

Headword

Lemma(i..n)

BROT

brot broet brɛot

Prôt Prôt Prôt

Core DBÖ entry datatypes—————————————-Archive recordHeadword (Form) POSDialect lemma (Form) Gram info Meaning (Sense)Usage exampleSourcePlaceQuestionnaireEtymology

Desired Data Structure

Desired Onomasiological Model for Extracted Terminological DBÖ Datasets

TermEntryConcept(a)

DialectEntry(i) DialectEntry(ii) DialectEntry(n)

Options using XML-Based Standards

(i) TEI Hacks: Alternate TEI Dictionary format (<entryFree>)

(ii) TEI-TBX Hybrid (Romary, 2014)OR…. use TEI P4

TEI <entryFree> Model

(1…n)

<sense @corresp/>

<entryFree @xml:id>

<usg @type=“dom”>

<superEntry>

<entry @xml:id @xml:lang=“bar”>

(0…n)

(1…n)

<form type=“hauptlemma”>

<orth>

(1…n)

(1…1)

<form type=“hauptlemma”>

(all other elements content from original copied without alteration)

<def @xml:lang>(0…n)

<sense>

concept:meaning

concept:domain

Form (headword(i))

Form (dialect(a))

Metadata:

DBÖ entry (headword (i))

Form (headword(ii))

Form (dialect(b))

Metadata:

DBÖ entry (headword (ii))

TEI <entryFree> Model

concept:meaning

<entryFree> <sense corresp="concept:Wecken"> <usg type="dom" corresp="concept:Brot">Brot</usg> <def xml:lang="en" resp="#JB">Oblong loaf of bread</def> </sense>

<superEntry> <!—for each unique hauptlemma for concept entry —> <form type="hauptlemma"> <orth>Wecken</orth> </form>

<entry xml:id="w834_qdb-d1e602b" xml:lang="bar">  <form type="lautung" n="1"> <pron notation="tustep">W.eiggn</pron> <pron notation="ipa" resp="#JB" change=“01.2">ʋɛiggn<̩/pron> </form> <usg type="geo"> <placeName>St.Michael/B. Bgl.</placeName> </usg> </entry>

<!—all entries with headword “Wecken” (ii..n) —> </superEntry> <superEntry>

<form type="hauptlemma"> <orth>Strutzen</orth> </form> <entry xml:id="s806_qdb-d1e43847b" xml:lang="bar">  <form type="lautung" n="1"> <pron notation="tustep">Struzn</pron> <pron notation="ipa" resp="#JB" change=“01.2">ʃtruzn<̩/pron> </form> <usg type="geo"> <placeName>Rohrb. OÖ</placeName> </usg> </entry>

<!—all entries with headword “Strutzen” (ii..n) —> </superEntry> </entryFree>

concept:domain

Form (headword(i))

Form (dialect(a))

Metadata:

DBÖ entry (headword (i))

Form (headword(ii))

Form (dialect(b))

Metadata:

DBÖ entry (headword (ii))

Problems with <entryFree> model

• It is a hack!• Current TEI guidelines and data model are

inherantly and intentionallly semasiological and this use of the vocabulary is only valid by chance, not intention.

>Thus using this data model within the TEI will not have any of the advantages that generally come with its use

TBX-TEI HybridRomary (2014):

Makes attempt at customizing TEI guidelines to incorporate TBX (ISO 30046) terminological entries in order to provide TEI with an onomasiological model

https://github.com/laurentromary/TBXinTEI

TBX-TEI Hybrid <tbx:termEntry xmlns="http://www.tbx.org"> <descrip type="concept" target="concept:Wecken"/>  <descrip type="domain" target="concept:Brot" xml:lang="de">Brot</descrip> <descrip type="definition" xml:lang="en">Oblong loaf of bread</descrip>  <tei:term type="hauptlemma">Wecken</tei:term> <termNote type="transcription">orth</termNote> </tig> <tig> <tei:term type="lautung" n="1">W.eiggn</tei:term> <termNote type="transcription">pron</termNote> <termNote type="notation">tustep</termNote> </tig> <tig> <tei:term type="lautung" n="1" resp="#JB">ʋɛiggn̩</tei:term> <termNote type="transcription" change=“1.2">pron</termNote> <termNote type="notation">ipa</termNote> </tig> </langSet> ….

Problems with TEI-TBX Hybrid model as per the ODD Schema from Romary (2014)

• <tig> is verbose and would be better replaced with <form>• the order of occurence of elements is too restricted • TBX-dominated schema lacks way too many attributes (e.g.

@notation),and elements (e.g. <orth> <pron>) that are key to storage and representation of lexical data as used in TEI

Conclusion(i) TEI lacks a legitimate means of encoding terminological/

onomasiological entries;

(ii) Given that we need to include sense (or a parallel equivalent) and the headword at the top of an entry, a TBX-TEI hybrid doesn’t work either without serious modification via ODD mostly to introduce elements and features from TEI, and stretching the traditional usage of the system;

(iii) TEI needs to re-introduce a means of onomasiological data representation (such as <termEntry>) but with an expanded set of elements and attributes based on the degree of expressivity in the Dictionary module

exploring data models for heterogenous dialect data: the case of explore.bread.at!

Technology