exploring data models for heterogenous dialect data: the case of explore.bread.at!

18
Jack T. Bowers Melanie Seltmann Austrian Academy of Sciences -Austrian Center for Digital Humanities Exploring data models for heterogenous dialect data: the case of explore.bread.AT!

Upload: jack-bowers

Post on 08-Feb-2017

28 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Exploring data models for heterogenous dialect data: the case of explore.bread.AT!

Jack T. BowersMelanie Seltmann

Austrian Academy of Sciences -Austrian Center for Digital Humanities

Exploring data models for heterogenous dialect data:

the case of explore.bread.AT!

Page 2: Exploring data models for heterogenous dialect data: the case of explore.bread.AT!

Outline of Presentation

Part I: Overview of project & dataPart II: Overview of possible solutions using XML-based markup standards for representing onomasiological dialectal language

Page 3: Exploring data models for heterogenous dialect data: the case of explore.bread.AT!

explore.AT!Overview:

• DBÖ: collection of Bavarian dialectal speech began 1911 • 2015-2016 converted from TUSTEP to TEI

Goals• Gain cultural and linguistic insights into Bavarian dialects in former

Austro-Hungarian empire;• Update and improve the existing body of resources by converting to

conform with standards and best practice (ISOcat, ISOconcept, etc.;• Enhance usability and compatibility of data in order to share with

project partners;• Integration of semantic web/LOD resources;

Page 4: Exploring data models for heterogenous dialect data: the case of explore.bread.AT!

Project Overview: Datasets

DBÖ@TEI

WBÖ@TEI

BaseX Database

place inventory (TEI-listPlace)concept inventory(TEI-feature structures)

gram features inventory (TEI-feature structures)

questionnaires (TEI-list)

DBÖ@emaSQL

BaseX Database

Extracted Topical Datasetsexplore.bread

The language of Colorlexicon(location(a))

inventory(lexicalFeature(a))

• Domain/Topic-based (exploreBread)• Location• Lexical/grammatical features

Possible basis for examination of sub-datasets

Page 5: Exploring data models for heterogenous dialect data: the case of explore.bread.AT!

Visualization

Page 6: Exploring data models for heterogenous dialect data: the case of explore.bread.AT!

DBÖ Questionnaires

Questionnaires:While questionnaires are topical in general, they are a complicated mixture of semasiological (term-based) and onomasiological (concept-based)

e.g.(31B5) bes. Weißgebäcke: länglich flaches, gerundetes Weißgebäck, z.B. Strutz (l.!), Strutzen, Strützel, Wecken u.a.; scherzhafte Bez. wie Schendarm

Current means of extracting this information were initially limited to:• Questionnaires • String searches in certain data fields

Dataset requires significant manual editing and curation due to nature of the questionnaires

Page 7: Exploring data models for heterogenous dialect data: the case of explore.bread.AT!

Desired EnhancementsIn most sub-topical studies such as ExploreBread! there would be potential benefits of having the ability to format data onomasiologically, for example:

• Domain and/or concept-oriented entries better represent the content of interest

• Information retrieval• Ontology mapping• Etymological &/or Morphosyntactic analysis• Cross linguistic (or dialectal) comparisson or translation

Problem: > TEI has no explicitly designated means of encoding onomasiological data!

Page 8: Exploring data models for heterogenous dialect data: the case of explore.bread.AT!

Enhancing original data• Adding domain (onomasiological) and ontology-based sense tags

<sense corresp=“concept:Weißgebäck”>Weißgebäck</usg><usg type="dom" corresp=“concept:Brot”>Brot</usg>

• Normalization of phonetic notation* <form type="lautung" n="1"> <pron notation="tustep">&gt;str-uts</pron> <pron notation="ipa" resp="#JB" change=“01.2">ʒt̊ruːts</pron> </form>

• Adding Morpholgical/Compositional Analysis*            <form type="hauptlemma">               <orth>(S:emmel)zipfel</orth>            </form>

            <form type="hauptlemma" resp="#MS">               <orth>(<seg corresp="concept:Semmel”>S:emm<seg ana="#dimin">el</seg></seg>)

   <seg ana="#stem" corresp="concept:Zipf”>zipf<seg ana="#dimin">el</seg></seg>       </orth>

            </form>

Page 9: Exploring data models for heterogenous dialect data: the case of explore.bread.AT!

Lexical Organization

Semasiological:

Onomasiological:

Semasiological Lexical Model

meaning(iii)

Form

meaning(ii)meaning(i)

Onomasiological Lexical Model

Concept

Form(i) Form(ii) Form(iii)

Starting point is word form and identifies associated meanings and senses

Starting point is a concept and looks at forms used to represent it

Page 10: Exploring data models for heterogenous dialect data: the case of explore.bread.AT!

Headword

Lemma(i..n)

BROT

brot broet brɛot

Prôt Prôt Prôt

Core DBÖ entry datatypes—————————————-Archive recordHeadword (Form) POSDialect lemma (Form) Gram info Meaning (Sense)Usage exampleSourcePlaceQuestionnaireEtymology

Desired Data Structure

Desired Onomasiological Model for Extracted Terminological DBÖ Datasets

TermEntryConcept(a)

DialectEntry(i) DialectEntry(ii) DialectEntry(n)

Page 11: Exploring data models for heterogenous dialect data: the case of explore.bread.AT!

Options using XML-Based Standards

(i) TEI Hacks: Alternate TEI Dictionary format (<entryFree>)

(ii) TEI-TBX Hybrid (Romary, 2014)OR…. use TEI P4

Page 12: Exploring data models for heterogenous dialect data: the case of explore.bread.AT!

TEI <entryFree> Model

(1…n)

<sense @corresp/>

<entryFree @xml:id>

<usg @type=“dom”>

<superEntry>

<entry @xml:id @xml:lang=“bar”>

(0…n)

(1…n)

<form type=“hauptlemma”>

<orth>

(1…n)

(1…1)

<form type=“hauptlemma”>

(all other elements content from original copied without alteration)

<def @xml:lang>(0…n)

<sense>

concept:meaning

concept:domain

Form (headword(i))

Form (dialect(a))

Metadata:

DBÖ entry (headword (i))

Form (headword(ii))

Form (dialect(b))

Metadata:

DBÖ entry (headword (ii))

Page 13: Exploring data models for heterogenous dialect data: the case of explore.bread.AT!

TEI <entryFree> Model

concept:meaning

<entryFree>            <sense corresp="concept:Wecken">               <usg type="dom" corresp="concept:Brot">Brot</usg>               <def xml:lang="en" resp="#JB">Oblong loaf of bread</def>            </sense>

            <superEntry> <!—for each unique hauptlemma for concept entry —>               <form type="hauptlemma">                  <orth>Wecken</orth>               </form>

<entry xml:id="w834_qdb-d1e602b" xml:lang="bar">                  <!-- hauptlemma removed from here; entry content abbreviated -->                  <form type="lautung" n="1">                     <pron notation="tustep">W.eiggn</pron>                     <pron notation="ipa" resp="#JB" change=“01.2">ʋɛiggn<̩/pron>                  </form>                  <usg type="geo">                     <placeName>St.Michael/B. Bgl.</placeName>                  </usg>               </entry>

<!—all entries with headword “Wecken” (ii..n) —> </superEntry> <superEntry>

               <form type="hauptlemma">                    <orth>Strutzen</orth>               </form>                              <entry xml:id="s806_qdb-d1e43847b" xml:lang="bar">                  <!-- hauptlemma removed from here; entry content abbreviated -->                  <form type="lautung" n="1">                     <pron notation="tustep">Struzn</pron>                     <pron notation="ipa" resp="#JB" change=“01.2">ʃtruzn<̩/pron>                  </form> <usg type="geo"> <placeName>Rohrb. OÖ</placeName> </usg>               </entry>

<!—all entries with headword “Strutzen” (ii..n) —> </superEntry> </entryFree>

concept:domain

Form (headword(i))

Form (dialect(a))

Metadata:

DBÖ entry (headword (i))

Form (headword(ii))

Form (dialect(b))

Metadata:

DBÖ entry (headword (ii))

Page 14: Exploring data models for heterogenous dialect data: the case of explore.bread.AT!

Problems with <entryFree> model

• It is a hack!• Current TEI guidelines and data model are

inherantly and intentionallly semasiological and this use of the vocabulary is only valid by chance, not intention.

>Thus using this data model within the TEI will not have any of the advantages that generally come with its use

Page 15: Exploring data models for heterogenous dialect data: the case of explore.bread.AT!

TBX-TEI HybridRomary (2014):

Makes attempt at customizing TEI guidelines to incorporate TBX (ISO 30046) terminological entries in order to provide TEI with an onomasiological model

https://github.com/laurentromary/TBXinTEI

Page 16: Exploring data models for heterogenous dialect data: the case of explore.bread.AT!

TBX-TEI Hybrid  <tbx:termEntry xmlns="http://www.tbx.org"><!-- @xml:id;  -->            <descrip type="concept" target="concept:Wecken"/> <!-- sense not normally included in TBX! -->                          <descrip type="domain" target="concept:Brot" xml:lang="de">Brot</descrip>            <descrip type="definition" xml:lang="en">Oblong loaf of bread</descrip>           <!-- no headword form may occur outside of <langSet>—>

            <langSet xml:id="w834_qdb-d1e602" xml:lang="bar-x-smichael"><!-- language/dialect i) @xml:id;  --><!-- No sense allowed! —>

               <tei:note type="anmerkung" resp="O" corresp="#BD">deren Grundriß ein Oval ist</tei:note><!-- @corresp allowed in TEI <note> but not here —><!-- Most metadata element valid using <tei:ref> but syntactically required to occur before <tig> —>

                <admin type="geo">                  <tei:placeName>St.Michael/B. Bgl.</tei:placeName>               </admin>               <tig><!-- <tei:form> would be better -->                  <tei:term type="hauptlemma">Wecken</tei:term>                  <termNote type="transcription">orth</termNote><!-- this is inefficient: need to allow <orth> & <pron>—>                  <termNote type="pos">Subst</termNote><!-- this actually should be applicable to all forms (headword & lemmas) -->                     </tig>               <tig>                  <tei:term type="lautung" n="1">W.eiggn</tei:term>                  <termNote type="transcription">pron</termNote>                  <termNote type="notation">tustep</termNote><!-- we also need to allow @notation  -->               </tig>               <tig><!-- TBX doesn't allow multiple instances of <term> in same <tig> as TEI does with <orth>,<pron> w/in <form> -->                  <tei:term type="lautung" n="1" resp="#JB">ʋɛiggn̩</tei:term>                  <termNote type="transcription" change=“1.2">pron</termNote><!-- @change in original not allowed in hybrid schema -->                  <termNote type="notation">ipa</termNote>               </tig>                </langSet> ….

Page 17: Exploring data models for heterogenous dialect data: the case of explore.bread.AT!

Problems with TEI-TBX Hybrid model as per the ODD Schema from Romary (2014)

• <tig> is verbose and would be better replaced with <form>• the order of occurence of elements is too restricted • TBX-dominated schema lacks way too many attributes (e.g.

@notation),and elements (e.g. <orth> <pron>) that are key to storage and representation of lexical data as used in TEI

Page 18: Exploring data models for heterogenous dialect data: the case of explore.bread.AT!

Conclusion(i) TEI lacks a legitimate means of encoding terminological/

onomasiological entries;

(ii) Given that we need to include sense (or a parallel equivalent) and the headword at the top of an entry, a TBX-TEI hybrid doesn’t work either without serious modification via ODD mostly to introduce elements and features from TEI, and stretching the traditional usage of the system;

(iii) TEI needs to re-introduce a means of onomasiological data representation (such as <termEntry>) but with an expanded set of elements and attributes based on the degree of expressivity in the Dictionary module