querying graf data in linguistic analysis
TRANSCRIPT
Querying GrAF data in linguistic analysis
Peter BoudaCentro Interdisciplinar de Documentao Lingustica e [email protected]
Overview
Existing infrastructure and workflows
GrAF
GrAF and TEI
Poio API
Queries in Poio API
Queries in GrAF API
Fieldwork
Fotos
Existing Infrastructure
LD tools and standards
Elan: EAF, MPEG, WAV
Toolbox: TXT, XML, WAV
Arbil: IMDI/CIMDI (Component MetaData Infrastructure)
Praat: XML, WAV
...
No standards for tier hierarchies, tier names or annotation schemes
Efforts in ISOcat
Interlinear Glossed Text
GrAF
GrAF: Graph Annotation Framework
ISO 24612: Language resource management - Linguistic annotation framework (LAF)
Started as stand-off version of XCES
API and representation as data structures, not a file format
GrAF/XML as XML representation
Used for the MASC of the ANC
Nodes, edges, regions, annotations, feature structures
GrAF entities
GrAF structure
GrAF-XML
so
TEI and GrAF
Schemata for GrAF created with TEI Roma
Custumized version of TEI P5 schema
ODD: One Document Does it all
GrAF is not TEI compliant
Share data types and feature structures of annotations
TEI has stand-off variant, uses XPointer/XLinkPrimary data has to be XML
Why we use GrAF
No inline markup
Radical stand-off approachEasier to share and manage data
Preferred solution to archive cultural heritage
Ideal for sparse annotations
Existing code: Java and Python
API vs. XQuery
The beauty of annotation graphs
Poio API
Think of GrAF as an assembly language for linguistic annotation; then Poio API is a libray to map from and to higher-level languages
Subset of GrAF to represent tier based annotation
Filters and filter chains for search
Plugin mechanism for file formatsMapping semantics: tiers and annotations to nodes and edges
Efforts to map between TEI and GrAFRetro-digitized dictionary data at University of Marburg are published as GrAF files
We want to publish as TEI
Queries in GrAF API
All queries are in-memory
Users can load parts of the full graph
Annotation graph to network conversionPython library networkx
Example: Semantic similarity
Queries in GrAF API
for (node_id, node) in graf_graph.nodes.items(): if node_id.endswith("entry"): for e in node.out_edges: if e.annotations.get_first().label == "head" or \ e.annotations.get_first().label == "translation": features = e.to_node.annotations.get_first().features substr = features.get_value("substring") [...]
Queries in Poio API
Example: Word order in Hinuq
Queries in Poio API
ag = from_excel("data/Hinuq2.csv")clause_unit_nodes = ag.nodes_for_tier("clause_id")
verbs = [ 'COP', 'cop', 'SAY', 'say', 'v.tr', 'v.intr', 'v.aff' ]others = [ 'A', 'S', 'P', 'EXP', 'STIM' ]search_terms = verbs + others
word_orders = collections.defaultdict(int)
for parent_node in clause_unit_nodes: word_order = [] for word_n in parent_node.iter_children(): a_list = ag.annotations_for_tier("grammatical_relation", word_n) if len(a_list) > 0: a_value = ag.annotation_value_for_annotation(a_list[0]) if a_value in search_terms: if a_value in verbs: word_order.append('V') else: word_order.append(a_value) word_orders[tuple(word_order)] += 1
Filters and filter chains
ag = poioapi.annotationgraph.AnnotationGraph()ag.from_elan("elan-example3.eaf")ag.structure_type_handler = poioapi.data.DataStructureType(ag.tier_hierarchies[0])
af = poioapi.annotationgraph.AnnotationGraphFilter(ag)af.set_filter_for_tier("words..W-Words", "follow")af.set_filter_for_tier("part_of_speech..W-POS", r"\bpro\b")ag.append_filter(af)
print("Filtered root nodes:")print(ag.filtered_node_ids)
search_terms = { "words..W-Words": "follow", "part_of_speech..W-POS": r"\bpro\b"}af = ag.create_filter_for_dict(search_terms)ag.append_filter(af)
Poio Analyzer
Developed for and with Prof. Johannes Helmbrecht, University of Regensburg
How to query the corpus in order to write a descriptive grammar?
Started with a list of requirements
Need to publish and archive queries and results
Poio Analyzer
Thank you for your attention!
Links
Clarin curation project: http://de.clarin.eu/en/discipline-specific-working-groups/wg-3-linguistic-fieldwork-anthropology-language-typology/curation-project-1.html
Poio:http://media.cidles.eu/poio/
GrAF:http://www.xces.org/ns/GrAF/1.0/