how community software supports language documentation and data analysis
DESCRIPTION
Field linguists have increasingly adopted the latest technologies and tools for language documentation. Their needs have led to remarkable developments in software and archiving, exemplified by work at the MPI in Nijmegen, which leads the innovation cycles that take place in the digital working environments of field linguists. The next step in research is now the analysis and theoretical exploitation of the huge amount of data that has been collected in numerous language documentation projects that use these environments. This research will also rely on computer-based strategies, as data is instantly available in digital formats. In this talk I will introduce some of the lesser known tools and software packages for annotation and analysis tasks. Some of these tools were created within DOBES projects and/or as community projects by small teams; they can be combined with well-known tools like ELAN or Toolbox to give researchers access to their data. I will focus on how a combination of simple, special purpose tools makes researchers more productive and how existing software libraries allow scientific projects to create their own, task-specific software tools that they can tailor to their own needs.TRANSCRIPT
![Page 1: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/1.jpg)
How community software supports language documentation and data analysis
Peter BoudaCentro Interdisciplinar de Documentação
Linguística e SocialMinde/Portugal
![Page 2: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/2.jpg)
What is „open“ in software?
• Open Source license (but be careful about restrictions!)
• Make participation easy– Documentation– Transparent development process (e.g. discuss features
publically)– Attract programmers (code quality, make „giving back“
easy, online meetings, code sprints, …)• Try to create and support a community from the
beginning, otherwise nobody will use your code
![Page 3: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/3.jpg)
Community
• Software projects are not only source code:– Feedback from users– Write documentation– Test!!! Report bugs– In our case: provide data for tests– Propose features
• Best code and software quality• Websites for community development
(Github, Bitbucket, …)
![Page 4: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/4.jpg)
Examples
• EOPAS• LingPy and qlc• NLTK (Natural Language Toolkit)• Poio and PyAnnotation• Scientific Python
![Page 5: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/5.jpg)
EOPAS
• „Ethnographic E-Research Online Presentation System for Interlinear Text”
• Present interlinear text online• Supported files:– Elan– Transcriber– Toolbox interlinear glossed text
![Page 6: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/6.jpg)
EOPAS
![Page 7: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/7.jpg)
Why EOPAS is open
• Published on Github• EOPAS is community software:– Based on a modern web framework (Ruby on Rails)– clear and documented deployment strategy (how
to use it on your own)– easy to maintain, low entry level, good code quality– several options for participation listed on website
• Publish your data with EOPAS and support the development!
![Page 8: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/8.jpg)
Poio and PyAnnotation
• Started during my internship in DoBeS project „Minderico - An endangered language in Portugal“
• Ideas and support by Prof. Johannes Helmbrecht• Support by Institute for General Linguistics and
Language Typology at University of Munich• Support by Institute for General Linguistics at
University of Bamberg
![Page 9: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/9.jpg)
Poio and PyAnnotation
• PyAnnotation provides access to different file formats
• Provides access to data programmatically (API)• Poio is graphical user interface (GUI) on top of
PyAnnotation• Two software packages:– Poio Editor– Poio Analyzer
![Page 10: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/10.jpg)
Poio Editor and Analyzer
• Start: Poio Editor as an „add-on“ to Elan• Open Elan transcription and add morpho-syntactic
annotations• Analyzer to search in Elan and Toolbox files• Now: adding support for GRAID (Grammatical
Relations and Animacy in Discourse) and any other annotation types
• Goal: a highly customizable desktop software for diverse annotation and analysis scenarios / sparse annotation
![Page 11: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/11.jpg)
Live Demo
![Page 12: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/12.jpg)
PyAnnotation
• Parses LD files:– Elan– Toolbox– Kura
• Unified data access through API• Modify data structure and write Elan files again– good for batch processing
• Combines well with Scientific Python for analysis
![Page 13: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/13.jpg)
Scientific Python
• Python programming language• Collection of tools and scientific libraries:– IPython– NumPy and SciPy, scikit-learn, networkx, …– An easy installer: Python(x,y)
• Alternative to Matlab, R, and other mathematical tools
• But: general usage for software development
![Page 14: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/14.jpg)
Live Demo
![Page 15: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/15.jpg)
CLARIN
• „Common Language Resources and Technology Infrastructure“
• Current projects of CLARIN-D:– Weblicht: SOA to create annotated corpora– Virtual Language Observatory
• “Kurationsprojekt” to develop software framework to access fieldwork data
• Among others: based on code of PyAnnotation
![Page 16: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/16.jpg)
Framework for Fieldwork Data
• Improve annotation and analysis tasks based on documentation data
• Build a bridge between LD and NLP data formats and technology– Lexan, UIMA, …
• DoBeS corpus as central resource• Develop a basic software library• Web API and web app as reference
implementation
![Page 17: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/17.jpg)
Library to access LD data
![Page 18: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/18.jpg)
Generic representation
• GrAF: Graph Annotation Framework• Based on annotation graphs• Developed at American National Corpus• Common representation helps to process and
analyze data from different sources• Map LD data (Elan, Toolbox, …) to GrAF• Users work with „structures“ (GRAID, Morph-
Syntax, POS, …) that can be mapped to a GUI
![Page 19: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/19.jpg)
Morpho-syntactic structure
![Page 20: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/20.jpg)
Custom structures
• Morpho-syntactic vs. GRAID
![Page 21: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/21.jpg)
Development
• Library is developed at CIDLeS• Web API and app is developed at
CCeH/University of Cologne• Coordination at Institute for Linguistics,
University Cologne• August/September 2012 – July 2013
![Page 22: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/22.jpg)
Support Open Software!
• Use existing project whenever possible and contribute by giving feedback
• In LD: data drives development, developers need files to test
• Share your code as soon as possible• Use existing infrastructure like Github to share
![Page 23: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/23.jpg)
Thank you for your attention!
• [email protected]• Become a member of CIDLeS to support our
software development:
www.cidles.eu
![Page 24: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/24.jpg)
Links
• EOPAS: http://www.eopas.org/• Poio and PyAnnotation: http://
www.cidles.eu/ltll/poio• Python(x,y):
http://code.google.com/p/pythonxy/• LingPy: http://lingulist.de/lingpy/• QLC: https://github.com/pbouda/qlc• NLTK: http://nltk.org/
![Page 25: How community software supports language documentation and data analysis](https://reader035.vdocuments.us/reader035/viewer/2022070304/54c674b14a7959d4168b45a0/html5/thumbnails/25.jpg)
Links
• Apache UIMA: http://uima.apache.org/• CLARIN-D: http://de.clarin.eu/index.php/en/