language technology for polish in...
TRANSCRIPT
CLARIN-PL
Language Technology for Polish in Practice G 4.19 Language Technology and Computational Linguistics Research Group and CLARIN-PL Research Infrastructure in brief
Maciej Piasecki, Marcin Pol, Tomasz Walkowiak Wrocław University of Science and Technology
G4.19 Research Group [email protected]
2017-01-17
G4.19 Research Group
§ Location § Department of Computational Intelligence § Faculty of Computer Science and Management § Wrocław University of Science and Technology
§ Subgroups § Computational Semantics § Information Extraction § Language Technology § Corpus Linguistics § Polish Lexicography § Polish-English Lexicography § Sentiment and Emotional Resources
§ Staff and permanent collaborators: 35
§ http://nlp.pwr.edu.pl
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
CLARIN Support for Humanities & Social Sciences
§ CLARIN is ERIC type consortium of § 19 members: Austria, Bulgaria, Czech Republic, Denmark,
Estonia, Finland, Germany, Greece, Hungary, Italy, Latvia, Lithuania, The Netherlands, Norway, Poland, Portugal, Slovenia, Sweden and The Dutch Language Union § … Poland … - founding members
§ 1 observer: United Kingdom § Focus area: supporting research in Humanities and Social
Sciences § CLARIN Mission
§ To significantly lower the barriers for the use of Language Technology in Humanities & Social Sciences (H&SS)
§ To facilitate or enable research methods based on automated analysis of text and speech resources
http://www.clarin.eu
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
Basic Notions
§ Language Technology (LT) § language resources and tools § robust in terms of quality and coverage § multipurpose § component based
§ Language Technology Infrastructure § a software framework (architecture or platform) § for combining language tools with language resources into
processing chains (or pipelines) § the defined processing chains are next applied to language
data sources § interoperability, also with the external systems
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
LT in Humanities and Social Sciences: Problems
§ Limited usage of LT in Humanities and Social Sciences § hard to find: dispersed in the Web, poorly described in a
technical language § varieties of technological solutions, insufficient users’
computers § required programming skills or knowledge from the area of
natural language engineering § LT Infrastructure for H&SS
§ common standards, combined platforms, open approaches § joint catalogues and search facilities § focused on H&SS users § Web Services and Web Applications: no need for installing,
processing focused on H&SS research tasks
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
LT in Humanities and Social Sciences: Barriers
§ Physical – language tools and resources are not accessible in Internet
§ Informational – descriptions are not available or there is no means for searching
§ Technological – lack of commonly accepted standards for LT, lack of a common platform, varieties of technological solutions, insufficient users’ computers
§ Related to knowledge – the use of LT requires programming skills or knowledge from the area of natural language engineering
§ Legal – licences for language resources and tools (LRTs) limit their applications
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
CLARIN Offer
§ Integration of different LT components into one interoperable system
§ Common, flexible meta-data standard (CMDI) § Central searching for resources (Virtual Language
Observatory) § One sign on and one login into the distributed infrastructure § Common standards: promoting, co-ordinating, harmonising § Web Services for Language Tools and Resources § Installation-free Web Applications for research tasks § Common licences and promotion of the open access
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
CLARIN: Central Services https://www.clarin.eu/
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
CLARIN-PL: the Consortium
§ Consortium § Wrocław University of Science and Technology,
Computational Intelligence Department, G4.19 Research Group § Institute of Computer Science, Polish Academy of Science § Polish-Japanese Institute of Information Technology, Chair of
Multimedia § University of Łódź, PELCRA group at Chair of English Language
and Applied Linguistics § Institute of Slavic Studies, Polish Academy of Science § Wrocław University
§ Goal: § implementation of the Polish part of the CLARIN ERIC LTI
§ CLARIN-PL Language Technology Centre http://ww.clarin-pl.eu
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
CLARIN-PL Development
§ Bi-directional development of LTI (Piasecki, 2014) § Bottom-up - development of the necessary basic elements
of LTI § a distributed network infrastructure § basic LT processing chain
§ Top-down § user’s needs è web-based research applications § close co-operation with key users from the H&SS domain § amendments to the shape of the technical basis: LRTs,
standards, § inspirations, identification of the further user needs, next
iterations …
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
CLARIN-PL: the Consortium
§ Polish scientific consortium § Wrocław University of Technology, G4.19 Research Group § Institute of Computer Science, Polish Academy of Science § Polish-Japanese Institute of Information Technology, Chair of
Multimedia § University of Łódź, PELCRA group at Chair of English Language
and Applied Linguistics § Institute of Slavic Studies, Polish Academy of Science § Wrocław University
§ Goal: implementation of the Polish part of the CLARIN ERIC LTI
§ Follows the bi-directional approach to LTI development
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
CLARIN-PL Language Technology Centre § Located in
Wrocław University of Technology § modified D-Space system
(Lindat, Czech CLARIN) § One sign-on, one login (Pioneer.id) § Advanced repository system for language resources
§ Persistent Identifiers for resources and tools § CMDI meta-data standard (Virtual Language Observatory) § Interface for Federated Content Search § depositing service for researchers from H&SS
§ Web Services for LRTs: § Basic processing chain of Polish § Prototype system for flexible composition of the natural language
processing chains § Support for developers SOAP & REST interfaces
§ Web Applications for LRTs § Knowledge Sharing: expertise and support for the users of LT
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
CLARIN-PL: Language Resources
1. Polish Morphological Dictionary 2. Polish Speech Corpora 3. Annotated Polish Corpora 4. Bilingual Corpora 5. Polish Historical Corpus 6. Semantic lexicon
§ Wordnet for Polish § formal description of lexical meanings
7. Dictionary of Multiword Expressions 8. Bilingual semantic lexicon 9. Lexicon of Proper Names 10. Syntactic-semantic Valency Dictionary 11. Robust syntactic-semantic grammar
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
CLARIN-PL: Language Resources
1. Polish Morphological Dictionary 2. Polish Speech Corpora 3. Annotated Polish Corpora 4. Bilingual Corpora 5. Polish Historical Corpus 6. Semantic lexicon
§ plWordNet 3.0 § formal description of lexical meanings
7. Dictionary of Multiword Expressions 8. Bilingual semantic lexicon 9. Lexicon of Proper Names 10. Syntactic-semantic Valency Dictionary Walenty 11. Robust syntactic-semantic grammar
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
Basic Language Tools for Polish
1. Segmentation into tokens and sentences 2. Morphological analysis 3. Morphological guessing of unknown words (both without context and
context sensitive) 4. Morpho-syntactic tagging 5. Word Sense Disambiguation 6. Chunker and shallow syntactic parser 7. Named Entity Recognition and disambiguation 8. Co-reference and anaphora resolution 9. Temporal expression recognition 10. Semantic relation recognition 11. Event recognition 12. Shallow semantic parser 13. Deep syntactic parser with disambiguated output: dependency and
constituent 14. Deep semantic parser
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
Basic Language Tools for Polish
1. Segmentation into tokens and sentences 2. Morphological analysis 3. Morphological guessing of unknown words (both without context and
context sensitive) 4. Morpho-syntactic tagging 5. Word Sense Disambiguation 6. Chunker and shallow syntactic parser 7. Named Entity Recognition and disambiguation 8. Co-reference and anaphora resolution 9. Temporal expression recognition 10. Semantic relation recognition 11. Event recognition 12. Shallow semantic parser 13. Deep syntactic parser with disambiguated output: dependency and
constituent 14. Deep semantic parser
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
Basic Language Tools for Polish
1. Segmentation into tokens and sentences 2. Morphological analysis 3. Morphological guessing of unknown words (both without context and
context sensitive) 4. Morpho-syntactic tagging 5. Word Sense Disambiguation 6. Chunker and shallow syntactic parser 7. Named Entity Recognition and disambiguation 8. Co-reference and anaphora resolution 9. Temporal expression recognition 10. Semantic relation recognition 11. Event recognition 12. Shallow semantic parser 13. Deep syntactic parser with disambiguated output: dependency and
constituent 14. Deep semantic parser
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
Services - architecture
18
NLPWorkersNLPServices
RESTSOAP
Serwer
NFS
Worker1(WCRFT2)
Worker2(Liner2)
Worker3(WSD)
Workern+1(Serel)
NLPEngine
MonitoringInternalnetwork
G4.19Web
applications
§ Efficiency § Parallel processing (Walkowiak, 2015) § Private cloud, scalling § File indentifieres
on In/Out of tools
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
§ Elastyczność § złożone potoki przetwarzania § narzędzia z obszaru
maszynowego uczenia
Web Services - choreography
19
WCRFT LINER2 SEREL
SuperMatrix
WCRFT LINER2 SEREL
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
Services in LTC CLARIN-PL
§ Examples of implemented services § Conversion: any2txt § Language tools: wcrft2, chunker, chunkrel, serel, liner2, wosedon § Extraction of feature vectors for texts: Fextor, FextorBis § Text clustering and classification: stylo, cluto, SVM § Communication (files, URLs, e-mails), integration with DSpace
§ Ongoing work § Format converters, monitoring § Application for concrete research tasks
§ Possible linking other tools § Virtual machines + simple API § Re-directering to foreign services (WebLicht, Multiservice)
20
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
Integrated enviroment
§ Repository is integrated with Language Tools
§ Simple corpus preprocessing for systems like Inforex
§ One user account for all tools and DSpace
Processing pipeline
WS1 WS2 WS3
D-SPACE
API for Language Tools
Temporary data Resources / data
Request from
DSpace
Inforex
Prepared data
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
Bi-directional - Top-down Part: First Applications
§ Approaching users § already active, interested, working on large textual and
speech resources, … § covering a maximal variety of research areas, e.g. linguistics,
literary studies, psychology, political studies and sociology § matching the available language tools for Polish § the first set of several prototype applications illustrating
possibilities and facilitating identification of the needs § First applications
§ Spokes – searching corpora of conversational data § A system for collecting Polish text corpora from the Web § A open textometric and stylometric system focused on Polish § Semantic text classification for sociology § Literary Map
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
Spokes (University of Łódź) http://spokes.clarin-pl.eu
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
System for Collecting Polish Text Corpora from the Web
§ Results from the gaps in the available technology revealed by the users § existing corpus building systems were too sensitive to text
encoding errors found in the web § not designed for informal corpora like blogs
§ A system for collecting Polish text corpora from the Web had to be constructed: § based on tools from the Masaryk University in Brno § to detect texts including larger number of errors (by
morphological analysis) § supports semi-automated extraction of texts from blogs, posts
on forums, etc. § integrated with tools for processing
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
Open Textometric and Stylometric System
§ System designed for characteristic features of Polish § like rich inflection, weakly constrained word order
§ Based on several existing components including Stylo (Eder & Rybicki)
§ Enabling the use of features defined on any level of the linguistic structure: § from the level of word forms § up to the level of the semantic-pragmatic structures.
§ Available as Web Application and a Web Service § Stylometric techniques appear to be applicable in many
tasks of H&SS § sociology (characteristic features that are for different
subgroups), political studies (similarity and differences between political parties), literary studies …
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
Semantic Text Classification for Sociology
§ Users: Collegium Civitas, Warsaw § Goal
§ Support for large scale analysis of the source materials § Automatically annotate documents and text fragments with
pre-defined semantic categories § Definition of categories by examples § Automated semantic grouping of documents and text
fragments § Support for
§ Corpus building § Manual annotation of the learning sub-corpus § Automated annotation process § Statistical analysis of the results
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
Conclusions
§ Application of LT to the research in Humanities & Social Sciences seem to be much more challenging than in commercial systems!
§ LT for Polish achieved a stage in which valuable support can be provided for research applications
§ Bi-directional approach combines § development of the basic, universal set of language tools and
resources § with inspirations from the research applications
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL
Bibliography
§ Piasecki, Maciej (2014) User-driven Language Technology Infrastructure – the Case of CLARIN-PL. In Proceedings of the Ninth Language Technologies Conference, Ljubljana, Slovenia, 2014. http://nl.ijs.si/isjt14/proceedings/isjt2014_01.pdf
§ Pęzik, Piotr (2015) Spokes – a search and exploration service for conversational corpus data. In Selected Papers from the CLARIN 2014 Conference, October 24-25, 2014, Soesterberg, The Netherlands, pp. 99-109, Linköping University Electronic Press, Linköpings universitet, ISBN: 978-91-7685-954-4. http://www.ep.liu.se/ecp/116/009/ecp15116009.pdf
§ Walkowiak, Tomasz (2015) Web based engine for processing and clustering of Polish texts. In Proceedings of the Tenth International Conference on Dependability and Complex Systems DepCoS-RELCOMEX 2015, Springer-Verlag.
Samsung R&D Institute
Invit. Lecture 2017-01-17
CLARIN-PL