clarin-pl – language technology infrastructure …...2015/10/16  · 2.disambiguation of the...

31
CLARIN-PL CLARIN-PL – Language Technology Infrastructure Open for Users Maciej Piasecki Wrocław University of Technology, G4.19 Research Group Violetta Koseska-Toszewa Institute of Slavic Studies PAS Krzysztof Marasek Polish-Japanese Academy of Information Technology Adam Pawłowski Wrocław University Piotr Pęzik University of Łódź Adam Przepiórkowski Institute of Computer Science PAS 2015-09-16

Upload: others

Post on 06-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

CLARIN-PL

CLARIN-PL – Language Technology Infrastructure Open for Users

Maciej Piasecki Wrocław University of Technology, G4.19 Research Group

Violetta Koseska-Toszewa Institute of Slavic Studies PAS

Krzysztof Marasek Polish-Japanese Academy of Information Technology

Adam Pawłowski Wrocław University

Piotr Pęzik University of Łódź

Adam Przepiórkowski Institute of Computer Science PAS

2015-09-16

Page 2: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

CLARIN-PL: the Consortium

§  Wrocław University of Technology, §  G4.19 Language Technology and Computational Linguistics

Research Group §  Institute of Computer Science, Polish Academy of Science §  Institute of Slavic Studies, Polish Academy of Science §  Polish-Japanese Academy of Information Technology,

§  Chair of Multimedia §  University of Łódź,

§  PELCRA group at Chair of English Language and Applied Linguistics

§  Wrocław University §  Institute of Library Studies and Scientific Information

CLARINAnnualConference2015

Wrocław2015-10-16

CLARIN-PL

Page 3: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

Development Paradigm

§  Bi-directional approach §  Technology-centred

§  CLARIN centre §  Language Resources and Tools: publishing, linking, developing

§  User-centred §  development of a set of research applications

§  Bottom-up §  a collected offer approach §  focus on accessibility, technical interoperability and

processing chains §  Top-down

§  following user-centred design paradigm §  research applications for H&SS are a starting point

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Page 4: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

CLARIN-PL: Pillars

§  CLARIN-PL Language Technology Centre www.clarin-pl.eu

§  the Polish node of the CLARIN distributed infrastructure §  Complete set of the basic Language Resources & Tools

for Polish §  filling gaps in the set of basic Language Resources and Tools

for Polish §  Research applications for H&SS

§  first set for key users and selected sub-domains of H&SS

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Page 5: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

CLARIN-PL Language Technology Centre §  Location: Wrocław University of Technology

§  based on modified D-Space system from Lindat (Czech CLARIN) §  Certified B-type centre §  Pioneer.id federation based one login §  Repository system for language resources

§  persistent identifiers for resources and tools §  CMDI meta-data §  interface for Federated Content Search §  depositing services

§  Web Services for LRTs (REST and SOAP): §  basic processing chain for Polish §  prototype system for flexible composition of the natural language processing

chains §  Web Applications for LRTs §  Knowledge Sharing: expertise and support for the users

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Page 6: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

Wrocław University of Technology Resources (selected)

§  plWordNet 3.0 emo §  a comprehensive description of the Polish lexico-semantic

system (~200 000 lemmas, ~280 000 senses) §  the largest world wordnet, annotated with sentiment and basic

emotions, manually mapped to Princeton WordNet §  enWordNet 1.0 expanded Princeton WordNet 3.1 (+10 000

lemmas) §  Korpus Politechniki Wrocławskiej

§  an open Polish corpus with rich annotation on several levels §  Dictionary of multiword expressions described syntactically §  NELexicon 2.0 – a very large lexicon of Polish Proper

Names (2.5 mln)

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Page 7: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

plWordNet 2.3 emo & enWordNet 0.1

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

http://plwordnet.pwr.edu.pl

Page 8: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

plWordNet 2.3 emo http://plwordnet.pwr.edu.pl

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Page 9: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

Wrocław University of Technology Tools (selected)

§  Inforex – an web-based system for corpus annotation §  MeWeX – a system for extraction of multiword expressions

(collocations) §  WoSeDon – Word Sense Disambiguation and sense-based

statistical analysis §  Information Extraction (Text mining)

§  recognition of Proper Names, anaphoric links, time expressions and spatial expressions

§  event recognition §  Shallow semantic dependency parser §  Extraction of the semantic-pragmatic information

§  keywords, text semantic relations and text summaries

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Page 10: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

Inforex – corpus annotation editor

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Page 11: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

Statistical analysis of word sense frequencies (WoSeDon)

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Page 12: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

Institute of Computer Science Resources (selected)

§  Resources §  A large semantic valency lexicon for Polish predicative lexical units §  Polfie – a formal syntactic-semantic grammar of Polish in LFG

formalism §  Treebanks with different syntactic and semantic annotation

§  Systems for dictionaries §  Slowal – an editor for the valency dictionary §  Kuźnia – a system for editing morphological resources §  Toposław – an editor for dictionary of Multiword Expressions

§  Poliqarp 2.0 §  Search engine for very large richly annotated corpora

§  Including treebanks and semantic annotation §  Powerful language for specifying queries

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Page 13: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

Institute of Computer Science Tools (selected)

§  Tools adaptable to the domain and user needs: §  Segmenter based on user-modified rules §  Morfeusz 2.0 – an adaptable morphological analyser §  Lemmatiser based on combining taggers §  Extended Named Entity Recognition

§  Hybrid dependency parser based on combining taggers and dependency parsers

§  Deep parsers for Polish §  Świgra – a syntactic parser Based on DCG grammar §  Syntactic-semantic parser based on LFG grammar §  Syntactic-semantic parser based on Categorial Grammar

§  Terminology extraction from domain corpora §  Statistical method combined with simple extraction rules

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Page 14: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

Walenty – a valency dictionary CLARINAnnual

Conference2015Wrocław

2015-10-16CLARIN-PL

Page 15: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

POLFIE – an LFG grammar of Polish

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Page 16: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

Polish-Japanese Academy of Information Technology

§  Technology §  System for long term archiving based on a unique hardware

and software solution §  Resources (selected)

§  Transcribed speech database §  Tools (selected)

§  Phonetic transcription of texts §  Text-to-speech alignment §  Speech segmentation

§  Recognition of speaker changes in speech §  Recognition of events in speech

§  Searching for keywords in speech recordings

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Page 17: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

Services for speech CLARINAnnual

Conference2015Wrocław

2015-10-16CLARIN-PL

Polish-Japanese Academy of Information Technology

Page 18: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

Services for speech: integration with Praat

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Polish-Japanese Academy of Information Technology

Page 19: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

University of Łódź

§  Resources §  Parallel Polish-English (expanded) §  Conversational corpus (expanded)

§  Recorded in real-life situations §  Described with meta-data

§  Tools §  Paralela – a search engine for parallel corpora §  Spokes – a search engine for

§  Monolingual corpora §  And conversational corpora §  Corpora described with meta-data

§  Thematic classifier based on Wikipedia categories §  Assigned semantic categories to texts

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Page 20: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

Spokes (University of Łódź) http://spokes.clarin-pl.eu

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Page 21: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

Paralela (University of Łódź) http://spokes.clarin-pl.eu

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Page 22: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

Wrocław University

§  ChronoPress – the Polish Chronological Corpus §  Text samples from the years 1945-1954

§  5760 sample per year, each sample 300 words §  Described with meta-data §  Statistical representation of the language changes

§  Tools §  Lexical trend analysis §  Calculation of the descriptive parameters: average,

correlation, cross-correlation etc.

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Page 23: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

ChronoPress – Polish Chronological Corpus

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Wrocław University

Page 24: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

Institute of Slavic Studies Resources (selected)

§  Polish-Bulgarian-Russian text corpus §  Contemporary texts §  Manually aligned on the level of sentences §  sub-corpus semantically annotated

§  Polish-Lithuanian text corpus §  Contemporary texts §  Manually aligned on the level of sentences

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Page 25: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

Bi-directional - Top-down Part: First Applications

§  Approaching users §  already active, interested, working on large textual and

speech resources, … §  covering a maximal variety of research areas, e.g. linguistics,

literary studies, psychology, political studies and sociology §  matching the available language tools for Polish §  the first set of several prototype applications illustrating

possibilities and facilitating identification of the needs §  First applications

§  Spokes – searching corpora of conversational data §  A system for collecting Polish text corpora from the Web §  A open textometric and stylometric system focused on Polish §  Semantic text classification for sociology §  Literary Map

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Page 26: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

Open Textometric and Stylometric System

§  System designed for characteristic features of Polish §  Links together language tools, feature extraction with

frameworks for stylometry and clustering, e.g. Stylo (Eder & Rybicki)

§  Enables the use of features defined on any level of the linguistic structure: §  from the level of word forms §  up to the level of the semantic-pragmatic structures.

§  Available as Web Application and a Web Service §  Combines

§  The stylometry system §  With a semantic classification and tagging system

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Page 27: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

Open Textometric and Stylometric System

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Page 28: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

Literary Map

§  Goal §  Support for using maps in the literary criticism §  Tool for the identification of all geographical names in the

literary text (or a corpus) and mapping them onto a geographical map

§  Tasks 1. Identification and semantic classification of the referring language expressions 2. Disambiguation of the referents 3. Mapping the referents onto a map (geo-location) 4. Recognition of the semantic relations and statistical analysis

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Page 29: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

Literary Map CLARINAnnual

Conference2015Wrocław

2015-10-16CLARIN-PL

Page 30: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

Workshops: Results and Requests

§  Training workshops for CLARIN-PL centre and services §  Results

§  large interest in participation three candidates for one place! §  workshops: three cities, three days each, min. 25 hours §  participants: more than 140 persons from different domains

of H&SS, full professors (>5), professors, researchers, PhD students

§  Requests (on the basis of more than 30 questionnaires) §  very warm reception §  some criticism, but unexpectedly rare §  many concrete suggestions concerning the services and

applications §  described research tasks, concrete needs, two proposals to

organise domain-focused workshops

CLARINAnnualConference2015

Wrocław2015-10-16CLARIN-PL

Page 31: CLARIN-PL – Language Technology Infrastructure …...2015/10/16  · 2.Disambiguation of the referents 3.Mapping the referents onto a map (geo-location) 4.Recognition of the semantic

CLARIN-PL

Thank you very much for your attention! www.clarin-pl.eu

Supported by the Polish Ministry of Science and Higher Education [CLARIN-PL]