dan cristea alexandru ioan cuza university of iasi romanian academy – institute of computer...
TRANSCRIPT
Dan CristeaAlexandru Ioan Cuza University of Iasi
Romanian Academy – Institute of Computer Science
According to Ethnologue – Languages of the World (SIL)◦ Spoken in: Romania (22 millions), Moldavia (2.7 millions),
300.000 (Serbia, Montenegro), 250,000 (Ukraine), 250,000 (Israel), Hungary (100,000), USA, Canada, Spain, Italy, etc.
◦ Native speakers: 24 millions, +4 millions as a second language
◦ Romanian (Rumanian, Moldavian, Moldovan, Daco-Romanian) ◦ Linguistic lineage: Indo-European>Italic>Romance>Eastern ◦ Dialects: Istro Romanian (Croatia), Macedo Romanian
(Greece), Megleno Romanian (Greece)◦ Lexical similarity: 77% with Italian, 75% with French, 74% with
Sardinian, 73% with Catalan, 72% with Portuguese and Rheto-Romance, 71% with Spanish
◦ Other influences: Slavic, Hungarian, Turkish, etc.
2LT Days, Luxembourg, 14-15 Jan, 2009
Since 1900: linguistics & lexicography research (in the Academy and the universities)
1960: early trials of Machine Translation; after that – no financing for more than 45 years
1980s: first NLP models and systems◦ semantic networks, dialogue systems (IURES,
QUERNAL), paradigmatic morphology and morphological analysers, unification-based formalisms, generation, grammars and parsers, etc.
Good computer science and computer engineering schools (in Bucharest, Iasi, Cluj-Napoca, Timisoara)
3LT Days, Luxembourg, 14-15 Jan, 2009
Master level: ◦ Iasi (UAIC-FII, since 2001), University of Bucharest
PhD level: ◦ Bucharest (RACAI), Iasi (UAIC-FII)◦ 6 PhD thesis will be defended this year
Summer schools, international and national conferences EUROLAN, since 1993, second as significance in Europe
(after ESSLLI) SPED (since 2001) – Speech Technology and Human-
Computer Dialogue conferences ConsILR (since 2002) – the national conference of the
Consortium for Informatisation of the Romanian Language Alumni:
◦ >30 PhDs and PhD students doing LT all over the world
4LT Days, Luxembourg, 14-15 Jan, 2009
Bucharest ◦ Romanian Academy, RACAI (acad. Dan Tufis)
10 researchers (3 PhDs): Romanian resources, language independent tools, human-computer interfaces, statistical models of Romanian, NLP Web services
◦ Romanian Academy, Institute of Linguistics (acad. Marius Sala) lexicography, old Romanian texts corpora
◦ University of Bucharest formal models, resources
◦ Technical University of Bucharest & Military Academy speech processing (prof. Corneliu Burileanu, prof.
Olteanu)5LT Days, Luxembourg, 14-15 Jan, 2009
Iasi◦ Alexandru Ioan Cuza University – Dept. of
Computer Science (UAIC-FII, my group) 8 PhDs (2 in co-tutelle with prof. E.Munteanu, Dept. of
Letters), 4 researchers, >20 masters in CL, undergraduate projects
resources, language independent tools in written LT, NLP Web services, computational lexicography, multimodal interfaces, NL user interfaces
◦ Romanian Academy, Institute of Computer Science (acad. Horia-Neculai Teodorescu) 4 PhDs, 8 researchers speech processing and resource building, tools and
annotated resources in written language processing◦ Romanian Academy, Institute of Philology
lexicography, old manuscripts (including in old Cyrillic)
6LT Days, Luxembourg, 14-15 Jan, 2009
Word Alignment (Ro-En): ◦ RACAI 2003, 2005: ranked first
Question Answering (CLEF - Ro, En): ◦ RACAI 2006: Ro-En 7/13, 2007: Ro-Ro 1/2◦ UAIC 2008: Ro-Ro 1/2
Answer Validation Exercise (CLEF - En)◦ UAIC 2007: 1/7, 2008: 1/7
Anaphora Resolution Exercise (En): ◦ UAIC 2007: ranked first
Textual Entailment (En): ◦ UAIC 2007: 2-way task – 3/26, 3-way task – 4/10◦ UAIC 2008: 2-way task – 2/26, 3-way task – 1/13
7LT Days, Luxembourg, 14-15 Jan, 2009
Morphological and POS tagger (En/Ro) Lemmatizer (En/Ro) Dependency Linker (En/Ro) Sentence splitting (En/Ro) Spell checker (Ro) Word aligner (En-Ro) Anaphora resolver (En/Ro) Discourse parser (En/Ro) Summarisation (En/Ro) Q&A (En/Ro) SMT (En-Ro-En, En-Gr-En, En-Sl-En) Definitions extractor (En/Ro) Information Retrieval (Ro Wikipedia)
8LT Days, Luxembourg, 14-15 Jan, 2009
Ro WordNet aligned with Princeton En WN (ILI)◦ the second largest in the world (55,000 synsets)
Mono and multilingual corpora◦ various RO classical novels (about 3,000,000 words)
richest annotation: Orwell’s “1984” (110,000 words)◦ tagged, lemmatized, chunked, word-aligned (XCES):
Semcor (En, Ro): 1,000,000 words Ev.Zilei (En, Ro): 1,000,000 words Acquis Communautaire (22 languages), Ro: 30,832,212 words Wikipedia-Ro (fragment): 3,405,324 words
◦ dictionaries: Dictionary of Modern Romanian – DEX, Thesaurus Dictionary of Romanian Language (eDTLR)
Language models, grammars, NE lists, complete inflexional lists, AR models, sentence splitting models, discourse cue words, etc.
9LT Days, Luxembourg, 14-15 Jan, 2009
European past: ◦ ELSNET (ESPRIT), ELSNET-Goes-EAST
(Copernicus), TELRI (COPERNICUS), FF-POIROT (FP5), Balkanet (FP5), RolTech (INTAS), LT4eL (FP6)…(more than 30 projects, see lists at www.racai.ro, www.info.uaic.ro/~dcristea)
European active: ◦ CLARIN: design & build the European LT
infrastructure for HSS (representation in SB and EB, 2 partners and 5 member institutions)
◦ FlareNet: Nicoletta’s speech◦ ALEAR: models of language evolution in
humanoid agents (robots): unification optimisation and discourse modelling
10LT Days, Luxembourg, 14-15 Jan, 2009
Language Technology and preservation of national heritage – national priorities in the Ro research plan
Massive financing over the last 2 years (compared to previous)…
11LT Days, Luxembourg, 14-15 Jan, 2009
◦ Under the Ministry Culture and Arts (dir. Dan Matei)
◦ Digitisation of the Ro literature
12LT Days, Luxembourg, 14-15 Jan, 2009
13LT Days, Luxembourg, 14-15 Jan, 2009
@ RACAI A follow up of a successful SEE-ERA.net
project (Ro, Bg, Gr, Sl, Sr) Encouraging pilot experiments for Ro-En-Ro,
Gr-En-Gr, Sl-En-Sl
14LT Days, Luxembourg, 14-15 Jan, 2009
Language pair Google translation RACAI translation
NIST score BLEU score NIST score BLEU score
English to Greek 3.5705 0.2934 3.9730 0.3533
English to Slovene 3.5340 0.2653 3.6719 0.2450
English to Romanian 4.4057 0.4508 4.9348 0.5464
Greek to English 3.5427 0.2868 3.7733 0.2981
Slovene to English 4.0424 0.2215 4.0589 0.2293
Romanian to English 4.3573 0.2827 4.5426 0.4604
ALPE: a model of anchoring specifications of NLP applications on XML annotation schemas (standards)
build a pipeline/parallel architecture without any need to program
just input your own file and indicate the form of the output
use the federation of tools as bricks for new applications cooking: the more ingredients you have, the list of
possible recipes you may go for increases
15LT Days, Luxembourg, 14-15 Jan, 2009
◦ Explosion of formats difficulty of standardisation◦ Standards are like laws: they help to organise the
society, but they also reduce freedom◦ Standards usually come late◦ We are in a hurry to do thinks instantly
Invent heuristics able to guess the semantics of new formats
‘Compute’ wrappers to transform non-standard input into standard
16LT Days, Luxembourg, 14-15 Jan, 2009
17LT Days, Luxembourg, 14-15 Jan, 2009