dan cristea alexandru ioan cuza university of iasi romanian academy – institute of computer...

17
Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science [email protected]

Upload: harriet-bryan

Post on 25-Dec-2015

222 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science dcristea@info.uaic.ro

Dan CristeaAlexandru Ioan Cuza University of Iasi

Romanian Academy – Institute of Computer Science

[email protected]

Page 2: Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science dcristea@info.uaic.ro

According to Ethnologue – Languages of the World (SIL)◦ Spoken in: Romania (22 millions), Moldavia (2.7 millions),

300.000 (Serbia, Montenegro), 250,000 (Ukraine), 250,000 (Israel), Hungary (100,000), USA, Canada, Spain, Italy, etc.

◦ Native speakers: 24 millions, +4 millions as a second language

◦ Romanian (Rumanian, Moldavian, Moldovan, Daco-Romanian) ◦ Linguistic lineage: Indo-European>Italic>Romance>Eastern ◦ Dialects: Istro Romanian (Croatia), Macedo  Romanian

(Greece), Megleno Romanian (Greece)◦ Lexical similarity: 77% with Italian, 75% with French, 74% with

Sardinian, 73% with Catalan, 72% with Portuguese and Rheto-Romance, 71% with Spanish

◦ Other influences: Slavic, Hungarian, Turkish, etc.

2LT Days, Luxembourg, 14-15 Jan, 2009

Page 3: Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science dcristea@info.uaic.ro

Since 1900: linguistics & lexicography research (in the Academy and the universities)

1960: early trials of Machine Translation; after that – no financing for more than 45 years

1980s: first NLP models and systems◦ semantic networks, dialogue systems (IURES,

QUERNAL), paradigmatic morphology and morphological analysers, unification-based formalisms, generation, grammars and parsers, etc.

Good computer science and computer engineering schools (in Bucharest, Iasi, Cluj-Napoca, Timisoara)

3LT Days, Luxembourg, 14-15 Jan, 2009

Page 4: Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science dcristea@info.uaic.ro

Master level: ◦ Iasi (UAIC-FII, since 2001), University of Bucharest

PhD level: ◦ Bucharest (RACAI), Iasi (UAIC-FII)◦ 6 PhD thesis will be defended this year

Summer schools, international and national conferences EUROLAN, since 1993, second as significance in Europe

(after ESSLLI) SPED (since 2001) – Speech Technology and Human-

Computer Dialogue conferences ConsILR (since 2002) – the national conference of the

Consortium for Informatisation of the Romanian Language Alumni:

◦ >30 PhDs and PhD students doing LT all over the world

4LT Days, Luxembourg, 14-15 Jan, 2009

Page 5: Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science dcristea@info.uaic.ro

Bucharest ◦ Romanian Academy, RACAI (acad. Dan Tufis)

10 researchers (3 PhDs): Romanian resources, language independent tools, human-computer interfaces, statistical models of Romanian, NLP Web services

◦ Romanian Academy, Institute of Linguistics (acad. Marius Sala) lexicography, old Romanian texts corpora

◦ University of Bucharest formal models, resources

◦ Technical University of Bucharest & Military Academy speech processing (prof. Corneliu Burileanu, prof.

Olteanu)5LT Days, Luxembourg, 14-15 Jan, 2009

Page 6: Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science dcristea@info.uaic.ro

Iasi◦ Alexandru Ioan Cuza University – Dept. of

Computer Science (UAIC-FII, my group) 8 PhDs (2 in co-tutelle with prof. E.Munteanu, Dept. of

Letters), 4 researchers, >20 masters in CL, undergraduate projects

resources, language independent tools in written LT, NLP Web services, computational lexicography, multimodal interfaces, NL user interfaces

◦ Romanian Academy, Institute of Computer Science (acad. Horia-Neculai Teodorescu) 4 PhDs, 8 researchers speech processing and resource building, tools and

annotated resources in written language processing◦ Romanian Academy, Institute of Philology

lexicography, old manuscripts (including in old Cyrillic)

6LT Days, Luxembourg, 14-15 Jan, 2009

Page 7: Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science dcristea@info.uaic.ro

Word Alignment (Ro-En): ◦ RACAI 2003, 2005: ranked first

Question Answering (CLEF - Ro, En): ◦ RACAI 2006: Ro-En 7/13, 2007: Ro-Ro 1/2◦ UAIC 2008: Ro-Ro 1/2

Answer Validation Exercise (CLEF - En)◦ UAIC 2007: 1/7, 2008: 1/7

Anaphora Resolution Exercise (En): ◦ UAIC 2007: ranked first

Textual Entailment (En): ◦ UAIC 2007: 2-way task – 3/26, 3-way task – 4/10◦ UAIC 2008: 2-way task – 2/26, 3-way task – 1/13

7LT Days, Luxembourg, 14-15 Jan, 2009

Page 8: Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science dcristea@info.uaic.ro

Morphological and POS tagger (En/Ro) Lemmatizer (En/Ro) Dependency Linker (En/Ro) Sentence splitting (En/Ro) Spell checker (Ro) Word aligner (En-Ro) Anaphora resolver (En/Ro) Discourse parser (En/Ro) Summarisation (En/Ro) Q&A (En/Ro) SMT (En-Ro-En, En-Gr-En, En-Sl-En) Definitions extractor (En/Ro) Information Retrieval (Ro Wikipedia)

8LT Days, Luxembourg, 14-15 Jan, 2009

Page 9: Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science dcristea@info.uaic.ro

Ro WordNet aligned with Princeton En WN (ILI)◦ the second largest in the world (55,000 synsets)

Mono and multilingual corpora◦ various RO classical novels (about 3,000,000 words)

richest annotation: Orwell’s “1984” (110,000 words)◦ tagged, lemmatized, chunked, word-aligned (XCES):

Semcor (En, Ro): 1,000,000 words Ev.Zilei (En, Ro): 1,000,000 words Acquis Communautaire (22 languages), Ro: 30,832,212 words Wikipedia-Ro (fragment): 3,405,324 words

◦ dictionaries: Dictionary of Modern Romanian – DEX, Thesaurus Dictionary of Romanian Language (eDTLR)

Language models, grammars, NE lists, complete inflexional lists, AR models, sentence splitting models, discourse cue words, etc.

9LT Days, Luxembourg, 14-15 Jan, 2009

Page 10: Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science dcristea@info.uaic.ro

European past: ◦ ELSNET (ESPRIT), ELSNET-Goes-EAST

(Copernicus), TELRI (COPERNICUS), FF-POIROT (FP5), Balkanet (FP5), RolTech (INTAS), LT4eL (FP6)…(more than 30 projects, see lists at www.racai.ro, www.info.uaic.ro/~dcristea)

European active: ◦ CLARIN: design & build the European LT

infrastructure for HSS (representation in SB and EB, 2 partners and 5 member institutions)

◦ FlareNet: Nicoletta’s speech◦ ALEAR: models of language evolution in

humanoid agents (robots): unification optimisation and discourse modelling

10LT Days, Luxembourg, 14-15 Jan, 2009

Page 11: Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science dcristea@info.uaic.ro

Language Technology and preservation of national heritage – national priorities in the Ro research plan

Massive financing over the last 2 years (compared to previous)…

11LT Days, Luxembourg, 14-15 Jan, 2009

Page 12: Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science dcristea@info.uaic.ro

◦ Under the Ministry Culture and Arts (dir. Dan Matei)

◦ Digitisation of the Ro literature

12LT Days, Luxembourg, 14-15 Jan, 2009

Page 13: Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science dcristea@info.uaic.ro

13LT Days, Luxembourg, 14-15 Jan, 2009

Page 14: Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science dcristea@info.uaic.ro

@ RACAI A follow up of a successful SEE-ERA.net

project (Ro, Bg, Gr, Sl, Sr) Encouraging pilot experiments for Ro-En-Ro,

Gr-En-Gr, Sl-En-Sl

14LT Days, Luxembourg, 14-15 Jan, 2009

Language pair Google translation RACAI translation

NIST score BLEU score NIST score BLEU score

English to Greek 3.5705 0.2934 3.9730 0.3533

English to Slovene 3.5340 0.2653 3.6719 0.2450

English to Romanian 4.4057 0.4508 4.9348 0.5464

Greek to English 3.5427 0.2868 3.7733 0.2981

Slovene to English 4.0424 0.2215 4.0589 0.2293

Romanian to English 4.3573 0.2827 4.5426 0.4604

Page 15: Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science dcristea@info.uaic.ro

ALPE: a model of anchoring specifications of NLP applications on XML annotation schemas (standards)

build a pipeline/parallel architecture without any need to program

just input your own file and indicate the form of the output

use the federation of tools as bricks for new applications cooking: the more ingredients you have, the list of

possible recipes you may go for increases

15LT Days, Luxembourg, 14-15 Jan, 2009

Page 16: Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science dcristea@info.uaic.ro

◦ Explosion of formats difficulty of standardisation◦ Standards are like laws: they help to organise the

society, but they also reduce freedom◦ Standards usually come late◦ We are in a hurry to do thinks instantly

Invent heuristics able to guess the semantics of new formats

‘Compute’ wrappers to transform non-standard input into standard

16LT Days, Luxembourg, 14-15 Jan, 2009

Page 17: Dan Cristea Alexandru Ioan Cuza University of Iasi Romanian Academy – Institute of Computer Science dcristea@info.uaic.ro

17LT Days, Luxembourg, 14-15 Jan, 2009