new slovene corpora within the »communication in slovene« project nataša logar bergincsimon krek...
Post on 19-Dec-2015
216 Views
Preview:
TRANSCRIPT
New Slovene corpora within the »Communication in
Slovene« project
Nataša Logar Berginc Simon KrekUniversity of Ljubljana Amebis, Kamnik Faculty of Social Sciences Jozef Stefan Institut
natasa.logar@fdv.uni-lj.si simon.krek@guest.arnes.si
“Communication in Slovene”
• Web site: http://www.slovenscina.eu• Leading partner: Amebis, d. o. o., Kamnik• Duration: June 2008 - December 2013• Total value: 3,2 million Euro• Project consortium:
• Amebis, d. o. o., Kamnik• Jozef Stefan Institute• University of Ljubljana• Scientific Research Centre of the Slovenian Academy of
Sciences and Arts• Trojina, Institute for Applied Slovene Studies
Language data
• Three corpora of Slovene:
a billion word written corpus GigaFIDA
100 million word balanced subcorpus KRES
a million word corpus of spoken Slovene GOS
Other activities
• NLP tools & resources– statistical tagger and parser– training corpus (500.000 words)– lexicon (100.000 lemmas)
• Language learning– integration of resources & tools in Slovene language teaching– pedagogical corpus interface– pedagogical corpus-based grammar
• Language description– lexical database (NLP & lexicography)– manual of style
Goals
GigaFIDA• a billion word written corpus• linguistic annotation
– lemmatized– morpho-syntactically annotated– partly syntactically annotated
• format– XML TEI P5 format
• purpose– data for the new Slovene lexical database,
pedagogical grammar and manual of style– freely available on the web
A bit of FIDA history
• FIDA corpus– 1997-2000– 100 million words– available for project partners (academic & industrial)
• FidaPLUS corpus– 2005-2006– 620 million words– publicly available in the web concordancer– available for partners as a data set– text type: fiction 3,5%, non-fiction 96,5% (90% newspapers and
magazines)
KRES
• a 100 million word written subcorpus• criteria
– balanced (text types, production-reception etc.)– text quality (processing & annotation)– copyright issues: 10 %
• purpose– downloadable as a data set– freely available for research (BNC style)– Creative Commons (Authorship, Non-Commercial)
New taxonomyKRES GigaFIDA
Print 80 50 <> 90
Books 35 15 <> 35
Fiction 17 20 <> 50
Non-fiction 18 30 <> 60
Periodicals 40 20 <> 40
Newspapers 20 30 <> 70
Magazines 20 30 <> 70
Other 5 5 <> 10
Internet 20 10 <> 50
News sites 8 30 <> 70
Corp. & govern. sites
12 30 <> 70
GOS
• a million word corpus of spoken Slovene− 120 hours of speech
• criteria− demographic− speech type/situation− additional (language learning, 15%)
• transcription– pronunciation-based– standardized
Demographic criteria
– sex: 50% M– age: <34: 40%– education: primary/secondary school: 70%– region:
• SW: 35%, • Ljubljana r.: 25%,• NE: 25%, • Maribor r.: 15%
Speech type/situation criteria
– public/non-public discourse: 60% : 40%– media:
• face to face c.: 50%• telephone: 10%• radio: 20%• TV: 20%
Tools for linguistic annotation• Tokenization & segmentation
– new more trasparent rules
• Lemmatizer & tagger– rule-based (Amebis)– statistical (JSI)– metatagger (JSI)
• Parser– statistical (based on MSTParser)
• Online services (beta)– tagger: http://oznacevalnik.slovenscina.eu/– parser: http://razclenjevalnik.slovenscina.eu/
March 2011
• Three publicly and freely available annotated corpora of modern Slovene, all texts copyright (+ gathering of new texts still in progress)
• New user-friendly interface (see Iztok Kosem presentation)
• Freely available tools for linguistic annotation of Slovene (tagger, parser)
… and not much further down the road: new, up-to-date language descriptions and manuals
See: www.slovenscina.eu
top related