slovene specialized text corpus of library and information science – an advanced lexicographic...
TRANSCRIPT
SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE
– AN ADVANCED LEXICOGRAPHIC TOOL FOR
LIBRARY TERMINOLOGY RESEARCH
Ivan KaničUniversity of Ljubljana, Faculty of Economics
International scientific conference «Corpus linguistics»Saint-Petersburg State University, June 25 – 27, 2013
SLOVENIA
• Population: 1,992,690• Ljubljana (capital) 260,000 • Independence: 25 June 1991 (from Yugoslavia)• Surface: 20,273 sq km• Border countries: Austria, Croatia, Hungary,
Italy• Adriatic coastline: 46.6 km• Highest point: Triglav 2,864 m
SLOVENIA (2)
• Language: Slovene (var.: Slovenian)• Ethnic composition:
Slovene 83.1%, Serb 2%, Croat 1.8%, Bosniak 1.1%, other or unspecified 12%
• Religions: Catholic 57.8%, Muslim 2.4%, Orthodox 2.3%, other or unspecified 28%, none 10.1% (2002 census)
• GDP - per capita: $28,700 (2012)• Currency: EURO (introduced in 2007)
SLOVENE LANGUAGE
• Slovenski jezik, slovenščina• Western South Slavic language• cca. 2,4 mio speakers (1,85 mio first language)• 50 regional dialects (limited understanding: „most
diverse Slavic language“)• Latin alphabet
Č, Š, Ž• Highly inflected language• Particularities: dual
6
SLOVENSKI BESEDILNI KORPUSI
• 20 < CORPORA AVAILABLE ONLINE• REPRESENTATIVE (GENERAL) CORPORA• SYNCHRONOUS CORPORA• Nova Beseda– 240 mio words, 2004 (cca 10 years‘ coverage)
• GigaFida– 1,2 bill. words, 1990-2011
• SPECIALISED CORPORA– DSI, Jos, Evrokorpus, VAYNA . . . – EduKorp, Bibliotekarstvo
Slovene LIS Terminology
• Long professional tradition• Linguistic shortage in the subject field– Lack of written technical texts– German language tradition– Later English influences– NO dictionaries in LIS terminology– Terminology Project 1987– Important tangible results
Usables
• International Project – Multilingual Dictionaries of Library Terminology
• English-Slovene Dictionary of Library Terminology
• (Slovene) Dictionary of Library Terminology– Printed edition– Electronic edition (web, public access)
• Text Corpus – Korpus bibliotekarstva
Korpus bibliotekarstva
• Specialized corpusLibrary and Information Science & practice
• Synchronous• Open public access• Dedicated in-house software– PC dat aprocessing– Web-based usage– Rich experience (eg. Dictionaries of the Slovene
Academy of Sciences and Arts)
Texts
• Defined selection criteria• Subject & Level• Written texts• Electronic published texts only – Digital born– Digitized & published– NO scanning for the corpus
• Technical limitations and barriers
Selected texts & Functions
Basic functions
• Simple/basic search– Single words & phrases– N-grams (N = 1 – 5)– Concordances– Global corpus – selected document segment(s)– Exact matching– Truncation (*)– Upper / lower case
Knjižnica - knjižnica
Basic functions (2)
• Advanced search– Frequency search = , < , >
Fr>1000Fr>200 in be:kata*
– Word length = , < , >Do=15
• Word masking *adjective + substantive
* katalogknjižnični *
Hyperlinked list of texts & authors
Concordance list
Citation
Full-text access
Single word
Bigrams
Bigrams (2)
4-grams
Insight
• 625 texts• 353 authors (single or co+authors)• 3,66 mio words• Lematisation• Part of speech tagging• 28.808 individual distinctive words• Highest frequency - 172.031 (aux. v. „to be“)• Hapax legomena - 7.310
Frequency distribution
• First 50
Zipf‘s Law vs. experience
Parts of speech Verbs
noun adjective verb adverb the rest0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
Nouns Adjectives
28
Accessibility
• Open Access• CC License• BLOG Bibliotekarska terminologija
http://terminologija.blogspot.com
Problems & Challenges
• Choice & acquisition of texts• „Analogue“ texts• Copyright issues• Technical barriers– PDF protected data– Special characters– Special text formatting– Typing errors– Genuine OCR errors
Problems & Challenges (2)
• Linguistic– Highly inflected language
• Data processing• Search • Analysis• Part of speech tagging
– Foreign language „contamination“• General– Resources
• Human• financial
Plans
• Harvesting new texts– Recent / current digital born publications– Recently digitized (e.g. „Knjižnica“)– „Backlog“• 120 graduate theses• 28 master theses• 25 monographs & proceedings
– Scientific analysis– Dictionary updating and supplementing
СПАСИБО ЗА ВНИМАНИЕ!
Check: http://terminologija.blogspot.com Contact: [email protected]
http://www2.arnes.si/~ljnuk4/kanic.html