slovene specialized text corpus of library and information science – an advanced lexicographic...

32
SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University of Ljubljana, Faculty of Economics International scientific conference «Corpus linguistics» Saint-Petersburg State University, June 25 – 27, 2013

Upload: tamsyn-burke

Post on 20-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE

– AN ADVANCED LEXICOGRAPHIC TOOL FOR

LIBRARY TERMINOLOGY RESEARCH

Ivan KaničUniversity of Ljubljana, Faculty of Economics

International scientific conference «Corpus linguistics»Saint-Petersburg State University, June 25 – 27, 2013

Page 2: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University
Page 3: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

SLOVENIA

• Population: 1,992,690• Ljubljana (capital) 260,000 • Independence: 25 June 1991 (from Yugoslavia)• Surface: 20,273 sq km• Border countries: Austria, Croatia, Hungary,

Italy• Adriatic coastline: 46.6 km• Highest point: Triglav 2,864 m

Page 4: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

SLOVENIA (2)

• Language: Slovene (var.: Slovenian)• Ethnic composition:

Slovene 83.1%, Serb 2%, Croat 1.8%, Bosniak 1.1%, other or unspecified 12%

• Religions: Catholic 57.8%, Muslim 2.4%, Orthodox 2.3%, other or unspecified 28%, none 10.1% (2002 census)

• GDP - per capita: $28,700 (2012)• Currency: EURO (introduced in 2007)

Page 5: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

SLOVENE LANGUAGE

• Slovenski jezik, slovenščina• Western South Slavic language• cca. 2,4 mio speakers (1,85 mio first language)• 50 regional dialects (limited understanding: „most

diverse Slavic language“)• Latin alphabet

Č, Š, Ž• Highly inflected language• Particularities: dual

Page 6: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

6

SLOVENSKI BESEDILNI KORPUSI

• 20 < CORPORA AVAILABLE ONLINE• REPRESENTATIVE (GENERAL) CORPORA• SYNCHRONOUS CORPORA• Nova Beseda– 240 mio words, 2004 (cca 10 years‘ coverage)

• GigaFida– 1,2 bill. words, 1990-2011

• SPECIALISED CORPORA– DSI, Jos, Evrokorpus, VAYNA . . . – EduKorp, Bibliotekarstvo

Page 7: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

Slovene LIS Terminology

• Long professional tradition• Linguistic shortage in the subject field– Lack of written technical texts– German language tradition– Later English influences– NO dictionaries in LIS terminology– Terminology Project 1987– Important tangible results

Page 8: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

Usables

• International Project – Multilingual Dictionaries of Library Terminology

• English-Slovene Dictionary of Library Terminology

• (Slovene) Dictionary of Library Terminology– Printed edition– Electronic edition (web, public access)

• Text Corpus – Korpus bibliotekarstva

Page 9: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

Korpus bibliotekarstva

• Specialized corpusLibrary and Information Science & practice

• Synchronous• Open public access• Dedicated in-house software– PC dat aprocessing– Web-based usage– Rich experience (eg. Dictionaries of the Slovene

Academy of Sciences and Arts)

Page 10: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

Texts

• Defined selection criteria• Subject & Level• Written texts• Electronic published texts only – Digital born– Digitized & published– NO scanning for the corpus

• Technical limitations and barriers

Page 11: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University
Page 12: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

Selected texts & Functions

Page 13: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

Basic functions

• Simple/basic search– Single words & phrases– N-grams (N = 1 – 5)– Concordances– Global corpus – selected document segment(s)– Exact matching– Truncation (*)– Upper / lower case

Knjižnica - knjižnica

Page 14: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

Basic functions (2)

• Advanced search– Frequency search = , < , >

Fr>1000Fr>200 in be:kata*

– Word length = , < , >Do=15

• Word masking *adjective + substantive

* katalogknjižnični *

Page 15: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

Hyperlinked list of texts & authors

Page 16: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

Concordance list

Page 17: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

Citation

Page 18: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

Full-text access

Page 19: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

Single word

Page 20: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

Bigrams

Page 21: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

Bigrams (2)

Page 22: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

4-grams

Page 23: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

Insight

• 625 texts• 353 authors (single or co+authors)• 3,66 mio words• Lematisation• Part of speech tagging• 28.808 individual distinctive words• Highest frequency - 172.031 (aux. v. „to be“)• Hapax legomena - 7.310

Page 24: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

Frequency distribution

• First 50

Page 25: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

Zipf‘s Law vs. experience

Page 26: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

Parts of speech Verbs

noun adjective verb adverb the rest0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

Page 27: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

Nouns Adjectives

Page 28: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

28

Accessibility

• Open Access• CC License• BLOG Bibliotekarska terminologija

http://terminologija.blogspot.com

Page 29: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

Problems & Challenges

• Choice & acquisition of texts• „Analogue“ texts• Copyright issues• Technical barriers– PDF protected data– Special characters– Special text formatting– Typing errors– Genuine OCR errors

Page 30: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

Problems & Challenges (2)

• Linguistic– Highly inflected language

• Data processing• Search • Analysis• Part of speech tagging

– Foreign language „contamination“• General– Resources

• Human• financial

Page 31: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

Plans

• Harvesting new texts– Recent / current digital born publications– Recently digitized (e.g. „Knjižnica“)– „Backlog“• 120 graduate theses• 28 master theses• 25 monographs & proceedings

– Scientific analysis– Dictionary updating and supplementing

Page 32: SLOVENE SPECIALIZED TEXT CORPUS OF LIBRARY AND INFORMATION SCIENCE – AN ADVANCED LEXICOGRAPHIC TOOL FOR LIBRARY TERMINOLOGY RESEARCH Ivan Kanič University

СПАСИБО ЗА ВНИМАНИЕ!

Check: http://terminologija.blogspot.com Contact: [email protected]

http://www2.arnes.si/~ljnuk4/kanic.html