corpus lexicography in russia: recent trends and perspectives maria khokhlova st.petersburg state...

20
Corpus lexicography Corpus lexicography in Russia: recent in Russia: recent trends and trends and perspectives perspectives Maria Khokhlova Maria Khokhlova St.Petersburg State St.Petersburg State University University Philological Faculty Philological Faculty [email protected] [email protected]

Upload: suzan-kelley

Post on 13-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Corpus lexicography in Corpus lexicography in Russia: recent trends Russia: recent trends

and perspectivesand perspectivesMaria KhokhlovaMaria Khokhlova

St.Petersburg State UniversitySt.Petersburg State University

Philological FacultyPhilological Faculty

[email protected]@gmail.com

2

Prehistory of Russian Corpus Linguistics

Frequency Dictionary of Russian: (L.N.Zasorina, 1977) Text database contained about 1 mln units.During its compilation a huge number of notorious issues were discussed:representiveness;tokenization;lemmatization...So it was the earliest computer corpus of Russian.

3

Prehistory of Russian Corpus Linguistics «Computer Fund of the Russian

Language»Idea: Acad. Andrey Yershov

Andrey Petrovich Yershov (1931-1988)

Jeršov A.P. "On methodology of constructing dialogue systems: the

phenomenon of business prosa" (1978)

The idea was formulated as follows: "Any progress in the field of constructing models and algorithms will remain a purely academic exercise, unless a most important problem of creating a Computer fund of the Russian language is solved. We hope that creation of such a Computer fund by linguists, qualified for the task, will precede construction of large systems for application purposes. This would minimize labour costs and simultaneously would protect the Russian language from arbitrary and incompetent intervention“.

5

Russian Corpora (1)

The Uppsala Russian Corpus (1960s), the earliest corpus

The Tübingen Russian Corpus (Tübingen Universität, in 1999 -2004 under the guidance of T.Berger)

The HANCO corpus (Helsinki Annotated Corpus), Helsinki University, Slavic and Baltic Languages Department (2001-2004, A. Mustajoki, M. Kopotev). It is a small teaching corpus with morphological and syntactic annotation.

6

Russian Corpora (2)

Three big corpora of Russian: The National Corpus of Russian Language

(NCRL, about 364 million words) (http://ruscorpora.ru

Corpora at the Leeds University created by S.Sharoff (about 2000 million words) (http://corpus.leeds.ac.uk/ruscorpora.html)

A corpus of Russian Fiction at the Automatic Text Processing initiative team (AOT), 680 million words (http://aot.ru).

7

Russian National Corpus (1)Over 364 million wordsBased on Yandex Search:

Search by exact form(s); Lexico-grammatical search. see www.yandex.ru – Advanced Search and www.ruscorpora.ru – Search in the Corpus

Additional options:morphological features;semantic features;metadata.

8

Russian National Corpus (2)Subcorpora: Modern Russian corpus, Diachronic corpus (the Church Slavonic

language), Syntactic corpus, Spoken corpus, News corpus, Parallel corpora, Poetic corpus, Dialect corpus, Speech corpus, Multimodal corpus

9

10

11

Dictionaries based on the Russian National Corpus

Grammatical Dictionary of Russian Neologisms;

New Frequency Dictionary of Russian;

The Combinatory Dictionary of Russian Intensifiers;

The Verbal Combinatory Dictionary of Russian Abstract Nouns

http://dict.lang.ru

AOT (1)

AOT (2)

Russian Corpora (Leeds University, Serge Sharoff)

Russian Reference CorpusRussian Reference Corpus,

another versionRussian Fiction (disambiguated) Russian Newspapers

Russian Internet Corpus Russian National Corpus…

Collocations

St.Petersburg Corpus of Hagiographic Texts

Biographies of saints and holy people;

50 manuscripts; 500 000 tokenshttp://project.phil.spbu.ru/scat/

page.php?page=project

The Fundamental Digital Library of Russian Literature

and FolkloreFEB-web accumulates information in text,

audio, visual, and other forms on 11th-20th-century Russian literature, Russian folklore, and the history of Russian literary scholarship and folklore studies.

19

Conference “Corpus Linguistics”

2002 2004 2006 2008 2011 2013 (late June)Saint-PetersburgSt.Petersburg State University,

Department of Mathematical Linguistics

Thank you for your attention!