introducing czech national corpus - brown university, 04/09/16 · czech national corpus project ......
TRANSCRIPT
IntroducingCzech National Corpus
Brown University, 04/09/16
Václav Cvrček
IntroductionIj
Czech National Corpus project
Basic facts about the CNC
▶ est. in 1994 by prof. František Čermák
▶ 2 departments of Faculty of Arts, Charles University in Prague(ICNC & ITCL)
▶ in 2012 acknowledged by MEYS as a research infrastructurefor social sciences and humanities
▶ 4,500+ registered users▶ 1,900 queries per day▶ web portal: www.korpus.cz
Czech National Corpus project
Basic facts about the CNC
▶ est. in 1994 by prof. František Čermák▶ 2 departments of Faculty of Arts, Charles University in Prague
(ICNC & ITCL)
▶ in 2012 acknowledged by MEYS as a research infrastructurefor social sciences and humanities
▶ 4,500+ registered users▶ 1,900 queries per day▶ web portal: www.korpus.cz
Czech National Corpus project
Basic facts about the CNC
▶ est. in 1994 by prof. František Čermák▶ 2 departments of Faculty of Arts, Charles University in Prague
(ICNC & ITCL)▶ in 2012 acknowledged by MEYS as a research infrastructure
for social sciences and humanities
▶ 4,500+ registered users▶ 1,900 queries per day▶ web portal: www.korpus.cz
Czech National Corpus project
Basic facts about the CNC
▶ est. in 1994 by prof. František Čermák▶ 2 departments of Faculty of Arts, Charles University in Prague
(ICNC & ITCL)▶ in 2012 acknowledged by MEYS as a research infrastructure
for social sciences and humanities▶ 4,500+ registered users
▶ 1,900 queries per day▶ web portal: www.korpus.cz
Czech National Corpus project
Basic facts about the CNC
▶ est. in 1994 by prof. František Čermák▶ 2 departments of Faculty of Arts, Charles University in Prague
(ICNC & ITCL)▶ in 2012 acknowledged by MEYS as a research infrastructure
for social sciences and humanities▶ 4,500+ registered users▶ 1,900 queries per day
▶ web portal: www.korpus.cz
Czech National Corpus project
Basic facts about the CNC
▶ est. in 1994 by prof. František Čermák▶ 2 departments of Faculty of Arts, Charles University in Prague
(ICNC & ITCL)▶ in 2012 acknowledged by MEYS as a research infrastructure
for social sciences and humanities▶ 4,500+ registered users▶ 1,900 queries per day▶ web portal: www.korpus.cz
DataIj
Language data
SYN2.3 bil.
InterCorp1.4 bil.
ORAL5 mil.
Diakorp3.4 mil.
SYN series
Written (i.e. published) synchronic texts:SYN2000 100M representative (1990–1999)SYN2005 100M representative (2000–2004)SYN2006PUB 300M journalistic texts (1989–2004)SYN2009PUB 700M journalistic texts (1995–2007)SYN2010 100M representative (2005–2009)SYN2013PUB 935M journalistic texts (2005–2009)SYN2015 100M representative (2010–2014)SYN (v. 3) 2,3G union of all SYN* corpora
All corpora are: lemmatized, morphologically tagged and enrichedby metadata (biblio information + text-type/genre classification)
SYN series
Written (i.e. published) synchronic texts:SYN2000 100M representative (1990–1999)SYN2005 100M representative (2000–2004)SYN2006PUB 300M journalistic texts (1989–2004)SYN2009PUB 700M journalistic texts (1995–2007)SYN2010 100M representative (2005–2009)SYN2013PUB 935M journalistic texts (2005–2009)SYN2015 100M representative (2010–2014)SYN (v. 3) 2,3G union of all SYN* corpora
All corpora are: lemmatized, morphologically tagged and enrichedby metadata (biblio information + text-type/genre classification)
ORAL series
Unprepared, dialogical, informal spoken language
One-layer transcription corpora:ORAL2006 1.0M Bohemian Czech onlyORAL2008 1.0M sociolinguistically balanced, Bohemian
Czech onlyORAL2013 2.8M sociolinguistically balanced, whole CR,
text-to-sound alignment
Older spoken corpora: Prague spoken corpus (0.5M), Brno spokencorpus (0.5M)
ORAL series
Unprepared, dialogical, informal spoken language
One-layer transcription corpora:ORAL2006 1.0M Bohemian Czech onlyORAL2008 1.0M sociolinguistically balanced, Bohemian
Czech onlyORAL2013 2.8M sociolinguistically balanced, whole CR,
text-to-sound alignment
Older spoken corpora: Prague spoken corpus (0.5M), Brno spokencorpus (0.5M)
Diachronic corpus
DIAKORP
▶ diachronic part of the CNC – 2.5 mil. words (v. 5)▶ the end of the 13th century to the beginning of the SYN
section (1945)▶ texts are transcribed, not transliterated▶ current focus on 19th century – lemmatization
Multilingual parallel corpus InterCorp
Czech texts with translations to or from 30+ languages
InterCorp (v. 8)
▶ core (=fiction) and collections (=journalism, subtitles…)
Core Collectionscs 85M 90Mforeign 194M 1,229M
▶ partly lemmatized and tagged▶ uneven amount of texts in language pairs
Multilingual parallel corpus InterCorp
Czech texts with translations to or from 30+ languages
InterCorp (v. 8)
▶ core (=fiction) and collections (=journalism, subtitles…)
Core Collectionscs 85M 90Mforeign 194M 1,229M
▶ partly lemmatized and tagged▶ uneven amount of texts in language pairs
Multilingual parallel corpus InterCorp
Czech texts with translations to or from 30+ languages
InterCorp (v. 8)
▶ core (=fiction) and collections (=journalism, subtitles…)
Core Collectionscs 85M 90Mforeign 194M 1,229M
▶ partly lemmatized and tagged▶ uneven amount of texts in language pairs
Multilingual parallel corpus InterCorp
Czech texts with translations to or from 30+ languages
InterCorp (v. 8)
▶ core (=fiction) and collections (=journalism, subtitles…)
Core Collectionscs 85M 90Mforeign 194M 1,229M
▶ partly lemmatized and tagged
▶ uneven amount of texts in language pairs
Multilingual parallel corpus InterCorp
Czech texts with translations to or from 30+ languages
InterCorp (v. 8)
▶ core (=fiction) and collections (=journalism, subtitles…)
Core Collectionscs 85M 90Mforeign 194M 1,229M
▶ partly lemmatized and tagged▶ uneven amount of texts in language pairs
Language Core Totalbg Bulgarian 5.2M 28.1Mda Danish 3.0M 53.0Mde German 27.7M 77.1Men English 15.5M 113.9Mes Spanish 17.5M 103.9Mfi Finnish 3.4M 45.2Mfr French 9.2M 87.0Mhr Croatian 15.5M 34.6Mhu Hungarian 5.4M 58.1Mit Italian 7.2M 65.6Mpl Polish 17.5M 79.9Mru Russian 3.3M 13.4Msk Slovak 7.4M 44.5Msl Slovenian 0.9M 49.8Msr Serbian 8.8M 29.6Muk Ukrainian 5.1M 5.3M
ToolsIj
CNC Tools
main concordancer analysis of variants derivational morphology
discourse analysis translation equivalents
All tools are available on-line within the portal www.korpus.cz
CNC Tools
main concordancer analysis of variants derivational morphology
discourse analysis translation equivalents
All tools are available on-line within the portal www.korpus.cz
CNC research portal www.korpus.cz
KonText – CNC concordancer
KonText – CNC concordancer
SyD – exploring variants
SyD – exploring variants
SyD – exploring variants
SyD – exploring variants
SyD – exploring variants
Treq – translation equivalents
Translation candidates for “workshop”
% Czech English39.4 dílna (‘workroom‘) workshop30.4 seminář (‘seminar‘) workshop8.7 workshop workshop4.6 pracovní (‘workring‘) workshop2.3 kurs (‘course‘) workshop1.7 garáž (‘garage‘) workshop1.7 krejčovna (‘tailor’s shop‘) workshop0.9 ateliér (‘studio‘) workshop0.8 továrna (‘factory‘) workshop
User ServicesIj
User services
▶ hosting of corpora
▶ providing data packages (NLP)▶ analysis of user data▶ consulting, education, training
User services
▶ hosting of corpora▶ providing data packages (NLP)
▶ analysis of user data▶ consulting, education, training
User services
▶ hosting of corpora▶ providing data packages (NLP)▶ analysis of user data
▶ consulting, education, training
User services
▶ hosting of corpora▶ providing data packages (NLP)▶ analysis of user data▶ consulting, education, training
Repository Biblio
User forum and user support
CNC Wiki
www.korpus.cz