care henk vd heuvel

1
Aim of project CARE: Curation of Dutch Regional Dialect Dictionaries Nicoline van der Sijs, Henk van den Heuvel, Roeland van Hout, Eric Sanders CLS/CLST, Radboud University Nijmegen, The Netherlands •OCR version of PDF files (WBD & WLD, Parts I and II • Formerly curated TSV files for WBD & WLD, Part III • FP5 files of WGD What we deliver Generic LMF model for dialect dictionaries WBD, WLD as CSV files and LMF files • For at least 32 of 42 books of Parts I and II For all 28 books of Part III Original PDFs of books CMDI files per Part Curation Reports Where we start The CARE project is funded by CLARIN-NL under grant number 15-004 • Definition of a generic database structure for dialect dictionaries (LMF) • Link the structure to Woordenboek van de Vlaamse Dialecten (WVD) and other regional dictionaries • Curation of Woordenboek van de Brabantse dialecten (WBD) and Woordenboek van de Limburgse Dialecten (WLD) parts I and II • Update curation of WBD and WLD Part III • Include Woordenboek van de Gelderse Dialecten (WGD) Generic aspects • LMF model suited for all sorts of dialect dictionaries • CMDI metadata profile • Very flexible LMF conversion script PDF book CLARIN Data Centre LMF files CSV files CMDI files CLARIN Data Centre: Meertens Institute • Adding Persistent Identifiers • Storage CMDI -Metadata profile includes: -Link to LMF LMF script -Converts CSV file into LMF CSV script -Converts typographed text file into CSV file by: -Typographic & text cleaning - Categorization of information based on typography -Recoding dialect forms -Checking and expanding Kloekecodes -Logfile is used for iterative manual correction Manual Preprocesing by trained assistents, greatly acknowledged: Aukje Borkent, Maaike Borst, Eline Dimmendaal, Jorik van Engeland and Inge Otto - Addition of typographic codes for Comments (“Toelichting”) in text file - Correcting script errors

Upload: clariah

Post on 14-Apr-2017

266 views

Category:

Science


1 download

TRANSCRIPT

Page 1: Care henk vd Heuvel

Aim of project

CARE: Curation of Dutch Regional Dialect Dictionaries

Nicoline van der Sijs, Henk van den Heuvel, Roeland van Hout, Eric Sanders

CLS/CLST, Radboud University Nijmegen, The Netherlands

•OCR version of PDF files (WBD & WLD, Parts I and II

• Formerly curated TSV files for WBD & WLD, Part III

• FP5 files of WGD

What we deliver • Generic LMF model for dialect dictionaries • WBD, WLD as CSV files and LMF files

• For at least 32 of 42 books of Parts I and II • For all 28 books of Part III

• Original PDFs of books • CMDI files per Part • Curation Reports

Where we start

The CARE project is funded by CLARIN-NL under grant number 15-004

• Definition of a generic database structure for dialect dictionaries (LMF)

• Link the structure to Woordenboek van de Vlaamse Dialecten (WVD) and other regional dictionaries

• Curation of Woordenboek van de Brabantse dialecten (WBD) and Woordenboek van de Limburgse Dialecten (WLD) parts I and II

• Update curation of WBD and WLD Part III • Include Woordenboek van de Gelderse Dialecten (WGD)

Generic aspects

• LMF model suited for all sorts of dialect dictionaries

• CMDI metadata profile • Very flexible LMF conversion script

PDF book

CLARIN Data Centre

LMF files

CSV files

CMDI files

CLARIN Data Centre: Meertens Institute

• Adding Persistent Identifiers • Storage

CMDI -Metadata profile includes: -Link to LMF

LMF script -Converts CSV file into LMF

CSV script -Converts typographed text file into CSV file by:

-Typographic & text cleaning - Categorization of information based on typography

-Recoding dialect forms -Checking and expanding Kloekecodes -Logfile is used for iterative manual correction

Manual Preprocesing by trained assistents, greatly acknowledged:

Aukje Borkent, Maaike Borst, Eline Dimmendaal, Jorik van Engeland and Inge Otto

- Addition of typographic codes for Comments (“Toelichting”) in text file

- Correcting script errors