the sketch engine for dutch with the anw corpus carole tiberius
DESCRIPTION
The Sketch Engine for Dutch with the ANW corpus Carole Tiberius. Outline. The A lgemeen N ederlands W oordenboek Main features The ANW corpus The Sketch Engine Background Word Sketches for Dutch. The ANW dictionary. Online scholarly dictionary - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/1.jpg)
The Sketch Engine for Dutch with the ANW corpus
Carole Tiberius
![Page 2: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/2.jpg)
Outline
• The Algemeen Nederlands Woordenboek– Main features– The ANW corpus
• The Sketch Engine– Background– Word Sketches for Dutch
![Page 3: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/3.jpg)
The ANW dictionary• Online scholarly dictionary• Contemporary standard Dutch in the Netherlands and
Flanders• General (mainly written) language• Period: 1970-2018• Size: 70.000 main entries and 250.000 subentries• Users: from laymen to professionals• No clone of an existing printed dictionary• Semasiological and onomasiological• Modular editing and publication• Corpus-based
![Page 4: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/4.jpg)
+
+ = sportveld
meaning
grammar
morphology and compoundingspelling
combinations; collocations
multimedia
ANW: main content features
![Page 5: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/5.jpg)
ANW Corpus
Compiled from:• Electronic texts already available at the INL• Internet• Scanning
Subcorpora:• Corpus of domains 32 million tokens• Corpus of literary texts 20 million tokens• Newspaper corpus 40 million tokens• Corpus of neologisms 5,5 million tokens• Pluscorpus 5 million tokens
Total 102,5 million tokens
![Page 6: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/6.jpg)
Corpus preparation
• Conversion to vertical format: word-form tag lempos– Inclusion of <g> tag for punctuation– Removal of double occurring texts
• Conversion to UTF8
• More uniform document headers– subcorpus; ID; variant; dates etc.
![Page 7: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/7.jpg)
Changes to the editor
The ANW editor was adapted such that the
lexicographers can automatically copy
examples plus source information from the
Sketch Engine into the editor.
![Page 8: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/8.jpg)
![Page 9: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/9.jpg)
![Page 10: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/10.jpg)
![Page 11: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/11.jpg)
![Page 12: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/12.jpg)
![Page 13: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/13.jpg)
ANWGrammatical Relations for nouns
• object-of• with ‘dat’ (that)-compl• subject-of• with wh-compl• with auxiliary• with ‘of’ (whether)-compl• premodifying adjective• with ‘alsof’ (as if)-compl• premodifying present participle• with demonstrative pronoun• premodifying past participle• with possessive pronoun• with infinitive plus ‘om te’• with PP• in PP• with indefinite pronoun• with personal pronoun
• premodifying noun• premodifying genitive• postmodifying noun• postmodifying genitive• premodifying numeral• with proper noun• postmodifying numeral• with article• postmodifying adjective• with coordinated noun• with infinitive plus ‘te’• other
![Page 14: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/14.jpg)
Dutch Sketch Grammar
• Geared completely towards the ANW requirements
• Covers ± 50 of the 70 relations
• Types of relations:– Symmetric (e.g. and/or)– Trinary (e.g. headword + pp + noun)
– Dual (e.g. adj + headword)
– Unary (e.g + relative clause – dat)
![Page 15: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/15.jpg)
Specific problems for Dutch
• Verb-subject and verb-object relations as word order not a reliable source, e.g.
BOONEN zou Voigt in de sprint geklopt hebbenBoonen would Voigt in the sprint beaten have
‘Boonen would have beaten Voigt in the sprint.’
VOIGT zou Boonen in de sprint geklopt hebbenVoigt would Boonen in the sprint beaten have
‘Voigt would have beaten Boonen in the sprint.’(Bouma 2008:20)
![Page 16: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/16.jpg)
Sketch Grammar rules*DUAL=object/object_of# hij ziet de man / hij heeft de man gezien
"P.*pers.*nom.*" 1:"V.*mai.*" [tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*" [tag!="N.*" & tag!="S.pre.*"] "P.*pers.*nom.*" "V.*aux.*" [[tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*" 1:"V.*mai.*"# gisteren zag Piet Jan [word=“[gisteren|morgen|vanmorgen|vandaag|jaar]”] 1:"V.*mai.*" [tag=“[T|D|M|R|A].*"]{0,3} "N.*" [[tag=“[T|D|M|R|A].*"]{0,3}
2:"N.*" [word=“[gisteren|morgen|vanmorgen|vandaag|jaar]”][tag="V.*aux.*"] [tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*" 1:"V.*mai.*"# omdat Piet Jan ziet
"C.*sub.*" [[tag=“[T|D|M|R|A].*"]{0,3} "N.*" [tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*" 1:"V.*mai.*"
*DUAL=subject/subject_of# gisteren zag Piet Jan [word=“[gisteren|morgen|vanmorgen|vandaag|jaar]”] 1:"V.*mai.*" [tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*" [tag=“[T|D|M|R|A].*"]{0,3}
"N.*" [word=“[gisteren|morgen|vanmorgen|vandaag|jaar]”] [tag="V.*aux.*"] [tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*" [tag=“[T|D|M|R|A].*"]
{0,3} "N.*" 1:"V.*mai.*"# omdat Piet Jan ziet
[word="omdat" | word="dat" & tag="C.*sub.*"] [tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*" [[tag=“[T|D|M|R|A].*"]{0,3} "N.*" 1:"V.*mai.*"# gepleegd door de moordenaar
1:"V.*mai.*part.*past.*" [word="door"] [[tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*"
![Page 17: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/17.jpg)
Specific problems for Dutch
• Separable verbs, e.g.
Hij at een hele boterham op (from ‘opeten’)He ate a whole sandwich up‘He ate a whole sandwich’
omdat hij een hele boterham op heeft gegetenbecause he a whole sandwich up has eaten
‘because he has eaten a whole sandwich’
![Page 18: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/18.jpg)
Sketch Grammar rules=bijw+WW# separable verbs "N.*" 2:"S.*nonpre.*" "U.*infmark.*" "V.*aux.*"{0,3} 1:"V.*mai.*inf.*"
"N.*" 2:"S.*nonpre.*" "U.*infmark.*" 1:"V.*mai.*inf.*""A.*partpast.*" 2:"S.*nonpre.*" "U.*infmark.*" "V.*aux.*"{0,3} 1:"V.*mai.*inf.*""N.*partpast.*" 2:"S.*nonpre.*" "U.*infmark.*" 1:"V.*mai.*inf.*"
"V.*mai.*" 2:"S.*nonpre.*" "U.*infmark.*" "V.*aux.*"{0,3} 1:"V.*mai.*inf.*""V.*mai.*" 2:"S.*nonpre.*" "U.*infmark.*" 1:"V.*mai.*inf.*"
"N.*" "C.*sub.*" 2:"S.*nonpre.*" "U.*infmark.*" "V.*aux.*"{0,3} 1:"V.*mai.*inf.*" "N.*" "C.*sub.*" 2:"S.*nonpre.*" "U.*infmark.*" 1:"V.*mai.*inf.*" "A.*" "C.*sub.*" 2:"S.*nonpre.*" "U.*infmark.*" "V.*aux.*"{0,3} 1:"V.*mai.*inf.*" "A.*" "C.*sub.*" 2:"S.*nonpre.*" "U.*infmark.*" 1:"V.*mai.*inf.*" "V.*mai.*" "C.*sub.*" 2:"S.*nonpre.*" "U.*infmark.*" "V.*aux.*"{0,3} 1:"V.*mai.*inf.*" "V.*mai.*" "C.*sub.*" 2:"S.*nonpre.*" "U.*infmark.*" 1:"V.*mai.*inf.*"
![Page 19: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/19.jpg)
Subcorpora
Within the ANW corpus, 7 subcorpora weredefined:
– Belgian Dutch– Dutch Dutch– Corpus Literary Texts– Domain-dependent Texts– Newspaper Texts– Neologisms– Pluscorpus
![Page 20: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/20.jpg)
Language variety: BelgianDutch
![Page 21: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/21.jpg)
Language variety: DutchDutch
![Page 22: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/22.jpg)
Wish list / Questions
• Fixed order of display
• Efficient dealing with different tag sets
• Correct display of unary relations
• Possible formats of dates in document headers
• Use of morphological information in Sketch Engine
![Page 23: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/23.jpg)
http://anw.inl.nl
![Page 24: The Sketch Engine for Dutch with the ANW corpus Carole Tiberius](https://reader033.vdocuments.us/reader033/viewer/2022051517/56815a69550346895dc7bd45/html5/thumbnails/24.jpg)