gdex: automatically finding good dictionary examples in a corpus

17
GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012 Kilgarriff: GDEX 1

Upload: gagan

Post on 27-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

GDEX: Automatically finding good dictionary examples in a corpus. Users appreciate examples. Paper: space constraints Electronic: no space constraints Give lots of examples Constraint: Cost of selection, editing. Project. Macmillan English dictionary Already had 1000 collocation boxes - PowerPoint PPT Presentation

TRANSCRIPT

GDEX: Automatically finding good dictionary examples in a corpus

Auckland 2012 Kilgarriff: GDEX 1

Auckland 2012 Kilgarriff: GDEX 2

Users appreciate examples

Paper: space constraints Electronic: no space constraints

Give lots of examplesConstraint: Cost of selection, editing

Auckland 2012 Kilgarriff: GDEX 3

Project

Macmillan English dictionary Already had 1000 collocation boxes Average 8 per box New electronic version

All 8000 collocations need examples Authentic; from corpus

Auckland 2012 Kilgarriff: GDEX 4

Old method

Lexicographer Gets concordance for collocation Reads through until they find a good

example Cut, paste, edit

Auckland 2012 Kilgarriff: GDEX 5

New method

Lexicographer Gets sorted concordance

20 best examples in spreadsheet Less reading through Tick the first good one, edit

Auckland 2012 Kilgarriff: GDEX 6

What makes a good example?

Readable EFL users

Informative Typical, for the collocation Gives context which helps user

understand the target word/phrase

Auckland 2012 Kilgarriff: GDEX 7

Readability

70 years research Not just (or mainly) EFL

Educational theory Teaching children to read

Instruction manuals Early work: US military

Publishing People like newspapers and magazines that

they find easy to read

Auckland 2012 Kilgarriff: GDEX 8

Readability tests Fleish Reading Ease test

1948 Ave sentence length, ave word length In some word processing software

Many similar measures Recent work

training data for different reading levels Language modelling

Target levels US grades Now, increasingly: Common European Framwork

Auckland 2012 Kilgarriff: GDEX 9

GDEX

Get concordance for collocation For each sentence

Score it Sort Show best ones to lexicographer

Auckland 2012 Kilgarriff: GDEX 10

GDEX heuristics Sentence length (10-26 words) Mostly common words is good Rare words are bad Sentences

Start with capital, end with one of .!? No [, ], <, >, http, \ Not much other punctuation, numbers Not too many capitals Typicality: third collocate is a plus

Auckland 2012 Kilgarriff: GDEX 11

Weighting

For each sentence Score on each heuristic Weight scores Add together weighted score

How to set weights? Two students:

Manually judged 1000 “good examples” Weights set so system makes same choices

as students

Auckland 2012 Kilgarriff: GDEX 12

Was it successful? Did it save lexicographer time?

Definitely (says project manager)

Rough guess Average number of corpus lines to read

until you find a good one: Unsorted: 20 Sorted: 5

Auckland 2012 Kilgarriff: GDEX 13

Corpus choice

Started with BNC but Too old Not enough examples

If no good examples in corpus, GDEX can’t help

Changed to UKWaC 20 times bigger; from web; contemporary Better Most web junk filtered out Usually a good example in top twenty

Auckland 2012 Kilgarriff: GDEX 14

GDEX and TALC TALC (Teaching and Language

Corpora) Goal: bring corpora into lg teaching Usual problem

Concordances are tough for learners to read

Way forward GDEX examples Half way between dictionary and corpus

Auckland 2012 Kilgarriff: GDEX 15

GDEX: Models for use

More examples for dictionaries Speed up, as with MED or Fully automatic “more examples”

Corpus query tool Sort concordances, best first Now an option in the Sketch Engine

Automatic collocations dictionary http://forbetterenglish.com

Recent developments

Configurable GDEX For other languages Interface to help set up

Commonest string Between ‘bare collocate’ and example

Auckland 2012 Kilgarriff: GDEX 16

Auckland 2012 Kilgarriff: GDEX 17