modern lexicography – developments, prospects, and problems patrick hanks research institute of...

49
Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton [email protected] EFNIL, Budapest, 25 October, 2012

Upload: katrina-benson

Post on 13-Jan-2016

219 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Modern Lexicography – Developments, Prospects, and

Problems

Patrick Hanks Research Institute of Information and Language

ProcessingUniversity of [email protected]

EFNIL, Budapest, 25 October, 2012

Page 2: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Outline of the talk

• Technology and Lexicography– During the Renaissance

– Now

• Philosophy, linguistics, and lexicography– During the Enlightenment

– Now

• Lexicography of the future– The corpus revolution

– Presenting the facts to the public

– Can (should) natural language be regulated?

2

Page 3: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

PART 1: Lexicography and technology

Lexicography as we know it today is possible because of two technological developments during the Renaissance: •The invention of printing (Gutenberg, Mainz, c. 1440)

– Enabling many copies of a work to be disseminated rapidly, regardless of its size, bulk, and complexity.

•The invention of modern typography by Nicolas Jenson (Venice, 1470)

And the scholarship of Aldus Manutius (1449-1515) in Venice

•Manutius collected Latin and Greek manuscripts from all over Europe and had them typeset and printed.

But in the past 10 years this kind of lexicography has become obsolete! •It has been superseded by a new kind of technology – text processing by computer. I will discuss this in part 3.

3

Page 4: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

The typography of Gutenberg‘s Bible (c. 1455)

4

Page 5: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Nicholas Jensen’s Roman Antiqua typeface (c. 1468)

5

Page 6: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Palsgrave (1530)

6

Page 7: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

R. Estienne (1531)

7

Page 8: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Calepino: Basle edition, 1550

8

Page 9: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Promptorium Parvulorum in print (Pynson 1499)

9

Page 10: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Present-day technology: lexicographical evidence

hazard, verb.1. No one at this stage is prepared to hazard a guess at the outcome of the poll on 2.name -- Chicken.” “Not Hen Chicken?” I hazarded, as this humorous diminutive was part 3. the wall. Stifling a giggle, she hazarded a guess that the wardrobe would be full 4. It seemed sensible to hazard that a man of this standing would have 5.can result in lost profits. When staff hazard a guess as to the price of goods – or 6. them as Part I and Part 2. One might hazard a guess that Part I was concerned with 7. North American standards. He does not hazard any opinions on how costs depend on the 8.ecoming proficient. Perhaps we can now hazard an attempt at defining `a good reader'. 9. builder, nor an architect, I can only hazard a guess. During construction in the mid-19 10.hair and eyes like her mother. I would hazard a guess and say she would be, at the time 11. Where do your art materials live? We hazard a guess that they're lurking in a shoebox 12. excitement than others, and I would hazard a guess that, even if they've never played 13. age and some movies date. I would hazard the guess that The Graduate belongs in 14. What the connection is we can only hazard a guess at but it confirms all our worst15. have been lost and commandos were not hazarded in foolish risks, although often taking 16. shapes and colours from which we hazard the inference that a leaping dog is in 17. and a principle strong enough to hazard lives for, America cannot hope to lead 18. of the farmer is not revealed; we may hazard the guess that he was William Hardeley, 19. to begin restoring. But I'd hazard a guess that if you restore the directory 20. from time to time admire people who hazard their entire company on one major throw 21. supreme grade of evil'. It may be hazarded that it was this inevitable alliance with 22. his achievement, such as it was, and hazarded the opinion that he might best be remembered 23. in those stations' heyday, but I hazard a guess that considerably more passengers 24. the day's racing. In fact I would hazard a guess that one, if not both of these 25. of society itself. Indeed, one could hazard a further, and more general, observation 26.The Phillips curve. Although Phillips hazarded some theoretical conjectures concerning

10

Page 11: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

PART 2: Philosophy, linguistics, and lexicography

• Do words have meaning?

• If not in words, where do meanings reside?

– Nowhere!

– Meanings are ephemeral interpersonal events, not stable objects with a ‘residence’.

– But then how can anyone know what anyone else means?

– What do philosophers say?

– What do linguist say?

• And what is the nature of linguistic creativity?

11

Page 12: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Do words have meaning?

• Let’s think of a word: blow• What does blow mean?

12

Page 13: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

The meaning potential of a word

• What’s the meaning of blow? --– What the wind does? A disappointment? Something you do with

your fist? With your nose? With a whistle? Spend a lot of money? …

• What’s the meaning of blow up? – Destroying a building? What you do to a balloon? Lose your

temper? …– All of these things and more! Words are hopelessly ambiguous.

– A checklist of word meanings cannot, for principled reasons, be exhaustive.

– But put a word in context, and ambiguity is reduced or eliminated.

13

Page 14: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Meaning potentials

• If words don’t have meanings, how come dictionaries have been so successful?

• Strictly speaking, dictionaries list meaning potentials, not meanings.

– The distinction is subtle but the theoretical consequences are far-reaching

– When consulting a dictionary, human beings use their imaginations to put words in a relevant context – a context for which they are already primed (Hoey 2005)

– Computer programs and logic-based theories are not so primed.

14

Page 15: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Philosophical background

• H. P. Grice (1957, 1975) argued that meanings are not just in the head – they are events; interactions between people: – between speaker (S) and hearer (H);

– (and with displacement in time) between writer and reader

• For this to work, S and H must share a body of linguistic conventions having the same meanings.

• Grice did not specify what the conventions are.– He left that task to linguists and lexicographers

– So far, we seem to have let him down rather badly

15

Page 16: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Lexis and grammar

• Are the conventions that underlie conversational co-operation conventions of grammar (syntax)? – Only partly. Discussed in more detail in Hanks (2012): ‘How

people use words to make meanings’.

• Perhaps the conventions that we rely on for conversational co-operation are words, with meanings as given in dictionaries?– But two decades of research in Word Sense Disambiguation by

computational linguists (using LDOCE and other existing lexical resources) is now seen as a failure (Ide and Wilks 2005)

– maybe, at least in part, because dictionaries don’t say enough about phraseology

• Something else is needed.

16

Page 17: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Firth and Sinclair

“We must separate from the mush of general goings-on those features of repeated events which appear to be part of a patterned process.” —J. R. Firth (1950)

17

Page 18: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Idiomaticity vs. Open Choice• “The principle of idiom is that a language user has available to him or her a large number of semi-preconstructed phrases that constitute single choices, even though they might appear to be analysable into segments.”

—Sinclair 1991. Corpus, Concordance, Collocation, p. 110

• “Tending towards open choice is what we can dub the terminological tendency, which is the tendency for a word to have a fixed meaning in reference to the world. ... tending towards idiomaticity is the phraseological tendency, where words tend to go together and make meanings by their combinations.”

—Sinclair 2004. Trust the Text, p. 29

18

Page 19: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

The importance of context• “More often than not, activation of a particular meaning depends

on the co-occurrence of two or more lexical items” –Sinclair– The study of collocations is still in its infancy

– Empirical measurement of word co-occurrences (collocations) only became possible with very large corpora (i.e. since the early 1990s)

– Problem with small corpora (Brown, LOB, ICE):• Impossible to distinguish significant collocations from chance

– We now have very large corpora – billions of words of texts• Contemporary corpora, historical corpora, domain corpora, …

– But serious analysis of corpus data has hardly started

– It requires both new tools and revision of received theories

19

Page 20: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Idiom and Open Choice

• The range of collocational norms varies greatly from word to word

• What do you abandon?– a car, [NO DET] ship, an old fridge, a plan, a theory, a

baby, a dog ( = as a pet), your wife and children, …

– Very open choice in the direct object slot.

• What do you hazard?– The direct object slot is idiomatically highly constrained:

• just one word (guess) accounts for over 50% of uses of this verb

20

Page 21: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Exploiting the norm

• I hazarded various Stuartesque destinations like Florida, Bali, Crete and Western Turkey.

—Julian Barnes

• Is it normal to hazard destinations or locations? – No.

• This is an exploitation of a norm.

• We need a theory (and an artefact) that distinguishes the normal, conventional, idiomatic phraseology of each word from exploitations of those phraseological norms

21

Page 22: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Extended context (Several exploitations here)

Stuart needlessly scraped a fetid plastic comb over his cranium.

—‘Where are you going? You know, just in case I need to get in touch.’

—‘State secret. Even Gillie doesn’t know. Just told her to take light clothes.’

He was still smirking, so I presumed that some juvenile guessing game was required of me. I hazarded various Stuartesque destinations like Florida, Bali, Crete and Western Turkey, each of which was greeted by a smug nod of negativity. I essayed all the Disneylands of the world and a selection of tarmacked spice islands; I patronised him with Marbella, applauded him with Zanzibar, tried aiming straight with Santorini. I got nowhere.

22

Page 23: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

PART 3: Lexicography of the future

• Will draw on prototype theory (Rosch 1972)

• Will aim to map cognitive prototypes (meanings, beliefs, etc., associated with each word) onto phraseological prototypes of those words in use

• There will be an emphasis on analysing statistically significant collocations

23

Page 24: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

James Murray (1878) predicts the need for corpus data

• “The editor and his assistants have to spend precious hours searching for examples of common everyday words. Thus, in the slips, we have 50 citations for abusion, but for abuse, not five.” – James Murray, Presidential address to the Philological Society, 1878

24

Page 25: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

The need for a pattern dictionary

• To record all and only the normal patterns of use for each word

– Not meanings

– Not all possible patterns

• A pattern dictionary will be a benchmark against which actual usage can be measured

• Meanings, implicatures, translations, and whatever-else-you-like are attached to patterns (not to isolated words)– A word is no more than an entry point to a set of patterns

25

Page 26: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

What is a pattern dictionary?

• A semantically driven syntagmatic inventory of normal word uses and meanings (implicatures). – Based on analysis of significant colligations and a statistically valid

random sample.

– Shows comparative frequency of each pattern of a polysemous word.

• Meanings are associated with patterns, not with words.– The colligational preferences of a word are part of its patterns.

• Created by means of a painstaking technique called Corpus Pattern Analysis (CPA).

26

Page 27: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Norms and exploitations

• A pattern dictionary aims to record all and only the normal uses of each word.

– Exploitation of norms is a subject for separate analysis.

– Types of ‘exploitation’ include creative metaphor, ellipsis, and (in particular) anomalous realizations. Consider:• The goat ate the newspaper.

• The verb eat has a preference for nouns of semantic type [[Food]] in the direct object clause role.

• ‘[[Animate]] eat [[Document]]’ is not a normal pattern of English.

• Compare John devoured the newspaper.

• ‘[[Human]] devour [[Document]]’ is a normal pattern of English. It is a conventional metaphor.

27

Page 28: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Specifically, ...

The Pattern Dictionary of English Verbs

• aims to list all normal patterns of each verb lemma in BNC. – with practical applications and theoretical consequences (see later).

• A benchmark for comparative studies of and identification of norms in other corpora – by time period: historical corpora, future corpora

– by region: e.g. American English

– by domain, e.g.

• ‘[[Human]] abate [[Problem = Nuisance]]’ is a domain-specific norm in the domain of legal jargon

• abate is not normally a transitive verb.

28

Page 29: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

A typical Pattern Dictionary entry

• irritate

PATTERN 1 (90%): [[Anything]] irritate [[Human]]

IMPLICATURE: [[Anything]] causes [[Human]] to feel mildly annoyed.

PATTERN 2 (8%): [[Phys Obj | Stuff]] irritate [[Body Part]]

IMPLICATURE: [[Phys Obj | Stuff]] causes [[Body Part]] to become inflamed and somewhat painful.

• Notes:1. Both these patterns are transitive but they have different meanings.

They are distinguished by the semantic types of the nouns

2. Getting the right level of semantic generalization for each noun is hard.

It must select normal, prototypical uses – not all possible uses.

29

Page 30: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Semantic type vs. contextual role

• Mr Woods sentenced Bailey to seven years | life imprisonment

PATTERN: [[Human 1]] sentence [[Human 2]] {to [[Time Period | Punishment]]}

• Semantic type: [[Human]]

• Contextual roles: [[Human 1 = Judge]], [[Human 2 = Convicted Criminal]], seven years [Time Period = Punishment in jail]]

– Semantic type is an intrinsic semantic property of a lexical item.

– Contextual role is extrinsic; the meaning is imposed (activated, selected) by the context in which the word is used.

30

Page 31: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Nouns and verbs

• The analytical apparatus required for nouns is different in kind from that required for predicators (verbs, adjectives).– Nouns are grouped into lexical sets in relation to the predicators that

they normally collocate with.

– The lexical sets are normally united by a semantic type.

– A shallow ontology of nouns (grouped by their semantic type) is therefore part of the apparatus of a pattern dictionary.

– Semantic types in real texts are more complex than might be expected at first sight or from invented examples.

– Lexical sets include alternations, parts, and properties of types

31

Page 32: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

What would an empirically well-founded ontology be like? (1)

• A hierarchy of about 250 semantic types (not more)• Representing the intrinsic conceptual semantic properties of words

– [[Eventuality]] and [[Entity]] at the top

– [[Eventuality]] = [[Event | State of Affairs]]

– [[Entity]] = [[Physical Object | Abstract Object]]

• Each semantic type is governed by corpus evidence of colligations, e.g.: • [[Human]]s and [[Animal]]s eat, run, sleep, etc.

• [[Human]]s and [[Institution]]s think, say, negotiate, etc.

• So snakes (for this purpose) are not animals

• The hierarchy of [[Artefact]]s has many members, because different artefacts are used for different purposes (= with different verbs).

• Ref. James Pustejovsky, 1995. The Generative Lexicon.

32

Page 33: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

What would an empirically well-founded ontology be like? (2)

• It would have to take account of verb-specific lexical alternations (parts and properties).

• For example, Pattern 2 (of 8) for calm, verb, is: [[Human 1 | Event]] calm [[Human 2]]

– Alternation of Human (2): [[Animal]]

– Parts of Human (2): nerves [[Body Part | Psyche Part]]

– Attributes: [Possessive Determiner]] fear, anxiety, agitation, .... [[Emotion]]

33

Page 34: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Argument alternation and focus1. Straightforward alternations:

– People negotiate, governments negotiate, …

– Humans eat, horses eat, dogs eat, alligators eat …

– Horses gallop, humans gallop [ambiguous]

2. Another function of argument variation is focus:

– repair one’s car, repair the fender, repair the damage

– treat a person, treat his ankle, treat the injured, treat their injuries

– The meaning of treat here contrasts with the meaning in treat a person well/badly

The presence or absence of a manner adverbial is all-important

34

Page 35: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

How to Measure Collocations?Various statistical tools are available, e.g.:

• Mutual Information (“MI”; Church and Hanks 1990) – tends to favour content words as collocates

• t-score tends to favour function words as collocates.

• Sketch Engine (Kilgarriff, Rychlý, et al., 2004)– measures salience scores for pairs of collocates in pre-determined

colligational patterns

• Take your pick – but measuring must be done, one way or the other, if we are to have any hope of understanding the nature of meaning in language nd getting our dictionaries to report accurately how words are used– because a natural language is a fuzzy, variable, analogical, unstable

system for making meanings

35

Page 36: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

The Pattern Dictionary and FrameNet

PDEV is corpus-driven (ruthlessly empirical) and proceeds word by word, investigating syntagmatic criteria for distinguishing different meanings of polysemous words, in a “semantically shallow” way.

FrameNet proceeds frame by frame. It:• expresses the deep semantics of situations (frames);• proceeds frame by frame, not word by word;• analyses situations in terms of frame elements;• studies meaning differences and similarities between different words in a

frame;• does not explicitly study meaning differences of polysemous words;• does not analyse corpus data systematically, but goes fishing in corpora for

examples in support of hypotheses;• has problems grouping words into frames, and misses some;• has no established inventory of frames;• has no criteria for completeness of a lexical entry.

36

Page 37: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Construction Grammar (1)

• Focus on meaning, not just well-formedness.

• Challenges reductionist theories of language

• Meaning is (in part) associated with constructions.

• Anything from a word to a clause can be a construction.

– Example: ‘she slept her way to the top.’ – Sleep is not normally a goal-achievement verb.

– But in this sentence, it is coerced into being one by the construction “[V] one’s way to [[Status]]”.

– This meaning is not arrived at by a concatenation of the meanings of the lexical items of which the sentence is composed.

37

Page 38: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Construction Grammar (2)

• So far so good – but Construction Grammar is in the speculative tradition. It is not based on analysis of evidence.

• It is based largely on made-up examples, many of which are bizarre, e.g. The gardener watered the flowers flat.

• Corpus evidence shows that the verb water does not normally participate in the resultative construction.

• A distinction between normal usage and exploitation of norms must be made.– Abnormal examples are conducive to distortions in the theory.

– CG needs corpus analysis.

– Some sort of synthesis between PG and CG is desirable.

38

Page 39: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Theoretical consequences and practical applications (1)

Pedagogical:

• Anyone acquiring a language must learn competence in two kinds of rule-governed linguistic behaviour:– How to use words normally

– How to exploit the norms (creative metaphors, ellipsis, etc.)

• A pattern dictionary gives comparative frequency of patterns.– A lexical syllabus will focus on statistically significant patterns of use.

• In error analysis: what norm was aimed at?– If learners are exploiting norms creatively, do you (the teacher) really

want them to?

39

Page 40: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Theoretical consequences and practical applications (2)

For theoretical linguistics:

• Are some grammars better than others for representing how words are used to make meanings?‘S NP VP’: confuses of language with predicate logic

• The third argument (‘adjunct’, ‘adverbial’): – Not well analysed in generative grammar (or, indeed, any other

grammar)

– CPA shows that a new grammar of adverbials is needed.

• Metaphor analysis:– CPA distinguishes conventional metaphors from exploitations.

40

Page 41: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Theoretical consequences and practical applications (3)

• For computational linguistics and AI:• Improving machine translation

– Getting the right pattern is more likely to select the right translation.

• Parsing and word-class tagging: – CLAWS achieves ~90% accuracy in word-class tagging in BNC

– CPA reveals some systematic errors in CLAWS tagging.

• Anaphora resolution: – He found a glass of water on the table and drank it.

– ‘[[Animate]] drink [[Liquid]]’ selects water as a direct object of drink

41

Page 42: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Presenting the facts to the public (2)

• Dictionaries of the future will be electronic products– Space constraints removed

– leading to a danger of verbosity

• They will pay more attention to phraseology and collocation

• Language communities will still need lexicographers to analyse the lexical content of corpora, Internet data, conversation, etc., and to identify the phraseological conventions on which successful communication depends

• You can’t just plonk language learners down in front of a concordance (corpus data) and expect them to work out what is going on. The data needs an interpretation.

42

Page 43: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Phraseological Lexicography and Computational Linguistics

• At present NLP applications such as machine translation are having great success with “knowledge-poor” statistical methods.– Sooner or later the pendulum will swing back: lexicographical methods

will be needed to augment the raw statistical approach

• According to Ken Church, in 1987 the single most productive contribution to the NLP text-to-speech generation system at AT&T Bell Labs came from the IPA transcriptions in Collins dictionaries

• Can we expect a similar contribution from phraseological lexicography to computational message understanding?

43

Page 44: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Phraseological Lexicography and the Semantic Web

• Semantic Web: the original dream: – “Web technology must not discriminate between the scribbled draft

and the polished performance.” –Berners-Lee, Hendler, and Lassila, in Scientific American 2001

• At present Semantic Web research is very far from being able to interpret polished performances, let alone scribbled drafts– It confines itself to identifying names, dates, address, and

appointments, and to processing tags that have been added to elements in text.

– It is “the apotheosis of annotation – but what are its semantics?” (asks Yorick Wilks)

• Realizing the dream will require lexicographic input – a radical new kind of lexicography, one possibility for which I have tried to outline in this presentation.

44

Page 45: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

A model presentation

• OED3 is a model of electronic presentation– but its lexicographical principles are old: they are (rightly)

those of the Renaissance and the Enlightenment– These principles need revision in the light of corpus evidence

– But you interfere with a national monument at your peril

– One of many unacknowledged theoretical problems is a confusion between the (stipulative) meaning of scientific concepts and the meanings of words in natural language.

• Dictionaries of the future will be based on the principles of Wittgenstein, Rosch, Putnam, Grice, Firth, and Sinclair.

45

Page 46: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Can (should) natural language be regulated? (1)

• Johnson’s dictionary (1755) was based on citations from “the best authorities”.

• “Those who have been persuaded to think well of my design require that it should fix our language...

• “When we see men grow old and die ... we laugh at the elixir that promises to prolong life to a thousand years; and with equal justice may the lexicographer be derided who, being able to produce no example of a nation that has preserved their words and phrases from mutability, shall imagine that his dictionary can embalm his language and secure it from corruption and decay.” —Preface, Dictionary, 1755

46

Page 47: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Can (should) natural language be regulated? (2)

• Johnson’s liberal empirical descriptivism is OK for English

• But what about other language situations, e.g. – Norwegian (institutionalized diglossia)

– Czech (Every literate user of Czech must be able to use standard literary Czech, as well as his or her local dialect – but but standard literary Czech is not a natural language)

– Greek? (katharevousa is obsolescent)

– Langauges without a strong literary convention, e.g. Bantu languages, such as Northern Sotho, Zulu, Luganda. An element of prescriptivism seems to be inevitable here.

– What about French? What is the role of the Académie Française in this brave new world of computational language processing?

• These are subjects on which I am not qualified to speak.

47

Page 48: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Thanks

• To you, for listening,

• To the late John Sinclair and the (still extant) James Pustejovsky, who have inspired this approach,

• To the Academy of Sciences of the Czech Republic (project T100300419) and the Czech Ministry of Education (National Research Program II project 2C06009), who, in part, funded the pilot study on which PDEV is based,

• And to Karel Pala, Pavel Rychlý, Adam Rambousek, and Adam Kilgarriff, who have created tools that make this kind of analysis possible

48

Page 49: Modern Lexicography – Developments, Prospects, and Problems Patrick Hanks Research Institute of Information and Language Processing University of Wolverhampton

Invitation to browse the Pattern Dictionary

• Fire up a Firefox browser window. • VISIT: http://nlp.fi.muni.cz/projects/cpa• Pattern Dictionary of English Verbs: • http://deb.fi.muni.cz/pdev/

49