collocations, dictionaries, and mt · 2 collocations and semantics previous work ([church and...

12
Collocations, Dictionaries and MT Dirk Heylen [email protected] OTS, Utrecht University, The Netherlands Kerry G. Maxwell [email protected] CL/MT Group, University of Essex, United Kingdom Susan Armstrong-Warwick susan~divsun.unige.ch ISSCO, University of Geneva, Switzerland 1 Introduction Collocations pose specific problems in trans- lation (both human and machine translation). For the native speaker of English it maybe ob- vious that you ’pay attention’, but for a native speaker of Dutch it would have been muchsim- pler if in English people ’donated attention.’ Within an MT system, we can deal with these mismatches in different ways. Simply adding the entry to our bilingual dictionary saying ’pay’ is the translation of ’schenken’, leaves us with the job of specifying in which contexts we can use this equivalence. A more elaborate dictionary might list the complete collocation alongside with its translation. Another possi- bility would be to adopt an interlingua and have (harmonious) monolingual components take care of mapping ’pay’ and ’schenken’ to the same meaning representation. The idea of investigating the latter possibility suggested itself while looking at the analysis of colloca- tions in terms of lexical functions by Mel’5uk and his followers (as presented for instance in their Explanatory Combinatory Dictionar- ies [Mel’~uk et al., 1984]). The claim that these lexical functions are universal and only few in number, would make them interesting for the interlingua option. The ET-10 project ’Collocations or the lexicalisation of seman- tic operations ’1 is concerned with investigat- ing these claims and their consequences for the automatic translation of collocations. Adopting this approach, leads rather nat- urally towards an interlingua system, using the lexical functions as some sort of semantic primitives. Being rather sceptical about the use of such concepts, we set ourselves the task to find out whether the ’empirical basis’ (collo- cations) constitute sufficient ground to postu- late this specific inventory of semantic primi- tives. This has lead us into the investigation of corpora and dictionaries to confront abstract concepts with real data. Although not one of our tasks, these investigations have lead to a number of suggestions on how one could pos- sibly extract relevant information from these sources for the construction of collocational dictionaries. In this paper we will focus our attention on the methods to extract informa- tion from dictionaries. 1The research project (ET-10/75) is sponsored the EuropeanCommission, L’Association Suissetra and Oxford University Press. 75 From: AAAI Technical Report SS-93-02. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.

Upload: phamnga

Post on 09-Apr-2018

221 views

Category:

Documents


3 download

TRANSCRIPT

Collocations, Dictionaries and MT

Dirk Heylen

[email protected]

OTS, Utrecht University, The Netherlands

Kerry G. [email protected]

CL/MT Group, University of Essex, United Kingdom

Susan Armstrong-Warwicksusan~divsun.unige.ch

ISSCO, University of Geneva, Switzerland

1 Introduction

Collocations pose specific problems in trans-

lation (both human and machine translation).For the native speaker of English it may be ob-vious that you ’pay attention’, but for a nativespeaker of Dutch it would have been much sim-pler if in English people ’donated attention.’Within an MT system, we can deal with thesemismatches in different ways. Simply addingthe entry to our bilingual dictionary saying’pay’ is the translation of ’schenken’, leaves uswith the job of specifying in which contextswe can use this equivalence. A more elaboratedictionary might list the complete collocationalongside with its translation. Another possi-bility would be to adopt an interlingua andhave (harmonious) monolingual componentstake care of mapping ’pay’ and ’schenken’ tothe same meaning representation. The ideaof investigating the latter possibility suggesteditself while looking at the analysis of colloca-tions in terms of lexical functions by Mel’5ukand his followers (as presented for instancein their Explanatory Combinatory Dictionar-ies [Mel’~uk et al., 1984]). The claim thatthese lexical functions are universal and onlyfew in number, would make them interestingfor the interlingua option. The ET-10 project’Collocations or the lexicalisation of seman-

tic operations’1 is concerned with investigat-ing these claims and their consequences for theautomatic translation of collocations.

Adopting this approach, leads rather nat-urally towards an interlingua system, usingthe lexical functions as some sort of semanticprimitives. Being rather sceptical about theuse of such concepts, we set ourselves the taskto find out whether the ’empirical basis’ (collo-cations) constitute sufficient ground to postu-late this specific inventory of semantic primi-tives. This has lead us into the investigation ofcorpora and dictionaries to confront abstractconcepts with real data. Although not one ofour tasks, these investigations have lead to anumber of suggestions on how one could pos-sibly extract relevant information from thesesources for the construction of collocationaldictionaries. In this paper we will focus ourattention on the methods to extract informa-tion from dictionaries.

1The research project (ET-10/75) is sponsored the European Commission, L’Association Suissetraand Oxford University Press.

75

From: AAAI Technical Report SS-93-02. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.

2 Collocations andSemantics

Previous work ([Church and Hanks, 1989],[Smadja, to appear] and earlier work) on theacquisition of collocational information hasconcentrated on the extraction of collocationsfrom corpora based on some statistical mea-sures. Programs such as XTRACT ([Smadja,to appear]) are able to select an interesting col-lection of potential ’collocations’. The notionof collocation that lies behind these investiga-tions is that of ’recurrent combination’. Thenotion we are investigating however, is differ-ent as it only concerns collocations in which wecan clearly distinguish a collocate which has a’general’ or ’non-literal’ reading. The modeof defining ’collocations’ in our approach isdifferent from the performance or probabilityapproach. Instead, we are interested in alinguistic or lexicographic way of defining col-locations and more particularly the semanticsof collocations. The first approach leads tothe notion of collocations we have termed p-collocation elsewhere ([Sloksma el at., 1992]),whereas the second mode leads to a set of l-collocations2.

But what are the semantic aspects of col-locations we are interested in exactly? In theanalysis by Mel’~uk of the kind of collocationswe are interested in (mainly adjective-nouncombinations and light verb constructions), construction such as heavy smoker is analyzedby means of a lexical function Magn. In theExplanatory Combinatory Dictionary z the en-try for smoker contains information of the fol-lowing kind.

Magn(smoker) = heavy

Mel’~uk claims that we need about 50 ofthese ’universal’ functions to describe the se-mantics of most collocations4. A lexical func-

Sin fact, it turns out that a further distinctionshould be made between the lexlcographlc and the lln-guistic approach to collocations. We will not go intofurther details here.

3The dictionary that goes with the Meaning TextModel.

4 Understood as l-collocations.

tion represents some meaningful operationwhich yields the lexeme (collocate) that typ-ically combines with the argument (head ofthe collocation) of the function to expressthe meaning of the function. Simplifying,Magn -- meaning something like "intense"-- is expressed by heavy when said of smoker.More complex functions can be constructed by(1) combining functions; (2) adding differentkinds of sub- and superscripts. The followingexamples illustrate this.

"IncepFuncl(vent) = se leverFinFunel(vent) = se calmerMagn[, con,equence,,] ( maladie ) = sdrieuse

The last example, clearly illustrates that thebasic lexical functions are in need of somerefinement. The semantic component of theargument which the intensifier relates to isexplicitly mentioned in the subscript. In[Heylen, 1992] we have suggested some waysto make lexical functions sensitive to the se-mantic structure of the arguments, by incorpo-rating for instance qualia information ([Puste-jovsky, 1991]) in the semantic representationof nouns. Notice also that lexical functionsgo with the ideology that the association be-tween heads and collocates is to be specifiedlexically, because it is unpredictable. They donot consider any kind of (sub)regularities thatmay exist.

One of the aims of the project is to evaluateMel’~uk’s system of lexical functions. Amongthe questions we are interested in are the fol-lowing.

¯ Can collocations5 be associated with lexicalfunctions as the theory assumes they can?

¯ In what ways can the notion of lexical func-tion be refined?

- Can we make them sensitive to the se-mantic structure of their arguments?

-Can we make precise the parame-ters that differentiate the possible val-ues of a function? For instance, wemight want to differentiate betweensharp/wide turns, or for instance be-tween certain "degrees of intensity"when we are considering Magn etc.

5At least a definable subset.

76

¯ Can we find some regular patterns in the for-mation of collocations?

From a methodological point of view, itis important to be able to go beyond possi-bly deceiving intuitions. We therefore haveto provide for more reliable ways to get atour data. Listings of collocations as providedby XTRACT-Iike programs, are useful as fil-ters on the data to consider. But as we areconcerned predominantly with semantic issueswhen evaluating the lexical function approach,we need programs that can be made sensitiveto the meaning of the expressions considered.To get at a more interesting collection of data,we have looked for material from which cer-tain aspects of meaning can be read off inone way or another. Of course, dictionariesare one source which deals directly with therepresentation of meaning. But it is not theexplicit encoding which is most interesting tolook at. Indeed, it turns out that most dictio-naries do not provide the kind of informationwe are interested in directly6. It is the less ob-vious structure of the dictionary entry whichgives rise to interesting pieces of information.In the following sections we list a number ofways we can use dictionaries to collect data onthe semantic issues we are concerned with.

3 Monolingual Dictionar-ies

In this section we investigate how far wecan gather information about collocations andcollocational behaviour by scrutinizing theformats of learner’s dictionaries. We fo-cus our attention on the Oxford AdvancedLearner’s Dictionary, third edition [Hornby,1974] (henceforth OALD3), but we would ex-pect these observations to have more generalvalidity. We examine the dictionary on botha macro- and microlexicographical level, ie:in addition to looking at the details of entry

6First of all, we are interested in ~abstract’ seman-tic concept. Secondly, lexicographic tradition treats antunber of different constructions in the same way theytreat the kind of collocations we are interested in.

organisation we investigate the lexicographicconventions adopted and the overall rationalebehind the organization of the dictionary.

3.1 Lexicographic Conventions

Of course in producing the dictionary a set ofrules and general practices have to be devel-oped in order to ensure internal consistency.In other words, where there is perceived flexi-bility in description/methodology, the lexicog-rapher may adopt a certain path simply be-cause this corresponds to a convention amongthose established for the construction of theparticular dictionary. Some of these conven-tions will ultimately reflect linguistic facts,whilst others will be purely arbitrary. Suchconventions often remain implicit in the na-ture of the resulting dictionary and are neverdirectly explained, whilst others may occasion-ally be alluded to in the frontmatter of thedictionary. For instance, in the frontmatter ofOALD3 we find an explanation of the listing ofdefinitions and meanings of words, where theuser is told that:-

"...definitions are listed in order ofmeanings from the most common ormost simple to the most rare or mostcomplicated" (p.xiv)

Such a convention may seem arbitrary 7 in thesense that it seems to rest on the intuitionsof individual lexicographers, but what mightit imply in our quest for information on thesemantics of collocates? Of course we shouldnote that "rare" and "complicated" and "sim-ple" and "common" are not always mutuallyprevailing characteristics, ie: a common mean-ing is not necesarily simple nor a rare meaningcomplicated. A useful interpretation for ourpurposes is perhaps that "rare/complicated"corresponds to "figurative/nonstandard" andaccordingly "simple/common" corresponds to

7However we should point out that lexicographerswork to a set of explicit compiler instructions whichmay determine the most arbitrary decisions. Further,lexicographers should be highly skilled with rich intu-itions about the language under description.

77

"standard" or "regular", so that in the caseof entries for adjectives which form collocates,the definitions which constitute the secondary,or later explanations of meaning (ie: thosethat occur "further down" the entry) will bethe more significant ones for our interests.

In the discussion of the representation of id-iom structures, we are told that:-

"..to find an idiom, look for it in theentry for the most important wordin the phrase or sentence (usually noun, verb or adjective)." (p. xvi)

This means for instance that the idiom pickholes in can be found in the entry for hole andthat the idiom get hold of the wrong end of thestick can be found under the entry for stick. Ofcourse, most "important" merely correspondsto what is perceived to be most salient, thoughit may be that observation of what are consid-ered to be the most salient elements of idiomsmay lead us to intuitions about where the se-mantic load lies in such structures.

In sum, even though the conventions under-lying the description of such lexical phenom-ena may appear arbitrary at first sight, it maybe that some reveal general linguistic insights,explain omissions, or provide pointers to theinformation in an entry that we are most in-terested in. We should at least be aware ofthem as we embark upon extraction of infor-mation, automatic or otherwise.

3.2 User Needs

A natural consequence of the learner orien-tation of the OALD3 is that the dictionaryweighs heavily on explanatory and exemplarydata, generally residing in specific examplefields, and, as will be shown in later sections,this field of information proves extremely fruit-ful in the collection of collocational data. Thefrontmatter details several reasons for the ex-tensive inclusion of example fields (p. xvii),the key one being of course their general rolein reinforcing the explanation of the meaningof (groups of) words. Also worthy of note the aim of illustrating words and multiword

forms in different sentence patternss, and theindication of the "style" or "context" in whichthe word or phrase is usually used. The lattermeans that example fields often:-

"...include words or sorts of wordsthat the headword is usually usedwith.., for example at sensa-tional (2) there is a sensationalwriter/newspaper." (p. xviii)

Such examples illustrate the potential forstumbling across example collocations, as wellas hints about lexical selection, which may inturn reveal issues in the semantics of colloca-tional structures (cf: section 3.3.1).

To summarize - an integral aspect of therationale behind the dictionary is its exem-plary function in the learner context. Thismeans that information of relevance to thestudy of collocations may be implicit in exem-plary form, as well as in standard headwordand definition formats.

3.3 Collocational

within Entries

Information

In this section we will investigate to whatextent information about collocational struc-tures is available in OALD3 entries. The ob-servations made in this section are based ona study of the electronic version of the dictio-nary (henceforth OALD3E) undertaken dur-ing our project. For further documentation ofthe results see [Heylen et al., 1992]. The elec-tronic version of the dictionary contains en-tries whose information fields are tagged by

STiffs information may be patchy where the factsabout collocational structures axe concerned; for in-stance in the entry for attention OALD3 illustratesthat the related support verb construction can be pas-sivised,eg: No attention was paid to my advice, butsays nothing in the entry for heavy about restrictionson predicative use where heavy has a collocationalreading ie: this smoker ii heavy can only mean thatthe smoker has weight problems! Such gaps in cover-age often arise where dictionaries concentrate on pro-viding positive rather than negative information, ie:providing well-formed examples but not giving explicitguidance as to what is no._.~t possible.

78

SGML-style markers. A simple search tool de-veloped in-house was used to extract entriesbearing a headword or example matching thespecified search string9.

The focus for this study was an investigationof the coverage of Adjective Noun (henceforthAN) collocations in OALD3E. In particular weinvestigate the information available in adjec-tive entries, these proving to be a more fruit-ful source of interesting AN structures thannoun entries. Adjective entries tend to list typ-ical (nominal) modificands, the reverse is notnecessarily true (presumably since the set ofmodifiers given a particular noun is potentiallymuch larger and heterogenous). Further, ad-jective entries provide us with secondary sensedefinitions and corresponding examples whichmay relate directly to semantic aspects of col-locational behaviour, as we discuss below.

3.3.1 Searches on Adjective

The structures/combinations we are primarilyinterested in are those which can be deemed tobe collocational on some kind of linguistic ba-sis. For AN-collocations we refer to the noun’base’, a semantically autonomous elementpreserving its standard interpretation, and anadjectival ’collocate’, an adjective which hassome kind of figurative or non-standard read-ing, its precise interpretation being only deriv-able in the context of the noun base. Thesebeing the kind of phenomena we are attempt-ing to investigate, it would seem that we needto be particularly attentive to those aspects ofan adjective entry devoted to explanations ofsecondary (or further alternative) senses. Suchexplanations give clues as to the potential col-locational readings of an adjective, and relatedexample fields contain candidate collocationalstructures. An example would be the OALD3description of primary and alternative sensesfor the adjective violent, and related examples,as shown below:-

1. using, showing, accompanied by, greatforce

9Brined on a Perl regular expression, cf: [Wall andSchwartz, 1990], pp.103-106.

violent wind/attack/blows/temver/abuse2. caused by violent attackviolent death3. severeviolent toothache

The example details three senses of the ad-jective. We might designate collocational sta-tus to the third, violent assumes an intensify-ing function not explicitly related to its regularuse indicating force (cf: sense 1).

Glues about collocational readings are alsofound in glosses to individual examples, eg: wesee the gloss extreme against the sense of vi-olent in the example combination violent con-trast, where we might assume the adjective hasa similar function to that in sense 3 above.

So, although there is no explicit discrim-ination of "collocational readings" listed inthe dictionary, in sifting through the varioussenses of adjectives we find a number of can-didate collocates (and indeed example colloca-tions).

But it may not be merely the semantics ofthe adjectival collocate which is alluded to inglosses and secondary senses. We might as-sume that the semantics of a noun is cruciallyrelated to such factors as form, function andcausation (cf: the notion of Qualia as coinedby Pustejovsky in a discussion of lexical se-mantic decomposition of nouns [Pustejovsky,1991]). These meaning aspects may also bereflected in sense definitions pertaining to par-ticular examples where the noun appears. Forinstance, the sense definition against the ex-ample violent death says "caused by violent at-tack", reflecting the fact that death essentiallyinvolves a "cause" of some kind (cf: Puste-jovsky’s Agentive role). The third sense ofthe adjective regular (cf: below) refers to ex-plicit properties of the nouns provided in theexample, eg: "qualified" and "trained" of sol-diers/army, (cf: Pustejovsky’s formal role).

3. properly qualified; recognised; trainedregular soldiers/army

Example fields often group potential mod-ificands of a specified adjective according

79

to some unwritten semantic criteria (whichsometimes may be linked back to a partic-ular sense definition). For instance the sec-ondary definition of the adjective formidableis supplied as "requiring great effort to dealwith or overcome", the modificands obsta-cle/opposition/enemies//debts are supplied onmass, each in some sense representing a situ-ation or body which must be confronted andultimately surmounted or "overcome". So weobserve that example fields frequently list setsof modificands which may be perceived as be-longing to a homogenous semantic class, ie:they have a common property, or, more for-mally, they share a semantic feature. We mayspeculate about what this feature is, but, aswe illustrated, the sense definition of the ad-jective may give clues. In some sense thenwe gain pointers as to potential selectional re-strictions between an adjectival modifier andthe set of exemplary modificands. Certain ex-amples seem to show adjectives predicating ofnouns which share particular properties.

Beyond sense definitions and related exam-ples we also find candidate collocations in in-formation fields which are intended to housecomplex expressions such as idioms or com-pound forms. An example is the entry for theadjective blind, where we find the examplesblind spot, blind turning, blind flying, blindalley, blind date. These complex expressionsmay undoubtedly warrant various labels ac-cording to a given classification, but it may bethat some fit our criteria for collocational sta-tus. Eg: if we take blind flying and blind turn-ing, we observe that both have a base with itsregular compositional meaning, and a modifierwhich derives its precise interpretation in thecontext of the base1°. Such examples indicatethat we should also target compound groupfields as a potential "quarry" of candidate col-locations.

If a collocational form occurs within an in-formation slot devoted to compounds or fixed

1°blind in blind turning means "not easily seen bydrivers" and blind in blind flying means "unable tosee due to cloud/fog, therefore flying with the aid ofinstruments only".

expressions, this may give clues as to thedegree of syntactic/morphological versatilitydisplayed by the collocate. In other words,there are certain AN combinations which wemay classify as collocational according to ourbasic perception, but which resemble com-pounds/fixed expressions in terms of eg: thelack of morphological versatility (compara-tive/superlative formation) of the non-head.Such examples typically reside in compoundgroup fields (eg: tall/*taller/*tallest story, tallstory being listed in the compound group fieldof the OALD3 entry for the adjective tall).By contrast, those collocations higher on theversatility 11 scale seem more likely to crop upin example fields eg: heavy rain.

3.4 Summary and Conclusions

In sum, the observations made in the previ-ous sections show that information on collo-cational structures can be obtained from anon-line dictionary, but that this information ispractically always implicit, embedded in as-pects of the formal structure of the dictio-nary. We have performed automatic searchesfor particular strings and then hand-filtered in-formation from the entries which formed theoutput. The result of such hand-filtering re-veals where the "hot spots" of information lie.

Various factors interact to place informa-tion at a certain place and present informa-tion in a certain way. The information gleanedfrom such ’first-pass’ searches, coupled witha knowledge of underlying lexicographic con-ventions and the rationale behind the dictio-nary (ie: user needs, look-up strategies), canpromote the simulation of more "intelligent"searches at a subsequent stage. For instance,we now know that example fields will probablyform one of the most productive active searchspaces. If we want data on adjective-noun col-locations, we should target adjective entriesprimarily. Example collocations are likely tobe found at example slots for sense definitions

llBy ~versatile’ we may also refer to syntactic as-pects, eg: predicative use: the rain was heavy/~thestory was tall

8O

which are "lower down" the entry. It is in theselater sense definitions that we may find thedescription of senses of adjectival collocates,from which we may be able to unpack somekind of semantic information. If example mod-ificands are ’lumped’ together, (commonly de-limited by /) then we may have turned up potential selectional restriction between mod-ifier and head (collocate and base).., and on. We therefore illustrate how on-line lexicalinformation, though only partially formal innature, can be the focus of automatic searcheswhich are crucially based on some kind of "in-telligent" strategy derived from knowledge ofthe nature of the dictionary.

4 Bilingual Dictionaries

Examination of bilingual dictionaries can betelling in various ways, not only in the collec-tion of contrastive data but also in the affir-mation of ideas about the nature of colloca-tional behaviours. Of course contrastive anal-ysis of collocational structures reveals a degreeof complexity in translation, but it can alsobring to light regularities about the monolin-gual system. In addition the data can helpinstantiate or repudiate definitions of interlin-gua concepts in view of translation.

In this section we will discuss what types ofinformation we can find in bilingual dictionar-ies and how this might help us towards a moresystematic account of collocations (in view ofautomatic translation).

Though the investigation of collocations hasnot been a focal point for past work with ma-chine readable dictionaries, we can identify anumber of related studies which have focussedon semantics and translation issues. Bilingualstudies carried out in the ACQUILEX project,for example, have investigated the linking ofsemantic hierarchies which have been built onlanguage internal concepts [Sanfilippo el al.,1992]. In a comparaison of noun taxonomiescross-linguistically (using word senses and hi-erarchies derived from mono- and bilingualdictionaries) [Vossen, 1991] defends the hy-

pothesis that "meaning is a language internalaffair". Translation equivalence is seen as amapping between different conceptualizations.Though the work does not focus on colloca-tions, it does recognize that a major problemremains in how to select the best language ex-pression for a given conceptualization. It isprecisely this problem that we are trying toaccount for with the lexical functions.

Other studies of interest in view of extract-ing information from bilingual machine read-able dictionaries includes the work reportedon in [Calzolari, 1983] on establishing seman-tic links and [Byrd et al., 1987] on mappingbetween entries and problems of symmetry.The monolingual dictionary work on navigat-ing through entries reported on in [Chodorowand Byrd, 1985] provides the basis for ourwork on exploring the data in bilingual dic-tionaries.

4.1 Accessing the data

The bilingual dictionaries we have availableon-line are the Collins German, French andEnglish, pocket edition [Collins, 1990] in all re-spective pairs. The dictionaries cover the basicvocabulary of the languages (ca. 15,000 wordsper direction) and include the major sense dis-tinctions (differentiated only in so far as nec-essary for translation purposes). Though rel-atively small, the advantages are that theyare easy to parse and contain only minimalbut essential information. Thus, for a firstattempt at organizing some of the data inview of our goals to systematically accountfor the relations between the words in colloca-tional phrases, they provide an ideal startingpoint. As suggested above in the discussion ofmonolingual dictionaries, we assume that thedata contained in these dictionaries reflects aninitial synthesis and organization of essentialsemantic and translational considerations re-garding the use of the words.

The dictionaries under consideration can beaccessed via a network based dictionary con-sultation tool developed at ISSCO for theUniversity of Geneva [Petitpierre and Robert,

81

1991]. The underlying format is an SGML12

mark-up of the fields, identifying headwords,grammatical information, sense distinctions,contextual information, translations, examplephrases, etc. This structure allows access tothe entries not only by headword, but alsoby words found in the different fields (via sec-ondary indexes).

4.2 Collocations and translation

As was the case with the monolingual informa-tion found in different fields, glosses and sensedefinitions in bilingual entries play a crucialrole in providing more information about thesemantics of the collocate. From the user per-spective, they are of course there to assist thechoice of the correct translation, ie: they arelinked to the specification of a particular trans-lation where several alternatives are possible.From our perspective, they give pointers to thepossible interpretation of the collocate. Eg:in the Collins French-English/English-FrenchDictionary [Collins, 1990] we see (gain, re-quire, bring/carry, etc) listed as senses of theverb take. It is precisely these sense definitionswhich may give clues as to the interlingual con-cepts which underlie such support verbs, con-cepts which are independent of their realisa-tion in a given language. Eg. take in Englishtranslated as remporter in French, but mean-ing GAI~ in both.

4.2.1 Bilingual lexicographic conven-tions

There are a number of conventions (often im-plicit) used in bilingual lexicographical workwhich come under the global heading of "se-mantic indicators". The intention is to iden-tify or restrict the semantic range of the wordto be translated. Aside from the closed class ofsubject field markers (eg. NAUT, GEO, MED,etc.), these semantic indicators are little morethan clues for human readers well-versed in

12The explicit SGML mark-up also provides themeans for displaying the results of a given query asif a printed dictionary had been consulted.

one of the two languages and cannot be in-terpreted automatically. The types of indica-tors employed are only partially formalizable(examples given here are German to Frenchtranslations):

¯ synonymseg. stark (mKchtig) puissantegloss: ’strong’ in the sense of powerful

¯ typical collocationseg. stark (Schmerzen) vi olentegloss: ’strong’ pains

¯ meta-type sem/syn informationeg. stark (bei Massangabe)gloss: ’strong’ used with measurementsexample: 2 cm " ---* 2 cm d’epaisseur

¯ multl-word expressionseg. " Rancher --* grand fumeurgloss: ’strong’ smoker (i.e. heavy smoker)

4.2.2 Typical collocations

The information in the entries is rich in bothmonolingual and bilingual semantic informa-tion. The ’semantic indicators’ and typicalphrases are often indicative of a set colloca-tions (hypernyms, for example, are often usedto stand for a class of collocations).

Following is a list of collocations taken fromjust a few adjective entries.

¯ deep -~(water, sorrow, thoughts)profond(e)(voice) grave

¯ hard --* dur(e)(work) dur(think, try) serieusement

¯ high --+ haut(e)(speed, respect, number)grand(e)(price) aev (e)(wind) fort(e), violent(e)(voice) aigu(aigu~)

These simple entries are clearly useful forgathering a starting list of collocations, all ofthem linguistically relevant l-collocations (asopposed to an initial set of p-collocations foundin corpora which would require hand filtering).

The grouping of objects may reflect seman-tic patterning, though not always, and in fact,

82

not very often. Support verbs in particular areoften listed with coherent semantic classes, eg.under "take" we can find classes like step, walkor effort, courage. But as the sample transla-tions for deep and high illustrate, we are of-ten confronted with groups that could hardlybe viewed as semantically homogeneous. Nev-ertheless, the examples given in the entriesusually cover the basic sense distinctions andthus provide a useful set to test initial hy-potheses. A larger collection of data and moreextensive empirical studies may help to iden-tify more general and also more refined classesof regularities that are not apparent in theseselected examples.

One immediate way to extend this list of ex-amples is to look for phrase or candidate collo-cations within the translations, i.e. searchingon the word contained in the translation andexample fields. Following is a list of exam-ples extracted from a search for adjectives inthe translation field (the headword is given inboldface).

¯ deep voice ~-- (voix/son) grave¯ deep hate ~ (mgpris) profond¯ deep/great ~ vif (regret/deception)¯ hard choice/problem ~ (choiz/probl~me)

difflcile¯ hard work *-- (travail) dur/ferme¯ hard (insensible) person (c oeur/-

personae) sec¯ high price ~ (prix/sommel) 616v6¯ high returns ~ rendement fort¯ high/low tide ~ mar6e haute/basse¯ high vitamin/mineral/etc content ~-

richesse en vitamines¯ high blood pressure ~-- faire ou avoir de

la tension

In comparing this list with the examplescited above (searching on headwords) it is im-mediately apparent that the information pro-vided is not symmetric. Given the seriousspace restrictions on these dictionaries in par-ticular, this is not surprising, though a quicklook at larger dictionaries will show similardiscrepancies.13 More importantly, what does

13Note also that only a subset of the entries con-

become apparent in these examples, is thecross-linguistic verification of our definition ofcollocations. It also substantiates the claimthat the nouns (bases) the adjective (collo-cate) selects have some kind of semantic ho-mogeneity. In essentially all cases (the ex-ceptions seem to represent metaphorical andidiomatic expressions) the nouns retain theirregular meaning, the adjectives are selected(and ultimately interpreted) in the contextsof the nouns, which do have a straightforwardtranslation. (cf: observations in section 3.3.1).Given our relatively limited understanding ofthese phenomena and lack of ability to formal-ize this system, we can perhaps better appre-ciate the rather intuitive and incomplete in-formation provided by the lexicographers andhence the assymetries across entries.

4.2.3 Sense distinctions and transla-tion

Another type of information we find in the en-tries (other than specific collocations) could interpreted as sense disambiguation informa-tion, i.e. when more than one translationalequivalent is given without further restric-tions, we are usually confronted with at leastpartial synonyms. In the translation of theGerman adjective stark (strong) from French,for example we find the following translations:

¯ farouche---* (voiontg, haine, rgsistance)stark, heftig (gloss: heavy or forceful)

¯ intense --* stark, intensiv (gloss: inten-sive)

¯ violent ~ heftig, stark¯ virulent ~ stark, thdlich (gloss: deadly)

The synonyms supplied for each transla-tion choice seem to indicate a possible sensedistinction. The underlying basic concept isthat of intensification which is precisely the in-terlingual concept mediated by Mel’~uk’s LFMAGN. However, in each case the meaning

talnlng the words we are interested in have been listedhere; fixed expressions and other idiosyncractic exam-ples were deleted.

83

is refined, as suggested by the glosses sup-plied above. Following Pustejovsky’s notion ofQualia [Pustejovsky, 1991], part of the choicefor a given collocate will depend on the se-mantics of the noun, but this must perhaps bejudged in interaction with particular senses ofa given adjective rather than the word itself.It is unfortunate that translation dictionariesin general do not represent sense distinctionsin the translation fields. As these examplessuggest, if we are looking to use the lexicalfunctions for generation, we may need to re-fine them on the basis of the senses in order tochoose from the set of possible translations.

A next step would be to look at whether thesense distinctions found in monolingual dic-tionaries can be of service in helping to clas-sify translation equivalents (cf. the discus-sion in the following section); another is tolook for the semantic structures and primitivesprovided in thesauri. A search on the RogetThesaurus [Roget, 1990], again searching fromwithin, i.e. all classes where strong is givenas example, provided some initial categories.Three major categories which included strongas a possible adjective were "quantity by com-parison with a standard", "degree of power"and "physical energy". Though the first cat-egory does not seem to help in distinguishingthe translations given above, the latter two dohelp in distinguishing stark as "intense" andstark as "violent". This very preliminary in-vestigation, though far from giving a system-atic account of this problem, suggests the po-tential of combining the information from anumber of resources in view of working to-wards more general accounts of the data.

4.3 Semantic classes and chains

Given the numerous pairs of bilingual dictio-naries and the ability to search inside entries,we explored the idea of chaining through en-tries. Starting with a headword in one lan-guage we find a number of translations whichin turn serve as the query both as headword oras occurring in a translation field. This workis similar to the "sprouting" mechanism de-

scribed in [Chodorow and Byrd, 1985] whichestablished chains through entries in order tobuild up a semantic hierarchy and the subse-quent work on "bilingual sprouting" discussedin [Byrd et al., 1987]. In contrast to the workcited above, we did not attempt to find explicitand formalized semantic information (such as+HUMAN), nor to automatically derive tax-onomies. The intuition is simply that chain-ing through words and their translations (fromheadwords to translations and vice versa) wewill discover synonym classes across the lan-guages which can help refine our insights intothe semantic or conceptual differences withinand across languages.

In a sample chain beginning with the wordheavy in English, we find three basic trans-lations in French, i.e. iourde, gros and grand.The entry for the adjective lourd lists only theone translation heavy but occurs in the entriesfor ’close’, ’heavy’, ’hefty’, ’weighty’, amongothers. The entry for gros gives as transla-tions big, large, fat, extensive, thick, heavy andoccurs in ’big’, ’broadly’, ’fat’, ’heavy’, ’hefty’,’roughly’, etc.

As the example above suggests, after onlyone step through the chain, we can quickly col-lect a set of words that clearly share a subsetof semantic properties. The differences willhave to be accounted for in a systematic wayif we are not going to list every possible com-bination in view of translation and subsequentgeneration. It is clear that a lexical functionsuch a MAGN will account for any number ofcollocations containing the words given above,but not why one word is chosen over another.

In the entry for heavy, for example, weare also supplied with some contextual infor-mation necessary for choosing the appropriateequivalent. I.e. it translates as gros in thecontext of work, sea, rain, eater but as grandin the case of drinker and smoker. It is alsointeresting to note that two of the three mainsense distinctions given in OALD correspondapproximately to the first two translations, i.e.1. weight, and 2. size, force, amount. A pos-sible line of investigation would be to see howwell these distinctions overlap with the seman-

84

tic groupings found in thesuri14.

Another study starting with the adjectivefort in French and chaining through Germanand English, yielded the following set of se-mantically related adjectives.

farouche, intense, puissant, violent, viru-lent, fort, doug, capable, vigoureux, solide, vif,haul, grand, dlevd, aigu, grand, formidable,gros, corpulente, dur, s~rieusement, agressive,bruyant, sonore, beaucoup

Though every native speaker would knowprecisely which of these adjectives could oc-cur with any given noun, we do not have anyeven semi-formal account of why this is thecase let alone how they will behave in trans-lation. One question to be studied is whetherwe can identify subsets of this class that be-have similarly with respect to collocations. Itis this type of data and more that we will needto analyze and test our hypotheses on.

Conclusion

As the title suggests, the main focus of theproject "Collocations and the Lexicalisation ofSemantic operations" is the semantics involvedin collocational constructions. In Mel’Suk’sapproach, the meaning of each ’collocate’ canbe identified as an instance of one of the fiftyor so Lexical Functions. Whether or not eachcollocation can be assigned such a function,and whether or not these functions make anysense, is partly a matter of empirical investi-gation.

In this paper, we have discussed some pos-sibilities in which this empirical investigationcan be carried out. Although most dictionariesdo not immediately provide the informationwe are after, they are a rich source of lexico-semantic knowledge which can be exploited inmany ways for various purposes. The primaryuse we have been talking about involves the re-search on the semantics of collocations ratherthan the automatic acquisition of lexical infor-mation.

14A simple program to extract phrases from Rogeridentified ca. 20 expressions containing heavy as anadjective.

For instance, simple chaining through thedictionaries provides us with synonym sets (orsome kind of equivalence set) which capturefacets of meaning which may point to a lex-ical function. But one could try to go be-yond the simple identification of lexical func-tions and look for further evidence for gener-alizations and refinements. Indeed, one of thedrawbacks of the lexical function approach asused in the ECD is that each function has tobe enumerated completely. In this account, asin many others, collocations are considered tobe totally unpredictable combinations. How-ever, looking at collocations one is sometimesstruck by the semantic patterns suggested bythe data. Again there is a need to find moreevidence to be able to tackle such questions as:what kind of regularities are we talking about,to what extent are they there, and how can weaccount for them?

A second drawback concerns the ’refine-ment’ of the functions. Lexical functions of-ten pick out one semantic facet of their argu-ment to operate on. Mel’~uk himself alreadymakes some provisions, but does not provide asystematic theoretical account of these exten-sions. One could also turn to other theoriesof semantic structure such as Pustejovsky’sQualia theory. We are then lead to the ques-tion how these theories fit the phenomenon weare describing, and again, what kind of evi-dence we can use to decide on particular anal-yses.

Collocations are an inspiring source of in-teresting hypotheses concerning all kinds oflexical semantic issues. Working on them wehave realised however, that conjectures needthorough testing. The previous sections haveserved to identify the different types of lexi-cal resources we can exploit in our study ofcollocations. Electronic access to various col-lections of data, i.e. mono- and bilingual dic-tionaries, and thesauri provide the basis for amore in-depth exploration of the phenomena.A next step is to refine the methods for pickingout relevant data and combining the informa-tion gained from different perspectives.

85

References

[Bloksma et al., 1992] Laura Bloksma, DirkHeylen, and R. Lee IIumphreys. Charac-terisation. Deliverable ET-10/75 [C.1], Col-locations and the Lexicalisation of SemanticOperations, Utrecht, July 1992.

[Byrd et al., 1987] R. Byrd, N. Calzolari,M. Chodorow, J. Klavans, and M. Neff.Tools and methods for computational lin-guistics. Computational Linguistics, 13(3-4):219-240, 1987.

[Calzolari, 1983] N. Calzolari. Semantic linksand the dictionary. In Proceedings of theSixth International Conference on Comput-ers and the Humanities, pages 47-50, Mary-land, 1983.

[Chodorow and Byrd, 1985] M. Chodorowand R. Byrd. Extracting semantic hierar-chies from a large on-line dictionary. InAssociation for Computational Linguisticsproceedings, pages 299-304, Chicago, 1985.

[Church and Hanks, 1989] Kenneth WardChurch and Patrick Hanks. Word associ-ation norms, mutual information and lex-icography. In 27th Annual Meeting of theAssociation for Computational Linguistics,Vancouver, 1989.

[Collins, 1990] Collins, editor. Collins French-English English-French Dictionary, CollinsGerman-English English-German Dictio-nary, Collins French-German German-French Dictionary. Collins, London, Glas-gow, 1990.

[iieylen et al., 1992] Dirk Heylen, Kerry G.Maxwell, Tim Nicolas, and Susan Warwick-Armstrong. Analysis of existing dictionar-ies. Deliverable, Collocations and the Lexi-calisation of Semantic Operations, Utrecht,July 1992.

[tIeylen, 1992] Dirk Heylen. Lexical func-tions and knowledge representation. In Pro-ceedings of the Second International Work-shop on Computational Lexical Semantics,Toulouse, 1992.

[Hornby, 1974] A.S. Hornby. The Oxford Ad-vanced Learner’s Dictionary of English. Ox-ford University Press, third edition edition,1974.

[Mel’~uk el al., 1984] IgorMel’~uk, Nadia Arbatchewsky-Jumarie, L~oElnitsky, Lidija Iordanskaja, and AddleLessard. Diclionnaire explicalif el combina-toire du fran~ais contemporain. Les Pressesde l’Universitd de Montreal, Montreal, 1984.

[Petitpierre and Robert, 1991] D. Petitpierreand G. Robert. Dico- a network based dic-tionary consultation tool. In SGAICO pro-ceedings, Neuchatel, 1991.

[Pustejovsky, 1991] James Pustejovsky. Thegenerative lexicon. Computational Linguis-tics, 17(4), October-December 1991.

[Roger, 1990] Roger. Roger’s thesaurus, 1911,1990.

[Sanfilippo et al., 1992] A. Sanfil-ippo, T. Briscoe, A. Copestake, Maria An-tonia Martl, Mariona Taul~, and AntoniettaAlong. Translation equivalence and lexical-ization in the acquilex lk. In Proceedings ofTMI 4, pages 1-11, Montreal, 1992.

[Smadja, to appear] Frank Smadja. Retriev-ing collocations from text: Xtract. Compu-tational Linguistics, to appear.

[Vossen, 1991] P. Vossen. Comparing noun-taxonomies cross-linguistically. ESPRITBRA-3030 ACQUILEX Working PaperNumber.014, English Department, Univer-sity of Amsterdam, 1991.

[Wall and Schwartz, 1990] L. Wall and R. L.Schwartz. Programming Perl. O’Reilly LzAssociates Inc., Sebastopol, CA., 1990.

86