corpus based analyse

15
An Electronic Corpus-based analysis of a set of near synonyms across two registers. Tomoki KOYA, Danielle RAHARINTSOA, Deniz TOPRAK, Dildar KEREM WU Université Marc Bloch, 22, Rue Descartes, 67 084 Strasbourg Cedex, France Abstract In this paper, we compare a set of noun synonyms appliance, device, gadget, machine across two registers. We will attempt to demonstrate the semantic nuance of these terms in a phraseology context. We intend to demonstrate that words behave differently when we take in consideration their surrounding contexts and registers. Résumé Dans cet article, nous allons comparer différents synonymes de appliance, device, gadget, machine à travers deux registres. Nous essayerons de démontrer les nuances sémantiques de ces termes dans un contexte de phraséologie. Nous avons l'intention de démontrer que les mots se comportent différemment quand nous prenons en considération leurs contextes environnants et leurs registres. Keywords: corpus-linguistics, corpus-based analysis, lexicographic investigation, dictionaries, comparison of synonyms, and collocation. 1. Introduction The aim of this paper is to explore the different uses of the synonyms appliance, device, gadget, machine. The method is a corpus-based analysis. The main reason for this choice is that all of these words have close semantic boundaries between each other. Some words may be inherent in some collocations on a given context and the structure of collocation doesn’t authorize a replacement by its synonym. We try to compare and contrast the above-mentioned 1

Upload: wu

Post on 27-Apr-2015

427 views

Category:

Documents


2 download

DESCRIPTION

An Electronic Corpus-based analysis of a set of near synonyms across two registers

TRANSCRIPT

Page 1: Corpus Based Analyse

An Electronic Corpus-based analysis of a set of near synonyms across two registers.

Tomoki KOYA, Danielle RAHARINTSOA, Deniz TOPRAK, Dildar KEREM WU

Université Marc Bloch, 22, Rue Descartes, 67 084 Strasbourg Cedex, France

Abstract

In this paper, we compare a set of noun synonyms appliance, device, gadget, machine

across two registers. We will attempt to demonstrate the semantic nuance of these terms in a

phraseology context. We intend to demonstrate that words behave differently when we take in

consideration their surrounding contexts and registers.

Résumé

Dans cet article, nous allons comparer différents synonymes de appliance, device,

gadget, machine à travers deux registres. Nous essayerons de démontrer les nuances

sémantiques de ces termes dans un contexte de phraséologie. Nous avons l'intention de

démontrer que les mots se comportent différemment quand nous prenons en considération

leurs contextes environnants et leurs registres.

Keywords: corpus-linguistics, corpus-based analysis, lexicographic investigation,

dictionaries, comparison of synonyms, and collocation.

1. Introduction

The aim of this paper is to explore the different uses of the synonyms appliance, device,

gadget, machine. The method is a corpus-based analysis. The main reason for this choice is

that all of these words have close semantic boundaries between each other. Some words may

be inherent in some collocations on a given context and the structure of collocation doesn’t

authorize a replacement by its synonym. We try to compare and contrast the above-mentioned

1

Page 2: Corpus Based Analyse

An Electronic Corpus-based analysis of a set of near synonyms across two registers.

near synonyms by using an empirical analysis. In this paper, we believe that our approach is

interesting because our central element - the notion of collocations - claimed by Gledhill

(Gledhill, 2000), is different from the notion of collocations claimed by other linguists. This

notion of collocations will be explained in details later.

In applied linguistics, the study of meaning and use of term is called lexicography. The

study is carried out by lexicographic investigation, which is traditionally used as a

methodology for dictionary-building. Many linguists, particularly Sinclair (Sinclair, 1991) ,

have developed the methodology of 'corpus linguistics' where the utilization of corpus

analysis provides the evidence for the uses and meaning of words. Sinclair even used this

methodology for the construction of the Collins Cobuild English Dictionary, where the

dictionary is edited according to the analysis results of a two-hundred-million words corpus.

According to Biber (Biber et al., 1998), a corpus-based analysis allows us to establish

frequency lists (occurrence of words), concordances (occurrences of words with its

surrounding contexts) and collocations (the patterned ways that words group together). He

claims that an empirical corpus-based analysis can establish that synonyms possess their

contextual preferences when associated with other collocates or registers.

There is no argument with regard to what a corpus is. Linguists agree that a corpus is a

collection of language data, selected and organized with the specified criteria so as to serve as

a language model (Sinclair 1996, Habert et al., 1997). On the other hand, the notion of

collocations is very complex and linguists choose different perspectives. Thierry Fontenelle

(Fontenelle, 1994) pointed out that the combinations of words depend on various facts. He

refers to Carter and McCarthy in his article, who believe that the concept of collocations is

independent of grammatical categories, specially a grammatical collocation which evoked one

2

Page 3: Corpus Based Analyse

An Electronic Corpus-based analysis of a set of near synonyms across two registers.

element from an open class and an element from a closed class, for example the verb depend

who collocates with on but not of. Gledhill (Gledhill, 1998) claims that the collocations are

unmarked language expressions; he further explains that even though the collocations are

relatively fixed sequences of words, they are different from the idioms because they are not

recognized culturally or stylistically as expressions in themselves. As he demonstrated in his

example, to take a break is easily interpreted and therefore unmarked; while to kick the

bucket is difficult to interpret, therefore it is opaque and marked. In (Gledhill, 2000), he gives

a full description of different views adopted by the linguists related to the notion of

collocations. He synthesis three different perspectives: Halliday's statistical/textual

perspective, the semantic/syntactic perspective and the discourse/rhetorical perspective.

According to him, in Halliday's statistical/textual perspective, collocations are framed in

terms of statistical probabilities and co-occurrence. This perspective allows linguists to

observe certain co-occurrence (for example, the case of set of as showed by Sinclair) that

could not be recognized using a traditional method. The semantic/syntactic perspective,

contrastively, stresses on the potential lexical combination abilities of an expression (shrug

one's shoulder has no alternatives for the verb shrug, while in make a decision, make can be

replaced by reach, take, etc). The discourse/rhetorical perspective examines collocations in

terms of performance, it examines the communicative function and effectuates an external

functional analysis (thus we know the difference between how do you do and how are you).

The choice of expressions often reveals a rhetorical or ideological stance. In his words,

"collocations are all related in phraseology", where phraseology refers to "The preferred way

of saying things in a particular discourse". It is this notion of collocations that we will be

exploring in our work.

3

Page 4: Corpus Based Analyse

An Electronic Corpus-based analysis of a set of near synonyms across two registers.

We organize this paper as follows. First we introduce our methodology in section 2,

where we explain the construction of our corpus; we synthesize as well how the words are

represented in the three chosen dictionaries. Section 3 interprets our analysis and compares

the behaviors among these near synonyms. The final section summarizes our research and

lays out the perspectives for this work.

2. Methodology

For a linguistic analysis, the construction of the corpus is primordial. To obtain a

reliable result, it is preferred to use two corpora: a reference corpus and a constructed corpus

for special needs. We use the BNC Baby as our reference corpus. BNC Baby is a subset of the

British National Corpus (BNC)1. It contains 4 million words and is constructed with equal

amounts of material (academic writing, imaginative writing, newspaper texts, and

spontaneous conversation). The corpus function with a software named Xaira2. The

specialized corpus is constructed with respect to our own defined criteria. Firstly, we decided

to have a homogeneous corpus, which includes a series of theses. Secondly, even though we

are all aware that it is better to have a large corpus, we agreed to construct a relatively small

one, this is mainly due to a lack of time. The size of our corpus is 240,000 words. For the

analysis of this corpus, we use the software WordSmith Tools3.

1 The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written. (http://www.natcorp.ox.ac.uk/).

2 Xaira is the current name for a new version of SARA, the text searching software originally developed at OUCS for use with the British National Corpus.

3 Wordsmith is an integrated suite of programs for looking at how words behave in texts. The tools have been used by Oxford University Press for their own lexicographic work in preparing dictionaries tool based on lexical analysis. For details, please consult: http://www.lexically.net/downloads/version5/HTML/index.html

4

Page 5: Corpus Based Analyse

An Electronic Corpus-based analysis of a set of near synonyms across two registers.

In the study of this group of near-synonyms appliance, device, gadget, and machine, we

need to first begin by examining their meanings in the dictionaries. We consulted eleven

dictionaries and finally we selected three of them:

• Collins Cobuild, Harper Collins Publishers, 2nd edition, 1995 (hereinafter Collins).

• The Oxford English Dictionary, Clarendon Press, 2nd edition, 1989 (hereinafter

OED).

• Webster's Third New International Dictionary, G. & C. Merriam Company, 1961

(hereinafter Webster's).

Collins is the only dictionary among the three that acknowledges use of a great corpus

The Bank of English for the edition. The chief editor is Sinclair himself. All the examples in

this dictionary contain typical patterning associated with the word. The OED claims to be the

largest, most authoritative dictionary of the English language and the ultimate source of

information on the usage and meaning of English words and phrases. It covers the vocabulary

of the English language since AD 1150. We have chosen Webster's firstly, because it is

American English; secondly, because it compares the subtle shades of meanings among

synonyms.

3. Comparison and analysis

3.1. Comparison of the definition representations in the dictionaries

As we will discuss how the words are represented in this chapter, we begin by pointing

out two basic forms of representations: entry and sense. Among these four words, machine

and device have two entries in some dictionaries. The reason is that the word machine is not

only a noun but also a verb; as for device, its second entry is an old form of the verb devise.

5

Page 6: Corpus Based Analyse

An Electronic Corpus-based analysis of a set of near synonyms across two registers.

Concerning the sense, the four words have a common sense: an object which is fabricated,

and the purpose of this object is to achieve doing something in the place of the human hand.

Concerning the sense, the four words have a common sense: an object which is fabricated,

and the purpose of this object is to achieve doing something in the place of the human hand.

The word machine comes from the idea of "make" and "power". In the concept of the

machine, we generally found the idea of some kind of power. This power may be electricity,

steam, gas or human in most cases. The word gadget comes from the French word "Gachette"

(Trigger). Gadgets are small, useful things that work on a specific task. It is generally smartly

designed. The device comes from French word "devise" who means to divide. It is an object

or piece of object that is used in a specific domain and for a specific purpose. An appliance

contains the idea of "applying something". It can be a part of a larger object and it is designed

specially for domestic tasks (in a specific use as washing or cooking). It works mainly with

electricity. Machine seems to be the only word that has a generic sens and seems to collapse

all other.

But the dictionaries also reveal that these words are not totally identical. For example,

the fixed expression leave to someone's devices is specific to the word device in plural form;

OED defines the use of gadget specific to glass-making, where gadget implicates a spring clip

used for gripping the foot of a wine or other footed glass when it is being shaped. The word

appliance is specifically used in British English to refer to a fire engine. Machine is the only

word that is provided with a list of synonyms (engine, apparatus, appliance) in OED (page

1353).

From the above comparison, we can conclude that even though these synonyms possess

a common sense, dictionaries define also their proper uses and proper senses. In the section

6

Page 7: Corpus Based Analyse

An Electronic Corpus-based analysis of a set of near synonyms across two registers.

that follows, we base on our corpus-based analysis to prove that these synonyms are not

identical in meaning and usage.

3.2. Frequency distributions of appliance, device, gadget, machine

We have examined the definitions provided by three dictionaries previously. In this

chapter, we first begin an analysis and a comparison on the basis of the frequency

distribution of these synonyms in different registers. Then we study their collocations in

focusing on their immediate collocations (immediate right and immediate left).

A quick analysis in BNC Baby affirms that these words are used differently across

registers. The analysis shows that the word machine is the most commonly used item.

Compared with machine and device, the words gadgets and appliances are used rarely. For

the analysis of BNC Baby, we did not display the results for the plural forms for the lack of

space. But we have noticed that appliance(s) and gadget do not appear in written academic

prose; device(s) are widely used in academic prose; Even though machine(s) are widely

used in both spoken demographic and written academic prose, we noted that machine is

more widely used in spoken English when compared with machines; gadgets appears

mostly in written newspaper and it is worth mentioning that the occurrence of 'gadget' en

singular is zero.

Table 1: Analysis of appliance in BNC Baby

Class Hits %d words Pc Hit texts Textsheader 0 11 558 0.000 0 2

Written academic prose 0 1145004 0.000 0 30Spoken demographic 0 1203554 0.000 0 30

Written fiction 1 1214069 0.000 1 25Written newspapers 4 1116508 0.000 4 97

#unc 0 0 0.000 0 0Total 5 4690693 0.000 5 184

7

Page 8: Corpus Based Analyse

An Electronic Corpus-based analysis of a set of near synonyms across two registers.

Table 2: Analysis of device in BNC Baby

Class Hits %d words Pc Hit texts Textsheader 0 11558 0.000 0 2

Written academic prose 82 1145004 0.007 14 30Spoken demographic 8 1203554 0.001 4 30

Written fiction 24 1214069 0.002 8 25Written newspapers 10 1116508 0.001 9 97

#unc 0 0 0.001 0 0Total 124 4690693 0.003 35 184

Table 3: Analysis of gadget in BNC Baby

Class Hits %d words Pc Hit texts Textsheader 0 11558 0.000 0 2

Written academic prose 0 1145004 0.000 0 30Spoken demographic 0 1203554 0.000 0 30

Written fiction 1 1214069 0.000 1 25Written newspapers 4 1116508 0.000 3 97

#unc 0 0 0.000 0 0Total 5 4690693 0.000 4 184

Table 4: Analysis of machine in BNC Baby

Class Hits %d words Pc Hit texts Textsheader 3 11558 0.026 2 2

Written academic prose 48 1145004 0.004 14 30Spoken demographic 115 1203554 0.010 22 30

Written fiction 51 1214069 0.004 15 25Written newspapers 59 1116508 0.005 27 97

#unc 0 0 0.005 0 0Total 276 4690693 0.006 80 184

8

Page 9: Corpus Based Analyse

An Electronic Corpus-based analysis of a set of near synonyms across two registers.

Below is the result of analysis carried out by WordSmith Tools on our constructed

corpus. Table 5 shows a word list which shows the frequency distributions of the words

in our constructed corpus.

Table 5: WordSmith Tools -- 10/11/2008 21:55:45 corpus.txt

N Word Frequency %544 APPLIANCE 301 0.12545 APPLIANCE’S 1546 APPLIANCES 377 0.152697 DEVICE 186 0.072698 DEVICEAPPLICAT + 12699 DEVICES 80 0.032700 DEVICESCOMPUTE+ 14093 GADGET 50 0.021094 GADGETS 60 0.025847 MACHINE 443 0.185848 MACHINE’S 45649 MACHINEABILITY 15850 MACHINEABLE 15851 MACHINED 35852 MACHINEIS 15853 MACHINERY 65854 MACHINES 222 0.095855 MACHINING 1 5855

The display of the frequency listing of our constructed corpus analyzed by WordSmith

Tools in table 5 shows that appliance/appliances are one of the most active words; it also

reveals that the frequency for the plural or the singular forms is quite similar. Gadget exist in

singular and plural forms in our constructed corpus. This analysis is in the guidelines of

Halliday's statistical/textual perspective (Gledhill, 2000). We have obtained surprising results

in this analysis. The different uses of machine/gadget in their plural or singular forms are

unexpected; these uses are not mentioned in the dictionaries. This confirms once again that

even though words are similar in meanings, they are employed differently across registers.

9

Page 10: Corpus Based Analyse

An Electronic Corpus-based analysis of a set of near synonyms across two registers.

3.3. The immediate right and left collocates of appliance, device, gadget, machine

Firstly we begin by an analysis of the immediate right and left collocates. The

following table justifies our assumption, we will not be displaying collocates for each words.

We will rather synthesize our findings with examples based on our analysis.

Table 6: Most common left and right collocates of appliance, device, gadget,

machine

Synonyms left collocation frequency right collocation frequency

Appliance / appliances

safety 179 defect (s) 40The / the 23 inspection (s) 29An / an 15 standards 21

this 4 bad 15kitchen 2 act (s) 11france 1 load 10small 1 model 3

device / devicessoftware 7 that 5profile 5 allegedly

application 3

gadget / gadgets

learning 25 corresponding 2and 16 in 3turn 14 verb 16the 11

machine / machines

turing 160 vision 86a 26 can / cannot 35

virtual 22 that 12and 7 abstraction (s) 5

washing 5 Model (s) 3

From the syntactic perspective, we have noticed in our corpus that appliance is as

productive as device in terms of syntactical combinations. Appliance can have adjective

modifiers (domestic appliance, household appliance, electrical appliance, etc), it can also

form noun phrases (appliance's operating state, the power of appliances, safety appliance

inspections, etc.). It modifies another noun as well (appliance models, appliance state,

appliance defects, etc.). We notice as example:

1

Page 11: Corpus Based Analyse

An Electronic Corpus-based analysis of a set of near synonyms across two registers.

1. FRA safety appliance defects must be repaired before a train can depart a yard, there are other

forms of deformation that could be repaired as time allows.

2. For each of the domestic appliances investigated here, the table reports the effect that owning a

particular appliance has on the amount of time spent in various kinds of household work.

3. This expanded group of appliances included grab irons, ladders, sill steps, hand brakes, running

boards, and other similar equipment.

4. Small resistive appliances in the power range 50 W < P < 210 W are grouped together with a

special algorithm.

5. Safety appliances on rail cars are the interface between humans and rolling stock with regard to

movement of rail cars.

Device has most of the characteristics as appliance, but we have also extracted certain

verbs that are used in passive forms after device (device is made / constructed / fabricated /

developed / assembled). Device can be followed by relative clauses introduced by that /

which. From the distributions of collocates, we noted also that phone / mobile are preferably

placed before device, whereas applications, software, hardware prefer to be placed after

device and act as the head words. Information is never placed after device, and profile is

always placed after device in the corpus. This is because they form a proper noun together

Mobile Information Device Profile (MIDP). The prepositions used with device are with, for

and from. With follows device in most of the time and for and from tend to be often placed

before device.

6. The device was then assembled with a heater, a thermoelectric module and thermocouples.

7. There are commercially available devices for doing slab gel analysis.

8. The CLDC specification is made up of Mobile Information Device Profile (MIDP) applications,

whose technology stack is......

1

Page 12: Corpus Based Analyse

An Electronic Corpus-based analysis of a set of near synonyms across two registers.

9. The automation of mobile device software product-lines requires generic software technologies to

support …...

10. A mobile device that people can carry with, is a constant factor to rely upon, when design......

Gadget functions as an adjective to modify another noun (gadget elements, gadget

combinations). But it can also be modified by an adjective, by a verb in ING-form, where

gadget itself acts as the head element (electronic gadget, learning gadget). Gadget is used in

combination of the word turn and forms a proper noun turn gadget in the corpus.

11. Additionally, the participatory design of the leaning gadgets, which is based on the reuse of

proven solutions in the …..

12. Thus, for any parallel morph between two drawings of a turn gadget that keeps vertices of VS

static …

13. To complete the collection, turn gadgets are required to connect a turn gadget to another gadget

above it and another gadget either to the left or right.

14. Only the interface gadget elements are declared as part of the code in the fields form and text Field

of the class MobApp.

15. …... the hub operator will deliver enough gadgets for the customer when the future demand is

uncertain.

Syntactically, the word machine is the most productive. It can be a subject or an object

(machine halts, computes, produces, operates or repair a machine, build a machine). It

possesses the other characteristics of combinations as the other three synonyms: modifier or

modified or relative clauses. The adjectives placed before machine as modifiers (small /

domestic / virtual / powerful machine) or the noun or verb-ing forms placed before machine

as modifiers (washing / lapping / fax / PC machine). It modifies also nouns placed after

(machine language / loop / rules). One striking characteristic of machine in our corpus is of

1

Page 13: Corpus Based Analyse

An Electronic Corpus-based analysis of a set of near synonyms across two registers.

its combinations to refer to specific proper nouns (turing machine, machine vision system, o-

machine, etc). It is also very often used as a noun, but with a suffix-added form (automatic

machinery / machinery and parts / a piece of machinery). Its co-occurrences with appliance is

a observed in the corpus.

16. The location of a machine vision installation for monitoring safety appliances can be broken down

…...

17. We could imagine, for instance, a machine that included an accelerated Turing machine (M) as a

part.

18. Turing's later paper 'computing Machinery and Intelligence' can be taken as saying that even

mathematician …...

19. For example, as almost every household possesses a washing machine or stove, no statistically

valid comparison between the behavior of …...

20. As usual, if there is no appropriate step to execute at some point, the machine halts.

From the rhetoric perspective, we noticed that in our corpus, lots of proper nouns are

constructed using these synonyms, and they act as terms specific to the domains: Virtual

Machine and Turing Machine are related to computer science, turn gadget is used

specifically in graphic data structures, mobile device is specifically used in

telecommunication, the safety appliance / Act / inspection / standards / deformation /

defects are very restricted for use in the rail road safety inspections. Even though, the

group of synonyms keeps their original meanings in these restricted domains, the formed

proper nouns are no longer considered as separable. Therefore their lexical constructions

are not interchangeably with the other synonyms. In this case, these synonyms become a

part of the domain terminology. These special uses have to be learned as a part of the

terminology learning.

1

Page 14: Corpus Based Analyse

An Electronic Corpus-based analysis of a set of near synonyms across two registers.

4. Discussion

The analysis from a statistic perspective allowed us to see the differences of

distributions of these synonyms in different corpora. While appliance, device and machine are

frequent words in both corpora, the word gadget experiences great changes regarding

frequency from one corpus to another. From a syntactic perspective, all the synonyms can be

modified by an adjective or a verb in ING-form; they can equally modify the other nouns and

form a noun phrase or a proper noun; the word machine turns out to be very productive

concerning the syntactic combinations, which reflects that this word is the most commonly

used among the group of synonyms. From the rhetorical perspective, these nouns are mostly

used to form a proper noun with another word. The ability to use correctly these terms reflects

the user's skill in the domain, the comprehension of these terms is related to the

comprehension of the phraseology.

This research paper shows from three perspectives the differences of a set of near-

synonyms. We can conclude that these synonyms have a common sense, but they are not

identical. Collocations allow us to notice certain properties that might not be revealed by

dictionaries. The results which are presented here demonstrate a fine analysis of the

phenomena. This analysis would be more interesting if we construct a large and diversified

corpus. It is encouraging that we have seen interesting results with a small-size corpus. We

believe that the same method applied to other groups of synonyms could represent unexpected

results as we have had in this paper.

1

Page 15: Corpus Based Analyse

An Electronic Corpus-based analysis of a set of near synonyms across two registers.

5. References

Biber, D. Conrad, S. & Ruppen, R. (1998). Corpus linguistics, Investigating language

Structure and Use. Cambridge : Cambridge University Press.

Fontenelle, T. (1994). What on earth are collocations? English Today 40, Volume 10,

Numéro 4. Cambridge University Press.

Gledhill, C. (1998). Towards a Description of English and French phraseologie. In

Langue and Parole in Synchronic and Diachronic Perspective. St. Andrews, pp 221 –

237.

Gledhill, C. (2000). Collocations in Science Writing. Gunter Narr Verlag Tübiggen.

Gledhill, C. & Frath, P. (2005). Une tournure peut en cache rune autre : l’innovation

phraséologique dans Trainspotting. In Les langues modernes edited Guillaume Astrid.

Paris, pp 68 – 79.

Grossman, F. & Tutin, A. (2003). Les collocations : analyse et traitement. Travaux et

recherches en linguistique appliquée. Amsterdam : De Werelt.

Habert, B. Nazarenko, A. & Salem, A. (1997). Les linguistiques de corpus. Paris :

Armand Colin.

Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University

Press.

1