corpus based analyse
DESCRIPTION
An Electronic Corpus-based analysis of a set of near synonyms across two registersTRANSCRIPT
An Electronic Corpus-based analysis of a set of near synonyms across two registers.
Tomoki KOYA, Danielle RAHARINTSOA, Deniz TOPRAK, Dildar KEREM WU
Université Marc Bloch, 22, Rue Descartes, 67 084 Strasbourg Cedex, France
Abstract
In this paper, we compare a set of noun synonyms appliance, device, gadget, machine
across two registers. We will attempt to demonstrate the semantic nuance of these terms in a
phraseology context. We intend to demonstrate that words behave differently when we take in
consideration their surrounding contexts and registers.
Résumé
Dans cet article, nous allons comparer différents synonymes de appliance, device,
gadget, machine à travers deux registres. Nous essayerons de démontrer les nuances
sémantiques de ces termes dans un contexte de phraséologie. Nous avons l'intention de
démontrer que les mots se comportent différemment quand nous prenons en considération
leurs contextes environnants et leurs registres.
Keywords: corpus-linguistics, corpus-based analysis, lexicographic investigation,
dictionaries, comparison of synonyms, and collocation.
1. Introduction
The aim of this paper is to explore the different uses of the synonyms appliance, device,
gadget, machine. The method is a corpus-based analysis. The main reason for this choice is
that all of these words have close semantic boundaries between each other. Some words may
be inherent in some collocations on a given context and the structure of collocation doesn’t
authorize a replacement by its synonym. We try to compare and contrast the above-mentioned
1
An Electronic Corpus-based analysis of a set of near synonyms across two registers.
near synonyms by using an empirical analysis. In this paper, we believe that our approach is
interesting because our central element - the notion of collocations - claimed by Gledhill
(Gledhill, 2000), is different from the notion of collocations claimed by other linguists. This
notion of collocations will be explained in details later.
In applied linguistics, the study of meaning and use of term is called lexicography. The
study is carried out by lexicographic investigation, which is traditionally used as a
methodology for dictionary-building. Many linguists, particularly Sinclair (Sinclair, 1991) ,
have developed the methodology of 'corpus linguistics' where the utilization of corpus
analysis provides the evidence for the uses and meaning of words. Sinclair even used this
methodology for the construction of the Collins Cobuild English Dictionary, where the
dictionary is edited according to the analysis results of a two-hundred-million words corpus.
According to Biber (Biber et al., 1998), a corpus-based analysis allows us to establish
frequency lists (occurrence of words), concordances (occurrences of words with its
surrounding contexts) and collocations (the patterned ways that words group together). He
claims that an empirical corpus-based analysis can establish that synonyms possess their
contextual preferences when associated with other collocates or registers.
There is no argument with regard to what a corpus is. Linguists agree that a corpus is a
collection of language data, selected and organized with the specified criteria so as to serve as
a language model (Sinclair 1996, Habert et al., 1997). On the other hand, the notion of
collocations is very complex and linguists choose different perspectives. Thierry Fontenelle
(Fontenelle, 1994) pointed out that the combinations of words depend on various facts. He
refers to Carter and McCarthy in his article, who believe that the concept of collocations is
independent of grammatical categories, specially a grammatical collocation which evoked one
2
An Electronic Corpus-based analysis of a set of near synonyms across two registers.
element from an open class and an element from a closed class, for example the verb depend
who collocates with on but not of. Gledhill (Gledhill, 1998) claims that the collocations are
unmarked language expressions; he further explains that even though the collocations are
relatively fixed sequences of words, they are different from the idioms because they are not
recognized culturally or stylistically as expressions in themselves. As he demonstrated in his
example, to take a break is easily interpreted and therefore unmarked; while to kick the
bucket is difficult to interpret, therefore it is opaque and marked. In (Gledhill, 2000), he gives
a full description of different views adopted by the linguists related to the notion of
collocations. He synthesis three different perspectives: Halliday's statistical/textual
perspective, the semantic/syntactic perspective and the discourse/rhetorical perspective.
According to him, in Halliday's statistical/textual perspective, collocations are framed in
terms of statistical probabilities and co-occurrence. This perspective allows linguists to
observe certain co-occurrence (for example, the case of set of as showed by Sinclair) that
could not be recognized using a traditional method. The semantic/syntactic perspective,
contrastively, stresses on the potential lexical combination abilities of an expression (shrug
one's shoulder has no alternatives for the verb shrug, while in make a decision, make can be
replaced by reach, take, etc). The discourse/rhetorical perspective examines collocations in
terms of performance, it examines the communicative function and effectuates an external
functional analysis (thus we know the difference between how do you do and how are you).
The choice of expressions often reveals a rhetorical or ideological stance. In his words,
"collocations are all related in phraseology", where phraseology refers to "The preferred way
of saying things in a particular discourse". It is this notion of collocations that we will be
exploring in our work.
3
An Electronic Corpus-based analysis of a set of near synonyms across two registers.
We organize this paper as follows. First we introduce our methodology in section 2,
where we explain the construction of our corpus; we synthesize as well how the words are
represented in the three chosen dictionaries. Section 3 interprets our analysis and compares
the behaviors among these near synonyms. The final section summarizes our research and
lays out the perspectives for this work.
2. Methodology
For a linguistic analysis, the construction of the corpus is primordial. To obtain a
reliable result, it is preferred to use two corpora: a reference corpus and a constructed corpus
for special needs. We use the BNC Baby as our reference corpus. BNC Baby is a subset of the
British National Corpus (BNC)1. It contains 4 million words and is constructed with equal
amounts of material (academic writing, imaginative writing, newspaper texts, and
spontaneous conversation). The corpus function with a software named Xaira2. The
specialized corpus is constructed with respect to our own defined criteria. Firstly, we decided
to have a homogeneous corpus, which includes a series of theses. Secondly, even though we
are all aware that it is better to have a large corpus, we agreed to construct a relatively small
one, this is mainly due to a lack of time. The size of our corpus is 240,000 words. For the
analysis of this corpus, we use the software WordSmith Tools3.
1 The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written. (http://www.natcorp.ox.ac.uk/).
2 Xaira is the current name for a new version of SARA, the text searching software originally developed at OUCS for use with the British National Corpus.
3 Wordsmith is an integrated suite of programs for looking at how words behave in texts. The tools have been used by Oxford University Press for their own lexicographic work in preparing dictionaries tool based on lexical analysis. For details, please consult: http://www.lexically.net/downloads/version5/HTML/index.html
4
An Electronic Corpus-based analysis of a set of near synonyms across two registers.
In the study of this group of near-synonyms appliance, device, gadget, and machine, we
need to first begin by examining their meanings in the dictionaries. We consulted eleven
dictionaries and finally we selected three of them:
• Collins Cobuild, Harper Collins Publishers, 2nd edition, 1995 (hereinafter Collins).
• The Oxford English Dictionary, Clarendon Press, 2nd edition, 1989 (hereinafter
OED).
• Webster's Third New International Dictionary, G. & C. Merriam Company, 1961
(hereinafter Webster's).
Collins is the only dictionary among the three that acknowledges use of a great corpus
The Bank of English for the edition. The chief editor is Sinclair himself. All the examples in
this dictionary contain typical patterning associated with the word. The OED claims to be the
largest, most authoritative dictionary of the English language and the ultimate source of
information on the usage and meaning of English words and phrases. It covers the vocabulary
of the English language since AD 1150. We have chosen Webster's firstly, because it is
American English; secondly, because it compares the subtle shades of meanings among
synonyms.
3. Comparison and analysis
3.1. Comparison of the definition representations in the dictionaries
As we will discuss how the words are represented in this chapter, we begin by pointing
out two basic forms of representations: entry and sense. Among these four words, machine
and device have two entries in some dictionaries. The reason is that the word machine is not
only a noun but also a verb; as for device, its second entry is an old form of the verb devise.
5
An Electronic Corpus-based analysis of a set of near synonyms across two registers.
Concerning the sense, the four words have a common sense: an object which is fabricated,
and the purpose of this object is to achieve doing something in the place of the human hand.
Concerning the sense, the four words have a common sense: an object which is fabricated,
and the purpose of this object is to achieve doing something in the place of the human hand.
The word machine comes from the idea of "make" and "power". In the concept of the
machine, we generally found the idea of some kind of power. This power may be electricity,
steam, gas or human in most cases. The word gadget comes from the French word "Gachette"
(Trigger). Gadgets are small, useful things that work on a specific task. It is generally smartly
designed. The device comes from French word "devise" who means to divide. It is an object
or piece of object that is used in a specific domain and for a specific purpose. An appliance
contains the idea of "applying something". It can be a part of a larger object and it is designed
specially for domestic tasks (in a specific use as washing or cooking). It works mainly with
electricity. Machine seems to be the only word that has a generic sens and seems to collapse
all other.
But the dictionaries also reveal that these words are not totally identical. For example,
the fixed expression leave to someone's devices is specific to the word device in plural form;
OED defines the use of gadget specific to glass-making, where gadget implicates a spring clip
used for gripping the foot of a wine or other footed glass when it is being shaped. The word
appliance is specifically used in British English to refer to a fire engine. Machine is the only
word that is provided with a list of synonyms (engine, apparatus, appliance) in OED (page
1353).
From the above comparison, we can conclude that even though these synonyms possess
a common sense, dictionaries define also their proper uses and proper senses. In the section
6
An Electronic Corpus-based analysis of a set of near synonyms across two registers.
that follows, we base on our corpus-based analysis to prove that these synonyms are not
identical in meaning and usage.
3.2. Frequency distributions of appliance, device, gadget, machine
We have examined the definitions provided by three dictionaries previously. In this
chapter, we first begin an analysis and a comparison on the basis of the frequency
distribution of these synonyms in different registers. Then we study their collocations in
focusing on their immediate collocations (immediate right and immediate left).
A quick analysis in BNC Baby affirms that these words are used differently across
registers. The analysis shows that the word machine is the most commonly used item.
Compared with machine and device, the words gadgets and appliances are used rarely. For
the analysis of BNC Baby, we did not display the results for the plural forms for the lack of
space. But we have noticed that appliance(s) and gadget do not appear in written academic
prose; device(s) are widely used in academic prose; Even though machine(s) are widely
used in both spoken demographic and written academic prose, we noted that machine is
more widely used in spoken English when compared with machines; gadgets appears
mostly in written newspaper and it is worth mentioning that the occurrence of 'gadget' en
singular is zero.
Table 1: Analysis of appliance in BNC Baby
Class Hits %d words Pc Hit texts Textsheader 0 11 558 0.000 0 2
Written academic prose 0 1145004 0.000 0 30Spoken demographic 0 1203554 0.000 0 30
Written fiction 1 1214069 0.000 1 25Written newspapers 4 1116508 0.000 4 97
#unc 0 0 0.000 0 0Total 5 4690693 0.000 5 184
7
An Electronic Corpus-based analysis of a set of near synonyms across two registers.
Table 2: Analysis of device in BNC Baby
Class Hits %d words Pc Hit texts Textsheader 0 11558 0.000 0 2
Written academic prose 82 1145004 0.007 14 30Spoken demographic 8 1203554 0.001 4 30
Written fiction 24 1214069 0.002 8 25Written newspapers 10 1116508 0.001 9 97
#unc 0 0 0.001 0 0Total 124 4690693 0.003 35 184
Table 3: Analysis of gadget in BNC Baby
Class Hits %d words Pc Hit texts Textsheader 0 11558 0.000 0 2
Written academic prose 0 1145004 0.000 0 30Spoken demographic 0 1203554 0.000 0 30
Written fiction 1 1214069 0.000 1 25Written newspapers 4 1116508 0.000 3 97
#unc 0 0 0.000 0 0Total 5 4690693 0.000 4 184
Table 4: Analysis of machine in BNC Baby
Class Hits %d words Pc Hit texts Textsheader 3 11558 0.026 2 2
Written academic prose 48 1145004 0.004 14 30Spoken demographic 115 1203554 0.010 22 30
Written fiction 51 1214069 0.004 15 25Written newspapers 59 1116508 0.005 27 97
#unc 0 0 0.005 0 0Total 276 4690693 0.006 80 184
8
An Electronic Corpus-based analysis of a set of near synonyms across two registers.
Below is the result of analysis carried out by WordSmith Tools on our constructed
corpus. Table 5 shows a word list which shows the frequency distributions of the words
in our constructed corpus.
Table 5: WordSmith Tools -- 10/11/2008 21:55:45 corpus.txt
N Word Frequency %544 APPLIANCE 301 0.12545 APPLIANCE’S 1546 APPLIANCES 377 0.152697 DEVICE 186 0.072698 DEVICEAPPLICAT + 12699 DEVICES 80 0.032700 DEVICESCOMPUTE+ 14093 GADGET 50 0.021094 GADGETS 60 0.025847 MACHINE 443 0.185848 MACHINE’S 45649 MACHINEABILITY 15850 MACHINEABLE 15851 MACHINED 35852 MACHINEIS 15853 MACHINERY 65854 MACHINES 222 0.095855 MACHINING 1 5855
The display of the frequency listing of our constructed corpus analyzed by WordSmith
Tools in table 5 shows that appliance/appliances are one of the most active words; it also
reveals that the frequency for the plural or the singular forms is quite similar. Gadget exist in
singular and plural forms in our constructed corpus. This analysis is in the guidelines of
Halliday's statistical/textual perspective (Gledhill, 2000). We have obtained surprising results
in this analysis. The different uses of machine/gadget in their plural or singular forms are
unexpected; these uses are not mentioned in the dictionaries. This confirms once again that
even though words are similar in meanings, they are employed differently across registers.
9
An Electronic Corpus-based analysis of a set of near synonyms across two registers.
3.3. The immediate right and left collocates of appliance, device, gadget, machine
Firstly we begin by an analysis of the immediate right and left collocates. The
following table justifies our assumption, we will not be displaying collocates for each words.
We will rather synthesize our findings with examples based on our analysis.
Table 6: Most common left and right collocates of appliance, device, gadget,
machine
Synonyms left collocation frequency right collocation frequency
Appliance / appliances
safety 179 defect (s) 40The / the 23 inspection (s) 29An / an 15 standards 21
this 4 bad 15kitchen 2 act (s) 11france 1 load 10small 1 model 3
device / devicessoftware 7 that 5profile 5 allegedly
application 3
gadget / gadgets
learning 25 corresponding 2and 16 in 3turn 14 verb 16the 11
machine / machines
turing 160 vision 86a 26 can / cannot 35
virtual 22 that 12and 7 abstraction (s) 5
washing 5 Model (s) 3
From the syntactic perspective, we have noticed in our corpus that appliance is as
productive as device in terms of syntactical combinations. Appliance can have adjective
modifiers (domestic appliance, household appliance, electrical appliance, etc), it can also
form noun phrases (appliance's operating state, the power of appliances, safety appliance
inspections, etc.). It modifies another noun as well (appliance models, appliance state,
appliance defects, etc.). We notice as example:
1
An Electronic Corpus-based analysis of a set of near synonyms across two registers.
1. FRA safety appliance defects must be repaired before a train can depart a yard, there are other
forms of deformation that could be repaired as time allows.
2. For each of the domestic appliances investigated here, the table reports the effect that owning a
particular appliance has on the amount of time spent in various kinds of household work.
3. This expanded group of appliances included grab irons, ladders, sill steps, hand brakes, running
boards, and other similar equipment.
4. Small resistive appliances in the power range 50 W < P < 210 W are grouped together with a
special algorithm.
5. Safety appliances on rail cars are the interface between humans and rolling stock with regard to
movement of rail cars.
Device has most of the characteristics as appliance, but we have also extracted certain
verbs that are used in passive forms after device (device is made / constructed / fabricated /
developed / assembled). Device can be followed by relative clauses introduced by that /
which. From the distributions of collocates, we noted also that phone / mobile are preferably
placed before device, whereas applications, software, hardware prefer to be placed after
device and act as the head words. Information is never placed after device, and profile is
always placed after device in the corpus. This is because they form a proper noun together
Mobile Information Device Profile (MIDP). The prepositions used with device are with, for
and from. With follows device in most of the time and for and from tend to be often placed
before device.
6. The device was then assembled with a heater, a thermoelectric module and thermocouples.
7. There are commercially available devices for doing slab gel analysis.
8. The CLDC specification is made up of Mobile Information Device Profile (MIDP) applications,
whose technology stack is......
1
An Electronic Corpus-based analysis of a set of near synonyms across two registers.
9. The automation of mobile device software product-lines requires generic software technologies to
support …...
10. A mobile device that people can carry with, is a constant factor to rely upon, when design......
Gadget functions as an adjective to modify another noun (gadget elements, gadget
combinations). But it can also be modified by an adjective, by a verb in ING-form, where
gadget itself acts as the head element (electronic gadget, learning gadget). Gadget is used in
combination of the word turn and forms a proper noun turn gadget in the corpus.
11. Additionally, the participatory design of the leaning gadgets, which is based on the reuse of
proven solutions in the …..
12. Thus, for any parallel morph between two drawings of a turn gadget that keeps vertices of VS
static …
13. To complete the collection, turn gadgets are required to connect a turn gadget to another gadget
above it and another gadget either to the left or right.
14. Only the interface gadget elements are declared as part of the code in the fields form and text Field
of the class MobApp.
15. …... the hub operator will deliver enough gadgets for the customer when the future demand is
uncertain.
Syntactically, the word machine is the most productive. It can be a subject or an object
(machine halts, computes, produces, operates or repair a machine, build a machine). It
possesses the other characteristics of combinations as the other three synonyms: modifier or
modified or relative clauses. The adjectives placed before machine as modifiers (small /
domestic / virtual / powerful machine) or the noun or verb-ing forms placed before machine
as modifiers (washing / lapping / fax / PC machine). It modifies also nouns placed after
(machine language / loop / rules). One striking characteristic of machine in our corpus is of
1
An Electronic Corpus-based analysis of a set of near synonyms across two registers.
its combinations to refer to specific proper nouns (turing machine, machine vision system, o-
machine, etc). It is also very often used as a noun, but with a suffix-added form (automatic
machinery / machinery and parts / a piece of machinery). Its co-occurrences with appliance is
a observed in the corpus.
16. The location of a machine vision installation for monitoring safety appliances can be broken down
…...
17. We could imagine, for instance, a machine that included an accelerated Turing machine (M) as a
part.
18. Turing's later paper 'computing Machinery and Intelligence' can be taken as saying that even
mathematician …...
19. For example, as almost every household possesses a washing machine or stove, no statistically
valid comparison between the behavior of …...
20. As usual, if there is no appropriate step to execute at some point, the machine halts.
From the rhetoric perspective, we noticed that in our corpus, lots of proper nouns are
constructed using these synonyms, and they act as terms specific to the domains: Virtual
Machine and Turing Machine are related to computer science, turn gadget is used
specifically in graphic data structures, mobile device is specifically used in
telecommunication, the safety appliance / Act / inspection / standards / deformation /
defects are very restricted for use in the rail road safety inspections. Even though, the
group of synonyms keeps their original meanings in these restricted domains, the formed
proper nouns are no longer considered as separable. Therefore their lexical constructions
are not interchangeably with the other synonyms. In this case, these synonyms become a
part of the domain terminology. These special uses have to be learned as a part of the
terminology learning.
1
An Electronic Corpus-based analysis of a set of near synonyms across two registers.
4. Discussion
The analysis from a statistic perspective allowed us to see the differences of
distributions of these synonyms in different corpora. While appliance, device and machine are
frequent words in both corpora, the word gadget experiences great changes regarding
frequency from one corpus to another. From a syntactic perspective, all the synonyms can be
modified by an adjective or a verb in ING-form; they can equally modify the other nouns and
form a noun phrase or a proper noun; the word machine turns out to be very productive
concerning the syntactic combinations, which reflects that this word is the most commonly
used among the group of synonyms. From the rhetorical perspective, these nouns are mostly
used to form a proper noun with another word. The ability to use correctly these terms reflects
the user's skill in the domain, the comprehension of these terms is related to the
comprehension of the phraseology.
This research paper shows from three perspectives the differences of a set of near-
synonyms. We can conclude that these synonyms have a common sense, but they are not
identical. Collocations allow us to notice certain properties that might not be revealed by
dictionaries. The results which are presented here demonstrate a fine analysis of the
phenomena. This analysis would be more interesting if we construct a large and diversified
corpus. It is encouraging that we have seen interesting results with a small-size corpus. We
believe that the same method applied to other groups of synonyms could represent unexpected
results as we have had in this paper.
1
An Electronic Corpus-based analysis of a set of near synonyms across two registers.
5. References
Biber, D. Conrad, S. & Ruppen, R. (1998). Corpus linguistics, Investigating language
Structure and Use. Cambridge : Cambridge University Press.
Fontenelle, T. (1994). What on earth are collocations? English Today 40, Volume 10,
Numéro 4. Cambridge University Press.
Gledhill, C. (1998). Towards a Description of English and French phraseologie. In
Langue and Parole in Synchronic and Diachronic Perspective. St. Andrews, pp 221 –
237.
Gledhill, C. (2000). Collocations in Science Writing. Gunter Narr Verlag Tübiggen.
Gledhill, C. & Frath, P. (2005). Une tournure peut en cache rune autre : l’innovation
phraséologique dans Trainspotting. In Les langues modernes edited Guillaume Astrid.
Paris, pp 68 – 79.
Grossman, F. & Tutin, A. (2003). Les collocations : analyse et traitement. Travaux et
recherches en linguistique appliquée. Amsterdam : De Werelt.
Habert, B. Nazarenko, A. & Salem, A. (1997). Les linguistiques de corpus. Paris :
Armand Colin.
Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University
Press.
1