downloaded from -...

11
SCIEN CE -SUPPLEMENT. FRIDAY. MARCH 11, 1887. THE CHARACTERISTIC CURVES OF COM- POSITION. AUGUSTUs DEMORGAN somewhere remarks (I think it is in his 'Budget of paradoxes') that some time somebody will institute a conmparison among writers m regard to the average length of ocrt I mean word-length suggested itself. The new method, while scarcely more laborious than that proposed by DeMorgan, promised to yield results more quickly and of a definitely higher order. It also had the advantage of including, in its ap- plication, all that was necessary to the determina- tion of mean word-length; so that, in reality, it furnished two distinct tests. Preliminary trials of the method have furnished 200 - ISO 100 --- so Ai i< A i < L iii 2 3 4 5 6 7 8 9 10 11 12 13 14 16 FIG. 1. -FIRST ONE THOUSAND WORDS IN 'OLIVER TWIST.7 words used in composition, and that it may be found possible to identify the author of a book, a poem, or a play, in this way. In reflecting upon this remark at various times within the past five or six years, always with the determination to test the value of the suggestion whenever time for the work seemed available, a more comprehensive and satisfactory method of analysis than that based simply upon strong grounds for the belief that it may prove useful as a method of analysis leading to identi- fication or discrimination of authorship, and it is therefore brought to the attention of the scientific and literary public in the hope that some one may be found who is at once able and willing to secure a satisfactory test of its validity. The nature of the process is extremely simple, but it may be useful to point out its similarity to \ 1111 CL [11 IFTII %J I on February 20, 2019 http://science.sciencemag.org/ Downloaded from

Upload: others

Post on 29-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Downloaded from - elmo.sbs.arizona.eduelmo.sbs.arizona.edu/sandiway/ling388-20/Mendenhall1887.pdf · 238 a well-known method of material analysis, the consideration of whichactually

SCIENCE -SUPPLEMENT.FRIDAY. MARCH 11, 1887.

THE CHARACTERISTIC CURVES OF COM-POSITION.

AUGUSTUs DEMORGAN somewhere remarks (Ithink it is in his 'Budget of paradoxes') thatsome time somebody will institute a conmparisonamong writers m regard to the average length of

ocrt I

mean word-length suggested itself. The newmethod, while scarcely more laborious than thatproposed by DeMorgan, promised to yield resultsmore quickly and of a definitely higher order.It also had the advantage of including, in its ap-plication, all that was necessary to the determina-tion of mean word-length; so that, in reality, itfurnished two distinct tests.Preliminary trials of the method have furnished

200 -

ISO

100 ---

soAi i< Ai < Liii

2 3 4 5 6 7 8 9 10 11 12 13 14 16FIG. 1. -FIRST ONE THOUSAND WORDS IN 'OLIVER TWIST.7

words used in composition, and that it may befound possible to identify the author of a book, apoem, or a play, in this way.

In reflecting upon this remark at varioustimes within the past five or six years, alwayswith the determination to test the value of thesuggestion whenever time for the work seemedavailable, a more comprehensive and satisfactorymethod of analysis than that based simply upon

strong grounds for the belief that it may proveuseful as a method of analysis leading to identi-fication or discrimination of authorship, and it istherefore brought to the attention of the scientificand literary public in the hope that some one maybe found who is at once able and willing to securea satisfactory test of its validity.The nature of the process is extremely simple,

but it may be useful to point out its similarity to

\

1111CL[11 IFTII%J I

on February 20, 2019

http://science.sciencem

ag.org/D

ownloaded from

Page 2: Downloaded from - elmo.sbs.arizona.eduelmo.sbs.arizona.edu/sandiway/ling388-20/Mendenhall1887.pdf · 238 a well-known method of material analysis, the consideration of whichactually

238

a well-known method of material analysis, theconsideration of which actually first suggested tothe writer its literary analogue.By the use of the spectroscope, a beam of non-

homogeneous light is analyzed, and its compo-nents assorted according to their wave-length. Asis well known, each element,when intensely heatedunder proper conditions, sends forth light which,upon prismatic analysis, is found to consist ofgroups of waves of definite length, and appearing

250

200

[VOL. IX., No. 214

every author, as with every element, this spec-trum persists in its form and appearance, thevalue of the method will be at once conceded. Ithas been proved that the spectrum of hydrogenis the same, whether that element is obtainedfrom the water of the ocean or from the vapor ofthe atmosphere. Wherever and whenever it ap-pears, it means bydrogen. If it can be provedthat the word-spectrum or characteristic curveexhibited by an analysis of ' David Copperfield'

50 R~ - - - -Vx q - i ii

3 4 5 6 7 8 9 10 11 12 18 14

Fia. 2.- SHOWING FIVE GROUPS, OF ONE THOUSAND WORDS EACH, FROM 'OLIVER TWIT.'15 16

in certain definite proportions. So certain anduniform are the results of this analysis, that theappearance of a particular spectrum is indispu-table evidence of the presence of the elemenit towhich it belongs.

In a manner very similar, it is proposed toanalyze a composition by forming what-may becalled a 'word-spectrum,' or 'characteristiccurve,' which shall be a graphic representationof an arrangement of words according to theirlength and to the relative frequency of their oc-currence. If, now, it shall be found that with

is identical with that of ' Oliver Twist,' of 4 Bar-naby Rudge.' of -Great expectations,' of the' Child's history of England,' etc., and that it dif-fers sensibly from that of ' Vanity fair,' or'Eugene Aram,' or IRobinson Crusoe,' or 'DonQuixote,' or any thing else in fact, then the con-clusion will be tolerably certain that when it ap-pears it means Dickens.The validity of the method as a test of author-

ship, tben, implies the following assumptions:that every writer makes use of a vocabularywhich is peculiar to himself, and the character of

SCIENCE.

II4r X I I1F- -

.-

:1 ',\ T T -;TII\ -in0 % ----

IDU k I I

'I \

lI,

I NJ\,%

. I

---- I-I.-.

--#j

on February 20, 2019

http://science.sciencem

ag.org/D

ownloaded from

Page 3: Downloaded from - elmo.sbs.arizona.eduelmo.sbs.arizona.edu/sandiway/ling388-20/Mendenhall1887.pdf · 238 a well-known method of material analysis, the consideration of whichactually

MARCH l1, 1887.] SOCIE]

which does not materially change from year toyear during his productive period; that, in theuse of that vocabulary in composition, personalpeculiarities in the construction of sentences will,in the long-run, recur with such regularity thatshort words, long words, and words of mediumlengtb, will occur with definite relative fre-quencies.The first assumption will, perhaps, be admitted

in a general way, without debate. It is easily

250

150

L00

50

S 4 5 6 7 b

YCE. 239

in their curves, and consequently as a severe testof the method, two contemporaneous novelists,Dickens and Thackeray, were selected for the firstexamination. The operation consisted simply incounting the number of letters in every word, andrecording the number of words of one letter, twoletters, three letters, etc. The count began inboth cases at the beginning of the volume, and,after a few thousand words had been counted inorder, the book was opened at random near the

9 10 11 I 13 14 15 1.6FIG. 3.-Two CONSECUTIVE GROUPS, OF ONE THOUSAND WORDS EACH, FROM 'VANITY FAIR.' THESE GROUPS

SHOW SENSIBLY THE SAME AVERAGE WORD-LENGTHS.

seen that to prove or disprove the second will re-quire the expenditure of an enormous amount oflabor. The following results are offered as ameans of properly exhibiting the method, andas evidence, in some degree at least, of its realvalue.

It is important, first, to determine to what ex-te-it an author may be said to agree with himself;and, second, to what extent does he differ fromothers.As an instance in which two writers might

well be expected to greatly resemble each other

middle, and the count continued. In no case wasany personal choice exercised, except that bothcounts began with the first chapter. Words werecounted always in groups of one thousand. Thegraphic display of the result was made by thecommon method of rectangular co-ordinates,using the number of letters in a word as anabscissa, and the corresponding number of suchwords in a thousand as an ordinate. As an illus-tration, the first one thousand words counted from' Oliver Twist' may be cited; they were as fol-lows:-

I' I_ _I ' __ _ _I

k,

X - - -_-- -< -

0-

i i txuui i i .4N i i i

I 2

on February 20, 2019

http://science.sciencem

ag.org/D

ownloaded from

Page 4: Downloaded from - elmo.sbs.arizona.eduelmo.sbs.arizona.edu/sandiway/ling388-20/Mendenhall1887.pdf · 238 a well-known method of material analysis, the consideration of whichactually

240 SCI1E

Number of letters 1 2 3 4 5 6 7 8 9 10 11 12

Number of words 38 170 235 175 123 91 62 41 35 10 13 7

Even in so small a number as one thousand,the relative distribution of words is approximatelythe same as in a much larger number, although,as would naturally be expected, accidental varia-tions or ' runs' overshadow personal characteris-

YCE. [VOL. IX., No. 214

placing the numbers showing letters in each wordat points along a horizontal line separated fromeach other by equal distances, above each of theseplace other points whose distance from the baseline shall be proportional to the number of suchwords in a thousand; then join these points by abroken line, and the characteristic curve is shown.Fig. I shows the curve thus constructed fromthe first thousand words in ' Oliver Twist,' thenumerical analysis of which is shown above.

FIG. 4. -Two GROUPS, OF FIVE THOUSAND WORDS EACH, FROM 'OLIVER TWIST.'

tics to a great extent; but not completely, as willbe seen in the characteristic curves shown in thefollowing pages. In fact, when the ten groups,of a thousand words each, from Dickens, arecompared with ten similar groups from JohnStuart Mill, no one of the first set could byany possibility be mistaken for any one of. thesecond.The graphic representation of the results will be

readily understood. It is only necessary to take asheet of 'squared' paper, or paper ruled in twodirections at right angles to each other, and, after

The next diagram (fig. 2) exhibits five curvesconstructed from the first five thousand wordsthe same from work, in groups of one thousandeach. It is presented in order to show the varia-tion among groups based on a relatively smallnumber of words.The superiority of this method over that of

simple word averages, as suggested by DeMorgan,is clearly shown in fig. 3, which exhibits two con-secutive groups, of one thousand words each,from 'Vanity fair.' The numerical analysis ofthese groups is as follows :

on February 20, 2019

http://science.sciencem

ag.org/D

ownloaded from

Page 5: Downloaded from - elmo.sbs.arizona.eduelmo.sbs.arizona.edu/sandiway/ling388-20/Mendenhall1887.pdf · 238 a well-known method of material analysis, the consideration of whichactually

SCIENCE.

Letters............1 2 3 41 5 6 7 8 9 10 11 12 13 14

Words in 1st group 25 169 232 187 109 78 79 48 28 20 10 10 2 3

Words in 2d group 33 146 248 164 135 76 73 52 35 23 6 5 2 2

It will be seen that the total number of lettersin the first group is 4,507, and in the second 4,508,or an average of 4.507 and 4.508 letters to eachword in the respective groups. If this average,

241

ist. One of the curves shows an excess of nine-letter words, which does not appear in the other.They agree in showing a greater number ofsix-letter words than a smooth curve would de-mand. This excess may persist, and prove tobe a real characteristic of Dickens's composition.Fig. 5 exhibits these two groups of five thousandwords combined in one of ten thousand, giving acurve of greater smoothness, and approximatingstill more closely to the normal curve of the writer.

5 6 7 8 9 10 11 1i 13 14FIG. 5. -CURVE FOR TEN THOUSAND WORDS FROM 'OLIVER TWIST.'

or ' mean word-length,' be alone considered, thetwo groups must be regarded as sensibly identi-cal; but an inspection of the diagram shows thatthey are in reality quite different.When the number of words in a group is in-

creased to fi ce thousand, the accidental irregu-larities begin to disappear, the curve becomessmoother, approximating more nearly to the nor-mal curve whicb, it is assumed, is characteristicof the writer. Fig. 4 exhibits two groups, each offive thousand words, from ' Oliver Twist,' and itwill be seen that considerable differences still ex-

In fig. 6, two groups of five thousand wordseach, from 'Vanity fair,' are shown; and infig. 7, two groups of ten thiousand each, from' Oliver Twist' and ' Vanity fair,' are placed sideby side for comparison, the former being repre-sented by the continuous line, and the latter bythe broken line. Although these curves differ,and while it is believed that the difference willpersist with an increased number of words, it iscertainly surprising, that in the analysis of tenthousand words from Dickens, and the samenumber from Thackeray, so close an agreement

MARCH 11, 1887.1

on February 20, 2019

http://science.sciencem

ag.org/D

ownloaded from

Page 6: Downloaded from - elmo.sbs.arizona.eduelmo.sbs.arizona.edu/sandiway/ling388-20/Mendenhall1887.pdf · 238 a well-known method of material analysis, the consideration of whichactually

242 SCIENCE.

should be found. This agreement is particularlystriking in words of eleven, twelve, and thirteenletters, the numerical comparison of which is asfollows:-

Number of letters................... 11 12 13

Number of words in Dickens....... 85 57 29

Number of words in Thackeray..... 85 53 29

Z50

200

150

100

50

[VOL. 1x., No. 214

ists; but I confess to considerable surprise on find-ing from the very beginning, that although, onthe whole, this anticipation was realized, the wordwhich occurred most frequently was not the three-letter word, as with both Dickens and Thackeray,but the word of two letters. Indeed, the word oftwo letters was not only relatively more frequent,but absolutely; that is to say, it occurred morefrequently in the composition of Mill than in thatof either of the novelists, and with great uniform-

I I I 1 1

6 7 ¶9 10 1 12 13 14 15 16FIG. 6. - TwO GROUPS, OF FIVE THOUSAND WORDS EACH, FROM 'VANITY FAIR.7

This closeness to identity must be largely the re-sult of accident, and it would not be likely torepeat itself in another analysis.The writer next examined was John Stuart

Mill; and to test the persistence of form in com-positions belonging to different periods of the au-thor's life, and upon different subjects, two groupsof five thousand words each were taken,- onefrom his ' Political economy,' and the other fromhis 'Essay on liberty.' It was anticipated, ofcourse, that words of greater length would occurfar more frequently than in the case of the novel-

ity, as it was in excess in each thousand of theten analyzed. The explanation is easv, and is tobe found in the liberal use of prepositions in sen-tence-building. The proposed method of analysisis designed to reveal any peculiarity of this kind,and the exemplification of its power thus early inthe work was encouraging.

Figs. 8 and 9 show the curves for five thousandwords from the ' Political economy' and from thef Essay on liberty.' It will be observed, that,while they differ considerably, there is still, in ageneral way, a striking resemblance, and that

I n. -N

z 5

on February 20, 2019

http://science.sciencem

ag.org/D

ownloaded from

Page 7: Downloaded from - elmo.sbs.arizona.eduelmo.sbs.arizona.edu/sandiway/ling388-20/Mendenhall1887.pdf · 238 a well-known method of material analysis, the consideration of whichactually

SCIENCE.they are in marked contrast with the curves ofthe novelists. An interesting case was furnishedin two recent addresses on the labor question byMr. Edward Atkinson. In reality, one addresswas given to two very different audiences. Onewas made up from the workingmen of Provi-dence, and the other from the alumni of the An-dover theological seminary. On reading the two,one cannot avoid being struck by the marked dif-ference in style, although the two papers are much

243

The average length of ten thousand words in his ad-dresses on the labor question is 4.298 letters. Themean word-length of the writers thus far exam-ined, based upon a count of ten thousand wordsfrom each, is as follows:-

Atkinson ..............................4.298Dickens .................................4.342Thackeray ...............................4.481Mill ................................4.775

A friend has furnished me with the result ofthe count of the first five thousand five hundred

150-w-X--- i- - -200t I

II150 -lTl fFFH r

Iinn J. - - I

4 5 6 7 & 10 ll 12 13 14FIG. 7.-TWO GROUPS, OF TEN THOUSAND WORDS EACH, FROM 'OLIVER TWIST,' , AND FROM

'VANITY FAIR,'----

alike in substance. It was interesting, then, toinquire whether their curves of compositionwould show any marked resemblance. An analy-sis of five thousand words from each paper wasmade, and the result is shown in fig. 10. Averv satisfactory, indeed a striking, general re-semblance will be observed; and it wvill also beseen that Mr. Atkinson's curve differs decidedlyfrom others previously figured and described. Itis shown in contrast with that of John Stuart Millin fig. 11. Mr. Atkinson's composition is remark-able in respect to the shortness of the words used.

words of Caesar's 'Commentaries.' The meanword-length is 6.065. The most extensive word-counting that I know of is that of the words andletters in the Bible. I cannot vouch for the reli-ability of the information which periodically floatsthrough the columns of the public press, that theOld Testament contains 592,493 words with 2,728,-100 letters, and the New Testament 181.253 wordswith 838,380 letters. It is interesting to note, how-ever, that these numbers give averages of 4.604and 4.625 respectively, agreeing within less thanone-half of one per cent.

MARCH 11, 1887.]

50

PNItt T T> - I T-

S 15 16

ch ir 0%

on February 20, 2019

http://science.sciencem

ag.org/D

ownloaded from

Page 8: Downloaded from - elmo.sbs.arizona.eduelmo.sbs.arizona.edu/sandiway/ling388-20/Mendenhall1887.pdf · 238 a well-known method of material analysis, the consideration of whichactually

244 SCIENCE.Before making an analysis of Mr. Atkinson's

composition, and after having counted more thanthirty thousand from other writers, I had con-cluded that a group of one thousand words whoseaverage length was less than four letters wouldnot occur, except in compositions especially writ-ten in short words. Out of ten such groups fromMr. Atkinson's addresses, however, one was foundwhose mean word-length was 3.991. I have re-cently received from him a brief paper, entitled

[VOL. 1X., No. 214

method of analysis and identification has beenfurnished by several friends who have had thepatience to enumerate the letters in "many thou-sand words from different sources. Prof. StanleyCoulter sends me the result of a count of tenthousand from Dickens's 'IChristmas carol.' Hewrites, II became exceedingly interested in watch-ing how little tricks of composition affected the'curve.' For instance, one of the characters,'Scrooge,' appears in one place very often, and an

FIG. 8. -CURVE OF FIVE THOUSAND WORDS FROM 1MILL'S 'POLITICAL ECONOMY.'

'How do we all get a living?'Iwhicb was pub-lished in Work and wages, and in the preparationof which he made a special effort to use the sim-plest language possible. The article contains alittle more than two thousand word!, the numberbeing too small for the construction of a curvewhich would be comparable with those alreadyexhibited. The general form of one based upontwo thousand words is similar to that previouslyobtained from the same writer, and the meanword-length is 3.771.

Interesting evidence of the validity of this

excess of 7's is the result; in another place ' Fiz-ziwig,' and the 8's creep up [this is doubtless owingto the frequent appearance of the names]. Othervariations and excesses seem to come from Dick-ens's love of certain forms of description, whichhe iterates and reiterates upon a single page."

I have plotted these ten thousand words fromthe ' Carol' with the ten thousand already shownfrom 'Oliver Twist.' in fig. 12. A very close re-semblance will be observed, and it will be noticedthat the mean of these two curves would be freefrom certain irregularities which occur in both,

on February 20, 2019

http://science.sciencem

ag.org/D

ownloaded from

Page 9: Downloaded from - elmo.sbs.arizona.eduelmo.sbs.arizona.edu/sandiway/ling388-20/Mendenhall1887.pdf · 238 a well-known method of material analysis, the consideration of whichactually

MARCH 11, 1687.1 SCICE. 24and would be a much closer approximation to thenormal characteristic curve of Dickens.

It is hardly necessary to say that the method isnot necessarily confined to the analysis of a com-position by means of its mean word-length: itmay equally well be applied to the study of syl-lables, of words in sentences, and in various otherways. The results thus far obtained from its ap-plication would appear to justify the claim that itis worthy of a thorough test through which the

250

zoo

150

100

5 6 7

Many interesting applications of the processwill suggest themselves to every reader; the mostnotable, of course, being the attempt to solvequestions of disputed authorship, such as exist inreference to the letters of Junius, the plays ofShakspeare, and other less widely knovn ex-

amples. It might also be utilized in comparativelanguage studies, in tracing the growth of a lan-guage, in studying the growth of the vocabularyfrom childhood to manlhood, and in other direc-

8 9 10 11 1Z 1 14 15 16FIG. 9. -CURVE OF FIVE THOUSAND WORDS FROM MILL'S 'ESSAY ON LIBERTY.'

validity of its assumptions might be proved ordisproved. Its principal merits are, that it offersa means of investigating and displaying the meremechanism of composition, and that it is purelymechanical in its application. In 'virtue of thefirst, it might reveal characteristics which a writerwould make no attempt to conceal, being himselfunaware of their existence; and, of the second,theconclusions reached through its use would beindependent of personal bias, the work of oneperson in the studiy of an author being at oncecomparable with that of any other.

tions too numerous to be cataloguel. An illus-tration of its application to another language isshown in the analysis of more than five thousandwords in Caesar's ' Commentaries,' already referredto, which is represented in fig. 13. The curveshows a relatively large use of long words, andits peculiar feature is the evident indication oftwo maximum ordinates nearly equal to eachother.From the examinations tbus far made, I am

convinced that one hundred thousand words willbe necessarv and sufficient to furnish the charac-

I I

LI I<lVIZLITT

I I4X\-

I I I I I

i , ,

24,I ,

on February 20, 2019

http://science.sciencem

ag.org/D

ownloaded from

Page 10: Downloaded from - elmo.sbs.arizona.eduelmo.sbs.arizona.edu/sandiway/ling388-20/Mendenhall1887.pdf · 238 a well-known method of material analysis, the consideration of whichactually

246

teristic curve of a writer, -that is to say, if acurve is constructed from one hundred thousandwords of a writer, taken from any one of his pro-ductions, then a second curve constructed fromanother hundred thousand words would be prac-tically identical with the first, - and that this curvewould, in general, differ from that formed in thesame way from the composition of another writer,to such an extent that one could always be dis-tinguished from the other. To demonstrate the

250 - '

.Oo

200 +150

.- -A-

10 - -. -

52 3456 7 i

FIG. 10. -TWO GROUPS, OF FIVE THOUSAND WORDS EATO WORKINGMEN, -; TO ALUMNI

existence of such a curve will require the enum-eration of the letters in several hundred thousandwords from each of a number of writers. Shouldits existence be established, the method mightthen be applied to cases of disputed authorship.If striking differences are found between thecurves of known and suspected compositions ofany writer, the evidence against identity of author-ship would be quite conclusive. If the two com-positions should produce curves which are practi-cally identical, the proof of a common originwould be less convincing; for it is possible, al-

[VOL. IX.. No. 214

though not probable, that two writers might showidentical characteristic curves.

T. C. MENDENHALL.

TIDAL OBSERVATIONS OF THE GREELYEXPEDITION.

THE principal tidal observations were made atFort Conger, on Lady Franklin Bay, by variousmembers of the expeditionary force working under

12 13 14 15 1HLCH, FROM ADDRESSES OF EDWARD ATKINSON: ADDRESSOF THEOLOGICAL SEMINARY,

direction of Sergt. Edward Israel, and with ageneral supervision by the commanding officer ofthe expedition. They consisted of hourly heightsof the tide from Aug. 20, 1881, to July 1, 1882,and the times and heights of high and low watersfrom Aug. 20, 1881, to June 30, 1883, both seriesread from fixed staff gauges and practically con-tinuous. A broken series of high and low watersfrom July 1 to Aug. 8, 1883, obtained under un-favorable conditions, were not used in the discus-sion. There were also short series at seven out-lying stations on the coasts of Greenland and

SCIENCE.

on February 20, 2019

http://science.sciencem

ag.org/D

ownloaded from

Page 11: Downloaded from - elmo.sbs.arizona.eduelmo.sbs.arizona.edu/sandiway/ling388-20/Mendenhall1887.pdf · 238 a well-known method of material analysis, the consideration of whichactually

THE CHARACTERISTIC CURVES OF COMPOSITIONT. C. Mendenhall

DOI: 10.1126/science.ns-9.214S.237 (214S), 237-246.ns-9Science 

ARTICLE TOOLS http://science.sciencemag.org/content/ns-9/214S/237.citation

PERMISSIONS http://www.sciencemag.org/help/reprints-and-permissions

Terms of ServiceUse of this article is subject to the

trademark of AAAS. is a registeredScienceof Science. No claim to original U.S. Government Works. The title

The Authors, some rights reserved; exclusive licensee American Association for the Advancementfor the Advancement of Science, 1200 New York Avenue NW, Washington, DC 20005. 2017 ©

(print ISSN 0036-8075; online ISSN 1095-9203) is published by the American AssociationScience

on February 20, 2019

http://science.sciencem

ag.org/D

ownloaded from