a computerized statistical methodology for linguistic geography - a pilot study

16
A COMPUTER1ZED STATISTICAL METHODOLOGY FOR LINGUISTIC GEOGRAPHY: A PILOT STUDY* CHARLES L. HOUCK 1. INTRODUCTION The need for a statistical methodology for analyzing linguistic data, is, I believe, vital when these data are a function of either geographical or sociolinguistic factors. In analyzing these data one is called upon to draw inferences from a large nurnber of linguistic phenomena from a large set of informants. Wide experience in other behavioral sciences has shown, however, that if objective inferenees are to be obtained, quantification of these linguistic events is necessary. And since linguistic phenomena of this kind usually do not constitute measurement which is fully reproducible äs it is in the physical sciences, and are generally subject to error, it is best analyzed statistically. Statistics provides, first of all, empirically tested formulae for drawing accurate inferences about the differences or similarities of a given population from a sample, even though any particular set of data may be very inaccurate. And secondly, these statistical formulae provide an estimate of the degree of error involved in making these inferences. Lack of this estimate of error usually renders Statements about a population from a sample unreliable. 1 Although David W. Reed and John L. Spxcer proposed äs early äs 1952 correlation methods which provided a rigorous means by which * An earlier Version of this paper was presented at the meeting of the Midwest Modern Language Association in Chicago on 8 May 1965. I am especially indebted to John W. Bowers, Associate Professor of Speech, University of Iowa, for his numer- ous editorial suggestions and help with the statistical design and analysis. I also wish to express my thanks to Robert Howren, Jr., Associate Professor of English, University of Iowa, Pavle Ivic, Faculty of Philosophy, Novi Sad University, Yugoslavia, and Roger Shuy, Assistant Professor of English, Michigan State University, for the advice, comments, and encouragement they have given me in the pursuit of this study. 1 The statistical concepts expressed in this introduction were obtained from the introductory material in T. G. Connolly and W. Sluckin, Statistics for the social sciences (London, 1962), 1-3; George A. Ferguson, Statistical analysis in psychology andeducation (New York, 1959), 1-12; and Sidney Siegel, Nonparametric Statistics for the behavioral sciences (New York, 1965), 1-5.

Upload: ubismail

Post on 06-Sep-2015

10 views

Category:

Documents


1 download

DESCRIPTION

Folia Linguistica 1967.1.1-2

TRANSCRIPT

  • A COMPUTER1ZED STATISTICAL METHODOLOGYFOR LINGUISTIC GEOGRAPHY:

    A PILOT STUDY*

    CHARLES L. HOUCK

    1. INTRODUCTION

    The need for a statistical methodology for analyzing linguistic data, is,I believe, vital when these data are a function of either geographical orsociolinguistic factors. In analyzing these data one is called upon todraw inferences from a large nurnber of linguistic phenomena from alarge set of informants. Wide experience in other behavioral scienceshas shown, however, that if objective inferenees are to be obtained,quantification of these linguistic events is necessary. And since linguisticphenomena of this kind usually do not constitute measurement which isfully reproducible s it is in the physical sciences, and are generallysubject to error, it is best analyzed statistically. Statistics provides,first of all, empirically tested formulae for drawing accurate inferencesabout the differences or similarities of a given population from a sample,even though any particular set of data may be very inaccurate. Andsecondly, these statistical formulae provide an estimate of the degree oferror involved in making these inferences. Lack of this estimate of errorusually renders Statements about a population from a sample unreliable.1

    Although David W. Reed and John L. Spxcer proposed s early s1952 correlation methods which provided a rigorous means by which

    * An earlier Version of this paper was presented at the meeting of the MidwestModern Language Association in Chicago on 8 May 1965. I am especially indebtedto John W. Bowers, Associate Professor of Speech, University of Iowa, for his numer-ous editorial suggestions and help with the statistical design and analysis. I also wishto express my thanks to Robert Howren, Jr., Associate Professor of English, Universityof Iowa, Pavle Ivic, Faculty of Philosophy, Novi Sad University, Yugoslavia, andRoger Shuy, Assistant Professor of English, Michigan State University, for the advice,comments, and encouragement they have given me in the pursuit of this study.1 The statistical concepts expressed in this introduction were obtained from the

    introductory material in T. G. Connolly and W. Sluckin, Statistics for the socialsciences (London, 1962), 1-3; George A. Ferguson, Statistical analysis in psychologyandeducation (New York, 1959), 1-12; and Sidney Siegel, Nonparametric Statistics forthe behavioral sciences (New York, 1965), 1-5.

  • A COMPUTERIZED STATISTICAL METHODOLOGY 81

    linguistic geographers could test their data for significant relationshipsand differences,2 linguistic geographers in general have failed to applythese correlation methods or other statistical procedures which cantest whether the differences or similarities obtained were due to chance.Using the Reed and Spicer study s a point of departure, I will proposeand describe a computerized statistical methodology s it was used in astudy of dialects in Johnson County, Iowa.

    The correlation method proposed and used in the Reed and Spicerarticle was the phi coefficient with the chi-square test for significance.In addition to the correlation method, Reed and Spicer also used WilhelmMilke's method of cartographic representation of correlation coeffi-cients with the phi coefficient s input.3

    In their study, Reed and Spicer used data obtained from the threeTables compiled by Alva L. Davis and Raven L McDavid, Jr.,4 whichshowed "the distribution respectively of thirty-nine items of vocabulary,ten items of pronunciation, and seven items of grammar, among teninformants two each in five communities in northwestern Ohio.Most of these items [i.e. test questions] have several variants [i.e. multipleresponse items]; only the pronunciation was not organized directly toshow simple presence or absence of all variants in each item. For statisti-cal purposes, each variant of an item in the vocabulary and grammartables was considered to be a separate item [i.e. a separate responseitem], thus yielding ninety-one vocabulary items and fourteen grammaritems. The pronunciation table was reorganized to indicate simple pre-sence or absence of each pronunciation characteristic, with a resultanttotal of forty-three pronunciation items. After this preliminary organi-zation of the Davis-McDavid material, it was subjected to the normal-ized phi coefficient with the chi-square test for significance."5

    The results of their study, in spite of its obvious limitations of samplesize and test completeness, were impressive, for the method was ableto delineate the dialect patterns of the area concisely and convincingly.Unfortunately, the implied promise by the authors of future applicationof their correlation methods to linguistic atlas materials has not beenfulfilled, to nay knowledge, in print.61 believe it is safe to say that, on the2 "Correlation methods of comparing idiolects in a transition area", Language, 28

    (1952), 348-359.3 "The quantitative distribution of cultural similarities and their cartographic

    representation", American Anthropologist, 51 (1949), 127-152.4 "Northwestern Ohio: a transition area", Language, 26 (1950), 264-273.

    6 Reed and Spicer, Language, 28, 351.

    6 Reed and Spicer, Language, 28, 359.

  • 82 CHARLES L. HOUCK

    basis of published linguistic atlas materials, the level of statistical sophis-tication in the field of linguistic geography has remained low and staticsince the appearance of their article in 1952.7

    The purpose of this pilot study is fivefold: (1) It will apply the conceptof density s an attempt to obtain statistical data which largely eliminatesthe chance factor. It will do this by providing the phi coefficient and thetetrachoric correlation with input which is derived, first, from a largerInformant sample concentrated in a smaller geographical area, and secondfrom a much larger sample of lexical test questions and response items.(2) It will explore the use of fourfold correlation analyses in which thecorrelation coefficients are used not only terminally s in Reed andSpicer,8 but also instrumentally s input for factor analyses whichattempt to determine from these intercorrelations whether the "Variationrepresented can be accounted for adequately by a number of basiccategories smaller than that with which the investigation was started".9(3) 1t will provide a simple frequency count analysis which, first, providesa basis for the rejection or retention of test questions and response itemsin the questionnaire, and, second, provides data in an accessible form sothat a statistical test for differences can be executed on the response todialect lexical items and a given geographical area can be classifieddialectally. (4) It will provide Computer programs which can processa large amount of linguistic data for correlation, count, and factoranalyses. (5) It will report Substantive findings of the pilot study.

    2. METHOD

    The following sections will describe the methodology used in this study.2.1 GEOGRAPHICAL AREA. Since the orientation of this study is pri-

    marily methodological, no attempt was made to pick a county whichwould have important dialectal findings. Johnson County, Iowa, how-ever, falls within the Davenport-Cedar Rapids-Dubuque triangle ofIowa which is described by Harold B. Allen, s showing, in his view,strong Northern elements although the contrasts to Midland featuresare not s strong s they are at the major boundary.10 The resultant7 See Glenna Ruth Pickford, "American linguistic geography: a sociological ap-

    praisal", Word, 12 (1956), 211-233, for a comment on linguistic geography method-ology which is still apropos. For a more recent comment, see Charles A. Ferguson,Social science research counc, vol. 19, no. l (1965).8 Reed and Spicer, Language, 28, 348-359.

    0 Benjamin Fruchter, Introduction factor analysis (Princeton, New Jersey, 1954), l.

    10 Harold B. Allen, "The primary dialect areas of the upper midwest", in Harold

    B. Allen (ed), Readings in applied English lingmstics (New York, 1964), 233 and 241.

  • A COMPUTERIZED STAT1STICAL METHODOLOGY 83

    findings by the proposed methodology should point to a rejection orretention of Allen's findings in reference only, of course, to lexical usage.

    2.2. DENSITY. Density in this study was deiined with the township sthe minimal geographical unit rather than the more commonly usedcounty. In practice, however, this definition turned out to be unfeasible,and a more realistic unit was devised before analysis took place: namely,by dividing the county arbitrarily into five sections with four to six adja-cent townships each, except for the two townships which contained IowaCity. In each section five to seven informants were questioned.11

    2.3. INFORMANTS. Thirty-two informants were used in the study.In general, the informants were selected if they met the following criteria:(1) that they were born and reared in Johnson County or were life-longresidents of the county (i.e. had lived in the county since they were fiveyears old or less); and (2) that they were sixty-five years old or l der. Twoinformants who failed to meet these criteria were kept in the study be-cause both were native Iowans, and their idiolects could be comparedwith those native to the county. The mean age was 72 with the oldestInformant being 87. The one Informant who failed to meet the agecriterion was 38 years old. Six of the thirty-two informants were women.In education they ranged from people who had had only four years atschool to holders of Professional and post-graduate College degrees. By oc-cupation they included housewives, farmers, a county extension officer, acarpenter, a retired lawyer and large landowner, a pharmacist, a machineshop operator-owner, a streetcar conductor and bodyshop operator-owner, and a bulk-oil dealer.

    2.4. THE LEXICAL QUESTIONNAIRE. The lexical questionnaire wascompiled from the Iowa Atlas checklist and workbook. Hans Kurath'sWord Geography,1* and Robert Howren's Ocracoke checklist13 werealso consulted for relevant lexical response items. The resultant formcontained two hundred and thirty lexical test questions composed ofone thousand and eighty lexical response items. The test questions cov-ered the following categories: (1) time; (2) weather; (3) household;(4) farmstead; (5) farming; (6) farm animal terms; (7) farm animal11

    The township was discarded s the basic geographical unit because it was too smalla unit, especially in the relatively sparsely settled areas, to provide enough informantsmeeting the requirements set forth in 2.3; moreover, farmers, at least in JohnsonCounty, many times move from township to township in quest of better farms andliving conditions, or simply to town to retire, but remain in the county, and, mostof the time, in the same sections of the county s devised for this study.12

    A wordgeography ofthe eastern United States (Ann Arbor, U. of Michigan, 1949).13

    "The speech of Ocracoke, North Carolina", American Speech, 37 (1962), 163-175.A copy of the questionnaire was also made available to me by Robert Howren.

  • 84 CHARLES L. HOUCK i

    sounds; (8) calls to farm animals; (9) landscape; (10) fishing; (11) roads;(12) food; (13) nature; (14) kinship terms (primarily parental); (15)idioms; (16) childhood terms for playthings and games; and (17) mis-cellaneous. In Table 3 is a sample of fifty-four key questions, theirrespective response items, and the frequency with which each was chosen.The questionnaires were distributed in person to help insure the highreturn necessary for a methodological pilot study. The informants wereprovided with a stamped envelope for the return of the questionnaire.

    2.5. THE PHI COEFFICIENT WITH THE CHI-SQUARE TEST. The phi co-efficient, or fourfold point correlation, measures, like other tests I willdescribe later, what statistical relationships exist among informants onthe cnterion oflexical similarity. It assumes that a given lexical responseitem is either present or absent in a given idiolect. Given a phi coeffi-cient, one can determine by referring to appropriate theoretical distri-butions the likelihood of the apparent relationships having occurred bychance. Such a test is the chi-square test for significance. By referringto a chi-square table for the critical value required for significance at anaccepted significance level for the appropriate degrees of freedom, onecan determine whether the values for the differences between the observedand the expected frequencies are significant and cannot reasonably beexplained by sampling fluctuation or chance.14 The phi coefficient i sused here to provide input for Guttman5 s Radex Analysis15 and thecluster analysis.

    2.6 THE TETRACHORIC CORRELATION. The tetrachoric correlation isalso a fourfold correlation which treats the dichotomy, presence andabsence, s though it is on a continuum; i.e. sometimes present, sometimesabsent, depending, e.g. on the Speech Situation.16 The tetrachoric isused here primarily to provide input for the multiple factor analysis.

    2.7. FACTOR ANALYSIS. Three kinds of factor analysis are employedin this study: (1) Guttman's Radex approach to factor analysis;17(2) a multiple factor analysis Computer program assembled by ProfessorHarold Bechtoldt, Department of Psychology, University of Iowa;18 and

    14 J. P. Guilford, Fundamental statistics in psychology and education, 3d ed, (New

    York, 1965), 311-316. See also George A. Ferguson, Statistical analysis in psychologyandedcation (New York, 1959), 158-77.15

    L. Guttman, "A new approach to factor analysis; the radex", in P. Lazarusfeld(ed.), Mathematical thinking in the social sciences (Glencoe, Illinois, 1954), 258-348.16

    Guilford, . c//., 305-311.17

    Guttman, loc. cit.9 258-348; Also "A generalized simplex for factor analysis",Psychometrika, 20 (1955), 173-191.18

    A program write-up (and a description of the type of factor analysis used) is

  • A COMPUTERIZED STATISTICAL METHODOLOGY 85

    (3) Robert C. Tryon's cluster analysis.19 All three of these factor analysesprovide "a mathematical model which can be used to describe certainareas [of linguistic behavior such s the use of lexical items]. A series ...of measures [e.g. responses to lexical items in a questionnaire] areintercorrelated to determine the number of dimensions the test spaceoccupies, and to identify these dimensions in terms [of Jmguistic orsocio-geographical categories]. The interpretations are done by ob-serving which tests fall on a given dimension and inferring what thesetests have in common [e.g. geography, occupation, age, sex, or education]that is absent from tests not falling on the dimension. Tests correlateto the extent that they measure common traits ... [Responses to a check-list questionnaire or to a fieldworker can be studied] to detect possiblecommon sources of Variation or variance, [or factors; and factorsrepresent] the fundamental underlying sources of Variation operating in agiven set of scores or other data observed under a specified set of con-ditions."20

    2.8. FREQUENCY COUNT OF ITEMS ACROSS INFORMANTS. The purposeof the frequency count is to provide a tabulation of response itemsacross informants, so that the total number of responses to particularresponse items in the questionnaire can be readily determined and ana-lyzed. This is important for editorial purposes, for the count can deter-mine meaningfulness of response items in the questionnaire for a partic-ular geographical area. The frequency count also provides input forthe Mest.21

    2.9. THE Z-TEST FOR THE DIFFERENCE BETWEEN TWO MEANS. The /-testdetermines whether an apparent difference between two means caneasily be accounted for by chance.22 In this study, it will be used todetermine whether Johnson County natives employ Northern lexicalitems significantly more often than they use Midland lexical items.

    available upon request from the State Univeristy of Iowa Computer Center, IowaCity, Iowa 52240.19

    R. C. Tryon, Cluster analysis: correlation profile and orthometric {factor) analysis forthe isolation ofunities in mind andpersonality (Ann Arbor, 1939), especially 41-48.20

    Fruchter, op. dt., 2-4 (see fn. 9).21

    The frequency count analysis has been expanded to three types. The primaryaddition is the tabulation and percentage of response items across Informant profilewhich includes sex, age, education, and occupation. The program identifies theprofile, totals the number of informants who belong to each profile, and indicates howeach profile responds in toto to each lexical item in the questionnaire. This Output canthen be fed into a Type l analysis of variance which tests whether each profile differssignificantly in relation to each lexical response item.22

    George A. Ferguson, op. c/ , 126-128 (see fn. 1).

  • 86 CHARLES L. HOUCK

    2.10. THE COMPUTER. The study was designed to make fll use ofthe Computer for two reasons: (1) accuracy, for correlation and factoranalysis studies entail a great amount of intricate mathematical compu-tation and counting which, by their very nature, are greatly error pronewhen done humanly; and (2) efficiency, for, since there is a great amountof mathematical computation and counting, the computor saves time.It is, of course, in this area that a Computer provides the linguisticgeographer with his greatest boon, for it allows him to increase his in-formant sample for more reliable results. In this study, for example, theestimated time for manual computation of a 32 X 32 phi coefficient andtetrachoric matrix was more than one thousand hours. The estimatedtime for programming, keypunching, and eliminating program errors(fide-bugging') is around one hundred hours. Although the saving oftime here i s large, the real saving comes when data from a new studyare to be analyzed, for all that remains is the preparation of the data a minor part of the process.

    In this study, then, a Computer program was used for each type ofanalysis except for the -test and the cluster analysis. A Computerprogram for the cluster analysis is now operational.23

    The following sequence was used for analysis on the Computer: (1) Thedata was readied on Computer data cards. (2) The frequency countprogram was then run. This program not only provided the neces-sary input for the Mest, but also provided automatically another inputdeck for the phi and tetrachoric program in which all the response itemsthat none of the informants responded to were deleted. This was done sothat the *D' cell of the fourfold contingency table for the two correlationcomputations would not be inflated, thus providing greater correlationdiscrimination. (3) The phi and tetrachoric program was then run. Thisprogram also automatically provided an input deck in the form of asymmetrical tetrachoric intercorrelation matrix for the multiple factoranalysis program. (4) The multiple factor analysis was run in two stages:(a) exploratory; and (b) confirmatory.23

    The use of the Computer was first made possible through the interest of Garry A.Flint, a Computer programmer at the Indiana University Computing Research Centerin the summer of 1964. He was responsible for the phi and tetrachoric program usedin this pilot study, I am also indebted to him for his help in learning the basics ofComputer programming. Since the completion of this pilot study I have expanded theanalyses and have increased the data processing capacity of the various Computerprograms through the generous help of the University of Iowa Computer Center.This expanded methodology has been applied to the Iowa Atlas checklist materials,and the results will appear in a monograph by Robert Howren, Jr. and myself, tobe published by the Iowa State University Press, Ames, Iowa. The complete meth-odology will also be described in my doctoral dissertation.

  • A COMPUTERIZED STATISTICAL METHODOLOGY 87

    3. RESULTS AND DiSCUSSION

    The overall results were encouraging, for the degree of density providedhighly reliable data input for the phi, tetrachoric, and count analyses.The phi and tetrachoric intercorrelation matrices consistently showedmiddle to relatively high but homogeneous intercorrelations, indicatingperhaps dialectal homogeneity, while at the same time revealing idiolectaldiscrimination. The rnge for the phi coefficient intercorrelations was .05to .60; the rnge for the tetrachoric intercorrelations was .09 to .83.All the intercorrelations except four were significant (If 2> 6.64, df =l,p ^.01); i.e. if chi-square is greater thaii 6.64 at one degree of freedom,the probability is that fewer than one intercorrelation out of 100 wouldbe due to chance, A randomly selected sample of phi (with their 2values) and tetrachoric intercorrelations is shown in Table 1.

    TABLE l

    A phi coefficient and tetrachoric intercorrelation matrix of randomlyselected informants from the five county-sections of Johnson County

    Informants27

    18

    23

    28

    21.00.50.73*

    176.59**.49.72

    168.80.47.70

    118.42.47.70

    152.48

    7

    LOO

    .72

    .71161.72

    .45

    .68143.04

    .46

    .69146.52

    18

    1.00

    .39

    .60104.63

    .42

    .64123.04

    23

    1.00

    .37

    .5894.43

    28

    1.00

    * Tetrachoric intercorrelations.** Chi-square values.

    The four non-significant intercorrelations were caused by one inform-ant who also showed marked deviation from the rest of the informants,even though he correlated significantly with them in some respects. Noexplanation can be offered for this deviation, for there is nothing in hisbiographical data which would indicate even a post hoc explanation forthe deviation. On the criteria set up for the selection of informants in theLinguistic Atlas of the United States and Canada, he would have been anideal Informant: he was a native and life-long resident of Johnson

  • CHARLES L. HOUCKr

    County, Iowa; he was 71 years old; he was a farmer who owned his ownfarm; and he had only four years of education. I believe this case ofdeviance points up rather concretely the need to exercise care in assumingthat an Informant who meets the Informant selection criteria of theLinguistic Atlas of the United States and Canada necessarily representsthe norm of his geographical area, and to note that he may in factcontribute spurious data to a survey.

    TABLE 2

    The Guttman 'quasi-simplex covariance structure*

    Informants27

    18

    28

    23

    21.00.50.73*

    176.59**.49,72

    168.80.47.70

    152.48.47.70

    118.42

    7

    1.00

    .47

    .71161.72

    .46

    .69146.52

    .45

    .68143.04

    18

    1.00

    .42

    .64123.04

    .39

    .60104.63

    28

    1.00

    .37

    .5894.43

    23

    1.00

    * Tetrachoric intercorrelations.** Chi-square values.

    The results gained by using factor analysis were just s encouraging.The patterning exhibited by the intercorrelation matrix in Table 2demonstrates the Guttman Quasi-simplex Covariance Structure.24 Thatis, the diagonal of the matrix shows a non-equidistant ranking fromhigh to low. The same phenomenon also occurs in each column. Thetheoretical concept underlying the ranking in the matrix is that of'complexity'. 'Complexity' according to Guttman25 is that factor ofgreater inter-individual difference which is hypothesized in this studys language behavior in relation to some geographical area. Language24

    Guttman, loc. dt. (1954), 258-348 (see fn. 15).25

    Guttman, loc. cit. (1954), 258-348; Psychometrika, 20, (1955), 173-191; "Empiricalverification of the radex structure of mental abilities and personality traits", Educ.Psychol Measm., 17 (1957), 291-407; "What lies ahead for factor analysis?", Educ.Psychol. Measm., 18 (1958), 497-515.

  • A COMPUTERIZED STATISTICAL METHODOLOGY 89

    behavior s it varies among idiolects can be conceived here in terms ofuniqueness and ranked accordingly: from a more simple, i.e. homo-geneous, to a rnore complex, i.e, unique, dimension. This is to saythat each idiolect is ranked in terms of the number of unique items i t con-tains in relation to the other idiolects: the lower the number of uniqueterms, the higher the intercorrelation.26 Therefore, in Table 2, idiolect 2has fewer unique items in relation to idiolect 7 than to other idiolectsin the sample; thus, it correlates more strongly with 7 than the otheridiolects in this matrix. This Interpretation is reinforced by the independ-ent tetrachoric correlation computation whose intercorrelations aremarked with an asterisk in Table 2.

    However, the hypothesis that this uniqueness demonstrated by theGuttman Quasi-simplex Covariance Structure among the JohnsonCounty idiolects stems from geographical location is to be rejected, be-cause neither the multiple factor analysis, the cluster analysis, nor theMest supported such a hypothesis.

    The multiple factor analysis loaded all thirty-two informants into onefactor and by-passed the estimate-of-factor-loadings Step because itcould not meet the significance criterion of t wo factors. This one factorwas confirmed when the estimate-of-factor-loadings Step was programmedto run on those factor-loadings which contained the largest amount ofvariance in common. This step rejected all loadings, thus confirmimgthat none of the informants' intercorrelations could be used s a criterionfor establishing more than one factor. These results do not indicate, then,that the informants* lexical behavior in Johnston County is anything buthomogeneous. No significant differences occurred due to geography,occupation, education, age, or sex.

    The cluster analysis also justified retaining an assumption of homo-geneity, for the two highest intercorrelations in the matrix failed to reachthe ratio (2.00) of the average intercorrelations of the variables in a clusterto their average correlation with the variables not included in the cluster.

    In the -test, no significant difference (t = .14, df= 32, p ^ .01) wasfound between the incidence-means of Northern and Midland responses.This again supported the assumption that Johnson County is lexicallya homogeneous dialect area.

    The most important revelation of the frequency-count analysis wasthe large number of response-items to which no informant responded.26

    This Interpretation was offered to me by Bishwa Nath Mukherjee, formerly onthe Psychology faculty at Indiana University and now at Jakkanpur, Patna-1, BeharState, India.

  • 90 CHARLES L. HOUCKt

    There were three hundred and forty-two of these. All informants re-sponded to only ten response-items. In terms of the above correlationalanalyses, the informants were correlated over 738 response-items ratherthan the 1,080 items of the original list. This type of Information isimportant for editorial purposes. A questionnaire of 1,080 lexicalresponse items presents a formidable task for many informants; thus,a frequency count analysis which indicates that three hundred andforty-two of these one thousand and eighty response items were re-sponded to by none of the informants means that this questionnairecontained considerable excess baggage. Since studies in other socialsciences show that short questionnaires obtained a better return percen-tage than long ones, it seems almost mandatory that those three hundredand forty-two response items be deleted in this case.

    4. SUBSTANTIVE RESULTS

    As clearly indicated by the representative sample of key lexical testquestions and their respective response-items in Table 3,27 the frequencydistribution of the response items is predominately leptokurtic; i.e.one response-item was generally chosen overwheimingly more frequentlythan the other response-items belonging to the same set of response-items. If this distribution were graphically represented, the curvemarking the central location would be highly peaked. Test questions4,9,10, 11, 23,33,49,53, are obvious examples of leptokurtic distribution.These test questions also show that definite dialect mixture occurs be-tween different sets of response-items since in each instance the response-item picked within a particular set is definitely either Northern orMidland. In test questions 1,2, 17, 25, 30, 34, 42, 43, the response pat-27

    The sample of fifty-four key lexical test questions and their respective responseitems in Table 3 were chosen on the basis of the findings in H. Kurath's A word geo-graphy of the eastern United States, Roger W. Shuy's monograph, The northern-midland dialect boundary in Illinois ( Publication of the American Dialect Society,no. 38) (U. of Alabama, 1962), and the previously cited Davis and McDavid article.These lexical response items consistently showed dialectal Variation s a function ofgeography. Words marked N, M, SM, and S are Northern, Midland, South Midland,and Southern respectively. This dialectal classification is not, however, absolute,for the drawing of isoglosses tends to be more of an art than a science, and dialectaloverlap is the rule rather than the exception; but there is, for the most part, generalagreement that the classified response items in Table 3 represent that particulardialect area. The unclassified response items are either nondiscriminate or morerestricted in relation to dialect areas. The frequency with which each response itemwas chosen was compiled by the frequency count program.

  • A COMPUTERIZED STATISTICAL METHODOLOGY 91

    lern shows dialect mixture within a set of response-items since the modaldistribution is not so extreme between Northern and Midland reponse-items, and, in some instances, almost bi-modal, s in 14 and 2L thissample, only test questions 47 and 51 show true bi-modal distribution.These results show, then, that, while dialectal mixture occurs, there isno central tendency for Northern or Midland response items to occuroverall more frequently, thus indicating once again dialectal homo-geneity.

    TABLE 3* J * ~-**-^ mf

    representative sample of key lexical test-questions and their respec veresponse-items

    1. IT is FIVE:quarter of (N)quarter to (N)quarter till (M)

    2. sunrisesun-up

    3. CHANNEL FOR RAINWATER ONEDGE OF ROOFI

    eavetroughs (N)eavestrough (N)gutters (M,S)spouts (M)

    4. BUILDING FOR CORN corn crib (N)corn barncorn house (M)crib (N)

    5. L ARGE OBLONG STACK OF HAYhayrick (M)haymowDutch capbarrack (N)haystack

    20*5

    1116*15

    13*8

    114

    31*004

    l

    2000

    30*6. SMALL STACK FOR DRYING HAY

    IN FIELD:haycock (N) 19*tumble (N) 0doodle (M) 0heap (N) 0cock 4

    coil lpile 5

    7. POLE TO STEER AND PULL WAGON:neap (N)tonguepole (N)spear

    8. TWIN POLES OF BUGGYshafts (M)shavsth ls (N)

    drafts

    031*l0

    28*5l00

    9. PIVOTED CROSSBAR FOR ONE HORSE:whiffletree (N)whtpplefree (N)swingletreesingletree (M)

    10. PIVOTED CROSSBAR FOR TWOHORSES:

    evener (N)doubletreespreaderdouble singletree

    \ l. WOOD IN WAGON:hauling (M)drawing (N)carting (N)teaming

    00l

    31*

    031*0l

    32*000

    Modal response

  • 92 CHARLES L. HOUCK12. IMPLEMENT FOR BREAKING CLODS

    AFTER PLOWING :0

    32*drag (N)harrow

    13. SETTING HEN:duck (M)duck hensetting henhatching henbrooder

    14. HORSE ON THE LEFT:horse (N)

    /z

  • A COMPUTER IZED STATISTICAL METHODOLOG 93

    poke (M) lsack 5bag \

    29. HEMP OR BURLAP CONTAINER!burlap sack 8burlap bag 3gunny sack (M) 25*polato sack 0gramsack 4

    30. WALL OF LOOSE STONEIstone wall (N) 20*rock fence (M) 7rock wall (S) 5

    31. SMALL WIND INSTRUMENT PLAYEDWITH THE MOUTH:

    harmonica 11mouth organ (N) 19*french harp (SM) lbreath harp 0mouth harp 6harp 0juice harp 0jew's harp l

    32. VESSEL TO CARRY COAL:coal hod (N) 4scuttle (N) 5coal pail 2coal b cket (M) 23*

    33. PETROLEUM PRODUCT BURNED INLAMPS:

    coal oll (M) 0kerosene 31*lamp

    34. A TIED, FILLED BEDCOVER:tied quilt 0comforter (N) 19*comfort (M) 11comfortable (N) l/H f 0

    35. A SMALL FRESH BODY OF RUNNINGWATER:

    creek 28*stream 4prong 0/ () fork 0brauch (M) 2

    (N) 3

    rindet 0riverlet 0glitter 0

    36. BREAD MADE OF CORN MEAL INLARGE CAKES:

    corn bread 29*johnny cake (N) 2cornpone (M) l/X?77 0

    37. SMALL RING-SHAPED CAKE MADEWITH CAKE DOUGH:

    doughnui 30*fried-cake (N) 2cruller (M) lfat-cake (M) 0

    38. SIDE MEAT OF HOGS, SALTED, NOTSMOKED:

    side pork (N) 8side meat (M) 19*sowbelly 3fatback 0oe//y w^^r 0streak-o-lean 0

    39. THICK, SOUR MILK :curled milkbonny-clabber (N)lobbered milkthick milk (M)clabber (M)loppered milkclabbered milk (M)clabber milk

    17*l0l5l64

    40. A LOOSE, WHITE, LUMPY CHEESE:pot cheese (N) lDutch cheese (N) 2smear cheese lC/Y/ (SM) 0smearcase (M) 12clabber cheese lsourmilk cheese (N) 0cwrrf cheese 0cottage cheese 25*

    41. FOOD BATEN BETWEEN ME ALS:Wte (N) 4

    snack(M,S) 15*/7/ece (M) 5lunch 8

  • 94 CHARLES L. HOUCK42. CENTER OF A CHERRY:

    seed(M) 16pit (N) 19*stone lkernel 0heart 0

    43. CENTER OF A PEACH :stone (N) 14seed(M) 16*pit (N) 5

    44. GREEN OUTER COVER OF WALNUT Ihll (M) 25*shuck (N) 2AWJ: 2Shell 3

    45. - BEANS:shell (N) 19*/m// (M) 9shuck 0

    6

    75lll

    26*l0l02

    47. L ARGE WENGED INSECT SEENAROUND WATER:

    darning needle (N) 4deviVs darning needle (N) 7sewing needle 0mosquito hawk 0snake feeder (M) 1 1 *dragonfly 11*

    46. KIND OF WORM Iearthworm (N)angleworm (N)zY wormmudwormredworm (M)fishworm (M)fishing worm (M)eelwormrainwormeaceworm

    ear-sewer 048. INSECT THAT GLOWS AT NIGHTI

    >?r^/fy (N) 5lightning bug 25*firebug (M) 4candlefly 0

    49.maple 5/re^ (M) 0

    wwzpfe (N) 22*maple (N) 2

    50. PLACE WHERE SAP IS GATHERED:maple grove 13*

    (N) 2(N) 0

    orchard 0maple grove 10camp (M) 4

    (M) 251. HE is SICK :

    16*16*l0

    at his stomach (M)to his stomach (N)in his stomach (M)on his stomach (M,S)

    52. THE GAME OFquoits (N)quateshorseshoes

    l0

    30*53. A NOISY BURLESQUE SERENADE

    AFTER A WEDDING:serenade lchivaree (N) 31 *belling(M) ldish-panning 0skimmelton (N) 0callathump 0

    54. BABY (ON ALL FOURS)ACROSS THE FLOOR:

    creeps (N) 23*crawls (M) 10

    5. SUMMARY AND CONCLUSION

    In summary, then the following conclusions may be drawn from thestudy: (1) the degree of density applied in this study should be seriously* Modal response

  • A COMPUTERIZED STATISTICAL METHODOLOGY 95

    considered s a critical part of future dialect studies. Although the sam-pling techniques used in the study were relatively unsophisticated anddefinitely need to be improved, the reliability of the data was apparentin all of the significance tests. The degree of density in relation to ques-tionnaire size will have to be revised in the light of the above findings.(2) Although the phi coefficient s well s the tetrachoric correlationdescribe the relatedness of linguistic phenomena under analysis reliably,they function more crucially s input for either the multiple factor analysisor cluster analysis, for the factors they describe must obtain some signi-ficance criterion if they are to be considered valid. (3) Although the Gutt-man quasi-simplex covariance structure can show apparent 'complexity'among idiolects, it is unreliable s an indicator of the statistically signifi-cant factors which underlie the intercorrelations. (4) The count analysisproved to be editorially informative s well s an Instrument to providean accurate frequency-count and input for the Mest. (5) The resultsof the proposed statistical methodology overwhelmingly did not supportprevious assumptions about lexical usage in Johnson County, Iowa, anddemonstrated the need for an analytic methodology which can test forsignificant differences. It should be pointed out at this point that theproposed methodology can also be used to analyze phonological,morphological, and syntactical dialect materials. (6) The Computercan make the necessary degree of density feasible and be an extremelytime-saving and powerful tool in counting and computation for thelinguistic geographer.

    30 vi 1966 University of IowaIowa City, Iowa 52240U.S.A.