delineating europe's cultural regions_ population structure and surname clustering

Upload: moarscience

Post on 05-Apr-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    1/26

    Delineating Europe's Cultural Regiom.: Population Struchlreand Surname ClusteringJAMES CHESHtRE,I* PABLO M ATEOS,I AND PAUL A. LONGLEY I

    Abstract Surnames (family names) show distinctive geographical pauerning and in many disciplines rcmain an underutilized source of infonnationabout population origins. migra tion and idelllity , This paper investigates thegeographical structure of surnames. using a unique in dividuallcve l databaseassembled from registers an d tclephone dircctories from 16 Europe ancountries. We develop a novel combination of methods for exhaustive lyan alyzing this multinational data set. based upon the Lasker Distance.conscnsus clustcring and multidimcnsional scaling. Our analys is is both datarich and computationally intcns ivc. cntailing as it docs the aggregation.clustcring and mapping of 8 million surnamcs collccted from 152 millionindividuals. The resul ting regionalization has applications in devcloping ourun derstanding of the social and cu ltural complexion of Europe. and offerspotential insights into the long and shon-term dynamics of migration andresidential mobility. The researeh also contribute$ a range of methodologicalinsights fo r flllure studies concerning spatial clustering of surnames an dpopulation data morc wi dely. In shon. this paper funhcr dcmonstrates thevalue of surnames in multinational population studies and also the increasingsophistication of techniqucs available to analyze them.

    Family names, a lso known as surnames, arc widely unders tood 10 provide goodind icators of the geographic, e thn ic, cultu ral and genet ic stru c ture of humanpopul ati ons. This is ma in ly occause surnamcs were "fixcd" in mos t pop ul a tionsseveral cen tur ies ago, and thei r transmission over genermions (mostly patrili nearl y) typ ica lly confo rm s 10 socio-economic, re lig io us and cu ltu ral charac te risti cs (Smi th 2002) as we ll as geographica l constraints (M anni et al. 2004). TI leou tcome is a va riety of spal ia l pa tt e rns Iha t manifest processes of bio logicalinher it ance (Las ke r 1985) and intergencrationa l inherit ance of c ul ture (Cava ll i-Sforl.a and Feldman 1981). The vast li1erature in this area is principallyconcerned with ana lyzi ng populat ion s tructure in su rname freque ncy d ist ri buti ons at national or sub-national levels (for a review see Col a nt onio e t a1. 2003) .He re we a re so le ly concerned w ith how such po pula tion structure is manifestacross space, ra thcr than be twecn rel ig ious, et hno-cultural or soc ia l groups per se .

    'OcpannlC"l of G:ogrnphy. U n i , ~ r s i l y C o l l e g ~ U)I)don. Uniled Kin gdom.*E_mail: jam"s_rlles hi re@ncLac_uk.lI"nkm R;o/" /iY. O:loixr 2011 . ". 83. nO_ 5. pp. 573- 598-Cop)'righl CI Wtt Wa)'IIC Slare Uni I'ersily Press. Oclroil. Michigan 4820t tJ09KEY WORDS: SUKNAMES, CONSENSUS CLUSTERtNG, LASKER DtSTANCE, EUROI'E GtS.

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    2/26

    574 / CHE.SHIRE lIT AL.

    One of the primary methodological concerns of these studies is the developmem of:a) adequate measures of surname relatedness- or surname distance- betweenlocalities or regions and b) areal classification algorithms to partition space accordingto such distances. In this paper we seek to make two comributions to this line ofresearch; first we investigate the geographical structure of surnames at a continentallevcl in 16 European countries, and second we consider a relatively new regionalclustering technique at this pan-European scale. [n so doing we draw upon expertisedeveloped population genetics and geography. TIle resuh is a cultural regionalizationof Europe based purely on the geography of surname frequencies that is key to thesearch for Europe's cultural regions. We use techniques derived from populaliongenetics to devise and cluster measures of surname distance between populations,and use regionalization concepts and spatial database skills from geography tostructure millions of address records and map the results.

    Our analysis is both data rich and computationally intensive, entailing as itdoes the aggregation, clustering and mapping of 8 million surnames collectedfrom 152 million individuals. The resuhing regionalization can be used 10 infercuhural. linguistic and genealogical information about the European populationover lhe preceding centuries, for example with a view to design a geneticsampling framework.Cultural and Surname Di stance between Areas

    Surnames first appean.'{\ in Europe during the Middle Ages (Hanks 2003) andcan be chamcteri".ed by frequency distributions within a population that are driven byinitial population size, rate of endogamy between populations and soc io-culturalpreferences within a group's reproduction paUerns. Such processes are in tum aproduct of demogmphic, geographic, ethno-cultural, and migration factors. One ofthe most striking and recUITCnt findings of surname research is that, in spite of therelative mobility of modern populations, surnames usually remain highly concentrated in or around the localities in which they were first coined many centuries ago(e.g .. Longley et at. 2011). The size of the databases available for the study ofsurnames have been increasing in line with the computational resources required toprocess them (see Scapoli el a1. 2007; Cheshire et at. 2010; Longley et a1. 2011).Such advances enable the continued progression of surname research in the contextof the many exemplary studies outlined below.

    Following the early work of Cavalli-Sforza and colleagues using Italiantelephone directories in magnetic tape form in the 1970s (see Piazza et a1. 1987and Cavalli-Sforza et at. 2004), the increasingly wide availability of digitallyencoded names registers has led to a host of studies of the surname structure ofpopulations of individual countries. Throughout, one group has dominated thisresearch through the publication of a succession of national-level surnameanalyses: their studies include Austria (Barrai et at. 2()(x); Switzerland(Rodriguez-Larralde et a1. 1998); Germany (Rodriguez-Larralde et at. 1998);Italy (Manni and Barrai 2001); Belgium (Barrai et al. 2004); the Netherlands(Manni et a1. 2005): and France (Mourrieras et al. 1995; Scapoli et al. 2005).

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    3/26

    ClillI/ral Regions ill Ellrope II/rollgh SlIrnames I 575More recently they have amalgammed these findings for eight countries inWestern Europe, analyzing a sample of 2094 towns and cities, grouped into 125regions (Scapoli et al. 2007). This study found clear regionalizmion patterns insurname frequency distributions. closely matching the national borders for eightcountries, but also highlights anomalies arising from the historical geography oflanguages.

    Whi 1st bci ng wide-ranging, both geographically and in terms of the numberof surnames sampled. the work by Scapoli et al. (2007) is still limited to thepartial sampling of "representative locations. The motivation for the workreported here is to expand this work in methodological terms by including moreEuropean countries (16 in this paper), to use data representing completepopulations (i.e. , without sampling). and to use new classification algorithms inthe form of consensus clustering to delineate cultural surname regions andbarriers to population interactions over space.

    The remainder of this paper is organized as follows: first, we outline thechoice of surname distance metric used in this analysis; second, we review themost commonly used regional classification algorithms and suggest a newmethodological approach; third, we presem the materials and methods used in theanalysis: and fourth, we present and discuss the resulting surname regionalizmionof Europe and the benetits and challenges of the proposed methodology.M easuring Surname Dis tance. Interest in the rclationship between surnamesand genetic charactcristies first emerged in the late 191h century when GeorgeDarwin (1875)- son of Charles and himsclf offspring of first cousins- usedsurnames to calculate the probability of first cousin marriages in Britain. Liulefurther research was undertaken until in the 1960s when Crow and Mange (1965)proposed a probability of relatedness defined as the frequency of repetition of thesame surname. known as isonymy (Lasker 2002). In addition to applications tothe study of inbreeding between marital partners or social groups, isonymy canbe also be used to establish the degree of relatedness between two or morepopulation groups at different geographic locations (Smith 2002). It is this lalter.regional, interpretation of isonymy that has gained greater currency over the lastdecades and is the one used here. The cocfficient of isonymy extends the idea ofmonophyly (sharing a single common ancestor) between two populations and isdefined by Lasker (1985) as the probability of members of two populations orsubpopulations having genes in common by descent as estimated from sharingthe same surnames" (Lasker 1985: 142). This coefficient is based on the similarityof the surname frequency distribution bet ween two populations. In the two regioncase, isonymy is calculated as:

    (1)where PiA is the relative frequency of the ith surname in population A and Pill isthe relative frequency of the ith surname in population I). In many cases,

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    4/26

    576 / CHE.SHIRE lIT AL .especially when comparing international populations, the similarity betweenpopulation groups is very small and this creates very small values of RAB .Therefore, a more meaningful transformation of this measure, termed the LaskerDiSIance (Rodriguez-Larralde et al. 1994) is used here. It is defined as:

    (2)where RAB = (PiA X PiB)I2. The inverse natural logarithm creates a more intuitivemeasure Ihat can be Ihought of as distance in surname space such Ihat largervalues between populations indicale greater differences between them (that is,less commonality in their surnames). Scapoli et al. (2007) suggest that thismeasure can be used to isolate differences in cultural inheritance because twopopulations that are genetically homogenous, yet distant in Lasker Distanceterms arc likely to exhibit subtle differences in cultural behavior.

    Doubts about the value of isonymy studies arc founded upon the fundamental assumptions thai they entail. An implicit assumption is that at someprevious generation each male had a unique (monophyletic) surname, and that allsurnames were first coined synchronically in the same generation (Rogers 199 I).We know this not 10 be the ease in several countries, for example in Great Britain.where for a multitude of reasons permanent surnames were acquired gradually ina number of distinetive and separate sub-populations. The name "Smith," forexample, describes an occupation found within every community across thecountry and hence resulted in a heterophyletic surname. However, it is also thecase that even if two populations with very similar surname distributions do notshare unique common ancestors, they are nevertheless much more likely to begenetically rclated to one another, in comparison with a population that has avery different surname makeup (Lasker 1985).

    One important al1ernative to the Lasker Distance was proposed by Nei(1973). His measure of genetic distance, originally intended for the study ofallele similarities between populations (Nd 1978), has been applied to surnamesas Nei 's distance of isonymy in a number of studies (such as Seapoli et al. 2007).Others have also successfully used the measure (see Manni et al. 2008 and Manniet al. 2006) and found it less sensitive to heterophyletic surnames and also likelyto be more correlated with geographic distance. The purpose of this paper is topropose an innovative set of clustering techniques across a large number ofcountries. Therefore, it was thought best to avoid comparisons of multipleestablished distance measures and focus our clustering efforts on a singlesurname (dis)similarity measure so as to keep this aspect of the analysis fixed andconcentrate on clustering and representational issues. On the basis that the workpresented here is an extension of previous national level studies with the LaskerDistance (see Longley et al. 2011 and Cheshire 201 t) the authors felt mostcomfortable using the measure here. TIle intention is to conduct further researchinto the utility of dissimilarity measures from both population genetics anddemographics more widely and the Nei's distance will form part of this.

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    5/26

    ClillI/ral Regions ill Ellrope II/rollgh SlIrnames I 577Delineating Surname Regions: Consensus Clu stering and MDS. As theresult of the studies outlined above, the similarities between frequency distributionsof surnames and the genetic struClUre of populations across space are now quite wellknown. However, there continues to be an imponant researeh gap with respect to themost appropriate spatial analysis techniques to automatically detect the geographicalpatterns of surname distributions at various scales. In population genetics. moststudies posit clinal transitions in genetic characteristics punctuated by abrupt barriers(Lasker and M a ~ i e - Taylor (985). In contrast. and with a few exceptions, surnamegeography research is usually founded upon discrete administrative areal buildingblocks, and as such produces valid generalizations for only a pre-specified range ofscales. We arc not the first to apply clustering and data reduction tcchniques tosurnames (in addition to the studies listed above. see Chen and Cavalli-Sforla 1983)but we hope to improve on previous research by suggesting a good compromisebetween the continuous and discrete representations of space by using two arealclassification methods: consensus clustering and multidimensional scaling (MOS).TIle formcr creates discrete groupings of prc-spedtled areal units whilst the laner,when used to infonn areal color values on a map. can produce a more continuousrepresentalion of population change over space.

    Indicating the certainty of a clustering outcome is an important aspect ofpopulation geography research. especially in regionalization. Readers shouldrefer to Kaufman and Rousseeuw (1990) and also Gordon (1999) for a review ofthese. Of direct relevance here is Nerbonne et al:s (2008) use of the aggregatedata matrices produced in dialectometrics as a basis for identifying linguisticregions. "l1le certainty of such regions were determined through bootstrappingand composite clustering techniques and visualized both as a dendrogram andcomposite cluster maps. In the former, each branch has information about thenumber of times a particular grouping between its sub-branches occurred. whilein the latter lines between geographic regions were drawn with increasingly darkshading, corresponding 10 the number of times contiguous spatial units on b01hsides of the line were assigned to different clusters. Using a different approachbut with a similar canographic effect, Manni et al. (2004) and Manni et .11. (2006)implement Monmonier's (1973) boundary algorithm to detect dissimilaritiesbetween contiguous regions. TIle mapped results of both methodologies do notrequire the assignment of all spatial units to a particular cluster. but the objectiveis 10 identify only the most abrupt boundaries.

    In this paper consensus clustering (Momi et a1. 2003) was chosen as apromisi ng method of creating a robust cluster outcome, consistent with providinga number of metrics to indicate the optimal number of clusters and the certaintyassociated with each cluster assignment. Such metrics are useful because theygive context to the tlnal clustering outcome: in particular they address the issuethat, contrary to what many surname regionalization maps suggest, not allresulting clusters are equally probable to occur within the data.

    Consensus clustering. tlrst proposed by Monti et al. (2003), is a relativelynew method for class discovery. It has become increasingly popular in the

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    6/26

    578 / CHE.SHIRE lIT AL.

    genetics literature- Monti et al. (2003) is highly cited- and there are a numberof papers, such as Grotkja:r et al. (2005), that compare its effectiveness 10 othermore established clustering methods. "[ne underlying hypothesis states that itemsconsistently grouped togethcr are more likely 10 be similar than those appearingin the same group less frequently (Simpson el al. 2010). The method is designedto increase the stability of the final cluster outcomes by taking the consensus ofmUltiple runs of a single cluster algorithm. Simpson et a1. (2010) have providedan extension 10 this approach, called merged consensus clustering, by enablingthe cluster assignments to be the product of multiple run s of multiple algorithmsor kinds of data. By merging the results from different algorithms it is thoughtthat the confidence in the result will increase because the limitations of oneclustering algorithm will be offset by the strengths of another. For exampleWard's hierarchical clustering is se nsitive 10 outliers in the data, but offers astable solut ion overall in term s of consistency of cluster outcome; by contrast.the overall arrangement of K-means clusters is relatively unstable, but thesolutions are le ss sensitive to outliers. In addition to the increased stability ofthe results, consensus clustering can provide a range of metrics 10 help informthe optimum number of clusters as well as the robustness of the resultingcluster outcome in terms of its structure and the membership of individualclusters.

    Before undertaking the merged consensus clustering procedure, the userhas to select the clustering algorithms to be used. Theoretically there is no limiton the number of algorithms that contribute to the final result aside rrom thepractical constraints related 10 computation time and the degree to which theresult will actually improve. Some of the most popular data classificationmethods are Ward's hierarchical clustering (Ward 1963). K-means (Hartigan andWong 1979), partitioning around medoids (PAM) tsee Kaufman and Rousseeuw( 1990)J, and self-organizing maps (SOM) (Kohonen 1990). The algorithmsselected for this study are lislCd under the analysis section below. Table Ishows the definitions of the variables and the algorithms used- t he laller are

    Table I. Variables and Definitions Used in Merged Consensus Clustering. Adaptedfrom Monti et al. (2003)SymbolD = 1t'""" eNINP = {P" . . . , p.1K,K ..,.N,H( ~ J

    M,M,o":11(1

    DescriptionData. in this case surname distance matrix with geographic units (e;s) to be

    clustered.The number of geographic units (or number of rows) in distance mmri.\ .Partition of D inlO K cluslers.Number of c l u S l ~ T S . maximum number of clu'leTS.Number of ilems in cluster k .Number of resampting iterations.Dataset oblained by res"JllIpling D (h tll iteral ion).ConneCliv ity llIatrix, corresponding 10 b-Ih iteration .Consensus matrix. corresponding to K c\U>leTS.N x N indicator matrix.

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    7/26

    ClillI/ral Regions ill Ellrope Ihrollgh SlIrnames I 579adapted from Monti et a1. (2003) 10 make them more applicable in thiscontext.

    Consensus clustering first samples the complete data set 0 to create a newsubset O (h ) before clustering using the specified algorithm(s). The sampling(using methods such as bootstrapping) and clustering are repeated multiple timesin order to gauge sensitivity to repeat sampling from the total number N ofrandomly selected geographic units e;, The results from each iteration are storedin a consensus matrix At , which records for each possible pair of ei the proportionof the clustering runs in which they are both clustered together. The consensusmatrix is derived by taking the average over the connectivity matrices of everyperturbed data set (Monti et a1. 2003). TIle emries to the matrix are defined in thefollowing way:M (h)( ' .) _ f 1 if i and j belong to the same cluster

    I, J - 1 0 otherwise (3)O (h) is the (N x N) connectivity matrix , required to keep track of the number ofiterations in which both geographic units are selected by resampling, such that its(i.j)th entry is equal to I ifboth i andj are present in [fh l, and aotherwise. Accordingto Monti et al. (2003) the consensus matrix At is the nomtalized sum of theconnectivity matrices of all the perturbed data set {D(I!): II = 1,2,. . . H}:

    (4)The i,jth entry in the consensus matrix records the number of times the two itemshave been assigned to the same cluster divided by the number of times both itemshave been selected (sampled). 11 therefore follows that a perfect consensus resultwould produce a matrix containing only Os and Is. At in essence provides asimilarity measure 10 be used in further clustering or agglomerative hierarchicaltree construction (Simpson et al. 2010).To create a merged result a merge matrix provides a way of combining theoutcomes mulliple methods by weighted averaging their respective consensusmatrices (Simpson et a1. 2010). The weighting can be adjusted to increase ordecrease thc influencc of certain clustering methods. The advantage of thisapproach is that it mitigates the issues associated with the different classificationproperties in each of the algorithms discussed above.

    Two types of clustering robustness measures can be calculated. The firstrelates 10 the cluster structure [called eluster consenSIlS m(k)] and the second 10the cluster membership [called item consensus mk(i)). In regionalizalion prob-lems, the latter is especially useful because it enables the comparative visualiza-tion of the geographic unit's cluster allocations alongside their summarymeasures of cluster robustness. As is often the case, a geographic unit may onlybe assigned to a particular cluster on the basis that all units have 10 be assignedto one of the set of clusters. Where allocations are marginal, there will be low

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    8/26

    580 / CHE.SHIRE lIT AL.confidence in the allocation and it can therefore be interpreted accordingly.Monti et al. (2003) first define 't-as the set of indices of items (geographic unitsin this case) belonging to cluster k. This can then be used to define the cluster'sconsensus as the average consensus index between all pairs of items belongingto the same cluster.

    Im(k) N (N _ 1)12 -fl(i, j )k k ',J.1; (5);

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    9/26

    ClillI/ral Regions ill Ellrope Ihrollgh SlIrnames I 581result. The result with the number of groups that comes closest to this cantherefore be considered the optimum number of clusters. To establish the bestoutcome the quantity is calculated. which is the change in AUe as k varies.The optimal k. !'a/lie is broadly cOl1sidered 10 coincide willi the peak. ill ~ K . Using Simpson et al.s (2010) merged method the resulting consensus matrices(one from each cluster method used) from the optimal k. arc combined throughweighted averaging. The merged matrix maintains the same properties as aconsensus matrix and can therefore be used as a dissimilarity matrix forre-clustering.

    In addition to the identification of discrete surname regions we also usemultidimensional scalin CMOS) to show more subtle and continuous differencesthat depict trends or surfaces of closeness or dissimilarity between populations.MDS provides an effective summary of the deree to which reions are relatedto each other in "surname space." Fotlowin Gotlede and Rushton's (1972)pioneering work , MDS has found many spatial analysis applications (Gatrell1981). MDS reduces the dimensionality of a (dis)similarity matrix of m rows by11 columns with a large value of II , 10 one with very few values of II. [n geographicapplications, the dissimilarity matrix between areas can be converted throughMDS into a space of minimum dimensionality (typically two or three dimensionsor number of n) closely matching the observed (dis)similarities in the data(Gatrell 1981). MDS can either be metric or nonmetric; both seek a regression ofthe distances on the (dis)similarity matrix with the fonner utilizing the numericalvalues of the (dis)similarities and the laller their rank-order. For its application inthis paper, we acknowledge Manni et al.'s (2004) concerns that MDS (likeprincipal components analysis) does not provide a statistical analysis of thepallern of ehane, instead portraying an interpolated landscape in gcographicspacc. This. of coursc, differs little from thc maps produced by lao ct al. (2008),or Cavalli-Sforza (2(x)(), which rcly on spatial intcrpolation techniques to infergcnetic charactcristics in areas wherc samples have actually not been takcn. This.in part, is the rcason why wc adopt a mixcd approach here by combining MDSwith cluster analysis in order that one set of results can provide context to theother.

    Materials and MethodsData and Geography. The VCl Worldnames database (see worldnames.publicprofiler.org) contains the names and addresses of more than 400 millionpeople in 26 countries, derived from a range of publicly available populationregisters and telephone directories collectcd since 2(x)(). For the purposes of thispaper, surname data for 16 European countrics in Worldnames wcre cxtractedmore than 8 million uniquc surnames- along with thcir geographical locationsand frequencies of occurrence. A list of countries, name frequencies andgeographical characteristics is shown in Table 2. The countries used in this studyrenect those available in the World names database , and thus omissions renec! an

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    10/26

    582 / CHE.SHIRE lIT AL.Tllble 2. Th e Countries and Their Dat a Used in This Study

    Number ,vlul/vaValli TOIIII l'Vorldliumes U' lique NlfTS Spllliu/

    emU/try Year POlm/(l/iim Populatioll SUr1lllme., !,..,,eI Ullit.,P o l ~ n d 2007 38.518.241 8.015.455 339.339 2 16Serbia. Monlenegro 2006 10. I59Jl46 1.704.559 69.977 2 4

    and KosovoAuslria .996 8.316.487 2.520.012 81.387 2 98Iklgiulll 2007 10.511.382 3.489.068 852.492 3 IID . : n m ~ r k 2006 5.457.415 3.074.871 153.134 2 15F r ~ n c e 2006 64.102.140 20.280.55 1 1.197.684 3 %GennallY 2006 82.314.900 28.541.078 1.226.M I 2 J9Greal Brilain 2001 60.587.300 45.690.258 1.612.599 3 131Republic of l r e l ~ n d 2007 4.239.M8 2.916.744 46.507 3 26haly 2006 59.131.282 15.927.926 1.305.554 3 .03Luxemburg 2006 480.222 117.619 75.267 3 12Nelherlands 2006 16.570.613 4.672.344 53 1.970 2 12Norway 2006 4.770.000 3.536.524 123.240 3 19Spain 2004 45.116.894 9.545.104 260.469 3 SOSweden 2004 9. 142.817 791.143 135.830 3 24Swilzerland 2006 7.508.700 1.565.098 19.270 3 26

    Totals 426.927.287 152.388.352 8.03l.S60"NUTS Lever' re fe rs 10 Ihe geographic unil of analysis used. 1bere are 495.059 Hapax (occurrin gonly once) surnames in IOC dala.inability 10 source the requisite dalU, rather than a deliberate exclusion ofparticu lar countries.

    The ongoing assembly of this database is a major ongoing e nterpri se,involving th e acquisition, normalization, cleaning and maintenance of availabletelephone directories and commercial versions of public reg isters of electors. Theextract used in thi s paper comprises a commerc ially enhanced version of the2001 Electoral Reg ister for the United Kingdom and landline Ielephone directories from the remaining countries identified as current during the period2oo 1-21Xl6. There are many potential sources of bias in these sources, and somearc likely to be systematic in their operation. Non-electors (of different IYpes) arelikely to be under-represented in the Unit ed Kingdom dalU, for example, and suchindividuals are more likely than average to bear names recently imported fromabroad. Landline rental may introduce some socioeconomic and geographic biasin some European count ries, while the bearers of some names may be more like lyto withhold their telephone numbers from public directories than others. Theseare all complicated issues to address and thus, in order to expedite analysis, wehave taken the deci s ion to accept the data in the form in which they were su ppliedto us. We view th e time period as helpful in sustaining thi s decision, in that itpredated the period of mass mobile phone ownership, which may have reducedthe penetration of land line services amongst some groups, and the heightenedpri vacy concerns that are leading to attrition in the size of the public version of theUnited Kin gdom Electoral Register.

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    11/26

    ClillI/ral Regions ill Ellrope Ihrollgh SlIrnames I 583Selection and calibration of appropriate spatial units is a key problem in

    geographical research (Opcnshaw [984) and one that requires much morethorough consideration in the population genetics literature. In order to analyzeEurope's surname regions we tlrst had to adopt a geographical unit of analysisthat was as consistent as possible throughout the study area. The internationalnature of the World names database means that it contains data at geographicscales ranging from an individual's address through to name frequencies withinadministrative areas. Individual addresses have been carefully geocodcd to a setof geographical coordinates (la1itude and longitude) at levels of resolUlionranging from the building level to the street, postcode, city, metropolitan area andadministrative region. Sincc this study is concerned with gcneral patterns at thepan-European level we are itl1erested in aggregating detailed locations onto a setof standard geographical regions of similar size and population. European Union(EU) NUTS regions (Nomenclature d'Unitis TerrifOriales 51atisliques) providea convenient set of geographic units that broadly conform to these aims. NUTSare a standard referencing system for the hierarchy of five levels of administrative sub-divisions of EU countries for statistical purposes, ranging from broadcountl)' regions (NUTS I) to municipalities (NUTS 5). Initially all surname datawerc aggregated to NUTS 3 level (the province or department), but it subsequently became apparent that some countries with relatively large numbers ofNUTS 3 units relative to their population sizes (such as Germany where thesecorrespond to 429 Krei.\e or Districts) were having an undue intluence on theresults. This was especially evidetl1 at the clustering and MDS stages of theanalysis. Therefore, for this study we have opted for a combination of NUTS 2and NUTS 3 regions in an attempt to address this problem and to producc a setof homogeneous areas in terms of popula1ion size and geographical extent. In sodoing, we follow common practice in geographical analysis of NUTS data inEurope. Table 2 identifies the NUTS level selected for each country and thenumber of areal units. This resulted in a total number of 685 geographic unitsacross the 16 coutl1ries.Anal ysis. Our analysis consisted of applying consensus clustering and MDS tothe 685 geographic units. It was implemented using the statistical software R (RDevelopment Core Team 2010); in panicular the consensus clustering requiredthe clllsferCons package, developed by Simpson et al. (2010). The package is anew release and designed primarily for gene expression microarray data and weprovide its first documented use in the context of popUlation genetics/geography.

    A matrix of the Lasker Distances between all pairs of NUTS geographicuni ts provided the input for the cllisterCofis package. Consensus clustering wasperformed using three different algorithms: K-Means, panitioning aroundmedoids (PAM), and Ward's hierarchical clustering. These were chosen for theirsuccess in previous studies (see Cheshire et a1. 2010; Longley et al. 201 I). Inorder to select the most appropriate number of clusters (K) in which to group thegeographic units, each of these algorithms was IUn using K values ranging

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    12/26

    584 / CHE.SHIRE lIT AL.

    '"

    0.'"

    0-01U

    000v

    "umber of Clus ters (K)Figure] . The 1; K plOi used 10 infonn the decision lociuSlcr the Lasker Distance nJairix inlO ]4 groups.

    II shows the change in AUC valocs as calculated in Equation 8.

    between 5 and 45. For each value of K, subsampling was used ]0 provide 200selections for each algori]hm in the consensus clus]ering. The results of thi sprocess produced a merged consensus matrix- an average of lhe threeconsensus ma]rices (one for each clustering methodology)- for each value ofk (resulling in the crea]ion of 40 ma]rices). The merged consensus matricesprovided the basis for the 11K calculations, ]he results of which are shown inFigure I.

    Figu re 1 shows a dramatic decrease in 11K values between K = 5 and K =12. fluctualing between 12 and 20 before s]abilizing after K = 2 1. So le ly on thebasis of Monti et al.'s (2003) number of clusters cri]erion (ou] lined in Seclion1.2.3.) 10 should have provided the best outcome. It was however decided 10relax this criterion and select 14 clusters for a number of reasons. Firstly, thi sdoes not exceed a practical number of cluslers for visualizing regions in achoropleth map and secondly it makes intuitive sense as it approximates the

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    13/26

    ClillI/ral Regions ill Ellrope II/rollgh SlIrnames I 585

    0.8 I

    0.6ww 0- D0a: 0.402

    0.0 K-Means Wards PAM

    Figure 2. Box p l O l ~ showing the robustness values a " , ~ i r u c d with the t r u C l U ~ of each of the dus teroutcomes . White boxes are produced from direct clustering of the distance matrix and greyboxes an: produced from clustering the llIerged consensus matrix.l-or reasons ollliined in thetext. PAM provides thl: Mt solUlioo in this inswIl 14 thuscontained some questionable regional gro upings within countries. The picture atK > 9 but < 14 appeared too generalized when mapped (although was morestable) for the purposes here.

    Figure 2 shows a box plot with th e robustness values associated with thefinal cluster stlUctures at 14 clusters (as outlined in Equation 5). In addition to theresults from c lustering the final merge matrix , those from the non-mergedconsensus clustering are al so included for comparison. In agreement with

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    14/26

    586 / CHE.SHIRE lIT AL.

    1

    , .- -Fi gure .1. ~ h p s showing lhe spatial diSlribtJtions of each of the t4 duster alJocat ions (A) and their

    n::spcct ivc mbtistness valocs (8 ). Higher robustness valocs n : : p T L - ~ n I a bencr n::suh. On theIcft hand plOl each duster has been assigned a uniqoc p..1.nem. A full-color vers ion can befound at spatialanalysis.co.uklsumames.

    preliminary research using diffe rent data (Cheshire and Adnan 201 1). the mergemat rix result produced hi gher median robustness values across all algorithmswh en compared with the non-merged results. Overall, based on Fi gure 2, it wasthought that PAM on th e me rge matrix produced the most robust clusterstructure. Although, the PAM inter-quartile range was greater than that forWard' s al gorithm , six of the "Ward clusters" (nearly halt) were classified asout liers. The membership robustness values were also highest, on average, for thePAM clustering re sult: these hav e been mapped alongside the final clusterouteome in Figu re 3.

    In thi s study , guided by the visual interpretability of the re sults, we also useMDS in two and three dimensions. MDS undertaken for greater than th reedimensions had little impact (see s tress values in Figures 4 and 5) on thepositioning of the NUTS regions in relative spaee and becomes inc reasing ly hardto visuali ze effec ti ve ly in print. Res ult s from the MDS are shown in tw o wa ys.Figure 4 shows a conventi onal plot of the re sults from two-dimensional MDS foreach country, where each dot represents a NUTS region and each axis each of th etw o MDS dimensions. Figure 5 is a more novel representation, previously usedin linguisti cs (see Ne rbon ne 2010) and sh ows the three-dimensional MDS valueson a two-dimensional map. In th is fi gure the raw MDS coordinates have been

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    15/26

    ,-,-0--,-

    ~ -

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    16/26

    588 / CHE.SHIRE lIT AL.

    MDS Dimension 1

    b 1

    .'o

    MDS Dimension 3b1

    .'

    MDS Dimension 2

    b 1

    MDS 3 Di

    ..Figure 5. Maps showing the spatial distributions of cach di mcnsion produced from the 3 dimcnsional

    MDS. Each dimens ion has been r c s c a l ~ ' t I to a value of between 0 and 255 to facilitate thec l \ : ~ t i o n of RGB coloo; (best vic,,"wl ooline: SIXltialanalysis.co. uk/wmames). Stl\:SS valuesfor 3 dimensions = 0.11(Xj4 and 4 dimensions = 0.9838.

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    17/26

    25-

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    18/26

    590 / CHE.SHIRE lIT AL.

    2.2 -2.0-1.8 -1 6 -,,-1.2 -1 0 -

    2.2-2.0-1.8-1 6 -"" 1.4--s 12 =.. 10!i 2.2-

    g' 2.0-.. . 1.8-

    1 6 -,,-1.2-1 0 -

    2.2-2.0-18 -1.6-lA -1 2 -1.0-

    Austria Be lgium Switzerland Ge rmany

    Denmark Spain United Kingdom

    Luxembourg Nethe rlands..J-

    Norway Poland Serbia + Montenegro Sweden

    " " ' " " " ' " " " ' " " " ' "9 1011121314 8 9 1 0 1 1 1 2 1 3 1 4 8 9 1 0 1 1 1 2 1 3 1 4 8 9 1 0 1 1 1 2 1 3 1 4Log Geographic Distance

    Fi gure 7. A plO! soowing (1M: relationships be tween (lie Lasker Distance measures and log geographicdis(ance (km) within each European country studied here. Every possible region -pair isr e p r e - d .

    across Europe is 10 .45 with the maximum value (1 9.68) occurr ing betweenNorthern Ireland and southern Ital y, hint ing m a measure of isonymy with a lowdispersion across Eu rope compared to geographic di stances .

    At the coun try level, the rciationship between surname and geograph icaldistance presents some interesting and particula r national trends, as shown inFigure 7. Mu ltilingua l countries , such as Belg ium and Switzerland, un surprisingly show the strongest relationship be tween geographic distance and diffe rences in th e su rname composi tion of its reg ion s. Cou nter-int uit ively perhaps, theplot fo r Norway suggests thm su rname diversity increases with proxi mity . Thisis most probably because of the greater surname diversity (resul ting rromdomestic and intern at ional migration) in urban areas Ihat arc close to one otherin the southwest of the country. This di versity appears to be suffi c iently strong

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    19/26

    ClillI/ral Regions ill Ellrope II/rollgh SlIrnames I 591and in close proximity, managing to offset the more distant but more homogenous rural arcas. In countries such as Denmark. a dc-facto archipelago.Euclidean distance does not reflect actual population interaction. Moreover, theplots in Figure 7 provide an important indication of the sub-national interactionsbetween distance and surname divcrsity.Consensus Clustering. The clustering results shown in Figure 3A conform tomany well-known national and linguistic divisions across Europe, and mostnotably, follow linguistic or historical political boundaries, in some casesreflecting the effects of contemporary global migration to large urban areas.

    The clusters generally follow national borders, with some interestingexceptions: multilingual countries and those with unique regional patterns. Largepans of Switzerland have been allocated to the same cluster as the Alsace regionin France, Southern Luxembourg, and the Bolzano region in northern Italy,denoting similar surname characteristics shared by these multilingual areas withGerman language heritage. The analysis has also split Belgium along linguisticlines, assigning Flanders to the same cluster as the Netherlands and Wallonia tothe French cluster, with part of Brussels appearing as a French enclave withinWallonia.

    Denmark, Norway, and Sweden have been assigned to the same clusterexcept for one sparsely populated area of northern Sweden that is well known tohave commonalities with ils Finnish neighbor. This particular area has beengrouped together with more "peripheral" countries such as Poland and Serbia.Montenegro and Kosovo. The robustness values associated with this area inSweden are low, suggesting the region shares relatively lillle in common with thecountries included in this cluster. which is truly a Polish cluster, with theex-Yugoslavia region being associated with it because of its small size in relativeterms ,in effect an outlier as the aforementioned Northern Swedish region.

    Beyond contemporary national political boundaries there are some interesting within country regionalizations that derive from the analysis. In the UnitedKingdom, hislOricallinguistic regions such as Wales, and the Scottish Islands areclearly distinguishable from the rest of the United Kingdom. It is also interestingto see the urban corridor around London suggesting that the surname composition of these areas is much more diverse and hence disconnected from thenational picture. This demonstrates the uniqueness in the surname composition ofcontemporary global migrants to the London area (see also Longley et al. 20 II).In the rest of the British Is les, Ireland (Eire) is grouped under a single cluster, thatincludes most of Northern Ireland, except for the Eastern coast reflecting theclose migration and trade flows with Great Britain.In France, the mainland except for the Alsace-Lorraine has been allocatedto a single cluster that includes the island of Corsica and the Geneva region inSwitzerland, as well as the Wallonia region in Belgium. This is hence a "tightFrench surnames cluster" automatically identified by the clustering algorithm.Italy has been split in two clusters, with a northern and western cluster separated

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    20/26

    592 / CHE.SHIRE lIT AL.from th e res t of the cou ntry. Spain so lidly belongs to a si ngle cluster, despite itsstrong multilingual cleavages (Mateos and Tucke r 2008). perhaps because of itsoverall low surname diversilY (Scapoli et al. 2007). Most of Germany is allocatedto a si ngle cluster. while most of Austria belongs to a separate cluster. wilh somespillover regions bel ween lhe lwO.M ultidimensio na l Scaling (M DS). 111e resu lts from the multidimensionalscaling large ly support lhe consensus cluster ing outcome. The two-d im ensionalMDS plots for individual countri es shown in Figu re 4 provide an indication ofthe location of each of the spatial units in their multidimensional surname space.Those cou ntri es lhat have largely hom ogenous surname distributions form verytig ht clusters. such as Germany, Ireland, or Denmark. Others suc h as Switzerland, Luxembourg, France. or Spain, show a greater degree of sca tter, renectingpresent or hi stor ic multilinguism. Of most interest are the outlier points for eachof the countries. For example, lhe th ree hi ghlighted points in Italy's distribulionare spatial units o n the island of Sardinia, and those hi ghl ight ed in Francerepresent the border reg io n of Al sace-Lorraine.

    Figu re 5 provid es th e geographi c context to the re sults ofl he MDS analysisand is, in many ways, much more informati ve as a result . The maps (best viewedelect roni cally at spatialanalys is.co.uklsu rnames) create a simi lar impression toth ose in Fi gure:3 in addil ion to some more SUblJc distinctions, For example MDSDimension:3 suggests a ra ther strong north-sou th split within Ge rma ny that is notnoticeable in lhe consensus clustering resu lts or lhe th ree-color map in lhe samefigure. Mult i- lingual cou ntries are also clearly id entified in lhis figure , as well assome oflhe diversilY within the Neth erlands identified by Barrai et a!. (2002). Itis clear from Figu re 5 that the European map has a number of abrupt trans itionsin its su rn ame compos it ions. "nlere are clear splits between the Br it ish Isles andthe Continent , between Romance and Germanic languages, between Scandinaviaand the rest of Europe, and between Poland and Germany. 111e laller abrupttransition is especially s tri kin g s i nce the currell1 Polish-German border o n y datesto 1945. 111i s probably re fl ects rapid population movement dur in g Wor ld War IIand the practice of surname change or forced migration on lhe Soviet side duringlhe Cold War. Such di st incti ons are perhaps unsurprising but lhese maps show.for the first time, how abrupt bou ndaries across Europe can simply be cap tu redby surname frequencies der ived from data assembled from telephone and otherdirectories.DiscussionReg ionali1.ation Methods. The fact that lhe outcomes from lhc two separatercgionalizat ion techniques uscd in thi s papcr, consensus clustering and MOS , arein broad agreement with previous rescarch in thi s area is encouraging and sc rvesto cndorse th eir use in geographic analysis of population structurc. Clustering themcrged matrix provided a more consistcnt outcome than consensus clustering,which in turn was more re liable than clustering areas us ing a si ng le algorithm.

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    21/26

    ClillI/ral Regions ill Ellrope II/rollgh SlIrnames I 593The method does not obviate the need for the selection of a single algorithm 10produce the final result, but it docs providc some useful metrics upon which tobase this decision. As Figure 3 demonstrates. the abi lity to map the clustermembership robustness of each spatial unit 10 its respecti ve final cluster providesa powerful way of assessing the appropriateness of the outcome for each specificarea. A key flaw with convent iona l clustering routines is the requ irement toassign every item 10 one of a limited set of clusters, since this may result inquestionable cluster allocations. Us ing robus tncss measures, such "weak" a l l o ~ cations can be identilied and interpreted with an approprime degree of caution.In addition the aK measure is useful for indicating the optimal number ofclusters that shou ld be used as an input to the algorithm. It should be noted that"optimal" in the quantitative sense, might not be optimal in the practical sense.If the outcomes were to be mapped, for example, there would be a lim it on thenumber of cluster outcomes that can be readily discriminated by the map user. Asubstantial advantage of the methods presented here is in the visual outpulS Ihatthey provide so this limitation should not be underestimated.

    A linal considerat ion relates to the opposite scenario where the .11Kmeasure indicates Ihat a very low cluster number is optimal bUI the researchersmay wish to identify a greater number of clusters to highlight diversity. In thiscase the desired clustering result can be shown alongside that which is optimal.Merged consensus clustering cannot therefore entirely remove the need forsubjective guidance of cluster analysis, but it docs provide measures upon whiehresearchers can base Iheir decisions. We do not claim Ihat our use of consensusclustering has provided a panacea to Ihe many issues surrounding the clusteringof surname data. We do hope, however, 10 have made a substantial empiricalcontribution 10 the debates surroundi ng such issues through the appl ication of themethod to such a large data set.

    The maps shown in Figure 5 demonstrate the power of nwpping MDSvalues in this context. The resulting impression of regionalization is similar toIhat produced by the computationally more intensive consensus clustering withthe additional advantage that less discrete phenomena such as isolation bydistance is also shown. Assigning discrete groupings to the visual impressionscreated by Ihe maps is best left 10 the sons of clustering methodologies shownhere, but the relative simplicity (using most wide ly available statistical softwarepackages) and speed of the MDS classilication makes it a powerful one in thiscontext.Issues of Geographical Scale and Size. The data set used here containinformation at the level of the individual for most coutmies, and therefore, theyoffer the potential for much tiner-scale analysis than has been presented here forthe 685 NUTS 2/3 areas. Very line spatial units will create different regionalization outcomes OUI of the same input data set as Ihose based at a coarser scale.This effect is clearly seen if Figure 7 in Scapoli et a1. (2007) is contrasted withFigure 3A above. For example, Scapoli et al. (2007) have clustered the entire

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    22/26

    594 / CHE.SHIRE lIT AL.

    region of Lorraine as pan of the Franco-German border area using NUTS 2regions, whilc the smaller geographical units presented in Figure 3 (NUTS 3)suggest that it is only those departments contiguous with the German border (andnOl with Belgium or in the interior) that fall into this category.

    Thc issuc of scale is partially rcsolved through thc application and contextof the surname research being undertaken. If, for example, surname analysis isused as a proxy for genetic information at the European level then tine scaleanalysis may be unnecessary since most trails are only noticeable at coarsegranularity (Cavalli-Sforza 2000). 111at said, as Longley et al. (2011) demonstrate using similar methods for Great Britain. the use of fine granularity units ofanalysis will still preserve thc large-scale trends if tbcse arc legitimate and notjust anifacts of the units used. A major advantage of smaller spatial units is theirability 10 highlight detail, such as that arising out of more recent migrationevents. This may be especially useful in the context of understanding segregationin global cities such as London, Paris, and other large European cities. Whilstsuch fine-scale analysis would not be practical at a European level, it couldnevertheless be undenaken within each of the 14 or so groupings created in thisstudy in order to identify the dynam ics wilhill each of these surname sub-regions.

    An issue 10 be considered alongside the size of spatial unit selected is thesize, concentration and distribution of the populations contained within them.The (dis)similarity between the surname compositions of populations has beenestablished between areas with the Lasker Distance. The subsequent clustering ofthe measure is sensitive to the different levels of aggregation and samplingassociated with the inconsistent population sizes represented by each spatial unit.Dissimilarity measures, such as the LlSker Distance, rely on comp..msons betweenaggregate population groups that are often equally weighted for the analysis. Aspatial unit representing 100 ~ o p l e is therefore tremed in the same way as one with1000 or even 10,000. A country's innuence on the analysis is in part based on thenumber of spatial units it has rather than the size of ilS population. The likely resultis an apparent increase in diversity for countries partitioned into large numbers ofregions, despite relatively unifonn surname compositions. It is therefore the case thatthe resulting classification is d e ~ n d e n t on the size of the spatial units, the populationsize spatial unit and the surname heterogeneity within and between lhe spatialunits. TIle use of merged consensus clustering has helped to accommodate some ofthese effects, in addition to minimil.ing the impact of outliers in the cluster analysis.Future work will seek to establish a number of heuristics around which to base asuitable weighting methodology 10 account for the varying populations in eachspatial unit across E u r o ~ .

    A number of approaches could be used to mitigate the drawbacksassociated with inconsistent levels of aggregation within distance measures. Theobvious solution would be the greater standardization of spatial units acrossEurope, in order that they better renect popUlation density. This, however, leads10 complications such as whether the size of the resulting units should renect Ihetarget population density or the sampled population density. In addition, more

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    23/26

    ClillI/ral Regions ill Ellrope II/rollgh SlIrnames I 595sparsely populated areas are going to require larger units (in terms of geographicextent) in order to meet a population threshold and this is likely 10 riskamalgamating culturally distinct groups as potential surname boundaries arecrossed. This solution would presem a major undertaking at the European leveland may not produce significamly improved results. More practical options couldtherefore include weighting the dissimilarity calculation or its subsequentclustering. One possible approach, in this context, would simply be 10 multiplythe elements of the Lasker Di stance matrix by a suitably normalized populationweight. Such an approach may also require some nationally varying 'at value 10alter the influence of the population weighting on the cluster outcome.

    We consider the disparities in sample size for each population a lesser issuebecause, as Fox and Lasker (1983) demonstrate, the relative proportions of eachsurname tend 10 be consistent whatever the percentage of the population issampled so long as the sample is representative. We believe that our data sourcesare broadly representative of their target populations (with the caveats below)and therefore will have adequate proportions of each surname 10 calculaterealistic pairwise distances. Finally, an element of uncertainty has also beenintroduced in this analysis by the di fferent provenance of the surname frequencydata for each country. While the ultimate data source for most of the coumries isthe nationalteJephone direclOry (except the United Kingdom where an enhancedelectoral register was used), these obviously do not present idemical characteristics across the 16 countries. These include national variations in the gender biastoward male registration in telephone directories, variable penetration of land linerental in the population, different conventions for subscribers removing theirentries from direclOries, different customs in registering names to telephone linesand different procedures and conventions by the companies that commercializedthe data. Follow ing from the previous discussion on geographical scale, this canalso be applied to geographical extent. If we had clustered surname distancesindividually within each country the results would have been somewhat differentto doing so for the whole of Europe in a single step.Conclusions

    This research has offered a number of important contributions to ourunderstanding of the spatial distributions of surnames. It has combined acommonly used method of establishing the simi lari ties in the surnames composit ion between different populations or areas (isonymy) with novel forms of dataclustering and geographic visualization (consensus c lustering and MDS). It hascreated the most com prehensive surname regionalization of Europe to date byexamining the 8 million surnames of over 150 mi Ilion people who can reasonablybe deemed representative of the entire populations of each of the 16 cou ntriesincluded here. The unprecedented size and comprehensiveness of the data setused has provided new insights into the problem of identifying the regionalization of European populations using surname distributions as a proxy for culturaland genetic structure. The introduction of a new method- merged consensus

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    24/26

    596 / CHE.SHIRE lIT AL.

    cl uster ing - i n thi s context has g reatly increased lhe stabili ty and co nsistency oftraditional cluster ing algorithms. In addition the mapping of a measure of clusterrobustness alongside the final results provid es important context about thestrength of the resulting regions. This information is augmented by lhe results ofMDS analysis that, as shown in Figure 5, capture both the abrupt transitions insurname co mposition as we ll as more gradual trends. Th is goes some way towardcomb in ing the traditionally cont inuous models of genet ic divers ity with thediscrete transilions commonly establ ished in surname analysis.

    In conclusion. lhi s paper has soughl 10 demonslratc the utility of aninductive approach to su mmarizing and analyzing large population data setacross cultural and geographi c space, the outcomes of whic h can provide thebasis 10 hypothesis generation abolll soc ial and cullural patterning and thedynamics of migration and resident ial mobililY.

    Receil"ed 12 March 2011: rel'isiOI1 accepted for publicatiol/ 21 Jllly 20" .

    Literature CitedBarrai. I.. A. RodriguezLarralde. E. Mmnolini et al. 20Cl0. Elements of the surname structure of

    Austria . Al1nlllJ of Hum. lJiol. 27(6):607- 622.Barrai. I . A. Rodriguez-Laffillde. F. Manni et al. 2002. Isonymy and isolation by distance in the

    Netherlands. HUll!. lJio/. 74(2):263- 283.Baffili. I.. A. Rodriguez-Larmlde. F. Manni et al. 2004. Isolation by language and distance in

    Belgium. Amlllis of HWlIlIn Gl'lIl'Iics 68( 1 : 1- 16.Cavalli -Sforza. L. L.. A. Moroni. and G. Zci. 2004. COflJIIII}luil1it)'. ftlbreedill}l. lind Gelll'lic Drift ill

    /tl!ly. Princeton. NJ: PrincelOn Uni versity Press.Cavalli-Sforza. L. L. 20Cl0. Gelles. /-'eopleJ IIml uUI)I!IlI)1e.l. London: Pen guin Books.Cavalli -Sforza. L. L.. and M. Feldman. 1981. ("u/lUra/ Tran.m,i.H;(JII Will El"O/l4Iirm. Princeton. NJ:

    Princcton University Press.Chcshire. J .. M. Adnan. and C. Galc. 2011. The usc of con"e Geogr(lI,IIY. by N. Wrigley and R. J.

    R.enneH. OxFord: Routledge.Golledge. R. G . and G. Rushton. 1972. Multidimensional scaling: Review and geographical

    applications. A.'.lOcimitm of AmeriC(l1l Geogf(Jpher., Comminioll Oil College Geography.Tedmicil/ P(lper No. 10.

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    25/26

    ClillI/ral Regions ill Ellrope II/rollgh SlIrnames I 597Gordon. A. 1999. C/assificlllivlI. 2nd edition. London: C I I ~ p l H a n ~ n d Hall.Grotkj:er. T . O. Winther. B. Regenberg al. 2005. Robust ntUlt i-SCillc clustering of large DNA

    microarray data>cts with 111.: consensus algoritllm. Bioilijorllllllics 22. 1:58-67.Hanks. P. 2003. Dicrirmary vf AmeriC/l/l Family Nil/III's. New York: Oxford University I'fcs,.H ~ r t i g a n . J. A.. and M. A. Wong. 1979. A K-means clustering algoritllm. Applied SlIlIislics

    28:100 - 108.Kau fman. L.. and P. Roussecuw. 1990. f'illd;'!K GWllpS il! Ollla. New York:Wilcy-lntersciencc.Kohonen. T. 1990. Til.: selforganizing map. Pro.:el'dillKS vf Ille IEEE. 9: 1464- 1480.Lao. O. et al. 2008. Corrclmion b e t w ~ ' C n genetic and geographic structure in europe. Curr. lliol.

    18(16): 1241 - 1248.Las ker. G. 1985. S U f l l o m e . ~ ,md Gellelic: SI",,;lore. Cambridge: Cambridge University Press.La.,ker. G. 2002. Using surnames to analyse populmion structure. In Namillg. Smje/y and ReKiOlwl

    idellii/),. D. Postles. cd. Oxford: Leopard's Head l'Tess. 3- 24La'.ker. G . and C. MaseieTa),lor. 1985. The geographical di.tribution of selected surnames in

    Britain model gene frequency clines. loomwl of lfalllOIl E>"(J/II/;,m 14:385-392.Longley. P. A . J. A. CII.:shire. and P. Maleus. 2011. Creming a regional geography of Brimin thrnughthe spatial analysis of surnames. Gnifo,," " 42(4):506-5 16.

    Longley. P. A . R. Webber. and [). Lloyd. 2006. The quantitmivc am,lysis of family names: Hi storicmigration and the present day neighbourhood structure of Middlesborough. United Kin gdom.Almals of rll/! A"'sodatiOJl ofAmerinll! Geogrof,l!ers 97(1 :31-48.

    Manni, F . and I. Barrai. 2001. Genetic Slructures and linguistic boundaries in Italy: A llIicroregionalapproach. Ifum. Bioi. 73(3):335-347.

    M ~ n n i . F . W. Hceringa. B. TOllpance et al. 2008. Do surnalllc e n ~ . . . , s mirror d i ~ l e c t a t i o nHum. 8iol. 8lX l):4 1-64.

    M ~ n n i . F . E. Guer.lrd. ,md E. Heyer. 2004. Geogr.lphic pal1 erns of gellCtic. morphologic. linguisticvariation: How bamers can be d c t c c t ~ - d by using Monmonicr's algoritllm. Ifum. Bioi.76(2): 173---190.

    ~ n n i . F . W. Heerin ga. ,md J. Nerbonne. 2006. To what extent arc surn:mlcs words? COIHp;lringgeographic pal1cms of surname and di alect variation in the Netherlands. Literary atldLillxuistic COlIIl'Ulill1-: 21 (4):507- 528.

    Manni. F . B. Toupancc. A. Sabbagh e! al. 2005. New method for surname studies of ancientpmril ineal population structures. and possible appl ication to improvement of Y-Chromosolll

  • 8/2/2019 Delineating Europe's Cultural Regions_ Population Structure and Surname Clustering

    26/26

    598 / CHE.SHIRE lIT AL.PiazZ