can we track the geography of surnames based on bibliographic data?

18
Can we track the geographic origin of surnames based on bibliographic data? Nicolas Robinson-Garcia, Ed Noyons & Rodrigo Costas 15th INTERNATIONAL CONFERENCE ON SCIENTOMETRICS & INFORMETRICS 29 June – 3 July, 2015, Bogazici University, Istanbul, Turkey EC3metrics spin off CWTS Leiden University

Upload: nicolas-robinson

Post on 05-Aug-2015

109 views

Category:

Education


0 download

TRANSCRIPT

Can we track the geographic origin of

surnames based on bibliographic data?

Nicolas Robinson-Garcia, Ed Noyons & Rodrigo Costas

15th INTERNATIONAL CONFERENCE ON SCIENTOMETRICS & INFORMETRICS

29 June – 3 July, 2015, Bogazici University, Istanbul, Turkey

EC3metrics spin off CWTS Leiden University

Agenda

oBackground

oBibliographic data

oMethod 1. Kullback-Leibler divergence

oMethod 2. Concentration Index

oThe ‘golden list’

oNext or previous steps

Background

“the use of surnames in human population biology dates back to 1875, when George Darwin used frequency of occurrences of the

same surname in married couples to study in-breeding”

Kissin, 2011

WHAT IS IN A SURNAME?

o Proxy for genetic/ethnic origin -> Epidemiology, Biomedical research

o Proxy for country origin -> Demographic studies, migratory movements

Background

o The representation of Jewish surnames in biomedical journals and US-patents

Kissin, 2011; Kissin & Bradley, 2013

o Relation between ethnic mix collaboration and citation impact

Freeman & Huang, 2014

… in the field of bibliometrics

Background

HOW CAN WE DETERMINE THE GEOGRAPHIC ORIGIN OF SURNAMES?

METHODS

o Manually curated lists

o Probability and Bayesian methods

o Clustering techniques

DATA SOURCES

o National census

o Dispersion of sources

o Lack of international coverage

Bibliographic data

o Scientific databases as international surnames data sources

Regional restrictions Temporal restrictions

o Establishing ‘trusted’ linkages between surnames and countries

Reprint address First author-First address One country publications Author-address linkages (2008)

Bibliographic data

o Scientific databases as international surnames data sources

Regional restrictions Temporal restrictions

o Establishing ‘trusted’ linkages between surnames and countries

Some figures: -> 1,568,052 distinct surnames assigned to 119 countries -> France 8,8%; Germany 8,0%; Russia 7,1%; Spain 4,9%

Assumptions

HYPOTHESIS 1

A surname should be assigned to the country where there is a higher frequency of such surname

HYPOTHESIS 2

A surname should be assigned to the country where there is a greater concentration of such surname.

Method 1. Kullback-Leibler

OPERATIONALIZATION

A surname will be assigned to a country if 1) it has the highest frequency, and 2) there are “certain levels of assurance”.

METHOD 1

Kullback-Leibler divergence

indicates the (dis)similarity of a

global surname distribution with its

distribution in each country.

Method 2. Gini Index

OPERATIONALIZATION

A surname will be assigned to a country if it is the one with the highest concentration of such surname.

METHOD 2

Gini Index is an inequality indicator

already employed for other

purposes in bibliometrics. It ponder

within 0 and 1 the concentration of

a surname in a country.

Kulback-Leibler vs. Gini index

Country No. surnames

FRANCE 138349

GERMANY 112445

RUSSIA 111716

SPAIN 83529

USA 76219

ITALY 69637

ENGLAND 63885

JAPAN 56345

CANADA 49775

NETHERLANDS 41306

Country No. surnames

USA 310739

FRANCE 117938

GERMANY 111375

RUSSIA 94369

ITALY 65699

JAPAN 52399

ENGLAND 47521

CANADA 46146

POLAND 44087

INDIA 42897

Method 1. Kullback-Leibler Method 2. Gini index

Top 10 countries with the highest number of surnames assigned

Kulback-Leibler vs. Gini index

Surname Country

CLINTON USA

EGGHE BELGIUM

GARFIELD USA

HERRERA SPAIN

GARCIA SPAIN

EINSTEIN USA

NOYONS NETHERLANDS

PEREIRA BRAZIL

Method 1. Kullback-Leibler Method 2. Gini index

Top 10 countries with the highest number of surnames assigned

Surname Country

CLINTON USA

EGGHE BELGIUM

GARFIELD USA

HERRERA CUBA

GARCIA CUBA

EINSTEIN ISRAEL

NOYONS NETHERLANDS

PEREIRA PORTUGAL

The ‘golden list’

Validating the methods proposed

SEARCHING A ‘GOLDEN LIST’ TO VALIDATE THE RESULTS o Coverage

o Criteria

› Language › Ethnicity › Historical origin

o Reliance and double assignments

The ‘golden list’

Validating the methods proposed

SEARCHING A ‘GOLDEN LIST’ TO VALIDATE THE RESULTS o Coverage

o Criteria

› Language › Ethnicity › Historical origin

o Reliance and double assignments

The ‘golden list’

Validating the methods proposed

Unified country Languages

Denmark Danish

England Celtic; Anglo-Cornish; English; Scottish; Irish

Finland Finnish

France Breton; French

Germany German

Greece Greek

Iceland Icelandic

Italy Italian

Japan Japanese

Netherlands Afrikaans; Dutch

Portugal Portuguese

Spain Basque; Catalan; Galician;

In search for a ‘golden list’ of

surnames assigned to

countries/languages/ethnicities

http://en.wikipedia.org/wiki/Category:Surnames_by_language

The ‘golden list’

METHOD 1 METHOD 2 Countries % coverage % correct % coverage % correct

DENMARK 91.1% 68.75% 100% 60.16%

ENGLAND 28.8% 80.97% 100% 58.56%

FINLAND 99.11 94.62% 100% 91.96%

FRANCE 88.08% 68.28% 100% 50.54%

GERMANY 52.24% 69.00% 100% 43.78%

GREECE 84.12% 78.32% 100% 78.57%

ICELAND 100.00% 65.52% 100% 100.00%

ITALY 87.65% 86.97% 100% 64.77%

JAPAN 98.74% 98.95% 100% 91.39%

NETHERLANDS 88.11% 60.96% 100% 41.67%

PORTUGAL 98.54% 92.59% 100% 91.91%

SPAIN 93.18% 48.74% 100% 54.74%

Total 73.22% 79.03% 100% 61.29%

Next or previous steps

o Is the Web of Science a good sample of the world population? › Country census crossed with the WoS

o Time frames and migratory movements › Apply methods to different periods

o Validation and comparison with other techniques › Bayesian, probability, clustering

o Multiple assignments of countries (e.g., Lee, Santos)

Thank you! [email protected]

Nicolas Robinson-Garcia, Ed Noyons & Rodrigo Costas

15th INTERNATIONAL CONFERENCE ON SCIENTOMETRICS & INFORMETRICS

29 June – 3 July, 2015, Bogazici University, Istanbul, Turkey

EC3metrics spin off CWTS Leiden University