icc2013 country names

20
International Cartographic Conference, Dresden Variability of country names and identifiers in datasets – Reconciling practical and cultural perspectives GOVERNMENT AND COMMERCIAL SERVICES THEME Laura Kostanski| Sara-Jane Farmer | Rob Atkinson August 2013

Upload: sirf13

Post on 22-Jun-2015

213 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Icc2013 country names

International Cartographic Conference, Dresden

Variability of country names and identifiers in datasets – Reconciling practical and cultural perspectives

GOVERNMENT AND COMMERCIAL SERVICES THEME

Laura Kostanski| Sara-Jane Farmer | Rob Atkinson August 2013

Page 2: Icc2013 country names

Today’s Presentation

• Overview• Cultural Reasons for Multiple Country Names• Impact of Cultural Reasons• Multiple Country Name Datasets• Reconciling Information• Spatial Identifier Reference Framework (SIRF) Approach

Page 3: Icc2013 country names

Overview

• There are multiple country name datasets in use• e.g. ISO 3166, UNSTATS , Alexandria Digital Library, CIA Fact Book, UN-FAO

• Multiple stakeholders in creation and use of data using these names• e.g. World Bank, Statistics Agencies, Crisis Response and Social Protection Groups.

• Time spent accessing and reconciling data is costly and delays production of results from analysis

• The same issues apply to most, perhaps all, identifiers of spatial objects

• Preview of how we might tackle this problem

Page 4: Icc2013 country names

Context

CSIRO. UNSDI Gazetteer for Social Protection in Indonesia

Page 5: Icc2013 country names

Data Analysis

Utopia Way Inc. investigated files in the data.un.org dataset. …

Country names were discovered in multiple fields, such as:

•country of birth, •country of citizenship, •country or area, •country or territory, •country or territory of asylum or residence, •country or territory of origin, •reference area.

and identified significant issues with country name alignments and mismatches.

An automated matching process was set up to explore the extent of the issue.

In all, 21,195,188 rows of data were analysed.

Page 6: Icc2013 country names

Common “Errors”Index error ExamplesWithdrawn countries with no ISO3166 code

“East Timor", "Czechoslovakia, Czechoslovak Socialist Republic”, "USSR, Union of Soviet Socialist Republics", “Yemen, Yemen Arab Republic", “Yemen, Democratic, People's Democratic Republic of", “Yugoslavia, Socialist Federal Republic of”, “Germany, Federal Republic of”, “German Democratic Republic”, “US Miscellaneous Pacific Islands", “Wake Island", “Serbia and Montenegro".

Abbreviation “Rep.” for “Republic”, “St.” for “Saint”, “Is.” For “Island”, “Isds” for “Islands”, “&” for “and”.

Added markers “+” added to the end of region names, to differentiate them from countrynames. “MDG_” added to region names, e.g. “MDG_Southern Asia”.

Capitalisation “YEMEN” for “Yemen”, “republic” for “Republic”, “The” for “the”, “the” for “The”.

Brackets “()” or “[]” instead of commas

“Virgin Islands (British)” for “British Virgin Islands”.

Standards confusion The ISO3166 labels “name” and “official_name” were both used in the same datasets (“name” is available for all countries; “official_name” is not).

Use of familiar names Brunei, Ivory Coast, China, Libyaissues with character translation Cote d'Ivoire, Åland Islands, Curaçao, Réunion

Misspellings Double spaces, trailing spaces, “South Asia” vs “Southern Asia”.

Page 7: Icc2013 country names

Long names, short names

Page 8: Icc2013 country names

Organisation Name of Data SetUnited Nations Statistics Division Country and Region Codes for Statistical Use

Working Group on Country Names, United Nations Group of Experts on Geographic Names

List of Country Names

Terminology Section,Department for General Assembly and Conference Management

Multilingual Terminology Database (UNTERM)

International Standards Organisation (ISO)

ISO 3166: Codes for the representation of names of countries and their subdivisions (parts 1, 2 and 3)

Food and Agriculture Organisation of the United Nations

Global Administrative Unit Layers (GAUL)

United Nations Geospatial Information Working Group (UNGIWG)

Second Administrative Level Boundaries (SALB)

National Geospatial Intelligence Agency

Federal Information Processing Standard (FIPS) 10-4 : Countries, Dependencies, Areas of Special Sovereignty, and their Principal Administrative Divisions

NATO Standards Agreement (STANAG) 1059

Data sets providing country names

Page 9: Icc2013 country names

Two Aspects of Country Name Datasets

1: Development of datasetsWhy is there a proliferation of country name sources?

• Cultural issues

• Development practices

2: UsageHow, in a digital age of ‘big data’ analytics and SDIs, can newly emerging

technologies such as the Spatial Identifier Reference Framework (SIRF) assist in reducing the ambiguity associated with multiple, heterogeneous country name sources?

• Can we do better? What do we need to do it?

Page 10: Icc2013 country names

Cultural Issues

• Toponyms provide communities with identity (Toponymic Identity is both reflected and reinforced)

• Country names are the highest-order toponyms

• Problems are similar at lower levels, compounded by scale (size of problem) and higher rates of change (e.g. electoral boundaries, urban growth)

Page 11: Icc2013 country names

Endonym/Exonym

Above and beyond associations with an individual’s attachment to the Endonym of their country, there are often multiple Exonyms used by other languages.

e.g. Deutschland= Germany or Allemagne

Page 12: Icc2013 country names

Other Cultural Country Naming Considerations

Formal/Informal naming applications(particularly prevalent in the social media world- e.g. ‘Oz’ for Australia)

Political/Non-Political Usagee.g. ‘Commonwealth of Australia’

Change over timee.g. Czechoslovakia

Non-standardised international conventionse.g. Saint or St? The or none?

Page 13: Icc2013 country names

The Impact

All of these cultural mores impact on the ability of people and organisations to record country name information in a standardised, transparent manner.

Thus, there exists a proliferation of country name lists which are officially promoted by international agencies.

This impact is then intensified in usage,

Page 14: Icc2013 country names

OptionsSuggested improvements to the indices and standards include:1. Improve access to source data

a. Make the UN’s regions list available as a csv file online, to include withdrawn country codes, assignment dates and withdrawal dates (these are needed to match names for earlier years).

b. Make the UN’s economic status list available as a csv file online.

2. Lobby to improve contenta. ISO to create a region (Africa, West Africa, North America etc.) code standard.b. ISO to correct inconsistencies in the ISO countries list (e.g. republic not Republic in

Bolivia’s name).

3. Policya. Make a definitive statement about which GIS naming standard (ISO, UNstats etc) UN

online development data should attempt to adhere to.

4. Better citation mechanisms– Standardised metadata and identifiers that “resolve” – i.e. links back to data– Shared infrastructure to link all the information together

Page 15: Icc2013 country names

Spatial Identifier Reference Framework

CSIRO has been working with stakeholders including UN, National agencies and others on a set of standards and infrastructure services to support discovering and linking multiple sources of spatial references.

This is being presented in more detail in:

6D.3 Spatial Identifier Reference Framework (SIRF): Realising the potential of SDI Using Spatial Identifiers to Link Multiple Information Systems (#633)Paul Box 1, Robert Atkinson 1, Laura Kostanski 2

S6-D - SDITuesday, August 27, 2013 04:30 p.m. - 05:45 p.m. - Room: Conference Level - C1

Page 16: Icc2013 country names

One real world feature:a bus station

Represented in multiple systems using different names, and classified

and represented in different ways

MerakTerminus Dataset

Merak, Stasiun BisGazetir Indeonesia

Spatial Identifier REFERENCE FRAMEWORK

Links gazetteers (based on same feature in different gazetteers)

used in web applications and other online resources.

Passenger Travel Stats Application Linked Resource

Merak(Gazetteer Entry)

Terminus Dataset(Gazetteer)

Merak, Stasiun Bis(Gazetteer Entry)

Gazetir Indonesia(Gazetteer)

Online Public Transport MapLinked Resource

Same as Used in

Navigation applicationLinked Resource

Used in

Identifier Feature Type FootprintMerak, Stasiun Bis Transport Point

Identifier Feature Type FootprintMerak Terminal Polygon

BIG National Gazetteer of

Indonesia

Department of Transport Bus Terminals

Currently systems are disconnected and difficult to integrate

Page 17: Icc2013 country names

Identifiers

This is the “tricky part”

Lets start with the practical implication…

Catchment Boundary

Area Geometry

1123343 33535.4 151.3344,-35.330…….

Catchment ExtractionRate Storage

1123343 730 300

Page 18: Icc2013 country names

“Distributed” referencesCatchment ExtractionRate Storage

1123343 730 300

Internet

How to ask for this entity

How to deliver this entity

Catchment Boundary Area Geometry

1123343 33535.4 151.3344,-35.330…….

Page 19: Icc2013 country names

One real world feature:a bus station

Represented in multiple systems using different names, and classified

and represented in different ways

MerakTerminus Dataset

Merak, Stasiun BisGazetir Indeonesia

Spatial Identifier REFERENCE FRAMEWORK

Links gazetteers (based on same feature in different gazetteers)

used in web applications and other online resources.

Passenger Travel Stats Application Linked Resource

Merak(Gazetteer Entry)

Terminus Dataset(Gazetteer)

Merak, Stasiun Bis(Gazetteer Entry)

Gazetir Indonesia(Gazetteer)

Online Public Transport MapLinked Resource

Same as Used in

Navigation applicationLinked Resource

Used in

Identifier Feature Type FootprintMerak, Stasiun Bis Transport Point

Identifier Feature Type FootprintMerak Terminal Polygon

BIG National Gazetteer of

Indonesia

Department of Transport Bus Terminals

Currently systems are disconnected and difficult to integrate

URI

Describe

Link

Discover

Provenance

SDI resourceaccess

Page 20: Icc2013 country names

Thank youFor more [email protected]

GOVERNMENT AND COMMERCIAL SERVICES THEME