icc2013 country names
TRANSCRIPT
International Cartographic Conference, Dresden
Variability of country names and identifiers in datasets – Reconciling practical and cultural perspectives
GOVERNMENT AND COMMERCIAL SERVICES THEME
Laura Kostanski| Sara-Jane Farmer | Rob Atkinson August 2013
Today’s Presentation
• Overview• Cultural Reasons for Multiple Country Names• Impact of Cultural Reasons• Multiple Country Name Datasets• Reconciling Information• Spatial Identifier Reference Framework (SIRF) Approach
Overview
• There are multiple country name datasets in use• e.g. ISO 3166, UNSTATS , Alexandria Digital Library, CIA Fact Book, UN-FAO
• Multiple stakeholders in creation and use of data using these names• e.g. World Bank, Statistics Agencies, Crisis Response and Social Protection Groups.
• Time spent accessing and reconciling data is costly and delays production of results from analysis
• The same issues apply to most, perhaps all, identifiers of spatial objects
• Preview of how we might tackle this problem
Context
CSIRO. UNSDI Gazetteer for Social Protection in Indonesia
Data Analysis
Utopia Way Inc. investigated files in the data.un.org dataset. …
Country names were discovered in multiple fields, such as:
•country of birth, •country of citizenship, •country or area, •country or territory, •country or territory of asylum or residence, •country or territory of origin, •reference area.
and identified significant issues with country name alignments and mismatches.
An automated matching process was set up to explore the extent of the issue.
In all, 21,195,188 rows of data were analysed.
Common “Errors”Index error ExamplesWithdrawn countries with no ISO3166 code
“East Timor", "Czechoslovakia, Czechoslovak Socialist Republic”, "USSR, Union of Soviet Socialist Republics", “Yemen, Yemen Arab Republic", “Yemen, Democratic, People's Democratic Republic of", “Yugoslavia, Socialist Federal Republic of”, “Germany, Federal Republic of”, “German Democratic Republic”, “US Miscellaneous Pacific Islands", “Wake Island", “Serbia and Montenegro".
Abbreviation “Rep.” for “Republic”, “St.” for “Saint”, “Is.” For “Island”, “Isds” for “Islands”, “&” for “and”.
Added markers “+” added to the end of region names, to differentiate them from countrynames. “MDG_” added to region names, e.g. “MDG_Southern Asia”.
Capitalisation “YEMEN” for “Yemen”, “republic” for “Republic”, “The” for “the”, “the” for “The”.
Brackets “()” or “[]” instead of commas
“Virgin Islands (British)” for “British Virgin Islands”.
Standards confusion The ISO3166 labels “name” and “official_name” were both used in the same datasets (“name” is available for all countries; “official_name” is not).
Use of familiar names Brunei, Ivory Coast, China, Libyaissues with character translation Cote d'Ivoire, Åland Islands, Curaçao, Réunion
Misspellings Double spaces, trailing spaces, “South Asia” vs “Southern Asia”.
Long names, short names
Organisation Name of Data SetUnited Nations Statistics Division Country and Region Codes for Statistical Use
Working Group on Country Names, United Nations Group of Experts on Geographic Names
List of Country Names
Terminology Section,Department for General Assembly and Conference Management
Multilingual Terminology Database (UNTERM)
International Standards Organisation (ISO)
ISO 3166: Codes for the representation of names of countries and their subdivisions (parts 1, 2 and 3)
Food and Agriculture Organisation of the United Nations
Global Administrative Unit Layers (GAUL)
United Nations Geospatial Information Working Group (UNGIWG)
Second Administrative Level Boundaries (SALB)
National Geospatial Intelligence Agency
Federal Information Processing Standard (FIPS) 10-4 : Countries, Dependencies, Areas of Special Sovereignty, and their Principal Administrative Divisions
NATO Standards Agreement (STANAG) 1059
Data sets providing country names
Two Aspects of Country Name Datasets
1: Development of datasetsWhy is there a proliferation of country name sources?
• Cultural issues
• Development practices
2: UsageHow, in a digital age of ‘big data’ analytics and SDIs, can newly emerging
technologies such as the Spatial Identifier Reference Framework (SIRF) assist in reducing the ambiguity associated with multiple, heterogeneous country name sources?
• Can we do better? What do we need to do it?
Cultural Issues
• Toponyms provide communities with identity (Toponymic Identity is both reflected and reinforced)
• Country names are the highest-order toponyms
• Problems are similar at lower levels, compounded by scale (size of problem) and higher rates of change (e.g. electoral boundaries, urban growth)
Endonym/Exonym
Above and beyond associations with an individual’s attachment to the Endonym of their country, there are often multiple Exonyms used by other languages.
e.g. Deutschland= Germany or Allemagne
Other Cultural Country Naming Considerations
Formal/Informal naming applications(particularly prevalent in the social media world- e.g. ‘Oz’ for Australia)
Political/Non-Political Usagee.g. ‘Commonwealth of Australia’
Change over timee.g. Czechoslovakia
Non-standardised international conventionse.g. Saint or St? The or none?
The Impact
All of these cultural mores impact on the ability of people and organisations to record country name information in a standardised, transparent manner.
Thus, there exists a proliferation of country name lists which are officially promoted by international agencies.
This impact is then intensified in usage,
OptionsSuggested improvements to the indices and standards include:1. Improve access to source data
a. Make the UN’s regions list available as a csv file online, to include withdrawn country codes, assignment dates and withdrawal dates (these are needed to match names for earlier years).
b. Make the UN’s economic status list available as a csv file online.
2. Lobby to improve contenta. ISO to create a region (Africa, West Africa, North America etc.) code standard.b. ISO to correct inconsistencies in the ISO countries list (e.g. republic not Republic in
Bolivia’s name).
3. Policya. Make a definitive statement about which GIS naming standard (ISO, UNstats etc) UN
online development data should attempt to adhere to.
4. Better citation mechanisms– Standardised metadata and identifiers that “resolve” – i.e. links back to data– Shared infrastructure to link all the information together
Spatial Identifier Reference Framework
CSIRO has been working with stakeholders including UN, National agencies and others on a set of standards and infrastructure services to support discovering and linking multiple sources of spatial references.
This is being presented in more detail in:
6D.3 Spatial Identifier Reference Framework (SIRF): Realising the potential of SDI Using Spatial Identifiers to Link Multiple Information Systems (#633)Paul Box 1, Robert Atkinson 1, Laura Kostanski 2
S6-D - SDITuesday, August 27, 2013 04:30 p.m. - 05:45 p.m. - Room: Conference Level - C1
One real world feature:a bus station
Represented in multiple systems using different names, and classified
and represented in different ways
MerakTerminus Dataset
Merak, Stasiun BisGazetir Indeonesia
Spatial Identifier REFERENCE FRAMEWORK
Links gazetteers (based on same feature in different gazetteers)
used in web applications and other online resources.
Passenger Travel Stats Application Linked Resource
Merak(Gazetteer Entry)
Terminus Dataset(Gazetteer)
Merak, Stasiun Bis(Gazetteer Entry)
Gazetir Indonesia(Gazetteer)
Online Public Transport MapLinked Resource
Same as Used in
Navigation applicationLinked Resource
Used in
Identifier Feature Type FootprintMerak, Stasiun Bis Transport Point
Identifier Feature Type FootprintMerak Terminal Polygon
BIG National Gazetteer of
Indonesia
Department of Transport Bus Terminals
Currently systems are disconnected and difficult to integrate
Identifiers
This is the “tricky part”
Lets start with the practical implication…
Catchment Boundary
Area Geometry
1123343 33535.4 151.3344,-35.330…….
Catchment ExtractionRate Storage
1123343 730 300
“Distributed” referencesCatchment ExtractionRate Storage
1123343 730 300
Internet
How to ask for this entity
How to deliver this entity
Catchment Boundary Area Geometry
1123343 33535.4 151.3344,-35.330…….
One real world feature:a bus station
Represented in multiple systems using different names, and classified
and represented in different ways
MerakTerminus Dataset
Merak, Stasiun BisGazetir Indeonesia
Spatial Identifier REFERENCE FRAMEWORK
Links gazetteers (based on same feature in different gazetteers)
used in web applications and other online resources.
Passenger Travel Stats Application Linked Resource
Merak(Gazetteer Entry)
Terminus Dataset(Gazetteer)
Merak, Stasiun Bis(Gazetteer Entry)
Gazetir Indonesia(Gazetteer)
Online Public Transport MapLinked Resource
Same as Used in
Navigation applicationLinked Resource
Used in
Identifier Feature Type FootprintMerak, Stasiun Bis Transport Point
Identifier Feature Type FootprintMerak Terminal Polygon
BIG National Gazetteer of
Indonesia
Department of Transport Bus Terminals
Currently systems are disconnected and difficult to integrate
URI
Describe
Link
Discover
Provenance
SDI resourceaccess
Thank youFor more [email protected]
GOVERNMENT AND COMMERCIAL SERVICES THEME