ipums to ihsn: leveraging structured metadata for discovering multi-national census and survey data...

16
IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey Research Association (ESRA) Session 2.10 The role and benefit of structured metadata in survey research II

Upload: diana-heath

Post on 24-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey

IPUMS to IHSN: Leveraging structured metadata

for discovering multi-national census and survey data

Wendy L. Thomas

4th Conference of the European Survey Research Association (ESRA)

Session 2.10 The role and benefit of structured metadata in survey research II

Page 2: IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey

W.L. Thomas2

Outline The Project:

Creation of DDI compliant metadata from the contents of a proprietary database

Project for the World Bank supported by funding from PARIS21 The Source:

IPUMS International database of metadata The Target:

IHSN Microdata Toolkit [DDI Codebook based] The Process:

Mapping a proprietary database to DDI Codebook standard The Results:

Identification of improvements for IPUMS Limitations of ISHN Microdata Toolkit not attributable to DDI Limitations of distribution metadata for supporting on-the-ground

processes

2011-07-20

Page 3: IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey

W.L. Thomas3

Project The project was to extract metadata from the

IPUMS database system and to create IHSN Microdata Toolkit compatible metadata for the original data files and for the IPUMS extracts of data for developing countries.

IPUMS contains metadata and data for 106 censuses from developing countries world wide

Goals: Facilitate access to existing full data sets within the

IPUMS member country Populate a number of microdata catalogs based on

NADA

2011-07-20

Page 4: IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey

W.L. Thomas4

Integrated Public Use Microdata Series, International Census microdata for social and economic

research (62 countries, 185 censuses, 397 million person records)

IPUMS-International is a project dedicated to collecting and distributing census microdata from around the world. Its purpose is to: Collect and preserve data and documentation Harmonize data Disseminate the data to registered researchers at

no cost https://international.ipums.org/international/

2011-07-20

Page 5: IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey

W.L. Thomas5

IPUMS Metadata Extensive, detailed metadata held as original

documents, structured text files, and structured database

Basic division is between metadata to: Inform the user Drive the harmonization and/or search and

extraction engine Retains the original metadata for all included

original variables: Description Source Question Enumerator Instructions

Adds information on harmonization process and variables

2011-07-20

Page 6: IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey

W.L. Thomas6

IHSN Microdata Toolkit Developed by the World Bank Data Group for

the IHSN to promote the adoption of international standards and best practices for microdata documentation, dissemination and preservation.

Consists of 3 modules Microdata Editor: Documents using DDI-Codebook

and Dublin Core standards Explorer: Views metadata and exports to common

formats CD-ROM Builder: Create user-friendly CD-ROM and

web products http://ihsn.org/home/index.php?q=tools/toolkit

2011-07-20

Page 7: IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey

W.L. Thomas7

Process

2011-07-20

The project focused on metadata within the IPUMS database system which could be mapped out to generate a IHSN Microdata Toolkit compatible set of metadata

The desired output consisted of A set of Dublin Core descriptions of related technical

documents and copies of those documents A set of metadata for the original country data A set of metadata and data for the content of the public

extract system (harmonized and country specific variables) All outputs would be delivered to the country NSO for

their use Metadata content and document copies would be

provided to IHSN for inclusion in their catalog

Page 8: IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey

W.L. Thomas8

IPUMS Process: Creating the Database

2011-07-20

STEP 1: Incoming data and metadata Prepare data for entry

into IPUMS harmonization process and database

Data arrives in non-specific formats and reflects the original structure of the data at the NSO or a standard extract format

Metadata arrives in non-specific format with inconsistent content coverage (may or may not have English language equivalents)

Page 9: IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey

W.L. Thomas9

IPUMS Process: Creating the Database

2011-07-20

STEP 1: Incoming data and metadata Prepare data for entry

into IPUMS harmonization process and database

STEP 2: Intermediary data and metadata Clean and restrict for

confidentiality Harmonize Establish metadata

links

Data is selected and reformatted into household and person records, IPUMS generated variables are added for processing

Basic metadata is entered into a spreadsheet including; original name, description, IPUMS name, translations, frequency (generated), and all extant content types (including errors).

Creation of database

Page 10: IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey

W.L. Thomas10

IPUMS Process: Creating the Database

2011-07-20

STEP 1: Incoming data and metadata Prepare data for entry

into IPUMS harmonization process and database

STEP 2: Intermediary data and metadata Clean and restrict for

confidentiality Harmonize Establish metadata links

STEP 3: IPUMS

Clean, confidential, harmonized data

Metadata includes links to original data, questions, instructions, and transformation processes

Original data structures have not been retained in the database, includes only the cleaned, restricted content and harmonized variables created by IPUMS

Page 11: IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey

W.L. Thomas11

Process Constraints

2011-07-20

Working from the IPUMS database we were only able to construct microdata from STEP 2 and STEP 3 Linking back to original metadata and data

structure would require individualized processing (entry and linking)

Had we been able to construct metadata for STEP 1 it would not have been IHSN Microdata Toolkit compatible, just DDI-Codebook compatible Linking information held as file names Alphabetic codes Person records only (households required

construction) Multi-record relational data base structures for

data

Page 12: IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey

W.L. Thomas12

Results

2011-07-20

Modified products delivered Within 24 hours of submission, all metadata files were

loaded into the IHSN NADA based catalog and were searchable down to the variable level in that system

The MPC is now capable of creating DDI/IHSN/NESSTAR compatible codebooks for all extracts on all microdata systems

Countries can create full extracts of their census data to match the intermediary structure (household/person record) found in the metadata, import it to the Microdata Toolkit and explore or create desired output easily

Countries can easily compare or create products from the harmonized data from their country or download additional data from the IPUMS International system for cross-national comparison

Page 13: IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey

W.L. Thomas13

Insights

2011-07-20

The IPUMS database captured all required metadata elements either directly or in related structured files More information should be moved to the database Informational items that could be structured (i.e., dates)

should have both unstructured and structured content Database should be expanded to capture input structures and

retain the full process metadata from Step 1 to Step 3 (for curation as well as process purposes)

Simplifies the shift to DDI Lifecycle output by simple addition of comparative information

The IHSN Microdata Toolkit is limited in its ability to support legacy data and legacy data systems on the ground Most limitations are due to software rather than standards

constraints

Page 14: IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey

W.L. Thomas14

Distribution and Process Metadata

2011-07-20

IPUMS, like most metadata databases, contains the majority of required objects for supporting distribution metadata in a DDI format (codebook or lifecycle)

The process of mapping to a detailed standard raised specific issues and highlighted possible improvements to the database content and structure

Brought to light the need to capture more metadata at an earlier point in the process. Metadata produced outside the database structure

was lost Object level version control was missing so

intermediary content was lost

Page 15: IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey

W.L. Thomas15

Conclusions

2011-07-20

The ability to create distribution metadata in a standard format increases utility for users, provides new product options, provides easy means of populating external catalogs

The process of mapping to a well formed standard provides an opportunity for content and process improvement

Capture of incoming metadata is critical to shifting from a distribution metadata structure to a process metadata structure

Process structures not only can capture the development of the data within the system but can be used to drive the process and data and metadata checks

Page 16: IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey

W.L. Thomas16

Contacts

2011-07-20

Wendy Thomas – Project manager [email protected]

Benjamin Youngdahl – Programmer [email protected]