integrating cross-domain metadata

26
Integrating Cross-Domain Metadata Terra Populus experience report Peter Clark [email protected] @pclark Alex Jokela [email protected] @alexjokela

Upload: pete-clark

Post on 29-May-2015

123 views

Category:

Technology


2 download

DESCRIPTION

Slides from #BigDataMN presentation discussing metadata integration in the Terra Populus project at the Minnesota Population Center at the University of Minnesota

TRANSCRIPT

Page 1: Integrating Cross-Domain Metadata

Integrating Cross-Domain Metadata

Terra Populus experience report

Peter Clark [email protected] @pclarkAlex Jokela [email protected] @alexjokela

Page 2: Integrating Cross-Domain Metadata

Overview of Terra Populus

● National Science Foundation○ Award ID: 0940818

● Core goal:Integrate, preserve, and disseminate data describing changes in the human population and environment over time.

● Focus for this talk is on integrating and linking interdisciplinary data○ Challenges for both data and metadata

Page 3: Integrating Cross-Domain Metadata

Motivation: Demographics

Data: Population Reference Bureau

Page 4: Integrating Cross-Domain Metadata

Motivation: Environment

Image credit: United Nations Environmental Programme (via http://climate.nasa.gov/state_of_flux)

Page 5: Integrating Cross-Domain Metadata

Flavors of Data and Metadata

● Four types of data in Terra Populus○ Individual demographic

■ Brazil 1970-2000 (4 points in time)■ Malawi 1987-2009 (3 points in time)

○ Area-based demographic and economic○ Environment and land use

■ Global Landscapes Initiative■ Global Land Cover GLC2000

○ Climate■ WorldClim: 1950-2000

● Three different formats:○ Microdata (survey)○ Area-level (tables)○ Raster (images/bitmaps)

Page 6: Integrating Cross-Domain Metadata

Microdata

● Individual-level survey response data○ Usually grouped into households○ Usually has a couple hundred variables○ Usually doesn't give good geographic locality

due to confidentiality● All censuses tend to ask about similar topics:

○ location, dwelling characteristics, age, sex, occupation, marital status, family size, home ownership, nativity and immigration...

● Each of these is a Variable

Page 7: Integrating Cross-Domain Metadata

US Census 2000 MicrodataH000012712724 27600 51205120 22 34473

P00001270100004501000010600011110000010147031400100005002699901200000 0027010000

P00001270300004503011010170010110000010147050600215007003299901200000 0027010000

P00001270200004519000010540010110000010147051600100008003299901200000 0027010000

H000013412724 27720 51205120 23 2862

P00001340100008023000010200010110000010147010200216011008203202200000 0027010000

H000014112724 27500 51205120 22 22517

P00001410100027001000010540010110000010147010100100009003299901200000 0055010000

P00001410200022602000220440010110000010147010100100009002202602200000 0027010000

P00001410300031503000010210010110000010147050600206011003202202200000 0027010000

P00001410400029303011010170010110000010147050600205007003202202200000 0027010000

P00001410500024703011010130010110000010147050000204004003202202200000 0027010000

Page 8: Integrating Cross-Domain Metadata

US Census 2000 MicrodataH000012712724 27600 51205120 22 34473

P00001270100004501000010600011110000010147031400100005002699901200000 0027010000

P00001270300004503011010170010110000010147050600215007003299901200000 0027010000

P00001270200004519000010540010110000010147051600100008003299901200000 0027010000

Page 9: Integrating Cross-Domain Metadata

Area-level aggregate data

● Demographic and economic characteristics of a location○ Location can mean nation, state, county, city,

metropolitan area○ Might be a political/administrative entity or just a

statistical boundary.● Some aggregate characteristics are scalar:

○ Total Population, GDP, % literacy, average household size

● More often it's a crosstab or table...○ We treat each cell in the table as a variable

Page 10: Integrating Cross-Domain Metadata

Educational Attainment by Sex: Minnesota

Subject Estimate (Total)

Estimate (Male)

Estimate (Female)

Population 18 to 24 years 506,399 258,167 248,232

Less than high school graduate 12.9% 14.2% 11.6%

High school graduate (inc. equivalency) 25.8% 28.3% 23.1%

Some college or associate's degree 50.2% 48.5% 51.9%

Bachelor's degree or higher 11.2% 9.0% 13.4%

Page 11: Integrating Cross-Domain Metadata

Raster data

● Basically, just a picture○ But usually heavily processed

● Geocoded● Pixel = Measurement

○ either direct or interpolated● Type

○ Categorical (such as land use)○ Continuous (temperature, elevation)

Page 12: Integrating Cross-Domain Metadata

Global Land Cover 2000

Page 13: Integrating Cross-Domain Metadata

Metadata Integration - same domain

● Census/Survey○ Different questions○ Different answers○ Different encodings

● Area-level○ Different table structures○ Different geographies○ Different categories

● Raster○ Different alignments○ Different resolutions○ Different processing

Page 14: Integrating Cross-Domain Metadata

Metadata integration - cross-domain

● Semantics are key○ Terms can mean different things to different

disciplines.○ What does "data" mean?

● How do you link across domains?○ What questions are you trying to answer?

Page 15: Integrating Cross-Domain Metadata

Microdata integration

● Old data usually presents recovery, cleanup and parsing issues

● The older the data, the less likely to have a machine readable codebook

● If you're lucky, you get the data as an SPSS or SAS file

Page 16: Integrating Cross-Domain Metadata

Historical Data challenges: ETL

Page 17: Integrating Cross-Domain Metadata

Historical Microdata - metadata

Page 18: Integrating Cross-Domain Metadata

Microdata Integration in TerraPop

● Data structure can vary on many dimensions○ questions/variables○ categories & topcoding/bottomcoding○ universe statements

● We've defined a unified coding scheme● We deal with these using translation tables

Page 19: Integrating Cross-Domain Metadata

Translation Tables for microdataCODE LABEL China 1982 Columbia 1973 Kenya 1989 Mexico 1970 USA 1990

100 Single/Never Married 1=never 4=single 1=single 9=single 6=never

200 Married/In Union

210 Married (Not Specified) 2=married 2=married 2=monogamous 1=married

211 Civil 3=only civil

212 Religious 4=only religious

213 Civil and Religious 2=civil and religious

214 Polygamous 3=polygamous

200 Consensual Union 1=free union 5=free union

300 Separated/Divorced 3=sep/divorced

310 Separated 6=separated 8=separated 3=separated

330 Divorced 4=divorced 5=divorced 7=divorced 4=divorced

400 Widowed 3=widowed 5=widowed 4=widowed 6=widowed 5=widowed

999 Missing/Unknown 0=missing 6=unknown B=Blank 1=unknown

Page 20: Integrating Cross-Domain Metadata

Area-level metadata

Presents all the same problems as microdata, but two points of additional complexity

● Geography○ boundaries, names, and codes change

● Table Structure○ Changes over time in concert with changes in survey

structure and binning

Page 21: Integrating Cross-Domain Metadata

Area-level Metadata in TerraPop

We're cheating!

● Generating our own from microdata○ Minimizes structural differences○ But doesn't eliminate semantic differences

● But we lose out on fine-grain geography

Page 22: Integrating Cross-Domain Metadata

Raster metadata

● Best opportunity for things to be easy○ ISO-19115 standard○ ArcGIS○ GeoTIFF

● Need to integrate:○ Spatial: projection, origin, resolution○ Semantic integration is harder○ Across datasets and within a dataset over time

● Raster data is often heavily processed, and there's no standard for describing that.

Page 23: Integrating Cross-Domain Metadata

Raster metadata...INTERNATIONAL JOURNAL OF CLIMATOLOGY

Int. J. Climatol. 25: 1965–1978 (2005)

Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/joc.1276

VERY HIGH RESOLUTION INTERPOLATED CLIMATE SURFACES FOR GLOBAL LAND AREAS

ROBERT J. HIJMANS,a,* SUSAN E. CAMERON,a,b JUAN L. PARRA,a PETER G. JONESc and ANDY JARVISc,d a Museum of Vertebrate Zoology, University of California, 3101 Valley Life Sciences Building, Berkeley, CA, USAb Department of Environmental Science and Policy, University of California, Davis, CA, USA; and Rainforest Cooperative Research Centre, University of Queensland, Australia

c International Center for Tropical Agriculture, Cali, Colombia

d International Plant Genetic Resources Institute, Cali, Colombia

Received 18 November 2004 Revised 25 May 2005 Accepted 6 September 2005

ABSTRACTWe developed interpolated climate surfaces for global land areas (excluding Antarctica) at a spatial resolution of 30 arc s (often referred to as 1-km spatial resolution). The climate elements considered were monthly precipitation and mean, minimum, and maximum temperature. Input data were gathered from a variety of sources and, where possible, were restricted to records from the 1950–2000 period. We used the thin-plate smoothing spline algorithm implemented in the ANUSPLIN package for interpolation, using latitude, longitude, and elevation as independent variables. We quantified uncertainty arising from the input data and the interpolation by mapping weather station density, elevation bias in the weather stations, and elevation variation within grid cells and through data partitioning and cross validation. Elevation bias tended to be negative (stations lower than expected) at high latitudes but positive in the tropics. Uncertainty is highest in mountainous and in poorly sampled areas. Data partitioning showed high uncertainty of the surfaces on isolated islands, e.g. in the Pacific. Aggregating the elevation and climate data to 10 arc min resolution showed an enormous variation within grid cells, illustrating the value of high-resolution surfaces. A comparison with an existing data set at 10 arc min resolution showed overall agreement, but with significant variation in some regions. A comparison with two high-resolution data sets for the United States also identified areas with large local differences, particularly in mountainous areas. Compared to previous global climatologies, ours has the following advantages: the data are at a higher spatial resolution (400 times greater or more); more weather station records were used; improved elevation data were used; and more information about spatial patterns of uncertainty in the data is available. Owing to the overall low density of available climate stations, our surfaces do not capture of all variation that may occur at a resolution of 1 km, particularly of precipitation in mountainous areas. In future work, such variation might be captured through knowledge- based methods and inclusion of additional co-variates, particularly layers obtained through remote sensing. Copyright 2005 Royal Meteorological Society.

Page 24: Integrating Cross-Domain Metadata

Integration in Terra Populus

● Everything is a Variable.

● Deal with each domain separately

● Integration across domains is geospatial○ Tabulate microdata to area-level○ Summarize raster to area-level, based on boundary

files

● Provide output in all three domains○ Different disciplines have different preferences

Page 25: Integrating Cross-Domain Metadata

Observations and Lessons

● Unifying view of data as Variables

● Need a human in the loop:○ Integration is messy, hard & difficult to automate○ Many issues of semantics to sort out

● Mappings let you automate ~80%.● Support hand-coding of special cases.● Useful to have a sense of what the output

should look like for sanity checks.● Simple is good.

Page 26: Integrating Cross-Domain Metadata

Coming this summer!

http://www.terrapop.org@terrapopulus

http://www.ipums.orghttp://www.nhgis.org

http://www.pop.umn.edu