quick metric for judging level of bias basis for comparing samples/sampling methods a way to...

GLOBEAnalytics for Assessing Global Representativeness

Matthew D SchmillLindsey Gordon, Erle Ellis, Nicholas Magliocca, Tim Oates

University of Maryland, Baltimore County

GLOBE: Enhancing Scientific Workflows The goal: accelerate and improve scientific workflows for land change

science

Joint work with Wayne Lutters, Erle Ellis, Tim Oates, Penny Rheingans at University of Maryland, Baltimore County IS, CSEE, GES

Supported by NSF’s Cyber-Enabled Discovery & Innovation program Fourth and final year of the program

Centerpiece is the GLOBE system Enabling better science through

Real-time statistical assessments, interactive geovisualization tools

Scientific collaboration platform

Land Change Science Study of interaction between human systems, ecosystems, the atmosphere,

and other Earth Systems as mediated through human use of land. Cross cuts many disciplines of social and natural science

Typified by this challenge: how to integrate and synthesize local studies to “globalized” results

Though GLOBE is targeted at Land Change Scientists The concept of representativeness is a very general concern

The GLOBE system is appropriate to any discipline engaged in the synthesizing local studies into global results

Representativeness The degree to which a sample represents a global pattern

A converse to bias A well-represented sample is not biased, a biased sample is not representative

Sampling bias: a typical criticism anywhere that samples are used to make inferences A land change science example:

Are you representing only accessible sites?

Accessibility as a measure of travel time to a city (Nelson, 2008)

A measure of representativeness should be Intuitive, understandable

Statistically sound

Measures of Representativeness Pearson’s Chi Square

Requires the variable space be discrete

Unreliable with small sample sizes

Kolmogorov-Smirnov Goodness-of-Fit Test Does not require discrete space

Scaling and visualizing beyond 1d is hard

f-Divergence (Hellinger, Jensen-Shannon) Requires discrete variable space

Measures of Representativeness Pearson’s Chi Square

Requires the variable space be discrete

Unreliable with small sample sizes

Kolmogorov-Smirnov Does not require discrete space

Scaling and visualizing beyond 1d is hard

f-Divergence (Hellinger, Jensen-Shannon) Requires discrete variable space

Probability Estimates Chi Square – simple

Monte Carlo methods for the rest

Representativeness

Gives you Quick metric for judging level of bias

Basis for comparing samples/sampling methods

A way to compute the probability of incorrectly concluding a sample is biased

Does not give you Any guidance on where to look to

address sampling bias

Any way to view this geographically

Representedness The degree to which a location or member of the population is represented

by the collection The complement of representativeness

Useful for visualization and analysis Heat maps that show geographically where gaps lie

Can be used as a basis for case study search to fill study gaps

Computing Representedness

Get datum for land unit(precipitation)

Locate datum in global distribution

Chi Square

KS Distance

1573mm/yr

Compute representativeness for that value

p-value of x2

times sign of between differencesample and population

Difference in ECDF forpopulation versus sample

at unit datum

Computing Representedness

Get datum for land unit

Locate datum in global distribution

discrete

continuous

49.2m

Compute representativeness for that value

p-value of x2

times sign of between differencesample and population

Difference in ECDF forpopulation versus sample

at unit datum

Compute RGB(heat map)

Addressing Bias

Study Gap Search Identify areas where density in

population is significantly higher than sample

Search case database using that criterion

Additional criteria available (fts, metadata)

Case Weighting Addresses biases in statistical

analysis by Over-weighting (> 1.0) cases in under-

represented areas

Under-weighting (< 1.0) cases in over-represented areas

Computed using representedness

The GLOBE Application Our platform for better Land Change Science

By improving workflows

As a social/collaborative platform

Formally introduced to GLP OSM in March 2014

Features Allows researchers to create and manage case studies and their geometry

Integrates global data layers to augment user cases

Provides real-time analytics and visual tools Similarity search

Representativeness analysis

Global Data Organized into a Discrete Global Grid [Sahr, White, and Kimerling, 2003]

ISEA Aperature 3, Hexagonal

1.5M 96 km2 equal-area hexagons at resolution 12 (native GLOBE resolution)

Downsampled grid at resolution 10 (863.8 km2) for approximate calculations

Currently 75 global variables; variables can be processed and submitted to GLOBE Human, remote sensing, biological, surface, climate

GLOBE Cases

GLOBE Cases GLOBE GES team has georeferenced and entered 630 cases

Currently a total 927 georeferenced, completed cases

Similarity Assessment

Representativeness Analysis – Monte Carlo

Representativeness Analysis – x2

Representativeness Analysis – Gap Search

In Summary Representativeness an issue anywhere inferences are made from samples

Representedness a companion piece that enables geovisualization and gap search

Can be implemented many ways Classical hypothesis test (x2)

Monte Carlo methods: f-divergence, KS-distance

GLOBE application enables representativeness workflow for land change science Realtime assessment & visualizations

Gap search and case weighting

In the Pipeline Multidimensional Analysis

Quantifying the impact of data scarcity (small sample size) Heuristic tools for guiding the user

Improved visual tools

Dimensionality reduction Identifying if and when it is possible

Automated exploratory analysis Helping the user to identify what analysis they should be running

Thanks! Visit us at http://globe.umbc.edu

Representativeness Analysis – KS

Conceptual Overview

Global Datadiscrete global grid

GLOBE Casesgeography + data

GLOBE GCEanalytical &

computational engine

GLOBE Web App

visual & interactive tools

quick metric for judging level of bias basis for comparing samples/sampling methods a way to...

Documents

representedness slide

unit datum slide

difference sample

population difference

value pvalue

sample search case database

over represented areas

land unit precipitation