Data Intensive Challenges in Biodiversity Conservation
Steve Kelling
Environmental Science Challenges
• Climate Change• Biodiversity Loss• Invasive Species• Water Depletion• Disease Spread• Green Energy• Habitat Loss• ---
Habitat Loss
From: University of California Press Blog Earth Day 2010
Habitat loss is the major issue for Biodiversity Conservation.
The increasing availability of massive volumes of scientific data requires new synthetic analysis techniques to explore and identify interesting patterns that were otherwise not apparent. For biodiversity studies a “data driven” approach is necessary due to the complexity of ecological systems, particularly when viewed at large spatial and temporal scales.
Observation NetworksDescription of eBird: http://www.ebird.org
Species Distribution ModelsDescription of the Avian Knowledge Network: http://avianknowledge.net
Data Intensive ScienceDescription of the outcomes of the DataONE Exploration, Visualization, and Analysis Working Group
Presentation Goals:
eBird is a global online program that gathers bird observations from citizen scientists, predominately across the Western Hemisphere. eBird gathers checklists of birds with associated effort information from well-defined locations, passing each record through a two-tiered verification system.
ebird is a joint project between the Cornell Lab of Ornithology and National Audubon Society, and has more than 2 dozen regional partners.
Sullivan, B.L., C.L. Wood, m.J. Iliff, R.E. Bonney, D. Fink, and S. Kelling. 2009. eBird: A citizen-based bird observation network in the biological sciences. Biological Conservation 142: 2282-2292.
eBird uses Crowdsourcing techniques to gather observations of birds.
Crowdsourcing is the act of outsourcing tasks, traditionally performed by an employee or contractor, to an undefined, large group of peopleor community (a "crowd"), through an open call.
Jeff Howe, one of first authors to employ the term, established that the concept of crowdsourcing depends essentially on the fact thatbecause it is an open call to an undefined group of people, it gathers those who are most fit to perform tasks, solve complex problems andcontribute with the most relevant and fresh ideas.For example, the public may be invited to develop a new technology, carry out a design task, refine or carry out the steps of an algorithm,or help capture, systematize or analyze large amounts of data (CITIZEN SCIENCE).(From Wikipedia)
eBird Checklists
Volunteers submit checklists of bird observations from specific locations using protocols that collect information on data, time, and distance traveled.
Flagged Records
• 4% submitted records were flagged for review• 60% of those records were reviewed and validated
eBird contains a two-stag verification system:(1) Instantaneous automated evaluation of submissions based on species count
limits for a given data and location;(2) A growing network of more than 500 regional editors composed of local
experts who vet records flagged by the automated filters.
Understanding our Audience
eBird is building a web enabled community of bird watchers who collect, manage,‐and store their observations in a globally accessible unified database. Through itsdevelopment as a tool that addresses the needs of the birding community,eBird sustains and grows participation.
Give Birders What They Want!
eBird contains an array of data visualization and analysis toolsthat provide birders, land managers, and scientists with summaryinformation about bird distribution.
Sooty Shearwater
eBird data can be used to examine the timing of migration across large geographic areas.
Because each eBird observation is recorded at a specific location,eBird can generate maps depicting species distribution at multiplespatio temporal scales.‐
Bird Occurrence Patterns in Upstate New York
eBird provides ‘‘bar charts” (i.e., frequency histograms) based on frequency of detection for individual species.
These visualizations provide users with occurrence information at specific locations at 1 week increments and ‐indicate the likelihood of detecting a species based on its frequency in that area (darker and wider bars indicate increased frequency).
Growth in eBird Observations and Checklists
Observations
Checklists
2003 2004 2005 2006 2007 2008 2009 20100
400,000
800,000
1,200,000
1,600,000
2,000,000
2,400,000
0
40000
80000
120000
160000
200000
240000
eBird 2.0 launch
2011
Statistics 2010
More than…
18,214, 480 observations submitted d
1,300,029 hours collecting bird observations.
1,293,480 checklists entered
22,136 contributors
351,000 unique visitors to eBird
20 million page views
Introducing
BirdsEye—an
eBird powered
iPhone app
Estimating Species Distributions
Determining the patterns of species occurrence through time, space, and understanding their links with features of the environment are central themes in ecology. Identifying the factors that influence species distributions is a complex task, requiring the examination of multiple facets of a species’ natural history and their relationships with the complex and variable environments which they live.
Fink, D., W. M. Hochachka, D. Winkler, B. Shaby, G. Hooker, B. Zuckerberg, M. A. Munson, D. Sheldon, M. Riedewald, and S. Kelling. 2010. Spatiotemporal Exploratory models for Large scale Survey Data. Ecological Applications ‐ 20:2131 2147.‐
Observational Data Model
The most crucial aspect of predicting species occurrence is to learn a model—called the observation model—from observed measurements and make probabilistic inferences over regions or variables where measurements were not made. This approach joins organism observations with a multitude of "drivers", covariates that could potentially influence the occurrence of the organism. While a single (or a few sources) of noisy observations may not be sufficient to accurately model distributions, combining many measurements (e.g., species occurrence, weather, organism occurrence, landscape mosaic, human population data etc.), greatly improves the accuracy of the models.
Munson, M. A., K. Webb, D. Sheldon, D. Fink, W. M. Hochachka, M. J. Iliff, M. Riedewald,D. Sorokina, B. L. Sullivan, C. L. Wood, and S. Kelling. 2009.The eBird Reference Dataset(http://www.avianknowledge.net/content/features/archive/eBird_Ref).
The Multi-scale Modeling Challenge
Goal: Analysis at broad-scale with fine resolutionChallenge: spatiotemporal patterning at multiple
scales• Local-scale
– Fine-scale spatial and temporal resource patterns
• Large-scale– Regional & seasonal variation in species’ habitat utilization
Wood Thrush
SpatioTemporal Exploratory Model (STEM) Current nonparametric SDM’s are very good for
local-scale modeling by relating environmental predictors (X) to observed occurrences (y)
Multi-scale strategy: differentiate between local and global-scale ST structure.
1. Make explicit time (t) and location (s) 2. “Regionalize” by restricting support3. Predictions at time (t) and location (s) are
made by averaging across a set of local models containing that time and location
1
n(s,t)f i(X,s,t)I(s,t i)
i1
m
y f (X)
f (X,s, t)I(s, t )
Restricted Support Set ( )q
Number of models supporting (s,t)
ith ST explicit base model
“Slice and dice” ST extent into stixels• With sufficient overlap• Adapt to different dynamics
Temporal Design: • 40 day intevals• 80 evenly spaced windows throughout
year
Spatial Design• For each time interval• Random Sample rectangles
(12 deg lon x 9 deg lat) • Minimum 25 unique locations.
The ST Ensemble
Western Meadowlark
SpatioTemporal Variation of Local-scale Predictor Effects Non-stationarity of species-habitat associations
Exploratory Inference:
Although many ecological processes are known or expected to vary in space and time, the vast majority of SDM is done for a single region and/or season. So, our goal is to develop techniques to explore patterns of variation in ST and time to provide ecologists and land managers with more accurate information about how species habitat associations (requirements) change.‐
Chimney Swift
Indigo Bunting
Taking a data intensive science approach requires a data management and research environment that supports the entire data life cycle; from acquisition, storage, management, and integration, to data exploration, analysis, visualization and other computing and information processing services.
Kelling, S., W. M. Hochachka, D. Fink, M. Riedewald, R. Caruana, G. Ballard, and G. Hooker. 2009. Data intensive Science: A New Paradigm for Biodiversity Studies. BioScience ‐ 59:613‐620.
• Data Discovery, Access, and Synthesis• Model Development• Managing Computational Requirements• Exploring and Visualizing Model Results• Examples
Steve Kelling (co-chair), Cornell Lab of Ornithology Bob Cook (co-chair), Oak Ridge National Lab John Cobb, Oak Ridge National LabTheo Damoulis, Cornell UniversityTom Dietterich, Oregon State Juliana Freire, University of UtahDaniel Fink, Cornell Lab of Ornithology Damian Gesler, iPlant
Scientific Exploration, Visualization, and Analysis Working Group
Bill Michener, University of New Mexico Jeff Morisette, USGS Patrick O’Leary U of IdahoAlyssa Rosemartin NPNSuresh SanthanaVannan, Oak Ridge National Lab Claudio Silva, University of Utah Kevin Webb, Cornell Lab of Ornithology
Kelling, S., R. Cook, T. Damoulas, D. Fink, J. Freire, W. M. Hochachka, W. K. Michener, K. Rosenberg, and C. Silva, 2011 IN PRESS.Estimating species distributions, across space through time and with features of the environment.
Observational Data Sources
Photo courtesy of www.carboafrica.net
Sensors, sensor networks, and remote sensing gather observations
Data Interoperability
Our major data interoperability challenge rectifying object based models (i.e. vector entities such as‐locations where birds are observed), with field based models (i.e. raster imagery comprised of attribute‐values in gridded in space) of storing geographic information. To make data interoperable we had to applythat conflate point location based observations (e.g. bird observations) to match raster attribute data‐at the resolution of the raster data. For each observation location, we determine the cell in the rastergrid into which the observation's location falls. We use the value of that cell's attribute as the attributevalue for each observation.
Spatio-Temporal Exploratory Models predict the probability of occurrence of bird species across the United States at a 35 km x 35 km grid.
Patterns in Bird Species Occurrence Explored through Data Intensive Analysis and Visualization
Bird observations and environmental data from > 100,000 locations in US integrated and analyzed using High Performance Computing Resources
Land Cover
Potential Uses-• Examine patterns of migration • Infer impacts ofclimate change• Measure patterns of habitat
useage• Measure population trends
Model resultseBird
Meteorology
MODIS – Remote sensing data
Occurrence of Indigo Bunting (2008)
Jan Sep DecJunApr
Observations from Bird Watchers (citizen scientists)—huge number of birders collecting 16 million observations each yearCombine with environmental factors like land cover, landscape fragmentation, topography, human population, weather, and remote sensing data (green ness of terrestrial vegetation).‐Integrating the data into one database is challenge.This huge amount of data can only be analyzed on Supercomputers, using the NSF TeraGrid High Performance ComputingModels used in the creation of the 2011 United States of America State of the Birds Report entitled Birds in Public Lands and Waters.
Gaining insight into the complexities and processes of natural systems is no longer
an exclusive realm of theory and experiment; computation and access to large
quantities of data is now an equal and indispensible partner for advances in scientific knowledge,
land management, and informed decision making.
Biodiversity Research and Conservation in a Digital World
Funding and Acknowledgements
• National Science Foundation• Leon Levy Foundation• Wolf Creek Foundation
The volunteers who contributed millions of hoursgathering bird observations.
eBird and the Avian Knowledge Network
Art Munson - CU
Daniel Fink - CU
Wesley Hochachka - CU
Denis Lepage - BSC
Rich Caruana - MS
Mirek Riedewald - NEU
Daria Sorokina - CMU
Kevin Webb - CU
Giles Hooker - CU
Brian Sullivan - CU
Chris Wood - CU
Marshall Iliff - CU
Computational Sustainability
Carla Gomes - CU
Tom Dietterich - OSU
Daniel Sheldon - OCU
Ken Rosenberg - CU
Rebecca Hutchinson - OSU
Weng-Keen Wong - OSU
Megan MacDonald - CU
Stefan Hames - CU
Theo Damoulas - CU
Bistra Dilkina - CU
DataONE
Bill Michener - UNM
Bob Cook - ORNL
Jeff Morrisette - USGS
Juliana Freire - UUT
Claudio Silva - UUT
Matt Jones - UCSB
Suresh SanthanaVannan - ORNL
Acknowledgements