data, metadata, and ontology in ecology matthew b. jones national center for ecological analysis and...

33
Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara and many major collaborators: Mark Schildhauer, Josh Madin, Jing Tao, Chad Berkley, Dan Higgins, Peter McCartney, Chris Jones, Shawn Bowers, Bertram Ludaescher, and others April 24, 2007

Upload: angela-norris

Post on 02-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Data, Metadata, and Ontology in Ecology

Matthew B. Jones

National Center for Ecological Analysis and Synthesis (NCEAS)University of California Santa Barbara

and many major collaborators:Mark Schildhauer, Josh Madin, Jing Tao, Chad Berkley, Dan Higgins, Peter McCartney, Chris Jones, Shawn Bowers, Bertram Ludaescher,

and others

April 24, 2007

Page 2: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Scaling-up Synthesis

• More than 400 projects at NCEAS– have produced over 1000 publications that

synthesize and re-use existing data

– massive investment in compiling, integrating, and analyzing data

• Building custom database for each project is not logistically feasible

• Instead, need loosely-coupled systems that accommodate heterogeneity

Page 3: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Dilemma: no unified model

• No single database suffices

– Data warehouses use federated schemas• any data that does not fit is not captured

• original data transformed to fit federation– this is a form of data integration for one purpose

– Numerous data warehouses exist• not extensible for all data

• VegBank, ClimbDB, GenBank, PDB, etc.

Page 4: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

• Metadata-based data collections

– Loosely-coupled metadata and data collections– No constraints on data schemas– Data discovery based on metadata

– Dynamic data loading and query based on metadata descriptions

Data Collections

Page 5: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

PhysicalPhysicalDataData

FormatFormat

Access and Access and DistributionDistribution

LogicalLogicalDataData

ModelModel

MethodsMethodsCoverage:Coverage:

Space, Time, Space, Time, TaxaTaxa

Identity andIdentity andDiscovery Discovery

InformationInformation

<EML>

A …• modular• extensible• comprehensive

• Ecological Metadata Language

What is EML?

Page 6: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

EML: Selected relationships

1995 2000 2005‘91 ‘92 ‘93 ‘94 ‘96 ‘97 ‘98 ‘99 ‘01 ‘02 ‘03 ‘04 ‘06 ‘07 ‘08 ‘09

EML1.0.0

EML1.3.0

EML1.4.x

EML2.0.0

CSDGM1.0

Michener ’97 paper

ESA FLEDReport

NBIIBDP

ISO 19115

DublinCore

OBOE

XML1.0

EML2.0.1

Page 7: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

A simple EML example

eml

packageId: sbclter.316.18

system: knb

dataset

title: Kelp Forest Community Dynamics: Benthic Fish

creator

individualName

contact

surName: Reed

surName: Evans

individualName

Page 8: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Data Discovery

Geographic, Temporal, and Taxonomic coverage

Page 9: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Logical Model: Attribute structure

• Describes data tables and their variables/attributes

• a typical data table with 10 attributes– some metadata are likely apparent, other ambiguous

– missing value code is present

– definitions need to be explicit, as well as data typing

YEAR MONTH DATE SITE TRANSECT SECTION SP_CODE SIZE OBS_CODE NOTES2001 8 2001-08-22 ABUR 1 0-20 CLIN 5 06 .2001 8 2001-08-22 ABUR 1 21-40 OPIC 11 06 .2001 8 2001-08-22 ABUR 1 21-40 OPIC 10 06 .2001 8 2001-08-22 ABUR 1 21-40 OPIC 14 06 .2001 8 2001-08-22 ABUR 1 21-40 OPIC 7 06 .2001 8 2001-08-22 ABUR 1 21-40 OPIC 19 06 .2001 8 2001-08-22 ABUR 1 21-40 COTT 5 06 .2001 8 2001-08-22 ABUR 2 0-20 CLIN 5 06 .2001 8 2001-08-22 ABUR 2 21-40 NF 0 06 .2001 8 2001-08-27 AHND 1 0-20 NF 0 03 .

SpeciesCodes

Valuebounds

DateFormat

Codedefinitions

Page 10: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

EML Measurement Scale

LowMedium

High

Equidistant Equidistant on number on number

scale, scale, meaningful meaningful

ratioratio

Equidistant Equidistant on number on number

scalescale

OrderedOrderedCategoriesCategoriesCategoriesCategories

Points on Points on calendar calendar timescaletimescale

MaleFemale 3 Celsius 5 meter 6-Oct-2004

TextualTextual

OrdinalOrdinalNominalNominal

NumericNumeric

RatioRatioIntervalInterval DatetimeDatetime

DatesDates

Page 11: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Logical Model: unit Dictionary

• Consistent assignment of measurement units

– Quantitative definitions in terms of SI units

– ‘unitType’ expresses dimensionality• time, length, mass, energy are all ‘unitType’s

• second, meter, gram, pound, joule are all ‘unit’s

MassMass

kilogramkilogram

gramgram

UnitType Unit

x1000

Page 12: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Collating metadata

• Most scientists know all of this information about their data– EML simply provides a standardized format for

recording the information

• Enables data exchange across organizations and software systems

Page 13: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Knowledge Network for Biocomplexity (KNB)

PISCOPISCO

KNB IIKNB II

ANDAND

... (26)... (26)

GCEGCE LTERLTER

NCEASNCEAS

ESAESA

OBFSOBFS

KNB 1KNB 1

Building a community data network

• Simplified data sharing • Immediate change tracking • Redundant backup • Data maintained by individuals• Access controlled by individuals

Page 14: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

EML-described data in the KNB

Data Packagesin the KNB

2002 2003 2004 2005 2006

Year

02000

4000

6000

8000

1000012000

Cum

ula

tive c

oun

t

Page 15: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Kepler: dynamic data loading

Data source from EcoGrid(metadata-driven ingestion)

res <- lm(BARO ~ T_AIR)resplot(T_AIR, BARO)abline(res)

R processing script

Kepler supports dynamic data loading:

• Data sources are discovered via metadata queries

• EML metadata allows arbitrary schemas to be loaded into an embedded database

• Data queries can be performed before data flows downstream

Page 16: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Importance of semantics

• So far we’ve dealt only with the logical data model– any semantics in EML in natural language

• The computer doesn’t really understand:– what is being measured– how measurements relate to one another– how semantics map to logical structure

• Analysis depends on understanding the semantic contextual relationships among data measurements– e.g., density measured within subplot

Page 17: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Provide extension points for loading specialized domain ontologies

Goal: semantically describe the structure of scientific observation and measurement as found in a data set

Observation ontology (OBOE)

Entities represent real-world objects or concepts that can be measured.

Observations are made about particular entities.

Every measurement has a characteristic, which defines the property of the entity being measured.

Observations can provide context for other observations.

slide from J. Madin

Page 18: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Semantic annotation

Observation Ontology

Data set

Mapping between data and the ontology via semantic annotation

slide from J. Madin

• Relational data lacks critical semantic information• no way for computer to determine that “Ht.” represents a “height” measurement • no way for computer to determine if Plot is nested within Site or vice-versa• no way for computer to determine if the Temp applies to Site or Plot or Species

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 19: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Date Site Plot Species Height10/12 Hendricks 1 AHYA 12.210/12 Hendricks 1 AHYA 11.010/12 Hendricks 1 AHYA 9.7… … … … …

h

Date LocationName HeightTaxonomicNameLabelCharacteristic: Area

Time Space Space OrganismEntity:

hasContext hasContext hasContext

Page 20: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Tree Plot Species CountA 1 AHYA 3A 2 AHYA 2A 3 AHYA 8… … … …

Organism Space Organism

Label AbundanceTaxonomicNameReplicate

Entity:

Characteristic: Area

hasContext hasContext

A

B

C

Page 21: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Observation ontology

slide from J. Madin

Extension points

Page 22: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Observation

A high-level assertion that a thing was observed

?

Page 23: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

All things (concrete and conceptual) that are observable

Entity

Page 24: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

An extension point for domain-specific terms

Entity extension

Page 25: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Asserts a “containment” relationship between entities

Context

Page 26: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Context is transitive

Context

Page 27: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Observations are composed of measurements, which refer measurable characteristics to the entity being observed

Measurement

Page 28: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Characteristic

Page 29: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Summary

• EML captures critical metadata• OBOE adds critical semantic descriptions

• Data discovery and integration tools can be built that leverage metadata and ontologies

• Metadata and ontologies permit:– Loosely-coupled systems

– Schema independence in data systems

– Semantic data integration

– Capturing data that is collected, rather than derived product

Page 30: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Vegetation Schema Questions

• Vegetation schema– Exchange standard or federation?

• Can we accommodate all data that is collected in vegetation plots?– or just a transformed subset

• XML? RDF? OWL? other?• Should a vegetation schema link to other

evolving community standards?– EML?– OBOE?

Page 31: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Questions?

• http://www.nceas.ucsb.edu/ecoinformatics/• http://knb.ecoinformatics.org/• http://seek.ecoinformatics.org/• http://kepler-project.org/

Page 32: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

• Knowledge Representation Working Group• Mark Schildhauer, Matt Jones (NCEAS)• Shawn Bowers, Bertram Ludaescher, Dave

Thau (UCD)• Deana Pennington (UNM)• Serguei Krivov, Ferdinando Villa (UVM)• Corinna Gries, Peter McCartney (ASU)• Rich Williams (Microsoft)

Acknowledgements

Page 33: Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara

Acknowledgments

• This material is based upon work supported by:

• The National Science Foundation under Grant Numbers 9980154, 9904777, 0131178, 9905838, 0129792, and 0225676.

• Collaborators: NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Napier University, Arizona State University, UC Davis

• The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus.

• The Andrew W. Mellon Foundation.

• Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON, RoadNet, EOL, Resurgence