scott jensen data to insight center indiana university university of chicago – march 2, 2012

Scott Jensen

Data to Insight Center

Indiana University

Adaptable and Incremental Metadata Capture in e-Science

University of Chicago – March 2, 2012

University of Chicago – March 2, 2012 Adaptable and Incremental Metadata Capture for e-Science 2

What is Metadata?Data About Data• “structured information that

describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage any other resource” National Information Standards Organization

• Alternately, answers the who, what, when, and why questions about a dataset.ISO 19115 standard

– Where (spatial metadata)– How (configuration)


Why Does Metadata Matter?

• Data Reuse– “Metadata is key to being able to share results”

U.K. e-Science Core Programme

– “A significant need exists in many disciplines for long-term, distributed, and stable data and metadata repositories”

NSF Blue-Ribbon Advisory Panel on Cyberinfrastructure

– “Preservation of digital data is arguably a ‘grand challenge’ of the information age” Francine Berman

• Trusting and Understanding Data– The ability to understand and evaluate the quality of data is key to reuse

after discovery. If they have too much uncertainty, they would not use it.Ann Zimmermann

• Data that is Costly and Irreplaceable– Can other data be regenerated?

• Data Management Plans


Metadata Capture

• Historically done at the end of the data lifecycle– Research is completed– Data and results tarred up as a dataset – metadata at the dataset level– Inserts are full metadata documents

• Metadata often captured at the collection level– Generalized and not specific to each data product– Collection level metadata for discovery (e.g., WCS)– Detailed metadata stored as an object

• Data search is coarse – Based on keywords or text search– Spatial bounding box and temporal range– Not specific to a data product , details not searchable– Sometimes just browse capabilities


How Much Metadata to Capture?

Lower Barriers to Entry

Less

Str

uctu

reM

ore Structure

Structured Metadata Schemata (FGDC, EML)

CoreMetadata

Richer Metadata to Search Over

Name / ValuePairs

Flat Schemata(unqualified DC)

Cost / Benefit Trade-offs


Research Problem

• Early Capture of Ephemeral Metadata– Incremental, not at the end of the lifecycle– Incremental capture must be efficient

• Deluge, Tsunami, Bonanza – Requires automation– Detailed metadata for discovery– Scalability

• Variable and Dynamic Data– Must accommodate new metadata– Accommodate different domains and schemata


Research Focus• Identified the concept based character of scientific metadata

schemas that differentiates them as a class from other XML schemas.

• Capture metadata incrementally and efficiently early in the scientific process– Capture detailed metadata without full update– Reconstruct metadata on-the-fly after incremental capture– automated metadata extraction from data objects

• Incremental capture must be efficient and scalable

• Architecture must generalize across schemas and domains

• Detailed metadata must be discoverable

• Extensible without schema modifications


Metadata Schemas - a Bag of ConceptsStandard

Metadata

IdentificationEntity

AttributeMetadataReference

DistributionSpatial

ReferenceSpatialData

DataQuality

CoreFGDC Spatial Schema

ISO 19115• Identification• Constraints• Data Quality• Spatial• Reference

System• Distribution• Metadata

Extension• and more …

DDI (version 2.0)• Description• Study description• Physical file

description• Logical description

(variables)• other

Astronomy• Identity• Curation• Content• Coverage• Spatial• Temporal• Data Quality

Ecology (EML)• General• Geographic• Temporal• Taxonomic• Methods• Data table

metadata


Concepts have Complex Structure

• Schemata are often composed of complex concepts (compound elements)– “Compound elements represent higher-level concepts that cannot be represented

by an individual data element”

• Increased structure → Increased reusability

• Flat schema → difficulty harvesting– Harvesting Dublin Core led to incomplete and inconsistent data - California

Digital Libraries

– Similar issues at the National Science Digital Library made it difficult to build services on harvested Dublin Core.

• Performance bottleneck when converting XML to name/value pairs


Concepts & Incremental Metadata Capture

• As an experiment runs, adding a concepts does not require editing the existing metadata.

• Can capture ephemeral metadata such as workflow notifications and add them to a detailed metadata document.

• Metadata can be harvested from files and added as queryable metadata at different levels of the hierarchy.


Partitioning a Schema on Concepts

Indentification

Citation

Keywords

.

.

.Theme

Temporal

...

Thesaurus

Theme Keyword

Originator

Publication Date

Publication Time

Title...

Publication InfoPublication Place

Publisher

Larger Work Citation

...

Entity and Attribute Detailed Desc

Entity TypeType Label

Type Definition

Definition Source

Attribute

Thesaurus

Temporal Keyword

Attribute Label

Definition

Definition Source

Domain Values

.

.

.

Distribution Distributor

Standard Order Process

Metadata

Metadata Concepts

Elements Within a Concept

Concept Requirements:Recursion is within concept

Elements where cardinality can exceed one are concepts or contained in concepts

Beneficial when CRUD operations are at the concept level or higher

Global ordering of concept elements and higher levels• Incremental ingest – no need

to modify existing concepts.• Efficient reconstruction based

on concept-sized fragments1

2

3

56

7

12 13

16


Shredding XML Concepts• Metadata documents are

“shredded” into concepts and then concepts are shredded into elements using XSLT.

• Once CLOBs are stored, metadata cannot be lost.

• CLOBs are indexed on Object ID and their global ordering.

• Shredded metadata is only a search index, allowing for strong typing – even if types do not match XML.

Metadata Document

IDConceptConceptConcept …GlobalOrderCLOBShredded Concept

NameSourceSub-concept *Element *

NameSourceTyped Value

Detailed Search

Fast

Res

pons

e


DatabaseXMC Cat

Ingest & Search Using Incremental Capture

ShreddedConcepts

ConceptCLOBs

DetermineSchema Validate Shred

new concept

BuildQuerysearch based

on concepts query shredded metadata for matching objects

BuildResult

object IDs

query for CLOBs based on IDs


Exploded Datasets(Describing data in a broader context)

• Not a tarball at the end of a project

• Automated capture during an experiment

• Data objects are generated throughout a workflow

• Experiment data hierarchies vary by domain

• Provides scientists access to incremental metadata

Metadata Catalog

Query For D

ata

Browse

Sea

rch

Compose

Workflow

Gateway

Message Bus

Wor

kflo

wN

otifi

catio

ns

Workflow Inputs

Intermediate Results

Workflow Outputs

Workflow Notifications

Incremental Capture During a Workflow or Experiment


Automated Metadata Capture

DataRepository

Science Gateway

Data Management Agent

Archived to the data repository XMC Cat Metadata Catalog

Minimal source metadata is recorded

worker

Post-processing of data registration events

Registration eventsadded to queue

pluginplugin

worker

pluginplugin

Database

dataregistration

event queue

nodenode

node

node

Workflow Nodes Register Data Products


Domain Schema → Generalized Architecture

XMC Cat Browser

XMC Cat Web Service

XMC Cat WSDLFunctionality

Plug-insXMC Cat

WSDL

DomainMetadata Schema

Domain SchemaXML Bean

Domain SchemaXSLT Shredding Templates

Domain ConceptDatabase Script

Post-ProcessingPlug-ins

DistributedShredders

Dat

a S

tore

External Services

Generated withXMC Cat Builder

XMC CatDatabase

Shred Data on Ingest

Cast Metadata on Ingest


Adaptable Metadata Store

• Shares characteristics of clinical genomics databases and relational RDF stores such as Jena.

• Definition of concepts is based on schema structure.

• Dynamic concepts can be defined based on metadata content instead of structure.

• Every concept is stored as a CLOB

• Concepts can optionally be parsed into concepts, sub-concepts, and elements.


A Generic Structure for SearchingDomain Concepts

Schema-BasedXML XML

Shredding

ShreddedLeafData

Elements

XML FragmentsAs CLOBs

Detailed QueryBased on

Shredded Data Elements

Query for CLOBs and

Build Response

Object IDs

CLOBs

XMLResponseQuery Shredded

Data Elements forObject IDsObject

IDs

Query onData

Elements

ObjectIDsMetadata Schema : Concept +

Concept : Sub-Concept *, Atomic Element *

Sub-Concept : Sub-Concept *, Atomic Element *

Atomic Element : date | time | timestamp | integer | float | spatial | string

Complex Domain-Specific Concepts

Generalized Concepts, Sub-Concepts and Elements

mapped to


Shredding Domain Metadata<lead:LEADresource xmlns:lead="http://schemas.leadproject.org/2007/01/lms/lead" xmlns:le="http://schemas.leadproject.org/2007/01/lms/leadelements" xmlns:fgdc="http://schemas.leadproject.org/2007/01/lms/fgdc"> <le:resourceID>urn:uuid:97afbef7-58c8-4143-9b05-f0b9d82d27ef</le:resourceID> <lead:data> <lead:idinfo> <lead:citation> <fgdc:origin>/C=US/O=National Center for Supercomputing Applications/CN=Anne Wilson</fgdc:origin> <fgdc:pubdate>Unknown</fgdc:pubdate> <fgdc:title>LEAD CONUS ADAS Catalog/CONUS ADAS 10km</fgdc:title> <fgdc:pubinfo> <fgdc:pubplace>unknown</fgdc:pubplace> <fgdc:publish>IU/GEOG</fgdc:publish> </fgdc:pubinfo> </lead:citation> <lead:descript> <fgdc:abstract>Real-time meteorological data assimilations with CONUS coverage at 10km resolution produced hourly by CAPS at OU. The List of contents provides the OPeNDAP URLs for the files within the collection. They have a form: http://lead.unidata.ucar.edu/cgi-bin/nph-dods/test-data/ADAS/OU/ad{date}.nc where {date} has the form: YYYYMMDDHH and indicates the hour for which the data assimilation is valid. </fgdc:abstract> <fgdc:purpose>Scientific research and education</fgdc:purpose> </lead:descript> . . . <lead:keywords> <fgdc:theme> <fgdc:themekt>DatasetTypes.lead.org</fgdc:themekt> <fgdc:themekey>ADAS</fgdc:themekey> </fgdc:theme> <fgdc:theme> <fgdc:themekt>CF-1.0</fgdc:themekt> <fgdc:themekey>projection_x_coordinate</fgdc:themekey> <fgdc:themekey>projection_y_coordinate</fgdc:themekey> <fgdc:themekey>height</fgdc:themekey> <fgdc:themekey>geopotential_height</fgdc:themekey>

Citation Concept

Description Concept

2nd Theme Keyword Concept

<lead:LEADresource xmlns:lead="http://schemas.leadproject.org/2007/01/lms/lead" xmlns:le="http://schemas.leadproject.org/2007/01/lms/leadelements" xmlns:fgdc="http://schemas.leadproject.org/2007/01/lms/fgdc"> <le:resourceID>urn:uuid:97afbef7-58c8-4143-9b05-f0b9d82d27ef</le:resourceID> <lead:data> <lead:idinfo> <lead:citation> <fgdc:origin>/C=US/O=National Center for Supercomputing Applications/CN=Anne Wilson</fgdc:origin> <fgdc:pubdate>Unknown</fgdc:pubdate> <fgdc:title>LEAD CONUS ADAS Catalog/CONUS ADAS 10km</fgdc:title> <fgdc:pubinfo> <fgdc:pubplace>unknown</fgdc:pubplace> <fgdc:publish>IU/GEOG</fgdc:publish> </fgdc:pubinfo> </lead:citation>


Shredded Citation Metadata

<objectClobProperty myPos=“5" (namespaces omitted here) > <objectClob> <lead:citation xmlns:lead="http://schemas.leadproject.org/2007/01/lms/lead" xmlns="http://schemas.leadproject.org/2007/01/lms/lead" xmlns:fgdc="http://schemas.leadproject.org/2007/01/lms/fgdc" xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"> <fgdc:origin>/C=US/O=National Center for Supercomputing Applications/CN=Anne Wilson</fgdc:origin> <fgdc:pubdate>Unknown</fgdc:pubdate> <fgdc:title>LEAD CONUS ADAS Catalog/CONUS ADAS 10km</fgdc:title> <fgdc:pubinfo> <fgdc:pubplace>unknown</fgdc:pubplace> <fgdc:publish>IU/GEOG</fgdc:publish> </fgdc:pubinfo> </lead:citation> </objectClob> <objectProperty myName="citation" mySource="LEAD"> <objectProperty myName="pubInfo" mySource="LEAD"> <objectElement myName="pubPlace" mySource="LEAD" myVal="unknown"/> <objectElement myName="publisher" mySource="LEAD" myVal="IU/GEOG"/> </objectProperty> <objectElement myName="originator" mySource="LEAD" myVal="/C=US/O=National Center for Supercomputing Applications/CN=Anne Wilson"/> <objectElement myName="pubDate" mySource="LEAD" myVal="Unknown"/> <objectElement myName="pubDateTime" mySource="LEAD" myVal="Unknown"/> <objectElement myName="title" mySource="LEAD" myVal="LEAD CONUS ADAS Catalog/CONUS ADAS 10km"/> </objectProperty></objectClobProperty>

CLOB forCitation Concept

pubInfoSub-concept

All Shredded Metadata Conforms to the Same Schema

<objectProperty myName="citation" mySource="LEAD"> <objectProperty myName="pubInfo" mySource="LEAD"> <objectElement myName="pubPlace" mySource="LEAD" myVal="unknown"/> <objectElement myName="publisher" mySource="LEAD" myVal="IU/GEOG"/> </objectProperty> <objectElement myName="originator" mySource="LEAD" myVal="/C=US/O=National Center for Supercomputing Applications/CN=Anne Wilson"/> <objectElement myName="pubDate" mySource="LEAD" myVal="Unknown"/> <objectElement myName="pubDateTime" mySource="LEAD" myVal="Unknown"/> <objectElement myName="title" mySource="LEAD" myVal="LEAD CONUS ADAS Catalog/CONUS ADAS 10km"/> </objectProperty></objectClobProperty>


Dynamic Concepts Based on Content

Entity and Attribute Detailed Desc

Entity TypeType Label

Type Definition

Definition Source

AttributeAttribute Label

Definition

Definition Source

Domain Values

Metadata

1312

1

CLOB parsed out and saved based on global order (schema structure)

Concept defined based on “entity” label and source

Sub-concept and elements defined based on “attribute” label and source

New domain concepts without schema changes• Concept CLOBs are always saved

based on global order – even if concept is not defined.

• To be queryable, new concepts and elements defined, but no schema change is required


XMC Cat Builder: Concepts


Deployed in Diverse Domains

• Linked Environments for Atmospheric Discovery (LEAD) – NSF funded science gateway– Metadata describing 500TB of data, intermediate results, and workflow

output– Data objects each described by up to 2,202 elements– Individual workspaces of up to 15,000 objects

• One Degree Imager (ODI) WIYN Consortium– Component in the data subsystem– Data-driven workflows

• SEAD Project– Sustainability science– Provide search capability over archived use metadata


Comparing to a Native XML DatabaseConcurrent Insert/Query Execution Time in Milliseconds

• Except for queries based on object IDs, XMC Cat at 8X the base workload performs better than Berkeley XML at 1/10th of the base workload.

• XMC Cat experiment inserts include validation not reflected in Berkeley results, eliminating validation, XMC Cat at 8X the workload is 2,477 ms.

Scott Jensen, Devarshi Ghoshal, and Beth Plale, Evaluation of Two XML Storage Approaches for Scientific Metadata Indiana University CS Technical Report TR698, October 2011.

Projected insert and query workload as multiples of projected LEAD workload based on LEAD technical report and insert/query ratios of the TPC-E benchmark.

Minimal (core)

Moderate (file)

Extensive (experiment)

Additional Concept

Query on ID

Context Query

Berkeley 1/10th 87 138 2,659 78 23 1,086

1X 52 74 2,316 26 27 632X 53 76 2,954 27 27 634X 60 80 4,803 29 31 696X 67 88 4,628 32 54 728X 69 89 4,719 36 34 145

XMC Cat - Percentage of Base Workload

Inserting Metadata Queries


Performance Compared to Inlining

200,000

150,000

100,000

50,000

00 5 10 15 20Inlining 4X Base Workload

Ave

rage

Pro

cess

ing

Tim

e (m

s) 200,000

150,000

100,000

50,000

00 5 10 15 20XMC Cat 9X Base Workload

File QueryExperiment Query

Batch File InsertExperiment Insert

Processing Start Time (minute in test)

Scott Jensen and Beth Plale, Using Characteristics of Computational Science Schemas for Workflow Metadata Management, In Proceedings of the 2008 IEEE Congress on Services, IEEE 2008 Second International Workshop on Scientific Workflows (SWF 2008) , Hawaii, July 2008.


Eventual ConsistencyBrowse versus Search Metadata

Met

adat

a C

atal

og W

eb S

ervi

ce

pars

e m

etad

ata

into

co

ncep

t CLO

Bs

Catalog

queue

Concept Shredders

. .

.

1a

1b

2

wor

ker

thre

ad

shre

d in

to s

ub-c

once

pts

and

elem

ents

successfullyshredded?

Yes

Yes

Storage of concepts so a user can browse their workspace

Shred of concepts for eventually consistent querying of the workspace

4

5

Metadata Catalog Distributed Concept Shredders

store concept CLOBs to object’s metadata

queue concept’s ID for eventual shredding

experimentsadding

metadata

addingmetadatato existing

experiments

query for a batch

of concepts

CLO

Bs

adde

d to

que

ue

. .

.

wor

ker

thre

ad.

. .

Deq

ueue

CLO

B

3

shredded metadata added to metadata catalog

remove entry for concept in processing queue

6b

6a

Scott Jensen and Beth Plale, Trading Consistency for Scalability in Scientific Metadata, In Proceedings of the 2010 IEEE International Conference on e-Science, Brisbane, Australia, December 2010.


Bounds on Eventual Consistency

ECt = Wt + Tt + Rt + St + It

Above times are averages for fetching a batch of 100 concepts (Tt and Rt) and then processing each concept (St and It).

Total wait time is dominated by Wt. If the distributed shredders keep pace with the ingest rate, the frequency of the shredders fetching determines Wt

WtTime a concept ID is queued

TtTime to “tag” as taken when fetching (64.42ms per batch of 100 concepts)

RtTime to fetch tagged concepts (74.58ms per batch of 100 concepts)

StTime in local shredder queue and shredded by a worker thread (3.48ms)

ItTime to insert shredded concept into the metadata catalog (13.74ms)


Evaluation of Eventual Consistency

• Eventual consistency scales higher• Strict consistency scaled to 8X the

projected workload• Mostly due to deferred shredding• Using two eventually consistent

shredders on a separate server

Total Processing

Inserting Shredded Metadata

Strict Consistency

Total With Shredding

Total - Deferred Shredding

Eventual Consistency

Multiple of Base Workload20 4 6 8

Mea

n E

xecu

tion

Tim

e (m

s)

20

0

40

60

80

100

120

140

strict consistency is42% longer at 6X

the base workload


Domain-Adaptable Metadata Search

• Metadata search criteria are often limited keywords or text, spatial bounding box, and temporal bounds.

• If rich metadata is captured as a BLOB, it is available as use metadata, but not discovery metadata.

Instead …

• Use domain concepts and dynamic concepts to define search criteria.

• Generic architecture for shredded metadata -> search criteria can include any shredded domain metadata.


Dynamic Search Definition

Concept_idCategory_idConcept_nameConcept_sourceSchema_order_idConcept_descriptionConcept_short_descTop_concept_idParent_sequenceParent_id

concept_definitions

Category_idCategory_description

metadata_categories

Element_idConcept_idElement_typeElement_nameElement_sourceElement_descriptionElement_short_desc

element_definitions

publisher nameelement definition

pubinfosub-concept

citationconcept

general information

category

<metadataDefinition> <metadataCategoryDef> <categoryId>1</categoryId> <categoryName>General Information</categoryName> <metadataConceptDef> <conceptId>1</conceptId> <conceptName>citation</conceptName> <conceptSource>FGDC</conceptSource> <conceptDesc>citation</conceptDesc> <conceptShortDesc>citation</conceptShortDesc> <metadataElementDef> <elementId>1</elementId> <elementName>originator</elementName> <elementSource>FGDC</elementSource> <elementDesc>citation originator</elementDesc> <elementShortDesc>originator</elementShortDesc> <elementType>6</elementType> </metadataElementDef> . . . <metadataConceptDef> <conceptId>3</conceptId> <conceptName>pubinfo</conceptName> <conceptSource>FGDC</conceptSource> <conceptDesc>publication information</conceptDesc> <conceptShortDesc>pub info</conceptShortDesc> . . . <metadataElementDef> <elementId>12</elementId> <elementName>publish</elementName> <elementSource>FGDC</elementSource> <elementDesc>publisher name</elementDesc> <elementShortDesc>publisher</elementShortDesc> <elementType>6</elementType> </metadataElementDef> </metadataConceptDef>


Search Adjusts to Domain Concepts

When the target is selected:

all concepts are listed as search options – grouped by their categories

When a concept is selected, all of its sub-concepts and elements are listed

as options


Strongly Typed Search Criteria


Current Work• Handle hierarchies based on

multiple schema– Experiments bringing together data

from multiple sources described by different standards.

– Data described by different metadata standards can be combined in a single dataset.

– Metadata can be queried based on different schemas.

• Faceted search– Added to XMC Cat web service.– Can alternate between facets and

details.– Unified criteria for multiple schema.

Simulation

Forecast

SensorData

EcologicalData Satellite

Data

CensusData


Thank You!

Scott Jensen

[email protected]

Thanks also to:- The NSF-funded Linked Environments for Atmospheric Discovery (LEAD) project- Data to Insight Center

scott jensen data to insight center indiana university university of chicago – march 2, 2012

Documents

capture metadata

incremental metadata

dataset metadata

emlcoremetadataricher

automationdetailed metadata

updatereconstruct metadata

wcsdetailed metadata

metadata documentsmetadata