connecting the domains - university of...

20
www.nationaldataservice.org CONNECTING THE DOMAINS THROUGH THE NDS Ray Plante National Center for Supercomputing Applications University of Illinois Urbana Champaign Integrating Domain Repositories into the National Data Infrastructure -- Ann Arbor, MI

Upload: doanhuong

Post on 29-Aug-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

www.nationaldataservice.org

CONNECTING THE DOMAINSTHROUGH THE NDS

Ray PlanteNational Center for Supercomputing Applications

University of Illinois Urbana Champaign

Integrating Domain Repositories into the National Data Infrastructure -- Ann Arbor, MI

www.nationaldataservice.org

Connecting the Domains

• The National Data Service and Domain Repositories• Sharing and Publishing Pathways• Data Discovery• Supporting progress toward a national infrastructure

www.nationaldataservice.org

Vision for a National Data Service

• National infrastructure for data sharing, publishing, and reuse– Builds and operates robust data services around

• Discovery• Publishing• linking to literature• Reusing: data access, transfer, connecting to processing

– Builds on and connects to community-specific resources• Preserve access the community-specific capabilities and knowledge

• A partner with the RDA– Production services that implement RDA recommendations– Provide platform for developing new practices

www.nationaldataservice.org

Assembling an ecosytem

www.nationaldataservice.org

Assembling a broad community

• Organizing an NDS Consortium– Data providers, Community federations, university

libraries, publishers, cyberinfrastructure providers, researchers across many disciplines

– To coordinate development, operation, integration with community resources

– Two meetings to date• June 2014 (Boulder), October 2014 (Rockville)• Organize architecture, governance; develop pilot projects

– Next meeting: late March in Austin, TX

www.nationaldataservice.org

NDS Architecture:an ecosystem in 3 parts

• The Portal Complete end-to-end set of vanilla national services for storing, sharing, publishing, finding and re-using dataMaterials Data Facility – an NDS prototype in the form of a domain repo

• The Framework The system into which a community can plug specialized tools, portals, and servicesconnecting domain repositories

• The InfrastructureFoundational storage, hosting environment and software that allow communities to build their own specialized data servicesSpace to build new domain repositories

We’re exploring all three of these with pilot activities now underway

www.nationaldataservice.org

Connecting publishing workflows

• Example: SEAD and repositories– Uses Medici tool for assembling data collections for publishing– Extract and collect metadata– Collection can be forwarded via the Virtual Archiver (and SWORD) to

author’s choice of repository

• What if researchers want to combine data collected into another tool—e.g. SciDrive?– (SciDrive: dropbox-like app that supports metadata-extracting plugins)– Want to transfer both data and collected metadata from SciDrive

SEAD

MediciVirtual

Archiver

Repo

RepoSWORD

Research Group

SciDriveAnotherGroup

www.nationaldataservice.org

Towards a publishing framework

• Topic of NDS Hackathon (Sept. 2014)– Imagine “standard” service interface for retrieving different

aspects/views of a digital object• Base URL + Dataset ID + “what” qualifier• To get object, metadata, “landing page”, or …• Metadata returned in JSON-LD• Note related effort in RDA PID Information Types WG

– Connecting different storage systems• Used OwnCloud to pull data from many different storage systems

(SciDrive, Dropbox, IRODS, …)• Leveraged various service interfaces

• What’s a good balance between a common use of standards and “mashing up” via existing (open) interfaces?

www.nationaldataservice.org

Aside: what do we mean by “standard”

A practice that…• Well-described in an open document in a well-

known location• Broadly recognized and accepted by a

communityExamples: SWORD, OAI-PMH, OpenID, VO specs, W3C specs, IETF RFCs

www.nationaldataservice.org

Thinking about the Framework

National Data Service Portal

Identity & GroupManagement

Data Movement& Access Services

Collection Creation& Sharing Space Publishing Repository

ScholarlyJournal

Mass data import tool

Lab experiment

Simulations

Computations

Data MiningService

CommunityInstrument

Cross-disciplinarysearch service

Data coming from…

DataMetadata

www.nationaldataservice.org

Thinking about the Framework

Identity & GroupManagement

Data Movement& Access Services

CollectionCreation &

Sharing Space

Publishing Repository

Mass data import tool

Lab experiment

Simulations

Computations

Data MiningService

CommunityInstrument

Data DiscoverySystem

Data coming from…

Project Repository

Large Missionor Project

Archive data

Accessservice

Lab notebooktool

DataMetadata

SWORD

OAI-PMHSHAREOAI-ORE

Communities can replace any/all of the vanillaservices with specialized versionsThe Framework defines the interfaces to enable interoperability

ScholarlyJournal

Data-LiteratureLinking Service

www.nationaldataservice.org

Data discovery across disciplines

• Two key use cases– Broad, top-down, cross-disciplinary search

• Imagine a search tool in an NDS portal that can discover data in any repository, data service

• (Note: not all data sources will look like a traditional repository)

– Finding data in related fields• Imagine using a search tool in discipline-specific portal which

can “reach out” to other displines’ systems for data• Could leverage an NDS search service behind the scenes• Perhaps limit search to specific fields

www.nationaldataservice.org

Data discovery across disciplines• Challenges

– The Problem of Too Many Matches• How does one effectively drill down?

– How can the user smoothly transition from a generic discovery tool/service to a community-specific one?

• Will need to engage the specific metadata, tools of the discipline/community

• What is the role of standards in that transition?– How important is take-up of common standards for broad

coverage• Can we combine multiple approaches?

• Approaches– Leverage solutions developed in specific communities– Mash-up open APIs where practical– Look for capability gaps where a standard could play a role

www.nationaldataservice.org

Approaches to discovery: VO

• The Virtual Observatory landscape– Variety of data source types: repositories, databases, information services

• Categories of datasets: images, spectra, time-series, tables, event-lists, …– Mix of static and highly-dynamic contents

• Sharing and indexing– Data centers provide collection-level descriptions of holdings and services

• VOResource metadata standard: DC + astronomy-specific• Exposed via OAI-PMH

– Searchable registries regularly harvest descriptions– Data centers support standard search services for searching tables and

discovering individual datasets

• Discovery is hierarchical: search client tool will…– query searchable registry to discover relevant data collections and databases

• Usually based on topic– query discovered collections and databases directly for datasets and records

• Typically based on position in the sky

www.nationaldataservice.org

VAO Data Discovery Tool

www.nationaldataservice.org

Approaches to discovery: VO

• Shortcomings– Search effectiveness is only as good as the

underlying metadata– Replicated and similar-but-derived resources

confuse the user– Needs fast inventory capability

• Often, all data centers will be queried unnecessarily– E.g. Collection contains data from a particular part of the sky

• Need centralized indexes that can indicate how many matches a collection might have for a user query

www.nationaldataservice.org

Approaches to discovery: SHARE

• Registry filled via notifications, drives discovery– Push rather than harvesting pull– Normalize references to avoid duplication– Track relationships between articles, data, authors (Note: VIVO-based searching)

• My questions– Will it require a pull component, too?– What are the assumptions about the data “things” and their sources in registry?

• Granularity• Types of data sources?

– Can model leverage community-specific discovery capabilities?

www.nationaldataservice.org

What a solution might look like• Technical perspective

– Collection of high-level metadata to central registry(-ies) • Via both push (SHARE) and pull (OAI-PMH)• Includes inventory information: # of datasets over different facets

– Metadata• DC/Datacite largely enough• Discipline/community categories

– E.g. astronomy & space physics, geophysics, geography, biology, behavioral sciences, etc.– 8-10 so that can be easily integrated into tool/service interfaces– Can limit search by discipline, see results faceted by discipline

• Identifiers for federations data is known to– For transitioning to community tools

– Connecting to discipline/community-specific tools• Standard ways of launching tools pre-seeded with a query• Encourage exposure of REST-like interfaces and ID-resolving services• Mash-up multiple search mechanisms as needed

– Leverages data format, data type registries

www.nationaldataservice.org

What a solution might look like

• User perspective– Both portal-based and REST-like interfaces supported– Queries can be limited by discipline– Results indicate inventories:

• Simple results: # of hits in each discipline category, can drill down to see details

• Faceted browsing– Results also organized by accessibility by community

federations/portals• Clicking on set launches community tool with search constraints

– Some discipline-specific capabilities may be leveraged at a high-level; e.g., geographic coordinates

www.nationaldataservice.org

NDS Community Development

NDS Consortium is encouraging integration of existing solutions and development of new ones as needed

• Pilot Projects – Explore parts of problem– Support development of funding proposals– Data-Literature Pilot: shared database of data-literature links– Discovery: would like to support SHARE in any way that is useful

• NDS Labs and NDS Share– Labs: developer environment with access to substantial storage,

virtualization, data management software– Share: platform for exposing solutions to friendly users

• Hackathons

We welcome new partners; visit www.nationaldataservice.org