materials data facility - eresearch online 2020example: nucapt data publication 17 goal: - aid...

24
Logan Ward 1 ([email protected]) Ben Blaiszik 1,2 ([email protected]), Ian Foster ([email protected]) 1,2 , Ryan Chard 2 Jonathon Gaff 1 , Kyle Chard 1 , Jim Pruyne 1 , Rachana Ananthakrishnan 1 , Steven Tuecke 1 Michael Ondrejcek 3 , Kenton McHenry 3 , John Towns 3 University of Chicago 1 , Argonne National Laboratory 2 , University of Illinois at Urbana-Champaign 3 materialsdatafacility.org globus.org Materials Data Facility: A Distributed Model for the Materials Data Community

Upload: others

Post on 07-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

Logan Ward1 ([email protected])

Ben Blaiszik1,2 ([email protected]),

Ian Foster ([email protected])1,2, Ryan Chard2

Jonathon Gaff1, Kyle Chard1, Jim Pruyne1,

Rachana Ananthakrishnan1, Steven Tuecke1

Michael Ondrejcek3, Kenton McHenry3, John Towns3

University of Chicago1, Argonne National Laboratory2, University of Illinois at Urbana-Champaign3

materialsdatafacility.org

globus.org

Materials Data Facility:A Distributed Model for

the Materials Data Community

Page 2: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

Data-Intensive Materials Science

2

Materials Databases High-Throughput Screening

Machine Learning Multi-scale Modeling

Kirklin et al. Acta Mat. (2016)

de Jong et al. Sci Rep. (2016) Sparks et al. Scr. Mat. (2015) https://www.mpg.de/

Page 3: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

Data-Intensive Materials Science

3

Science is becoming limited by the ability to handle data

- Where to get it?

- How to selectively share it?

- Where to store it?

- How to know what it is?

- How to build software that uses it?

- How to get others to share theirs?

- How to keep track of provenance?

- ….?

Our goal is to create infrastructure that provides easy

answers to these questions

Page 4: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

What is the MDF?

EP

EP

EP

EP

Deep indexing

Query

Browse

Aggregate

Publish

Mint DOIs

Associate

metadata

Databases

Datasets

APIs

LIMS

etc.

Distributed data

storage

Data

publication

service

Data

discovery

service

1

23

Page 5: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

Globus Background

5

Endpoint• E.g. laptop or server

running a Globus client

(e.g. Dropbox client)

• Enables advanced file

transfer and sharing

• Currently GridFTP,

future GridFTP +

HTTP

Some Key

Features• REST API for

automation and

interoperability

• Web UI for

convenience

• Optimizes and verifies

transfers

• Handles auto-restarts

Page 6: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

Globus Platform-as-a-Service (PaaS)

6

Identity

management

User

groups

Data

transfer

Data

sharing

• Share directly from your storage

device (laptop or cluster)

• File and directory-level ACLs

• Manage user group creation and

administration flows

• Share data with user groups

• High-performance data transfer

from a web browser

• Optimize transfer settings and

verify transfer integrity

• Add your laptop to the Globus

cloud with Globus Connect

Personal

• create and manage a unique

identity linked to external identities

for authentication

Publication Discovery

Page 7: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

Data sharing and Globus

7

Easily control who gains access to your data:

- Globus can use University/Laboratory credentials

- You can establish groups of authorized users

Page 8: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

REST APIs, Clients, and Docs

8

• New Python SDK available▪ https://github.com/globusonline/globus-sdk-python

• Jupyter Notebook Examples▪ https://github.com/globus/globus-jupyter-notebooks

• Sample Data Portal▪ https://github.com/globus/globus-sample-data-portal

• (alpha) MDF Data Publication Service API

Page 9: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

DATA PUBLICATION

9

EP

EP

EP

Distributed data

storage

Data

publication

service

Data

discovery

service

Page 10: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

Materials Data Publication Service

10

Page 11: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

Datasets Are Citable

11

Page 12: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

Publication statistics

15.0 TB

13.4 TB outData

Volumes

Publication

Authors

94Institutions

14Accesses

>1000

Total

datasets

50CHiMaD

datasets

16

Pipeline CHiMaD

datasets

+14Total

datasets

+30

Page 13: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

Publication Route #1: MDF Storage

13

~ 30 datasets

~ 6.5 TB

MATIN (GT)

~ 10 datasets

Used in

education

X-ray Scattering Image Classification

Using Deep Learning

http://dx.doi.org/10.18126/M2Z30Z

Electron Backscattering and

Diffraction Datasets for Ni, Mg, Fe, Si

Yager et al.Marc De Graef et al.

Phase Field Benchmark I Dataset

Jokisaari et al.

Grain Structure, Grain-averaged Lattice Strains, and

Macro-scale Strain Data for Superelastic Nickel-

Titanium Shape Memory Alloy Polycrystal Loaded in

Tension

Paranjape et al.

• Largest dataset to date (>1.5 TB). Showcases MDF unique

capabilities and makes a unique dataset discoverable for code

development, analysis, and benchmarking

Page 14: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

Customization: Collection Model

Page 15: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

Customization: Collection Model

15

• Collections might be a

research group or a research

topic...

• Collections have specified

▪ Mapping to storage endpoint▪ Currently handled as automatically created

shared endpoints

▪ Metadata schemas

▪ Access control policies

▪ Licenses

▪ Curation workflows

• Collections contain

▪ Datasets

▪ Data

▪ Metadata

• Metadata Persistence

▪ Metadata log file with dataset

▪ Metadata replicated in search index

Page 16: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

Share Data with Flexible ACLs

16

• Share data publicly, with a set of users,

or keep data private

Leverage Curation Workflows

• Collection administrators can specify

the level of curation workflow required

for a given collection e.g.▪ No curation▪ Curation of metadata only▪ Curation of metadata and files

Page 17: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

Example: NUCAPT Data Publication

17

Goal:

- Aid metadata capture

- Simplify data publication

Approach: Lightweight web service

- Form-based metadata capture

- Automatic file management

- “One-click” data publication

Results:

- Beta version deployed Sept ‘17Organizes data,

Co-locates metadata

Form-based

metadata capture

Page 18: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

DATA DISCOVERY [AND USE]

18

EP

EP

EP

Distributed data

storage

Data

publication

service

Data

discovery

service

Page 19: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

Part 1: Linking with the Data Community

19

Materials Project

CitrinationMaterials Commons

Other Facilities (APS, SNS, NSLS, …), Institutional Repositories, Publishers!

MetadataPublishing

MetadataMD,Pub., Compute

MetadataPublishing

NCSA-PIREHV/TMSMBDH

Page 20: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

Many Databases, Single Search

20

Page 21: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

MDF + NIST Database Tools

21

Data

discovery

service

MDCSNIST

MRR

MDF automates publicizing dataand provides a uniform search interface

Page 22: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

MDF data discovery ecosystem

EP

NIST

MRR

Data

discovery

service

Harvest

Deep index

Register / Sync

Services

Bots

MDF

Pub

Service

Automate

Process

Refine

Analyze

Data Output

Data Input

EP

Data Sources

Query

Browse

Aggregate

User Interfaces

Identify resources for indexing

22

Page 23: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

Summary

Three Major Components of Materials Data Facility

1. Globus ▪ High speed data transfer

▪ Easy data sharing

2. Data Publication Service▪ Simple data publication, from your own

▪ Free data publication

3. Data Discovery Service▪ Single search engine for many materials databases

▪ Python API for accessing these databases

#

Page 24: Materials Data Facility - eResearch Online 2020Example: NUCAPT Data Publication 17 Goal: - Aid metadata capture - Simplify data publication Approach: Lightweight web service - Form-based

Thanks to our sponsors!

24

U . S . D E PA RT M E N T O F

ENERGY