common framework working groups - data science · ronak patel (baylor college of medicine) lisa...

Post on 22-Jun-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

+

Common Framework Working Groups

Owen White and many more

+ Why this is confusing

■Several different initiatives■BD2k, Common Fund, Global Alliance, Genome Data Commons

■Several different virtual spaces■GDC, Hutch Data Commonwealth, Cloud pilots

■Co-opting several pre-existing activities■MODs, GA4GH, HMP

+Why this is REALLY confusing

+ Ready or not …we are building an ecosystem

■ Living, thriving, dynamic and very new concept

■ Composed of many incubators

■ Some technologies will prevail, some will not

■ It is not appropriate or possible to:■ do this in isolation■ burn resources while just doing our own research

+Data Commons Components

Reference DataSets

Resource Search & Index

Cloud Credit Model

Commons Framework Pilots

• GDC• Human Microbiome Project• Global Alliance• MODs• RFI – engage community

Winter 2017

• FOAs – Place high impact data sets in the cloud

Spring 2017

+Data Commons Components

Reference Data Sets

Resource Search & Index

Cloud Credit Model

Commons Framework Pilots

• Data Discovery Index (DDI) Consortium (bioCADDIE, dataMed, omicsDI others)

• Aggregation of metadata presented on web

• Driving metadata standards

• Search/query services

+Data Commons Components

Reference Data Sets

Resource Search & Index

Cloud Credit Model

Commons Framework Pilots

• 3 year pilot to test business model • Investigators receive credits for

use with cloud providers• Provider debits against account in

pay-as-you-go model• Amazon Reseller, IBM, Google

Reseller, Broad and NCI Cloud Pilots

+Data Commons Components

Reference Data Sets

Resource Search & Index

Cloud Credit Model

Commons Framework Pilots

Several examples…...

+The CEDAR Approach to Metadata

+AzTec

Building a Technology Platform to Integrate Biomedical Resources

https://aztec.bio

+

Faceted search

Metadata editorAPI testingRepository

API

smartAPI Interoperability PilotDevelopment of a Community-based standard

Intelligent authoring of API Metadata

+

Brian O’Connor - UCSC

+ Motivation – broad goals…and why you should participate

■ Everyone has a lot to share – let’s ensure we socialize our research products

■ Vision for Commons implementation

■ Self-governance

■ Managing standards proliferation

■ We are not in competition with each other

■ Setting guidelines in RFAs

+ Common Framework Working Group

■ Development of FAIR-ness Metrics■ objective measures for the degree of data availability

■ Metadata documentation of APIs■ creating a "minimal list" that describes available APIs

■ Data-object registry / Indexing■ Approaches to make all data findable

■ Workflow sharing and Docker Registry■ we got lots of workflows, how to share them?

■ Commons publication initiative■ a coordinated publication plan

FAIRness

FAIRness MetricsOptimizing the FAIR alignment of

research assets, roles & relationships

Neil McKenna, Ph.DBaylor College of Medicine

Co-Chair, FAIRness Metrics Subgroup

Co-chair: Michel Dumontier, Ph.D

What are FAIRness Metrics?

• The FAIR principles articulate ideals in research• FAIRness Metrics give effect to these principles to

advance FAIRness in research• Commons FAIRness Metrics Subgroup (FMSG) has been

tasked with developing FAIRness Metrics• First (ongoing) step for the FMSG is to comprehensively

define the components of the research ecosystem: – Research assets– Research roles– Research relationships

What are the assets of research?

Datasets

Metadata& Standards

Research Resources

Applications,services & tools

Defining research roles: two examples

Asset Producers

Asset Stewards

Individual benchresearchers

Ontologyorganizations

Tool & appdevelopers

Primary data repos

Softwareregistries Research Resource

Stores

What are the roles in research?

Asset Consumers

Asset Producers

Asset Stewards

PublishersAsset sponsors

Asset indexers& registries

What are the relationships between these roles?

What are the relationships between assets and roles?

+

Web-based analysis widget

Asset Producers

Asset Stewards

PublishersAsset Sponsors

Asset ConsumersAsset Indexers

& Registries

FAIRness Metrics & Indexes

• FAIRness metrics seek to optimize the alignment of research assets, roles and relationships with the FAIR principles

• Unique roles & relationships require unique sets of metrics– FAIRness Index: custom set of metrics tailored to a

specific research role, its assets and its relationships with other roles

FAIRness Indexes: holding a mirror up to research roles

• Asset Producers How well are the products of my research shared with other researchers?

• Asset stewards Are assets optimally exposed to both machines & humans?

• Publishers Is the relationship between research articles and their supporting assets properly recognized?

• Consumers Do I give full attribution when I re-use assets?

Asset Producers

PublishersAsset Sponsors

Asset ConsumersAsset Indexers

& Registries

Get involved – please!• Long term goal is to have FAIRness Indexes adopted by

funding agencies & incorporated into FOAs• We need help from the community:

– Defining & identifying research assets and roles– Developing & refining use cases that define the relationships

between research assets & roles – The more roles that are represented in the FMSG, the better

FAIRness Indexes will reflect the real research world• To get involved in the FMSG, complain, or just find out

more about what we’re doing, contact– Neil McKenna (nmckenna@bcm.edu) – Michel Dumontier (michel.dumontier@stanford.edu)

• Or stop by Poster 135 later on!

Current FMSG roster: thank youMark Wilkinson (University of Madrid)Alejandra Gonzalez-Beltran, Philippe Rocca-Serra, Susanna Sansone (Oxford University)Allen Dearry, Elaine Collier (NIH)Lucila Ohno-Machado, Jeff Grethe (UCSD)Mark Musen (Stanford University)Tim Clark (Harvard Medical School)Nolan Nichols (SRI/Stanford)Tobias Kuhn (VU University Amsterdam)Carole Goble (The University of Manchester)Jo McEntyre (EBI)Luiz Bonino (DTL/VU)Alasdair Gray (Heriot-Watt University)Marco Roos, Katy Wolstencroft, Mark Thompson (Leiden University Medical Center)Richard Finkers (Wageningen UR)Christina Lohr, Holly Falk-Krzesinski, Anita deWaard, Paul Groth (Elsevier)Ronak Patel (Baylor College of Medicine)Lisa Federer (NIH Library)

CFWG API Interoperability Working GroupImproving the discoverability, accessibility,

interoperability and reuse of web APIs

28

Co-Chairs: Michel Dumontier and Chunlei Wu

@micheldumontier::CFWG:30-11-2016

MotivationBiomedical science is increasingly being done using cloud-based, web-friendly application programming interfaces (APIs).

BUT it’s pretty much impossible to automaticallydiscover which API to use and how to connect these together to create an effective workflow.

-> barrier to discovery.@micheldumontier::CFWG:30-11-2016 29

51 APIs

1,184 APIs

14,952 APIs

Examining the metadata for the myGene.info web API

@micheldumontier::CFWG:30-11-2016 30

GenemyGene.info ?

@micheldumontier::CFWG:30-11-2016 31

GenBank identifier

Affymetrix identifier

Taxonomy identifier

… 1340 lines …

HGNC symbol

?

NCBI Gene Terminology

Profiling the API output

What do these symbols refer to?How do we find out more?

@micheldumontier::CFWG:30-11-2016 32

How does myGene.info connect with myVariant.info?

Gene

myGene.info

?

myVariant.info

Knowing how APIs connect is essential for (automated) workflow composition

@micheldumontier::CFWG:30-11-2016 33

Problem Statement

There is an overwhelming lack of explicitknowledge pertaining to the structure and datatype of web API inputs and outputs

If web APIs were annotated with semantic metadata, they would be easier to discover, connect together, and reuse.

@micheldumontier::CFWG:30-11-2016 34

API Interoperability CFWG

To foster a collaborative environment for the discussion, development and evaluation of infrastructure and guidelines that facilitate the discoverability, implementation, deployment, interoperability and reuse of web APIs

@micheldumontier::CFWG:30-11-2016 35

API Interoperability WG PeopleMichel DumontierAmrapali ZaveriShima Dastgheib

Chunlei Wu Caty ChungRaymond Terryn Paul Avillach

http://mygene.info

http://ruben.verborgh.org/blog/2013/11/29/the-lie-of-the-api/

http://dumontierlab.com http://www.lincsproject.org http://bd2k-picsure.hms.harvard.edu

https://spec-ops.io http://nidm.nidash.org/

Kevin OsbornDavid Steinberg

https://cgl.genomics.ucsc.edu/

http://sadiframework.org https://bd2kccc.org/http://rgd.mcw.edu/

Kathleen Jagodnik

36

Gregg Kellogg Nolan Nichols

Mark Wilkinson Ruben Verborgh Mary ShimoyamaJeff De Pons Denise Luna

Metadata Survey

@micheldumontier::CFWG:30-11-201637

We performed a survey of 3 repositories (Biocatalogue, Programmable Web, Elixir Tools & Services Registry) and 4 specifications (MIAS, OPEN API, SADI, schema.org, and a preliminary smartAPI metadata specification).

@micheldumontier::CFWG:30-11-2016 38

Metadata Elements 20 basic, 6 provider, 10 operation, 12 parameters, 6 response

@micheldumontier::CFWG:30-11-2016 39

Metadata authoring made easy. We extended t

smartAPI metadata authoring tool

he Swagger Editor to validateusing the smartAPIspecification and to suggestmetadata elements and values from the smartAPIrepository API.

Unify API data with Linked Open Data

@micheldumontier::CFWG:30-11-2016 40

@micheldumontier::CFWG:30-11-2016 41

WG members are documenting their APIs!

API Interoperability CFWGMission: To foster a collaborative environment for the discussion, development and evaluation of infrastructure and guidelines that facilitate the discoverability, implementation, deployment, interoperability and reuse of web APIs.

Planned Activities– Finalizing vision and API metadata specification– Demonstrations and evaluations of usability and utility of our work– Implement and use of smartAPIs in reproducible discovery science– Coordinate activities with the GA4GH API group– Investigating FAIR metrics for APIs– Your idea here!

Participation– Join mailing list and participate in biweekly teleconference calls– Work with an excellent group of people with broad expertise– Take credit for transforming the API ecosystem in BD2K … and beyond!

@micheldumontier::CFWG:30-11-2016 42

43@micheldumontier::CFWG:30-11-2016

michel.dumontier@stanford.eduWebsite: http://dumontierlab.com

Presentations: http://slideshare.com/micheldumontier

BD2K Indexing Working Group

a consolidated effort of the Commons Framework Pilots WG, the Centers of Excellence Coordination Center,

and the Data Discovery Index Consortium

Current co-chairs

45

Wei Wang, UC Los Angeles

Michel Dumontier, Stanford

Lucila Ohno-Machado, UC San Diego

Founding members (everyone is welcome to join)George Alter, Univ. MichiganElizabeth Bell, UCSDAlejandra Gonzalez-Beltran, Philippe Rocca-Serra, Susanna Sansone, Univ. OxfordJudith Blake, The Jackson LaboratoryBrian Bleakley, BD2K centers Coordinating CenterBenjamin Hitz, StanfordIyad Obeid, TempleJoe Picone, TempleKevin Read, NYU

Operating Principles

• Data integration is key to functional and comparative biomedicine (-omics, clinical medicine, public health, health economics)– Allows data to be evaluated in new contexts

• Standards are key to data integration– Nomenclature

• Standardized nomenclature, keywords, etc.– Knowledge representation

• Gene Ontology (GO)• Mammalian Phenotype Ontology• Others

Adapted from J Blake’s slide, The Jackson Laboratory 46

Gaps in the Metadata Workflow

Most data are “born digital,” but metadata are orphans

47

• Curating data is an expensive manual process• When data are created in silico, why are annotations entered

manually?• There are gaps in the scientific workflow because tools for managing,

transforming, and analyzing data are not metadata-aware• Tools to automate the capture and maintenance of metadata are

needed• Example:

– Many types of data are analyzed in statistical packages (R, SAS, etc.) that do not read or write metadata (data transformations from statistical software must be annotated by hand)

– Other analytical software should also read/write metadata (and be indexed)

Adapted from G Alter’s slide, University of Michigan

Annotating Data Repositories

48

MetadataIngestion

Terminology server• Query

expansion• Result ranking

DataMed User InterfaceSearch Engine

Metadata Management• Mapping• Indexing

Repositories

Data Sets

Funding Agencies

Data Producers

Publishers

Data

sour

ces

Dataset Ingestion Challenges and Costs (1)

Challenges we have encountered Costs

1. Lack of metadata documentation Human labor and time spent on investigating the repository website to understand the data it provides, and to find solutions for obtaining metadata

2. Limited readily accessible metadata Human labor and time spend on design a web crawler to collect available data from the repository website before translating them into the metadata required for indexing

Hardware to meet computational needs for web crawling tasks

3. Lack of domain knowledge (from the indexing team)

Human labor and time spent on understanding the biological and/or technical contents of the data repository

4. Heterogeneity in metadata and data formats Human labor and time spent on iterative refinement of DATS mapping as well as transformation and ingestion scripts (or codes)

Adapted from H Kim’s slide, UC San Diego 49

Dataset Ingestion Challenges and Costs (2)

Challenges• Setting up the ingestion pipeline is

complicated and time-consuming (one-time process)

• Metadata download and ingestion requires domain expertise to verify validity & granularity

• Domain experts required to verify indexing

• Heterogeneity across curators during the mapping process

• Code for harvesting metadata needs to be invariably customized for each repository

• Poor documentation (including lack of APIs, no defined metadata) in a large number of repositories

• Requires interaction and communication with repository personnel (time-consuming) to initiate the ingestion process

50

Costs• Personnel (domain experts &

programmers)• Time consuming process

Adapted from H Xu’s slide, University of Texas Houston

Like JATS (Journal Article Tag Suite) is used by PubMed to index literature,

DATS (DatA Tag Suite) is needed for a scalable way to

index data sources in the DataMed prototype

A community effort

Adapted from a slide by Sansone, Gonzalez-Beltran, and Rocca-Serra, University of Oxford

Example of a model for scalable indexing

Convergence

of elements extracted

from competency

questions

and existing (generic and

biomedical)

data models

(incl. DataCite, DCAT,

schema.org, HCLS

dataset, RIF-CS, ISA-

Tab, SRA-xml etc.)

Adoption from

of elements extracted from

and from

core entities

extended entitiesAdapted from a slide by Sansone, Gonzalez-Beltran, Rocca-Serra, University of Oxford

Interlinking to other indexes

Adapted from a slide by Sansone, Gonzalez-Beltran, and Rocca-Serra, University of Oxford

Two FrontsAnnotating existing data• Continue to work with data

repositories to map into a minimal standard

• Incentive$ for data producers/repositories to facilitate mapping

• Incentive$ for data reuse/citation

Annotating new data• Could be done at the

source, like publishers do for JATS

• Additional re$ource$ need to be provided for data producers/repositories to prepare data for sharing (e.g., after grant funding period ends)

Re$ource$ for data producers and/or repositories to maintain data and their annotations are needed

Leveraging resources from various paid projects, consolidation of efforts, and incentivizing data producers/keepers saves time and money

54

Working Group Charter

Make recommendations to funders to allow increase adoption of standardized metadata by the

biomedical science community

• Establish framework for calculation of costs and sustainability

• Propose mechanisms to enable effective metadata curation– What: Re$ource$

– When: Timelines– How: Minimal metadata– Who: Self- or assisted mapping

55

Workflow Sharing and DockerRegistries Work Group

Umberto RavaioliUniversity of Illinois

Brian O’ConnorUniversity of California Santa Cruz

FAIR-ness

• Adherence to FAIR principles: Registries tomake tools Findable and Accessible and(Docker) container adoption to make toolcomponents Interoperable and Reusable.

• Important mission of the NIH Commonsshould be to develop a culture of open sourcedevelopment, data sharing, and accessibletools for reproducible science.

Overview of Activities - MembershipRavaioli Umberto University of Illinois at Urbana-ChampaignO'Connor Brian University of California, Santa CruzDiekhans Mark University of California, Santa CruzPaten Benedict University of California, Santa CruzBlatti Charles University of Illinois at Urbana-ChampaignEpstein Milt University of Illinois at Urbana-ChampaignArmstrong Don University of Illinois at Urbana-ChampaignMadduri Ravi University of Chicago & Argonne National LabAmaro Rommie University of California, San DiegoRamsey Stephen Oregon State UniversityHitz Benjamin Stanford UniversityCrusoe Michael Common Workflow Language ProjectSofia Heidi National Human Genome Research Institute, NIHTsang Steve Hsinyi NIH/NCI & Attain

Organization of Activities

• Monthly conference calls (3rd Thursday of themonth)

• Use of Google tools and workspaces tocommunicate and share documents

• Administrative assistance received fromCoordinating Center (UCLA – Denise Luna)

Position Paper• Discussion of the State-of-the-Art• Goals of the Work Group• Sharing mechanisms• Docker containers• Workflow Languages• Case Studies/Prototypes/etc• Recommendations based on experience, future

path of technologies, e.g.:• Standards , API’s / External Collaborations• Other considerations (security, legal, etc.)• Adherence to FAIR Commons Concepts

Workflow Languages and Specs

• This area keeps evolving• There are two main languages (CWL and WDL)

used by the genomics community• Workflow Execution Services:

– Seven Bridges– Fire Cloud (Broad Institute, specialized for Google)– Consonance (Java)– TOIL (UCSC – Python. Wide support of computing

systems)

Common Workflow Language (CWL)

• CWL is a way to describe command line tools and connect them to create workflows.

• CWL is a specification and not a piece of software• Tools and workflows described using CWL are

portable across a variety of platforms that support the CWL standard.

• CWL approach emphasizes execution features and machine-readability, and serves a core target audience of software and platform developers.

Workflow Description Language (WDL)

• Developed by the Broad Institute engineering team supporting genome analysis pipelines

• WDL emphasizes scripting and is designed from the ground up as a human-readable and -writable way to express tasks and workflows.

• WDL script provides a complete analysis solution: workflow, task, call, command and output

• Work is underway to ensure interoperability between CWL and WDL, through conversion and related utilities.

Reaching out to GA4GH

• We are in contact with the GA4GH Containersand Workflows Group to coordinate technicaldiscussions and possibly to mergedevelopment of position paper into a jointactivity (Brian taking the lead on this)

GA4GH – API proposal for:

• ability to request a workflow run using CWL orWDL (and maybe future formats)

• ability to parameterize that workflow using aJSON schema that's simple and used incommon between CWL and WDL

• ability to get information about runningworkflows, status, errors, output file locations

• ISSUE: standardization of terms– job, workflow, steps, tools, etc

GA4GH – API (continued)

• Having this standard API supported by multipleexecution engines will give options of processingthe same workflow (e.g., CWL or WDL) acrossdifferent workflow execution platforms runningacross various clouds/environments.

• Example of possible scenario:– Get workflow in CWL on Dockstore.org– Use Dockstore to generate a JSON parameterization

file– Submit to SevenBridges/FireCloud/Consonance or

some other GA4GH-compliant workflow executionservice (if API is supported!)

Containerization

• How do we approach standardization ofDocker containers to promote reusability?

• Computational efficiency goes hand in handwith workflow definition and execution.

• Parallelization: Macrotasking vs Microtasking.• Optimization of numerical procedures is of

paramount importance.• Discoverability: Standardization of terms

mentioned before is very important.

Computing Landscape

• Cloud architectures and providers areproliferating in a climate of competitions

• Will platforms standardize and perhapsconsolidate over time?

• Need to understand trade off between cost,efficiency and adherence to FAIR principles.

Disruptive Technologies on the Horizon

• Amazon “lambda” serverless computingparadigm is intended to maximize utilization ofresources.

• Server Virtual Machine is not “allocatedpermanently” to a given system but computeinstances are fire up only when needed.

• Considerable cost reduction with presentcharging scheme.

• Need to understand how design of wokflows andcontainers may be affected.

15

+

BD2k Collections IssueOwen White

Ian FosterXinzhi Zhang

Susanna-Assunta Sansone

+ The idea

http:/ / collections.plos.org/ hmp

+ Oversight Committee

Develop r ules of engagement regarding consor tium membership, disclosing intended publications to the group, and areas of professional conduct.

Discussion of topical areas.

Search and open call for possible manuscr ipt authors.

Promote publication plan across BD2k network .

Promote coordination with E uropean or other inter national networks.

+ Oversight Committee

G eneration of an over view publication descr ibing the BD2k commons, and general NIH data management ecosystem.

O rganization and general announcements to the larger group.

Hold per iodic meetings to discuss progress.

C oordination and milestone completion.

G eneration of ar twork for special collection.

+ Timeline November 2016 BD2k meeting: broad announcement for special

collections

November: formation of Steer ing C ommittee

November: O versight C ommittee representative contacts potential jour nal editors

Januar y 2017: Finalize target jour nal for special collections

Present to November 2017: Manuscr ipt generation

July - November: Per iodic meetings for exposure of content, discussion of progress

November 2017: Submission deadline

December 2017/Januar y 2018: Review and revision

Febr uar y 2018: publication appears

+ Open Issues

Iterative process / Multiple deadlinesPublish an earlier marker paper or

set of position papers

+ The CFWG: Looming Issues

+ Looming Issues■Consortium-wide tools

■ Diversity of datatypes■Genomic / `omic / variants■Phenotypes / patient■Clinical studies

■Overlapping working groups■Funding / identify / mandate■NIH, trans-agency, international, NGOs

■Awareness

■Longevity / sustainability

top related