embl-abr - bd2k and why bioinformatics matters...bd2k and why bioinformatics matters relevance to...

BD2K and why bioinformatics matters

relevance to Australia

EMBL - Australia AHM 2016

Vivien Bonazzi Senior Advisor for Data Science Technologies ADDs (Assoc. Director for Data Science) Office

Office of the Director (OD) National Institutes of Health (NIH)

The NIH Data Commons Digital Ecosystems for using and sharing FAIR Data

EMBL - Australia AHM 2016

Vivien Bonazzi Senior Advisor for Data Science Technologies ADDs (Assoc. Director for Data Science) Office Office of the Director (OD) National Institutes of Health (NIH)

http://datascience.nih.gov/bd2k

A word about BD2K

What’s driving the need for a

Data Commons?

Convergence of factors

¤ Mountains of Data

¤ Increasing need and support for Data sharing

¤ Availability of digital technologies and infrastructures that support Data at scale

https://gds.nih.gov/ Went into effect January 25, 2015 NCI guidance: http://www.cancer.gov/grants-training/grants-management/nci-policies/genomic-data Requires public sharing of genomic data sets

Recommendation #4: A national cancer data ecosystem for sharing and analysis. Create a National Cancer Data Ecosystem to collect, share, and interconnect a broad array of large datasets so that researchers, clinicians, and patients will be able to both contribute and analyze data, facilitating discovery that will ultimately improve patient care and outcomes.

Challenges with Biomedical Data The Journal Article is the end goal Data is a means to an ends (low value) Data is not FAIR Findable, Accessible, Interoperable, Reproducible Limited e-infrastructures to support FAIR data

What’s Changing?

Digital ecosystems

Development of the

NIH Data Commons

¤  How do we find data, software, standards?

¤  How can we make (large) data, annotations, software, metadata accessible?

¤  How do we reuse data, tools and standards?

¤  How do we make more data machine readable?

¤  How do we leverage existing digital technologies systems, infrastructures?

¤  How do we collaborate?

¤  How do we enable digital ecosystem?

Changing the conversation around Data sharing and access

NIH Data Commons

Data Commons enabling data driven science

Enable investigators to leverage all possible data and tools in the effort to accelerate biomedical discoveries, therapies and cures

driving the development of data infrastructure and data science capabilities through collaborative research and robust engineering

Matthew Trunnel, FHC

Data Commons’s

Developing a Data Commons

¤ Treats products of research – data, methods, papers etc. as digital objects

¤ These digital objects exist in a shared virtual space •  Find, Deposit, Manage, Share, and Reuse data,

software, metadata and workflows

¤ Digital object compliance through FAIR principles: •  Findable •  Accessible (and usable) •  Interoperable •  Reusable

The Data Commons

is a framework

that supports

FAIR data access and sharing

fosters the development

of a digital ecosystem

https://datascience.nih.gov/commons

The Data Commons Framework

Compute Platform: Cloud

Services: APIs, Containers, Indexing,

Software: Services & Tools

scientific analysis tools/workflows

Data “Reference” Data Sets

User defined data

ital O

App store/User Interface

Current Data Commons Pilots

Explore feasibility of the Commons Framework

Facilitate collaboration and interoperability

Making large and/or high impact NIH funded data sets

and tools accessible in the cloud

Developing Data and Software indexing methods

Leveraging BD2K Efforts: bioCADDIE and others.

Collaborating with external groups

Provide access to cloud (IaaS) and PaaS/SaaS via credits

Connecting credits to the grants system

Reference Data Sets Pilot

Large, High-Impact Datasets in the Cloud

Commons Framework Pilots Software and Services

Commons Framework

•  FAIRness Metrics

•  Data-object registry

•  Interoperability of APIs

•  Workflow sharing and docker registry

•  Commons Framework Publications

Resource Search & Indexing

Discoverability of data and software

Cloud Credits Model

$ denominated NIH credits to use cloud resources (IaaS) and services (PaaS/SaaS)

The Data Commons Framework

Compute Platform: Cloud

Services: APIs, Containers, Indexing,

Software: Services & Tools

scientific analysis tools/workflows

Data “Reference” Data Sets

User defined data

ital O

App store/User Interface

Authorization /authentication layer

Digital Ecosystem

Considerations and

Concluding Thoughts

Considerations

¤  Metrics – Understanding and accounting of data usage patterns

¤  Cost •  Cloud Storage

•  Pay for use cloud compute (NIH credits pilot)

•  Indirect costs for cloud

¤  Hybrid Clouds – Institution (private) and commercial (public) clouds

¤  Managing Open vs Controlled access data •  Auth: single sign on - dreams/nightmares?

¤  Archive vs Working and versioning Copies of data

¤  Interoperability with other Commons (clouds)

¤  Standards – Metadata, UIDs, APIs

¤  Discoverability – Finding digital objects across clouds

¤  Interfaces – For users with different needs and capabilities

¤  Consent – Reconsenting data, Dynamic consents?

¤  Policies •  Data sharing policies that are useful and effective

•  Keep pace with use of technology (e.g. dbGAP data in the Cloud)

¤  Incentives •  Access to, and shareability of FAIR Data as part of NIH grant review

criteria

¤  Governance – Community involvement in governance models

¤  Sustainability – Long term support

Relevance to Australia?

Relevance to Australia

¤  The value of Australian Data * ¤  Unique flora and fauna

¤  e.g Marsupials ¤  Indigenous Australians

¤  Understanding of genomic structure – health & disease ¤  Medicinal products

¤  Making this data (securely) available ¤  With high quality annotation and metadata ¤  Attributions to original authors ¤  On the cloud ¤  Via open standard APIs

¤  Aggregation of data via an Australian wide Commons?

Authorization /authentication layer

Oz Digital Ecosystem

Summary

¤  We need an unprecedented level of convergence and

collaboration to drive biomedical science to the next level.

¤  Supporting this model of data-intensive collaborative

science requires a shift in academic research culture and

new investments in data infrastructure and capabilities.

Matthew Trunnel, FHC

Acknowledgments •  ADDS Office: Jennie Larkin, Phil Bourne, Michelle Dunn,Mark Guyer, Allen Dearry, Sonynka Ngosso,

Tonya Scott, Lisa Dunneback, Vivek Navale (CIT/ADDS)

•  NCBI: George Komatsoulis

•  NHGRI: Valentina di Francesco

•  NIGMS: Susan Gregurick

•  CIT: Andrea Norris, Debbie Sinmao

•  NIH Common Fund: Jim Anderson , Betsy Wilder, Leslie Derr

•  NCI Cloud Pilots/ GDC: Warren Kibbe, Tony Kerlavage, Tanja Davidsen

•  Commons Reference Data Set Working Group: Weiniu Gan (HL), Ajay Pillai (HG), Elaine Ayres, (BITRIS), Sean Davis (NCI), Vinay Pai (NIBIB), Maria Giovanni (AI), Leslie Derr (CF), Claire Schulkey (AI)

•  RIWG Core Team: Ron Margolis (DK), Ian Fore, (NCI), Alison Yao (AI), Claire Schulkey (AI), Eric Choi (AI)

•  OSP: Dina Paltoo, Kris Langlais, Erin Luetkemeier, Agnes Rooke,

•  Research and Industry: Mathew Trunnell (FHC), Bob Grossman (Chicago), Toby Bloom (NYGC)

Stay in

QR Business Card

@Vivien.Bonazzi

Slideshare

Blog (Coming soon!)

embl-abr - bd2k and why bioinformatics matters...bd2k and why bioinformatics matters relevance to...

Documents

the druggable genome - embl-ebi · anna gaulton embl -...

ab3acbs 2016: embl australia bioinformatics resource

data submission services of embl australia bioinformatics...

embl-ebi data archives – an overview. the embl-ebi mission...

the complex portal - an encyclopaedia of macromolecular ......

the european bioinformatics institute industry programme ·...

embl-ebi msd search tools. embl-ebi msdlite embl-ebi msdlite

best practice data life cycle approaches for the life...

nicolas le novère, babraham institute, embl-ebi n...

data science & bd2k update · bd2k supports the development...

embl australian bioinformatics resource ahm - data commons

the embl-european bioinformatics institute sme forum...

embl-european bioinformatics institute annual scientific

bd2k update

embl australia bioinformatics resource (embl-abr) · all...

embl-european bioinformatics institute annual scientific...

introduction to bioinformatics -...

nicolas le novère, babraham institute, embl-ebi n.lenovere...

protein database bioinformatics lab. sequence databases...

1 distributed big data & analytics university of cincinnati...