the future of the commons

30
George A. Komatsoulis, Ph.D. National Center for Biotechnology Information National Library of Medicine National Institutes of Health U.S. Department of Health and Human Services The Future of the Commons: Data, Software and Beyond

Upload: george-komatsoulis

Post on 20-Jan-2017

204 views

Category:

Government & Nonprofit


0 download

TRANSCRIPT

Page 1: The future of the commons

George A. Komatsoulis, Ph.D.National Center for Biotechnology Information

National Library of MedicineNational Institutes of Health

U.S. Department of Health and Human Services

The Future of the Commons:Data, Software and Beyond

Page 2: The future of the commons

The Current “Big Data” problem in biology

Page 3: The future of the commons

All of this has happened before…

1960 1970 1980 1990 2000 2010 2020

Page 4: The future of the commons

… and other scientific fields have similar problemsSensor Stream = 500 EB/dayStores 69 TB/day

Collection = 14 EB/dayStore 1PB/day

Total Data = 14 PBStore an average of 3.3TB/day for 10 years!

Page 5: The future of the commons

A wide range of data categoriesMultiple technology platforms used to generate the same

class of dataPrivacy and Patient protection

Together = Greater Complexity

Although biomedicine has some unique challenges

Page 6: The future of the commons

Precision Medicine Initiative

DNA Sequence alone could total 250 PB

Page 7: The future of the commons

Budgets are stagnant to declining in constant dollars

Source: AAAS (http://www.aaas.org/fy16budget/national-institutes-health)

Page 8: The future of the commons

Public Data Repositories

Local Data

U N I V E R S I T YU N I V E R S I T Y

Locally Developed Software

Publicly AvailableSoftware

Local storage andcompute resources

Standard Model of Biomedical Computation

Page 9: The future of the commons

Lots of data = Difficult to move from place to placeExpensive to store multiple copiesExpensive to provide computing capacityEven more potentially on the wayExpanding “data inequality” among researchers

Budgets are stagnant = More resources going to managing dataFewer resources available for scienceCurrent methods of providing IT resources are not efficient

So to summarize, current issues:

Page 10: The future of the commons

The Commons: A shared virtual spaceIs scalable and exploits new computing modelsIs more cost effective given digital growthSimplifies sharing digital research objects such

as data, software, metadata and workflowsMakes digital research objects more FAIR:

Findable, Accessible, Interoperable and Reusable

DOES NOT replace existing, well-curated databases

Phil Bourne, 2014

Page 11: The future of the commons

Software: Services and Tools

The Commons: Conceptual Model

Infrastructure

Data

User Data Sets

Reference Data Sets

Services: APIs, IDs, Encapsulation

Analysis Tools/Workflows

UI/App Store

Vivien Bonazzi, 2015

Storage and Compute

Page 12: The future of the commons

NCI GDC and Cloud Pilots (2014/15)

QA/QCValidation

Aggregation

Authoritative NCI Reference Data Set

Data Coordinating Center

NCI Genomic Data Commons

NCI Clouds

High PerformanceComputing

Search/Retrieve

Download

Analysis

NCI Cloud Pilots (X3)

Page 13: The future of the commons

Commons Supplements – Data, analysis tools, APIs, containers BD2K Centers MODs (Model Organism Databases) Interoperability (some) HMP (Human Microbiome Project)

The Cloud Credits ModelAccessing cloud services for NIH grantees

NIH Affiliated Commons projectsNIAID/CF/ADDS - Microbiome/HMP Cloud PilotNCI Cloud Pilots & Genomic Data Commons

NIH Launches The Commons

Page 14: The future of the commons

The Commons in Context

The Commons(infrastructure)

Well Curated Databases

Search Researchers

Indexes Develops capabilities

Makes AvailableInvestigator

Computes

Downloads

Searches Uses

Indexes Develops capabilities

Page 15: The future of the commons

Commons Credits Model

The Commons(infrastructure)

Cloud ProviderA

Cloud ProviderB

Cloud ProviderC

InvestigatorNIH

Provides credits Enables Search

Discovery Index

Uses credits inthe Commons

IndexesOption:Direct Funding

Page 16: The future of the commons

Supports simplified data sharing by driving science into publicly accessible computing environments that still provide for investigator level access control

Scalable for the needs of the scientific community for the next 5 years

Democratize access to data and computational toolsCost effective

Competitive marketplace for biomedical computing servicesReduces redundancy Uses resources efficiently

Advantages of this Model

Page 17: The future of the commons

Novelty: Never been tried, so we don’t have data about likelihood of success

Cost Models: Predicated on stable or declining prices among providers True for the last several years, but we can’t guarantee that it will

continue, particularly if there is significant consolidation in industry Service Providers:

Predicated on service providers willing to make the investment to become conformant

Market research suggests 3-5 providers within 2-3 months of program launch

Persistence: The model is ‘Pay As You Go’ which means if you stop paying it stops going Giving investigators an unprecedented level of control over what lives (or

dies) in the Commons

Potential Disadvantages of this Model

Page 18: The future of the commons

Minimum set of requirements forBusiness relationships (coordinating center, investigators)Interfaces (upload, download, manage, compute)Capacity (storage, compute)Networking and ConnectivityInformation AssuranceAuthentication and authorization

A conformant cloud is not necessarily a provider of Infrastructure as a Service (IaaS) although all providers must provide IaaS

Current Conformance Guidelines: http://bit.ly/1SWjBeF

What does it mean for a provider to be conformant?

Page 19: The future of the commons

NIH intends to run a 3 year pilot to test the efficacy of this business model in enhancing data sharing and reducing costs.

Pilot will not directly interact with the existing grant system, rather, it is being modeled on the mechanisms being used to gain access to NSF and DOE national resources (HPC, light sources, etc.)

The only required qualification for applying for credits will be that the investigator has an existing NIH grant

A major element will be the collection of metrics to assess effectiveness of this model

Pilot of the Commons Business Model

Page 20: The future of the commons

NIH recently completed a contract with the CAMH Federally Funded Research and Development Center (FFRDC) to act as the coordinating center for this effort.

We need you to:Identify what capabilities will be useful to investigatorsProvide guidance on the conformance requirements

http://bit.ly/1SWjBeF Help identify good metricsDefine the criteria that are used to decide if credit

requests are selected

Status and requests

Page 21: The future of the commons

Two Closing Thoughts(or three depending on how you choose to count them)

Page 22: The future of the commons

Making data understandable over time

Demotic Inscription (c. 330 BCE) Crypto-Minoan Inscription (c. 1150 BCE)

Page 23: The future of the commons

Data loss in medicineSEER data

1 | Male 2 | Female 3 | Other Hermaphrodite 4 | Transsexual 9 | Unknown

• ECOG– 121102 | Other sex– 121104 | Ambiguous sex– F | Female– FC | Female changed to male– FP | Female pseudohermaphrodite– H | Hermaphrodite– M | Male– MC | Male changed to female– MP | Male pseudohermaphrodite– O | Undetermined sex– U | Unknown sex

Page 24: The future of the commons

The Importance of Standards

Page 25: The future of the commons

Standards reduce the complexity of interpreting data

The modern five-line staff was first adopted in France and became almost universal by the 16th century (although the use of staffs with other numbers of lines was still widespread well into the 17th century).

-Wikipedia

Guido [d’Arezzo 992-1050 CE] used the first syllable of each line, Ut, Re, Mi, Fa, Sol and La, to read notated music in terms of hexachords; they were not note names, and each could, depending on context, be applied to any note…

Much early music has been “lost” because there was no standard representation, and the metadata for the non-standard representations were lost (or never recorded).

Page 26: The future of the commons

“Are you a good standard…”

• Preserves information with sufficient detail to enable reproduction• Well documented and used by essentially all western musicians• Highly flexible, capable of representing a very wide range of

expected states• Excellent cross-platform interoperability• Readily transformed into variant representations without loss

(MIDI, ABC, tablature, etc.)

Page 27: The future of the commons

“… or a bad standard?”Advocates of standards need

to resist certain temptations:Standardizing too earlyStandardizing data that is

not widely collectedAllowing information

theories to trump usability

Developing standards in the absence of real practitioners.

Page 28: The future of the commons

Should we ever let data ‘die’?Traditional Answer for non-digital data is no!

Does this make sense for digital (or, for that matter, non-digital) data?

Are there circumstances where we can or should ‘throw away’ data?

Page 29: The future of the commons

What are the considerations for letting data die?

“Usage” ?“Importance” ?Ability to reproduce experiment?Publication Associated?Time?Raw vs. Derived?

Page 30: The future of the commons

NCI (cloud pilots) Ishwar ChandramouliswaranStephen Chanock, MDTanja Davidsen, Ph.D.Anthony Kerlavage, Ph.D.Warren Kibbe, Ph.D.Juli Klemm, Ph.D.Daoud Meerzaman, Ph.D.Louis Staudt, MD, Ph.D.Barbara Wold, Ph.D.

Harold Varmus, MD

NIH Office of ADDSVivien Bonazzi, Ph.D.Philip Bourne, Ph.DJennie Larkin, Ph.D.Mark Guyer, Ph.D.

NCBIDennis Benson, Ph.D.Alan GraeffDavid Lipman, MDJim Ostell, Ph.D.Don Preuss

Acknowledgements