the future of the commons
TRANSCRIPT
George A. Komatsoulis, Ph.D.National Center for Biotechnology Information
National Library of MedicineNational Institutes of Health
U.S. Department of Health and Human Services
The Future of the Commons:Data, Software and Beyond
The Current “Big Data” problem in biology
All of this has happened before…
1960 1970 1980 1990 2000 2010 2020
… and other scientific fields have similar problemsSensor Stream = 500 EB/dayStores 69 TB/day
Collection = 14 EB/dayStore 1PB/day
Total Data = 14 PBStore an average of 3.3TB/day for 10 years!
A wide range of data categoriesMultiple technology platforms used to generate the same
class of dataPrivacy and Patient protection
Together = Greater Complexity
Although biomedicine has some unique challenges
Precision Medicine Initiative
DNA Sequence alone could total 250 PB
Budgets are stagnant to declining in constant dollars
Source: AAAS (http://www.aaas.org/fy16budget/national-institutes-health)
Public Data Repositories
Local Data
U N I V E R S I T YU N I V E R S I T Y
Locally Developed Software
Publicly AvailableSoftware
Local storage andcompute resources
Standard Model of Biomedical Computation
Lots of data = Difficult to move from place to placeExpensive to store multiple copiesExpensive to provide computing capacityEven more potentially on the wayExpanding “data inequality” among researchers
Budgets are stagnant = More resources going to managing dataFewer resources available for scienceCurrent methods of providing IT resources are not efficient
So to summarize, current issues:
The Commons: A shared virtual spaceIs scalable and exploits new computing modelsIs more cost effective given digital growthSimplifies sharing digital research objects such
as data, software, metadata and workflowsMakes digital research objects more FAIR:
Findable, Accessible, Interoperable and Reusable
DOES NOT replace existing, well-curated databases
Phil Bourne, 2014
Software: Services and Tools
The Commons: Conceptual Model
Infrastructure
Data
User Data Sets
Reference Data Sets
Services: APIs, IDs, Encapsulation
Analysis Tools/Workflows
UI/App Store
Vivien Bonazzi, 2015
Storage and Compute
NCI GDC and Cloud Pilots (2014/15)
QA/QCValidation
Aggregation
Authoritative NCI Reference Data Set
Data Coordinating Center
NCI Genomic Data Commons
NCI Clouds
High PerformanceComputing
Search/Retrieve
Download
Analysis
NCI Cloud Pilots (X3)
Commons Supplements – Data, analysis tools, APIs, containers BD2K Centers MODs (Model Organism Databases) Interoperability (some) HMP (Human Microbiome Project)
The Cloud Credits ModelAccessing cloud services for NIH grantees
NIH Affiliated Commons projectsNIAID/CF/ADDS - Microbiome/HMP Cloud PilotNCI Cloud Pilots & Genomic Data Commons
NIH Launches The Commons
The Commons in Context
The Commons(infrastructure)
Well Curated Databases
Search Researchers
Indexes Develops capabilities
Makes AvailableInvestigator
Computes
Downloads
Searches Uses
Indexes Develops capabilities
Commons Credits Model
The Commons(infrastructure)
Cloud ProviderA
Cloud ProviderB
Cloud ProviderC
InvestigatorNIH
Provides credits Enables Search
Discovery Index
Uses credits inthe Commons
IndexesOption:Direct Funding
Supports simplified data sharing by driving science into publicly accessible computing environments that still provide for investigator level access control
Scalable for the needs of the scientific community for the next 5 years
Democratize access to data and computational toolsCost effective
Competitive marketplace for biomedical computing servicesReduces redundancy Uses resources efficiently
Advantages of this Model
Novelty: Never been tried, so we don’t have data about likelihood of success
Cost Models: Predicated on stable or declining prices among providers True for the last several years, but we can’t guarantee that it will
continue, particularly if there is significant consolidation in industry Service Providers:
Predicated on service providers willing to make the investment to become conformant
Market research suggests 3-5 providers within 2-3 months of program launch
Persistence: The model is ‘Pay As You Go’ which means if you stop paying it stops going Giving investigators an unprecedented level of control over what lives (or
dies) in the Commons
Potential Disadvantages of this Model
Minimum set of requirements forBusiness relationships (coordinating center, investigators)Interfaces (upload, download, manage, compute)Capacity (storage, compute)Networking and ConnectivityInformation AssuranceAuthentication and authorization
A conformant cloud is not necessarily a provider of Infrastructure as a Service (IaaS) although all providers must provide IaaS
Current Conformance Guidelines: http://bit.ly/1SWjBeF
What does it mean for a provider to be conformant?
NIH intends to run a 3 year pilot to test the efficacy of this business model in enhancing data sharing and reducing costs.
Pilot will not directly interact with the existing grant system, rather, it is being modeled on the mechanisms being used to gain access to NSF and DOE national resources (HPC, light sources, etc.)
The only required qualification for applying for credits will be that the investigator has an existing NIH grant
A major element will be the collection of metrics to assess effectiveness of this model
Pilot of the Commons Business Model
NIH recently completed a contract with the CAMH Federally Funded Research and Development Center (FFRDC) to act as the coordinating center for this effort.
We need you to:Identify what capabilities will be useful to investigatorsProvide guidance on the conformance requirements
http://bit.ly/1SWjBeF Help identify good metricsDefine the criteria that are used to decide if credit
requests are selected
Status and requests
Two Closing Thoughts(or three depending on how you choose to count them)
Making data understandable over time
Demotic Inscription (c. 330 BCE) Crypto-Minoan Inscription (c. 1150 BCE)
Data loss in medicineSEER data
1 | Male 2 | Female 3 | Other Hermaphrodite 4 | Transsexual 9 | Unknown
• ECOG– 121102 | Other sex– 121104 | Ambiguous sex– F | Female– FC | Female changed to male– FP | Female pseudohermaphrodite– H | Hermaphrodite– M | Male– MC | Male changed to female– MP | Male pseudohermaphrodite– O | Undetermined sex– U | Unknown sex
The Importance of Standards
Standards reduce the complexity of interpreting data
The modern five-line staff was first adopted in France and became almost universal by the 16th century (although the use of staffs with other numbers of lines was still widespread well into the 17th century).
-Wikipedia
Guido [d’Arezzo 992-1050 CE] used the first syllable of each line, Ut, Re, Mi, Fa, Sol and La, to read notated music in terms of hexachords; they were not note names, and each could, depending on context, be applied to any note…
Much early music has been “lost” because there was no standard representation, and the metadata for the non-standard representations were lost (or never recorded).
“Are you a good standard…”
• Preserves information with sufficient detail to enable reproduction• Well documented and used by essentially all western musicians• Highly flexible, capable of representing a very wide range of
expected states• Excellent cross-platform interoperability• Readily transformed into variant representations without loss
(MIDI, ABC, tablature, etc.)
“… or a bad standard?”Advocates of standards need
to resist certain temptations:Standardizing too earlyStandardizing data that is
not widely collectedAllowing information
theories to trump usability
Developing standards in the absence of real practitioners.
Should we ever let data ‘die’?Traditional Answer for non-digital data is no!
Does this make sense for digital (or, for that matter, non-digital) data?
Are there circumstances where we can or should ‘throw away’ data?
What are the considerations for letting data die?
“Usage” ?“Importance” ?Ability to reproduce experiment?Publication Associated?Time?Raw vs. Derived?
NCI (cloud pilots) Ishwar ChandramouliswaranStephen Chanock, MDTanja Davidsen, Ph.D.Anthony Kerlavage, Ph.D.Warren Kibbe, Ph.D.Juli Klemm, Ph.D.Daoud Meerzaman, Ph.D.Louis Staudt, MD, Ph.D.Barbara Wold, Ph.D.
Harold Varmus, MD
NIH Office of ADDSVivien Bonazzi, Ph.D.Philip Bourne, Ph.DJennie Larkin, Ph.D.Mark Guyer, Ph.D.
NCBIDennis Benson, Ph.D.Alan GraeffDavid Lipman, MDJim Ostell, Ph.D.Don Preuss
Acknowledgements