2014 bosc-keynote
Post on 10-Sep-2014
3.383 Views
Preview:
DESCRIPTION
TRANSCRIPT
Monday, July 11th, 2039
It’s hard to make predictions –
especially about the future.
-- attributed to Niels Bohr
Monday, July 11th, 2039
A History of “Bioinformatics”
C. Titus BrownMonday, July 11th, 2039
Monday, July 11th, 2039
Invited to reminisce!
…and perhaps inform the BRAIN2050 initiative.
Note for the young: “bioinformatics” and “systems biology” are now simply “biology”.
Monday, July 11th, 2039
The 20-teens and onwards
1. Too Much Data: The Datapocalypse
2. Great results, seen once: the reproducibility crisis.
3. Mind the gap: computation in biology.
Monday, July 11th, 2039
1. The Datapocalypse
Monday, July 11th, 2039
Too… much… data…Between –omics, automated sensor data, and data sharing, biology grew into a data-intensive science.
Volume, velocity, variety: the general problem.
But also!
Biology was optimized for hypothesis-driven investigation, not data exploration!
Long arguments over “which is better”, with the people who controlled the funding => winning.
HTC, not HPCFor lots of data, High Throughput Computing was
needed – but compute was cheap, not throughput!
Monday, July 11th, 2039 Figure from bbc.co.uk
Monday, July 11th, 2039
2. The reproducibility crisis
Trials
Failed
Monday, July 11th, 2039
The reproducibility crisis - why??
Well known fact among biotech that the majority of published experiments were largely lab-specific.
Neither career incentives nor funding were there! (In fact, quite the contrary…)
This slowly started to change later in the decade, as the public caught on…
Monday, July 11th, 2039
Shift in “publication” recognition
Hard to believe now, but back then, people were rewarded for the first (claimed) “observation” of an effect.
Two-lab rule was only instated as best practice in the early 2020s, once reviewers started rejecting papers unaccompanied by a replication report.
Funding shift followed, of course.
Monday, July 11th, 2039
3. Computing & data in biologyOf the sciences, biology had always been the weakest in terms of computing education.
This became a complete disaster once the data tsunami hit – labs generated data sets they couldn’t analyze, graduate students planned experiments that relied on computing they couldn’t do. Photo from Wikipedia
Monday, July 11th, 2039
The “easy to use” tools fiasco
Immense investment in late ‘teens in tools that were “easy to use” – push-button data analysis, etc.
This worked well outside of research; however, it turns out you can’t place most data analysis in a black box.
“Easy to use” tools embodied so many assumptions that most results were simply invalid.
Monday, July 11th, 2039
=> Bioinformatics “sweatshops”
Cadre of students and low-paid employees devoted to “service bioinformatics”
No career path, no significant authorship…
…but necessary for big labs to make progress!
Monday, July 11th, 2039
Things came to a head…
www.sanantonio-urbanliving.com
Monday, July 11th, 2039
The tipping pointThe well-trained students left for the data
science industry;
More and more papers were being written by people who didn’t understand the computing…
…and an increasing number of them were being rejected…
…until the supply of reviewers ran out…
Monday, July 11th, 2039
And then… California.
Map from Wikimedia
Monday, July 11th, 2039
Bioinformaticians, revolt!Bioinformatics reviewers essentially unionized and laid down three rules:
1. All of the data and source code must be provided for any paper.
2. Full methods sections and references are included in the primary paper review.
3. No unpublished methods can be used in data analysis.
In the end, the only people that complained were companies like MS Elsevier, because preprints.
Monday, July 11th, 2039
Replication “parties”
A community of practice emerged around replication!
Monday, July 11th, 2039
Part of a larger renaissance for biology!
Starting in ~2020,
1. Biomedical enterprise rediscovers basic biology;
2. Rise and triumph of open science;
3. A transition to networked science;
4. Massive investment in the people;
Monday, July 11th, 2039
1. Rediscovering basic biology
Monday, July 11th, 2039
The biomedical community backs away from translational medicine.
Several veterinary and agricultural animals proved to be better model organisms for human disease than mouse;
Ecology and evolution provided valuable theoretical and empirical observations for understanding human genetics.
Microbial interactions between environment and human proved to be important as well; built environment, disease reservoirs, etc.
Cheap sequencing enabled a vast array of studies.
Monday, July 11th, 2039
2. Open science triumphs!The computational community knew this by 2016,
but it took a few years for the rest of biology…
A curious story!
1. Biotech pressured congresspeople into decreasing funding for experiments, since analysis was usually wrong and raw data was never available;
2. Funding crunch, more generally, tightened the screws further;
3. Hypothesis driven labs couldn’t compete…
Monday, July 11th, 2039
…hypothesis-driven lab science joined with discovery.
Eventually, funders mandated data availability;
Labs that made use of available data had a dramatic edge in hypothesis-driven experimentation;
Data-driven modeling and model-driven data interpretation blossomed!
Image from emory.edu
Monday, July 11th, 2039
3. A transition to networked science
Monday, July 11th, 2039
Universities collapsed!So all the senior professors and administrators retired…
Massive brain drain…
… enabled a massive increase in creativity in the research enterprise!
Collaboration tools, data sharing, distributed team science…
Monday, July 11th, 2039
“Walled garden” modelPioneered by Sage Bionetworks in ~2010s
Data collection done by small consortia;
Data made available to all, but publication in step.
Model is of course obsolete nowadays, but was quite effective back then.
Monday, July 11th, 2039
4. Massive investment in people
The NIH finally invested heavily in training.
Among other things:
Data Carpentry
Model Carpentry
(We won! Yay!)
Monday, July 11th, 2039
There are still problems, of course!
What do most genes do? Functional annotations are still poor. Some approaches --BiogeochemistrySynthetic biology
Career paths for experimental biologists are very uncertain.
“Glam data”
Cancer is cured, but many complex diseases – especially neurodegenerative ones – remain poorly understood.
Monday, July 11th, 2039
BRAIN2050Ambitious 10-year proposal to “understand the
brain” by 2050.
Focus on neurodegenerative diseases, regeneration, and a mechanistic understanding of intelligence.
What mistakes can they avoid, with the benefit of hindsight?
Monday, July 11th, 2039
Correlation is not causation
You’d think we’d have learned this by now!?
Original MIND project 25 years ago failed for this reason. (“Record ALL the neurons”)
Image from Wikipedia
Monday, July 11th, 2039
(Computational) modeling is critical
Can we develop models that embody hypotheses that we can then “test” against the data?
Holistic multidisciplinary research.
(Brain community has always been better off here…)
Monday, July 11th, 2039
Focus less on reproducibility
A strict requirement for independent replication is strangling us!
Completely independent replication is a strong requirement; understandable, given disasters of the past, but also slow.
Can we compromise?
Monday, July 11th, 2039
“Replication debt”Can we borrow idea of “technical debt” from
software engineering?
Semi-independent replication after initial exploratory phase, followed by articulation of protocols and independent replication.
Image from blog.crisp.se
Monday, July 11th, 2039
“Replication debt”Semi-independent replication after initial
exploratory phase, followed by articulation of protocols and independent replication.
Public acknowledgement of debt is important.
Image from blog.crisp.se
Monday, July 11th, 2039
Invest in infrastructure for collaboration and sharing
Data sharing is a given
But existing tools still merely support rather than drive science with data sharing!
Push for collaborative process from the outset.
Monday, July 11th, 2039
Can we help drive collaboration with technology?
See e.g. pebourne.wordpress.com/2014/01/04/universities-as-big-data/
Monday, July 11th, 2039
Tool up! But evaluate, compare, understand.
Having a robust and competitive software ecosystem is important for innovation and creativity.
Available, open, reusable, remixable: all critical!
Benchmarks are not always useful; understanding always is.
Monday, July 11th, 2039
Build commercial software only when basics are understood
Monday, July 11th, 2039
Invest in training as first-class research citizen!
The high school students of yesterday are the research scientists of
tomorrow.
Monday, July 11th, 2039
It’s the network, dummies. Single molecule full genome sequences did not provide
understanding.
Reductionist studies of gene function did not provide understanding.
Neither will high resolution ensemble neuronal sampling.
Our main obstacle in understanding aging has been that it seems to be systemic, just like neurogeneration.
Monday, July 11th, 2039
Concluding thoughts (I)Many things the BRAIN2050 field can do to
invest in its own future and accelerate progress!
Bitter lessons learned from decades of mistakes in other fields; maybe we can do better?
Monday, July 11th, 2039
All right…Future talk over
I thought I’d use this as a foil to highlight issues that I think are important for the future.
But:
We have to get used to the idea that radical change keeps happening ... even after 1997.
First published by Broadway Books on May 5, 1997. Via Erich Schwarz
We have to get used to the idea that radical change keeps happening ... even after 1997.
"Among the pessimists, molecular biologist Gunther Stent suggests that science is reaching a point of incremental, diminishing returns as it comes up against the limits of
knowledge..." --review by Publishers Weekly
First published by Broadway Books on May 5, 1997. Via Erich Schwarz
Robert Heinlein's four curves of predicted human progress (described in 1950)
Ref.: Heinlein, R.A. (1950), "Where To?".
"The solid curve ... represents many things -- use of power, speed of transport, numbers of scientific and technical workers, advance in communication, average miles traveled per person per year, advances in mathematics ... Call it the curve of human achievement."
Via Erich Schwarz
Robert Heinlein's four curves of predicted human progress (described in 1950)
"Despite everything, there is a stubborn'common sense' tendency to project it along dotted line number (1) like the patent office official of a hundred years back who quit his job 'because everything had already been invented'."
Ref.: Heinlein, R.A. (1950), "Where To?". Via Erich Schwarz
Robert Heinlein's four curves of predicted human progress (described in 1950)
"Even those who don't expect a slowing up at once tend to expect us to reach a point of diminishing returns -- dotted line number (2)."
Ref.: Heinlein, R.A. (1950), "Where To?". Via Erich Schwarz
Robert Heinlein's four curves of predicted human progress (described in 1950)
"Very daring minds are willing to predict that we will continue our present rate of progress -- dotted line number (3) -- a tangent."
Ref.: Heinlein, R.A. (1950), "Where To?". Via Erich Schwarz
Robert Heinlein's four curves of predicted human progress (described in 1950)
Ref.: Heinlein, R.A. (1950), "Where To?".
"But the proper way to project the curve is dotted line number (4), because there is no reason, mathematical, scientific, or historical, to expect that curve to flatten out... The correct projection ... is for the curve to go on up indefinitely with increasing steepness..."
Via Erich Schwarz
Conclusion -- I certainly don’t know where we’re headed; no one
else does either.
We must invest in people and process; we must help figure out what the right process is and then provide career incentives for people to do things that way.
This community should be leading the way:
Bioinformatics Open Source Conference
(Reminder: we will win.)
But: economics matter
50-million mark note. Weimar Germany, 1923.
Economics matter
Ref.: U.S. Government Accountability Office, Citizen's Guide of 2010.
Prospects for U.S. public funding of science
Ref.: U.S. Government Accountability Office, Citizen's Guide of 2010.
Monday, July 11th, 2039
Public support for science matters!
Data sharing, openness => maximizing return.
Must figure out how to align career and funding incentives.
We are currently doing a horrible job of this…
…I’m looking forward to Phil Bourne’s talk :)
Monday, July 11th, 2039
Thanks!Discussions with Phil Bourne (NIH), Erich Schwarz
(Caltech & Cornell), Katherine Mejia-Guerra (OSU) and Jeffrey Campbell (OSU).
All of this will be (is already?) posted online.
“The next 10 years of quant bio” by Mike Schatz
…with apologies to Gary Bernhardt(Birth & Death of JavaScript – go watch it!)
top related