science and machines - tuvalutuvalu.santafe.edu/~aaronc/slides/clauset_2017_science... ·...

Data-driven predictions and their limits for the science of science

Aaron Clauset@aaronclausetComputer Science Dept. & BioFrontiers InstituteUniversity of Colorado, BoulderExternal Faculty, Santa Fe Institute

20 March 2017© 2017 Aaron Clauset

current position

professor of Computer Science

external faculty

current interests

broadly theory and practice of computational science

specifically network science, machine learning, rare events, statistical forecasting, macroevolutionary theory, terrorism and war, competition and sports, genetics and immunology, science of science, gender inequality

History and Future of Computing

small undergraduate CS seminar

weekly reading assignments

weekly writing assignments

7 weeks on history written language, mechanical computing (Babbage), theoretical computing (Lovelace), the information "quickening", Godel, Turing & Shannon, transistors, Moore’s law, modern computing, etc.

7 weeks on future algorithmic bias and FAT ML, autonomous vehicles, ubiquitous computing, surveillance and privacy, weak AI, strong AI, the singularity is in our past light cone, etc.

public reading list : http://bit.ly/1S8zDzx

http://bit.ly/1S8zDzx

the desire to predict science pervades society

what will be discovered?by whom, when, and where?

the desire to predict science pervades society

what will be discovered?by whom, when, and where?

individuals what questions are interesting, impactful, fundable?

publishers, funders

what manuscripts or projects will be most impactful?

hiring committees

which applicant will perform best? which will make most valuable contributions?

society how can tax and other dollars be invested to make technological, biomedical, and scientific advances?

how predictable are scientific discoveries?

simple question with a 150+ year history

how predictable are scientific discoveries?

simple question with a 150+ year history, e.g.:

Bolesław Prus(1847-1912)

Florian Znaniecki(1882-1958)

Freeman Dyson Steven Weinberg(Nobel Physics, 1979)

• mainly conceptual, focusing on goals and general approaches

(Weinberg: "to explain the world") (Dyson: "birds and frogs")

• progress toward a genuine "science of science" was slowhard to get good datajudgement of experts seemed good enough

ESSAY

Data-driven predictions inthe science of scienceAaron Clauset,1,2* Daniel B. Larremore,2 Roberta Sinatra3,4

The desire to predict discoveries—to have some idea, in advance, of what will bediscovered, by whom, when, and where—pervades nearly all aspects of modern science,from individual scientists to publishers, from funding agencies to hiring committees. In thisEssay, we survey the emerging and interdisciplinary field of the “science of science”and what it teaches us about the predictability of scientific discovery. We then discussfuture opportunities for improving predictions derived from the science of science and itspotential impact, positive and negative, on the scientific community.

Today, the desire to predict discoveries—tohave some idea, in advance, of what will bediscovered, by whom, when, and where—pervades nearly all aspects of modern sci-ence. Individual scientists routinely make

predictions about which research questions ortopics are interesting, impactful, and fundable.Publishers and funding agencies evaluate man-uscripts and project proposals in part by predictingtheir future impact. Faculty hiring committeesmake predictions about which candidates willmake important scientific contributions over

the course of their careers. And predictions areimportant to the public, who fund the majorityof all scientific research through tax dollars.The more predictable we can make the processof scientific discovery, the more efficiently thoseresources can be used to support worthwhiletechnological, biomedical, and scientific advances.Despite this pervasive need, our understand-

ing of how discoveries emerge is limited, andrelatively few predictions by individuals, publish-ers, funders, or hiring committees are made in ascientific way. How, then, can we know what ispredictable and what is not? Although it can bedifficult to separate the discovery from the dis-coverer, the primary focus of this Essay is thescience of science: an interdisciplinary effort toscientifically understand the social processes thatlead to scientific discoveries. [For the currentthinking on the philosophy of science and howscientists make progress on individual scientificchallenges, see (1).]

Interest in predicting discoveries stretchesback nearly 150 years, to work by the philosopherBoleslaw Prus (1847–1912) and the empirical soci-ologist Florian Znaniecki (1882–1958). Znaniecki,in particular, called for the establishment of a data-driven study of the social processes of science. Formost of the 20th century, progress toward thisgoal came slowly, in part because good data weredifficult to obtain and most people were satisfiedwith the judgment of experts.Today, the scientific community is a vast and

varied ecosystem, with hundreds of loosely inter-acting fields, tens of thousands of researchers,and a dizzying number of new results each year.This daunting size and complexity has broadenedthe appeal of a science of science and encourageda focus on generic measurable quantities such ascitations to past works, production of newworks,career trajectories, grant funding, scholarly prizes,and so forth. Digital technology makes such in-formation abundant, and researchers are devel-oping powerful new computational tools foranalyzing it—for instance, to extract and catego-rize the content of papers in order to automat-ically quantify progress on specific scientificquestions (2, 3). It is now widely believed thatexploiting this information can produce predic-tions that are more objectively accurate than ex-pert opinions. Bibliographic databases and onlineplatforms—Google Scholar, PubMed, Web of Sci-ence, JSTOR, ORCID, EasyChair, and “altmetrics,”to name a few—are enabling a new generationof researchers to develop deeper insights intothe scientific process.These efforts raise a provocative question:Will

we eventually be able to predict important dis-coveries or their discoverers, such as YoshinoriOhsumi’s Nobel Prize–winning work on the au-tophagy systeminanimal cells?Wedonot yet knowthe answer, but work toward one will substantially

Clauset et al., Science 355, 477–480 (2017) 3 February 2017 1 of 4

1Department of Computer Science, and BioFrontiersInstitute, University of Colorado, Boulder, CO 80309, USA.2Santa Fe Institute, Santa Fe, NM 87501, USA. 3Center forNetwork Science and Department of Mathematics, CentralEuropean University, Budapest, Hungary. 4Center forComplex Network Research and Physics Department,Northeastern University, Boston, MA 02115, USA.*Corresponding author. Email: [email protected]

Polio vaccine

Structure of DNA

Humangenome sequence

Higgs boson

Gravitationalwaves (LIGO)

Unexpected Expected

PenicillinCosmic microwavebackground (CMB)

CRISPRgene editing

Gödel’sincompleteness

theorem

x-rays

Theory of evolution

Microwave cooking

Ribozymes

Church-Turing Thesis (general purpose computers)

Calculus

Quasicrystals

Fermat’s last theorem

Fig. 1. How unexpected is a discovery? Scientific discoveries vary in how unexpected they were relative to existing knowledge. To illustrate this perspective,17 examples of major scientific discoveries are arranged from the unanticipated (like antibiotics, programmable gene editing, and cosmic microwave backgroundradiation) to expected discoveries (like the observation of gravitational waves, the structure of DNA, or the decoding of the human genome).

PREDICTION

GRAPHIC

:ADAPTED

BYK.S

UTLIFF/

SCIENCE

on

Febr

uary

2, 2

017

http

://sc

ienc

e.sc

ienc

emag

.org

/D

ownl

oade

d fr

om

predictability depends on context

Clauset, Larremore & Sinatra, Science 355, 477-480 (2017)

http://tuvalu.santafe.edu/~aaronc/papers/CLS_2017_LimitsOfPredictionInScientificDiscovery.pdf

ESSAY












Polio vaccine

Structure of DNA


Higgs boson


Unexpected Expected


CRISPRgene editing


theorem

x-rays

Theory of evolution

Microwave cooking

Ribozymes


Calculus

Quasicrystals



PREDICTION

GRAPHIC

:ADAPTED

BYK.S

UTLIFF/

SCIENCE

on

Febr

uary

2, 2

017

http

://sc

ienc

e.sc

ienc

emag

.org

/D

ownl

oade

d fr

om

unexpected discovery changes the way we understand the world, or finds novel use elsewhere



ESSAY












Polio vaccine

Structure of DNA


Higgs boson


Unexpected Expected


CRISPRgene editing


theorem

x-rays

Theory of evolution

Microwave cooking

Ribozymes


Calculus

Quasicrystals



PREDICTION

GRAPHIC

:ADAPTED

BYK.S

UTLIFF/

SCIENCE

on

Febr

uary

2, 2

017

http

://sc

ienc

e.sc

ienc

emag

.org

/D

ownl

oade

d fr

om

expected discovery accumulation of theory and

evidence, fits with other ideas



ESSAY












Polio vaccine

Structure of DNA


Higgs boson


Unexpected Expected


CRISPRgene editing


theorem

x-rays

Theory of evolution

Microwave cooking

Ribozymes


Calculus

Quasicrystals



PREDICTION

GRAPHIC

:ADAPTED

BYK.S

UTLIFF/

SCIENCE

on

Febr

uary

2, 2

017

http

://sc

ienc

e.sc

ienc

emag

.org

/D

ownl

oade

d fr

om

"normal" discovery some elements surprising, but fits

partly within existing ideas



a modern science of science

1. abundant data and computation

Google Scholar, PubMed, Web of Science, arXiv, JSTOR, OCRID, EasyChair, etc.supercomputers, cloud computing, etc.

2. interdisciplinary community

physicists, computer scientists, biologists, economists, sociologists, statisticians, etc.

3. surely all this data must enable better predictions of future discoveries!

a modern science of science

data and computation are generating new insights into the successes and limitations of prediction:

four notable examples1. citations to past discoveries

2. who gets hired to make discoveries

3. productivity over a career

4. timing of discoveries within a career


the canonical narrative (50+ years of evidence):

1. rapid rise to an early peak2. decline or flattening


Horner, et al. Psychology and Aging 1(4), 319 (1986)

Publication rates in psychology, 1986




Publication rates in psychology, 1986. . . in Russian science & math, 1954

Lehman, The Scientific Monthly 78, 321-326 (1954)



Lehman, The Scientific Monthly 78, 321-326 (1954)



Kaplan et al., Evol. Anth.: Issues, News, and Rev., 9, 156-185 (2000)Quetelet, 1835.

. . . hunter-gather groups

. . . French & Philly criminals, 1835

. . . French artists, 1835

. . . Many others, 1950s - present





Way, et al., Preprint, arXiv:1612.08228 (2016)

. . . hunter-gather groups

. . . French & Philly criminals, 1835

. . . French artists, 1835

. . . Many others, 1950s - present

0 5 10 15 20Years post-hire

012345678

Publ

icat

ion

coun

t

π< 50

50≤ π< 100π≥ 100. . . Computer scientists

department prestige}

faculty hiring project



https://arxiv.org/abs/1612.08228


average productivity appears to be predictable


Conventional narrative; N)357 (32. 7%)Stable (74. 6%) Unstable (25. 4%)

Years post-hire

Pub.

cou

nt

Years post-hire

Pub.

cou

nt

Years post-hire

Pub.

cou

nt

Years post-hire

Pub.

cou

nt


except it’s not — conventional narrative holds only for 33% of people

Way, et al., Preprint, arXiv:1612.08228 (2016)


4. timing of biggest discoveries

conventional narrative: creativity peaks early


conventional narrative: creativity peaks early

Nobel in Physics (1982): phase transitions and renormalization group

! 14!

!Figure 1: Patterns of productivity during a scientific career. (A) Publication history of Kenneth G. Wilson (Nobel Prize in Physics, 1982). The horizontal axis indicates the number of years after the scientist’s first publication and each vertical line corresponds to a research paper. The height of each line corresponds to !"#, i.e. the number of citations the paper received after 10 years (SM S1.3 and S1.6). The highest impact paper of Wilson was published in 1974, 9 years after his first publication and it is the 17th of his 48 papers, hence -∗ = 9, .∗ = 17, . = 48. (B) Distribution of the highest impact paper % !"#∗ across all scientists. We highlight in blue the bottom 20% of the area, corresponding to low maximum impact scientists (!"#∗ ≤ 20); the red area indicates the high maximum impact scientists (top 5% , !"#∗ ≥ 200); yellow corresponds to the remaining 75% medium maximum impact scientists (20 < !"#∗ < 200). These cutoffs do not change if we exclude review papers from our analysis (see Fig. S4 and Fig. S36). (C) Number of papers . - published up to time -, for three scientists with low, medium and high impact, but comparable final number of papers throughout their career. (D) Distribution of the productivity exponents 5 (18). The productivity of high impact scientists grows faster than that of low impact scientists. (E) Dynamics of productivity, as captured by the average number of papers 4 - published each year for high, average and low impact scientists. - = 0 corresponds to the year of a scientist’s first publication.

10 100 300 500

0.01

0.02

Low max impact

Medium max impact

High max impact

A B

EC

t in years

P(γ)

⟨n(t)⟩

P(c

∗ 10)

time in years

γ

c∗10

⟨c−∗10 ⟩

c 10

c∗10

0

0 10 20 30 400 1 2 3 4 5 60

0.1

0.2

0.3

0.4

0.5

20 < c∗10 < 200

c∗10 ≤ 20

c∗10 ≥ 200

0

100

200

300

400

100 20 30

Kenneth G. Wilson

0

1

2

3

4D

10 0 10 1 10 210 0

10 1

10 2

t in years

N(t)

time since first publication (years)

450,000 articles from Physical Review, 1893-2016Sinatra et al, Science 354, 596 (2016)

cita

tions

First paper Last paperSequence of publications

150

Scie

ntis

ts

Highestimpactpaper

0.0

0.1Fraction of highest impact paper in sequence of publications


Sinatra et al, Science 354, 596 (2016)

conventional narrative: wrong againall physicists, publications ordered from first to last

predicting discoveries

some aspects of science are highly predictablemost citation counts, institution of origin, maximum impact, etc.interdisciplinary research is harder to publish & fundunder-represented groups (women, minorities) receive less funding

other aspects appear fundamentally unpredictableproductivity over a career, timing of biggest discovery, etc.likely long-term impact of proposed project or manuscript

predicting discoveries

some aspects of science are highly predictablemost citation counts, institution of origin, maximum impact, etc.interdisciplinary research is harder to publish & fundunder-represented groups (women, minorities) receive less funding

other aspects appear fundamentally unpredictableproductivity over a career, timing of biggest discovery, etc.likely long-term impact of proposed project or manuscript

bibliographic information is abundant but crude, and is a lagging indicator of scientific innovation

could we make better predictions with more data? contents of papers, preprints, workshops, research team communication, rejected manuscripts or proposals, peer reviews, post-publication reviews

risks of automation

citations and publications prone to feedback loopsrich-get-richer dynamiccan amplify inequalitiesif opportunities for future success allocated by markers of recent successcan create self-fulfilling predictions, which narrow innovation

action item: what measures of success are not susceptible to feedback loops?

risks of automation

citations and publications prone to feedback loopsrich-get-richer dynamiccan amplify inequalitiesif opportunities for future success allocated by markers of recent successcan create self-fulfilling predictions, which narrow innovation

action item: what measures of success are not susceptible to feedback loops?

selection by automatic prediction of future "impact"at request of universities or publisherscan induce herding behavior (e.g., automated stock traders)need fairness, accountability and transparency (FAT) in MLneed humans in the loop

action item: how do we build FAT ML principles into systems?

the future could be bright

science is a large and diverse ecosystem

machines could expand or contact it

what lessons can we learn from ecology and evolutionary theory? design principles of robustness, diversifying selection, stabilizing feedback, etc.

if discovery is inherently unpredictable, better to cultivate a diverse scientific ecosystem than try to automate its prediction

"novel discoveries are valuable precisely because they have never been seen before, while data-driven prediction techniques can only learn about what’s been done in the past"

ESSAY












Polio vaccine

Structure of DNA


Higgs boson


Unexpected Expected


CRISPRgene editing


theorem

x-rays

Theory of evolution

Microwave cooking

Ribozymes


Calculus

Quasicrystals



PREDICTION

GRAPHIC

:ADAPTED

BYK.S

UTLIFF/

SCIENCE

on

Febr

uary

2, 2

017

http

://sc

ienc

e.sc

ienc

emag

.org

/D

ownl

oade

d fr

om

Daniel B Larremore(Santa Fe)

Roberta Sinatra(Central Eur. U.)

Clauset, Larremore & Sinatra, Science 355, 477-480 (2017)Way, et al., Preprint, arXiv:1612.08228 (2016)

Samuel F Way(Colorado)

Allison C Morgan(Colorado)

The misleading narrative of the canonical faculty productivity trajectory

Samuel F. Way,1, ⇤ Allison C. Morgan,1, † Aaron Clauset,1, 2, 3, ‡ and Daniel B. Larremore3, §

1Department of Computer Science, University of Colorado, Boulder, CO, USA2BioFrontiers Institute, University of Colorado, Boulder, CO, USA

3Santa Fe Institute, Santa Fe, NM, USA

A researcher may publish tens or hundreds of papers, yet these contributions to the literatureare not uniformly distributed over a career. Past analyses of the trajectories of faculty productivitysuggest an intuitive and canonical pattern: after being hired, productivity tends to rise rapidlyto an early peak and then gradually declines. Here, we test the universality of this conventionalnarrative by analyzing the structures of individual faculty productivity time series, constructedfrom over 200,000 publications matched with hiring data for 2453 tenure-track faculty in all 205Ph.D-granting computer science departments in the U.S. and Canada. Unlike prior studies, whichconsidered only some faculty or some institutions, or lacked common career reference points, herewe combine a large bibliographic dataset with comprehensive information on career transitions thatcovers an entire field of study. We show that the conventional narrative describes only one thirdof faculty, regardless of department prestige, and the remaining two thirds of faculty exhibit arich diversity of productivity patterns. To explain this diversity, we introduce a simple model ofproductivity trajectories, and explore which factors correlate with its parameters, showing thatboth individual productivity and the transition from first- to last-author publications correlate withdepartmental prestige.

INTRODUCTION

Scholarly publications serve as the primary mode ofcommunication through which scientific knowledge is de-veloped, discussed, and disseminated. The amount thatan individual researcher contributes to this dialogue—their scholarly productivity—thus serves as an impor-tant measure of the rate at which they contribute unitsof knowledge to the field, and this measure is known toinfluence the placement of graduates into faculty jobs[1], the likelihood of being granted tenure [2, 3], and theability to secure funding for future research [4].

The trajectory of productivity over the course of a re-searcher’s lifetime has been studied for at least 60 years,with the common observation that a researcher’s produc-tivity rises rapidly to a peak and then slowly declines [5–9], inspiring the construction of mechanistic models witha similar profile [7, 9–11]. These models have includedfactors like cognitive decline with age, career age, finitesupplies of human capital, knowledge advantages con-ferred by recent education, as well as skill deficits amongthe young, among others, and have been supported bythe observation that productivity curves are not well de-scribed by even fourth-degree polynomial models [9]. In-deed, every study we found to date proposes or confirmsa “rise and decline,” “curvilinear,” or “peak and taper-ing” productivity trajectory, regardless of whether re-searchers are binned by chronological age [5–8, 10–12],career age [9, 10], or (only for young researchers) yearssince first publication [13]. In fact, this conventionalnarrative of the life course is not restricted to academia,with similar trajectories observed in criminal behaviorand artistic production in 1800s France [14] and evenproductivity of food acquisition by hunter-gatherers [15].

While these past studies have firmly established thatthe conventional academic productivity narrative isequally descriptive across fields and time, their analy-ses are based on averages over hundreds or thousandsof individuals [5–15]. This raises two crucial and previ-ously unanswered questions: is this average trajectoryrepresentative of individual faculty, and how much di-versity is hidden by a focus on a central tendency over apopulation? To answer these questions, we combine andstudy two comprehensive datasets that span forty yearsof productivity for nearly every tenure-track professorin a North American Ph.D.-granting computer sciencedepartment. By introducing a simple mathematical de-scription, we map individuals’ publication histories toa low-dimensional parameter space, revealing enormousdiversity in the publication trends of individual facultyand showing that only a minority follow the conventionalnarrative of productivity. In fact, even among the con-ventional trajectories, individuals exhibit large fluctu-ations in their productivity around the average trend.Together, these results reveal that productivity patternsare both more diverse and less predictable than previ-ously thought, and that population averages provide adramatically inaccurate picture of intellectual contribu-tions over time.

Moreover, while we show that the distribution of pro-ductivity trajectories resist natural categorizations, it isnevertheless possible to explore covariates that are as-sociated with di↵erent regions of parameter space. Theliterature on such associations has avoided detailed tra-jectories and instead focused on the complicated rela-tionship between prestige, productivity, and hiring. Paststudies have found that researchers trained at prestigious

arX

iv:1

612.

0822

8v1

[cs.D

L] 2

5 D

ec 2

016



science and machines - tuvalutuvalu.santafe.edu/~aaronc/slides/clauset_2017_science... ·...

Documents