20200208-metaeval-keynote€¦ · experiment i and ii • we surveyed 400 papers. • 100 papers...

42
1 52% Yes, a significant crisis 3% No, there is no crisis 7% Don’t know 38% Yes, a slight crisis 1,576 RESEARCHERS SURVEYED (M. Baker, Nature, 2016)

Upload: others

Post on 13-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

1

52% Yes, a significant crisis

3% No, there is no crisis

7% Don’t know

38% Yes, a slight crisis

1,576 RESEARCHERS SURVEYED (M. Baker, Nature, 2016)

Page 2: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

2 (M. Baker, Nature, 2016)

Computer Science

Page 3: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

ICLR 2018 Reproducibility Challenge

3 (J. Pineau, ICLR keynote, 2018)

Page 4: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

Odd Erik Gundersen, dr. philos. Chief AI Officer, TrønderEnergi AS Adjunct Associate Professor, NTNU [email protected]

How can we know it is shoulders we stand on?

Page 5: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

The Scientific Method in AI Research

Page 6: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

The Scientific Method in AI Research

Page 7: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

Many Definitions of Reproducibility

(S. N. Goodman, D. Fanelli, J. P. A. Ioannidis, Science Translational Medicine, 2016)

(V. Stodden, Amstat News, 2011)

(R. D. Peng, Science, 2011)

Page 8: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

Definition of Reproducibility

Reproducibility in empirical AI research is the ability of an independent research team to produce the same results using the same AI method based on the documentation made by the original research team.

Page 9: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

Degree of Reproducibility

Page 10: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six
Page 11: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

Reproducibility Metrics

Page 12: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

WHAT WE GAIN

Page 13: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

We Can Specify How Well Research is Documented

ExperimentMethod Data

Pseudo

code Research

question

Research

methodObjectiv

e

/ Goal

ProblemResults Test

ValidationTra

in

Experiment co

de

Experiment se

tup

Software dep.

Hardware specs

Method code

Prediction

Hypothesis

25%

50%

75%

100%

25%

50%

75%

100%

25%

50%

75%

100%54% 6% 2% 22% 47% 4% 30% 16% 56% 6% 69% 16% 27% 8% 5%1%

(Gundersen, Kjensmo, AAAI, 2018)

Page 14: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

We Can Measure Improvement

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

2013 2014 2015 2016

IJCAI R1D IJCAI R2D IJCAI R3D AAAI R1D AAAI R2D AAAI R3D

(Gundersen, Kjensmo, AAAI, 2018)

Page 15: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

We Can Compare Research: Papers

(Gundersen et al, forthcoming)

Page 16: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

We Can Compare Research: Conferences

(Gundersen, Kjensmo, AAAI, 2018)

Page 17: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

We Can Compare Research: Groups

(Gundersen, AI Magazine, forthcoming)

Method Data Experiment

Academia versus Industry

Page 18: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

We Can Compare Software Frameworks

(Isdahl et al, forthcoming)

Page 19: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

We Could Empirically Find What Entails Well-Documented Research

?

Page 20: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

Compute the Likelihood of Success?

20

Description

Problem Description

Data Hardware

AncillarySoftware

R1 Results

Experiment

Test

Training Validation

Output

Hypothesis Hyper-parameters

Prediction

Pseudo code

Research Questions

Version

Version

Name

Code

AI Method

Experiment

Memory CPU

Page 21: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

We Should Be Able to Measure Success

21

Success: 3%

Partial success: 30%

Failure: 30%

No result: 23%

Filtered out (R3): 27%

(Gundersen et al, forthcoming)

Page 22: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

We Can Set the Bar Based on What We Want to Achieve

22

Page 23: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

EXPERIMENTS

Page 24: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

Experiment I and II• We surveyed 400 papers.

• 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016.

• Six reproducibility metrics proposed for quantifying the reproducibility.

(Gundersen, Kjensmo, AAAI, 2018)

Page 25: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

Results I: Factors and Variables

ExperimentMethod Data

(Gundersen, Kjensmo, AAAI, 2018)

Page 26: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

Results II: Reproducibility Degree

(Gundersen, Kjensmo, AAAI, 2018)

Page 27: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

Results III: Change over Time

(Gundersen, Kjensmo, AAAI, 2018)

Page 28: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

Results IV: Industry vs Academia

(Gundersen, AI Magazine, 2019)

Method Data Experiment

Page 29: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

Results V: Industry vs Academia

(Gundersen, AI Magazine, 2019)

Page 30: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

Experiment III

(Isdahl and Gundersen, eScience, 2019)

Page 31: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six
Page 32: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

Experiment IV• We selected 30 papers to reproduce

• Ten most cited AI papers from 2012, 2014 and 2016 based on numbers from Scopus.

• Structured research procedure.

(Gundersen et al, forthcoming)

Page 33: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

Research Procedure• Reproduce research that shared code and data or

data (filtered out R3 papers).

• Time-boxed the work put into each research paper to 40 hours effective work time.

• Stopping criteria (computing resources, paywall data sets, only qualitative results presented).

(Gundersen et al, forthcoming)

Page 34: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

Results: Outcome per paper

34

Success: 20%

Partial success: 13%

Failure: 23%

No result: 17%

Filtered out (R3): 27%

(Gundersen et al, forthcoming)

Page 35: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

Top Six Causes of Failure• Aspect of implementation not described or ambiguous

(R2). • Aspect of experiment not described or ambiguous (R2). • Not all hyper-parameters are specified (R2). • Mismatch between data in paper and available online

(R1+R2). • Method code shared, experiment code not shared (R1). • Method not described with enough detail (R2).

35 (Gundersen et al, forthcoming)

Page 36: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

“IT IS MORE LIKE WE ARE STANDING ON EACH OTHERS FEET”

Page 37: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

EVALUATIONS

Page 38: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six
Page 39: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six
Page 40: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six
Page 41: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six

• State of the Art: Reproducibility in Artificial Intelligence O. E. Gundersen and S. Kjensmo, AAAI 2018

• On Reproducible AI O. E. Gundersen, Y. Gil and D. W. Aha, AI Magazine, Fall 2018.

• Standing on the Feet of Giants O. E. Gundersen, AI Magazine, Winter 2019.

• Out-of-the-box Reproducibility: A Survey of Machine Learning Platforms, R. Isdahl and O. E. Gundersen, eScience 2019.

• What We Learned When Reproducing the Most Cited AI Research, O. E. Gundersen, O. Cappelen, N. Grimstad, M. Mølnå, forthcoming.

Research

Odd Erik Gundersen [email protected]

Page 42: 20200208-MetaEval-Keynote€¦ · Experiment I and II • We surveyed 400 papers. • 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016. • Six