the reproducibility mindset ... - cancer research ukroger peng’s coursera course and notes (2013)...

37
The Reproducibility Mindset: Enhancing Big Data Quality in Medicine and Research Keith A. Baggerly Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center [email protected] Big Data Analytics, June 13, 2017

Upload: others

Post on 27-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

The Reproducibility Mindset:Enhancing Big Data Quality in Medicine

and Research

Keith A. BaggerlyBioinformatics and Computational Biology

UT M. D. Anderson Cancer [email protected]

Big Data Analytics, June 13, 2017

Page 2: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

1

What Makes Data High Quality?

My (bioinformatics) viewpoint:big data is thousands of measurements per sample

Findability / Accessibility (publication, data sharing)

Accuracy / Precision (bias, variance)

Clarity / Labeling / Metadata (sanity checks)

Generality (experimental design, confounding)

Relevance to the Problem at Hand

Page 3: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

2

What Makes Inference High Quality?

Mostly the same criteria applied to the methods used

Can we find/access the code?

Can we understand it? Is the workflow clear?

Are the methods employed appropriate?

Will results replicate? (design, prespecification, train/test)

These criteria also apply to any preprocessing of “raw” datainto the final form (often important for big data)

None of this is inherently complicated

Page 4: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

3

Relevance to Medicine and Research?

Many biomedical claims are failing tests of replicability

2005, 2009 Ioannidis ( et al)

2012 Begley and Ellis

2015 Freedman et al

2016 NIH Rigor and Reproducibility Initiative

2017 NSF, NASEM

We’re getting better, but there’s still room for improvement

Let’s look at some examples...

Page 5: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

4

A Proteomics Case Study

Petricoin et al (2002), Lancet, 359(9306):572-77

100 ovarian cancer patients100 normal controls16 patients with “benign disease”

Use 50 cancer and 50 normal spectra to train a classificationmethod; test the algorithm on the remaining samples.

Page 6: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

5

Which Group is Different?

Page 7: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

6

Really?

Page 8: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

7

Processing Can Trump Biology: Design!

Page 9: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

8

Using Cell Lines to Predict Sensitivity

Potti et al (2006), Nature Medicine, 12:1294-300.

The main conclusion: we can use microarray data from celllines (the NCI60) to define drug response “signatures”, whichcan predict whether patients will respond.

They provide examples using 7 commonly used agents.

This got people at MDA very excited.

Page 10: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

9

Their Gene List and Ours

> temp <- cbind(sort(rownames(pottiUpdated)[fuRows]),sort(rownames(pottiUpdated)[

[email protected] <= fuCut]);> colnames(temp) <- c("Theirs", "Ours");> temp

Theirs Ours...[3,] "1881_at" "1882_g_at"[4,] "31321_at" "31322_at"[5,] "31725_s_at" "31726_at"[6,] "32307_r_at" "32308_r_at"...

Page 11: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

10

Predicting Response: Docetaxel

Potti et al (2006), Nature Medicine, 12:1294-300, Fig 1d

Chang et al, Lancet 2003, 362:362-9, Fig 2 top

Page 12: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

11

Predicting Response: Adriamycin

Potti et al (2006), Nature Medicine, 12:1294-300, Fig 2c

Holleman et al, NEJM 2004, 351:533-42, Fig 1

Page 13: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

12

We Tried Matching Their Validation Samples

43 samples are mislabeled.16 samples don’t match because the genes are mislabeled.All of the validation data are wrong.

We reported this to Duke and the NCI in mid-Nov 2009.

Page 14: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

13

A Catalyzing Event: July 16, 2010

Jul 19/20: Letter to Varmus; Duke resuspends trials.Oct 22/9: First call for paper retraction.Nov 9: Duke terminates trials.Nov 19: call for Nat Med retraction, Potti resigns

Page 15: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

14

Age-Related Macular Degeneration (AMD)

AREDS, 2001

Awh et al, 2013 Genomics matters!

Page 16: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

15

However...

Chew et al, 2014 No it doesn’t!

Awh et al, 2015 Yes it does!

Page 17: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

16

Do we Need to Genotype?

Awh et al, 2015, Fig 2Statistical arbitration sought

Page 18: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

17

Challenges

Data cleaning

Experimenter degrees of freedom

How many genes were examined before choosingCFH and ARMS2?

How were genotype groups defined?Was this algorithmic?

Are we working with the same data?

Will the claims hold in an independent test set?

Page 19: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

18

MetaAnalysis: Dietary Reference Intakes (DRIs)

Recommended Dietary Allowance (RDA): the average dailydietary intake level that is sufficient to meet the nutrientrequirement of nearly all (97 to 98 percent) healthyindividuals in a group.

Estimated Average Requirement (EAR): a nutrient intakevalue that is estimated to meet the requirement of half thehealthy individuals in a group.

Tolerable Upper Intake Level (UL): the highest daily nutrientintake likely to pose no risk of adverse health effects toalmost all individuals in the general population. As intakeincreases above the UL, the risk of adverse effects increases.

Adequate Intake (AI): the fallback

Page 20: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

19

Nomenclature

Intakes in IUWe consume this

Serum Levels inng/mL = 2.5 nmol/L

We link this to outcomeRequirements in terms of either

Page 21: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

20

Modeling Intakes and Requirements

IOM 2000: DRIs in Dietary Assessment, Fig 4.2Assume normality and model.

Page 22: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

21

Defining Requirements in Other Units

Durazo-Arvizu et al, 2010, Fig 1

When is the nutrient product not biochemically limiting?

If vitamin D is too low to regulate calcium, parathyroidhormone (PTH) will increase and leach Ca from bones.

Requirements use serum levels (ng/mL); intakes use IU.

Page 23: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

22

Priemel et al, Figure 4d: OV/BV

OV/BV values ≥ 1.2% or 2% are bad.Priemel et al recommended targeting >30 ng/mL.

Page 24: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

23

The Official Cutpoint

20 ng/mL = 50 nmol/L.

Why? Because 97.5% isn’t 100%.

“The number ... above 50 nmol/L was counted by inspection...

At ... 50 nmol/L, there were seven data points reflecting ...(OV/BV > 2 percent).

This suggested ...50 nmol/L met the needs of 99 percent ... (that is, only 7 of675 surpassed the measure).”

IOM report, p.276.

Page 25: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

24

Come Again?

Is this picture reasonable?

Page 26: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

25

Zooming In on 20 ng/mL

This rate of problems is way too high

Page 27: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

26

Mapping Serum to Intake, IOM 2011 Fig 5.4

Are cohort averages (dots) vertically close to a curve?How can we model variation in attainments?

Page 28: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

27

SD(Attainments) for the Studies Used

Here, σY /σX ≈ 4. (One study used SEM.)

Page 29: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

28

All the Data

IOM in red, About 2 SEM; prediction in black.

Page 30: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

29

Doing it Better: IMPACT, May 8

Zehir et al, Nat Med 10945 samples from 10336 patients

Page 31: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

30

Most Data are Publicly Available

From the Paper

The Supplementary Information (meta-data, annotation)

The cBio Portal http://cbioportal.org/msk-impact

GitHub repositories of their data processing pipelines

Not BAM level raw data, but somatic mutation calls, variantallele fractions, and the like

Page 32: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

31

We Can Check It: TP53

The uber-tumor suppressor: break anywhere

Page 33: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

32

We Can Check It: KRAS

A key oncogene: break in very specific places

Page 34: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

33

The Bottom Lines

These cases may be pathological.

But we see similar problems a lot.

The most common mistakes are simple.

Confounding in the Experimental DesignMixing up the sample labelsMixing up the gene labelsMixing up the group labels(Most mixups involve simple switches or offsets)

This simplicity is often hidden.

Incomplete documentation

This is fixable.

Page 35: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

34

Reasons for Hope

1. Our Own (Evolving!) Experience & Sanity Checks

2. Better tools (knitr, markdown, GitHub)

3. Journals, Code and Data

4. The IOM, the FDA, and IDEs*

5. The NCI and Trials it Funds

6. OSTP, Congress, Science, Nature

As I perform an analysis, am I confident I or someone elsecould easily get the same results again, or modify theanalysis if need be?

Page 37: The Reproducibility Mindset ... - Cancer Research UKRoger Peng’s Coursera course and notes (2013) Christopher Gandrud’s book (2e, 2015) Yihui Xie’s book (2e, 2015) Hadley Wickham’s

36

Thanks!