how to make more from exposure data? an integrated machine ... · 7 133 2.1 data calibration 134...
TRANSCRIPT
![Page 1: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/1.jpg)
1
How to make more from exposure data? An integrated machine 1
learning pipeline to predict pathogen exposure 2
3 Nicholas M. Fountain-Jones*1, Gustavo Machado2, Scott Carver3, Craig Packer4, Mariana 4 Recamonde-Mendoza5 & Meggan E. Craft1 5 6
1Department of Veterinary Population Medicine, University of Minnesota, USA. 7
2Department of Population Health and Pathobiology, North Carolina State University, North 8
Carolina, USA. 9
3 Department of Biological Sciences, University of Tasmania, Australia. 10
4 Department of Ecology Evolution and Behavior, University of Minnesota, USA. 11 12
5 Institute of Informatics, Universidade Federal do Rio Grande do Sul, Brazil. 13
14 *Department of Veterinary Population Medicine, University of Minnesota, 1365 Gortner 15 Avenue, St Paul, Minnesota 55108. [email protected] 16
17 Author Contributions 18
The study was designed by NMFJ, CP and MEC and the data were provided by CP. NMFJ, GM 19 and MRM conducted all statistical analyses, MRM led the development of the R optimization 20
functions, and NMFJ wrote the first draft of the manuscript. All authors (NMFJ, GM, SC, CP, 21 MRM and MEC) contributed substantially to revisions. NMFJ and GM prepared the vignette. 22
23 Data Accessibility Statement 24 Should the manuscript be accepted, the data supporting the results will be archived on GitHub 25
and Dryad with DOI provided. 26 27 Running Title 28
Machine learning and pathogen exposure risk 29 30 Keywords 31 Boosted regression trees, Disease ecology, Gradient boosting models, Machine learning, Model 32
agnostic methods, Random forests, Serology, Support vector machines. 33 34
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 2: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/2.jpg)
2
Abstract 35
1. Predicting infectious disease dynamics is a central challenge in disease ecology. Models 36
that can assess which individuals are most at risk of being exposed to a pathogen not only 37
provide valuable insights into disease transmission and dynamics but can also guide 38
management interventions. Constructing such models for wild animal populations, 39
however, is particularly challenging; often only serological data is available on a subset 40
of individuals and non-linear relationships between variables are common. 41
42
2. Here we take advantage of the latest advances in statistical machine learning to construct 43
pathogen-risk models that automatically incorporate complex non-linear relationships 44
with minimal statistical assumptions from ecological data with missing values. Our 45
approach compares multiple machine learning algorithms in a unified environment to 46
find the model with the best predictive performance and uses game theory to better 47
interpret results. We apply this framework on two major pathogens that infect African 48
lions: canine distemper virus (CDV) and feline parvovirus. 49
50
3. Our modelling approach provided enhanced predictive performance compared to more 51
traditional approaches, as well as new insights into disease risks in a wild population. We 52
were able to efficiently capture and visualise strong non-linear patterns, as well as model 53
complex interactions between variables in shaping exposure risk from CDV and feline 54
parvovirus. For example, we found that lions were more likely to be exposed to CDV at a 55
young age but only in low rainfall years. 56
57
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 3: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/3.jpg)
3
4. When combined with our data calibration approach, our framework helped us to answer 58
questions about risk of pathogen exposure which are difficult to address with previous 59
methods. Our framework not only has the potential to aid in predicting disease risk in 60
animal populations, but also can be used to build robust predictive models suitable for 61
other ecological applications such as modelling species distribution or diversity patterns. 62
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 4: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/4.jpg)
4
1. Introduction 63
64
An individual’s risk of infection by a pathogen is dependent upon a wide variety of host and 65
pathogen traits that are often complicated by heterogeneities in landscapes and individual 66
behavior (Altizer et al., 2003; Carver et al., 2016; Ezenwa, 2004; Nunn, Jordán, McCabe, 67
Verdolin, & Fewell, 2015). Evidence of infection in animals is often determined by serology, 68
which can determine if a host has likely been infected (or not) sometime in the past (‘exposed’ vs 69
‘non-exposed’ host, Gilbert et al 2013). Models that attempt to quantify exposure risk often 70
assume exposed and non-exposed hosts can be linearly separable by a predictor even though 71
non-linear relationships may be more biologically plausible, e.g., where exposure risks reach an 72
asymptote by a certain age (Denison & Holmes, 2001). Further, exposure risk may be dependent 73
on interactions between predictors (Diez-Roux, 1998) (e.g., exposure risk is only elevated in the 74
youngest and oldest individuals in one subset of the population). Generalized linear models are 75
not effective in capturing such non-linear responses and complex interactions, which is an 76
important advantage of most machine-learning algorithms (Elith, Leathwick, & Hastie, 2008; Tu, 77
1996). 78
79
Machine learning methods offer a powerful and flexible alternative to explore important patterns 80
in exposure data, often with substantially better predictive performance (e.g., Elith et al., 2008). 81
Machine learning models can include thousands of predictors without overfitting and can analyze 82
either categorical or continuous data, even in datasets containing large numbers of missing 83
values. However, choosing the most appropriate machine learning model can be challenging, 84
thus the development of methods for identifying the best-performing model has become an 85
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 5: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/5.jpg)
5
important area of research over the last 15 years (Ho & Pepyne, 2002). Supervised machine 86
learning models, such as support vector machines (SVMs), or tree-based algorithms, such as 87
random forests (RF), gradient boosting models (GBMs), or boosted regression trees (BRTs), 88
operate in markedly different ways, potentially leading to different performance and results 89
(Machado, Mendoza, & Corbellini, 2015). While machine learning models are increasingly 90
popular in disease ecology and epidemiology (e.g., Babayan, Orton, & Streicker, 2018; Han et 91
al., 2016; Hollings et al., 2017), they have not yet been widely adopted. 92
93
In part, the limited application of machine learning models to disease ecology is due to the 94
perception of these models as black boxes. Recent advances have helped unveil the black box by 95
clarifying the predictive capabilities of these models (Brdar, Gavrić, Ćulibrk, & Crnojević, 2016; 96
Lundberg & Lee, 2016; Molnar, 2018b). For example, coalitional game theory can be used to 97
untangle how variables contribute to a machine learning model prediction (Štrumbelj & 98
Kononenko, 2014), providing an explanation for why an individual host was predicted to be 99
positive or negative for pathogen exposure. 100
101
The conditions of when and where an individual was infected by a pathogen are often unknown 102
(Gilbert et al., 2013). Exposure risk is often quantified by analyzing variables such as age or 103
group size or landscape context based on the date the individual was sampled, even though the 104
start of the infection was likely to have occurred sometime in the past (Carver et al., 2016; 105
Packer et al., 1999). Further, risk factors of pathogen exposure are often assumed to be fixed in 106
time, even though the ecological context could change between exposure and time of sampling 107
(Goldstein et al., 2017; Munson et al., 2008). The spatio-temporal mismatch between the location 108
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 6: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/6.jpg)
6
and date of infection and the location and date of sampling is well known (Gilbert et al., 2013; 109
Goldstein et al., 2017), yet few pathogen risk models have explicitly considered these 110
limitations. 111
112
Here we introduce a multi-algorithm machine learning ensemble pipeline that incorporates recent 113
advances in data science to better predict pathogen exposure risk. Our pipeline is flexible and 114
many of the post-processing methods are model agnostic in that they can be used to interrogate 115
any regression or classification model. To demonstrate the utility of our approach, we model 116
pathogen exposure risk in two different viruses that infect African lions (Panthera leo): canine 117
distemper virus (CDV) and feline parvovirus (parvovirus). To build our exposure risk models, 118
we leverage age estimates, long term behavioural observations, and environmental data from the 119
Serengeti Lion Project (SLP) as predictors (hereafter called ‘features’ in line with computer 120
science terminology). Using this data, we develop exposure risk models based on features 121
calibrated to reflect potential date of exposure (rather that sampling date), and also compare 122
exposure risk factors between the two pathogens. While we focus on pathogens in wildlife 123
populations, our pipeline could be applied in a similar way to human, domestic animal, or 124
livestock exposure data, as well as to any classification or regression problem. We provide code 125
and a fully worked example in a vignette (Text S1). 126
127
Material and Methods 128
129
If exposure data has been collected across multiple years and there are reliable estimates of host 130
age, researchers can take advantage of our calibration steps below. If not, researchers can 131
proceed to 2.2 Pre-processing. 132
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 7: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/7.jpg)
7
2.1 Data calibration 133
134
Plotting prevalence relationships across host age and time is an essential first step to start 135
calibrating exposure models. When evaluating year-prevalence curves, spikes in prevalence in 136
certain years may estimate the timing of likely epidemics, particularly when the year-prevalence 137
curves of juveniles are considered (Fig. 1)(Packer et al., 1999). Knowledge of pathogen natural 138
history (e.g., chronic vs acute infection) and tests such as chi-squared testing for temporal 139
fluctuations can be used to support these estimates (Fig. 1., Packer et al., 1999). In populations 140
with more precise age estimates beyond juvenile/adult classes, age of likely exposure can be 141
calculated based on how old the individual was when the last epidemic occurred. All other 142
features can be calibrated to the year of exposure (e.g., what was the group size for that 143
individual for the potential year of exposure). Some host characteristics such as sex do not 144
change with time. If the analysis uses serological evidence, detection of exposure is often based 145
on a cut-off threshold (i.e., above a certain titre an individual is classified as exposed). This is 146
imperfect as antibody titre tends to diminish over time (Gilbert et al., 2013; Packer et al., 1999), 147
therefore adding ‘time since epidemic’ as a feature can account for imperfect detection. 148
Accounting for imperfect detection can allow for more accurate measures of infection risk in 149
disease surveys (Lachish, Gopalaswamy, Knowles, & Sheldon, 2012) 150
Age-prevalence relationships can also help calibrate possible exposure times for pathogens more 151
endemic, or constantly present, in a population (Fig. 1). If prevalence peaks at an early age and is 152
stable over time, this may be an indication that the pathogen is more endemic in the population; 153
the relevant features can be calibrated to reflect that the individual was likely exposed early in 154
life (Fig. 1). Age-specific force of infection models (FOI, or the per capita rate at which 155
susceptible individuals acquire infection) can help support these estimates (i.e., at what age is the 156
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 8: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/8.jpg)
8
FOI highest; Anderson & May, 1985). If prevalence increases in a linear fashion with age and 157
the FOI is equal across age groups (Fig. 1), gaining a reliable estimate of the year of exposure 158
(and therefore calibration, as described below) is difficult. 159
Depending on the interpretation of the age- and year-prevalence curves, temporal features can be 160
estimated, or calibrated, including age exposed, time since last epidemic, last epidemic year, year 161
exposed, and age sampled (Fig. 1). 162
163
Fig.1: Flow chart showing the pre-processing steps in our machine learning pipeline. Each 164
colour represents a separate pathogen. FOI: Force of infection. *: indicates optional step. 165
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 9: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/9.jpg)
9
2.2 Pre-processing 166
167
It is important to account for missing data either by imputation or removal prior to model 168
construction (Fig. 2). Some machine learning algorithms, such as gradient boosting, bin missing 169
data as a separate node in the decision tree (Friedman, 2002, Fig. S1), however other algorithms 170
such as SVM are less flexible. In order to compare predictive performance across models, 171
missing data can either be imputed or removed from the dataset. Although providing specific 172
advice on whether to include missing data or not is outside the scope of this paper (see 173
Nakagawa & Freckleton, 2008), we provide an option if imputation is suitable for the study 174
problem. We integrated the ‘missForest’(Stekhoven & Bühlmann, 2012) machine-learning 175
imputation routine (using the RF algorithm) into our pipeline, as it has been found to have low 176
error rates with ecological data (Penone et al., 2014). 177
178
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 10: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/10.jpg)
10
179
Fig. 2: Flow chart showing the steps in our machine learning pipeline. Vertical text in bold 180
indicates how the chart relates to sections of the main text and vignette (Text S1, sections 1-3 181
respectively).TP: True positive, FP: False positive, FN: False negative, TN: True negative. 182
Yellow boxes indicate which data split is being tested in that particular ‘fold’. 183
184
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 11: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/11.jpg)
11
185
Another important step before constructing a model, particularly for relatively small data sets 186
with large numbers of features, is to select features that are relevant and informative for 187
prediction. Reducing the feature set used in models not only increases the efficiency of the 188
algorithms, but also improves the accuracy of many machine learning algorithms (Kohavi & 189
John, 1997; Kursa & Rudnicki, 2010). In our pipeline, we use the ‘Boruta’ (Kursa & Rudnicki, 190
2010) feature selection algorithm for reducing the feature set to just those relevant for prediction 191
prior to model building. The Boruta routine has been found to be the one of the most powerful 192
approaches to select relevant features (Degenhardt, Seifert, & Szymczak, 2017) and does so by 193
measuring the importance of each variable using an RF-based algorithm (Kursa & Rudnicki, 194
2010). 195
196
2.3 Model training 197
198
We incorporated an internal repeated 10-fold cross-validation (CV) process to estimate model 199
performance. CV can help prevent overfitting and artificial inflation of accuracy due to use of the 200
same data for training and validation steps (Fig. 2). To run and evaluate each model, our pipeline 201
uses the ‘caret’ (classification and regression training) package in R (Kuhn, 2008). Not only does 202
this package provide a streamlined approach to tuning parameters for a wide variety of models, 203
including machine learning models, it also offers functions that directly enable robust 204
comparisons of model performance. The ‘train’ function uses resampling to evaluate how tuning 205
parameters such as learning rate (see Elith et al., 2008) can impact model performance and 206
‘chooses’ the optimal model with the highest performance via the confusion matrix (e.g., 207
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 12: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/12.jpg)
12
sensitivity and specificity for classification models). Another advantage of this package is that it 208
can perform classification or regression using 237 different types of models from generalized 209
linear models (GLMs such as logistic regression) to complex machine learning and Bayesian 210
models using a standardized approach (see Kuhn, 2008 for a complete list of models). 211
In our pipeline, we compare supervised machine learning algorithms (RF, SVM, and GBM) as 212
well as logistic regression. These models are among the most popular and best tested machine 213
learning methods, but all operate in different ways, and this can in turn can impact predictive 214
performance as measured by AUC (Marmion, et al., 2009; Ogutu, Piepho, & Schulz-Streeck, 215
2011). In brief, for classification problems, SVMs use vectors and kernel functions that 216
maximizes the margin between classes of data on a hyperplane (see Scholkopf & Smola, 2002 217
for model details). In contrast, RF and GBM are tree-based algorithms that iteratively split data 218
into increasingly pure ‘sets’, with the main difference being that GBM fits trees sequentially with 219
each new tree helping correct errors from the previous (Friedman, 2002). For RF, the trees are fit 220
independently with each iteration (see Fig. S1 for a more detailed comparison). 221
Imbalanced proportions of the outcome of interest are common in disease ecology, where for 222
example, prevalence of the pathogen is < 50%; imbalanced proportions can influence the 223
predictive performance of the algorithm (Haibo He & Garcia, 2009). To address this issue, we 224
included a down-sampling strategy in our pipeline that can be used to compare model 225
performance for datasets with imbalanced data (Fig. 2). 226
227
2.4 Interpreting the model 228
229
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 13: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/13.jpg)
13
Once the best model is selected, we provide a variety of tool sets to visualize and interrogate 230
model results (Fig. 2) based on the ‘iml’ (interpretable machine learning) R package (Molnar, 231
2018a). Measures of variable importance are often used to understand which features contributed 232
to the model in ranked order of importance. However, comparing variable importance across 233
multiple types of models can be challenging as models often quantify variable importance in 234
different ways. Further, in some cases, different models can have the same predictive 235
performance but rely on different features or the ‘Rashomon’ effect (Breiman, 2001). Here, to 236
help overcome these limitations, our pipeline presents a model agnostic method called ‘model 237
class reliance’ (Fisher, Rudin, & Dominici, 2018) as an extension of Breiman’s (2001) 238
permutation approach. In essence, the algorithm quantifies the expected loss in predictive 239
performance (i.e., how well the model classifies exposed vs non-exposed individuals) for a pair 240
of observations compared to the full model in which the particular feature has been switched (see 241
Fisher et al., 2018; Molnar, 2018b for more details). Because this algorithm compares the ratio of 242
error rather than the differences in model error rates it can be used to compare different models. 243
A feature is not considered important if switching it does not alter model performance. Further, 244
this method is probabilistic and has the advantage that feature permutation automatically 245
includes feature interaction effects into account into the importance calculation (Molnar, 2018b). 246
247
Our pipeline also provides the appropriate code to calculate and plot both partial dependence 248
plots (PD) and individual conditional expectation (ICE) to examine the marginal effect of a 249
particular feature on the response (Goldstein et. al., 2015). PD plots visualize the global (i.e., 250
average model) effect of a feature on the response, whilst ICE plots examine the effect of each 251
feature for each observation, whilst holding all other features constant (Goldstein et al., 2015; 252
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 14: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/14.jpg)
14
Molnar, 2018b). For example, to assess the effect of a feature on exposure risk of individual ‘A’, 253
the ICE approach would vary the feature across a grid of values. For each feature variant, the 254
selected machine learning model is used to make new predictions of exposure risk for individual 255
A. Then a line is fit to the predictions across feature values and then the process is repeated for 256
each individual observation. The PD plot represents the average of the lines an ICE plot (e.g., 257
yellow line is the average whereas the black lines are the individual observations in the bottom 258
panel of Fig. 2). Combined they overcome limitations of each approach separately as PD plots 259
alone obscure when interactions shape a feature-response relationship, whilst ICE plots can be 260
challenging to read without plotting the average effect (Molnar, 2018b). We recommend ICE 261
plots centred on the minimum feature value (cICE) in order to more easily visualize when ICE 262
estimates vary between observations (Molnar, 2018a). 263
To further interrogate interactions in each model, our pipeline provides the option to calculate 264
Friedman’s H-statistic (Friedman & Popescu, 2008) for all features (and combinations thereof) in 265
the model. The strongest interactions can then be visualized using 3D PD plots or heatmaps (Fig. 266
2). In brief, Friedman’s H-statistic is a flexible way to quantify feature interaction strength using 267
partial dependency decomposition and represents the portion of variance explained by the 268
interaction (see Friedman & Popescu, 2008 for details). Whilst computationally expensive, this 269
method is also model agnostic and can be upscaled to quantify high order interactions, i.e., three- 270
or four-way interactions between features (Molnar, 2018b). 271
272
Finally, our pipeline provides functions to calculate Shapely values (Shapley, 1953) to assess 273
how individual features effect the prediction of single observations (i.e., an individual lion). This 274
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 15: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/15.jpg)
15
approach calculates Shapely values from the final model that assign ‘payouts’ to ‘players’ 275
depending on their contribution towards the prediction (Brdar et al., 2016; Molnar, 2018b; 276
Štrumbelj & Kononenko, 2014). The players cooperate with each other and obtain a certain 277
‘gain’ from that cooperation. In this case, the ‘payout’ is the model prediction, ‘players’ are the 278
feature values of the observation (e.g., host sex) and the ‘gain’ is the model prediction for this 279
observation (e.g., was the host exposed or not) minus the average prediction of all observations 280
(on average what is probability of a host being exposed) (Molnar, 2018b). More specifically, 281
Shapely values, defined by a value function (Va), compute feature effects (ϕij) for any predictive 282
model 𝑓(𝑖): 283
𝜙𝑖𝑗𝑉𝑎 = ∑|𝑆|! (𝑝 − |𝑆| − 1)!
𝑝!(𝑉𝑎(𝑆 ∪ {𝑥𝑖𝑗}) − 𝑉𝑎(𝑆))
𝑆⊆{𝑥𝑖1,…,𝑥𝑖𝑝}∖{𝑒𝑖𝑗}
284
Where S is a subset of the features (or players), x is the vector features values for observation i 285
and p is the number of features. Vaxi is the prediction of feature values in S marginalized by 286
features not in S: 287
𝑉𝑎𝑥𝑖(𝑆) = ∫𝑓(𝑥𝑖1,… ,𝑥𝑖𝑝)𝑑𝑃𝑋𝑖⋅∉𝑆 − 𝐸𝑥(𝑓(𝑋)) 288
289
2.5 Modelling lion exposure risk 290
291
Blood samples were collected from 300 individual lions of known age (estimated to the month) 292
in the Serengeti Ecosystem, Tanzania between 1984 and 1994. Of these individuals, 40% were 293
seropositive for CDV and 30% for parvovirus (see Hofmann-Lehmann et al., 1996; Roelke-294
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 16: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/16.jpg)
16
Parker et al., 1996 for method details). Both viruses can infect multiple species, yet CDV is 295
directly transmitted via aerosol, whereas parvovirus can be either transmitted directly or through 296
environmental contamination. CDV and parvovirus are considered to have epidemic cycles in the 297
Serengeti lion population (Hofmann-Lehmann et al., 1996; Packer et al., 1999). Chi-square tests 298
of year-prevalence relationships supported three epidemic years for both viruses: ~1977, 1981 299
and 1994 for CDV and 1977, 1985 and 1992 for parvovirus (Packer et al., 1999). Subsequent 300
Bayesian state-space analysis of this data further confirmed these years as epidemic years for 301
CDV and parvovirus (Behdenna et al., in press; Viana et al., 2015). 302
We selected 17 features and calibrated them where appropriate (as detailed in 2.1 Data 303
calibration). See Text S2 for feature selection details. Each model, including logistic regression, 304
was performed following the steps outlined in our pipeline. We compared the predictive 305
performance of the models using calibrated and uncalibrated feature sets (hereafter calibrated or 306
uncalibrated models) to assess how differences in calibration could change the exposure risk 307
predictions. Uncalibrated feature sets were calculated based on the date an individual was 308
sampled, rather than going through the process outlined in 2.1 Data calibration. 309
310
Results 311
312
Machine learning models with calibrated feauture sets have higher predictive 313
performance 314
315
For both CDV and parvovirus, machine learning models had higher predictive performance 316
(higher AUC) compared to logistic regression models (Table 1). For example, predictions of 317
parvovirus were only just better than random using logistic regression model (AUC = 0.54), 318
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 17: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/17.jpg)
17
whereas the GBM model was correctly predicting exposure 73% of the time (AUC = 72.98). 319
Using our calibration approach further improved the overall predictive performance of each 320
pathogen by increasing the sensitivity of the models (i.e., they more able to correctly identify 321
positives), however, there was a trade-off with reduced specificity. For example, our calibrated 322
CDV model had a 7% increase in AUC with an 23% increase in sensitivity but 18% decrease in 323
specificity compared to the uncalibrated model (Table 1). The features considered relevant in 324
each model were relatively consistent (Table 1). However, there were some important exceptions 325
with, for example, sex only relevant for parvovirus exposure in the calibrated model. In contrast, 326
rainfall was relevant for predicting parvovirus exposure in the uncalibrated model but not in the 327
calibrated model. 328
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 18: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/18.jpg)
18
Table 1: Model performance and relevant features for each model. If a feature was not relevant in any model it was excluded using 329 the Boruta algorithm. 330
Model AUC Spec Sens Sex Age Epidemic
year
FIVPle
Territory
size
Overlap Group
size
Habitat
quality
Rainfall
CDV
Log regression 68.80 69.74 66.51 NA
Uncalibrated
(RF)
78.04 83.74 69.49 NA
Calibrated (RF) 85.08 65.63 92.80 *
PARVOVIRUS
Log regression 54.21 62.25 53.56 NA Uncalibrated
(GBM)
72.98 89.08 35.54 NA
Calibrated (RF) 76.47 83.08 62.90 *
Tick indicates that this feature was relevant in the model with a variable importance score >1.1 (i.e., permutated error increases by 1.1 after permutation). Spec: 331 specificity, Sens: sensitivity. RF: Random Forest model had the best predictive performance. GBM: Gradient Boosting Model had the best performance. Features 332 not included in the table were not the most important in any model. *: Age exposed rather than age sampled. See Fig. S3 for Boruta results and Table S2 for 333 GBM and SVM calibrated models. 334
335
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 19: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/19.jpg)
19
Model calibration alters feature-risk relationships 336
337
Age and rainfall were the most important features predicting CDV exposure, but both features 338
were relatively less important in the calibrated models (Fig. 3). Even though the features 339
associated with exposure risk in each model were broadly similar for both pathogens, the 340
relationships between each feature and exposure risk varied. Partial dependency plots showed 341
that risk of CDV increased relatively linearly across age classes in in the uncalibrated model 342
(Fig. 3b), whereas in the calibrated model exposure risk was much more constant across age 343
classes with an increase in risk in individuals between 1-2 y.o. (Fig. 3f). Rainfall also showed 344
different relationships in each model with reduced exposure risk when the average monthly 345
rainfall > 40 mm in the age calibrated model (Fig. 3c). There was a much shallower decline in 346
CDV risk associated with rainfall in the calibrated model (Fig 3e) compared to the uncalibrated 347
model (Fig. 3c). 348
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 20: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/20.jpg)
20
349
Fig. 3: Plots showing the differences in model predictions and the features that contribute to 350
CDV exposure risk in the Serengeti lions for the uncalibrated models (a-c) compared to the 351
calibrated models (d-f). Age sampled followed by rainfall are the most important features 352
associated with CDV exposure risk in the uncalibrated models (a). In contrast, rainfall followed 353
by age exposed were the most important features in the calibrated model (d). However, cICE 354
plots show that the relationship with exposure risk differs substantially (b & c: uncalibrated 355
model, e & f: calibrated model). Feature name colours reflect feature type (orange = individual, 356
blue = pride-level, red = environmental) and * indicates that the variable was averaged across the 357
pride’s territory. See Text S2 for further details including sample size. Tick marks on the x-axes 358
(b, c, e & f) show the distribution of data. Yellow/red line indicates the average value across all 359
individuals. The y-axes reflect probability of being positive for CDV. 360
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 21: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/21.jpg)
21
361
Similar to CDV, age sampled followed by rainfall were the most important features associated 362
with parvovirus exposure risk in the uncalibrated models (Fig. S4a). Parvovirus exposure risk 363
slightly increased across age classes in the uncalibrated models, however in the calibrated 364
models exposure risk increased rapidly at early ages (0-1), but then was relatively constant 365
across ages >3 (Fig. S4b). The signature of rainfall on parvovirus risk in the uncalibrated model 366
was remarkably like that of CDV with a large drop in risk when the monthly rainfall was > 40 367
mm a month (Fig. S4c). However, rainfall was much less important in the calibrated model (Fig. 368
S4d). Strikingly epidemic year was important in the calibrated model with exposure risk much 369
higher for animals likely exposed in the 1992 epidemic (Fig. S4f). 370
371
Interactions 372
373
We further interrogated the calibrated models to visualize how interactions between features 374
could be important for exposure risk of both pathogens. We focussed on interactions with 375
epidemic year, as we were interested to see if exposure risk could vary with each outbreak (see 376
Fig. S5 for a summary of all interactions detected). For CDV, the strongest interaction with 377
epidemic year was age exposed (Fig. 4a). Even though exposure risk was predicted to be 378
reasonably similar in each CDV outbreak (Fig. S6), exposure risk was higher for lions 2-5 y.o. in 379
the 1994 epidemic compared to the 1981 epidemic (Fig. 4b). For parvovirus, territory overlap 380
had the strongest interaction with epidemic year (Fig. 4c) with individuals in prides with low 381
overlap having an increased risk of parvovirus during the 1992 epidemic only. There was also an 382
interaction with epidemic year and age exposed with lions aged 10-12 y.o. more likely to be 383
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 22: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/22.jpg)
22
exposed to parvovirus during the 1992 epidemic (Fig. 4d). See Text S3 and Fig. S7 for a 384
description of how individual features effects the prediction of individual lion’s exposure risk. 385
386
Fig. 4: Interactions between epidemic year and the other features in shaping exposure risk of 387
CDV (a-b) and parvovirus (c-d) in the Serengeti lions. Differing coloured partial dependence 388
curves (b & d) represent different epidemics. The y axis (y hat) is the relative risk of being 389
exposed (with all other feature combinations marginalized). See Fig. S5 for the interactions 390
detected for other features. 391
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 23: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/23.jpg)
23
Discussion 392
393
Our machine learning pipeline offers a powerful approach that can enable researchers to generate 394
robust exposure risk models and interrogate models in new ways. We found that models, where 395
host and environmental features were calibrated based on likely exposure year, had higher 396
predictive power compared to uncalibrated models (i.e., comprised of features based on sample 397
date). The calibrated models not only provided better predictive power but also more robust 398
insights into pathogen dynamics in a wild population that can aid future surveillance. 399
Furthermore, use of this pipeline is not restricted to research on pathogen exposure, but also 400
could be employed with minimal change to model other ecological systems. 401
402
Robust insights in exposure risk for CDV and parvovirus 403
404
The machine learning algorithms employed in our pipeline had higher predictive performance 405
compared to the logistic regression models because they were able to quantify non-linear 406
relationships in our data. For both pathogens, non-linear relationships between host and 407
environmental features were common and provided new perspectives of pathogen dynamics in 408
this lion population. For example, in our calibrated model of parvovirus, risk was saturated when 409
individuals > 3 y.o but decreased after age 7 (Fig. S4d-e). Puppies are thought to be the highest 410
risk category for canine parvovirus (Pollock & Coyne, 1993) and, based on our calibrated 411
models, this may be true for feline parvovirus in lions as well. Even though this pattern is 412
relatively intuitive, neither the logistic regression (Table 1), the uncalibrated model (Fig. S4a-b), 413
or age-prevalence curves (Fountain-Jones et al., in press) captured this pattern, highlighting the 414
need for these types of models to quantify exposure risk. 415
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 24: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/24.jpg)
24
416
Calibrating features provided not only a more nuanced interpretation of pathogen dynamics but 417
also higher predictive performance. The greater predictive performance could be because we 418
removed individuals that were not on the landscape during the inter-epidemic years, and thus 419
were negative for the pathogen, not due a behavioral or landscape trait, but due to the cyclic 420
nature of epidemics. As exposure risk changed over time, including ‘epidemic year’ in the 421
models may have been responsible for the greater predictive power of the calibrated models. 422
Further, even though it was not relevant in the best performing CDV and parvovirus models, 423
including ‘time since exposure’ as a feature provides support that the observed relationships 424
were less likely to be due to differences in detectability due to titre decline over time(Packer et 425
al., 1999). 426
427
Another advantage of the machine learning methods is the ability to model the non-linear 428
interactions shaping exposure risk in the lions. In our case we focused on how host and 429
environmental characteristics interacted with epidemic year to better understand the temporal 430
dimension of exposure risk: an important question rarely tested in wildlife populations (Gilbert et 431
al., 2013). For parvovirus, in particular, the risk associated with lion age varied between 432
epidemics with the 1992 epidemic standing out with increased risk in older as well as younger 433
individuals (Fig. 4). For CDV in contrast, there were remarkably similar risk factors for the 1981 434
and 1994 epidemics, except for an increase in risk in juveniles in 1994 (Fig. 4b). However, there 435
was no known increase in mortality in 1981 as there was in the 1994, further supporting the idea 436
that co-infection with Babesia was the principal agent driving mortality (Munson et al., 2008). 437
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 25: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/25.jpg)
25
When and why exposure risk is stable or dynamic across epidemic years is an open question in 438
disease ecology, and analysis pipelines such as ours can help address this gap. 439
440
Further, calculating Shapely values from predictive models can unpack what each model means 441
in terms of exposure risk in an intuitive way that can be used to inform practitioners and guide 442
surveillance and forecast future risk. For example, when rainfall is below average individuals 443
from prides with relatively high overlap are more at risk of being exposed to CDV perhaps due 444
to increased chances of congregating at waterholes. The sex and age of individuals are less 445
important when choosing which individuals to test or vaccinate. 446
447
Caveats and future directions 448
449
Even though serological or qPCR evidence is often used (including here) as evidence of past or 450
current infection, there are numerous limitations associated with this type of data that need to be 451
considered (Gilbert et al., 2013; Lachish et al., 2012). Further, for pathogens that cause high 452
morality, the individuals sampled represent survivors of the epidemic, and this may bias 453
exposure risk models for wildlife populations that are infrequently monitored. For example, if 454
males experienced high mortality in a particular epidemic compared to females, post epidemic 455
sampling of this population may incorrectly infer that females were more likely to be exposed. 456
How to incorporate biases such as these into exposure risk models could be a fruitful future 457
direction for research. 458
459
A more general limitation of this approach is that uncertainty in model predictions for the 460
machine learning algorithms we used is not well characterized. Bayesian additive regression 461
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 26: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/26.jpg)
26
trees (BART) may be a promising technique that could fill this gap, but this technique comes at a 462
computational cost that may be prohibitive for larger datasets (Chipman, George, & McCulloch, 463
2010). Furthermore, none of the methods we employed in our pipeline explicitly account for 464
spatial relationships which could be a limitation for studies that use data across a larger 465
geographic area where spatial autocorrelation could be an important problem. 466
467
Lastly this analysis pipeline could be used for a variety of ecological applications including both 468
classification and regression problems with continuous or categorical response variables. For 469
example, our pipeline could easily be incorporated into species distribution models. Moreover, as 470
the R package ‘caret’ is highly flexible and can perform and compare a large number of 471
algorithms with minimal change to the R syntax, the end user could easily test any model of their 472
choosing in a similar way (e.g. generalized additive models or discriminant analysis). Similarly, 473
the code we provide from the R package ‘iml’ is model agnostic and can provide an increase in 474
the interpretability for many regression or classification models (Molnar, 2018a). 475
Conclusions 476
477
Our analysis pipeline offers a flexible unified approach to not only better understand pathogen 478
exposure risk, but also could be applied to any system where non-linear responses and 479
interactions are important in shaping a response variable. This study has highlighted the value of 480
using advances in data science coupled with feature calibration to increase our understanding of 481
how host and landscape characteristics shape pathogen risk in populations, and how exposure 482
risk could change over time. Machine learning approaches are rapidly evolving and future 483
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 27: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/27.jpg)
27
developments coupled with advances in pathogen detection are likely to provide even more 484
resolution on the drivers of pathogen exposure risk. 485
486
Acknowledgements 487
N.M. F-J. and M.E.C. were funded by National Science Foundation (DEB-1413925) and M.E.C. 488
was funded by National Science Foundation (DEB-1654609) and CVM Research Office UMN 489
Ag Experiment Station General Ag Research Funds. 490
491
References 492
493
Altizer, S., Nunn, C. L., Thrall, P. H., Gittleman, J. L., Antonovics, J., Cunningham, A. a., … 494
Pulliam, J. R. C. (2003). Social organization and parasite risk in mammals : Integrating 495
theory and empirical studies. Annual Review of Ecology, Evolution, and Systematics, 34(1), 496
517–547. doi:10.1146/annurev.ecolsys.34.030102.151725 497
Anderson, R. M., & May, R. M. (1985). Age-related changes in the rate of disease transmission: 498
implications for the design of vaccination programmes. The Journal of Hygiene, 94(3), 499
365–436. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/4008922 500
Babayan, S. A., Orton, R. J., & Streicker, D. G. (2018). Predicting reservoir hosts and arthropod 501
vectors from evolutionary signatures in RNA virus genomes. Science, 362(6414), 577–580. 502
doi:10.1126/science.aap9072 503
Behdenna, A., Lembo, T., Cleaveland, S., Halliday, J. E. B., Packer, C., Lankester, F., … Viana, 504
M. (in press). Transmission ecology of canine parvovirus in a multi-host, multi-pathogen 505
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 28: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/28.jpg)
28
system. Proc R Soc B. 506
Brdar, S., Gavrić, K., Ćulibrk, D., & Crnojević, V. (2016). Unveiling spatial epidemiology of 507
HIV with mobile phone data. Scientific Reports, 6(1), 19342. doi:10.1038/srep19342 508
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. 509
doi:10.1023/A:1010933404324 510
Carver, S., Bevins, S. N., Lappin, M. R., Boydston, E. E., Lyren, L. M., Alldredge, M., … 511
VandeWoude, S. (2016). Pathogen exposure varies widely among sympatric populations of 512
wild and domestic felids across the United States. Ecological Applications, 26(2), 367–381. 513
Chipman, H. A., George, E. I., & McCulloch, R. E. (2010). BART: Bayesian additive regression 514
trees. The Annals of Applied Statistics, 4(1), 266–298. doi:10.1214/09-AOAS285 515
Denison, D. G. T., & Holmes, C. C. (2001). Bayesian partitioning for estimating disease risk. 516
Biometrics, 57(1), 143–149. doi:10.1111/j.0006-341X.2001.00143.x 517
Diez-Roux, A. V. (1998). Bringing context back into epidemiology: variables and fallacies in 518
multilevel analysis. American Journal of Public Health, 88(2), 216–22. 519
doi:10.2105/AJPH.88.2.216 520
Elith, J., Leathwick, J. R., & Hastie, T. (2008). A working guide to boosted regression trees. 521
Journal of Animal Ecology, 77(4), 802–813. doi:10.1111/j.1365-2656.2008.01390.x 522
Ezenwa, V. O. (2004). Host social behavior and parasitic infection: a multifactorial approach. 523
Behavioral Ecology, 15(3), 446–454. doi:10.1093/beheco/arh028 524
Fisher, A., Rudin, C., & Dominici, F. (2018). Model class reliance: Variable importance 525
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 29: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/29.jpg)
29
measures for any machine learning model class, from the "Rashomon" 526
Perspective. Retrieved from http://arxiv.org/abs/1801.01489 527
Fountain-Jones, N. ., Packer, C., Jacquot, M., Blanchet, G., Terio, K., & Craft, M. . (in press). 528
Endemic infection can shape exposure to novel pathogens: Pathogen co-occurrence 529
networks in the Serengeti lions. Ecology Letters. 530
Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 531
38(4), 367–378. doi:10.1016/S0167-9473(01)00065-2 532
Friedman, J. H., & Popescu, B. E. (2008). Predictive learning via rule ensembles. The Annals of 533
Applied Statistics, 2(3), 916–954. doi:10.1214/07-AOAS148 534
Gilbert, A. T., Fooks, A. R., Hayman, D. T. S., Horton, D. L., Müller, T., Plowright, R., … 535
Rupprecht, C. E. (2013). Deciphering serology to understand the ecology of infectious 536
diseases in wildlife. EcoHealth, 10(3), 298–313. doi:10.1007/s10393-013-0856-0 537
Goldstein, A., Kapelner, A., Bleich, J., & Pitkin, E. (2015). Peeking inside the black box: 538
Visualizing statistical learning with plots of individual conditional expectation. Journal of 539
Computational and Graphical Statistics, 24(1), 44–65. doi:10.1080/10618600.2014.907095 540
Goldstein, E., Pitzer, V. E., O’Hagan, J. J., & Lipsitch, M. (2017). Temporally varying relative 541
risks for infectious diseases. Epidemiology. NIH Public Access. 542
doi:10.1097/EDE.0000000000000571 543
Han, B. A., Schmidt, J. P., Alexander, L. W., Bowden, S. E., Hayman, D. T. S., & Drake, J. M. 544
(2016). Undiscovered bat hosts of Filoviruses. PLOS Neglected Tropical Diseases, 10(7), 545
e0004815. doi:10.1371/journal.pntd.0004815 546
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 30: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/30.jpg)
30
Ho, Y. C., & Pepyne, D. L. (2002). Simple explanation of the no-free-lunch theorem and its 547
implications. Journal of Optimization Theory and Applications, 115(3), 549–570. 548
doi:10.1023/A:1021251113462 549
Hofmann-Lehmann, R., Fehr, D., Grob, M., Elgizoli, M., Packer, C., Martenson, J. S., … Lutz, 550
H. (1996). Prevalence of antibodies to feline parvovirus, calicivirus, herpesvirus, 551
coronavirus, and immunodeficiency virus and of feline leukemia virus antigen and the 552
interrelationship of these viral infections in free-ranging lions in East Africa. Clinical and 553
Vaccine Immunology, 3(5), 554–562. Retrieved from 554
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC170405/pdf/030554.pdf 555
Hollings, T., Robinson, A., van Andel, M., Jewell, C., & Burgman, M. (2017). Species 556
distribution models: A comparison of statistical approaches for livestock and disease 557
epidemics. PLOS ONE, 12(8), e0183626. doi:10.1371/journal.pone.0183626 558
Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 559
97(1–2), 273–324. doi:10.1016/S0004-3702(97)00043-X 560
Kuhn, M. (2008). Building Predictive Models in R Using the caret Package. Journal of 561
Statistical Software, 28(5), 1–26. doi:10.18637/jss.v028.i05 562
Kursa, M. B., & Rudnicki, W. R. (2010). Feature selection with the Boruta Package. Journal of 563
Statistical Software, 36(11), 1–13. doi:10.18637/jss.v036.i11 564
Lachish, S., Gopalaswamy, A. M., Knowles, S. C. L., & Sheldon, B. C. (2012). Site-occupancy 565
modelling as a novel framework for assessing test sensitivity and estimating wildlife disease 566
prevalence from imperfect diagnostic tests. Methods in Ecology and Evolution, 3(2), 339–567
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 31: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/31.jpg)
31
348. doi:10.1111/j.2041-210X.2011.00156.x 568
Lundberg, S., & Lee, S.-I. (2016). An unexpected unity among methods for interpreting model 569
predictions. In NIPS 2016 Workshop on Interpretable Machine Learning in Complex 570
Systems. Retrieved from http://arxiv.org/abs/1611.07478 571
Machado, G., Mendoza, M. R., & Corbellini, L. G. (2015). What variables are important in 572
predicting bovine viral diarrhea virus? A random forest approach. Veterinary Research, 573
46(1), 85. doi:10.1186/s13567-015-0219-7 574
Marmion, M., Parviainen, M., Luoto, M., Heikkinen, R. K., & Thuiller, W. (2009). Evaluation of 575
consensus methods in predictive species distribution modelling. Diversity and Distributions, 576
15(1), 59–69. doi:10.1111/j.1472-4642.2008.00491.x 577
Molnar, C. (2018a). iml: An R package for Interpretable Machine Learning. Journal of Open 578
Source Software, 3(26), 786. doi:10.21105/joss.00786 579
Molnar, C. (2018b). Interpretable machine learning (Retrieved). 580
Munson, L., Terio, K. A., Kock, R., Mlengeya, T., Roelke, M. E., Dubovi, E., … Packer, C. 581
(2008). Climate extremes promote fatal co-infections during canine distemper epidemics in 582
African lions. PLoS ONE, 3(6), e2545. doi:10.1371/journal.pone.0002545 583
Nakagawa, S., & Freckleton, R. P. (2008). Missing inaction: the dangers of ignoring missing 584
data. Trends in Ecology & Evolution, 23(11), 592–596. doi:10.1016/J.TREE.2008.06.014 585
Nunn, C. L., Jordán, F., McCabe, C. M., Verdolin, J. L., & Fewell, J. H. (2015). Infectious 586
disease and group size: more than just a numbers game. Philosophical Transactions of the 587
Royal Society of London B: Biological Sciences, 370(1669). Retrieved from 588
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 32: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/32.jpg)
32
http://rstb.royalsocietypublishing.org/content/370/1669/20140111.long 589
Ogutu, J. O., Piepho, H.-P., & Schulz-Streeck, T. (2011). A comparison of random forests, 590
boosting and support vector machines for genomic selection. BMC Proceedings, 5(Suppl 3), 591
S11. doi:10.1186/1753-6561-5-S3-S11 592
Packer, C., Altizer, S., Appel, M., Brown, E., Martenson, J., O’Brien, S. J., … Lutz, H. (1999). 593
Viruses of the Serengeti: patterns of infection and mortality in African lions. Journal of 594
Animal Ecology, 68(6), 1161–1178. 595
Penone, C., Davidson, A. D., Shoemaker, K. T., Di Marco, M., Rondinini, C., Brooks, T. M., … 596
Costa, G. C. (2014). Imputation of missing data in life-history trait datasets: which approach 597
performs the best? Methods in Ecology and Evolution, 5(9), 961–970. doi:10.1111/2041-598
210X.12232 599
Pollock, R. V, & Coyne, M. J. (1993). Canine parvovirus. The Veterinary Clinics of North 600
America. Small Animal Practice, 23(3), 555–68. Retrieved from 601
http://www.ncbi.nlm.nih.gov/pubmed/8389070 602
Roelke-Parker, M. E., Munson, L., Packer, C., Kock, R., Cleaveland, S., Carpenter, M., … 603
Appel, M. J. (1996). A canine distemper virus epidemic in Serengeti lions (Panthera leo). 604
Nature, 379(6564), 441–445. 605
Scholkopf, B., & Smola, A. J. (2002). Learning with kernels : support vector machines, 606
regularization, optimization, and beyond. MIT Press. Retrieved from 607
https://dl.acm.org/citation.cfm?id=559923 608
Shapley, L. S. (1953). Stochastic games. Proceedings of the National Academy of Sciences of the 609
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint
![Page 33: How to make more from exposure data? An integrated machine ... · 7 133 2.1 Data calibration 134 135 Plotting prevalence relationships across host age and time is an essential first](https://reader036.vdocuments.us/reader036/viewer/2022062414/5f01e1bb7e708231d4017ebc/html5/thumbnails/33.jpg)
33
United States of America, 39(10), 1095–100. doi:10.1073/PNAS.39.10.1095 610
Stekhoven, D. J., & Bühlmann, P. (2012). Missforest-Non-parametric missing value imputation 611
for mixed-type data. Bioinformatics, 28(1), 112–118. doi:10.1093/bioinformatics/btr597 612
Štrumbelj, E., & Kononenko, I. (2014). Explaining prediction models and individual predictions 613
with feature contributions. Knowledge and Information Systems, 41(3), 647–665. 614
doi:10.1007/s10115-013-0679-x 615
Tu, J. V. (1996). Advantages and disadvantages of using artificial neural networks versus logistic 616
regression for predicting medical outcomes. Journal of Clinical Epidemiology, 49(11), 617
1225–1231. doi:10.1016/S0895-4356(96)00002-9 618
Viana, M., Cleaveland, S., Matthiopoulos, J., Halliday, J., Packer, C., Craft, M. E., … Lembo, T. 619
(2015). Dynamics of a morbillivirus at the domestic–wildlife interface: Canine distemper 620
virus in domestic dogs and lions. Proceedings of the National Academy of Sciences of the 621
United States of America, 112(5), 1464–1469. doi:10.1073/pnas.1411623112 622
623
.CC-BY-NC-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted March 6, 2019. . https://doi.org/10.1101/569012doi: bioRxiv preprint